Coming soon - Get a detailed view of why an account is flagged as spam!
view details

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

33
SillyTavern running locally on Mac M1 or M2 with llama-cpp-python backend
Post Flair (click to view more posts with a particular flair)
Post Body

In this post, I'll share my method for running SillyTavern locally on a Mac M1/M2 using llama-cpp-python.

And even if you don't have a Metal GPU, this might be the quickest way to run SillyTavern locally - full stop.

Overview

SillyTavern is a powerful chat front-end for LLMs - but it requires a server to actually run the LLM. Many people simply use the OpenAI server - but since this is LocalLLaMa, we should run our own server.

To use SillyTavern locally, you'd usually serve your own LLM API using KoboldCpp, oobabooga, LM Studio, or a variety of other methods to serve the API. Personally, I've found it to be cumbersome running any of those LLM API servers - and I wanted something simpler.

In fact, many people have been wondering how we could more simply use llama.cpp as an OpenAI-compatible backend for SillyTavern:

Solution: the llama-cpp-python embedded server

It turns out the Python package llama-cpp-python now ships with a server module that is compatible with OpenAI. For SillyTavern, the llama-cpp-python local LLM server is a drop-in replacement for OpenAI. Best of all, for the Mac M1/M2, this method can take advantage of Metal acceleration.

I've found this to be the quickest and simplest method to run SillyTavern locally.

(Optional) Install llama-cpp-python with Metal acceleration

pip uninstall llama-cpp-python -y
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir

Skip this step if you don't have Metal. However, if you DO have a Metal GPU, this is a simple way to ensure you're actually using it.

Install and run the HTTP server that comes with llama-cpp-python

pip install 'llama-cpp-python[server]'
python -m llama_cpp.server \
    --model "llama2-13b.gguf.q6_K.bin" \
    --n_gpu_layers 1 \
    --port "8001"

In the future, to re-launch the server, just re-run the python command; no need to install each time.

Obtain SillyTavern and run it too

git clone https://github.com/SillyTavern/SillyTavern
cd SillyTavern
./start.sh

In the future, you can just run ./start.sh. Note that both the llama-cpp-python server and SillyTavern need to be running at the same time.

Connect SillyTavern to your llama-cpp-python server

  1. Click API Connections (the icon looks like an electrical plug)
  2. Under API, select Chat Completion (OpenAI...)
  3. For Chat Completion Source: select OpenAI (there will be no API Key)
  4. Click AI Response Configuration (the leftmost icon; looks like settings/sliders)
  5. Scroll to OpenAI/Claide Reverse Proxy: type http://127.0.0.1:8001/v1
  6. Legacy Streaming Processing: set True

Finished

Now you're able to use SillyTavern locally with Metal acceleration. On Llama2 13b, I'm getting about 10 token/s with an 32GB M1 with 8 GPU cores - and I'm really satisfied with the performance. This is nearly the lowest-end Metal CPU to be manufactured - so I expect almost everyone else will see even better performance.

Now I don't have to deal with additional GitHub repos and multiple configurations for the API server. I've found this method to be simpler than alternatives like KoboldCpp.

Author
User Disabled
Account Strength
0%
Disabled 6 months ago
Account Age
13 years
Verified Email
Yes
Verified Flair
No
Total Karma
4,750
Link Karma
857
Comment Karma
3,893
Profile updated: 2 days ago
Posts updated: 7 months ago

Subreddit

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.
Posted
1 year ago