This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
In this post, I'll share my method for running SillyTavern locally on a Mac M1/M2 using llama-cpp-python.
And even if you don't have a Metal GPU, this might be the quickest way to run SillyTavern locally - full stop.
Overview
SillyTavern is a powerful chat front-end for LLMs - but it requires a server to actually run the LLM. Many people simply use the OpenAI server - but since this is LocalLLaMa, we should run our own server.
To use SillyTavern locally, you'd usually serve your own LLM API using KoboldCpp, oobabooga, LM Studio, or a variety of other methods to serve the API. Personally, I've found it to be cumbersome running any of those LLM API servers - and I wanted something simpler.
In fact, many people have been wondering how we could more simply use llama.cpp as an OpenAI-compatible backend for SillyTavern:
Solution: the llama-cpp-python embedded server
It turns out the Python package llama-cpp-python now ships with a server module that is compatible with OpenAI. For SillyTavern, the llama-cpp-python local LLM server is a drop-in replacement for OpenAI. Best of all, for the Mac M1/M2, this method can take advantage of Metal acceleration.
I've found this to be the quickest and simplest method to run SillyTavern locally.
(Optional) Install llama-cpp-python with Metal acceleration
pip uninstall llama-cpp-python -y
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir
Skip this step if you don't have Metal. However, if you DO have a Metal GPU, this is a simple way to ensure you're actually using it.
Install and run the HTTP server that comes with llama-cpp-python
pip install 'llama-cpp-python[server]'
python -m llama_cpp.server \
--model "llama2-13b.gguf.q6_K.bin" \
--n_gpu_layers 1 \
--port "8001"
In the future, to re-launch the server, just re-run the python command; no need to install each time.
Obtain SillyTavern and run it too
git clone https://github.com/SillyTavern/SillyTavern
cd SillyTavern
./start.sh
In the future, you can just run ./start.sh
.
Note that both the llama-cpp-python server and SillyTavern need to be running at the same time.
Connect SillyTavern to your llama-cpp-python server
- Click API Connections (the icon looks like an electrical plug)
- Under API, select Chat Completion (OpenAI...)
- For Chat Completion Source: select OpenAI (there will be no API Key)
- Click AI Response Configuration (the leftmost icon; looks like settings/sliders)
- Scroll to OpenAI/Claide Reverse Proxy: type http://127.0.0.1:8001/v1
- Legacy Streaming Processing: set True
Finished
Now you're able to use SillyTavern locally with Metal acceleration. On Llama2 13b, I'm getting about 10 token/s with an 32GB M1 with 8 GPU cores - and I'm really satisfied with the performance. This is nearly the lowest-end Metal CPU to be manufactured - so I expect almost everyone else will see even better performance.
Now I don't have to deal with additional GitHub repos and multiple configurations for the API server. I've found this method to be simpler than alternatives like KoboldCpp.
Subreddit
Post Details
- Posted
- 1 year ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/LocalLLaMA/...