SillyTavern running locally on Mac M1 or M2 with llama-cpp-python backend

This post has been de-listed

It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.

Post Flair (click to view more posts with a particular flair)

Tutorial | Guide

Post Body

In this post, I'll share my method for running SillyTavern locally on a Mac M1/M2 using llama-cpp-python.

And even if you don't have a Metal GPU, this might be the quickest way to run SillyTavern locally - full stop.

Overview

SillyTavern is a powerful chat front-end for LLMs - but it requires a server to actually run the LLM. Many people simply use the OpenAI server - but since this is LocalLLaMa, we should run our own server.

To use SillyTavern locally, you'd usually serve your own LLM API using KoboldCpp, oobabooga, LM Studio, or a variety of other methods to serve the API. Personally, I've found it to be cumbersome running any of those LLM API servers - and I wanted something simpler.

In fact, many people have been wondering how we could more simply use llama.cpp as an OpenAI-compatible backend for SillyTavern:

Solution: the llama-cpp-python embedded server

It turns out the Python package llama-cpp-python now ships with a server module that is compatible with OpenAI. For SillyTavern, the llama-cpp-python local LLM server is a drop-in replacement for OpenAI. Best of all, for the Mac M1/M2, this method can take advantage of Metal acceleration.

I've found this to be the quickest and simplest method to run SillyTavern locally.

(Optional) Install llama-cpp-python with Metal acceleration

pip uninstall llama-cpp-python -y
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir

Skip this step if you don't have Metal. However, if you DO have a Metal GPU, this is a simple way to ensure you're actually using it.

Install and run the HTTP server that comes with llama-cpp-python

pip install 'llama-cpp-python[server]'
python -m llama_cpp.server \
    --model "llama2-13b.gguf.q6_K.bin" \
    --n_gpu_layers 1 \
    --port "8001"

In the future, to re-launch the server, just re-run the python command; no need to install each time.

Obtain SillyTavern and run it too

git clone https://github.com/SillyTavern/SillyTavern
cd SillyTavern
./start.sh

In the future, you can just run ./start.sh. Note that both the llama-cpp-python server and SillyTavern need to be running at the same time.

Connect SillyTavern to your llama-cpp-python server

Click API Connections (the icon looks like an electrical plug)
Under API, select Chat Completion (OpenAI...)
For Chat Completion Source: select OpenAI (there will be no API Key)
Click AI Response Configuration (the leftmost icon; looks like settings/sliders)
Scroll to OpenAI/Claide Reverse Proxy: type http://127.0.0.1:8001/v1
Legacy Streaming Processing: set True

Finished

Now you're able to use SillyTavern locally with Metal acceleration. On Llama2 13b, I'm getting about 10 token/s with an 32GB M1 with 8 GPU cores - and I'm really satisfied with the performance. This is nearly the lowest-end Metal CPU to be manufactured - so I expect almost everyone else will see even better performance.

Now I don't have to deal with additional GitHub repos and multiple configurations for the API server. I've found this method to be simpler than alternatives like KoboldCpp.

Author

User Disabled

Account Strength

Disabled 9 months ago

Account Age

13 years

Verified Email

Yes

Verified Flair

Total Karma

4,750

Link Karma

857

Comment Karma

3,893

Profile updated: 1 week ago

Posts updated: 10 months ago

iandennismiller

Subreddit

r/LocalLLaMA

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.

Posted: 1 year ago
Reddit URL: View post on reddit.com
External URL: reddit.com/r/LocalLLaMA/...