Guide to configuring Llama.cpp? I just switched and it's 1/10th the speed of ollama

Post Flair (click to view more posts with a particular flair)

Question | Help

Post Body

Does anyone have a guide on how to configure Llama.cpp I can't find much past basic installation guide. I have it up and running, but I'm getting <1tk/s, whereas the same model on Ollama is 15-18tk/s. Also I'm having a weird issue with llama_cpp_python / guidance where it doesn't accept properly formatted function arguments. GPT4 says it's likely something to do with the python wrapper not passing the function argument to C , but I'm honestly in a bit over my head. Any advice appriciated.

llama_print_timings: load time = 12337.41 ms

llama_print_timings: sample time = 34.87 ms / 208 runs ( 0.17 ms per token, 5964.33 tokens per second)

llama_print_timings: prompt eval time = 12337.19 ms / 57 tokens ( 216.44 ms per token, 4.62 tokens per second)

llama_print_timings: eval time = 131624.22 ms / 207 runs ( 635.87 ms per token, 1.57 tokens per second)

llama_print_timings: total time = 144722.12 ms / 264 tokens

SOLVED:

I found this thread:

https://github.com/abetlen/llama-cpp-python/issues/756

and they recomended to switch to the arm version of python. My version was still using rosetta. I can now specify GPU layers and get around 10tk/s

thank you for the help.

Author

Account Strength

100%

Account Age

14 years

Verified Email

Yes

Verified Flair

Total Karma

175,895

Link Karma

19,451

Comment Karma

153,432

Profile updated: 2 months ago

Posts updated: 2 months ago

Mescallan

Subreddit

r/LocalLLaMA

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.

Posted: 6 months ago
Reddit URL: View post on reddit.com
External URL: reddit.com/r/LocalLLaMA/...