Updated specific locations to be searchable, take a look at Las Vegas as an example.
7
Guide to configuring Llama.cpp? I just switched and it's 1/10th the speed of ollama
Post Flair (click to view more posts with a particular flair)
Post Body

Does anyone have a guide on how to configure Llama.cpp I can't find much past basic installation guide. I have it up and running, but I'm getting <1tk/s, whereas the same model on Ollama is 15-18tk/s. Also I'm having a weird issue with llama_cpp_python / guidance where it doesn't accept properly formatted function arguments. GPT4 says it's likely something to do with the python wrapper not passing the function argument to C , but I'm honestly in a bit over my head. Any advice appriciated.

llama_print_timings: load time = 12337.41 ms

llama_print_timings: sample time = 34.87 ms / 208 runs ( 0.17 ms per token, 5964.33 tokens per second)

llama_print_timings: prompt eval time = 12337.19 ms / 57 tokens ( 216.44 ms per token, 4.62 tokens per second)

llama_print_timings: eval time = 131624.22 ms / 207 runs ( 635.87 ms per token, 1.57 tokens per second)

llama_print_timings: total time = 144722.12 ms / 264 tokens

SOLVED:

I found this thread:

https://github.com/abetlen/llama-cpp-python/issues/756

and they recomended to switch to the arm version of python. My version was still using rosetta. I can now specify GPU layers and get around 10tk/s

thank you for the help.

Author
Account Strength
100%
Account Age
14 years
Verified Email
Yes
Verified Flair
No
Total Karma
175,895
Link Karma
19,451
Comment Karma
153,432
Profile updated: 2 months ago
Posts updated: 2 months ago

Subreddit

Post Details

We try to extract some basic information from the post title. This is not always successful or accurate, please use your best judgement and compare these values to the post title and body for confirmation.
Posted
6 months ago