Does anyone have a guide on how to configure Llama.cpp I can't find much past basic installation guide. I have it up and running, but I'm getting <1tk/s, whereas the same model on Ollama is 15-18tk/s. Also I'm having a weird issue with llama_cpp_python / guidance where it doesn't accept properly formatted function arguments. GPT4 says it's likely something to do with the python wrapper not passing the function argument to C , but I'm honestly in a bit over my head. Any advice appriciated.
llama_print_timings: load time = 12337.41 ms
llama_print_timings: sample time = 34.87 ms / 208 runs ( 0.17 ms per token, 5964.33 tokens per second)
llama_print_timings: prompt eval time = 12337.19 ms / 57 tokens ( 216.44 ms per token, 4.62 tokens per second)
llama_print_timings: eval time = 131624.22 ms / 207 runs ( 635.87 ms per token, 1.57 tokens per second)
llama_print_timings: total time = 144722.12 ms / 264 tokens
SOLVED:
I found this thread:
https://github.com/abetlen/llama-cpp-python/issues/756
and they recomended to switch to the arm version of python. My version was still using rosetta. I can now specify GPU layers and get around 10tk/s
thank you for the help.
Subreddit
Post Details
- Posted
- 6 months ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/LocalLLaMA/...