This post has been de-listed
It is no longer included in search results and normal feeds (front page, hot posts, subreddit posts, etc). It remains visible only via the author's post history.
KoboldCpp 1.67 has now integrated whisper.cpp functionality, providing two new Speech-To-Text endpoints `/api/extra/transcribe` used by KoboldCpp, and the OpenAI compatible drop-in `/v1/audio/transcriptions`. Both endpoints accept payloads as .wav file uploads (max 32MB), or base64 encoded wave data.
Kobold Lite can now also utilize the microphone when enabled in settings panel. You can use Push-To-Talk (PTT) or automatic Voice Activity Detection (VAD) aka Hands Free Mode, everything runs locally within your browser including resampling and wav format conversion, and interfaces directly with the KoboldCpp transcription endpoint.
Special thanks to ggerganov and all the developers of whisper.cpp, without which none of this would have been possible.
Additionally, the Quantized KV Cache enhancements from llama.cpp have also been merged, and can now be used in KoboldCpp. Note that using the quantized KV option requires flash attention enabled and context shift disabled.
The setup shown in the video can be run fully offline on a single device.
Text Generation = MistRP 7B (KoboldCpp)
Image Generation = SD 1.5 PicX Real (KoboldCpp)
Speech To Text = whisper-base.en-q5_1 (KoboldCpp)
Image Recognition = mistral-7b-mmproj-v1.5-Q4_1 (KoboldCpp)
Text To Speech = XTTSv2 with custom sample (XTTS API Server)
See full changelog here: https://github.com/LostRuins/koboldcpp/releases/latest
Subreddit
Post Details
- Posted
- 5 months ago
- Reddit URL
- View post on reddit.com
- External URL
- reddit.com/r/LocalLLaMA/...