Instructions to use mudler/vibevoice.cpp-models with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- VibeVoice
How to use mudler/vibevoice.cpp-models with VibeVoice:
import torch, soundfile as sf, librosa, numpy as np from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference # Load voice sample (should be 24kHz mono) voice, sr = sf.read("path/to/voice_sample.wav") if voice.ndim > 1: voice = voice.mean(axis=1) if sr != 24000: voice = librosa.resample(voice, sr, 24000) processor = VibeVoiceProcessor.from_pretrained("mudler/vibevoice.cpp-models") model = VibeVoiceForConditionalGenerationInference.from_pretrained( "mudler/vibevoice.cpp-models", torch_dtype=torch.bfloat16 ).to("cuda").eval() model.set_ddpm_inference_steps(5) inputs = processor(text=["Speaker 0: Hello!\nSpeaker 1: Hi there!"], voice_samples=[[voice]], return_tensors="pt") audio = model.generate(**inputs, cfg_scale=1.3, tokenizer=processor.tokenizer).speech_outputs[0] sf.write("output.wav", audio.cpu().numpy().squeeze(), 24000) - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| library_name: vibevoice.cpp | |
| tags: | |
| - tts | |
| - asr | |
| - speech | |
| - vibevoice | |
| - gguf | |
| - ggml | |
| base_model: | |
| - microsoft/VibeVoice-Realtime-0.5B | |
| - microsoft/VibeVoice-ASR | |
| # vibevoice.cpp β quantized model bundle | |
| **Brought to you by the [LocalAI](https://github.com/mudler/LocalAI) team** β the creators of LocalAI, the open-source AI engine that runs any model β LLMs, vision, voice, image, video β on any hardware. No GPU required. | |
| Quantized GGUF weights for [vibevoice.cpp](https://github.com/mudler/vibevoice.cpp), | |
| a C/C++ port of Microsoft VibeVoice (TTS + ASR) on top of `ggml`. | |
| | File | Source | Quant | Size | | |
| | ---- | ------ | ----- | ---- | | |
| | `vibevoice-realtime-0.5B-q8_0.gguf` | `microsoft/VibeVoice-Realtime-0.5B` | Q8_0 (matmul) + F16 | ~1.6 GB | | |
| | `vibevoice-asr-q8_0.gguf` | `microsoft/VibeVoice-ASR` | Q8_0 (matmul) + F16 | ~13 GB | | |
| | `voice-en-Carter_man.gguf` | upstream voice prompt cache | F16 | 8 MB | | |
| | `voice-en-Emma.gguf` | upstream voice prompt cache | F16 | 6 MB | | |
| | `tokenizer.gguf` | Qwen2.5 BPE + VibeVoice specials | β | 6 MB | | |
| ## Quantization scheme | |
| `scripts/quantize_gguf.py` in the source repo selectively quantizes only the | |
| LM matmul weights β attention q/k/v/o, ffn gate/up/down, and lm_head β to | |
| Q8_0. Everything else (1-D conv kernels, RMSNorm scales, biases, | |
| layer-scale gammas, token embeddings, small scalars) passes through | |
| unchanged. The conv1d implementation in vibevoice.cpp casts kernels to F16 | |
| inline rather than dequantizing on the fly, so quantizing those would | |
| corrupt the convolution outputs. | |
| Q8_0 was chosen because it's pure-Python implementable in `gguf-py` and | |
| gives a ~60% size reduction on the 7B ASR model with no measurable | |
| quality regression in the closed-loop TTS β ASR roundtrip test. | |
| ## Quickstart | |
| ```bash | |
| git clone --recursive https://github.com/mudler/vibevoice.cpp | |
| cd vibevoice.cpp && cmake -B build -DVIBEVOICE_BUILD_TESTS=ON && cmake --build build -j | |
| # Pull this bundle | |
| mkdir -p models && cd models | |
| hf download mudler/vibevoice.cpp-models --local-dir . | |
| cd .. | |
| # TTS | |
| build/bin/vibevoice-cli tts \ | |
| --model models/vibevoice-realtime-0.5B-q8_0.gguf \ | |
| --voice models/voice-en-Carter_man.gguf \ | |
| --tokenizer models/tokenizer.gguf \ | |
| --text "Hello world this is a test of the synthesis system." \ | |
| --out hello.wav | |
| # ASR | |
| build/bin/vibevoice-cli asr \ | |
| --model models/vibevoice-asr-q8_0.gguf \ | |
| --tokenizer models/tokenizer.gguf \ | |
| --audio hello.wav | |
| # -> [{"Start":0,"End":2.8,"Speaker":0,"Content":"Hello world, this is a test of the synthesis system."}] | |
| ``` | |
| ## Closed-loop verification | |
| The `test_closed_loop` ctest in vibevoice.cpp runs TTS β ASR end-to-end | |
| and asserts β₯80% source-word recall in the recovered transcript. With | |
| this bundle (both Q8_0 models) it passes at 10/10 (100 %). | |
| ## License | |
| Weights are derived from Microsoft VibeVoice | |
| ([VibeVoice-Realtime-0.5B](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B) | |
| and [VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR)); | |
| follow the upstream model licenses for use. The conversion + quantization | |
| tooling is released under MIT as part of vibevoice.cpp. | |