mudler
/

vibevoice.cpp-models

Model card Files Files and versions

vibevoice.cpp-models / README.md

mudler's picture

README: add LocalAI team attribution

adae3bc verified 20 days ago

|

history blame contribute delete

3.31 kB

	---
	license: mit
	library_name: vibevoice.cpp
	tags:
	- tts
	- asr
	- speech
	- vibevoice
	- gguf
	- ggml
	base_model:
	- microsoft/VibeVoice-Realtime-0.5B
	- microsoft/VibeVoice-ASR
	---

	# vibevoice.cpp — quantized model bundle

	Brought to you by the [LocalAI](https://github.com/mudler/LocalAI) team — the creators of LocalAI, the open-source AI engine that runs any model — LLMs, vision, voice, image, video — on any hardware. No GPU required.

	Quantized GGUF weights for [vibevoice.cpp](https://github.com/mudler/vibevoice.cpp),
	a C/C++ port of Microsoft VibeVoice (TTS + ASR) on top of `ggml`.

	\| File \| Source \| Quant \| Size \|
	\| ---- \| ------ \| ----- \| ---- \|
	\| `vibevoice-realtime-0.5B-q8_0.gguf` \| `microsoft/VibeVoice-Realtime-0.5B` \| Q8_0 (matmul) + F16 \| ~1.6 GB \|
	\| `vibevoice-asr-q8_0.gguf` \| `microsoft/VibeVoice-ASR` \| Q8_0 (matmul) + F16 \| ~13 GB \|
	\| `voice-en-Carter_man.gguf` \| upstream voice prompt cache \| F16 \| 8 MB \|
	\| `voice-en-Emma.gguf` \| upstream voice prompt cache \| F16 \| 6 MB \|
	\| `tokenizer.gguf` \| Qwen2.5 BPE + VibeVoice specials \| — \| 6 MB \|

	## Quantization scheme

	`scripts/quantize_gguf.py` in the source repo selectively quantizes only the
	LM matmul weights — attention q/k/v/o, ffn gate/up/down, and lm_head — to
	Q8_0. Everything else (1-D conv kernels, RMSNorm scales, biases,
	layer-scale gammas, token embeddings, small scalars) passes through
	unchanged. The conv1d implementation in vibevoice.cpp casts kernels to F16
	inline rather than dequantizing on the fly, so quantizing those would
	corrupt the convolution outputs.

	Q8_0 was chosen because it's pure-Python implementable in `gguf-py` and
	gives a ~60% size reduction on the 7B ASR model with no measurable
	quality regression in the closed-loop TTS → ASR roundtrip test.

	## Quickstart

	```bash
	git clone --recursive https://github.com/mudler/vibevoice.cpp
	cd vibevoice.cpp && cmake -B build -DVIBEVOICE_BUILD_TESTS=ON && cmake --build build -j

	# Pull this bundle
	mkdir -p models && cd models
	hf download mudler/vibevoice.cpp-models --local-dir .
	cd ..

	# TTS
	build/bin/vibevoice-cli tts \
	--model models/vibevoice-realtime-0.5B-q8_0.gguf \
	--voice models/voice-en-Carter_man.gguf \
	--tokenizer models/tokenizer.gguf \
	--text "Hello world this is a test of the synthesis system." \
	--out hello.wav

	# ASR
	build/bin/vibevoice-cli asr \
	--model models/vibevoice-asr-q8_0.gguf \
	--tokenizer models/tokenizer.gguf \
	--audio hello.wav
	# -> [{"Start":0,"End":2.8,"Speaker":0,"Content":"Hello world, this is a test of the synthesis system."}]
	```

	## Closed-loop verification

	The `test_closed_loop` ctest in vibevoice.cpp runs TTS → ASR end-to-end
	and asserts ≥80% source-word recall in the recovered transcript. With
	this bundle (both Q8_0 models) it passes at 10/10 (100 %).

	## License

	Weights are derived from Microsoft VibeVoice
	([VibeVoice-Realtime-0.5B](https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B)
	and [VibeVoice-ASR](https://huggingface.co/microsoft/VibeVoice-ASR));
	follow the upstream model licenses for use. The conversion + quantization
	tooling is released under MIT as part of vibevoice.cpp.