🇧🇬 BG-TTS V5 — Bulgarian Text-to-Speech
The first open-source high-quality Bulgarian TTS model.
Created by Ani (Ани) 🤖 — an AI assistant powered by Claude
🎧 Audio Samples / Аудио примери
Technology / Технологии:
https://huggingface.co/beleata74/bg-tts-v5/resolve/main/examples/v5_106k_long_tech_spk0.wav
Nature / Природа:
https://huggingface.co/beleata74/bg-tts-v5/resolve/main/examples/v5_106k_long_nature_spk0.wav
Medical / Медицина:
https://huggingface.co/beleata74/bg-tts-v5/resolve/main/examples/v5_106k_long_medical_spk0.wav
All samples are generated from completely unseen text — no cherry-picking, no post-processing.
Model Description / Описание на модела
English
BG-TTS V5 is an encoder-decoder Transformer model for Bulgarian text-to-speech synthesis. It converts Bulgarian text directly into speech using a character-level tokenizer and NVIDIA's NanoCodec (0.6kbps, 4 codebooks).
Key features:
- 🎯 250.8M parameters — compact yet powerful
- 🗣️ 2 female speakers — spk0 (AI-generated, clear & fast) and spk1 (real female voice, natural audiobook narrator)
- 📝 Character-level input — no external text processing needed
- 🎵 NanoCodec 0.6kbps — high quality at extreme compression (12.5 fps, 4 codebooks × 4032 codes)
- 🎥 Frame-level position encoding (tokens_per_frame=4) — 4 codebook tokens share the same RoPE position
- 🏗️ Encoder-Decoder architecture — text encoder sees full context bidirectionally, decoder generates audio causally with cross-attention
- 📊 Trained for 106K steps on ~700 hours of Bulgarian speech (Val CE: 3.727)
Български
BG-TTS V5 е encoder-decoder Transformer модел за синтез на българска реч. Преобразува български текст директно в реч, използвайки символно (character-level) кодиране и NVIDIA NanoCodec (0.6kbps, 4 кодови книги).
Основни характеристики:
- 🎯 250.8M параметъра — компактен, но мощен
- 🗣️ 2 женски гласа — spk0 (AI-генериран, ясен и бърз) и spk1 (реален женски глас, аудиокниги)
- 📝 Символно кодиране — не е необходима допълнителна обработка на текста
- 🎵 NanoCodec 0.6kbps — високо качество при екстремна компресия
- 🎥 Позиционно кодиране на ниво фрейм (tokens_per_frame=4)
- 🏗️ Encoder-Decoder архитектура — текстовият encoder вижда целия контекст двупосочно
- 📊 Обучен 106K стъпки с ~700 часа българска реч (Val CE: 3.727)
Architecture / Архитектура
Text → [Char Tokenizer] → Text Encoder (bidirectional) → Cross-Attention ↓
Audio Decoder (causal) → NanoCodec → Waveform
| Component | Details |
|---|---|
| Text Encoder | 6 layers, d=512, 8 heads, bidirectional, learned positional embedding |
| Audio Decoder | 18 layers, d=768, 12 heads, causal, RoPE, cross-attention every layer |
| FFN | SwiGLU |
| Normalization | RMSNorm |
| Codec | NVIDIA NanoCodec 0.6kbps (22kHz, 12.5fps, 4CB × 4032) |
| Vocab | 9 special + 146 text chars + 16,128 audio tokens = 16,283 total |
| Encoder params | 25.5M |
| Decoder params | 224.9M |
| Total params | 250.8M |
Speaker Notes / Бележки за говорителите
| Speaker | Description | Recommended text length | Tempo |
|---|---|---|---|
| spk0 | AI-generated voice, clear and expressive | Any length (20–500+ chars) | Normal/fast |
| spk1 | Real female voice, audiobook narrator | 250–320 characters (trained on ~20s segments) | Slower, natural |
⚠️ Important: spk1 works best with longer text (250–320 characters) because the training data consists of ~20-second audiobook segments. Short texts may produce suboptimal results for spk1.
⚠️ Важно: spk1 работи най-добре с по-дълъг текст (250–320 символа), тъй като данните за обучение са ~20-секундни аудиокнижни сегменти. Кратките текстове могат да произведат неоптимални резултати за spk1.
Usage / Използване
Requirements / Изисквания
pip install torch torchaudio
pip install nemo_toolkit[asr] # for NanoCodec
Quick Start
import torch
import torchaudio
from tts_v5.inference import synthesize
# Speaker 0 — clear, fast AI voice
synthesize(
checkpoint="checkpoint",
text="Здравейте, аз съм българска система за синтез на реч.",
output="output_spk0.wav",
speaker_id=0,
temperature=0.25,
top_k=50,
top_p=0.8,
)
# Speaker 1 — real female voice, audiobook narrator (use longer text, 250-320 chars)
synthesize(
checkpoint="checkpoint",
text="Пролетта в Родопите е невероятно красива. Снегът по върховете бавно се топи "
"и малките планински реки набъбват от водата. Първите диви цветя се появяват "
"по поляните, а въздухът е изпълнен с благоуханието на борови гори и свежа "
"трева. Птиците се завръщат от юг и техните песни огласят тихите долини.",
output="output_spk1.wav",
speaker_id=1,
temperature=0.25,
top_k=50,
top_p=0.8,
)
Command Line / Команден ред
python -m tts_v5.inference \
--checkpoint checkpoint \
--text "Добър ден! Как сте днес?" \
--output output.wav \
--speaker 0 \
--temperature 0.25 \
--top-k 50 \
--top-p 0.8
Inference Parameters / Параметри за генериране
| Parameter | Default | Recommended | Description |
|---|---|---|---|
temperature |
0.7 | 0.25 | Lower = more stable, higher = more varied |
top_k |
250 | 50 | Number of top tokens to sample from |
top_p |
0.95 | 0.8 | Nucleus sampling threshold |
rep_penalty |
1.1 | 1.1 | Repetition penalty for recent tokens |
max_tokens |
2000 | 2000 | Maximum audio tokens to generate |
Training Details / Детайли за обучението
| Parameter | Value |
|---|---|
| Dataset | ~700 hours Bulgarian speech (2 speakers) |
| spk0 data | ~400h AI-generated TTS |
| spk1 data | ~300h real female voice audiobooks |
| Steps | 106,000 |
| Batch size | 8 (×2 grad accum = effective 16) |
| Learning rate | 1e-4, cosine schedule |
| Dropout | 0.05 |
| CTC weight | 0.0 (disabled after step 28K) |
| tokens_per_frame | 4 (from step 70K) |
| GPU | NVIDIA RTX 5090 (32GB) |
| Training time | ~13 hours |
| Validation CE | 3.727 (best) |
Training Curriculum
- Steps 0–24K: spk0 only, dropout=0.10, CTC=0.1
- Steps 24K–28K: spk0 only, dropout=0.05, CTC=0.1
- Steps 28K–48K: spk0 only, dropout=0.05, CTC=0.0
- Steps 48K–70K: Both speakers (curriculum learning)
- Steps 70K–106K: Both speakers, frame-level positions (tpf=4)
File Structure / Структура на файловете
├── checkpoint/
│ └── checkpoint.pt # Model weights + optimizer + config (2.9GB)
├── tts_v5/
│ ├── __init__.py
│ ├── config.py # Model config & vocab
│ ├── model.py # Encoder-Decoder architecture
│ ├── inference.py # Generation code
│ ├── tokenizer.py # Character-level tokenizer
│ └── codec.py # NanoCodec wrapper
└── examples/
├── v5_106k_long_tech_spk0.wav
├── v5_106k_long_nature_spk0.wav
└── v5_106k_long_medical_spk0.wav
Inference Speed / Скорост на генериране
Benchmarked on NVIDIA RTX 5090 (32GB), no optimizations (no quantization, no speculative decoding).
| Text length | Chars | Generation time | Audio duration | Speed |
|---|---|---|---|---|
| Short | 24 | 3.2s | 2.2s | 0.7x realtime |
| Medium | 114 | 7.0s | 7.6s | 1.1x realtime |
| Long | 297 | 18.7s | 20.2s | 1.1x realtime |
💡
98% of the time is spent on autoregressive token generation. NanoCodec decoding is near-instant (0.1s).
💡
98% от времето е за авторегресивно генериране на токени. NanoCodec декодирането е почти мигновено (0.1с).
Limitations / Ограничения
- Bulgarian only — не поддържа други езици / does not support other languages
- Female voices only — и двата говорителя са женски гласове
- spk1 requires 250+ chars — кратките текстове не работят добре за spk1
- Requires NanoCodec — NVIDIA NeMo Toolkit needed for audio decoding
- GPU recommended — inference works on CPU but is slow
License / Лиценз
Apache 2.0
Citation / Цитиране
@misc{bg-tts-v5-2026,
title={BG-TTS V5: Bulgarian Text-to-Speech with Encoder-Decoder Transformer},
author={Ani (AI assistant)},
year={2026},
url={https://huggingface.co/beleata74/bg-tts-v5}
}
Built with ❤️ for the Bulgarian language by Ani (AI assistant powered by Claude) Създадено с ❤️ за българския език от Ани (AI асистент, базиран на Claude)