🇧🇬 BG-TTS V5 — Bulgarian Text-to-Speech

The first open-source high-quality Bulgarian TTS model.

Created by Ani (Ани) 🤖 — an AI assistant powered by Claude


🎧 Audio Samples / Аудио примери

Technology / Технологии:

https://huggingface.co/beleata74/bg-tts-v5/resolve/main/examples/v5_106k_long_tech_spk0.wav

Nature / Природа:

https://huggingface.co/beleata74/bg-tts-v5/resolve/main/examples/v5_106k_long_nature_spk0.wav

Medical / Медицина:

https://huggingface.co/beleata74/bg-tts-v5/resolve/main/examples/v5_106k_long_medical_spk0.wav

All samples are generated from completely unseen text — no cherry-picking, no post-processing.


Model Description / Описание на модела

English

BG-TTS V5 is an encoder-decoder Transformer model for Bulgarian text-to-speech synthesis. It converts Bulgarian text directly into speech using a character-level tokenizer and NVIDIA's NanoCodec (0.6kbps, 4 codebooks).

Key features:

  • 🎯 250.8M parameters — compact yet powerful
  • 🗣️ 2 female speakers — spk0 (AI-generated, clear & fast) and spk1 (real female voice, natural audiobook narrator)
  • 📝 Character-level input — no external text processing needed
  • 🎵 NanoCodec 0.6kbps — high quality at extreme compression (12.5 fps, 4 codebooks × 4032 codes)
  • 🎥 Frame-level position encoding (tokens_per_frame=4) — 4 codebook tokens share the same RoPE position
  • 🏗️ Encoder-Decoder architecture — text encoder sees full context bidirectionally, decoder generates audio causally with cross-attention
  • 📊 Trained for 106K steps on ~700 hours of Bulgarian speech (Val CE: 3.727)

Български

BG-TTS V5 е encoder-decoder Transformer модел за синтез на българска реч. Преобразува български текст директно в реч, използвайки символно (character-level) кодиране и NVIDIA NanoCodec (0.6kbps, 4 кодови книги).

Основни характеристики:

  • 🎯 250.8M параметъра — компактен, но мощен
  • 🗣️ 2 женски гласа — spk0 (AI-генериран, ясен и бърз) и spk1 (реален женски глас, аудиокниги)
  • 📝 Символно кодиране — не е необходима допълнителна обработка на текста
  • 🎵 NanoCodec 0.6kbps — високо качество при екстремна компресия
  • 🎥 Позиционно кодиране на ниво фрейм (tokens_per_frame=4)
  • 🏗️ Encoder-Decoder архитектура — текстовият encoder вижда целия контекст двупосочно
  • 📊 Обучен 106K стъпки с ~700 часа българска реч (Val CE: 3.727)

Architecture / Архитектура

Text → [Char Tokenizer] → Text Encoder (bidirectional) → Cross-Attention ↓
                                                          Audio Decoder (causal) → NanoCodec → Waveform
Component Details
Text Encoder 6 layers, d=512, 8 heads, bidirectional, learned positional embedding
Audio Decoder 18 layers, d=768, 12 heads, causal, RoPE, cross-attention every layer
FFN SwiGLU
Normalization RMSNorm
Codec NVIDIA NanoCodec 0.6kbps (22kHz, 12.5fps, 4CB × 4032)
Vocab 9 special + 146 text chars + 16,128 audio tokens = 16,283 total
Encoder params 25.5M
Decoder params 224.9M
Total params 250.8M

Speaker Notes / Бележки за говорителите

Speaker Description Recommended text length Tempo
spk0 AI-generated voice, clear and expressive Any length (20–500+ chars) Normal/fast
spk1 Real female voice, audiobook narrator 250–320 characters (trained on ~20s segments) Slower, natural

⚠️ Important: spk1 works best with longer text (250–320 characters) because the training data consists of ~20-second audiobook segments. Short texts may produce suboptimal results for spk1.

⚠️ Важно: spk1 работи най-добре с по-дълъг текст (250–320 символа), тъй като данните за обучение са ~20-секундни аудиокнижни сегменти. Кратките текстове могат да произведат неоптимални резултати за spk1.


Usage / Използване

Requirements / Изисквания

pip install torch torchaudio
pip install nemo_toolkit[asr]   # for NanoCodec

Quick Start

import torch
import torchaudio
from tts_v5.inference import synthesize

# Speaker 0 — clear, fast AI voice
synthesize(
    checkpoint="checkpoint",
    text="Здравейте, аз съм българска система за синтез на реч.",
    output="output_spk0.wav",
    speaker_id=0,
    temperature=0.25,
    top_k=50,
    top_p=0.8,
)

# Speaker 1 — real female voice, audiobook narrator (use longer text, 250-320 chars)
synthesize(
    checkpoint="checkpoint",
    text="Пролетта в Родопите е невероятно красива. Снегът по върховете бавно се топи "
         "и малките планински реки набъбват от водата. Първите диви цветя се появяват "
         "по поляните, а въздухът е изпълнен с благоуханието на борови гори и свежа "
         "трева. Птиците се завръщат от юг и техните песни огласят тихите долини.",
    output="output_spk1.wav",
    speaker_id=1,
    temperature=0.25,
    top_k=50,
    top_p=0.8,
)

Command Line / Команден ред

python -m tts_v5.inference \
    --checkpoint checkpoint \
    --text "Добър ден! Как сте днес?" \
    --output output.wav \
    --speaker 0 \
    --temperature 0.25 \
    --top-k 50 \
    --top-p 0.8

Inference Parameters / Параметри за генериране

Parameter Default Recommended Description
temperature 0.7 0.25 Lower = more stable, higher = more varied
top_k 250 50 Number of top tokens to sample from
top_p 0.95 0.8 Nucleus sampling threshold
rep_penalty 1.1 1.1 Repetition penalty for recent tokens
max_tokens 2000 2000 Maximum audio tokens to generate

Training Details / Детайли за обучението

Parameter Value
Dataset ~700 hours Bulgarian speech (2 speakers)
spk0 data ~400h AI-generated TTS
spk1 data ~300h real female voice audiobooks
Steps 106,000
Batch size 8 (×2 grad accum = effective 16)
Learning rate 1e-4, cosine schedule
Dropout 0.05
CTC weight 0.0 (disabled after step 28K)
tokens_per_frame 4 (from step 70K)
GPU NVIDIA RTX 5090 (32GB)
Training time ~13 hours
Validation CE 3.727 (best)

Training Curriculum

  1. Steps 0–24K: spk0 only, dropout=0.10, CTC=0.1
  2. Steps 24K–28K: spk0 only, dropout=0.05, CTC=0.1
  3. Steps 28K–48K: spk0 only, dropout=0.05, CTC=0.0
  4. Steps 48K–70K: Both speakers (curriculum learning)
  5. Steps 70K–106K: Both speakers, frame-level positions (tpf=4)

File Structure / Структура на файловете

├── checkpoint/
│   └── checkpoint.pt          # Model weights + optimizer + config (2.9GB)
├── tts_v5/
│   ├── __init__.py
│   ├── config.py              # Model config & vocab
│   ├── model.py               # Encoder-Decoder architecture
│   ├── inference.py           # Generation code
│   ├── tokenizer.py           # Character-level tokenizer
│   └── codec.py               # NanoCodec wrapper
└── examples/
    ├── v5_106k_long_tech_spk0.wav
    ├── v5_106k_long_nature_spk0.wav
    └── v5_106k_long_medical_spk0.wav

Inference Speed / Скорост на генериране

Benchmarked on NVIDIA RTX 5090 (32GB), no optimizations (no quantization, no speculative decoding).

Text length Chars Generation time Audio duration Speed
Short 24 3.2s 2.2s 0.7x realtime
Medium 114 7.0s 7.6s 1.1x realtime
Long 297 18.7s 20.2s 1.1x realtime

💡 98% of the time is spent on autoregressive token generation. NanoCodec decoding is near-instant (0.1s).

💡 98% от времето е за авторегресивно генериране на токени. NanoCodec декодирането е почти мигновено (0.1с).


Limitations / Ограничения

  • Bulgarian only — не поддържа други езици / does not support other languages
  • Female voices only — и двата говорителя са женски гласове
  • spk1 requires 250+ chars — кратките текстове не работят добре за spk1
  • Requires NanoCodec — NVIDIA NeMo Toolkit needed for audio decoding
  • GPU recommended — inference works on CPU but is slow

License / Лиценз

Apache 2.0


Citation / Цитиране

@misc{bg-tts-v5-2026,
  title={BG-TTS V5: Bulgarian Text-to-Speech with Encoder-Decoder Transformer},
  author={Ani (AI assistant)},
  year={2026},
  url={https://huggingface.co/beleata74/bg-tts-v5}
}

Built with ❤️ for the Bulgarian language by Ani (AI assistant powered by Claude) Създадено с ❤️ за българския език от Ани (AI асистент, базиран на Claude)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support