MioTTS-0.1B: Lightweight & Fast LLM-based TTS
MioTTS-0.1B is a lightweight, high-speed Text-to-Speech (TTS) model based on an LLM architecture. It is designed to generate high-quality speech in English and Japanese while maintaining low latency and minimal resource usage.
This model supports zero-shot voice cloning and is built on top of the efficient neural audio codec MioCodec-25Hz-24kHz.
📊 MioTTS Family
We offer a range of model sizes to suit different performance and resource requirements.
| Model Name | Parameters | Base Model | License | RTF (Real-Time Factor) |
|---|---|---|---|---|
| MioTTS-0.1B | 0.1B | tiiuae/Falcon-H1-Tiny-Multilingual-100M-Base | Falcon-LLM License | 0.04 - 0.05 |
| MioTTS-0.4B | 0.4B | LiquidAI/LFM2-350M | LFM Open License v1.0 | 0.035 - 0.045 |
| MioTTS-0.6B | 0.6B | Qwen/Qwen3-0.6B-Base | Apache 2.0 | 0.055 - 0.065 |
| MioTTS-1.2B | 1.2B | LiquidAI/LFM2.5-1.2B-Base | LFM Open License v1.0 | 0.065 - 0.075 |
| MioTTS-1.7B | 1.7B | Qwen/Qwen3-1.7B-Base | Apache 2.0 | 0.10 - 0.11 |
| MioTTS-2.6B | 2.6B | LiquidAI/LFM2-2.6B | LFM Open License v1.0 | 0.135 - 0.145 |
RTF values represent the range observed when generating approximately 15 seconds of audio across multiple runs. Measured on an NVIDIA RTX 5090 using vLLM 0.15.1.
🌟 Key Features
- Lightweight & Fast: Optimized for speed, making it suitable for consumer-grade GPUs and edge deployment.
- Bilingual Support: Trained on approximately 100,000 hours of English and Japanese data.
- Voice Cloning: Supports high-fidelity zero-shot voice cloning from a short reference audio clip.
- Efficient Codec: Uses Aratako/MioCodec-25Hz-24kHz, which operates at a low framerate (25Hz) for faster generation without sacrificing quality.
🚀 Inference
We provide a dedicated repository for inference, including installation instructions and example WebUI.
👉 GitHub: Aratako/MioTTS-Inference
🎧 Audio Samples
Below are some samples generated by MioTTS-0.1B.
Note: The reference audio samples below were generated using Aratako/T5Gemma-TTS-2b-2b and gemini-2.5-pro-tts.
| Case | Text | Reference Audio | Generated Audio |
|---|---|---|---|
| English 1 | "The old library was silent, save for the gentle ticking of a clock somewhere in the shadows. As I ran my fingers along the dusty spines of the books, I felt a strange sense of nostalgia, as if I had lived a thousand lives within these walls." | ||
| English 2 | "Hey! I haven't seen you in ages. Do you want to grab some coffee later? I've got so much to tell you!" | ||
| Japanese 1 | "気象庁によりますと、大型の台風10号は、明日の明け方にかけて関東地方に接近する見込みです。沿岸部では高波に警戒が必要です。" | ||
| Japanese 2 | "その森には、古い言い伝えがありました。月が最も高く昇る夜、静かに耳を澄ませば、風の歌声が聞こえるというのです。私は半信半疑でしたが、その夜、確かに誰かが私を呼ぶ声を聞いたのです。" |
🏗️ Training Details
- Data: ~100k hours of speech data (English & Japanese).
- Codec: MioCodec-25Hz-24kHz
- Base Model: Initialized from tiiuae/Falcon-H1-Tiny-Multilingual-100M-Base.
📊 Evaluation Results
We evaluated MioTTS-0.1B using J-HARD-TTS-Eval, a challenging benchmark for Japanese zero-shot TTS. The evaluation involves synthesizing each test case 5 times and measuring the best, average, and worst performance.
Key Findings
Despite its ultra-lightweight size (0.1B parameters), MioTTS shows interesting characteristics:
- Exceptional Peak Performance: In the Rhyme test (closest to standard reading), the model achieves the best scores in both 'best' and 'average' metrics, outperforming significantly larger models. It also attains the top 'best' scores in Repetition and Continuation tasks.
- Stability Trade-offs: While peak performance is high, the 'average' and 'worst' scores for Repetition and Continuation degrade compared to the best runs. This indicates some instability in generation consistency across attempts.
- Limitations: Performance on Short text inputs is currently lower and remains a topic for future improvement.
- Speaker Similarity: The lower similarity scores are an expected trade-off of the architecture, as voice cloning is handled by the highly compressed, lightweight codec rather than the LLM itself.
Character Error Rate (CER)
Lower is better.
| Model | Size | short best |
short avg |
short worst |
rep best |
rep avg |
rep worst |
rhyme best |
rhyme avg |
rhyme worst |
cont best |
cont avg |
cont worst |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MioTTS-0.1B | AR: 114.5M (74.4M) + NAR: 81.3M | 13.39 | 50.55 | 107.9 | 4.963 | 22.29 | 50.92 | 0.1419 | 1.022 | 3.654 | 0.2884 | 2.664 | 9.285 |
| XTTS-v2 | 441.0M (424.2M) | 5.512 | 14.33 | 31.50 | 7.792 | 12.12 | 18.61 | 0.1419 | 1.064 | 3.122 | 0.3460 | 1.396 | 3.287 |
| CosyVoice2-0.5B | AR: 505.8M (357.9M) + NAR: 112.5M | 22.83 | 71.50 | 123.6 | 8.139 | 15.25 | 28.68 | 0.1774 | 1.398 | 4.576 | 0.4614 | 5.456 | 16.03 |
| FishAudio-S1-mini | AR: 801.4M (440.5M) + NAR: 58.73M | 0.7874 | 15.59 | 48.82 | 11.81 | 35.19 | 79.90 | 0.4966 | 1.313 | 3.015 | 0.4037 | 1.257 | 2.364 |
| Qwen3-TTS-0.6B | AR: 764.2M (437.3M) + NAR: 141.6M | 7.087 | 22.36 | 45.67 | 6.799 | 13.01 | 21.49 | 2.128 | 4.292 | 7.627 | 0.7497 | 2.076 | 4.037 |
| Qwen3-TTS-1.7B | AR: 1.703B (1.403B) + NAR: 175.1M | 1.575 | 4.724 | 11.02 | 5.261 | 10.57 | 17.67 | 0.6031 | 2.469 | 4.753 | 0.5767 | 1.488 | 2.884 |
Speaker Similarity (Sim)
Higher is better. Computed using varying CER thresholds.
| Model | Size | SS (CER=0) |
SS (CER<=10) |
SS (CER<=30) |
SS (CER<=50) |
SS (CER<=100) |
SS (Unfiltered) |
|---|---|---|---|---|---|---|---|
| MioTTS-0.1B | AR: 114.5M (74.4M) + NAR: 81.3M | 0.5696 | 0.5651 | 0.5576 | 0.5523 | 0.5430 | 0.5387 |
| XTTS-v2 | 441.0M (424.2M) | 0.6267 | 0.6273 | 0.6218 | 0.6178 | 0.6155 | 0.6145 |
| CosyVoice2-0.5B | AR: 505.8M (357.9M) + NAR: 112.5M | 0.7325 | 0.7251 | 0.7152 | 0.7087 | 0.6858 | 0.6848 |
| FishAudio-S1-mini | AR: 801.4M (440.5M) + NAR: 58.73M | 0.6864 | 0.6833 | 0.6722 | 0.6646 | 0.6531 | 0.6440 |
| Qwen3-TTS-0.6B | AR: 764.2M (437.3M) + NAR: 141.6M | 0.7419 | 0.7496 | 0.7451 | 0.7418 | 0.7354 | 0.7298 |
| Qwen3-TTS-1.7B | AR: 1.703B (1.403B) + NAR: 175.1M | 0.7623 | 0.7614 | 0.7549 | 0.7539 | 0.7537 | 0.7530 |
📝 Benchmark Notes
- Data Source: Baseline results (rows other than MioTTS) are transcribed from the Zero-shot results in the J-HARD-TTS-Eval README.
- Model Size: Values in parentheses within the
Sizecolumn indicate AR model parameters excluding the embedding and output head layers. - Generation Settings: MioTTS inference was performed using vLLM 0.15.1 with
temperature=0.8,top_p=1.0, andrepetition_penalty=1.0. - Preprocessing: MioTTS inference applies the same text preprocessing used during its training. This process removes trailing punctuation (e.g., final commas). As a result, the synthesized target text may differ slightly from the original benchmark text, which can influence CER scores.
📜 License & Ethical Restrictions
License
This model is released under the Falcon-LLM License.
Ethical Considerations & Limitations
While this model is released under a permissive license, we aim to promote responsible AI development and urge users to respect the rights of others.
- Voice Cloning: Please respect the privacy and rights of individuals. We strongly discourage using this model to clone the voices of real people (especially non-consenting individuals) for deceptive or harmful purposes.
- No Misinformation: This model should not be used to generate deepfakes intended to mislead others or spread misinformation.
- Disclaimer: The developers assume no liability for any misuse of this model. Users are solely responsible for ensuring their use of the generated content complies with applicable laws and regulations in their jurisdiction.
🙏 Acknowledgments
- Compute Support: Part of the compute resources for this project were provided by Saldra, Witness and Lumina Logic Minds. We deeply appreciate their support.
- Base Model: We thank the developers of the base LLM for their open-source contributions.
- Community: Thanks to the open-source community for the datasets and tools that made this project possible.
🖊️ Citation
If you use MioTTS in your research or project, please cite it as follows:
@misc{miotts,
author = {Chihiro Arata},
title = {MioTTS: Lightweight and Fast LLM-based Text-to-Speech},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {\url{https://huggingface.co/collections/Aratako/miotts}}
}
- Downloads last month
- 1,375
Model tree for Aratako/MioTTS-0.1B
Base model
tiiuae/Falcon-H1-Tiny-Multilingual-100M-Base