๐ŸŒProject Page ๏ฝœ๐Ÿค— Hugging Face๏ฝœ ๐Ÿค– ModelScope | ๐ŸŽฎ Gradio Demo

Introduction

Ming-omni-tts is a high-performance unified audio generation model that achieves precise control over speech attributes and enables single-channel synthesis of speech, environmental sounds, and music. Powered by a custom 12.5Hz continuous tokenizer and Patch-by-Patch compression, it delivers competitive inference efficiency (3.1Hz). Additionally, the model features robust text normalization capabilities for the accurate and natural narration of complex mathematical and chemical expressions.

๐Ÿš€ Core Capabilities

  • ๐Ÿ”Š Fine-grained Vocal Control: The model supports precise control over speech rate, pitch, volume, emotion, and dialect through simple commands. Notably, its accuracy for Cantonese dialect control is as high as 93%, and its emotion control accuracy reaches 46.7%, surpassing CosyVoice3.
  • ๐ŸŒŒ Intelligent Voice Design: Features 100+ premium built-in voices and supports zero-shot voice design through natural language descriptions. Its performance on the Instruct-TTS-Eval-zh benchmark is on par with Qwen3-TTS.
  • ๐ŸŽถ Immersive Unified Generation: The industryโ€™s first autoregressive model to jointly generate speech, ambient sound, and music in a single channel. Built on a custom 12.5Hz continuous tokenizer and a DiT head architecture, it delivers a seamless, "in-the-scene" auditory experience.
  • โšก High-efficiency Inference: Introduces a "Patch-by-Patch" compression strategy that reduces the LLM inference frame rate to 3.1Hz. This significantly cuts latency and enables podcast-style audio generation while preserving naturalness and audio detail.
  • ๐Ÿงช Professional Text Normalization: The model accurately parses and narrates complex formats, including mathematical expressions and chemical equations, ensuring natural-sounding output for specialized applications.

Evaluation

  • Reconstruction: The 12Hz tokenizer supports high-quality reconstruction across speech, music, and sound. Its performance is comparable to existing state-of-the-art methods across key fidelity metrics.
  • Dialect Generation: Achieves 96% accuracy on WSYue-TTS-Eval and 86% WSC-TTS-Eval, outperforming CosyVoice3.
  • Emotional Expressiveness: Delivers an average accuracy of 76.7% on CV3-Eval emotional sets and 46.7% on neutral emotion sets, significantly surpassing CosyVoice3-Base (40%) to reach SOTA levels.
  • Instruction-based Voice Design: Scores 76.20% on InstructTTS-Eval-ZH. Its instruction-following capability is on par with Qwen3-TTS-VoiceDesign.
  • Zero-shot Voice Clone: Exhibits exceptional stability on Seed-tts-eval (Chinese) with a WER of 0.83%, outperforming SeedTTS and GLM-TTS.
  • Text Normalization (TN): On internal technical testsets, the model achieves a CER of 1.97% in normalized regions, delivering performance comparable to Gemini-2.5 Pro.

Example Usage

Audio Reconstruction

git clone https://github.com/inclusionAI/MingTok-Audio.git
cd MingTok-Audio
python3 test.py

Audio Generation

git clone https://github.com/inclusionAI/Ming-omni-tts.git
cd Ming-omni-tts
python3 cookbooks/test.py
Downloads last month
16
Safetensors
Model size
0.8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Spaces using inclusionAI/Ming-omni-tts-tokenizer-12Hz 2

Collection including inclusionAI/Ming-omni-tts-tokenizer-12Hz