File size: 6,404 Bytes
d9e8fba
3b65b67
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81ba1f8
 
 
 
 
 
 
d9e8fba
81ba1f8
d9e8fba
 
ea29ade
 
 
 
 
 
 
 
 
 
81ba1f8
ea29ade
 
81ba1f8
ea29ade
 
 
 
 
 
 
d9e8fba
81ba1f8
d9e8fba
81ba1f8
d9e8fba
81ba1f8
d9e8fba
 
 
81ba1f8
d9e8fba
3fa84fb
81ba1f8
 
3fa84fb
81ba1f8
 
 
 
 
3fa84fb
81ba1f8
 
 
3fa84fb
81ba1f8
 
 
d9e8fba
81ba1f8
 
d9e8fba
 
81ba1f8
d9e8fba
81ba1f8
 
 
d9e8fba
 
81ba1f8
d9e8fba
81ba1f8
 
 
 
 
d9e8fba
 
81ba1f8
d9e8fba
81ba1f8
d9e8fba
81ba1f8
 
 
 
 
 
d9e8fba
81ba1f8
d9e8fba
81ba1f8
d9e8fba
81ba1f8
 
d9e8fba
81ba1f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
language:
- zh
- en
- de
- es
- fr
- ja
- it
- he
- ko
- ru
- fa
- ar
- pl
- pt
- cs
- da
- sv
- hu
- el
- tr
license: apache-2.0
library_name: transformers
pipeline_tag: text-to-speech
tags:
- text-to-speech
- audio-tokenizer
- moss
---

# MOSS-TTS Family

<br>

<p align="center">
  &nbsp;&nbsp;&nbsp;&nbsp;
  <img src="https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_imgaes_demo/openmoss_x_mosi" height="50" align="middle" />
</p>



<div align="center">
  <a href="https://github.com/OpenMOSS/MOSS-Audio-Tokenizer"><img src="https://img.shields.io/badge/Project%20Page-GitHub-blue"></a>
  <a href="https://modelscope.cn/collections/OpenMOSS-Team/MOSS-TTS"><img src="https://img.shields.io/badge/ModelScope-Models-lightgrey?logo=modelscope&amp"></a>
  <a href="https://mosi.cn/#models"><img src="https://img.shields.io/badge/Blog-View-blue?logo=internet-explorer&amp"></a>
  <a href="https://huggingface.co/papers/2602.10934"><img src="https://img.shields.io/badge/Arxiv-2602.10934-red?logo=arxiv&amp"></a>

  <a href="https://studio.mosi.cn"><img src="https://img.shields.io/badge/AIStudio-Try-green?logo=internet-explorer&amp"></a>
  <a href="https://studio.mosi.cn/docs/moss-tts"><img src="https://img.shields.io/badge/API-Docs-00A3FF?logo=fastapi&amp"></a>
  <a href="https://x.com/Open_MOSS"><img src="https://img.shields.io/badge/Twitter-Follow-black?logo=x&amp"></a>
  <a href="https://discord.gg/fvm5TaWjU3"><img src="https://img.shields.io/badge/Discord-Join-5865F2?logo=discord&amp"></a>
</div>

## Overview
MOSS‑TTS Family is an open‑source **speech and sound generation model family** from [MOSI.AI](https://mosi.cn/#hero) and the [OpenMOSS team](https://www.open-moss.com/). It is built upon the **MOSS-Audio-Tokenizer**, a unified discrete audio tokenizer based on the **CAT** (Causal Audio Tokenizer with Transformer) architecture presented in the paper [MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models](https://huggingface.co/papers/2602.10934).

## Sample Usage (Audio Reconstruction)

The tokenizer can be used to compress audio into discrete tokens and reconstruct it back into waveforms.

```python
import torch
from transformers import AutoModel
import torchaudio

repo_id = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval()

# Load and resample audio
wav, sr = torchaudio.load('path_to_audio.wav')
if sr != model.sampling_rate:
    wav = torchaudio.functional.resample(wav, sr, model.sampling_rate)
wav = wav.unsqueeze(0)

# Encode audio to tokens
enc = model.encode(wav, return_dict=True)
print(f"enc.audio_codes.shape: {enc.audio_codes.shape}")

# Decode tokens back to audio
dec = model.decode(enc.audio_codes, return_dict=True)
print(f"dec.audio.shape: {dec.audio.shape}")

wav_rec = dec.audio.squeeze(0)
torchaudio.save("reconstructed.wav", wav_rec, sample_rate=model.sampling_rate)
```

## Introduction

<p align="center">
  <img src="https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_imgaes_demo/moss_tts_family_arch.jpeg" width="85%" />
</p>


When a single piece of audio needs to **sound like a real person**, **pronounce every word accurately**, **switch speaking styles across content**, **remain stable over tens of minutes**, and **support dialogue, role‑play, and real‑time interaction**, a single TTS model is often not enough. The **MOSS‑TTS Family** breaks the workflow into five production‑ready models that can be used independently or composed into a complete pipeline.

- **MOSS‑TTS**: MOSS-TTS is the flagship production TTS foundation model, centered on high-fidelity zero-shot voice cloning with controllable long-form synthesis, pronunciation, and multilingual/code-switched speech.
- **MOSS‑TTSD**: MOSS-TTSD is a production long-form dialogue model for expressive multi-speaker conversational audio at scale.
- **MOSS‑VoiceGenerator**: MOSS-VoiceGenerator is an open-source voice design model that creates speaker timbres directly from free-form text.
- **MOSS‑SoundEffect**: MOSS-SoundEffect is a high-fidelity text-to-sound model with broad category coverage and controllable duration.
- **MOSS‑TTS‑Realtime**: MOSS-TTS-Realtime is a context-aware, multi-turn streaming TTS model for real-time voice agents.


## Released Models

| Model | Architecture | Size | Hugging Face |
|---|---|---:|---|
| **MOSS-TTS** | MossTTSDelay | 8B | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTS) |
|  | MossTTSLocal | 1.7B | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Local-Transformer) |
| **MOSS‑TTSD‑V1.0** | MossTTSDelay | 8B | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTSD-v1.0) |
| **MOSS‑VoiceGenerator** | MossTTSDelay | 1.7B | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-Voice-Generator) |
| **MOSS‑SoundEffect** | MossTTSDelay | 8B | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-SoundEffect) |
| **MOSS‑TTS‑Realtime** | MossTTSRealtime | 1.7B | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Realtime) |

## Supported Languages

MOSS-TTS, MOSS-TTSD and MOSS-TTS-Realtime currently supports **20 languages**: Chinese, English, German, Spanish, French, Japanese, Italian, Hebrew, Korean, Russian, Persian (Farsi), Arabic, Polish, Portuguese, Czech, Danish, Swedish, Hungarian, Greek, and Turkish.

## Evaluation
MOSS-TTS achieved state-of-the-art results on the zero-shot TTS benchmark Seed-TTS-eval, rivaling the most powerful closed-source systems.

| Model | EN WER (%) ↓ | EN SIM (%) ↑ | ZH CER (%) ↓ | ZH SIM (%) ↑ |
|---|---:|---:|---:|---:|
| MossTTSDelay (8B) | 1.79 | 71.46 | 1.32 | 77.05 |
| MossTTSLocal (1.7B) | 1.85 | **73.42** | 1.2 | **78.82** |

## Citation
If you use this code or result in your research, please cite:
```tex
@misc{gong2026mossaudiotokenizerscalingaudiotokenizers,
      title={MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models}, 
      author={Yitian Gong and Kuangwei Chen and Zhaoye Fei and Xiaogui Yang and Ke Chen and Yang Wang and Kexin Huang and Mingshu Chen and Ruixiao Li and Qingyuan Cheng and Shimin Li and Xipeng Qiu},
      year={2026},
      eprint={2602.10934},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2602.10934}, 
}
```