OpenMOSS-Team
/

MOSS-TTSD-v1.0

+---
+license: apache-2.0
+---
+# MOSS-TTS Family
+## Overview
+MOSS‑TTS Family is an open‑source **speech and sound generation model family** from [MOSI.AI](https://mosi.cn/#hero) and the [OpenMOSS team](https://www.open-moss.com/). It is designed for **high‑fidelity**, **high‑expressiveness**, and **complex real‑world scenarios**, covering stable long‑form speech, multi‑speaker dialogue, voice/character design, environmental sound effects, and real‑time streaming TTS.
+## Introduction
+<p align="center">
+  <img src="./assets/moss_tts_family.jpeg" width="85%" />
+</p>
+When a single piece of audio needs to **sound like a real person**, **pronounce every word accurately**, **switch speaking styles across content**, **remain stable over tens of minutes**, and **support dialogue, role‑play, and real‑time interaction**, a single TTS model is often not enough. The **MOSS‑TTS Family** breaks the workflow into five production‑ready models that can be used independently or composed into a complete pipeline.
+- **MOSS‑TTS**: MOSS-TTS is the flagship, production-ready Text-to-Speech foundation model in the MOSS-TTS Family, built to ship, scale, and deliver real-world voice applications beyond demos. It provides high-fidelity zero-shot voice cloning as the core capability, along with ultra-long speech generation, token-level duration control, multilingual and code-switched synthesis, and fine-grained Pinyin/phoneme pronunciation control. Together, these features make it a robust base model for scalable narration, dubbing, and voice-driven products.
+- **MOSS‑TTSD**: MOSS-TTSD is a production-oriented long-form spoken dialogue generation model for creating highly expressive, multi-party conversational audio at scale. It supports continuous long-duration generation, flexible multi-speaker turn-taking control, and zero-shot voice cloning from short reference audio, enabling natural conversations with rich interaction dynamics. It is designed for real-world long-form content such as podcasts, audiobooks, commentary, dubbing, and entertainment dialogue.
+- **MOSS‑VoiceGenerator**: MOSS-VoiceGenerator is an open-source voice design system that generates speaker timbres directly from free-form text descriptions, enabling fast creation of voices for characters, personalities, and emotions—without requiring reference audio. It unifies timbre design, style control, and content synthesis in a single instruction-driven model, producing high-fidelity, emotionally expressive speech that feels naturally human. It can be used standalone for creative production, or as a voice design layer that improves integration and usability for downstream TTS systems.
+- **MOSS‑SoundEffect**: MOSS-SoundEffect is a high-fidelity sound effect generation model built for real-world content creation, offering strong environmental richness, broad category coverage, and reliable duration controllability. Trained on large-scale, high-quality data, it generates consistent audio from text prompts across natural ambience, urban scenes, creatures, human actions, and music-like clips. It is well suited for film and game production, interactive experiences, and data synthesis pipelines.
+- **MOSS‑TTS‑Realtime**: MOSS-TTS-Realtime is a context-aware, multi-turn streaming TTS foundation model designed for real-time voice agents. Unlike conventional TTS that synthesizes replies in isolation, it conditions generation on multi-turn dialogue history—including both textual and acoustic signals from prior user speech—so responses stay coherent, consistent, and natural across turns. With low-latency incremental synthesis and strong voice stability, it enables truly conversational, human-like real-time speech experiences.
+## Released Models
+| Model | Architecture | Size | Model Card | Hugging Face |
+|---|---|---:|---|---|
+| **MOSS-TTS** | MossTTSDelay | 8B | [moss_tts_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/moss_tts_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTS) |
+|  | MossTTSLocal | 1.7B | [moss_tts_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/moss_tts_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Local-Transformer) |
+| **MOSS‑TTSD‑V1.0** | MossTTSDelay | 8B | [moss_ttsd_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/moss_ttsd_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTSD-v1.0) |
+| **MOSS‑VoiceGenerator** | MossTTSDelay | 1.7B | [moss_voice_generator_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/moss_voice_generator_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-Voice-Generator) |
+| **MOSS‑SoundEffect** | MossTTSDelay | 8B | [moss_sound_effect_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/moss_sound_effect_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-SoundEffect) |
+| **MOSS‑TTS‑Realtime** | MossTTSRealtime | 1.7B | [moss_tts_realtime_model_card.md](https://github.com/OpenMOSS/MOSS-TTS/blob/main/moss_tts_realtime_model_card.md) | 🤗 [Huggingface](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Realtime) |
+<br>
+# MOSS-TTSD
+**MOSS-TTSD** is a long-form spoken dialogue generation model that enables highly expressive multi-party conversational speech synthesis across multiple languages. It supports continuous long-duration generation, flexible multi-speaker dialogue control, and state-of-the-art zero-shot voice cloning with only short reference audio. MOSS-TTSD is designed for real-world long-form content creation, including podcasts, audiobook, sports and esports commentary, dubbing, crosstalk, and entertainment scenarios.
+## 1. Overview
+### 1.1 TTS Family Positioning
+MOSS-TTSD is the Long-Form Dialogue Specialist in our open-source TTS Family. While our foundational models focus on high-fidelity single-speaker synthesis, MOSS-TTSD extends this capability into the realm of complex, multi-party interactions. It is designed to bridge the gap between distinct audio samples and cohesive, continuous conversation.
+**Design Goals**
+- **Authentic Interaction**: Capturing the natural rhythm, overlaps, and dynamics of human conversation.
+- **Sustained Coherence**: Maintaining speaker identity and contextual consistency over extended durations (up to 1 hour).
+- **Production Adaptability**: Serving diverse high-end scenarios from rigorous audiobook narration to dynamic sports commentary.
+### 1.2 Key Capabilities
+MOSS-TTSD transforms static text into living conversations, offering features specifically optimized for multi-speaker environments:
+- **Multi-Party Conversational Generation** — Unlike traditional TTS which optimizes for reading, MOSS-TTSD masters the rhythm of conversation. It supports 1 to 5 speakers with flexible control, handling natural turn-taking, overlapping speech patterns, and distinct persona maintenance.
+- **Extreme Long-Context Modeling** — Moving beyond short-sentence generation, the model is architected for stability over long durations, supporting up to 60 minutes of coherent audio in a single session without losing speaker identity or prosodic quality.
+- **Diverse Scenario Adaptation** — The model is fine-tuned on high-variability scenarios to handle different speaking styles:
+  - Conversational Media: AI Podcasts, Interviews.
+  - Dynamic Commentary: High-energy Sports/Esports shouting and analysis.
+  - Entertainment: Audiobooks (narrator + characters), Dubbing, and Crosstalk (Xiangsheng).
+- **Multilingual & Zero-Shot Cloning** — Features state-of-the-art zero-shot voice cloning requiring only short reference audio (3-10s), with robust cross-lingual performance across major languages including Chinese, English, Japanese, and European languages.
+### 1.3 Model Architecture
+MOSS-TTSD is built on top of **Architecture A: Delay Pattern (MossTTSDelay)** from our MOSS-TTS foundation model — a single Transformer backbone with multi-head parallel prediction using delay scheduling for multi-codebook audio tokens.
+<!-- For full architecture details, see **`moss_tts_delay/moss_tts_delay_architecture.md`**. -->
+### 1.4 Released Models
+| Model | Architecture | NVQ | Parameters |
+|-------|-------------|-----|------------|
+| MOSS-TTSD | Architecture A: Delay Pattern (MossTTSDelay) | 16 | 8B |
+**Recommended decoding hyperparameters**
+| Model | audio_temperature | audio_top_p | audio_top_k | audio_repetition_penalty |
+|---|---:|---:|---:|---:|
+| **MOSS-TTSD** | 1.1 | 0.9 | 50 | 1.1 |
+## 2. Quick Start
+MOSS-TTSD uses a **continuation** workflow: provide reference audio for each speaker, their transcripts as a prefix, and the dialogue text to generate. The model continues in each speaker's identity.
+```python
+import os
+from pathlib import Path
+import torch
+import torchaudio
+from transformers import AutoModel, AutoProcessor
+pretrained_model_name_or_path = "OpenMOSS-Team/MOSS-TTSD"
+audio_tokenizer_name_or_path = "OpenMOSS-Team/MOSS-Audio-Tokenizer"
+device = "cuda" if torch.cuda.is_available() else "cpu"
+dtype = torch.bfloat16 if device == "cuda" else torch.float32
+processor = AutoProcessor.from_pretrained(
+    pretrained_model_name_or_path,
+    trust_remote_code=True,
+    codec_path=audio_tokenizer_name_or_path,
+)
+processor.audio_tokenizer = processor.audio_tokenizer.to(device)
+processor.audio_tokenizer.eval()
+model = AutoModel.from_pretrained(
+    pretrained_model_name_or_path,
+    trust_remote_code=True,
+    attn_implementation="flash_attention_2",
+    torch_dtype=dtype,
+).to(device)
+model.eval()
+# --- Inputs ---
+prompt_audio_speaker1 = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_02_s1.wav"
+prompt_audio_speaker2 = "https://speech-demo.oss-cn-shanghai.aliyuncs.com/moss_tts_demo/tts_readme_demo/reference_02_s2.wav"
+prompt_text_speaker1 = "[S1] In short, we embarked on a mission to make America great again for all Americans."
+prompt_text_speaker2 = "[S2] NVIDIA reinvented computing for the first time after 60 years. In fact, Erwin at IBM knows quite well that the computer has largely been the same since the 60s."
+text_to_generate = "[S1] Listen, let's talk business. China. I'm hearing things. People are saying they're catching up. Fast. What's the real scoop? Their AI—is it a threat? [S2] Well, the pace of innovation there is extraordinary, honestly. They have the researchers, and they have the drive. [S1] Extraordinary? I don't like that. I want us to be extraordinary. Are they winning? [S2] I wouldn't say winning, but their progress is very promising. They are building massive clusters. They're very determined. [S1] Promising. There it is. I hate that word. When China is promising, it means we're losing. It's a disaster, Jensen. A total disaster. "
+# --- Load & resample audio ---
+target_sr = int(processor.model_config.sampling_rate)
+wav1, sr1 = torchaudio.load(prompt_audio_speaker1)
+wav2, sr2 = torchaudio.load(prompt_audio_speaker2)
+if wav1.shape[0] > 1:
+    wav1 = wav1.mean(dim=0, keepdim=True)
+if wav2.shape[0] > 1:
+    wav2 = wav2.mean(dim=0, keepdim=True)
+if sr1 != target_sr:
+    wav1 = torchaudio.functional.resample(wav1, sr1, target_sr)
+if sr2 != target_sr:
+    wav2 = torchaudio.functional.resample(wav2, sr2, target_sr)
+# --- Build conversation ---
+reference_audio_codes = processor.encode_audios_from_wav([wav1, wav2], sampling_rate=target_sr)
+concat_prompt_wav = torch.cat([wav1, wav2], dim=-1)
+prompt_audio = processor.encode_audios_from_wav([concat_prompt_wav], sampling_rate=target_sr)[0]
+full_text = f"{prompt_text_speaker1} {prompt_text_speaker2} {text_to_generate}"
+conversations = [
+    [
+        processor.build_user_message(
+            text=full_text,
+            reference=reference_audio_codes,
+        ),
+        processor.build_assistant_message(
+            audio_codes_list=[prompt_audio]
+        ),
+    ],
+]
+# --- Inference ---
+batch_size = 1
+save_dir = Path("output")
+save_dir.mkdir(exist_ok=True, parents=True)
+sample_idx = 0
+with torch.no_grad():
+    for start in range(0, len(conversations), batch_size):
+        batch_conversations = conversations[start : start + batch_size]
+        batch = processor(batch_conversations, mode="continuation")
+        input_ids = batch["input_ids"].to(device)
+        attention_mask = batch["attention_mask"].to(device)
+        outputs = model.generate(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            max_new_tokens=2000,
+        )
+        for message in processor.decode(outputs):
+            for seg_idx, audio in enumerate(message.audio_codes_list):
+                torchaudio.save(save_dir / f"{sample_idx}_{seg_idx}.wav", audio.unsqueeze(0), processor.model_config.sampling_rate)
+            sample_idx += 1
+```
+### Input Types
+**UserMessage**
+| Field | Type | Required | Description |
+|---|---|---:|---|
+| `text` | `str` | Yes | Full dialogue text including speaker tags (`[S1]`, `[S2]`, ...) and prompt transcripts. |
+| `reference` | `List` | Yes | Per-speaker reference audio codes from `processor.encode_audios_from_wav()`. |
+**AssistantMessage**
+| Field | Type | Required | Description |
+|---|---|---:|---|
+| `audio_codes_list` | `List` | Yes | Concatenated prompt audio codes for all speakers. |
+### Generation Hyperparameters
+| Parameter | Type | Default | Description |
+|---|---|---:|---|
+| `max_new_tokens` | `int` | — | Controls total generated audio tokens. **1s ≈ 12.5 tokens**. |
+| `audio_temperature` | `float` | 1.1 | Higher values increase variation; lower values stabilize prosody. |
+| `audio_top_p` | `float` | 0.9 | Nucleus sampling cutoff. |
+| `audio_top_k` | `int` | 50 | Top-K sampling. |
+| `audio_repetition_penalty` | `float` | 1.1 | >1.0 discourages repeating patterns. |
+## 3. Evaluation
+### Objective Evaluation(TTSD-eval)
+We introduce a robust evaluation framework leveraging **MMS-FA** for alignment and **wespeaker** for embedding extraction to ensure precise speaker attribution.
+- **Method**: Forced-alignment based segmentation + Similarity-based speaker verification.
+- **Metrics**:
+  - **Speaker Attribution Accuracy (ACC)**
+  - **Speaker Similarity (SIM)**
+  - **Word Error Rate (WER)** computed using **Whisper-large-v3**.
+- **Dataset**: 100 multi-turn dialogues (CN/EN) spanning 30s–720s. Covers diverse scenarios including Podcasts, TV dubbing, and Crosstalk. Code and data coming soon.
+<br>
+| Model | ZH - SIM | ZH - ACC | ZH - WER | EN - SIM | EN - ACC | EN - WER |
+| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
+| **Comparison with Open-Source Models** | | | | | | |
+| MOSS-TTSD | **0.7949** | **0.9587** | **0.0485** | **0.7326** | **0.9626** | 0.0988 |
+| MOSS-TTSD v0.7 | 0.7423 | 0.9391 | 0.0517 | 0.6743 | 0.9266 | 0.1612 |
+| Vibevoice 7B | 0.7590 | 0.9222 | 0.0570 | 0.7140 | 0.9554 | **0.0946** |
+| Vibevoice 1.5 B | 0.7415 | 0.8798 | 0.0818 | 0.6961 | 0.9353 | 0.1133 |
+| FireRedTTS2 | 0.7383 | 0.9022 | 0.0768 | - | - | - |
+| Higgs Audio V2 | - | - | - | 0.6860 | 0.9025 | 0.2131 |
+| **Comparison with Proprietary Models** | | | | | | |
+| Eleven V3 | 0.6970 | 0.9653 | **0.0363** | 0.6730 | 0.9498 | **0.0824** |
+| MOSS-TTSD (elevenlabs_voice) | **0.8165** | **0.9736** | 0.0391 | **0.7304** | **0.9565** | 0.1005 |
+| | | | | | | |
+| gemini-2.5-pro-preview-tts | - | - | - | 0.6786 | 0.9537 | **0.0859** |
+| gemini-2.5-flash-preview-tts | - | - | - | 0.7194 | 0.9511 | 0.0871 |
+| MOSS-TTSD (gemini_voice) | - | - | - | **0.7893** | **0.9655** | 0.0984 |
+| | | | | | | |
+| Doubao_Podcast | 0.8034 | 0.9606 | **0.0472** | - | - | - |
+| MOSS-TTSD (doubao_voice) | **0.8226** | **0.9630** | 0.0571 | - | - | - |
+### Subjective Evaluation
+For open-source models, annotators are asked to score each sample pair in terms of speaker attribution accuracy, voice similarity, prosody, and overall quality. Following the methodology of the LMSYS Chatbot Arena, we compute Elo ratings and confidence intervals for each dimension.
+![alt text](assets/VS_Open-Source_Models.png)
+For closed-source models, annotators are only asked to choose the overall preferred one in each pair, and we compute the win rate accordingly.
+![alt text](assets/VS_Proprietary_Models1.png)
+![alt text](assets/VS_Proprietary_Models2.png)

modeling_moss_tts.py CHANGED Viewed

@@ -395,7 +395,7 @@ class MossTTSDelayModel(MossTTSDelayPreTrainedModel):
         input_ids: torch.LongTensor,
         attention_mask: Optional[torch.Tensor] = None,
         max_new_tokens: Optional[int] = None,
-        text_temperature: float = 1.2,
         text_top_p: float = 0.9,
         text_top_k: int = 50,
         audio_temperature: Optional[float] = None,
@@ -460,14 +460,14 @@ class MossTTSDelayModel(MossTTSDelayPreTrainedModel):
         generation_ids = input_ids[:]
         is_stopping = torch.zeros(batch_size, dtype=torch.bool, device=device)
-        # 三个阶段: 1. 非 audio; 2. audio not delay; 3. audio delay
-        audio_lengths = torch.zeros(batch_size, dtype=torch.int64, device=device) # 0 的时候表示阶段1;
         torch_int64_max = torch.iinfo(torch.int64).max
-        delayed_lengths = torch.full((batch_size,), torch_int64_max, dtype=torch.int64, device=device) # 最大值的时候表示阶段2;
-        # 考虑 continuation 时 audio_start 已经在 input_ids 中的情况;
-        # NOTE 注意我们目前不考虑任何输入已经开始 delay 的情况;
-        # 需要同时考虑 continuation 和直接生成的情况;
         is_continuation = (input_ids[:, -1, 0] == self.config.audio_start_token_id) | (input_ids[:, -1, 0] == self.config.audio_assistant_gen_slot_token_id)
         audio_start_indices = find_last_equal_C(input_ids[..., 0], self.config.audio_start_token_id)
         audio_start_mask = is_continuation & (audio_start_indices != -1)
@@ -480,7 +480,7 @@ class MossTTSDelayModel(MossTTSDelayPreTrainedModel):
         pre_exclude_mask1[[self.config.audio_assistant_gen_slot_token_id, self.config.audio_assistant_delay_slot_token_id]] = False
-        # 注意 time_step 未必表示对于实际对话时，当前输出token的位置，因为有续写的情况;
         for time_step in tqdm(range(max_new_tokens), desc=f"Generating bs{batch_size} ..."):
             outputs = self(
                 input_ids=current_input_ids,
@@ -492,9 +492,10 @@ class MossTTSDelayModel(MossTTSDelayPreTrainedModel):
             next_token_logits = [logit[:, -1, :] / text_temperature if logit_idx == 0 else logit[:, -1, :] / audio_temperature for logit_idx, logit in enumerate(outputs.logits)] # List, len=n_vq+1, [batch_size, 1, vocab_size];
             next_token_logits[0] = next_token_logits[0].clone()
-            # 1. 先处理 text token;
             next_text_token = torch.full((batch_size,), self.config.pad_token_id, device=device)
-            # 第二个 audio_assistant_delay_slot_token_id 和 audio_end 是不需要采样的，audio_start, 每一个 audio_assistant_gen_slot_token_ids 和第一个 audio_assistant_delay_slot_token_id 是需要采样的;
             next_text_token[~is_stopping & (delayed_lengths < n_vq)] = self.config.audio_assistant_delay_slot_token_id
             is_audio_eos = ~is_stopping & (delayed_lengths == n_vq)
             next_text_token[is_audio_eos] = self.config.audio_end_token_id
@@ -507,7 +508,7 @@ class MossTTSDelayModel(MossTTSDelayPreTrainedModel):
             if time_step <= n_vq:
                 next_token_logits[0][..., self.config.im_end_token_id] = float('-inf')
-            # 文本层不使用重复惩罚;
             next_text_token[sampling_text_mask] = sample_token(
                 logits=next_token_logits[0][sampling_text_mask],
                 top_p=text_top_p,
@@ -515,15 +516,15 @@ class MossTTSDelayModel(MossTTSDelayPreTrainedModel):
                 do_sample=text_do_sample
             )
             is_audio[next_text_token == self.config.audio_start_token_id] = True
-            # 只存在一种停止逻辑，即 next_text_token = <|im_end|>;
             is_stopping[next_text_token == self.config.im_end_token_id] = True
-            # 2. 再处理 audio tokens;
-            # audio_start 和 audio_end 之外的内容直接pad，默认是 pad，我们只需要填充有值的部分即可;
             next_audio_tokens = torch.full((batch_size, n_vq), self.config.audio_pad_code, device=device)
-            # 需要考虑的是与 audio_start 的距离;
-            # 先查看是否是pad的情况; true 表示有值;
             pre_audio_mask = audio_lengths.unsqueeze(1) > torch.arange(n_vq, dtype=int, device=device).expand(batch_size, n_vq)
             post_audio_mask = torch.arange(n_vq, dtype=int, device=device).expand(batch_size, n_vq) > delayed_lengths.unsqueeze(1) - 1
             post_audio_mask[delayed_lengths == torch_int64_max] = True
@@ -531,18 +532,32 @@ class MossTTSDelayModel(MossTTSDelayPreTrainedModel):
             next_audio_tokens[~sampling_audio_mask] = self.config.audio_pad_code
             if sampling_audio_mask.sum() > 0:
-                audio_logits = torch.stack(next_token_logits[1:], dim=1)[sampling_audio_mask] # torch.stack -> [batch_size, n_vq - 1, vocab_size]
                 audio_logits[..., self.config.audio_pad_code] = float('-inf')
-                next_audio_tokens[sampling_audio_mask] = sample_token(
                     logits=audio_logits,
-                    prev_tokens=generation_ids[:, :, 1:],
                     repetition_penalty=audio_repetition_penalty,
                     top_p=audio_top_p,
                     top_k=audio_top_k,
                     do_sample=audio_do_sample
                 )
-            # 这里显示的是下一个时间步时可以直接使用的 audio_lengths 和 delayed_lengths 的状态;
             # audio_lengths[(next_text_token == self.audio_start_token_id) & (audio_lengths > 0)] += 1
             # audio_lengths[(next_text_token == self.audio_start_token_id) | (next_text_token == self.audio_assistant_gen_slot_token_id)] += 1
             audio_lengths[(next_text_token == self.config.audio_start_token_id) | (next_text_token == self.config.audio_assistant_gen_slot_token_id) | (next_text_token == self.config.audio_assistant_delay_slot_token_id)] += 1

         input_ids: torch.LongTensor,
         attention_mask: Optional[torch.Tensor] = None,
         max_new_tokens: Optional[int] = None,
+        text_temperature: float = 1.1,
         text_top_p: float = 0.9,
         text_top_k: int = 50,
         audio_temperature: Optional[float] = None,
         generation_ids = input_ids[:]
         is_stopping = torch.zeros(batch_size, dtype=torch.bool, device=device)
+        # Three phases: 1) non-audio, 2) audio generation before delay, 3) delayed audio.
+        audio_lengths = torch.zeros(batch_size, dtype=torch.int64, device=device) # 0 means phase 1.
         torch_int64_max = torch.iinfo(torch.int64).max
+        delayed_lengths = torch.full((batch_size,), torch_int64_max, dtype=torch.int64, device=device) # int64 max means phase 2.
+        # Handle continuation where audio_start is already present in input_ids.
+        # NOTE: delayed-audio continuation is currently not handled.
+        # Support both continuation and fresh generation.
         is_continuation = (input_ids[:, -1, 0] == self.config.audio_start_token_id) | (input_ids[:, -1, 0] == self.config.audio_assistant_gen_slot_token_id)
         audio_start_indices = find_last_equal_C(input_ids[..., 0], self.config.audio_start_token_id)
         audio_start_mask = is_continuation & (audio_start_indices != -1)
         pre_exclude_mask1[[self.config.audio_assistant_gen_slot_token_id, self.config.audio_assistant_delay_slot_token_id]] = False
+        # time_step is a generation step, not the absolute dialogue position under continuation.
         for time_step in tqdm(range(max_new_tokens), desc=f"Generating bs{batch_size} ..."):
             outputs = self(
                 input_ids=current_input_ids,
             next_token_logits = [logit[:, -1, :] / text_temperature if logit_idx == 0 else logit[:, -1, :] / audio_temperature for logit_idx, logit in enumerate(outputs.logits)] # List, len=n_vq+1, [batch_size, 1, vocab_size];
             next_token_logits[0] = next_token_logits[0].clone()
+            # 1) Process text token first.
             next_text_token = torch.full((batch_size,), self.config.pad_token_id, device=device)
+            # The second delay-slot token and audio_end are fixed; audio_start, each gen-slot token,
+            # and the first delay-slot token are sampled.
             next_text_token[~is_stopping & (delayed_lengths < n_vq)] = self.config.audio_assistant_delay_slot_token_id
             is_audio_eos = ~is_stopping & (delayed_lengths == n_vq)
             next_text_token[is_audio_eos] = self.config.audio_end_token_id
             if time_step <= n_vq:
                 next_token_logits[0][..., self.config.im_end_token_id] = float('-inf')
+            # No repetition penalty on the text channel.
             next_text_token[sampling_text_mask] = sample_token(
                 logits=next_token_logits[0][sampling_text_mask],
                 top_p=text_top_p,
                 do_sample=text_do_sample
             )
             is_audio[next_text_token == self.config.audio_start_token_id] = True
+            # Single stop condition: next_text_token == <|im_end|>.
             is_stopping[next_text_token == self.config.im_end_token_id] = True
+            # 2) Then process audio tokens.
+            # Outside [audio_start, audio_end], keep pad; only fill valid positions.
             next_audio_tokens = torch.full((batch_size, n_vq), self.config.audio_pad_code, device=device)
+            # Build masks based on distance from audio_start.
+            # True means this position should contain a real audio token.
             pre_audio_mask = audio_lengths.unsqueeze(1) > torch.arange(n_vq, dtype=int, device=device).expand(batch_size, n_vq)
             post_audio_mask = torch.arange(n_vq, dtype=int, device=device).expand(batch_size, n_vq) > delayed_lengths.unsqueeze(1) - 1
             post_audio_mask[delayed_lengths == torch_int64_max] = True
             next_audio_tokens[~sampling_audio_mask] = self.config.audio_pad_code
             if sampling_audio_mask.sum() > 0:
+                # audio_logits = torch.stack(next_token_logits[1:], dim=1)[sampling_audio_mask] # torch.stack -> [batch_size, n_vq - 1, vocab_size]
+                audio_logits = torch.stack(next_token_logits[2:], dim=1)[sampling_audio_mask[:, 1:]]
                 audio_logits[..., self.config.audio_pad_code] = float('-inf')
+                audio_ch0_logits = next_token_logits[1][sampling_audio_mask[:, 0]]
+                audio_ch0_logits[..., 1024] = float('-inf')
+                next_audio_tokens[:, 0][sampling_audio_mask[:, 0]] = sample_token(
+                    logits=audio_ch0_logits,
+                    prev_tokens=generation_ids[:, :, 1],
+                    repetition_penalty=audio_repetition_penalty,
+                    top_p=audio_top_p,
+                    top_k=audio_top_k,
+                    do_sample=audio_do_sample
+                )
+                # print(f"{next_audio_tokens[:, 0][sampling_audio_mask[:, 0]] = }")
+                next_audio_tokens[:, 1:][sampling_audio_mask[:, 1:]] = sample_token(
                     logits=audio_logits,
+                    prev_tokens=generation_ids[:, :, 2:],
                     repetition_penalty=audio_repetition_penalty,
                     top_p=audio_top_p,
                     top_k=audio_top_k,
                     do_sample=audio_do_sample
                 )
+                # print(f"{next_audio_tokens[:, 1:][sampling_audio_mask[:, 1:]] = }")
+            # Update audio_lengths and delayed_lengths for direct use in the next step.
             # audio_lengths[(next_text_token == self.audio_start_token_id) & (audio_lengths > 0)] += 1
             # audio_lengths[(next_text_token == self.audio_start_token_id) | (next_text_token == self.audio_assistant_gen_slot_token_id)] += 1
             audio_lengths[(next_text_token == self.config.audio_start_token_id) | (next_text_token == self.config.audio_assistant_gen_slot_token_id) | (next_text_token == self.config.audio_assistant_delay_slot_token_id)] += 1

processing_moss_tts.py CHANGED Viewed

@@ -555,7 +555,7 @@ class MossTTSDelayProcessor(ProcessorMixin):
         truncation: bool,
     ) -> torch.Tensor:
         """
-        此时的 content 已经是带上了对话格式
         """
         if role == "user":
             audio_gen_slot_token = audio_delay_slot_token = self.audio_user_slot_token
@@ -740,8 +740,8 @@ class MossTTSDelayProcessor(ProcessorMixin):
     def decode(self, output: List[Tuple[int, torch.Tensor]]):
         """
-        1. 这里不管怎样，都需要一个完整的 assistant generation ids;
-        2. 支持从任意位置进行截断；
         """
         genearted_messages = []
@@ -927,58 +927,34 @@ class MossTTSDelayProcessor(ProcessorMixin):
             for codes in audio_tokens_list
         ]
-        # Align with legacy behavior: decode each sample with chunk_duration=8.0.
-        # Streaming chunk decode currently supports batch_size=1 in MossAudioTokenizer.
-        if hasattr(audio_tokenizer, "decode"):
-            wav_list: List[torch.Tensor] = []
-            for codes in codes_list:
-                try:
-                    dec = audio_tokenizer.decode(
-                        codes,
-                        return_dict=True,
-                        chunk_duration=8.0,
-                    )
-                except TypeError:
-                    # Compatibility fallback for tokenizers without chunk_duration arg.
-                    dec = audio_tokenizer.decode(
-                        codes,
-                        return_dict=True,
-                    )
-                audio = dec.audio
-                audio_lengths = dec.audio_lengths
-                if audio is None:
-                    raise RuntimeError("audio_tokenizer.decode() returned empty audio.")
-                if audio_lengths is None:
-                    cur_len = int(audio.shape[-1])
-                else:
-                    cur_len = int(audio_lengths[0].item())
-                if audio.ndim == 3:
-                    wav = audio[0, 0, :cur_len]
-                elif audio.ndim == 2:
-                    wav = audio[0, :cur_len]
-                else:
-                    raise RuntimeError(
-                        f"Unexpected audio shape from decode: {tuple(audio.shape)}"
-                    )
-                wav_list.append(wav.contiguous().to(torch.float32).cpu())
-            return wav_list
-        if hasattr(audio_tokenizer, "batch_decode"):
-            dec = audio_tokenizer.batch_decode(codes_list)
-            audio = dec.audio
-            audio_lengths = dec.audio_lengths
-            if audio is None or audio_lengths is None:
-                raise RuntimeError(
-                    "audio_tokenizer.batch_decode() returned empty outputs (audio/audio_lengths)."
-                )
-            wav_list = []
-            for i in range(int(audio.shape[0])):
-                length_i = int(audio_lengths[i].item())
-                wav = audio[i, 0, :length_i].contiguous().to(torch.float32).cpu()
-                wav_list.append(wav)
-            return wav_list
-        raise RuntimeError("audio_tokenizer has neither decode() nor batch_decode().")

         truncation: bool,
     ) -> torch.Tensor:
         """
+        content is already formatted with the conversation template.
         """
         if role == "user":
             audio_gen_slot_token = audio_delay_slot_token = self.audio_user_slot_token
     def decode(self, output: List[Tuple[int, torch.Tensor]]):
         """
+        1. Always require complete assistant generation ids.
+        2. Support truncation from arbitrary positions.
         """
         genearted_messages = []
             for codes in audio_tokens_list
         ]
+        # Fallback: pad to (NQ, B, T) + mask, then decode.
+        nq = int(codes_list[0].shape[0])
+        max_t = max(int(c.shape[1]) for c in codes_list)
+        audio_codes = torch.zeros(
+            nq, len(codes_list), max_t, device=device, dtype=torch.long
+        )
+        padding_mask = torch.zeros(
+            len(codes_list), max_t, device=device, dtype=torch.bool
+        )
+        for i, c in enumerate(codes_list):
+            t = int(c.shape[1])
+            audio_codes[:, i, :t] = c
+            padding_mask[i, :t] = True
+        dec = audio_tokenizer.decode(
+            audio_codes, padding_mask=padding_mask, return_dict=True, chunk_duration=8
+        )
+        audio = dec.audio
+        audio_lengths = dec.audio_lengths
+        if audio is None or audio_lengths is None:
+            raise RuntimeError(
+                "audio_tokenizer.decode() returned empty outputs (audio/audio_lengths)."
+            )
+        # Return historical contract: list of 1D waveforms (T,)
+        wav_list: List[torch.Tensor] = []
+        for i in range(int(audio.shape[0])):
+            length_i = int(audio_lengths[i].item())
+            wav = audio[i, 0, :length_i].contiguous().to(torch.float32).cpu()
+            wav_list.append(wav)
+        return wav_list