multimodal, audio, speech, llms
Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens