Choppy sound

#14
by acatovic - opened

I got everything up and running using https://github.com/NVIDIA/personaplex but I get choppy sound and the system is unusable. My specs are: 64 GiB RAM, RTX5070 with 12 GiB VRAM. Are the specs sufficient and if so, why do I get "choppiness"?

I will try to reproduce your setup to see why you get the choppiness. Could you verify if your inference is running on GPU or is somehow running on CPU?
Also could you try this suggested installation fix for Blackwell based GPUs:
https://github.com/NVIDIA/personaplex/issues/2

Nice, I'll try that fix and get back to you with results and extra info.

That didn't help unfortunately, and I can confirm that it's running on the GPU by seeing memory and utilization via nvidia-smi -l 1

I tried with offline/evaluation method and that works fine (and confirmed inference is on the GPU), i.e. I ran

python -m moshi.offline --voice-prompt "NATF2.pt" --input-wav "input_assistant.wav" --seed 42424242 --output-wav "output.wav" --output-text "output.json"

See enclosed output wav.

So it's likely something not with the model itself but with the whole streaming (frontend<->server) setup. I previously made my own local voice assistant setup, i.e. ASR->LLM->TTS and recall had some issues with the streaming so landed on a non-dynamic approach (see https://github.com/acatovic/ova).

I will poke around a bit more.

Others are raising this choppiness problem with Blackwell GPUs as well. I am working on reproducing the problem and finding a fix.

Strange, I have been getting torch.OutOfMemoryError on my 5090, even with the offline/evaluation method.

Strange, I have been getting torch.OutOfMemoryError on my 5090, even with the offline/evaluation method.

Nvidia updated their github library earlier today to add a lowvram flag to launch. I'm able to run it now on a 5090 with no problem

I AM ALSO HAVING THE SAME ISSUE WITH MY RTX 5070TI COULD LET US KNOW HOW THIS COULD BE FIXED

The reason you are experiencing choppy sound is likely due to gpu offloading to the cpu. Unless you have high frequency ddr5 memory, performance won't be great. If you don't have cpu offload enabled and you set a max memory cap then you are likely going to hit the out of memory exception.

On my docker container i used TORCHDYNAMO_DISABLE=1 to avoid having to figure out a Triton error I was getting. I'm not sure if using Triton would improve memory.

I set max memory to 0.9 for my rtx 3090 24gb, it allocates ~20 gigs of vram.
image

On stream initialization it spikes and uses 100% of available memory. If I didn't cap max memory, I would likely also get out of memory exception.
image

During conversation it can spike to use all available memory but generally speech output seems to take up roughly 18gb of vram.
image

can my 3080 20g run this model? my memory is ddr4 32gb

Might be ok. Hard to tell, you are right there at the edge. Royrajarshi pushed a change to his github repo that reduces memory use on initialization. The screen captures I shared were before that change. I think you should be ok. There are also some quantization's released that claim to reduce memory to 16gb. I haven't tried the quants.

Sign up or log in to comment