FP8 + INT4 version

#2
by bigstorm - opened

Good day - will you also be releasing a version with the original FP8 sections rather than dequant BF16 to save a bit of space when deploying to 2X RTX 6000s?

Thanks

Yes, I just didn't have time on my PC besides launching initial quantization overnight

This is exactly my usecase! looking forward to deploy your quants with FP8-AWQ :D on my 2x6000 pros

Looking forward to this too!

Wow - There's a lot of 2x RTX 6000 pro folks. Would y'all want to create a group somewhere to exchange configs? I assume we are all aligned in getting the max performance out of our machines..

Haha, yes there are a few! I featured @mratsim M2.1 fp8-awq quant on my small YouTube channel here:
https://youtu.be/nMks3l0SFKU

bigstorm changed discussion status to closed
bigstorm changed discussion status to open

looking too!
thanks for mratsim's work!
this may be the best model series for 192 RAM gpus, mine is 4x 4090 48G, both M2.1(fp8+awq) and M2.5(bf16+awq) work well!

Thanks folks! Currently requanting due to the discussion in https://huggingface.co/mratsim/MiniMax-M2.5-BF16-INT4-AWQ/discussions/4

TLDR: I tried the new batch_size feature from llmcompressor at 32 batch size but this may have led to calibration data being truncated so now passing examples one by one to ensure quality as good as M2.1

New quant with batch_size=1 uploaded

@mratsim do you have benchmark numbers that validate the degradation they were talking about in the other thread? Given what the user was saying I'm more interested in getting something concrete now.

I could run a couple benchmarks on each (Luke's NVFP4, BS 32 AWQ, and BS 1 AWQ) I have the BS 32 weights locally.

I'll get back to you on this a bit later today.

New quant with batch_size=1 uploaded

@mratsim , something goes wrong? use the new update model
safetensors_rust.SafetensorError: Error while deserializing header: invalid JSON in header: control character (\u0000-\u001F) found while parsing a string at line 1 column 683114

Do you have more of your surrounding logs? Something like this

image

I tested in vLLM and SGLang and it loads fine.

From the logs it seems like one of the safetensors has metadata that causes problem but vLLM and SGLang both use safetensors so ...

If using vLLM can you run either pip freeze or

wget https://raw.githubusercontent.com/vllm-project/vllm/main/vllm/collect_env.py
# For security purposes, please feel free to check the contents of collect_env.py before running it.
python collect_env.py

In my vLLM docker I have safetensors==0.7.0

And for SGLang pip freeze or

python3 -m sglang.check_env

I use the same safetensors==0.7.0 with the SGLang latest from Jan 23

@mratsim , solved! After deleting the entire model and downloading it again, the issue was resolved.
Maybe some wrong occured during downloading safetensors.

@mratsim Is it worth waiting for an FP8 + INT4 version, or are the results and VRAM so similar it's basically equivalent? I'n on Hopper, so FP8 should be faster, but its also only a small fraction of weights that are in BF16 I think.

Sign up or log in to comment