FP8 + INT4 version
Good day - will you also be releasing a version with the original FP8 sections rather than dequant BF16 to save a bit of space when deploying to 2X RTX 6000s?
Thanks
Yes, I just didn't have time on my PC besides launching initial quantization overnight
This is exactly my usecase! looking forward to deploy your quants with FP8-AWQ :D on my 2x6000 pros
Looking forward to this too!
Wow - There's a lot of 2x RTX 6000 pro folks. Would y'all want to create a group somewhere to exchange configs? I assume we are all aligned in getting the max performance out of our machines..
Haha, yes there are a few! I featured
@mratsim
M2.1 fp8-awq quant on my small YouTube channel here:
https://youtu.be/nMks3l0SFKU
looking too!
thanks for mratsim's work!
this may be the best model series for 192 RAM gpus, mine is 4x 4090 48G, both M2.1(fp8+awq) and M2.5(bf16+awq) work well!
Thanks folks! Currently requanting due to the discussion in https://huggingface.co/mratsim/MiniMax-M2.5-BF16-INT4-AWQ/discussions/4
TLDR: I tried the new batch_size feature from llmcompressor at 32 batch size but this may have led to calibration data being truncated so now passing examples one by one to ensure quality as good as M2.1
New quant with batch_size=1 uploaded
@mratsim do you have benchmark numbers that validate the degradation they were talking about in the other thread? Given what the user was saying I'm more interested in getting something concrete now.
I could run a couple benchmarks on each (Luke's NVFP4, BS 32 AWQ, and BS 1 AWQ) I have the BS 32 weights locally.
I'll get back to you on this a bit later today.
Do you have more of your surrounding logs? Something like this
I tested in vLLM and SGLang and it loads fine.
From the logs it seems like one of the safetensors has metadata that causes problem but vLLM and SGLang both use safetensors so ...
If using vLLM can you run either pip freeze or
wget https://raw.githubusercontent.com/vllm-project/vllm/main/vllm/collect_env.py
# For security purposes, please feel free to check the contents of collect_env.py before running it.
python collect_env.py
In my vLLM docker I have safetensors==0.7.0
And for SGLang pip freeze or
python3 -m sglang.check_env
I use the same safetensors==0.7.0 with the SGLang latest from Jan 23
@mratsim
, solved! After deleting the entire model and downloading it again, the issue was resolved.
Maybe some wrong occured during downloading safetensors.
@mratsim Is it worth waiting for an FP8 + INT4 version, or are the results and VRAM so similar it's basically equivalent? I'n on Hopper, so FP8 should be faster, but its also only a small fraction of weights that are in BF16 I think.
