mratsim/MiniMax-M2.5-BF16-INT4-AWQ

FP8 + INT4 version

by bigstorm - opened 2 days ago

2 days ago

Good day - will you also be releasing a version with the original FP8 sections rather than dequant BF16 to save a bit of space when deploying to 2X RTX 6000s?

Thanks

mratsim

Owner 2 days ago

Yes, I just didn't have time on my PC besides launching initial quantization overnight

mtcl

2 days ago

This is exactly my usecase! looking forward to deploy your quants with FP8-AWQ :D on my 2x6000 pros

dnhkng

2 days ago

Looking forward to this too!

bigstorm

2 days ago

Wow - There's a lot of 2x RTX 6000 pro folks. Would y'all want to create a group somewhere to exchange configs? I assume we are all aligned in getting the max performance out of our machines..

mtcl

2 days ago

•

edited 2 days ago

Haha, yes there are a few! I featured @mratsim M2.1 fp8-awq quant on my small YouTube channel here:
https://youtu.be/nMks3l0SFKU

bigstorm changed discussion status to closed 2 days ago

bigstorm changed discussion status to open 2 days ago

PowAG

1 day ago

looking too!
thanks for mratsim's work!
this may be the best model series for 192 RAM gpus, mine is 4x 4090 48G, both M2.1(fp8+awq) and M2.5(bf16+awq) work well!

mratsim

Owner 1 day ago

Thanks folks! Currently requanting due to the discussion in https://huggingface.co/mratsim/MiniMax-M2.5-BF16-INT4-AWQ/discussions/4

TLDR: I tried the new batch_size feature from llmcompressor at 32 batch size but this may have led to calibration data being truncated so now passing examples one by one to ensure quality as good as M2.1

mratsim

Owner 1 day ago

New quant with batch_size=1 uploaded

bigstorm

1 day ago

@mratsim do you have benchmark numbers that validate the degradation they were talking about in the other thread? Given what the user was saying I'm more interested in getting something concrete now.

I could run a couple benchmarks on each (Luke's NVFP4, BS 32 AWQ, and BS 1 AWQ) I have the BS 32 weights locally.

I'll get back to you on this a bit later today.

PowAG

1 day ago

New quant with batch_size=1 uploaded

@mratsim , something goes wrong? use the new update model
safetensors_rust.SafetensorError: Error while deserializing header: invalid JSON in header: control character (\u0000-\u001F) found while parsing a string at line 1 column 683114

mratsim

Owner 1 day ago

Do you have more of your surrounding logs? Something like this

I tested in vLLM and SGLang and it loads fine.

From the logs it seems like one of the safetensors has metadata that causes problem but vLLM and SGLang both use safetensors so ...

If using vLLM can you run either pip freeze or

wget https://raw.githubusercontent.com/vllm-project/vllm/main/vllm/collect_env.py
# For security purposes, please feel free to check the contents of collect_env.py before running it.
python collect_env.py

In my vLLM docker I have safetensors==0.7.0

And for SGLang pip freeze or

python3 -m sglang.check_env

I use the same safetensors==0.7.0 with the SGLang latest from Jan 23

PowAG

about 17 hours ago

•

edited about 17 hours ago

@mratsim , solved! After deleting the entire model and downloading it again, the issue was resolved.
Maybe some wrong occured during downloading safetensors.

dnhkng

about 11 hours ago

•

edited about 11 hours ago

@mratsim Is it worth waiting for an FP8 + INT4 version, or are the results and VRAM so similar it's basically equivalent? I'n on Hopper, so FP8 should be faster, but its also only a small fraction of weights that are in BF16 I think.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment