mratsim/MiniMax-M2.5-BF16-INT4-AWQ

accuracy

by ktsaou - opened 2 days ago

2 days ago

@mratsim I have this test that I run with agents to understand the quality of the model: The test is simple: a prompt describing bigquery data and their relations, and then I ask agents questions about these data and I also run the corresponding queries with a script to know expected answers and compare accuracy. This is multi-turn, the prompt is big, the data is a lot, they need to do 3-4 turns to find the answers, so context is about 80-100k tokens for most tests.

I tested this quant vs @lukealonso nvfp4, 3 times each.

The AWQ quant consistently made 10-12 errors out of 52 questions. @lukealonso nvfp4 version has a much smaller error rate of 4-5 errors per run.

So, somehow they are now flipped. MiniMax-M2.5 is more accurate in NVFP4 than AWQ.

mtcl

2 days ago

@ktsaou if you share your testing methodology, i can test it for you as well. I am able to run this model in its native precision as well, so I might be able to do some more testing here for us.

ktsaou

2 days ago

ah! I would love to share. But these are real bigquery data, not a dataset I can share.

ktsaou

2 days ago

I will run the tests again...

mratsim

Owner 2 days ago

Interesting, I might requant then.

The only thing I changed compared to the previous one is using batch_size=32 from the llmcompressor release.

I see that there is a default to truncate but I might change it to padding or change the batch_size to 1:

https://github.com/vllm-project/llm-compressor/blob/0.9.0.2/src/llmcompressor/args/dataset_arguments.py#L70-L91

shambler74

2 days ago

@ktsaou
1). NVFP4 is within %1 of FP8 via Nvidia's tests per se, and what most have seen in the wild in most broad tests.
2). INT4, even with some items at BF16, is still INT4. In my Edited VLLM with real PPL, W4A16 deviates in the ~7% range off FP8 where INT8 deviates in the ~0.018%
3). NVFP4 will ALWAYS be more accurate than INT4. INT8 will almost always be more accurate than NVFP4
4). @mratsim was playing with Batch Sizing, as I saw it deliver INSANE speed, but I've seen ALL of the models I quanted, get deteriorated with accuracy, when using ANY batch size. LLM_Compressor warns that truncation may occur and EXTREME truncation occurs in batchsize > 16

TLDR; This is normal for NVFP4 when compared to ANY INT4. @mratsim will requant at batch size 1 and I would expect the errors to be less, but not NVFP4 levels of less.

shambler74

2 days ago

@ktsaou If you want to compare the Quant @mratsim did here, you should compare another persons normal W4A16 (INT4) against this one. That way you can see whether the BF16 actually makes a difference, but remember, MAKE SURE YOU KNOW THE GROUP SIZE before comparing. A W4A16_GS32 is demonstrably better than a W4A16_GS128 when observing nuance and context.

ktsaou

2 days ago

Thank you @shambler74 . Yes you right. However for MiniMax-M2.1 the quality was flipped between the 2 quant types. We were discussing this at https://huggingface.co/mratsim/MiniMax-M2.1-FP8-INT4-AWQ/discussions/9#698f3598ff0cc62f5009fb56 - @mratsim did a great job helping @lukealonso understand how to get max quality quant and it seems it paid off.

mratsim

Owner 2 days ago

Currently requanting but batch_size 1 is really slow. 10min per layer, 63 layers = 630min so over 10 hours

mratsim

Owner 1 day ago

New quant with batch_size=1 uploaded

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment