accuracy
@mratsim I have this test that I run with agents to understand the quality of the model: The test is simple: a prompt describing bigquery data and their relations, and then I ask agents questions about these data and I also run the corresponding queries with a script to know expected answers and compare accuracy. This is multi-turn, the prompt is big, the data is a lot, they need to do 3-4 turns to find the answers, so context is about 80-100k tokens for most tests.
I tested this quant vs @lukealonso nvfp4, 3 times each.
The AWQ quant consistently made 10-12 errors out of 52 questions. @lukealonso nvfp4 version has a much smaller error rate of 4-5 errors per run.
So, somehow they are now flipped. MiniMax-M2.5 is more accurate in NVFP4 than AWQ.
ah! I would love to share. But these are real bigquery data, not a dataset I can share.
I will run the tests again...
Interesting, I might requant then.
The only thing I changed compared to the previous one is using batch_size=32 from the llmcompressor release.
I see that there is a default to truncate but I might change it to padding or change the batch_size to 1:
@ktsaou
1). NVFP4 is within %1 of FP8 via Nvidia's tests per se, and what most have seen in the wild in most broad tests.
2). INT4, even with some items at BF16, is still INT4. In my Edited VLLM with real PPL, W4A16 deviates in the ~7% range off FP8 where INT8 deviates in the ~0.018%
3). NVFP4 will ALWAYS be more accurate than INT4. INT8 will almost always be more accurate than NVFP4
4).
@mratsim
was playing with Batch Sizing, as I saw it deliver INSANE speed, but I've seen ALL of the models I quanted, get deteriorated with accuracy, when using ANY batch size. LLM_Compressor warns that truncation may occur and EXTREME truncation occurs in batchsize > 16
TLDR; This is normal for NVFP4 when compared to ANY INT4. @mratsim will requant at batch size 1 and I would expect the errors to be less, but not NVFP4 levels of less.
@ktsaou If you want to compare the Quant @mratsim did here, you should compare another persons normal W4A16 (INT4) against this one. That way you can see whether the BF16 actually makes a difference, but remember, MAKE SURE YOU KNOW THE GROUP SIZE before comparing. A W4A16_GS32 is demonstrably better than a W4A16_GS128 when observing nuance and context.
Thank you @shambler74 . Yes you right. However for MiniMax-M2.1 the quality was flipped between the 2 quant types. We were discussing this at https://huggingface.co/mratsim/MiniMax-M2.1-FP8-INT4-AWQ/discussions/9#698f3598ff0cc62f5009fb56 - @mratsim did a great job helping @lukealonso understand how to get max quality quant and it seems it paid off.
New quant with batch_size=1 uploaded
