New version for M2.5
Minimax M2.5 just dropped https://huggingface.co/MiniMaxAI/MiniMax-M2.5
had great experience with this fp8-int4-awq, could you re-create with M2.5? thanks.
Thanks for the extremely good quants!
Since it's like a single line of config, could you please include Greek in the calibration data ("calibrate_software_engineer.yaml")? Or other European langs perhaps?
But even better, because I don't like just requesting things without helping, I created a small curated dataset based on EuroBlocks that contains 34 samples, each one in a different language. The config for your YAML (I think I got it correct):
- dataset: droussis/euroblocks_sft_1sample_per_lang
split: train
columns: [conversations]
formatter: chat_completion
num_samples: 34
That's because a recurrent problem I've seen is that AWQ models are usually lobotomized in languages that the original (e.g. FP8) model shines.
Thank you very much in advance and keep up the good work!
My script has been cooking already and I missed this convo.
Thanks for the dataset, I wasn't aware of this when I looked for multilingual. I'll try to integrate it for next quants or if somehow this one fails (happens with OOM or out of disk space sometimes ...).
Given the very diverse dataset, I hope all experts have activation spikes that reach their maximum and so the AWQ quantize them correctly.
The dequantized-BF16 + INT4-AWQ is out https://huggingface.co/mratsim/Minimax-M2.5-BF16-INT4-AWQ
Next is replacing the dequantized-BF16 self-attention with original FP8.
I've been thinking
That's because a recurrent problem I've seen is that AWQ models are usually lobotomized in languages that the original (e.g. FP8) model shines.
I think that's because most people don't create a modeling files to ensure calibration of all experts like I've done in https://github.com/vllm-project/llm-compressor/pull/2171
This leads to this kind of degradation I mention in my README
Credits: https://avtc.github.io/aquarium-side-by-side/
If say Greek and math are delegated to a specific experts, and it's not given any token because narrow calibrations et as well, the AWQ algorithm might assume that it's activation range is -0.001 to +0.001 and then when it receives Greek it's effectively lobotomized.
Re MiniMax-M2.5 - Following accuracy degradation concerns after using the new batch_size=32 feature in LLMcompressor I have reuploaded quants with batch_size=1 to ensure my calibration dataset is passed as-is and not truncated to the shortest sequence in the batch. Please redownload for highest quality! (see thread https://huggingface.co/mratsim/MiniMax-M2.5-BF16-INT4-AWQ/discussions/4)
This includes @droussis Greek request and dataset
Thank you very much!