mratsim/MiniMax-M2.1-FP8-INT4-AWQ

6 days ago

Hello,

if you compare to the lukealonso/MiniMax-M2.1-NVFP4
what is better quant for the rtx 6000 pro and why? FP8-INT4-AWQ or that NVFP4?

mratsim

Owner 5 days ago

Quality-wise I would be surprised if lukealonso/MiniMax-M2.1-NVFP4 has better scores than my quant
for 3 reasons

1. All experts calibrations

His quant was published on Dec 29, 2025 while I published my all-expert calibration quantization script on Dec 28. It's unlikely he used it (https://github.com/vllm-project/llm-compressor/pull/2171)

Actually it seems like he used modelopt and ModelOpt needs also specific PR to activate all-expert calibration: https://github.com/NVIDIA/Model-Optimizer/issues/732

Failure to do so can lead to significant degradation, see
https://avtc.github.io/aquarium-side-by-side/

This is even more significant for NVFP4 because while AWQ is W4A16 (16-bit activations), NVFP4 is W4A4 (4-bit activations) meaning activation spikes can really be problematic.

And if your training set doesn't trigger spikes on ALL experts, the calibration for 4-bit activations will not take into account the full range of possible activations

2. Self-attention quantization

As shown in my tensors to up-quantize section

Tensors to up-quantize

If there is enough bits, down projections should be prioritized.

According to [4]

Fig. 3: Maximum absolute value over layers for a LLaMA3-8B.
Each color represent a different projection and we clearly see that down_proj has the biggest
spikes in input and output. We also observe that RMSNorm propagate spikes through the entire model
According to [5]
Figure 5(a) illustrates the extremal ratio across layers and modules in LLaMA2-7B, highlighting
that weight outliers are concentrated in the down-projection matrices Wdown
ℓ of the second layer and
the last two layers. Figures 5(b) and 5(c) provide detailed visualizations of these outliers in the last
two layers.

Mixture-of-Experts quantization (MoE)

Mixture-of-Experts require specific quantization techniques.

Mixed-precision quantization

Some layers have a higher impact on LLM performance.
According to [2], spending more bits in attention layers results in large gain compared to spending them in FFN layers.
According to [3] on 2-bit quantization:

quantizing expert FFN layers do not seriously impact model quality

quantizing cross-attention has some impact

quantizing self-attention has a large impact

quantizing dense FFN has a very significant impact
Hence to preserve model quality we should choose not to quantize dense FFN layers and self-attention layers.
We notice that:

official MXFP4 weights of gpt-oss-120b from OpenAI keep self-attention in BF16:

https://huggingface.co/openai/gpt-oss-120b/blob/main/model.safetensors.index.json

NVFP4 weights of DeepSeek-R1 quantized by Nvidia also keep self-attention in BF16:

https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4/blob/main/model.safetensors.index.json

Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness (2023)
Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla
https://arxiv.org/pdf/2310.02410

Precision Where It Matters: A Novel Spike
Aware Mixed-Precision Quantization Strategy for
LLaMA-based Language Models (2025)
Lucas Maisonnave, Cyril Moineau, Olivier Bichler, and Fabrice Rastello
https://arxiv.org/pdf/2504.21553

Systematic Outliers in Large Language Models (2025)
Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang
https://arxiv.org/pdf/2502.06415v2

self-attention layers have a large impact, I keep them in original FP8.
lukealonso's quant only skip gates but quantize self-attention.

3. Calibration data

I used a large expansive calibration dataset https://huggingface.co/mratsim/MiniMax-M2.1-FP8-INT4-AWQ/blob/main/calibrate_software_engineer.yaml

The goal was to capture all activation spikes for the best possible AWQ quantization.

calibration_set:
  _templates:
    programming_languages: &programming_languages "Solve the following problem using {{ ['Zephyr', 'Prolog', 'Cobol', 'Apex', 'Crystal', 'Fortran', 'Nim', 'Delphi', 'Ada', 'Objective-C', 'VBA', 'Perl', 'Groovy', 'MATLAB', 'Solidity', 'Visual Basic', 'OCaml', 'Erlang', 'Julia', 'Lisp', 'F#', 'Clojure', 'GDScript', 'Scala', 'R', 'Haskell', 'Ruby', 'Elixir', 'Lua', 'Zig', 'Dart', 'Swift', 'Metal', 'PowerShell', 'PHP', 'Kotlin', 'C', 'Java', 'C++', 'C#', 'Bash/Shell', 'Go', 'Rust', 'TypeScript', 'HTML/CSS', 'SQL', 'JavaScript', 'Python', 'Lean', 'Coq', 'Pony', 'D', 'Racket', 'Haxe', 'x86-64 ASM', 'ARM-64 ASM', 'LLVM IR', 'GLSL', 'CUDA', 'Vulkan'][hash(row|string) % 60] }}\n***\n"
    spoken_languages: &spoken_languages "Answer in {{ ['Arabic', 'Chinese', 'French', 'German', 'Hebrew', 'Hindi', 'Japanese', 'Korean', 'Portuguese', 'Russian', 'Spanish', 'Turkish'][hash(row|string) % 12] }}\n***\n"
  max_seq_length: 8192
  shuffle: true
  seed: 42
  datasets:
    
    # Category Summary (Total: 590 samples)
    # =====================================================
    # General chat (24 samples - 4.07%)
    # Instruction and Reasoning tuning (14 samples - 2.37%)
    # Multilingual (36 samples - 6.10%)
    # Tool use (100 samples - 16.95%)
    # Code / Programming / Software Engineering / Devops (328 samples - 55.59%)
    # Math (12 samples - 2.03%)
    # Sciences (16 samples - 2.71%)
    # Medical (8 samples - 1.36%)
    # Finance (8 samples - 1.36%)
    # Business (16 samples - 2.71%)
    # Humanities and Philosophy (8 samples - 1.36%)
    # Creative Writing, Adventure, Roleplay (13 samples - 2.20%)
    # General Knowledge and Pop Culture (2 samples - 0.34%)
    # Specialized skills (4 samples - 0.68%)
    # Misc (1 sample - 0.17%)
    # =====================================================

Now another angle would be performance. I'm unsure what's the state of NVFP4 kernels on RTX Pro 6000
they might still fallback to the W4A16 Marlin kernel like GPTQ/AWQ then speed,
which is dominated by MLP layers will be the same. If there are proper NVFP4 kernels
then NVFP4 quant should be faster.

ktsaou

5 days ago

@lukealonso I kind of confirm this. I initially installed your nvfp4 version and then I switched to @mratsim AWQ quant . In parallel to my GPU install, I have a minimax coding subscription, so I am also using the real minimax-m2.1. Primary use: heavy agentic work with very long contexts.

Minimax-m2.1 is a "sloppy" model. The real model has this issue natively. It looks and feels like claude, but does little stupid mistakes, like not paying enough attention to detail.

The nvfp4 version was clearly inferior to the real model - I considered this as poor nvfp4 support from vllm. To the contrary I cannot find any clearly visible difference between the AWQ version and the real minimax. Sometimes I feel the opposite: the real minimax frequently injects chinese into the output, I have never seen this with the AWQ version (although this may be a circumstantial).

Regarding speed: nvfp4 is about 5-10% faster than AWQ on single requests. The nvfp4 version has significantly better throughput at scale (40+ parallel requests, +30-40% in output tokens/s). Both versions are significantly faster (20-30+%) compared to the real minimax, on 2x rtx 6000 pro blackwell workstation.

ktsaou

5 days ago

And I just noticed minimax-m2.5 is out today! https://huggingface.co/MiniMaxAI/MiniMax-M2.5
We would really appreciate an update of your magic quants guys!
You rock!

lukealonso

5 days ago

•

edited 5 days ago

@ktsaou @mratsim Appreciate the tips. I think the issue here is definitely the quantization of the self attention layers. Will try to produce a new version of 2.1 and 2.5 NVFP4. The rest of the potential issues (all experts, good dataset) I think are covered.

festr2

5 days ago

guys you rock! cant wait for your new quants of m2.5

festr2

5 days ago

I have 8x rtx 6k available if you need raw power to speed this up

mratsim

Owner 5 days ago

Currently cooking, unfortunately LLMCompressor can only quantize with a single GPU so using 1x RTX Pro 6000

mratsim

Owner 4 days ago

BF16+INT4 Mixed precision is out: https://huggingface.co/mratsim/Minimax-M2.5-BF16-INT4-AWQ

mratsim

Owner 4 days ago

@lukealonso for all experts quantization you need to modify ModelOpt itself and add specific modeling, see my investigation here https://github.com/NVIDIA/Model-Optimizer/issues/732 and what I did for LLMCompressor https://github.com/vllm-project/llm-compressor/pull/2171

festr2

4 days ago

@mratsim good job! Will you also create FP8+ INT4 and do we even want to run FP8+INT4 on sm120 instead of BF16? And why? What would be the best available benchmarks which compares precission lost over the original m2.5 quants? (to your recommendation)

festr2

4 days ago

preliminary tests shows that the nvfp4 is much stable in most of the benchmarks - unless I'm doing something wrong when testing it (KV cache is BF16 in both testing int4 and nvfp4)

mratsim

Owner 4 days ago

Will you also create FP8+ INT4 and do we even want to run FP8+INT4 on sm120 instead of BF16? And why?

Yes, I didn't have the time today to assemble the weights.

What would be the best available benchmarks which compares precission lost over the original m2.5 quants? (to your recommendation)

Ideally you modify vLLM or SGLang to output the logits and you compare the KL-divergence between original and the quant. Otherwise vibe smell with the applications you use.

I opened an issue in llmcompressor for that: https://github.com/vllm-project/llm-compressor/issues/2031