nvfp4

#9
by festr2 - opened

Hello,

if you compare to the lukealonso/MiniMax-M2.1-NVFP4
what is better quant for the rtx 6000 pro and why? FP8-INT4-AWQ or that NVFP4?

Quality-wise I would be surprised if lukealonso/MiniMax-M2.1-NVFP4 has better scores than my quant
for 3 reasons

1. All experts calibrations

His quant was published on Dec 29, 2025 while I published my all-expert calibration quantization script on Dec 28. It's unlikely he used it (https://github.com/vllm-project/llm-compressor/pull/2171)

Actually it seems like he used modelopt and ModelOpt needs also specific PR to activate all-expert calibration: https://github.com/NVIDIA/Model-Optimizer/issues/732

Failure to do so can lead to significant degradation, see
https://avtc.github.io/aquarium-side-by-side/
image

This is even more significant for NVFP4 because while AWQ is W4A16 (16-bit activations), NVFP4 is W4A4 (4-bit activations) meaning activation spikes can really be problematic.

And if your training set doesn't trigger spikes on ALL experts, the calibration for 4-bit activations will not take into account the full range of possible activations

2. Self-attention quantization

As shown in my tensors to up-quantize section

Tensors to up-quantize

If there is enough bits, down projections should be prioritized.

According to [4]

Fig. 3: Maximum absolute value over layers for a LLaMA3-8B.
Each color represent a different projection and we clearly see that down_proj has the biggest
spikes in input and output. We also observe that RMSNorm propagate spikes through the entire model
According to [5]
Figure 5(a) illustrates the extremal ratio across layers and modules in LLaMA2-7B, highlighting
that weight outliers are concentrated in the down-projection matrices Wdown
ℓ of the second layer and
the last two layers. Figures 5(b) and 5(c) provide detailed visualizations of these outliers in the last
two layers.

Mixture-of-Experts quantization (MoE)

Mixture-of-Experts require specific quantization techniques.

Mixed-precision quantization

Some layers have a higher impact on LLM performance.
According to [2], spending more bits in attention layers results in large gain compared to spending them in FFN layers.
According to [3] on 2-bit quantization:

  1. Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness (2023)
    Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla
    https://arxiv.org/pdf/2310.02410
  2. Precision Where It Matters: A Novel Spike
    Aware Mixed-Precision Quantization Strategy for
    LLaMA-based Language Models (2025)
    Lucas Maisonnave, Cyril Moineau, Olivier Bichler, and Fabrice Rastello
    https://arxiv.org/pdf/2504.21553
  3. Systematic Outliers in Large Language Models (2025)
    Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang
    https://arxiv.org/pdf/2502.06415v2

self-attention layers have a large impact, I keep them in original FP8.
lukealonso's quant only skip gates but quantize self-attention.

3. Calibration data

I used a large expansive calibration dataset https://huggingface.co/mratsim/MiniMax-M2.1-FP8-INT4-AWQ/blob/main/calibrate_software_engineer.yaml

The goal was to capture all activation spikes for the best possible AWQ quantization.

calibration_set:
  _templates:
    programming_languages: &programming_languages "Solve the following problem using {{ ['Zephyr', 'Prolog', 'Cobol', 'Apex', 'Crystal', 'Fortran', 'Nim', 'Delphi', 'Ada', 'Objective-C', 'VBA', 'Perl', 'Groovy', 'MATLAB', 'Solidity', 'Visual Basic', 'OCaml', 'Erlang', 'Julia', 'Lisp', 'F#', 'Clojure', 'GDScript', 'Scala', 'R', 'Haskell', 'Ruby', 'Elixir', 'Lua', 'Zig', 'Dart', 'Swift', 'Metal', 'PowerShell', 'PHP', 'Kotlin', 'C', 'Java', 'C++', 'C#', 'Bash/Shell', 'Go', 'Rust', 'TypeScript', 'HTML/CSS', 'SQL', 'JavaScript', 'Python', 'Lean', 'Coq', 'Pony', 'D', 'Racket', 'Haxe', 'x86-64 ASM', 'ARM-64 ASM', 'LLVM IR', 'GLSL', 'CUDA', 'Vulkan'][hash(row|string) % 60] }}\n***\n"
    spoken_languages: &spoken_languages "Answer in {{ ['Arabic', 'Chinese', 'French', 'German', 'Hebrew', 'Hindi', 'Japanese', 'Korean', 'Portuguese', 'Russian', 'Spanish', 'Turkish'][hash(row|string) % 12] }}\n***\n"
  max_seq_length: 8192
  shuffle: true
  seed: 42
  datasets:
    
    # Category Summary (Total: 590 samples)
    # =====================================================
    # General chat (24 samples - 4.07%)
    # Instruction and Reasoning tuning (14 samples - 2.37%)
    # Multilingual (36 samples - 6.10%)
    # Tool use (100 samples - 16.95%)
    # Code / Programming / Software Engineering / Devops (328 samples - 55.59%)
    # Math (12 samples - 2.03%)
    # Sciences (16 samples - 2.71%)
    # Medical (8 samples - 1.36%)
    # Finance (8 samples - 1.36%)
    # Business (16 samples - 2.71%)
    # Humanities and Philosophy (8 samples - 1.36%)
    # Creative Writing, Adventure, Roleplay (13 samples - 2.20%)
    # General Knowledge and Pop Culture (2 samples - 0.34%)
    # Specialized skills (4 samples - 0.68%)
    # Misc (1 sample - 0.17%)
    # =====================================================

Now another angle would be performance. I'm unsure what's the state of NVFP4 kernels on RTX Pro 6000
they might still fallback to the W4A16 Marlin kernel like GPTQ/AWQ then speed,
which is dominated by MLP layers will be the same. If there are proper NVFP4 kernels
then NVFP4 quant should be faster.

@lukealonso I kind of confirm this. I initially installed your nvfp4 version and then I switched to @mratsim AWQ quant . In parallel to my GPU install, I have a minimax coding subscription, so I am also using the real minimax-m2.1. Primary use: heavy agentic work with very long contexts.

Minimax-m2.1 is a "sloppy" model. The real model has this issue natively. It looks and feels like claude, but does little stupid mistakes, like not paying enough attention to detail.

The nvfp4 version was clearly inferior to the real model - I considered this as poor nvfp4 support from vllm. To the contrary I cannot find any clearly visible difference between the AWQ version and the real minimax. Sometimes I feel the opposite: the real minimax frequently injects chinese into the output, I have never seen this with the AWQ version (although this may be a circumstantial).

Regarding speed: nvfp4 is about 5-10% faster than AWQ on single requests. The nvfp4 version has significantly better throughput at scale (40+ parallel requests, +30-40% in output tokens/s). Both versions are significantly faster (20-30+%) compared to the real minimax, on 2x rtx 6000 pro blackwell workstation.

And I just noticed minimax-m2.5 is out today! https://huggingface.co/MiniMaxAI/MiniMax-M2.5
We would really appreciate an update of your magic quants guys!
You rock!

@ktsaou @mratsim Appreciate the tips. I think the issue here is definitely the quantization of the self attention layers. Will try to produce a new version of 2.1 and 2.5 NVFP4. The rest of the potential issues (all experts, good dataset) I think are covered.

guys you rock! cant wait for your new quants of m2.5

I have 8x rtx 6k available if you need raw power to speed this up

Currently cooking, unfortunately LLMCompressor can only quantize with a single GPU so using 1x RTX Pro 6000

@lukealonso for all experts quantization you need to modify ModelOpt itself and add specific modeling, see my investigation here https://github.com/NVIDIA/Model-Optimizer/issues/732 and what I did for LLMCompressor https://github.com/vllm-project/llm-compressor/pull/2171

@mratsim good job! Will you also create FP8+ INT4 and do we even want to run FP8+INT4 on sm120 instead of BF16? And why? What would be the best available benchmarks which compares precission lost over the original m2.5 quants? (to your recommendation)

preliminary tests shows that the nvfp4 is much stable in most of the benchmarks - unless I'm doing something wrong when testing it (KV cache is BF16 in both testing int4 and nvfp4)

Will you also create FP8+ INT4 and do we even want to run FP8+INT4 on sm120 instead of BF16? And why?

Yes, I didn't have the time today to assemble the weights.

What would be the best available benchmarks which compares precission lost over the original m2.5 quants? (to your recommendation)

Ideally you modify vLLM or SGLang to output the logits and you compare the KL-divergence between original and the quant. Otherwise vibe smell with the applications you use.

I opened an issue in llmcompressor for that: https://github.com/vllm-project/llm-compressor/issues/2031

Sign up or log in to comment