Step-3.5-Flash-NVFP4
NVFP4-quantized version of stepfun-ai/Step-3.5-Flash, an open-source frontier-level reasoning model by StepFun with 196.81B total parameters and ~11B active parameters per token.
Model Description
Step 3.5 Flash is an open-source foundation model designed for frontier-level reasoning and agentic capabilities with exceptional efficiency. Key highlights from the base model:
- AIME 2025: 97.3%
- SWE-bench Verified: 74.4%
- LiveCodeBench-V6: 86.4%
- Terminal-Bench 2.0: 51.0%
- GAIA (no file): 84.5
This NVFP4 quantization reduces the model size from ~372 GB (BF16) to ~105 GB while preserving quality, making it practical to deploy on just 2 GPUs.
Quantization Details
| Property | Value |
|---|---|
| Format | NVFP4 (nvfp4-pack-quantized) |
| Weight precision | FP4 E2M1 with FP8 E4M3 block scales (group_size=16) |
| Input activations | FP8 E4M3 dynamic per-tensor-group (group_size=16) |
| Quant method | compressed-tensors |
| Calibration data | 512 samples from HuggingFaceH4/ultrachat_200k |
| Max calibration seq length | 2048 |
| Quantization tool | llm-compressor |
| Excluded from quantization | lm_head, all MoE router gates (moe.gate) |
During calibration, all 288 experts per MoE layer were activated to ensure every expert received calibration data, using a custom Step3p5MoEMLP calibration module.
Architecture
| Component | Details |
|---|---|
| Architecture | 45-layer Sparse Mixture-of-Experts (MoE) Transformer |
| Total parameters | 196.81B |
| Active parameters | ~11B per token |
| Experts | 288 routed + 1 shared per MoE layer, top-8 selection |
| Hidden size | 4096 |
| MoE intermediate size | 1280 |
| Dense intermediate size | 11264 |
| MoE layers | 3-44 (42 layers) |
| Attention | GQA with 64 heads, 8 KV groups, head dim 128 |
| Attention pattern | 3:1 sliding window (512 tokens) / full attention ratio |
| Context window | 256K tokens (with llama3-style RoPE scaling) |
| Vocabulary | 128,896 tokens |
| Multi-Token Prediction | MTP-3 (predicts 4 tokens simultaneously) |
Layers 43-44 use a swiglustep activation (clipped SwiGLU with limit=7.0) on their MoE experts. All other MoE layers use standard SiLU. This requires vLLM support for swiglustep in the NVFP4 MoE kernels.
Requirements
This model requires vLLM with swiglustep MoE activation support. This is available in the following PR:
vllm-project/vllm#34478 -- Add swiglustep activation support for NVFP4 MoE backends
Until the PR is merged, install vLLM from the PR branch or from source with the changes applied.
Usage with vLLM
Serving
vllm serve tacos4me/Step-3.5-Flash-NVFP4 \
--served-model-name step3p5-flash \
--tensor-parallel-size 2 \
--trust-remote-code \
--reasoning-parser step3p5 \
--enable-auto-tool-choice \
--tool-call-parser step3p5 \
--disable-cascade-attn
Offline Inference
from vllm import LLM, SamplingParams
llm = LLM(
model="tacos4me/Step-3.5-Flash-NVFP4",
tensor_parallel_size=2,
trust_remote_code=True,
max_model_len=4096,
gpu_memory_utilization=0.95,
)
output = llm.generate(
"Explain the significance of the number 42.",
SamplingParams(max_tokens=256),
)
print(output[0].outputs[0].text)
Performance
| Metric | Value |
|---|---|
| Model size on disk | ~105 GB (23 safetensors shards) |
| Decode throughput | ~108 tok/s |
| Hardware tested | 2x NVIDIA RTX PRO 6000 Blackwell (TP=2) |
| CUDA graphs | Enabled |
Known Issues
FlashInfer MoE backend on Blackwell: The FlashInfer CUTLASS MoE backend may crash with illegal memory access on Blackwell GPUs (sm_120). Set
VLLM_USE_FLASHINFER_MOE_FP4=0as a workaround.MTP weights not included: Speculative decoding (Multi-Token Prediction) weights from the base model are not included in this quantized checkpoint.
Minimum 2 GPUs required: The model requires ~105 GB, so it does not fit on a single 80/96 GB GPU. Use
--tensor-parallel-size 2or higher.
Acknowledgments
- Based on stepfun-ai/Step-3.5-Flash by StepFun
- Quantized with llm-compressor by the vLLM project
- NVFP4 MoE swiglustep activation support contributed to vLLM
Citation
If you use this model, please cite the original Step 3.5 Flash paper:
@misc{huang2026step35flashopen,
title={Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters},
author={Huang, Ailin and Li, Ang and others},
year={2026},
eprint={2602.10604},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2602.10604}
}
License
This model is released under the Apache 2.0 License, same as the base model.
- Downloads last month
- 789
Model tree for tacos4me/Step-3.5-Flash-NVFP4
Base model
stepfun-ai/Step-3.5-Flash