Step-3.5-Flash-NVFP4

NVFP4-quantized version of stepfun-ai/Step-3.5-Flash, an open-source frontier-level reasoning model by StepFun with 196.81B total parameters and ~11B active parameters per token.

Model Description

Step 3.5 Flash is an open-source foundation model designed for frontier-level reasoning and agentic capabilities with exceptional efficiency. Key highlights from the base model:

AIME 2025: 97.3%
SWE-bench Verified: 74.4%
LiveCodeBench-V6: 86.4%
Terminal-Bench 2.0: 51.0%
GAIA (no file): 84.5

This NVFP4 quantization reduces the model size from ~372 GB (BF16) to ~105 GB while preserving quality, making it practical to deploy on just 2 GPUs.

Quantization Details

Property	Value
Format	NVFP4 (`nvfp4-pack-quantized`)
Weight precision	FP4 E2M1 with FP8 E4M3 block scales (group_size=16)
Input activations	FP8 E4M3 dynamic per-tensor-group (group_size=16)
Quant method	`compressed-tensors`
Calibration data	512 samples from HuggingFaceH4/ultrachat_200k
Max calibration seq length	2048
Quantization tool	llm-compressor
Excluded from quantization	`lm_head`, all MoE router gates (`moe.gate`)

During calibration, all 288 experts per MoE layer were activated to ensure every expert received calibration data, using a custom Step3p5MoEMLP calibration module.

Architecture

Component	Details
Architecture	45-layer Sparse Mixture-of-Experts (MoE) Transformer
Total parameters	196.81B
Active parameters	~11B per token
Experts	288 routed + 1 shared per MoE layer, top-8 selection
Hidden size	4096
MoE intermediate size	1280
Dense intermediate size	11264
MoE layers	3-44 (42 layers)
Attention	GQA with 64 heads, 8 KV groups, head dim 128
Attention pattern	3:1 sliding window (512 tokens) / full attention ratio
Context window	256K tokens (with llama3-style RoPE scaling)
Vocabulary	128,896 tokens
Multi-Token Prediction	MTP-3 (predicts 4 tokens simultaneously)

Layers 43-44 use a swiglustep activation (clipped SwiGLU with limit=7.0) on their MoE experts. All other MoE layers use standard SiLU. This requires vLLM support for swiglustep in the NVFP4 MoE kernels.

Requirements

This model requires vLLM with swiglustep MoE activation support. This is available in the following PR:

vllm-project/vllm#34478 -- Add swiglustep activation support for NVFP4 MoE backends

Until the PR is merged, install vLLM from the PR branch or from source with the changes applied.

Usage with vLLM

Serving

vllm serve tacos4me/Step-3.5-Flash-NVFP4 \
  --served-model-name step3p5-flash \
  --tensor-parallel-size 2 \
  --trust-remote-code \
  --reasoning-parser step3p5 \
  --enable-auto-tool-choice \
  --tool-call-parser step3p5 \
  --disable-cascade-attn

Offline Inference

from vllm import LLM, SamplingParams

llm = LLM(
    model="tacos4me/Step-3.5-Flash-NVFP4",
    tensor_parallel_size=2,
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.95,
)

output = llm.generate(
    "Explain the significance of the number 42.",
    SamplingParams(max_tokens=256),
)
print(output[0].outputs[0].text)

Performance

Metric	Value
Model size on disk	~105 GB (23 safetensors shards)
Decode throughput	~108 tok/s
Hardware tested	2x NVIDIA RTX PRO 6000 Blackwell (TP=2)
CUDA graphs	Enabled

Known Issues

FlashInfer MoE backend on Blackwell: The FlashInfer CUTLASS MoE backend may crash with illegal memory access on Blackwell GPUs (sm_120). Set VLLM_USE_FLASHINFER_MOE_FP4=0 as a workaround.
MTP weights not included: Speculative decoding (Multi-Token Prediction) weights from the base model are not included in this quantized checkpoint.
Minimum 2 GPUs required: The model requires ~105 GB, so it does not fit on a single 80/96 GB GPU. Use --tensor-parallel-size 2 or higher.

Acknowledgments

Based on stepfun-ai/Step-3.5-Flash by StepFun
Quantized with llm-compressor by the vLLM project
NVFP4 MoE swiglustep activation support contributed to vLLM

Citation

If you use this model, please cite the original Step 3.5 Flash paper:

@misc{huang2026step35flashopen,
  title={Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters},
  author={Huang, Ailin and Li, Ang and others},
  year={2026},
  eprint={2602.10604},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2602.10604}
}

License

This model is released under the Apache 2.0 License, same as the base model.

Downloads last month: 789

Safetensors

Model size

111B params

Tensor type

F32

BF16

F8_E4M3

Model tree for tacos4me/Step-3.5-Flash-NVFP4

Base model

stepfun-ai/Step-3.5-Flash

Quantized

(20)

this model

Paper for tacos4me/Step-3.5-Flash-NVFP4

Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters

Paper • 2602.10604 • Published 6 days ago • 174