Step-3.5-Flash-NVFP4

NVFP4-quantized version of stepfun-ai/Step-3.5-Flash, an open-source frontier-level reasoning model by StepFun with 196.81B total parameters and ~11B active parameters per token.

Model Description

Step 3.5 Flash is an open-source foundation model designed for frontier-level reasoning and agentic capabilities with exceptional efficiency. Key highlights from the base model:

  • AIME 2025: 97.3%
  • SWE-bench Verified: 74.4%
  • LiveCodeBench-V6: 86.4%
  • Terminal-Bench 2.0: 51.0%
  • GAIA (no file): 84.5

This NVFP4 quantization reduces the model size from ~372 GB (BF16) to ~105 GB while preserving quality, making it practical to deploy on just 2 GPUs.

Quantization Details

Property Value
Format NVFP4 (nvfp4-pack-quantized)
Weight precision FP4 E2M1 with FP8 E4M3 block scales (group_size=16)
Input activations FP8 E4M3 dynamic per-tensor-group (group_size=16)
Quant method compressed-tensors
Calibration data 512 samples from HuggingFaceH4/ultrachat_200k
Max calibration seq length 2048
Quantization tool llm-compressor
Excluded from quantization lm_head, all MoE router gates (moe.gate)

During calibration, all 288 experts per MoE layer were activated to ensure every expert received calibration data, using a custom Step3p5MoEMLP calibration module.

Architecture

Component Details
Architecture 45-layer Sparse Mixture-of-Experts (MoE) Transformer
Total parameters 196.81B
Active parameters ~11B per token
Experts 288 routed + 1 shared per MoE layer, top-8 selection
Hidden size 4096
MoE intermediate size 1280
Dense intermediate size 11264
MoE layers 3-44 (42 layers)
Attention GQA with 64 heads, 8 KV groups, head dim 128
Attention pattern 3:1 sliding window (512 tokens) / full attention ratio
Context window 256K tokens (with llama3-style RoPE scaling)
Vocabulary 128,896 tokens
Multi-Token Prediction MTP-3 (predicts 4 tokens simultaneously)

Layers 43-44 use a swiglustep activation (clipped SwiGLU with limit=7.0) on their MoE experts. All other MoE layers use standard SiLU. This requires vLLM support for swiglustep in the NVFP4 MoE kernels.

Requirements

This model requires vLLM with swiglustep MoE activation support. This is available in the following PR:

vllm-project/vllm#34478 -- Add swiglustep activation support for NVFP4 MoE backends

Until the PR is merged, install vLLM from the PR branch or from source with the changes applied.

Usage with vLLM

Serving

vllm serve tacos4me/Step-3.5-Flash-NVFP4 \
  --served-model-name step3p5-flash \
  --tensor-parallel-size 2 \
  --trust-remote-code \
  --reasoning-parser step3p5 \
  --enable-auto-tool-choice \
  --tool-call-parser step3p5 \
  --disable-cascade-attn

Offline Inference

from vllm import LLM, SamplingParams

llm = LLM(
    model="tacos4me/Step-3.5-Flash-NVFP4",
    tensor_parallel_size=2,
    trust_remote_code=True,
    max_model_len=4096,
    gpu_memory_utilization=0.95,
)

output = llm.generate(
    "Explain the significance of the number 42.",
    SamplingParams(max_tokens=256),
)
print(output[0].outputs[0].text)

Performance

Metric Value
Model size on disk ~105 GB (23 safetensors shards)
Decode throughput ~108 tok/s
Hardware tested 2x NVIDIA RTX PRO 6000 Blackwell (TP=2)
CUDA graphs Enabled

Known Issues

  1. FlashInfer MoE backend on Blackwell: The FlashInfer CUTLASS MoE backend may crash with illegal memory access on Blackwell GPUs (sm_120). Set VLLM_USE_FLASHINFER_MOE_FP4=0 as a workaround.

  2. MTP weights not included: Speculative decoding (Multi-Token Prediction) weights from the base model are not included in this quantized checkpoint.

  3. Minimum 2 GPUs required: The model requires ~105 GB, so it does not fit on a single 80/96 GB GPU. Use --tensor-parallel-size 2 or higher.

Acknowledgments

Citation

If you use this model, please cite the original Step 3.5 Flash paper:

@misc{huang2026step35flashopen,
  title={Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters},
  author={Huang, Ailin and Li, Ang and others},
  year={2026},
  eprint={2602.10604},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2602.10604}
}

License

This model is released under the Apache 2.0 License, same as the base model.

Downloads last month
789
Safetensors
Model size
111B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tacos4me/Step-3.5-Flash-NVFP4

Quantized
(20)
this model

Paper for tacos4me/Step-3.5-Flash-NVFP4