Here's the vLLM recipe I'm using with 2x RTX Pro 6000
Basic launch (copy-paste)
Set your Hugging Face cache and GPUs, then run (from project root with venv activated):
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1
export HF_HOME=/path/to/huggingface
export HUGGINGFACE_HUB_CACHE=$HF_HOME/hub
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export SAFETENSORS_FAST_GPU=1
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_DISABLE_PYNCCL=1
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
export OMP_NUM_THREADS=8
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
python -m vllm.entrypoints.openai.api_server \
--model lukealonso/MiniMax-M2.5-NVFP4 \
--download-dir $HUGGINGFACE_HUB_CACHE \
--host 0.0.0.0 \
--port 1235 \
--served-model-name MiniMax-M2.5-NVFP4 \
--trust-remote-code \
--tensor-parallel-size 2 \
--attention-backend FLASH_ATTN \
--gpu-memory-utilization 0.95 \
--max-model-len 190000 \
--max-num-batched-tokens 16384 \
--max-num-seqs 64 \
--disable-custom-all-reduce \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2_append_think
Dependencies (tested with)
Install in a Python 3.12 venv; use CUDA 12.x on the host.
| Package | Version | Note |
|---|---|---|
| vllm | 0.15.1 | OpenAI server + NVFP4 MoE |
| torch | 2.9.1+cu128 | CUDA 12.8 build |
| transformers | 4.57.6 | |
| safetensors | 0.7.0 | |
| nvidia-modelopt | 0.41.0 | NVFP4 / ModelOpt format |
| flashinfer-python | 0.6.1 | Optional (we use FLASH_ATTN) |
| nvidia-nccl-cu12 | 2.27.5 | Multi-GPU |
| nvidia-cutlass-dsl* | 4.4.0.dev1 | NVFP4 GEMM (script uses cutlass backend) |
System: CUDA 12.8, cuDNN 9.10.2 (or matching torch cuDNN). Driver must support your GPUs (e.g. Blackwell).
Speeds pretty much identical to m2.1:
you are disabling P2P?
Personally I haven't tried vLLM yet, but on sglang I definitely have P2P enabled.
Will try the above today and add it to the instructions if it works with P2P enabled as well.
That being said, I find sglang performs even better and is a lot less annoying to use.
I have P2P enabled and (really) working with RTX 6000 Pro BW in vLLM, I have a pending PR , you can check the what I did to fix it here:
https://github.com/Gadflyii/vllm/tree/main
Real simple, it manages the custom all reduce and checks for the correct iommu=pt kernel parameter. If present, it will use P2P and it will really work instead of a silent fail and fall back.
Cool, makes sense. FWIW, I had to also set amd_iommu=pt in addition to iommu=pt, the latter was not enough.
thx!is working! sofast ,i use ppx3 tp1 ep 1
"winrm_port": 5985,
"winrm_use_ssl": false,
"winrm_transport": "ntlm",
"wsl_distro": "Ubuntu-vLLM",
"model_path": "/models/minimaxm2.5_nvfp4/",
"venv_path": "/opt/vllm/venv/bin/activate",
"cuda_home": "/usr/local/cuda-13.0",
"script_name": "launch_minimax_3gpu.sh",
"script_remote_path": "/opt/launch_minimax_3gpu.sh",
"log_remote_path": "/opt/minimax_output.txt",
"task_name": "ServeMiniMax",
"tensor_parallel_size": 1,
"pipeline_parallel_size": 3,
"enable_expert_parallel": true,
"trust_remote_code": true,
"dtype": "auto",
"kv_cache_dtype": "fp8",
"host": "0.0.0.0",
"port": 8000,
"gpu_memory_utilization": 0.85,
"max_model_len": "196608",
"max_num_seqs": 2,
"max_num_batched_tokens": 2048,
"enable_chunked_prefill": true,
"enable_prefix_caching": true,
"enable_auto_tool_choice": true,
"tool_call_parser": "minimax_m2",
"reasoning_parser": "minimax_m2",
"gen_max_tokens": "",
"gen_temperature": "",
"gen_top_p": "",
"gen_top_k": "",
"enforce_eager": false,
"cudagraph_capture_sizes": "",
"max_cudagraph_capture_size": "",
"cpu_offload_gb": 0,
"swap_space": 0,
"kv_cache_memory_bytes": "24",
"calculate_kv_scales": false,
"vllm_use_v1": true,
"vllm_use_flashinfer_moe_fp4": false,
"omp_num_threads": 8,
"cuda_visible_devices": "0,1,2 (3 GPU)",
"nccl_debug": "INFO",
"nccl_ib_disable": 1,
"nccl_p2p_disable": 0,
"nccl_shm_disable": 0,
"nccl_nvls_enable": 0,
"nccl_cumem_enable": 0,
"nccl_net_gdr_level": 0,
"nccl_socket_ifname": "eth0",
"safetensors_fast_gpu": true,
"vllm_nvfp4_gemm_backend": "cutlass"
}
anyone else getting random stuck looping every once in a while, seems like it might be a buyer with nvfp4 itself and vllm?
weird... it a issue with this nvfp4?
So what kind of speeds are you guys getting with this?
I see above avg output 94tps on 2x 96GBblackwell6000 , but honestly, for fp4 and just 10B active parameters at fp4 feels... like it should go quite a bit higher?
I'll see if I can test it on a b300 in fp8 & this. B200 vllm dockers of 0.15.x have been kinda buggy but b300 has been fine when it was available.
@nisten
Single stream, I get about 91 tok/s at low context. Here's a visualizer for performance vs power limit vs concurrency vs context for this model on dual RTX 6000 blackwell:
https://shihanqu.github.io/Blackwell-Wattage-Performance/
Might seem slow vs a b300 beast!! But for reference, at 32k context, one of those sold-out $10k 512gb Mac Ultra can serve 4-bit Minimax m2.5 single stream at the same speed a $20k Dual RTX 6000 Pro system can serve 16 simultaneous requests.


