Here's the vLLM recipe I'm using with 2x RTX Pro 6000

#1
by zenmagnets - opened

Basic launch (copy-paste)

Set your Hugging Face cache and GPUs, then run (from project root with venv activated):

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1
export HF_HOME=/path/to/huggingface
export HUGGINGFACE_HUB_CACHE=$HF_HOME/hub
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export SAFETENSORS_FAST_GPU=1
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_DISABLE_PYNCCL=1
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
export OMP_NUM_THREADS=8
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

python -m vllm.entrypoints.openai.api_server \
  --model lukealonso/MiniMax-M2.5-NVFP4 \
  --download-dir $HUGGINGFACE_HUB_CACHE \
  --host 0.0.0.0 \
  --port 1235 \
  --served-model-name MiniMax-M2.5-NVFP4 \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --attention-backend FLASH_ATTN \
  --gpu-memory-utilization 0.95 \
  --max-model-len 190000 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 64 \
  --disable-custom-all-reduce \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think

Dependencies (tested with)

Install in a Python 3.12 venv; use CUDA 12.x on the host.

Package Version Note
vllm 0.15.1 OpenAI server + NVFP4 MoE
torch 2.9.1+cu128 CUDA 12.8 build
transformers 4.57.6
safetensors 0.7.0
nvidia-modelopt 0.41.0 NVFP4 / ModelOpt format
flashinfer-python 0.6.1 Optional (we use FLASH_ATTN)
nvidia-nccl-cu12 2.27.5 Multi-GPU
nvidia-cutlass-dsl* 4.4.0.dev1 NVFP4 GEMM (script uses cutlass backend)

System: CUDA 12.8, cuDNN 9.10.2 (or matching torch cuDNN). Driver must support your GPUs (e.g. Blackwell).


Speeds pretty much identical to m2.1:

image

you are disabling P2P?

Personally I haven't tried vLLM yet, but on sglang I definitely have P2P enabled.

Will try the above today and add it to the instructions if it works with P2P enabled as well.

That being said, I find sglang performs even better and is a lot less annoying to use.

I have P2P enabled and (really) working with RTX 6000 Pro BW in vLLM, I have a pending PR , you can check the what I did to fix it here:

https://github.com/Gadflyii/vllm/tree/main

Real simple, it manages the custom all reduce and checks for the correct iommu=pt kernel parameter. If present, it will use P2P and it will really work instead of a silent fail and fall back.

Cool, makes sense. FWIW, I had to also set amd_iommu=pt in addition to iommu=pt, the latter was not enough.

Ok, turned on P2P. Slight performance increase:

image

thx!is working! sofast ,i use ppx3 tp1 ep 1

"winrm_port": 5985,
"winrm_use_ssl": false,
"winrm_transport": "ntlm",
"wsl_distro": "Ubuntu-vLLM",
"model_path": "/models/minimaxm2.5_nvfp4/",
"venv_path": "/opt/vllm/venv/bin/activate",
"cuda_home": "/usr/local/cuda-13.0",
"script_name": "launch_minimax_3gpu.sh",
"script_remote_path": "/opt/launch_minimax_3gpu.sh",
"log_remote_path": "/opt/minimax_output.txt",
"task_name": "ServeMiniMax",
"tensor_parallel_size": 1,
"pipeline_parallel_size": 3,
"enable_expert_parallel": true,
"trust_remote_code": true,
"dtype": "auto",
"kv_cache_dtype": "fp8",
"host": "0.0.0.0",
"port": 8000,
"gpu_memory_utilization": 0.85,
"max_model_len": "196608",
"max_num_seqs": 2,
"max_num_batched_tokens": 2048,
"enable_chunked_prefill": true,
"enable_prefix_caching": true,
"enable_auto_tool_choice": true,
"tool_call_parser": "minimax_m2",
"reasoning_parser": "minimax_m2",
"gen_max_tokens": "",
"gen_temperature": "",
"gen_top_p": "",
"gen_top_k": "",
"enforce_eager": false,
"cudagraph_capture_sizes": "",
"max_cudagraph_capture_size": "",
"cpu_offload_gb": 0,
"swap_space": 0,
"kv_cache_memory_bytes": "24",
"calculate_kv_scales": false,
"vllm_use_v1": true,
"vllm_use_flashinfer_moe_fp4": false,
"omp_num_threads": 8,
"cuda_visible_devices": "0,1,2 (3 GPU)",
"nccl_debug": "INFO",
"nccl_ib_disable": 1,
"nccl_p2p_disable": 0,
"nccl_shm_disable": 0,
"nccl_nvls_enable": 0,
"nccl_cumem_enable": 0,
"nccl_net_gdr_level": 0,
"nccl_socket_ifname": "eth0",
"safetensors_fast_gpu": true,
"vllm_nvfp4_gemm_backend": "cutlass"
}

anyone else getting random stuck looping every once in a while, seems like it might be a buyer with nvfp4 itself and vllm?

@vkerkez No, isn't an issues with VLLM and NVFP4, it works fine on other models.

@eddy1111111 Hold up, you're using vLLM in WSL, with 3 GPUs and flashinfer?!

Wizard.

weird... it a issue with this nvfp4?

So what kind of speeds are you guys getting with this?

I see above avg output 94tps on 2x 96GBblackwell6000 , but honestly, for fp4 and just 10B active parameters at fp4 feels... like it should go quite a bit higher?

I'll see if I can test it on a b300 in fp8 & this. B200 vllm dockers of 0.15.x have been kinda buggy but b300 has been fine when it was available.

@nisten
Single stream, I get about 91 tok/s at low context. Here's a visualizer for performance vs power limit vs concurrency vs context for this model on dual RTX 6000 blackwell:

image

https://shihanqu.github.io/Blackwell-Wattage-Performance/

Might seem slow vs a b300 beast!! But for reference, at 32k context, one of those sold-out $10k 512gb Mac Ultra can serve 4-bit Minimax m2.5 single stream at the same speed a $20k Dual RTX 6000 Pro system can serve 16 simultaneous requests.

Sign up or log in to comment