Here's the vLLM recipe I'm using with 2x RTX Pro 6000

by zenmagnets - opened 4 days ago

4 days ago

Basic launch (copy-paste)

Set your Hugging Face cache and GPUs, then run (from project root with venv activated):

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1
export HF_HOME=/path/to/huggingface
export HUGGINGFACE_HUB_CACHE=$HF_HOME/hub
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export SAFETENSORS_FAST_GPU=1
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0
export VLLM_DISABLE_PYNCCL=1
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
export OMP_NUM_THREADS=8
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

python -m vllm.entrypoints.openai.api_server \
  --model lukealonso/MiniMax-M2.5-NVFP4 \
  --download-dir $HUGGINGFACE_HUB_CACHE \
  --host 0.0.0.0 \
  --port 1235 \
  --served-model-name MiniMax-M2.5-NVFP4 \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --attention-backend FLASH_ATTN \
  --gpu-memory-utilization 0.95 \
  --max-model-len 190000 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 64 \
  --disable-custom-all-reduce \
  --enable-auto-tool-choice \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2_append_think

Dependencies (tested with)

Install in a Python 3.12 venv; use CUDA 12.x on the host.

Package	Version	Note
vllm	0.15.1	OpenAI server + NVFP4 MoE
torch	2.9.1+cu128	CUDA 12.8 build
transformers	4.57.6
safetensors	0.7.0
nvidia-modelopt	0.41.0	NVFP4 / ModelOpt format
flashinfer-python	0.6.1	Optional (we use FLASH_ATTN)
nvidia-nccl-cu12	2.27.5	Multi-GPU
nvidia-cutlass-dsl*	4.4.0.dev1	NVFP4 GEMM (script uses cutlass backend)

System: CUDA 12.8, cuDNN 9.10.2 (or matching torch cuDNN). Driver must support your GPUs (e.g. Blackwell).

Speeds pretty much identical to m2.1:

GadflyII

3 days ago

you are disabling P2P?

lukealonso

Owner 3 days ago

Personally I haven't tried vLLM yet, but on sglang I definitely have P2P enabled.

Will try the above today and add it to the instructions if it works with P2P enabled as well.

That being said, I find sglang performs even better and is a lot less annoying to use.

GadflyII

3 days ago

I have P2P enabled and (really) working with RTX 6000 Pro BW in vLLM, I have a pending PR , you can check the what I did to fix it here:

https://github.com/Gadflyii/vllm/tree/main

Real simple, it manages the custom all reduce and checks for the correct iommu=pt kernel parameter. If present, it will use P2P and it will really work instead of a silent fail and fall back.

lukealonso

Owner 3 days ago

Cool, makes sense. FWIW, I had to also set amd_iommu=pt in addition to iommu=pt, the latter was not enough.

zenmagnets

3 days ago

•

edited 3 days ago

Ok, turned on P2P. Slight performance increase:

eddy1111111

3 days ago

thx！is working! sofast ，i use ppx3 tp1 ep 1

"winrm_port": 5985,
"winrm_use_ssl": false,
"winrm_transport": "ntlm",
"wsl_distro": "Ubuntu-vLLM",
"model_path": "/models/minimaxm2.5_nvfp4/",
"venv_path": "/opt/vllm/venv/bin/activate",
"cuda_home": "/usr/local/cuda-13.0",
"script_name": "launch_minimax_3gpu.sh",
"script_remote_path": "/opt/launch_minimax_3gpu.sh",
"log_remote_path": "/opt/minimax_output.txt",
"task_name": "ServeMiniMax",
"tensor_parallel_size": 1,
"pipeline_parallel_size": 3,
"enable_expert_parallel": true,
"trust_remote_code": true,
"dtype": "auto",
"kv_cache_dtype": "fp8",
"host": "0.0.0.0",
"port": 8000,
"gpu_memory_utilization": 0.85,
"max_model_len": "196608",
"max_num_seqs": 2,
"max_num_batched_tokens": 2048,
"enable_chunked_prefill": true,
"enable_prefix_caching": true,
"enable_auto_tool_choice": true,
"tool_call_parser": "minimax_m2",
"reasoning_parser": "minimax_m2",
"gen_max_tokens": "",
"gen_temperature": "",
"gen_top_p": "",
"gen_top_k": "",
"enforce_eager": false,
"cudagraph_capture_sizes": "",
"max_cudagraph_capture_size": "",
"cpu_offload_gb": 0,
"swap_space": 0,
"kv_cache_memory_bytes": "24",
"calculate_kv_scales": false,
"vllm_use_v1": true,
"vllm_use_flashinfer_moe_fp4": false,
"omp_num_threads": 8,
"cuda_visible_devices": "0,1,2 (3 GPU)",
"nccl_debug": "INFO",
"nccl_ib_disable": 1,
"nccl_p2p_disable": 0,
"nccl_shm_disable": 0,
"nccl_nvls_enable": 0,
"nccl_cumem_enable": 0,
"nccl_net_gdr_level": 0,
"nccl_socket_ifname": "eth0",
"safetensors_fast_gpu": true,
"vllm_nvfp4_gemm_backend": "cutlass"
}

vkerkez

2 days ago

anyone else getting random stuck looping every once in a while, seems like it might be a buyer with nvfp4 itself and vllm?

GadflyII

2 days ago

@vkerkez No, isn't an issues with VLLM and NVFP4, it works fine on other models.

zenmagnets

1 day ago

@eddy1111111 Hold up, you're using vLLM in WSL, with 3 GPUs and flashinfer?!

Wizard.

vkerkez

1 day ago

weird... it a issue with this nvfp4?

nisten

about 2 hours ago

So what kind of speeds are you guys getting with this?

I see above avg output 94tps on 2x 96GBblackwell6000 , but honestly, for fp4 and just 10B active parameters at fp4 feels... like it should go quite a bit higher?

I'll see if I can test it on a b300 in fp8 & this. B200 vllm dockers of 0.15.x have been kinda buggy but b300 has been fine when it was available.

zenmagnets

about 2 hours ago

@nisten
Single stream, I get about 91 tok/s at low context. Here's a visualizer for performance vs power limit vs concurrency vs context for this model on dual RTX 6000 blackwell:

https://shihanqu.github.io/Blackwell-Wattage-Performance/

Might seem slow vs a b300 beast!! But for reference, at 32k context, one of those sold-out $10k 512gb Mac Ultra can serve 4-bit Minimax m2.5 single stream at the same speed a $20k Dual RTX 6000 Pro system can serve 16 simultaneous requests.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment