Thank you
Thank you @anikifoss .
I haven't posted a benchmark for a while, so here is one for those interested.
Certainly not the best what can be achieved, I have started to power limit my CPU (Epyc 9355) and GPUs (this one is RTX 5090):
./llama-sweep-bench \
--model "$MODEL_PATH" \
--no-mmap \
-fa -fmoe \
-b 2048 -ub 2048 \
-ctk q8_0 -ctv q8_0 -c 49152 \
-ngl 999 -ot exps=CPU \
--threads 16 \
--threads-batch 28 \
--warmup-batch
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 2048 | 512 | 0 | 6.488 | 315.64 | 29.354 | 17.44 |
| 2048 | 512 | 2048 | 6.577 | 311.38 | 30.003 | 17.07 |
| 2048 | 512 | 4096 | 6.633 | 308.77 | 30.482 | 16.80 |
| 2048 | 512 | 6144 | 6.729 | 304.34 | 31.201 | 16.41 |
| 2048 | 512 | 8192 | 6.817 | 300.43 | 31.760 | 16.12 |
| 2048 | 512 | 10240 | 6.878 | 297.74 | 32.157 | 15.92 |
| 2048 | 512 | 12288 | 6.975 | 293.60 | 32.599 | 15.71 |
| 2048 | 512 | 14336 | 7.063 | 289.96 | 33.259 | 15.39 |
| 2048 | 512 | 16384 | 7.151 | 286.40 | 34.504 | 14.84 |
| 2048 | 512 | 18432 | 7.268 | 281.77 | 35.270 | 14.52 |
| 2048 | 512 | 20480 | 7.430 | 275.65 | 35.006 | 14.63 |
| 2048 | 512 | 22528 | 7.554 | 271.10 | 35.781 | 14.31 |
| 2048 | 512 | 24576 | 7.651 | 267.68 | 37.252 | 13.74 |
| 2048 | 512 | 26624 | 7.763 | 263.81 | 37.740 | 13.57 |
| 2048 | 512 | 28672 | 7.870 | 260.24 | 38.907 | 13.16 |
| 2048 | 512 | 30720 | 7.982 | 256.57 | 39.417 | 12.99 |
| 2048 | 512 | 32768 | 8.099 | 252.88 | 40.261 | 12.72 |
| 2048 | 512 | 34816 | 8.177 | 250.45 | 42.002 | 12.19 |
| 2048 | 512 | 36864 | 8.276 | 247.45 | 42.873 | 11.94 |
| 2048 | 512 | 38912 | 8.429 | 242.98 | 43.775 | 11.70 |
| 2048 | 512 | 40960 | 8.523 | 240.29 | 44.600 | 11.48 |
| 2048 | 512 | 43008 | 8.642 | 236.98 | 45.184 | 11.33 |
| 2048 | 512 | 45056 | 8.725 | 234.73 | 46.276 | 11.06 |
| 2048 | 512 | 47104 | 8.837 | 231.76 | 47.090 | 10.87 |
Interesting how much VRAM this model needs to hold the context...
@anikifoss It is dual Epyc Turin, so total 24 channels (?). Not quite sure, honestly :). I run it on a VM which is pinned to a single socket, only using the memory attached to it.
I was considering upgrading an 8-channel DDR5 system to a dual socket 12x2 channels DDR5. But I was worried about numa issues preventing any meaningful performance gains.
I'd love to learn more about your expriences with the 24 channel system. Were you able to utilize all 24 channels while running inference?
@anikifoss I never tried.
I studied NUMA things a bit, reading about experiences of others, and decided it is not worth the trouble :-). I only added second CPU later as I wanted to run Proxmox on the machine with other AI related VMs, and realized 768 GB RAM is not enough for Proxmox + ZFS (ARC) + other VMs + inference VM. I only bought 384 GB for the second socket - making the machine memory configuration unbalanced across sockets. It does not affect the performance of one socket or the other, but running work on both sockets with shared memory would be affected quite a lot due to crippled interleaving...
That said, I have not yet seen a benchmark documenting a good LLM inference performance in a dual-CPU setup. I know K-Transformers supported a trick where the model was cloned into memory of both sockets - but it doesn't seem like an economic solution to me :).
As you can see here, the bus speed between sockets is quite limited:
I see, so a dual socket system has a 12-channel bottleneck, and that's as good as it gets. Thanks for sharing your research!
I'll hold off on upgrading to a dual socket system. Though an Epyc with 8 CCDs and 12-channels DDR would give me a 50% boost, it won't give me a 3x boost I was hoping for.
@anikifoss You should fact-check me on this, I have no idea what I am talking about 😀. But if you find someone with significantly better performance on dual-socket than a single socket, please let me know.
Here is sweep-bench on RTX 6000 with 128K context @ f16 and -ub 16384:
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 16384 | 4096 | 0 | 15.882 | 1031.59 | 236.725 | 17.30 |
| 16384 | 4096 | 16384 | 22.389 | 731.80 | 258.207 | 15.86 |
| 16384 | 4096 | 32768 | 32.372 | 506.11 | 283.656 | 14.44 |
| 16384 | 4096 | 49152 | 41.317 | 396.54 | 307.315 | 13.33 |
| 16384 | 4096 | 65536 | 50.527 | 324.26 | 337.632 | 12.13 |
| 16384 | 4096 | 81920 | 60.364 | 271.42 | 433.913 | 9.44 |
| 16384 | 4096 | 98304 | 70.328 | 232.97 | 516.324 | 7.93 |
| 16384 | 4096 | 114688 | 75.671 | 216.52 | 566.042 | 7.24 |
This is with -ot "\.([1-9][0-9])\.ffn_(gate|up|down)_exps.=CPU", but it does not have much effect.
Awesome, thanks for sharing RTX 6000 numbers!
I wonder how it would do with a smaller GLM-4.6 quant on 2x RTX 6000.
I've been playing with offloading experts to MI50s while keeping attention on 5090. Here is what I'm getting with 4xMI50s (ROCm 6.3.3):
./build/bin/llama-server \
--alias unsloth/GLM-4.6-UD-Q2_K_XL \
--model ~/Env/models/unsloth/GLM-4.6-UD-Q2_K_XL/GLM-4.6-UD-Q2_K_XL-00001-of-00003.gguf \
--temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 \
--ctx-size 38000 \
-ctk f16 -ctv f16 \
-fa on \
-b 1024 -ub 1024 \
-ngl 99 \
--device CUDA0,ROCm0,ROCm1,ROCm2,ROCm3 \
--tensor-split 1,0,0,0,0 \
-ot "blk\.([0-9])\.attn_.*=CUDA0" \
-ot "blk\.([1-9][0-9])\.attn_.*=CUDA0" \
-ot "blk\.([0-2])\.ffn_.*=CUDA0" \
-ot "blk\.(3)\.ffn_.*_exps.*=ROCm3" \
-ot "blk\.([4-5])\.ffn_.*_exps.*=ROCm0" \
-ot "blk\.([6-7])\.ffn_.*_exps.*=ROCm1" \
-ot "blk\.([8-9])\.ffn_.*_exps.*=ROCm2" \
-ot "blk\.([1-2][0-9])\.ffn_.*_exps.*=ROCm0" \
-ot "blk\.([3-4][0-9])\.ffn_.*_exps.*=ROCm1" \
-ot "blk\.([5-6][0-9])\.ffn_.*_exps.*=ROCm2" \
-ot "blk\.([7-9][0-9])\.ffn_.*_exps.*=ROCm3" \
--jinja \
--parallel 1 \
--threads 32 \
--host 127.0.0.1 \
--port 8090
prompt eval time = 49061.74 ms / 2839 tokens ( 17.28 ms per token, 57.87 tokens per second)
eval time = 62459.47 ms / 802 tokens ( 77.88 ms per token, 12.84 tokens per second)
total time = 111521.21 ms / 3641 tokens
prompt eval time = 621226.95 ms / 36349 tokens ( 17.09 ms per token, 58.51 tokens per second)
eval time = 44711.61 ms / 454 tokens ( 98.48 ms per token, 10.15 tokens per second)
total time = 665938.57 ms / 36803 tokens
MoE offloading to another GPU takes a significant penalty for each layer, and GLM-4.6 has 92 layers, so it takes a very big hit.
Interesting. I love the MI50 price tag 😀!
Did you try vLLM? Any good or bad experiences with smaller models, like gpt-oss-120b?
vLLM is not currently optimized for MoE on MI50s. To be fair, even 5090 is not well supported for all models with vLLM, but MI50 definitely has worse support.
DeepSeek works much better for MoE offloading to cheap GPUs. With a custom quant that is basically this DeepSeek-V3.2-Terminus quant but with all the experts changed to iq1_s to cram it into just four MI50s, I got these results:
./build/bin/llama-server \
--alias anikifoss/DeepSeek-V3.1-Terminus-Q8_0A_XSS \
--model ~/Env/models/anikifoss/DeepSeek-V3.1-Terminus-Q8_0A_XSS/DeepSeek-V3.1-Terminus-Q8_0A_XSS.gguf \
--temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 \
--ctx-size 38000 \
-ctk f16 -ctv f16 \
-fa on \
-b 1024 -ub 1024 \
-ngl 99 \
--device CUDA0,ROCm0,ROCm1,ROCm2,ROCm3 \
--tensor-split 1,0,0,0,0 \
-ot "blk\.([0-9])\.attn_.*=CUDA0" \
-ot "blk\.([1-9][0-9])\.attn_.*=CUDA0" \
-ot "blk\.([0-2])\.ffn_.*=CUDA0" \
-ot "blk\.([3-6])\.ffn_.*_exps.*=ROCm0" \
-ot "blk\.([7-9])\.ffn_.*_exps.*=ROCm1" \
-ot "blk\.(1[0-1])\.ffn_.*_exps.*=ROCm1" \
-ot "blk\.(1[2-5])\.ffn_.*_exps.*=ROCm2" \
-ot "blk\.(1[6-9])\.ffn_.*_exps.*=ROCm3" \
-ot "blk\.(2[0-9])\.ffn_.*_exps.*=ROCm0" \
-ot "blk\.(3[0-9])\.ffn_.*_exps.*=ROCm1" \
-ot "blk\.(4[0-9])\.ffn_.*_exps.*=ROCm2" \
-ot "blk\.(5[0-9])\.ffn_.*_exps.*=ROCm3" \
-ot "blk\.60\.ffn_.*_exps.*=ROCm3" \
--jinja \
--parallel 1 \
--threads 32 \
--host 127.0.0.1 \
--port 8090
prompt eval time = 26065.27 ms / 2874 tokens ( 9.07 ms per token, 110.26 tokens per second)
eval time = 31335.96 ms / 447 tokens ( 70.10 ms per token, 14.26 tokens per second)
total time = 57401.23 ms / 3321 tokens
prompt eval time = 388345.31 ms / 37279 tokens ( 10.42 ms per token, 95.99 tokens per second)
eval time = 36343.06 ms / 484 tokens ( 75.09 ms per token, 13.32 tokens per second)
total time = 424688.38 ms / 37763 tokens
Since the memory bandwidth on MI50s is so high, and compute is not much different for higher quality quantization, I believe these numbers will be the same with 16xMI50s, and something like Q6_K for all experts.
And GLM-4.5-Air really shines (with just 46 layers and less compute):
./build/bin/llama-server \
--alias anikifoss/GLM-4.5-Air-HQ4_K \
--model ~/Env/models/anikifoss/GLM-4.5-Air-HQ4_K/GLM-4.5-Air-HQ4_K-00001-of-00002.gguf \
--temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 \
--ctx-size 38000 \
-ctk f16 -ctv f16 \
-fa on \
-b 1024 -ub 1024 \
-ngl 99 \
--device CUDA0,ROCm0,ROCm1,ROCm2,ROCm3 \
--tensor-split 1,0,0,0,0 \
-ot "blk\.([0-9])\.attn_.*=CUDA0" \
-ot "blk\.([1-4][0-9])\.attn_.*=CUDA0" \
-ot "blk\.0\.ffn_.*=CUDA0" \
-ot "blk\.([1-9])\.ffn_.*_exps.*=ROCm0" \
-ot "blk\.(1[0-9])\.ffn_.*_exps.*=ROCm1" \
-ot "blk\.(2[0-9])\.ffn_.*_exps.*=ROCm2" \
-ot "blk\.(3[0-9])\.ffn_.*_exps.*=ROCm3" \
-ot "blk\.(4[0-9])\.ffn_.*_exps.*=ROCm3" \
-ot "blk\.46\.nextn.*=CPU" \
--jinja \
--parallel 1 \
--threads 32 \
--host 127.0.0.1 \
--port 8090
prompt eval time = 6565.97 ms / 2840 tokens ( 2.31 ms per token, 432.53 tokens per second)
eval time = 30444.09 ms / 913 tokens ( 33.35 ms per token, 29.99 tokens per second)
total time = 37010.06 ms / 3753 tokens
prompt eval time = 87546.13 ms / 36349 tokens ( 2.41 ms per token, 415.20 tokens per second)
eval time = 40974.20 ms / 1034 tokens ( 39.63 ms per token, 25.24 tokens per second)
total time = 128520.33 ms / 37383 tokens
I was just about to write that it looks good (Deepseek), but with 16xMI50 we are talking investment exceeding EPYC + 1 TB RAM, which has similar performance and one would argue might be more practical build. But the GLM numbers look really nice!
I would definitely love to hear more as you progress with this experiment.
Interestingly Q8_0 (@unsloth) is not that much slower.
But yes... it is not exactly 30 t/s you might possibly get when fully offloaded 😀:
RTX 5090
./llama-sweep-bench \
--model "$MODEL_PATH" \
--no-mmap \
-fa -fmoe \
-b 8192 -ub 8192 \
-ctk q8_0 -ctv q8_0 -c 32768 \
-ngl 999 -ot exps=CPU \
--threads 16 \
--threads-batch 28 \
--warmup-batch
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 8192 | 2048 | 0 | 13.324 | 614.83 | 143.697 | 14.25 |
| 8192 | 2048 | 8192 | 14.941 | 548.31 | 152.929 | 13.39 |
| 8192 | 2048 | 16384 | 16.703 | 490.45 | 160.305 | 12.78 |
| 8192 | 2048 | 24576 | 18.683 | 438.48 | 172.090 | 11.90 |
RTX 6000
./llama-sweep-bench \
--model "$MODEL_PATH" \
--no-mmap \
-fa -fmoe \
-b 16384 -ub 16384 \
-ctk f16 -ctv f16 -c 163840 \
-ngl 999 -ot exps=CPU \
--threads 16 \
--threads-batch 28 \
--warmup-batch
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 16384 | 4096 | 0 | 18.825 | 870.31 | 287.942 | 14.23 |
| 16384 | 4096 | 16384 | 25.074 | 653.43 | 295.760 | 13.85 |
| 16384 | 4096 | 32768 | 34.702 | 472.14 | 336.472 | 12.17 |
Interestingly Q8_0 (@unsloth) is not that much slower.
Which model? The 30 tokens/sec is GLM-4.5-Air, not the full GLM.
Ah, okay. The Q8_0 (@unsloth) numbers above is GLM 4.6.