Thank you

#1
by sousekd - opened

Thank you @anikifoss .

I haven't posted a benchmark for a while, so here is one for those interested.
Certainly not the best what can be achieved, I have started to power limit my CPU (Epyc 9355) and GPUs (this one is RTX 5090):

./llama-sweep-bench \
  --model "$MODEL_PATH" \
  --no-mmap \
  -fa -fmoe \
  -b 2048 -ub 2048 \
  -ctk q8_0 -ctv q8_0 -c 49152 \
  -ngl 999 -ot exps=CPU \
  --threads 16 \
  --threads-batch 28 \
  --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
2048 512 0 6.488 315.64 29.354 17.44
2048 512 2048 6.577 311.38 30.003 17.07
2048 512 4096 6.633 308.77 30.482 16.80
2048 512 6144 6.729 304.34 31.201 16.41
2048 512 8192 6.817 300.43 31.760 16.12
2048 512 10240 6.878 297.74 32.157 15.92
2048 512 12288 6.975 293.60 32.599 15.71
2048 512 14336 7.063 289.96 33.259 15.39
2048 512 16384 7.151 286.40 34.504 14.84
2048 512 18432 7.268 281.77 35.270 14.52
2048 512 20480 7.430 275.65 35.006 14.63
2048 512 22528 7.554 271.10 35.781 14.31
2048 512 24576 7.651 267.68 37.252 13.74
2048 512 26624 7.763 263.81 37.740 13.57
2048 512 28672 7.870 260.24 38.907 13.16
2048 512 30720 7.982 256.57 39.417 12.99
2048 512 32768 8.099 252.88 40.261 12.72
2048 512 34816 8.177 250.45 42.002 12.19
2048 512 36864 8.276 247.45 42.873 11.94
2048 512 38912 8.429 242.98 43.775 11.70
2048 512 40960 8.523 240.29 44.600 11.48
2048 512 43008 8.642 236.98 45.184 11.33
2048 512 45056 8.725 234.73 46.276 11.06
2048 512 47104 8.837 231.76 47.090 10.87

Interesting how much VRAM this model needs to hold the context...

@sousekd thanks for sharing benchmark results!

How many memory channels do you have in that system?

@anikifoss It is dual Epyc Turin, so total 24 channels (?). Not quite sure, honestly :). I run it on a VM which is pinned to a single socket, only using the memory attached to it.

I was considering upgrading an 8-channel DDR5 system to a dual socket 12x2 channels DDR5. But I was worried about numa issues preventing any meaningful performance gains.

I'd love to learn more about your expriences with the 24 channel system. Were you able to utilize all 24 channels while running inference?

@anikifoss I never tried.

I studied NUMA things a bit, reading about experiences of others, and decided it is not worth the trouble :-). I only added second CPU later as I wanted to run Proxmox on the machine with other AI related VMs, and realized 768 GB RAM is not enough for Proxmox + ZFS (ARC) + other VMs + inference VM. I only bought 384 GB for the second socket - making the machine memory configuration unbalanced across sockets. It does not affect the performance of one socket or the other, but running work on both sockets with shared memory would be affected quite a lot due to crippled interleaving...

That said, I have not yet seen a benchmark documenting a good LLM inference performance in a dual-CPU setup. I know K-Transformers supported a trick where the model was cloned into memory of both sockets - but it doesn't seem like an economic solution to me :).

As you can see here, the bus speed between sockets is quite limited:
MZ73-LM2_BlockDiagram

I see, so a dual socket system has a 12-channel bottleneck, and that's as good as it gets. Thanks for sharing your research!

I'll hold off on upgrading to a dual socket system. Though an Epyc with 8 CCDs and 12-channels DDR would give me a 50% boost, it won't give me a 3x boost I was hoping for.

@anikifoss You should fact-check me on this, I have no idea what I am talking about 😀. But if you find someone with significantly better performance on dual-socket than a single socket, please let me know.

Here is sweep-bench on RTX 6000 with 128K context @ f16 and -ub 16384:

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
16384 4096 0 15.882 1031.59 236.725 17.30
16384 4096 16384 22.389 731.80 258.207 15.86
16384 4096 32768 32.372 506.11 283.656 14.44
16384 4096 49152 41.317 396.54 307.315 13.33
16384 4096 65536 50.527 324.26 337.632 12.13
16384 4096 81920 60.364 271.42 433.913 9.44
16384 4096 98304 70.328 232.97 516.324 7.93
16384 4096 114688 75.671 216.52 566.042 7.24

This is with -ot "\.([1-9][0-9])\.ffn_(gate|up|down)_exps.=CPU", but it does not have much effect.

Awesome, thanks for sharing RTX 6000 numbers!

I wonder how it would do with a smaller GLM-4.6 quant on 2x RTX 6000.

I've been playing with offloading experts to MI50s while keeping attention on 5090. Here is what I'm getting with 4xMI50s (ROCm 6.3.3):

./build/bin/llama-server \
    --alias unsloth/GLM-4.6-UD-Q2_K_XL \
    --model ~/Env/models/unsloth/GLM-4.6-UD-Q2_K_XL/GLM-4.6-UD-Q2_K_XL-00001-of-00003.gguf \
    --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 \
    --ctx-size 38000 \
    -ctk f16 -ctv f16 \
    -fa on \
    -b 1024 -ub 1024 \
    -ngl 99 \
    --device CUDA0,ROCm0,ROCm1,ROCm2,ROCm3 \
    --tensor-split 1,0,0,0,0 \
    -ot "blk\.([0-9])\.attn_.*=CUDA0" \
    -ot "blk\.([1-9][0-9])\.attn_.*=CUDA0" \
    -ot "blk\.([0-2])\.ffn_.*=CUDA0" \
    -ot "blk\.(3)\.ffn_.*_exps.*=ROCm3" \
    -ot "blk\.([4-5])\.ffn_.*_exps.*=ROCm0" \
    -ot "blk\.([6-7])\.ffn_.*_exps.*=ROCm1" \
    -ot "blk\.([8-9])\.ffn_.*_exps.*=ROCm2" \
    -ot "blk\.([1-2][0-9])\.ffn_.*_exps.*=ROCm0" \
    -ot "blk\.([3-4][0-9])\.ffn_.*_exps.*=ROCm1" \
    -ot "blk\.([5-6][0-9])\.ffn_.*_exps.*=ROCm2" \
    -ot "blk\.([7-9][0-9])\.ffn_.*_exps.*=ROCm3" \
    --jinja \
    --parallel 1 \
    --threads 32 \
    --host 127.0.0.1 \
    --port 8090

prompt eval time =   49061.74 ms /  2839 tokens (   17.28 ms per token,    57.87 tokens per second)
       eval time =   62459.47 ms /   802 tokens (   77.88 ms per token,    12.84 tokens per second)
      total time =  111521.21 ms /  3641 tokens

prompt eval time =  621226.95 ms / 36349 tokens (   17.09 ms per token,    58.51 tokens per second)
       eval time =   44711.61 ms /   454 tokens (   98.48 ms per token,    10.15 tokens per second)
      total time =  665938.57 ms / 36803 tokens

MoE offloading to another GPU takes a significant penalty for each layer, and GLM-4.6 has 92 layers, so it takes a very big hit.

Interesting. I love the MI50 price tag 😀!
Did you try vLLM? Any good or bad experiences with smaller models, like gpt-oss-120b?

vLLM is not currently optimized for MoE on MI50s. To be fair, even 5090 is not well supported for all models with vLLM, but MI50 definitely has worse support.

DeepSeek works much better for MoE offloading to cheap GPUs. With a custom quant that is basically this DeepSeek-V3.2-Terminus quant but with all the experts changed to iq1_s to cram it into just four MI50s, I got these results:

./build/bin/llama-server \
    --alias anikifoss/DeepSeek-V3.1-Terminus-Q8_0A_XSS \
    --model ~/Env/models/anikifoss/DeepSeek-V3.1-Terminus-Q8_0A_XSS/DeepSeek-V3.1-Terminus-Q8_0A_XSS.gguf \
    --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 \
    --ctx-size 38000 \
    -ctk f16 -ctv f16 \
    -fa on \
    -b 1024 -ub 1024 \
    -ngl 99 \
    --device CUDA0,ROCm0,ROCm1,ROCm2,ROCm3 \
    --tensor-split 1,0,0,0,0 \
    -ot "blk\.([0-9])\.attn_.*=CUDA0" \
    -ot "blk\.([1-9][0-9])\.attn_.*=CUDA0" \
    -ot "blk\.([0-2])\.ffn_.*=CUDA0" \
    -ot "blk\.([3-6])\.ffn_.*_exps.*=ROCm0" \
    -ot "blk\.([7-9])\.ffn_.*_exps.*=ROCm1" \
    -ot "blk\.(1[0-1])\.ffn_.*_exps.*=ROCm1" \
    -ot "blk\.(1[2-5])\.ffn_.*_exps.*=ROCm2" \
    -ot "blk\.(1[6-9])\.ffn_.*_exps.*=ROCm3" \
    -ot "blk\.(2[0-9])\.ffn_.*_exps.*=ROCm0" \
    -ot "blk\.(3[0-9])\.ffn_.*_exps.*=ROCm1" \
    -ot "blk\.(4[0-9])\.ffn_.*_exps.*=ROCm2" \
    -ot "blk\.(5[0-9])\.ffn_.*_exps.*=ROCm3" \
    -ot "blk\.60\.ffn_.*_exps.*=ROCm3" \
    --jinja \
    --parallel 1 \
    --threads 32 \
    --host 127.0.0.1 \
    --port 8090

prompt eval time =   26065.27 ms /  2874 tokens (    9.07 ms per token,   110.26 tokens per second)
       eval time =   31335.96 ms /   447 tokens (   70.10 ms per token,    14.26 tokens per second)
      total time =   57401.23 ms /  3321 tokens

prompt eval time =  388345.31 ms / 37279 tokens (   10.42 ms per token,    95.99 tokens per second)
       eval time =   36343.06 ms /   484 tokens (   75.09 ms per token,    13.32 tokens per second)
      total time =  424688.38 ms / 37763 tokens

Since the memory bandwidth on MI50s is so high, and compute is not much different for higher quality quantization, I believe these numbers will be the same with 16xMI50s, and something like Q6_K for all experts.

And GLM-4.5-Air really shines (with just 46 layers and less compute):

./build/bin/llama-server \
    --alias anikifoss/GLM-4.5-Air-HQ4_K \
    --model ~/Env/models/anikifoss/GLM-4.5-Air-HQ4_K/GLM-4.5-Air-HQ4_K-00001-of-00002.gguf \
    --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 \
    --ctx-size 38000 \
    -ctk f16 -ctv f16 \
    -fa on \
    -b 1024 -ub 1024 \
    -ngl 99 \
    --device CUDA0,ROCm0,ROCm1,ROCm2,ROCm3 \
    --tensor-split 1,0,0,0,0 \
    -ot "blk\.([0-9])\.attn_.*=CUDA0" \
    -ot "blk\.([1-4][0-9])\.attn_.*=CUDA0" \
    -ot "blk\.0\.ffn_.*=CUDA0" \
    -ot "blk\.([1-9])\.ffn_.*_exps.*=ROCm0" \
    -ot "blk\.(1[0-9])\.ffn_.*_exps.*=ROCm1" \
    -ot "blk\.(2[0-9])\.ffn_.*_exps.*=ROCm2" \
    -ot "blk\.(3[0-9])\.ffn_.*_exps.*=ROCm3" \
    -ot "blk\.(4[0-9])\.ffn_.*_exps.*=ROCm3" \
    -ot "blk\.46\.nextn.*=CPU" \
    --jinja \
    --parallel 1 \
    --threads 32 \
    --host 127.0.0.1 \
    --port 8090

prompt eval time =    6565.97 ms /  2840 tokens (    2.31 ms per token,   432.53 tokens per second)
       eval time =   30444.09 ms /   913 tokens (   33.35 ms per token,    29.99 tokens per second)
      total time =   37010.06 ms /  3753 tokens

prompt eval time =   87546.13 ms / 36349 tokens (    2.41 ms per token,   415.20 tokens per second)
       eval time =   40974.20 ms /  1034 tokens (   39.63 ms per token,    25.24 tokens per second)
      total time =  128520.33 ms / 37383 tokens

I was just about to write that it looks good (Deepseek), but with 16xMI50 we are talking investment exceeding EPYC + 1 TB RAM, which has similar performance and one would argue might be more practical build. But the GLM numbers look really nice!

I would definitely love to hear more as you progress with this experiment.

Interestingly Q8_0 (@unsloth) is not that much slower.
But yes... it is not exactly 30 t/s you might possibly get when fully offloaded 😀:

RTX 5090

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap \
    -fa -fmoe \
    -b 8192 -ub 8192 \
    -ctk q8_0 -ctv q8_0 -c 32768 \
    -ngl 999 -ot exps=CPU \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
8192 2048 0 13.324 614.83 143.697 14.25
8192 2048 8192 14.941 548.31 152.929 13.39
8192 2048 16384 16.703 490.45 160.305 12.78
8192 2048 24576 18.683 438.48 172.090 11.90

RTX 6000

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap \
    -fa -fmoe \
    -b 16384 -ub 16384 \
    -ctk f16 -ctv f16 -c 163840 \
    -ngl 999 -ot exps=CPU \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch
PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
16384 4096 0 18.825 870.31 287.942 14.23
16384 4096 16384 25.074 653.43 295.760 13.85
16384 4096 32768 34.702 472.14 336.472 12.17

Interestingly Q8_0 (@unsloth) is not that much slower.

Which model? The 30 tokens/sec is GLM-4.5-Air, not the full GLM.

Ah, okay. The Q8_0 (@unsloth) numbers above is GLM 4.6.

Sign up or log in to comment