Oct 12, 2025

I haven't posted a benchmark for a while, so here is one for those interested.
Certainly not the best what can be achieved, I have started to power limit my CPU (Epyc 9355) and GPUs (this one is RTX 5090):

./llama-sweep-bench \
  --model "$MODEL_PATH" \
  --no-mmap \
  -fa -fmoe \
  -b 2048 -ub 2048 \
  -ctk q8_0 -ctv q8_0 -c 49152 \
  -ngl 999 -ot exps=CPU \
  --threads 16 \
  --threads-batch 28 \
  --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
2048	512	0	6.488	315.64	29.354	17.44
2048	512	2048	6.577	311.38	30.003	17.07
2048	512	4096	6.633	308.77	30.482	16.80
2048	512	6144	6.729	304.34	31.201	16.41
2048	512	8192	6.817	300.43	31.760	16.12
2048	512	10240	6.878	297.74	32.157	15.92
2048	512	12288	6.975	293.60	32.599	15.71
2048	512	14336	7.063	289.96	33.259	15.39
2048	512	16384	7.151	286.40	34.504	14.84
2048	512	18432	7.268	281.77	35.270	14.52
2048	512	20480	7.430	275.65	35.006	14.63
2048	512	22528	7.554	271.10	35.781	14.31
2048	512	24576	7.651	267.68	37.252	13.74
2048	512	26624	7.763	263.81	37.740	13.57
2048	512	28672	7.870	260.24	38.907	13.16
2048	512	30720	7.982	256.57	39.417	12.99
2048	512	32768	8.099	252.88	40.261	12.72
2048	512	34816	8.177	250.45	42.002	12.19
2048	512	36864	8.276	247.45	42.873	11.94
2048	512	38912	8.429	242.98	43.775	11.70
2048	512	40960	8.523	240.29	44.600	11.48
2048	512	43008	8.642	236.98	45.184	11.33
2048	512	45056	8.725	234.73	46.276	11.06
2048	512	47104	8.837	231.76	47.090	10.87

Interesting how much VRAM this model needs to hold the context...

anikifoss

Owner Oct 12, 2025

@sousekd thanks for sharing benchmark results!

How many memory channels do you have in that system?

sousekd

Oct 13, 2025

•

edited Oct 13, 2025

@anikifoss It is dual Epyc Turin, so total 24 channels (?). Not quite sure, honestly :). I run it on a VM which is pinned to a single socket, only using the memory attached to it.

anikifoss

Owner Oct 13, 2025

I was considering upgrading an 8-channel DDR5 system to a dual socket 12x2 channels DDR5. But I was worried about numa issues preventing any meaningful performance gains.

I'd love to learn more about your expriences with the 24 channel system. Were you able to utilize all 24 channels while running inference?

sousekd

Oct 13, 2025

•

edited Oct 13, 2025

@anikifoss I never tried.

I studied NUMA things a bit, reading about experiences of others, and decided it is not worth the trouble :-). I only added second CPU later as I wanted to run Proxmox on the machine with other AI related VMs, and realized 768 GB RAM is not enough for Proxmox + ZFS (ARC) + other VMs + inference VM. I only bought 384 GB for the second socket - making the machine memory configuration unbalanced across sockets. It does not affect the performance of one socket or the other, but running work on both sockets with shared memory would be affected quite a lot due to crippled interleaving...

That said, I have not yet seen a benchmark documenting a good LLM inference performance in a dual-CPU setup. I know K-Transformers supported a trick where the model was cloned into memory of both sockets - but it doesn't seem like an economic solution to me :).

As you can see here, the bus speed between sockets is quite limited:

anikifoss

Owner Oct 13, 2025

I see, so a dual socket system has a 12-channel bottleneck, and that's as good as it gets. Thanks for sharing your research!

I'll hold off on upgrading to a dual socket system. Though an Epyc with 8 CCDs and 12-channels DDR would give me a 50% boost, it won't give me a 3x boost I was hoping for.

sousekd

Oct 14, 2025

•

edited Oct 14, 2025

@anikifoss You should fact-check me on this, I have no idea what I am talking about 😀. But if you find someone with significantly better performance on dual-socket than a single socket, please let me know.

sousekd

Oct 18, 2025

Here is sweep-bench on RTX 6000 with 128K context @ f16 and -ub 16384:

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
16384	4096	0	15.882	1031.59	236.725	17.30
16384	4096	16384	22.389	731.80	258.207	15.86
16384	4096	32768	32.372	506.11	283.656	14.44
16384	4096	49152	41.317	396.54	307.315	13.33
16384	4096	65536	50.527	324.26	337.632	12.13
16384	4096	81920	60.364	271.42	433.913	9.44
16384	4096	98304	70.328	232.97	516.324	7.93
16384	4096	114688	75.671	216.52	566.042	7.24

This is with -ot "\.([1-9][0-9])\.ffn_(gate|up|down)_exps.=CPU", but it does not have much effect.

anikifoss

Owner Oct 18, 2025

•

edited Oct 19, 2025

Awesome, thanks for sharing RTX 6000 numbers!

I wonder how it would do with a smaller GLM-4.6 quant on 2x RTX 6000.

I've been playing with offloading experts to MI50s while keeping attention on 5090. Here is what I'm getting with 4xMI50s (ROCm 6.3.3):

./build/bin/llama-server \
    --alias unsloth/GLM-4.6-UD-Q2_K_XL \
    --model ~/Env/models/unsloth/GLM-4.6-UD-Q2_K_XL/GLM-4.6-UD-Q2_K_XL-00001-of-00003.gguf \
    --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 \
    --ctx-size 38000 \
    -ctk f16 -ctv f16 \
    -fa on \
    -b 1024 -ub 1024 \
    -ngl 99 \
    --device CUDA0,ROCm0,ROCm1,ROCm2,ROCm3 \
    --tensor-split 1,0,0,0,0 \
    -ot "blk\.([0-9])\.attn_.*=CUDA0" \
    -ot "blk\.([1-9][0-9])\.attn_.*=CUDA0" \
    -ot "blk\.([0-2])\.ffn_.*=CUDA0" \
    -ot "blk\.(3)\.ffn_.*_exps.*=ROCm3" \
    -ot "blk\.([4-5])\.ffn_.*_exps.*=ROCm0" \
    -ot "blk\.([6-7])\.ffn_.*_exps.*=ROCm1" \
    -ot "blk\.([8-9])\.ffn_.*_exps.*=ROCm2" \
    -ot "blk\.([1-2][0-9])\.ffn_.*_exps.*=ROCm0" \
    -ot "blk\.([3-4][0-9])\.ffn_.*_exps.*=ROCm1" \
    -ot "blk\.([5-6][0-9])\.ffn_.*_exps.*=ROCm2" \
    -ot "blk\.([7-9][0-9])\.ffn_.*_exps.*=ROCm3" \
    --jinja \
    --parallel 1 \
    --threads 32 \
    --host 127.0.0.1 \
    --port 8090

prompt eval time =   49061.74 ms /  2839 tokens (   17.28 ms per token,    57.87 tokens per second)
       eval time =   62459.47 ms /   802 tokens (   77.88 ms per token,    12.84 tokens per second)
      total time =  111521.21 ms /  3641 tokens

prompt eval time =  621226.95 ms / 36349 tokens (   17.09 ms per token,    58.51 tokens per second)
       eval time =   44711.61 ms /   454 tokens (   98.48 ms per token,    10.15 tokens per second)
      total time =  665938.57 ms / 36803 tokens

MoE offloading to another GPU takes a significant penalty for each layer, and GLM-4.6 has 92 layers, so it takes a very big hit.

sousekd

Oct 18, 2025

Interesting. I love the MI50 price tag 😀!
Did you try vLLM? Any good or bad experiences with smaller models, like gpt-oss-120b?

anikifoss

Owner Oct 18, 2025

vLLM is not currently optimized for MoE on MI50s. To be fair, even 5090 is not well supported for all models with vLLM, but MI50 definitely has worse support.

anikifoss

Owner Oct 18, 2025

•

edited Oct 18, 2025

DeepSeek works much better for MoE offloading to cheap GPUs. With a custom quant that is basically this DeepSeek-V3.2-Terminus quant but with all the experts changed to iq1_s to cram it into just four MI50s, I got these results:

./build/bin/llama-server \
    --alias anikifoss/DeepSeek-V3.1-Terminus-Q8_0A_XSS \
    --model ~/Env/models/anikifoss/DeepSeek-V3.1-Terminus-Q8_0A_XSS/DeepSeek-V3.1-Terminus-Q8_0A_XSS.gguf \
    --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 \
    --ctx-size 38000 \
    -ctk f16 -ctv f16 \
    -fa on \
    -b 1024 -ub 1024 \
    -ngl 99 \
    --device CUDA0,ROCm0,ROCm1,ROCm2,ROCm3 \
    --tensor-split 1,0,0,0,0 \
    -ot "blk\.([0-9])\.attn_.*=CUDA0" \
    -ot "blk\.([1-9][0-9])\.attn_.*=CUDA0" \
    -ot "blk\.([0-2])\.ffn_.*=CUDA0" \
    -ot "blk\.([3-6])\.ffn_.*_exps.*=ROCm0" \
    -ot "blk\.([7-9])\.ffn_.*_exps.*=ROCm1" \
    -ot "blk\.(1[0-1])\.ffn_.*_exps.*=ROCm1" \
    -ot "blk\.(1[2-5])\.ffn_.*_exps.*=ROCm2" \
    -ot "blk\.(1[6-9])\.ffn_.*_exps.*=ROCm3" \
    -ot "blk\.(2[0-9])\.ffn_.*_exps.*=ROCm0" \
    -ot "blk\.(3[0-9])\.ffn_.*_exps.*=ROCm1" \
    -ot "blk\.(4[0-9])\.ffn_.*_exps.*=ROCm2" \
    -ot "blk\.(5[0-9])\.ffn_.*_exps.*=ROCm3" \
    -ot "blk\.60\.ffn_.*_exps.*=ROCm3" \
    --jinja \
    --parallel 1 \
    --threads 32 \
    --host 127.0.0.1 \
    --port 8090

prompt eval time =   26065.27 ms /  2874 tokens (    9.07 ms per token,   110.26 tokens per second)
       eval time =   31335.96 ms /   447 tokens (   70.10 ms per token,    14.26 tokens per second)
      total time =   57401.23 ms /  3321 tokens

prompt eval time =  388345.31 ms / 37279 tokens (   10.42 ms per token,    95.99 tokens per second)
       eval time =   36343.06 ms /   484 tokens (   75.09 ms per token,    13.32 tokens per second)
      total time =  424688.38 ms / 37763 tokens

Since the memory bandwidth on MI50s is so high, and compute is not much different for higher quality quantization, I believe these numbers will be the same with 16xMI50s, and something like Q6_K for all experts.

anikifoss

Owner Oct 18, 2025

And GLM-4.5-Air really shines (with just 46 layers and less compute):

./build/bin/llama-server \
    --alias anikifoss/GLM-4.5-Air-HQ4_K \
    --model ~/Env/models/anikifoss/GLM-4.5-Air-HQ4_K/GLM-4.5-Air-HQ4_K-00001-of-00002.gguf \
    --temp 0.5 --top-k 0 --top-p 1.0 --min-p 0.1 \
    --ctx-size 38000 \
    -ctk f16 -ctv f16 \
    -fa on \
    -b 1024 -ub 1024 \
    -ngl 99 \
    --device CUDA0,ROCm0,ROCm1,ROCm2,ROCm3 \
    --tensor-split 1,0,0,0,0 \
    -ot "blk\.([0-9])\.attn_.*=CUDA0" \
    -ot "blk\.([1-4][0-9])\.attn_.*=CUDA0" \
    -ot "blk\.0\.ffn_.*=CUDA0" \
    -ot "blk\.([1-9])\.ffn_.*_exps.*=ROCm0" \
    -ot "blk\.(1[0-9])\.ffn_.*_exps.*=ROCm1" \
    -ot "blk\.(2[0-9])\.ffn_.*_exps.*=ROCm2" \
    -ot "blk\.(3[0-9])\.ffn_.*_exps.*=ROCm3" \
    -ot "blk\.(4[0-9])\.ffn_.*_exps.*=ROCm3" \
    -ot "blk\.46\.nextn.*=CPU" \
    --jinja \
    --parallel 1 \
    --threads 32 \
    --host 127.0.0.1 \
    --port 8090

prompt eval time =    6565.97 ms /  2840 tokens (    2.31 ms per token,   432.53 tokens per second)
       eval time =   30444.09 ms /   913 tokens (   33.35 ms per token,    29.99 tokens per second)
      total time =   37010.06 ms /  3753 tokens

prompt eval time =   87546.13 ms / 36349 tokens (    2.41 ms per token,   415.20 tokens per second)
       eval time =   40974.20 ms /  1034 tokens (   39.63 ms per token,    25.24 tokens per second)
      total time =  128520.33 ms / 37383 tokens

sousekd

Oct 18, 2025

I was just about to write that it looks good (Deepseek), but with 16xMI50 we are talking investment exceeding EPYC + 1 TB RAM, which has similar performance and one would argue might be more practical build. But the GLM numbers look really nice!

I would definitely love to hear more as you progress with this experiment.

sousekd

Oct 18, 2025

Interestingly Q8_0 (@unsloth) is not that much slower.
But yes... it is not exactly 30 t/s you might possibly get when fully offloaded 😀:

RTX 5090

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap \
    -fa -fmoe \
    -b 8192 -ub 8192 \
    -ctk q8_0 -ctv q8_0 -c 32768 \
    -ngl 999 -ot exps=CPU \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
8192	2048	0	13.324	614.83	143.697	14.25
8192	2048	8192	14.941	548.31	152.929	13.39
8192	2048	16384	16.703	490.45	160.305	12.78
8192	2048	24576	18.683	438.48	172.090	11.90

RTX 6000

./llama-sweep-bench \
    --model "$MODEL_PATH" \
    --no-mmap \
    -fa -fmoe \
    -b 16384 -ub 16384 \
    -ctk f16 -ctv f16 -c 163840 \
    -ngl 999 -ot exps=CPU \
    --threads 16 \
    --threads-batch 28 \
    --warmup-batch

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
16384	4096	0	18.825	870.31	287.942	14.23
16384	4096	16384	25.074	653.43	295.760	13.85
16384	4096	32768	34.702	472.14	336.472	12.17

anikifoss

Owner Oct 18, 2025

•

edited Oct 18, 2025

Interestingly Q8_0 (@unsloth) is not that much slower.

Which model? The 30 tokens/sec is GLM-4.5-Air, not the full GLM.

sousekd

Oct 19, 2025

Ah, okay. The Q8_0 (@unsloth) numbers above is GLM 4.6.

anikifoss
/

GLM-4.6-HQ4_K

Thank you

RTX 5090

RTX 6000