Performance report with 72GB VRAM: 32 t/s
#14
by
SlavikF
- opened
System:
- Nvidia RTX 4090D 48GB
- Nvidia RTX 3090 24GB
- Intel Xeon W5-3425 12 cores
- 256GB DDR5-4800 (8 channels)
- Ubuntu 24
Docker compose:
services:
step35:
image: ghcr.io/ggml-org/llama.cpp:server-cuda12-b7964
container_name: step35
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
ports:
- "8080:8080"
volumes:
- /home/slavik/.cache/llama.cpp/router/local-step35-200b:/root/.cache/llama.cpp
entrypoint: ["./llama-server"]
command: >
--model /root/.cache/llama.cpp/stepfun-ai_Step-3.5-Flash-Int4_step3p5_flash_Q4_K_S-00001-of-00012.gguf
--alias local-step35-200b
--chat-template-file "/root/.cache/llama.cpp/chat_template.jinja"
--host 0.0.0.0 --port 8080
--ctx-size 131072
--parallel 2
--temp 1.0
Few log lines:
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090 D, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /app/libggml-cuda.so
load_backend: loaded CPU backend from /app/libggml-cpu-sapphirerapids.so
build: 7964 (b83111815) with GNU 11.4.0 for Linux x86_64
system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090 D) (0000:00:10.0) - 48150 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:00:11.0) - 23859 MiB free
print_info: file type = Q4_K - Small
print_info: file size = 103.84 GiB (4.53 BPW)
load_tensors: CUDA0 model buffer size = 42404.16 MiB
load_tensors: CUDA1 model buffer size = 18271.29 MiB
llama_context: CUDA_Host output buffer size = 0.98 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 65536 cells
llama_kv_cache: CUDA0 KV buffer size = 2560.00 MiB
llama_kv_cache: CUDA1 KV buffer size = 3584.00 MiB
llama_kv_cache: size = 6144.00 MiB ( 65536 cells, 12 layers, 2/2 seqs), K (f16): 3072.00 MiB, V (f16): 3072.00 MiB
llama_kv_cache_iswa: creating SWA KV cache, size = 1024 cells
llama_kv_cache: CUDA0 KV buffer size = 120.00 MiB
llama_kv_cache: CUDA1 KV buffer size = 144.00 MiB
llama_kv_cache: size = 264.00 MiB ( 1024 cells, 33 layers, 2/2 seqs), K (f16): 132.00 MiB, V (f16): 132.00 MiB
sched_reserve: reserving ...
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve: CUDA0 compute buffer size = 1628.56 MiB
sched_reserve: CUDA1 compute buffer size = 275.75 MiB
sched_reserve: CUDA_Host compute buffer size = 146.02 MiB
srv params_from_: Chat format: Hermes 2 Pro
prompt eval time = 36456.24 ms / 4583 tokens ( 7.95 ms per token, 125.71 tokens per second)
eval time = 275287.06 ms / 8922 tokens ( 30.85 ms per token, 32.41 tokens per second)
