Text Generation
Transformers
GGUF
step3p5
custom_code
imatrix
conversational

Performance report with 72GB VRAM: 32 t/s

#14
by SlavikF - opened

System:

  • Nvidia RTX 4090D 48GB
  • Nvidia RTX 3090 24GB
  • Intel Xeon W5-3425 12 cores
  • 256GB DDR5-4800 (8 channels)
  • Ubuntu 24

Docker compose:

services:
  step35:
    image: ghcr.io/ggml-org/llama.cpp:server-cuda12-b7964
    container_name: step35
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
    ports:
      - "8080:8080"
    volumes:
      - /home/slavik/.cache/llama.cpp/router/local-step35-200b:/root/.cache/llama.cpp
    entrypoint: ["./llama-server"]
    command: >
      --model  /root/.cache/llama.cpp/stepfun-ai_Step-3.5-Flash-Int4_step3p5_flash_Q4_K_S-00001-of-00012.gguf
      --alias local-step35-200b
      --chat-template-file "/root/.cache/llama.cpp/chat_template.jinja" 
      --host 0.0.0.0  --port 8080
      --ctx-size 131072
      --parallel 2
      --temp 1.0

Few log lines:

ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090 D, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from /app/libggml-cuda.so
load_backend: loaded CPU backend from /app/libggml-cpu-sapphirerapids.so

build: 7964 (b83111815) with GNU 11.4.0 for Linux x86_64

system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090 D) (0000:00:10.0) - 48150 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:00:11.0) - 23859 MiB free

print_info: file type   = Q4_K - Small
print_info: file size   = 103.84 GiB (4.53 BPW) 

load_tensors:        CUDA0 model buffer size = 42404.16 MiB
load_tensors:        CUDA1 model buffer size = 18271.29 MiB

llama_context:  CUDA_Host  output buffer size =     0.98 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 65536 cells
llama_kv_cache:      CUDA0 KV buffer size =  2560.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =  3584.00 MiB
llama_kv_cache: size = 6144.00 MiB ( 65536 cells,  12 layers,  2/2 seqs), K (f16): 3072.00 MiB, V (f16): 3072.00 MiB
llama_kv_cache_iswa: creating     SWA KV cache, size = 1024 cells
llama_kv_cache:      CUDA0 KV buffer size =   120.00 MiB
llama_kv_cache:      CUDA1 KV buffer size =   144.00 MiB
llama_kv_cache: size =  264.00 MiB (  1024 cells,  33 layers,  2/2 seqs), K (f16):  132.00 MiB, V (f16):  132.00 MiB
sched_reserve: reserving ...
sched_reserve: Flash Attention was auto, set to enabled
sched_reserve:      CUDA0 compute buffer size =  1628.56 MiB
sched_reserve:      CUDA1 compute buffer size =   275.75 MiB
sched_reserve:  CUDA_Host compute buffer size =   146.02 MiB

srv  params_from_: Chat format: Hermes 2 Pro

prompt eval time =   36456.24 ms /  4583 tokens (    7.95 ms per token,   125.71 tokens per second)
       eval time =  275287.06 ms /  8922 tokens (   30.85 ms per token,    32.41 tokens per second)

The smol-IQ3_KS quant is a little bit smaller, but using ik_llama.cpp -sm graph (on 2 or more GPUs) can help speeds even if doing hybrid CPU inferencing. Here is full offload on 2xA6000 (older sm86 arch, not the new blackwell ones):

sweep-bench-Step-3.5-Flash

Sign up or log in to comment