Q3_K_XL - Results with: 1x XTX 7900 24GB VRAM // Ryzen 7 9700X 96GB
Hardware:
GPU: 1x AMD Radeon RX 7900 XTX (24GB)
CPU: Ryzen 7 9700X
RAM: 96GB (2x48GB) DDR5 @ 6000 MT/s
Performance:
"timings":{"cache_n":0,"prompt_n":76,"prompt_ms":2060.365,"prompt_per_token_ms":27.11006578947368,"prompt_per_second":36.886668138897726,"predicted_n":512,"predicted_ms":42042.831,"predicted_per_token_ms":82.114904296875,"predicted_per_second":12.178057181734504}}
Small prompt (76 tokens):
Prompt Eval: ~36.9 t/s
Eval (Generation): ~12.2 t/s
"timings":
{"cache_n":0,"prompt_n":2976,"prompt_ms":25355.593,"prompt_per_token_ms":8.520024529569893,"prompt_per_second":117.37055410220538,"predicted_n":512,"predicted_ms":46464.055,"predicted_per_token_ms":90.750107421875,"predicted_per_second":11.019270702912175}}
Longer prompt (2976 tokens):
Prompt Eval: ~117.37 t/s
Eval (Generation): ~11.0 t/s
Setup:
I'm using llama.cpp (server-rocm), docker image: ghcr.io/ggml-org/llama.cpp:server-rocm
The 48k context with q8_0 KV cache takes up about 7.5GB VRAM, leaving roughly 15GB VRAM for the model layers.
Llama.cpp configuraion:
"-ot", "blk.([8-9]|[1-5][0-9]|6[0-1]).ffn_.*_exps.weight=CPU",
Memory & Context
"--no-mmap",
"--flash-attn", "on",
"-c", "49152",
"--cache-type-k", "q8_0",
"--cache-type-v", "q8_0"
My context:
The 48k context is the comfortable amount that I need for the usage I have for the model doing a number of tasks for me, that is why i choose that. This configuration also leaves some VRAM and RAM for system overhead to make sure it wont go OOM.
*Please let me know if with similar hardware setup you were able to squeeze out more performance. Comment with the config.