Q3_K_XL - Results with: 1x XTX 7900 24GB VRAM // Ryzen 7 9700X 96GB

#2
by flaviocb - opened

Hardware:

GPU: 1x AMD Radeon RX 7900 XTX (24GB)
CPU: Ryzen 7 9700X
RAM: 96GB (2x48GB) DDR5 @ 6000 MT/s

Performance:

"timings":{"cache_n":0,"prompt_n":76,"prompt_ms":2060.365,"prompt_per_token_ms":27.11006578947368,"prompt_per_second":36.886668138897726,"predicted_n":512,"predicted_ms":42042.831,"predicted_per_token_ms":82.114904296875,"predicted_per_second":12.178057181734504}}

Small prompt (76 tokens):
Prompt Eval: ~36.9 t/s
Eval (Generation): ~12.2 t/s

"timings":
{"cache_n":0,"prompt_n":2976,"prompt_ms":25355.593,"prompt_per_token_ms":8.520024529569893,"prompt_per_second":117.37055410220538,"predicted_n":512,"predicted_ms":46464.055,"predicted_per_token_ms":90.750107421875,"predicted_per_second":11.019270702912175}}

Longer prompt (2976 tokens):
Prompt Eval: ~117.37 t/s
Eval (Generation): ~11.0 t/s

Setup:

I'm using llama.cpp (server-rocm), docker image: ghcr.io/ggml-org/llama.cpp:server-rocm
The 48k context with q8_0 KV cache takes up about 7.5GB VRAM, leaving roughly 15GB VRAM for the model layers.

Llama.cpp configuraion:

"-ot", "blk.([8-9]|[1-5][0-9]|6[0-1]).ffn_.*_exps.weight=CPU",

Memory & Context

"--no-mmap",
"--flash-attn", "on",
"-c", "49152",
"--cache-type-k", "q8_0",
"--cache-type-v", "q8_0"

My context:

The 48k context is the comfortable amount that I need for the usage I have for the model doing a number of tasks for me, that is why i choose that. This configuration also leaves some VRAM and RAM for system overhead to make sure it wont go OOM.

*Please let me know if with similar hardware setup you were able to squeeze out more performance. Comment with the config.

Sign up or log in to comment