Q3_K_XL - Results with: 1x XTX 7900 24GB VRAM // Ryzen 7 9700X 96GB

by flaviocb - opened 2 days ago

Discussion

flaviocb

2 days ago

•

edited 2 days ago

Hardware:

GPU: 1x AMD Radeon RX 7900 XTX (24GB)
CPU: Ryzen 7 9700X
RAM: 96GB (2x48GB) DDR5 @ 6000 MT/s

Performance:

"timings":{"cache_n":0,"prompt_n":76,"prompt_ms":2060.365,"prompt_per_token_ms":27.11006578947368,"prompt_per_second":36.886668138897726,"predicted_n":512,"predicted_ms":42042.831,"predicted_per_token_ms":82.114904296875,"predicted_per_second":12.178057181734504}}

Small prompt (76 tokens):
Prompt Eval: ~36.9 t/s
Eval (Generation): ~12.2 t/s

"timings":
{"cache_n":0,"prompt_n":2976,"prompt_ms":25355.593,"prompt_per_token_ms":8.520024529569893,"prompt_per_second":117.37055410220538,"predicted_n":512,"predicted_ms":46464.055,"predicted_per_token_ms":90.750107421875,"predicted_per_second":11.019270702912175}}

Longer prompt (2976 tokens):
Prompt Eval: ~117.37 t/s
Eval (Generation): ~11.0 t/s

Setup:

I'm using llama.cpp (server-rocm), docker image: ghcr.io/ggml-org/llama.cpp:server-rocm
The 48k context with q8_0 KV cache takes up about 7.5GB VRAM, leaving roughly 15GB VRAM for the model layers.

Llama.cpp configuraion:

"-ot", "blk.([8-9]|[1-5][0-9]|6[0-1]).ffn_.*_exps.weight=CPU",

Memory & Context

"--no-mmap",
"--flash-attn", "on",
"-c", "49152",
"--cache-type-k", "q8_0",
"--cache-type-v", "q8_0"

My context:

The 48k context is the comfortable amount that I need for the usage I have for the model doing a number of tasks for me, that is why i choose that. This configuration also leaves some VRAM and RAM for system overhead to make sure it wont go OOM.

*Please let me know if with similar hardware setup you were able to squeeze out more performance. Comment with the config.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment