What are best parameters for Strix halo?

by datayoda - opened 2 days ago

Discussion

datayoda

2 days ago

Wondering what you would recommend for Strix halo 128g. Iq4_xs? What settings for the command to maximize context?

ubergarm

Owner 2 days ago

@datayoda

Good question, are you compiling with Vulkan backend or how are you running it? If you're using ik_llama.cpp keep in mind you'll still want to use a "mainline" quant recipe for any tensors running on the Vulkan backend.

Its gonna be a bit tight in 128GB probably, you can compress kv-cache and in general compress k more than v. Here are a couple options:

will work on mainline or ik on any backends -ctk q8_0 -ctv q8_0
will only work on ik -khad -ctk q6_0 -ctv q8_0 will save a little more with minimal quality loss.

Check out some of the other discussions too, i know some mac people and other vulkan people are asking basically the same question.

Finally try it out with llama-sweep-bench to see which quant is working best for you.

While this specific one is for hybrid CPU+cuda discussion, there are some tips that might apply for you as well and an example of a/b testing speed: https://huggingface.co/ubergarm/MiniMax-M2.5-GGUF/discussions/9

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment