What are best parameters for Strix halo?
Wondering what you would recommend for Strix halo 128g. Iq4_xs? What settings for the command to maximize context?
Good question, are you compiling with Vulkan backend or how are you running it? If you're using ik_llama.cpp keep in mind you'll still want to use a "mainline" quant recipe for any tensors running on the Vulkan backend.
Its gonna be a bit tight in 128GB probably, you can compress kv-cache and in general compress k more than v. Here are a couple options:
- will work on mainline or ik on any backends
-ctk q8_0 -ctv q8_0 - will only work on ik
-khad -ctk q6_0 -ctv q8_0will save a little more with minimal quality loss.
Check out some of the other discussions too, i know some mac people and other vulkan people are asking basically the same question.
Finally try it out with llama-sweep-bench to see which quant is working best for you.
While this specific one is for hybrid CPU+cuda discussion, there are some tips that might apply for you as well and an example of a/b testing speed: https://huggingface.co/ubergarm/MiniMax-M2.5-GGUF/discussions/9