What are best parameters for Strix halo?

#8
by datayoda - opened

Wondering what you would recommend for Strix halo 128g. Iq4_xs? What settings for the command to maximize context?

@datayoda

Good question, are you compiling with Vulkan backend or how are you running it? If you're using ik_llama.cpp keep in mind you'll still want to use a "mainline" quant recipe for any tensors running on the Vulkan backend.

Its gonna be a bit tight in 128GB probably, you can compress kv-cache and in general compress k more than v. Here are a couple options:

  • will work on mainline or ik on any backends -ctk q8_0 -ctv q8_0
  • will only work on ik -khad -ctk q6_0 -ctv q8_0 will save a little more with minimal quality loss.

Check out some of the other discussions too, i know some mac people and other vulkan people are asking basically the same question.

Finally try it out with llama-sweep-bench to see which quant is working best for you.

While this specific one is for hybrid CPU+cuda discussion, there are some tips that might apply for you as well and an example of a/b testing speed: https://huggingface.co/ubergarm/MiniMax-M2.5-GGUF/discussions/9

Sign up or log in to comment