Small report (IQ4_XS) & question: IQ4_XS or smol-IQ4_KSS
First, thank you for the wonderful quants. Using them all the time. β¨
I've been using MiniMax-M2.1 as my daily coding driver for quite some time (unsloth UD-Q3_K_XL quant) and was exciting to see M2.5 upgrade dropping... Yesterday I pulled IQ4_XS immediately after you pushed them. May have been the first downloader... π
I'm using a system with 4xNVIDIA A40, AMD EPYC 9334 32-Core, 1.5TB RAM:
Command line
~/ik_llama.cpp/build/bin/llama-server \
--alias MiniMax-M2.5 \
--model ~/models/ubergarm-MiniMax-M2.5-GGUF/IQ4_XS/MiniMax-M2.5-IQ4_XS-00001-of-00004.gguf \
--ctx-size 106496 \
--threads 28 \
--threads-batch 32 \
--grouped-expert-routing \
--split-mode-graph-scheduling \
--split-mode graph \
--max-extra-alloc 256 \
--n-gpu-layers 99 \
--n-cpu-moe 34 \
--tensor-split 10,12,12,12 \
--ubatch-size 4096 \
--batch-size 4096 \
--parallel 1 \
--host 127.0.0.1 \
--port 15647 \
--no-mmap \
--cache-type-k f16 \
--cache-type-v f16 \
--temp 1.0 \
--top-p 0.95 \
--top-k 40 \
--no-display-prompt \
--jinja
Always trying to juggle bpw, context size, KV quant, etc. to accommodate for at least 100k context. But yeah, as I only have 100 GB of free VRAM I sadly need to offload... You may wonder about the f16 cache type. I'm under the impression low KV quant is bad especially for longer context runs. So if I can I try not to quantize cache... π€·
Getting around PP 450, TG 28 T/sec, which is ok-ish for coding tasks. Using OpenCode as harness + OpenSpec with some customizations. And MiniMax-M2.5-IQ4_XS has been very solid for me after throwing long-context real coding tasks at it. Good tool calling, pretty good quality code.
So, all in all, has been a decent upgrade from M2.1 and using it currently as my primary coding model. π
IQ4_XS vs. smol-IQ4_KSS
I've seen you recently added an ik_llama-specific quant in the same size category: smol-IQ4_KSS
I wonder if I should use that instead. I can see the technical differences but am too stupid with quants to do an educated decision... π«
| Layer | IQ4_XS | smol-IQ4_KSS |
|---|---|---|
| token_embd.weight | q4_K | iq4_k |
| output.weight | q6_K | iq6_k |
| ffn_down_exps | iq4_xs | iq4_kss |
| ffn_(gate|up)_exps | iq4_xs | iq4_kss |
Info on that stuff is kind of sparse and often very technical... Can you shed some light? Thanks!
Wow thanks for the very thoughtful and detailed report, I'm impressed!
So you have 4x A40s which are sm86 arch and 48GB VRAM each but you only have 100GB VRAM free? (guessing you keep other models loaded as well or something?)
You have plenty of DRAM and a solid CPU so a very good rig for local ai you have there!
Command Line
Honestly, that is a very clean and solid looking command! And since you can create llama-sweep-bench plots you can definitely dial in as much as you'd like and observe clearly how it is performing.
A few possible things to try:
- If you can free up some VRAM, you could use all 4x GPUs given
-sm graphworks quite well for that (and also works for hybrid CPU+GPUs as you noticed) - You could try to quantize cache using
-khad -ctk q6_0 -ctv q8_0which should hold up okay without too much loss, but I wouldn't go lower than that. Details on khad here: https://github.com/ikawrakow/ik_llama.cpp/pull/1033 (it works on GPU too in more recent PRs). the tl;dr; is khad can improve quality of quantized k cache, so you can go a bit smaller on it. Avoid quantizing v cache as much though. - Try the new speculative decoding stuff from here: https://github.com/ikawrakow/ik_llama.cpp/pull/1261
- Experiment with different slimmer coding harness e.g. oh-my-pi (i still haven't tried it myself, haha)
I wonder if I should use that instead.
Generally, try to get the lowest perplexity quant that fits in your RAM+VRAM that runs fast enough for your workload with the desired context depth. Given you have CUDA you can use the smol-IQ4_KSS and as you see it performs pretty well. To confusing things more, there is a new IQ4_NL which has surprisingly low perplexity. I have not checked the KLD to confirm it is actually "better" in terms of less deviation from the full size model, but worth more research probably.
Anyway, you're doing great! Keep us posted on what you continue to discover especially with actual agentic coding results! Cheers!
Oh there may be some wins tuning prompt caching stuff like --cache-ram XXXX or other things, I need to learn more about how this could help out with agentic use as well.
- Try the new speculative decoding stuff
Yeah, saw that! So much stuff to try... and more knobs to adjust. Love it... π
I'm unsure about good parameters for coding, currently using:
--spec-type ngram-map-k4v \
--spec-ngram-size-n 6 \
--spec-ngram-size-m 4 \
--draft-min 1 \
--draft-max 8 \
--draft-p-min 0.2 \
Not sure though if it's really faster, doesn't seem to slow it down at least...
- Experiment with different slimmer coding harness e.g. oh-my-pi (i still haven't tried it myself, haha)
Very interesting stuff indeed. I actually tried omp, but I guess I'm too locked in with OpenCode already which works best for me all things considered...
omp is stopping execution in the middle of tasks and also file editing fails often despite their hashline thingy. And I'm too lazy/busy to start to debugging omp code though...
OpenCode issues regarding more robust file editing:
- [Tracking] Edit tool reliability: "modified since last read" errors (Undo/Redo & Persistence)
- [FEATURE]: Add a new experimental "hashline" edit mode ποΈ this one explicitly references omp
there is a new
IQ4_NL
Yep, already glancing at it. Just decided to wait a bit though as it's another hefty 130 GB to add to the download queue...
Not sure though if it's really faster, doesn't seem to slow it down at least...
That was my experience, haha... I'm not sure exactly what kinds of workloads it would benefit e.g. repetitive editing JSON data or something? I wish opencode had a way to track PP and TG speeds live and over time as context grows or something.
Yep, already glancing at it. Just decided to wait a bit though as it's another hefty 130 GB to add to the download queue...
If you want even more to download, I'm working on a big boi: https://huggingface.co/ubergarm/GLM-5-GGUF
It would be slower, but might be good as a first pass model to initialize the project, then let the smaller models do refactors after context gets big?
I haven't uploaded everything yet, still fishing:


