quant req: 256 GB RAM + 96 GB VRAM
This would be a perfect size :-)
so 356GiB / 666 GiB gives us a target close to 4bpw which should give good performance still. Perhaps an iq4_kss would turn out about right.
I'm redoing my bf16 safetensors -> bf16 GGUF currently based on some updated info from ik_llama.cpp here: https://github.com/ikawrakow/ik_llama.cpp/issues/651#issuecomment-3212864652
I'll re-upload the new imatrix using this new GGUF and use it for the remainder of the quants so hopefully I can quantize attn_(k|v)_b for faster TG without losing much quality now. I released the IQ5_K already though given it uses Q8_0 for those tensors anyway as it is basically a max quality quant faster and smaller than the full Q8_0.
If anyone uses the IQ5_K, i noticed I needed to pass llama-server --chat-template deepseek3 ... as it seems to be auto-detecting incorrectly which was causing loops hitting the /chat/completions endpoint. Seems to work fine with the correct template or you're doing your own chat template with ST etc with the /text/completions endpoint.
But yeah, should be a good size!
That sounds great! I just wanted to add that the 96 GB of VRAM is across two cards (A6000 & 48GB 4090D), which does use a little bit of overhead for CUDA graphs, plus inefficiency of VRAM usage due to size of layers.
In the past I've been using V3/R1 quants at Q3_K_M with good success. Not sure what the difference would be to an IQ4_KSS lol.
PS: my daily driver has been your DeepSeek-TNG-R1T2-Chimera-IQ3_KSquant. It's a great model!
Speaking of DeepSeek-TNG-R1T2-Chimera-IQ3_KS, I only just stumbled across the conversation here:
https://huggingface.co/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/discussions/2
I find it quite coincidental that I wasn't the only one that liked that model. I wonder if it was just good fortune thatIQ3_KS turned out to be a sweet spot in terms of trading off size and quality.
Fast forwarding to V3.1, I wonder if there be much difference in an IQ4_KSS vs IQ3_KS?
PS: There's a wealth of information there that I'll have to incorporate into my configuration...!
Yeah I have the same size vram and system ram, I'd be interested in a iq4_kss.
My biggest annoyance with these bigger models is the time it takes to load them into the ram! I guess a pcie5 nvme is in my upgrade list lol
Excited to see the perplexity differences of these quants -
Fast forwarding to V3.1, I wonder if there be much difference in an IQ4_KSS vs IQ3_KS?
Excited to see the perplexity differences of these quants -
Yeah I'm slowly working through some quants now that I have this sorted out (had to re-make my bf16 GGUF) (pinging @anikifoss as this might be relevent if you're cooking new MLA quants too)
I'm expecting perplexity data to trickle in as I cook and test, and I'll have a graph up sometime this weekend.
My current biggest question about is about some quants like IQ3_KS and this quantization tweaks PR624 so I have to cook a quant twice, measure perplexity twice, and then release whichever is "better" haha...
I'll likely have a few options in this 3~4bpw range with data so you can choose what fits best on your rig. I'm also trying to balance perplexity with TG speed. Definitely pull and rebuild ik_llama.cpp as it got some more updates especially for Zen5 and CPUs with avx_vnni real 512bit instructions for PP speed on CPU.
Okay, I've published the first perplexity graph and the IQ4_KSS is uploading now. The IQ3_KS is already there!
Will you make IQ4K? I was wondering if you have plans to mae a 4.7-4.9 bpw though I will try the closest match to this.
Downloading the iq4_kss now, I think it will just barely squeeze onto my system!
Also, that perplexity chart is very pretty, so linear lmao.
I'll update this comment once the download finishes with some performance numbers.
update 1: I'm going to have to settle for iq3, I just barely can't fit the iq4_kss, hitting OOM error due to the 256gb system ram I have... :c
llm_load_tensors: CPU buffer size = 252954.00 MiB
llm_load_tensors: CUDA_Host buffer size = 497.11 MiB
llm_load_tensors: CUDA0 buffer size = 19796.32 MiB
llm_load_tensors: CUDA1 buffer size = 19769.05 MiB
llm_load_tensors: CUDA2 buffer size = 19769.05 MiB
llm_load_tensors: CUDA3 buffer size = 20104.91 MiB
@ubergarm , how did you convert to BF16? I've tried with mainline but there are issues. And I've tried with https://github.com/evshiron/llama.cpp, which worked but your imatrix isn't compatible with the BF16 it produced.
how did you convert to BF16? I've tried with mainline but there are issues.
I used the "triton-cpu" method outlined here in this mainline lcpp issue to use the original deepseek casting script and mainline convert script e.g.:
deepseek-ai fp8 safetensors -> fp8_cast_bf16.py -> bf16 safetensors -> mainline convert_hf_to_gguf.py -> bf16 GGUF -> imatrix
your imatrix isn't compatible with the BF16 it produced.
Give you used the evshiron method, you'll end up with the attn_kv_b MLA tensor. i do have an imatrix for that too as that is what I originally did. You can still download that looking at earlier commit/upload in this repo here: https://huggingface.co/ubergarm/DeepSeek-V3.1-GGUF/blob/db827cc0e0f9c46db67f106840ac594294d6cd11/imatrix-DeepSeek-V3.1-Q8_0.dat
The reason I switched to the "mainline MLA style" for the first time is this comment/discussion: https://github.com/ikawrakow/ik_llama.cpp/issues/651#issuecomment-3212883303
The only difference really is that the evshiron/ik convert method will end up with attn_kv_b and ik_llama.cpp imatrix won't give data for attn_(k|v)_b tensors so you want to leave them at q8_0. The final file size will be a couple GB bigger too as you will have to leave attn_kv_b at q8_0 also so kind of have duplicated data.
If you use the cast+mainline convert or the new direct fp8 safetensors -> bf16 gguf PR from mainline with some changes possibly (bartowski did this successfully) https://github.com/ggml-org/llama.cpp/pull/14810 , you will end up without attn_kv_b and you will have imatrix data for attn_(k|v)_b, but will want to keep them close to q8_0 anyway probably given they are so small and fairly sensitive.
hmu if you get stuck on any parts, and looking forward to what your systematic approach to quantizing shows for this model!
I'm going to have to settle for iq3, I just barely can't fit the iq4_kss, hitting OOM error due to the 256gb system ram I have... :c
Say no more, fam: ## IQ3_K 293.177 GiB (3.753 BPW) πΉ
This one is a bit different, keeping full q8_0 attn/shexp/first 3 dense layers which will likely slow down TG a little, but should give about the best possible perplexity for the size.
After I run perplexity clean with no NaNs and do a quick vibe check will start uploading it!
Finally got a working quant running locally. Has anyone figured out how to turn on thinking with V3.1 and llama.cpp?
Has anyone figured out how to turn on thinking with V3.1 and llama.cpp?
Right, some folks asked over here and I've been wondering the best way to handle that too...
My impression is given where the model expects the <thinking> or </thinking> token to appear it would require the client to use the /text/completion/ API endpoint of llama-server. The baked in deepseek3 template can't handle it psure.
Here are the options:
- Use
llama-serverand your own client like SillyTavern and use the/text/completionendpoint and inject the token as desired yourself in your own client side controlled chat template. - Add code to
ik_llama.cpp/llama.cppwith two new "chate templates" e.g.deepseek3-thinkanddeepseek3-nothinkthat turn on or off to support the/chat/completion/endpoint.
Not super elegant, but the existing heuristics to detect chat template from the jinja are not working great anymore given how big the chat templates have become causing false positives and requiring users to pass --chat-template deepseek3 already...
Any other ideas?? As right now I have a simple python client hitting the /chat/completion endpoint and not specifying anything which seems to default to nothinking... ??
Any other ideas?? As right now I have a simple python client hitting the /chat/completion endpoint and not specifying anything which seems to default to nothinking... ??
Yeah, this is what I'm seeing as well.
@ubergarm , I finally found my stupid mistake. I had forgotten to copy the tokenizer.json file...
cp ~/AI/huggingface/DeepSeek-V3.1/tokenizer.json ~/AI/DeepSeek-V3.1-bf16-safetensors/
Full commands:
# Install dependencies
apt install python3-dev python3-pip python3-venv python3-wheel python3-setuptools git acl netcat-openbsd cmake git-lfs pipx build-essential ccache # python-is-python3 sudo netcat
apt-get install --no-install-recommends zlib1g-dev libxml2-dev libssl-dev libgmp-dev libmpfr-dev
# Prepare env
pipx install uv
pipx ensurepath
mkdir -p ~/AI/fp8-to-bf16
cd ~/AI/fp8-to-bf16
exec $SHELL
uv venv ./venv --python 3.12 --python-preference=only-managed
# Activate env
source venv/bin/activate
# More dependencies
uv pip install ninja cmake wheel setuptools pybind11
exec $SHELL
source venv/bin/activate
# Clone llama.cpp for DeepSeek-V3.1
git clone https://github.com/ggml-org/llama.cpp --recursive
cd llama.cpp
# Build llama.cpp
cd ~/AI/fp8-to-bf16/llama.cpp
uv pip install -r requirements/requirements-convert_hf_to_gguf.txt --prerelease=allow --index-strategy unsafe-best-match
cmake -B build -DGGML_AVX=ON -DGGML_AVX2=ON -DLLAMA_CURL=OFF
cmake --build build --config Release -j16
cd ..
# Build triton-cpu
git clone https://github.com/triton-lang/triton-cpu --recursive
cd triton-cpu
# Apply this patch - https://github.com/ikawrakow/ik_llama.cpp/issues/383#issuecomment-2865306085
nano -w CMakeLists.txt
---
# set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Werror -Wno-covered-switch-default -fvisibility=hidden")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-covered-switch-default -fvisibility=hidden")
---
nano -w third_party/cpu/CMakeLists.txt
---
#find_package(dnnl CONFIG)
#if (dnnl_FOUND)
#...
#endif()
---
# Install dependencies
uv pip install -r python/requirements.txt
apt-get update
apt-get install -y ccache
apt-get install -y --no-install-recommends \
zlib1g-dev \
libxml2-dev \
libssl-dev \
libgmp-dev \
libmpfr-dev
# Compile
MAX_JOBS=16 uv pip install -e python --no-build-isolation
# Be patient, "Preparing Packages" downloads a lot of stuff before build begins...
cd ..
# Download model
mkdir -p ~/AI/huggingface
cd ~/AI/huggingface
git lfs clone https://huggingface.co/deepseek-ai/DeepSeek-V3.1
# Additional requirements specific to DeepSeek-V3.1
cd ~/AI
git clone https://github.com/deepseek-ai/DeepSeek-V3.git
cd DeepSeek-V3/inference
nano -w fp8_cast_bf16.py
---
from: device="cuda" to: device="cpu"
---
cd ~/AI/fp8-to-bf16/llama.cpp
mkdir ~/AI/DeepSeek-V3.1-bf16-safetensors
python fp8_cast_bf16.py \
--input-fp8-hf-path ~/AI/huggingface/DeepSeek-V3.1/ \
--output-bf16-hf-path ~/AI/DeepSeek-V3.1-bf16-safetensors/ 2>&1 | tee -a fp8_cast_bf16-DeepSeek-V3.1.log
cp ~/AI/huggingface/DeepSeek-V3.1/config.json ~/AI/DeepSeek-V3.1-bf16-safetensors/
cp ~/AI/huggingface/DeepSeek-V3.1/generation_config.json ~/AI/DeepSeek-V3.1-bf16-safetensors/
cp ~/AI/huggingface/DeepSeek-V3.1/tokenizer_config.json ~/AI/DeepSeek-V3.1-bf16-safetensors/
cp ~/AI/huggingface/DeepSeek-V3.1/tokenizer.json ~/AI/DeepSeek-V3.1-bf16-safetensors/
cp ~/AI/huggingface/DeepSeek-V3.1/*.py ~/AI/DeepSeek-V3.1-bf16-safetensors/
# DeepSeek-V3.1
cd ~/AI/fp8-to-bf16/llama.cpp
ulimit -n 99999
mkdir -p ~/AI/DeepSeek-V3.1/DeepSeek-V3.1-THIREUS-BF16-SPECIAL_SPLIT/
python convert_hf_to_gguf.py \
--outtype bf16 \
--outfile ~/AI/DeepSeek-V3.1/DeepSeek-V3.1-THIREUS-BF16-SPECIAL_SPLIT/DeepSeek-V3.1-THIREUS-BF16-SPECIAL_TENSOR \
--no-tensor-first-split --split-max-tensors 1 \
~/AI/DeepSeek-V3.1-bf16-safetensors
# Fix name
cd ~/AI/DeepSeek-V3.1/DeepSeek-V3.1-THIREUS-BF16-SPECIAL_SPLIT
for f in $(ls); do mv -f $f $(echo $f | sed 's/DeepSeek-V3-/DeepSeek-V3.1-THIREUS-BF16-SPECIAL_TENSOR-/g'); done
Uploaded here: https://huggingface.co/Thireus/DeepSeek-V3.1-THIREUS-BF16-SPECIAL_SPLIT
I'm going to have to settle for iq3, I just barely can't fit the iq4_kss, hitting OOM error due to the 256gb system ram I have... :c
Say no more, fam:
## IQ3_K 293.177 GiB (3.753 BPW)πΉThis one is a bit different, keeping full q8_0 attn/shexp/first 3 dense layers which will likely slow down TG a little, but should give about the best possible perplexity for the size.
After I run perplexity clean with no NaNs and do a quick vibe check will start uploading it!
you rock! replaced the IQ3_KS with the IQ3_K and honestly the speed difference is only like 0.4 t/s, which is fine as I don't really use deepseek for coding, more so analytical work.
I noticed that this iteration of deepseek seems less creative, more sterile by default... I had to re-prompt like 3 times to get it to provide a creative answer "from your point of view"... like a "here is a scenario that resulted in a disaster, what would you do different in this scenario to avoid the initial result" type analysis... saw other reports of this as well on the main models discussions, seems like they are going the gpt-5 approach with making models a workhorse rather than an outside-of-the-box thinker...
I get it though, I feel as though less initial creativity = more accurate coding & benchmarks, but at the cost of less flair and character... which in turn requires us to make more descriptive prompts!
Thanks for the detailed notes including the many dependencies to install! Glad you got it going. Also check out my perplexity graphs, the new smol-IQ4_KSS is really amazing. Given it is almost exactly 4bpw I wonder if DeepSeek is doing any QAT for 4bpw knowing it is a popular quant size, just pure speculation here. Normally I use one size larger for the ffn_down_tensors than the ffn_(gate|up)_tensors, but for the smol I used the same smaller size for all of them.
you rock! replaced the IQ3_KS with the IQ3_K and honestly the speed difference is only like 0.4 t/s, which is fine as I don't really use deepseek for coding, more so analytical work.
Okay good to know it isn't much slower with full q8_0 for the tensors running on CUDA, given the CPU is much slower it doesn't effect overall TG speed too much.
Also, you might be able to squeeze the new smol-IQ4_KSS which seems like the best quant in the bunch looking at the graph for size/quality trade-off. Hopefully you have enough RAM/VRAM left over to run your desktop as well as get long enough context.
Cheers!
Regarding the ability to enable/disable thinking on DeepSeek-V3.1 without using your own custom client chat template and /text/completion/ endpoint,
I saw a tip from @ddh0 :
if you're using llama.cpp u can do --chat-template-kwargs "{'enable_thinking': false}" or whatever the right kwargs are for your template
So maybe that would work? I gotta see if ik supports this and try it out... ahh no, don't see the option there... i gotta look into this some more as it seems like possibly the "right way" to do it for /chat/completion endpoint, unless there is a way to pass kwargs to the API endpoint via the JSON post body? hrmmm
EDIT: just noticed this PR which would add the template-kwargs feature to ik: https://github.com/ikawrakow/ik_llama.cpp/pull/723
Also, you might be able to squeeze the new
smol-IQ4_KSSwhich seems like the best quant in the bunch looking at the graph for size/quality trade-off. Hopefully you have enough RAM/VRAM left over to run your desktop as well as get long enough context.
Alright, honestly I'd given up on this iteration of Deepseek because the quality just felt off from previous ones... however you went through the effort of creating the smol boi, so I figured I'd put in the effort of squeezing this whale into my ram...
I have filled my system to the BRIM in order to even load this quant, however in the end I think it's best to run iq3 quant and eat the perplexity difference... still i got the iq4 to work a little which is honestly amazing to me. I had to to disable my window manager in order to run it without it touching swap, as that would introduce big performance reduction, but as the context filled the swap began to get utilized. below are speeds and proof of just how filled my system is... if I paid for all this ram I might as well use every byte of it! haha what a fun expirement.
IQ4_KSS:
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 512 | 128 | 0 | 5.564 | 92.03 | 10.392 | 12.32 |
| 512 | 128 | 512 | 6.480 | 79.02 | 11.442 | 11.19 |
| 512 | 128 | 1024 | 8.418 | 60.82 | 14.064 | 9.10 |
| 512 | 128 | 1536 | 5.824 | 87.92 | 16.933 | 7.56 |
and then we get an OOM error... but you can see the token gen speed suffered greatly as the context went into the swap partition of my SSD.
update 1: redownloading IQ3_KS in order to provide performance numbers on a 1x4090 3x3090 256gb DDR5 setup... will result on this comment when done.
IQ3_KS results for comparison: more stable context speed degradation.
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 2048 | 512 | 0 | 24.623 | 83.17 | 38.213 | 13.40 |
| 2048 | 512 | 2048 | 24.498 | 83.60 | 38.805 | 13.19 |
| 2048 | 512 | 4096 | 24.432 | 83.82 | 39.865 | 12.84 |
| 2048 | 512 | 6144 | 24.534 | 83.48 | 40.802 | 12.55 |
| 2048 | 512 | 8192 | 24.642 | 83.11 | 41.788 | 12.25 |
| 2048 | 512 | 10240 | 24.732 | 82.81 | 42.526 | 12.04 |
| 2048 | 512 | 12288 | 24.851 | 82.41 | 43.274 | 11.83 |
| 2048 | 512 | 14336 | 25.261 | 81.07 | 44.161 | 11.59 |
Alright, honestly I'd given up on this iteration of Deepseek because the quality just felt off from previous ones...
Something does seem off to me as well. I'm trying to separate my subjective annoyance since nearly every response starts with 'Of course' from objective quality.
Although not properly supported yet (thus rendering the output difficult to use), forcing thinking on does seem to make things better.
Something does seem off to me as well. I'm trying to separate my subjective annoyance since nearly every response starts with 'Of course' from objective quality.
Although not properly supported yet (thus rendering the output difficult to use), forcing thinking on does seem to make things better.
I wonder if the hybrid thinking / non thinking is actually harming the output quality, similar to how the first iterations of qwen 3 worked... there's a reason qwen ditched the method to just release separate instruct and thinking models, and when the revised qwen models released, the performance gains for both were very impressive.
sure hybrid thinking / non thinking switching on the fly is convenient, but at what cost? In my opinion i think a solid nonthinking model is much more impressive than a thinking model, which is probably why I always come back to kimi k2. even at a quant of iq2 it's speeds are much snappier due to the smaller experts than deepseek, (I can run it around 17t/s token gen, which is perfect for non-coding tasks), and it just has more parameters in general so even quantized it's more knowledgeable.
I just noticed this PR for ik_llama.cpp: https://github.com/ikawrakow/ik_llama.cpp/pull/723
it possibly adds the ability to run: --chat-template-kwargs "{'enable_thinking': false}" or whatever the exact chat template kwargs are in the chat template: https://huggingface.co/deepseek-ai/DeepSeek-V3.1#chat-template
I'll give it a try after my coffee kicks in, as it might improve generation over using the /chat/completion endnpoint as it stands right now.
Thanks for the feedback and testing, always interesting to hear how these big quants run on folks various rigs and which models y'all prefer!
@ubergarm
thanks for the tip, --chat-template-kwargs '{"enable_thinking": false}' also works with GLM 4.5 series to disable thinking.
I got thinking working with V3.1, but llama.cpp does not return the first <think> token. This breaks OpenWebUI multi-turn conversation, so I made this filter to inject the <think> token into OpenWebUI.
@phakio are you are still running Kimi K2?
I found that at least for me specifying the Smart Expert Reduction parameter, improves token generation speed seemingly without affecting perplexity.
cf https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF/discussions/7
I made this filter to inject the
<think>token into OpenWebUI.
@anikifoss do you have some instructions on how to get this wokring in Open WebUI?
@whoisjeremylam
In OpenWebUI, you can add filters via Admin Panel -> Functions -> Filter -> + button on the right. Make sure to pass --chat-template-kwargs '{"thinking": false }' \ to llama-serverto enable thinking.
