Feedback

#1
by bibproj - opened

@sszymczyk

Thank you for creating these. You know quite a lot about LLMs to get this working!

I use LLMs for translating. The DeepSeek models are very good with translations, so I try to test their models or any model using their model architecture. Translating is a non-reasoning task as research has proven that reasoning degrades the quality of translations. Evidently the models tend to second-guess themselves, and in the process chooses a weaker translation.

Saw your question at the DeepSeek V3.2 model and thought I could give you some feedback from the translation angle. I tested the Q4_K_M on a Mac Studio, and it generated at 14.8 t/s. On the Mac, MLX tends to be faster than llama.cpp, and the 4-bit quant of MLX DeepSeek V3.2 runs at 20 t/s.

The quality of the translations on the dense attention Q4_K_M are however not as good as that of the MLX and API models that use the designed sparse attention. This is one area where the dense attention is not doing quite as well.

Thought I'd just give you the feedback.

Still very impressed that you got a working, non-hallucinating V3.2 model with llama.cpp. Well done.

Thanks for your feedback!

Just wanted to confirm, did you use the recommended DeepSeek V3.2-Exp jinja chat template by explicitly setting it in llama.cpp parameters?

Sorry, no. I did use --jinja, but did not specify the specific chat template.

I will redo my testing.

Okay, it definitely helps to read the instructions! Thank you for verifying.

The translations are working very well. My compliments on getting it all working so smoothly. Quite unexpected. 👍
From what I can see, the results from your dense attention version look the same as that of the sparse attention version.

Also, when specifying the specific chat template, the generation has gone up to 15.2 t/s.

Could I ask you a big favour please? I would like (to make) a Q5_K_M version. Would you mind sharing your modified convert_hf_to_gguf.py file? Mine is not working.

I see that my older patch from here no longer merges cleanly, so you can try this process instead:

  1. Edit tokenizer_config.json file from the HF model files and change at the top:
"add_bos_token": false,

to

"add_bos_token": true,
  1. Apply these 2 small changes in llama.cpp convert_hf_to_gguf.py (not sure if the patch survives HF forum formatting, but they are easy enough to apply manually):
diff --git a/convert_hf_to_gguf.py b/convert_hf_to_gguf.py
index d9ee390b3..62c798f00 100755
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -7210,6 +7210,7 @@ class DeepseekModel(TextModel):
 

@ModelBase
	.register(
     "DeepseekV2ForCausalLM",
     "DeepseekV3ForCausalLM",
+    "DeepseekV32ForCausalLM",
     "KimiVLForConditionalGeneration",
     "YoutuForCausalLM",
     "YoutuVLForConditionalGeneration"
@@ -7330,7 +7331,7 @@ class DeepseekV2Model(TextModel):
 
     def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
         # skip vision tensors and remove "language_model." for Kimi-VL
-        if "vision_tower" in name or "multi_modal_projector" in name:
+        if "vision_tower" in name or "multi_modal_projector" in name or "self_attn.indexer" in name:
             return []
         if name.startswith("siglip2.") or name.startswith("merger."):
             return []
  1. Convert and quantize the model as usual.
  2. Reverse changes in tokenizer_config.json (so you have the original content in case you are going to use the model later).

Thank you!

@ubergarm

Hi John. You might find this experiment interesting.

I'd like to hear your thoughts on the fact that even though DeepSeek V3.2 was designed for sparse attention, this experiment using it with dense attention is working extremely well.
These seems to be the only properly-working GGUFs of DeepSeek V3.2 to date.

@bibproj

Thanks for the heads up! I've seen some folks running DeepSeek V3.2 on llama.cpp only recently (without the sparse attention e.g.

I haven't given it a try yet myself. Looking at this repo's Q4_K_M the recipe at initial glance seems to be fairly default. Might be able to improve output quality as well as speed using ik quants and increasing attn/shexp/first N dense layers but not sure if it would run on ik yet as it only recently is running on mainline i know.

GLM-4.7 has been pretty good and I'm happy with the ik quants for that one so far and it is working with pydantic-ai agentic tool calling type stuff even at a ~2bpw smol-IQ1_KT quant hah...

Excited to see what 2026 holds (and nervous too about the price of consumer computing equipment... :/)

IIUC this is basically just fooling llama.cpp into thinking it's just a standard DeepSeekV3-type model, so it should in theory run in ik as well.

If it's basically lossess (aside from reduced performance ofc) compared to the proper sparse attention, then it might be interesting to try to get running with mixed bpw quants.

@ubergarm Hi, it's me, u/fairydreaming. I tested my dense attention DeepSeek V3.2 GGUF in ik_llama.cpp and it loads fine. But there seems to be a problem with the DeepSeek-V3.2-Exp jinja chat template that I used - llama-cli crashed after answering my prompt. I'm going to investigate some more.

Update: looks like the problem is caused by usage of second argument in split() inside the template. Not sure why it works in mainline llama.cpp - it uses the same minja.hpp.
Update2: Fixed template: https://pastebin.com/HaWZGD38

@ubergarm

Thanks for the heads up! I've seen some folks running DeepSeek V3.2 on llama.cpp only recently (without the sparse attention e.g.

Yes, this is the result of that very post.

Excited to see what 2026 holds (and nervous too about the price of consumer computing equipment... :/)

Yeah. You're on the right path with tweaking and squeezing the best performance out of any rig that people already have! Your skills might be even more in demand in 2026 ... 😊

If it's basically lossess (aside from reduced performance ofc) compared to the proper sparse attention, then it might be interesting to try to get running with mixed bpw quants.

See you already made some imatrix files at https://huggingface.co/Doctor-Shotgun/DeepSeek-V3.2-dense-attn-imatrix

I'm busy trying out a mixed bpw quant

If it's basically lossess (aside from reduced performance ofc) compared to the proper sparse attention, then it might be interesting to try to get running with mixed bpw quants.

See you already made some imatrix files at https://huggingface.co/Doctor-Shotgun/DeepSeek-V3.2-dense-attn-imatrix

I'm busy trying out a mixed bpw quant

Yeah I spun up a cloud instance and ran the q8_0 from this repo through a couple datasets in preparation. Was gonna try Bartowski’s v5 set too but that one woulda taken like 1.5 hours on 8x RTX Pro 6000 lol.

🤞

$ tail -f nohup.out
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/DeepSeek-V3.2-Speciale-GGUF/DeepSeek-V3.2-Speciale-256x20B-safetensors-BF16-00029-of-00030.gguf: n_tensors = 37, total_size = 47.9G
INFO:gguf.gguf_writer:/mnt/data/models/ubergarm/DeepSeek-V3.2-Speciale-GGUF/DeepSeek-V3.2-Speciale-256x20B-safetensors-BF16-00030-of-00030.gguf: n_tensors = 3, total_size = 7.5G
Shard (1/30):  13%|█▎        | 5.82G/43.9G [00:10<01:09, 544Mbyte/s]
Writing:   0%|          | 5.82G/1.34T [00:10<40:55, 544Mbyte/s]

Got the Speciale-Q8_0 running on ik following your notes, initial test on short prompt seems to be working. CPU-only single socket of a EPYC 9975 128-Core w/ 12x64GiB DDR5@6400MT/s in NPS1:

INFO [           print_timings] prompt eval time     =     724.85 ms /    20 tokens (   36.24 ms per token,    27.59 tokens per second)
INFO [           print_timings] generation eval time =   27632.42 ms /   291 runs   (   94.96 ms per token,    10.53 tokens per second) 

Next test It got about 145 tok/sec PP and 9 tok/sec TG on a longer multi tool use agentic test using pydantic-ai as the client providing. With a few tools for it to use it did see what was available but did NOT actually call the tools, just made up some "simulated" responses haha... Similar thing on mainline, it wasn't actually calling the tools.

Mainline speeds for the longer test are a bit slower than ik:

prompt eval time =   43633.23 ms /  2086 tokens (   20.92 ms per token,    47.81 tokens per second)
eval time =  334559.74 ms /  2503 tokens (  133.66 ms per token,     7.48 tokens per second)

I'll have to poke around their encoding folder maybe to look more into tool calling: https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Speciale#chat-template ... Oh it says it right there:

Please note that the DeepSeek-V3.2-Speciale variant is designed exclusively for deep reasoning tasks and does not support the tool-calling functionality.

Otherwise I'll play some more with it and see if smaller quants run faster... it is a yappy thinker by design hah...

@ubergarm Tool calls not working is expected - llama.cpp currently recognizes the chat format as DeepSeek V3.1 (visible in llama-server logs: srv params_from_: Chat format: DeepSeek V3.1) and expects corresponding tool calls tags. Not sure what tags DeepSeek-V3.2-Speciale used for tool calls in your test, but likely they were not detected and parsed at all. If you use DeepSeek V3.2 it probably won't work either since it uses entirely different set of <|DSML|...> XML-like tags (but I'm not sure if it uses them only based on the system prompt or was trained to use these tags).

What you could try is to add a system prompt that explicitly commands the model to use DeepSeek V3.1-like tags for tool calls.

Edit: it worked, but I had to remove old tool calling tags from added tokens in tokenizer.json first (tokens from 128806 to 128814). For the purpose of testing I simply cleared these tokens in llama.cpp code:

diff --git a/src/llama-vocab.cpp b/src/llama-vocab.cpp
index a20c6525e..b093158b9 100644
--- a/src/llama-vocab.cpp
+++ b/src/llama-vocab.cpp
@@ -2087,6 +2087,7 @@ void llama_vocab::impl::load(llama_model_loader & ml, const LLM_KV & kv) {
 
     for (uint32_t i = 0; i < n_tokens; i++) {
         std::string word = gguf_get_arr_str(ctx, token_idx, i);
+        if (i >= 128806 && i <= 128814) { printf("XXXXXXXXXXXXXXXXXXXXXXXXX clearing token %d\n", i); word=""; }
         if (word.empty()) {
             LLAMA_LOG_WARN("%s: empty token at index %u\n", __func__, i);
             word = "[EMPTY_" + std::to_string(i) + "]";

then I used example code from Unsloth: https://unsloth.ai/docs/basics/inference-and-deployment/tool-calling with added system prompt:

Result messages:

[
   {
      "role":"system",
      "content":"\n## Tools\nYou have access to the following tools:\n\n\n### add_number\nDescription: Add two numbers.\n\nParameters: {\"type\": \"object\", \"properties\": {\"a\": {\"type\": \"string\", \"description\": \"The first number.\"}, \"b\": {\"type\": \"string\", \"description\": \"The second number.\"}}, \"required\": [\"a\", \"b\"]}\n\n\n### multiply_number\nDescription: Multiply two numbers.\n\nParameters: {\"type\": \"object\", \"properties\": {\"a\": {\"type\": \"string\", \"description\": \"The first number.\"}, \"b\": {\"type\": \"string\", \"description\": \"The second number.\"}}, \"required\": [\"a\", \"b\"]}\n\n\n### substract_number\nDescription: Substract two numbers.\n\nParameters: {\"type\": \"object\", \"properties\": {\"a\": {\"type\": \"string\", \"description\": \"The first number.\"}, \"b\": {\"type\": \"string\", \"description\": \"The second number.\"}}, \"required\": [\"a\", \"b\"]}\n\n\n### write_a_story\nDescription: Writes a random story.\n\nParameters: {\"type\": \"object\", \"properties\": {}, \"required\": []}\n\n\n### terminal\nDescription: Perform operations from the terminal.\n\nParameters: {\"type\": \"object\", \"properties\": {\"command\": {\"type\": \"string\", \"description\": \"The command you wish to launch, e.g `ls`, `rm`, ...\"}}, \"required\": [\"command\"]}\n\n\n### python\nDescription: Call a Python interpreter with some Python code that will be ran.\n\nParameters: {\"type\": \"object\", \"properties\": {\"code\": {\"type\": \"string\", \"description\": \"The Python code to run\"}}, \"required\": [\"code\"]}\n\n\nIMPORTANT: ALWAYS adhere to this exact format for tool use:\n<|tool▁calls▁begin|><|tool▁call▁begin|>tool_call_name<|tool▁sep|>tool_call_arguments<|tool▁call▁end|>{additional_tool_calls}<|tool▁calls▁end|>\n\nWhere:\n- `tool_call_name` must be an exact match to one of the available tools\n- `tool_call_arguments` must be valid JSON that strictly follows the tool\\'s Parameters Schema\n- For multiple tool calls, chain them directly without separators or spaces\n"
   },
   {
      "role":"user",
      "content":[
         {
            "type":"text",
            "text":"Could you write me a story ?"
         }
      ]
   },
   {
      "role":"assistant",
      "tool_calls":[
         {
            "id":"UdOqqoRheLPVKHbyOqO12g5ghbPwACbZ",
            "function":{
               "arguments":"{}",
               "name":"write_a_story"
            },
            "type":"function"
         }
      ],
      "content":""
   },
   {
      "role":"tool",
      "tool_call_id":"UdOqqoRheLPVKHbyOqO12g5ghbPwACbZ",
      "name":"write_a_story",
      "content":"A long time ago in a galaxy far far away..."
   },
   {
      "role":"user",
      "content":[
         {
            "type":"text",
            "text":"What is today's date plus 3 days?"
         }
      ]
   },
   {
      "role":"assistant",
      "tool_calls":[
         {
            "id":"KZixRulhFlRSxh1dXUz9ZZzA6M7K8bBd",
            "function":{
               "arguments":"{\"code\":\"from datetime import datetime, timedelta\\n\\n# Get today\\'s date\\ntoday = datetime.now().date()\\n# Add 3 days\\nnew_date = today + timedelta(days=3)\\nprint(f\\\"Today\\'s date: {today}\\\")\\nprint(f\\\"Date after adding 3 days: {new_date}\\\")\"}",
               "name":"python"
            },
            "type":"function"
         }
      ],
      "content":""
   },
   {
      "role":"tool",
      "tool_call_id":"KZixRulhFlRSxh1dXUz9ZZzA6M7K8bBd",
      "name":"python",
      "content":"{'datetime': <class 'datetime.datetime'>, 'timedelta': <class 'datetime.timedelta'>, 'today': datetime.date(2026, 1, 9), 'new_date': datetime.date(2026, 1, 12)}"
   },
   {
      "role":"user",
      "content":[
         {
            "type":"text",
            "text":"Create a Fibonacci function in Python and find fib(20)."
         }
      ]
   },
   {
      "role":"assistant",
      "tool_calls":[
         {
            "id":"We2NVdIDPed1e7WdJkw0gU4UkbFKil2h",
            "function":{
               "arguments":"{\"code\":\"def fib(n):\\n    if n <= 0:\\n        return 0\\n    elif n == 1:\\n        return 1\\n    else:\\n        a, b = 0, 1\\n        for _ in range(2, n+1):\\n            a, b = b, a + b\\n        return b\\n\\nresult = fib(20)\\nprint(f\\\"The 20th Fibonacci number is: {result}\\\")\"}",
               "name":"python"
            },
            "type":"function"
         }
      ],
      "content":""
   },
   {
      "role":"tool",
      "tool_call_id":"We2NVdIDPed1e7WdJkw0gU4UkbFKil2h",
      "name":"python",
      "content":"{'fib': <function fib at 0x7d767fa4be20>, 'result': 6765}"
   }
]

There is a PR for a DeepSeek V3.2 tool parser in llama.cpp which isn't merged. Perhaps that would be the best solution?

Meanwhile I did do a mixed quant for DeepSeek V3.2 with Q8_0+Q4_K+Q4_K+Q5_K for main llama.cpp but it's still uploading as I have cable internet now lol.

EDIT: Hmm rendering the jinja template from the PR yields some weird results. Not sure that it's correct?

@Doctor-Shotgun

I manually copy pasted the jinja template from earlier Exp version into the token_config.json before converting my bf16 GGUF. As the original models has custom python code and no jinja template, you can follow these directions to specify one at runtime: https://huggingface.co/sszymczyk/DeepSeek-V3.2-nolight-GGUF/discussions/2#69628d5390025be8b918f4d6

If I understood you incorrectly, lemme know! Thanks for uploading some more quants!

I have the 3.2 Exp template already embedded in my GGUF, but imo we should try to replicate the correct prompt format for 3.2 non-Exp.

It seems like the llama.cpp PR author added a jinja template for non-Exp DS 3.2 (I’m not sure the rationale behind the PR because the model architecture isn’t even supported lol), but when I was running test renders of it, it seemed to add an additional assistant and user turn at the end, so I’m not confident that this PR template is correct compared to the python code.

Sign up or log in to comment