Warning! Achtung! Внимание! - set the chat template explicitly when using the model

Introduction

This repo contains Q8_0 and Q4_K_M quants of DeepSeek V3.2 with removed sparse attention lightning indexer tensors. This allows to run the model in mainline llama.cpp or ik_llama.cpp until the proper implementation of DeepSeek V3.2 sparse attention is completed.

Usage

llama.cpp

To use the model save DeepSeek V3.2-Exp chat template to a file and pass --jinja --chat-template-file <saved-chat-template-file> when running llama-cli or llama-server.

Note that tool calls will likely not work correctly with this template.

ik_llama.cpp

ik_llama.cpp needs modified DeepSeek V3.2-Exp chat template file. Otherwise you will get errors like this:

terminate called after throwing an instance of 'std::runtime_error'
  what():  split method must have between 1 and 1 positional arguments and between 0 and 0 keyword arguments at row 1

Model conversion

If you want to convert the model by yourself perform the following steps:

  1. Edit tokenizer_config.json file from the DeepSeek V3.2 HF model and change "add_bos_token" field value from false to true.
  2. Apply the changes below to llama.cpp convert_hf_to_gguf.py script:
diff --git a/convert_hf_to_gguf.py b/convert_hf_to_gguf.py
index d9ee390b3..62c798f00 100755
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -7210,6 +7210,7 @@ class DeepseekModel(TextModel):
 @ModelBase.register(
     "DeepseekV2ForCausalLM",
     "DeepseekV3ForCausalLM",
+    "DeepseekV32ForCausalLM",
     "KimiVLForConditionalGeneration",
     "YoutuForCausalLM",
     "YoutuVLForConditionalGeneration"
@@ -7330,7 +7331,7 @@ class DeepseekV2Model(TextModel):
 
     def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
         # skip vision tensors and remove "language_model." for Kimi-VL
-        if "vision_tower" in name or "multi_modal_projector" in name:
+        if "vision_tower" in name or "multi_modal_projector" in name or "self_attn.indexer" in name:
             return []
         if name.startswith("siglip2.") or name.startswith("merger."):
             return []
  1. Convert and quantize the model as usual.

Performance notes

The model has exactly the same tensor shapes like DeepSeek V3/R1/V3.1, so performance shall be the same as for these models.

Benchmark results

In my limited testing so far I found no degradation in the model "intelligence" after removing lightning indexer.

lineage-bench

I tested Q4_K_M quant in lineage-bench:

In the benchmark run there were 40 quizzes per each difficulty level, 160 overall.

Nr model_name lineage lineage-8 lineage-64 lineage-128 lineage-192
1 deepseek/deepseek-v3.2 0.988 1.000 1.000 1.000 0.950

The model solved almost all quizzes correctly. It made only 2 errors in lineage graphs of 192 nodes (most difficult quizzes). This result is even better than for the original DeepSeek V3.2 tested via API.

Downloads last month
236
GGUF
Model size
671B params
Architecture
deepseek2
Hardware compatibility
Log In to add your hardware

4-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sszymczyk/DeepSeek-V3.2-nolight-GGUF

Quantized
(20)
this model