Warning! Achtung! Внимание! - set the chat template explicitly when using the model

Introduction

This repo contains Q8_0 and Q4_K_M quants of DeepSeek V3.2 with removed sparse attention lightning indexer tensors. This allows to run the model in mainline llama.cpp or ik_llama.cpp until the proper implementation of DeepSeek V3.2 sparse attention is completed.

Usage

llama.cpp

To use the model save DeepSeek V3.2-Exp chat template to a file and pass --jinja --chat-template-file <saved-chat-template-file> when running llama-cli or llama-server.

Note that tool calls will likely not work correctly with this template.

ik_llama.cpp

ik_llama.cpp needs modified DeepSeek V3.2-Exp chat template file. Otherwise you will get errors like this:

terminate called after throwing an instance of 'std::runtime_error'
  what():  split method must have between 1 and 1 positional arguments and between 0 and 0 keyword arguments at row 1

Model conversion

If you want to convert the model by yourself perform the following steps:

Edit tokenizer_config.json file from the DeepSeek V3.2 HF model and change "add_bos_token" field value from false to true.
Apply the changes below to llama.cpp convert_hf_to_gguf.py script:

diff --git a/convert_hf_to_gguf.py b/convert_hf_to_gguf.py
index d9ee390b3..62c798f00 100755
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -7210,6 +7210,7 @@ class DeepseekModel(TextModel):
 @ModelBase.register(
     "DeepseekV2ForCausalLM",
     "DeepseekV3ForCausalLM",
+    "DeepseekV32ForCausalLM",
     "KimiVLForConditionalGeneration",
     "YoutuForCausalLM",
     "YoutuVLForConditionalGeneration"
@@ -7330,7 +7331,7 @@ class DeepseekV2Model(TextModel):
 
     def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
         # skip vision tensors and remove "language_model." for Kimi-VL
-        if "vision_tower" in name or "multi_modal_projector" in name:
+        if "vision_tower" in name or "multi_modal_projector" in name or "self_attn.indexer" in name:
             return []
         if name.startswith("siglip2.") or name.startswith("merger."):
             return []

Convert and quantize the model as usual.

Performance notes

The model has exactly the same tensor shapes like DeepSeek V3/R1/V3.1, so performance shall be the same as for these models.

Benchmark results

In my limited testing so far I found no degradation in the model "intelligence" after removing lightning indexer.

lineage-bench

I tested Q4_K_M quant in lineage-bench:

In the benchmark run there were 40 quizzes per each difficulty level, 160 overall.

Nr	model_name	lineage	lineage-8	lineage-64	lineage-128	lineage-192
1	deepseek/deepseek-v3.2	0.988	1.000	1.000	1.000	0.950

The model solved almost all quizzes correctly. It made only 2 errors in lineage graphs of 192 nodes (most difficult quizzes). This result is even better than for the original DeepSeek V3.2 tested via API.

Downloads last month: 236

GGUF

Model size

671B params

Architecture

deepseek2

Hardware compatibility

4-bit

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sszymczyk/DeepSeek-V3.2-nolight-GGUF

Base model

deepseek-ai/DeepSeek-V3.2-Exp-Base

Finetuned

deepseek-ai/DeepSeek-V3.2

Quantized

(20)

this model