Warning! Achtung! Внимание! - set the chat template explicitly when using the model
Introduction
This repo contains Q8_0 and Q4_K_M quants of DeepSeek V3.2 with removed sparse attention lightning indexer tensors. This allows to run the model in mainline llama.cpp or ik_llama.cpp until the proper implementation of DeepSeek V3.2 sparse attention is completed.
Usage
llama.cpp
To use the model save DeepSeek V3.2-Exp chat template to a file and pass
--jinja --chat-template-file <saved-chat-template-file> when running llama-cli or llama-server.
Note that tool calls will likely not work correctly with this template.
ik_llama.cpp
ik_llama.cpp needs modified DeepSeek V3.2-Exp chat template file. Otherwise you will get errors like this:
terminate called after throwing an instance of 'std::runtime_error'
what(): split method must have between 1 and 1 positional arguments and between 0 and 0 keyword arguments at row 1
Model conversion
If you want to convert the model by yourself perform the following steps:
- Edit tokenizer_config.json file from the DeepSeek V3.2 HF model and change "add_bos_token" field value from false to true.
- Apply the changes below to llama.cpp convert_hf_to_gguf.py script:
diff --git a/convert_hf_to_gguf.py b/convert_hf_to_gguf.py
index d9ee390b3..62c798f00 100755
--- a/convert_hf_to_gguf.py
+++ b/convert_hf_to_gguf.py
@@ -7210,6 +7210,7 @@ class DeepseekModel(TextModel):
@ModelBase.register(
"DeepseekV2ForCausalLM",
"DeepseekV3ForCausalLM",
+ "DeepseekV32ForCausalLM",
"KimiVLForConditionalGeneration",
"YoutuForCausalLM",
"YoutuVLForConditionalGeneration"
@@ -7330,7 +7331,7 @@ class DeepseekV2Model(TextModel):
def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iterable[tuple[str, Tensor]]:
# skip vision tensors and remove "language_model." for Kimi-VL
- if "vision_tower" in name or "multi_modal_projector" in name:
+ if "vision_tower" in name or "multi_modal_projector" in name or "self_attn.indexer" in name:
return []
if name.startswith("siglip2.") or name.startswith("merger."):
return []
- Convert and quantize the model as usual.
Performance notes
The model has exactly the same tensor shapes like DeepSeek V3/R1/V3.1, so performance shall be the same as for these models.
Benchmark results
In my limited testing so far I found no degradation in the model "intelligence" after removing lightning indexer.
lineage-bench
I tested Q4_K_M quant in lineage-bench:
In the benchmark run there were 40 quizzes per each difficulty level, 160 overall.
| Nr | model_name | lineage | lineage-8 | lineage-64 | lineage-128 | lineage-192 |
|---|---|---|---|---|---|---|
| 1 | deepseek/deepseek-v3.2 | 0.988 | 1.000 | 1.000 | 1.000 | 0.950 |
The model solved almost all quizzes correctly. It made only 2 errors in lineage graphs of 192 nodes (most difficult quizzes). This result is even better than for the original DeepSeek V3.2 tested via API.
- Downloads last month
- 236
4-bit
8-bit
Model tree for sszymczyk/DeepSeek-V3.2-nolight-GGUF
Base model
deepseek-ai/DeepSeek-V3.2-Exp-Base