Model Overview
- Model Architecture: DeepSeek-OCR
- Input: Text
- Output: Text
- Supported Hardware Microarchitecture: AMD MI350/MI355
- ROCm: 7.1.0
- Operating System(s): Linux
- Inference Engine: vLLM
- Model Optimizer: AMD-Quark (V0.11)
- Weight quantization: Language model, MoE only, OCP MXFP4, Static
- Activation quantization: Language model, MoE only, OCP MXFP4, Dynamic
- Calibration Dataset: Pile
This model was built with DeepSeek-OCR model by applying AMD-Quark for MXFP4 quantization.
Model Quantization
The model was quantized from amd/DeepSeek-OCR using AMD-Quark. The weights and activations are quantized to MXFP4.
Quantization scripts:
Before quantization, please install flash-attn in the following way:
pip install flash-attn --no-build-isolation
Note that deepseek_vl_v2 is not in the built-in model template list in Quark V0.11, it has to be registered before quantization.
import torch
from transformers import AutoModel, AutoTokenizer, AutoProcessor
from quark.torch import LLMTemplate, ModelQuantizer, export_safetensors
from datasets import load_dataset
from quark.contrib.llm_eval import ppl_eval
# Register DeepSeek-OCR template
deepseek_ocr_template = LLMTemplate(
model_type="deepseek_vl_v2",
kv_layers_name=["*k_proj", "*v_proj"],
q_layer_name="*q_proj",
exclude_layers_name=["lm_head", "model.sam_model*", "model.vision_model*", "model.projector*"],
)
LLMTemplate.register_template(deepseek_ocr_template)
# Configuration
ckpt_path = "amd/DeepSeek-OCR"
output_dir = "amd/DeepSeek-OCR-MXFP4"
quant_scheme = "mxfp4"
exclude_layers = ["*self_attn*", "*mlp.gate", "lm_head", "*mlp.gate_proj", "*mlp.up_proj",
"*mlp.down_proj", "*shared_experts.*", "*sam_model*", "*vision_model*", "*projector*"]
# Load model
model = AutoModel.from_pretrained(ckpt_path, use_safetensors=True, trust_remote_code=True,
_attn_implementation='flash_attention_2', device_map="cuda:0", torch_dtype=torch.bfloat16)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(ckpt_path, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(ckpt_path, trust_remote_code=True)
# Get quant config from template
template = LLMTemplate.get(model.config.model_type)
quant_config = template.get_config(scheme=quant_scheme, exclude_layers=exclude_layers)
# Quantize
quantizer = ModelQuantizer(quant_config)
model = quantizer.quantize_model(model)
model = quantizer.freeze(model)
# Export hf_format
export_safetensors(model, output_dir, custom_mode="quark")
tokenizer.save_pretrained(output_dir)
processor.save_pretrained(output_dir)
# Evaluate PPL (optional)
testdata = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
testenc = tokenizer("\n\n".join(testdata["text"]), return_tensors="pt")
ppl = ppl_eval(model, testenc, model.device)
print(f"Perplexity: {ppl.item()}")
Perplexity
| Benchmark | DeepSeek-OCR | DeepSeek-OCR-MXFP4(this model) | Recovery |
| WikiText2 PPL | 11.1787 | 11.8868 | 94.04% |
License
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.
- Downloads last month
- 90
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for amd/DeepSeek-OCR-MXFP4
Base model
deepseek-ai/DeepSeek-OCR