Model Overview

Model Architecture: Kimi-K2.5
- Input: Text
- Output: Text
Supported Hardware Microarchitecture: AMD MI350/MI355
ROCm: 7.1.0
Operating System(s): Linux
Inference Engine: vLLM
Model Optimizer: AMD-Quark
- Weight quantization: MOE-only, OCP MXFP4, Static
- Activation quantization: MOE-only, OCP MXFP4, Dynamic
Calibration Dataset: Pile

This model was built with Kimi-K2.5 model by applying AMD-Quark for MXFP4 quantization.

Model Quantization

The model was quantized from moonshotai/Kimi-K2.5 using AMD-Quark. The weights and activations are quantized to MXFP4.

Deployment

Use with vLLM

This model can be deployed efficiently using the vLLM backend.

Evaluation

The model was evaluated on GSM8K benchmarks.

Accuracy

Benchmark	Kimi-K2.5	Kimi-K2.5-MXFP4(this model)	Recovery
GSM8K (flexible-extract)	94.09	93.25	99.1%

Reproduction

The GSM8K results were obtained using the lm-evaluation-harness framework, based on the Docker image vllm/vllm-openai-rocm:v0.14.0.

Install the vLLM (commit 05339a7b207e2f32b56c29398c18d577c74cef3b) and lm-eval (Version: 0.4.10) in container first.

git clone https://github.com/vllm-project/vllm.git
cd vllm
python3 setup.py develop

pip install lm-eval

Launching server

vllm serve amd/Kimi-K2.5-MXFP4 -tp 4 \
  --mm-encoder-tp-mode data \
  --tool-call-parser kimi_k2 \
  --reasoning-parser kimi_k2 \
  --trust-remote-code \
  --enforce-eager

Evaluating model in a new terminal

lm_eval \
  --model local-completions \
  --model_args "model=amd/Kimi-K2.5-MXFP4,base_url=http://0.0.0.0:8000/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" \
  --tasks gsm8k \
  --num_fewshot 5 \
  --batch_size 1

License

Downloads last month: 57

Safetensors

Model size

551B params

Tensor type

F32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amd/Kimi-K2.5-MXFP4

Base model

moonshotai/Kimi-K2.5

Quantized

(18)

this model