Model Overview
- Model Architecture: Kimi-K2.5
- Input: Text
- Output: Text
- Supported Hardware Microarchitecture: AMD MI350/MI355
- ROCm: 7.1.0
- Operating System(s): Linux
- Inference Engine: vLLM
- Model Optimizer: AMD-Quark
- Weight quantization: MOE-only, OCP MXFP4, Static
- Activation quantization: MOE-only, OCP MXFP4, Dynamic
- Calibration Dataset: Pile
This model was built with Kimi-K2.5 model by applying AMD-Quark for MXFP4 quantization.
Model Quantization
The model was quantized from moonshotai/Kimi-K2.5 using AMD-Quark. The weights and activations are quantized to MXFP4.
Deployment
Use with vLLM
This model can be deployed efficiently using the vLLM backend.
Evaluation
The model was evaluated on GSM8K benchmarks.
Accuracy
| Benchmark | Kimi-K2.5 | Kimi-K2.5-MXFP4(this model) | Recovery |
| GSM8K (flexible-extract) | 94.09 | 93.25 | 99.1% |
Reproduction
The GSM8K results were obtained using the lm-evaluation-harness framework, based on the Docker image vllm/vllm-openai-rocm:v0.14.0.
Install the vLLM (commit 05339a7b207e2f32b56c29398c18d577c74cef3b) and lm-eval (Version: 0.4.10) in container first.
git clone https://github.com/vllm-project/vllm.git
cd vllm
python3 setup.py develop
pip install lm-eval
Launching server
vllm serve amd/Kimi-K2.5-MXFP4 -tp 4 \
--mm-encoder-tp-mode data \
--tool-call-parser kimi_k2 \
--reasoning-parser kimi_k2 \
--trust-remote-code \
--enforce-eager
Evaluating model in a new terminal
lm_eval \
--model local-completions \
--model_args "model=amd/Kimi-K2.5-MXFP4,base_url=http://0.0.0.0:8000/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32" \
--tasks gsm8k \
--num_fewshot 5 \
--batch_size 1
License
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.
- Downloads last month
- 57
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
馃檵
Ask for provider support
Model tree for amd/Kimi-K2.5-MXFP4
Base model
moonshotai/Kimi-K2.5