Inconsistent results when using fp8?
vllm serve /share5/projects/llm/models/weight/Kimi-K2-Instruct-0905 \
--distributed-executor-backend ray \
--tensor-parallel-size 16 \
--host 0.0.0.0 --port 8080 \
--served-model-name kimi-k2-instruct-0905 \
--trust-remote-code \
--max-model-len 131072 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.95 \
--quantization fp8 \
--kv-cache-dtype fp8 \
--calculate-kv-scales \
--enable-auto-tool-choice \
--tool-call-parser kimi_k2
I served kimi-k2-instruct-0905 on 16 h100 gpus. when I inference with the endpoint, I got some inconsitent reuslts. Any clues? Is my hosting the model correct.
The original prompt (11,701 tokens) consistently fails with kimi-k2:
5/5 attempts returned empty response
stop_reason: 163586 (appears to be an internal error code)
completion_tokens: 1 (only generates 1 token before stopping)
Comparison:
Prompt Type Tokens kimi-k2 Claude GPT-5
Simple (same question) 136 β
Works β
Works β
Works
Original complex 11,701 β Empty β
Works β
Works
Investigation with prompt length:
======================================================================
FINDING KIMI-K2 TOKEN THRESHOLD (8K-15K Range)
β
1,299 prompt tokens | completion: 12 | The meeting lasts 35 minutes and 25 seconds.
β
2,499 prompt tokens | completion: 35 | The meeting lasts 35 minutes and 25 seconds, c
β
3,699 prompt tokens | completion: 22 | The meeting lasts 35 minutes and 25 seconds, i
β
4,899 prompt tokens | completion: 35 | The meeting lasts 35 minutes and 25 seconds, c
β
6,099 prompt tokens | completion: 22 | The meeting lasts 35 minutes and 25 seconds, i
β
7,299 prompt tokens | completion: 30 | The meeting lasts 35 minutes and 25 seconds, c
β
8,499 prompt tokens | completion: 24 | The meeting lasts 35 minutes and 25 seconds, for a
β
9,699 prompt tokens | completion: 34 | The meeting lasts 35 minutes and 25 seconds, c
======================================================================
BINARY SEARCH FOR KIMI-K2 TOKEN THRESHOLD
β 12,099 prompt tokens | completion: 1 | (empty)
β 14,099 prompt tokens | completion: 1 | (empty)
β
16,099 prompt tokens | completion: 12 | The meeting lasts **35 minutes and 25 se
β 18,099 prompt tokens | completion: 1 | (empty)
β
20,099 prompt tokens | completion: 12 | The meeting lasts **35 minutes and 25 se
β 22,099 prompt tokens | completion: 1 | (empty)
β
24,099 prompt tokens | completion: 21 | The meeting lasts **35 minutes and 25 se