Inconsistent results when using fp8?

#22
by songwang41 - opened
vllm serve /share5/projects/llm/models/weight/Kimi-K2-Instruct-0905 \
  --distributed-executor-backend ray \
  --tensor-parallel-size 16 \
  --host 0.0.0.0 --port 8080 \
  --served-model-name kimi-k2-instruct-0905 \
  --trust-remote-code \
  --max-model-len 131072 \
  --max-num-seqs 4 \
  --gpu-memory-utilization 0.95 \
  --quantization fp8 \
  --kv-cache-dtype fp8 \
  --calculate-kv-scales \
  --enable-auto-tool-choice \
  --tool-call-parser kimi_k2

I served kimi-k2-instruct-0905 on 16 h100 gpus. when I inference with the endpoint, I got some inconsitent reuslts. Any clues? Is my hosting the model correct.

The original prompt (11,701 tokens) consistently fails with kimi-k2:
5/5 attempts returned empty response
stop_reason: 163586 (appears to be an internal error code)
completion_tokens: 1 (only generates 1 token before stopping)
Comparison:
Prompt Type Tokens kimi-k2 Claude GPT-5
Simple (same question) 136 βœ… Works βœ… Works βœ… Works
Original complex 11,701 ❌ Empty βœ… Works βœ… Works

Investigation with prompt length:

======================================================================
FINDING KIMI-K2 TOKEN THRESHOLD (8K-15K Range)

βœ… 1,299 prompt tokens | completion: 12 | The meeting lasts 35 minutes and 25 seconds.
βœ… 2,499 prompt tokens | completion: 35 | The meeting lasts 35 minutes and 25 seconds, c
βœ… 3,699 prompt tokens | completion: 22 | The meeting lasts 35 minutes and 25 seconds, i
βœ… 4,899 prompt tokens | completion: 35 | The meeting lasts 35 minutes and 25 seconds, c
βœ… 6,099 prompt tokens | completion: 22 | The meeting lasts 35 minutes and 25 seconds, i
βœ… 7,299 prompt tokens | completion: 30 | The meeting lasts 35 minutes and 25 seconds, c
βœ… 8,499 prompt tokens | completion: 24 | The meeting lasts 35 minutes and 25 seconds, for a
βœ… 9,699 prompt tokens | completion: 34 | The meeting lasts 35 minutes and 25 seconds, c

======================================================================
BINARY SEARCH FOR KIMI-K2 TOKEN THRESHOLD

❌ 12,099 prompt tokens | completion: 1 | (empty)
❌ 14,099 prompt tokens | completion: 1 | (empty)
βœ… 16,099 prompt tokens | completion: 12 | The meeting lasts **35 minutes and 25 se
❌ 18,099 prompt tokens | completion: 1 | (empty)
βœ… 20,099 prompt tokens | completion: 12 | The meeting lasts **35 minutes and 25 se
❌ 22,099 prompt tokens | completion: 1 | (empty)
βœ… 24,099 prompt tokens | completion: 21 | The meeting lasts **35 minutes and 25 se

songwang41 changed discussion title from Inconsistent results to Inconsistent results when using fp8?

Sign up or log in to comment