OCR Request fails the first time, works second time
Hi, I deployed the model using vLLM in a kubernetes cluster (docker image vllm/vllm-openai:nightly-ca00b1bfc69e71d860485340f0a197bf584ec004). Here is a sample curl request:
curl --retry 3 --retry-delay 2 --retry-all-errors --location 'https://deepseek-ocr..../v1/chat/completions'
--header 'Content-Type: application/json'
--data '{
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://ofasys-multimodal-wlcb-3-toshanghai.oss-accelerate.aliyuncs.com/wpf272043/keepme/image/receipt.png"
}
},
{
"type": "text",
"text": "Free OCR."
}
]
}
],
"model": "deepseek-ai/DeepSeek-OCR",
"max_tokens": 2048,
"temperature": 0.0,
"skip_special_tokens": false,
"vllm_xargs": {
"ngram_size": 30,
"window_size": 90,
"whitelist_token_ids": [
128821,
128822
]
}
}'
Problem is that, it fails the first time with 500 server error and succeeds on the second try. This is strange behaviour.
Logs (Please note 500 error in the middle and 200 OK at the end):
(APIServer pid=1) DEBUG 11-18 07:06:36 [v1/metrics/loggers.py:221] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 65.0%, MM cache hit rate: 66.7%
(APIServer pid=1) DEBUG 11-18 07:06:39 [v1/engine/async_llm.py:654] Called check_health.
(APIServer pid=1) INFO: 10.129.4.2:49984 - "GET /health HTTP/1.1" 200 OK
(APIServer pid=1) DEBUG 11-18 07:06:39 [v1/engine/async_llm.py:654] Called check_health.
(APIServer pid=1) INFO: 10.129.4.2:49970 - "GET /health HTTP/1.1" 200 OK
(APIServer pid=1) DEBUG 11-18 07:06:44 [entrypoints/chat_utils.py:484] Failed to load AutoTokenizer chat template for deepseek-ai/DeepSeek-OCR
(APIServer pid=1) DEBUG 11-18 07:06:44 [entrypoints/chat_utils.py:484] Traceback (most recent call last):
(APIServer pid=1) DEBUG 11-18 07:06:44 [entrypoints/chat_utils.py:484] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/chat_utils.py", line 482, in resolve_hf_chat_template
(APIServer pid=1) DEBUG 11-18 07:06:44 [entrypoints/chat_utils.py:484] return tokenizer.get_chat_template(chat_template, tools=tools)
(APIServer pid=1) DEBUG 11-18 07:06:44 [entrypoints/chat_utils.py:484] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) DEBUG 11-18 07:06:44 [entrypoints/chat_utils.py:484] File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py", line 1824, in get_chat_template
(APIServer pid=1) DEBUG 11-18 07:06:44 [entrypoints/chat_utils.py:484] raise ValueError(
(APIServer pid=1) DEBUG 11-18 07:06:44 [entrypoints/chat_utils.py:484] ValueError: Cannot use chat template functions because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating
(APIServer pid=1) DEBUG 11-18 07:06:44 [entrypoints/chat_utils.py:484] Failed to load AutoTokenizer chat template for deepseek-ai/DeepSeek-OCR
(APIServer pid=1) DEBUG 11-18 07:06:44 [entrypoints/chat_utils.py:484] Traceback (most recent call last):
(APIServer pid=1) DEBUG 11-18 07:06:44 [entrypoints/chat_utils.py:484] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/chat_utils.py", line 482, in resolve_hf_chat_template
(APIServer pid=1) DEBUG 11-18 07:06:44 [entrypoints/chat_utils.py:484] return tokenizer.get_chat_template(chat_template, tools=tools)
(APIServer pid=1) DEBUG 11-18 07:06:44 [entrypoints/chat_utils.py:484] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) DEBUG 11-18 07:06:44 [entrypoints/chat_utils.py:484] File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py", line 1824, in get_chat_template
(APIServer pid=1) DEBUG 11-18 07:06:44 [entrypoints/chat_utils.py:484] raise ValueError(
(APIServer pid=1) DEBUG 11-18 07:06:44 [entrypoints/chat_utils.py:484] ValueError: Cannot use chat template functions because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating
(APIServer pid=1) DEBUG 11-18 07:06:46 [v1/metrics/loggers.py:221] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 65.0%, MM cache hit rate: 66.7%
(APIServer pid=1) DEBUG 11-18 07:06:49 [v1/engine/async_llm.py:654] Called check_health.
(APIServer pid=1) INFO: 10.129.4.2:40206 - "GET /health HTTP/1.1" 200 OK
(APIServer pid=1) DEBUG 11-18 07:06:49 [v1/engine/async_llm.py:654] Called check_health.
(APIServer pid=1) INFO: 10.129.4.2:40212 - "GET /health HTTP/1.1" 200 OK
(APIServer pid=1) INFO: 10.130.0.2:56758 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) DEBUG 11-18 07:06:51 [entrypoints/chat_utils.py:484] Failed to load AutoTokenizer chat template for deepseek-ai/DeepSeek-OCR
(APIServer pid=1) DEBUG 11-18 07:06:51 [entrypoints/chat_utils.py:484] Traceback (most recent call last):
(APIServer pid=1) DEBUG 11-18 07:06:51 [entrypoints/chat_utils.py:484] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/chat_utils.py", line 482, in resolve_hf_chat_template
(APIServer pid=1) DEBUG 11-18 07:06:51 [entrypoints/chat_utils.py:484] return tokenizer.get_chat_template(chat_template, tools=tools)
(APIServer pid=1) DEBUG 11-18 07:06:51 [entrypoints/chat_utils.py:484] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) DEBUG 11-18 07:06:51 [entrypoints/chat_utils.py:484] File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py", line 1824, in get_chat_template
(APIServer pid=1) DEBUG 11-18 07:06:51 [entrypoints/chat_utils.py:484] raise ValueError(
(APIServer pid=1) DEBUG 11-18 07:06:51 [entrypoints/chat_utils.py:484] ValueError: Cannot use chat template functions because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating
(APIServer pid=1) DEBUG 11-18 07:06:51 [entrypoints/chat_utils.py:484] Failed to load AutoTokenizer chat template for deepseek-ai/DeepSeek-OCR
(APIServer pid=1) DEBUG 11-18 07:06:51 [entrypoints/chat_utils.py:484] Traceback (most recent call last):
(APIServer pid=1) DEBUG 11-18 07:06:51 [entrypoints/chat_utils.py:484] File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/chat_utils.py", line 482, in resolve_hf_chat_template
(APIServer pid=1) DEBUG 11-18 07:06:51 [entrypoints/chat_utils.py:484] return tokenizer.get_chat_template(chat_template, tools=tools)
(APIServer pid=1) DEBUG 11-18 07:06:51 [entrypoints/chat_utils.py:484] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) DEBUG 11-18 07:06:51 [entrypoints/chat_utils.py:484] File "/usr/local/lib/python3.12/dist-packages/transformers/tokenization_utils_base.py", line 1824, in get_chat_template
(APIServer pid=1) DEBUG 11-18 07:06:51 [entrypoints/chat_utils.py:484] raise ValueError(
(APIServer pid=1) DEBUG 11-18 07:06:51 [entrypoints/chat_utils.py:484] ValueError: Cannot use chat template functions because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating
(EngineCore_DP0 pid=24) DEBUG 11-18 07:06:54 [v1/engine/core.py:893] EngineCore loop active.
(EngineCore_DP0 pid=24) DEBUG 11-18 07:06:54 [v1/engine/core.py:887] EngineCore waiting for work.
(APIServer pid=1) INFO: 10.130.0.2:56758 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 11-18 07:06:56 [v1/metrics/loggers.py:221] Engine 000: Avg prompt throughput: 27.8 tokens/s, Avg generation throughput: 5.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 67.4%, MM cache hit rate: 71.4%