stepfun-ai/Step-3.5-Flash-FP8 · First few tokens being cut

10 days ago

Having an issue when model is deployed with 4x PRO 6000 Blackwell cards in tp-4, following recommended vLLM serve command.

It seems to be cutting the first few tokens from the streaming response, including the opening tag, so I'm running into a lot of issues with applications not working properly. Primarily open code.

It seems to be the first 4 ish tokens being cut off. Instead of seeing like " The user..." On generation, it cuts off the beginning and just starts with "user...".

Anyone else experiencing this issue and potentially have a fix?

jmnargi

10 days ago

For some reason putting the xml think tag in the description doesn't appear for me on the published post, but it should show a think tag before "The user", but it does not

jmnargi

10 days ago

Using latest vLLM nightly docker image

Halbin

10 days ago

•

edited 10 days ago

Same issue on vLLM, 4xH100 (both vllm/vllm-openai:nightly-d88a1df699f68e5284fe3a3170f8ae292a3e9c3f and vllm/vllm-openai:nightly-4c4b6f7a9764bac8bf9f2a0bfedf852d8e59c98e)

I'm using the FP8 quantization and step3p5 tool-call and reasoning parser.
Might be a bug in the reasoning parser?

jmnargi

10 days ago

Same issue on vLLM, 4xH100 (both vllm/vllm-openai:nightly-d88a1df699f68e5284fe3a3170f8ae292a3e9c3f and vllm/vllm-openai:nightly-4c4b6f7a9764bac8bf9f2a0bfedf852d8e59c98e)

I'm using the FP8 quantization and step3p5 tool-call and reasoning parser.
Might be a bug in the reasoning parser?

That's kinda my thought, I was thinking maybe something template or tokenizer related too. Tried switching it with deepseek_r1 parser and minimax parser, I see the chat template appends the opening think tag

Halbin

9 days ago

The glm45 reasoning parser seems to work on the current nightly.

Halbin

5 days ago

I'm still having issues with stray newlines in the content after thinking (something the step3p5 attempted to fix), as well as errors during tool-parsing. This mostly happens with contexts greater than around 80k tokens.

Switched back to MiniMax for now. Much more reliable.

jmnargi

5 days ago

I'm still having issues with stray newlines in the content after thinking (something the step3p5 attempted to fix), as well as errors during tool-parsing. This mostly happens with contexts greater than around 80k tokens.

Switched back to MiniMax for now. Much more reliable.

Yeah I'm on the same boat as you. Newest vLLM nightly seemed to fix the tokens being cut, but the parser for reasoning isn't working right still. It's supposed to take everything before the closing think tag and mark it as reasoning. Seems to work on new session first message in open code, breaks afterwards. Lots of tool call errors using it via cursor we've noticed as well. I've been back on MiniMax as well for the time being

oliversong

StepFun org 5 days ago

Thanks for sharing this — that’s really helpful context.
Would you be able to provide a bit more detail so I can try to reproduce the issue on my side?

In particular, it would help a lot to see:

An example request body (you can redact anything sensitive)
Whether streaming was enabled or disabled
Whether you were using structured_output / JSON mode

jmnargi

5 days ago

Thanks for sharing this — that’s really helpful context.
Would you be able to provide a bit more detail so I can try to reproduce the issue on my side?

In particular, it would help a lot to see:

An example request body (you can redact anything sensitive)

Whether streaming was enabled or disabled

Whether you were using structured_output / JSON mode

In open code it was not parsing the reasoning. I noticed on raw LLM API calls that the reasoning content was being returned in the content section, didn't show an opening think tag sometimes, but sometimes the model generated it (probably because the template includes it already). I was attempting to use it in open code CLI & via Cursor with custom OpenAI endpoint. I used the official vLLM command to run it, only change was setting context to 202,000 tokens. Hardware was 4x PRO 6000 Blackwell cards.

It seems like it's a weird bug with how the reasoning parser is working, tool calls worked mostly fine in open code, sometimes I believe it would try to hallucinate a tool but it'd quickly recover. Cursor had a more difficult time though, very rarely on some tool calls it'll throw a hard json schema error. I don't have any screenshots now but should be able to provide some later today.

All calls were streaming response

jmnargi

5 days ago

Thanks for sharing this — that’s really helpful context.
Would you be able to provide a bit more detail so I can try to reproduce the issue on my side?

In particular, it would help a lot to see:

An example request body (you can redact anything sensitive)

Whether streaming was enabled or disabled

Whether you were using structured_output / JSON mode

This is the following docker compose file I'm using:

services:
  vllm-code:
    image: 'vllm/vllm-openai:nightly'
    container_name: vllm-code
    restart: unless-stopped
    ports:
      - '11433:11433'
    volumes:
      - '~/.cache/huggingface:/root/.cache/huggingface'
      - '/mnt/storage/models:/mnt/storage/models'
    ipc: host
    environment:
      - OMP_NUM_THREADS=16
      - LMCACHE_CHUNK_SIZE=256
      - LMCACHE_LOCAL_CPU=True
      - LMCACHE_MAX_LOCAL_CPU_SIZE=20.0
      - 'PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True'
    deploy:
      resources:
        reservations:
          devices:
            -
              driver: nvidia
              device_ids:
                - '0'
                - '1'
                - '2'
                - '3'
              capabilities:
                - gpu
    entrypoint: 'python3 -m vllm.entrypoints.openai.api_server'
    command:
      - '--model'
      - /mnt/storage/models/Step-3.5-Flash-FP8
      - '--tensor-parallel-size'
      - '4'
      - '--host'
      - 0.0.0.0
      - '--port'
      - '11433'
      - '--quantization'
      - fp8
      - '--gpu-memory-utilization'
      - '0.90'
      - '--chat-template'
      - /mnt/storage/models/Step-3.5-Flash-FP8/chat_template.jinja
      - '--max-model-len'
      - '204800'
      - '--trust-remote-code'
      - '--tool-call-parser'
      - step3p5
      - '--reasoning-parser'
      - step3p5
      - '--enable-auto-tool-choice'
      - '--enable-expert-parallel'
      - '--disable-cascade-attn'
      - '--hf-overrides'
      - '{"num_nextn_predict_layers": 1}'
      - '--speculative_config'
      - '{"method": "step3p5_mtp", "num_speculative_tokens": 1}'
      - '--model-loader-extra-config'
      - '{"enable_multithread_load": true, "num_threads": 48}'

One thing I have noticed, when calling this model through Open WebUI it doesn't seem to be thinking at all, might just be my configuration though.

In opencode, the first message did not do any thinking, but the following ones somewhat do the thinking. I have the logs from LiteLLM proxy that I have vLLM setup behind so i can capture these logs. I can confirm this issue happens even when directly using the vLLM API endpoint without LiteLLM, so this is not causing the problem.

{
id:"chatcmpl-b985c646dae941fb",
model:"[company redacted]-code",
usage:{
total_tokens:12031,
prompt_tokens:11600,
completion_tokens:431,
prompt_tokens_details:null,
completion_tokens_details:{
text_tokens:null,
audio_tokens:null,
image_tokens:null,
reasoning_tokens:0,
accepted_prediction_tokens:null,
rejected_prediction_tokens:null
}
},
object:"chat.completion",
choices:[
{
index:0,
message:{
role:"assistant",
content:"I'll create a playable Flappy Bird game for you with separate HTML, CSS, and JavaScript files. Let me start by planning the implementation.",
tool_calls:[
{
id:"chatcmpl-tool-834a42c2d33b8aa9",
type:"function",
function:{
name:"todowrite",
arguments:"{"todos": [{"content": "Create HTML file with canvas element and game structure", "id": "1", "priority": "high", "status": "pending"}, {"content": "Create CSS file for styling and responsive layout", "id": "2", "priority": "high", "status": "pending"}, {"content": "Implement core game logic in JavaScript (bird physics, controls)", "id": "3", "priority": "high", "status": "pending"}, {"content": "Implement pipe generation and movement system", "id": "4", "priority": "high", "status": "pending"}, {"content": "Add collision detection and scoring", "id": "5", "priority": "high", "status": "pending"}, {"content": "Add game states (start, playing, game over) and restart functionality", "id": "6", "priority": "high", "status": "pending"}, {"content": "Test the game and verify all features work correctly", "id": "7", "priority": "medium", "status": "pending"}]}"{\"todos\": [{\"content\": \"Create HTML file with canvas element and game structure\", \"id\": \"1\", \"priority\": \"high\", \"status\": \"pending\"}, {\"content\": \"Create CSS file for styling and responsive layout\", \"id\": \"2\", \"priority\": \"high\", \"status\": \"pending\"}, {\"content\": \"Implement core game logic in JavaScript (bird physics, controls)\", \"id\": \"3\", \"priority\": \"high\", \"status\": \"pending\"}, {\"content\": \"Implement pipe generation and movement system\", \"id\": \"4\", \"priority\": \"high\", \"status\": \"pending\"}, {\"content\": \"Add collision detection and scoring\", \"id\": \"5\", \"priority\": \"high\", \"status\": \"pending\"}, {\"content\": \"Add game states (start, playing, game over) and restart functionality\", \"id\": \"6\", \"priority\": \"high\", \"status\": \"pending\"}, {\"content\": \"Test the game and verify all features work correctly\", \"id\": \"7\", \"priority\": \"medium\", \"status\": \"pending\"}]}""
}
}
],
function_call:null,
provider_specific_fields:null
},
finish_reason:"stop"
}
],
created:1770676739,
system_fingerprint:null
}

The following is the json response from the message it did right after, missing opening think tag (also not using reasoning_content field:

{
id:"chatcmpl-a2774e2447626393",
model:"[company redacted]-code",
usage:{
total_tokens:12531,
prompt_tokens:12175,
completion_tokens:356,
prompt_tokens_details:null,
completion_tokens_details:{
text_tokens:null,
audio_tokens:null,
image_tokens:null,
reasoning_tokens:0,
accepted_prediction_tokens:null,
rejected_prediction_tokens:null
}
},
object:"chat.completion",
choices:[
{
index:0,
message:{
role:"assistant",
content:"Good, I've created the todo list. Now let me start implementing the Flappy Bird game. I'll create the HTML, CSS, and JavaScript files one by one. Let me start with the HTML file.
</think>
Now let me create the three files for the Flappy Bird game.",
tool_calls:[
{
id:"chatcmpl-tool-b27691fc287728d6",
type:"function",
function:{
name:"write",
arguments:"{"content": "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n    <meta charset=\"UTF-8\">\n    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n    <title>Flappy Bird</title>\n    <link rel=\"stylesheet\" href=\"style.css\">\n</head>\n<body>\n    <div class=\"game-container\">\n        <canvas id=\"gameCanvas\" width=\"400\" height=\"600\"></canvas>\n        <div id=\"scoreDisplay\">Score: <span id=\"score\">0</span></div>\n        <div id=\"startScreen\" class=\"overlay\">\n            <h1>Flappy Bird</h1>\n            <p>Press SPACE or Click to flap</p>\n            <button id=\"startButton\">Start Game</button>\n        </div>\n        <div id=\"gameOverScreen\" class=\"overlay hidden\">\n            <h1>Game Over</h1>\n            <p>Final Score: <span id=\"finalScore\">0</span></p>\n            <button id=\"restartButton\">Play Again</button>\n        </div>\n    </div>\n    <script src=\"game.js\"></script>\n</body>\n</html>\n", "filePath": "C:\\Users\\[companyredacted]\\flappybird.html"}"
}
}
],
function_call:null,
provider_specific_fields:null
},
finish_reason:"stop"
}
],
created:1770676744,
system_fingerprint:null
}

Below is an error snippet and the a snippet from the request sent from cursor when it had the invalid tool call.

{
name:"write",
role:"tool",
content:"Wrote contents to C:\Users\[companyredacted]\Documents\Development\testcalls\test_api.py.

When you're done with your current set of changes to this file, you should call the read_lints tool with the specific file path and fix any newly introduced errors.",
tool_call_id:"chatcmpl-tool-8d5040e319b7302b"
},
{
role:"assistant",
content:"The user wants me to create a test_api.py file that:
1. Uses the OpenAI Python library
2. Makes API calls to http://[ip redacted]:11434/v1 (this is a local Ollama instance)
3. Uses a random API key
4. Model name: vitruvix-code
5. Makes calls that trigger critical thinking
6. Saves the full raw response JSON for each call in separate JSON files
7. Must use streaming (or standard HTTP if better for capturing full streamed response)

I've created the file. Now I should check for linter errors and fix any issues. Let me read the lints for this file.
</think>
Now let me check for any linter errors in the file:",
tool_calls:[
{
id:"chatcmpl-tool-8991ffb88eaae95e",
type:"function",
index:0,
function:{
name:"read_lints",
arguments:"{"paths": ["C:\\Users\\[companyredacted]\\Documents\\Development\\testcalls\\test_api.py"]}"{\"paths\": [\"C:\\\\Users\\\\[companyredacted]\\\\Documents\\\\Development\\\\testcalls\\\\test_api.py\"]}""
}
}
]
},
{
name:"read_lints",
role:"tool",
content:"Error calling tool: Invalid JSON args passed in",
tool_call_id:"chatcmpl-tool-8991ffb88eaae95e"
}
],

This resulted in the following error in vLLM console:

2026-Feb-09 22:46:21
(APIServer pid=1) ERROR 02-09 22:46:21 [serving.py:315] Error in preprocessing prompt inputs
2026-Feb-09 22:46:21
(APIServer pid=1) ERROR 02-09 22:46:21 [serving.py:315] Traceback (most recent call last):
2026-Feb-09 22:46:21
(APIServer pid=1) ERROR 02-09 22:46:21 [serving.py:315]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 299, in render_chat_request
2026-Feb-09 22:46:21
(APIServer pid=1) ERROR 02-09 22:46:21 [serving.py:315]     conversation, engine_prompts = await self._preprocess_chat(
2026-Feb-09 22:46:21
(APIServer pid=1) ERROR 02-09 22:46:21 [serving.py:315]                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-Feb-09 22:46:21
(APIServer pid=1) ERROR 02-09 22:46:21 [serving.py:315]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/engine/serving.py", line 1018, in _preprocess_chat
2026-Feb-09 22:46:21
(APIServer pid=1) ERROR 02-09 22:46:21 [serving.py:315]     (conversation,), (engine_prompt,) = await renderer.render_chat_async(
2026-Feb-09 22:46:21
(APIServer pid=1) ERROR 02-09 22:46:21 [serving.py:315]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-Feb-09 22:46:21
(APIServer pid=1) ERROR 02-09 22:46:21 [serving.py:315]   File "/usr/local/lib/python3.12/dist-packages/vllm/renderers/protocol.py", line 377, in render_chat_async
2026-Feb-09 22:46:21
(APIServer pid=1) ERROR 02-09 22:46:21 [serving.py:315]     for conv, prompt in await asyncio.gather(*rendered):
2026-Feb-09 22:46:21
(APIServer pid=1) ERROR 02-09 22:46:21 [serving.py:315]                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-Feb-09 22:46:21
(APIServer pid=1) ERROR 02-09 22:46:21 [serving.py:315]   File "/usr/local/lib/python3.12/dist-packages/vllm/renderers/hf.py", line 694, in render_messages_async
2026-Feb-09 22:46:21
(APIServer pid=1) ERROR 02-09 22:46:21 [serving.py:315]     conversation, mm_data, mm_uuids = await parse_chat_messages_async(
2026-Feb-09 22:46:21
(APIServer pid=1) ERROR 02-09 22:46:21 [serving.py:315]                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-Feb-09 22:46:21
(APIServer pid=1) ERROR 02-09 22:46:21 [serving.py:315]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/chat_utils.py", line 1580, in parse_chat_messages_async
2026-Feb-09 22:46:21
(APIServer pid=1) ERROR 02-09 22:46:21 [serving.py:315]     _postprocess_messages(conversation)
2026-Feb-09 22:46:21
(APIServer pid=1) ERROR 02-09 22:46:21 [serving.py:315]   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/chat_utils.py", line 1516, in _postprocess_messages
2026-Feb-09 22:46:21
(APIServer pid=1) ERROR 02-09 22:46:21 [serving.py:315]     item["function"]["arguments"] = json.loads(content)
2026-Feb-09 22:46:21
(APIServer pid=1) ERROR 02-09 22:46:21 [serving.py:315]                                     ^^^^^^^^^^^^^^^^^^^
2026-Feb-09 22:46:21
(APIServer pid=1) ERROR 02-09 22:46:21 [serving.py:315]   File "/usr/lib/python3.12/json/__init__.py", line 346, in loads
2026-Feb-09 22:46:21
(APIServer pid=1) ERROR 02-09 22:46:21 [serving.py:315]     return _default_decoder.decode(s)
2026-Feb-09 22:46:21
(APIServer pid=1) ERROR 02-09 22:46:21 [serving.py:315]            ^^^^^^^^^^^^^^^^^^^^^^^^^^
2026-Feb-09 22:46:21
(APIServer pid=1) ERROR 02-09 22:46:21 [serving.py:315]   File "/usr/lib/python3.12/json/decoder.py", line 341, in decode
2026-Feb-09 22:46:21
(APIServer pid=1) ERROR 02-09 22:46:21 [serving.py:315]     raise JSONDecodeError("Extra data", s, end)
2026-Feb-09 22:46:21

Direct API calls in python to the vLLM endpoint seem to generate the tags proper and all, but they are just shown in content, not in reasoning_content field.

jmnargi

5 days ago

I'm not 100% sure how some of these CLIs and such are doing the direct call to the AI model endpoints, raw request in python / using the python api when streaming / not streaming does mostly fine, other than the tags being in the content block instead of the reasoning_content field, if there's anything else I can assist with let me know.

Also, I have used the docker compose without the other env vars setup and stuff, i just added them back in as it didn't seem to effect the results from testing.

mariohong128

4 days ago

Thank you. We will fix it soon. https://github.com/vllm-project/vllm/pull/34211

1337Code

4 days ago

•

edited 4 days ago

In openwebui

This is also the same problem in kilocode and roocode and other coding agents.

First message propper thinking block, then the rest is:

thinking content...
</think> 
response