IQ5_K 136.891 GiB

#9
by Hunterx - opened

Okay tested it a bit. Seems to be kind of like minimax m2.1 very fast on CPU/GPU
It thinks a lot though. Also FYI this model gets confused very easily. So simple svg drawings it will do well. But more complex tasks it will just think for 20k tokens and still give nothing good where as minimax will at least give a decent result after a lot of token gen... Still have to play with it a bit. Not sure if its better as instruct without overthinking.

The crazy part is the i'm running 256k q8_0 context fairly easy somehow!!!

8480+ QYFS w790e sage 512 gigs 2x 3090s

./build/bin/llama-server
--model "/home/xeon/ik_llama.cpp/models/Step-3.5-Flash-IQ5_K-00001-of-00004.gguf"
--alias "Step-Fun-3.5-flash IQ5_K"
-c 256000 -ctk q8_0 -ctv q8_0 \ <----- What i fit that in Vram somehow and also layers to GPU
-b 4096
-amb 1024
-mla 3
-fa on
-ub 4096
-ngl 99
-sm layer
-gr
-smgs
-ger
--n-cpu-moe 54
-ts 1,1
--parallel 2
--threads 54
--host 0.0.0.0
--port 8080
--merge-qkv
--mirostat 2
--mirostat-ent 2
--mirostat-lr 0.05
--jinja

image

image

Ok Edit just playing around with it in Open Claw and its really good at tool calls, and did a very good deep research report. as an agent thats fast and can search with a lot of repeated tool calls this one is what you want. Is it the smartest vs gemini flash 3... no. It still didn't proactively read through all the memories and databases to figure out answers to questions. But otherwise it does drive very well very responsive.

Great, I was wondering how it would do with tool calling / agentic stuff. Looks like you using the built-in jinja template (i included the original version) or did you find a custom one (another discussion has one that was helping people running mainline).

Oh does -sm graph work with this one yet? I'll check now... https://github.com/ikawrakow/ik_llama.cpp/pull/1236 yes!

def try that given you have 2x GPUs. I've noticed even doing hybrid cpu+gpu it helps if u have >=2 GPUs.

gonna get some llama-sweep-bench tests going soon today!

sweep-bench-Step-3.5-Flash

its exploding with openclaw regex btw and yes sm graph 290 token/s prefil insane! : INFO [ release_slots] slot released | tid="132193615597568" timestamp=1770774507 id_slot=0 id_task=36844 n_ctx=256000 n_past=17330 n_system_tokens=0 n_cache_tokens=17330 truncated=false
INFO [ slots_idle] all slots are idle | tid="132193615597568" timestamp=1770774507
slot print_timing: id 0 | task -1 |
prompt eval time = 2668.46 ms / 775 tokens ( 3.44 ms per token, 290.43 tokens per second)
eval time = 13336.58 ms / 222 tokens ( 60.07 ms per token, 16.65 tokens per second)
total time = 16005.04 ms / 997 tokens
INFO [ log_server_request] request | tid="131988832903168" timestamp=1770774507 remote_addr="100.74.165.82" remote_port=57273 status=200 method="POST" path="/v1/chat/completions" params={}
======== Prompt cache: cache size: 17330, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 1.00, cache_ram_similarity: 0.50

  • looking for better prompt, base f_keep = 0.995, sim = 0.985, n_keep = 0, n_discarded_prompt = 0
  • cache state: 1 prompts, 6366.494 MiB (limits: 8192.000 MiB, 0 tokens, 87716 est)
    • prompt 0x5f6c9ec390c0: 68170 tokens, 0 discarded, checkpoints: 0, 6366.494 MiB
      prompt cache load took 8.31 ms
      terminate called after throwing an instance of 'std::regex_error'
      what(): Number of NFA states exceeds limit. Please use shorter regex string, or use smaller brace expression, or make _GLIBCXX_REGEX_STATE_LIMIT larger.
      Aborted (core dumped)
      xeon@xeon-System-Product-Name:~/ik_llama.cpp$ # Check the tokenizer config more thoroughly
      python3 << 'EOF'
      import json
      import sys

Try to extract metadata from GGUF

try:
with open('/home/xeon/ik_llama.cpp/models/Step-3.5-Flash-IQ5_K-00001-of-00004.gguf', 'rb') as f:
# Read first 1MB to find metadata
data = f.read(1024*1024)
# Look for chat_template or tokenizer configs
if b'chat_template' in data:
idx = data.find(b'chat_template')
print("Found chat_template at offset:", idx)
print(data[idx:idx+500])
if b'tokenizer' in data:
idx = data.find(b'tokenizer')
print("\nFound tokenizer at offset:", idx)
print(data[idx:idx+500])
except Exception as e:
print(f"Error: {e}")
EOF

Found tokenizer at offset: 2564
b'tokenizer.ggml.model\x08\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00gpt2\x12\x00\x00\x00\x00\x00\x00\x00tokenizer.ggml.pre\x08\x00\x00\x00\x0b\x00\x00\x00\x00\x00\x00\x00deepseek-v3\x15\x00\x00\x00\x00\x00\x00\x00tokenizer.ggml.tokens\t\x00\x00\x00\x08\x00\x00\x00\x80\xf7\x01\x00\x00\x00\x00\x00\x1d\x00\x00\x00\x00\x00\x00\x00<\xef\xbd\x9cbegin\xe2\x96\x81of\xe2\x96\x81sentence\xef\xbd\x9c>\x1b\x00\x00\x00\x00\x00\x00\x00<\xef\xbd\x9cend\xe2\x96\x81of\xe2\x96\x81sentence\xef\xbd\x9c>\x11\x00\x00\x00\x00\x00\x00\x00<\xef\xbd\x9c\xe2\x96\x81pad\xe2\x96\x81\xef\xbd\x9c>\x01\x00\x00\x00\x00\x00\x00\x00!\x01\x00\x00\x00\x00\x00\x00\x00"\x01\x00\x00\x00\x00\x00\x00\x00#\x01\x00\x00\x00\x00\x00\x00\x00$\x01\x00\x00\x00\x00\x00\x00\x00%\x01\x00\x00\x00\x00\x00\x00\x00&\x01\x00\x00\x00\x00\x00\x00\x00'\x01\x00\x00\x00\x00\x00\x00\x00(\x01\x00\x00\x00\x00\x00\x00\x00)\x01\x00\x00\x00\x00\x00\x00\x00*\x01\x00\x00\x00\x00\x00\x00\x00+\x01\x00\x00\x00\x00\x00\x00\x00,\x01\x00\x00\x00\x00\x00\x00\x00-\x01\x00\x00\x00\x00\x00\x00\x00.\x01\x00\x00\x00\x00\x00\x00\x00/\x01\x00\x00\x00\x00\x00\x00\x000\x01\x00\x00\x00\x00\x00\x00\x001\x01\x00\x00\x00\x00\x00\x00\x002\x01\x00\x00\x00\x00\x00\x00\x003\x01\x00\x00\x00\x00\x00\x00\x004\x01\x00\x00\x00\x00\x00\x00\x005\x01\x00\x00\x00\x00\x00\x00\x006\x01\x00\x00\x00\x00\x00\x00\x007\x01\x00\x00\x00\x00\x00\x00\x008\x01\x00\x00\x00\x00\x00\x00\x009\x01\x00\x00\x00\x00\x00\x00\x00:\x01\x00\x00\x00\x00\x00\x00\x00;\x01\x00\x00\x00\x00\x00\x00\x00<\x01\x00\x00\x00\x00\x00\x00\x00=\x01\x00\x00\x00\x00\x00\x00\x00>\x01\x00\x00'

Great, I was wondering how it would do with tool calling / agentic stuff. Looks like you using the built-in jinja template (i included the original version) or did you find a custom one (another discussion has one that was helping people running mainline).

Im using --jinja what ever is in there. Its a real beast at tool calls, i had it go on quiet a ride doing deepsearch with browser ui as well as direct pathing through internet and duck duck go which filled context and still was going... except above issues with regex explosions going to try reduce pathing if that helps : export _GLIBCXX_REGEX_STATE_LIMIT=1000000 &&
./build/bin/llama-server
--model "/home/xeon/ik_llama.cpp/models/Step-3.5-Flash-IQ5_K-00001-of-00004.gguf"
--alias "Step-Fun-3.5-flash"
--slot-save-path "/tmp/claw_cache/mem"
--prompt-cache "/tmp/claw_cache/mem/step_35_base.bin"
--prompt-cache-all
-c 256000 -ctk q8_0 -ctv q8_0
-b 4096
-amb 2048
-mla 3
-fa on
-ub 4096
-ngl 99
-sm graph
-gr
-smgs
-ger
--n-cpu-moe 99
-ts 1,1
--parallel 1
--threads 88
--host 0.0.0.0
--port 8080
--merge-qkv
--jinja
--mirostat 2
--mirostat-lr 0.05

NOTE! This is my favorite model now! I'm dropping Glm 4.7 Q8_0 for this iq5 btw and kimi k2.5 iq3... im sorry this is just so fast and doing so well with tool calling and logic. It wont do the most complex things (ps4 controller rendering stuff) in the world but what it does good is being an agent and doing it very fast without hallucinating (a bit of expositions but what can you do, but better than Gemini 3 flash hallucinating that it saved code when it just hallucinating saving code). Oh and the cache hits are really good. for Minimax M2.1 I couldn't use it because it just misses and you are stuck waiting for another 50k tokens was so painful. I would rather gave 6 tokens a second tgen like glm 4.7 but cache hit rather than 20 tok/s tgen and cache missing minimax m2.1.

@Hunterx

Sweet! Yes I'm hitting that error too _GLIBCXX_REGEX_STATE_LIMIT which happens when opencode is trying to do some big grepping or searching of large text files. Maybe the default value is like _GLIBCXX_REGEX_STATE_LIMIT 100000, i'll try setting to 10x bigger like you show in the env var above to `export 1000000

I've found it is quite good with opencode running my smol-IQ3_KS 75.934 GiB (3.312 BPW) in full GPU offload across 2x A6000's and almost 128k context. So I've been mostly using that for speed, and dropping back to Kimi-K2.5-Q4_X to stat off a new project or things that require less context (as it can become more slow).

Curious to see how the new GLM5, DS4, and round of Qwen's do (hoping they still release open weights lol)...

EDIT

On more thing, have you tried any of the speculative decoding features: https://github.com/ikawrakow/ik_llama.cpp/pull/1261

e.g. something like --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64 might help speed it up for repetitive tasks. Not sure how to tune it though.

On more thing, have you tried any of the speculative decoding features: https://github.com/ikawrakow/ik_llama.cpp/pull/1261

e.g. something like --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64 might help speed it up for repetitive tasks. Not sure how to tune it though.

I havent tried ngram yet. Might start looking into tuning it if i figure it out

.

Curious to see how the new GLM5, DS4, and round of Qwen's do (hoping they still release open weights lol)...

Glm 5 is out... Its very similar to kimik2 5 based on the few tests i ran for performance on the platform. Was expecting better... Now ds4 and qwen3.5... oh also minimax 2.5 is out too

For me on ik_llama and your IQ4_XS quant the tool parser is still broken! :(
P.S. I don't know what the secret sauce in https://github.com/pwilkin/llama.cpp/tree/autoparser is but I wish it could be ported to ik_llama so I could use the model with -sm graph and have tool calling working ok.

@Hunterx

Yeah, a lot of models released this week, maybe for lunar new years? I'll hopefully get to GLM5 by this weekend, but it is gonna be slower given more active weights and yeah kimi-k2-q4_x is only ~540GB and is "full quality" given they QAT'd it which is nice for a big model...

@dehnhaide

I've had luck using these quants with ik and tool parsing specifically with opencode using the built in template. However, here is another template some folks have suggested you can try: https://huggingface.co/ubergarm/Step-3.5-Flash-GGUF/discussions/1#69878ca7ae66ac235fc2ca95

I use it with -sm graph and the Step-3.5-Flash smol-IQ3_KS works pretty well in 96GB VRAM for agentic stuff. I was getting an error about regex and just increased that env var. What is the exact error you are seeing and what is your client and full ik_llama.cpp command you are using if you want to try to debug it?

@dehnhaide
I've had luck using these quants with ik and tool parsing specifically with opencode using the built in template. However, here is another template some folks have suggested you can try: https://huggingface.co/ubergarm/Step-3.5-Flash-GGUF/discussions/1#69878ca7ae66ac235fc2ca95
I use it with -sm graph and the Step-3.5-Flash smol-IQ3_KS works pretty well in 96GB VRAM for agentic stuff. I was getting an error about regex and just increased that env var. What is the exact error you are seeing and what is your client and full ik_llama.cpp command you are using if you want to try to debug it?

Nevermind... I was proxying the model via litellm (for some statistics) but I am done with it cause it only causes issues down the path. Is still see some:
"Let me check if there are any actual test files in the project (excluding node_modules).
<function=glob<|im_end|>"
in OpenCode but it doesn't seem to break anything.

While I serve the model with the "autoparser" llama.cpp fork, everthing is a clean slate but I love the xtra speed I get from "-sm graph" on ik_llama on 6x RTX 3090!
I will also try later with the chat template you've proposed + ik_llama and your quant and see how it delivers.

@dehnhaide

Some folks have also been telling me to try pi.dev instead of opencode or claudcode or codex etc. Then again today i hear try this new agent dev tool: http://blog.can.ac/2026/02/12/the-harness-problem/

hard to keep up haha

https://www.tiktok.com/@startupcode.net/video/7605360360727547150

@dehnhaide

Some folks have also been telling me to try pi.dev instead of opencode or claudcode or codex etc. Then again today i hear try this new agent dev tool: http://blog.can.ac/2026/02/12/the-harness-problem/
hard to keep up haha
https://www.tiktok.com/@startupcode.net/video/7605360360727547150

Maddening indeed.. I have turned into a TUI guinea pig lately, I have already 5 TUIs x 3 local LLM = 15 projects I compare until I lose breath, mind, memory and focus ... and start back again! Is this the a new geek drug?
OpenCode - Claude - Mistral Vibe - Factory Droid - OMP

@ubergarm

Apparently the original tool-calling template in the first release (which I think most of your quants were based on) was broken. They fixed it 8 days ago, apparently, and people have had great luck with it since then. See https://huggingface.co/stepfun-ai/Step-3.5-Flash-GGUF-Q4_K_S/discussions/16

Since I think you refactor most of the template stuff into the small 1st GGUF part, could you consider porting that into these quants? I think that file alone would be the only relatively small change. Otherwise, I've been really impressed with the IQ4_XS.

@JDWarner

oof thanks for the heads up.. oddly they didn't change the chat_template.jina but you are correct they re-uploaded a GGUF with an embedded chat template that apparently has changed (i didn't diff it to see the details).

I could probably change that and re-upload just the first part, but i'll just be lazy for now and put a note in the model card with a link to it. Thanks for letting me know to warn folks and they will just have to get a known working jinja such as copy-paste from the official gguf chat template and use --chat-template-file myTemplate.jinja.

I'll update the README now!

In case it's helpful for anyone else, I extracted the template from the current (as of writing) official quant's first GGUF partfile so you don't have to download 9.5 GB.

This apparently fixes template issues, including some tool calling, with flags --jinja --chat-template-file myTemplate.jinja

Though other reports suggest that to really make it reliable requires the autoparsing branch of llama.cpp (unsure if anything like this is in ik-llama.cpp) https://github.com/pwilkin/llama.cpp/tree/autoparser

{% macro render_content(content) %}{% if content is none %}{{- '' }}{% elif content is string %}{{- content }}{% elif content is mapping %}{{- content['value'] if 'value' in content else content['text'] }}{% elif content is iterable %}{% for item in content %}{% if item.type == 'text' %}{{- item['value'] if 'value' in item else item['text'] }}{% elif item.type == 'image' %}<im_patch>{% endif %}{% endfor %}{% endif %}{% endmacro %}
{{bos_token}}{%- if tools %}
    {{- '<|im_start|>system\n' }}
    {%- if messages[0].role == 'system' %}
        {{- render_content(messages[0].content) + '\n\n' }}
    {%- endif %}
    {{- "# Tools\n\nYou have access to the following functions in JSONSchema format:\n\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson(ensure_ascii=False) }}
    {%- endfor %}
    {{- "\n</tools>\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...>\n...\n</function> block must be nested within <tool_call>\n...\n</tool_call> XML tags\n- Required parameters MUST be specified\n</IMPORTANT><|im_end|>\n" }}
{%- else %}
    {%- if messages[0].role == 'system' %}
        {{- '<|im_start|>system\n' + render_content(messages[0].content) + '<|im_end|>\n' }}
    {%- endif %}
{%- endif %}
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
    {%- set index = (messages|length - 1) - loop.index0 %}
    {%- if ns.multi_step_tool and message.role == "user" and render_content(message.content) is string and not(render_content(message.content).startswith('<tool_response>') and render_content(message.content).endswith('</tool_response>')) %}
        {%- set ns.multi_step_tool = false %}
        {%- set ns.last_query_index = index %}
    {%- endif %}
{%- endfor %}
{%- for message in messages %}
    {%- set content = render_content(message.content) %}
    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
        {%- set role_name = 'observation' if (message.role == "system" and not loop.first and message.name == 'observation') else message.role %}
        {{- '<|im_start|>' + role_name + '\n' + content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {%- set reasoning_content = '' %}
        {%- if enable_thinking %}
            {%- if message.reasoning_content is string %}
                {%- set reasoning_content = render_content(message.reasoning_content) %}
            {%- else %}
                {%- if '</think>' in content %}
                    {%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
                    {%- set content = content.split('</think>')[-1].lstrip('\n') %}
                {%- endif %}
            {%- endif %}
        {%- else %}
            {# If thinking is disabled, strip any inline <think>...</think> from assistant content #}
            {%- if '</think>' in content %}
                {%- set content = content.split('</think>')[-1].lstrip('\n') %}
            {%- endif %}
        {%- endif %}

        {%- if loop.index0 > ns.last_query_index and enable_thinking %}
            {{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.rstrip('\n') + '\n</think>\n' + content.lstrip('\n') }}
        {%- else %}
            {{- '<|im_start|>' + message.role + '\n' + content.lstrip('\n') }}
        {%- endif %}
        {%- if message.tool_calls %}
            {%- for tool_call in message.tool_calls %}
                {%- if tool_call.function is defined %}
                    {%- set tool_call = tool_call.function %}
                {%- endif %}
                {{- '<tool_call>\n<function=' + tool_call.name + '>\n' }}
                {%- if tool_call.arguments is defined %}
                    {%- if tool_call.arguments is mapping %}
                        {%- set arguments = tool_call.arguments %}
                        {%- for args_name, args_value in arguments|items %}
                            {{- '<parameter=' + args_name + '>\n' }}
                            {%- set args_value = args_value | tojson(ensure_ascii=False) | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}
                            {{- args_value }}
                            {{- '\n</parameter>\n' }}
                        {%- endfor %}
                    {%- elif tool_call.arguments is string %}
                        {# Minja does not support fromjson; preserve raw JSON string as a single parameter #}
                        {{- '<parameter=arguments>\n' + tool_call.arguments + '\n</parameter>\n' }}
                    {%- endif %}
                {%- endif %}
                {{- '</function>\n</tool_call>' }}
            {%- endfor %}
        {%- endif %}
        {{- '<|im_end|>\n' }}
    {%- elif message.role == "tool" %}
        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
            {{- '<|im_start|>tool_response\n' }}
        {%- endif %}
        {{- '<tool_response>' }}
        {{- content }}
        {{- '</tool_response>' }}
        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
    {%- if enable_thinking %}
        {{- '<think>\n' }}
    {%- endif %}
{%- endif %}

Sign up or log in to comment