Nanbeige4.1-3B-MLX-4bit (4-bit Quantized)

This is the Nanbeige4.1-3B model converted to MLX format with 4-bit quantization (affine, group_size=64) for efficient inference on Apple Silicon. This is the smallest and fastest variant — ideal for speed-sensitive and memory-constrained use cases.

Other variants:

All Variants Compared

Performance

Variant Size Memory Prompt Speed Gen Speed
4-bit (this) 2.06 GB ~2.3 GB ~279 tok/s ~103 tok/s
8-bit 3.91 GB ~4.3 GB ~342 tok/s ~59 tok/s
BF16 7.35 GB ~8.0 GB ~276 tok/s ~33 tok/s

Quality Comparison (Head-to-Head, Identical Prompts, temp=0)

All three variants were tested with the same prompts under deterministic settings (temperature=0) to evaluate quality differences:

Test 4-bit 8-bit BF16
Math: 47 * 83 3901 3901 3901
Logic: "All but 9 die" trick 9 9 9
Code: Binary search Correct Correct Correct
Math: f(x)=2x^2-3x+1, f(5) 36 36 36
Nuanced reasoning: Paper folding Correct Correct Correct
Tool call: BookFlight JSON Identical Identical Identical
AIME-style: 2^100 mod 7 2 2 2

Key findings:

  • 8-bit vs BF16: Produced word-for-word identical reasoning and answers in the majority of tests. Essentially zero quality loss.
  • 4-bit vs BF16: Sometimes takes slightly different reasoning paths, but arrives at the same correct answers. Tool calling output is 100% identical across all variants.
  • Recommendation: 4-bit is best for speed and memory. 8-bit is the sweet spot for quality. BF16 is for research and benchmarking where exact reproduction matters.

Original Model

Conversion Details

Property Value
Quantization 4-bit affine
Group size 64
Bits per weight ~4.5
Original size (BF16) ~7.87 GB
Quantized size ~2.06 GB
Compression ratio 3.8x
Conversion tool mlx-lm v0.30.7

Performance on Apple Silicon

Tested on Apple Silicon:

Metric Value
Prompt processing ~279 tokens/sec
Generation speed ~103 tokens/sec
Peak memory usage ~2.3 GB

Capabilities

This model retains all capabilities of the original Nanbeige4.1-3B:

  • Reasoning/Thinking: Uses <think>...</think> tags for chain-of-thought reasoning
  • Tool/Function Calling: Generates structured <tool_call>...</tool_call> JSON output
  • Multi-turn Conversation: Supports multi-turn chat with context tracking
  • Multilingual: Strong performance in both English and Chinese
  • Code Generation: Capable of writing and explaining code
  • Deep-Search Agent: Supports deep-search tasks with 500+ rounds of tool invocations (using tokenizer_config_search.json)

Quickstart

Installation

pip install mlx-lm

CLI Usage

mlx_lm generate \
  --model andrevp/Nanbeige4.1-3B-MLX-4bit \
  --prompt "Explain quantum computing in simple terms." \
  --max-tokens 512 \
  --temp 0.6 \
  --top-p 0.95

Python - Chat

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load("andrevp/Nanbeige4.1-3B-MLX-4bit")
sampler = make_sampler(temp=0.6, top_p=0.95)

messages = [
    {"role": "user", "content": "Which number is bigger, 9.11 or 9.8?"}
]
prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False
)
response = generate(
    model, tokenizer, prompt=prompt, max_tokens=512, sampler=sampler
)
print(response)

Python - Tool Calling

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load("andrevp/Nanbeige4.1-3B-MLX-4bit")
sampler = make_sampler(temp=0.6, top_p=0.95)

messages = [
    {"role": "user", "content": "What is the weather in Tokyo?"}
]
tools = [
    {
        "type": "function",
        "function": {
            "name": "SearchWeather",
            "description": "Find the current weather in a city.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City name"
                    }
                },
                "required": ["location"]
            }
        }
    }
]
prompt = tokenizer.apply_chat_template(
    messages, tools=tools, add_generation_prompt=True, tokenize=False
)
response = generate(
    model, tokenizer, prompt=prompt, max_tokens=256, sampler=sampler
)
print(response)

Python - Multi-turn with Tool Responses

from mlx_lm import load, generate
from mlx_lm.sample_utils import make_sampler

model, tokenizer = load("andrevp/Nanbeige4.1-3B-MLX-4bit")
sampler = make_sampler(temp=0.6, top_p=0.95)

tools = [
    {
        "type": "function",
        "function": {
            "name": "SearchWeather",
            "description": "Find the current weather in a city.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"}
                },
                "required": ["location"]
            }
        }
    }
]

messages = [
    {"role": "user", "content": "What's the weather like in Paris?"},
    {"role": "assistant", "content": "", "tool_calls": [
        {"function": {"name": "SearchWeather", "arguments": '{"location": "Paris"}'}}
    ]},
    {"role": "tool", "content": '{"temperature": "18°C", "condition": "Partly cloudy"}'}
]

prompt = tokenizer.apply_chat_template(
    messages, tools=tools, add_generation_prompt=True, tokenize=False
)
response = generate(
    model, tokenizer, prompt=prompt, max_tokens=300, sampler=sampler
)
print(response)

Recommended Inference Hyperparameters

Parameter Value
Temperature 0.6
Top-p 0.95
Repeat penalty 1.0
Max new tokens 131,072

Benchmarks (Original Model)

General Reasoning Tasks

Benchmark Qwen3-4B Qwen3-8B Qwen3-14B Qwen3-32B Qwen3-30B-A3B Nanbeige4.1-3B
Code
Live-Code-Bench-V6 57.4 49.4 55.9 55.7 66.0 76.9
LCB-Pro-Easy 40.2 41.2 33.0 42.3 60.8 81.4
LCB-Pro-Medium 5.3 3.5 1.8 3.5 3.5 28.1
Math
AIME 2026 I 81.46 70.42 76.46 75.83 87.30 87.40
HMMT Nov 68.33 48.33 56.67 57.08 71.25 77.92
IMO-Answer-Bench 48.00 36.56 41.81 43.94 54.34 53.38
Science
GPQA 65.8 62.0 63.38 68.4 73.4 83.8
HLE (Text-only) 6.72 5.28 7.00 9.31 11.77 12.60
Alignment
Arena-Hard-v2 34.9 26.3 36.9 56.0 60.2 73.2
Multi-Challenge 41.14 36.30 36.97 38.72 49.40 52.21
Tool Use
BFCL-V4 44.87 42.20 45.14 47.90 48.6 56.50
Tau2-Bench 45.9 42.06 44.96 45.26 47.70 48.57

Deep Search Benchmarks

Model xBench-DS-2505 xBench-DS-2510 Browse-Comp GAIA HLE SEAL-0
MiroThinker-v1.0-8B 61 - 31.1 66.4 21.5 40.4
AgentCPM-Explore-4B 70 - 25.0 63.9 19.1 40.0
Qwen3-32B 39 8 3.15 30.17 9.26 8.15
Nanbeige4.1-3B 75 39 19.12 69.90 22.29 41.44

Files Included

File Description
model.safetensors Quantized 4-bit model weights
model.safetensors.index.json Weight index mapping
config.json Model architecture config (with quantization params)
tokenizer.json Fast tokenizer
tokenizer.model SentencePiece model
tokenizer_config.json Tokenizer config with all special tokens
tokenizer_config_search.json Tokenizer config for deep-search mode
chat_template.jinja Chat template (chat + tool calling + reasoning)
special_tokens_map.json Special tokens mapping
added_tokens.json Additional token definitions
generation_config.json Default generation parameters

Special Tokens

Token ID Purpose
<|im_start|> 166100 BOS / message start
<|im_end|> 166101 EOS / message end
<|endoftext|> 166102 End of text
<think> 166103 Start of reasoning
</think> 166104 End of reasoning
<tool_call> 166105 Start of tool call
</tool_call> 166106 End of tool call

Deep-Search Mode

For deep-search agent capabilities, switch to tokenizer_config_search.json and use the miroflow-framework for inference. See the original model card for detailed deep-search setup instructions.

Important: Extended Thinking Behavior

Nanbeige4.1-3B is a reasoning model that generates chain-of-thought inside <think>...</think> tags before producing a final answer. This is by design — and it affects all variants equally (4-bit, 8-bit, BF16).

What to expect

Task Type Typical Thinking Length Recommended max_tokens
Math, logic, tool calling 200–500 tokens 512–2,048
Code generation 500–1,500 tokens 2,048–4,096
Translation, creative writing, commonsense 3,000–5,000+ tokens 8,192+

For complex tasks (translation, creative writing, open-ended questions), the model may spend 3,000–5,000+ tokens reasoning before delivering an answer. If max_tokens is too low, the output will be truncated mid-thinking and no final answer will appear — this is not a failure, the model simply needs more tokens to finish its thought process.

Workarounds

1. Increase max_tokens (recommended)

# For complex tasks, use high max_tokens
response = generate(
    model, tokenizer, prompt=prompt,
    max_tokens=8192,  # or higher for very complex tasks
    sampler=sampler
)

2. Skip thinking (experimental)

You can pre-fill an empty thinking block to force the model to answer directly. This does not always work — the model may re-enter thinking mode — but it can help for simpler open-ended tasks:

prompt = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=False
)
prompt += '<think>\n\n</think>\n\n'  # Force empty thinking block
response = generate(
    model, tokenizer, prompt=prompt,
    max_tokens=512, sampler=sampler
)

3. Post-process to extract the answer

if '</think>' in response:
    answer = response.split('</think>')[-1].strip()
else:
    answer = response  # Still in thinking — increase max_tokens

Limitations

While safety was emphasized during training, the model may still generate unexpected outputs due to its size and probabilistic nature. Users should not propagate harmful content. The developers assume no responsibility for consequences from dissemination of inappropriate content.

License

This model is released under the Apache 2.0 License, the same license as the original Nanbeige4.1-3B model.

Credits

Downloads last month
14
Safetensors
Model size
0.6B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for andrevp/Nanbeige4.1-3B-MLX-4bit

Quantized
(40)
this model