Qwen3.5-397B-A17B-4bit (MLX)

4-bit MLX quantized version of the text model from Qwen/Qwen3.5-397B-A17B.

Portions of this card were copied or adapted from the original model card, authored by the Qwen team.

Model Overview

Qwen3.5-397B-A17B is Alibaba's latest flagship language model, featuring a hybrid architecture that combines Gated DeltaNet (linear attention) with sparse Mixture-of-Experts for high-throughput inference. Despite having 397B total parameters, only ~17B are activated per token, making it remarkably efficient for its capability level.

This conversion provides a text-only 4-bit quantized version optimized for local inference on Apple Silicon Macs via the MLX framework. The vision encoder from the original multimodal model is not included — for image/video understanding, refer to the original Qwen/Qwen3.5-397B-A17B.

Key Capabilities

201 languages and dialects with deep cultural and regional understanding
262K native context (extensible to 1M+ with YaRN)
Thinking mode with chain-of-thought reasoning (<think>...</think>)
Tool use and agentic workflows (MCP, function calling)
Competitive benchmarks: MMLU-Pro 87.8, SuperGPQA 70.4, C-Eval 93.0

Architecture

Parameter	Value
Total Parameters	397B
Active Parameters	~17B
Hidden Size	4,096
Layers	60
Layer Layout	15 × (3 × Gated DeltaNet + 1 × Full Attention), all with MoE FFN
Total Experts	512
Active Experts per Token	10 routed + 1 shared
Expert Intermediate Size	1,024
Full Attention Heads	32 Q / 2 KV (GQA), head dim 256
Linear Attention Heads	16 QK / 64 V, head dim 128
Context Length	262,144 tokens
Vocab Size	248,320

Quantization Details

Parameter	Value
Method	Affine quantization
Bits	4-bit (weights)
Group Size	64
MoE Router Gates	8-bit (preserved at higher precision)
Model Size on Disk	~223 GB

The MoE router gates (mlp.gate and mlp.shared_expert_gate for all 60 layers) are kept at 8-bit precision to preserve routing accuracy, which is critical for Mixture-of-Experts models.

Requirements

Apple Silicon Mac with at least 256 GB unified memory (e.g., Mac Studio M3 Ultra 256GB+)
Python 3.10+
mlx-lm from the main branch

Installation

pip install git+https://github.com/ml-explore/mlx-lm

Usage

Quick Start — Python API

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Qwen3.5-397B-A17B-4bit")

messages = [{"role": "user", "content": "Explain the Riemann hypothesis in simple terms."}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=4096,
    verbose=True,
    temp=0.6,
    top_p=0.95,
)

Thinking Mode (Default)

The model defaults to thinking mode, producing chain-of-thought reasoning inside <think>...</think> tags before the final answer:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Qwen3.5-397B-A17B-4bit")

messages = [
    {"role": "user", "content": "How many r's are in the word 'strawberry'?"}
]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=8192,
    verbose=True,
    temp=0.6,
    top_p=0.95,
)

Non-Thinking Mode

For faster, more direct responses without chain-of-thought reasoning:

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Qwen3.5-397B-A17B-4bit")

messages = [
    {"role": "user", "content": "Write a haiku about machine learning."}
]
prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    enable_thinking=False,
)

response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=2048,
    verbose=True,
    temp=0.7,
    top_p=0.8,
)

Command Line

# Thinking mode (default)
mlx_lm.generate \
    --model mlx-community/Qwen3.5-397B-A17B-4bit \
    --prompt "What are the key differences between TCP and UDP?" \
    --max-tokens 4096 \
    --temp 0.6 \
    --top-p 0.95

# Start a local chat server (OpenAI-compatible)
mlx_lm.server --model mlx-community/Qwen3.5-397B-A17B-4bit

Local OpenAI-Compatible Server

Start the server:

mlx_lm.server --model mlx-community/Qwen3.5-397B-A17B-4bit --port 8080

Then query it with any OpenAI-compatible client:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")

response = client.chat.completions.create(
    model="mlx-community/Qwen3.5-397B-A17B-4bit",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a Python function to find all prime numbers up to n using the Sieve of Eratosthenes."},
    ],
    max_tokens=4096,
    temperature=0.6,
    top_p=0.95,
)
print(response.choices[0].message.content)

Or with curl:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen3.5-397B-A17B-4bit",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 512,
    "temperature": 0.6
  }'

Recommended Generation Parameters

Parameter	Thinking Mode	Non-Thinking Mode
`temperature`	0.6	0.7
`top_p`	0.95	0.8
`top_k`	20	20
`presence_penalty`	0.0	1.5
`repetition_penalty`	1.0	1.0
`max_tokens` (general)	32,768	32,768
`max_tokens` (math/code)	81,920	—

Tips

Thinking mode is best for complex reasoning, math, and coding tasks. The model will produce internal reasoning before answering.
Non-thinking mode is better for straightforward Q&A, creative writing, and conversational use where latency matters.
For math problems, append: "Please reason step by step, and put your final answer within \boxed{}."
For multi-turn conversations, the default chat template automatically strips thinking content from prior turns.
If running into memory pressure, consider closing other applications to free unified memory.

Original Model

This is a quantized version of Qwen/Qwen3.5-397B-A17B. Refer to the original model card for full benchmark results, training details, and the technical report.

Citation

@misc{qwen3.5,
    title  = {{Qwen3.5}: Towards Native Multimodal Agents},
    author = {{Qwen Team}},
    month  = {February},
    year   = {2026},
    url    = {https://qwen.ai/blog?id=qwen3.5}
}

Downloads last month: 1,306

Safetensors

Model size

396B params

Tensor type

BF16

U32

F32

MLX

Hardware compatibility

4-bit

Model tree for mlx-community/Qwen3.5-397B-A17B-4bit

Base model

Qwen/Qwen3.5-397B-A17B

Quantized

(9)

this model