ANEMLL VibeThinker 1.5B β Variable Context State Transition Model
Pre-converted WeiboAI/VibeThinker-1.5B for Apple Neural Engine with dynamic context size support.
This model demonstrates variable context inference: it starts generating with a small KV cache (512 tokens) and automatically grows through 1024, 2048, 3072, up to 4096 tokens as the output gets longer. When the largest context fills up, a shift-refill mechanism compacts the cache and continues generating β enabling 24,000+ token outputs from a 4096-context model, entirely on-device.
Model Details
| Property | Value |
|---|---|
| Base model | WeiboAI/VibeThinker-1.5B |
| Architecture | Qwen 2.5 (1.5B parameters) |
| Context sizes | 512, 1024, 2048, 3072, 4096 |
| Quantization | LUT6 (LM head), FP16 (FFN/embeddings) |
| FP32 attention | Layer-0 attention runs in FP32 for numerical stability |
| Sampling | Temperature 0.6, top_p 0.95 (recommended) |
| Total size | ~1.6 GB |
| Framework | Core ML (Apple Neural Engine) |
| Converter | ANEMLL v0.3.5 |
Model Files
| File | Size | Description |
|---|---|---|
qwen25_embeddings.mlmodelc |
445 MB | Token embeddings |
qwen25_FFN_PF_statex_chunk_01of03.mlmodelc |
351 MB | Layers chunk 1 (infer + prefill for all contexts) |
qwen25_FFN_PF_statex_chunk_02of03.mlmodelc |
321 MB | Layers chunk 2 |
qwen25_FFN_PF_statex_chunk_03of03.mlmodelc |
321 MB | Layers chunk 3 |
qwen25_FFN_attn_fp32_statex_chunk_01of03.mlmodelc |
32 MB | FP32 layer-0 attention (numerical stability fix) |
qwen25_lm_head_lut6.mlmodelc |
172 MB | LM head (LUT6 quantized) |
meta.yaml |
2 KB | Model configuration and state transition metadata |
tokenizer.json + vocab.json + merges.txt |
~15 MB | Tokenizer files |
Requirements
- macOS 15+ (Sequoia) with Apple Silicon (M1/M2/M3/M4 or later)
- Python 3.9+
- ANEMLL v0.3.5
Download
Option 1: Using huggingface-cli
pip install huggingface_hub
huggingface-cli download anemll/anemll-vibethinker-1.5b-state-transition \
--local-dir ~/Models/ANE/vibethinker_1.5b_xstates
Option 2: Using git lfs
git lfs install
git clone https://huggingface.co/anemll/anemll-vibethinker-1.5b-state-transition \
~/Models/ANE/vibethinker_1.5b_xstates
Setup ANEMLL
git clone https://github.com/anemll/anemll.git
cd anemll
git checkout 0.3.5-staging
./create_uv_env.sh
source env-anemll/bin/activate
./install_dependencies.sh
Run the Demo
Quick Start (default prompt)
python examples/variable_context_demo.py \
--meta ~/Models/ANE/vibethinker_1.5b_xstates/meta.yaml
This generates a Tic Tac Toe game in Python, demonstrating context transitions and overflow handling live in the terminal.
Custom Prompt
python examples/variable_context_demo.py \
--meta ~/Models/ANE/vibethinker_1.5b_xstates/meta.yaml \
--prompt "Write a complete snake game in Python using curses"
Time-Limited Generation (5 minutes)
python examples/variable_context_demo.py \
--meta ~/Models/ANE/vibethinker_1.5b_xstates/meta.yaml \
--max-time 300
Full Command with All Options
python examples/variable_context_demo.py \
--meta ~/Models/ANE/vibethinker_1.5b_xstates/meta.yaml \
--prompt "Explain the theory of relativity in detail" \
--max-tokens 24000 \
--sampling-mode auto \
--seed 123 \
--max-context-size 4096 \
--overflow-reserve-batches 9
Advanced: Direct Runner
For full control over all parameters, use the state transition runner directly:
python tests/dev/state_transition_growing_inference.py \
--meta ~/Models/ANE/vibethinker_1.5b_xstates/meta.yaml \
--max-tokens 24000 \
--prompt "Write a game of Tic Tac Toe in python, your code should play against human" \
--prefill-mode batch-prefill \
--sampling-mode auto \
--max-context-size 4096 \
--overflow-preserve-prompt \
--overflow-policy shift-refill \
--overflow-reserve-batches 9 \
--live-events \
--seed 123
What to Expect
During generation you will see live events showing context transitions and compaction:
[transition] ctx512 -> ctx1024 at tokens=512 (2.3 ms, avg decode 45.2 t/s)
[transition] ctx1024 -> ctx2048 at tokens=1024 (3.1 ms, avg decode 44.8 t/s)
[transition] ctx2048 -> ctx4096 at tokens=2048 (4.5 ms, avg decode 43.5 t/s)
...
[compact] ctx4096 drop=3200 keep=896 (1250 ms, avg decode 42.1 t/s)
- Transitions happen automatically when the current context fills up (~2-5 ms, invisible to user)
- Compactions re-prefill the KV cache with recent tokens when the largest context is exhausted (~1-2 seconds)
- Between events, tokens stream continuously to stdout
At the end, a summary shows per-context decode speed, total tokens, and compaction stats.
Why FP32 Attention?
VibeThinker has unusually large Q/K projection biases in layer 0 that cause FP16 attention logit overflow on ANE. This produces gibberish output. The fix is to run layer-0 attention in FP32 on CPU (the qwen25_FFN_attn_fp32_statex_chunk_01of03.mlmodelc file). This is handled automatically β no user configuration needed.
License
The converted model inherits the license from WeiboAI/VibeThinker-1.5B. The ANEMLL converter is MIT licensed.
Links
- Downloads last month
- 16