YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

ANEMLL VibeThinker 1.5B — Variable Context State Transition Model

Pre-converted WeiboAI/VibeThinker-1.5B for Apple Neural Engine with dynamic context size support.

This model demonstrates variable context inference: it starts generating with a small KV cache (512 tokens) and automatically grows through 1024, 2048, 3072, up to 4096 tokens as the output gets longer. When the largest context fills up, a shift-refill mechanism compacts the cache and continues generating — enabling 24,000+ token outputs from a 4096-context model, entirely on-device.

Model Details

Property	Value
Base model	WeiboAI/VibeThinker-1.5B
Architecture	Qwen 2.5 (1.5B parameters)
Context sizes	512, 1024, 2048, 3072, 4096
Quantization	LUT6 (LM head), FP16 (FFN/embeddings)
FP32 attention	Layer-0 attention runs in FP32 for numerical stability
Sampling	Temperature 0.6, top_p 0.95 (recommended)
Total size	~1.6 GB
Framework	Core ML (Apple Neural Engine)
Converter	ANEMLL v0.3.5

Model Files

File	Size	Description
`qwen25_embeddings.mlmodelc`	445 MB	Token embeddings
`qwen25_FFN_PF_statex_chunk_01of03.mlmodelc`	351 MB	Layers chunk 1 (infer + prefill for all contexts)
`qwen25_FFN_PF_statex_chunk_02of03.mlmodelc`	321 MB	Layers chunk 2
`qwen25_FFN_PF_statex_chunk_03of03.mlmodelc`	321 MB	Layers chunk 3
`qwen25_FFN_attn_fp32_statex_chunk_01of03.mlmodelc`	32 MB	FP32 layer-0 attention (numerical stability fix)
`qwen25_lm_head_lut6.mlmodelc`	172 MB	LM head (LUT6 quantized)
`meta.yaml`	2 KB	Model configuration and state transition metadata
`tokenizer.json` + `vocab.json` + `merges.txt`	~15 MB	Tokenizer files

Requirements

macOS 15+ (Sequoia) with Apple Silicon (M1/M2/M3/M4 or later)
Python 3.9+
ANEMLL v0.3.5

Download

Option 1: Using `huggingface-cli`

pip install huggingface_hub
huggingface-cli download anemll/anemll-vibethinker-1.5b-state-transition \
  --local-dir ~/Models/ANE/vibethinker_1.5b_xstates

Option 2: Using `git lfs`

git lfs install
git clone https://huggingface.co/anemll/anemll-vibethinker-1.5b-state-transition \
  ~/Models/ANE/vibethinker_1.5b_xstates

Setup ANEMLL

git clone https://github.com/anemll/anemll.git
cd anemll
git checkout 0.3.5-staging

./create_uv_env.sh
source env-anemll/bin/activate
./install_dependencies.sh

Run the Demo

Quick Start (default prompt)

python examples/variable_context_demo.py \
  --meta ~/Models/ANE/vibethinker_1.5b_xstates/meta.yaml

This generates a Tic Tac Toe game in Python, demonstrating context transitions and overflow handling live in the terminal.

Custom Prompt

python examples/variable_context_demo.py \
  --meta ~/Models/ANE/vibethinker_1.5b_xstates/meta.yaml \
  --prompt "Write a complete snake game in Python using curses"

Time-Limited Generation (5 minutes)

python examples/variable_context_demo.py \
  --meta ~/Models/ANE/vibethinker_1.5b_xstates/meta.yaml \
  --max-time 300

Full Command with All Options

python examples/variable_context_demo.py \
  --meta ~/Models/ANE/vibethinker_1.5b_xstates/meta.yaml \
  --prompt "Explain the theory of relativity in detail" \
  --max-tokens 24000 \
  --sampling-mode auto \
  --seed 123 \
  --max-context-size 4096 \
  --overflow-reserve-batches 9

Advanced: Direct Runner

For full control over all parameters, use the state transition runner directly:

python tests/dev/state_transition_growing_inference.py \
  --meta ~/Models/ANE/vibethinker_1.5b_xstates/meta.yaml \
  --max-tokens 24000 \
  --prompt "Write a game of Tic Tac Toe in python, your code should play against human" \
  --prefill-mode batch-prefill \
  --sampling-mode auto \
  --max-context-size 4096 \
  --overflow-preserve-prompt \
  --overflow-policy shift-refill \
  --overflow-reserve-batches 9 \
  --live-events \
  --seed 123

What to Expect

During generation you will see live events showing context transitions and compaction:

[transition] ctx512 -> ctx1024 at tokens=512 (2.3 ms, avg decode 45.2 t/s)
[transition] ctx1024 -> ctx2048 at tokens=1024 (3.1 ms, avg decode 44.8 t/s)
[transition] ctx2048 -> ctx4096 at tokens=2048 (4.5 ms, avg decode 43.5 t/s)
...
[compact] ctx4096 drop=3200 keep=896 (1250 ms, avg decode 42.1 t/s)

Transitions happen automatically when the current context fills up (~2-5 ms, invisible to user)
Compactions re-prefill the KV cache with recent tokens when the largest context is exhausted (~1-2 seconds)
Between events, tokens stream continuously to stdout

At the end, a summary shows per-context decode speed, total tokens, and compaction stats.

Why FP32 Attention?

VibeThinker has unusually large Q/K projection biases in layer 0 that cause FP16 attention logit overflow on ANE. This produces gibberish output. The fix is to run layer-0 attention in FP32 on CPU (the qwen25_FFN_attn_fp32_statex_chunk_01of03.mlmodelc file). This is handled automatically — no user configuration needed.

License

The converted model inherits the license from WeiboAI/VibeThinker-1.5B. The ANEMLL converter is MIT licensed.

anemll
/

anemll-vibethinker-1.5b-state-transition

ANEMLL VibeThinker 1.5B — Variable Context State Transition Model

Model Details

Model Files

Requirements

Download

Option 1: Using `huggingface-cli`

Option 2: Using `git lfs`

Setup ANEMLL

Run the Demo

Quick Start (default prompt)

Custom Prompt

Time-Limited Generation (5 minutes)

Full Command with All Options

Advanced: Direct Runner

What to Expect

Why FP32 Attention?

License

Links

ANEMLL VibeThinker 1.5B — Variable Context State Transition Model

Model Details

Model Files

Requirements

Download

Option 1: Using huggingface-cli

Option 2: Using git lfs

Setup ANEMLL

Run the Demo

Quick Start (default prompt)

Custom Prompt

Time-Limited Generation (5 minutes)

Full Command with All Options

Advanced: Direct Runner

What to Expect

Why FP32 Attention?

License

Links

Option 1: Using `huggingface-cli`

Option 2: Using `git lfs`