YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

ANEMLL VibeThinker 1.5B β€” Variable Context State Transition Model

Pre-converted WeiboAI/VibeThinker-1.5B for Apple Neural Engine with dynamic context size support.

This model demonstrates variable context inference: it starts generating with a small KV cache (512 tokens) and automatically grows through 1024, 2048, 3072, up to 4096 tokens as the output gets longer. When the largest context fills up, a shift-refill mechanism compacts the cache and continues generating β€” enabling 24,000+ token outputs from a 4096-context model, entirely on-device.

Model Details

Property Value
Base model WeiboAI/VibeThinker-1.5B
Architecture Qwen 2.5 (1.5B parameters)
Context sizes 512, 1024, 2048, 3072, 4096
Quantization LUT6 (LM head), FP16 (FFN/embeddings)
FP32 attention Layer-0 attention runs in FP32 for numerical stability
Sampling Temperature 0.6, top_p 0.95 (recommended)
Total size ~1.6 GB
Framework Core ML (Apple Neural Engine)
Converter ANEMLL v0.3.5

Model Files

File Size Description
qwen25_embeddings.mlmodelc 445 MB Token embeddings
qwen25_FFN_PF_statex_chunk_01of03.mlmodelc 351 MB Layers chunk 1 (infer + prefill for all contexts)
qwen25_FFN_PF_statex_chunk_02of03.mlmodelc 321 MB Layers chunk 2
qwen25_FFN_PF_statex_chunk_03of03.mlmodelc 321 MB Layers chunk 3
qwen25_FFN_attn_fp32_statex_chunk_01of03.mlmodelc 32 MB FP32 layer-0 attention (numerical stability fix)
qwen25_lm_head_lut6.mlmodelc 172 MB LM head (LUT6 quantized)
meta.yaml 2 KB Model configuration and state transition metadata
tokenizer.json + vocab.json + merges.txt ~15 MB Tokenizer files

Requirements

  • macOS 15+ (Sequoia) with Apple Silicon (M1/M2/M3/M4 or later)
  • Python 3.9+
  • ANEMLL v0.3.5

Download

Option 1: Using huggingface-cli

pip install huggingface_hub
huggingface-cli download anemll/anemll-vibethinker-1.5b-state-transition \
  --local-dir ~/Models/ANE/vibethinker_1.5b_xstates

Option 2: Using git lfs

git lfs install
git clone https://huggingface.co/anemll/anemll-vibethinker-1.5b-state-transition \
  ~/Models/ANE/vibethinker_1.5b_xstates

Setup ANEMLL

git clone https://github.com/anemll/anemll.git
cd anemll
git checkout 0.3.5-staging

./create_uv_env.sh
source env-anemll/bin/activate
./install_dependencies.sh

Run the Demo

Quick Start (default prompt)

python examples/variable_context_demo.py \
  --meta ~/Models/ANE/vibethinker_1.5b_xstates/meta.yaml

This generates a Tic Tac Toe game in Python, demonstrating context transitions and overflow handling live in the terminal.

Custom Prompt

python examples/variable_context_demo.py \
  --meta ~/Models/ANE/vibethinker_1.5b_xstates/meta.yaml \
  --prompt "Write a complete snake game in Python using curses"

Time-Limited Generation (5 minutes)

python examples/variable_context_demo.py \
  --meta ~/Models/ANE/vibethinker_1.5b_xstates/meta.yaml \
  --max-time 300

Full Command with All Options

python examples/variable_context_demo.py \
  --meta ~/Models/ANE/vibethinker_1.5b_xstates/meta.yaml \
  --prompt "Explain the theory of relativity in detail" \
  --max-tokens 24000 \
  --sampling-mode auto \
  --seed 123 \
  --max-context-size 4096 \
  --overflow-reserve-batches 9

Advanced: Direct Runner

For full control over all parameters, use the state transition runner directly:

python tests/dev/state_transition_growing_inference.py \
  --meta ~/Models/ANE/vibethinker_1.5b_xstates/meta.yaml \
  --max-tokens 24000 \
  --prompt "Write a game of Tic Tac Toe in python, your code should play against human" \
  --prefill-mode batch-prefill \
  --sampling-mode auto \
  --max-context-size 4096 \
  --overflow-preserve-prompt \
  --overflow-policy shift-refill \
  --overflow-reserve-batches 9 \
  --live-events \
  --seed 123

What to Expect

During generation you will see live events showing context transitions and compaction:

[transition] ctx512 -> ctx1024 at tokens=512 (2.3 ms, avg decode 45.2 t/s)
[transition] ctx1024 -> ctx2048 at tokens=1024 (3.1 ms, avg decode 44.8 t/s)
[transition] ctx2048 -> ctx4096 at tokens=2048 (4.5 ms, avg decode 43.5 t/s)
...
[compact] ctx4096 drop=3200 keep=896 (1250 ms, avg decode 42.1 t/s)
  • Transitions happen automatically when the current context fills up (~2-5 ms, invisible to user)
  • Compactions re-prefill the KV cache with recent tokens when the largest context is exhausted (~1-2 seconds)
  • Between events, tokens stream continuously to stdout

At the end, a summary shows per-context decode speed, total tokens, and compaction stats.

Why FP32 Attention?

VibeThinker has unusually large Q/K projection biases in layer 0 that cause FP16 attention logit overflow on ANE. This produces gibberish output. The fix is to run layer-0 attention in FP32 on CPU (the qwen25_FFN_attn_fp32_statex_chunk_01of03.mlmodelc file). This is handled automatically β€” no user configuration needed.

License

The converted model inherits the license from WeiboAI/VibeThinker-1.5B. The ANEMLL converter is MIT licensed.

Links

Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support