BitVLA (1.58-bit Vision-Language-Action) — Core AI
The first Vision-Language-Action model running fully on-device on iPhone, via Apple Core AI.
A Core AI conversion of lxsy/bitvla-bf16
(BitVLA, arXiv:2506.07530, MIT).
BitVLA takes an image + a natural-language instruction and predicts a 7-DoF robot end-effector action (Δx, Δy, Δz, Δroll, Δpitch, Δyaw, gripper) — OpenVLA-style discrete action tokens. Every transformer weight, in both the BitNet b1.58-2B language model and the BitSigLIP-SO400M vision tower, is 1.58-bit ternary ({-1, 0, +1}) — ~32× smaller than a full-precision VLA (OpenVLA-7.5B ≈ 15 GB), so the whole policy fits and runs on a phone GPU. The language model's per-layer linears run a custom 2-bit packed-ternary Metal kernel.
Part of the Core AI model zoo — on-device AI for iPhone & Mac through Apple Core AI: https://github.com/john-rocky/coreai-model-zoo
On-device (iPhone 17 Pro — Core AI GPU, greedy)
One image + instruction → 7-DoF action:
| stage | warm |
|---|---|
| vision encode (BitSigLIP-SO400M, 256 tokens) | 0.13 s |
| LLM prefill (≈308 positions, M=1 ternary kernel) | 8.8 s |
| action decode (7 tokens) | 0.26 s |
Resident ≈ 2 GB, no jetsam. On-device output matches the official model: 6/7 action tokens identical, 7-DoF action effectively identical, vision embeddings at per-token cosine 0.999.
What's in this repo
h18p/— device-ready, AOT-compiled for the iPhone 17 Pro (h18p) GPU:bitvla_vision/(BitSigLIP tower),bitvla_llm_act/(BitNet LLM, 256-row action head + ternary kernel),bitvla_device_data/(preset-instruction text embeds + the 256-row action-token embed table +norm_stats+ a sample image — so the device needs no tokenizer or embedding table).aimodel/— portable source.aimodels (vision + LLM); re-AOT for another device withxcrun coreai-build compile <…>.aimodel --platform iOS --preferred-compute gpu --architecture <arch>.
Architecture
- LLM = BitNet b1.58 2B4T (30L, hidden 2560, FFN 6912, GQA 20/5 hd128, ReLU² FFN, SubLN, RoPE θ500000), W1.58-A8 (per-tensor absmean ternary weight + per-token int8 activation).
- Vision = BitSigLIP-SO400M (26L, hidden 1152, FFN 4304, patch14/224 → 256 tokens), ternary linears; fp16 activations on device.
- Connector = 2-layer MLP (1152→2560→2560), fp16.
- Action = OpenVLA discrete: 256 vision embeds spliced into the LLM; 7 action tokens from the
vocab tail → 256-bin → BOUNDS-Q99 un-normalization (
norm_stats, 27-dataset OXE mix; e.g.unnorm_key = bridge_orig).
Conversion & how it works
Recipe, kernel notes, and the device gotchas (custom kernel must be AOT-compiled — it can't JIT on
device; the dynamic-shape LLM .aimodelc loads with expectFrequentReshapes=false; vision uses
fp16 activations because the in-graph A8 quant stalls the GPU) are in the zoo:
- card: https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/bitvla.md
- knowledge: https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/bitvla-1.58bit-vla.md
- conversion: https://github.com/john-rocky/coreai-model-zoo/tree/main/conversion
License
MIT, inheriting lxsy/bitvla-bf16 /
BitVLA. This is a converted redistribution of the Core AI
artifacts; see the base model for original terms.
Model tree for mlboydaisuke/BitVLA-CoreAI
Base model
lxsy/bitvla-bf16