BitVLA (1.58-bit Vision-Language-Action) — Core AI

The first Vision-Language-Action model running fully on-device on iPhone, via Apple Core AI. A Core AI conversion of lxsy/bitvla-bf16 (BitVLA, arXiv:2506.07530, MIT).

BitVLA takes an image + a natural-language instruction and predicts a 7-DoF robot end-effector action (Δx, Δy, Δz, Δroll, Δpitch, Δyaw, gripper) — OpenVLA-style discrete action tokens. Every transformer weight, in both the BitNet b1.58-2B language model and the BitSigLIP-SO400M vision tower, is 1.58-bit ternary ({-1, 0, +1}) — ~32× smaller than a full-precision VLA (OpenVLA-7.5B ≈ 15 GB), so the whole policy fits and runs on a phone GPU. The language model's per-layer linears run a custom 2-bit packed-ternary Metal kernel.

Part of the Core AI model zoo — on-device AI for iPhone & Mac through Apple Core AI: https://github.com/john-rocky/coreai-model-zoo

On-device (iPhone 17 Pro — Core AI GPU, greedy)

One image + instruction → 7-DoF action:

stage warm
vision encode (BitSigLIP-SO400M, 256 tokens) 0.13 s
LLM prefill (≈308 positions, M=1 ternary kernel) 8.8 s
action decode (7 tokens) 0.26 s

Resident ≈ 2 GB, no jetsam. On-device output matches the official model: 6/7 action tokens identical, 7-DoF action effectively identical, vision embeddings at per-token cosine 0.999.

What's in this repo

  • h18p/ — device-ready, AOT-compiled for the iPhone 17 Pro (h18p) GPU: bitvla_vision/ (BitSigLIP tower), bitvla_llm_act/ (BitNet LLM, 256-row action head + ternary kernel), bitvla_device_data/ (preset-instruction text embeds + the 256-row action-token embed table + norm_stats + a sample image — so the device needs no tokenizer or embedding table).
  • aimodel/ — portable source .aimodels (vision + LLM); re-AOT for another device with xcrun coreai-build compile <…>.aimodel --platform iOS --preferred-compute gpu --architecture <arch>.

Architecture

  • LLM = BitNet b1.58 2B4T (30L, hidden 2560, FFN 6912, GQA 20/5 hd128, ReLU² FFN, SubLN, RoPE θ500000), W1.58-A8 (per-tensor absmean ternary weight + per-token int8 activation).
  • Vision = BitSigLIP-SO400M (26L, hidden 1152, FFN 4304, patch14/224 → 256 tokens), ternary linears; fp16 activations on device.
  • Connector = 2-layer MLP (1152→2560→2560), fp16.
  • Action = OpenVLA discrete: 256 vision embeds spliced into the LLM; 7 action tokens from the vocab tail → 256-bin → BOUNDS-Q99 un-normalization (norm_stats, 27-dataset OXE mix; e.g. unnorm_key = bridge_orig).

Conversion & how it works

Recipe, kernel notes, and the device gotchas (custom kernel must be AOT-compiled — it can't JIT on device; the dynamic-shape LLM .aimodelc loads with expectFrequentReshapes=false; vision uses fp16 activations because the in-graph A8 quant stalls the GPU) are in the zoo:

License

MIT, inheriting lxsy/bitvla-bf16 / BitVLA. This is a converted redistribution of the Core AI artifacts; see the base model for original terms.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for mlboydaisuke/BitVLA-CoreAI

Base model

lxsy/bitvla-bf16
Finetuned
(1)
this model

Paper for mlboydaisuke/BitVLA-CoreAI