YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
π UniVLA: Unified Vision-Language-Action Model
A general-purpose VLA Model designed to unify vision, language, and action for robotics and autonomous driving.
π [technical report] π€ [model weights] π€ [project page]
π News
- 2025.6.27: code released for robotic simulations.
- 2025.6.25: paper released on the arXiv.
π§ͺ Highlights
- Unified Vision-Language-Action Model: supports image grounding, video generation, and action prediction.
- Strong Performance on Several Robotics Benchmarks: support CALVIN, LIBERO, SimplerEnv.
- Interleaved Video Training: support interleaved vision-action training in Markov Decision Process.
- Broader Applications: Real-robot ALOHA & Autonomous Driving.
π§ REPO TODO List
- Policy learning for CALVIN, LIBERO, and SimplerEnv.
- Support for evaluation.
- World model pretraining for video generation.
- Example for real-robot ALOHA.
- Support for autonomous driving.
- Support for general grounding.
π Experiments
Emu3 Pretraining Models
You can download the pretraining models from HuggingFace, here we provide the links.
World Model Training
More details can be found in the World Model Training document.
# train the world model
bash scripts/pretrain/train_video_1node.sh
This model is used to serve as the prerained model for the downstream policy learning tasks, such as CALVIN, LIBERO, and SimplerEnv.
1. CALVIN Benchmark
| Method | Mode | Setting | AVG | CKPT |
|---|---|---|---|---|
| UniVLA | video sft | ABCD->D | 4.63 (5x:4.71) | huggingface |
Note: 5Γ means 5Γ inference steps, i.e., 180 steps total.
Training
- Here provide single node training script, recommend multi-node training.
# video sft
bash scripts/simulator/calvin/train_calvin_abcd_video.sh
2. LIBERO Benchmark
| Method | Mode | SPATIAL | OBJECTS | GOAL | 10 | AVG | CKPT |
|---|---|---|---|---|---|---|---|
| UniVLA | img sft | 97.0 | 99.0 | 92.6 | 90.8 | 94.8 | huggingface |
| UniVLA | video sft | 95.4 | 98.8 | 93.6 | 94.0 | 95.5 | huggingface |
Training
bash scripts/simulator/libero/train_libero_video.sh
3. SimplerEnv Benchmark
| Method | Robot | Mode | Put Spoon | Put Carrot | Stack Block | Put Eggplant | AVG | CKPT |
|---|---|---|---|---|---|---|---|---|
| UniVLA | Bridge(WidowX) | video sft | 83.3 | 66.7 | 33.3 | 95.8 | 69.8 | huggingface |
Training
bash scripts/simulator/simplerenv/train_simplerenv_bridge_video.sh
Setup
Here we provide a conda environment setup for the project.
conda create -n emu_vla python=3.10
pip install -r requirements.txt
Benchmark setup, training and evaluation
π Code Structure
OmniSim/ βββ configs/ # Model configuration files βββ models/ # Tokenizer and diffusion test βββ train/ # Training dataset and pipeline βββ reference/ # Reference code β βββ Emu3/ # Base code β βββ RoboVLMs/ # Evaluation code βββ scripts/ # Shell scripts for training & evaluation βββ tools/ # Data preprocessing tools βββ README.md # Project description and user guide
β€οΈ Acknowledgement
Our work is built upon the following projects, Thanks for their great open-source work!
πCitation
If you find this project useful, please consider citing our work:
@article{wang2025unified,
title={Unified Vision-Language-Action Model},
author={Wang, Yuqi and Li, Xinghang and Wang, Wenxuan and Zhang, Junbo and Li, Yingyan and Chen, Yuntao and Wang, Xinlong and Zhang, Zhaoxiang},
journal={arXiv preprint arXiv:2506.19850},
year={2025}
}
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support