|
|
---
|
|
|
pipeline_tag: robotics
|
|
|
library_name: transformers
|
|
|
license: cc-by-nc-sa-4.0
|
|
|
tags:
|
|
|
- vision-language-model
|
|
|
- manipulation
|
|
|
- robotics
|
|
|
---
|
|
|
|
|
|
<div align="center">
|
|
|
<video src="https://cdn-uploads.huggingface.co/production/uploads/678123194248fde89e4fc9bf/_cbIWKHPzffRxIpfmqdFG.mp4"
|
|
|
controls autoplay muted playsinline loop width="720"></video>
|
|
|
|
|
|
<p><em>π Best viewed with sound on</em></p>
|
|
|
</div>
|
|
|
|
|
|
|
|
|
# F1: A Vision Language Action Model Bridging<br>Understanding and Generation to Actions
|
|
|
[](https://arxiv.org/abs/2509.06951)
|
|
|
[](https://github.com/InternRobotics/F1-VLA)
|
|
|
[](https://aopolin-lv.github.io/F1-VLA)
|
|
|
|
|
|
|
|
|
|
|
|
## π Key Innovations
|
|
|
|
|
|
- **π§ Predictive Inverse Dynamics**: Visual foresight generation for planning-based control
|
|
|
- **ποΈ Mixture-of-Transformer**: Three specialized experts (Understanding, Generation, Action)
|
|
|
- **π Three-Stage Training**: Progressive alignment, pretraining, and adaptation
|
|
|
|
|
|
## π€ Real-World Robot Experiments
|
|
|
|
|
|
<!-- <div align="center">
|
|
|
<video src="https://cdn-uploads.huggingface.co/production/uploads/678123194248fde89e4fc9bf/FPZ45NJd9_B_T1gOP8QVf.qt"
|
|
|
controls autoplay muted playsinline loop width="720"></video>
|
|
|
<p><em>9 diverse manipulation tasks including pick-and-place, handover, and complex object manipulation</em></p>
|
|
|
</div> -->
|
|
|
|
|
|
<div style="display: flex; flex-direction: column; align-items: center; gap: 10px;">
|
|
|
<!-- First Row -->
|
|
|
<div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
|
|
|
<video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
|
|
|
<source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/arx_v2_long.mp4" type="video/mp4">
|
|
|
</video>
|
|
|
<video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
|
|
|
<source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/arx_v1_dyna.mp4" type="video/mp4">
|
|
|
</video>
|
|
|
<video controls autoplay loop muted width="210" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
|
|
|
<source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/franka_v1_sweep.mp4" type="video/mp4">
|
|
|
</video>
|
|
|
</div>
|
|
|
<!-- Second Row -->
|
|
|
<div style="display: flex; justify-content: center; align-items: center; gap: 10px;">
|
|
|
<video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
|
|
|
<source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/genie_v2_handover.mp4" type="video/mp4">
|
|
|
</video>
|
|
|
<video controls autoplay loop muted width="200" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
|
|
|
<source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/genie_v3_tea.mp4" type="video/mp4">
|
|
|
</video>
|
|
|
<video controls autoplay loop muted width="210" style="border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
|
|
|
<source src="https://huggingface.co/spaces/Jia-Zeng/Robot_demos/resolve/main/genie_v1_flower.mp4" type="video/mp4">
|
|
|
</video>
|
|
|
</div>
|
|
|
<p><em>Diverse manipulation tasks across multiple robot platforms.</em></p>
|
|
|
</div>
|
|
|
|
|
|
|
|
|
## π Performance Summary
|
|
|
|
|
|
| Task | Platform | F1 | Ο0 | Improvement |
|
|
|
|:--------:|:------------:|:------------------:|:------------:|:---------------:|
|
|
|
| Multi-task | Genie-1 | 82.2% | 65.2% | +17.0% |
|
|
|
| Adaptation | Franka | 66.7% | 53.3% | +13.4% |
|
|
|
| Long-horizon | ARX LIFT II | 40.0% | 0.0% | +40.0% |
|
|
|
| Dynamic Env | ARX LIFT II | 66.7% | 33.3% | +33.4% |
|
|
|
|
|
|
## Usage
|
|
|
Please refer to our official repo [F1-VLA](https://github.com/InternRobotics/F1-VLA).
|
|
|
|
|
|
## π Citation
|
|
|
|
|
|
If you find our work helpful, please cite:
|
|
|
|
|
|
```bibtex
|
|
|
@article{f1_vla_2025,
|
|
|
title={F1: A Vision Language Action Model Bridging Understanding and Generation to Actions},
|
|
|
author={Qi Lv and Weijie Kong and Hao Li and Jia Zeng and Zherui Qiu and Delin Qu and Haoming Song and Qizhi Chen and Xiang Deng and Jiangmiao Pang},
|
|
|
journal={Conference/Journal Name},
|
|
|
year={2025},
|
|
|
url={https://arxiv.org/abs/2509.06951}
|
|
|
}
|
|
|
```
|
|
|
|
|
|
## License
|
|
|
This work is under the [cc-by-nc-sa-4.0](LICENSE).
|
|
|
|
|
|
## Acknowledgements
|
|
|
This repository is based on [Lerobot](https://github.com/huggingface/lerobot), [Any4lerobot](https://github.com/Tavish9/any4lerobot/), and [VAR](https://github.com/FoundationVision/VAR).
|
|
|
|