V-JEPA 2 Robot Multi-Task Dataset & Models
Vision-based robot control data using V-JEPA 2 (ViT-L) latent representations from DeepMind Control Suite environments.
π Dataset
| Task | Episodes | Transitions | Latent Dim | Action Dim | Success Rate |
|---|---|---|---|---|---|
| reacher_easy | 1,000 | 200,000 | 1024 | 2 | 28.9% |
| point_mass_easy | 1,000 | 200,000 | 1024 | 2 | 0.6% |
| cartpole_swingup | 1,000 | 200,000 | 1024 | 1 | 0.0% |
Each .npz file contains:
z_tβ V-JEPA 2 latent state embeddings (N Γ 1024)a_tβ actions taken (N Γ action_dim)z_nextβ next-state latent embeddings (N Γ 1024)rewardsβ per-step rewards (N,)
π€ Models
For each task, we provide:
- 5Γ Dynamics Ensemble β
dyn_0.pttodyn_4.pt(MLP: z + a β z_next, ~1.58M params each) - 1Γ Reward Model β
reward.pt(MLP: z + a β reward, ~329K params)
Architecture
- Dynamics:
Linear(1024+a_dim, 512) β LN β ReLU β Γ3 β Linear(512, 1024)+ residual connection - Reward:
Linear(1024+a_dim, 256) β ReLU β Γ2 β Linear(256, 1) - Ensemble diversity (weight cosine sim): ~0.60
ποΈ How It Was Built
- Expert policies collect episodes in dm_control environments
- Each frame rendered at 224Γ224, encoded with V-JEPA 2 ViT-L (8-frame sliding windows)
- Dynamics ensemble trained with random data splits + different seeds
- Reward model trained to predict per-step rewards from z_t + a_t
π Training Details
- GPU: NVIDIA A100-SXM4-80GB (Prime Intellect)
- Total time: 5.4 hours
- Total cost: ~$7
- Dynamics val loss: ~0.0008 (reacher, point_mass), ~0.0002 (cartpole)
- Temporal coherence: >0.998 for all tasks
π― Purpose
These world models are designed for "teach-by-showing" β demonstrating a task via video, then using the learned dynamics + CEM planning to reproduce the shown behavior.