V-JEPA 2 Robot Multi-Task Dataset & Models

Vision-based robot control data using V-JEPA 2 (ViT-L) latent representations from DeepMind Control Suite environments.

📊 Dataset

Task	Episodes	Transitions	Latent Dim	Action Dim	Success Rate
reacher_easy	1,000	200,000	1024	2	28.9%
point_mass_easy	1,000	200,000	1024	2	0.6%
cartpole_swingup	1,000	200,000	1024	1	0.0%

Each .npz file contains:

z_t — V-JEPA 2 latent state embeddings (N × 1024)
a_t — actions taken (N × action_dim)
z_next — next-state latent embeddings (N × 1024)
rewards — per-step rewards (N,)

🤖 Models

For each task, we provide:

5× Dynamics Ensemble — dyn_0.pt to dyn_4.pt (MLP: z + a → z_next, ~1.58M params each)
1× Reward Model — reward.pt (MLP: z + a → reward, ~329K params)

Architecture

Dynamics: Linear(1024+a_dim, 512) → LN → ReLU → ×3 → Linear(512, 1024) + residual connection
Reward: Linear(1024+a_dim, 256) → ReLU → ×2 → Linear(256, 1)
Ensemble diversity (weight cosine sim): ~0.60

🏗️ How It Was Built

Expert policies collect episodes in dm_control environments
Each frame rendered at 224×224, encoded with V-JEPA 2 ViT-L (8-frame sliding windows)
Dynamics ensemble trained with random data splits + different seeds
Reward model trained to predict per-step rewards from z_t + a_t

📈 Training Details

GPU: NVIDIA A100-SXM4-80GB (Prime Intellect)
Total time: 5.4 hours
Total cost: ~$7
Dynamics val loss: ~0.0008 (reacher, point_mass), ~0.0002 (cartpole)
Temporal coherence: >0.998 for all tasks

🎯 Purpose

These world models are designed for "teach-by-showing" — demonstrating a task via video, then using the learned dynamics + CEM planning to reproduce the shown behavior.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics