V-JEPA 2 Robot Multi-Task Dataset & Models

Vision-based robot control data using V-JEPA 2 (ViT-L) latent representations from DeepMind Control Suite environments.

πŸ“Š Dataset

Task Episodes Transitions Latent Dim Action Dim Success Rate
reacher_easy 1,000 200,000 1024 2 28.9%
point_mass_easy 1,000 200,000 1024 2 0.6%
cartpole_swingup 1,000 200,000 1024 1 0.0%

Each .npz file contains:

  • z_t β€” V-JEPA 2 latent state embeddings (N Γ— 1024)
  • a_t β€” actions taken (N Γ— action_dim)
  • z_next β€” next-state latent embeddings (N Γ— 1024)
  • rewards β€” per-step rewards (N,)

πŸ€– Models

For each task, we provide:

  • 5Γ— Dynamics Ensemble β€” dyn_0.pt to dyn_4.pt (MLP: z + a β†’ z_next, ~1.58M params each)
  • 1Γ— Reward Model β€” reward.pt (MLP: z + a β†’ reward, ~329K params)

Architecture

  • Dynamics: Linear(1024+a_dim, 512) β†’ LN β†’ ReLU β†’ Γ—3 β†’ Linear(512, 1024) + residual connection
  • Reward: Linear(1024+a_dim, 256) β†’ ReLU β†’ Γ—2 β†’ Linear(256, 1)
  • Ensemble diversity (weight cosine sim): ~0.60

πŸ—οΈ How It Was Built

  1. Expert policies collect episodes in dm_control environments
  2. Each frame rendered at 224Γ—224, encoded with V-JEPA 2 ViT-L (8-frame sliding windows)
  3. Dynamics ensemble trained with random data splits + different seeds
  4. Reward model trained to predict per-step rewards from z_t + a_t

πŸ“ˆ Training Details

  • GPU: NVIDIA A100-SXM4-80GB (Prime Intellect)
  • Total time: 5.4 hours
  • Total cost: ~$7
  • Dynamics val loss: ~0.0008 (reacher, point_mass), ~0.0002 (cartpole)
  • Temporal coherence: >0.998 for all tasks

🎯 Purpose

These world models are designed for "teach-by-showing" β€” demonstrating a task via video, then using the learned dynamics + CEM planning to reproduce the shown behavior.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading