Paper status: completed

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Published:06/12/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

V-JEPA 2 is a self-supervised video model that combines vast video data with limited robot interaction data, achieving state-of-the-art performance in motion understanding and human action prediction, while also excelling in video question-answering tasks.

Abstract

A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

1.2. Authors

Mahmoud Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Geji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xiaodong Ma, Sarath Chandar, Franziska Meier, Yann LeCun, Michael Rabbat, Nicolas Ballas. Affiliations: FAIR (Fundamental AI Research) at Meta, Mila – Quebec AI Institute, and Polytechnique Montréal.

1.3. Journal/Conference

The paper is published as a preprint on arXiv (arXiv:2506.09985) and originates from the FAIR team at Meta. Given the authors (including Yann LeCun) and the scale of the work, it is likely targeted for a top-tier computer vision or machine learning conference such as CVPR or NeurIPS.

1.4. Publication Year

Published on June 11, 2025 (UTC).

1.5. Abstract

The paper explores a self-supervised approach called V-JEPA 2 that combines internet-scale video data with a small amount of robot interaction data to develop world models. These models are capable of understanding, predicting, and planning in physical environments. Pre-trained on over 1 million hours of video, V-JEPA 2 achieves state-of-the-art (SOTA) performance on motion understanding, action anticipation, and video question-answering. Furthermore, a version called V-JEPA 2-AC, trained on a small robot dataset, enables zero-shot picking and placing on physical robots without environment-specific training or rewards.

arXiv:2506.09985


2. Executive Summary

2.1. Background & Motivation

A fundamental goal in AI is to create agents that learn about the world much like humans do—largely through observation rather than just trial and error. Traditional methods for training robots often rely on massive amounts of specific interaction data or explicit reward signals (reinforcement learning), which are expensive and difficult to scale.

The core problem is the lack of a "world model"—an internal representation that allows an agent to predict the future consequences of its actions. While previous work used video generation (predicting every pixel), this is computationally expensive and focuses on irrelevant details (like the movement of every leaf on a tree). V-JEPA 2 aims to solve this by predicting in a latent space (a compact mathematical representation), focusing on high-level dynamics rather than pixel-perfect reconstruction.

2.2. Main Contributions / Findings

  1. V-JEPA 2 Model: A large-scale video model (up to 1 billion parameters) trained on 1.1 million hours of video using a self-supervised objective.

  2. State-of-the-Art Understanding: The model achieves SOTA results on benchmarks requiring fine-grained motion understanding (Something-Something v2) and temporal reasoning (PerceptionTest).

  3. Action-Conditioned World Model (V-JEPA 2-AC): By adding a small amount of robot data (62 hours), the model learns to predict how its own actions change the environment.

  4. Zero-Shot Planning: The model successfully performs robot manipulation tasks (pick and place) in entirely new lab environments without any task-specific training or fine-tuning, demonstrating strong generalization.


3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Self-Supervised Learning (SSL)

Self-Supervised Learning is a method where a model learns from data without needing human-provided labels (like "this is a cat"). Instead, the data itself provides the supervision. For example, the model might hide a part of a video and try to predict what was there based on the surrounding context.

3.1.2. Joint-Embedding Predictive Architecture (JEPA)

Proposed by Yann LeCun, JEPA is an architecture designed to avoid the pitfalls of generative models. Instead of predicting pixels (which are noisy and hard to get exactly right), a JEPA encoder converts an image/video into a latent embedding (a vector of numbers representing the meaning). The model then tries to predict the embedding of a missing part of the video from the embedding of the visible part.

3.1.3. Vision Transformer (ViT)

The Vision Transformer is a neural network architecture that treats an image or video as a sequence of "patches" (like tiles). It uses an Attention Mechanism to understand the relationships between different patches. The core of this is the Self-Attention formula:

$ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $

  • QQ (Query): What the model is looking for.
  • KK (Key): What each part of the data "offers."
  • VV (Value): The actual information contained in that part.
  • dkd_k: The dimension of the keys, used for scaling.
  • softmax: A function that turns the scores into probabilities (ensuring they sum to 1).

3.2. Previous Works

  • V-JEPA (2024): The predecessor to this work, which introduced the Joint-Embedding Predictive Architecture for video but at a smaller scale.
  • DINOv2: A popular image-based SSL model. V-JEPA 2 compares itself to DINOv2 to show that video-based pre-training is better for understanding motion.
  • Octo: A Vision-Language-Action (VLA) model that uses imitation learning. Unlike V-JEPA 2, Octo predicts actions directly rather than modeling the world's physics.

3.3. Technological Evolution

The field has moved from Supervised Learning (needing millions of labels) to Generative SSL (predicting pixels, e.g., VideoMAE) and now to Latent SSL (predicting embeddings, e.g., JEPA). V-JEPA 2 represents the scaling phase of Latent SSL, combining it with robotics for the first time at this magnitude.


4. Methodology

4.1. Principles

The core intuition is that a model that can predict the "missing" parts of a video in a high-level feature space has learned the underlying "laws" of that world. By later conditioning this prediction on an "action," the model becomes a World Model that can "imagine" what will happen if it moves its arm in a certain way.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Stage 1: Action-Free Pre-training (V-JEPA 2)

The first goal is to learn a powerful encoder that understands video. The model uses a mask-denoising objective. It takes a video yy, hides some parts (masking), and tries to predict the representation of those parts using the visible parts xx.

The training objective is to minimize the following loss:

$ \min_{\theta, \phi, \Delta_y} | P_{\phi}(\Delta_y, E_{\theta}(x)) - \mathrm{sg}(E_{\bar{\theta}}(y)) |_1 $

  1. Encoder Eθ(x)E_{\theta}(x): A Vision Transformer that converts the visible video patches xx into feature vectors. θ\theta represents the weights of this encoder.
  2. Target Encoder Eθˉ(y)E_{\bar{\theta}}(y): A "teacher" version of the encoder that processes the full video yy to create targets. The weights θˉ\bar{\theta} are an Exponential Moving Average (EMA) of the main encoder weights θ\theta to keep training stable.
  3. Stop-Gradient sg()\mathrm{sg}(\cdot): This operation ensures that the gradients only flow through the predictor and main encoder, not the teacher, preventing the model from cheating or collapsing.
  4. Mask Token Δy\Delta_y: A learnable vector that tells the predictor where the missing patches were located.
  5. Predictor PϕP_{\phi}: A smaller transformer that takes the encoder's output and the mask tokens to guess the features of the hidden parts.
  6. L1 Loss 1\|\cdot\|_1: The model calculates the absolute difference between its prediction and the actual features produced by the teacher.

4.2.2. Stage 2: Action-Conditioned World Model (V-JEPA 2-AC)

After pre-training, the encoder EE is frozen. We now train a new predictor PϕP_{\phi} that takes into account robot actions. The model observes a frame xtx_t, a robot state sts_t (position of the arm), and an action ata_t (how the arm is moving).

The teacher-forcing loss is used to train the model to predict the next frame's features zk+1z_{k+1}:

$ \mathcal{L}{\mathrm{teacher-forcing}}(\phi) := \frac{1}{T} \sum{k=1}^{T} | P_{\phi}((a_t, s_t, E(x_t)){t \leq k}) - E(x{k+1}) |_1 $

  • E(xt)E(x_t): The features of the current video frame.

  • at,sta_t, s_t: The action and end-effector state (the "hand" of the robot).

  • E(xk+1)E(x_{k+1}): The ground-truth features of the next frame.

    To ensure the model can think multiple steps ahead, a rollout loss is added:

$ \mathcal{L}{\mathrm{rollout}}(\phi) := | P{\phi}(a_{1:T}, s_1, z_1) - z_{T+1} |_1 $

This forces the model to predict step T+1T+1 based on its own previous predictions rather than ground truth, which helps prevent error accumulation during planning.

4.2.3. Stage 3: Planning via Energy Minimization

To solve a task (e.g., "pick up the cup"), the model is given a goal image xgx_g. It calculates the features of that goal z_g = E(x_g). The robot then searches for a sequence of actions a1:Ta_{1:T} that minimizes an energy function:

$ \mathcal{E}(\hat{a}{1:T}; z_k, s_k, z_g) := | P(\hat{a}{1:T}; s_k, z_k) - z_g |_1 $

The robot "imagines" different action sequences and picks the one where the final predicted state representation is closest to the goal's representation in latent space. This optimization is performed using the Cross-Entropy Method (CEM), a sampling-based optimization algorithm.


5. Experimental Setup

5.1. Datasets

  • VideoMix22M (VM22M): A massive dataset of 22 million samples (1.1M hours) curated from sources like Something-Something v2, Kinetics, HowTo100M, and YT-Temporal-1B.

  • Droid: 62 hours of robot interaction data (Franka arm) used to train the action-conditioned predictor.

  • Epic-Kitchens-100 (EK100): Used for human action anticipation.

  • VidQA Benchmarks: PerceptionTest, MVP, TempCompass for testing language understanding of video.

    The researchers provide an example of "curation": they use a cluster-based retrieval to select videos from YouTube-1B that match the distribution of high-quality datasets like Kinetics, ensuring the model doesn't learn from "noisy" or irrelevant content.

5.2. Evaluation Metrics

5.2.1. Top-1 Accuracy

  1. Conceptual Definition: The percentage of times the model's most likely prediction (highest score) matches the actual correct label.
  2. Mathematical Formula: $ \mathrm{Acc} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}(\hat{y}_i = y_i) $
  3. Symbol Explanation: NN is total samples; y^i\hat{y}_i is the predicted label; yiy_i is the true label; 1()\mathbb{1}(\cdot) is the indicator function (1 if true, 0 if false).

5.2.2. Mean-Class Recall-at-5

  1. Conceptual Definition: For each category, it measures how often the correct label is among the model's top 5 guesses. It averages this across all classes to account for imbalanced data.
  2. Mathematical Formula: $ \mathrm{mR@5} = \frac{1}{C} \sum_{c=1}^{C} \frac{\sum_{i \in S_c} \mathbb{1}(y_i \in \text{Top-5}(\hat{y}_i))}{|S_c|} $
  3. Symbol Explanation: CC is the number of classes; ScS_c is the set of samples in class cc; Top-5(y^i)\text{Top-5}(\hat{y}_i) is the set of the 5 highest-scoring predictions.

5.3. Baselines

  • DINOv2 / SigLIP2: State-of-the-art image encoders.

  • Octo: A leading robot policy model.

  • Cosmos: A 7-billion parameter video generation world model from NVIDIA.


6. Results & Analysis

6.1. Core Results Analysis

The results show that scaling is the primary driver of performance. Moving from a 300M parameter model to a 1B parameter model (ViT-g) and increasing resolution/duration (up to 64 frames) consistently improved accuracy. In robotics, V-JEPA 2-AC outperformed Octo (an imitation model) and Cosmos (a generative model), particularly in tasks involving complex object interactions like "Pick-and-Place."

6.2. Data Presentation (Tables)

The following are the results from Table 4 of the original paper:

Motion Understanding Appearance Understanding
Method Param. Avg. SSv2 Diving-48 Jester K400 COIN IN1K
Image Encoders DINOv2 1.1B 81.1 50.7 82.5 93.4 83.6 90.7 86.1
Video Encoders V-JEPA (old) 600M 85.2 74.3 87.9 97.7 84.5 87.1 80.0
Ours V-JEPA 2 ViT-g 1B 87.5 75.3 90.1 97.8 86.6 90.7 84.6
Ours (High Res) V-JEPA 2 g384 1B 88.2 77.3 90.2 - 87.3 91.1 85.1

The following are the results from Table 2 of the original paper regarding zero-shot robot manipulation:

Method Reach Grasp Reach w/ Obj. Pick-&-Place
Cup Box Cup Box Cup Box
Octo 100% 15% 0% 15% 70% 15% 10%
V-JEPA 2-AC 100% 65% 25% 75% 75% 80% 65%

6.3. Ablation Studies

  • Data Curation: Training on "curated" YouTube data improved performance by +1.4 points over uncurated data.

  • Resolution: Increasing resolution during the "cooldown" phase of training was much more efficient than training at high resolution from the start (8×8\times speedup).

  • Predictor Input: Using both the encoder and the predictor features for action anticipation yielded the best results on the EK100 dataset.


7. Conclusion & Reflections

7.1. Conclusion Summary

V-JEPA 2 demonstrates that large-scale self-supervised learning on internet video provides a "common sense" foundation for physical world understanding. By learning to predict in a latent space, the model avoids the heavy computation of pixel generation and focuses on the underlying physics. This approach allows robots to plan actions in new environments zero-shot, a major milestone for general-purpose robotics.

7.2. Limitations & Future Work

  • Camera Sensitivity: The model is sensitive to camera positioning because it implicitly learns action axes from the video; if the camera moves, the "left/right" commands might get confused.
  • Long Horizon Planning: Predicting far into the future still suffers from error accumulation.
  • Task Specification: Currently, the model requires an "image" of the goal. Future work aims to allow language-based goals (e.g., "Put the cup on the shelf").

7.3. Personal Insights & Critique

The most impressive part of this work is the zero-shot generalization. Most robot models fail immediately when moved to a new lab with different lighting or backgrounds. V-JEPA 2's ability to handle this suggests that the 1 million hours of internet video successfully taught the model to ignore "background noise" and focus on the objects and the arm.

However, a potential issue is the CEM optimization used for planning. Sampling 800 trajectories to find the best one takes 16 seconds per action. While much faster than the generative Cosmos model (4 minutes), it is still too slow for high-speed, real-time robotics. Future iterations might need a "policy" that distills the world model's knowledge into a faster, reactive neural network.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.