V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
TL;DR Summary
V-JEPA 2 is a self-supervised video model that combines vast video data with limited robot interaction data, achieving state-of-the-art performance in motion understanding and human action prediction, while also excelling in video question-answering tasks.
Abstract
A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supervised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
1.2. Authors
Mahmoud Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Geji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xiaodong Ma, Sarath Chandar, Franziska Meier, Yann LeCun, Michael Rabbat, Nicolas Ballas. Affiliations: FAIR (Fundamental AI Research) at Meta, Mila – Quebec AI Institute, and Polytechnique Montréal.
1.3. Journal/Conference
The paper is published as a preprint on arXiv (arXiv:2506.09985) and originates from the FAIR team at Meta. Given the authors (including Yann LeCun) and the scale of the work, it is likely targeted for a top-tier computer vision or machine learning conference such as CVPR or NeurIPS.
1.4. Publication Year
Published on June 11, 2025 (UTC).
1.5. Abstract
The paper explores a self-supervised approach called V-JEPA 2 that combines internet-scale video data with a small amount of robot interaction data to develop world models. These models are capable of understanding, predicting, and planning in physical environments. Pre-trained on over 1 million hours of video, V-JEPA 2 achieves state-of-the-art (SOTA) performance on motion understanding, action anticipation, and video question-answering. Furthermore, a version called V-JEPA 2-AC, trained on a small robot dataset, enables zero-shot picking and placing on physical robots without environment-specific training or rewards.
1.6. Original Source Link
2. Executive Summary
2.1. Background & Motivation
A fundamental goal in AI is to create agents that learn about the world much like humans do—largely through observation rather than just trial and error. Traditional methods for training robots often rely on massive amounts of specific interaction data or explicit reward signals (reinforcement learning), which are expensive and difficult to scale.
The core problem is the lack of a "world model"—an internal representation that allows an agent to predict the future consequences of its actions. While previous work used video generation (predicting every pixel), this is computationally expensive and focuses on irrelevant details (like the movement of every leaf on a tree). V-JEPA 2 aims to solve this by predicting in a latent space (a compact mathematical representation), focusing on high-level dynamics rather than pixel-perfect reconstruction.
2.2. Main Contributions / Findings
-
V-JEPA 2 Model: A large-scale video model (up to 1 billion parameters) trained on 1.1 million hours of video using a self-supervised objective.
-
State-of-the-Art Understanding: The model achieves SOTA results on benchmarks requiring fine-grained motion understanding (Something-Something v2) and temporal reasoning (PerceptionTest).
-
Action-Conditioned World Model (V-JEPA 2-AC): By adding a small amount of robot data (62 hours), the model learns to predict how its own actions change the environment.
-
Zero-Shot Planning: The model successfully performs robot manipulation tasks (pick and place) in entirely new lab environments without any task-specific training or fine-tuning, demonstrating strong generalization.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. Self-Supervised Learning (SSL)
Self-Supervised Learning is a method where a model learns from data without needing human-provided labels (like "this is a cat"). Instead, the data itself provides the supervision. For example, the model might hide a part of a video and try to predict what was there based on the surrounding context.
3.1.2. Joint-Embedding Predictive Architecture (JEPA)
Proposed by Yann LeCun, JEPA is an architecture designed to avoid the pitfalls of generative models. Instead of predicting pixels (which are noisy and hard to get exactly right), a JEPA encoder converts an image/video into a latent embedding (a vector of numbers representing the meaning). The model then tries to predict the embedding of a missing part of the video from the embedding of the visible part.
3.1.3. Vision Transformer (ViT)
The Vision Transformer is a neural network architecture that treats an image or video as a sequence of "patches" (like tiles). It uses an Attention Mechanism to understand the relationships between different patches. The core of this is the Self-Attention formula:
$ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $
- (Query): What the model is looking for.
- (Key): What each part of the data "offers."
- (Value): The actual information contained in that part.
- : The dimension of the keys, used for scaling.
softmax: A function that turns the scores into probabilities (ensuring they sum to 1).
3.2. Previous Works
- V-JEPA (2024): The predecessor to this work, which introduced the
Joint-Embedding Predictive Architecturefor video but at a smaller scale. - DINOv2: A popular image-based SSL model. V-JEPA 2 compares itself to
DINOv2to show that video-based pre-training is better for understanding motion. - Octo: A
Vision-Language-Action(VLA) model that uses imitation learning. Unlike V-JEPA 2,Octopredicts actions directly rather than modeling the world's physics.
3.3. Technological Evolution
The field has moved from Supervised Learning (needing millions of labels) to Generative SSL (predicting pixels, e.g., VideoMAE) and now to Latent SSL (predicting embeddings, e.g., JEPA). V-JEPA 2 represents the scaling phase of Latent SSL, combining it with robotics for the first time at this magnitude.
4. Methodology
4.1. Principles
The core intuition is that a model that can predict the "missing" parts of a video in a high-level feature space has learned the underlying "laws" of that world. By later conditioning this prediction on an "action," the model becomes a World Model that can "imagine" what will happen if it moves its arm in a certain way.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Stage 1: Action-Free Pre-training (V-JEPA 2)
The first goal is to learn a powerful encoder that understands video. The model uses a mask-denoising objective. It takes a video , hides some parts (masking), and tries to predict the representation of those parts using the visible parts .
The training objective is to minimize the following loss:
$ \min_{\theta, \phi, \Delta_y} | P_{\phi}(\Delta_y, E_{\theta}(x)) - \mathrm{sg}(E_{\bar{\theta}}(y)) |_1 $
- Encoder : A
Vision Transformerthat converts the visible video patches into feature vectors. represents the weights of this encoder. - Target Encoder : A "teacher" version of the encoder that processes the full video to create targets. The weights are an Exponential Moving Average (EMA) of the main encoder weights to keep training stable.
- Stop-Gradient : This operation ensures that the gradients only flow through the predictor and main encoder, not the teacher, preventing the model from cheating or collapsing.
- Mask Token : A learnable vector that tells the predictor where the missing patches were located.
- Predictor : A smaller transformer that takes the encoder's output and the mask tokens to guess the features of the hidden parts.
- L1 Loss : The model calculates the absolute difference between its prediction and the actual features produced by the teacher.
4.2.2. Stage 2: Action-Conditioned World Model (V-JEPA 2-AC)
After pre-training, the encoder is frozen. We now train a new predictor that takes into account robot actions. The model observes a frame , a robot state (position of the arm), and an action (how the arm is moving).
The teacher-forcing loss is used to train the model to predict the next frame's features :
$ \mathcal{L}{\mathrm{teacher-forcing}}(\phi) := \frac{1}{T} \sum{k=1}^{T} | P_{\phi}((a_t, s_t, E(x_t)){t \leq k}) - E(x{k+1}) |_1 $
-
: The features of the current video frame.
-
: The action and end-effector state (the "hand" of the robot).
-
: The ground-truth features of the next frame.
To ensure the model can think multiple steps ahead, a
rollout lossis added:
$ \mathcal{L}{\mathrm{rollout}}(\phi) := | P{\phi}(a_{1:T}, s_1, z_1) - z_{T+1} |_1 $
This forces the model to predict step based on its own previous predictions rather than ground truth, which helps prevent error accumulation during planning.
4.2.3. Stage 3: Planning via Energy Minimization
To solve a task (e.g., "pick up the cup"), the model is given a goal image . It calculates the features of that goal z_g = E(x_g). The robot then searches for a sequence of actions that minimizes an energy function:
$ \mathcal{E}(\hat{a}{1:T}; z_k, s_k, z_g) := | P(\hat{a}{1:T}; s_k, z_k) - z_g |_1 $
The robot "imagines" different action sequences and picks the one where the final predicted state representation is closest to the goal's representation in latent space. This optimization is performed using the Cross-Entropy Method (CEM), a sampling-based optimization algorithm.
5. Experimental Setup
5.1. Datasets
-
VideoMix22M (VM22M): A massive dataset of 22 million samples (1.1M hours) curated from sources like
Something-Something v2,Kinetics,HowTo100M, andYT-Temporal-1B. -
Droid: 62 hours of robot interaction data (Franka arm) used to train the action-conditioned predictor.
-
Epic-Kitchens-100 (EK100): Used for human action anticipation.
-
VidQA Benchmarks:
PerceptionTest,MVP,TempCompassfor testing language understanding of video.The researchers provide an example of "curation": they use a cluster-based retrieval to select videos from
YouTube-1Bthat match the distribution of high-quality datasets likeKinetics, ensuring the model doesn't learn from "noisy" or irrelevant content.
5.2. Evaluation Metrics
5.2.1. Top-1 Accuracy
- Conceptual Definition: The percentage of times the model's most likely prediction (highest score) matches the actual correct label.
- Mathematical Formula: $ \mathrm{Acc} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}(\hat{y}_i = y_i) $
- Symbol Explanation: is total samples; is the predicted label; is the true label; is the indicator function (1 if true, 0 if false).
5.2.2. Mean-Class Recall-at-5
- Conceptual Definition: For each category, it measures how often the correct label is among the model's top 5 guesses. It averages this across all classes to account for imbalanced data.
- Mathematical Formula: $ \mathrm{mR@5} = \frac{1}{C} \sum_{c=1}^{C} \frac{\sum_{i \in S_c} \mathbb{1}(y_i \in \text{Top-5}(\hat{y}_i))}{|S_c|} $
- Symbol Explanation: is the number of classes; is the set of samples in class ; is the set of the 5 highest-scoring predictions.
5.3. Baselines
-
DINOv2 / SigLIP2: State-of-the-art image encoders.
-
Octo: A leading robot policy model.
-
Cosmos: A 7-billion parameter video generation world model from NVIDIA.
6. Results & Analysis
6.1. Core Results Analysis
The results show that scaling is the primary driver of performance. Moving from a 300M parameter model to a 1B parameter model (ViT-g) and increasing resolution/duration (up to 64 frames) consistently improved accuracy. In robotics, V-JEPA 2-AC outperformed Octo (an imitation model) and Cosmos (a generative model), particularly in tasks involving complex object interactions like "Pick-and-Place."
6.2. Data Presentation (Tables)
The following are the results from Table 4 of the original paper:
| Motion Understanding | Appearance Understanding | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Method | Param. | Avg. | SSv2 | Diving-48 | Jester | K400 | COIN | IN1K | |
| Image Encoders | DINOv2 | 1.1B | 81.1 | 50.7 | 82.5 | 93.4 | 83.6 | 90.7 | 86.1 |
| Video Encoders | V-JEPA (old) | 600M | 85.2 | 74.3 | 87.9 | 97.7 | 84.5 | 87.1 | 80.0 |
| Ours | V-JEPA 2 ViT-g | 1B | 87.5 | 75.3 | 90.1 | 97.8 | 86.6 | 90.7 | 84.6 |
| Ours (High Res) | V-JEPA 2 g384 | 1B | 88.2 | 77.3 | 90.2 | - | 87.3 | 91.1 | 85.1 |
The following are the results from Table 2 of the original paper regarding zero-shot robot manipulation:
| Method | Reach | Grasp | Reach w/ Obj. | Pick-&-Place | |||
|---|---|---|---|---|---|---|---|
| Cup | Box | Cup | Box | Cup | Box | ||
| Octo | 100% | 15% | 0% | 15% | 70% | 15% | 10% |
| V-JEPA 2-AC | 100% | 65% | 25% | 75% | 75% | 80% | 65% |
6.3. Ablation Studies
-
Data Curation: Training on "curated" YouTube data improved performance by +1.4 points over uncurated data.
-
Resolution: Increasing resolution during the "cooldown" phase of training was much more efficient than training at high resolution from the start ( speedup).
-
Predictor Input: Using both the encoder and the predictor features for action anticipation yielded the best results on the
EK100dataset.
7. Conclusion & Reflections
7.1. Conclusion Summary
V-JEPA 2 demonstrates that large-scale self-supervised learning on internet video provides a "common sense" foundation for physical world understanding. By learning to predict in a latent space, the model avoids the heavy computation of pixel generation and focuses on the underlying physics. This approach allows robots to plan actions in new environments zero-shot, a major milestone for general-purpose robotics.
7.2. Limitations & Future Work
- Camera Sensitivity: The model is sensitive to camera positioning because it implicitly learns action axes from the video; if the camera moves, the "left/right" commands might get confused.
- Long Horizon Planning: Predicting far into the future still suffers from error accumulation.
- Task Specification: Currently, the model requires an "image" of the goal. Future work aims to allow language-based goals (e.g., "Put the cup on the shelf").
7.3. Personal Insights & Critique
The most impressive part of this work is the zero-shot generalization. Most robot models fail immediately when moved to a new lab with different lighting or backgrounds. V-JEPA 2's ability to handle this suggests that the 1 million hours of internet video successfully taught the model to ignore "background noise" and focus on the objects and the arm.
However, a potential issue is the CEM optimization used for planning. Sampling 800 trajectories to find the best one takes 16 seconds per action. While much faster than the generative Cosmos model (4 minutes), it is still too slow for high-speed, real-time robotics. Future iterations might need a "policy" that distills the world model's knowledge into a faster, reactive neural network.
Similar papers
Recommended via semantic vector search.