Abstract

This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model's parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.

1. Bibliographic Information

1.1. Title

Revisiting Feature Prediction for Learning Visual Representations from Video

1.2. Authors

Ain Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Micael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. The authors are primarily affiliated with FAIR (Fundamental AI Research) at Meta, along with researchers from Inria, École normale supérieure (PSL), New York University, and Université Gustave Eiffel. Yann LeCun is a pioneer in deep learning and the Chief AI Scientist at Meta.

1.3. Journal/Conference

This paper was published as a preprint on arXiv in April 2024. Given the authorship and the scope of the work, it is representative of high-impact research typically seen at top-tier computer vision conferences such as CVPR, ICCV, or NeurIPS.

1.4. Publication Year

2024 (First published Feb 15, 2024; updated April 15, 2024).

1.5. Abstract

The paper explores feature prediction as a standalone objective for unsupervised learning from video. It introduces V-JEPA, a collection of vision models trained solely by predicting latent features of masked video regions, without using pretrained image encoders, text, negative examples, or pixel-level reconstruction. Trained on 2 million videos, V-JEPA learns versatile representations that perform well on both motion-heavy (e.g., Something-Something-v2) and appearance-heavy (e.g., Kinetics-400) tasks using a frozen backbone. The largest model (ViT-H/16) achieves state-of-the-art results for video-only pretraining, demonstrating high efficiency and label-efficiency.

1.6. Original Source Link

PDF Link: https://arxiv.org/pdf/2404.08471v1.pdf
Status: Preprint (under review/widely cited in the SSL community).

2. Executive Summary

2.1. Background & Motivation

In the current landscape of computer vision, Self-Supervised Learning (SSL) aims to train models on vast amounts of unlabeled data. A popular approach for video is Masked Video Modeling (MVM), where parts of a video are hidden (masked) and the model tries to reconstruct the missing pixels. However, pixel reconstruction is computationally expensive and forces the model to focus on irrelevant low-level details (like individual leaf movements in a forest) rather than high-level semantic concepts.

The core problem V-JEPA addresses is: Can we learn powerful video representations by predicting semantic features instead of raw pixels? This follows the "predictive feature principle," suggesting that if a model can predict the "meaning" of a hidden video segment based on the surrounding context, it truly understands the world's dynamics.

2.2. Main Contributions / Findings

V-JEPA Architecture: A purely feature-predictive model that avoids the overhead of pixel decoders and the need for negative samples (contrastive learning).
Efficiency: V-JEPA is significantly faster to train than pixel-reconstruction methods (like VideoMAE) because it operates in a lower-dimensional latent space.
Versatility: The model performs exceptionally well on diverse tasks (motion vs. appearance) using a frozen encoder. This means the model's internal "understanding" is so good that it doesn't need to be updated for new tasks; you only need to train a tiny classification layer on top.
Label Efficiency: When only a small fraction of labeled data (e.g., 5% or 10%) is available, V-JEPA maintains high performance, outperforming previous methods by a wide margin.

3.1. Foundational Concepts

Self-Supervised Learning (SSL): A method where the data itself provides the "labels." For example, by hiding part of an image and asking the model to guess what’s missing.
Vision Transformer (ViT): Unlike traditional grids (CNNs), a ViT breaks an image/video into small squares called patches. These patches are converted into vectors called tokens and processed using self-attention, which allows every part of the image to "talk" to every other part.
Joint-Embedding Predictive Architecture (JEPA): Proposed by Yann LeCun, this architecture doesn't generate pixels. Instead, it embeds two different views of the same data ( $x$ and $y$ ) into a mathematical space and tries to predict the embedding of $y$ from $x$ .
3.jpg: The following figure (Figure 3 from the original paper) shows the system architecture:

该图像是一个示意图，展示了V-JEPA模型与其他任务特定模型在Something-Something-v2和Kinetics 400上的性能对比。图中显示，V-JEPA在这两个任务中均表现良好，且在进行任务评估时无需调整模型参数。

3.2. Previous Works

Masked Autoencoders (MAE): Models like VideoMAE mask 90% of video patches and use a "decoder" to predict the actual colors of the missing pixels. The reconstruction loss is usually Mean Squared Error (MSE): $ \mathcal{L}{MSE} = \frac{1}{n} \sum (P{pixel} - T_{pixel})^2 $
Contrastive Learning: Models like CLIP or SimCLR compare different images and try to bring "positive" pairs (different views of the same image) closer together while pushing "negative" pairs (different images) apart. V-JEPA avoids this "negative" sampling, which is often complex to implement.
Exponential Moving Average (EMA): A technique where a "target" model is a slow-moving version of the "online" model. This prevents the model from "collapsing" (outputting the same constant value for every input).

3.3. Technological Evolution

The field moved from Hand-crafted features $\rightarrow$ Supervised CNNs $\rightarrow$ Contrastive SSL $\rightarrow$ Generative SSL (MAE). V-JEPA represents the latest stage: Non-generative, Predictive SSL, focusing on efficiency and semantic understanding rather than pixel-perfect reconstruction.

4. Methodology

4.1. Principles

V-JEPA is built on the idea that a model should learn the "gist" of a video. If you see a person starting to throw a ball, you can predict they will release it and it will move forward. You don't need to predict the exact texture of every blade of grass in the background.

4.2. Core Methodology In-depth

Step 1: Tokenization (Turning Video into Math)

A video clip of $T$ frames is divided into 3D spatio-temporal patches. Each patch is $16 \times 16$ pixels and spans 2 frames.

The video shape $16 \times 224 \times 224 \times 3$ (Frames $\times$ Height $\times$ Width $\times$ Color) is processed by a 3D convolution.
The result is a sequence of tokens $L$ . For a standard clip, $L = 1568$ .

Step 2: Masking (Hiding Information)

V-JEPA uses a Multi-Block Masking strategy.

Context $x$ : A subset of tokens that the model is allowed to see.
Target $y$ : Large, spatially continuous blocks that are hidden. These blocks are often repeated across the temporal dimension to prevent the model from simply "cheating" by looking at the previous or next frame to see the missing area.

Step 3: Encoding

V-JEPA uses two encoders:

Online Encoder $E_{\theta}$ : Processes the visible context $x$ .
Target Encoder $\overline{E}_{\theta}$ : Processes the entire video clip to produce the ground-truth latent features for the target $y$ . This encoder is an Exponential Moving Average (EMA) of the online encoder and does not receive gradients.

Step 4: Prediction and Loss

The Predictor $P_{\phi}$ (a small Transformer) takes the output of the online encoder and mask tokens (placeholders with position information $\Delta_y$ ) and tries to predict what the target encoder saw.

The training objective is to minimize the $L_1$ distance between the predicted features and the target features: $ \mathcal{L} = | P_{\phi}(E_{\theta}(x), \Delta_{y}) - \mathrm{sg}(\overline{E}_{\theta}(y)) |_1 $

Symbol Explanation:

$P_{\phi}$ : The predictor network with parameters $\phi$ .
$E_{\theta}(x)$ : The representation of the visible part $x$ produced by the online encoder.
$\Delta_{y}$ : The spatio-temporal position of the missing part $y$ .
$\mathrm{sg}(\cdot)$ : Stop-gradient operation. It tells the computer: "Don't update the target encoder using this specific error; only update the online encoder."
$\overline{E}_{\theta}(y)$ : The representation of the target part $y$ produced by the target (EMA) encoder.
$\| \cdot \|_1$ : The L1 Norm, which is the sum of the absolute differences. This was found to be more stable than the standard $L_2$ (squared) distance.

The following figure (Figure 2 from the original paper) summarizes this JEPA flow:

$Figure 2 Joint-Embedding Predictive Architectures are trained to predict the representation of an input `_ y` from the representation of another input $x$ . The additional variable $z$ provides the predictor with information about the transformation that computes `_ y` from `_ x` .$ 该图像是示意图，展示了联合嵌入预测架构的结构。图中，输入 $x$ 通过 $x$ 编码器处理，输入 $y$ 通过 $y$ 编码器处理，预测器根据输入 $z$ 生成 $s_y$ 的预测值 $\hat{s}_y$ ，并计算损失 $D(\hat{s}_y, s_y)$ 。该架构旨在通过预测来学习输入的表征。

5. Experimental Setup

5.1. Datasets

The authors created VideoMix2M, a massive 2-million video dataset by combining:

HowTo100M: Instructional videos (cooking, DIY).
Kinetics (400/600/700): Human action videos (running, dancing).
Something-Something-v2 (SSv2): Videos showing basic physics/interactions (e.g., "dropping a pen").
ImageNet-1K: Used for evaluating how well a video model understands static images.

5.2. Evaluation Metrics

Top-1 Accuracy:
- Definition: The percentage of times the model's highest-probability guess is the correct label.
- Formula: $\text{Acc} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}(\hat{y}_i = y_i)$
- Symbols: $N$ is total samples, $\hat{y}_i$ is the predicted class, $y_i$ is the true class, and $\mathbb{1}$ is the indicator function (1 if true, 0 if false).
Mean Average Precision (mAP):
- Definition: Used for action localization (AVA dataset). It measures the area under the Precision-Recall curve, averaged across all classes.
- Formula: $\text{mAP} = \frac{1}{C} \sum_{c=1}^{C} \text{AP}_c$
- Symbols: $C$ is the number of classes, $\text{AP}_c$ is the average precision for class $c$ .

5.3. Baselines

VideoMAE / VideoMAEv2: The standard for pixel-reconstruction SSL.
OmniMAE: A model trained on both images and videos using pixel reconstruction.
DINOv2: A state-of-the-art image model used to see if video-trained V-JEPA can compete with image-trained giants.

6. Results & Analysis

6.1. Core Results Analysis

The most striking result is V-JEPA's performance in Frozen Evaluation. Usually, to get a model to work on a new task, you must "fine-tune" (update) all its millions of parameters. V-JEPA is so semantically rich that you can freeze its parameters and just train a simple "probe" to get world-class results.

6.2. Data Presentation (Tables)

The following are the results from Table 5 of the original paper, comparing V-JEPA to pixel-prediction methods:

Methods	Arch.	Frozen Evaluation w/ Att. Pooling				Fine-Tuning
Methods	Arch.	K400	SSv2	AVA	IN1K	K400-ft	SSv2-ft
OmniMAE	ViT-L/16	65.6	60.6	14.4	75.1	84.0	74.2
VideoMAE	ViT-L/16	77.8	65.5	21.6	71.1	85.4	74.3
Hiera	Hiera-L	75.5	64.2	15.8	68.9	87.3	75.1
V-JEPA	ViT-L/16	80.8	69.5	25.6	74.8	85.6	75.1

Analysis: V-JEPA (ViT-L/16) significantly outperforms VideoMAE and OmniMAE in frozen evaluation across almost all tasks, especially on the AVA task (action detection), where it leads by 4 points.

6.3. Efficiency and Time

Figure 5 in the paper illustrates that V-JEPA reaches high accuracy much faster than pixel-reconstruction models. While VideoMAE might take a long time to learn to draw every pixel, V-JEPA quickly learns the underlying features.

该图像是一个图表，展示了V-JEPA与其他视频模型在SSv2冻结评估性能与预训练时间之间的关系。其中，V-JEPA在较短的预训练时间内表现优异，超过70%的准确率，显示出其在视频特征预测中的优势。

6.4. Ablation: Attentive Probing

The authors found that instead of just averaging all the tokens at the end (Average Pooling), using an Attentive Probe (a small layer that "pays attention" to the most important parts of the video) boosted performance by over 16 points.

$ \text{Output} = \sum_{i=1}^{L} \text{softmax}\left(\frac{q^\top W_k s_i}{\sqrt{d}}\right) W_v s_i $

$q$ : Learnable query.
$s_i$ : Sequence of tokens from the frozen encoder.
$W_k, W_v$ : Learned weight matrices.

7. Conclusion & Reflections

7.1. Conclusion Summary

V-JEPA proves that feature prediction is superior to pixel reconstruction for learning video representations. By predicting in the latent space, the model ignores "noise" and captures "semantics." It is more efficient to train, more versatile when frozen, and highly label-efficient.

7.2. Limitations & Future Work

Data Diversity: While 2 million videos is a lot, image models like DINOv2 are trained on 142 million curated images. V-JEPA still lags slightly behind image-only models on static tasks like ImageNet.
Dataset Bias: Most current video datasets are focused on human actions. Future work should involve more "egocentric" (first-person) or "robotic" data to learn broader world models.

7.3. Personal Insights & Critique

V-JEPA is a major milestone in Yann LeCun's vision of World Models. By moving away from pixels, the model behaves more like a human—we don't remember the exact color of every pixel in a movie; we remember that the hero jumped off a bridge.

Critique: The requirement for "Attentive Probing" suggests that the "Linear Probability" of these features is not as high as contrastive models. However, the trade-off—getting such high performance from a frozen backbone—is a huge win for researchers who don't have the massive compute power required to fine-tune giant models for every new task. This makes V-JEPA a very "democratic" model for the research community.

Revisiting Feature Prediction for Learning Visual Representations from Video

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~9 min read · 10,924 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth

Step 1: Tokenization (Turning Video into Math)

Step 2: Masking (Hiding Information)

Step 3: Encoding

Step 4: Prediction and Loss

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Data Presentation (Tables)

6.3. Efficiency and Time

6.4. Ablation: Attentive Probing

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

Revisiting Feature Prediction for Learning Visual Representations from Video

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~9 min read · 10,924 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth

Step 1: Tokenization (Turning Video into Math)

Step 2: Masking (Hiding Information)

Step 3: Encoding

Step 4: Prediction and Loss

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Data Presentation (Tables)

6.3. Efficiency and Time

6.4. Ablation: Attentive Probing

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers