Paper status: completed

Revisiting Feature Prediction for Learning Visual Representations from Video

Published:02/16/2024
Original LinkPDF
Price: 0.100000
Price: 0.100000
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The V-JEPA model promotes unsupervised learning from 2 million videos with a standalone feature prediction objective, demonstrating versatility in visual representations and strong performance on motion and appearance tasks without relying on pretrained encoders.

Abstract

This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model's parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Revisiting Feature Prediction for Learning Visual Representations from Video

1.2. Authors

Ain Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Micael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. The authors are primarily affiliated with FAIR (Fundamental AI Research) at Meta, along with researchers from Inria, École normale supérieure (PSL), New York University, and Université Gustave Eiffel. Yann LeCun is a pioneer in deep learning and the Chief AI Scientist at Meta.

1.3. Journal/Conference

This paper was published as a preprint on arXiv in April 2024. Given the authorship and the scope of the work, it is representative of high-impact research typically seen at top-tier computer vision conferences such as CVPR, ICCV, or NeurIPS.

1.4. Publication Year

2024 (First published Feb 15, 2024; updated April 15, 2024).

1.5. Abstract

The paper explores feature prediction as a standalone objective for unsupervised learning from video. It introduces V-JEPA, a collection of vision models trained solely by predicting latent features of masked video regions, without using pretrained image encoders, text, negative examples, or pixel-level reconstruction. Trained on 2 million videos, V-JEPA learns versatile representations that perform well on both motion-heavy (e.g., Something-Something-v2) and appearance-heavy (e.g., Kinetics-400) tasks using a frozen backbone. The largest model (ViT-H/16) achieves state-of-the-art results for video-only pretraining, demonstrating high efficiency and label-efficiency.

2. Executive Summary

2.1. Background & Motivation

In the current landscape of computer vision, Self-Supervised Learning (SSL) aims to train models on vast amounts of unlabeled data. A popular approach for video is Masked Video Modeling (MVM), where parts of a video are hidden (masked) and the model tries to reconstruct the missing pixels. However, pixel reconstruction is computationally expensive and forces the model to focus on irrelevant low-level details (like individual leaf movements in a forest) rather than high-level semantic concepts.

The core problem V-JEPA addresses is: Can we learn powerful video representations by predicting semantic features instead of raw pixels? This follows the "predictive feature principle," suggesting that if a model can predict the "meaning" of a hidden video segment based on the surrounding context, it truly understands the world's dynamics.

2.2. Main Contributions / Findings

  • V-JEPA Architecture: A purely feature-predictive model that avoids the overhead of pixel decoders and the need for negative samples (contrastive learning).

  • Efficiency: V-JEPA is significantly faster to train than pixel-reconstruction methods (like VideoMAE) because it operates in a lower-dimensional latent space.

  • Versatility: The model performs exceptionally well on diverse tasks (motion vs. appearance) using a frozen encoder. This means the model's internal "understanding" is so good that it doesn't need to be updated for new tasks; you only need to train a tiny classification layer on top.

  • Label Efficiency: When only a small fraction of labeled data (e.g., 5% or 10%) is available, V-JEPA maintains high performance, outperforming previous methods by a wide margin.


3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

  • Self-Supervised Learning (SSL): A method where the data itself provides the "labels." For example, by hiding part of an image and asking the model to guess what’s missing.

  • Vision Transformer (ViT): Unlike traditional grids (CNNs), a ViT breaks an image/video into small squares called patches. These patches are converted into vectors called tokens and processed using self-attention, which allows every part of the image to "talk" to every other part.

  • Joint-Embedding Predictive Architecture (JEPA): Proposed by Yann LeCun, this architecture doesn't generate pixels. Instead, it embeds two different views of the same data (xx and yy) into a mathematical space and tries to predict the embedding of yy from xx.

  • 3.jpg: The following figure (Figure 3 from the original paper) shows the system architecture:

    Figure 1 V-JEPA models pretrained on video learn versatile visual representations. It performs well on motion-based tasks (Something-Something-v2) and appearance-based tasks (Kinetics 400) without adaptation of the model's parameters, i.e., using the same frozen backbone for both tasks. 该图像是一个示意图,展示了V-JEPA模型与其他任务特定模型在Something-Something-v2和Kinetics 400上的性能对比。图中显示,V-JEPA在这两个任务中均表现良好,且在进行任务评估时无需调整模型参数。

3.2. Previous Works

  • Masked Autoencoders (MAE): Models like VideoMAE mask 90% of video patches and use a "decoder" to predict the actual colors of the missing pixels. The reconstruction loss is usually Mean Squared Error (MSE): $ \mathcal{L}{MSE} = \frac{1}{n} \sum (P{pixel} - T_{pixel})^2 $
  • Contrastive Learning: Models like CLIP or SimCLR compare different images and try to bring "positive" pairs (different views of the same image) closer together while pushing "negative" pairs (different images) apart. V-JEPA avoids this "negative" sampling, which is often complex to implement.
  • Exponential Moving Average (EMA): A technique where a "target" model is a slow-moving version of the "online" model. This prevents the model from "collapsing" (outputting the same constant value for every input).

3.3. Technological Evolution

The field moved from Hand-crafted features \rightarrow Supervised CNNs \rightarrow Contrastive SSL \rightarrow Generative SSL (MAE). V-JEPA represents the latest stage: Non-generative, Predictive SSL, focusing on efficiency and semantic understanding rather than pixel-perfect reconstruction.


4. Methodology

4.1. Principles

V-JEPA is built on the idea that a model should learn the "gist" of a video. If you see a person starting to throw a ball, you can predict they will release it and it will move forward. You don't need to predict the exact texture of every blade of grass in the background.

4.2. Core Methodology In-depth

Step 1: Tokenization (Turning Video into Math)

A video clip of TT frames is divided into 3D spatio-temporal patches. Each patch is 16×1616 \times 16 pixels and spans 2 frames.

  • The video shape 16×224×224×316 \times 224 \times 224 \times 3 (Frames ×\times Height ×\times Width ×\times Color) is processed by a 3D convolution.
  • The result is a sequence of tokens LL. For a standard clip, L=1568L = 1568.

Step 2: Masking (Hiding Information)

V-JEPA uses a Multi-Block Masking strategy.

  • Context xx: A subset of tokens that the model is allowed to see.
  • Target yy: Large, spatially continuous blocks that are hidden. These blocks are often repeated across the temporal dimension to prevent the model from simply "cheating" by looking at the previous or next frame to see the missing area.

Step 3: Encoding

V-JEPA uses two encoders:

  1. Online Encoder EθE_{\theta}: Processes the visible context xx.
  2. Target Encoder Eθ\overline{E}_{\theta}: Processes the entire video clip to produce the ground-truth latent features for the target yy. This encoder is an Exponential Moving Average (EMA) of the online encoder and does not receive gradients.

Step 4: Prediction and Loss

The Predictor PϕP_{\phi} (a small Transformer) takes the output of the online encoder and mask tokens (placeholders with position information Δy\Delta_y) and tries to predict what the target encoder saw.

The training objective is to minimize the L1L_1 distance between the predicted features and the target features: $ \mathcal{L} = | P_{\phi}(E_{\theta}(x), \Delta_{y}) - \mathrm{sg}(\overline{E}_{\theta}(y)) |_1 $

Symbol Explanation:

  • PϕP_{\phi}: The predictor network with parameters ϕ\phi.

  • Eθ(x)E_{\theta}(x): The representation of the visible part xx produced by the online encoder.

  • Δy\Delta_{y}: The spatio-temporal position of the missing part yy.

  • sg()\mathrm{sg}(\cdot): Stop-gradient operation. It tells the computer: "Don't update the target encoder using this specific error; only update the online encoder."

  • Eθ(y)\overline{E}_{\theta}(y): The representation of the target part yy produced by the target (EMA) encoder.

  • 1\| \cdot \|_1: The L1 Norm, which is the sum of the absolute differences. This was found to be more stable than the standard L2L_2 (squared) distance.

    The following figure (Figure 2 from the original paper) summarizes this JEPA flow:

    Figure 2 Joint-Embedding Predictive Architectures are trained to predict the representation of an input `_ y` from the representation of another input \(x\) . The additional variable \(z\) provides the predictor with information about the transformation that computes `_ y` from `_ x` . 该图像是示意图,展示了联合嵌入预测架构的结构。图中,输入 xx 通过 xx 编码器处理,输入 yy 通过 yy 编码器处理,预测器根据输入 zz 生成 sys_y 的预测值 s^y\hat{s}_y,并计算损失 D(s^y,sy)D(\hat{s}_y, s_y)。该架构旨在通过预测来学习输入的表征。


5. Experimental Setup

5.1. Datasets

The authors created VideoMix2M, a massive 2-million video dataset by combining:

  • HowTo100M: Instructional videos (cooking, DIY).
  • Kinetics (400/600/700): Human action videos (running, dancing).
  • Something-Something-v2 (SSv2): Videos showing basic physics/interactions (e.g., "dropping a pen").
  • ImageNet-1K: Used for evaluating how well a video model understands static images.

5.2. Evaluation Metrics

  1. Top-1 Accuracy:

    • Definition: The percentage of times the model's highest-probability guess is the correct label.
    • Formula: Acc=1Ni=1N1(y^i=yi)\text{Acc} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}(\hat{y}_i = y_i)
    • Symbols: NN is total samples, y^i\hat{y}_i is the predicted class, yiy_i is the true class, and 1\mathbb{1} is the indicator function (1 if true, 0 if false).
  2. Mean Average Precision (mAP):

    • Definition: Used for action localization (AVA dataset). It measures the area under the Precision-Recall curve, averaged across all classes.
    • Formula: mAP=1Cc=1CAPc\text{mAP} = \frac{1}{C} \sum_{c=1}^{C} \text{AP}_c
    • Symbols: CC is the number of classes, APc\text{AP}_c is the average precision for class cc.

5.3. Baselines

  • VideoMAE / VideoMAEv2: The standard for pixel-reconstruction SSL.

  • OmniMAE: A model trained on both images and videos using pixel reconstruction.

  • DINOv2: A state-of-the-art image model used to see if video-trained V-JEPA can compete with image-trained giants.


6. Results & Analysis

6.1. Core Results Analysis

The most striking result is V-JEPA's performance in Frozen Evaluation. Usually, to get a model to work on a new task, you must "fine-tune" (update) all its millions of parameters. V-JEPA is so semantically rich that you can freeze its parameters and just train a simple "probe" to get world-class results.

6.2. Data Presentation (Tables)

The following are the results from Table 5 of the original paper, comparing V-JEPA to pixel-prediction methods:

Methods Arch. Frozen Evaluation w/ Att. Pooling Fine-Tuning
K400 SSv2 AVA IN1K K400-ft SSv2-ft
OmniMAE ViT-L/16 65.6 60.6 14.4 75.1 84.0 74.2
VideoMAE ViT-L/16 77.8 65.5 21.6 71.1 85.4 74.3
Hiera Hiera-L 75.5 64.2 15.8 68.9 87.3 75.1
V-JEPA ViT-L/16 80.8 69.5 25.6 74.8 85.6 75.1

Analysis: V-JEPA (ViT-L/16) significantly outperforms VideoMAE and OmniMAE in frozen evaluation across almost all tasks, especially on the AVA task (action detection), where it leads by 4 points.

6.3. Efficiency and Time

Figure 5 in the paper illustrates that V-JEPA reaches high accuracy much faster than pixel-reconstruction models. While VideoMAE might take a long time to learn to draw every pixel, V-JEPA quickly learns the underlying features.

Figure 5 SSv2 frozen-evaluation performance vs. Pretraining Time. Wallclock times for all methods are measured on a single GPU with a batch size of 10 clips, using the official codebases for VideoMAE and VideoMAEv2, and linearly extrapolated assuming a global batch size of 2400 samples. However, note that the SSv2 accuracies of video pixel prediction methods are actually obtained with small batch sizes and significantly longer training schedules. V-JEPA outperforms pixel-reconstruction methods while training significantly faster. 该图像是一个图表,展示了V-JEPA与其他视频模型在SSv2冻结评估性能与预训练时间之间的关系。其中,V-JEPA在较短的预训练时间内表现优异,超过70%的准确率,显示出其在视频特征预测中的优势。

6.4. Ablation: Attentive Probing

The authors found that instead of just averaging all the tokens at the end (Average Pooling), using an Attentive Probe (a small layer that "pays attention" to the most important parts of the video) boosted performance by over 16 points.

$ \text{Output} = \sum_{i=1}^{L} \text{softmax}\left(\frac{q^\top W_k s_i}{\sqrt{d}}\right) W_v s_i $

  • qq: Learnable query.

  • sis_i: Sequence of tokens from the frozen encoder.

  • Wk,WvW_k, W_v: Learned weight matrices.


7. Conclusion & Reflections

7.1. Conclusion Summary

V-JEPA proves that feature prediction is superior to pixel reconstruction for learning video representations. By predicting in the latent space, the model ignores "noise" and captures "semantics." It is more efficient to train, more versatile when frozen, and highly label-efficient.

7.2. Limitations & Future Work

  • Data Diversity: While 2 million videos is a lot, image models like DINOv2 are trained on 142 million curated images. V-JEPA still lags slightly behind image-only models on static tasks like ImageNet.
  • Dataset Bias: Most current video datasets are focused on human actions. Future work should involve more "egocentric" (first-person) or "robotic" data to learn broader world models.

7.3. Personal Insights & Critique

V-JEPA is a major milestone in Yann LeCun's vision of World Models. By moving away from pixels, the model behaves more like a human—we don't remember the exact color of every pixel in a movie; we remember that the hero jumped off a bridge.

Critique: The requirement for "Attentive Probing" suggests that the "Linear Probability" of these features is not as high as contrastive models. However, the trade-off—getting such high performance from a frozen backbone—is a huge win for researchers who don't have the massive compute power required to fine-tune giant models for every new task. This makes V-JEPA a very "democratic" model for the research community.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.