VL-JEPA: Joint Embedding Predictive Architecture for Vision-language
TL;DR Summary
VL-JEPA introduces a vision-language model using Joint Embedding Predictive Architecture that predicts continuous text embeddings, outperforming traditional models with 50% fewer parameters. It also supports selective decoding, enhancing efficiency for tasks like open-vocabulary
Abstract
We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
VL-JEPA: Joint Embedding Predictive Architecture for Vision-language
1.2. Authors
The authors are Delong Chen, Mustafa Shukor, Théo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Allen Bolourchi, Yann LeCun, and Pascale Fung. They represent Meta FAIR (Fundamental AI Research), the Hong Kong University of Science and Technology (HKUST), Sorbonne Université, and New York University (NYU). Yann LeCun is a pioneer in deep learning and the Chief AI Scientist at Meta.
1.3. Journal/Conference
The paper was published on arXiv (a prominent preprint server for rapid dissemination of research) on December 11, 2025. Given the authorship and the novelty of applying the Joint Embedding Predictive Architecture (JEPA) to vision-language tasks, it is highly likely to be submitted to top-tier computer vision or machine learning conferences like CVPR, ICCV, or NeurIPS.
1.4. Publication Year
2025
1.5. Abstract
The paper introduces VL-JEPA, a vision-language model that moves away from traditional autoregressive token generation. Instead of predicting tokens one by one, it predicts continuous embeddings of target texts in an abstract representation space. This allows the model to focus on semantics rather than surface-level linguistic variability (like word choice or style). In controlled experiments, VL-JEPA achieved stronger performance than standard Vision-Language Models (VLMs) while using 50% fewer trainable parameters. It also features selective decoding, reducing decoding operations by 2.85x. The model excels at video classification, retrieval, and visual question answering (VQA) with 1.6B parameters.
1.6. Original Source Link
-
Original Source: https://arxiv.org/abs/2512.10942
-
PDF Link: https://arxiv.org/pdf/2512.10942v1.pdf
-
Status: Preprint (under review or recently released).
2. Executive Summary
2.1. Background & Motivation
The current dominant paradigm for Vision-Language Models (VLMs) is generative. These models take an image or video and a question, then generate a text response token by token (autoregressively). While effective, this approach has two major drawbacks:
-
Computational Waste: Generative models spend significant energy learning "surface" details—such as specific phrasing, grammar, and synonyms—that don't necessarily change the underlying meaning or correctness of an answer.
-
High Latency: For real-time applications (like a robot describing a live video stream), waiting for a model to generate text word-by-word is too slow.
The core problem is that these models work in the "data space" (pixels and words) rather than the "latent space" (abstract concepts). The paper aims to solve this by applying the
Joint Embedding Predictive Architecture (JEPA), which predicts high-level semantic embeddings instead of raw data.
2.2. Main Contributions / Findings
-
VL-JEPA Architecture: The first non-generative model designed for general-domain vision-language tasks.
-
Latent Space Prediction: Demonstrates that predicting abstract embeddings is more efficient and accurate than predicting discrete tokens, especially in data-constrained environments.
-
Efficiency: Achieves better performance than traditional
VLMswith 50% fewer trainable parameters. -
Selective Decoding: Introduced a method to monitor the "semantic stream" of embeddings and only invoke the expensive text decoder when a significant change in meaning occurs.
-
Generalist Capability: A single model that outperforms specialized models in video classification and retrieval while matching large generative models in
VQA.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. Joint Embedding Predictive Architecture (JEPA)
JEPA is a self-supervised learning paradigm proposed by Yann LeCun. Unlike traditional models that try to reconstruct missing pixels (like Masked Autoencoders) or predict the next token (like GPT), JEPA tries to predict the representation of a target part of the data from the representation of a context part.
- Context: What the model sees (e.g., an image or the start of a video).
- Target: What the model needs to predict (e.g., what happens next or a textual description).
- Benefit: By working in "embedding space," the model can ignore noise. For example, if predicting a video of a tree in the wind, it doesn't need to predict the exact position of every leaf (noise); it only needs to predict "the tree is swaying" (semantics).
3.1.2. Autoregressive vs. Non-Autoregressive
- Autoregressive: The model predicts token based on tokens
1tot-1. It is a serial process (). - Non-Autoregressive: The model predicts the entire target at once or in a single forward pass.
VL-JEPApredicts a single vector representing the whole answer, which can then be decoded in one go.
3.1.3. Embedding Space
An embedding is a numerical vector (a list of numbers) that represents a piece of data (like a word or image) such that similar items are mathematically "close" to each other in a multi-dimensional space.
3.2. Previous Works
The authors build upon several key precursors:
- V-JEPA: A previous model that used
JEPAto learn video representations by predicting the latent features of masked video blocks. - CLIP (Contrastive Language-Image Pre-training): A model that uses two separate encoders (one for images, one for text) and trains them to produce similar embeddings for matching pairs using the
InfoNCEloss. - Generative VLMs (e.g., LLaVA, Flamingo): These connect a vision encoder to a
Large Language Model (LLM)and use "next-token prediction" to generate text.
3.3. Technological Evolution
The field has moved from simple classifiers to CLIP-style alignment (matching images to text) to generative VLMs (chatting about images). VL-JEPA represents a new branch: Predictive Latent VLMs, combining the semantic alignment of CLIP with the task flexibility of generative models.
4. Methodology
4.1. Principles
The core intuition of VL-JEPA is that the "answer" to a visual query exists as a semantic concept before it is turned into words. By training the model to reach that semantic point directly, we skip the hard work of learning "how to speak" during the main visual reasoning phase.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Model Architecture
The following figure (Figure 1 from the original paper) shows the system architecture:
该图像是VL-JEPA模型架构示意图。它展示了视觉输入通过X-Encoder进行编码,生成的表示和文本查询一起输入到预测器,最终输出文本目标的表示。模型还包含Y-Encoder和Y-Decoder,用于生成相应的文本输出。其中损失函数用L表示。
The model consists of four distinct modules:
- X-Encoder (): This maps the visual input (images or video frames) into a sequence of visual embeddings . The authors use a frozen
V-JEPA 2 ViT-L(304M parameters). - Predictor (): This is the "brain." It takes the visual embeddings and a textual query (the prompt/question). It outputs a predicted target embedding . It uses 8 Transformer layers from
Llama-3.2-1B. - Y-Encoder (): This maps the ground-truth text into the "target" embedding . This is used during training to provide the signal for the Predictor. They use
EmbeddingGemma-300M. - Y-Decoder (): A lightweight module used only at inference time to turn the predicted embedding back into human-readable text.
4.2.2. Training Objective and Formulas
The training is performed in two stages. Unlike generative models that use Cross-Entropy loss on tokens, VL-JEPA uses an embedding-based loss.
Stage 1: Pretraining. The model is trained on massive captioning data to align vision and text. The authors use a bi-directional InfoNCE (Information Noise Contrastive Estimation) loss.
The InfoNCE loss is used to maximize the similarity between the predicted embedding and the actual target embedding , while minimizing similarity with "negative" samples (other texts in the batch). The simplified alignment term is:
Where:
-
is the predicted embedding from the Predictor.
-
is the actual target embedding from the Y-Encoder.
-
is a distance metric (usually based on cosine similarity).
To avoid "representation collapse" (where the model predicts the same constant vector for everything), the
InfoNCEloss includes a uniformity regularization term that pushes different embeddings in a batch apart. The total loss is calculated as:
Where:
- is the cosine similarity between vectors and .
- is a temperature hyper-parameter that controls the sharpness of the distribution.
- is the batch size.
- and are indices for samples in the batch.
4.2.3. Multi-tasking Logic
The following figure (Figure 2 from the paper) illustrates how the single architecture handles multiple tasks:
该图像是 VL-JEPA 的架构示意图,展示了模型如何通过预测目标嵌入 来实现任务,而不是在检索任务中重构原始目标。图中还说明了视觉到文本生成、区分性 VQA 和分类及文本到视觉检索的选择性解码过程。
- Generation (VQA/Captioning): Predict Decode to text.
- Classification: Predict Compare to candidate label embeddings Pick the closest.
- Retrieval: Use a query to predict an embedding Match against a database of video embeddings.
4.2.4. Selective Decoding
For long videos, VL-JEPA monitors the "stream" of predicted embeddings over time. Because it's non-autoregressive, this stream is computationally cheap to produce.
The model calculates the variance of the embeddings in a sliding window. If the variance is low, the scene is semantically stable, and no new decoding is needed. If a "semantic shift" (high variance) is detected, the Y-Decoder is triggered.
5. Experimental Setup
5.1. Datasets
The authors used a massive collection of data across two stages:
- Pretraining:
- Images:
Datacomp,YFCC-100M,PLM-Image-Auto(approx. 2B samples seen). - Videos:
Ego4D,HowTo100M(Action100M captions).
- Images:
- SFT (Supervised Finetuning):
- A mixture of 25M
VQAsamples, 2.8M caption samples, and 1.8M classification samples.
- A mixture of 25M
- Evaluation Benchmarks:
- Classification:
Kinetics-400,Something-Something v2 (SSv2),Epic-Kitchens 100 (EK-100). - VQA:
GQA(reasoning),TallyQA(counting), (hallucination).
- Classification:
5.2. Evaluation Metrics
5.2.1. Recall@1 (R@1)
- Conceptual Definition: Used in retrieval tasks. It measures if the correct item is the very first result returned by the model.
- Mathematical Formula:
- Symbol Explanation: is the total number of queries; is the indicator function (1 if true, 0 if false); is the rank of the ground-truth item for query .
5.2.2. Accuracy (Top-1)
- Conceptual Definition: The percentage of times the model's highest-probability prediction matches the ground-truth label.
- Mathematical Formula:
5.2.3. CIDEr (Consensus-based Image Description Evaluation)
- Conceptual Definition: Measures how similar a generated caption is to a set of human reference captions using
TF-IDFweighted n-gram overlap. - Mathematical Formula (Standardized):
- Symbol Explanation: is the generated caption; is the set of reference captions; is a vector of
TF-IDFweights for n-grams of length ; is the number of references.
5.3. Baselines
-
CLIP / SigLIP2: Standard joint-embedding models (encoders only).
-
Perception Encoder (PE-Core): A recent state-of-the-art visual embedding model.
-
Generative VLMs:
InstructBLIP,Qwen-VL,LLaVA-1.5,SmolVLM.
6. Results & Analysis
6.1. Core Results Analysis
- Efficiency:
VL-JEPAachieved higher accuracy thanVLMbaselines while having 50% fewer parameters and seeing significantly less data (2B samples vs. 86B for some baselines). - Video Mastery: The model performed exceptionally well on motion-centric datasets like
SSv2andEgoExo4D, likely due to the strengths of theV-JEPAvision backbone. - Selective Decoding: The model reduced decoding operations by 2.85x while maintaining the same
CIDErscore.
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper, showing classification and retrieval performance:
| Model | # Params | # Samples Seen | Classification (Zero-Shot) | Retrieval (Zero-Shot) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| K400 | SSv2 | EK-100 | Avg (8) | MSR-VTT | ActivityNet | EK-100 | Avg (8) | |||
| CLIP ViT-L | 389M | 12.8B | 58.3 | 2.4 | 14.7 | 35.3 | 30.7 | 23.4 | 7.9 | 35.9 |
| SigLIP2 ViT-g | 1.9B | 40B | 68.0 | 6.1 | 26.0 | 47.5 | 38.9 | 33.9 | 22.2 | 43.4 |
| PE-Core ViT-G | 2.3B | 86B | 76.4 | 13.0 | 29.0 | 58.1 | 44.5 | 49.1 | 26.0 | 51.6 |
| VL-JEPA Base | 1.6B | 2.0B | 57.8 | 21.1 | 39.8 | 46.4 | 49.2 | 55.4 | 78.2 | 58.4 |
| VL-JEPA SFT | 1.6B | 2.5B | 81.4 | 59.5 | 60.3 | 70.7 | 49.1 | 53.8 | 81.1 | 59.5 |
Note: refers to the average over 8 datasets in each category.
The following are VQA results from Section 4.2, comparing VL-JEPA SFT to standard VLMs:
| Model | GQA (Acc) | TallyQA (Acc) | POPE (Acc) |
|---|---|---|---|
| InstructBLIP (Vicuna-13B) | 49.5 | 68.0 | 79.0 |
| Qwen-VL-7B | 59.3 | - | - |
| LLaVA-1.5-7B | 62.0 | 71.9 | 85.9 |
| VL-JEPA SFT (1.6B) | 60.8 | 67.4 | 84.2 |
6.3. Ablation Studies / Parameter Analysis
The authors tested several configurations (Table 5 in the paper):
-
Pretraining: Removing the caption-based pretraining stage dropped classification accuracy from 49.0 to 27.3.
-
Loss Function:
InfoNCEwas superior to simpleCosineorL2loss for classification and retrieval. -
Predictor Size: Increasing the number of layers in the Predictor significantly improved
VQAperformance.The following figure (Figure 3) shows the sample efficiency comparison:
该图像是一个图表,展示了VL-JEPA与传统视觉语言模型(VLM)在零样本视频标题生成和视频分类任务中的性能比较。左侧图表中,VL-JEPA在CIDEr得分上优于VLM,而右侧图表则显示VL-JEPA的可训练参数和平均推理时间成本更低。在文本解码上,VL-JEPA的平均时间为203毫秒。
As seen in Figure 3, VL-JEPA reaches higher performance much faster (with fewer samples) than the VLM baseline.
7. Conclusion & Reflections
7.1. Conclusion Summary
VL-JEPA successfully demonstrates that the Joint Embedding Predictive Architecture is a viable and highly efficient alternative to autoregressive generative models for vision-language tasks. By predicting semantic embeddings rather than tokens, the model achieves better performance with fewer parameters, excels at video-based tasks, and supports real-time streaming via selective decoding. It bridges the gap between discriminative models (like CLIP) and generative ones.
7.2. Limitations & Future Work
- Reasoning and Agents: The authors acknowledge they haven't yet tested
VL-JEPAon complex multi-step reasoning, tool use, or agentic behaviors where largeVLMscurrently excel. - Scaling: While 1.6B parameters is efficient, they haven't yet explored the limits of scaling this architecture to 70B+ parameters.
- Latent Reasoning: Future work could involve "reasoning in latent space"—performing logical steps on embeddings before ever decoding them into text.
7.3. Personal Insights & Critique
This paper is a significant shift in thinking. The move from "generating text" to "predicting meaning" aligns with theories of how biological brains might work (not predicting every word, but the concept of what to say).
- Inspiration: The
selective decodingmechanism is brilliant for robotics and wearable tech (like smart glasses), where battery life and latency are the biggest constraints. - Potential Issue: The model relies heavily on the quality of the
Y-Encoder. If the text embedding model (Gemma) has biases or lacks specific domain knowledge,VL-JEPAwill inherit those weaknesses without the ability to "re-learn" the language space easily. - Assumption: The paper assumes that
InfoNCEis the best loss; however, otherJEPA-specific losses likeVICReg(which don't require negative samples) might offer even more stability in future iterations.
Similar papers
Recommended via semantic vector search.