Paper status: completed

Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization

Published:10/29/2025

Fine-Tuning of Vision-Language-Action Models (1)Visual Representation Alignment Methods (1)Out-of-Distribution Generalization Capability (1)Analysis of Vision-Language Model Performance (1)Retention of Visual Action Knowledge (1)

Original Link PDF

Price: 0.100000

5 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper investigates how fine-tuning Vision-Language-Action (VLA) models degrades visual representations. It reveals that naive fine-tuning harms visual knowledge, affecting performance in out-of-distribution scenarios. A visual representation alignment method is introduced to

Abstract

The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA's hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. We further evaluate a range of strategies for aligning visual representations and introduce a simple yet effective method that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios. Taken together, our analysis clarifies the trade-off between action fine-tuning and the degradation of VL representations and highlights practical approaches to recover inherited VL capabilities. Code is publicly available: https://blind-vla-paper.github.io

Mind Map

In-depth Reading

English Analysis~14 min read · 17,247 chars

1. Bibliographic Information

1.1. Title

Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization

1.2. Authors

Nikita Kachaev: Cognitive AI Lab, Moscow, Russia.
Mikhail Kolosov: IAI MIPT, Moscow, Russia.
Daniil Zelezetsky: IAI MIPT, Moscow, Russia.
Alexey K. Kovalev: Cognitive AI Lab, IAI MIPT, Moscow, Russia.
Aleksandr I. Panov: Cognitive AI Lab, IAI MIPT, Moscow, Russia.

1.3. Journal/Conference

Published at (UTC): 2025-10-29 (Based on the input timestamp, though the PDF suggests arXiv preprint).
Source: arXiv (Repository of electronic preprints).
Reputation: arXiv is the primary repository for pre-publication research in computer science and AI. While not peer-reviewed in the traditional journal sense, high-impact papers often appear here first. The "Simpler" benchmark and "OpenVLA" context suggest this is cutting-edge work in the Embodied AI community.

1.4. Publication Year

2025

1.5. Abstract

The paper investigates a critical issue in Vision-Language-Action (VLA) models: when large pretrained Vision-Language Models (VLMs) are fine-tuned for robotic actions, they often lose their original rich visual and semantic understanding. This phenomenon leads to poor performance when the robot faces new situations (Out-Of-Distribution or OOD). The authors analyze this degradation through attention maps and latent representation probing. To solve it, they introduce a method called Visual Representation Alignment, which forces the VLA model to maintain a connection to a "teacher" vision model during training, significantly improving generalization.

1.6. Original Source Link

Link: https://arxiv.org/abs/2510.25616
PDF: https://arxiv.org/pdf/2510.25616v1.pdf
Status: Preprint.

2. Executive Summary

2.1. Background & Motivation

Core Problem: The field of robotics is moving towards VLA (Vision-Language-Action) models. These are built by taking massive, smart VLMs (Vision-Language Models)—like GPT-4V or LLaVA, which understand images and text well—and fine-tuning them to output robot control commands. The hope is that the robot inherits the VLM's "world knowledge." However, the authors identify a critical failure: Representation Collapse. When the model is trained on specific robotic tasks (which usually have limited, repetitive data), it "forgets" the general visual understanding it originally had.
Importance: If a robot forgets general visual concepts (e.g., what a "stop sign" means or the difference between "odd" and "even" numbers) during training, it becomes "blind" to anything slightly different from its training data. This makes the robot useless in the real world, where environments constantly change (Out-Of-Distribution, or OOD, scenarios).
Innovation: The paper shifts focus from "how to train better actions" to "how to stop the model from becoming stupid visually." They propose measuring this "blindness" and fixing it by mathematically tethering the robot's brain to a frozen, smart visual encoder (a "Teacher") during training.

2.2. Main Contributions / Findings

Diagnostic Tools (VL-Think): They created a new test suite called VL-Think designed to measure only the visual-language understanding of a robot, separate from its physical dexterity.
Evidence of Degradation: They systematically proved that standard fine-tuning (SFT) causes "Attention Sink" (the model stops looking at the relevant object) and "Representation Collapse" (distinct concepts like "cup" and "bottle" blur together in the model's mind).
Method (Visual Alignment): They proposed a simple, effective fix: adding an auxiliary loss function that forces the VLA's internal representations to match those of a robust, pre-trained vision teacher (like SIGLIP or RADIO).
Results: Their method improved success rates in OOD scenarios by up to 10% compared to standard training, restoring the semantic understanding that was otherwise lost.

3.1. Foundational Concepts

To understand this paper, one must grasp several key concepts in modern AI:

Vision-Language Models (VLMs): These are AI models trained on massive amounts of internet data (images + text). They can look at an image and answer questions about it. Example: "Is this a cat?" -> "Yes." They build rich internal "representations" of the world.
Vision-Language-Action (VLA) Models: A VLM adapted for robotics. Instead of outputting text like "Yes," it outputs robot actions like $[move_x, move_y, gripper_open]$ . The input is usually an image of the workspace and a text instruction like "Pick up the apple."
Fine-Tuning (SFT): The process of taking a pre-trained model (that knows general things) and training it further on a specific dataset (like robot demonstrations) to specialize it.
Representation Collapse: Imagine you learn to distinguish 100 types of dogs. Then, you get a job that only requires you to say "animal." After years, you might forget how to tell a Poodle from a Husky. This is collapse—the rich details are lost because the new task doesn't strictly demand them.
Out-of-Distribution (OOD): Testing the model on data that looks different from what it studied. If it trained on white plates but fails on blue plates, it has poor OOD generalization.
LoRA (Low-Rank Adaptation): A technique to fine-tune huge models efficiently. Instead of updating all billions of parameters, it updates a small set of extra parameters, keeping the original model largely frozen.
Attention Mechanism: The part of a Transformer model that decides which parts of the input (image pixels or text words) are important. An "Attention Map" visualizes this focus.

3.2. Previous Works

The authors build upon and contrast with several key areas:

RT-1 & RT-2: Early VLA models that showed scaling VLMs helps robotics.
OpenVLA: An open-source VLA model based on Llama (language) and SigLIP (vision). This paper uses OpenVLA as its primary testbed.
Platonic Representation Hypothesis: A theory suggesting that as AI models get bigger and better, their internal view of the world converges to a "shared reality," regardless of whether they are trained on text or images. The authors use this to justify why aligning a robot to a "general" vision teacher is a good idea.

3.3. Technological Evolution

Standard VLMs: Strong at static images (e.g., CLIP, Prismatic).
Robotic Pretraining: Using VLMs to initialize robot policies (RT-1, RT-2).
The Problem: Researchers noticed these robots were brittle.
Current Paper: Addresses the brittleness by explicitly preserving the "VLM-ness" (visual understanding) inside the "VLA" (robot) during the adaptation phase.

3.4. Differentiation Analysis

Vs. Freezing: Some prior works simply freeze the vision part of the model to prevent forgetting. The authors show this fails because the frozen vision part creates a disconnect with the action part.
Vs. Co-training: Others mix robot data with internet data during training to keep general knowledge. This is computationally expensive.
This Paper: Allows the model to change (via LoRA) but adds a "guide rail" (the alignment loss) to keep it from drifting too far from general understanding. It is lightweight and effective.

4. Methodology

4.1. Principles

The core principle is Visual Representation Alignment. The authors hypothesize that a "Generalist Vision Teacher" (a model trained solely to understand images, like a strong CLIP model) holds a "Platonic" ideal of visual features—it knows exactly what a "red circle" or a "stop sign" looks like. When a VLA trains on limited robot data, its internal features become distorted to fit only that dataset (overfitting). The method forces the VLA's internal features to remain mathematically similar to the Teacher's ideal features, preserving the "Platonic" truth while learning the motor skills.

4.2. Core Methodology In-depth (Layer by Layer)

Step 1: Architecture Setup

The system involves three main components:

The VLA Model (Student): This is the robot policy being trained. It has an Image Encoder ( $E_{\text{image}}$ ), a Text Encoder ( $E_{\text{text}}$ ), and a Transformer Backbone ( $B_{\theta}$ ).
The Teacher Model ( $E^{\star}_{\text{img}}$ ): A frozen, pre-trained vision model (e.g., C-RADIOv3) that is not updated. It acts as the reference standard.
The Projector ( $P_{\varphi}$ ): A small neural network that translates the VLA's internal language into the Teacher's language so they can be compared.

Step 2: Input Processing

The input consists of an image $I$ and a text instruction $\ell$ . The VLA processes these into a sequence of tokens $x_{1:n}$ , where the first $k$ tokens are visual and the rest are text: $x_{1:k} = E_{\text{image}}(I) \in \mathbb{R}^{k \times d_e}, \qquad x_{k+1:n} = E_{\text{text}}(\ell) \in \mathbb{R}^{(n-k) \times d_e}$

$k$ : Number of image patches (visual tokens).
$n$ : Total number of tokens.
$d_e$ : The embedding dimension (size of the vector representing a token).

These tokens pass through the VLA's backbone $B_{\theta}$ . At a specific internal layer $i^{\star}$ (the authors found middle layers to be best), the hidden states (features) for the image tokens are extracted. Let's call these student features: $h_{1:k}^{i^{\star}} \in \mathbb{R}^{k \times d_e}$

Step 3: Teacher Feature Extraction

Simultaneously, the same image $I$ is fed into the frozen Teacher model. The teacher produces its own set of high-quality teacher features: $z_{1:k} = E_{\text{img}}^{\star}(I) \in \mathbb{R}^{k \times d_t}$

$d_t$ : The dimension of the teacher's embeddings (which might be different from the VLA's $d_e$ ).
Note: The methodology assumes the number of patches $k$ matches or is aligned between student and teacher.

Step 4: Projection and Alignment

Since the VLA's dimension ( $d_e$ ) and the Teacher's dimension ( $d_t$ ) might differ, and their latent spaces are different, we cannot compare $h$ and $z$ directly. The student features $h_{1:k}^{i^{\star}}$ are passed through a Projector $P_{\varphi}$ (often a simple MLP - Multi-Layer Perceptron) to map them to the teacher's space. $u_{1:k} = P_{\varphi}(h_{1:k}^{i^{\star}})$ Now, $u$ and $z$ are in the same mathematical space. The Alignment Loss ( $\mathcal{L}_{\text{align}}$ ) calculates how different they are. The authors use Negative Cosine Similarity (maximizing similarity): $\mathcal{L}_{\text{align}} = - \frac{1}{k} \sum_{j=1}^{k} \mathsf{Sim}(u_j, z_j)$

$u_j$ : The projected feature of the $j$ -th image patch from the VLA.
$z_j$ : The feature of the $j$ -th image patch from the Teacher.
$\mathsf{Sim}(\cdot, \cdot)$ : Cosine similarity function.
The sum averages this similarity across all $k$ visual tokens. This forces the VLA to "think" about the image in the same way the Teacher does.

Step 5: Joint Optimization

The model is trained on two tasks simultaneously:

Action Prediction: Predicting the next robot action token ( $y_t$ ) correctly. This is the standard autoregressive loss ( $\mathcal{L}_{\text{VLA}}$ ): $\mathcal{L}_{\text{VLA}}(\theta) = \mathbb{E}_{(x,y) \sim \mathcal{D}} \left[ - \sum_{j=1}^{m-1} M_j \log \hat{p}_{\theta} (y_{j+1} \mid x_{1:n}, y_{1:j}) \right]$
- $\hat{p}_{\theta}$ : The probability distribution predicted by the model.
- $y$ : The ground truth action tokens.
- $M_j$ : A mask to select target positions.
Alignment: The loss calculated in Step 4.

The Total Loss combines them with a weighting coefficient $\lambda$ : $\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{VLA}} + \lambda \mathcal{L}_{\text{align}}, \quad \lambda > 0$

$\lambda$ : A hyperparameter controlling how much importance is given to alignment vs. action. The authors typically use $\lambda = 0.2$ .

The following figure (Figure 2 from the original paper) illustrates this process, showing the teacher path (top) and the student VLA path (bottom) converging in the loss function:

该图像是示意图，展示了如何从视觉教师中提取特征、进行细调以及损失函数的景观。图中包含了三个部分：a) 从视觉教师生成特征，b) 使用正则化损失进行细调，c) 损失景观，展示了在不同任务中的对齐和优化过程。

5. Experimental Setup

5.1. Datasets

Base Environment: The experiments are based on the Simpler benchmark, specifically using a WidowX-250S robot arm simulation.
VL-Think Task Suite: The authors created a custom dataset/benchmark called VL-Think.
- Goal: To isolate visual-language logic from dexterity.
- Setup: A "Pick and Place" task where the robot must pick up a carrot and place it on a specific "board."
- Variation: The "boards" have different textures representing concepts: Shapes, Colors, Traffic Signs, Laundry Symbols, Weather Icons, etc.
- Example Instruction: "Put the object on the yield sign" or "Put the object on the odd number."
- Data Volume: 1400 expert demonstration trajectories collected using a motion planner (MPLib).
- Why this dataset? Standard robot datasets focus on "picking up cans." They don't test if the robot knows what a "yield sign" is. This dataset specifically tests the transfer of semantic knowledge.
  
  The following figure (Figure 3 from the original paper) shows examples of the VL-Think tasks (Traffic, Laundry, Parity, etc.):
  
  该图像是示意图，展示了不同任务的执行指令，包括形状、颜色、洗衣、奇偶性、公共信息、交通、天气和箭头等任务，图中显示了一个夹具正在将一个胡萝卜放置在相应的图标上。

5.2. Evaluation Metrics

Success Rate (SR):
- Concept: Simply, did the robot complete the task?
- Definition: The percentage of episodes where the object was successfully placed on the correct target board.
- Formula: $ \text{SR} = \frac{\text{Number of Successful Episodes}}{\text{Total Episodes}} \times 100% $
Linear Probing Accuracy (ImageNet-100):
- Concept: A test to see how "clean" the internal visual features are. Can a simple linear classifier distinguish objects using only the features extracted from the VLA?
- Formula: Standard classification accuracy: $ \text{Accuracy} = \frac{\text{Correct Predictions}}{\text{Total Samples}} $
- Why? If the VLA features are "collapsed," this accuracy will be low because the features for "dog" and "cat" will be mixed up.

5.3. Baselines

The paper compares three main approaches:

Default (SFT): Standard naive fine-tuning. The model is updated to minimize action error only.
Freeze: SFT, but the VLA's visual encoder weights are frozen (locked). The hypothesis is that freezing prevents forgetting.
Align (Ours): SFT with the proposed Visual Representation Alignment loss.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results strongly support the authors' hypothesis. The Align method consistently outperforms the Default and Freeze methods, particularly in Out-Of-Distribution (OOD) settings.

The following are the results from Table 1 of the original paper. Note the breakdown by "Semantic" (understanding concepts), "Vision" (robustness to visual noise), and "Execution" (robustness to physical changes).

Method	Semantic (Success Rate)					Vision (Success Rate)				Execution (Success Rate)
Method	Carrot	Instruct	MultiCarrot	MultiPlate	Plate	VisionImg	Tex03	Tex05	Whole03	Whole05	Position	EEPose	PosChangeTo
Default	0.49±0.02	0.74±0.02	0.28±0.02	0.43±0.02	0.73±0.02	0.81±0.01	0.67±0.01	0.55±0.03	0.71±0.02	0.56±0.01	0.43±0.02	0.34±0.01	0.23±0.01
Freeze	0.03±0.01	0.05±0.01	0.01±0.01	0.02±0.01	0.03±0.01	0.02±0.01	0.03±0.01	0.01±0.01	0.01±0.01	0.01±0.01	0.03±0.01	0.03±0.01	0.04±0.01
Align (ours)	0.61±0.01	0.83±0.03	0.35±0.02	0.49±0.02	0.75±0.01	0.86±0.02	0.70±0.02	0.67±0.02	0.79±0.02	0.60±0.02	0.58±0.02	0.38±0.02	0.20±0.03

Analysis:

Freeze Failure: The "Freeze" baseline fails catastrophically (Success Rates near 0.03). This proves that you cannot simply lock the vision encoder; the robot must adapt its vision to the specific robot setup (camera angle, arm appearance), but it needs to do so without forgetting general concepts.
Alignment Gains: "Align" beats "Default" by significant margins (e.g., +12% in "Carrot", +9% in "Instruct"). This confirms that forcing the model to respect the teacher's representations helps it generalize better to new instructions and visual variations.

6.2. Visualization Analysis

The authors provide qualitative evidence via attention maps.

Before Alignment (Default SFT): The attention map is "diffuse" and "noisy." The model looks at random background pixels instead of the object mentioned in the text. This is "Attention Sink."
After Alignment: The attention map is sharp and focused on the target object.

The following figure (Figure 4 from the original paper) visualizes this contrast. Note how "OpenVLA (Align)" focuses tightly on the object, unlike the default model.

该图像是注意力图比较，展示了不同层级的注意力分布。OpenVLA SFT 在中间层表现出分散和噪声的模式，表明视觉-语言对接的丧失，而采用我们提出的方法（OpenVLA Align）的模型则在注意力图中保持了对象对齐的焦点。

Additionally, t-SNE plots (Figure 5) show the latent space.

Observation: In OpenVLA (default), clusters for "cup," "bottle," and "knife" overlap significantly (Representation Collapse). In the original VLM, they are distinct. The Alignment method helps separate these clusters again.

该图像是图表，展示了 Qwen 2.5-VL、PrismaticVLM 和 OpenVLA 三个模型在不同层次的 token 嵌入的 t-SNE 可视化。图中显示，Qwen 2.5-VL 和 PrismaticVLM 在各层次上维持了良好的类分离，而 OpenVLA 则在各类之间出现显著重叠，表明动作微调导致了表示的崩溃。

6.3. Linear Probing & Domain Forgetting

The authors used linear probing on ImageNet-100 to quantify feature quality.

C-RADIOv3 (Teacher): 87.31% Accuracy (The gold standard).
OpenVLA Pretrained: 79.88% (Before robotic training).
OpenVLA SFT (Default): 77.48% (Performance drops after robotic training).
OpenVLA Align (Ours): 82.13% (Not only recovers the loss but improves upon the pretrained base).

This result is crucial: Naive robotic training makes the model's vision worse. Alignment fixes this.

6.4. Ablation Studies

The authors tested various components of their method:

Teacher Choice: Better teachers (C-RADIOv3) yield better students compared to weaker teachers (SigLIP, Theia). The "Platonic" ideal matters.
Alignment Layer: Aligning the Middle Layers (Backbone) is better than aligning early layers (Encoder). This is because the middle layers are where vision and text fuse.
Projector: Using a Frozen MLP projector works best. If the projector is trainable, the model "cheats" by minimizing the loss via the projector weights rather than fixing the VLA's representations.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper conclusively demonstrates that standard fine-tuning of VLA models damages their visual-language capabilities, leading to "blindness" in Out-Of-Distribution scenarios. They identify "Representation Collapse" and "Attention Sink" as the culprits. Their proposed solution, Visual Representation Alignment, effectively mitigates this by anchoring the VLA to a strong, frozen vision teacher. This method is simple to implement (an auxiliary loss) and yields significant gains in semantic grounding and generalization robustness.

7.2. Limitations & Future Work

Limitation - Data Scale: The authors hypothesize that their method's inability to fully recover all domains (like Traffic signs) in VL-Think is due to the small size of the fine-tuning dataset (1400 trajectories). The alignment regularizer can only do so much without sufficient data coverage.
Limitation - Parameter Efficiency: They relied on LoRA (updating few parameters). They suggest that relaxing this constraint might allow for better recovery of concepts, though it would be more expensive.
Future Work: Expanding the dataset breadth and applying this alignment strategy during the massive pre-training phase (not just fine-tuning) are suggested next steps.

7.3. Personal Insights & Critique

Inspiration: This paper highlights a fundamental tension in Transfer Learning: Adaptation vs. Retention. We want the model to adapt to a new task (robotics) but retain old skills (vision). This is the "Catastrophic Forgetting" problem in disguise. The "Platonic Representation" solution—using a static teacher as a compass—is an elegant, computationally efficient way to navigate this trade-off.
Applicability: This concept is not limited to robotics. It could apply to Medical AI (fine-tuning a general model on X-rays without losing general anatomy knowledge) or Legal AI.
Critique: The reliance on a "frozen" projector is a clever but potentially restrictive trick. It assumes the Teacher and Student spaces can be linearly mapped. If the student drifts too far non-linearly, a frozen MLP might fail. Investigating non-linear or dynamic alignment techniques could be interesting. Furthermore, while the "Simpler" benchmark is good, validating this on real-world hardware is essential to prove the "Attention Sink" doesn't cause safety issues in physical interaction.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.