Paper status: completed

Actions as Language: Fine-Tuning VLMs into VLAs Without Catastrophic Forgetting

Published:09/26/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
4 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper presents VLM2VLA, a method for fine-tuning vision-language models (VLMs) into vision-language-action models (VLAs) without catastrophic forgetting, by representing low-level robot actions in natural language, achieving zero-shot generalization in real experiments.

Abstract

Fine-tuning vision-language models (VLMs) on robot teleoperation data to create vision-language-action (VLA) models is a promising paradigm for training generalist policies, but it suffers from a fundamental tradeoff: learning to produce actions often diminishes the VLM's foundational reasoning and multimodal understanding, hindering generalization to novel scenarios, instruction following, and semantic understanding. We argue that this catastrophic forgetting is due to a distribution mismatch between the VLM's internet-scale pretraining corpus and the robotics fine-tuning data. Inspired by this observation, we introduce VLM2VLA: a VLA training paradigm that first resolves this mismatch at the data level by representing low-level actions with natural language. This alignment makes it possible to train VLAs solely with Low-Rank Adaptation (LoRA), thereby minimally modifying the VLM backbone and averting catastrophic forgetting. As a result, the VLM can be fine-tuned on robot teleoperation data without fundamentally altering the underlying architecture and without expensive co-training on internet-scale VLM datasets. Through extensive Visual Question Answering (VQA) studies and over 800 real-world robotics experiments, we demonstrate that VLM2VLA preserves the VLM's core capabilities, enabling zero-shot generalization to novel tasks that require open-world semantic reasoning and multilingual instruction following.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Actions as Language: Fine-Tuning VLMs into VLAs Without Catastrophic Forgetting

Central topic: aligning robot action data with the pretraining distribution of vision-language models (VLMs) by representing low-level actions in natural language, enabling parameter-efficient fine-tuning into vision-language-action (VLA) models without eroding foundational multimodal reasoning.

1.2. Authors

  • Asher J. Hancock (Princeton University, Department of Mechanical and Aerospace Engineering)

  • Xindi Wu (Princeton University, Department of Computer Science)

  • Lihan Zha (Princeton University, Department of Mechanical and Aerospace Engineering)

  • Olga Russakovsky (Princeton University, Department of Computer Science)

  • Anirudha Majumdar (Princeton University, Department of Mechanical and Aerospace Engineering)

    The team spans robotics, computer vision, and embodied AI. Notably, Russakovsky is known for work in visual recognition and fairness, Majumdar for robust and generalist robotics, and the group has prior work on embodied policies and evaluation.

1.3. Journal/Conference

arXiv preprint. While arXiv is not peer-reviewed, it is a widely used venue for disseminating cutting-edge research prior to conference/journal publication.

1.4. Publication Year

2025 (Published at UTC: 2025-09-26T10:54:04.000Z)

1.5. Abstract

The paper tackles a key obstacle in converting VLMs into VLAs: catastrophic forgetting of foundational world knowledge when fine-tuned on robot teleoperation data. The authors argue that the root cause is distribution mismatch between internet-scale image-text pretraining and low-level robotics action data. Their paradigm, VLM2VLA, resolves this mismatch by expressing actions as natural language and uses Low-Rank Adaptation (LoRA) for fine-tuning—minimally modifying the VLM backbone. They claim this averts catastrophic forgetting without expensive co-training on massive internet corpora. Extensive evaluation (VQA benchmarks and 800+ real robot experiments) shows that VLM2VLA preserves core VLM capabilities and enables zero-shot generalization that requires open-world semantic reasoning and multilingual instruction following.

2. Executive Summary

2.1. Background & Motivation

  • Core problem: When fine-tuning VLMs for robotic control (turning them into VLAs), many models lose their pretraining-acquired capabilities (visual understanding, commonsense reasoning, multilingual comprehension)—a phenomenon known as catastrophic forgetting.
  • Why important: Foundation VLMs carry broad knowledge learned from internet-scale data. For robots to be truly generalist (open-world), retaining this knowledge is crucial. Forgetting undermines generalization, instruction following, and robustness.
  • Gaps in prior work:
    • Many VLAs modify architecture/tokenization or add separate action heads; these changes plus full-parameter fine-tuning drift the representation away from pretraining.
    • The popular countermeasure—co-training with large web datasets—helps but is expensive and sensitive to mixture ratios.
  • Key idea (entry point): Treat “actions as language.” Re-express low-level robot actions as natural language within the VLM’s existing vocabulary, aligning fine-tuning data to the VLM’s pretraining distribution. Then adapt using parameter-efficient LoRA only, minimizing perturbation to the backbone.

2.2. Main Contributions / Findings

  • Representing actions as language: The paper proposes translating low-level robot actions (end-effector motions) into text descriptions (e.g., “move down and slightly right”), including magnitude and direction, leveraging the VLM’s innate numeric and spatial priors.
  • Data relabeling and training pipeline: They introduce a scalable pipeline (using Gemini 2.5) to annotate teleoperation trajectories with a hierarchical language format: high-level subtasks, mid-level motion plans, and low-level action chunks—all in text.
  • LoRA-only VLA fine-tuning: With data aligned to the VLM’s representation, LoRA suffices to adapt for control while preserving world knowledge—no architecture changes, no token hacks, and no web-scale co-training.
  • Empirical validation:
    • VQA retention: The fine-tuned VLA retains a large fraction of the base model’s VQA performance (reported as retaining over 85% across challenging benchmarks).
    • Real-world manipulation: Across 800+ trials on a 6-DoF arm, the VLA shows competitive in-distribution performance and strong out-of-distribution generalization, including multilingual tasks and pop-culture semantic grounding (“Ash Ketchum”).
    • Ablation: A token-based action ablation (mapping digits to least likely tokens) shows LoRA helps, but linguistic representation of actions yields better downstream robotic generalization.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

  • Vision-language model (VLM): A neural network (often a transformer) that processes images and text jointly, trained on large corpora of image-text pairs. It can answer questions about images, caption them, or follow instructions grounded in vision.

  • Vision-language-action (VLA) model: Extends a VLM to output actions for robots. Inputs are visual observations and language commands; outputs are sequences of motor actions or intermediate control signals.

  • Catastrophic forgetting: When a model adapts to new data (robot tasks), it degrades performance on earlier capabilities (internet-scale knowledge). In VLA, this manifests as poor VQA, weak semantic understanding, and limited instruction following post-fine-tuning.

  • Low-Rank Adaptation (LoRA): A parameter-efficient fine-tuning method that inserts learnable low-rank matrices into weight update paths, keeping the original backbone weights mostly fixed. Intuition: adapt the model via small low-rank shifts to avoid overwriting base knowledge.

  • Robot teleoperation trajectories: Sequences of observations (e.g., RGB frames) and actions (e.g., end-effector displacements and gripper states) collected while a human controls the robot to perform tasks.

    Transformer and attention (baseline understanding):

  • Transformers rely on self-attention to let each token attend to all others. The standard scaled dot-product attention is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^{T}}{\sqrt{d_k}}\right)V $

    • QQ: query matrix
    • KK: key matrix
    • VV: value matrix
    • dkd_k: key dimensionality (used to scale dot-products) This is central to VLMs that encode text and vision tokens and learn cross-modal relationships.

3.2. Previous Works

  • Token-based action representation: Methods discretize continuous actions and map them to tokens (often least-likely tokens in the vocabulary) so the model can autoregress actions [OpenVLA; FAST]. While simple and fast, assigning arbitrary tokens for numeric values can be far from pretraining distribution.
  • Separate action heads: Some VLAs add diffusion or flow-matching heads to predict continuous vectors [e.g., RT-2 variants, DexVLA]. New parameters can corrupt pretrained representations and require careful training (stop-gradients, staged training).
  • Co-training approaches: Mix robot data with large VQA/captioning datasets during VLA training (e.g., RT-2 family, MolmoAct, π0.5). Helps retention but is costly and sensitive to mixture ratios; not a fundamental fix for distribution shift.
  • Reasoning in VLAs: Chain-of-thought for robots (ECoT, CoT-VLA, Inner Monologue) generate plans/subgoals before low-level actions to improve compositionality and long-horizon performance.

3.3. Technological Evolution

  • Early monolithic VLAs fine-tuned VLMs with architectural modifications (token hacks, added heads).
  • Observed forgetting led to co-training and complex training schemes (knowledge insulation, staged expert training).
  • Recent trend: embody reasoning and hierarchical planning within a single VLM to improve robustness and generalization.
  • This paper continues the monolithic VLA trend but pushes a data-centric alignment: actions as language to keep fine-tuning within the base distribution and rely on LoRA for safe adaptation.

3.4. Differentiation Analysis

  • Core difference: Action representation. Instead of arbitrary tokens or new heads, VLM2VLA expresses both high-level plans and low-level movements in natural language, reusing the VLM’s vocabulary and numeric/string priors.
  • Training paradigm: LoRA-only fine-tuning with minimal backbone perturbation, avoiding co-training and architectural changes.
  • Empirical stance: Show that alignment at the data level is sufficient to prevent catastrophic forgetting and support robust OOD generalization.

4. Methodology

4.1. Principles

  • Hypothesis: Catastrophic forgetting stems from distribution mismatch between a VLM’s pretraining (image-text pairs) and VLA fine-tuning (image-action pairs). If robot actions are represented in the same modality as pretraining—natural language—the fine-tuning data better aligns with the model’s representation space.
  • Intuition: With aligned data, parameter-efficient LoRA can adapt specific subspaces for robotic control without overwriting broad knowledge embedded in backbone weights.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Hierarchical “Actions as Language” Representation

The model decomposes the main task into a sequence of NN steps (indexed by ii), and at each step predicts:

  1. High-Level Subtask (lil_i): Given observation oio_i and instruction LL, describe the immediate sub-goal (e.g., “Move to carrot”).

  2. Mid-Level Motion Plan (mim_i): Given lil_i and oio_i, produce spatially informative directions with respect to the end-effector (e.g., “move primarily downward and slightly right”).

  3. Low-Level Action Chunk (aˉi\bar{a}_i): Conditioned on oio_i, lil_i, and mim_i, generate a variable-length list of per-DoF commands as text (e.g., sequences of [dx, dy, dz, gripper]).

    The joint distribution targeted by the VLA is factorized exactly as in the paper: $ p_{\theta}(\bar{a}i, m_i, l_i \mid \bar{o}i, L) = \underbrace{p{\theta}(l_i \mid \bar{o}i, L)}{\mathrm{1)~Subtask}} \underbrace{p{\theta}(m_i \mid l_i, \bar{o}i)}{\mathrm{2)~Motion}} \underbrace{p_{\theta}(\bar{a}_i \mid m_i, l_i, \bar{o}i)}{\mathrm{3)~Action}} $

  • pθp_{\theta}: model distribution with parameters θ\theta

  • oˉi\bar{o}_i: observation(s), typically RGB image tokens for step ii

  • LL: main language instruction

  • lil_i: language subtask at step ii

  • mim_i: language motion plan at step ii

  • aˉi\bar{a}_i: language-encoded action chunk at step ii

    All outputs are text tokens drawn from the VLM’s existing vocabulary, ensuring representational alignment.

As evidence for alignment, the paper shows (Figure 3) that a pretrained VLM (Gemma-3-12B-IT) assigns significantly higher log-probabilities to action outputs expressed in natural language than to outputs via “reserved least-likely tokens”—a common tokenization strategy in VLAs.

The following figure (Figure 3 from the original paper) shows the pre-fine-tuning action probability distribution under Gemma-3-12B-IT, highlighting higher log-probability for language-based actions:

Figure 3 Distribution of action probabilities under Gemma-3-12B-IT before fine-tuning on robot teleoperation data. The model assigns significantly higher log-probabilities to actions represented as language compared to those defined by explicit tokenization modifications, e.g., least likely token assignment. 该图像是图表,展示了在 Gemma-3-12B-IT 模型上针对机器人遥控数据进行微调前,动作概率的分布情况。模型对以自然语言表示的动作分配了显著更高的对数概率,相较于通过明确标记修改定义的动作,后者的对数概率显著较低。

4.2.2. Inference Procedure (Closed-loop with Verifier)

  • Subtask generation: At rollout start, given o0o_0 and LL, the model generates all NN subtasks (kept fixed through execution).

  • For each step ii:

    • Motion planning: Generate mim_i conditioned on lil_i and current observation oio_i.
    • Action generation: Produce aˉi\bar{a}_i conditioned on oio_i, lil_i, and mim_i.
    • Execute action chunk on robot.
    • Verification: An external verifier VV inspects the pre/post images (oˉi,oˉi+1)(\bar{o}_i, \bar{o}_{i+1}) and subtasks (li,li+1)(l_i, l_{i+1}) to decide whether to proceed or retry.
  • Verifier: Implemented with Gemini 2.5 Pro; prompt enforces strict preconditions (e.g., a “move to” is only successful if perfectly aligned for a subsequent “grasp”).

    The following figure (Figure 2 from the original paper) illustrates the conceptual contrast between a nominal VLM, a typical VLA that overfits to low-level actions, and VLM2VLA which preserves reasoning (e.g., safety) while also producing motor commands:

    Figure 2 Traditional VLA training procedures often overfit to the robot training data, sacrificing their original reasoning capabilities for low-level action prediction (center). In contrast, VLM2VLA (right) preserves the world understanding of the nominal VLM (left), allowing the model to reason about potential safety risks instead of just motor commands. 该图像是示意图,展示了三种不同的模型响应同一问题的方式。左侧的Nominal VLM展示了基础视觉语言模型的推理能力,中心的Nominal VLA则显示了在训练数据中低级动作预测的过拟合,右侧的VLM2VLA则展示了如何通过对低级动作进行自然语言表示,保留了模型对潜在安全风险的推理能力。

4.2.3. Data Curation Pipeline (Relabeling Robot Trajectories)

  • Source data: Teleoperation trajectories Drob={τ}\mathcal{D}_{\mathrm{rob}} = \{\tau\}, where τ={(ot,at)}t=0T\tau = \{(o_t, a_t)\}_{t=0}^{T} and LL is the main instruction. No low-level state (e.g., joint angles) beyond end-effector relative motions.

  • Annotation: Use Gemini 2.5 to decompose each trajectory into sub-trajectories, annotating:

    • oio_i: key observation (image)
    • lil_i: subtask text
    • mim_i: motion plan text
    • aˉi\bar{a}_i: action chunk text (lists of [dx, dy, dz, gripper])
  • Output dataset: Dlan={τˉ}\mathcal{D}_{\mathrm{lan}} = \{\bar{\tau}\} with τˉ={(oˉi,li,mi,aˉi)}i=0N1\bar{\tau} = \{(\bar{o}_i, l_i, m_i, \bar{a}_i)\}_{i=0}^{N-1}—now cast as image-text pairs suitable for standard supervised fine-tuning.

    The following figure (Figure 4 from the original paper) presents the relabeling pipeline from robot trajectories to natural language annotations:

    Figure 4 VLM2VLA's pipeline for annotating existing robot datasets \(\\mathcal { D } _ { \\mathrm { r o b } }\) into \(\\mathcal { D } _ { \\mathrm { l a n } }\) described via natural language. We use Gemini 2.5 \[3\] to decompose each trajectory into sub-trajectories, each with an associated subtask, motion plan, and action chunk. 该图像是示意图,展示了VLM2VLA的管道,将现有机器人数据集extbf{D}_{ ext{rob}}注释为自然语言描述的extbf{D}_{ ext{lan}}。使用Gemini 2.5对每个轨迹进行分解,生成子轨迹及其对应的子任务、运动计划和动作片段。

4.2.4. Training: LoRA-only Fine-Tuning on Gemma-3-12B-IT

  • Backbone: Gemma-3-12B-IT (instruction-tuned VLM).

  • Adaptation: Apply LoRA to all linear modules (q/k/v/o projections, MLP up/down/gate projections).

  • Objective: Cross-entropy over text tokens (subtasks, motion plans, action chunks).

  • Rationale: Keeping backbone weights largely intact while learning low-rank adapters prevents overwriting broad visual-semantic knowledge.

    The following figure (Figure 1 from the original paper) summarizes the entire VLM2VLA approach (data alignment and LoRA-only training) to preserve foundational capabilities:

    Figure 1 We present VLM2VLA, a data pipeline and training methodology for fine-tuning VLMs into VLAs while preerving the foundational perceptual and reasonig capabilities.Our policy retains ts pretrainig knowlege, enabling strong VQA performance and superior generalization in real robotic manipulation tasks. 该图像是示意图,展示了VLM2VLA的数据管道和训练方法,旨在将视觉语言模型(VLM)微调为视觉语言行动(VLA),同时保持感知和推理能力。它说明了如何通过自然语言描述机器人数据以及各要素之间的关系,并突出了多模态理解与操作的能力。

4.2.5. Prompting and Decoding Details

  • Multi-stage prompts:
    1. Subtask prediction: Ask for a list of high-level steps (example provided to start with “Move to …”).
    2. Motion planning: Provide concise spatial reasoning using dx, dy, dz conventions (signs and gripper states).
    3. Action generation: Output only a Python list of lists of [dx, dy, dz, gripper], concise and consistent with motion reasoning.
  • Decoding: Nucleus (top-p=0.95). Temperatures:
    • 0.1 for motion planning (precision-oriented)
    • 0.5 for action prediction
    • 0.5 for subtask decomposition in-distribution; 1.0 for OOD scenarios
  • Baselines (OpenVLA, ECoT) use greedy decoding and their recommended prompts.

4.2.6. Action Post-processing and Auxiliary Signals

  • Action chunking thresholds: Re-aggregate tiny movements along a single dimension to produce meaningful magnitudes (e.g., 2.5 cm per direction; 5 cm absolute threshold). This helps the model avoid negligible commands despite sensible motion plans.
  • Auxiliary supervision:
    • Subtask completion tuples (oˉi,oˉj,li)(\bar{o}_i, \bar{o}_j, l_i) to train a completion classifier (not used at runtime; left for future work).
    • Directional movement labels (left/right/up/down/none) as additional training signals.

4.2.7. Ablation: Action Tokens vs Actions as Language

  • VLM2VLA-AT: Map digits 0–9 in action data to Gemma-3’s ten least likely tokens (e.g., , …). Train identically, differing only in action representation.
  • Purpose: Test whether LoRA alone suffices for retention and whether linguistic representation improves downstream robotic generalization relative to least-likely token assignments.

5. Experimental Setup

5.1. Datasets

  • Robotics fine-tuning:

    • BridgeData v2 subset [robot manipulation on a WidowX 250S in a toy kitchen setup]. Tasks include pick-up and pick-and-place with objects like carrot, eggplant, pan, fish. Each trajectory includes RGB frames and end-effector relative motions; main instruction LL available for some.
    • Data relabeling executed with Gemini 2.5 to produce Dlan\mathcal{D}_{\mathrm{lan}} (hierarchical language annotations).
  • VQA evaluation:

    • Benchmarks: MMMU, MMStar, MME, OCRBench, MMB (en/cn), TextVQA, DocVQA, InfoVQA, AI2D, ChartQA, RealWorldQA. These cover diverse multimodal reasoning: OCR, diagrams, charts, general knowledge, bilingual comprehension.
  • Concrete sample (from Appendix A.1): For “open the silver pot,” Gemini decomposes steps like “Move to Lid Handle,” “Grasp Lid Handle,” “Lift Lid,” etc., and outputs matching action sequences in [dx, dy, dz, gripper].

    Why chosen:

  • BridgeData v2 provides standardized real-world robot tasks to test manipulation and hierarchical reasoning.

  • The VQA suite probes whether foundational multimodal capabilities survive fine-tuning—a key claim of the paper.

5.2. Evaluation Metrics

Robotics:

  • Success rate: $ \mathrm{SuccessRate} = \frac{\text{# successful trials}}{\text{# total trials}} $

    • Numerator: count of trials meeting full task criteria (e.g., grasp + place on plate).
    • Denominator: total trials conducted (e.g., 30 per task).
  • Partial credit scoring (composite tasks):

    • For “Put the Eggplant in the Pan, Then Lift the Fish,” points awarded for: contact with eggplant, placing eggplant in pan, moving toward fish, contacting fish, lifting fish (up to 5).
    • For planning-capable policies (ECoT, VLM2VLA), additional credit for correct subtask decomposition and motion plan keywords (Appendix C.3).
  • Task decomposition correctness (Figure 6):

    • Keyword-based evaluation (case-insensitive), counting presence of correct objects/destinations (e.g., “eggplant/aubergine,” “pan,” “fish,” “carrot/gajar”). $ \mathrm{PlanHitRate} = \frac{\text{# plans containing required keywords}}{\text{# evaluated plans}} $

VQA:

  • Accuracy (typical for many VQA datasets): $ \mathrm{Accuracy} = \frac{\text{# correct answers}}{\text{# total questions}} $

    • “Correct” often defined by exact match or standardized scoring per benchmark (e.g., normalized answers in TextVQA).
  • OCRBench, AI2D, ChartQA, DocVQA, InfoVQA:

    • Typically use accuracy or exact match (EM): $ \mathrm{EM} = \frac{\text{# predictions exactly matching ground truth}}{\text{# total samples}} $

Symbol explanations:

  • All metrics use standard counts of correct predictions divided by totals; subset-specific normalization may apply but are not detailed in the paper. When unspecified, we adopt accuracy/EM conventions from benchmark documentation.

5.3. Baselines

  • OpenVLA (7B Prismatic backbone; fine-tuned on large Open-X-Embodiment): Token-based autoregressive action generation; strong in-distribution grasping due to scale.
  • ECoT (Embodied Chain-of-Thought, 7B Prismatic): Planning-oriented VLA trained on BridgeData v2; produces reasoning traces before actions.
  • MolmoAct (7B, co-trained VLA): Designed for action reasoning with multi-camera inputs.
  • π0.5 (3B, co-trained VLA): Open-world generalization focus; evaluated on single-image VQA by masking other views.
  • Ablation: VLM2VLA-AT (12B Gemma-3 backbone; actions via least-likely tokens).

6. Results & Analysis

6.1. Core Results Analysis

VQA retention (Table 1):

  • OpenVLA and ECoT show severe degradation vs. their Prismatic VLM, indicating catastrophic forgetting.

  • VLM2VLA (12B Gemma-3) preserves strong performance across VQA tasks—only minor losses relative to Gemma-3-12B-IT—supporting the claim that LoRA-only fine-tuning on language-aligned data averts forgetting.

  • Compared to co-trained VLAs (MolmoAct, π0.5), VLM2VLA is competitive or superior across reported VQA benchmarks, despite no web-scale co-training.

    Robotic manipulation (Figure 5):

  • In-distribution tasks (Pick-up, Pick-and-Place): OpenVLA leads (likely due to extensive dataset coverage); VLM2VLA is competitive, answering Q2 that “actions as language” supports competent control.

  • Borderline compositional task (“Eggplant in Pan, then Lift Fish”): VLM2VLA’s hierarchical reasoning and execution outperform ECoT; OpenVLA often completes only the first subtask, revealing limitations in long-horizon compositionality.

  • Out-of-distribution tasks:

    • Multilingual pick-up (Spanish, Mandarin, Hindi): VLM2VLA shows significant advantage, often translating and identifying the correct object, leveraging retained multilingual priors (Figure 7).
    • Pop-culture grounding (“Pick up the item above Ash Ketchum”): VLM2VLA is the only model achieving a meaningful success rate, demonstrating retained open-world semantics and spatial reasoning. These results validate Q3: retained world knowledge enables zero-shot generalization to novel tasks.

Ablation (VLM2VLA-AT):

  • Despite strong VQA (suggesting LoRA mitigates forgetting), token-based actions underperform vs. “actions as language” in robotic tasks, especially OOD and compositional settings. Indicates that representation choice affects how well the VLM’s latent knowledge connects to control.

    The following figure (Figure 5 from the original paper) compares success rates across five tasks; VLM2VLA performs best in multilingual and open-world semantic tasks:

    该图像是一个示意图,展示了不同模型在机器人任务中的表现,包括 OpenVLA、ECoT、VLM2VLA-AT 和 VLM2VLA。图中展示了五种任务的表现百分比,其中 VLM2VLA 在多个任务中表现最佳,特别是在 'Pick Up - T' 和 'Pick Up - A' 任务中。 该图像是一个示意图,展示了不同模型在机器人任务中的表现,包括 OpenVLA、ECoT、VLM2VLA-AT 和 VLM2VLA。图中展示了五种任务的表现百分比,其中 VLM2VLA 在多个任务中表现最佳,特别是在 'Pick Up - T' 和 'Pick Up - A' 任务中。

The following figure (Figure 6 from the original paper) quantifies task decomposition correctness for OOD tasks; VLM2VLA identifies objects/destinations more reliably:

Figure 6 Analysis of task decomposition for OOD manipulation tasks. Points are awarded if the model's task plan correctly identifies the task object, and task destination if present. See Appendix C.3 for additional details. 该图像是柱状图,展示了不同方法在 OOD 操作任务上的成功率(%)。ECoT、VLM2VLA-AT 和 VLM2VLA 的成功率分别为 93%、97% 和相应的任务对象。详细数据可见附录 C.3。

The following figure (Figure 7 from the original paper) shows a qualitative multilingual example (Hindi “gajar uthao”—pick up the carrot) where VLM2VLA correctly identifies and acts:

Figure 7 A qualitative demonstration of VLM2VLA's zero-shot multilingual capabilities. Given the language instruction in Hindi ('pick up the carrot'), our model identifies the correct object amidst distractors (eggplant and banana), demonstrating a genuine understanding of the task. 该图像是一个示意图,展示了VLM2VLA模型如何将用户的指令转化为具体的子任务。根据用户的询问 ('gajar uthao',意为'捡起胡萝卜'),模型生成了三项子任务:'移动到胡萝卜'、'抓取胡萝卜'和'将胡萝卜抬高',强调了系统的多语种处理能力。

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

Method #Params MMMU MMStar MME OCRBench MMB-en MMB-cn TextVQA DocVQA InfoVQA AI2D ChartQA RealWorldQA
Prismatic VLM Family
Prismatic VLM 7b 35.0 38.8 1456.6 32.0 66.2 55.7 42.5 17.5 19.7 54.6 16.7 30.8
OpenVLA 7b 26.3 0 0 0 43.0 4.1 0 0 0 0 0
ECoT 7b 26.6 0 0 0.01 3.7 0 0 0 0 25.6
Gemma-3 Family (with VLM2VLA)
Gemma-3-4B-IT 4b 39.3 1205.8 70.2 68.6 64.3 61.5 68.8 40.9 70.5 50.3 44.0
Gemma-3-12B-IT 12b 46.0 37.1 46.3 1182.3 75.0 76.9 74.7 68.9 80.6 50.4 78.5 55.1
VLM2VLA-AT 12b 45.9 45.2 1082.2 65.5 70.9 66.8 64.2 74.6 44.8 74.1 41.8
VLM2VLA (Ours) 12b 42.7 48.0 1391.7 63.9 68.5 67.6 64.9 78.4 46.2 74.0 58.3
Open-Source Co-Trained VLAs
MolmoAct 7b 28.4 1.2 1224.5 52.7 55.1 46.3 57.5 58.7 41.9 2.0 55.9 8.6
π0.5 3b 24.0 21.7 1061.9 6.8 6.8 0.3 10.0 4.6 7.7 27.0 5.1 2.7

Note: The table in the paper contains some formatting inconsistencies (e.g., missing entries and merged cells). The transcription above follows the original as closely as possible.

6.3. Ablation Studies / Parameter Analysis

  • Token mapping ablation (VLM2VLA-AT) confirms:

    • LoRA reduces catastrophic forgetting in VQA (scores near VLM2VLA).

    • However, downstream robot generalization is worse, especially for OOD tasks requiring multilingual translation or pop-culture grounding. Indicates that semantic alignment of action representation matters beyond retention: linguistic actions better bridge VLM priors to control.

      Inference latency (Appendix C.6):

  • Median ~6.1 s per cycle; std ~14.3 s with long-tail retries (>45 s) due to output formatting issues. Highlights decoding/robustness as an area to improve for real-time control.

    Training configuration (Appendix D):

  • LoRA rank 16, alpha 32; AdamW (LR 5e-5, linear decay); BF16; DeepSpeed ZeRO-2; global batch size 1 with gradient accumulation (effective small batch); ~300 GPU-hours on 4×A100; 1 epoch sufficed.

7. Conclusion & Reflections

7.1. Conclusion Summary

  • VLM2VLA proposes a data-centric alignment—representing robot actions as natural language—to maintain proximity to a VLM’s pretraining distribution.
  • With aligned data, LoRA-only fine-tuning suffices to endow a VLM with control capabilities while preserving foundational multimodal reasoning, multilingual understanding, and open-world semantics.
  • Empirically validated across VQA benchmarks and >800 robot trials: competitive in-distribution performance, strong compositional reasoning, and notable OOD generalization (multilingual and pop-culture tasks).
  • The approach avoids architecture changes and costly co-training, offering a simple, model-agnostic route to generalist policies.

7.2. Limitations & Future Work

  • Inference latency: Autoregressive multi-stage generation (subtasks → motion → actions) is slow; decoding failures can trigger retries. Future work should explore faster decoding, constrained generation, or compact adapters.
  • Dexterous tasks: Current focus on translational DoFs; rotations and fine manipulation remain open. Finer-grained spatial language and improved spatial reasoning in VLMs may help.
  • Verifier dependence: External verifier (Gemini) adds overhead; training the base VLM to verify (or design self-verifying prompts) could streamline execution.
  • Cross-embodiment generalization: Current policy is specific to the WidowX arm; mapping language actions across robots (joint-space, torque control) is non-trivial but promising with a unified linguistic schema.
  • Scale: Applying the relabeling pipeline to larger datasets (e.g., DROID, Open-X-Embodiment) may unlock more robust zero-shot generalization.

7.3. Personal Insights & Critique

  • Strength: The paper reframes retention as a data alignment problem rather than a training regularization problem. This is elegant and practical—aligning modalities lets the model leverage its priors natively.

  • Practicality: LoRA-only fine-tuning is appealing for labs with limited compute. The relabeling cost (~$900 in the paper) is relatively modest compared to co-training experiments.

  • Potential issues:

    • Reliance on an external model (Gemini) for annotations introduces label biases and potential error propagation; manual spot-checking helps but scaling requires robust QA.
    • Numeric precision and unit consistency in language actions may affect control accuracy; future work should formalize numeric expressions and enforce constraints during decoding.
    • The token-based ablation uses least-likely tokens; alternative tokenization (e.g., learned codebooks or numeric-aware tokenization) might partially close the gap, so the superiority of language actions should be revisited with stronger baselines.
  • Transferability: The “actions as language” notion could benefit other agentic systems (software tools, web agents) where action APIs can be described in natural language, aligning with LLM pretraining distributions and enabling LoRA-only adaptation.

  • Broader implication: The paper underscores a general principle—minimize distribution shift at the data level to preserve foundation model capabilities. This may extend to speech-language-action, code-language-action, and other multimodal agent settings.

    Overall, VLM2VLA is a compelling, data-centric step toward robust generalist robot policies: by treating actions as language, it preserves the VLM’s world knowledge while enabling effective control—without architectural complexity or co-training overhead.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.