Paper status: completed

VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

Published:09/11/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
13 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

VLA-Adapter reduces costly VLA model training by systematically identifying crucial VL conditions and employing a lightweight Bridge Attention module. This achieves state-of-the-art performance with a 0.5B backbone, no robotic pre-training, and record-fast inference, dramatically

Abstract

Vision-Language-Action (VLA) models typically bridge the gap between perceptual and action spaces by pre-training a large-scale Vision-Language Model (VLM) on robotic data. While this approach greatly enhances performance, it also incurs significant training costs. In this paper, we investigate how to effectively bridge vision-language (VL) representations to action (A). We introduce VLA-Adapter, a novel paradigm designed to reduce the reliance of VLA models on large-scale VLMs and extensive pre-training. To this end, we first systematically analyze the effectiveness of various VL conditions and present key findings on which conditions are essential for bridging perception and action spaces. Based on these insights, we propose a lightweight Policy module with Bridge Attention, which autonomously injects the optimal condition into the action space. In this way, our method achieves high performance using only a 0.5B-parameter backbone, without any robotic data pre-training. Extensive experiments on both simulated and real-world robotic benchmarks demonstrate that VLA-Adapter not only achieves state-of-the-art level performance, but also offers the fast inference speed reported to date. Furthermore, thanks to the proposed advanced bridging paradigm, VLA-Adapter enables the training of a powerful VLA model in just 8 hours on a single consumer-grade GPU, greatly lowering the barrier to deploying the VLA model. Project page: https://vla-adapter.github.io/.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

  • Title: VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model
  • Authors: Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, Siteng Huang, Yifan Tang, Wenhui Wang, Ru Zhang, Jianyi Liu, Donglin Wang.
  • Affiliations: The authors are from various institutions including Beijing University of Posts and Telecommunications, Westlake University, Zhejiang University, OpenHelix Team, and The Hong Kong University of Science and Technology (Guangzhou).
  • Journal/Conference: The paper is available on arXiv, a preprint server. This indicates it has not yet undergone formal peer review for a specific conference or journal at the time of this analysis.
  • Publication Year: 2024 (based on the arXiv submission date, inferred from the ID 2509.09372).
  • Abstract: The paper addresses the high training costs associated with Vision-Language-Action (VLA) models, which typically rely on pre-training large-scale Vision-Language Models (VLMs) on robotic data. The authors introduce VLA-Adapter, a new paradigm to reduce this dependency. They systematically analyze which visual-language (VL) conditions are most effective for bridging perception to action. Based on these findings, they propose a lightweight "Policy" module with a novel "Bridge Attention" mechanism that automatically injects the optimal VL information into the action generation process. Their method achieves state-of-the-art level performance using a small 0.5B parameter backbone without any robotic data pre-training. It also boasts the fastest inference speed reported and can be trained in just 8 hours on a single consumer-grade GPU, significantly lowering the barrier for VLA model deployment.
  • Original Source Link:

2. Executive Summary

  • Background & Motivation (Why):

    • Core Problem: State-of-the-art Vision-Language-Action (VLA) models, which enable robots to perform tasks based on natural language instructions, are incredibly powerful but also suffer from major drawbacks. They typically require massive Vision-Language Models (VLMs) (often with 7B+ parameters), extensive pre-training on large-scale robotics datasets, and significant computational resources (high VRAM, long training times). This creates a high barrier to entry for research and practical deployment.
    • Gap in Prior Work: While many VLA models have been proposed, the fundamental question of how to most effectively and efficiently bridge the gap between vision-language (VL) perception and the robot's action space (A) has been underexplored. Existing methods use various ad-hoc strategies without a systematic understanding of what works best.
    • Innovation: This paper tackles the problem from a first-principles perspective. Instead of simply building a bigger model, it systematically investigates the "bridging" mechanism itself. The core innovation is the VLA-Adapter, a paradigm designed for efficiency and effectiveness, proving that a massive backbone and costly pre-training are not prerequisites for high performance.
  • Main Contributions / Findings (What):

    1. First Systematic Analysis of Bridging Paradigms: The paper provides the first systematic study on how different types of VL representations (e.g., from different layers of a VLM, raw features vs. query-based features) impact action generation, offering key design principles for future VLA models.

    2. VLA-Adapter Framework: The authors propose a novel, lightweight framework featuring a Policy module with Bridge Attention. This mechanism intelligently fuses multiple types of VL information—specifically, Raw features and ActionQuery features from all layers of the VLM—to generate precise actions.

    3. State-of-the-Art Performance with Tiny Models: VLA-Adapter achieves SOTA-level performance on complex robotics benchmarks using only a tiny 0.5B parameter VLM backbone, which is over 14 times smaller than common baselines like OpenVLA (7B).

    4. Unprecedented Efficiency: The model can be trained from scratch (without robotic pre-training) in just 8 hours on a single consumer GPU. It also achieves an inference throughput of 219.2 Hz, which is 3x faster than the SOTA competitor OpenVLA-OFT. This dramatically lowers the cost and complexity of developing and deploying VLA models.


3. Prerequisite Knowledge & Related Work

  • Foundational Concepts:

    • Vision-Language Model (VLM): An AI model that can process and understand information from both images (vision) and text (language) simultaneously. For example, it can answer questions about an image or describe what is happening in it.
    • Vision-Language-Action (VLA) Model: An extension of a VLM specifically for robotics. It takes visual input (e.g., from a robot's camera) and a language instruction (e.g., "pick up the red block") and outputs a sequence of actions (e.g., motor commands) for the robot to execute.
    • Policy Network: In robotics and reinforcement learning, the "Policy" is the component of the AI that decides what action to take in a given state. In VLA models, it translates the understanding from the VLM into concrete robot movements.
    • Embodied AI: A field of AI focused on creating intelligent agents (like robots) that can perceive, reason about, and interact with the physical world.
    • Backbone Model: A large, pre-trained model (in this case, a VLM) that serves as the foundation for a more specialized task. The VLA-Adapter is built "on top" of a VLM backbone.
    • Pre-training vs. Fine-tuning: A common training strategy for large models. Pre-training involves training the model on a massive, general dataset (like web images and text). Fine-tuning involves taking the pre-trained model and further training it on a smaller, task-specific dataset (like robotic manipulation data). This paper's method notably avoids the need for a robotic pre-training stage.
  • Previous Works & Bridging Paradigms: The paper categorizes prior work on bridging perception (VL) to action (A) into two main types, as illustrated in Image 7.

    该图像为示意图,展示了四种不同的视觉语言行动模型架构类型:(1) RoboVLMs,使用最后一层的原始特征;(2) GROOT N1,使用中间层的原始特征… 该图像为示意图,展示了四种不同的视觉语言行动模型架构类型:(1) RoboVLMs,使用最后一层的原始特征;(2) GROOT N1,使用中间层的原始特征;(3) π₀,使用所有层的原始特征;(4) OpenVLA-OFT,通过额外查询带入最后一层特征,图中通过箭头和模块框示意了特征从视觉语言模型(VLM)到策略模块(Policy)的传递方式及输入类型。

    • 1. Raw Features from VLMs: These methods extract feature representations directly from the VLM and feed them into the Policy network.
      • RoboVLMs (Type 1): Use features from the final layer of the VLM, assuming it contains the most abstract and task-relevant semantic information.
      • GR00T N1 (Type 2): Uses features from an intermediate layer, hypothesizing that these layers might retain richer, more fine-grained multimodal details useful for action.
      • π₀ (Type 3): Uses features from all layers, combining information from different levels of abstraction.
    • 2. Additional Query as Interface:
      • OpenVLA-OFT (Type 4): This approach introduces special, learnable "query" tokens that are fed into the VLM alongside the image and text. The VLM processes these queries, infusing them with multimodal information. The resulting output features for these queries are then used as the input to the Policy network. This acts as a dedicated "interface" between the VLM and the policy.
  • Differentiation: VLA-Adapter does not choose one of these methods but innovatively combines them. It uses both Raw features and ActionQuery features from all layers of the VLM. Its key novelty is the Bridge Attention mechanism, which learns how to weigh and combine these different sources of information dynamically to produce the best possible input for the action-generating Policy network. This makes the bridging process more robust and effective, especially when using a small VLM without prior robotic training.


4. Methodology (Core Technology & Implementation)

The core of VLA-Adapter is its systematic approach to identifying the best VL conditions for action generation and a novel architecture to leverage them.

  • Principles: The central idea is that different types of features from a VLM (raw vs. query-based) and from different depths (shallow vs. deep layers) contain complementary information crucial for robotic action. A powerful VLA model should be able to flexibly access and fuse this information.

  • Steps & Procedures: The overall framework is shown in Image 8.

    该图像为示意图,展示了VLA-Adapter的统一框架及四种条件类型。左侧为VLA-Adapter架构,包括VLM的多层特征输入与Policy模块的注意力… 该图像为示意图,展示了VLA-Adapter的统一框架及四种条件类型。左侧为VLA-Adapter架构,包括VLM的多层特征输入与Policy模块的注意力机制,用于桥接视觉语言特征与动作查询。右侧详细说明四种条件输入方式:单层原始特征、单层动作查询特征、全层原始特征及全层动作查询特征,均通过注意力机制融合。整体体现了不同视觉语言条件与动作空间的关联方式。

    1. Inputs: At each timestep tt, the model receives a 3rd-person view image Xtv\mathcal{X}_t^v, a gripper camera image Xtg\mathcal{X}_t^g, a language instruction Lt\mathcal{L}_t, and a set of learnable tokens called ActionQuery AQt\mathcal{AQ}_t.
    2. VLM Backbone: These inputs are processed by a VLM backbone (default is Qwen2.5-0.5B). The images are first encoded into embeddings using DINOv2 and SigLIP. The VLM then produces two sets of output features at each of its MM layers:
      • Raw Latent Features (CtR\mathcal{C}_t^{\mathcal{R}}): The standard vision and language representations.
      • ActionQuery Latent Features (CtAQ\mathcal{C}_t^{\mathcal{AQ}}): The representations corresponding to the input ActionQuery tokens.
    3. Systematic Analysis of Conditions: Before designing the final architecture, the authors conduct a crucial analysis to determine which features are most useful. They test four conditions, as shown on the right side of Image 8:
      • a) Single-layer Raw features.

      • b) Single-layer ActionQuery features.

      • c) All-layer Raw features.

      • d) All-layer ActionQuery features.

        The results of this analysis are shown in Image 9.

        这是一张图表,展示了不同单层层数下Raw latent与ActionQuery latent的成功率变化趋势。左侧折线图显示,Raw latent成功率在… 这是一张图表,展示了不同单层层数下Raw latent与ActionQuery latent的成功率变化趋势。左侧折线图显示,Raw latent成功率在各层波动较小,而ActionQuery latent从低成功率迅速提升直至第24层达到90.2%。右侧柱状图对比了24层全层的成功率,ActionQuery latent(92.6%)略优于Raw latent(90.6%)。整体反映了ActionQuery latent在深层次特征下表现更优。

    The key findings from this analysis are:

    • Finding 1 (Raw Features): Middle-layer Raw features are most effective, as they balance low-level visual detail and high-level semantics. Deep-layer features are too abstract and lose information needed for precise actions.

    • Finding 2 (ActionQuery Features): Deep-layer ActionQuery features perform best because these learnable queries aggregate rich, task-relevant information as they pass through the entire VLM.

    • Finding 3 (Multi-Layer is Better): Using features from all layers consistently outperforms using features from any single layer. This suggests that combining information across different levels of abstraction is beneficial.

      Based on these findings, the final VLA-Adapter architecture is designed to use all-layer Raw features and all-layer ActionQuery features.

    1. Policy with Bridge Attention: The core innovation is the Policy module, which translates the VLM features into an H-step action chunk. Its key component is the Bridge Attention mechanism, detailed in Image 10.

      该图像为示意图,展示了VLA-Adapter模型的结构框架。图中左侧为视觉语言模型(VLM),通过多层(Mx)桥接注意力模块(Bridge Attenti… 该图像为示意图,展示了VLA-Adapter模型的结构框架。图中左侧为视觉语言模型(VLM),通过多层(Mx)桥接注意力模块(Bridge Attention)与策略模块(Policy)相连接。右侧详细展示了桥接注意力的内部机制,包括多个多头交叉注意力(Multi-Head Cross Attention)和多头自注意力(Multi-Head Self Attention)层,利用条件KV信息实现从视觉语言表示到动作的有效桥接。整体结构强调了轻量化策略模块和桥接注意力在视觉-语言-动作任务中的核心作用。

    For each layer τ\tau of the Policy network (which mirrors the VLM's layers), the Bridge Attention module performs three parallel attention operations:

    • Cross-Attention with Raw Features (CA1\mathbf{CA}_1): The current action latent attends to the Raw features (CtR\mathcal{C}_t^{\mathcal{R}}) from the corresponding VLM layer.
    • Cross-Attention with ActionQuery Features (CA2\mathbf{CA}_2): The action latent attends to the ActionQuery features (CtAQ\mathcal{C}_t^{\mathcal{AQ}}) and proprioceptive state (Pt\mathcal{P}_t).
    • Self-Attention (SA\mathbf{SA}): The action latent attends to itself, allowing it to refine the action sequence over time.
  • Mathematical Formulas & Key Details: The outputs of these three attention mechanisms are concatenated to form the updated action latent. Crucially, the influence of the Raw features is modulated by a learnable gating parameter gg.

    A^tτ=[CA1(A~tτ,σ1(CtR))tanh(g),CA2(A~tτ,σ2[CtAQ,σ0(Pt)]),SA(A~tτ,A~tτ)] \widehat { \mathbf { A } } _ { t } ^ { \tau } = [ \mathbf { CA } _ { 1 } \left( \widetilde { \mathbf { A } } _ { t } ^ { \tau } , \sigma _ { 1 } ( \mathcal { C } _ { t } ^ { \mathcal { R } } ) \right) \cdot \operatorname { t a n h } ( g ) , \mathbf { C A } _ { 2 } ( \widetilde { \mathbf { A } } _ { t } ^ { \tau } , \sigma _ { 2 } [ \mathcal { C } _ { t } ^ { A \mathcal { Q } } , \sigma _ { 0 } ( \mathcal { P } _ { t } ) ] ) , \mathbf { S A } \left( \widetilde { \mathbf { A } } _ { t } ^ { \tau } , \widetilde { \mathbf { A } } _ { t } ^ { \tau } \right) ]

    • A^tτ\widehat{\mathbf{A}}_t^\tau: The output action latent at layer τ\tau.

    • A~tτ\widetilde{\mathbf{A}}_t^\tau: The input action latent at layer τ\tau.

    • CtR\mathcal{C}_t^{\mathcal{R}} and CtAQ\mathcal{C}_t^{\mathcal{AQ}}: The Raw and ActionQuery features from the VLM.

    • Pt\mathcal{P}_t: The robot's proprioceptive state (e.g., joint angles).

    • σ0,σ1,σ2\sigma_0, \sigma_1, \sigma_2: MLP layers for feature projection.

    • gg: A learnable scalar parameter initialized to 0.

    • tanh(g)\operatorname{tanh}(g): A gating function that squashes gg to the range [-1, 1]. This allows the model to learn whether to amplify, suppress, or even invert the contribution of the Raw features, providing a soft selection mechanism.

      This entire process is repeated for MM layers, after which a final MLP outputs the predicted action chunk.

    The model is trained end-to-end with a simple L1 loss, minimizing the difference between the predicted action and the ground truth action.

    minθI(θ)=EAt,CtR,CtAQ,σ0(Pt),τ[πθ(Atτ,CtR,CtAQ,σ0(Pt),τ)At1] \operatorname* { m i n } _ { \theta } \mathcal { I } ( \theta ) = \mathbb { E } _ { \mathbf { A } _ { t } , \mathcal { C } _ { t } ^ { \mathcal { R } } , \mathcal { C } _ { t } ^ { A \mathcal { Q } } , \sigma _ { 0 } ( \mathcal { P } _ { t } ) , \tau } \left[ \left\| \pi _ { \theta } \big ( \mathbf { A } _ { t } ^ { \tau } , \mathcal { C } _ { t } ^ { \mathcal { R } } , \mathcal { C } _ { t } ^ { \mathcal { A } \mathcal { Q } } , \sigma _ { 0 } ( \mathcal { P } _ { t } ) , \tau \big ) - \mathbf { A } _ { t } \right\| _ { 1 } \right]

    • πθ()\pi_\theta(\cdot): The VLA-Adapter model with parameters θ\theta.

    • At\mathbf{A}_t: The ground truth action trajectory.

    • 1\|\cdot\|_1: The L1 norm (sum of absolute differences).


5. Experimental Setup

  • Datasets:

    • LIBERO: A benchmark for lifelong robot learning with diverse manipulation tasks. It is divided into suites: Spatial, Object, Goal, and Long (for long-horizon tasks). Image 5 shows examples.
    • CALVIN: A benchmark designed to evaluate long-horizon, language-conditioned tasks and zero-shot generalization. The model is trained on environments A, B, and C and tested on the unseen environment D (ABC -> D). Image 3 illustrates the setup.
    • Real-World Tasks: The model was tested on a physical 6-DOF Synria Alicia-D robot arm with various pick-and-place, stacking, and long-horizon tasks. Image 4 shows the real-world setup and task examples.
  • Evaluation Metrics:

    1. Success Rate (%):
      • Conceptual Definition: The primary metric for task-oriented robotics. It measures the percentage of trials where the robot successfully completes the assigned task according to predefined criteria. A higher value is better.
      • Mathematical Formula: Success Rate=Number of Successful TrialsTotal Number of Trials×100% \text{Success Rate} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} \times 100\%
      • Symbol Explanation: The symbols are self-explanatory.
    2. Avg. len (Average Length):
      • Conceptual Definition: Used specifically for the CALVIN benchmark, which consists of task chains of 5 subtasks. This metric measures the average number of consecutive subtasks successfully completed in a chain. A higher value (up to 5) indicates better long-horizon performance.
    3. Throughput (Hz):
      • Conceptual Definition: A measure of inference efficiency. It quantifies how many action chunks the model can generate per second. Higher is better, indicating a faster, more responsive model.
      • Mathematical Formula: Throughput=Number of Action ChunksTotal Inference Time (seconds) \text{Throughput} = \frac{\text{Number of Action Chunks}}{\text{Total Inference Time (seconds)}}
    4. Latency (Sec):
      • Conceptual Definition: The time it takes for the model to generate a single action chunk. It is the inverse of throughput. Lower is better, indicating less delay.
      • Mathematical Formula: Latency=1Throughput \text{Latency} = \frac{1}{\text{Throughput}}
  • Baselines: The paper compares against a comprehensive set of recent VLA models, categorized by size:

    • Large (7B+): OpenVLA, OpenVLA-OFT, UniVLA, FlowVLA.

    • Small (2-4B): π₀, SmolVLA, GR00T N1.

    • Tiny (~0.5B): VLA-OS, Seer, Diffusion Policy. OpenVLA-OFT is considered the primary state-of-the-art (SOTA) baseline for comparison.


6. Results & Analysis

  • Core Results:

    1. Necessity and Effectiveness of VLA-Adapter: Section 4.1 demonstrates why the VLA-Adapter paradigm is crucial, especially for models without robotic pre-training.

    (This is a manual transcription of Table 2 from the paper.)

    Fine-tuned B1 +OFT B1 +Ours B2 +OFT B2 +Ours B3 +OFT B3 +Ours
    Success Rate (%) ↑ 85.8 95.0 (9.2% ↑) 87.5 95.2 (7.7% ↑) 94.5 95.4 (0.9% ↑)
    • Analysis: When using backbones without robotic pre-training (B1 and B2), VLA-Adapter (Ours) dramatically outperforms the OpenVLA-OFT bridging style. With a pre-trained backbone (B3), the gain is smaller because the VLM is already adapted to the action domain. This confirms that VLA-Adapter is highly effective at adapting general-purpose VLMs for robotics without needing expensive pre-training.

      (This is a manual transcription of Table 3 from the paper.)

      Frozen OpenVLA-OFT SmolVLA VLA-Adapter
      Success Rate (%) ↑ 0.0 77.0 86.4
    • Analysis: When the VLM backbone is completely frozen, OpenVLA-OFT fails entirely (0.0% success), while VLA-Adapter remains highly effective, outperforming SmolVLA. This highlights the robustness of its bridging mechanism. Image 6 provides a visual example of this difference in performance.

      该图像为插图,展示了两个不同模型在机器人操作任务中的动作序列对比。上半部分为OpenVLA-OFT模型(标记为False)执行过程中机器人抓取和移动物体的…

      (This is a manual transcription of Table 4 from the paper.)

    Efficiency OpenVLA OpenVLA-OFT (wo X, P) OpenVLA-OFT VLA-Adapter
    Throughput (Hz) ↑ 4.2 109.7 71.4 219.2
    Latency (Sec) ↓ 0.2396 0.0729 0.1120 0.0365
    • Analysis: VLA-Adapter is significantly more efficient, achieving 3x the throughput and 1/3 the latency of the SOTA OpenVLA-OFT model, thanks to its tiny backbone and lightweight policy design. Image 1 visually summarizes these efficiency and performance benefits.

      该图像为表格,比较了OpenVLA-OFT(SOTA)与本文提出的VLA-Adapter在骨干网络规模、训练显存需求、吞吐率和性能上的差异。VLA-Ada…

      2. Overall and Generalization Performance: VLA-Adapter demonstrates SOTA-level performance across a wide range of tasks.

    (This is a manual transcription of Table 5 from the paper.)

    LIBERO Params Spatial Object Goal Long Avg.
    Large OpenVLA-OFT (Kim et al., 2025) (Rss) 7 97.6 98.4 97.9 94.5 97.1
    Small π0 (Black et al., 2025b) (Rs) 3 96.8 98.8 95.8 85.2 94.2
    Small GROOT N1 (NVIDIA et al., 2025) (ArXiv) 2 94.4 97.6 93.0 90.6 93.9
    Tiny VLA-Adapter (Ours) 0.5 97.8 99.2 97.2 95.0 97.3
    Tiny VLA-Adapter-Pro (Ours) 0.5 99.6* 99.6* 98.2* 96.4* 98.5*
    • Analysis (LIBERO): The 0.5B VLA-Adapter achieves an average success rate of 97.3%, matching the 7B OpenVLA-OFT (97.1%) and significantly outperforming other smaller models. The improved VLA-Adapter-Pro version sets a new SOTA at 98.5%.

      (This is a manual transcription of Table 6 from the paper.)

      CALVIN ABC→D Params Avg. len ↑
      Large OpenVLA-OFT (Kim et al., 2025) (RsS) 7 4.10
      Small VPP† (Hu et al., 2025)(CML) 1.5 4.33
      Tiny VLA-Adapter (Ours) 0.5 4.42
      Tiny VLA-Adapter-Pro (Ours) 0.5 4.50*
    • Analysis (CALVIN): VLA-Adapter demonstrates superior zero-shot generalization, achieving the highest average task length (Avg. len) of 4.42 (and 4.50 for Pro), outperforming all baselines, including much larger models.

    3. Real-World Performance: As shown in Image 7, VLA-Adapter outperforms baselines like ACT and an OFT-style variant in real-world tasks, showing strong generalization from simulation to reality.

    该图像包含一个柱状图与四张示意图。柱状图展示了ACT、0.5B+OFT及VLA-Adapter三种方法在“Pick”“Move”“Stack”“Long”… 该图像包含一个柱状图与四张示意图。柱状图展示了ACT、0.5B+OFT及VLA-Adapter三种方法在“Pick”“Move”“Stack”“Long”四个任务及平均成功率上的对比,VLA-Adapter表现最佳。右侧四张示意图分别展示了机械臂执行的“Pick”“Move”“Stack”“Long”四个任务场景,每个场景用不同颜色方框标识对应动作。

  • Ablations / Parameter Sensitivity:

    1. Number of ActionQuery: Image 8 shows that performance peaks at 64 ActionQuery tokens. Too few queries cannot capture sufficient multimodal information, while too many introduce redundancy.

    该图像是一个折线图,展示了不同数量的ActionQuery下模型的成功率(Success Rate %)。横轴为ActionQuery数量,纵轴为成功率百… 该图像是一个折线图,展示了不同数量的ActionQuery下模型的成功率(Success Rate %)。横轴为ActionQuery数量,纵轴为成功率百分比。折线表示仅使用“Last-layer ActionQuery”的效果,成功率随ActionQuery增加而整体提升。红色星号标注代表“Full VLA-Adapter”在64和256 ActionQuery时分别达到95%和93%的成功率,明显优于仅用Last-layer ActionQuery的性能。

    2. Condition Type: This ablation validates the core design choice.

    (This is a manual transcription of Table 7 from the paper.)

    Layer Raw ActionQuery Style SR ↑
    Last RoboVLMs (Liu et al., 2024a) 85.8
    Last OpenVLA-OFT (Kim et al., 2025) 90.2
    Intermidiate GRO0T N1 (NVIDIA et al., 2025) 88.4
    All π0 (Black et al., 2025b) 90.6
    All N/A 92.6
    All VLA-Adapter (Ours) 95.0
    • Analysis: Using both Raw and ActionQuery features from all layers achieves the highest success rate (95.0%), confirming the superiority of the proposed combined bridging paradigm over all previous single-paradigm approaches.

      3. Injection Degree for Policy: This tests the effectiveness of the learnable gating factor gg.

    (This is a manual transcription of Table 8 from the paper.)

    Raw ActionQuery Success Rate (%)
    1) (VLA-Adapter) tanh(g) 1 95.0
    2) 1 1 91.4
    3) 1 tanh(g) 91.0
    4) tanh(g) tanh(g) 92.6
    • Analysis: The default VLA-Adapter setting (1), which learns the injection degree for Raw features while fully injecting ActionQuery features, performs best. Forcing both to be fully injected (2) degrades performance. This confirms that the adaptive gating mechanism in Bridge Attention is crucial for optimal performance.


7. Conclusion & Reflections

  • Conclusion Summary: The paper successfully introduces VLA-Adapter, a novel and highly efficient paradigm for building VLA models. By systematically analyzing the VL-to-A bridge and designing a Bridge Attention mechanism to fuse Raw and ActionQuery latents, the authors demonstrate that it is possible to achieve SOTA-level performance with a tiny (0.5B) model backbone and without expensive robotic data pre-training. VLA-Adapter not only matches or exceeds the performance of models 14x its size but also offers unparalleled training and inference efficiency, significantly lowering the barrier to entry for developing and deploying advanced robotic systems.

  • Limitations & Future Work: The authors acknowledge several limitations:

    1. Real-World Generalization: While promising, the generalization of a tiny model in diverse, unstructured real-world scenarios needs further improvement, as it lacks the vast knowledge implicitly learned during large-scale embodied pre-training.
    2. Condition Quality: The model's performance is fundamentally dependent on the quality of the features provided by the VLM. Future work could explore how to enhance these representations for robotics.
    3. Simple Training Process: The current model uses a simple imitation learning (L1 loss) setup. More advanced training methods, such as reinforcement learning, could be explored to further improve performance and robustness.
  • Personal Insights & Critique:

    • Strength in Simplicity and Rigor: The paper's greatest strength is its rigorous, first-principles investigation of a fundamental problem. Instead of chasing scale, the authors focused on understanding the "how" and "why" of information flow, leading to an elegant and resource-efficient solution.
    • Democratizing Robotics AI: The most significant impact of this work is its potential to democratize VLA research. By showing that powerful models can be trained in hours on a single consumer GPU, it opens the door for smaller labs, individual researchers, and startups to contribute to the field without needing access to massive compute clusters.
    • Open Questions: The learnable gating parameter gg is a simple yet effective mechanism. However, it's a single scalar applied uniformly. Could a more sophisticated, dynamic gating mechanism (e.g., one that is conditioned on the current state or instruction) provide even better control over information fusion? Furthermore, while the model avoids robotic pre-training, it still relies on a VLM pre-trained on general web data. The ultimate goal might be a system that learns efficiently with even less prior knowledge. Overall, VLA-Adapter is a landmark paper in efficient robotics AI, shifting the focus from "bigger is better" to "smarter is better."

Similar papers

Recommended via semantic vector search.

No similar papers found yet.