Paper status: completed

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

Published:02/27/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
9 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces an Optimized Fine-Tuning (OFT) recipe for VLAs, integrating parallel decoding, action chunking, continuous representation, and L1 regression. OpenVLA-OFT boosts OpenVLA's success rate from 76.5% to 97.1% on LIBERO, achieving 26x faster action generation. It

Abstract

Recent vision-language-action models (VLAs) build upon pretrained vision-language models and leverage diverse robot datasets to demonstrate strong task execution, language following ability, and semantic generalization. Despite these successes, VLAs struggle with novel robot setups and require fine-tuning to achieve good performance, yet how to most effectively fine-tune them is unclear given many possible strategies. In this work, we study key VLA adaptation design choices such as different action decoding schemes, action representations, and learning objectives for fine-tuning, using OpenVLA as our representative base model. Our empirical analysis informs an Optimized Fine-Tuning (OFT) recipe that integrates parallel decoding, action chunking, a continuous action representation, and a simple L1 regression-based learning objective to altogether improve inference efficiency, policy performance, and flexibility in the model's input-output specifications. We propose OpenVLA-OFT, an instantiation of this recipe, which sets a new state of the art on the LIBERO simulation benchmark, significantly boosting OpenVLA's average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26×\times. In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot and outperform other VLAs (π0\pi_0 and RDT-1B) fine-tuned using their default recipes, as well as strong imitation learning policies trained from scratch (Diffusion Policy and ACT) by up to 15% (absolute) in average success rate. We release code for OFT and pretrained model checkpoints at https://openvla-oft.github.io/.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

  • Title: Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
  • Authors: Moo Jin Kim, Chelsea Finn, Percy Liang (Stanford University)
  • Journal/Conference: This paper is a preprint available on arXiv. It has not yet been published in a peer-reviewed journal or conference.
  • Publication Year: 2024 (v1 submitted February 2025, likely a typo in the ArXiv metadata, the content suggests a 2024/2025 timeline)
  • Abstract: The paper addresses the challenge of adapting pretrained Vision-Language-Action models (VLAs) to new robotic setups. While VLAs show promise, they require fine-tuning, and the best strategy for this is not well-understood. The authors systematically study key design choices for VLA fine-tuning, including action decoding schemes, action representations, and learning objectives, using OpenVLA as a base model. This analysis leads to an Optimized Fine-Tuning (OFT) recipe that combines parallel decoding, action chunking, a continuous action representation, and a simple L1 regression loss. The resulting model, OpenVLA-OFT, achieves a new state-of-the-art success rate of 97.1% on the LIBERO simulation benchmark (up from 76.5%) while being 26x faster. On a real-world bimanual ALOHA robot, the recipe enables dexterous, high-frequency control, outperforming other leading VLAs and imitation learning policies by up to 15%.
  • Original Source Link:

2. Executive Summary

  • Background & Motivation (Why):

    • Core Problem: Vision-Language-Action models (VLAs) are powerful robot policies pretrained on large, diverse datasets. However, when deployed on a new robot or task, they often perform poorly without being fine-tuned on task-specific data. The standard fine-tuning approach, which mimics the model's original pretraining (slow, token-by-token generation), is often too slow for real-time control (e.g., 3-5 Hz) and can be ineffective for complex tasks like bimanual manipulation.
    • Importance & Gaps: There is a significant gap in understanding the most effective way to fine-tune a VLA. Practitioners lack a clear "recipe" for adaptation. Existing methods are either too slow (autoregressive models) or introduce new complexities (diffusion models) without a clear analysis of which design choices are most critical for success and efficiency.
    • Innovation: This work provides the first systematic empirical study of key design choices for VLA fine-tuning. Instead of proposing a completely new model, it focuses on creating an optimized recipe that dramatically improves the performance and speed of an existing, open-source VLA (OpenVLA).
  • Main Contributions / Findings (What):

    1. Systematic Study of VLA Fine-Tuning: The paper empirically evaluates three crucial design axes:

      • Action Decoding: Autoregressive (one-by-one) vs. Parallel (all-at-once).
      • Action Representation: Discrete (binned tokens) vs. Continuous (raw vectors).
      • Learning Objective: Next-Token Prediction vs. L1 Regression vs. Diffusion.
    2. Optimized Fine-Tuning (OFT) Recipe: Based on the study, the authors propose a simple yet highly effective recipe combining parallel decoding, action chunking, a continuous action representation, and an L1 regression objective. This recipe makes fine-tuned VLAs significantly faster, more successful, and more flexible in handling inputs/outputs.

    3. State-of-the-Art Performance: An instantiation of this recipe, OpenVLA-OFT, sets a new state-of-the-art on the LIBERO simulation benchmark, boosting the success rate from 76.5% to 97.1% and action generation throughput by 26x.

    4. Real-World Validation: A slightly augmented recipe, OFT+OFT+ (which adds FiLM for language grounding), enables the OpenVLA model—originally trained only on single-arm data—to successfully perform dexterous, high-frequency (25 Hz) bimanual tasks on a real ALOHA robot, outperforming stronger baselines like π0π₀ and RDT-1B.


3. Prerequisite Knowledge & Related Work

  • Foundational Concepts:

    • Vision-Language Models (VLMs): These are large AI models, often based on the Transformer architecture, that are trained to understand the relationship between images and text. They can perform tasks like answering questions about an image or generating a caption. Prismatic is the VLM that OpenVLA is built upon.
    • Vision-Language-Action Models (VLAs): VLAs are an extension of VLMs, fine-tuned to not only see and understand language but also to act. They take inputs like camera images and a text command (e.g., "pick up the red block") and output low-level robot control commands (e.g., motor positions). OpenVLA, RT-2, π0π₀, and RDT-1B are examples of VLAs.
    • Fine-Tuning: The process of taking a large, pretrained model and further training it on a smaller, specific dataset to adapt it to a new task.
    • LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning (PEFT) technique. Instead of updating all billions of parameters in a large model, LoRA freezes the original model and injects small, trainable "adapter" matrices, making fine-tuning much more computationally affordable.
    • Autoregressive vs. Parallel Decoding:
      • Autoregressive: Generating output sequentially, where each new piece of output depends on the previously generated ones. This is how language models like GPT write text, one word at a time. In robotics, it means predicting each dimension of an action vector one by one, which is slow.
      • Parallel: Generating the entire output sequence in a single step. This is much faster as it avoids the sequential dependency.
    • Action Chunking: Instead of predicting the robot's action for just the next single timestep, the model predicts a sequence (a "chunk") of actions for multiple future timesteps (e.g., the next 25 steps). This can lead to smoother motions and higher efficiency, as the model is queried less frequently.
    • Discrete vs. Continuous Actions:
      • Discrete: The continuous range of a robot's action (e.g., position change from -1.0 to 1.0) is divided into a fixed number of bins (e.g., 256). The model then predicts which bin the action falls into, treating it like a word prediction task. This can lose precision.
      • Continuous: The model directly outputs the raw numerical values for the action vector (e.g., [0.34,0.51,0.88][0.34, -0.51, 0.88]). This is typically done with a regression objective.
    • FiLM (Feature-wise Linear Modulation): A technique to condition a neural network's processing on external information (in this case, language). It generates a scaling (γ\gamma) and shifting (β\beta) vector from the language instruction and applies it to the visual features, effectively telling the vision part of the model "what to look for."
  • Previous Works & Differentiation:

    • Prior VLAs like RT-2 and the original OpenVLA primarily used autoregressive decoding of discrete actions. While showing impressive generalization, they were too slow (3-5 Hz) for many real-world robots that require high-frequency control (25-50+ Hz).

    • Recent works like FAST and MiniVLA improved efficiency with better action tokenization, but still relied on the fundamentally sequential autoregressive approach, leading to significant latency between action chunks.

    • Other recent VLAs like π0π₀ and RDT-1B achieved high speed and performance on bimanual tasks using diffusion models. Diffusion models are powerful generative models but can be complex, slower to train, and require multiple inference steps (denoising) to generate an action chunk.

    • Differentiation: This paper stands out because it doesn't just propose a new model. It systematically dissects the fine-tuning process itself. Through controlled experiments, it shows that a much simpler and faster approach—parallel decoding with L1 regression—can match or even exceed the performance of more complex diffusion-based methods, providing a clear, practical, and highly effective recipe for adapting existing VLAs.


4. Methodology (Core Technology & Implementation)

The core of the paper is the systematic investigation of three design choices to create the Optimized Fine-Tuning (OFT) recipe.

  • Principles: The central idea is to shift VLA fine-tuning away from the slow, language-model-like paradigm (autoregressive, discrete tokens) towards a more direct and efficient paradigm suitable for real-time robotic control (parallel, continuous actions).

  • Steps & Procedures (Key Design Decisions Studied):

    The authors start with the baseline OpenVLA model and explore alternatives for three key components, as illustrated below.

    该图像是示意图,展示了视觉-语言-动作(VLA)模型中的两种关键解码方案:自回归解码与并行解码(采用双向注意力),以及机器人动作的两种表示形式。左侧比较了自回归和并行解码机制。右侧示例展示了VLA模型如何根据图像输入和语言指令(“机器人应采取什么动作将茄子放入锅中?”),输出离散动作(基于令牌预测)或连续动作(基于L1回归或扩散)。这突出了优化VLA模型推理效率和策略性能的设计选择。 该图像是示意图,展示了视觉-语言-动作(VLA)模型中的两种关键解码方案:自回归解码与并行解码(采用双向注意力),以及机器人动作的两种表示形式。左侧比较了自回归和并行解码机制。右侧示例展示了VLA模型如何根据图像输入和语言指令(“机器人应采取什么动作将茄子放入锅中?”),输出离散动作(基于令牌预测)或连续动作(基于L1回归或扩散)。这突出了优化VLA模型推理效率和策略性能的设计选择。

    1. Action Generation Strategy (Autoregressive vs. Parallel):

      • Baseline (Autoregressive): OpenVLA originally predicts a 7-dimensional action by generating 7 discrete tokens sequentially. This requires 7 forward passes through the decoder for a single timestep action, making it very slow.
      • Proposed (Parallel): The authors modify the model for parallel decoding.
        • Input: Instead of feeding the previous ground-truth action token, the model receives "empty" action embeddings with only positional information.
        • Attention: The standard causal attention mask (which prevents looking at future tokens) is replaced with a bidirectional attention mask, allowing every output position to attend to all input information simultaneously.
        • Output: The model generates all action dimensions (and all timesteps in a chunk) in a single forward pass. This dramatically reduces latency.
    2. Action Representation (Discrete vs. Continuous):

      • Baseline (Discrete): Each dimension of the robot's action is normalized to [1,1][-1, 1] and discretized into 256 bins. The model predicts one bin per dimension using a standard language model output layer.
      • Proposed (Continuous): The model is modified to output continuous-valued actions directly. The final language model output layer is replaced with a Multi-Layer Perceptron (MLP) "action head" that maps the hidden states directly to normalized continuous action values. This avoids precision loss from discretization.
    3. Learning Objective (Next-Token Prediction vs. L1 vs. Diffusion):

      • Baseline (Next-Token Prediction): Used with discrete actions. The model is trained with a standard cross-entropy loss to predict the correct action token (bin), just like a language model predicts the next word.
      • Proposed (L1 Regression): Used with continuous actions. The model is trained to minimize the mean L1 distance (absolute difference) between the predicted continuous action vector and the ground-truth action vector. This is a simple and efficient regression objective.
      • Alternative (Diffusion): Also used with continuous actions. The model learns to denoise an action that has been corrupted with Gaussian noise. During inference, it starts with pure noise and iteratively refines it over multiple steps to generate the final action. This is more expressive but computationally heavier.
  • The Optimized Fine-Tuning (OFT) Recipe: The empirical results from the LIBERO benchmark (discussed in Section 6) lead to the final OFT recipe, which combines the best-performing and most efficient components:

    1. Parallel decoding with action chunking.
    2. Continuous action representation.
    3. L1 regression learning objective.
  • Augmenting OFT with FiLM (The OFT+OFT+ Recipe): For the more challenging real-world ALOHA tasks with multiple camera views, the authors found the model struggled with language grounding (i.e., correctly following the text command). They augmented the recipe with Feature-wise Linear Modulation (FiLM) to solve this.

    • Implementation: The language instruction is embedded and projected into scaling (γ\gamma) and shifting (β\beta) vectors. These vectors then modulate the visual features extracted by the Vision Transformer (ViT).

    • Formula: The modulation is applied via an affine transformation: FiLM(Fγ,β)=F^=(1+γ)F+β \mathrm{FiLM}(\mathbf{F} | \gamma, \beta) = \hat{\mathbf{F}} = (1 + \gamma) \odot \mathbf{F} + \boldsymbol{\beta}

      • F\mathbf{F} is the set of visual features (patch embeddings).
      • γ\gamma and β\beta are the scaling and shifting vectors derived from the language instruction.
      • \odot denotes element-wise multiplication.
      • F^\hat{\mathbf{F}} are the modulated visual features.
    • Key Detail: A crucial finding was that applying FiLM to modulate each hidden dimension across all spatial patch embeddings worked much better than modulating each patch embedding independently. This mimics how FiLM works in convolutional networks and proved essential for strong language following. This augmented recipe is called OFT+OFT+.


5. Experimental Setup

  • Datasets:

    1. LIBERO Benchmark: A standardized simulation environment for robot manipulation.

      • Robot: Franka Emika Panda arm.

      • Tasks: The study uses four task suites, each with 10 tasks and 500 expert demonstrations. The suites are designed to test different types of generalization: LIBERO-Spatial (different layouts), LIBERO-Object (different objects), LIBERO-Goal (different goals), and LIBERO-Long (long-horizon tasks).

      • Data: Camera images, robot state, language annotations, and end-effector pose actions.

        Fig. 3: LIBERO simulation benchmark \[25\] task suites. We study VLA fine-tuning design decisions using four representative task suites. Here we depict two of ten tasks per suite. 该图像是图3,展示了LIBERO仿真基准测试的四个代表性任务套件,用于研究VLA微调设计。每个套件包含10个任务,其中LIBERO-Spatial侧重不同布局,LIBERO-Object侧重不同物体,LIBERO-Goal侧重不同目标,而LIBERO-Long则为长周期任务。图像中每种套件展示了两个示例任务,清晰描绘了机器人操作的多样化模拟环境。

    2. ALOHA Robot: A real-world, low-cost bimanual manipulation platform.

      • Hardware: Two ViperX 300S arms.
      • Control: Operates at a high frequency of 25 Hz.
      • Inputs: Three camera views (one top-down, two wrist-mounted) and 14-D joint angles.
      • Tasks: Four dexterous tasks designed for this study: fold shorts, fold shirt, scoop X into bowl, and put X into pot. These tasks test deformable object manipulation, long-horizon skills, tool use, and language-conditioned control.
  • Evaluation Metrics:

    1. Success Rate (SR):
      • Conceptual Definition: The primary metric for task performance. It measures the percentage of trials where the robot successfully completes the entire task as defined. A higher SR indicates a more reliable policy.
      • Mathematical Formula: Not explicitly provided, but the standard definition is: Success Rate=Number of Successful TrialsTotal Number of Trials×100% \text{Success Rate} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} \times 100\%
    2. Throughput (Hz):
      • Conceptual Definition: Measures inference efficiency by quantifying how many single-timestep actions the model can generate per second. Higher throughput is critical for high-frequency robots.
      • Formula: Throughput=Total Actions GeneratedTotal Time Taken=Chunk Size×Number of QueriesTotal Time Taken \text{Throughput} = \frac{\text{Total Actions Generated}}{\text{Total Time Taken}} = \frac{\text{Chunk Size} \times \text{Number of Queries}}{\text{Total Time Taken}}
    3. Latency (sec):
      • Conceptual Definition: Measures the time required to perform a single inference pass to generate one action or one chunk of actions. Lower latency is better for real-time reactivity.
  • Baselines:

    • For LIBERO:
      • OpenVLA (fine-tuned): The original model fine-tuned with its default autoregressive recipe.
      • From-scratch Imitation Learning (IL): Diffusion Policy, MDT.
      • Fine-tuned Pretrained Policies: Octo, DiT Policy, Seer, π0π₀. These are strong, contemporary models.
    • For ALOHA:
      • ACT and Diffusion Policy: Popular and strong IL policies trained from scratch.

      • RDT-1B and π0π₀: State-of-the-art diffusion-based VLAs, chosen because they were pretrained on bimanual data and are expected to be strong competitors. They are fine-tuned using their respective authors' recommended recipes.


6. Results & Analysis

Core Results on LIBERO Benchmark

The experiments on LIBERO were designed to answer how each design decision affects performance and efficiency.

1. Task Performance (Success Rate):

Below is the manually transcribed Table I from the paper, comparing the success rates of various policies on the four LIBERO suites.

Spatial SR (%) Object SR (%) Goal SR (%) Long SR (%) Average SR (%)
Policy inputs: third-person image, language instruction
Diffusion Policy (scratch) [5] 78.3 92.5 68.3 50.5 72.4
Octo (fine-tuned) [47] 78.9 85.7 84.6 51.1 75.1
DiT Policy (fine-tuned) [13] 84.2 96.3 85.4 63.8 82.4
Seer (scratch) [48] 78.7
Seer (pretrained on LIBERO-90, then fine-tuned) [48] 87.7
OpenVLA (fine-tuned) [22] 84.7 88.4 79.2 53.7 76.5
OpenVLA (fine-tuned) + PD&AC 91.3 92.7 90.5 86.5 90.2
OpenVLA (fine-tuned) + PD&AC, Cont-L1 96.2 98.3 96.2 90.7 95.3
OpenVLA (fine-tuned) + PD&AC, Cont-Diffusion 96.9 98.1 95.5 91.1 95.4
Policy inputs: third-person image, wrist image, proprio, language instruction
MDT (scratch) [38] 78.5 87.5 73.5 64.8 76.1
π₀ + FAST (fine-tuned) [36] 96.4 96.8 88.6 60.2 85.5
π₀ (fine-tuned) [3] 96.8 98.8 95.8 85.2 94.2
OpenVLA-OFT (OpenVLA (fine-tuned) + PD&AC, Cont-L1) (ours) 97.6 98.4 97.9 94.5 97.1
  • Key Insight 1: Simply switching to Parallel Decoding & Action Chunking (PD&AC) from the autoregressive baseline boosts the average success rate of OpenVLA from 76.5% to 90.2%. This suggests that action chunking helps the policy produce smoother, more coherent actions, especially in long-horizon tasks (LIBERO-Long SR jumped from 53.7% to 86.5%).
  • Key Insight 2: Using continuous actions (Cont-L1, Cont-Diffusion) further improves performance to ~95.4%, likely by avoiding the precision loss of discretization.
  • Key Insight 3: The simple L1 regression objective performs just as well as the more complex diffusion objective, suggesting that for fine-tuning, the high-capacity VLA can model the action distribution effectively without needing a powerful generative model.
  • Key Insight 4: The full OpenVLA-OFT recipe, which uses L1 regression and adds more sensory inputs (wrist camera, proprioception), achieves a new state-of-the-art of 97.1%, outperforming even stronger models like π0π₀.

2. Inference Efficiency:

Below is the manually transcribed Table II, showing the speed improvements.

Throughput (Hz) ↑ Latency (Sec) ↓
OpenVLA 4.2 0.240
+PD 15.9 0.063
+ PD&AC 108.8 0.074
+ PD&AC, Cont-L1 109.7 0.073
+ PD&AC, Cont-Diffusion 10.1 0.792
+ PD&AC, Cont-L1 + Additional Inputs (wrist img, proprio) 71.4 0.112
  • Key Insight 1: Parallel Decoding (+PD+PD) alone gives a 4x speedup over the baseline.
  • Key Insight 2: Combining Parallel Decoding with Action Chunking (+PD&AC) yields a massive 26x throughput increase (4.2 Hz to 109.7 Hz). Although latency per chunk increases slightly (due to a longer sequence in the transformer), the total number of actions per second skyrockets.
  • Key Insight 3: The L1 regression model (Cont-L1) has negligible overhead compared to the discrete model. In contrast, the Diffusion variant is much slower due to its 50 denoising steps, making its latency ~10x higher than the L1 model.
  • Key Insight 4: The final OpenVLA-OFT model can process additional inputs (doubling the visual tokens) and still maintain a very high throughput of 71.4 Hz, which is more than sufficient for high-frequency control.

Real-World Results on ALOHA Robot

The ALOHA experiments tested the OFT+OFT+ recipe (with FiLM) in a challenging bimanual setting, where the base OpenVLA model had no prior experience.

1. Task Performance:

该图像是一个条形图,展示了OpenVLA-OFT+(我们提出的方法)与ACT、Diffusion Policy、RDT-1B和\(\\pi_0\)等机器策略在多项真实世界任务中的成功率比较。OpenVLA-OFT+在平均成功率(87.8%)上表现最佳,显著优于其他策略。它在“折叠短裤”、“折叠衬衫”和“将X舀入碗中”任务中均达到100%的成功率,并在“将X放入锅中”任务中也取得了最高成功率(51.3%… 该图像是一个条形图,展示了OpenVLA-OFT+(我们提出的方法)与ACT、Diffusion Policy、RDT-1B和π0\pi_0等机器策略在多项真实世界任务中的成功率比较。OpenVLA-OFT+在平均成功率(87.8%)上表现最佳,显著优于其他策略。它在“折叠短裤”、“折叠衬衫”和“将X舀入碗中”任务中均达到100%的成功率,并在“将X放入锅中”任务中也取得了最高成功率(51.3%),证明了其在实际操作中的优越性能。

  • Overall Performance: As seen in the bar chart, OpenVLA-OFT+ achieves the highest average success rate (87.8%), outperforming strong VLAs like π0π₀ (77.5%) and RDT-1B (78.4%), as well as from-scratch policies ACT and Diffusion Policy. This is remarkable given that π0π₀ and RDT-1B were pretrained on bimanual data, while OpenVLA was not. This highlights that the fine-tuning recipe can be more critical than the specificity of the pretraining data.

    Fig. 6: Sample successful OpenVLA-OFT \(^ +\) rollouts on the ALOHA robot. OpenVLA-OFT \(^ +\) can fold clothes, use a metal spoon to scoop and pour targeted trail mix ingredients into a bowl, and place…

    2. Language Following & FiLM Ablation:

    Fig. 5: ALOHA language following results. Success rates in approaching language-specified target objects for language-dependent tasks. Fine-tuned VLAs follow the user's command more frequently than p… 该图像是图5所示的ALOHA语言遵循结果图表,展示了不同机器人策略在语言依赖任务中的成功率。OpenVLA-OFT+ (ours) 在各项任务中表现出最高的语言理解能力,例如在“scoop X into bowl”任务中达到100%成功率,且其平均成功率为89.6%。移除FiLM会显著降低成功率,微调后的VLA模型比从头训练的策略更能有效地遵循用户指令。

  • Importance of FiLM: The ablation study is definitive. When FiLM is removed (OpenVLA-OFT (no FiLM)), the model's ability to follow language instructions drops to chance level (33.3%). This proves that FiLM is essential for robust language grounding in complex, multi-view settings where spurious visual correlations could otherwise distract the model. OpenVLA-OFT+ with FiLM achieves perfect (100%) language following on the scooping task and is the best overall.

3. Inference Efficiency:

Below is the manually transcribed Table III from the paper.

Throughput (Hz) ↑ Latency (Sec) ↓
OpenVLA 1.8 0.543
OpenVLA-OFT+ 77.9 0.321
RDT-1B 84.1 0.297
Diffusion Policy 267.4 0.090
π₀ 291.6 0.086
ACT 432.8 0.058
  • Efficiency Analysis: The original OpenVLA is impractically slow for this setup (1.8 Hz). OpenVLA-OFT+ achieves a throughput of 77.9 Hz, a 43x speedup, and is fast enough for the 25 Hz controller. While smaller, specialized models like ACT and π0π₀ (with a JAX implementation) are faster, OpenVLA-OFT+ is competitive with RDT-1B despite being ~7x larger, because it avoids the multi-step denoising process of diffusion.

Ablation on Pretrained Representation

An additional experiment tested whether the robot-specific pretraining of OpenVLA was still useful when using the completely different OFT fine-tuning recipe. They fine-tuned the base VLM (Prismatic) directly on LIBERO.

Below is the manually transcribed Table XIV.

Spatial SR (%) Object SR (%) Goal SR (%) Long SR (%) Average SR (%)
OpenVLA-OFT 97.6 98.4 97.9 94.5 97.1
OpenVLA-OFT (scratch) 94.3 95.2 91.7 86.5 91.9
  • Key Insight: The model fine-tuned from the full OpenVLA checkpoint (OpenVLA-OFT) outperforms the one fine-tuned from the base VLM (OpenVLA-OFT (scratch)) by 5.2% absolute. This confirms that the representations learned during the large-scale robot pretraining are still highly beneficial, even when the fine-tuning methodology is substantially different.


7. Conclusion & Reflections

  • Conclusion Summary: This work provides a clear and powerful recipe for adapting VLAs to new robotic applications. The authors demonstrate that by replacing slow autoregressive decoding with parallel decoding and action chunking, and using a continuous action representation with a simple L1 regression objective, one can achieve dramatic improvements in both inference speed and task success. The proposed OpenVLA-OFT model sets a new state-of-the-art on the LIBERO benchmark. Furthermore, with the addition of FiLM for language grounding (OFT+OFT+), the recipe enables a single-arm-pretrained VLA to excel at complex, high-frequency bimanual tasks in the real world, outperforming more sophisticated models.

  • Limitations & Future Work: The authors acknowledge several limitations:

    1. Multimodal Demonstrations: The L1 objective learns the average or median behavior from demonstrations. It might struggle in tasks where multiple distinct, valid strategies exist (multimodality), which diffusion models are theoretically better equipped to handle.
    2. Applicability to Pretraining: The study focuses exclusively on fine-tuning. It remains an open question whether the simple OFT recipe is sufficient for large-scale pretraining, or if more expressive models like diffusion are necessary to learn from massive, diverse datasets.
    3. Inconsistent Language Grounding: It's unclear why the base model struggled with language grounding on the ALOHA setup but not in LIBERO. This discrepancy, whether due to the bimanual setup, multiple cameras, or other factors, warrants further investigation.
  • Personal Insights & Critique:

    • This is an excellent example of engineering-driven research that provides immense practical value. Instead of chasing novelty with a new architecture, the paper focuses on a critical, under-explored problem: how to actually use these powerful models effectively. The resulting OFT recipe is simple, elegant, and demonstrably superior.
    • The most significant insight is that for adapting a VLA, a well-designed fine-tuning recipe can be more impactful than the exact composition of the pretraining data. The success on the ALOHA robot is a powerful testament to this.
    • The finding that a simple L1 loss can match diffusion models in performance is a strong statement. It suggests that for many robotic tasks, the complexity may lie more in perception and reasoning (which the large VLA backbone handles) than in generating a complex action distribution. This has important implications for practitioners, as L1 regression is much faster to train and easier to implement and debug than diffusion.
    • The work opens up exciting avenues for making large robotic models more accessible. By providing a clear, open-source recipe for high-performance adaptation, it lowers the barrier for deploying VLAs on new hardware and tasks.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.