AiPaper
Paper status: completed

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Published:09/12/2025
Original LinkPDF
Price: 0.10
Price: 0.10
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

SimpleVLA-RL is introduced to enhance the training of Vision-Language-Action models using reinforcement learning, addressing data scarcity and generalization issues. Results show state-of-the-art performance on OpenVLA-OFT, reducing reliance on labeled data.

Abstract

Vision-Language-Action (VLA) models have recently emerged as a powerful paradigm for robotic manipulation. Despite substantial progress enabled by large-scale pretraining and supervised fine-tuning (SFT), these models face two fundamental challenges: (i) the scarcity and high cost of large-scale human-operated robotic trajectories required for SFT scaling, and (ii) limited generalization to tasks involving distribution shift. Recent breakthroughs in Large Reasoning Models (LRMs) demonstrate that reinforcement learning (RL) can dramatically enhance step-by-step reasoning capabilities, raising a natural question: Can RL similarly improve the long-horizon step-by-step action planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL framework tailored for VLA models. Building upon veRL, we introduce VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. When applied to OpenVLA-OFT, SimpleVLA-RL achieves SoTA performance on LIBERO and even outperforms π0π_0 on RoboTwin 1.0&2.0 with the exploration-enhancing strategies we introduce. SimpleVLA-RL not only reduces dependence on large-scale data and enables robust generalization, but also remarkably surpasses SFT in real-world tasks. Moreover, we identify a novel phenomenon ``pushcut'' during RL training, wherein the policy discovers previously unseen patterns beyond those seen in the previous training process. Github: https://github.com/PRIME-RL/SimpleVLA-RL

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

1.2. Authors

Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, Dehui Wang, Dingxiang Luo, Yuchen Fan, Youbang sun, Jia Zeng, Jiangmiao Pang, Shanghang Zhang, Yu Wang, Yao Mu, Bowen Zhou, and Ning Ding.

The authors are affiliated with several institutions, including Tsinghua University and other research labs. The large number of authors and the affiliations suggest a significant collaborative effort, which is common in large-scale deep learning and robotics research.

1.3. Journal/Conference

The paper is available on arXiv, which is a preprint server for academic articles. This means it has not yet undergone formal peer review for publication in a journal or conference.

1.4. Publication Year

The paper was submitted to arXiv with a publication date of September 11, 2025. This future date is likely a placeholder within the system or a typo; the work was made public in 2024/2025.

1.5. Abstract

The abstract introduces Vision-Language-Action (VLA) models as a powerful tool for robotic manipulation. It identifies two major challenges: the high cost and scarcity of human-operated robot data required for Supervised Fine-Tuning (SFT), and the models' limited ability to generalize. Inspired by breakthroughs in Large Reasoning Models where Reinforcement Learning (RL) enhanced reasoning, the authors question if RL can similarly improve VLA models' action planning. They propose SimpleVLA-RL, an efficient RL framework for VLAs. This framework introduces VLA-specific enhancements like specialized trajectory sampling and parallelization. When applied to the OpenVLA-OFT model, SimpleVLA-RL achieves state-of-the-art (SoTA) performance on the LIBERO benchmark and outperforms the initial policy (π0π₀) on RoboTwin 1.0 & 2.0. The framework not only mitigates the need for large datasets and improves generalization but also surpasses SFT performance in real-world tasks. A novel phenomenon termed "pushcut" is identified, where the model discovers new behaviors not seen in the training data.

2. Executive Summary

2.1. Background & Motivation

2.1.1. Core Problem

The primary problem this paper addresses is the scalability and generalization of Vision-Language-Action (VLA) models for robotic manipulation. While VLAs show great promise, their training heavily relies on Supervised Fine-Tuning (SFT) using large datasets of human-operated robot trajectories. This approach faces two critical bottlenecks:

  1. Data Scarcity and Cost: Collecting high-quality, diverse robotic manipulation data is extremely expensive, time-consuming, and difficult to scale. This data bottleneck severely limits the potential for improving VLA models through SFT alone.
  2. Poor Generalization: VLA models trained via SFT often struggle to generalize to new tasks, objects, or environments that differ from their training data. This is because SFT primarily teaches the model to imitate the specific patterns present in the demonstrations, rather than learning generalizable, robust strategies.

2.1.2. Importance and Gaps

VLAs are a cornerstone of modern robotics, aiming to create general-purpose robots that can understand and execute tasks based on natural language commands and visual input. Overcoming the data and generalization limitations is crucial for moving from narrow, lab-controlled robots to adaptable, real-world assistants. Prior research has focused on scaling up datasets or improving model architectures, but the fundamental limitations of the SFT paradigm remain.

2.1.3. Innovative Idea

The paper draws inspiration from recent successes in Large Reasoning Models (LRMs), such as DeepSeek-R1, where Reinforcement Learning (RL) was used to significantly enhance step-by-step reasoning abilities, often using simple, outcome-based rewards (e.g., was the final answer correct?). The authors' central hypothesis is that a similar approach can be applied to VLAs. They propose that RL can improve a VLA's step-by-step action planning, allowing the model to learn from its own experience through trial and error, thereby reducing its dependence on expensive human demonstrations and improving its ability to generalize.

2.2. Main Contributions / Findings

The paper makes several key contributions:

  1. An Efficient Online RL Framework for VLAs (SimpleVLA-RL): The authors develop and release an end-to-end framework specifically designed for applying online RL to VLA models. It adapts existing LLM RL frameworks (veRL) with VLA-specific features like interactive trajectory sampling and parallelized multi-environment rendering to handle the unique demands of robotic simulation.

  2. State-of-the-Art Performance: By applying SimpleVLA-RL to an OpenVLA-OFT model, they achieve SoTA results on standard robotics benchmarks. The RL-trained model significantly outperforms strong SFT-based baselines on LIBERO and RoboTwin 1.0 & 2.0, particularly in complex, long-horizon tasks.

  3. Demonstrated Data Efficiency and Generalization:

    • Data Efficiency: The paper shows that with only a single demonstration per task, RL can boost performance to levels that surpass SFT trained on hundreds of demonstrations. For example, on LIBERO-Long tasks, the success rate jumped from 17.1% (one-shot SFT) to 91.7% after RL.
    • Generalization: Experiments on unseen tasks, objects, and spatial configurations show that SimpleVLA-RL leads to robust generalization, whereas SFT tends to overfit to the training data, often resulting in performance degradation on unseen scenarios.
  4. Effective Sim-to-Real Transfer: The policies trained entirely in simulation with SimpleVLA-RL demonstrate significant performance gains when deployed on a real robot, outperforming both the SFT-only model and other baselines. This highlights a viable path for scaling up real-world robot capabilities using low-cost simulation.

  5. Discovery of the "Pushcut" Phenomenon: The authors identify an emergent behavior where the RL-trained agent discovers novel and more efficient strategies (e.g., pushing an object instead of grasping it) that were never present in the original demonstration data. This showcases RL's ability to move beyond mere imitation and find creative solutions.


3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Vision-Language-Action (VLA) Models

A Vision-Language-Action (VLA) model is a type of AI model designed for robotics. It integrates three modalities:

  • Vision: It processes visual input from cameras (e.g., RGB images) to perceive the environment.
  • Language: It understands natural language instructions given by a human (e.g., "pick up the red block and place it in the blue bowl").
  • Action: It generates a sequence of control commands (actions) for a robot to execute in order to complete the instructed task. Essentially, VLAs are large multimodal models, often built on top of Large Language Model (LLM) architectures, that aim to bridge the gap between human instruction and robotic execution.

3.1.2. Supervised Fine-Tuning (SFT)

Supervised Fine-Tuning (SFT) is a training process where a pre-trained model (like an LLM) is further trained on a specific, labeled dataset. In the context of VLAs, this is a form of imitation learning. The dataset consists of expert demonstrations, where each data point is a trajectory containing a sequence of (state, action) pairs. The state includes vision and language inputs, and the action is the command executed by the expert. The model learns to predict the expert's action given a particular state, effectively learning to "imitate" the expert.

3.1.3. Reinforcement Learning (RL)

Reinforcement Learning (RL) is a machine learning paradigm where an agent learns to make decisions by interacting with an environment. The core components are:

  • Agent: The learner or decision-maker (in this case, the VLA policy).
  • Environment: The world the agent interacts with (a physical robot setup or a simulation).
  • State (ss): A representation of the environment at a specific time.
  • Action (aa): A decision made by the agent.
  • Reward (rr): A feedback signal from the environment indicating how good an action was. The agent's goal is to learn a policy (π\pi), which is a strategy for choosing actions that maximize the cumulative reward over time. Unlike SFT, which learns from a fixed set of "correct" examples, RL allows the agent to learn from trial and error.

3.1.4. Online vs. Offline RL

  • Online RL: The agent learns by directly and continuously interacting with the environment, collecting new experience (trajectories), and updating its policy based on that new experience. This allows for exploration and discovery of new strategies.
  • Offline RL: The agent learns from a fixed, pre-collected dataset of interactions, without any further interaction with the environment. This is more similar to SFT but uses RL algorithms to learn from the reward signals in the dataset. SimpleVLA-RL is an online RL framework.

3.2. Previous Works

3.2.1. OpenVLA and OpenVLA-OFT

OpenVLA is the VLA model architecture used as the backbone in this paper. It is an open-source model based on an LLM (LLaMA-2) that processes interleaved vision and language inputs to produce actions. OpenVLA-OFT is an optimized version that uses Orthogonal Finetuning (OFT) for more efficient adaptation. A key feature relevant to this paper is its token-based action representation, where robot actions are discretized into a vocabulary of "action tokens." The model then predicts a probability distribution over these tokens, similar to how an LLM predicts text tokens. This makes it naturally compatible with RL algorithms that operate on probability distributions.

3.2.2. DeepSeek-R1 and the Rise of RL for Reasoning

DeepSeek-R1 is a Large Reasoning Model (LRM) that demonstrated remarkable gains in complex reasoning tasks by using online RL. A key takeaway from its success, and the one that motivates this paper, is that significant improvements can be achieved with a very simple, rule-based, outcome-only reward (e.g., a reward of 1 if the final answer to a math problem is correct, 0 otherwise). This showed that RL could teach a model how to reason step-by-step without needing complex, step-by-step reward engineering.

3.2.3. Group Relative Policy Optimization (GRPO)

Group Relative Policy Optimization (GRPO) is the specific RL algorithm used in this paper. It is a variant of policy optimization methods that is value-function-free, meaning it does not need to learn a separate model to estimate the value of states. Instead, it computes the advantage of a trajectory by comparing its total reward to the average reward of a group of trajectories sampled from the same starting state. This simplifies the RL process.

The GRPO objective is given as: $ \begin{array} { r l } { J _ { \mathrm { GRPO } } ( \theta ) = \mathbb { E } _ { s _ { 0 } \sim \mathcal { D } , { \tau _ { i } } \sim \pi _ { \theta \mathrm { o l d } } } \left[ \displaystyle \frac { 1 } { G } \sum _ { i = 1 } ^ { G } \frac { 1 } { | \tau _ { i } | } \sum _ { t = 1 } ^ { | \tau _ { i } | } \operatorname* { m i n } \left( r _ { i , t } ( \theta ) \hat { A } _ { i } , \mathrm { c l i p } ( r _ { i , t } ( \theta ) , 1 - \epsilon , 1 + \epsilon ) \hat { A } _ { i } \right) \right. } & { } \ { \left. - \beta D _ { \mathrm { K L } } ( \pi _ { \theta } | | \pi _ { \mathrm { r e f } } ) \right] , } \end{array} $

where:

  • θ\theta: The parameters of the policy network.
  • πθold\pi_{\theta_{old}}: The old policy used to generate trajectories.
  • GG: The number of trajectories sampled in a group.
  • ri,t(θ)=πθ(ai,tsi,t)πθold(ai,tsi,t)r_{i,t}(\theta) = \frac{ \pi_{\theta}(a_{i,t} | s_{i,t}) }{ \pi_{\theta_{old}}(a_{i,t} | s_{i,t}) }: The importance sampling ratio, which measures how likely the new policy is to take a certain action compared to the old policy.
  • A^i=Rimean({Ri}i=1G)std({Ri}i=1G)\hat{A}_i = \frac{ R_i - \mathrm{mean}(\{R_i\}_{i=1}^G) }{ \mathrm{std}(\{R_i\}_{i=1}^G) }: The normalized advantage. RiR_i is the total reward for trajectory ii. The advantage is calculated by normalizing the trajectory's reward relative to the mean and standard deviation of rewards in its group. This tells the algorithm whether a trajectory was better or worse than average.
  • clip()\mathrm{clip}(\cdot): The clipping function from PPO, which prevents the policy updates from being too large and destabilizing training.
  • βDKL(πθπref)\beta D_{KL}(\pi_{\theta} || \pi_{ref}): A regularization term that penalizes the new policy for diverging too much from a reference policy πref\pi_{ref}. This term is removed in SimpleVLA-RL to simplify training and encourage more exploration.

3.3. Technological Evolution

The field of robotic manipulation has evolved from classical control methods to learning-based approaches.

  1. Early Learning: Early methods often used RL with hand-crafted features and dense, engineered reward functions. These were difficult to scale and generalize.
  2. Imitation Learning (IL) / Behavioral Cloning (BC): With the rise of deep learning, imitation learning became dominant. This is essentially the SFT paradigm, where models learn from expert demonstrations. This led to significant progress but created the data dependency problem.
  3. VLA Models: The integration of large pre-trained models (vision and language) led to the VLA paradigm, which improved generalization by leveraging knowledge from vast web-scale datasets. However, the final stage of training still relied heavily on SFT.
  4. RL for VLAs: This paper represents the latest trend, which seeks to combine the strengths of large pre-trained VLAs with the exploratory and adaptive power of RL. The goal is to move beyond simple imitation and enable models to learn and improve from their own interactions, addressing the core limitations of SFT.

3.4. Differentiation Analysis

Compared to related works, SimpleVLA-RL makes the following key differentiations:

  • vs. SFT-based VLAs (e.g., OpenVLA, UniVLA): The core innovation is the use of online RL instead of relying solely on imitation learning. This allows the model to learn from failure, explore, and discover novel strategies ("pushcut"), which is impossible with SFT.
  • vs. Traditional Robotics RL: It avoids complex, hand-engineered reward functions by using a simple, scalable binary outcome reward. This is made possible by starting with a capable pre-trained VLA model.
  • vs. Other VLA-RL works: While other works have explored RL for VLAs, this paper provides a particularly systematic study focused on:
    • An efficient, open-source framework tailored for the unique challenges of VLA rollout (interactive sampling, parallel simulation).

    • A comprehensive analysis of how RL addresses the key bottlenecks of data scarcity and generalization.

    • Demonstration of strong sim-to-real transfer.

    • Qualitative analysis of emergent behaviors.


4. Methodology

4.1. Principles

The core principle of SimpleVLA-RL is to adapt the successful online RL paradigm from Large Reasoning Models to the domain of robotic manipulation. The intuition is that just as RL can refine a model's step-by-step reasoning process to arrive at a correct final answer, it can also refine a VLA's step-by-step action generation process to achieve a successful task outcome. The method leverages a capable SFT-initialized VLA model and improves it through trial-and-error exploration in a simulated environment, guided only by a simple binary signal of task success or failure.

4.2. Core Methodology In-depth

The SimpleVLA-RL framework follows a standard online RL loop: rollout (generate experience), rewarding (evaluate outcomes), and training (update the policy). The paper introduces specific modifications to make this loop efficient and effective for VLA models. The overall workflow is depicted in Figure 2 from the paper.

Figure 2 | Overview of SimpleVLA-RL. 该图像是示意图,展示了 SimpleVLA-RL 的工作流程,包括有限的离线轨迹、策略的推理过程以及奖励的计算。通过与环境的交互,状态 sts_t 和动作 ata_t 进行更新,最终获得一系列的轨迹和奖励。

4.2.1. Step 1: Interactive VLA Rollout

The first step is to generate trajectories by having the policy interact with the environment. This process for VLAs is fundamentally different from that for LLMs.

  • Challenge: LLM rollouts are simple auto-regressive decoding processes. In contrast, VLA rollouts are interactive and closed-loop. Each action the robot takes changes the state of the world, and the VLA model must receive the new state (new camera images, new robot joint positions) to decide on the next action. This requires continuous communication with a simulator or real robot.

  • Solution: The framework implements an interactive rollout mechanism. To enable exploration, which is crucial for RL, the VLA model must generate diverse trajectories from the same starting point. This is achieved by:

    1. Using a VLA model (OpenVLA-OFT) that outputs a probability distribution over a discrete set of action tokens.

    2. During rollout, actions are chosen by random sampling from this distribution, controlled by a temperature parameter. Higher temperatures lead to more random (i.e., diverse) actions.

      The paper provides pseudo-code (Listing 1) illustrating the transition from a standard LLM rollout function to the interactive VLA rollout function, which includes parallel environment initialization and stepping.

      Figure 1 | Overview of SimpleVLA-RL. SimpleVLA-RL is an efficient RL framework for VLA that improves long-horizon planning under data scarcity, outperforms SFT in simulation and real-world tasks, rev… 该图像是一个示意图,展示了SimpleVLA-RL在处理数据稀缺、仿真与现实世界任务中的表现。图中显示了LIBERO和RoboTwin任务的成功率,以及在新模式发现中的策略改进。整体来看,SimpleVLA-RL在长时间跨度计划中显著超越了SFT模型。

4.2.2. Step 2: Outcome Reward Modeling

After a trajectory is completed (either by succeeding, failing, or reaching a time limit), it must be assigned a reward.

  • Principle: Following the DeepSeek-R1 approach, SimpleVLA-RL uses a simple, sparse, binary outcome reward. This avoids the need for complex and task-specific reward engineering.

  • Implementation: A trajectory-level reward is assigned based on feedback from the environment about task completion.

    • If the task is successfully completed, the entire trajectory receives a reward of 1.

    • Otherwise, it receives a reward of 0.

      This trajectory-level reward is then propagated to every individual action token within that trajectory for the purpose of gradient computation. The reward function is formally defined as:

$ R ( a _ { i , t } \mid s _ { i , t } ) = { \left{ \begin{array} { l l } { 1 , } & { { \mathrm { is } } \underbrace { s { \mathrm { u c c e s s f u l } } [ { \mathrm { tr a j } } _ { i } ( a _ { i } , s _ { i } ) ] } , } \ { 0 , } & { { \mathrm { o t h e r w i s e } } . } \end{array} \right. } $

  • Symbol Explanation:
    • R(ai,tsi,t)R(a_{i,t} | s_{i,t}): The reward assigned to taking action ai,ta_{i,t} in state si,ts_{i,t} for trajectory ii at time step tt.
    • is_successful[traji(ai,si)]\mathrm{is\_successful}[\mathrm{traj}_i(a_i, s_i)]: A boolean function that returns true if the trajectory traji\mathrm{traj}_i (composed of a sequence of states sis_i and actions aia_i) resulted in a successful task outcome.

4.2.3. Step 3: Exploration Enhancements

A key challenge in RL, especially with sparse rewards, is ensuring sufficient exploration. The paper introduces three specific techniques to enhance this. The effectiveness of these enhancements is shown in Figure 3.

Figure 3 | The effectiveness of three key enhancements: dynamic sampling, higher rollout temperature, and clip higher. 该图像是图表,展示了三项关键增强措施的有效性:动态采样、clip higher 和较高的 rollout 温度。图中呈现了 RL 训练步骤与 LIBERO-Long SR (%) 之间的关系,不同策略的效果通过红色和蓝色曲线进行比较。

  1. Dynamic Sampling:

    • Problem: In critic-free RL algorithms like GRPO, the advantage is calculated relative to the average reward of a group of trajectories. If all trajectories in a group have the same reward (e.g., all fail, receiving a reward of 0), the advantage for every trajectory becomes zero, leading to a zero gradient and stalling the learning process.
    • Solution: During data collection, the framework only keeps batches of trajectories where there is a mix of outcomes (at least one success and at least one failure). This ensures that the advantage calculation is always meaningful. This condition is formally expressed as: $ 0 < \left| { \mathrm { traj } _ { i } ( a _ { i } , s _ { i } ) \ | \ \mathrm { is _ s u c c e s s f u l } [ \mathrm { t r a j } _ { i } ( a _ { i } , s _ { i } ) ] } \right| < G . $ where GG is the total number of trajectories in the group.
  2. Clipping Higher:

    • Problem: The standard clipping in PPO/GRPO (e.g., within [0.8, 1.2]) symmetrically limits how much the probability of an action can be increased or decreased. This can be overly restrictive and can prevent the model from sufficiently increasing the probability of good, but initially low-probability, actions discovered during exploration.
    • Solution: Inspired by DAPO, the paper uses an asymmetric clipping range, specifically [0.8, 1.28]. By setting a higher upper bound (εhigh=0.28\varepsilon_{high}=0.28) than the lower bound's distance from 1 (εlow=0.2\varepsilon_{low}=0.2), the algorithm encourages more aggressive updates for successful actions, promoting exploration.
  3. Higher Rollout Temperature:

    • Problem: A low sampling temperature makes the policy more deterministic, causing it to generate repetitive, non-diverse trajectories, which limits exploration.
    • Solution: The sampling temperature during rollout is increased from a standard value of 1.0 to 1.6. This flattens the probability distribution over action tokens, making the sampling more random and leading to a wider variety of explored behaviors.

4.2.4. Step 4: Training Objective

The final step is to update the VLA policy's parameters using the collected trajectories and their rewards. The paper uses a modified GRPO objective.

  • Modification: The standard GRPO objective includes a KL-divergence penalty to keep the policy from changing too drastically. The authors remove this term. This has two benefits:

    1. Efficiency: It eliminates the need to keep a separate reference model in memory, reducing computational overhead.

    2. Exploration: It removes a constraint that could limit the policy's ability to explore novel behaviors far from its initial strategy.

      The final training objective is: $ \begin{array} { r l } & { \mathcal { T } ( \theta ) = \mathbb { E } _ { s _ { 0 } \sim \mathcal { D } , { a _ { t } } _ { i = 1 } ^ { G } \sim \pi _ { \theta _ { 0 d } } ( \cdot | s _ { t } ) } \left[ \displaystyle \frac { 1 } { G } \sum _ { i = 1 } ^ { G } \frac { 1 } { | a _ { i } | } \sum _ { t = 1 } ^ { | a _ { i } | } \operatorname* { m i n } \left( r _ { i , t } ( \theta ) \hat { A } _ { i } , \mathrm { c l i p } \left( r _ { i , t } ( \theta ) , 1 - \varepsilon _ { l o w } , 1 + \varepsilon _ { h i g h } \right) \hat { A } _ { i } \right) \right] } \ & { \mathrm { ~ s . t. ~ } 0 < \left| { \mathrm { tr a j } _ { i } ( a _ { i } , s _ { i } ) \ } _ { i } \mathrm { i } \mathrm { s } _ { - } \mathrm { s u c c e s s f u l } [ \mathrm { t r a j } _ { i } ( a _ { i } , s _ { i } ) ] } \right| < G , } \end{array} $ where the importance ratio ri,t(θ)r_{i,t}(\theta) and normalized advantage A^i\hat{A}_i are: $ r _ { i , t } ( \theta ) = \frac { \pi _ { \theta } ( a _ { i , t } \mid s _ { i , t } ) } { \pi _ { \theta _ { \mathrm { o l d } } } ( a _ { i , t } \mid s _ { i , t } ) } , \quad \hat { A } _ { i } = \frac { R _ { i } - \mathrm { m e a n } ( { R _ { i } } _ { i = 1 } ^ { G } ) } { \mathsf { s t d } ( { R _ { i } } _ { i = 1 } ^ { G } ) } . $

  • Symbol Explanation:

    • T(θ)\mathcal{T}(\theta): The objective function to be maximized.

    • θ\theta: The parameters of the policy being optimized.

    • πθold\pi_{\theta_{old}}: The policy used to generate the data (written as πθ0d\pi_{\theta_{0d}} in the paper, likely a typo for "old").

    • GG: The number of trajectories in a group.

    • ri,t(θ)r_{i,t}(\theta): The importance sampling ratio for the action at step tt of trajectory ii.

    • A^i\hat{A}_i: The normalized advantage for trajectory ii.

    • εlow,εhigh\varepsilon_{low}, \varepsilon_{high}: The clipping parameters (0.2 and 0.28, respectively).

    • The s.t. (such that) clause enforces the Dynamic Sampling constraint.


5. Experimental Setup

5.1. Datasets

The experiments are conducted on three major simulation benchmarks for robotic manipulation:

  1. LIBERO: A benchmark for Lifelong Imitation Benchmarking for Embodied Robots. It is designed to test knowledge transfer and generalization across diverse tasks, objects, and environments. The paper uses four of its suites:

    • LIBERO-Goal: Tasks with varied goal specifications.
    • LIBERO-Spatial: Tasks requiring understanding of spatial relationships.
    • LIBERO-Object: Tasks involving different object types.
    • LIBERO-Long: Tasks that require a long sequence of actions.
  2. RoboTwin 1.0: A simulation benchmark focused on dual-arm (bimanual) manipulation tasks. It provides 17 tasks with limited scene and object diversity.

  3. RoboTwin 2.0: An extension of RoboTwin 1.0 with significantly increased complexity. It features 50 dual-arm tasks, a much larger variety of object instances (731), and extensive domain randomization (varying clutter, lighting, backgrounds, tabletop height, etc.) to improve task diversity and promote robust sim-to-real transfer.

5.2. Evaluation Metrics

The primary evaluation metric used across all experiments is the Success Rate (SR).

  • Conceptual Definition: The Success Rate measures the percentage of trials in which the robot successfully completes the assigned task according to the predefined success criteria for that task. It is a direct and intuitive measure of the policy's effectiveness.
  • Mathematical Formula: $ \mathrm{SR} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} \times 100% $
  • Symbol Explanation:
    • Number of Successful Trials: The count of evaluation episodes where the task's goal was achieved.
    • Total Number of Trials: The total number of evaluation episodes conducted.

5.3. Baselines

The paper compares its method (SimpleVLA-RL applied to OpenVLA-OFT) against a range of recent, state-of-the-art VLA models. These baselines are primarily trained using SFT (imitation learning).

  • UniVLA

  • RDT-1B (Robotic Diffusion Transformer)

  • π₀ and π-fast

  • Nora

  • OpenVLA

  • Octo

  • DP (Diffusion Policy) and DP3

    These models represent the cutting edge in VLA research and provide a strong benchmark for evaluating the performance gains from adding RL. The direct comparison is with their own SFT-trained OpenVLA-OFT model to isolate the effect of RL.


6. Results & Analysis

6.1. Core Results Analysis

The main experiments evaluate the performance of SimpleVLA-RL on top of a pre-trained OpenVLA-OFT model across the three benchmarks. The results consistently show that the addition of RL provides a substantial performance boost over the SFT-only baseline and other SoTA models.

The following are the results from Table 2 of the original paper:

Model LIBERO
Spatial Object Goal Long Avg
Octo 78.9 85.7 84.6 51.1 75.1
OpenVLA 84.7 88.4 79.2 53.7 76.5
Nora 92.2 95.4 89.4 74.6 87.9
π0 + FAST 96.4 96.8 88.6 60.2 85.5
π₀ 96.8 98.8 95.8 85.2 94.2
UniVLA 96.5 96.8 95.6 92.0 95.2
OpenVLA-OFT 91.6 95.3 90.6 86.5 91.0
w/ ours 99.4 99.1 99.2 98.5 99.1
+7.8 +3.8 +8.6 +12.0 +8.1
  • Analysis: On LIBERO, SimpleVLA-RL ("w/ ours") improves the average success rate from 91.0% to 99.1%, achieving near-perfect performance and outperforming all other SoTA models, including UniVLA (95.2%) and π0π₀ (94.2%). The improvement is particularly pronounced on the LIBERO-Long suite (+12.0%), demonstrating RL's effectiveness in long-horizon planning.

    The following are the results from Table 3 of the original paper:

    Model RoboTwin1.0 Avg
    Hammer Beat Block Handover Blocks Stack Shoe Place
    DP 0.0 12.0 7.1 4.3 5.9
    DP3 64.7 84.3 24.0 59.3 58.1
    OpenVLA-OFT 67.2 61.6 7.1 23.4 39.8
    w/ ours 92.6 89.6 40.2 59.3 70.4
    +25.4 +28.0 +33.1 +35.9 +30.6

The following are the results from Table 4 of the original paper:

Short Horizon Tasks (100-130 Steps)
Model Lift Pot Beat Hammer Block Pick Dual Bottles Place Phone Stand Avg
π₀ 51.0 59.0 50.0 22.0 45.5
RDT 45.0 22.0 18.0 13.0 24.5
OpenVLA-OFT 10.1 28.1 29.7 17.1 21.3
w/ ours 64.1 87.5 68.3 39.6 64.9
+54.0 +59.4 +38.6 +22.5 +43.6
Medium Horizon Tasks (150-230 Steps)
Model Move Can Pot Place A2B Left Place Empty Cup Handover Mic Avg
π₀ 41.0 38.0 60.0 96.0 58.8
RDT 33.0 21.0 42.0 95.0 47.8
OpenVLA-OFT 28.1 37.5 77.3 45.3 47.1
w/ ours 61.2 45.3 94.2 89.2 72.5
+33.1 +7.8 +16.9 +43.9 +25.4
Long (280-320 Steps) & Extra Long Horizon Tasks (450-650 Steps)
Model Handover Block Stack Bowls Two Blocks Rank Rgb Put Bottles Dustbin Avg
π₀ 39.0 53.0 45.0 36.0 43.3
RDT 26.0 42.0 17.0 26.0 27.8
OpenVLA-OFT 33.1 40.6 70.2 42.2 46.5
w/ ours 57.8 75.8 81.3 60.9 69.0
+24.7 +35.2 +11.1 +18.7 +22.4
  • Analysis: On RoboTwin 1.0 & 2.0, the improvements are even more dramatic. The average success rate on RoboTwin 1.0 jumps from 39.8% to 70.4% (+30.6%). On the more challenging RoboTwin 2.0, the overall average success rate improves from 38.3% to 68.8%, a relative improvement of 80%, and significantly outperforms π0π₀ (49.2%). The gains are consistent across short, medium, and long-horizon tasks, confirming the robustness of the approach.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Overcoming Data Scarcity

This experiment tests the hypothesis that RL can reduce the dependency on large demonstration datasets. The authors compare performance after training with "One-Trajectory SFT" (only one demonstration per task) versus "Full-Trajectory SFT" (all available demonstrations).

The following are the results from Table 5 of the original paper:

Model LIBERO
Spatial Object Goal Long Avg
One-Trajectory SFT
OpenVLA-OFT 63.6 54.9 59.6 17.3 48.9
w/ ours 98.2 98.7 98.8 91.7 96.9
+34.6 +43.8 +39.2 +74.4 +48.0
Full-Trajectory SFT
OpenVLA-OFT 91.6 95.3 90.6 86.5 91.0
w/ ours 99.4 99.1 99.2 98.5 99.1
+7.8 +3.8 +8.6 +12.0 +8.1
  • Analysis: The results are striking. With only one demonstration, SFT performance is mediocre (48.9% average). However, applying SimpleVLA-RL on top of this weak model skyrockets the performance to 96.9%. This is even higher than the 91.0% achieved by SFT with the full dataset. This strongly supports the claim that online RL can effectively compensate for a lack of demonstration data.

6.2.2. Generalization Analysis

This analysis investigates whether RL improves generalization better than SFT. The experiment is set up by training on 9 tasks from a LIBERO suite and testing on 1 held-out (unseen) task. The following figure (Figure 4 from the original paper) plots the performance on the unseen task as the performance on the seen (training) tasks improves.

Figure 4 | Generalization Analysis on LIBERO: Goal Unseen (Top), Object Unseen (Middle), Spatial Unseen (Bottom). 该图像是图表,展示了在LIBERO上进行的泛化分析,分别针对目标未见、物体未见和空间未见任务的结果。图表中比较了强化学习(RL)和监督微调(SFT)在不同已见任务数量下的成功率(SR)。结果显示RL方法在多个任务中获得了更高的成功率。

  • Analysis: The graphs clearly show that as training progresses and performance on seen tasks improves (moving right on the x-axis), SimpleVLA-RL (blue/green curves) consistently improves performance on the unseen tasks as well. In contrast, SFT (red curves) exhibits severe overfitting. As SFT gets better at the training tasks, its performance on the unseen tasks often degrades, sometimes catastrophically dropping to 0%. This indicates that RL learns more generalizable skills, while SFT memorizes task-specific patterns.

6.2.3. Real-World Experiments (Sim-to-Real)

To validate real-world applicability, policies trained entirely in simulation were deployed on physical robots.

The following are the results from Table 6 of the original paper:

Stack Bowls Place Empty Cup Pick Bottle Click Bell Avg
RDT 60.0 4.0 10.0 20.0 23.5
OpenVLA-OFT 38.0 2.0 0.0 30.0 17.5
w/ ours 70.0 10.0 14.0 60.0 38.5
+32.0 +8.0 +14.0 +30.0 +21.0
  • Analysis: The RL-trained model (w/ ours) achieves an average success rate of 38.5% in the real world, more than doubling the performance of the SFT-only model (17.5%) and significantly outperforming the RDT baseline (23.5%). This demonstrates that the skills learned and refined through RL in simulation can effectively transfer to the real world, providing a scalable pathway for developing real-world robot policies.

6.2.4. Qualitative Analysis: "Pushcut"

The paper identifies an emergent behavior termed "pushcut". In tasks where the demonstration data always shows a "grasp-move-place" sequence, the RL-trained model sometimes discovers a more efficient strategy of simply pushing the object to the target location. The following figure (Figure 5 from the original paper) illustrates this.

该图像是展示简单VLA-RL方法应用于两个任务的示意图。左侧为“移动罐子”任务,右侧为“右侧放置A2B”任务。上半部分展示了监督微调(SFT)下的操作,标记为“Grasp”;下半部分展示了强化学习(RL)下的操作,标记为“Push”。 该图像是展示简单VLA-RL方法应用于两个任务的示意图。左侧为“移动罐子”任务,右侧为“右侧放置A2B”任务。上半部分展示了监督微调(SFT)下的操作,标记为“Grasp”;下半部分展示了强化学习(RL)下的操作,标记为“Push”。

  • Analysis: This phenomenon highlights a key advantage of RL over SFT. SFT is fundamentally limited to replicating patterns in the data it has seen. RL, driven by an outcome-based reward, is free to explore the entire action space to find any solution that achieves the goal, even if it's a novel one. This ability to discover "shortcuts" or more efficient strategies is a hallmark of true learning beyond imitation.

6.2.5. Failure Modes Analysis

This section investigates the conditions under which SimpleVLA-RL fails. The key factor explored is the capability of the initial SFT model.

The following are the results from Table 7 of the original paper:

RoboTwin2.0
Move Can Pot Place A2B Lift Place A2B Right Place Phone Stand Pick Dual Bottles Avg
0 trajs SFT +RL 0 0 0 0 0 0
0 0 0 0 0 0
100 trajs SFT 9.4 7.8 7.8 10.1 1.2 7.3
+RL ∆ 51.6 25.0 27.2 18.8 4.3 25.4
+42.2 +17.2 +19.4 +8.7 +3.1 +18.1
1000 trajs SFT 28.1 37.5 28.7 17.1 29.7 28.2
+RL ∆ 61.2 45.3 37.5 39.6 68.3 50.4
+33.1 +7.8 +8.8 +22.5 +38.6 +22.2
  • Analysis:
    1. RL fails from scratch: When starting with a base model that has 0% success rate (0 SFT trajectories), RL makes no progress. Because no successful trajectories are ever generated, the model never receives a positive reward signal to learn from.

    2. Better prior leads to better RL: The model initialized with 1000 SFT trajectories (28.2% initial SR) achieves a much higher final performance (50.4%) than the one initialized with 100 trajectories (7.3% initial SR -> 25.4% final SR). This shows that a stronger initial policy provides a better starting point for exploration.

    3. There's a capability threshold: When the initial success rate is extremely low (e.g., 1.2% on "Pick Dual Bottles"), RL provides only marginal gains (+3.1%). This suggests that the policy must have a minimal level of competence for the exploration process to be fruitful.


7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces SimpleVLA-RL, a simple yet powerful online RL framework for training VLA models. By adapting techniques from LLM reinforcement learning, the authors demonstrate that RL can be a highly effective tool for scaling VLA training. The key findings are that SimpleVLA-RL significantly improves performance on complex manipulation tasks, drastically reduces the need for expensive human demonstration data, enhances the model's generalization capabilities to unseen scenarios, and shows strong sim-to-real transfer. The discovery of the "pushcut" phenomenon further underscores RL's potential to unlock novel and efficient robotic behaviors beyond simple imitation.

7.2. Limitations & Future Work

The paper itself does not dedicate a section to limitations, but several can be inferred from the results:

  • Dependency on Initial SFT Policy: The most significant limitation, as shown in the failure modes analysis, is that the method requires an initial SFT-trained model that already has a non-zero success rate. It cannot learn from a completely "cold start." While it reduces the quantity of SFT data needed, it does not eliminate the need for an initial phase of imitation learning.

  • Sparse Reward Limitations: The binary, outcome-based reward is simple and scalable but may be insufficient for extremely long or complex multi-stage tasks where a successful outcome is exceptionally rare. In such cases, exploration could be inefficient, and some form of reward shaping or curriculum learning might be necessary.

  • Simulation Fidelity: The success of sim-to-real transfer is still limited by the fidelity of the simulation. While domain randomization helps, the gap between simulation and the real world remains a challenge for all simulation-based robotics training.

    Future work could focus on addressing these limitations, such as developing methods to make RL effective from weaker initial policies, exploring more sophisticated reward mechanisms that remain scalable, or integrating RL with automated curriculum generation to tackle progressively harder tasks.

7.3. Personal Insights & Critique

This paper is a strong piece of empirical research that provides compelling evidence for the value of online RL in the VLA domain.

  • Strengths:

    • Clarity and Simplicity: The proposed method is conceptually straightforward and builds upon established techniques, making it easy to understand and replicate.
    • Comprehensive Evaluation: The experiments are thorough, covering multiple benchmarks, data regimes, generalization axes, and a real-world deployment. This provides a robust validation of the claims.
    • Practical Significance: By tackling the data scarcity bottleneck, the work offers a practical path forward for scaling up the capabilities of general-purpose robots. The positive sim-to-real results are particularly promising.
    • Insightful Analysis: The "pushcut" and failure mode analyses provide valuable insights into how and when RL works for VLAs, moving beyond just reporting performance numbers.
  • Critique and Areas for Improvement:

    • Novelty of RL Techniques: The RL techniques themselves (GRPO, dynamic sampling, temperature tuning) are not novel; they are adaptations of methods from the LLM reasoning domain. The paper's main contribution is in the successful application and VLA-specific engineering of these techniques.

    • The "One-Shot" Claim: While the one-shot SFT experiment is impressive, it's important to remember that the model still benefits from large-scale pre-training. The RL phase is refining a very knowledgeable prior, not learning from scratch.

    • Cost of Online RL: Online RL requires running many parallel simulations, which can be computationally expensive. While it saves on human data collection costs, it incurs significant compute costs. A detailed comparison of these trade-offs would be beneficial.

      Overall, SimpleVLA-RL represents a significant and practical step forward. It convincingly argues that the future of training highly capable and adaptable robots will likely involve a synergistic combination of large-scale pre-training, a small amount of imitation learning to bootstrap behavior, and extensive online RL to refine, generalize, and discover new skills.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.