SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Ning Ding

Paper status: completed

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Published:09/12/2025

Vision-Language-Action Model (22)Reinforcement Learning for Math Reasoning (11)RL Training for Large Language Models (55)Multi-Environment Rendering (1)Efficient Reinforcement Learning Framework (1)

Original Link PDF

Price: 0.10

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

SimpleVLA-RL is introduced to enhance the training of Vision-Language-Action models using reinforcement learning, addressing data scarcity and generalization issues. Results show state-of-the-art performance on OpenVLA-OFT, reducing reliance on labeled data.

Abstract

Vision-Language-Action (VLA) models have recently emerged as a powerful paradigm for robotic manipulation. Despite substantial progress enabled by large-scale pretraining and supervised fine-tuning (SFT), these models face two fundamental challenges: (i) the scarcity and high cost of large-scale human-operated robotic trajectories required for SFT scaling, and (ii) limited generalization to tasks involving distribution shift. Recent breakthroughs in Large Reasoning Models (LRMs) demonstrate that reinforcement learning (RL) can dramatically enhance step-by-step reasoning capabilities, raising a natural question: Can RL similarly improve the long-horizon step-by-step action planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL framework tailored for VLA models. Building upon veRL, we introduce VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. When applied to OpenVLA-OFT, SimpleVLA-RL achieves SoTA performance on LIBERO and even outperforms $π_0$ on RoboTwin 1.0&2.0 with the exploration-enhancing strategies we introduce. SimpleVLA-RL not only reduces dependence on large-scale data and enables robust generalization, but also remarkably surpasses SFT in real-world tasks. Moreover, we identify a novel phenomenon ``pushcut'' during RL training, wherein the policy discovers previously unseen patterns beyond those seen in the previous training process. Github: https://github.com/PRIME-RL/SimpleVLA-RL

Mind Map

In-depth Reading

English Analysis~20 min read · 26,346 chars

1. Bibliographic Information

1.1. Title

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

1.2. Authors

Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, Dehui Wang, Dingxiang Luo, Yuchen Fan, Youbang sun, Jia Zeng, Jiangmiao Pang, Shanghang Zhang, Yu Wang, Yao Mu, Bowen Zhou, and Ning Ding.

The authors are affiliated with several institutions, including Tsinghua University and other research labs. The large number of authors and the affiliations suggest a significant collaborative effort, which is common in large-scale deep learning and robotics research.

1.3. Journal/Conference

The paper is available on arXiv, which is a preprint server for academic articles. This means it has not yet undergone formal peer review for publication in a journal or conference.

1.4. Publication Year

The paper was submitted to arXiv with a publication date of September 11, 2025. This future date is likely a placeholder within the system or a typo; the work was made public in 2024/2025.

1.5. Abstract

The abstract introduces Vision-Language-Action (VLA) models as a powerful tool for robotic manipulation. It identifies two major challenges: the high cost and scarcity of human-operated robot data required for Supervised Fine-Tuning (SFT), and the models' limited ability to generalize. Inspired by breakthroughs in Large Reasoning Models where Reinforcement Learning (RL) enhanced reasoning, the authors question if RL can similarly improve VLA models' action planning. They propose SimpleVLA-RL, an efficient RL framework for VLAs. This framework introduces VLA-specific enhancements like specialized trajectory sampling and parallelization. When applied to the OpenVLA-OFT model, SimpleVLA-RL achieves state-of-the-art (SoTA) performance on the LIBERO benchmark and outperforms the initial policy ( $π₀$ ) on RoboTwin 1.0 & 2.0. The framework not only mitigates the need for large datasets and improves generalization but also surpasses SFT performance in real-world tasks. A novel phenomenon termed "pushcut" is identified, where the model discovers new behaviors not seen in the training data.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2509.09674
PDF Link: https://arxiv.org/pdf/2509.09674v1.pdf
Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

2.1.1. Core Problem

The primary problem this paper addresses is the scalability and generalization of Vision-Language-Action (VLA) models for robotic manipulation. While VLAs show great promise, their training heavily relies on Supervised Fine-Tuning (SFT) using large datasets of human-operated robot trajectories. This approach faces two critical bottlenecks:

Data Scarcity and Cost: Collecting high-quality, diverse robotic manipulation data is extremely expensive, time-consuming, and difficult to scale. This data bottleneck severely limits the potential for improving VLA models through SFT alone.
Poor Generalization: VLA models trained via SFT often struggle to generalize to new tasks, objects, or environments that differ from their training data. This is because SFT primarily teaches the model to imitate the specific patterns present in the demonstrations, rather than learning generalizable, robust strategies.

2.1.2. Importance and Gaps

VLAs are a cornerstone of modern robotics, aiming to create general-purpose robots that can understand and execute tasks based on natural language commands and visual input. Overcoming the data and generalization limitations is crucial for moving from narrow, lab-controlled robots to adaptable, real-world assistants. Prior research has focused on scaling up datasets or improving model architectures, but the fundamental limitations of the SFT paradigm remain.

2.1.3. Innovative Idea

The paper draws inspiration from recent successes in Large Reasoning Models (LRMs), such as DeepSeek-R1, where Reinforcement Learning (RL) was used to significantly enhance step-by-step reasoning abilities, often using simple, outcome-based rewards (e.g., was the final answer correct?). The authors' central hypothesis is that a similar approach can be applied to VLAs. They propose that RL can improve a VLA's step-by-step action planning, allowing the model to learn from its own experience through trial and error, thereby reducing its dependence on expensive human demonstrations and improving its ability to generalize.

2.2. Main Contributions / Findings

The paper makes several key contributions:

An Efficient Online RL Framework for VLAs (SimpleVLA-RL): The authors develop and release an end-to-end framework specifically designed for applying online RL to VLA models. It adapts existing LLM RL frameworks (veRL) with VLA-specific features like interactive trajectory sampling and parallelized multi-environment rendering to handle the unique demands of robotic simulation.
State-of-the-Art Performance: By applying SimpleVLA-RL to an OpenVLA-OFT model, they achieve SoTA results on standard robotics benchmarks. The RL-trained model significantly outperforms strong SFT-based baselines on LIBERO and RoboTwin 1.0 & 2.0, particularly in complex, long-horizon tasks.
Demonstrated Data Efficiency and Generalization:
- Data Efficiency: The paper shows that with only a single demonstration per task, RL can boost performance to levels that surpass SFT trained on hundreds of demonstrations. For example, on LIBERO-Long tasks, the success rate jumped from 17.1% (one-shot SFT) to 91.7% after RL.
- Generalization: Experiments on unseen tasks, objects, and spatial configurations show that SimpleVLA-RL leads to robust generalization, whereas SFT tends to overfit to the training data, often resulting in performance degradation on unseen scenarios.
Effective Sim-to-Real Transfer: The policies trained entirely in simulation with SimpleVLA-RL demonstrate significant performance gains when deployed on a real robot, outperforming both the SFT-only model and other baselines. This highlights a viable path for scaling up real-world robot capabilities using low-cost simulation.
Discovery of the "Pushcut" Phenomenon: The authors identify an emergent behavior where the RL-trained agent discovers novel and more efficient strategies (e.g., pushing an object instead of grasping it) that were never present in the original demonstration data. This showcases RL's ability to move beyond mere imitation and find creative solutions.

3.1. Foundational Concepts

3.1.1. Vision-Language-Action (VLA) Models

A Vision-Language-Action (VLA) model is a type of AI model designed for robotics. It integrates three modalities:

Vision: It processes visual input from cameras (e.g., RGB images) to perceive the environment.
Language: It understands natural language instructions given by a human (e.g., "pick up the red block and place it in the blue bowl").
Action: It generates a sequence of control commands (actions) for a robot to execute in order to complete the instructed task. Essentially, VLAs are large multimodal models, often built on top of Large Language Model (LLM) architectures, that aim to bridge the gap between human instruction and robotic execution.

3.1.2. Supervised Fine-Tuning (SFT)

Supervised Fine-Tuning (SFT) is a training process where a pre-trained model (like an LLM) is further trained on a specific, labeled dataset. In the context of VLAs, this is a form of imitation learning. The dataset consists of expert demonstrations, where each data point is a trajectory containing a sequence of (state, action) pairs. The state includes vision and language inputs, and the action is the command executed by the expert. The model learns to predict the expert's action given a particular state, effectively learning to "imitate" the expert.

3.1.3. Reinforcement Learning (RL)

Reinforcement Learning (RL) is a machine learning paradigm where an agent learns to make decisions by interacting with an environment. The core components are:

Agent: The learner or decision-maker (in this case, the VLA policy).
Environment: The world the agent interacts with (a physical robot setup or a simulation).
State ( $s$ ): A representation of the environment at a specific time.
Action ( $a$ ): A decision made by the agent.
Reward ( $r$ ): A feedback signal from the environment indicating how good an action was. The agent's goal is to learn a policy ( $\pi$ ), which is a strategy for choosing actions that maximize the cumulative reward over time. Unlike SFT, which learns from a fixed set of "correct" examples, RL allows the agent to learn from trial and error.

3.1.4. Online vs. Offline RL

Online RL: The agent learns by directly and continuously interacting with the environment, collecting new experience (trajectories), and updating its policy based on that new experience. This allows for exploration and discovery of new strategies.
Offline RL: The agent learns from a fixed, pre-collected dataset of interactions, without any further interaction with the environment. This is more similar to SFT but uses RL algorithms to learn from the reward signals in the dataset. SimpleVLA-RL is an online RL framework.

3.2. Previous Works

3.2.1. OpenVLA and OpenVLA-OFT

OpenVLA is the VLA model architecture used as the backbone in this paper. It is an open-source model based on an LLM (LLaMA-2) that processes interleaved vision and language inputs to produce actions. OpenVLA-OFT is an optimized version that uses Orthogonal Finetuning (OFT) for more efficient adaptation. A key feature relevant to this paper is its token-based action representation, where robot actions are discretized into a vocabulary of "action tokens." The model then predicts a probability distribution over these tokens, similar to how an LLM predicts text tokens. This makes it naturally compatible with RL algorithms that operate on probability distributions.

3.2.2. DeepSeek-R1 and the Rise of RL for Reasoning

DeepSeek-R1 is a Large Reasoning Model (LRM) that demonstrated remarkable gains in complex reasoning tasks by using online RL. A key takeaway from its success, and the one that motivates this paper, is that significant improvements can be achieved with a very simple, rule-based, outcome-only reward (e.g., a reward of 1 if the final answer to a math problem is correct, 0 otherwise). This showed that RL could teach a model how to reason step-by-step without needing complex, step-by-step reward engineering.

3.2.3. Group Relative Policy Optimization (GRPO)

Group Relative Policy Optimization (GRPO) is the specific RL algorithm used in this paper. It is a variant of policy optimization methods that is value-function-free, meaning it does not need to learn a separate model to estimate the value of states. Instead, it computes the advantage of a trajectory by comparing its total reward to the average reward of a group of trajectories sampled from the same starting state. This simplifies the RL process.

The GRPO objective is given as: $ \begin{array} { r l } { J _ { \mathrm { GRPO } } ( \theta ) = \mathbb { E } _ { s _ { 0 } \sim \mathcal { D } , { \tau _ { i } } \sim \pi _ { \theta \mathrm { o l d } } } \left[ \displaystyle \frac { 1 } { G } \sum _ { i = 1 } ^ { G } \frac { 1 } { | \tau _ { i } | } \sum _ { t = 1 } ^ { | \tau _ { i } | } \operatorname* { m i n } \left( r _ { i , t } ( \theta ) \hat { A } _ { i } , \mathrm { c l i p } ( r _ { i , t } ( \theta ) , 1 - \epsilon , 1 + \epsilon ) \hat { A } _ { i } \right) \right. } & { } \ { \left. - \beta D _ { \mathrm { K L } } ( \pi _ { \theta } | | \pi _ { \mathrm { r e f } } ) \right] , } \end{array} $

where:

$\theta$ : The parameters of the policy network.
$\pi_{\theta_{old}}$ : The old policy used to generate trajectories.
$G$ : The number of trajectories sampled in a group.
$r_{i,t}(\theta) = \frac{ \pi_{\theta}(a_{i,t} | s_{i,t}) }{ \pi_{\theta_{old}}(a_{i,t} | s_{i,t}) }$ : The importance sampling ratio, which measures how likely the new policy is to take a certain action compared to the old policy.
$\hat{A}_i = \frac{ R_i - \mathrm{mean}(\{R_i\}_{i=1}^G) }{ \mathrm{std}(\{R_i\}_{i=1}^G) }$ : The normalized advantage. $R_i$ is the total reward for trajectory $i$ . The advantage is calculated by normalizing the trajectory's reward relative to the mean and standard deviation of rewards in its group. This tells the algorithm whether a trajectory was better or worse than average.
$\mathrm{clip}(\cdot)$ : The clipping function from PPO, which prevents the policy updates from being too large and destabilizing training.
$\beta D_{KL}(\pi_{\theta} || \pi_{ref})$ : A regularization term that penalizes the new policy for diverging too much from a reference policy $\pi_{ref}$ . This term is removed in SimpleVLA-RL to simplify training and encourage more exploration.

3.3. Technological Evolution

The field of robotic manipulation has evolved from classical control methods to learning-based approaches.

Early Learning: Early methods often used RL with hand-crafted features and dense, engineered reward functions. These were difficult to scale and generalize.
Imitation Learning (IL) / Behavioral Cloning (BC): With the rise of deep learning, imitation learning became dominant. This is essentially the SFT paradigm, where models learn from expert demonstrations. This led to significant progress but created the data dependency problem.
VLA Models: The integration of large pre-trained models (vision and language) led to the VLA paradigm, which improved generalization by leveraging knowledge from vast web-scale datasets. However, the final stage of training still relied heavily on SFT.
RL for VLAs: This paper represents the latest trend, which seeks to combine the strengths of large pre-trained VLAs with the exploratory and adaptive power of RL. The goal is to move beyond simple imitation and enable models to learn and improve from their own interactions, addressing the core limitations of SFT.

3.4. Differentiation Analysis

Compared to related works, SimpleVLA-RL makes the following key differentiations:

vs. SFT-based VLAs (e.g., OpenVLA, UniVLA): The core innovation is the use of online RL instead of relying solely on imitation learning. This allows the model to learn from failure, explore, and discover novel strategies ("pushcut"), which is impossible with SFT.
vs. Traditional Robotics RL: It avoids complex, hand-engineered reward functions by using a simple, scalable binary outcome reward. This is made possible by starting with a capable pre-trained VLA model.
vs. Other VLA-RL works: While other works have explored RL for VLAs, this paper provides a particularly systematic study focused on:
- An efficient, open-source framework tailored for the unique challenges of VLA rollout (interactive sampling, parallel simulation).
- A comprehensive analysis of how RL addresses the key bottlenecks of data scarcity and generalization.
- Demonstration of strong sim-to-real transfer.
- Qualitative analysis of emergent behaviors.

4. Methodology

4.1. Principles

The core principle of SimpleVLA-RL is to adapt the successful online RL paradigm from Large Reasoning Models to the domain of robotic manipulation. The intuition is that just as RL can refine a model's step-by-step reasoning process to arrive at a correct final answer, it can also refine a VLA's step-by-step action generation process to achieve a successful task outcome. The method leverages a capable SFT-initialized VLA model and improves it through trial-and-error exploration in a simulated environment, guided only by a simple binary signal of task success or failure.

4.2. Core Methodology In-depth

The SimpleVLA-RL framework follows a standard online RL loop: rollout (generate experience), rewarding (evaluate outcomes), and training (update the policy). The paper introduces specific modifications to make this loop efficient and effective for VLA models. The overall workflow is depicted in Figure 2 from the paper.

Figure 2 | Overview of SimpleVLA-RL. 该图像是示意图，展示了 SimpleVLA-RL 的工作流程，包括有限的离线轨迹、策略的推理过程以及奖励的计算。通过与环境的交互，状态 $s_t$ 和动作 $a_t$ 进行更新，最终获得一系列的轨迹和奖励。

4.2.1. Step 1: Interactive VLA Rollout

The first step is to generate trajectories by having the policy interact with the environment. This process for VLAs is fundamentally different from that for LLMs.

Challenge: LLM rollouts are simple auto-regressive decoding processes. In contrast, VLA rollouts are interactive and closed-loop. Each action the robot takes changes the state of the world, and the VLA model must receive the new state (new camera images, new robot joint positions) to decide on the next action. This requires continuous communication with a simulator or real robot.
Solution: The framework implements an interactive rollout mechanism. To enable exploration, which is crucial for RL, the VLA model must generate diverse trajectories from the same starting point. This is achieved by:
1. Using a VLA model (OpenVLA-OFT) that outputs a probability distribution over a discrete set of action tokens.
2. During rollout, actions are chosen by random sampling from this distribution, controlled by a temperature parameter. Higher temperatures lead to more random (i.e., diverse) actions.
  
  The paper provides pseudo-code (Listing 1) illustrating the transition from a standard LLM rollout function to the interactive VLA rollout function, which includes parallel environment initialization and stepping.
  
  该图像是一个示意图，展示了SimpleVLA-RL在处理数据稀缺、仿真与现实世界任务中的表现。图中显示了LIBERO和RoboTwin任务的成功率，以及在新模式发现中的策略改进。整体来看，SimpleVLA-RL在长时间跨度计划中显著超越了SFT模型。

4.2.2. Step 2: Outcome Reward Modeling

After a trajectory is completed (either by succeeding, failing, or reaching a time limit), it must be assigned a reward.

Principle: Following the DeepSeek-R1 approach, SimpleVLA-RL uses a simple, sparse, binary outcome reward. This avoids the need for complex and task-specific reward engineering.
Implementation: A trajectory-level reward is assigned based on feedback from the environment about task completion.
- If the task is successfully completed, the entire trajectory receives a reward of 1.
- Otherwise, it receives a reward of 0.
  
  This trajectory-level reward is then propagated to every individual action token within that trajectory for the purpose of gradient computation. The reward function is formally defined as:

$ R ( a _ { i , t } \mid s _ { i , t } ) = { \left{ \begin{array} { l l } { 1 , } & { { \mathrm { is } } \underbrace { s { \mathrm { u c c e s s f u l } } [ { \mathrm { tr a j } } _ { i } ( a _ { i } , s _ { i } ) ] } , } \ { 0 , } & { { \mathrm { o t h e r w i s e } } . } \end{array} \right. } $

Symbol Explanation:
- $R(a_{i,t} | s_{i,t})$ : The reward assigned to taking action $a_{i,t}$ in state $s_{i,t}$ for trajectory $i$ at time step $t$ .
- $\mathrm{is\_successful}[\mathrm{traj}_i(a_i, s_i)]$ : A boolean function that returns true if the trajectory $\mathrm{traj}_i$ (composed of a sequence of states $s_i$ and actions $a_i$ ) resulted in a successful task outcome.

4.2.3. Step 3: Exploration Enhancements

A key challenge in RL, especially with sparse rewards, is ensuring sufficient exploration. The paper introduces three specific techniques to enhance this. The effectiveness of these enhancements is shown in Figure 3.

Figure 3 | The effectiveness of three key enhancements: dynamic sampling, higher rollout temperature, and clip higher. 该图像是图表，展示了三项关键增强措施的有效性：动态采样、clip higher 和较高的 rollout 温度。图中呈现了 RL 训练步骤与 LIBERO-Long SR (%) 之间的关系，不同策略的效果通过红色和蓝色曲线进行比较。

Dynamic Sampling:
- Problem: In critic-free RL algorithms like GRPO, the advantage is calculated relative to the average reward of a group of trajectories. If all trajectories in a group have the same reward (e.g., all fail, receiving a reward of 0), the advantage for every trajectory becomes zero, leading to a zero gradient and stalling the learning process.
- Solution: During data collection, the framework only keeps batches of trajectories where there is a mix of outcomes (at least one success and at least one failure). This ensures that the advantage calculation is always meaningful. This condition is formally expressed as: $ 0 < \left| { \mathrm { traj } _ { i } ( a _ { i } , s _ { i } ) \ | \ \mathrm { is _ s u c c e s s f u l } [ \mathrm { t r a j } _ { i } ( a _ { i } , s _ { i } ) ] } \right| < G . $ where $G$ is the total number of trajectories in the group.
Clipping Higher:
- Problem: The standard clipping in PPO/GRPO (e.g., within [0.8, 1.2]) symmetrically limits how much the probability of an action can be increased or decreased. This can be overly restrictive and can prevent the model from sufficiently increasing the probability of good, but initially low-probability, actions discovered during exploration.
- Solution: Inspired by DAPO, the paper uses an asymmetric clipping range, specifically [0.8, 1.28]. By setting a higher upper bound ( $\varepsilon_{high}=0.28$ ) than the lower bound's distance from 1 ( $\varepsilon_{low}=0.2$ ), the algorithm encourages more aggressive updates for successful actions, promoting exploration.
Higher Rollout Temperature:
- Problem: A low sampling temperature makes the policy more deterministic, causing it to generate repetitive, non-diverse trajectories, which limits exploration.
- Solution: The sampling temperature during rollout is increased from a standard value of 1.0 to 1.6. This flattens the probability distribution over action tokens, making the sampling more random and leading to a wider variety of explored behaviors.

4.2.4. Step 4: Training Objective

The final step is to update the VLA policy's parameters using the collected trajectories and their rewards. The paper uses a modified GRPO objective.

Modification: The standard GRPO objective includes a KL-divergence penalty to keep the policy from changing too drastically. The authors remove this term. This has two benefits:
1. Efficiency: It eliminates the need to keep a separate reference model in memory, reducing computational overhead.
2. Exploration: It removes a constraint that could limit the policy's ability to explore novel behaviors far from its initial strategy.
  
  The final training objective is: $ \begin{array} { r l } & { \mathcal { T } ( \theta ) = \mathbb { E } _ { s _ { 0 } \sim \mathcal { D } , { a _ { t } } _ { i = 1 } ^ { G } \sim \pi _ { \theta _ { 0 d } } ( \cdot | s _ { t } ) } \left[ \displaystyle \frac { 1 } { G } \sum _ { i = 1 } ^ { G } \frac { 1 } { | a _ { i } | } \sum _ { t = 1 } ^ { | a _ { i } | } \operatorname* { m i n } \left( r _ { i , t } ( \theta ) \hat { A } _ { i } , \mathrm { c l i p } \left( r _ { i , t } ( \theta ) , 1 - \varepsilon _ { l o w } , 1 + \varepsilon _ { h i g h } \right) \hat { A } _ { i } \right) \right] } \ & { \mathrm { ~ s . t. ~ } 0 < \left| { \mathrm { tr a j } _ { i } ( a _ { i } , s _ { i } ) \ } _ { i } \mathrm { i } \mathrm { s } _ { - } \mathrm { s u c c e s s f u l } [ \mathrm { t r a j } _ { i } ( a _ { i } , s _ { i } ) ] } \right| < G , } \end{array} $ where the importance ratio $r_{i,t}(\theta)$ and normalized advantage $\hat{A}_i$ are: $ r _ { i , t } ( \theta ) = \frac { \pi _ { \theta } ( a _ { i , t } \mid s _ { i , t } ) } { \pi _ { \theta _ { \mathrm { o l d } } } ( a _ { i , t } \mid s _ { i , t } ) } , \quad \hat { A } _ { i } = \frac { R _ { i } - \mathrm { m e a n } ( { R _ { i } } _ { i = 1 } ^ { G } ) } { \mathsf { s t d } ( { R _ { i } } _ { i = 1 } ^ { G } ) } . $
Symbol Explanation:
- $\mathcal{T}(\theta)$ : The objective function to be maximized.
- $\theta$ : The parameters of the policy being optimized.
- $\pi_{\theta_{old}}$ : The policy used to generate the data (written as $\pi_{\theta_{0d}}$ in the paper, likely a typo for "old").
- $G$ : The number of trajectories in a group.
- $r_{i,t}(\theta)$ : The importance sampling ratio for the action at step $t$ of trajectory $i$ .
- $\hat{A}_i$ : The normalized advantage for trajectory $i$ .
- $\varepsilon_{low}, \varepsilon_{high}$ : The clipping parameters (0.2 and 0.28, respectively).
- The s.t. (such that) clause enforces the Dynamic Sampling constraint.

5. Experimental Setup

5.1. Datasets

The experiments are conducted on three major simulation benchmarks for robotic manipulation:

LIBERO: A benchmark for Lifelong Imitation Benchmarking for Embodied Robots. It is designed to test knowledge transfer and generalization across diverse tasks, objects, and environments. The paper uses four of its suites:
- LIBERO-Goal: Tasks with varied goal specifications.
- LIBERO-Spatial: Tasks requiring understanding of spatial relationships.
- LIBERO-Object: Tasks involving different object types.
- LIBERO-Long: Tasks that require a long sequence of actions.
RoboTwin 1.0: A simulation benchmark focused on dual-arm (bimanual) manipulation tasks. It provides 17 tasks with limited scene and object diversity.
RoboTwin 2.0: An extension of RoboTwin 1.0 with significantly increased complexity. It features 50 dual-arm tasks, a much larger variety of object instances (731), and extensive domain randomization (varying clutter, lighting, backgrounds, tabletop height, etc.) to improve task diversity and promote robust sim-to-real transfer.

5.2. Evaluation Metrics

The primary evaluation metric used across all experiments is the Success Rate (SR).

Conceptual Definition: The Success Rate measures the percentage of trials in which the robot successfully completes the assigned task according to the predefined success criteria for that task. It is a direct and intuitive measure of the policy's effectiveness.
Mathematical Formula: $ \mathrm{SR} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} \times 100% $
Symbol Explanation:
- Number of Successful Trials: The count of evaluation episodes where the task's goal was achieved.
- Total Number of Trials: The total number of evaluation episodes conducted.

5.3. Baselines

The paper compares its method (SimpleVLA-RL applied to OpenVLA-OFT) against a range of recent, state-of-the-art VLA models. These baselines are primarily trained using SFT (imitation learning).

UniVLA
RDT-1B (Robotic Diffusion Transformer)
π₀ and π-fast
Nora
OpenVLA
Octo
DP (Diffusion Policy) and DP3

These models represent the cutting edge in VLA research and provide a strong benchmark for evaluating the performance gains from adding RL. The direct comparison is with their own SFT-trained OpenVLA-OFT model to isolate the effect of RL.

6. Results & Analysis

6.1. Core Results Analysis

The main experiments evaluate the performance of SimpleVLA-RL on top of a pre-trained OpenVLA-OFT model across the three benchmarks. The results consistently show that the addition of RL provides a substantial performance boost over the SFT-only baseline and other SoTA models.

The following are the results from Table 2 of the original paper:

Model	LIBERO
Model	Spatial	Object	Goal	Long	Avg
Octo	78.9	85.7	84.6	51.1	75.1
OpenVLA	84.7	88.4	79.2	53.7	76.5
Nora	92.2	95.4	89.4	74.6	87.9
π0 + FAST	96.4	96.8	88.6	60.2	85.5
π₀	96.8	98.8	95.8	85.2	94.2
UniVLA	96.5	96.8	95.6	92.0	95.2
OpenVLA-OFT	91.6	95.3	90.6	86.5	91.0
w/ ours	99.4	99.1	99.2	98.5	99.1
∆	+7.8	+3.8	+8.6	+12.0	+8.1

Analysis: On LIBERO, SimpleVLA-RL ("w/ ours") improves the average success rate from 91.0% to 99.1%, achieving near-perfect performance and outperforming all other SoTA models, including UniVLA (95.2%) and $π₀$ (94.2%). The improvement is particularly pronounced on the LIBERO-Long suite (+12.0%), demonstrating RL's effectiveness in long-horizon planning.

The following are the results from Table 3 of the original paper:

Model	RoboTwin1.0				Avg
Model	Hammer Beat	Block Handover	Blocks Stack	Shoe Place	Avg
DP	0.0	12.0	7.1	4.3	5.9
DP3	64.7	84.3	24.0	59.3	58.1
OpenVLA-OFT	67.2	61.6	7.1	23.4	39.8
w/ ours	92.6	89.6	40.2	59.3	70.4
∆	+25.4	+28.0	+33.1	+35.9	+30.6

The following are the results from Table 4 of the original paper:

Short Horizon Tasks (100-130 Steps)
Model	Lift Pot	Beat Hammer Block	Pick Dual Bottles	Place Phone Stand	Avg
π₀	51.0	59.0	50.0	22.0	45.5
RDT	45.0	22.0	18.0	13.0	24.5
OpenVLA-OFT	10.1	28.1	29.7	17.1	21.3
w/ ours	64.1	87.5	68.3	39.6	64.9
∆	+54.0	+59.4	+38.6	+22.5	+43.6
Medium Horizon Tasks (150-230 Steps)
Model	Move Can Pot	Place A2B Left	Place Empty Cup	Handover Mic	Avg
π₀	41.0	38.0	60.0	96.0	58.8
RDT	33.0	21.0	42.0	95.0	47.8
OpenVLA-OFT	28.1	37.5	77.3	45.3	47.1
w/ ours	61.2	45.3	94.2	89.2	72.5
∆	+33.1	+7.8	+16.9	+43.9	+25.4
Long (280-320 Steps) & Extra Long Horizon Tasks (450-650 Steps)
Model	Handover Block	Stack Bowls Two	Blocks Rank Rgb	Put Bottles Dustbin	Avg
π₀	39.0	53.0	45.0	36.0	43.3
RDT	26.0	42.0	17.0	26.0	27.8
OpenVLA-OFT	33.1	40.6	70.2	42.2	46.5
w/ ours	57.8	75.8	81.3	60.9	69.0
∆	+24.7	+35.2	+11.1	+18.7	+22.4

Analysis: On RoboTwin 1.0 & 2.0, the improvements are even more dramatic. The average success rate on RoboTwin 1.0 jumps from 39.8% to 70.4% (+30.6%). On the more challenging RoboTwin 2.0, the overall average success rate improves from 38.3% to 68.8%, a relative improvement of 80%, and significantly outperforms $π₀$ (49.2%). The gains are consistent across short, medium, and long-horizon tasks, confirming the robustness of the approach.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Overcoming Data Scarcity

This experiment tests the hypothesis that RL can reduce the dependency on large demonstration datasets. The authors compare performance after training with "One-Trajectory SFT" (only one demonstration per task) versus "Full-Trajectory SFT" (all available demonstrations).

The following are the results from Table 5 of the original paper:

Model	LIBERO
Model	Spatial	Object	Goal	Long	Avg
One-Trajectory SFT
OpenVLA-OFT	63.6	54.9	59.6	17.3	48.9
w/ ours	98.2	98.7	98.8	91.7	96.9
∆	+34.6	+43.8	+39.2	+74.4	+48.0
Full-Trajectory SFT
OpenVLA-OFT	91.6	95.3	90.6	86.5	91.0
w/ ours	99.4	99.1	99.2	98.5	99.1
∆	+7.8	+3.8	+8.6	+12.0	+8.1

Analysis: The results are striking. With only one demonstration, SFT performance is mediocre (48.9% average). However, applying SimpleVLA-RL on top of this weak model skyrockets the performance to 96.9%. This is even higher than the 91.0% achieved by SFT with the full dataset. This strongly supports the claim that online RL can effectively compensate for a lack of demonstration data.

6.2.2. Generalization Analysis

This analysis investigates whether RL improves generalization better than SFT. The experiment is set up by training on 9 tasks from a LIBERO suite and testing on 1 held-out (unseen) task. The following figure (Figure 4 from the original paper) plots the performance on the unseen task as the performance on the seen (training) tasks improves.

Figure 4 | Generalization Analysis on LIBERO: Goal Unseen (Top), Object Unseen (Middle), Spatial Unseen (Bottom). 该图像是图表，展示了在LIBERO上进行的泛化分析，分别针对目标未见、物体未见和空间未见任务的结果。图表中比较了强化学习（RL）和监督微调（SFT）在不同已见任务数量下的成功率（SR）。结果显示RL方法在多个任务中获得了更高的成功率。

Analysis: The graphs clearly show that as training progresses and performance on seen tasks improves (moving right on the x-axis), SimpleVLA-RL (blue/green curves) consistently improves performance on the unseen tasks as well. In contrast, SFT (red curves) exhibits severe overfitting. As SFT gets better at the training tasks, its performance on the unseen tasks often degrades, sometimes catastrophically dropping to 0%. This indicates that RL learns more generalizable skills, while SFT memorizes task-specific patterns.

6.2.3. Real-World Experiments (Sim-to-Real)

To validate real-world applicability, policies trained entirely in simulation were deployed on physical robots.

The following are the results from Table 6 of the original paper:

	Stack Bowls	Place Empty Cup	Pick Bottle	Click Bell	Avg
RDT	60.0	4.0	10.0	20.0	23.5
OpenVLA-OFT	38.0	2.0	0.0	30.0	17.5
w/ ours	70.0	10.0	14.0	60.0	38.5
∆	+32.0	+8.0	+14.0	+30.0	+21.0

Analysis: The RL-trained model (w/ ours) achieves an average success rate of 38.5% in the real world, more than doubling the performance of the SFT-only model (17.5%) and significantly outperforming the RDT baseline (23.5%). This demonstrates that the skills learned and refined through RL in simulation can effectively transfer to the real world, providing a scalable pathway for developing real-world robot policies.

6.2.4. Qualitative Analysis: "Pushcut"

The paper identifies an emergent behavior termed "pushcut". In tasks where the demonstration data always shows a "grasp-move-place" sequence, the RL-trained model sometimes discovers a more efficient strategy of simply pushing the object to the target location. The following figure (Figure 5 from the original paper) illustrates this.

该图像是展示简单VLA-RL方法应用于两个任务的示意图。左侧为“移动罐子”任务，右侧为“右侧放置A2B”任务。上半部分展示了监督微调(SFT)下的操作，标记为“Grasp”；下半部分展示了强化学习(RL)下的操作，标记为“Push”。

Analysis: This phenomenon highlights a key advantage of RL over SFT. SFT is fundamentally limited to replicating patterns in the data it has seen. RL, driven by an outcome-based reward, is free to explore the entire action space to find any solution that achieves the goal, even if it's a novel one. This ability to discover "shortcuts" or more efficient strategies is a hallmark of true learning beyond imitation.

6.2.5. Failure Modes Analysis

This section investigates the conditions under which SimpleVLA-RL fails. The key factor explored is the capability of the initial SFT model.

The following are the results from Table 7 of the original paper:

	RoboTwin2.0
	Move Can Pot	Place A2B Lift	Place A2B Right	Place Phone Stand	Pick Dual Bottles	Avg
0 trajs SFT +RL	0	0	0	0	0	0
0 trajs SFT +RL	0	0	0	0	0	0
100 trajs SFT	9.4	7.8	7.8	10.1	1.2	7.3
+RL ∆	51.6	25.0	27.2	18.8	4.3	25.4
+RL ∆	+42.2	+17.2	+19.4	+8.7	+3.1	+18.1
1000 trajs SFT	28.1	37.5	28.7	17.1	29.7	28.2
+RL ∆	61.2	45.3	37.5	39.6	68.3	50.4
+RL ∆	+33.1	+7.8	+8.8	+22.5	+38.6	+22.2

Analysis:
1. RL fails from scratch: When starting with a base model that has 0% success rate (0 SFT trajectories), RL makes no progress. Because no successful trajectories are ever generated, the model never receives a positive reward signal to learn from.
2. Better prior leads to better RL: The model initialized with 1000 SFT trajectories (28.2% initial SR) achieves a much higher final performance (50.4%) than the one initialized with 100 trajectories (7.3% initial SR -> 25.4% final SR). This shows that a stronger initial policy provides a better starting point for exploration.
3. There's a capability threshold: When the initial success rate is extremely low (e.g., 1.2% on "Pick Dual Bottles"), RL provides only marginal gains (+3.1%). This suggests that the policy must have a minimal level of competence for the exploration process to be fruitful.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces SimpleVLA-RL, a simple yet powerful online RL framework for training VLA models. By adapting techniques from LLM reinforcement learning, the authors demonstrate that RL can be a highly effective tool for scaling VLA training. The key findings are that SimpleVLA-RL significantly improves performance on complex manipulation tasks, drastically reduces the need for expensive human demonstration data, enhances the model's generalization capabilities to unseen scenarios, and shows strong sim-to-real transfer. The discovery of the "pushcut" phenomenon further underscores RL's potential to unlock novel and efficient robotic behaviors beyond simple imitation.

7.2. Limitations & Future Work

The paper itself does not dedicate a section to limitations, but several can be inferred from the results:

Dependency on Initial SFT Policy: The most significant limitation, as shown in the failure modes analysis, is that the method requires an initial SFT-trained model that already has a non-zero success rate. It cannot learn from a completely "cold start." While it reduces the quantity of SFT data needed, it does not eliminate the need for an initial phase of imitation learning.
Sparse Reward Limitations: The binary, outcome-based reward is simple and scalable but may be insufficient for extremely long or complex multi-stage tasks where a successful outcome is exceptionally rare. In such cases, exploration could be inefficient, and some form of reward shaping or curriculum learning might be necessary.
Simulation Fidelity: The success of sim-to-real transfer is still limited by the fidelity of the simulation. While domain randomization helps, the gap between simulation and the real world remains a challenge for all simulation-based robotics training.

Future work could focus on addressing these limitations, such as developing methods to make RL effective from weaker initial policies, exploring more sophisticated reward mechanisms that remain scalable, or integrating RL with automated curriculum generation to tackle progressively harder tasks.

7.3. Personal Insights & Critique

This paper is a strong piece of empirical research that provides compelling evidence for the value of online RL in the VLA domain.

Strengths:
- Clarity and Simplicity: The proposed method is conceptually straightforward and builds upon established techniques, making it easy to understand and replicate.
- Comprehensive Evaluation: The experiments are thorough, covering multiple benchmarks, data regimes, generalization axes, and a real-world deployment. This provides a robust validation of the claims.
- Practical Significance: By tackling the data scarcity bottleneck, the work offers a practical path forward for scaling up the capabilities of general-purpose robots. The positive sim-to-real results are particularly promising.
- Insightful Analysis: The "pushcut" and failure mode analyses provide valuable insights into how and when RL works for VLAs, moving beyond just reporting performance numbers.
Critique and Areas for Improvement:
- Novelty of RL Techniques: The RL techniques themselves (GRPO, dynamic sampling, temperature tuning) are not novel; they are adaptations of methods from the LLM reasoning domain. The paper's main contribution is in the successful application and VLA-specific engineering of these techniques.
- The "One-Shot" Claim: While the one-shot SFT experiment is impressive, it's important to remember that the model still benefits from large-scale pre-training. The RL phase is refining a very knowledgeable prior, not learning from scratch.
- Cost of Online RL: Online RL requires running many parallel simulations, which can be computationally expensive. While it saves on human data collection costs, it incurs significant compute costs. A detailed comparison of these trade-offs would be beneficial.
  
  Overall, SimpleVLA-RL represents a significant and practical step forward. It convincingly argues that the future of training highly capable and adaptable robots will likely involve a synergistic combination of large-scale pre-training, a small amount of imitation learning to bootstrap behavior, and extensive online RL to refine, generalize, and discover new skills.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~20 min read · 26,346 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.1.1. Core Problem

2.1.2. Importance and Gaps

2.1.3. Innovative Idea

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Vision-Language-Action (VLA) Models

3.1.2. Supervised Fine-Tuning (SFT)

3.1.3. Reinforcement Learning (RL)

3.1.4. Online vs. Offline RL

3.2. Previous Works

3.2.1. OpenVLA and OpenVLA-OFT

3.2.2. DeepSeek-R1 and the Rise of RL for Reasoning

3.2.3. Group Relative Policy Optimization (GRPO)

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth

4.2.1. Step 1: Interactive VLA Rollout

4.2.2. Step 2: Outcome Reward Modeling

4.2.3. Step 3: Exploration Enhancements

4.2.4. Step 4: Training Objective

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Studies / Parameter Analysis

6.2.1. Overcoming Data Scarcity

6.2.2. Generalization Analysis

6.2.3. Real-World Experiments (Sim-to-Real)

6.2.4. Qualitative Analysis: "Pushcut"

6.2.5. Failure Modes Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers