AiPaper
Paper status: completed

RaC: Robot Learning for Long-Horizon Tasks by Scaling Recovery and Correction

Published:09/10/2025
Original LinkPDF
Price: 0.10
Price: 0.10
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper introduces `RaC`, a method that enhances robot learning for long-horizon tasks by scaling recovery and correction behaviors, fine-tuning robotic policies through human interventions, thus improving efficiency and robustness in complex tasks.

Abstract

Modern paradigms for robot imitation train expressive policy architectures on large amounts of human demonstration data. Yet performance on contact-rich, deformable-object, and long-horizon tasks plateau far below perfect execution, even with thousands of expert demonstrations. This is due to the inefficiency of existing ``expert'' data collection procedures based on human teleoperation. To address this issue, we introduce RaC, a new phase of training on human-in-the-loop rollouts after imitation learning pre-training. In RaC, we fine-tune a robotic policy on human intervention trajectories that illustrate recovery and correction behaviors. Specifically, during a policy rollout, human operators intervene when failure appears imminent, first rewinding the robot back to a familiar, in-distribution state and then providing a corrective segment that completes the current sub-task. Training on this data composition expands the robotic skill repertoire to include retry and adaptation behaviors, which we show are crucial for boosting both efficiency and robustness on long-horizon tasks. Across three real-world bimanual control tasks: shirt hanging, airtight container lid sealing, takeout box packing, and a simulated assembly task, RaC outperforms the prior state-of-the-art using 10×\times less data collection time and samples. We also show that RaC enables test-time scaling: the performance of the trained RaC policy scales linearly in the number of recovery maneuvers it exhibits. Videos of the learned policy are available at https://rac-scaling-robot.github.io/.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

RaC: Robot Learning for Long-Horizon Tasks by Scaling Recovery and Correction

The title clearly states the paper's core contribution: a method named RaC (Recovery and Correction) designed to improve robot learning, specifically for complex, long-horizon tasks, by focusing on scaling up data related to recovery and correction behaviors.

1.2. Authors

Zheyuan Hu, Robyn Wu, Naveen Enock, Jasmine Li, Riya Kadakia, Zackory Erickson, and Aviral Kumar.

All authors are affiliated with Carnegie Mellon University (CMU), a world-renowned institution for robotics and artificial intelligence research. The authors have research backgrounds in robotic learning, imitation learning, and reinforcement learning, lending significant credibility to the work.

1.3. Journal/Conference

The paper specifies a publication date of September 9, 2025, but does not name a specific conference or journal. However, given the content and timing, it is likely intended for a top-tier robotics conference such as the Conference on Robot Learning (CoRL), Robotics: Science and Systems (RSS), or the IEEE International Conference on Robotics and Automation (ICRA). These venues are highly competitive and influential in the field of robotics.

1.4. Publication Year

2025 (as listed on the preprint).

1.5. Abstract

The abstract highlights a key limitation of modern robot imitation learning: performance plateaus on complex tasks (contact-rich, deformable-object, long-horizon) despite large amounts of expert demonstration data. The authors attribute this to the inefficiency of standard data collection, which focuses on "perfect" executions. To solve this, they introduce RaC, a fine-tuning phase that follows initial imitation learning. In RaC, a human operator intervenes during policy rollouts when failure is imminent. These interventions consist of two parts: first, rewinding the robot to a known good state (recovery), and second, providing a short demonstration to complete the current sub-task (correction). Training on this specific data composition teaches the policy to retry and adapt, boosting robustness and efficiency. The authors demonstrate that RaC outperforms state-of-the-art methods on three real-world bimanual tasks and a simulated task, using 10x less data collection time. A key finding is that the RaC policy exhibits test-time scaling: its performance improves as it executes more recovery maneuvers during deployment.

2. Executive Summary

2.1. Background & Motivation

  • Core Problem: Modern robot learning, particularly imitation learning, has shown great promise by training large models on vast datasets of human demonstrations. However, for complex, long-horizon tasks that involve contact or deformable objects, performance hits a ceiling. Even with thousands of demonstrations, policies struggle to achieve perfect success rates. This is because small errors accumulate over long sequences of actions, a phenomenon known as compounding error.
  • Existing Gaps: Standard imitation learning datasets are heavily biased towards clean, successful trajectories. They teach the robot what to do when everything goes right but fail to teach it how to recover when things go wrong. When the robot inevitably encounters a state slightly different from what it saw in training (an out-of-distribution state), it doesn't know how to get back on track, leading to task failure. Existing human-in-the-loop methods like HG-DAgger collect corrective data but do not explicitly focus on teaching the robot to recover to a known good state first.
  • Innovative Idea: The paper's central idea is that to build robust policies, it is more data-efficient to teach a robot how to recover from failure and retry than to simply show it more examples of perfect success. The authors propose a new data collection paradigm, RaC, that deliberately collects and trains on trajectories containing both recovery (returning to a familiar state) and correction (completing the sub-task from that familiar state). This imbues the policy with a generic, powerful skill: the ability to try again.

2.2. Main Contributions / Findings

  1. A New Data Collection Framework (RaC): The paper introduces RaC, an iterative, human-in-the-loop training phase for imitation learning. The framework is defined by a simple but powerful data collection protocol that structures human interventions to explicitly demonstrate recovery and correction.

  2. Highly Data-Efficient Learning: Across three challenging real-world bimanual manipulation tasks, RaC achieved higher success rates than prior state-of-the-art methods while using an order of magnitude (10x) less data collection time and samples. This demonstrates that scaling the type of data can be more effective than just scaling the quantity.

  3. Discovery of "Test-Time Scaling" in Robotics: A key finding is that policies trained with RaC exhibit a behavior analogous to Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs). At test time, the policy's overall success rate is linearly correlated with the number of recovery maneuvers it performs. Essentially, the more the policy is able to "backtrack and retry," the more likely it is to succeed, a form of scaling computation at inference time.

  4. A Standardized Intervention Protocol: The paper formalizes the intervention process with two simple rules: (1) Recover then Correct, ensuring a balanced diet of both behaviors, and (2) Terminate after Intervention, which improves data efficiency by focusing budget on the most critical parts of the task.


3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

  • Imitation Learning (IL): This is a machine learning paradigm where an agent (e.g., a robot) learns to perform a task by observing and mimicking an expert (e.g., a human teleoperator). The simplest form, Behavior Cloning (BC), treats learning as a supervised learning problem: given an observation (state), predict the expert's action. While effective, IL suffers from "compounding errors" because the policy may drift into states the expert never visited, and it won't know how to act.
  • Human-in-the-Loop Learning: These are methods where a human provides feedback or interventions during the learning process to improve the policy. This is particularly useful for robotics, where collecting perfect, exhaustive expert data beforehand is impractical.
  • DAgger (Dataset Aggregation): A classic interactive imitation learning algorithm. It works as follows: (1) Train a policy on expert data. (2) Run the policy to collect a set of trajectories. (3) Have an expert label the states visited by the policy with the correct actions. (4) Aggregate this new data with the old data and retrain the policy. This iteratively corrects the policy's mistakes.
  • HG-DAgger (Human-in-the-Loop DAgger): An adaptation of DAgger for real-world robotics where a perfect expert policy is not available. Instead of having an expert label all states, a human teleoperator monitors the robot and intervenes only when the policy is about to make a mistake. The intervention data (the human's corrective actions) is then used for retraining. RaC builds on this idea but structures the interventions differently.
  • Diffusion Models for Policies: Recently popular in robotics, diffusion models are generative models that learn to reverse a noise-adding process. In policy learning, they are trained to take a noisy action sequence and "denoise" it to produce a valid, useful sequence of actions conditioned on the robot's observation. They are highly expressive and can model complex, multi-modal action distributions. The paper uses a variant called a Flow-Matching Policy, which is a more recent and computationally efficient alternative to traditional diffusion models.

3.2. Previous Works

  • Scaling Data in Robotics (RT-1, ALOHA Unleashed): Recent work has shown that, similar to LLMs, scaling the amount and diversity of training data can lead to more general and capable robot policies. For example, ALOHA Unleashed [50] collected over 5000 demonstrations for a single task (shirt hanging) to achieve high performance. RaC challenges this "brute-force" scaling approach by showing that a more intelligent data composition can achieve better results with far less data.
  • Human-in-the-loop IL (RoboCopilot, IWR): Several systems have been developed to make human interventions more effective. RoboCopilot [46] extends intervention-based learning to mobile manipulation with improved interfaces. Other works [29, 33] explore different ways to combine intervention data, on-policy rollouts, and full demonstrations. RaC differs by explicitly separating interventions into "recovery" and "correction" and demonstrating the unique benefit of learning to recover.
  • Recovery and Correction in IL (Juicer, IntervenGen, VBT): Other researchers have also recognized the importance of recovery data. Juicer [1] and IntervenGen [19] automatically generate recovery trajectories in simulation. Visual Backtracking Teleoperation (VBT) [3] proposes a protocol where operators deliberately demonstrate failure-recovery-success sequences for offline reinforcement learning. RaC distinguishes itself by focusing on the scaling properties of these behaviors within an online, human-in-the-loop imitation learning context and treating recovery as a "skill" to be learned, without modifying the underlying learning algorithm.

3.3. Technological Evolution

The field of robot learning has evolved from learning simple skills from a few demonstrations to tackling complex, long-horizon tasks. This progression has been driven by:

  1. More Data: Moving from small, lab-specific datasets to large-scale, diverse datasets (Open X-Embodiment).

  2. More Expressive Models: Shifting from shallow models to deep neural networks, and more recently, to high-capacity Transformers and Diffusion/Flow models.

  3. More Efficient Learning Paradigms: Moving from pure offline Behavior Cloning to interactive methods like DAgger and HG-DAgger that collect data where the policy actually needs it.

    RaC represents the next step in this evolution. It moves beyond just collecting "more data" or "corrective data" to asking, "What is the right kind of data to collect?" It proposes that a balanced diet of success, recovery, and correction data is the most efficient recipe for robust policies.

3.4. Differentiation Analysis

The core innovation of RaC compared to prior work is its specific, structured approach to data collection that prioritizes learning recovery behaviors.

  • vs. Batched Full Demonstrations (ALOHA Unleashed): Full demonstrations are data-inefficient because they mostly show optimal execution, leaving the policy brittle to small errors. RaC is more efficient because it focuses data collection on the critical failure points and teaches the robot how to handle them.

  • vs. HG-DAgger: HG-DAgger focuses on correction from an out-of-distribution (OoD) state. This requires the policy to learn a new skill from an unfamiliar state, which can be difficult. RaC simplifies the learning problem: first recover to a familiar, in-distribution state (an easier skill to learn), and then correct from there (where the policy already has some knowledge). This two-step process is more sample-efficient.

  • vs. Offline Recovery Methods (VBT): VBT collects recovery data offline by having humans stage failures. RaC collects this data in-the-loop, responding to the actual failures the current policy makes, which is more targeted and effective. Furthermore, RaC frames its contribution within imitation learning and its scaling properties, whereas VBT is studied through the lens of offline RL.

    The following diagram from the paper (Figure 3) illustrates the conceptual difference between a standard HG-DAgger style intervention and the RaC style.

    该图像是示意图,展示了在不同任务中的试验进展。左侧为‘衬衫悬挂’子任务,右侧为‘模拟双手组装’子任务,通过人类收集预算的轮数,分别显示了无进展和各子任务的百分比变化。 该图像是示意图,展示了在不同任务中的试验进展。左侧为‘衬衫悬挂’子任务,右侧为‘模拟双手组装’子任务,通过人类收集预算的轮数,分别显示了无进展和各子任务的百分比变化。


4. Methodology

4.1. Principles

The central principle of RaC is that for long-horizon tasks, the ability to recover from errors and retry is a more critical and data-efficient skill to learn than perfecting every sub-task on the first attempt. Compounding errors are inevitable, so a robust policy must be able to recognize an impending failure, reset itself to a "safe" or "familiar" state, and attempt the sub-task again.

The intuition is that the "skill" of recovery is often easier to learn than the "skill" of task execution. For example, moving a gripper back to an open space it has visited before (recovery) is less precise and has a wider basin of success than inserting a key into a keyhole (execution). By learning the easier recovery skill, the policy gets multiple chances to succeed at the harder execution skill, exponentially increasing its overall probability of success. This is analogous to how a human might solve a puzzle: instead of giving up after one failed attempt, we backtrack and try a different approach.

4.2. Core Methodology In-depth (Layer by Layer)

RaC is an iterative data collection and training protocol that runs after an initial policy has been pre-trained on a small set of full demonstrations. The process is detailed in Algorithm 1 of the paper.

The following is the data collection protocol presented in Algorithm 1 of the paper:

# Algorithm 1 RaC Data Collection Protocol

1: Given per-round human data collection budget B measured in the number of frames; total human intervention data collection rounds K.
   Initialize flow-matching policy π_θ^(k=0); dataset D_{0:K} <- ∅
3: Collect B frames of expert demonstrations in ΔD_0; D_{0:K} <- ΔD_0; π_θ^(k=0) <- TRAIN(D_{0:K}) via Equation 4.1;

# Human Intervention Data Collection Rounds
1: for k = 1 to K do
2:   initialize human policy π_H, intervention function I
3:   ΔD_k <- ∅; b <- 0  ▷ b data collection budget used in this round
4:   while b < B_k do
5:     s_0 <- env.reset(); traj <- []; intervened <- false; t <- 0
6:     while not env.done() do
7:       if I(s_t) = 0 then a_t ~ π_θ^(k-1)(·|s_t); is_human <- 0
8:       else a_t ~ π_H(·|s_t); is_human <- 1; intervened <- true  ▷ Rule 1: Pair each recovery a correction
9:       s_{t+1} <- env.step(a_t); traj.push(s_t, a_t, is_human); t += 1
10:      if is_human = 0 and INTERVENTIONDONE() then break  ▷ Rule 2: Terminate after intervention concludes
11:    if intervened == false then  ▷ If an entire trajectory has no human intervention =>
12:      ΔD_k U= traj  ▷ add full trajectory into dataset, with no human budget counted
13:    else
14:      ΔD_k U= {(s, a) ∈ traj : is_human = 1}  ▷ add only human intervention transitions into dataset
         b = b + |traj|  ▷ charge full episode length to budget
15:  D_{0:K} U= ΔD_k; π_θ^k <- TRAIN(D_{0:K})  ▷ Aggregate datasets, then train policy via flow-matching 4.1

Let's break down this process step-by-step.

4.2.1. Initial Pre-training (Round 0)

The process begins by collecting a small initial dataset of full, successful human demonstrations (ΔD0ΔD_0). An initial policy, πθ(k=0)π_θ^(k=0), is trained on this data using standard imitation learning. This policy serves as the starting point for the RaC fine-tuning loop.

4.2.2. Iterative Human Intervention (Rounds 1 to K)

The core of RaC is a loop that alternates between policy execution, human intervention, data aggregation, and retraining. In each round kk:

  1. Policy Rollout: The current policy πθ(k1)π_θ^(k-1) is deployed on the robot to attempt the task.
  2. Human Intervention: A human operator monitors the rollout. If the operator perceives that the policy is about to fail or has entered an unrecoverable state, they intervene using a shared autonomy interface (described below). This intervention is structured according to two critical rules.

RaC's Two Rules for Intervention

  • Rule 1: Recover then Correct. The human intervention is not just a simple correction. It must follow a two-phase structure:

    1. Recovery Phase: The operator first guides the robot backwards or away from the failure state to a previously visited, familiar state. This is a state that is "in-distribution" with respect to the successful demonstration data. The paper proposes a visual aid to help the operator identify these regions—a heatmap of gripper positions from the initial demonstrations.

    2. Correction Phase: Once the robot is back in a familiar state, the operator provides a short, corrective demonstration to successfully complete the current sub-task. The following figure from the paper illustrates this process for the shirt-hanging task. When the robot fails to insert the hanger, the human first recovers by moving the hanger backwards and then corrects by re-inserting it properly.

      Figure 5: Visual aid for guiding intervention data collection. We utilize overlaid heatmap of the grippers visitation frequency to illustrate in-distribution regions that a teleoperator should recove… 该图像是示意图,展示了机器人在 clamshell 外卖盒打包任务中的干预数据收集过程。左侧为 O.O.D 状态,右侧为恢复过程,图中标记的区域显示了操作者应恢复到的分区,以增强任务的执行效果。

  • Rule 2: Terminate after Intervention. Once the human's corrective segment is finished, the episode is immediately terminated. The robot does not continue with the rest of the long-horizon task. The rationale is to maintain a clean data distribution. If the policy were to continue, it would be starting the next sub-task from a state reached by a mix of policy and human actions. Training on data from later sub-tasks in such a mixed-distribution rollout might not be helpful for the policy's performance when it attempts those sub-tasks from its own state distribution. Terminating early focuses the data collection budget on improving earlier, more critical sub-tasks.

  1. Data Aggregation: The human intervention data (both recovery and correction segments) is collected and added to the growing dataset D0:KD_{0:K}. If a rollout succeeds without any human intervention, that full successful trajectory is also added to the dataset.
  2. Retraining: A new policy, πθkπ_θ^k, is trained from scratch on the entire aggregated dataset D0:KD_{0:K}. This new policy is then used for rollouts in the next round, k+1k+1.

4.2.3. Shared Autonomy Interface

To make interventions seamless, the authors designed a lightweight interface using off-the-shelf VR controllers.

  • "Clutch" Mechanism: The operator holds a VR controller. By default, the robot follows its learned policy. When the operator presses a "clutch" button on the controller, they take direct control of the robot's end-effector. Releasing the button returns control to the policy. This allows for instant takeovers.

  • Local-Frame Registration: To avoid the cumbersome need to perfectly align the operator's physical posture with the robot's, the system uses relative pose commands. When the clutch is engaged, the controller's current pose is registered as the origin. Subsequent controller movements are translated into relative end-effector commands (ΔpΔp, ΔRΔR), making the control intuitive and low-friction.

    The interface is depicted in the paper's Figure 4.

    该图像是图表,展示了三个不同任务( shirt hanging、airtight lid sealing 和 takeout box packing)成功率与每个成功轨迹的平均恢复次数之间的关系。每个子图均包含数据点及其拟合线,拟合方程和对应的相关系数 \(r\) 被标注在图中。图中显示,成功率随着平均恢复次数的增加而提高,表明恢复行为对任务成功的重要性。 该图像是图表,展示了三个不同任务( shirt hanging、airtight lid sealing 和 takeout box packing)成功率与每个成功轨迹的平均恢复次数之间的关系。每个子图均包含数据点及其拟合线,拟合方程和对应的相关系数 rr 被标注在图中。图中显示,成功率随着平均恢复次数的增加而提高,表明恢复行为对任务成功的重要性。

4.2.4. Policy Architecture and Training

The paper uses a high-capacity model to learn from the diverse data (full demos, recoveries, corrections).

  • Architecture: The policy is a Multimodal Diffusion Transformer (MMDiT), a 300M parameter model based on the DiT architecture. It takes as input the robot's proprioceptive state (e.g., joint velocities, end-effector positions) and images from three cameras (one overhead, two on the wrists). The model outputs a chunk of future actions, predicting 1 second (60 timesteps) into the future.

  • Training Objective: The policy is trained using a conditional flow matching objective. Flow matching is a technique for training generative models that learns a vector field to transport a simple noise distribution to a target data distribution. The loss function is:

    $ \mathcal { L } _ { \mathrm { F l o w } } ( \theta ) = \mathbb { E } _ { o _ { t } , A _ { t } \sim \mathcal { D } , \tau } \left[ \left| \left| v _ { \theta } ( \tau , o _ { t } , x ^ { \tau } ) - ( A _ { t } - x ^ { 0 } ) \right| \right| _ { 2 } ^ { 2 } \right] $ where:

    • oto_t: The observation at time tt (images and robot state).
    • AtA_t: The ground-truth action chunk (a sequence of future actions) from the dataset D\mathcal{D}.
    • x0N(0,Id)x^0 \sim \mathcal { N } ( 0 , I _ { d } ): A sample from a standard normal (noise) distribution.
    • τUnif([0,1])\tau \sim \mathrm{Unif} ( [ 0 , 1 ] ): A random time step between 0 and 1.
    • xτx^\tau: An interpolation between the noise x0x^0 and the data AtA_t at time τ\tau.
    • vθ(τ,ot,xτ)v_{\theta} ( \tau , o _ { t } , x ^ { \tau } ): The velocity vector predicted by the neural network (with parameters θ\theta) at the interpolated point xτx^\tau, conditioned on the observation oto_t.
    • The objective trains the network vθv_\theta to predict the direction from the noise sample to the real data sample.
  • Inference: To generate an action, the model starts with random noise and integrates the learned velocity field vθv_\theta over 10 steps to generate an action chunk. The robot executes the first half of the chunk (0.5 seconds) and then replans.

    The policy architecture is shown in the paper's Figure 6.

    该图像是图表,展示了干预技能组成与双手组装消融实验的结果。左侧柱状图展示在不同人类收集预算下,恢复帧和校正帧的数量。右侧折线图显示了不同条件下的任务成功率变化。研究表明,使用RaC方法在提高效率与鲁棒性方面显著优于之前的方法。 该图像是图表,展示了干预技能组成与双手组装消融实验的结果。左侧柱状图展示在不同人类收集预算下,恢复帧和校正帧的数量。右侧折线图显示了不同条件下的任务成功率变化。研究表明,使用RaC方法在提高效率与鲁棒性方面显著优于之前的方法。


5. Experimental Setup

5.1. Datasets

The authors do not use a pre-existing dataset. Instead, the dataset is created dynamically through their experimental procedure for each task. The experiments are conducted on four challenging bimanual manipulation tasks, designed to be long-horizon and involve contact and deformability.

  • Real-World Tasks:
    1. Shirt-hanging: A 5-step task involving picking up a hanger, passing it between grippers, inserting it into a shirt, and hanging it on a rack.
    2. Airtight-container-lid-sealing: A 5-step task of picking up a lid, placing it, and snapping four tabs in sequence.
    3. Clamshell-takeout-box-packing: A 7-step task involving picking a box, using a spatula to scoop a burger into it, and closing the box.
  • Simulated Task:
    1. Bimanual-assembly: A 3-step task of assembling blocks and placing the final assembly on a platform.

      For each task, an initial budget of full demonstrations is collected. Then, for RaC and HG-DAgger, further data is collected iteratively using a per-round budget of human intervention time. The tasks and setups are shown in Figure 7.

      该图像是示意图,展示了机器人在执行密封容器盖任务时的各个子任务和干预过程。包括子任务1:拾起盖子,子任务2:放置盖子,子任务3:夹住标签。图中标示了错误情境、恢复措施和修正步骤,体现了机器人在面临操作故障时的应对策略。 该图像是示意图,展示了机器人在执行密封容器盖任务时的各个子任务和干预过程。包括子任务1:拾起盖子,子任务2:放置盖子,子任务3:夹住标签。图中标示了错误情境、恢复措施和修正步骤,体现了机器人在面临操作故障时的应对策略。

5.2. Evaluation Metrics

The paper uses two primary metrics to evaluate performance:

  1. Task Success Rate:

    • Conceptual Definition: This metric measures the percentage of evaluation trials where the robot successfully completes all sub-tasks in the entire long-horizon sequence. It is a strict, binary measure of overall task completion (1 for full success, 0 for any failure).
    • Mathematical Formula: It is the empirical success probability, calculated as: $ \text{Success Rate} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} $
    • Symbol Explanation: For this proportion, the paper computes a 95% Wilson score interval, which is a method for calculating a confidence interval for a binomial proportion that works well even for small numbers of trials or when the success rate is close to 0 or 1.
  2. Task Progress Score:

    • Conceptual Definition: This metric provides a more granular measure of performance by counting the number of sub-tasks a policy successfully completes before failing. It gives partial credit for completing parts of the task.
    • Mathematical Formula: For a single trial, it is simply: $ \text{Progress Score} = \text{Number of sub-tasks completed successfully} $ The paper reports the average progress score across all trials.
    • Symbol Explanation: This is a simple count. The paper computes a standard 95% confidence interval for the mean of this score across trials.

5.3. Baselines

The performance of RaC is compared against two main approaches:

  1. Batched Full Expert Data Collection: This is the standard imitation learning paradigm. Instead of iterative fine-tuning, the entire data collection budget is used to collect a large, single batch of full human demonstrations. The policy is then trained once on this large dataset. This baseline represents the "scale up data quantity" approach.

  2. HG-DAgger: This represents the state-of-the-art in human-in-the-loop imitation learning. In the paper's implementation, when the human intervenes, they provide only a corrective segment to complete the sub-task from the point of failure. Unlike RaC, there is no explicit "recovery" phase, and the episode continues after the intervention (Rule 2 is not applied). This allows for a direct ablation of RaC's key ideas.


6. Results & Analysis

6.1. Core Results Analysis

RaC demonstrates significant improvements in both absolute performance and data efficiency across all tasks.

1. Order-of-Magnitude Data Efficiency Gain: The most striking result is the comparison with prior state-of-the-art work on the shirt-hanging task. The following are the results from Table 1 of the original paper:

Name Policy Architecture Model Size Training Data Size SR
ALOHA Unleashed [50] Diffusion Transformer policy 217M ∼89 hours (5345 shirt-hanging expert demos) 75.0%
Seed GR-3 [7] Vision-Language-Action model 4B 116 hours of shirt-hanging expert demos and vision-language data ∼ 63.6%
Ours (RaC) Flow-matching Transformer policy 368M 5 hours (RaC data: expert, recovery, and correction) 78.3%

As the table shows, RaC achieves a higher success rate (78.3%) than ALOHA Unleashed (75.0%) using only 5 hours of data, compared to the estimated 89 hours for ALOHA Unleashed. This is a dramatic improvement in data efficiency of over an order of magnitude (more than 10x).

2. Superior Scaling over Baselines: The scaling plots in Figure 8 (referred to as Figure 6 in the text) show how performance improves as more human data is collected.

该图像是一个示意图,展示了机器人在完成衣物悬挂任务过程中的三个子任务和干预过程。每一步标注了任务状态,包括抓取衣架、将衣架交给机器人、插入衣架,以及可能出现的错误及其恢复和校正方法。具体而言,在出现错误时,机器人会通过后退和重置手爪进行恢复,然后以正确的对齐方式重新插入衣架。 该图像是一个示意图,展示了机器人在完成衣物悬挂任务过程中的三个子任务和干预过程。每一步标注了任务状态,包括抓取衣架、将衣架交给机器人、插入衣架,以及可能出现的错误及其恢复和校正方法。具体而言,在出现错误时,机器人会通过后退和重置手爪进行恢复,然后以正确的对齐方式重新插入衣架。

Across all three real-world tasks, RaC (blue line) consistently achieves higher task progress and success rates than HG-DAgger (orange line) and Batched Data (green line) for the same amount of data collection budget. Crucially, the slope of the RaC curve is much steeper, indicating that each additional hour of human intervention data yields a larger performance gain. This confirms that the RaC data collection strategy is fundamentally more efficient.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Robustness of Intermediate Policies

Figure 9 analyzes the distribution of failures across sub-tasks. For RaC, as more rounds of intervention are performed, the number of rollouts that fail on early sub-tasks ("No Progress", "Sub-task 1") rapidly diminishes. The policy systematically learns to overcome the initial hurdles. In contrast, the Batched Data baseline still has a significant portion of trials failing early on, even with a large dataset. This shows that RaC is more effective at improving robustness and eliminating the "long tail" of easy failures.

该图像是示意图,展示了机器人在外卖盒包装任务中的执行过程。图中标识了三个子任务及对应的错误和修正步骤,分别为:抓取铲子、勺取汉堡和放置汉堡。错误状态为夹具与铲子不对齐,而修正步骤则通过后退和重置夹具来完成第一个子任务。 该图像是示意图,展示了机器人在外卖盒包装任务中的执行过程。图中标识了三个子任务及对应的错误和修正步骤,分别为:抓取铲子、勺取汉堡和放置汉堡。错误状态为夹具与铲子不对齐,而修正步骤则通过后退和重置夹具来完成第一个子任务。

6.2.2. "CoT-style" Test-Time Scaling

Figure 10 presents one of the most interesting findings of the paper. It plots the final task success rate against the average number of recovery maneuvers performed by the policy during successful rollouts.

Figure 2: Bimanual manipulation robot system. An illustration of our bimanual robot setup showing camera placements and workspace setup. 该图像是示意图,展示了双手操作的机器人系统。左侧机器人手臂持有工具,移动的食物放置在透明容器内,右侧机器人手臂与多个摄像头和虚拟现实操控杆相连,演示了该系统的工作环境与摄像机布置。

There is a clear positive linear correlation (high rr values) for all three tasks. This means that as the policy gets better (through more rounds of RaC training), it learns to use recovery and retry more often, and this behavior directly contributes to its higher success rate. This is termed "test-time scaling" because the policy improves its performance by spending more "computational effort" (in this case, time and actions for recovery) during deployment. This is analogous to how advanced LLMs improve their reasoning by generating longer, more elaborate chains of thought that include self-correction.

62.3. Analysis of Rollout Length

Figure 11 shows the distribution of lengths for successful rollouts. RaC policies produce the longest successful rollouts on average. This is not a negative result; it's a direct consequence of the learned recovery behavior. The extra length comes from the policy taking time to backtrack and retry sub-tasks, which ultimately leads to success. In contrast, policies trained only on full demonstrations tend to have shorter successful rollouts because they can likely only succeed when they get everything right on the first try and stay strictly within the narrow distribution of the training data.

该图像是图表,展示了HG-Dagger风格(左)与RaC风格(右)数据和训练策略的比较。在HG-Dagger中,失败后通常采用校正操作,而在RaC中则加入了恢复操作,从而扩展了机器人的技能组合,提高了长时间任务的效率和鲁棒性。 该图像是图表,展示了HG-Dagger风格(左)与RaC风格(右)数据和训练策略的比较。在HG-Dagger中,失败后通常采用校正操作,而在RaC中则加入了恢复操作,从而扩展了机器人的技能组合,提高了长时间任务的效率和鲁棒性。

6.2.4. Importance of the RaC Protocol Rules

Figure 12 provides a direct ablation of the two rules proposed in RaC.

  • Rule 1 (Recover-then-Correct): The left chart shows the composition of intervention data. Without Rule 1 ("Ours w/o Rule 1", equivalent to HG-DAgger + Rule 2), the data collected is heavily skewed towards correction, with very little recovery behavior. With Rule 1 (RaC), the data has a much more balanced ratio of recovery-to-correction frames. The right chart shows that this balanced diet leads to significantly better performance.

  • Rule 2 (Terminate-after-Intervention): The right chart compares RaC without Rule 1 (labeled "Ours w/o Rule 1") to HG-DAgger (labeled "Ours w/o Rule 1 & 2"). Simply adding the termination rule provides a noticeable performance boost over standard HG-DAgger. This confirms the hypothesis that terminating early improves data efficiency by focusing the collection budget on the most relevant state distributions.

    该图像是一个示意图,展示了机器人控制器的操作及其策略执行过程。左侧介绍了控制器的按钮功能,其中按钮A用于开始任务,按钮B用于结束任务,侧边按钮用于执行操作。右侧展示了在不同时间点,机器人在策略应用下的行为变化,强调了在控制任务中人机交互的重要性。 该图像是一个示意图,展示了机器人控制器的操作及其策略执行过程。左侧介绍了控制器的按钮功能,其中按钮A用于开始任务,按钮B用于结束任务,侧边按钮用于执行操作。右侧展示了在不同时间点,机器人在策略应用下的行为变化,强调了在控制任务中人机交互的重要性。

Together, these ablations show that both rules are critical components of RaC's success.


7. Conclusion & Reflections

7.1. Conclusion Summary

The paper presents RaC, a new paradigm for robot imitation learning that addresses the problem of performance plateaus in long-horizon tasks. The core insight is that it is more data-efficient to scale the collection of recovery and correction behaviors than to simply scale the quantity of perfect expert demonstrations. By using a structured, human-in-the-loop protocol, RaC teaches policies to mitigate compounding errors by learning to retry from failures. Experiments on challenging real-world tasks show that RaC achieves state-of-the-art performance with an order of magnitude less data. Furthermore, the paper identifies a novel "test-time scaling" phenomenon in robotics, where policy performance correlates with the number of recovery attempts, drawing a compelling parallel to chain-of-thought reasoning in LLMs.

7.2. Limitations & Future Work

The authors suggest several promising avenues for future research:

  • Initialization for Reinforcement Learning (RL): Policies trained with RaC could serve as excellent starting points for online RL fine-tuning. Because they already possess structured exploration behavior (through recovery), they might learn more safely and efficiently than standard IL-initialized policies.
  • Application to Generalist Models: The RaC fine-tuning methodology could be applied to large, pre-trained Vision-Language-Action (VLA) models (e.g., RT-2, Octo) to specialize them for complex tasks and improve their robustness.
  • Systematic Study of Emergent Recovery: The authors propose rigorously studying whether recovery behaviors emerge naturally in very large VLA models and plotting test-time scaling curves for them, which would be a valuable contribution to the community.

7.3. Personal Insights & Critique

  • Personal Insights: This paper is an excellent example of how a simple, well-founded idea can lead to significant practical gains. The shift in perspective from "how do we avoid errors?" to "how do we recover from errors?" is powerful. The analogy to chain-of-thought reasoning is particularly insightful and connects robotics to the broader principles of scalable intelligence being explored in the LLM space. The "test-time scaling" result is a genuinely novel contribution and provides a new way to think about and measure robot policy robustness.
  • Critique and Potential Issues:
    • Scalability of Human Intervention: While RaC is far more data-efficient than collecting full demonstrations, it still relies on a skilled human operator for the intervention loop. The scalability of this process to hundreds or thousands of different tasks remains an open question.
    • Heuristic Nature of "In-Distribution": The method for guiding recovery relies on a visual heatmap of past gripper positions, which is a heuristic for "in-distribution" states. A more principled or automated way to define and identify these recoverable states could make the method even more powerful and less reliant on human intuition.
    • Complexity of Recovery: The paper's core intuition is that recovery is "easier" to learn. While this holds for the tasks presented (e.g., moving an arm back into open space), one can imagine scenarios where recovery is itself a highly complex, multi-step process. In such cases, the data efficiency benefits of RaC might be reduced. However, for a vast range of manipulation tasks, the assumption appears to be a very effective one.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.