Paper status: completed

ResMimic: From General Motion Tracking to Humanoid Whole-body Loco-Manipulation via Residual Learning

Published:10/07/2025

Humanoid Whole-Body Control (1)Residual Learning Framework (1)General Motion Tracking (1)Object Interaction Task Optimization (1)Training on Human Motion Data (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

ResMimic is a two-stage residual learning framework that enhances humanoid control for loco-manipulation. By refining outputs from a general motion tracking policy trained on human data, it significantly improves task success and training efficiency, as shown in simulations and r

Abstract

Humanoid whole-body loco-manipulation promises transformative capabilities for daily service and warehouse tasks. While recent advances in general motion tracking (GMT) have enabled humanoids to reproduce diverse human motions, these policies lack the precision and object awareness required for loco-manipulation. To this end, we introduce ResMimic, a two-stage residual learning framework for precise and expressive humanoid control from human motion data. First, a GMT policy, trained on large-scale human-only motion, serves as a task-agnostic base for generating human-like whole-body movements. An efficient but precise residual policy is then learned to refine the GMT outputs to improve locomotion and incorporate object interaction. To further facilitate efficient training, we design (i) a point-cloud-based object tracking reward for smoother optimization, (ii) a contact reward that encourages accurate humanoid body-object interactions, and (iii) a curriculum-based virtual object controller to stabilize early training. We evaluate ResMimic in both simulation and on a real Unitree G1 humanoid. Results show substantial gains in task success, training efficiency, and robustness over strong baselines. Videos are available at https://resmimic.github.io/ .

Mind Map

In-depth Reading

English Analysis~37 min read · 46,949 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "ResMimic: From General Motion Tracking to Humanoid Whole-body Loco-Manipulation via Residual Learning." This title indicates a focus on enabling humanoids to perform complex tasks involving both movement (loco-) and interaction with objects (-manipulation) by building upon existing motion tracking capabilities through a novel residual learning approach.

1.2. Authors

The authors are:

Siheng Zhao
Yanjie Ze
Yue Wang
C. Karen Liu
Pieter Abbeel
Guanya Shi
Rocky Duan

Their affiliations include Amazon FAR (Frontier AI & Robotics), USC, Stanford University, UC Berkeley, and CMU. These affiliations suggest a strong background in robotics, artificial intelligence, and machine learning, particularly within prominent academic institutions and leading industry research labs.

1.3. Journal/Conference

The paper is published at (UTC): 2025-10-06T17:47:02.000Z. While the specific journal or conference is not explicitly mentioned in the provided text, the arXiv link indicates it is a preprint accessible to the academic community. Papers published on arXiv are often submitted to top-tier conferences or journals in fields like robotics, AI, or machine learning, which typically undergo rigorous peer review. Given the authors' affiliations and the nature of the research, it is likely intended for a highly reputable venue.

1.4. Publication Year

The publication year, based on the provided UTC timestamp, is 2025.

1.5. Abstract

The paper introduces ResMimic, a two-stage residual learning framework designed to achieve precise and expressive humanoid whole-body loco-manipulation from human motion data. The core idea is to first use a General Motion Tracking (GMT) policy, pre-trained on extensive human motion data, as a foundation for generating human-like movements. This GMT policy is task-agnostic. In the second stage, a task-specific residual policy is learned to refine the GMT outputs. This refinement improves locomotion and enables object interaction, which the base GMT policy lacks. To enhance training efficiency, ResMimic incorporates several novel components: (i) a point-cloud-based object tracking reward for smoother optimization, (ii) a contact reward to guide accurate humanoid-object interactions, and (iii) a curriculum-based virtual object controller to stabilize initial training. The framework's effectiveness is demonstrated through evaluations in both simulation and on a real Unitree G1 humanoid, showing significant improvements in task success, training efficiency, and robustness compared to existing baselines.

1.6. Original Source Link

The original source link is https://arxiv.org/abs/2510.05070v2. The PDF link is https://arxiv.org/pdf/2510.05070v2.pdf. This indicates the paper is a preprint available on arXiv, a popular open-access archive for scholarly articles.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is enabling humanoid robots to perform whole-body loco-manipulation tasks with both precision and expressiveness. This capability involves combining locomotion (movement) and manipulation (object interaction) in a coordinated, human-like manner.

This problem is highly important due to the transformative potential of humanoids in various real-world applications, such as daily services and warehouse tasks. Humanoid robots, unlike wheeled or quadruped robots, are designed to operate in human-centric infrastructure, leveraging environments built for humans.

However, several challenges and gaps exist in prior research:

Precision and Object Awareness: While General Motion Tracking (GMT) policies can reproduce diverse human motions, they often lack the precision required for loco-manipulation and are fundamentally unaware of manipulated objects.
Embodiment Gap: Directly imitating human motions for loco-manipulation is challenging because human demonstrations often involve contact locations and relative object poses that do not directly translate to humanoid robots due to differences in physical form, size, and capabilities. This can lead to issues like floating contacts (where the robot's hand appears to touch an object but doesn't exert force) or penetrations (where the robot's body passes through an object), as illustrated in Figure 2.
Scalability and Generality: Existing humanoid loco-manipulation approaches are often task-specific, relying on complex designs like stage-wise controllers or handcrafted data pipelines. These approaches limit the scalability and generality of solutions, meaning a new solution might be needed for every new task.
Lack of Unified Framework: There is no existing unified, efficient, and precise framework for humanoid loco-manipulation that can handle diverse tasks.

The paper's entry point or innovative idea is inspired by breakthroughs in foundation models, which leverage pre-training on large-scale data followed by post-training (fine-tuning) for robust and generalized performance. The key insight is that while diverse human motions can be captured by a pre-trained GMT policy, object-centric loco-manipulation primarily requires task-specific corrections. Many whole-body motions (e.g., balance, stepping, reaching) are shared across tasks, requiring adaptation only for fine-grained object interaction. This motivates a residual learning paradigm, where a stable motion prior (from GMT) is augmented with lightweight, task-specific adjustments.

2.2. Main Contributions / Findings

The paper makes four primary contributions:

Two-Stage Residual Learning Framework: The authors propose ResMimic, a novel two-stage residual learning framework that efficiently combines a pre-trained General Motion Tracking (GMT) policy with task-specific corrections. This enables precise and expressive humanoid loco-manipulation by decoupling the general motion generation from the fine-grained object interaction, leading to improved data efficiency and a general framework for loco-manipulation and locomotion enhancement.
Enhanced Training Efficiency and Sim-to-Real Transfer Mechanisms: To overcome challenges in training and deployment, ResMimic introduces several technical innovations:
- A point-cloud-based object tracking reward that offers a smoother optimization landscape compared to traditional pose-based rewards, making training more stable.
- A contact reward that explicitly guides the humanoid towards accurate body-object contacts, which is crucial for realistic physical interactions and sim-to-real transfer.
- A curriculum-based virtual object controller that stabilizes early training phases by providing temporary assistance, especially when dealing with noisy reference motions or heavy objects, preventing premature policy failures.
Extensive Evaluation and Robustness: The paper conducts comprehensive evaluations in both high-fidelity simulation environments (IsaacGym and MuJoCo) and on a real Unitree G1 humanoid robot. The results demonstrate substantial improvements in human motion tracking, object motion tracking, task success rate, training efficiency, robustness, and generalization across challenging loco-manipulation tasks compared to strong baselines.
Resource Release for Research Acceleration: To foster further research in the field, the authors commit to releasing their GPU-accelerated simulation infrastructure, a sim-to-sim evaluation prototype, and motion data. This contribution aims to lower the barrier for other researchers to build upon their work.

The key findings are that ResMimic successfully addresses the limitations of GMT policies by adding object awareness and precision through residual learning. It significantly outperforms baselines that either directly deploy GMT without adaptation, train from scratch, or use simple fine-tuning. The framework demonstrates real-world applicability and robustness, highlighting the power of a pretrain-finetune paradigm for humanoid whole-body control.

3.1. Foundational Concepts

To fully understand the ResMimic framework, a novice reader should be familiar with several core concepts in robotics, machine learning, and control theory.

3.1.1. Humanoid Whole-Body Loco-Manipulation

Humanoid robots are robots designed to resemble the human body, typically featuring a torso, head, two arms, and two legs. Whole-body control refers to the coordinated use of all these limbs and the torso to perform tasks. Loco-manipulation is a portmanteau combining locomotion (movement, like walking, kneeling, or balancing) and manipulation (interacting with objects, like grasping, lifting, or carrying). Therefore, humanoid whole-body loco-manipulation refers to the ability of a humanoid robot to move its entire body in a coordinated way to interact with and manipulate objects in its environment, often while simultaneously moving or maintaining balance. This is a complex task because it requires managing high degrees of freedom, maintaining stability, and understanding physical interactions with objects and the environment.

3.1.2. General Motion Tracking (GMT)

General Motion Tracking (GMT) is a technique in robotics where a robot learns to imitate a wide variety of human movements or reference motions. These motions are typically captured from humans using motion capture (MoCap) systems, which record the 3D positions and orientations of markers placed on a human body. A GMT policy, often trained using Reinforcement Learning (RL), aims to reproduce these motions on a robot as accurately as possible. The goal is to make the robot's movements appear natural and human-like. However, traditional GMT policies are primarily focused on the kinematics (movement) of the robot's own body and are usually task-agnostic, meaning they don't have explicit knowledge or awareness of external objects or the specific goals of a manipulation task.

3.1.3. Residual Learning

Residual learning is a machine learning paradigm where a model learns to predict the "residual" or "difference" between a desired output and an output produced by a base model or a simpler system. Instead of learning the entire complex function from scratch, the residual model focuses on learning only the corrections needed to improve the base model's output. In the context of control, this often means training a residual policy to output small adjustments to the actions suggested by a base policy or a classical controller. This approach can be more efficient because the residual model only needs to learn the "hard" parts of the problem, while the base model handles the "easy" or general aspects. It often leads to faster training, better stability, and improved performance.

3.1.4. Reinforcement Learning (RL) and PPO

Reinforcement Learning (RL) is an area of machine learning concerned with how intelligent agents should take actions in an environment to maximize the cumulative reward. An RL agent learns by trial and error, observing the state of the environment, taking an action, and receiving a reward signal, which indicates how good or bad that action was. The goal is to learn a policy, which is a mapping from states to actions, that maximizes the expected cumulative discounted reward over time.

The process is often modeled as a Markov Decision Process (MDP), defined by a tuple $\mathcal { M } = \langle \mathcal { S } , \mathcal { A } , \mathcal { T } , \mathcal { R } , \gamma \rangle$ :

$\mathcal { S }$ : The state space, representing all possible configurations of the environment.
$\mathcal { A }$ : The action space, representing all possible actions the agent can take.
$\mathcal { T }$ : The transition dynamics, which describes how the environment changes from one state to another given an action. Mathematically, $P(s' | s, a)$ is the probability of reaching state $s'$ from state $s$ after taking action $a$ .
$\mathcal { R }$ : The reward function, which defines the immediate reward an agent receives for taking an action in a given state, R(s, a).
$\gamma$ : The discount factor (a value between 0 and 1), which determines the present value of future rewards. A higher $\gamma$ makes future rewards more significant.

Proximal Policy Optimization (PPO) is a popular on-policy Reinforcement Learning algorithm. It's known for its balance between ease of implementation, sample efficiency, and good performance across a wide range of tasks. PPO works by trying to take the largest possible step towards a better policy without collapsing into a bad policy. It achieves this by using a clipped objective function that discourages large policy changes, thus keeping the new policy "proximal" to the old one. This stability makes it suitable for complex control problems like humanoid robotics.

3.1.5. Sim-to-Real Transfer

Sim-to-real transfer is the process of training a robot control policy in a simulated environment (which is generally safer, faster, and cheaper) and then deploying that trained policy on a physical robot in the real world. A major challenge is the sim-to-real gap, which refers to the discrepancies between the simulated environment (physics, sensor noise, modeling inaccuracies) and the real world. Policies trained purely in simulation may perform poorly or fail entirely when transferred to reality due to these differences. Techniques like domain randomization (varying simulation parameters during training) are used to make policies more robust to these unmodeled real-world variations, thereby improving sim-to-real transfer.

3.1.6. Kinematic Retargeting

Kinematic retargeting is the process of transferring a motion from one character (e.g., a human) to another character (e.g., a humanoid robot) with a different skeletal structure, proportions, or degrees of freedom. The goal is to make the target character perform the same general motion while respecting its own anatomical constraints. This often involves mapping joints from the source to the target, adjusting for limb lengths, and ensuring that the overall style and intent of the motion are preserved. The embodiment gap is a significant challenge here, as direct translation of human contact points or postures might be physically impossible or lead to instability for a robot. GMR [39] (General Motion Retargeting) is a specific method mentioned in the paper for this purpose.

3.2. Previous Works

The paper contextualizes its contributions by discussing prior work in learning-based humanoid control, humanoid loco-manipulation, and residual learning in robotics.

3.2.1. Learning-Based Humanoid Control

Early Reinforcement Learning (RL) for humanoids faced challenges like low data efficiency and the need for substantial effort in task-specific reward design [17]. This often limited work to simpler tasks like locomotion [18], [19] or highly specific skills such as getting up [20] and keeping balance [21].

A promising direction has been learning from human motions [5], which involves kinematic retargeting to map human demonstrations to robots, addressing the embodiment gap. This led to accurate tracking of individual motions [22]-[24] and more versatile general motion tracking (GMT) [6], [7], [25], [26].

DeepMimic [5]: A foundational work that enabled physics-based character skills to be learned from human motion examples using deep RL. It showed how to achieve highly dynamic and diverse motions.
TWIST [6]: A Teleoperated Whole-body Imitation System that allows humanoids to reproduce diverse human motions, often serving as a strong GMT baseline. The authors refer to TWIST for reward formulation and domain randomization details in their GMT training.
GMT [7]: A general motion tracking framework for humanoid whole-body control.
VideoMimic [27]: Made progress by not only tracking human motion but also reconstructing the environment, enabling contextual motions like sitting on a chair. However, it was limited to static environments and did not extend to dynamic object interactions.

The key differentiation of ResMimic from these GMT-focused works is its explicit focus on dynamic loco-manipulation with object awareness, which these prior GMT policies lacked.

3.2.2. Humanoid Loco-Manipulation

Humanoid loco-manipulation is particularly challenging.

Teleoperation [6], [28]-[30]: Some works have used teleoperation to control humanoids for these tasks. However, these methods require a human operator and lack explicit object awareness.
Autonomous Imitation Learning [4], [31]: Building on teleoperation, others have trained autonomous imitation learning policies from collected data. These efforts, however, were often restricted to tabletop manipulation with limited whole-body expressiveness, tasks potentially better suited for dual-arm mobile manipulators.
Modular and Trajectory-Optimized Approaches [10], [11]:
- Dao et al. [10]: Proposed a modular sim-to-real RL pipeline for box loco-manipulation, decomposing the task into distinct phases (e.g., walking, box-picking) with separate policies. This implies a lack of end-to-end learning and potentially less natural transitions.
- Liu et al. [11]: Introduced an end-to-end learning pipeline using reference motions generated by task-specific trajectory optimization. This approach, while end-to-end, still relies on task-specific designs for reference motion generation.
  
  The paper argues that these loco-manipulation approaches demonstrated limited whole-body contact (e.g., using only hands) and expressiveness, and relied on highly task-specific designs. ResMimic aims to overcome this by leveraging a GMT policy as a prior, enabling more expressive whole-body loco-manipulation within a unified framework.

3.2.3. Residual Learning for Robotics

Residual learning has been widely adopted in robotics to refine existing controllers or policies.

Refining Hand-Designed Policies [32], [33]: Early works used residual policies to improve hand-designed policies or model predictive controllers for more precise and contact-rich manipulation.
Refining Imitation Policies [34], [35]: Later approaches extended residual learning to policies initialized from demonstrations.
Dexterous Hand Manipulation [36], [37]: In dexterous hand manipulation, residual policies adapted human hand motions for task-oriented control.
ASAP [22]: A notable work in humanoids that leverages residual learning to compensate for dynamics mismatch between simulation and reality, enabling agile whole-body skills.

ResMimic differentiates itself by leveraging a pre-trained General Motion Tracking (GMT) policy as its foundation. This is distinct from refining classical controllers, arbitrary demonstrations, or solely focusing on sim-to-real dynamics. Instead, ResMimic uses residual learning to add object awareness and task-specific precision on top of a robust, general human-like motion prior.

3.3. Technological Evolution

The field of humanoid control has evolved significantly:

Classical Control/Model-Based Approaches: Early humanoid control often relied on model-based control (e.g., whole-body control with inverse kinematics/dynamics) and pre-programmed motions. These were precise but often lacked adaptability to unforeseen disturbances or varied tasks.
Reinforcement Learning for Locomotion: The advent of RL allowed humanoids to learn more dynamic and robust locomotion skills (walking, running, balancing) by interacting with environments, moving beyond purely model-based methods. However, RL often struggled with data efficiency and complex reward design.
Imitation Learning from Human Motion (GMT): To overcome the complexity of reward design and achieve more natural motions, researchers turned to imitation learning, specifically GMT. By retargeting human motion data, humanoids could learn diverse and expressive whole-body movements. This represented a significant step towards human-like behavior.
Foundation Models & Pre-training Paradigm: Inspired by foundation models in NLP and vision, the idea of pre-training large, general models on vast datasets became prevalent. In robotics, this translates to training powerful base policies on general data (like GMT on human motion) that can then be adapted for specific downstream tasks.
Residual Learning for Task-Specific Adaptation: This paper's work, ResMimic, fits into the latter part of this evolution. It recognizes that GMT provides a powerful motion prior but is insufficient for loco-manipulation due to a lack of object awareness and precision. By applying residual learning, ResMimic efficiently adapts this strong GMT foundation to object-centric tasks, effectively bridging the gap between general human-like motion and precise humanoid loco-manipulation.

3.4. Differentiation Analysis

Compared to the main methods in related work, ResMimic introduces several core differences and innovations:

Unified Framework for Expressive Loco-Manipulation: Unlike previous loco-manipulation works that were often task-specific [10], [11] or limited in whole-body expressiveness and contact richness, ResMimic leverages a powerful GMT prior. This enables a unified framework that can handle diverse, expressive whole-body loco-manipulation tasks involving complex contacts (e.g., using torso, arms, not just hands).
Decoupling Motion Generation from Object Interaction: The two-stage residual learning framework is a key innovation. Instead of trying to learn both general motion and precise object interaction simultaneously from scratch (which Train from Scratch attempts and fails at) or directly fine-tuning GMT without explicit object input (Finetune), ResMimic separates these concerns. The first stage provides a robust, task-agnostic human-like motion prior, and the second stage efficiently learns only the task-specific corrections for object interaction.
Effective Handling of Embodiment Gap and Noisy Data: ResMimic addresses the embodiment gap and imperfect human-object interaction data (leading to penetrations or floating contacts) through its specialized training mechanisms:
- The virtual object controller stabilizes early training by assisting object manipulation, allowing the residual policy to learn robust interactions despite initial data imperfections.
- The contact reward explicitly encourages correct body-object contact, which is crucial for real-world deployability and sim-to-real transfer, going beyond simple pose matching.
Robustness and Sim-to-Real Performance: ResMimic demonstrates superior sim-to-sim transfer performance (from IsaacGym to MuJoCo) and successful real-world deployment on a Unitree G1 humanoid. This contrasts sharply with Train from Scratch and Finetune policies, which often fail or degrade significantly in more realistic simulation or hardware. This robustness stems from leveraging the generalization capabilities of the pre-trained GMT policy and the proposed training enhancements.
Efficient Adaptation: By learning a residual policy, ResMimic significantly improves training efficiency for new tasks compared to training from scratch, as the residual policy is lightweight and only needs to learn corrections, not the entire motion.

4. Methodology

4.1. Principles

The core principle of ResMimic is to decompose the complex problem of humanoid whole-body loco-manipulation into two more manageable parts using residual learning. The intuition is that generating general, human-like motion is a common, reusable skill, while precise object interaction is task-specific. Therefore, a powerful General Motion Tracking (GMT) policy first learns to generate robust, human-like motions. Then, a smaller, more efficient residual policy learns only the necessary corrections to these base motions to achieve precise object interaction and loco-manipulation. This modular approach leverages the strengths of a pre-trained general model while allowing for efficient, task-specific adaptation.

4.2. Core Methodology In-depth (Layer by Layer)

The ResMimic framework formulates the whole-body loco-manipulation task as a goal-conditioned Reinforcement Learning (RL) problem within a Markov Decision Process (MDP).

The MDP is defined by the tuple $\mathcal { M } = \langle \mathcal { S } , \mathcal { A } , \mathcal { T } , \mathcal { R } , \gamma \rangle$ :

$\mathcal { S }$ is the state space.
$\mathcal { A }$ is the action space.
$\mathcal { T }$ denotes the transition dynamics of the environment.
$\mathcal { R }$ is the reward function.
$\gamma$ is the discount factor.

At each time step $t$ , the state $s_t \in \mathcal{S}$ is composed of four main components:

robot proprioception ( $s_t^r$ ): Information about the robot's current internal state.
object state ( $s_t^o$ ): Information about the current state of the manipulated object.
motion goal state ( $\hat{s}_t^r$ ): The target state for the robot's motion, typically derived from human demonstrations.
object goal state ( $\hat{s}_t^o$ ): The target state for the object, also derived from demonstrations.

The action $a_t$ generated by the policy specifies target joint angles for the robot. These target angles are then executed on the real or simulated robot using a Proportional-Derivative (PD) controller, which attempts to drive the robot's actual joint angles to the commanded targets.

The reward $r_t = \mathcal{R}(s_t, a_t)$ is computed at each time step based on the current state and action. The overall training objective is to maximize the expected cumulative discounted reward, expressed as $\mathbb { E } [ \sum _ { t = 1 } ^ { T } \gamma ^ { t - 1 } r _ { t } ]$ . Both stages of ResMimic are trained using the Proximal Policy Optimization (PPO) [38] algorithm.

The ResMimic framework operates in two distinct stages, as illustrated in Figure 3.

4.2.1. Stage I: General Motion Tracking (GMT)

The first stage focuses on training a task-agnostic General Motion Tracking (GMT) policy, denoted as $\pi_{\mathrm{GMT}}$ . This policy serves as the backbone controller, responsible for generating human-like whole-body movements.

Input: The GMT policy takes two inputs:
- robot proprioception $s_t^r$ .
- reference motion $\hat{s}_t^r$ .
Output: It outputs a coarse action $a_t^{\mathrm{gmt}}$ $a_{t}^{gmt}$ .
- The action is computed as: $a _ { t } ^ { \mathrm { gmt } } = \pi _ { \mathrm { G M T } } ( s _ { t } ^ { r } , \hat { s } _ { t } ^ { r } )$ .
Objective: The GMT policy is optimized to maximize only the motion tracking reward $r_t^m$ $r_{t}^{m}$ .
- The training objective for this stage is $\mathbb E \big[ \sum _ { t = 1 } ^ { T } \gamma ^ { t - 1 } r _ { t } ^ { m } \big]$ . This means the policy primarily cares about how well the robot mimics the human motion, not about object interaction.

4.2.1.1. Dataset for GMT

The GMT policy is trained on large-scale human motion capture data. This approach decouples human motion tracking from object interaction, avoiding the need for costly and hard-to-obtain manipulation data for the base policy.

Sources: Publicly available MoCap datasets such as AMASS [8] and OMOMO [9], totaling over 15,000 clips (approximately 42 hours).
Data Curation: Motions impractical for the humanoid robot (e.g., stair climbing) are filtered out.
Retargeting: Kinematics-based motion retargeting (e.g., GMR [39]) is applied to convert the human motion data into humanoid reference trajectories $\{ \hat { s } _ { i } ^ { r } = \{ \hat { s } _ { t } ^ { r } \} _ { t = 1 } ^ { T } \} _ { i = 1 } ^ { D }$ . Here, $i$ indexes individual motion clips, $t$ indexes time steps within a clip, $T$ is the clip length, and $D$ is the total number of clips.

4.2.1.2. Training Strategy for GMT

The GMT policy is trained using a single-stage RL framework in simulation without access to privileged information (information that is only available in simulation, like exact contact forces, which are not directly observable by a real robot).

Proprioceptive Observation ( $s_t^r$ ): This describes the robot's internal state. It is defined as a history of observations over a recent window: $s _ { t } ^ { r } = [ \theta _ { t } , \omega _ { t } , q _ { t } , \dot { q } _ { t } , a _ { t } ^ { \mathrm { hist } } ] _ { t - 10 : t }$ $s_{t}^{r} = [θ_{t}, ω_{t}, q_{t}, \overset{q}{˙}_{t}, a_{t}^{hist}]_{t - 10 : t}$
- $\theta _ { t }$ : The root orientation (orientation of the robot's base or main body).
- $\omega _ { t }$ : The root angular velocity (how fast the robot's base is rotating).
- $q _ { t } \in \mathbb { R } ^ { 29 }$ : The joint position for each of the robot's 29 degrees of freedom.
- $\dot { q } _ { t }$ : The joint velocity for each joint.
- $a _ { t } ^ { \mathrm { hist } }$ : The recent action history, typically the past actions taken by the robot.
- The subscript t-10:t indicates that the observations from the past 10 time steps up to the current time step $t$ are included.
Reference Motion Input ( $\hat{s}_t^r$ ): This describes the target human motion the robot should mimic. It includes future reference information to help the policy anticipate movements: $\hat { s } _ { t } ^ { r } = [ \hat { p } _ { t } , \hat { \theta } _ { t } , \hat { q } _ { t } ] _ { t - 10 : t + 10 }$ $\overset{s}{^}_{t}^{r} = [\overset{p}{^}_{t}, \hat{θ}_{t}, \overset{q}{^}_{t}]_{t - 10 : t + 10}$
- $\hat { p } _ { t }$ : The reference root translation (target position of the robot's base).
- $\hat { \theta } _ { t }$ : The reference root orientation (target orientation of the robot's base).
- $\hat { q } _ { t }$ : The reference joint position for each joint.
- The subscript t-10:t+10 indicates that the reference motion from the past 10 time steps up to the next 10 time steps is included, allowing the policy to plan for upcoming targets for smoother tracking.

4.2.1.3. Reward and Domain Randomization for GMT

Motion Tracking Reward ( $r_t^m$ ): Following TWIST [6], the reward is a sum of three components:
- task rewards: Encourage the robot to achieve specific motion goals (e.g., matching joint positions, root velocity).
- penalty terms: Discourage undesirable behaviors (e.g., falling, excessive joint limits).
- regularization terms: Promote smooth and energy-efficient movements.
Domain Randomization: To improve sim-to-real transfer and robustness, domain randomization is applied during training. This involves varying physical parameters (e.g., friction, mass, sensor noise) in the simulation, forcing the policy to learn to adapt to a range of environmental conditions rather than overfitting to specific simulation parameters.

Building on the pre-trained GMT policy from Stage I, this stage introduces a residual policy, $\pi_{\mathrm{Res}}$ , to refine the coarse actions $a_t^{\mathrm{gmt}}$ and enable precise object manipulation.

Input: The residual policy takes a comprehensive input including robot state, object state, and their respective reference trajectories: $\langle s _ { t } ^ { r } , s _ { t } ^ { o } , \hat { s } _ { t } ^ { r } , \hat { s } _ { t } ^ { o } \rangle$
Output: It outputs a residual action $\Delta a _ { t } ^ { \mathrm { res } } \in \mathbb { R } ^ { 29 }$ .
Final Action: The final action $a_t$ sent to the robot's PD controller is the sum of the GMT policy's output and the residual policy's correction: $a _ { t } = a _ { t } ^ { \mathrm { gmt } } + \Delta a _ { t } ^ { \mathrm { res } }$
Objective: The residual policy is optimized to maximize a combined reward that includes both motion tracking and object interaction.
- The training objective for this stage is $\mathbb { E } \big[ \sum _ { t = 1 } ^ { T } \gamma ^ { t - 1 } ( r _ { t } ^ { m } + r _ { t } ^ { o } ) \big]$ . This means it refines motions while specifically aiming to succeed at object tasks.

For the residual policy, reference trajectories are obtained from MoCap systems that simultaneously record both human motion and object motion.

Human Motion: The recorded human motion $\{ \hat { h } _ { t } \} _ { t = 1 } ^ { T }$ is retargeted to the humanoid robot using GMR [39] to produce humanoid reference trajectories $\{ \hat { s } _ { t } ^ { r } = \mathrm { GMR } ( \hat { h } _ { t } ) \} _ { t = 1 } ^ { T }$ .
Object Motion: The recorded object motion $\{ \hat { o } _ { t } \} _ { t = 1 } ^ { T }$ is directly used as the reference object trajectory $\hat { s } _ { t } ^ { o }$ .
Complete Reference: These combined trajectories $\{ \bar { ( s _ { t } ^ { r } , s _ { t } ^ { o } ) } \} _ { t = 1 } ^ { T }$ guide the training of the residual policy.

Similar to the GMT stage, single-stage RL with PPO is used for residual learning.

Object State Representation ( $s_t^o$ ): The current state of the object is represented by its pose and velocity: $s _ { t } ^ { o } = [ p _ { t } ^ { o } , \theta _ { t } ^ { o } , v _ { t } ^ { o } , \omega _ { t } ^ { o } ]$
- $p _ { t } ^ { o }$ : The object root translation (position).
- $\theta _ { t } ^ { o }$ : The object root orientation.
- $v _ { t } ^ { o }$ : The object root linear velocity.
- $\omega _ { t } ^ { o }$ : The object root angular velocity.
Reference Object Trajectory ( $\hat{s}_t^o$ ): Similar to robot motion, the reference object trajectory includes future information: $\hat { s } _ { t } ^ { o } = [ \hat { p } _ { t } ^ { o } , \hat { \theta } _ { t } ^ { o } , \hat { v } _ { t } ^ { o } , \hat { \omega } _ { t } ^ { o } ] _ { t - 10 : t + 10 }$
- $\hat { p } _ { t } ^ { o }$ : The reference object root translation.
- $\hat { \theta } _ { t } ^ { o }$ : The reference object root orientation.
- $\hat { v } _ { t } ^ { o }$ : The reference object root linear velocity.
- $\hat { \omega } _ { t } ^ { o }$ : The reference object root angular velocity.
- The subscript t-10:t+10 indicates a history and future window for the object reference.
Network Initialization: At the beginning of training, the residual policy should ideally output actions close to zero, as the humanoid already performs human-like motion. To encourage this, the final layer of the PPO actor network is initialized using Xavier uniform initialization with a small gain factor [40]. This ensures the initial outputs are close to zero, minimizing disruption to the pre-trained GMT policy.
Virtual Object Force Curriculum: This mechanism is designed to stabilize training, especially when reference motions are noisy (leading to penetrations) or objects are heavy. These issues can cause the initial policy to fail by knocking over the object or retreating.
- Mechanism: PD controllers apply virtual forces ( $\mathcal{F}_t$ ) and torques ( $\mathcal{T}_t$ ) to the object, guiding it towards its reference trajectory.
- Formula: $ \mathcal F _ { t } = k _ { p } ( \hat { p } _ { t } ^ { o } - p _ { t } ^ { o } ) - k _ { d } v _ { t } ^ { o } $ $ \mathcal T _ { t } = k _ { p } ( \hat { \theta } _ { t } ^ { o } \ominus \theta _ { t } ^ { o } ) - k _ { d } \omega _ { t } ^ { o } $
- Symbol Explanation:
  - $\mathcal F _ { t }$ : The virtual control force applied to the object at time $t$ .
  - $\mathcal T _ { t }$ : The virtual control torque applied to the object at time $t$ .
  - $k_p$ : The proportional gain for the PD controller, determining the strength of the correction based on position/orientation error.
  - $k_d$ : The derivative gain for the PD controller, determining the strength of the correction based on velocity/angular velocity error.
  - $\hat { p } _ { t } ^ { o }$ : The reference position of the object at time $t$ .
  - $p _ { t } ^ { o }$ : The current position of the object at time $t$ .
  - $v _ { t } ^ { o }$ : The current linear velocity of the object at time $t$ .
  - $\hat { \theta } _ { t } ^ { o }$ : The reference orientation of the object at time $t$ .
  - $\theta _ { t } ^ { o }$ : The current orientation of the object at time $t$ .
  - $\omega _ { t } ^ { o }$ : The current angular velocity of the object at time $t$ .
  - $\ominus$ : Denotes the rotation difference (e.g., computing the shortest angular distance between two orientations).
- Curriculum: The controller gains $(k_p, k_d)$ are gradually decayed over training. This provides strong virtual assistance early on to stabilize learning, then reduces assistance as the policy learns to handle the task autonomously.

The residual refinement stage reuses the motion reward $r_t^m$ and domain randomization from GMT training. It introduces two additional reward terms: object tracking reward ( $r_t^o$ ) and contact tracking reward ( $r_t^c$ ).

Object Tracking Reward ( $r_t^o$ ): Instead of traditional pose-based differences [11], [42], ResMimic uses a point-cloud-based difference for a smoother reward landscape, which implicitly accounts for both translation and rotation without task-specific weight tuning.
- Formula: $ r _ { t } ^ { o } = \exp ( - \lambda _ { o } \sum _ { i = 1 } ^ { N } | \mathbf { P } [ i ] _ { t } - \hat { \mathbf { P } } [ i ] _ { t } | _ { 2 } ) $
- Symbol Explanation:
  - $\lambda _ { o }$ : A positive scaling factor that determines the sensitivity of the reward to the object tracking error.
  - $N$ : The number of points sampled from the object's mesh surface.
  - $\mathbf { P } [ i ] _ { t } \in \mathbb { R } ^ { 3 }$ : The $i$ -th sampled 3D point on the current object mesh at time $t$ .
  - $\hat { \mathbf { P } } [ i ] _ { t } \in \mathbb { R } ^ { 3 }$ : The $i$ -th sampled 3D point on the reference object mesh at time $t$ .
  - $\| \cdot \| _ { 2 }$ : The Euclidean norm (L2 norm), measuring the distance between points.
- The reward is an exponential function, meaning it's highest when the points on the current object closely match the reference points, and rapidly decreases as the difference grows.
Contact Reward ( $r_t^c$ ): This reward encourages correct physical interactions and helps the robot learn to make meaningful contacts with objects. It discretizes contact locations to relevant links (e.g., torso, hip, arms, excluding feet which primarily contact the ground).
- Oracle Contact Information: This is derived from the reference human-object interaction trajectory. $ \hat { c } _ { t } [ i ] = \mathbf { 1 } ( \lVert \hat { d } _ { t } [ i ] \rVert < \sigma _ { c } ) $
- Symbol Explanation:
  - $i$ : Indexes the specific robot links (e.g., torso, left forearm, right hand).
  - $\mathbf { 1 } ( \cdot )$ : The indicator function, which equals 1 if the condition inside the parenthesis is true, and 0 otherwise.
  - $\lVert \hat { d } _ { t } [ i ] \rVert$ : The distance between robot link $i$ and the object surface in the reference trajectory at time $t$ .
  - $\sigma _ { c }$ : A threshold distance; if the reference distance is below this, it indicates a contact should be present.
- Contact Tracking Reward Formula: $ r _ { t } ^ { c } = \sum _ { i } \hat { c } _ { t } [ i ] \cdot \exp \Big ( - \frac { \lambda } { f _ { t } [ i ] } \Big ) $
- Symbol Explanation:
  - $\hat { c } _ { t } [ i ]$ : The oracle contact indicator for link $i$ at time $t$ (1 if contact is expected, 0 otherwise).
  - $\lambda$ : A positive scaling factor for the exponential decay.
  - f _ { t } [ i ]: The actual contact force measured at robot link $i$ at time $t$ .
- This reward encourages the robot to exert force on links where contact is expected based on the reference motion. If contact is expected ( $\hat{c}_t[i]=1$ ), the reward increases as the contact force $f_t[i]$ increases (up to a point, controlled by $\lambda$ ). If no contact is expected, the term is zero.
Early Termination: To prevent the policy from entering undesirable states, episodes are terminated early under specific conditions:
- Standard motion tracking conditions: Unintended ground contact (e.g., torso touching ground) or substantial deviation of a body part from its reference.
- Additional loco-manipulation conditions:
  - The object mesh deviates from its reference beyond a threshold: $\| \dot { \mathbf { P } _ { t } } - \hat { \mathbf { P } } _ { t } \| _ { 2 } > \sigma _ { o }$ , where $\sigma_o$ is a predefined threshold.
  - Any required body-object contact (as indicated by $\hat{c}_t[i]$ ) is lost for more than 10 consecutive frames.

4.2.3. Overall Architecture (Figure 3)

$该图像是示意图，展示了 ResMimic 方法的两个阶段：大规模人类动作的重标定以及残差策略训练。在第二阶段中，通过动作捕捉数据和参考人类动作，对人体的运动和物体的交互进行优化。涉及的关键奖励机制包括目标跟踪奖励和接触奖励，公式中的 $a_t^{res} = a_t^m + riangle a_t^{res}$ 表示残差学习过程。$ 该图像是示意图，展示了 ResMimic 方法的两个阶段：大规模人类动作的重标定以及残差策略训练。在第二阶段中，通过动作捕捉数据和参考人类动作，对人体的运动和物体的交互进行优化。涉及的关键奖励机制包括目标跟踪奖励和接触奖励，公式中的 $a_t^{res} = a_t^m + riangle a_t^{res}$ 表示残差学习过程。

The image is a schematic diagram illustrating the two stages of the ResMimic method: retargeting large-scale human motions and residual policy training. In the second stage, motion capture data and reference humanoid motion are utilized to optimize humanoid motion and object interaction. Key rewarding mechanisms include object tracking reward and contact reward, with the formula $a_t^{res} = a_t^m + \Delta a_t^{res}$ representing the residual learning process.

Figure 3 visually summarizes the two-stage process:

Stage 1: GMT Policy Training: Human MoCap data is used to train $\pi_{\mathrm{GMT}}$ . This policy takes robot proprioception and human reference motion as input and outputs a coarse action $a_t^{\mathrm{gmt}}$ . This stage focuses on the motion tracking reward $r_t^m$ .
Stage 2: Residual Policy Training: Human-object MoCap data (including object reference) is used. The GMT policy's output $a_t^{\mathrm{gmt}}$ is augmented by a residual policy $\pi_{\mathrm{Res}}$ . This residual policy takes robot proprioception, object state, human reference motion, and object reference motion as input, and outputs a residual action $\Delta a_t^{\mathrm{res}}$ . The final action $a_t = a_t^{\mathrm{gmt}} + \Delta a_t^{\mathrm{res}}$ is then executed. This stage optimizes for both motion tracking reward $r_t^m$ and object tracking reward $r_t^o$ , incorporating contact reward $r_t^c$ and using the virtual object controller.

5. Experimental Setup

The authors evaluate ResMimic through extensive sim-to-sim evaluation and real-world deployment on a Unitree G1 humanoid robot. The robot has 29 Degrees of Freedom (DoF) and is $1.3 \mathrm{m}$ tall.

5.1. Datasets

For training the General Motion Tracking (GMT) policy, the authors leverage several publicly available MoCap datasets, including AMASS [8] and OMOMO [9]. These datasets collectively contain over 15,000 clips, amounting to approximately 42 hours of human motion data. Motions deemed impractical for the humanoid robot (e.g., stair climbing) are filtered out. Kinematics-based motion retargeting (specifically, GMR [39]) is then used to map these human motions to the humanoid robot's kinematics.

For training the Residual Refinement Policy, the reference trajectories for both human and object motions are collected using an OptiTrack motion capture system. This system simultaneously records human motion ( $\{ \hat { h } _ { t } \} _ { t = 1 } ^ { T }$ ) and object motion ( $\{ \hat { o } _ { t } \} _ { t = 1 } ^ { T }$ ). The human motion is retargeted to the humanoid robot using GMR [39] to generate humanoid reference trajectories ( $\{ \hat { s } _ { t } ^ { r } = \mathrm { GMR } ( \hat { h } _ { t } ) \} _ { t = 1 } ^ { T }$ ), while the object motion is directly used as the reference object trajectory ( $\hat { s } _ { t } ^ { o }$ ).

These datasets are chosen because they provide a rich source of human-like movements and human-object interactions, which are critical for training policies that aim to mimic human behavior. The MoCap system for object interaction provides precise ground truth for both robot and object trajectories, enabling robust learning for the residual policy.

5.2. Evaluation Metrics

The effectiveness of ResMimic is assessed using several metrics that cover training efficiency, motion fidelity, manipulation accuracy, and overall task completion.

5.2.1. Training Iterations (Iter.)

Conceptual Definition: This metric quantifies the computational cost or time required for a policy to converge to a stable and effective solution during training. It measures how many training steps (or batches of data processed) were necessary for the reward to stop significantly increasing, indicating that the learning process has stabilized.
Mathematical Formula: Not a specific formula, but rather a count of optimization steps.
Symbol Explanation: Iter. represents the total training iterations until convergence. A lower value indicates higher training efficiency.

5.2.2. Object Tracking Error ( $E_o$ )

Conceptual Definition: This metric measures how accurately the simulated or real object's position and orientation match its reference trajectory. The paper uses a point-cloud-based approach, which is more robust than simple pose differences, as it considers the entire shape of the object.
Mathematical Formula: $ E _ { o } ^ { ' } = \frac { 1 } { T } \sum _ { t = 1 } ^ { T } \sum _ { i = 1 } ^ { N } | | \mathbf { P } [ i ] _ { t } - \mathbf { \bar { P } } [ i ] _ { t } | | _ { 2 } $
Symbol Explanation:
- $E _ { o } ^ { ' }$ : The average object tracking error over an episode.
- $T$ : The total number of time steps in the episode.
- $N$ : The number of points sampled from the object's mesh surface.
- $\mathbf { P } [ i ] _ { t } \in \mathbb { R } ^ { 3 }$ : The $i$ -th sampled 3D point on the current object mesh at time $t$ .
- $\mathbf { \bar { P } } [ i ] _ { t } \in \mathbb { R } ^ { 3 }$ : The $i$ -th sampled 3D point on the reference object mesh at time $t$ . (Note: The paper uses bar P which typically denotes a reference or mean. In the object tracking reward formula, it used hat P. Assuming consistency, $\bar{P}$ here likely refers to the reference points.)
- $\| \cdot \| _ { 2 }$ : The Euclidean norm, which calculates the straight-line distance between two points in 3D space. A lower value indicates better object tracking accuracy.

5.2.3. Motion Tracking Error ( $E_m$ )

Conceptual Definition: This metric quantifies how closely the robot's overall body motion (specifically the global positions of its body parts) matches the corresponding reference human motion.
Mathematical Formula: $ E _ { m } = \frac { 1 } { T } \sum _ { t = 1 } ^ { T } \sum _ { i } | | p _ { t } [ i ] - \hat { p } _ { t } [ i ] | | _ { 2 } $
Symbol Explanation:
- E _ { m }: The average motion tracking error over an episode.
- $T$ : The total number of time steps in the episode.
- $i$ : Indexes the various body parts (links) of the robot.
- $p _ { t } [ i ] \in \mathbb { R } ^ { 3 }$ : The global position of body part $i$ on the robot at time $t$ .
- $\hat { p } _ { t } [ i ] \in \mathbb { R } ^ { 3 }$ : The global position of body part $i$ in the reference motion at time $t$ .
- $\| \cdot \| _ { 2 }$ : The Euclidean norm. A lower value indicates higher motion fidelity.

5.2.4. Joint Tracking Error ( $E_j$ )

Conceptual Definition: This metric measures the precision of the robot's joint movements by comparing its actual joint angles to the reference joint angles from the human motion. It specifically evaluates joint-level accuracy.
Mathematical Formula: $ E _ { j } = \frac { 1 } { T } \sum _ { t = 1 } ^ { T } | | q _ { t } - \hat { q } _ { t } | | _ { 2 } $
Symbol Explanation:
- E _ { j }: The average joint tracking error over an episode.
- $T$ : The total number of time steps in the episode.
- $q _ { t } \in \mathbb { R } ^ { 29 }$ : The vector of joint positions (angles) of the robot at time $t$ . The $\mathbb{R}^{29}$ indicates the robot has 29 degrees of freedom.
- $\hat { q } _ { t } \in \mathbb { R } ^ { 29 }$ : The vector of reference joint positions (angles) at time $t$ .
- $\| \cdot \| _ { 2 }$ : The Euclidean norm, applied to the vector of joint differences. A lower value indicates more precise joint-level control.

5.2.5. Task Success Rate (SR)

Conceptual Definition: This is a binary metric indicating whether a specific task was successfully completed during a rollout. It encompasses both object manipulation and robot stability.
Mathematical Formula: Not a specific formula, but a percentage calculated as (Number of Successful Rollouts / Total Number of Rollouts) * 100%.
Symbol Explanation: SR stands for Success Rate. A rollout is considered successful if two conditions are met:
1. The object tracking error ( $E_o$ ) is below a predefined threshold ( $\sigma_o$ ).
2. The robot remains balanced (i.e., it does not fall or violate early termination conditions related to stability). A higher value indicates greater task reliability.

5.3. Baselines

To validate the effectiveness and efficiency of ResMimic, the authors compare it against three representative and strong baselines:

Base Policy:
- Description: This baseline directly deploys the pre-trained GMT policy (from Stage I) to follow the human reference motion.
- Limitation: It does not have access to object information and is not explicitly trained for object interaction. It serves as a benchmark to show the inherent limitation of GMT without specific adaptation for loco-manipulation.
Train from Scratch:
- Description: This baseline involves training a single-stage Reinforcement Learning (RL) policy completely from scratch. This policy attempts to track both human motion and object trajectories simultaneously.
- Training Setup: For fairness, it uses the same reward terms (motion, object, and contact rewards) and domain randomization as ResMimic across all tasks, without any task-specific tuning.
- Purpose: This baseline assesses the difficulty of learning complex loco-manipulation tasks without the benefit of a pre-trained GMT policy or residual learning.
Finetune (Base Policy + Fine-tune):
- Description: This baseline takes the pre-trained GMT policy and fine-tunes it to track both human motion and object trajectories.
- Training Setup: The reward terms are identical to those used in ResMimic.
- Limitation: A crucial limitation is that, due to the architecture of the GMT policy (which is designed for human motion inputs only), this fine-tuned policy cannot incorporate explicit object information as input. It must infer object interaction solely from the rewards. This baseline evaluates if simply adapting the GMT weights with object-related rewards is sufficient, even without explicit object state in the policy's observation.

5.4. Tasks

The authors design four challenging whole-body loco-manipulation tasks to stress different aspects of humanoid control and generalization. The human-object interaction reference motions for these tasks are collected using an OptiTrack motion capture system.

Kneel (Kneel on one knee and lift a box):
- Challenge: Requires expressive, large-amplitude motion, precise lower-body coordination, and maintaining balance while lifting.
Carry (Carry a box onto the back):
- Challenge: Demands whole-body expressiveness and maintaining balance under a shifting load distribution as the box is moved to the robot's back.
Squat (Squat and lift a box with arms and torso):
- Challenge: Highlights whole-body contact-rich manipulation, requiring the coordinated use of arms and torso for lifting, not just hands.
Chair (Lift up a chair):
- Challenge: Involves manipulating a heavy, irregularly shaped object, testing the policy's ability to generalize beyond simple box geometries.

6. Results & Analysis

The experiments evaluate ResMimic against baselines in both sim-to-sim transfer (from IsaacGym to MuJoCo) and real-world deployment.

6.1. Core Results Analysis

6.1.1. Q1: Can a general motion tracking (GMT) policy, without task-specific retraining, accomplish diverse loco-manipulation tasks?

Answer: No, but it provides a strong initialization. As shown in Table I, the Base Policy (which is the pre-trained GMT policy deployed directly without object information) achieves a very low Task Success Rate (SR) of only 10% on average across the four tasks. While it shows relatively low Joint Tracking Error (Ej) (0.89 on average), indicating it can follow human-like joint movements, its Object Tracking Error (Eo) is high (0.61 on average), and it struggles significantly with Motion Tracking Error (Em) (9.22 on average), meaning the robot's overall motion deviates from the reference in a way that impacts task completion. For tasks like Kneel and Carry, its SR is 0%, meaning it fails completely. This confirms that while GMT provides good joint-level precision for general motions, it is insufficient for loco-manipulation tasks that require object awareness and precise interaction.

6.1.2. Q2: Does initializing from a pre-trained GMT policy improve training efficiency and final performance compared to training from scratch?

Answer: Yes, significantly. Table I clearly demonstrates that Train from Scratch policies fail entirely, achieving a 0% SR across all tasks in MuJoCo. They also require a significantly higher number of Training Iterations (Iter.) (4500 on average for tasks they attempt to learn, though still fail) compared to ResMimic. Figure 5 further illustrates this: policies trained from scratch might show partial success in IsaacGym (a simpler simulation environment) but collapse entirely under sim-to-sim transfer to MuJoCo (a more realistic physics engine). In contrast, ResMimic, which uses the GMT policy as a base, achieves a 92.5% average SR with much fewer Iterations (1300 on average) and low errors across all metrics. This highlights the necessity of using GMT as a foundation, as its large-scale pre-training imbues generalization and robustness that are critical for overcoming sim-to-sim gaps.

6.1.3. Q3: When adapting GMT policies to loco-manipulation tasks, is residual learning more effective than fine-tuning?

Answer: Yes, residual learning significantly outperforms direct fine-tuning. Table I shows that the Finetune baseline achieves an average SR of only 7.5%, which is marginally better than Train from Scratch but still far from ResMimic's 92.5%. While Finetune does require fewer Iterations (2400 on average) than Train from Scratch, it does not approach ResMimic's efficiency or performance. The key limitation of Finetune is its inability to incorporate explicit object observations as input, because the original GMT policy architecture is designed only for human motion inputs. Although object-tracking rewards provide some supervision, the lack of explicit object state prevents the policy from learning robust object interaction behaviors, especially under randomized object poses. Moreover, fine-tuning tends to overwrite the generalization capability of the base GMT policy, leading to instability across tasks. Figure 5 illustrates this again: fine-tuned policies might succeed in IsaacGym but fail to transfer to MuJoCo. This underscores the superiority of residual learning (where a separate residual policy explicitly takes object state as input) as a more generalizable and extensible adaptation strategy.

6.1.4. Q4: Beyond simulation, can ResMimic achieve precise, expressive, and robust control in the real-world?

Answer: Yes, ResMimic demonstrates strong real-world capabilities. As shown in Figure 1, ResMimic is deployed successfully on a Unitree G1 humanoid.

Expressive Carrying Motions: The robot can kneel on one knee to pick up a box and carry the box on its back, showcasing expressive whole-body movements.
Humanoid-Object Interaction Beyond Manipulation: The robot can sit down on a chair and then stand up, maintaining balance and contact with the environment, demonstrating interaction with static environments.
Heavy Payload Carrying: ResMimic enables the robot to successfully carry a 4.5 kg box, despite the G1's wrist payload limit being around 2.5 kg. This highlights the necessity and success of leveraging whole-body contact (e.g., using the torso and arms) for such tasks.
Generalization to Irregular Heavy Objects: The robot is able to lift and carry chairs weighing 4.5 kg and 5.5 kg, demonstrating instance-level generalization to novel, non-box geometries.

Qualitative comparisons in Figure 6 show that while the Base Policy can superficially mimic human motion, it lacks object awareness. Train from Scratch and Finetuning fail entirely in the real world due to the sim-to-real gap. ResMimic supports both blind deployment (without active object state input) for most Figure 1 results, and non-blind deployment (with MoCap-based object state input) as shown in Figure 4, where it demonstrates manipulation from random initial poses, consecutive loco-manipulation tasks, and reactive behavior to external perturbations.

该图像是示意图，展示了ResMimic框架下的仿人机器人进行全身运动执行的过程，包含多种任务（标记为(a)至(f)），如抓取箱子、与物体交互和坐下。这些图像体现了机器人的精准控制和对象感知能力。

The image is an illustrative diagram showing the process of humanoid robots performing whole-body motions under the ResMimic framework, featuring various tasks labeled (a) to (f), such as grabbing boxes, interacting with objects, and sitting down. These images demonstrate the robot's precise control and object awareness abilities.

该图像是示意图，展示了ResMimic框架中的三种不同动作场景：(a) 幫助搬运货物；(b) 进行复杂的对象交互；(c) 应对扰动的操作。通过这种方式，展示了人形机器人在多种情境下的动态表现与适应能力。

The image is an illustration showing three different action scenarios within the ResMimic framework: (a) assisting in transporting goods; (b) performing complex object interactions; (c) handling disrupted operations. This highlights the dynamic performance and adaptability of humanoid robots in various contexts.

Fig. 6: Real-world qualitative results comparing ResMimic against all other baselines. 该图像是一个对比示意图，展示了不同策略下人形机器人搬运箱子的能力，包括我们提出的 ResMimic、从头训练的策略、基础策略以及基础策略加微调的结果。图中可以看到 ResMimic 在控制精度和物体交互上表现优异。

Fig. 6: Real-world qualitative results comparing ResMimic against all other baselines.

The above figures illustrate the real-world performance of ResMimic and its qualitative superiority over baselines.

6.2. Data Presentation (Tables)

The following are the results from Table I of the original paper:

Method	Task	SR ↑	Iter. ↓	Eo↓	Em↓	Ej↓
Method	Task	SR ↑	Iter. ↓			Ej↓
BasePolicy	Kneel	0%	−	0.76 ± 0.01	3.30 ± 0.53	0.28 ± 0.01
	Carry	0%	−	0.29 ± 0.02	2.47 ± 0.26	1.19 ± 0.30
	Squat	40%	−	0.19 ± 0.01	0.93 ± 0.07	0.90 ± 0.08
	Chair	0%	−	1.19 ± 0.48	30.18 ± 33.45	1.20 ± 0.23
	Mean	10%		0.61	9.22	0.89
TrainfromScratch	Kneel	0%	×	0.69 ± 0.00	5.20 ± 0.62	3.41 ± 0.07
	Carry	0%	6500	0.70 ± 0.03	5.39 ± 0.38	2.33 ± 0.06
	Squat	0%	5000	0.68 ± 0.05	7.56 ± 2.31	4.28 ± 0.70
	Chair	0%	2000	0.97 ± 0.08	10.01 ± 1.28	13.36 ± 0.92
	Mean	0%	4500	0.76	7.04	5.84
Finetune	Kneel	0%	×	0.87 ± 0.01	5.92 ± 0.81	3.02 ± 0.18
	Carry	30%	4500	0.33 ± 0.01	2.49 ± 0.18	2.39 ± 0.06
	Squat	0%	2000	0.47 ± 0.05	5.07 ± 1.06	2.53 ± 0.13
	Chair	0%	700	0.15 ± 0.01	0.28 ± 0.05	1.26 ± 0.09
	Mean	7.5%	2400	0.46	3.44	2.30
ResMimic(Ours)	Kneel	90%	2000	0.14 ± 0.00	0.23 ± 0.06	2.17 ± 0.06
	Carry	100%	1000	0.11 ± 0.00	0.08 ± 0.00	1.24 ± 0.03
	Squat	80%	1500	0.07 ± 0.01	0.07 ± 0.03	1.18 ± 0.03
	Chair	100%	700	0.16 ± 0.01	0.13 ± 0.02	0.55 ± 0.01
	Mean	92.5%	1300	0.12	0.13	1.29

Table I Analysis:

Success Rate (SR ↑): ResMimic drastically outperforms all baselines, achieving a mean SR of 92.5%. The Base Policy, Train from Scratch, and Finetune baselines show very low or zero success rates, indicating their inability to reliably complete the loco-manipulation tasks.
Training Iterations (Iter. ↓): ResMimic converges much faster, with a mean of 1300 iterations, compared to Train from Scratch (4500) and Finetune (2400). This highlights its superior training efficiency. The Base Policy has no training iterations for the task, as it is directly deployed.
Object Tracking Error (Eo ↓): ResMimic achieves significantly lower Eo (mean 0.12) compared to all baselines (0.61 for Base Policy, 0.76 for Train from Scratch, 0.46 for Finetune), demonstrating its precise object interaction capabilities.
Motion Tracking Error (Em ↓): ResMimic also shows the lowest Em (mean 0.13), indicating high fidelity in general motion tracking while performing object tasks. The Base Policy has high Em, suggesting that while it tracks individual joints, its overall body movement is not suitable for object interaction.
Joint Tracking Error (Ej ↓): ResMimic achieves competitive Ej (mean 1.29), showing that its refinements do not significantly degrade low-level joint tracking precision while achieving high-level task success.

In summary, ResMimic demonstrates substantial gains in task success, training efficiency, and robustness across all evaluated metrics and tasks in simulation. The baselines either fail to address the core problem (Base Policy) or are unable to learn effective and generalizable solutions (Train from Scratch, Finetune).

$该图像是示意图，展示了ResMimic框架与其他基线方法在IsaacGym环境和MuJoCo环境中的训练过程和效果比较。上半部分为树立物体并行走的过程，底部则是坐在椅子上进行操作的过程，同时配有相应的训练曲线 $E_0$ 。$ 该图像是示意图，展示了ResMimic框架与其他基线方法在IsaacGym环境和MuJoCo环境中的训练过程和效果比较。上半部分为树立物体并行走的过程，底部则是坐在椅子上进行操作的过程，同时配有相应的训练曲线 $E_0$ 。

The image is a schematic that illustrates the comparison of the ResMimic framework with other baseline methods in terms of training processes and outcomes in the IsaacGym and MuJoCo environments. The upper part shows the process of lifting an object while walking, while the bottom depicts operations sitting on a chair, accompanied by corresponding training curves $E_0$ .

Figure 5 visually supports the quantitative results from Table I. It compares training curves (object tracking error, $E_o$ ) and qualitative performance between IsaacGym and MuJoCo.

Sim-to-Sim Transfer: Train from Scratch and Finetune policies often show some success in IsaacGym but exhibit significantly degraded performance or complete failure when transferred to MuJoCo, which is considered a better proxy for real-world physics. This underscores the difficulty of sim-to-sim transfer for these baselines.
ResMimic's Robustness: In contrast, ResMimic maintains strong performance with minimal degradation from IsaacGym to MuJoCo, highlighting its superior robustness and generalization capabilities.

6.3. Ablation Studies

6.3.1. Effect of the Virtual Object Controller

The paper conducts an ablation study on the virtual object controller, which is designed to stabilize early-stage training.

Problem Addressed: Without the virtual object controller, the policy often struggles when reference motions contain imperfections (e.g., penetrations between the humanoid's hand and the object) or when objects are heavy. These imperfections can cause the robot to knock over the object while trying to imitate the motion, leading to low object rewards and frequent early terminations. This traps the policy in a local minimum where it learns to retreat from the object rather than engage with it effectively.
Benefit of the Controller: As illustrated in Figure 7, with the virtual force curriculum, the object remains stabilized during early learning. This allows the policy to overcome motion-data imperfections and converge to precise manipulation strategies. The controlled assistance prevents early failures, giving the residual policy a chance to learn how to interact with the object without destabilizing it.

该图像是示意图，展示了在有无虚拟物体控制器的情况下，ResMimic框架的运动表现对比。上半部分为没有虚拟物体控制器的情形，下半部分为使用了虚拟物体控制器的情况，显示了在物体交互过程中姿态的变化和精确度的提升。

Fig. 7: Ablation on virtual object controller.

Figure 7 visually contrasts the behavior with and without the virtual object controller. The upper sequence (without controller) shows the robot repeatedly struggling to interact with the object, often knocking it over. The lower sequence (with controller) demonstrates a more stable and successful interaction, indicating that the controller helps the policy learn reliable manipulation.

6.3.2. Effect of the Contact Reward

An ablation study on the contact reward investigates its role in encouraging appropriate whole-body interaction strategies.

Two Strategies for Lifting a Box: The authors identify two possible ways for the robot to lift a box:
1. Relying primarily on wrists and hands.
2. Engaging both torso and arm contact, as humans typically do.
Result without Contact Reward (NCR): Without the contact reward, the policy tends to converge to strategy (1) (using only wrists and hands). While this might sometimes succeed in IsaacGym, it often fails to transfer effectively to MuJoCo and the real world due to instability or insufficient force.
Result with Contact Reward (CR): With the contact reward, the humanoid learns to adopt strategy (2), using coordinated torso and arm contact. This strategy, which aligns with human demonstrations, leads to improved sim-to-sim and sim-to-real transfer. The contact reward explicitly guides the policy towards a more robust and human-like interaction.

该图像是图表，展示了在不同奖励机制下（NCR表示无接触奖励，CR表示有接触奖励）对人形机器人操作的影响。图表底部的曲线量化了接触力随时间的变化，反映了不同奖励策略下的任务成功率和机器人与物体的交互表现。

Fig. 8: Ablation on contact reward. Here NCR denotes "No Contact Reward", and CR denotes "with Contact Reward". Corresponding curves (bottom) quantify torso contact force.

Figure 8 provides a visual and quantitative comparison. The top image sequences show NCR (No Contact Reward) leading to manipulation primarily with hands, while CR (with Contact Reward) results in the robot engaging its torso and arms for lifting. The bottom curves quantify the torso contact force. The CR curve shows significantly higher and sustained torso contact force, confirming that the contact reward successfully encourages the desired whole-body contact strategy. This validation highlights the importance of explicit rewards for guiding complex physical interactions.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work introduces ResMimic, a highly effective two-stage residual learning framework for humanoid whole-body loco-manipulation. The framework successfully combines a pre-trained, task-agnostic General Motion Tracking (GMT) policy, trained on large-scale human motion data, with a task-specific residual policy. This residual policy refines the GMT outputs to achieve precise object interaction and locomotion. Key innovations, such as the point-cloud-based object tracking reward, contact reward, and curriculum-based virtual object controller, significantly boost training efficiency, robustness, and the ability to handle complex physical interactions. Extensive evaluations in both simulation and on a real Unitree G1 humanoid demonstrate ResMimic's substantial improvements in task success, motion fidelity, training efficiency, and robustness over strong baselines. The paper firmly establishes the transformative potential of leveraging pre-trained policies within a residual learning paradigm for advanced humanoid control.

7.2. Limitations & Future Work

The paper does not explicitly detail a "Limitations" section, but several can be inferred:

Dependence on High-Quality MoCap Data: Both the GMT base policy and the residual policy rely heavily on motion capture (MoCap) data for human and human-object interactions. The quality, diversity, and realism of this data directly impact the policies' performance and generalization. Acquiring such data, especially for complex and rare tasks, can be expensive and time-consuming.
Embodiment Gap Persistence: While ResMimic addresses the embodiment gap through residual learning and specific rewards, the initial kinematic retargeting of human motions to robots (e.g., via GMR [39]) remains a non-trivial process and can introduce imperfections that the residual policy must then correct.
Task-Specificity of Residual Policy: Although the GMT policy is task-agnostic, the residual policy is still trained per-task. While more efficient than training from scratch, adapting to entirely new tasks might still require new human-object interaction MoCap data and retraining the residual policy.
Complexity of Reward Engineering: While the paper aims to avoid "per-task reward engineering" for the residual policy by reusing $r_t^m$ and introducing $r_t^o$ and $r_t^c$ , the tuning of parameters for these rewards (e.g., $\lambda_o$ , $\lambda$ , $\sigma_c$ ) and the virtual object controller gains ( $k_p, k_d$ ) can still be non-trivial.

Potential future research directions implied by this work could include:
Learning More General Residual Policies: Exploring methods to learn a single residual policy that generalizes across multiple loco-manipulation tasks or objects, potentially through more advanced goal-conditioning or meta-learning techniques.
Reducing Reliance on Dense MoCap: Investigating alternative ways to generate reference motions or learn object interactions that require less explicit and dense MoCap data, perhaps through vision-based imitation or language-based task specification.
Extending to Unseen Objects and Environments: Enhancing the framework's ability to handle novel objects with varying dynamics and geometries, and operate in more complex, unstructured, or dynamic environments beyond the demonstrated tasks.
Combining with Model-Based Control: Further integrating ResMimic with model-predictive control (MPC) or other model-based techniques for even greater robustness and optimality, particularly for highly dynamic or safety-critical tasks.

7.3. Personal Insights & Critique

ResMimic offers several compelling insights into the future of humanoid robotics. The most significant is the power of the pretrain-then-refine paradigm. By leveraging a robust GMT prior, the paper effectively sidesteps the immense challenge of learning complex whole-body dynamics and human-like expressiveness from scratch for every new task. This modularity is a critical step towards scaling humanoid capabilities.

The design choices for the residual policy are particularly clever:

The point-cloud-based object tracking reward is an elegant solution to simplify reward engineering for object pose, implicitly handling both translation and rotation and providing a smoother learning signal.
The virtual object controller curriculum directly addresses a common failure mode in imitation learning with real-world complexities (noisy data, heavy objects), ensuring that the learning process remains stable and productive rather than converging to undesirable local minima.
The contact reward is crucial for bridging the sim-to-real gap by explicitly encouraging physically meaningful interactions, which might otherwise be overlooked by purely kinematic or pose-based rewards.

A potential critique lies in the ongoing reliance on precise MoCap data for both the GMT policy and the task-specific residual policy. While more efficient than training from scratch, the process of collecting, curating, and retargeting human-object interaction data still represents a significant overhead for new tasks. The phrase "task-agnostic base for generating human-like whole-body movements" is true for the GMT policy, but the residual policy itself is still per-task. The ultimate goal for general humanoid control would be a system that requires minimal to no new demonstrations for novel tasks, perhaps by synthesizing new reference motions or learning task-agnostic object interaction skills.

Despite these points, ResMimic provides a strong foundation and valuable blueprint. Its methods could be transferable to other domains requiring precise, expressive control built on general priors, such as dexterous manipulation with multi-fingered hands or even animal-like robots learning complex behaviors from observed natural movements. The paper's emphasis on sim-to-sim validation in MuJoCo as a proxy for real-world performance is also a robust methodology that strengthens confidence in its real-world deployability.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.