ResMimic: From General Motion Tracking to Humanoid Whole-body Loco-Manipulation via Residual Learning
TL;DR Summary
ResMimic is a two-stage residual learning framework that enhances humanoid control for loco-manipulation. By refining outputs from a general motion tracking policy trained on human data, it significantly improves task success and training efficiency, as shown in simulations and r
Abstract
Humanoid whole-body loco-manipulation promises transformative capabilities for daily service and warehouse tasks. While recent advances in general motion tracking (GMT) have enabled humanoids to reproduce diverse human motions, these policies lack the precision and object awareness required for loco-manipulation. To this end, we introduce ResMimic, a two-stage residual learning framework for precise and expressive humanoid control from human motion data. First, a GMT policy, trained on large-scale human-only motion, serves as a task-agnostic base for generating human-like whole-body movements. An efficient but precise residual policy is then learned to refine the GMT outputs to improve locomotion and incorporate object interaction. To further facilitate efficient training, we design (i) a point-cloud-based object tracking reward for smoother optimization, (ii) a contact reward that encourages accurate humanoid body-object interactions, and (iii) a curriculum-based virtual object controller to stabilize early training. We evaluate ResMimic in both simulation and on a real Unitree G1 humanoid. Results show substantial gains in task success, training efficiency, and robustness over strong baselines. Videos are available at https://resmimic.github.io/ .
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "ResMimic: From General Motion Tracking to Humanoid Whole-body Loco-Manipulation via Residual Learning." This title indicates a focus on enabling humanoids to perform complex tasks involving both movement (loco-) and interaction with objects (-manipulation) by building upon existing motion tracking capabilities through a novel residual learning approach.
1.2. Authors
The authors are:
-
Siheng Zhao
-
Yanjie Ze
-
Yue Wang
-
C. Karen Liu
-
Pieter Abbeel
-
Guanya Shi
-
Rocky Duan
Their affiliations include Amazon FAR (Frontier AI & Robotics), USC, Stanford University, UC Berkeley, and CMU. These affiliations suggest a strong background in robotics, artificial intelligence, and machine learning, particularly within prominent academic institutions and leading industry research labs.
1.3. Journal/Conference
The paper is published at (UTC): 2025-10-06T17:47:02.000Z. While the specific journal or conference is not explicitly mentioned in the provided text, the arXiv link indicates it is a preprint accessible to the academic community. Papers published on arXiv are often submitted to top-tier conferences or journals in fields like robotics, AI, or machine learning, which typically undergo rigorous peer review. Given the authors' affiliations and the nature of the research, it is likely intended for a highly reputable venue.
1.4. Publication Year
The publication year, based on the provided UTC timestamp, is 2025.
1.5. Abstract
The paper introduces ResMimic, a two-stage residual learning framework designed to achieve precise and expressive humanoid whole-body loco-manipulation from human motion data. The core idea is to first use a General Motion Tracking (GMT) policy, pre-trained on extensive human motion data, as a foundation for generating human-like movements. This GMT policy is task-agnostic. In the second stage, a task-specific residual policy is learned to refine the GMT outputs. This refinement improves locomotion and enables object interaction, which the base GMT policy lacks. To enhance training efficiency, ResMimic incorporates several novel components: (i) a point-cloud-based object tracking reward for smoother optimization, (ii) a contact reward to guide accurate humanoid-object interactions, and (iii) a curriculum-based virtual object controller to stabilize initial training. The framework's effectiveness is demonstrated through evaluations in both simulation and on a real Unitree G1 humanoid, showing significant improvements in task success, training efficiency, and robustness compared to existing baselines.
1.6. Original Source Link
The original source link is https://arxiv.org/abs/2510.05070v2.
The PDF link is https://arxiv.org/pdf/2510.05070v2.pdf.
This indicates the paper is a preprint available on arXiv, a popular open-access archive for scholarly articles.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is enabling humanoid robots to perform whole-body loco-manipulation tasks with both precision and expressiveness. This capability involves combining locomotion (movement) and manipulation (object interaction) in a coordinated, human-like manner.
This problem is highly important due to the transformative potential of humanoids in various real-world applications, such as daily services and warehouse tasks. Humanoid robots, unlike wheeled or quadruped robots, are designed to operate in human-centric infrastructure, leveraging environments built for humans.
However, several challenges and gaps exist in prior research:
-
Precision and Object Awareness: While
General Motion Tracking (GMT)policies can reproduce diverse human motions, they often lack the precision required forloco-manipulationand are fundamentally unaware of manipulated objects. -
Embodiment Gap: Directly
imitating human motionsforloco-manipulationis challenging because human demonstrations often involve contact locations and relative object poses that do not directly translate to humanoid robots due to differences in physical form, size, and capabilities. This can lead to issues likefloating contacts(where the robot's hand appears to touch an object but doesn't exert force) orpenetrations(where the robot's body passes through an object), as illustrated in Figure 2. -
Scalability and Generality: Existing humanoid
loco-manipulationapproaches are oftentask-specific, relying on complex designs likestage-wise controllersorhandcrafted data pipelines. These approaches limit the scalability and generality of solutions, meaning a new solution might be needed for every new task. -
Lack of Unified Framework: There is no existing unified, efficient, and precise framework for
humanoid loco-manipulationthat can handle diverse tasks.The paper's entry point or innovative idea is inspired by breakthroughs in
foundation models, which leverage pre-training on large-scale data followed by post-training (fine-tuning) for robust and generalized performance. The key insight is that whilediverse human motionscan be captured by a pre-trainedGMTpolicy,object-centric loco-manipulationprimarily requirestask-specific corrections. Manywhole-body motions(e.g., balance, stepping, reaching) are shared across tasks, requiring adaptation only for fine-grainedobject interaction. This motivates aresidual learningparadigm, where a stablemotion prior(fromGMT) is augmented with lightweight, task-specific adjustments.
2.2. Main Contributions / Findings
The paper makes four primary contributions:
-
Two-Stage Residual Learning Framework: The authors propose
ResMimic, a novel two-stageresidual learningframework that efficiently combines a pre-trainedGeneral Motion Tracking (GMT)policy with task-specific corrections. This enables precise and expressivehumanoid loco-manipulationby decoupling the general motion generation from the fine-grained object interaction, leading to improved data efficiency and a general framework for loco-manipulation and locomotion enhancement. -
Enhanced Training Efficiency and Sim-to-Real Transfer Mechanisms: To overcome challenges in training and deployment,
ResMimicintroduces several technical innovations:- A
point-cloud-based object tracking rewardthat offers a smoother optimization landscape compared to traditional pose-based rewards, making training more stable. - A
contact rewardthat explicitly guides the humanoid towards accuratebody-object contacts, which is crucial for realistic physical interactions andsim-to-real transfer. - A
curriculum-based virtual object controllerthat stabilizes early training phases by providing temporary assistance, especially when dealing with noisy reference motions or heavy objects, preventing premature policy failures.
- A
-
Extensive Evaluation and Robustness: The paper conducts comprehensive evaluations in both high-fidelity simulation environments (
IsaacGymandMuJoCo) and on a realUnitree G1 humanoidrobot. The results demonstrate substantial improvements inhuman motion tracking,object motion tracking,task success rate,training efficiency,robustness, andgeneralizationacross challengingloco-manipulationtasks compared to strong baselines. -
Resource Release for Research Acceleration: To foster further research in the field, the authors commit to releasing their
GPU-accelerated simulation infrastructure, asim-to-sim evaluation prototype, andmotion data. This contribution aims to lower the barrier for other researchers to build upon their work.The key findings are that
ResMimicsuccessfully addresses the limitations ofGMTpolicies by adding object awareness and precision through residual learning. It significantly outperforms baselines that either directly deployGMTwithout adaptation, train from scratch, or use simple fine-tuning. The framework demonstrates real-world applicability and robustness, highlighting the power of apretrain-finetuneparadigm forhumanoid whole-body control.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand the ResMimic framework, a novice reader should be familiar with several core concepts in robotics, machine learning, and control theory.
3.1.1. Humanoid Whole-Body Loco-Manipulation
Humanoid robots are robots designed to resemble the human body, typically featuring a torso, head, two arms, and two legs. Whole-body control refers to the coordinated use of all these limbs and the torso to perform tasks. Loco-manipulation is a portmanteau combining locomotion (movement, like walking, kneeling, or balancing) and manipulation (interacting with objects, like grasping, lifting, or carrying). Therefore, humanoid whole-body loco-manipulation refers to the ability of a humanoid robot to move its entire body in a coordinated way to interact with and manipulate objects in its environment, often while simultaneously moving or maintaining balance. This is a complex task because it requires managing high degrees of freedom, maintaining stability, and understanding physical interactions with objects and the environment.
3.1.2. General Motion Tracking (GMT)
General Motion Tracking (GMT) is a technique in robotics where a robot learns to imitate a wide variety of human movements or reference motions. These motions are typically captured from humans using motion capture (MoCap) systems, which record the 3D positions and orientations of markers placed on a human body. A GMT policy, often trained using Reinforcement Learning (RL), aims to reproduce these motions on a robot as accurately as possible. The goal is to make the robot's movements appear natural and human-like. However, traditional GMT policies are primarily focused on the kinematics (movement) of the robot's own body and are usually task-agnostic, meaning they don't have explicit knowledge or awareness of external objects or the specific goals of a manipulation task.
3.1.3. Residual Learning
Residual learning is a machine learning paradigm where a model learns to predict the "residual" or "difference" between a desired output and an output produced by a base model or a simpler system. Instead of learning the entire complex function from scratch, the residual model focuses on learning only the corrections needed to improve the base model's output. In the context of control, this often means training a residual policy to output small adjustments to the actions suggested by a base policy or a classical controller. This approach can be more efficient because the residual model only needs to learn the "hard" parts of the problem, while the base model handles the "easy" or general aspects. It often leads to faster training, better stability, and improved performance.
3.1.4. Reinforcement Learning (RL) and PPO
Reinforcement Learning (RL) is an area of machine learning concerned with how intelligent agents should take actions in an environment to maximize the cumulative reward. An RL agent learns by trial and error, observing the state of the environment, taking an action, and receiving a reward signal, which indicates how good or bad that action was. The goal is to learn a policy, which is a mapping from states to actions, that maximizes the expected cumulative discounted reward over time.
The process is often modeled as a Markov Decision Process (MDP), defined by a tuple :
-
: The
state space, representing all possible configurations of the environment. -
: The
action space, representing all possible actions the agent can take. -
: The
transition dynamics, which describes how the environment changes from one state to another given an action. Mathematically, is the probability of reaching state from state after taking action . -
: The
reward function, which defines the immediate reward an agent receives for taking an action in a given state,R(s, a). -
: The
discount factor(a value between 0 and 1), which determines the present value of future rewards. A higher makes future rewards more significant.Proximal Policy Optimization (PPO)is a popularon-policyReinforcement Learningalgorithm. It's known for its balance between ease of implementation, sample efficiency, and good performance across a wide range of tasks.PPOworks by trying to take the largest possible step towards a better policy without collapsing into a bad policy. It achieves this by using aclipped objective functionthat discourages large policy changes, thus keeping the new policy "proximal" to the old one. This stability makes it suitable for complex control problems like humanoid robotics.
3.1.5. Sim-to-Real Transfer
Sim-to-real transfer is the process of training a robot control policy in a simulated environment (which is generally safer, faster, and cheaper) and then deploying that trained policy on a physical robot in the real world. A major challenge is the sim-to-real gap, which refers to the discrepancies between the simulated environment (physics, sensor noise, modeling inaccuracies) and the real world. Policies trained purely in simulation may perform poorly or fail entirely when transferred to reality due to these differences. Techniques like domain randomization (varying simulation parameters during training) are used to make policies more robust to these unmodeled real-world variations, thereby improving sim-to-real transfer.
3.1.6. Kinematic Retargeting
Kinematic retargeting is the process of transferring a motion from one character (e.g., a human) to another character (e.g., a humanoid robot) with a different skeletal structure, proportions, or degrees of freedom. The goal is to make the target character perform the same general motion while respecting its own anatomical constraints. This often involves mapping joints from the source to the target, adjusting for limb lengths, and ensuring that the overall style and intent of the motion are preserved. The embodiment gap is a significant challenge here, as direct translation of human contact points or postures might be physically impossible or lead to instability for a robot. GMR [39] (General Motion Retargeting) is a specific method mentioned in the paper for this purpose.
3.2. Previous Works
The paper contextualizes its contributions by discussing prior work in learning-based humanoid control, humanoid loco-manipulation, and residual learning in robotics.
3.2.1. Learning-Based Humanoid Control
Early Reinforcement Learning (RL) for humanoids faced challenges like low data efficiency and the need for substantial effort in task-specific reward design [17]. This often limited work to simpler tasks like locomotion [18], [19] or highly specific skills such as getting up [20] and keeping balance [21].
A promising direction has been learning from human motions [5], which involves kinematic retargeting to map human demonstrations to robots, addressing the embodiment gap. This led to accurate tracking of individual motions [22]-[24] and more versatile general motion tracking (GMT) [6], [7], [25], [26].
-
DeepMimic [5]: A foundational work that enabled
physics-based character skillsto be learned from human motion examples using deep RL. It showed how to achieve highly dynamic and diverse motions. -
TWIST [6]: A
Teleoperated Whole-body Imitation Systemthat allows humanoids to reproduce diverse human motions, often serving as a strongGMTbaseline. The authors refer toTWISTfor reward formulation and domain randomization details in theirGMTtraining. -
GMT [7]: A general motion tracking framework for humanoid whole-body control.
-
VideoMimic [27]: Made progress by not only tracking human motion but also reconstructing the environment, enabling
contextual motionslike sitting on a chair. However, it was limited to static environments and did not extend to dynamicobject interactions.The key differentiation of
ResMimicfrom theseGMT-focused works is its explicit focus ondynamic loco-manipulationwithobject awareness, which these priorGMTpolicies lacked.
3.2.2. Humanoid Loco-Manipulation
Humanoid loco-manipulation is particularly challenging.
- Teleoperation [6], [28]-[30]: Some works have used
teleoperationto control humanoids for these tasks. However, these methods require a human operator and lack explicitobject awareness. - Autonomous Imitation Learning [4], [31]: Building on teleoperation, others have trained autonomous
imitation learningpolicies from collected data. These efforts, however, were often restricted totabletop manipulationwith limitedwhole-body expressiveness, tasks potentially better suited fordual-arm mobile manipulators. - Modular and Trajectory-Optimized Approaches [10], [11]:
-
Dao et al. [10]: Proposed a
modular sim-to-real RL pipelinefor boxloco-manipulation, decomposing the task into distinct phases (e.g., walking, box-picking) with separate policies. This implies a lack of end-to-end learning and potentially less natural transitions. -
Liu et al. [11]: Introduced an
end-to-end learning pipelineusing reference motions generated bytask-specific trajectory optimization. This approach, while end-to-end, still relies ontask-specific designsfor reference motion generation.The paper argues that these
loco-manipulationapproaches demonstratedlimited whole-body contact(e.g., using only hands) andexpressiveness, and relied onhighly task-specific designs.ResMimicaims to overcome this by leveraging aGMTpolicy as a prior, enabling more expressivewhole-body loco-manipulationwithin a unified framework.
-
3.2.3. Residual Learning for Robotics
Residual learning has been widely adopted in robotics to refine existing controllers or policies.
-
Refining Hand-Designed Policies [32], [33]: Early works used
residual policiesto improvehand-designed policiesormodel predictive controllersfor more precise andcontact-rich manipulation. -
Refining Imitation Policies [34], [35]: Later approaches extended
residual learningto policies initialized fromdemonstrations. -
Dexterous Hand Manipulation [36], [37]: In
dexterous hand manipulation,residual policiesadapted human hand motions fortask-oriented control. -
ASAP [22]: A notable work in humanoids that leverages
residual learningto compensate fordynamics mismatchbetween simulation and reality, enablingagile whole-body skills.ResMimicdifferentiates itself by leveraging a pre-trainedGeneral Motion Tracking (GMT)policy as its foundation. This is distinct from refining classical controllers, arbitrary demonstrations, or solely focusing onsim-to-real dynamics. Instead,ResMimicusesresidual learningto addobject awarenessandtask-specific precisionon top of a robust, generalhuman-like motion prior.
3.3. Technological Evolution
The field of humanoid control has evolved significantly:
- Classical Control/Model-Based Approaches: Early humanoid control often relied on
model-based control(e.g.,whole-body controlwithinverse kinematics/dynamics) andpre-programmed motions. These were precise but often lacked adaptability to unforeseen disturbances or varied tasks. - Reinforcement Learning for Locomotion: The advent of
RLallowed humanoids to learn more dynamic and robustlocomotion skills(walking, running, balancing) by interacting with environments, moving beyond purely model-based methods. However,RLoften struggled with data efficiency and complex reward design. - Imitation Learning from Human Motion (GMT): To overcome the complexity of reward design and achieve more natural motions, researchers turned to
imitation learning, specificallyGMT. Byretargeting human motion data, humanoids could learn diverse and expressivewhole-body movements. This represented a significant step towards human-like behavior. - Foundation Models & Pre-training Paradigm: Inspired by
foundation modelsinNLPandvision, the idea of pre-training large, general models on vast datasets became prevalent. In robotics, this translates to training powerfulbase policieson general data (likeGMTon human motion) that can then be adapted for specific downstream tasks. - Residual Learning for Task-Specific Adaptation: This paper's work,
ResMimic, fits into the latter part of this evolution. It recognizes thatGMTprovides a powerfulmotion priorbut is insufficient forloco-manipulationdue to a lack ofobject awarenessandprecision. By applyingresidual learning,ResMimicefficiently adapts this strongGMTfoundation toobject-centric tasks, effectively bridging the gap between general human-like motion and precisehumanoid loco-manipulation.
3.4. Differentiation Analysis
Compared to the main methods in related work, ResMimic introduces several core differences and innovations:
- Unified Framework for Expressive Loco-Manipulation: Unlike previous
loco-manipulationworks that were oftentask-specific [10], [11]or limited inwhole-body expressivenessand contact richness,ResMimicleverages a powerfulGMTprior. This enables a unified framework that can handle diverse, expressivewhole-body loco-manipulationtasks involving complex contacts (e.g., using torso, arms, not just hands). - Decoupling Motion Generation from Object Interaction: The two-stage
residual learningframework is a key innovation. Instead of trying to learn both general motion and precise object interaction simultaneously from scratch (whichTrain from Scratchattempts and fails at) or directly fine-tuningGMTwithout explicit object input (Finetune),ResMimicseparates these concerns. The first stage provides a robust, task-agnostichuman-like motion prior, and the second stage efficiently learns only thetask-specific correctionsfor object interaction. - Effective Handling of Embodiment Gap and Noisy Data:
ResMimicaddresses theembodiment gapand imperfect human-object interaction data (leading topenetrationsorfloating contacts) through its specialized training mechanisms:- The
virtual object controllerstabilizes early training by assisting object manipulation, allowing theresidual policyto learn robust interactions despite initial data imperfections. - The
contact rewardexplicitly encourages correctbody-object contact, which is crucial forreal-world deployabilityandsim-to-real transfer, going beyond simple pose matching.
- The
- Robustness and Sim-to-Real Performance:
ResMimicdemonstrates superiorsim-to-simtransfer performance (fromIsaacGymtoMuJoCo) and successful real-world deployment on aUnitree G1 humanoid. This contrasts sharply withTrain from ScratchandFinetunepolicies, which often fail or degrade significantly in more realistic simulation or hardware. This robustness stems from leveraging the generalization capabilities of the pre-trainedGMTpolicy and the proposed training enhancements. - Efficient Adaptation: By learning a
residual policy,ResMimicsignificantly improvestraining efficiencyfor new tasks compared to training from scratch, as theresidual policyis lightweight and only needs to learn corrections, not the entire motion.
4. Methodology
4.1. Principles
The core principle of ResMimic is to decompose the complex problem of humanoid whole-body loco-manipulation into two more manageable parts using residual learning. The intuition is that generating general, human-like motion is a common, reusable skill, while precise object interaction is task-specific. Therefore, a powerful General Motion Tracking (GMT) policy first learns to generate robust, human-like motions. Then, a smaller, more efficient residual policy learns only the necessary corrections to these base motions to achieve precise object interaction and loco-manipulation. This modular approach leverages the strengths of a pre-trained general model while allowing for efficient, task-specific adaptation.
4.2. Core Methodology In-depth (Layer by Layer)
The ResMimic framework formulates the whole-body loco-manipulation task as a goal-conditioned Reinforcement Learning (RL) problem within a Markov Decision Process (MDP).
The MDP is defined by the tuple :
-
is the
state space. -
is the
action space. -
denotes the
transition dynamicsof the environment. -
is the
reward function. -
is the
discount factor.At each time step , the state is composed of four main components:
-
robot proprioception(): Information about the robot's current internal state. -
object state(): Information about the current state of the manipulated object. -
motion goal state(): The target state for the robot's motion, typically derived from human demonstrations. -
object goal state(): The target state for the object, also derived from demonstrations.The
actiongenerated by the policy specifiestarget joint anglesfor the robot. These target angles are then executed on the real or simulated robot using aProportional-Derivative (PD) controller, which attempts to drive the robot's actual joint angles to the commanded targets.
The reward is computed at each time step based on the current state and action. The overall training objective is to maximize the expected cumulative discounted reward, expressed as . Both stages of ResMimic are trained using the Proximal Policy Optimization (PPO) [38] algorithm.
The ResMimic framework operates in two distinct stages, as illustrated in Figure 3.
4.2.1. Stage I: General Motion Tracking (GMT)
The first stage focuses on training a task-agnostic General Motion Tracking (GMT) policy, denoted as . This policy serves as the backbone controller, responsible for generating human-like whole-body movements.
- Input: The
GMTpolicy takes two inputs:robot proprioception.reference motion.
- Output: It outputs a
coarse action.- The action is computed as: .
- Objective: The
GMTpolicy is optimized to maximize only themotion tracking reward.- The training objective for this stage is . This means the policy primarily cares about how well the robot mimics the human motion, not about object interaction.
4.2.1.1. Dataset for GMT
The GMT policy is trained on large-scale human motion capture data. This approach decouples human motion tracking from object interaction, avoiding the need for costly and hard-to-obtain manipulation data for the base policy.
- Sources: Publicly available
MoCapdatasets such asAMASS [8]andOMOMO [9], totaling over 15,000 clips (approximately 42 hours). - Data Curation: Motions impractical for the humanoid robot (e.g., stair climbing) are filtered out.
- Retargeting:
Kinematics-based motion retargeting(e.g.,GMR [39]) is applied to convert the human motion data intohumanoid reference trajectories. Here, indexes individual motion clips, indexes time steps within a clip, is the clip length, and is the total number of clips.
4.2.1.2. Training Strategy for GMT
The GMT policy is trained using a single-stage RL framework in simulation without access to privileged information (information that is only available in simulation, like exact contact forces, which are not directly observable by a real robot).
- Proprioceptive Observation (): This describes the robot's internal state. It is defined as a history of observations over a recent window:
- : The
root orientation(orientation of the robot's base or main body). - : The
root angular velocity(how fast the robot's base is rotating). - : The
joint positionfor each of the robot's 29 degrees of freedom. - : The
joint velocityfor each joint. - : The
recent action history, typically the past actions taken by the robot. - The subscript
t-10:tindicates that the observations from the past 10 time steps up to the current time step are included.
- : The
- Reference Motion Input (): This describes the target human motion the robot should mimic. It includes future reference information to help the policy anticipate movements:
- : The
reference root translation(target position of the robot's base). - : The
reference root orientation(target orientation of the robot's base). - : The
reference joint positionfor each joint. - The subscript
t-10:t+10indicates that the reference motion from the past 10 time steps up to the next 10 time steps is included, allowing the policy to plan for upcoming targets for smoother tracking.
- : The
4.2.1.3. Reward and Domain Randomization for GMT
- Motion Tracking Reward (): Following
TWIST [6], the reward is a sum of three components:task rewards: Encourage the robot to achieve specific motion goals (e.g., matching joint positions, root velocity).penalty terms: Discourage undesirable behaviors (e.g., falling, excessive joint limits).regularization terms: Promote smooth and energy-efficient movements.
- Domain Randomization: To improve
sim-to-real transferand robustness,domain randomizationis applied during training. This involves varying physical parameters (e.g., friction, mass, sensor noise) in the simulation, forcing the policy to learn to adapt to a range of environmental conditions rather than overfitting to specific simulation parameters.
4.2.2. Stage II: Residual Refinement Policy
Building on the pre-trained GMT policy from Stage I, this stage introduces a residual policy, , to refine the coarse actions and enable precise object manipulation.
- Input: The
residual policytakes a comprehensive input including robot state, object state, and their respective reference trajectories: - Output: It outputs a
residual action. - Final Action: The final action sent to the robot's PD controller is the sum of the
GMTpolicy's output and theresidual policy's correction: - Objective: The
residual policyis optimized to maximize a combined reward that includes both motion tracking and object interaction.- The training objective for this stage is . This means it refines motions while specifically aiming to succeed at object tasks.
4.2.2.1. Reference Motions for Residual Refinement
For the residual policy, reference trajectories are obtained from MoCap systems that simultaneously record both human motion and object motion.
- Human Motion: The recorded human motion is
retargetedto the humanoid robot usingGMR [39]to producehumanoid reference trajectories. - Object Motion: The recorded object motion is directly used as the
reference object trajectory. - Complete Reference: These combined trajectories guide the training of the
residual policy.
4.2.2.2. Training Strategy for Residual Refinement
Similar to the GMT stage, single-stage RL with PPO is used for residual learning.
-
Object State Representation (): The current state of the object is represented by its pose and velocity:
- : The
object root translation(position). - : The
object root orientation. - : The
object root linear velocity. - : The
object root angular velocity.
- : The
-
Reference Object Trajectory (): Similar to robot motion, the reference object trajectory includes future information:
- : The
reference object root translation. - : The
reference object root orientation. - : The
reference object root linear velocity. - : The
reference object root angular velocity. - The subscript
t-10:t+10indicates a history and future window for the object reference.
- : The
-
Network Initialization: At the beginning of training, the
residual policyshould ideally output actions close to zero, as the humanoid already performs human-like motion. To encourage this, the final layer of thePPOactor network is initialized usingXavier uniform initializationwith a smallgain factor [40]. This ensures the initial outputs are close to zero, minimizing disruption to the pre-trainedGMTpolicy. -
Virtual Object Force Curriculum: This mechanism is designed to stabilize training, especially when reference motions are noisy (leading to
penetrations) or objects are heavy. These issues can cause the initial policy to fail by knocking over the object or retreating.- Mechanism:
PD controllersapplyvirtual forces() andtorques() to the object, guiding it towards its reference trajectory. - Formula: $ \mathcal F _ { t } = k _ { p } ( \hat { p } _ { t } ^ { o } - p _ { t } ^ { o } ) - k _ { d } v _ { t } ^ { o } $ $ \mathcal T _ { t } = k _ { p } ( \hat { \theta } _ { t } ^ { o } \ominus \theta _ { t } ^ { o } ) - k _ { d } \omega _ { t } ^ { o } $
- Symbol Explanation:
- : The
virtual control forceapplied to the object at time . - : The
virtual control torqueapplied to the object at time . - : The
proportional gainfor thePD controller, determining the strength of the correction based on position/orientation error. - : The
derivative gainfor thePD controller, determining the strength of the correction based on velocity/angular velocity error. - : The
reference positionof the object at time . - : The
current positionof the object at time . - : The
current linear velocityof the object at time . - : The
reference orientationof the object at time . - : The
current orientationof the object at time . - : The
current angular velocityof the object at time . - : Denotes the
rotation difference(e.g., computing the shortest angular distance between two orientations).
- : The
- Curriculum: The controller gains are gradually decayed over training. This provides strong virtual assistance early on to stabilize learning, then reduces assistance as the policy learns to handle the task autonomously.
- Mechanism:
4.2.2.3. Reward and Early Termination for Residual Refinement
The residual refinement stage reuses the motion reward and domain randomization from GMT training. It introduces two additional reward terms: object tracking reward () and contact tracking reward ().
-
Object Tracking Reward (): Instead of traditional pose-based differences [11], [42],
ResMimicuses apoint-cloud-baseddifference for a smoother reward landscape, which implicitly accounts for both translation and rotation without task-specific weight tuning.- Formula: $ r _ { t } ^ { o } = \exp ( - \lambda _ { o } \sum _ { i = 1 } ^ { N } | \mathbf { P } [ i ] _ { t } - \hat { \mathbf { P } } [ i ] _ { t } | _ { 2 } ) $
- Symbol Explanation:
- : A positive
scaling factorthat determines the sensitivity of the reward to the object tracking error. - : The
number of pointssampled from the object's mesh surface. - : The -th sampled 3D point on the
current object meshat time . - : The -th sampled 3D point on the
reference object meshat time . - : The
Euclidean norm(L2 norm), measuring the distance between points.
- : A positive
- The reward is an exponential function, meaning it's highest when the points on the current object closely match the reference points, and rapidly decreases as the difference grows.
-
Contact Reward (): This reward encourages correct physical interactions and helps the robot learn to make meaningful contacts with objects. It discretizes contact locations to relevant links (e.g., torso, hip, arms, excluding feet which primarily contact the ground).
- Oracle Contact Information: This is derived from the reference human-object interaction trajectory. $ \hat { c } _ { t } [ i ] = \mathbf { 1 } ( \lVert \hat { d } _ { t } [ i ] \rVert < \sigma _ { c } ) $
- Symbol Explanation:
- : Indexes the specific robot links (e.g., torso, left forearm, right hand).
- : The
indicator function, which equals 1 if the condition inside the parenthesis is true, and 0 otherwise. - : The
distancebetween robot link and the object surface in the reference trajectory at time . - : A
thresholddistance; if the reference distance is below this, it indicates a contact should be present.
- Contact Tracking Reward Formula: $ r _ { t } ^ { c } = \sum _ { i } \hat { c } _ { t } [ i ] \cdot \exp \Big ( - \frac { \lambda } { f _ { t } [ i ] } \Big ) $
- Symbol Explanation:
- : The
oracle contact indicatorfor link at time (1 if contact is expected, 0 otherwise). - : A positive
scaling factorfor the exponential decay. f _ { t } [ i ]: Theactual contact forcemeasured at robot link at time .
- : The
- This reward encourages the robot to exert force on links where contact is expected based on the reference motion. If contact is expected (), the reward increases as the contact force increases (up to a point, controlled by ). If no contact is expected, the term is zero.
-
Early Termination: To prevent the policy from entering undesirable states, episodes are terminated early under specific conditions:
- Standard
motion trackingconditions: Unintended ground contact (e.g., torso touching ground) or substantial deviation of a body part from its reference. - Additional
loco-manipulationconditions:- The
object meshdeviates from its reference beyond a threshold: , where is a predefined threshold. - Any required
body-object contact(as indicated by ) is lost for more than 10 consecutive frames.
- The
- Standard
4.2.3. Overall Architecture (Figure 3)
该图像是示意图,展示了 ResMimic 方法的两个阶段:大规模人类动作的重标定以及残差策略训练。在第二阶段中,通过动作捕捉数据和参考人类动作,对人体的运动和物体的交互进行优化。涉及的关键奖励机制包括目标跟踪奖励和接触奖励,公式中的 表示残差学习过程。
The image is a schematic diagram illustrating the two stages of the ResMimic method: retargeting large-scale human motions and residual policy training. In the second stage, motion capture data and reference humanoid motion are utilized to optimize humanoid motion and object interaction. Key rewarding mechanisms include object tracking reward and contact reward, with the formula representing the residual learning process.
Figure 3 visually summarizes the two-stage process:
- Stage 1: GMT Policy Training: Human
MoCapdata is used to train . This policy takesrobot proprioceptionandhuman reference motionas input and outputs acoarse action. This stage focuses on themotion tracking reward. - Stage 2: Residual Policy Training:
Human-object MoCapdata (including object reference) is used. TheGMTpolicy's output is augmented by aresidual policy. This residual policy takesrobot proprioception,object state,human reference motion, andobject reference motionas input, and outputs aresidual action. The final action is then executed. This stage optimizes for bothmotion tracking rewardandobject tracking reward, incorporatingcontact rewardand using thevirtual object controller.
5. Experimental Setup
The authors evaluate ResMimic through extensive sim-to-sim evaluation and real-world deployment on a Unitree G1 humanoid robot. The robot has 29 Degrees of Freedom (DoF) and is tall.
5.1. Datasets
For training the General Motion Tracking (GMT) policy, the authors leverage several publicly available MoCap datasets, including AMASS [8] and OMOMO [9]. These datasets collectively contain over 15,000 clips, amounting to approximately 42 hours of human motion data. Motions deemed impractical for the humanoid robot (e.g., stair climbing) are filtered out. Kinematics-based motion retargeting (specifically, GMR [39]) is then used to map these human motions to the humanoid robot's kinematics.
For training the Residual Refinement Policy, the reference trajectories for both human and object motions are collected using an OptiTrack motion capture system. This system simultaneously records human motion () and object motion (). The human motion is retargeted to the humanoid robot using GMR [39] to generate humanoid reference trajectories (), while the object motion is directly used as the reference object trajectory ().
These datasets are chosen because they provide a rich source of human-like movements and human-object interactions, which are critical for training policies that aim to mimic human behavior. The MoCap system for object interaction provides precise ground truth for both robot and object trajectories, enabling robust learning for the residual policy.
5.2. Evaluation Metrics
The effectiveness of ResMimic is assessed using several metrics that cover training efficiency, motion fidelity, manipulation accuracy, and overall task completion.
5.2.1. Training Iterations (Iter.)
- Conceptual Definition: This metric quantifies the computational cost or time required for a policy to converge to a stable and effective solution during training. It measures how many training steps (or batches of data processed) were necessary for the reward to stop significantly increasing, indicating that the learning process has stabilized.
- Mathematical Formula: Not a specific formula, but rather a count of optimization steps.
- Symbol Explanation:
Iter.represents the totaltraining iterationsuntil convergence. A lower value indicates highertraining efficiency.
5.2.2. Object Tracking Error ()
- Conceptual Definition: This metric measures how accurately the simulated or real object's position and orientation match its reference trajectory. The paper uses a
point-cloud-basedapproach, which is more robust than simple pose differences, as it considers the entire shape of the object. - Mathematical Formula: $ E _ { o } ^ { ' } = \frac { 1 } { T } \sum _ { t = 1 } ^ { T } \sum _ { i = 1 } ^ { N } | | \mathbf { P } [ i ] _ { t } - \mathbf { \bar { P } } [ i ] _ { t } | | _ { 2 } $
- Symbol Explanation:
- : The
average object tracking errorover an episode. - : The
total number of time stepsin the episode. - : The
number of pointssampled from the object's mesh surface. - : The -th sampled 3D point on the
current object meshat time . - : The -th sampled 3D point on the
reference object meshat time . (Note: The paper usesbar Pwhich typically denotes a reference or mean. In theobject tracking rewardformula, it usedhat P. Assuming consistency, here likely refers to the reference points.) - : The
Euclidean norm, which calculates the straight-line distance between two points in 3D space. A lower value indicates betterobject tracking accuracy.
- : The
5.2.3. Motion Tracking Error ()
- Conceptual Definition: This metric quantifies how closely the robot's overall body motion (specifically the global positions of its body parts) matches the corresponding reference human motion.
- Mathematical Formula: $ E _ { m } = \frac { 1 } { T } \sum _ { t = 1 } ^ { T } \sum _ { i } | | p _ { t } [ i ] - \hat { p } _ { t } [ i ] | | _ { 2 } $
- Symbol Explanation:
E _ { m }: Theaverage motion tracking errorover an episode.- : The
total number of time stepsin the episode. - : Indexes the various
body parts(links) of the robot. - : The
global positionof body part on the robot at time . - : The
global positionof body part in thereference motionat time . - : The
Euclidean norm. A lower value indicates highermotion fidelity.
5.2.4. Joint Tracking Error ()
- Conceptual Definition: This metric measures the precision of the robot's joint movements by comparing its actual joint angles to the reference joint angles from the human motion. It specifically evaluates joint-level accuracy.
- Mathematical Formula: $ E _ { j } = \frac { 1 } { T } \sum _ { t = 1 } ^ { T } | | q _ { t } - \hat { q } _ { t } | | _ { 2 } $
- Symbol Explanation:
E _ { j }: Theaverage joint tracking errorover an episode.- : The
total number of time stepsin the episode. - : The
vector of joint positions(angles) of the robot at time . The indicates the robot has 29 degrees of freedom. - : The
vector of reference joint positions(angles) at time . - : The
Euclidean norm, applied to the vector of joint differences. A lower value indicates more precisejoint-level control.
5.2.5. Task Success Rate (SR)
- Conceptual Definition: This is a binary metric indicating whether a specific task was successfully completed during a rollout. It encompasses both object manipulation and robot stability.
- Mathematical Formula: Not a specific formula, but a percentage calculated as (Number of Successful Rollouts / Total Number of Rollouts) * 100%.
- Symbol Explanation:
SRstands forSuccess Rate. A rollout is considered successful if two conditions are met:- The
object tracking error() is below a predefined threshold (). - The robot remains
balanced(i.e., it does not fall or violate early termination conditions related to stability). A higher value indicates greater task reliability.
- The
5.3. Baselines
To validate the effectiveness and efficiency of ResMimic, the authors compare it against three representative and strong baselines:
-
Base Policy:
- Description: This baseline directly deploys the
pre-trained GMT policy(from Stage I) to follow the human reference motion. - Limitation: It does not have access to
object informationand is not explicitly trained forobject interaction. It serves as a benchmark to show the inherent limitation ofGMTwithout specific adaptation forloco-manipulation.
- Description: This baseline directly deploys the
-
Train from Scratch:
- Description: This baseline involves training a
single-stage Reinforcement Learning (RL)policy completely from scratch. This policy attempts to track bothhuman motionandobject trajectoriessimultaneously. - Training Setup: For fairness, it uses the same
reward terms(motion, object, and contact rewards) anddomain randomizationasResMimicacross all tasks, without any task-specific tuning. - Purpose: This baseline assesses the difficulty of learning complex
loco-manipulationtasks without the benefit of a pre-trainedGMTpolicy orresidual learning.
- Description: This baseline involves training a
-
Finetune (Base Policy + Fine-tune):
- Description: This baseline takes the
pre-trained GMT policyandfine-tunesit to track bothhuman motionandobject trajectories. - Training Setup: The
reward termsare identical to those used inResMimic. - Limitation: A crucial limitation is that, due to the architecture of the
GMTpolicy (which is designed for human motion inputs only), this fine-tuned policycannot incorporate explicit object information as input. It must infer object interaction solely from the rewards. This baseline evaluates if simply adapting theGMTweights with object-related rewards is sufficient, even without explicit object state in the policy's observation.
- Description: This baseline takes the
5.4. Tasks
The authors design four challenging whole-body loco-manipulation tasks to stress different aspects of humanoid control and generalization. The human-object interaction reference motions for these tasks are collected using an OptiTrack motion capture system.
-
Kneel (Kneel on one knee and lift a box):
- Challenge: Requires expressive, large-amplitude motion, precise
lower-body coordination, and maintaining balance while lifting.
- Challenge: Requires expressive, large-amplitude motion, precise
-
Carry (Carry a box onto the back):
- Challenge: Demands
whole-body expressivenessand maintaining balance under ashifting load distributionas the box is moved to the robot's back.
- Challenge: Demands
-
Squat (Squat and lift a box with arms and torso):
- Challenge: Highlights
whole-body contact-rich manipulation, requiring the coordinated use of arms and torso for lifting, not just hands.
- Challenge: Highlights
-
Chair (Lift up a chair):
- Challenge: Involves manipulating a heavy,
irregularly shaped object, testing the policy's ability to generalize beyond simple box geometries.
- Challenge: Involves manipulating a heavy,
6. Results & Analysis
The experiments evaluate ResMimic against baselines in both sim-to-sim transfer (from IsaacGym to MuJoCo) and real-world deployment.
6.1. Core Results Analysis
6.1.1. Q1: Can a general motion tracking (GMT) policy, without task-specific retraining, accomplish diverse loco-manipulation tasks?
Answer: No, but it provides a strong initialization.
As shown in Table I, the Base Policy (which is the pre-trained GMT policy deployed directly without object information) achieves a very low Task Success Rate (SR) of only 10% on average across the four tasks. While it shows relatively low Joint Tracking Error (Ej) (0.89 on average), indicating it can follow human-like joint movements, its Object Tracking Error (Eo) is high (0.61 on average), and it struggles significantly with Motion Tracking Error (Em) (9.22 on average), meaning the robot's overall motion deviates from the reference in a way that impacts task completion. For tasks like Kneel and Carry, its SR is 0%, meaning it fails completely. This confirms that while GMT provides good joint-level precision for general motions, it is insufficient for loco-manipulation tasks that require object awareness and precise interaction.
6.1.2. Q2: Does initializing from a pre-trained GMT policy improve training efficiency and final performance compared to training from scratch?
Answer: Yes, significantly.
Table I clearly demonstrates that Train from Scratch policies fail entirely, achieving a 0% SR across all tasks in MuJoCo. They also require a significantly higher number of Training Iterations (Iter.) (4500 on average for tasks they attempt to learn, though still fail) compared to ResMimic. Figure 5 further illustrates this: policies trained from scratch might show partial success in IsaacGym (a simpler simulation environment) but collapse entirely under sim-to-sim transfer to MuJoCo (a more realistic physics engine). In contrast, ResMimic, which uses the GMT policy as a base, achieves a 92.5% average SR with much fewer Iterations (1300 on average) and low errors across all metrics. This highlights the necessity of using GMT as a foundation, as its large-scale pre-training imbues generalization and robustness that are critical for overcoming sim-to-sim gaps.
6.1.3. Q3: When adapting GMT policies to loco-manipulation tasks, is residual learning more effective than fine-tuning?
Answer: Yes, residual learning significantly outperforms direct fine-tuning.
Table I shows that the Finetune baseline achieves an average SR of only 7.5%, which is marginally better than Train from Scratch but still far from ResMimic's 92.5%. While Finetune does require fewer Iterations (2400 on average) than Train from Scratch, it does not approach ResMimic's efficiency or performance.
The key limitation of Finetune is its inability to incorporate explicit object observations as input, because the original GMT policy architecture is designed only for human motion inputs. Although object-tracking rewards provide some supervision, the lack of explicit object state prevents the policy from learning robust object interaction behaviors, especially under randomized object poses. Moreover, fine-tuning tends to overwrite the generalization capability of the base GMT policy, leading to instability across tasks. Figure 5 illustrates this again: fine-tuned policies might succeed in IsaacGym but fail to transfer to MuJoCo. This underscores the superiority of residual learning (where a separate residual policy explicitly takes object state as input) as a more generalizable and extensible adaptation strategy.
6.1.4. Q4: Beyond simulation, can ResMimic achieve precise, expressive, and robust control in the real-world?
Answer: Yes, ResMimic demonstrates strong real-world capabilities.
As shown in Figure 1, ResMimic is deployed successfully on a Unitree G1 humanoid.
-
Expressive Carrying Motions: The robot can
kneel on one knee to pick up a boxandcarry the box on its back, showcasing expressive whole-body movements. -
Humanoid-Object Interaction Beyond Manipulation: The robot can
sit down on a chair and then stand up, maintaining balance and contact with the environment, demonstrating interaction with static environments. -
Heavy Payload Carrying:
ResMimicenables the robot to successfully carry a4.5 kg box, despite theG1's wrist payload limit being around2.5 kg. This highlights the necessity and success of leveragingwhole-body contact(e.g., using the torso and arms) for such tasks. -
Generalization to Irregular Heavy Objects: The robot is able to
lift and carry chairs weighing 4.5 kg and 5.5 kg, demonstratinginstance-level generalizationto novel, non-box geometries.Qualitative comparisons in Figure 6 show that while the
Base Policycan superficially mimic human motion, it lacksobject awareness.Train from ScratchandFinetuningfail entirely in the real world due to thesim-to-real gap.ResMimicsupports bothblind deployment(without active object state input) for most Figure 1 results, andnon-blind deployment(withMoCap-based object state input) as shown in Figure 4, where it demonstratesmanipulation from random initial poses,consecutive loco-manipulation tasks, andreactive behavior to external perturbations.
该图像是示意图,展示了ResMimic框架下的仿人机器人进行全身运动执行的过程,包含多种任务(标记为(a)至(f)),如抓取箱子、与物体交互和坐下。这些图像体现了机器人的精准控制和对象感知能力。
The image is an illustrative diagram showing the process of humanoid robots performing whole-body motions under the ResMimic framework, featuring various tasks labeled (a) to (f), such as grabbing boxes, interacting with objects, and sitting down. These images demonstrate the robot's precise control and object awareness abilities.
该图像是示意图,展示了ResMimic框架中的三种不同动作场景:(a) 幫助搬运货物;(b) 进行复杂的对象交互;(c) 应对扰动的操作。通过这种方式,展示了人形机器人在多种情境下的动态表现与适应能力。
The image is an illustration showing three different action scenarios within the ResMimic framework: (a) assisting in transporting goods; (b) performing complex object interactions; (c) handling disrupted operations. This highlights the dynamic performance and adaptability of humanoid robots in various contexts.
该图像是一个对比示意图,展示了不同策略下人形机器人搬运箱子的能力,包括我们提出的 ResMimic、从头训练的策略、基础策略以及基础策略加微调的结果。图中可以看到 ResMimic 在控制精度和物体交互上表现优异。
Fig. 6: Real-world qualitative results comparing ResMimic against all other baselines.
The above figures illustrate the real-world performance of ResMimic and its qualitative superiority over baselines.
6.2. Data Presentation (Tables)
The following are the results from Table I of the original paper:
| Method | Task | SR ↑ | Iter. ↓ | Eo↓ | Em↓ | Ej↓ | |
| BasePolicy | Kneel | 0% | − | 0.76 ± 0.01 | 3.30 ± 0.53 | 0.28 ± 0.01 | |
| Carry | 0% | − | 0.29 ± 0.02 | 2.47 ± 0.26 | 1.19 ± 0.30 | ||
| Squat | 40% | − | 0.19 ± 0.01 | 0.93 ± 0.07 | 0.90 ± 0.08 | ||
| Chair | 0% | − | 1.19 ± 0.48 | 30.18 ± 33.45 | 1.20 ± 0.23 | ||
| Mean | 10% | 0.61 | 9.22 | 0.89 | |||
| TrainfromScratch | Kneel | 0% | × | 0.69 ± 0.00 | 5.20 ± 0.62 | 3.41 ± 0.07 | |
| Carry | 0% | 6500 | 0.70 ± 0.03 | 5.39 ± 0.38 | 2.33 ± 0.06 | ||
| Squat | 0% | 5000 | 0.68 ± 0.05 | 7.56 ± 2.31 | 4.28 ± 0.70 | ||
| Chair | 0% | 2000 | 0.97 ± 0.08 | 10.01 ± 1.28 | 13.36 ± 0.92 | ||
| Mean | 0% | 4500 | 0.76 | 7.04 | 5.84 | ||
| Finetune | Kneel | 0% | × | 0.87 ± 0.01 | 5.92 ± 0.81 | 3.02 ± 0.18 | |
| Carry | 30% | 4500 | 0.33 ± 0.01 | 2.49 ± 0.18 | 2.39 ± 0.06 | ||
| Squat | 0% | 2000 | 0.47 ± 0.05 | 5.07 ± 1.06 | 2.53 ± 0.13 | ||
| Chair | 0% | 700 | 0.15 ± 0.01 | 0.28 ± 0.05 | 1.26 ± 0.09 | ||
| Mean | 7.5% | 2400 | 0.46 | 3.44 | 2.30 | ||
| ResMimic(Ours) | Kneel | 90% | 2000 | 0.14 ± 0.00 | 0.23 ± 0.06 | 2.17 ± 0.06 | |
| Carry | 100% | 1000 | 0.11 ± 0.00 | 0.08 ± 0.00 | 1.24 ± 0.03 | ||
| Squat | 80% | 1500 | 0.07 ± 0.01 | 0.07 ± 0.03 | 1.18 ± 0.03 | ||
| Chair | 100% | 700 | 0.16 ± 0.01 | 0.13 ± 0.02 | 0.55 ± 0.01 | ||
| Mean | 92.5% | 1300 | 0.12 | 0.13 | 1.29 | ||
Table I Analysis:
-
Success Rate (SR ↑):
ResMimicdrastically outperforms all baselines, achieving a meanSRof 92.5%. TheBase Policy,Train from Scratch, andFinetunebaselines show very low or zero success rates, indicating their inability to reliably complete the loco-manipulation tasks. -
Training Iterations (Iter. ↓):
ResMimicconverges much faster, with a mean of 1300 iterations, compared toTrain from Scratch(4500) andFinetune(2400). This highlights its superiortraining efficiency. TheBase Policyhas no training iterations for the task, as it is directly deployed. -
Object Tracking Error (Eo ↓):
ResMimicachieves significantly lowerEo(mean 0.12) compared to all baselines (0.61 forBase Policy, 0.76 forTrain from Scratch, 0.46 forFinetune), demonstrating its precise object interaction capabilities. -
Motion Tracking Error (Em ↓):
ResMimicalso shows the lowestEm(mean 0.13), indicating high fidelity in general motion tracking while performing object tasks. TheBase Policyhas highEm, suggesting that while it tracks individual joints, its overall body movement is not suitable for object interaction. -
Joint Tracking Error (Ej ↓):
ResMimicachieves competitiveEj(mean 1.29), showing that its refinements do not significantly degrade low-level joint tracking precision while achieving high-level task success.In summary,
ResMimicdemonstrates substantial gains intask success,training efficiency, androbustnessacross all evaluated metrics and tasks in simulation. The baselines either fail to address the core problem (Base Policy) or are unable to learn effective and generalizable solutions (Train from Scratch, Finetune).
该图像是示意图,展示了ResMimic框架与其他基线方法在IsaacGym环境和MuJoCo环境中的训练过程和效果比较。上半部分为树立物体并行走的过程,底部则是坐在椅子上进行操作的过程,同时配有相应的训练曲线 。
The image is a schematic that illustrates the comparison of the ResMimic framework with other baseline methods in terms of training processes and outcomes in the IsaacGym and MuJoCo environments. The upper part shows the process of lifting an object while walking, while the bottom depicts operations sitting on a chair, accompanied by corresponding training curves .
Figure 5 visually supports the quantitative results from Table I. It compares training curves (object tracking error, ) and qualitative performance between IsaacGym and MuJoCo.
- Sim-to-Sim Transfer:
Train from ScratchandFinetunepolicies often show some success inIsaacGymbut exhibit significantly degraded performance or complete failure when transferred toMuJoCo, which is considered a better proxy for real-world physics. This underscores the difficulty ofsim-to-sim transferfor these baselines. - ResMimic's Robustness: In contrast,
ResMimicmaintains strong performance with minimal degradation fromIsaacGymtoMuJoCo, highlighting its superiorrobustnessandgeneralizationcapabilities.
6.3. Ablation Studies
6.3.1. Effect of the Virtual Object Controller
The paper conducts an ablation study on the virtual object controller, which is designed to stabilize early-stage training.
-
Problem Addressed: Without the
virtual object controller, the policy often struggles when reference motions contain imperfections (e.g.,penetrationsbetween the humanoid's hand and the object) or when objects are heavy. These imperfections can cause the robot to knock over the object while trying to imitate the motion, leading to lowobject rewardsand frequentearly terminations. This traps the policy in alocal minimumwhere it learns to retreat from the object rather than engage with it effectively. -
Benefit of the Controller: As illustrated in Figure 7, with the
virtual force curriculum, the object remains stabilized during early learning. This allows the policy to overcomemotion-data imperfectionsand converge to precise manipulation strategies. The controlled assistance prevents early failures, giving theresidual policya chance to learn how to interact with the object without destabilizing it.
该图像是示意图,展示了在有无虚拟物体控制器的情况下,ResMimic框架的运动表现对比。上半部分为没有虚拟物体控制器的情形,下半部分为使用了虚拟物体控制器的情况,显示了在物体交互过程中姿态的变化和精确度的提升。
Fig. 7: Ablation on virtual object controller.
Figure 7 visually contrasts the behavior with and without the virtual object controller. The upper sequence (without controller) shows the robot repeatedly struggling to interact with the object, often knocking it over. The lower sequence (with controller) demonstrates a more stable and successful interaction, indicating that the controller helps the policy learn reliable manipulation.
6.3.2. Effect of the Contact Reward
An ablation study on the contact reward investigates its role in encouraging appropriate whole-body interaction strategies.
-
Two Strategies for Lifting a Box: The authors identify two possible ways for the robot to lift a box:
- Relying primarily on
wrists and hands. - Engaging both
torso and arm contact, as humans typically do.
- Relying primarily on
-
Result without Contact Reward (NCR): Without the
contact reward, the policy tends to converge to strategy (1) (using only wrists and hands). While this might sometimes succeed inIsaacGym, it oftenfails to transfereffectively toMuJoCoand thereal worlddue to instability or insufficient force. -
Result with Contact Reward (CR): With the
contact reward, the humanoid learns to adopt strategy (2), usingcoordinated torso and arm contact. This strategy, which aligns with human demonstrations, leads to improvedsim-to-simandsim-to-real transfer. Thecontact rewardexplicitly guides the policy towards a more robust and human-like interaction.
该图像是图表,展示了在不同奖励机制下(NCR表示无接触奖励,CR表示有接触奖励)对人形机器人操作的影响。图表底部的曲线量化了接触力随时间的变化,反映了不同奖励策略下的任务成功率和机器人与物体的交互表现。
Fig. 8: Ablation on contact reward. Here NCR denotes "No Contact Reward", and CR denotes "with Contact Reward". Corresponding curves (bottom) quantify torso contact force.
Figure 8 provides a visual and quantitative comparison. The top image sequences show NCR (No Contact Reward) leading to manipulation primarily with hands, while CR (with Contact Reward) results in the robot engaging its torso and arms for lifting. The bottom curves quantify the torso contact force. The CR curve shows significantly higher and sustained torso contact force, confirming that the contact reward successfully encourages the desired whole-body contact strategy. This validation highlights the importance of explicit rewards for guiding complex physical interactions.
7. Conclusion & Reflections
7.1. Conclusion Summary
This work introduces ResMimic, a highly effective two-stage residual learning framework for humanoid whole-body loco-manipulation. The framework successfully combines a pre-trained, task-agnostic General Motion Tracking (GMT) policy, trained on large-scale human motion data, with a task-specific residual policy. This residual policy refines the GMT outputs to achieve precise object interaction and locomotion. Key innovations, such as the point-cloud-based object tracking reward, contact reward, and curriculum-based virtual object controller, significantly boost training efficiency, robustness, and the ability to handle complex physical interactions. Extensive evaluations in both simulation and on a real Unitree G1 humanoid demonstrate ResMimic's substantial improvements in task success, motion fidelity, training efficiency, and robustness over strong baselines. The paper firmly establishes the transformative potential of leveraging pre-trained policies within a residual learning paradigm for advanced humanoid control.
7.2. Limitations & Future Work
The paper does not explicitly detail a "Limitations" section, but several can be inferred:
-
Dependence on High-Quality MoCap Data: Both the
GMTbase policy and theresidual policyrely heavily onmotion capture (MoCap)data for human and human-object interactions. The quality, diversity, and realism of this data directly impact the policies' performance and generalization. Acquiring such data, especially for complex and rare tasks, can be expensive and time-consuming. -
Embodiment Gap Persistence: While
ResMimicaddresses theembodiment gapthroughresidual learningand specific rewards, the initialkinematic retargetingof human motions to robots (e.g., viaGMR [39]) remains a non-trivial process and can introduce imperfections that the residual policy must then correct. -
Task-Specificity of Residual Policy: Although the
GMTpolicy is task-agnostic, theresidual policyis still trainedper-task. While more efficient than training from scratch, adapting to entirely new tasks might still require new human-object interactionMoCapdata and retraining theresidual policy. -
Complexity of Reward Engineering: While the paper aims to avoid "per-task reward engineering" for the residual policy by reusing and introducing and , the tuning of parameters for these rewards (e.g., , , ) and the
virtual object controllergains () can still be non-trivial.Potential future research directions implied by this work could include:
-
Learning More General Residual Policies: Exploring methods to learn a single
residual policythat generalizes across multipleloco-manipulationtasks or objects, potentially through more advancedgoal-conditioningormeta-learningtechniques. -
Reducing Reliance on Dense MoCap: Investigating alternative ways to generate reference motions or learn object interactions that require less explicit and dense
MoCapdata, perhaps throughvision-based imitationorlanguage-based task specification. -
Extending to Unseen Objects and Environments: Enhancing the framework's ability to handle novel objects with varying dynamics and geometries, and operate in more complex, unstructured, or dynamic environments beyond the demonstrated tasks.
-
Combining with Model-Based Control: Further integrating
ResMimicwithmodel-predictive control (MPC)or othermodel-based techniquesfor even greater robustness and optimality, particularly for highly dynamic or safety-critical tasks.
7.3. Personal Insights & Critique
ResMimic offers several compelling insights into the future of humanoid robotics. The most significant is the power of the pretrain-then-refine paradigm. By leveraging a robust GMT prior, the paper effectively sidesteps the immense challenge of learning complex whole-body dynamics and human-like expressiveness from scratch for every new task. This modularity is a critical step towards scaling humanoid capabilities.
The design choices for the residual policy are particularly clever:
-
The
point-cloud-based object tracking rewardis an elegant solution to simplify reward engineering for object pose, implicitly handling both translation and rotation and providing a smoother learning signal. -
The
virtual object controller curriculumdirectly addresses a common failure mode in imitation learning with real-world complexities (noisy data, heavy objects), ensuring that the learning process remains stable and productive rather than converging to undesirable local minima. -
The
contact rewardis crucial for bridging thesim-to-real gapby explicitly encouraging physically meaningful interactions, which might otherwise be overlooked by purely kinematic or pose-based rewards.A potential critique lies in the ongoing reliance on precise
MoCapdata for both theGMTpolicy and thetask-specific residual policy. While more efficient than training from scratch, the process of collecting, curating, andretargetinghuman-object interaction data still represents a significant overhead for new tasks. The phrase "task-agnostic base for generating human-like whole-body movements" is true for theGMTpolicy, but theresidual policyitself is stillper-task. The ultimate goal forgeneral humanoid controlwould be a system that requires minimal to no new demonstrations for novel tasks, perhaps by synthesizing new reference motions or learning task-agnostic object interaction skills.
Despite these points, ResMimic provides a strong foundation and valuable blueprint. Its methods could be transferable to other domains requiring precise, expressive control built on general priors, such as dexterous manipulation with multi-fingered hands or even animal-like robots learning complex behaviors from observed natural movements. The paper's emphasis on sim-to-sim validation in MuJoCo as a proxy for real-world performance is also a robust methodology that strengthens confidence in its real-world deployability.
Similar papers
Recommended via semantic vector search.