PhysHSI: Towards a Real-World Generalizable and Natural Humanoid-Scene Interaction System
TL;DR Summary
The paper presents PhysHSI, a humanoid-scene interaction system that enables robots to perform diverse tasks in real-world environments. It combines adversarial motion prior-based policy learning and a coarse-to-fine object localization module, achieving natural behaviors and rob
Abstract
Deploying humanoid robots to interact with real-world environments--such as carrying objects or sitting on chairs--requires generalizable, lifelike motions and robust scene perception. Although prior approaches have advanced each capability individually, combining them in a unified system is still an ongoing challenge. In this work, we present a physical-world humanoid-scene interaction system, PhysHSI, that enables humanoids to autonomously perform diverse interaction tasks while maintaining natural and lifelike behaviors. PhysHSI comprises a simulation training pipeline and a real-world deployment system. In simulation, we adopt adversarial motion prior-based policy learning to imitate natural humanoid-scene interaction data across diverse scenarios, achieving both generalization and lifelike behaviors. For real-world deployment, we introduce a coarse-to-fine object localization module that combines LiDAR and camera inputs to provide continuous and robust scene perception. We validate PhysHSI on four representative interactive tasks--box carrying, sitting, lying, and standing up--in both simulation and real-world settings, demonstrating consistently high success rates, strong generalization across diverse task goals, and natural motion patterns.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is the development of a real-world system for humanoid-scene interaction that achieves generalizable and natural motions. The system is named PhysHSI, which stands for "Physical-world Humanoid-Scene Interaction."
1.2. Authors
The authors are: Huayi Wang, Wentao Zhang, Runyi Yu, Tao Huang, Junli Ren, Feiyu Jia, Zirui Wang, Xiaojie Niu, Xiao Chen, Jiahe Chen, Qifeng Chen, Jingbo Wang, and Jiangmiao Pang.
- Affiliations: Most authors are affiliated with
1(implied to be Shanghai Artificial Intelligence Laboratory, based on acknowledgements). Runyi Yu and Qifeng Chen are also affiliated with2(implied to be Shanghai Jiao Tong University, also based on acknowledgements). - Research Backgrounds: The authorship indicates a strong focus on robotics, reinforcement learning, computer vision, and physical simulation, particularly in the context of embodied AI and humanoid robot control.
1.3. Journal/Conference
This paper is published on arXiv, a preprint server for electronic preprints of scientific papers. It is not yet formally published in a peer-reviewed journal or conference proceedings, but arXiv is widely used in the machine learning and robotics communities for early dissemination of research.
1.4. Publication Year
The paper was published on October 13, 2025.
1.5. Abstract
Deploying humanoid robots for real-world interactions, such as carrying objects or sitting, demands generalizable, lifelike movements and robust scene perception. While individual advancements have been made in these areas, their integration into a unified system remains challenging. This work introduces PhysHSI, a physical-world humanoid-scene interaction system that enables autonomous execution of diverse interaction tasks with natural and lifelike behaviors. PhysHSI consists of a simulation training pipeline and a real-world deployment system. The simulation pipeline employs adversarial motion prior (AMP)-based policy learning to imitate natural humanoid-scene interaction data across various scenarios, promoting generalization and lifelike behaviors. For real-world deployment, a coarse-to-fine object localization module is developed, combining LiDAR and camera inputs for continuous and robust scene perception. PhysHSI is validated on four interaction tasks—box carrying, sitting, lying, and standing up—in both simulated and real environments, demonstrating high success rates, strong generalization across task goals, and natural motion patterns.
1.6. Original Source Link
- Official Source: https://arxiv.org/abs/2510.11072
- PDF Link: https://arxiv.org/pdf/2510.11072v1.pdf
- Publication Status: This paper is a preprint available on
arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is enabling humanoid robots to perform Humanoid-Scene Interaction (HSI) tasks in real-world environments in a generalizable, lifelike, and autonomous manner. Such tasks include complex actions like carrying objects or sitting on chairs.
This problem is crucial because HSI is a fundamental capability for deploying humanoid robots in everyday settings, moving beyond basic locomotion to truly interactive behaviors. However, several significant challenges persist in prior research:
-
Limited Generalization: Classical model-based methods (e.g., motion planning, trajectory optimization) are computationally expensive and rely on strong model assumptions, limiting their ability to generalize to diverse real-world scenarios.
-
Lack of Lifelike Motions: While
Reinforcement Learning (RL)-based methods offer broader generalization by training from diverse simulated experiences, learning policies from scratch often requires extensive and difficultreward shapingandstate transition designto produce natural, human-like motions. -
Robust Scene Perception: Real-world environments are complex. Existing perception modules often struggle with limited fields of view, occlusions, and maintaining continuous, accurate object localization over long-horizon tasks.
-
Sim-to-Real Gap: Many advanced
motion imitationtechniques, particularly those leveragingAdversarial Motion Priors (AMP), have largely been confined to simulation, relying on perfect scene observations, leaving thesim-to-real transferas a significant unexplored obstacle.The paper's innovative idea and entry point is to bridge these gaps by combining advanced
motion imitationtechniques in simulation with a robustreal-world perception systemto create a unified framework forHSI. This aims to deliver policies that are both generalizable and natural, and can be reliably deployed on physical robots.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
An AMP-based Training Pipeline: Introduction of a
simulation training pipelinethat leveragesAdversarial Motion Priors (AMP)to learn fromhumanoid interaction data. This pipeline enables the generation of natural, lifelike, and generalizable motions for diverseHSItasks. -
Coarse-to-Fine Real-World Object Localization Module: Development of a robust
coarse-to-fine object localization moduledesigned for real-world deployment. This module integratesLiDARandcamerainputs to provide continuous and reliablescene perception, crucial for interaction tasks where objects might be far, occluded, or dynamic. -
Comprehensive Evaluation Protocols: Establishment of thorough
evaluation protocolsthat analyze the system and its individual components in both simulation and real-world environments. This aims to provide insights and guidance for future research and development inreal-world HSItasks.The key conclusions and findings reached by the paper include:
-
High Success Rates and Generalization:
PhysHSIachieves consistently high success rates across complex,long-horizon HSItasks (e.g., box carrying, sitting, lying, standing up) and demonstrates strong generalization across diverse scenarios and task goals, outperformingreward-basedandtracking-basedbaselines, especially infull-distribution scenes. -
Natural and Lifelike Motion Patterns: The
AMP-based trainingeffectively produces natural and expressive motions, exhibiting human-like behaviors and even stylized locomotion, a significant improvement overRL-rewardmethods that struggle with motion realism. -
Robust Real-World Deployment: The
coarse-to-fine perception systemprovides accurate and continuous object localization, enablingzero-shot transferof learned policies to a physicalUnitree G1humanoid robot. The system can autonomously complete tasks in various indoor and outdoor environments using only onboard sensors. -
Portability: The system's ability to operate outdoors with onboard sensing and computation highlights its portability and reduced reliance on external infrastructure like
MoCapsystems.
3. Prerequisite Knowledge & Related Work
This section provides the foundational knowledge and context necessary to understand the PhysHSI paper.
3.1. Foundational Concepts
Humanoid-Scene Interaction (HSI)
Humanoid-Scene Interaction (HSI) refers to tasks where a humanoid robot physically interacts with objects and elements within its environment. This goes beyond simple locomotion to involve manipulation, contact, and adaptive behaviors based on the scene. Examples include picking up and carrying objects, sitting on chairs, lying on beds, opening doors, or using tools. The complexity arises from the need for coordinated whole-body movements, maintaining balance, adapting to varying object properties and scene layouts, and robustly perceiving the environment.
Reinforcement Learning (RL)
Reinforcement Learning (RL) is a paradigm of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward.
- Agent: The learner or decision-maker (in this paper, the humanoid robot's control policy).
- Environment: The external system with which the agent interacts (the simulated or real world with objects and tasks).
- State: A complete description of the environment at a given time (e.g., robot's joint angles, velocities, base pose, object positions).
- Action: A decision made by the agent that changes the state of the environment (e.g., target joint positions for the robot).
- Reward: A scalar feedback signal from the environment indicating how good or bad the agent's last action was (e.g., positive for reaching a goal, negative for falling).
- Policy (): A function that maps states to actions, determining the agent's behavior. The goal of
RLis to learn an optimal policy.
Motion Capture (MoCap)
Motion Capture (MoCap) is the process of digitally recording the movement of people or objects. In the context of robotics and animation, MoCap data typically consists of sequences of joint angles and/or marker positions over time, capturing natural human movements. This data can then be used to drive virtual characters or, as in this paper, retargeted onto robotic platforms to teach them human-like behaviors. While MoCap provides highly accurate ground truth motion, it usually requires specialized equipment in controlled laboratory settings, limiting its real-world applicability for direct robot control.
Adversarial Motion Priors (AMP)
Adversarial Motion Priors (AMP) is a technique in RL that combines generative adversarial networks (GANs) with motion imitation. Instead of explicitly designing reward functions to make motions look natural, AMP uses a discriminator network.
- Generator (Policy): This is the
RL agent(the robot's policy) that tries to generate natural-looking motions. - Discriminator (): This network is trained to distinguish between motion clips generated by the
policyand real human motion clips from areference motion dataset(theprior). - Adversarial Training: The
policytries to "fool" thediscriminatorinto thinking its generated motions are real (by maximizing astyle rewardderived from the discriminator), while thediscriminatortries to accurately distinguish between real and fake motions. This adversarial process implicitly guides thepolicyto produce motions that are stylistically similar to the human demonstrations, leading to natural and lifelike behaviors without explicitreward shapingfor motion quality.
Sim-to-Real Transfer
Sim-to-Real transfer refers to the process of training a robot control policy in a simulated environment and then deploying it on a real physical robot. This is highly desirable because training in simulation is much safer, faster, and cheaper than training on real hardware. However, a significant challenge is the sim-to-real gap: the differences between the simulated world and the real world (e.g., accurate physics models, sensor noise, latency, hardware imperfections). Policies trained purely in simulation often perform poorly when transferred to reality. Techniques like domain randomization are used to make the simulated environment diverse enough to cover real-world variations, thereby improving transferability.
LiDAR-Inertial Odometry (LIO)
LiDAR-Inertial Odometry (LIO) is a method for estimating the motion (odometry) of a robot using data from a LiDAR sensor and an Inertial Measurement Unit (IMU).
- LiDAR (Light Detection and Ranging): A sensor that measures distances to objects by emitting pulsed laser light and measuring the time it takes for the reflected light to return. It creates a 3D
point cloudof the environment. - IMU (Inertial Measurement Unit): A sensor that measures angular rate and linear acceleration, providing information about the robot's orientation and motion dynamics.
- Odometry: The estimation of the robot's change in position and orientation over time.
LIOcombines the precise distance measurements fromLiDAR(for mapping and matching features in the environment) with the high-frequency and robust motion data from theIMUto provide accurate and drift-corrected ego-motion estimation, especially useful forlong-range navigation.
AprilTag
AprilTag is a visual fiducial system, similar to QR codes, used for various robotics and augmented reality applications. AprilTags are 2D barcodes that can be easily detected and identified by a camera. When an AprilTag is observed, its precise 3D pose (position and orientation) relative to the camera can be estimated using computer vision algorithms. They are robust to lighting changes and partial occlusions, making them suitable for accurate object localization at close range, as used in this paper for fine-grained perception.
PD Controller
A Proportional-Derivative (PD) controller is a widely used feedback control loop mechanism. In robotics, it is often used to control joint positions.
- Given a
target joint position(setpoint) and thecurrent joint position(process variable), the controller calculates anerrorsignal. - The
proportional (P)term generates a control output proportional to the current error. A larger error leads to a larger corrective force. - The
derivative (D)term generates a control output proportional to the rate of change of the error. This helps to dampen oscillations and improve stability by anticipating future errors. - The
PD controlleroutputs a torque or force command to the joint motors to move the joint towards its target position. It helps in precisely executing the actions decided by theRL policy.
3.2. Previous Works
The paper contextualizes its contributions by reviewing prior work across three main areas:
A. Humanoid-Scene Interactions
- Simulation-based HSI: Many works have explored
HSIin physics-based simulations, achieving natural andlong-horizon behaviorslikeobject loco-manipulation[23, 26–28, 34]. However, these methods typically rely on idealized task observations, leading to largesim-to-real gapswhen attempting real-world deployment. - Classical Real-World Approaches: For real-world robots, classical methods often use
model-based motion planningortrajectory optimizationto generatewhole-body referencesfor tracking [9–13]. These approaches can generate stable motions but suffer from limited generalization capabilities, making them impractical for diverse real-world scenarios. - RL-based Approaches:
RL-based methodsaim for stronger generalization by learning control policies from scratch [14, 16, 17]. However, they typically require meticulousreward shapingandstate transition design, which is complex and often fails to produce genuinely natural motions, especially forlong-horizon tasks. - Motion Prior Guided RL: Some works have used
curated motion priorsto guidepolicy learningfor tasks like stair climbing and chair sitting [35, 36], improving motion naturalness. - PhysHSI's Position:
PhysHSIbuilds upon the idea of learning frommotion priorsbut extends it to enable generalizable and natural behaviors for more complex interactions, includingbox carryingandlying down, specifically addressing thesim-to-real gapthat previousAMPworks overlooked.
B. Humanoid Motion Imitation
- Physics-based Motion Tracking (Simulation): In simulation,
physics-based methodshave achieved expressivewhole-body motionsby imitating individualreference sequences[37–39] or learninguniversal trackingpolicies [40]. - Reference-Dependent Real-World Approaches: More recent works have adapted these methods for real-world robots [35, 7], but they remain highly
reference-dependentand show limited generalization. This constrains their ability to interact flexibly with diverse scenes and objects. - Adversarial Motion Priors (AMP) in Simulation:
AMP[31] significantly improved generalization in simulation by imitatingmotion stylesrather than specific trajectories, leading to progress in dynamic interactions [23, 27, 28]. - Limited Real-World AMP Applications: Real-world applications of
AMPhave been sparse, primarily usingAMPto regularizetracking policiesfor basiclocomotion skills[36, 41–43]. They typically do not extend to complexscene and object interactions. - PhysHSI's Position:
PhysHSIovercomes these limitations by integratingAMPinto a comprehensive system that enables natural behaviors for diverse real-worldscene and object interactions, moving beyond basic locomotion or simulation-only performance.
C. Scene Perception
- Motion Capture (MoCap) Systems:
MoCapsystems provide highly accurate global information, supporting dynamic interactive tasks [44–46]. However, their reliance on external infrastructure restricts them to laboratory environments with limited workspaces, making them unsuitable for real-world, untethered deployment. - Onboard RGB-D Cameras: Many studies use
onboard RGB and depth camerasforscene and object perception[15, 47–52]. The main drawback is that these approaches generally confine target objects to a local workspace and oftenlose sightof them duringlong-horizon loco-manipulation tasks. - LiDAR-Inertial Odometry (LIO): Other studies employ
LiDAR-Inertial Odometry (LIO)[53, 54] to obtain global information [35, 55–58]. WhileLIOprovides robust ego-motion and environmental mapping, its interaction accuracy with objects, especially for fine-grained manipulation, remains limited compared to visual methods. - PhysHSI's Position:
PhysHSIproposes acoarse-to-fine object localization systemthat relies solely ononboard sensors(LiDAR and camera). This system aims to providecontinuous and robust scene perceptionby combining the strengths ofLIOfor long-range guidance withAprilTag detectionfor precise close-range localization, addressing the shortcomings of previous perception methods.
3.3. Technological Evolution
The evolution of humanoid robotics for complex interactions has progressed from highly controlled, model-based methods to more adaptive, learning-based approaches. Initially, focus was on stability and basic locomotion through motion planning or trajectory optimization. The introduction of reinforcement learning allowed for more generalized behaviors, but often at the cost of motion naturalness due to the difficulty of reward shaping. Motion imitation, especially with techniques like Adversarial Motion Priors (AMP), marked a significant step towards achieving human-like motions by learning from demonstrations. However, these advancements were largely confined to simulation, facing a substantial sim-to-real gap. Concurrently, robot perception evolved from external MoCap systems to onboard sensors like RGB-D cameras and LiDAR, each with its own strengths and weaknesses regarding range, accuracy, and robustness.
This paper's work, PhysHSI, fits within this timeline as a crucial step in integrating these advancements. It pushes the boundaries of AMP from simulation to real-world deployment, using a practical and robust onboard coarse-to-fine perception system. This represents an evolution towards truly autonomous, natural, and generalizable humanoid robots that can interact effectively in unstructured, real-world environments.
3.4. Differentiation Analysis
Compared to the main methods in related work, PhysHSI offers several core differences and innovations:
-
Unified System for Generalization, Naturalness, and Robust Perception: Unlike prior works that often excelled in one area (e.g., generalization from
RL, naturalness fromAMPin simulation, or basic real-world locomotion),PhysHSIprovides a holistic system that successfully combines all three. It addresses the challenge of integrating these capabilities, which was previously an ongoing issue. -
Real-World Deployment of Advanced Motion Imitation: A key innovation is the successful
zero-shot transferand real-world deployment ofAMP-based policiesfor complexHSItasks. PreviousAMPapplications in real robots were mostly limited tolocomotion regularization, whilePhysHSIenablesdynamic interactionslikebox carryingandlying downwith natural behaviors. -
Novel Coarse-to-Fine Object Localization: The paper introduces a specific
coarse-to-fine object localization modulethat uniquely combinesLiDAR-based odometry(for long-range, coarse guidance) withcamera-based AprilTag detection(for close-range, fine accuracy). This addresses thefield-of-viewandocclusion challengesfaced byRGB-D camerasand improves interaction accuracy over pureLIOsystems, providingcontinuous and robust scene perceptionusing onlyonboard sensors. -
Hybrid Reference State Initialization (Hybrid RSI):
PhysHSIenhancesRL exploration efficiencyandgeneralizationby introducing ahybrid RSI strategy. This strategy intelligently samples frommotion datawhile randomizing scene parameters and also includes episodes starting from default poses with fully randomized scenes, preventingoverfittingto limited demonstrations, a common issue inmotion imitation. -
Data Preparation with Object Annotation: The method of
post-annotatingobject information onretargeted humanoid motionseffectively creates a dataset suitable for learninghumanoid-object interactions, addressing the difficulty of obtaining physically plausiblehumanoid-object interaction data.In essence,
PhysHSIdistinguishes itself by movingAMPfrom simulation labs to real-world deployment for complexHSItasks, supported by an intelligent and robustonboard perception systemthat overcomes typicalsim-to-realandperception limitations.
4. Methodology
The PhysHSI system is designed to enable humanoid robots to autonomously perform diverse interaction tasks with natural and lifelike behaviors in real-world environments. It consists of two main components: a simulation training pipeline and a real-world deployment system.
4.1. Principles
The core idea behind PhysHSI is to leverage the strengths of adversarial motion priors (AMP) in simulation to learn highly generalized and natural humanoid motions for interacting with objects, and then to bridge the sim-to-real gap with a robust, onboard coarse-to-fine perception module. The theoretical basis for AMP is Generative Adversarial Networks (GANs), where a generator (the robot's policy) learns to produce data (robot motions) that are indistinguishable from real data (human demonstrations) by a discriminator. This implicit style reward encourages human-like behaviors. For real-world deployment, the intuition is that accurate object pose is critical, but a single sensor type often has limitations (e.g., camera: limited FoV, LiDAR: less precise for fine manipulation). Therefore, a hybrid approach combining LiDAR for long-range, coarse guidance and camera for short-range, fine accuracy is employed.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Simulation Training Pipeline
The simulation training pipeline focuses on learning high-quality humanoid-scene interaction (HSI) policies that are both generalizable and lifelike.
4.2.1.1. Data Preparation
The first step is to prepare humanoid motion data that includes object interactions. Generating physically plausible humanoid-object interaction data (e.g., grasping a box securely) is challenging. The authors adopt a post-annotation strategy:
- Retargeting SMPL Motions:
SMPL motionsfrom datasets likeAMASSandSAMP[29, 30], which primarily contain human-only movements, are firstretargetedonto the humanoid robot using an optimization process. - Smoothing Filter: A
smoothing filteris applied to suppress anyretargeting jitter, resulting in arobot-motion-only dataset. - Manual Object Annotation: Key
contact frames(e.g., pickup and placement) aremanually annotated. - Rule-Based Object Trajectory Inference: Between the
pickup frame() and theplacement frame(), the object's position is set to the midpoint of the robot's hands, and its orientation is aligned with the robot's base. Before and after , the object remains fixed at its respective key contact frame position. This process yields anaugmented humanoid motion datasetwith consistent and physically coherent object positions, which is crucial forstage conditioningandreference state initializationin the subsequent training.
The following figure (Figure 2 from the original paper) illustrates the overall PhysHSI system, including data preparation:
Image 2: The image is a diagram illustrating the three main stages of the PhysHSI system: data preparation, AMP policy training, and real-world deployment. It details the processing of human and humanoid robot motions, as well as the object localization and task execution using LiDAR and camera.
4.2.1.2. Adversarial Motion Prior Policy Training
The HSI problem is formulated as a Reinforcement Learning (RL) task, using the Adversarial Motion Priors (AMP) framework [31]. This framework involves two main components:
- A
policythat generates humanoid actions. - A
discriminatorthat distinguishes between motions produced by the policy and those in the reference motion dataset .
1) Observation and Action Space
The policy observation at each timestep is composed of:
-
A 5-step history of
proprioception. -
A 5-step history of
task-specific observation.The
proprioceptionat time is defined as: Where: -
: The robot's base angular velocity at time .
-
: The base gravity direction at time , expressed in the base frame. This helps the robot understand its orientation relative to gravity.
-
: The joint positions (angles) of the robot's 29
Degrees of Freedom (DoFs)at time . -
: The joint velocities of the robot's 29
DoFsat time . -
: The 3D positions of five key
end-effectors(left hand, right hand, left foot, right foot, and head) at time , expressed in the robot's base frame. -
: The action taken at the previous timestep
t-1. This provides a sense of temporal continuity and previous control effort.The
task-specific observationvaries by task but generally includes: -
: The object's shape, represented by its bounding box dimensions.
-
: The object's position at time , expressed in the robot's base frame.
-
: The object's orientation at time , typically represented by the first two columns of its rotation matrix, expressed in the robot's base frame (an representation often compressed to for planar cases or 6D representation for ).
-
: The goal position for the task at time , expressed in the robot's base frame. For example, the target location to place a box or a chair's sitting position.
The
discriminator observationat each timestep consists ofprivileged information(information that might not be directly available to the robot'spolicyin the real world but is useful during simulation training): Where: -
: The base height of the robot at time .
-
: The base linear velocity of the robot at time .
-
: The base angular velocity of the robot at time .
-
: The base gravity direction at time .
-
: The joint positions of the robot's 29
DoFsat time . -
: The 3D positions of the five end-effectors at time .
-
: The object's position at time . The inclusion of in the discriminator observation is particularly important for
long-horizon tasks, as it allows the discriminator to implicitly condition on the differenttask phases(e.g., approach, pickup, carry, or place), providing enhanced guidance for policy training.
The action output by the policy specifies target joint positions. These target positions are then executed by a PD controller across all 29 DoFs of the humanoid robot.
2) Reward Terms and Discriminator Learning
The total reward function is defined as a weighted sum of three components:
Where:
-
: The
task rewardthat encourages the humanoid to achieve its high-level objectives (e.g., reaching the object, grasping it, placing it at the goal). -
: A
regularization rewardthat penalizes excessive joint torques and high joint speeds, promoting smoother and more energy-efficient motions. -
: The
style rewardthat encourages the humanoid to imitate behaviors from thereference motion dataset. -
: The corresponding coefficients that weigh the importance of each reward term.
The
style rewardis modeled using theadversarial discriminator. The discriminator is trained to differentiate betweenmotion clipsproduced by thepolicyandmotion clipssampled from thereference dataset. Thediscriminatoris optimized using aminimax objectiveand agradient penalty, similar toWasserstein GANs (WGANs). Specifically, the discriminator's optimization objective is: Where: -
: The discriminator network.
-
: Expected value.
-
: The distribution of (t^*+1)`-frame motion clips` sampled from the dataset $M$. * $d^\pi(\mathbf{o}_{t:t+t^*}^\mathcal{D})$: The distribution of (t^*+1)
-frame motion clipsgenerated by thepolicy. -
: A sequence of discriminator observations over timesteps.
-
: The length of the motion clip used for discriminator input.
-
: A coefficient that regularizes the
gradient penalty[61]. Thegradient penaltyterm forces the discriminator's gradient to be close to 1, which helps stabilizeadversarial trainingand prevents issues likemode collapseinGANs. -
: A sample interpolated between real and generated motion clips, used for computing the
gradient penalty.Finally, the
style rewardfor thepolicyis defined as: This reward encourages thepolicyto generate motions that are highly likely to be classified as "real" by thediscriminator, thereby imitating the style of the reference motions. Thepolicyis optimized usingProximal Policy Optimization (PPO)[62] to maximize thecumulative discounted reward, where is the discount factor.
3) Hybrid Reference State Initialization (RSI)
Many HSI tasks are long-horizon, making exploration difficult if all episodes start from a default pose. The reference state initialization (RSI) strategy [37] is adopted, where episodes are initialized from randomly sampled reference motions along with their corresponding labeled object states, which improves exploration efficiency.
However, naive RSI can lead to overfitting to the limited scene configurations present in the demonstrations. To mitigate this, a hybrid RSI strategy is used:
- Compositional Task Stages: For a sampled motion clip, an initial phase is sampled from the motion data. The
scene parameters(e.g., target goal position) for the remaining phase of the episode, from , arerandomized. This allows the robot to learn to adapt to new target goals even if the initial part of the motion is from a demonstration. - Fully Randomized Episodes: A subset of episodes are initialized from the default starting pose with
fully randomized scene parameters(e.g., object size, position, and goal position). This ensures that the policy does not solely rely on the demonstration data for initialization and learns to operate in entirely novel settings. Thishybrid RSI strategypromotes efficientexplorationwhile simultaneously ensuringgeneralization.
4) Asymmetric Actor-Critic Training
In real-world deployment, the agent receives only partial observations due to sensor noise and limitations. To compensate for this difference between training and deployment environments, an asymmetric actor-critic framework [63] is employed:
- Actor: The
actor(which determines thepolicy) uses observations that are available at deployment (i.e., proprioception and potentially masked task observations). - Critic: The
critic(which evaluates the value of states) observes a richer state , which includes privileged information (e.g., base velocity and unmasked task observations) that might not be available to the actor but helps stabilize and accelerate training.
5) Motion Constraints
During training, as rewards accumulate, the agent might exploit shortcuts by producing fast, jerky, or unnatural motions, especially in later stages. To address this and ensure smooth, stable motions suitable for hardware deployment:
- Weighted Style Reward: The
style reward weightis initially set small to encourageexploration, and then gradually increased as training progresses to align the policy's motions more closely with the motion data. - L2C2 Smoothness Regularization:
L2C2 smoothness regularization[64] is adopted. This technique adds a penalty term to the reward function that encourages smoother changes in joint positions and velocities over time, directly enhancing motion smoothness and stability.
4.2.2. Real-World Deployment System
To deploy the trained HSI skills on a physical robot, two critical observations are needed: the end-effector position and the object pose ( and ). While can be accurately obtained using forward kinematics (FK) from joint encoder information, reliable object localization is much more challenging in real-world settings due to limited fields of view (FoV) and frequent occlusions. To achieve robust and continuous localization, a coarse-to-fine perception system is designed, integrating LiDAR and RGB camera inputs.
4.2.2.1. Coarse-to-Fine Object Localization
The pose of an object at time in the robot's base frame is represented by a transformation matrix:
Where:
-
: The homogeneous transformation matrix representing the pose of object relative to the robot's base frame at time .
-
: A function that maps the object's position and orientation (a rotation matrix, or its 6D representation) to an transformation matrix.
-
: The Special Euclidean Group, representing rigid body transformations in 3D space (rotation and translation).
The localization process proceeds in stages:
-
Initialization: At the start of a task, the target object may be outside the camera's
FoV. Acoarse initial poseis assigned. The position ismanually specifiedusingLiDAR point cloud visualization, while the orientation is assumed to be an identity rotation matrix (meaning no initial rotation relative to the base frame). -
Coarse Localization (Long Range): When the robot is far from the object,
FAST-LIO[53] is used to estimate the robot'sodometry, which is the pose of the current robot base frame with respect to the initial base frame . The object's position and orientation in the current base frame are then calculated by transforming the initial object pose using the robot's odometry: Where:- : A function that extracts the position and orientation from a transformation matrix.
- : The inverse of the robot's odometry, representing the transformation from the current base frame to the initial base frame . This method provides a continuous but coarse estimate of the object pose, sufficient to guide the robot toward the target from a distance.
-
Fine Localization (Close Range): For
fine-grained localizationwhen the robot is close to the object,AprilTag detection[65] is employed.AprilTagsare attached to objects, allowing the camera to provide accurate object position and orientation in the camera frame . The systemautomatically transitionsfrom coarse to fine localization upon theAprilTag's first detection. In cases oftemporary detection loss(e.g., when the robot turns to sit, causing the tag to temporarily leave the camera'sFoV), the last knownfine localization(at time ) is propagated using the robot's odometry and the camera's fixed pose relative to the base . The object pose at the current time is calculated as: This ensurescontinuous localizationeven with intermittent tag visibility.The system further differentiates between
staticanddynamic objects:- Static Objects (e.g., chairs): Their pose is assumed fixed. The propagation strategy (as described for
temporary detection loss) is used to update their estimated pose, for example, when the robot maneuvers around them. - Dynamic Objects (e.g., boxes): This estimation is valid until the object is
grasped. After grasping, if the object leaves the camera's view, both its position and orientation aremasked(i.e., set to a default value or ignored), and the policy relies solely onproprioceptionto complete the task. A simpledistance thresholddefines thegrasp phase: if the estimated object distance exceeds , it's static; otherwise, it's assumed to move with the robot.
- Static Objects (e.g., chairs): Their pose is assumed fixed. The propagation strategy (as described for
4.2.2.2. Sim-to-Real Transfer
To facilitate sim-to-real transfer and ensure the trained policies are robust to real-world complexities, domain randomization [66] is applied during simulation training:
- Sensor Noise and Latency: Random offsets,
Gaussian noise, and delays are added to object poses andforward kinematics (FK)observations. This mimics real-world sensor imperfections. - Masking Mechanism: The
masking mechanismfor dynamic objects during thegrasping stageis replicated. This masking is applied when the object is outside the camera's view, the goal distance is out of range, or the camera angle deviates excessively from vertical. This trains the policy to cope with intermittent object perception. - Standard Domain Randomization: Additional standard
domain randomizationtechniques from [55] are adopted to enhance robustness (e.g., randomizing robot dynamics, friction, sensor noise parameters).
4.2.2.3. Hardware Setup
The PhysHSI system is built on the Unitree G1 humanoid robot. This robot is equipped with:
- LiDAR: A built-in
Livox Mid-360 LiDAR. - Camera: An external
Intel RealSense D455 depth cameramounted on the robot's head, providing an horizontal and verticalfield of view. All perception modules (point cloud visualization,FAST-LIO,AprilTag detection, andforward kinematics) along with the learnedpolicyrun entirelyonboardon aJetson Orin NX. This enablesfully portable deploymentwithout external computation or infrastructure.
5. Experimental Setup
This section details the experimental environment, datasets, evaluation metrics, and baselines used to validate PhysHSI.
5.1. Datasets
The primary motion dataset used for training PhysHSI policies is derived from SMPL motions from the AMASS and SAMP datasets [29, 30]. These datasets primarily contain human movements. As described in the Data Preparation section, these motions are retargeted onto the Unitree G1 humanoid robot and then post-annotated with object information using a rule-based procedure after manual annotation of key contact frames. The resulting dataset contains approximately 25 complete trajectories per task, providing the motion priors for AMP training.
The choice of these datasets is effective for validating the method's performance because they provide a rich source of natural human movements. By retargeting and annotating them, the authors create a physically plausible dataset for humanoid-object interaction, crucial for learning lifelike behaviors.
5.2. Evaluation Metrics
The experiments utilize several metrics to assess both simulation and real-world performance:
5.2.1. Success Rate ()
- Conceptual Definition:
Success ratemeasures the proportion of trials or episodes in which the robot successfully completes the specified task objective. ForHSItasks, this typically means the object is correctly placed at its goal location, or the humanoid achieves the desired final pose (e.g., fully sitting on a chair). It focuses on the functional outcome of the task. - Mathematical Formula: $ R_{\mathrm{succ}} = \frac{\text{Number of successful trials}}{\text{Total number of trials}} \times 100% $
- Symbol Explanation:
- : The count of individual attempts where the robot met the task's success criteria.
- : The total count of all attempts made during evaluation.
5.2.2. Human-likeness Score ()
- Conceptual Definition: The
human-likeness scorequantifies how natural and human-like the robot's generated motions appear. It aims to capture the subjective quality of motion beyond just task completion. In this paper, it is evaluated by a large language model. - Mathematical Formula: This is a subjective score assigned by
Gemini-2.5-Proranging from 0 to 5. There is no standard mathematical formula for this score, as it's an output from an AI model's qualitative assessment. - Symbol Explanation:
Gemini-2.5-Pro: A large language model used to assess the qualitative naturalness of the robot's movements based on task descriptions and video clips of experimental trajectories.0 to 5: The range of the score, where 0 indicates very unnatural motions and 5 indicates highly natural, human-like motions.
5.2.3. Finish Precision ()
- Conceptual Definition:
Finish precisionmeasures the accuracy of the robot's final placement of an object or its final pose relative to the target. For tasks involving object placement, it quantifies the average spatial error between the actual and desired final object position. - Mathematical Formula: If not explicitly defined in the paper, a common formula for position error (e.g., Euclidean distance) is: $ R_{\mathrm{precision}} = \frac{1}{N} \sum_{i=1}^N | \mathbf{p}{\mathrm{actual}, i} - \mathbf{p}{\mathrm{target}, i} |_2 $ (Note: The paper reports average and standard deviation, suggesting multiple trials; it refers to this as a placement error, which is typically a distance in meters.)
- Symbol Explanation:
- : The number of trials.
- : The actual final position of the object (or relevant robot part) in trial .
- : The desired target final position in trial .
- : The Euclidean distance (L2 norm) in 3D space.
5.2.4. Execution Time ()
- Conceptual Definition:
Execution timemeasures the total duration, in seconds, required for the robot to complete an entire task sequence from start to finish. It quantifies the efficiency of the task execution. - Mathematical Formula: No specific formula provided in the paper, as it's a direct measurement of time.
- Symbol Explanation: This is a directly measured temporal duration, typically in seconds.
5.2.5. Maximum Movement Range ()
- Conceptual Definition:
Maximum movement rangerefers to the maximum distance the robot's base (or root) travels during the execution of a task. It indicates the spatial extent of the task and the robot's ability to cover ground while interacting. - Mathematical Formula: No specific formula provided in the paper, as it's a direct measurement of distance.
- Symbol Explanation: This is a directly measured spatial distance, typically in meters.
5.3. Baselines
The paper compares PhysHSI against two commonly adopted baselines to demonstrate its effectiveness in simulation:
-
RL-Rewards: This baseline represents a standard
Reinforcement Learningapproach where the humanoid learnsHSItasks from scratch. It does not use anymotion referencesormotion priors. Instead, the policy is optimized using a carefully designedreward functionthat combines various terms, such as rewards for gait stability, task progression (e.g., proximity to object, successful grasp), and regularization penalties (e.g., for joint torques and speed). This baseline highlights the challenges of achieving natural motions and generalization solely through explicitreward shaping. -
Tracking-Based: This baseline involves training an
agentto mimicmotion referencesdirectly. It attempts to track the humanoid's and object's trajectories as provided by theaugmented motion datasetframe by frame. This method is common inmotion imitationliterature but is known to struggle withgeneralizationwhen faced with novel scene configurations or task goals that deviate from the exact demonstration trajectories. It uses the same underlying motion data asPhysHSIbut focuses on precise imitation rather than learning a generalized style.These baselines are representative because
RL-Rewardsshowcases the difficulty ofreward engineeringfor natural movements, whileTracking-Basedhighlights the limitations of directmotion imitationin terms ofgeneralization. Comparing against these two approaches effectively demonstratesPhysHSI's advantages in achieving bothnatural motionandstrong generalizationinHSItasks.
5.4. Experimental Setup
All training and evaluation environments for simulation experiments are implemented using IsaacGym [67], a high-performance GPU-accelerated simulation platform.
The methods are benchmarked on four representative HSI tasks:
-
Carry Box: Involves walking to a box, lifting it, carrying it, and then placing it at a new goal location. This is a
long-horizon taskwith multiple subtasks. -
Sit Down: Requires the robot to walk to a chair and sit on it.
-
Lie Down: Involves walking to a bed and lying down on it.
-
Stand Up: The robot starts from a seated position (on a chair) and rises to a standing pose.
For evaluation, two settings are considered:
-
In-distribution scenes: These scenes include only configurations (e.g., object positions, sizes) that are directly sampled from or very similar to those present in the training
dataset. -
Full-distribution scenes: These scenes involve
uniformly sampledconfigurations within a broadertask space. For instance, objects are placed within[0, 5]m of the start position; boxes are initialized at heights within[0, 0.6]m and have size dimensions within[0.2, 0.5]m. This setting is designed to test thegeneralization capabilitiesof the policies.For both settings, results are reported as the mean and standard deviation computed over five random seeds. Each seed is evaluated across 1000 episodes, and
human-likeness scoresare based on three demo clips.
6. Results & Analysis
This section analyzes the experimental results, demonstrating the effectiveness of PhysHSI in both simulation and real-world scenarios, and highlighting the contributions of its individual components through ablation studies.
6.1. Core Results Analysis
6.1.1. Simulation Performance
The following are the results from Table I of the original paper:
| Carry Box | Sit Down | Lie Down | Stand Up | |||||
|---|---|---|---|---|---|---|---|---|
| Rsucc(%, ↑) | Shuman(↑) | Rsucc(%, ↑) | Shuman(↑) | Rsucc(%, ↑) | Shuman(↑) | Rsucc(%, ↑) | Shuman(↑) | |
| In Distribution Scene | ||||||||
| RL-Rewards | 72.92 (±8.29) | 1.67(±0.47) | 83.60 (±5.98) | 1.50(±0.24) | 76.72 (±9.43) | 0.50 (±0.00) | 93.02 (±0.71) | 1.50 (±0.24) |
| Tracking-Based | 11.84(±3.16) | 4.83 (±0.24) | 31.46 (±2.96) | 3.80 (±0.08) | 19.58 (±1.02) | 2.23 (±0.21) | 99.00(±1.28) | 4.67(±0.12) |
| PhysHSI | 91.34(±1.63) | 4.00(±0.41) | 96.28 (±0.21) | 4.80(±0.08) | 97.86 (±0.60) | 4.80(±0.08) | 99.68(±0.21) | 3.77 (±0.21) |
| Full Distribution Scene | ||||||||
| RL-Rewards | 63.40 (±8.63) | 1.17(±0.24) | 73.14(±4.29) | 3.07 (±0.09) | 55.76 (±12.51) | 2.00 (±1.08) | 90.50 (±2.33) | 1.07(±0.09) |
| Tracking-Based | 0.02 (±0.01) | 0.50 (±0.00) | 1.12 (±0.51) | 0.50 (±0.00) | 0.94 (±0.45) | 1.00(±0.41) | 35.32 (±2.51) | 3.27±0.54) |
| PhysHSI | 84.60(±3.74) | 3.83(±0.24) | 91.32(±2.48) | 4.77±0.05) | 81.28 (±3.99) | 4.43 (±0.33) | 92.24(±0.75) | 3.77(±0.52) |
Key Findings from Table I:
-
Consistently High Success Rates:
PhysHSIdemonstrates superior performance across all fourlong-horizon HSItasks. Forin-distribution scenes, it achieves success rates above 91% for Carry Box, Sit Down, Lie Down, and Stand Up. Even for the complexCarry Boxtask (with four subtasks),PhysHSIreaches an impressive 91.34% success rate, comparable to the simplerSit Downtask. -
Strong Generalization:
PhysHSImaintains high success rates even infull-distribution scenes, where scenarios are uniformly sampled across a widetask space. For instance, infull-distribution scenes,PhysHSIachieves 84.60% for Carry Box and 91.32% for Sit Down. This is in stark contrast toTracking-Basedmethods, which almost completely fail (near 0% success) infull-distribution scenesfor complex tasks likeCarry Box,Sit Down, andLie Down. This highlights thatPhysHSI'sAMP frameworkenablesflexible motion recombinationandstyle alignmentrather than rigid trajectory following, allowing it to adapt to novel situations. The following figure (Figure 3 from the original paper) illustrates the generalization capability:
Fig. 3: Spatial Generalization. Root trajectories of the robot are shown for tasks (a) Carry Box and (b) Lie Down. Red trajectories indicate reference data, with others representing sampled policy motions.As depicted in Fig. 3,
PhysHSIgenerates diverse root trajectories that successfully navigate to different target locations for tasks likeCarry BoxandLie Down, despite thereference data(red trajectories) being limited. This visually confirms its strongspatial generalization. -
Lifelike Motion Patterns:
PhysHSIattains significantly higherhuman-likeness scores() compared toRL-Rewardsmethods. For example, inin-distribution scenes,PhysHSIscores 4.00 for Carry Box and 4.80 for Sit Down, whereasRL-Rewardsonly achieve 1.67 and 1.50 respectively. This indicates that theadversarial trainingbetween thepolicyanddiscriminatoreffectively guides the robot to produce natural behaviors by distinguishing betweendataset motionsandpolicy-generated motions.RL-Rewardsmethods, which rely on manuallyhand-crafted gaitandregularization terms, struggle to produce natural motions, especially forlong-horizon tasks.Tracking-Basedmethods generally achieve highhuman-likenessif the reference is good and the task is within distribution, but theirgeneralizationis poor.
6.1.2. Real-World Performance
The following are the results from Table III of the original paper:
| Tasks | Rsucc | Rprecision (m) | Texec (s) | Mrange e(m) |
|---|---|---|---|---|
| Carry Box | 8/10, 6/10 | 0.19(±0.10) | 10.5(±2.8) | 5.69 |
| Sit Down | 9/10 | 0.07(±0.03) | 6.2 (±1.3) | 4.14 |
| Lie Down | 8/10 | 0.16(±0.07) | 6.7(±1.0) | 3.76 |
| Stand Up | 8/10 | / | 2.3(±0.4) | 1.74 |
Key Findings from Table III and accompanying text:
- Competitive Success Rates with High Precision:
PhysHSIachieveszero-shot transferand successfully completes all fourHSItasks in real-world settings. For theCarry Boxtask, it records an 8/10 success rate for lifting and 6/10 for the full sequence, withplacement errors(finish precision) under 20 cm (0.19m ± 0.10m). TheSit Downtask shows particularly strong performance with 9/10 success and very high precision (0.07m ± 0.03m).Lie DownandStand Upalso show robust performance (8/10 success). - Effective Generalization in Real World: The system generalizes effectively across variations in
spatial layoutandobject properties. It handleslocomotionover distances up to 5.7m (for Carry Box) and accommodates diversebox dimensions,heights, andweights(as further detailed in theLimitation Analysis). This is demonstrated in Figure 4 (not provided in the prompt, but described in the paper as showing examples of varied scene configurations). - Natural, Human-like Motions: The policies learned in simulation, particularly the
catwalk-style locomotioninherited from theAMASS data, transfer well to the real robot. The framework also supportsstylized motion learning, allowing the system to produce diverse locomotion styles such asdinosaur-like walkingorhigh-knee stepping(Fig. 1(e), not provided in the prompt but described). - Portability and Autonomy:
PhysHSIcan be deployed outdoors using onlyonboard sensingandcomputation(Fig. 1(a)-(c), referring to the abstract image showing outdoor scenarios). This highlights the system's portability and independence from external infrastructure likeMoCapsystems, a significant advantage for practical applications.
6.2. Ablation Studies / Parameter Analysis
The following are the results from Table II of the original paper:
| Carry Box | Sit Down | |||
|---|---|---|---|---|
| Rsucc(%, ↑) | Shuman(↑) | Rsucc(%, ↑) | Shuman(↑) | |
| Ablation on Data Processing | ||||
| w/o Smoothness | 63.28(±11.72) | 2.33 (±1.03) | 87.24(±2.19) | 1.33 (±0.23) |
| w/o Object | 55.42(±8.17) | 2.60(±0.57) | 72.36(±6.71) | 3.50(±0.70) |
| PhysHSI | 79.34(±4.71) | 3.83(±0.24) | 91.32(±2.48) | 4.77(±0.05) |
| Ablation on RSI Strategy | ||||
| w/o RSI | 41.24(±6.92) | 2.50 (±1.63) | 78.24(±3.91) | 4.50(±0.00) |
| Naive RSI | 5.70(±2.38) | 0.50(±0.0) | 18.70(±5.33) | 1.83(±0.62) |
| Hybrid RSI | 79.34(±4.71) | 3.83(±0.24) | 91.32(±2.48) | 4.77(±0.05) |
| Ablation on Mask Strategy (for dynamic objects) | ||||
| w/o Obs Mask | 85.90(±2.90) | 4.30(±0.14) | I | I |
| PhysHSI | 79.34(±4.71) | 3.83(±0.24) | 91.32 (±2.48) | 4.77(±0.05) |
Key Observations from Ablation Analysis:
-
Data Quality and Object Annotation are Critical:
w/o Smoothness: Training withoutsmoothed motion datasignificantly reduces bothsuccess rateandhuman-likeness score. ForCarry Box, drops from 79.34% to 63.28%, and drops from 3.83 to 2.33. This indicates that raw, unsmoothed data introducesartifacts(e.g., jittering end-effectors, abrupt motion shifts) that thediscriminatormay exploit, leading to unnatural behaviors.w/o Object: Removingobject annotationsfrom the training data drastically increasesfailure rates. ForCarry Box, drops from 79.34% to 55.42%. This confirms thatobject statesinAMP observationsare essential for thepolicyto learnstage transitions(e.g., distinguishing between approaching, carrying, or placing an object) and appropriatemotion stylesconditioned on the object's presence and state. For example, thediscriminatorguides the robot to walk towards a distant object and keep a grasped box centered.
-
Hybrid RSI is Crucial for Generalization and Efficiency:
w/o RSI: Training without anyReference State Initializationleads to much lower success rates (e.g., 41.24% for Carry Box), indicating poorexploration efficiencyas the robot rarely experiences critical transitions.Naive RSI: Simply initializing all episodes fromreference statesfixed todataset settingsperforms even worse (5.70% for Carry Box, 18.70% for Sit Down) and yields very lowhuman-likeness scores. This demonstrates severeoverfittingand poorgeneralizationbecause the policy is unable to adapt to any scene configurations outside the limited demonstrations.Hybrid RSI: The proposedHybrid RSIstrategy significantly improves bothgeneralizationandsample efficiency, resulting in the highest success rates and stronghuman-likenesscompared to the otherRSIvariants. It balances exploration with exposure to diverse scene parameters.
-
Mask Processing Has Limited Impact on Overall Performance:
w/o Obs Mask: Training without themasking strategy(i.e., using complete object states even when they would be unobservable in the real world) achieves slightly higher success rates (85.90% for Carry Box) andhuman-likeness scores(4.30) thanPhysHSI. This represents an upper bound on performance, as the policy has access to perfect information.PhysHSI(with Obs Mask): While slightly slower in training and showing a marginal drop in performance compared tow/o Obs Mask, themasking strategyensures that thepolicyis robust to real-worldsensing limitations. The minimal impact on the finalsuccess rate(e.g., 79.34% vs 85.90% for Carry Box) suggests that thedomain randomizationandasymmetric actor-criticframework effectively compensate for the partial observations, making the policy ready for real-world deployment. (The "I" in the table for Sit Downw/o Obs Masklikely indicates 'irrelevant' or 'not applicable' for this specific ablation, as mask strategy is for dynamic objects, and Sit Down usually involves static objects.)
6.3. Object Localization Module Analysis
The effectiveness of the coarse-to-fine object localization module was evaluated using 17 real-world HSI trials, with 15 successful completions. The object trajectories estimated by the module were compared against ground-truth trajectories from a MoCap system.
The following figure (Figure 5 from the original paper) shows the real-world localization system analysis:
Fig. 5: Real-World Localization System Analysis. (a) Localization error versus robotobject distance, with coarse-to-fine transition statistics and distribution. (b) A representative object localization trajectory, highlighting three stages: (i) coarse localization, (ii) fine localization, and (ii) grasp.
Key Findings from Fig. 5:
- Coarse-to-Fine Transition: As shown in Fig. 5(a), the
localization erroris relatively large (average of ) when the robot is far from the object, usingcoarse localization(LiDAR-based). Upon reaching approximately from the object,AprilTag detectionactivates, and the system transitions tofine localization(camera-based). In this fine stage, the averagelocalization errorsignificantly drops to . This clearly demonstrates the effectiveness of thecoarse-to-fine design:coarse localizationprovides sufficient directional guidance from a distance, whilefine localizationdelivers precise positions at close range. - Robustness: The high overall success rate (15/17 trials) further confirms the robustness of the module. The two failures were attributed to
coarse guidancedeviating too far (preventingAprilTagfrom enteringFoV) and a system crash. - Error Source Analysis (Fig. 5b):
- (i) Coarse Stage: In this stage, errors primarily originate from
manually specified goal pointsinLiDAR point cloud visualization. For instance, an initial deviation of from the exact position was observed. Despite this, the estimated and ground-truth trajectories show consistent trends, which is adequate for long-range guidance. - (ii) Fine Stage: During
fine localization, errors are mainly due toodometry driftfromFAST-LIOandAprilTag noise. However, these errors remain small, and the estimated trajectories closely align with the ground truth. - (iii) Grasping Stage: At very close range, particularly during rapid manipulative motions like grasping,
AprilTag noisebecomes more pronounced and dominates the errors. This highlights the inherent challenges of precise perception during dynamic, close-quarter interactions.
- (i) Coarse Stage: In this stage, errors primarily originate from
6.4. System Limitation Analysis
The paper also conducted a limitation analysis on the carry box task to understand the boundaries of the system's capabilities.
The following are the results from Table IV of the original paper:
| Test Condition | Box Height | Box Weight | Maximum Box Size | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 cm | 20 cm | 40 cm | 60 cm | 0.6 kg | 1.2 kg | 2.3 kg | 3.6 kg | 4.5 kg | 20 cm | 30 cm | 40 cm | 45 cm | |
| Rsucc(↑) | 2/3 | 3/3 | 3/3 | 1/3 | 2/3 | 3/3 | 2/3 | 1/3 | 0/3 | 2/3 | 3/3 | 2/3 | 3/3 |
Key Findings from Table IV and accompanying text:
-
Box Height Limitations: The humanoid robot can stably carry boxes at heights within . However, carrying boxes higher than becomes challenging (e.g., success at height), primarily because it exceeds the robot's
vertical FoVeven when stationary, makingAprilTag detectiondifficult. -
Box Weight Limitations: The robot can handle weights in the range of . Beyond (e.g., success for , success for ), the robot struggles significantly due to
hardware constraints, specificallymotor overheatingand the limited clamping force of itsrubber hand. -
Box Size Limitations: The system performs well with maximum box sizes up to . Larger or wider boxes (e.g., beyond ) cannot be handled effectively because of the limited reach and grip of the robot's
rubber handandarm length.Beyond these specific findings, the authors identify several broader limitations:
-
Hardware Constraints: The
Unitree G1robot'srubber handhas limited clamping force, restricting its ability to manipulate larger or heavier boxes.Excessive weightcan also lead tomotor overheatingand potentialhardware failures. -
Large-Scale High-Quality HSI Data: The current approach relies on
manual post-annotationof objects inretargeted humanoid motion data. This manual process is time-consuming and does not scale well to generate the large quantities of high-qualityHSI dataneeded for more complex and diverse tasks. -
Automated Perception Module: The existing
object localization moduleis modular, combiningLiDARandcamerainputs. While effective, its reliance on specific components (e.g.,AprilTags, manual initialization) introduces complexity and potentialfragility. A moreautomated perception module, potentially incorporatingactive perception(where the robot actively moves to improve its perception), could enhance robustness and simplify deployment.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduced PhysHSI, a novel real-world system designed for generalizable and natural humanoid-scene interaction. By integrating an effective simulation training pipeline with a robust real-world deployment module, PhysHSI enables humanoid robots to autonomously perform complex tasks like box carrying, sitting down, lying down, and standing up. The system achieves high success rates, demonstrates strong generalization across diverse spatial layouts and object properties, and produces remarkably natural and human-like motion behaviors. A key enabler is the AMP-based training pipeline that learns from humanoid interaction data in simulation, coupled with a pragmatic coarse-to-fine object localization module that fuses LiDAR and camera inputs for continuous and robust scene perception using only onboard sensors. The portability of PhysHSI, exemplified by its successful deployment in outdoor environments with minimal manual initialization (a coarse object pose and a single fiducial tag), marks a significant step towards practical, autonomous humanoid robotics. This work serves as an initial yet comprehensive exploration into real-world HSI tasks, paving the way for more advanced object and scene interaction capabilities.
7.2. Limitations & Future Work
The authors highlight several limitations of PhysHSI and suggest future research directions:
- Hardware Constraints: The current
Unitree G1robot'srubber handand motor capabilities restrict manipulation tosmaller and lighter objects. Future work could explore adapting the system to robots with more dexterous manipulators or investigate techniques for handling heavier loads withoutmotor overheatingor failure. - Scalability of HSI Data: The reliance on
manual post-annotationof object information inretargeted motion datais a bottleneck. Developingautomated methodsfor generatinglarge-scale, high-quality HSI datais crucial for expanding the system's capabilities to a wider range of interactions and objects. - Automated Perception: The current
object localization module, while robust, is modular and requires some manual initialization (e.g., coarse object position). Future research could focus on developing a more fullyautomated perception module, possibly incorporatingactive perceptionstrategies where the robot intelligently moves to gather better sensor data, thereby improving robustness and simplifying deployment for truly unknown environments.
7.3. Personal Insights & Critique
PhysHSI presents a compelling blend of advanced learning techniques and pragmatic real-world engineering, which is inspiring. The innovation lies in effectively bridging the sim-to-real gap for complex AMP-based policies, an area where many prior motion imitation works have fallen short. The hybrid RSI strategy is a clever solution to balance efficient exploration with robust generalization, addressing a common challenge in RL from demonstrations. The coarse-to-fine perception module is also a well-thought-out design, leveraging the strengths of different sensor modalities to overcome their individual weaknesses, making real-world deployment feasible with readily available onboard hardware.
Transferability and Applicability: The methodologies presented in PhysHSI are highly transferable. The AMP-based training pipeline could be adapted to learn other complex whole-body skills or manipulation tasks for humanoid robots, not just HSI. The coarse-to-fine perception paradigm could be applied to various robotic platforms (e.g., mobile manipulators) requiring robust and continuous object localization in dynamic environments, such as logistics, assistive robotics in homes, or industrial inspection tasks. The insights on domain randomization and asymmetric actor-critic are also broadly applicable for sim-to-real transfer in robot learning.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
Dependency on Fiducial Markers (AprilTags): While
AprilTagsprovide excellent precision, their requirement for objects to be tagged limits the system's application in truly unstructured environments where objects cannot be modified. Future work could investigatetag-less object pose estimationusing3D object models,visual tracking, orneural radiance fields (NeRFs)forreal-time object reconstruction, enhancing generalization to arbitrary objects. -
Subjectivity of Human-likeness Score: The
human-likeness scoreevaluated byGemini-2.5-Prois an interesting and novel approach. However, the inherent subjectivity of such qualitative measures, even from an advanced AI, warrants consideration. A more comprehensive evaluation could involvehuman user studies(e.g.,Mechanical Turkor expert evaluation) to validate the scores and provide a more robust assessment of perceived naturalness. -
Scalability of Data Annotation: The paper acknowledges the
manual annotationbottleneck forHSI data. Exploringgenerative modelsthat can synthesize diversehumanoid-object interaction data(perhaps from simpler demonstrations or linguistic prompts) orself-supervised learningfrom large-scaleunlabeled video datacould significantly improve scalability. -
Hardware Constraints as a Fundamental Limit: While recognized, the
hardware constraintsof theUnitree G1(e.g., rubber hand, motor limits) are quite fundamental. For genuinely advancedreal-world HSI, future robot hardware would need to evolve significantly, or the policies would need to become even more adaptive to work within very strict physical boundaries. The currentclamping mechanismmight struggle with varying object geometries or textures. -
Explanations of Specific Components: The paper mentions
L2C2 smoothness regularizationas an important motion constraint but does not detail its implementation or provide specific formulas. For a novice reader, a brief explanation of how this regularization works would be beneficial. -
Real-time Computation Load: While stating everything runs onboard a
Jetson Orin NX, details on the real-time performance, inference latency, and computational load of the entireperception-to-action pipelinewould be valuable for assessing its practical deployability in high-frequency interaction scenarios.Overall,
PhysHSIis a strong foundational paper that addresses critical challenges inreal-world HSI. Its innovative integration ofAMPwith a robustonboard perception systemsets a promising direction for the development of highly capable and autonomous humanoid robots.
Similar papers
Recommended via semantic vector search.