PhysHSI: Towards a Real-World Generalizable and Natural Humanoid-Scene Interaction System

Jiangmiao Pang

Paper status: completed

PhysHSI: Towards a Real-World Generalizable and Natural Humanoid-Scene Interaction System

Published:10/13/2025

Humanoid Robot Scene Interaction System (1)Adversarial Motion Prior Policy Learning (1)Real-World Deployment System (1)Coarse-to-Fine Object Localization Module (1)Multi-Task Robot Interaction Validation (1)

Original Link PDF

Price: 0.10

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper presents PhysHSI, a humanoid-scene interaction system that enables robots to perform diverse tasks in real-world environments. It combines adversarial motion prior-based policy learning and a coarse-to-fine object localization module, achieving natural behaviors and rob

Abstract

Deploying humanoid robots to interact with real-world environments--such as carrying objects or sitting on chairs--requires generalizable, lifelike motions and robust scene perception. Although prior approaches have advanced each capability individually, combining them in a unified system is still an ongoing challenge. In this work, we present a physical-world humanoid-scene interaction system, PhysHSI, that enables humanoids to autonomously perform diverse interaction tasks while maintaining natural and lifelike behaviors. PhysHSI comprises a simulation training pipeline and a real-world deployment system. In simulation, we adopt adversarial motion prior-based policy learning to imitate natural humanoid-scene interaction data across diverse scenarios, achieving both generalization and lifelike behaviors. For real-world deployment, we introduce a coarse-to-fine object localization module that combines LiDAR and camera inputs to provide continuous and robust scene perception. We validate PhysHSI on four representative interactive tasks--box carrying, sitting, lying, and standing up--in both simulation and real-world settings, demonstrating consistently high success rates, strong generalization across diverse task goals, and natural motion patterns.

Mind Map

In-depth Reading

English Analysis~36 min read · 50,173 chars

1. Bibliographic Information

1.1. Title

The central topic of this paper is the development of a real-world system for humanoid-scene interaction that achieves generalizable and natural motions. The system is named PhysHSI, which stands for "Physical-world Humanoid-Scene Interaction."

1.2. Authors

The authors are: Huayi Wang, Wentao Zhang, Runyi Yu, Tao Huang, Junli Ren, Feiyu Jia, Zirui Wang, Xiaojie Niu, Xiao Chen, Jiahe Chen, Qifeng Chen, Jingbo Wang, and Jiangmiao Pang.

Affiliations: Most authors are affiliated with 1 (implied to be Shanghai Artificial Intelligence Laboratory, based on acknowledgements). Runyi Yu and Qifeng Chen are also affiliated with 2 (implied to be Shanghai Jiao Tong University, also based on acknowledgements).
Research Backgrounds: The authorship indicates a strong focus on robotics, reinforcement learning, computer vision, and physical simulation, particularly in the context of embodied AI and humanoid robot control.

1.3. Journal/Conference

This paper is published on arXiv, a preprint server for electronic preprints of scientific papers. It is not yet formally published in a peer-reviewed journal or conference proceedings, but arXiv is widely used in the machine learning and robotics communities for early dissemination of research.

1.4. Publication Year

The paper was published on October 13, 2025.

1.5. Abstract

Deploying humanoid robots for real-world interactions, such as carrying objects or sitting, demands generalizable, lifelike movements and robust scene perception. While individual advancements have been made in these areas, their integration into a unified system remains challenging. This work introduces PhysHSI, a physical-world humanoid-scene interaction system that enables autonomous execution of diverse interaction tasks with natural and lifelike behaviors. PhysHSI consists of a simulation training pipeline and a real-world deployment system. The simulation pipeline employs adversarial motion prior (AMP)-based policy learning to imitate natural humanoid-scene interaction data across various scenarios, promoting generalization and lifelike behaviors. For real-world deployment, a coarse-to-fine object localization module is developed, combining LiDAR and camera inputs for continuous and robust scene perception. PhysHSI is validated on four interaction tasks—box carrying, sitting, lying, and standing up—in both simulated and real environments, demonstrating high success rates, strong generalization across task goals, and natural motion patterns.

1.6. Original Source Link

Official Source: https://arxiv.org/abs/2510.11072
PDF Link: https://arxiv.org/pdf/2510.11072v1.pdf
Publication Status: This paper is a preprint available on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is enabling humanoid robots to perform Humanoid-Scene Interaction (HSI) tasks in real-world environments in a generalizable, lifelike, and autonomous manner. Such tasks include complex actions like carrying objects or sitting on chairs.

This problem is crucial because HSI is a fundamental capability for deploying humanoid robots in everyday settings, moving beyond basic locomotion to truly interactive behaviors. However, several significant challenges persist in prior research:

Limited Generalization: Classical model-based methods (e.g., motion planning, trajectory optimization) are computationally expensive and rely on strong model assumptions, limiting their ability to generalize to diverse real-world scenarios.
Lack of Lifelike Motions: While Reinforcement Learning (RL)-based methods offer broader generalization by training from diverse simulated experiences, learning policies from scratch often requires extensive and difficult reward shaping and state transition design to produce natural, human-like motions.
Robust Scene Perception: Real-world environments are complex. Existing perception modules often struggle with limited fields of view, occlusions, and maintaining continuous, accurate object localization over long-horizon tasks.
Sim-to-Real Gap: Many advanced motion imitation techniques, particularly those leveraging Adversarial Motion Priors (AMP), have largely been confined to simulation, relying on perfect scene observations, leaving the sim-to-real transfer as a significant unexplored obstacle.

The paper's innovative idea and entry point is to bridge these gaps by combining advanced motion imitation techniques in simulation with a robust real-world perception system to create a unified framework for HSI. This aims to deliver policies that are both generalizable and natural, and can be reliably deployed on physical robots.

2.2. Main Contributions / Findings

The paper's primary contributions are:

An AMP-based Training Pipeline: Introduction of a simulation training pipeline that leverages Adversarial Motion Priors (AMP) to learn from humanoid interaction data. This pipeline enables the generation of natural, lifelike, and generalizable motions for diverse HSI tasks.
Coarse-to-Fine Real-World Object Localization Module: Development of a robust coarse-to-fine object localization module designed for real-world deployment. This module integrates LiDAR and camera inputs to provide continuous and reliable scene perception, crucial for interaction tasks where objects might be far, occluded, or dynamic.
Comprehensive Evaluation Protocols: Establishment of thorough evaluation protocols that analyze the system and its individual components in both simulation and real-world environments. This aims to provide insights and guidance for future research and development in real-world HSI tasks.

The key conclusions and findings reached by the paper include:
High Success Rates and Generalization: PhysHSI achieves consistently high success rates across complex, long-horizon HSI tasks (e.g., box carrying, sitting, lying, standing up) and demonstrates strong generalization across diverse scenarios and task goals, outperforming reward-based and tracking-based baselines, especially in full-distribution scenes.
Natural and Lifelike Motion Patterns: The AMP-based training effectively produces natural and expressive motions, exhibiting human-like behaviors and even stylized locomotion, a significant improvement over RL-reward methods that struggle with motion realism.
Robust Real-World Deployment: The coarse-to-fine perception system provides accurate and continuous object localization, enabling zero-shot transfer of learned policies to a physical Unitree G1 humanoid robot. The system can autonomously complete tasks in various indoor and outdoor environments using only onboard sensors.
Portability: The system's ability to operate outdoors with onboard sensing and computation highlights its portability and reduced reliance on external infrastructure like MoCap systems.

This section provides the foundational knowledge and context necessary to understand the PhysHSI paper.

3.1. Foundational Concepts

Humanoid-Scene Interaction (HSI)

Humanoid-Scene Interaction (HSI) refers to tasks where a humanoid robot physically interacts with objects and elements within its environment. This goes beyond simple locomotion to involve manipulation, contact, and adaptive behaviors based on the scene. Examples include picking up and carrying objects, sitting on chairs, lying on beds, opening doors, or using tools. The complexity arises from the need for coordinated whole-body movements, maintaining balance, adapting to varying object properties and scene layouts, and robustly perceiving the environment.

Reinforcement Learning (RL)

Reinforcement Learning (RL) is a paradigm of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward.

Agent: The learner or decision-maker (in this paper, the humanoid robot's control policy).
Environment: The external system with which the agent interacts (the simulated or real world with objects and tasks).
State: A complete description of the environment at a given time (e.g., robot's joint angles, velocities, base pose, object positions).
Action: A decision made by the agent that changes the state of the environment (e.g., target joint positions for the robot).
Reward: A scalar feedback signal from the environment indicating how good or bad the agent's last action was (e.g., positive for reaching a goal, negative for falling).
Policy ( $\pi$ ): A function that maps states to actions, determining the agent's behavior. The goal of RL is to learn an optimal policy.

Motion Capture (MoCap)

Motion Capture (MoCap) is the process of digitally recording the movement of people or objects. In the context of robotics and animation, MoCap data typically consists of sequences of joint angles and/or marker positions over time, capturing natural human movements. This data can then be used to drive virtual characters or, as in this paper, retargeted onto robotic platforms to teach them human-like behaviors. While MoCap provides highly accurate ground truth motion, it usually requires specialized equipment in controlled laboratory settings, limiting its real-world applicability for direct robot control.

Adversarial Motion Priors (AMP)

Adversarial Motion Priors (AMP) is a technique in RL that combines generative adversarial networks (GANs) with motion imitation. Instead of explicitly designing reward functions to make motions look natural, AMP uses a discriminator network.

Generator (Policy): This is the RL agent (the robot's policy) that tries to generate natural-looking motions.
Discriminator ( $\mathcal{D}$ ): This network is trained to distinguish between motion clips generated by the policy and real human motion clips from a reference motion dataset (the prior).
Adversarial Training: The policy tries to "fool" the discriminator into thinking its generated motions are real (by maximizing a style reward derived from the discriminator), while the discriminator tries to accurately distinguish between real and fake motions. This adversarial process implicitly guides the policy to produce motions that are stylistically similar to the human demonstrations, leading to natural and lifelike behaviors without explicit reward shaping for motion quality.

Sim-to-Real Transfer

Sim-to-Real transfer refers to the process of training a robot control policy in a simulated environment and then deploying it on a real physical robot. This is highly desirable because training in simulation is much safer, faster, and cheaper than training on real hardware. However, a significant challenge is the sim-to-real gap: the differences between the simulated world and the real world (e.g., accurate physics models, sensor noise, latency, hardware imperfections). Policies trained purely in simulation often perform poorly when transferred to reality. Techniques like domain randomization are used to make the simulated environment diverse enough to cover real-world variations, thereby improving transferability.

LiDAR-Inertial Odometry (LIO)

LiDAR-Inertial Odometry (LIO) is a method for estimating the motion (odometry) of a robot using data from a LiDAR sensor and an Inertial Measurement Unit (IMU).

LiDAR (Light Detection and Ranging): A sensor that measures distances to objects by emitting pulsed laser light and measuring the time it takes for the reflected light to return. It creates a 3D point cloud of the environment.
IMU (Inertial Measurement Unit): A sensor that measures angular rate and linear acceleration, providing information about the robot's orientation and motion dynamics.
Odometry: The estimation of the robot's change in position and orientation over time. LIO combines the precise distance measurements from LiDAR (for mapping and matching features in the environment) with the high-frequency and robust motion data from the IMU to provide accurate and drift-corrected ego-motion estimation, especially useful for long-range navigation.

AprilTag

AprilTag is a visual fiducial system, similar to QR codes, used for various robotics and augmented reality applications. AprilTags are 2D barcodes that can be easily detected and identified by a camera. When an AprilTag is observed, its precise 3D pose (position and orientation) relative to the camera can be estimated using computer vision algorithms. They are robust to lighting changes and partial occlusions, making them suitable for accurate object localization at close range, as used in this paper for fine-grained perception.

PD Controller

A Proportional-Derivative (PD) controller is a widely used feedback control loop mechanism. In robotics, it is often used to control joint positions.

Given a target joint position (setpoint) and the current joint position (process variable), the controller calculates an error signal.
The proportional (P) term generates a control output proportional to the current error. A larger error leads to a larger corrective force.
The derivative (D) term generates a control output proportional to the rate of change of the error. This helps to dampen oscillations and improve stability by anticipating future errors.
The PD controller outputs a torque or force command to the joint motors to move the joint towards its target position. It helps in precisely executing the actions decided by the RL policy.

3.2. Previous Works

The paper contextualizes its contributions by reviewing prior work across three main areas:

A. Humanoid-Scene Interactions

Simulation-based HSI: Many works have explored HSI in physics-based simulations, achieving natural and long-horizon behaviors like object loco-manipulation [23, 26–28, 34]. However, these methods typically rely on idealized task observations, leading to large sim-to-real gaps when attempting real-world deployment.
Classical Real-World Approaches: For real-world robots, classical methods often use model-based motion planning or trajectory optimization to generate whole-body references for tracking [9–13]. These approaches can generate stable motions but suffer from limited generalization capabilities, making them impractical for diverse real-world scenarios.
RL-based Approaches: RL-based methods aim for stronger generalization by learning control policies from scratch [14, 16, 17]. However, they typically require meticulous reward shaping and state transition design, which is complex and often fails to produce genuinely natural motions, especially for long-horizon tasks.
Motion Prior Guided RL: Some works have used curated motion priors to guide policy learning for tasks like stair climbing and chair sitting [35, 36], improving motion naturalness.
PhysHSI's Position: PhysHSI builds upon the idea of learning from motion priors but extends it to enable generalizable and natural behaviors for more complex interactions, including box carrying and lying down, specifically addressing the sim-to-real gap that previous AMP works overlooked.

B. Humanoid Motion Imitation

Physics-based Motion Tracking (Simulation): In simulation, physics-based methods have achieved expressive whole-body motions by imitating individual reference sequences [37–39] or learning universal tracking policies [40].
Reference-Dependent Real-World Approaches: More recent works have adapted these methods for real-world robots [35, 7], but they remain highly reference-dependent and show limited generalization. This constrains their ability to interact flexibly with diverse scenes and objects.
Adversarial Motion Priors (AMP) in Simulation: AMP [31] significantly improved generalization in simulation by imitating motion styles rather than specific trajectories, leading to progress in dynamic interactions [23, 27, 28].
Limited Real-World AMP Applications: Real-world applications of AMP have been sparse, primarily using AMP to regularize tracking policies for basic locomotion skills [36, 41–43]. They typically do not extend to complex scene and object interactions.
PhysHSI's Position: PhysHSI overcomes these limitations by integrating AMP into a comprehensive system that enables natural behaviors for diverse real-world scene and object interactions, moving beyond basic locomotion or simulation-only performance.

C. Scene Perception

Motion Capture (MoCap) Systems: MoCap systems provide highly accurate global information, supporting dynamic interactive tasks [44–46]. However, their reliance on external infrastructure restricts them to laboratory environments with limited workspaces, making them unsuitable for real-world, untethered deployment.
Onboard RGB-D Cameras: Many studies use onboard RGB and depth cameras for scene and object perception [15, 47–52]. The main drawback is that these approaches generally confine target objects to a local workspace and often lose sight of them during long-horizon loco-manipulation tasks.
LiDAR-Inertial Odometry (LIO): Other studies employ LiDAR-Inertial Odometry (LIO) [53, 54] to obtain global information [35, 55–58]. While LIO provides robust ego-motion and environmental mapping, its interaction accuracy with objects, especially for fine-grained manipulation, remains limited compared to visual methods.
PhysHSI's Position: PhysHSI proposes a coarse-to-fine object localization system that relies solely on onboard sensors (LiDAR and camera). This system aims to provide continuous and robust scene perception by combining the strengths of LIO for long-range guidance with AprilTag detection for precise close-range localization, addressing the shortcomings of previous perception methods.

3.3. Technological Evolution

The evolution of humanoid robotics for complex interactions has progressed from highly controlled, model-based methods to more adaptive, learning-based approaches. Initially, focus was on stability and basic locomotion through motion planning or trajectory optimization. The introduction of reinforcement learning allowed for more generalized behaviors, but often at the cost of motion naturalness due to the difficulty of reward shaping. Motion imitation, especially with techniques like Adversarial Motion Priors (AMP), marked a significant step towards achieving human-like motions by learning from demonstrations. However, these advancements were largely confined to simulation, facing a substantial sim-to-real gap. Concurrently, robot perception evolved from external MoCap systems to onboard sensors like RGB-D cameras and LiDAR, each with its own strengths and weaknesses regarding range, accuracy, and robustness.

This paper's work, PhysHSI, fits within this timeline as a crucial step in integrating these advancements. It pushes the boundaries of AMP from simulation to real-world deployment, using a practical and robust onboard coarse-to-fine perception system. This represents an evolution towards truly autonomous, natural, and generalizable humanoid robots that can interact effectively in unstructured, real-world environments.

3.4. Differentiation Analysis

Compared to the main methods in related work, PhysHSI offers several core differences and innovations:

Unified System for Generalization, Naturalness, and Robust Perception: Unlike prior works that often excelled in one area (e.g., generalization from RL, naturalness from AMP in simulation, or basic real-world locomotion), PhysHSI provides a holistic system that successfully combines all three. It addresses the challenge of integrating these capabilities, which was previously an ongoing issue.
Real-World Deployment of Advanced Motion Imitation: A key innovation is the successful zero-shot transfer and real-world deployment of AMP-based policies for complex HSI tasks. Previous AMP applications in real robots were mostly limited to locomotion regularization, while PhysHSI enables dynamic interactions like box carrying and lying down with natural behaviors.
Novel Coarse-to-Fine Object Localization: The paper introduces a specific coarse-to-fine object localization module that uniquely combines LiDAR-based odometry (for long-range, coarse guidance) with camera-based AprilTag detection (for close-range, fine accuracy). This addresses the field-of-view and occlusion challenges faced by RGB-D cameras and improves interaction accuracy over pure LIO systems, providing continuous and robust scene perception using only onboard sensors.
Hybrid Reference State Initialization (Hybrid RSI): PhysHSI enhances RL exploration efficiency and generalization by introducing a hybrid RSI strategy. This strategy intelligently samples from motion data while randomizing scene parameters and also includes episodes starting from default poses with fully randomized scenes, preventing overfitting to limited demonstrations, a common issue in motion imitation.
Data Preparation with Object Annotation: The method of post-annotating object information on retargeted humanoid motions effectively creates a dataset suitable for learning humanoid-object interactions, addressing the difficulty of obtaining physically plausible humanoid-object interaction data.

In essence, PhysHSI distinguishes itself by moving AMP from simulation labs to real-world deployment for complex HSI tasks, supported by an intelligent and robust onboard perception system that overcomes typical sim-to-real and perception limitations.

4. Methodology

The PhysHSI system is designed to enable humanoid robots to autonomously perform diverse interaction tasks with natural and lifelike behaviors in real-world environments. It consists of two main components: a simulation training pipeline and a real-world deployment system.

4.1. Principles

The core idea behind PhysHSI is to leverage the strengths of adversarial motion priors (AMP) in simulation to learn highly generalized and natural humanoid motions for interacting with objects, and then to bridge the sim-to-real gap with a robust, onboard coarse-to-fine perception module. The theoretical basis for AMP is Generative Adversarial Networks (GANs), where a generator (the robot's policy) learns to produce data (robot motions) that are indistinguishable from real data (human demonstrations) by a discriminator. This implicit style reward encourages human-like behaviors. For real-world deployment, the intuition is that accurate object pose is critical, but a single sensor type often has limitations (e.g., camera: limited FoV, LiDAR: less precise for fine manipulation). Therefore, a hybrid approach combining LiDAR for long-range, coarse guidance and camera for short-range, fine accuracy is employed.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Simulation Training Pipeline

The simulation training pipeline focuses on learning high-quality humanoid-scene interaction (HSI) policies that are both generalizable and lifelike.

4.2.1.1. Data Preparation

The first step is to prepare humanoid motion data that includes object interactions. Generating physically plausible humanoid-object interaction data (e.g., grasping a box securely) is challenging. The authors adopt a post-annotation strategy:

Retargeting SMPL Motions: SMPL motions from datasets like AMASS and SAMP [29, 30], which primarily contain human-only movements, are first retargeted onto the humanoid robot using an optimization process.
Smoothing Filter: A smoothing filter is applied to suppress any retargeting jitter, resulting in a robot-motion-only dataset $M_{\mathrm{Robo}}$ .
Manual Object Annotation: Key contact frames (e.g., pickup and placement) are manually annotated.
Rule-Based Object Trajectory Inference: Between the pickup frame ( $\phi_1$ ) and the placement frame ( $\phi_2$ ), the object's position $\mathbf{p}^o_t \in \mathbb{R}^3$ is set to the midpoint of the robot's hands, and its orientation is aligned with the robot's base. Before $\phi_1$ and after $\phi_2$ , the object remains fixed at its respective key contact frame position. This process yields an augmented humanoid motion dataset $M$ with consistent and physically coherent object positions, which is crucial for stage conditioning and reference state initialization in the subsequent training.

The following figure (Figure 2 from the original paper) illustrates the overall PhysHSI system, including data preparation:

该图像是一个示意图，展示了PhysHSI系统的数据准备、AMP策略训练和真实世界部署三个主要阶段。图中详细描绘了人类动作与类人机器人动作的数据处理，及如何通过LiDAR和相机进行对象定位与任务执行。 Image 2: The image is a diagram illustrating the three main stages of the PhysHSI system: data preparation, AMP policy training, and real-world deployment. It details the processing of human and humanoid robot motions, as well as the object localization and task execution using LiDAR and camera.

4.2.1.2. Adversarial Motion Prior Policy Training

The HSI problem is formulated as a Reinforcement Learning (RL) task, using the Adversarial Motion Priors (AMP) framework [31]. This framework involves two main components:

A policy $\pi_\theta$ that generates humanoid actions.
A discriminator $\mathcal{D}$ that distinguishes between motions produced by the policy and those in the reference motion dataset $M$ .

1) Observation and Action Space

The policy observation $\mathbf{o}_t^\pi$ at each timestep $t$ is composed of:

A 5-step history of proprioception $\mathbf{o}_{t-4:t}^P$ .
A 5-step history of task-specific observation $\mathbf{o}_{t-4:t}^G$ .

The proprioception $\mathbf{o}_t^P \in \mathbb{R}^{108}$ at time $t$ is defined as: $\mathbf { o } _ { t } ^ { P } \triangleq \left[ \omega _ { b _ { t } } , \mathbf { g } _ { b _ { t } } , \pmb { \theta } _ { t } , \dot { \pmb { \theta } } _ { t } , \mathbf { p } _ { b _ { t } } ^ { e e } , \mathbf { a } _ { t - 1 } \right]$ Where:
$\omega_{b_t} \in \mathbb{R}^3$ : The robot's base angular velocity at time $t$ .
$\mathbf{g}_{b_t} \in \mathbb{R}^3$ : The base gravity direction at time $t$ , expressed in the base frame. This helps the robot understand its orientation relative to gravity.
$\pmb{\theta}_t \in \mathbb{R}^{29}$ : The joint positions (angles) of the robot's 29 Degrees of Freedom (DoFs) at time $t$ .
$\dot{\pmb{\theta}}_t \in \mathbb{R}^{29}$ : The joint velocities of the robot's 29 DoFs at time $t$ .
$\mathbf{p}_{b_t}^{ee} \in \mathbb{R}^{5 \times 3}$ : The 3D positions of five key end-effectors (left hand, right hand, left foot, right foot, and head) at time $t$ , expressed in the robot's base frame.
$\mathbf{a}_{t-1}$ : The action taken at the previous timestep t-1. This provides a sense of temporal continuity and previous control effort.

The task-specific observation $\mathbf{o}_t^G$ varies by task but generally includes:
$\mathbf{b}_t \in \mathbb{R}^3$ : The object's shape, represented by its bounding box dimensions.
$\mathbf{p}_{b_t}^{o_t} \in \mathbb{R}^3$ : The object's position at time $t$ , expressed in the robot's base frame.
$\mathbf{R}_{b_t}^{o_t} \in \mathbb{R}^6$ : The object's orientation at time $t$ , typically represented by the first two columns of its rotation matrix, expressed in the robot's base frame (an $SO(3)$ representation often compressed to $SO(2)$ for planar cases or 6D representation for $SO(3)$ ).
$\mathbf{p}_{b_t}^{g_t} \in \mathbb{R}^3$ : The goal position for the task at time $t$ , expressed in the robot's base frame. For example, the target location to place a box or a chair's sitting position.

The discriminator observation $\mathbf{o}_t^{\mathcal{D}} \in \mathbb{R}^{57}$ at each timestep consists of privileged information (information that might not be directly available to the robot's policy in the real world but is useful during simulation training): $\mathbf { o } _ { t } ^ { \mathcal { D } } \triangleq \left[ h _ { t } , \mathbf { v } _ { b _ { t } } , \omega _ { b _ { t } } , \mathbf { g } _ { b _ { t } } , \mathbf { \theta } _ { t } , \mathbf { p } _ { b _ { t } } ^ { e e } , \mathbf { p } _ { b _ { t } } ^ { o _ { t } } \right]$ Where:
$h_t \in \mathbb{R}$ : The base height of the robot at time $t$ .
$\mathbf{v}_{b_t} \in \mathbb{R}^3$ : The base linear velocity of the robot at time $t$ .
$\omega_{b_t} \in \mathbb{R}^3$ : The base angular velocity of the robot at time $t$ .
$\mathbf{g}_{b_t} \in \mathbb{R}^3$ : The base gravity direction at time $t$ .
$\mathbf{\theta}_t \in \mathbb{R}^{29}$ : The joint positions of the robot's 29 DoFs at time $t$ .
$\mathbf{p}_{b_t}^{ee} \in \mathbb{R}^{5 \times 3}$ : The 3D positions of the five end-effectors at time $t$ .
$\mathbf{p}_{b_t}^{o_t} \in \mathbb{R}^3$ : The object's position at time $t$ . The inclusion of $\mathbf{p}_{b_t}^{o_t}$ in the discriminator observation is particularly important for long-horizon tasks, as it allows the discriminator to implicitly condition on the different task phases (e.g., approach, pickup, carry, or place), providing enhanced guidance for policy training.

The action $\mathbf{a}_t \in \mathbb{R}^{29}$ output by the policy $\pi_\theta(\mathbf{o}_t^\pi)$ specifies target joint positions. These target positions are then executed by a PD controller across all 29 DoFs of the humanoid robot.

2) Reward Terms and Discriminator Learning

The total reward function $r_t$ is defined as a weighted sum of three components: $r _ { t } \triangleq w ^ { G } r _ { t } ^ { G } + w ^ { R } r _ { t } ^ { R } + w ^ { S } r _ { t } ^ { S }$ Where:

$r_t^G$ : The task reward that encourages the humanoid to achieve its high-level objectives (e.g., reaching the object, grasping it, placing it at the goal).
$r_t^R$ : A regularization reward that penalizes excessive joint torques and high joint speeds, promoting smoother and more energy-efficient motions.
$r_t^S$ : The style reward that encourages the humanoid to imitate behaviors from the reference motion dataset $M$ .
$w^{(\cdot)}$ : The corresponding coefficients that weigh the importance of each reward term.

The style reward $r_t^S$ is modeled using the adversarial discriminator $\mathcal{D}$ . The discriminator is trained to differentiate between motion clips produced by the policy and motion clips sampled from the reference dataset. The discriminator is optimized using a minimax objective and a gradient penalty, similar to Wasserstein GANs (WGANs). Specifically, the discriminator's optimization objective is: $\begin{array} { r l } & { \arg \underset { \mathcal { D } } { \operatorname* { m i n } } - \mathbb { E } _ { d ^ { M } ( \mathbf { o } _ { t : t + t ^ { * } } ^ { \mathcal { D } } ) } \left[ \log \left( \mathcal { D } ( \mathbf { o } _ { t : t + t ^ { * } } ^ { \mathcal { D } } ) \right) \right] } \\ & { \qquad - \mathbb { E } _ { d ^ { \pi } ( \mathbf { o } _ { t : t + t ^ { * } } ^ { \mathcal { D } } ) } \left[ \log \left( 1 - \mathcal { D } ( \mathbf { o } _ { t : t + t ^ { * } } ^ { \mathcal { D } } ) \right) \right] } \\ & { \qquad + \left. w ^ { \mathrm { g p } } \mathbb { E } _ { d ^ { M } ( \mathbf { o } _ { t : t + t ^ { * } } ^ { \mathcal { D } } ) } \left[ \left\| \nabla _ { \eta } \mathcal { D } ( \eta ) \right. _ { \eta = ( \mathbf { o } _ { t : t + t ^ { * } } ^ { \mathcal { D } } ) } \right\| ^ { 2 } \right] , } \end{array}$ Where:
$\mathcal{D}$ : The discriminator network.
$\mathbb{E}[\cdot]$ : Expected value.
$d^M(\mathbf{o}_{t:t+t^*}^\mathcal{D})$ : The distribution of $(t^*+1)`-frame motion clips` sampled from the dataset $M$. * $d^\pi(\mathbf{o}_{t:t+t^*}^\mathcal{D})$: The distribution of$ (t^*+1)-frame motion clips generated by the policy $\pi_\theta$ .
$\mathbf{o}_{t:t+t^*}^\mathcal{D}$ : A sequence of discriminator observations over $t^*+1$ timesteps.
$t^*$ : The length of the motion clip used for discriminator input.
$w^{\mathrm{gp}}$ : A coefficient that regularizes the gradient penalty [61]. The gradient penalty term forces the discriminator's gradient to be close to 1, which helps stabilize adversarial training and prevents issues like mode collapse in GANs.
$\eta$ : A sample interpolated between real and generated motion clips, used for computing the gradient penalty.

Finally, the style reward $r_t^S$ for the policy is defined as: $r _ { t } ^ { S } \triangleq - \log \left( 1 - \mathcal { D } ( \mathbf { o } _ { t - t ^ { * } : t } ^ { \mathcal { D } } ) \right) .$ This reward encourages the policy to generate motions that are highly likely to be classified as "real" by the discriminator, thereby imitating the style of the reference motions. The policy is optimized using Proximal Policy Optimization (PPO) [62] to maximize the cumulative discounted reward $\mathbb{E}\left\lceil \sum_{t=1}^T \gamma^{t-1} r_t \right\rceil$ , where $\gamma$ is the discount factor.

3) Hybrid Reference State Initialization (RSI)

Many HSI tasks are long-horizon, making exploration difficult if all episodes start from a default pose. The reference state initialization (RSI) strategy [37] is adopted, where episodes are initialized from randomly sampled reference motions along with their corresponding labeled object states, which improves exploration efficiency.

However, naive RSI can lead to overfitting to the limited scene configurations present in the demonstrations. To mitigate this, a hybrid RSI strategy is used:

Compositional Task Stages: For a sampled motion clip, an initial phase $\phi \in [0, 1]$ is sampled from the motion data. The scene parameters (e.g., target goal position) for the remaining phase of the episode, from $(\phi, 1]$ , are randomized. This allows the robot to learn to adapt to new target goals even if the initial part of the motion is from a demonstration.
Fully Randomized Episodes: A subset of episodes are initialized from the default starting pose with fully randomized scene parameters (e.g., object size, position, and goal position). This ensures that the policy does not solely rely on the demonstration data for initialization and learns to operate in entirely novel settings. This hybrid RSI strategy promotes efficient exploration while simultaneously ensuring generalization.

4) Asymmetric Actor-Critic Training

In real-world deployment, the agent receives only partial observations due to sensor noise and limitations. To compensate for this difference between training and deployment environments, an asymmetric actor-critic framework [63] is employed:

Actor: The actor (which determines the policy $\pi_\theta$ ) uses observations $\mathbf{o}_t^\pi$ that are available at deployment (i.e., proprioception and potentially masked task observations).
Critic: The critic (which evaluates the value of states) observes a richer state $\mathbf{o}_t^V$ , which includes privileged information (e.g., base velocity and unmasked task observations) that might not be available to the actor but helps stabilize and accelerate training.

5) Motion Constraints

During training, as rewards accumulate, the agent might exploit shortcuts by producing fast, jerky, or unnatural motions, especially in later stages. To address this and ensure smooth, stable motions suitable for hardware deployment:

Weighted Style Reward: The style reward weight $w^S$ is initially set small to encourage exploration, and then gradually increased as training progresses to align the policy's motions more closely with the motion data.
L2C2 Smoothness Regularization: L2C2 smoothness regularization [64] is adopted. This technique adds a penalty term to the reward function that encourages smoother changes in joint positions and velocities over time, directly enhancing motion smoothness and stability.

4.2.2. Real-World Deployment System

To deploy the trained HSI skills on a physical robot, two critical observations are needed: the end-effector position $\mathbf{p}_{b_t}^{ee}$ and the object pose ( $\mathbf{p}_{b_t}^{o_t}$ and $\mathbf{R}_{b_t}^{o_t}$ ). While $\mathbf{p}_{b_t}^{ee}$ can be accurately obtained using forward kinematics (FK) from joint encoder information, reliable object localization is much more challenging in real-world settings due to limited fields of view (FoV) and frequent occlusions. To achieve robust and continuous localization, a coarse-to-fine perception system is designed, integrating LiDAR and RGB camera inputs.

4.2.2.1. Coarse-to-Fine Object Localization

The pose of an object $o$ at time $t$ in the robot's base frame $b_t$ is represented by a transformation matrix: $T _ { b _ { t } } ^ { o _ { t } } = f _ { \mathrm { T } } ( \mathbf { p } _ { b _ { t } } ^ { o _ { t } } , \mathbf { R } _ { b _ { t } } ^ { o _ { t } } ) \in S E ( 3 )$ Where:

$T_{b_t}^{o_t}$ : The $4 \times 4$ homogeneous transformation matrix representing the pose of object $o$ relative to the robot's base frame at time $t$ .
$f_{\mathrm{T}}(\cdot)$ : A function that maps the object's position $\mathbf{p}_{b_t}^{o_t} \in \mathbb{R}^3$ and orientation $\mathbf{R}_{b_t}^{o_t}$ (a $3 \times 3$ rotation matrix, or its 6D representation) to an $SE(3)$ transformation matrix.
$SE(3)$ : The Special Euclidean Group, representing rigid body transformations in 3D space (rotation and translation).

The localization process proceeds in stages:

Initialization: At the start of a task, the target object may be outside the camera's FoV. A coarse initial pose $T_{b_0}^{o_0}$ is assigned. The position $\mathbf{p}_{b_0}^{o_0}$ is manually specified using LiDAR point cloud visualization, while the orientation $\mathbf{R}_{b_0}^{o_0}$ is assumed to be an identity rotation matrix (meaning no initial rotation relative to the base frame).
Coarse Localization (Long Range): When the robot is far from the object, FAST-LIO [53] is used to estimate the robot's odometry $T_{b_0}^{b_t}$ , which is the pose of the current robot base frame $b_t$ with respect to the initial base frame $b_0$ . The object's position and orientation in the current base frame are then calculated by transforming the initial object pose using the robot's odometry: $\mathbf { p } _ { b _ { t } } ^ { o _ { t } } , \mathbf { R } _ { b _ { t } } ^ { o _ { t } } = f _ { \mathrm { T } } ^ { - 1 } ( ( T _ { b _ { 0 } } ^ { b _ { t } } ) ^ { - 1 } T _ { b _ { 0 } } ^ { o _ { 0 } } ) ,$ Where:
- $f_{\mathrm{T}}^{-1}(\cdot)$ : A function that extracts the position and orientation from a transformation matrix.
- $(T_{b_0}^{b_t})^{-1}$ : The inverse of the robot's odometry, representing the transformation from the current base frame $b_t$ to the initial base frame $b_0$ . This method provides a continuous but coarse estimate of the object pose, sufficient to guide the robot toward the target from a distance.
Fine Localization (Close Range): For fine-grained localization when the robot is close to the object, AprilTag detection [65] is employed. AprilTags are attached to objects, allowing the camera to provide accurate object position $\mathbf{p}_{c_t}^{o_t}$ and orientation $\mathbf{R}_{c_t}^{o_t}$ in the camera frame $c_t$ . The system automatically transitions from coarse to fine localization upon the AprilTag's first detection. In cases of temporary detection loss (e.g., when the robot turns to sit, causing the tag to temporarily leave the camera's FoV), the last known fine localization $T_{c_{t'}}^{o_{t'}}$ (at time $t'$ ) is propagated using the robot's odometry $T_{b_{t'}}^{b_t}$ and the camera's fixed pose relative to the base $T_{b_{t'}}^{c_{t'}}$ . The object pose at the current time $t$ is calculated as: $\mathbf { p } _ { b _ { t } } ^ { o _ { t } } , \mathbf { R } _ { b _ { t } } ^ { o _ { t } } = f _ { \mathrm { T } } ^ { - 1 } \left( ( T _ { b _ { t ^ { \prime } } } ^ { b _ { t } } ) ^ { - 1 } T _ { b _ { t ^ { \prime } } } ^ { c _ { t ^ { \prime } } } T _ { c _ { t ^ { \prime } } } ^ { o _ { t ^ { \prime } } } \right) .$ This ensures continuous localization even with intermittent tag visibility.

The system further differentiates between static and dynamic objects:
- Static Objects (e.g., chairs): Their pose is assumed fixed. The propagation strategy (as described for temporary detection loss) is used to update their estimated pose, for example, when the robot maneuvers around them.
- Dynamic Objects (e.g., boxes): This estimation is valid until the object is grasped. After grasping, if the object leaves the camera's view, both its position and orientation are masked (i.e., set to a default value or ignored), and the policy relies solely on proprioception to complete the task. A simple distance threshold $\epsilon$ defines the grasp phase: if the estimated object distance exceeds $\epsilon$ , it's static; otherwise, it's assumed to move with the robot.

4.2.2.2. Sim-to-Real Transfer

To facilitate sim-to-real transfer and ensure the trained policies are robust to real-world complexities, domain randomization [66] is applied during simulation training:

Sensor Noise and Latency: Random offsets, Gaussian noise, and delays are added to object poses and forward kinematics (FK) observations. This mimics real-world sensor imperfections.
Masking Mechanism: The masking mechanism for dynamic objects during the grasping stage is replicated. This masking is applied when the object is outside the camera's view, the goal distance is out of range, or the camera angle deviates excessively from vertical. This trains the policy to cope with intermittent object perception.
Standard Domain Randomization: Additional standard domain randomization techniques from [55] are adopted to enhance robustness (e.g., randomizing robot dynamics, friction, sensor noise parameters).

4.2.2.3. Hardware Setup

The PhysHSI system is built on the Unitree G1 humanoid robot. This robot is equipped with:

LiDAR: A built-in Livox Mid-360 LiDAR.
Camera: An external Intel RealSense D455 depth camera mounted on the robot's head, providing an $86^\circ$ horizontal and $57^\circ$ vertical field of view. All perception modules (point cloud visualization, FAST-LIO, AprilTag detection, and forward kinematics) along with the learned policy run entirely onboard on a Jetson Orin NX. This enables fully portable deployment without external computation or infrastructure.

5. Experimental Setup

This section details the experimental environment, datasets, evaluation metrics, and baselines used to validate PhysHSI.

5.1. Datasets

The primary motion dataset used for training PhysHSI policies is derived from SMPL motions from the AMASS and SAMP datasets [29, 30]. These datasets primarily contain human movements. As described in the Data Preparation section, these motions are retargeted onto the Unitree G1 humanoid robot and then post-annotated with object information using a rule-based procedure after manual annotation of key contact frames. The resulting dataset contains approximately 25 complete trajectories per task, providing the motion priors for AMP training.

The choice of these datasets is effective for validating the method's performance because they provide a rich source of natural human movements. By retargeting and annotating them, the authors create a physically plausible dataset for humanoid-object interaction, crucial for learning lifelike behaviors.

5.2. Evaluation Metrics

The experiments utilize several metrics to assess both simulation and real-world performance:

5.2.1. Success Rate ( $R_{\mathrm{succ}}$ )

Conceptual Definition: Success rate measures the proportion of trials or episodes in which the robot successfully completes the specified task objective. For HSI tasks, this typically means the object is correctly placed at its goal location, or the humanoid achieves the desired final pose (e.g., fully sitting on a chair). It focuses on the functional outcome of the task.
Mathematical Formula: $ R_{\mathrm{succ}} = \frac{\text{Number of successful trials}}{\text{Total number of trials}} \times 100% $
Symbol Explanation:
- $\text{Number of successful trials}$ : The count of individual attempts where the robot met the task's success criteria.
- $\text{Total number of trials}$ : The total count of all attempts made during evaluation.

5.2.2. Human-likeness Score ( $S_{\mathrm{human}}$ )

Conceptual Definition: The human-likeness score quantifies how natural and human-like the robot's generated motions appear. It aims to capture the subjective quality of motion beyond just task completion. In this paper, it is evaluated by a large language model.
Mathematical Formula: This is a subjective score assigned by Gemini-2.5-Pro ranging from 0 to 5. There is no standard mathematical formula for this score, as it's an output from an AI model's qualitative assessment.
Symbol Explanation:
- Gemini-2.5-Pro: A large language model used to assess the qualitative naturalness of the robot's movements based on task descriptions and video clips of experimental trajectories.
- 0 to 5: The range of the score, where 0 indicates very unnatural motions and 5 indicates highly natural, human-like motions.

5.2.3. Finish Precision ( $R_{\mathrm{precision}}$ )

Conceptual Definition: Finish precision measures the accuracy of the robot's final placement of an object or its final pose relative to the target. For tasks involving object placement, it quantifies the average spatial error between the actual and desired final object position.
Mathematical Formula: If not explicitly defined in the paper, a common formula for position error (e.g., Euclidean distance) is: $ R_{\mathrm{precision}} = \frac{1}{N} \sum_{i=1}^N | \mathbf{p}{\mathrm{actual}, i} - \mathbf{p}{\mathrm{target}, i} |_2 $ (Note: The paper reports average and standard deviation, suggesting multiple trials; it refers to this as a placement error, which is typically a distance in meters.)
Symbol Explanation:
- $N$ : The number of trials.
- $\mathbf{p}_{\mathrm{actual}, i}$ : The actual final position of the object (or relevant robot part) in trial $i$ .
- $\mathbf{p}_{\mathrm{target}, i}$ : The desired target final position in trial $i$ .
- $\| \cdot \|_2$ : The Euclidean distance (L2 norm) in 3D space.

5.2.4. Execution Time ( $T_{\mathrm{exec}}$ )

Conceptual Definition: Execution time measures the total duration, in seconds, required for the robot to complete an entire task sequence from start to finish. It quantifies the efficiency of the task execution.
Mathematical Formula: No specific formula provided in the paper, as it's a direct measurement of time.
Symbol Explanation: This is a directly measured temporal duration, typically in seconds.

5.2.5. Maximum Movement Range ( $M_{\mathrm{range}}$ )

Conceptual Definition: Maximum movement range refers to the maximum distance the robot's base (or root) travels during the execution of a task. It indicates the spatial extent of the task and the robot's ability to cover ground while interacting.
Mathematical Formula: No specific formula provided in the paper, as it's a direct measurement of distance.
Symbol Explanation: This is a directly measured spatial distance, typically in meters.

5.3. Baselines

The paper compares PhysHSI against two commonly adopted baselines to demonstrate its effectiveness in simulation:

RL-Rewards: This baseline represents a standard Reinforcement Learning approach where the humanoid learns HSI tasks from scratch. It does not use any motion references or motion priors. Instead, the policy is optimized using a carefully designed reward function that combines various terms, such as rewards for gait stability, task progression (e.g., proximity to object, successful grasp), and regularization penalties (e.g., for joint torques and speed). This baseline highlights the challenges of achieving natural motions and generalization solely through explicit reward shaping.
Tracking-Based: This baseline involves training an agent to mimic motion references directly. It attempts to track the humanoid's and object's trajectories as provided by the augmented motion dataset $M$ frame by frame. This method is common in motion imitation literature but is known to struggle with generalization when faced with novel scene configurations or task goals that deviate from the exact demonstration trajectories. It uses the same underlying motion data as PhysHSI but focuses on precise imitation rather than learning a generalized style.

These baselines are representative because RL-Rewards showcases the difficulty of reward engineering for natural movements, while Tracking-Based highlights the limitations of direct motion imitation in terms of generalization. Comparing against these two approaches effectively demonstrates PhysHSI's advantages in achieving both natural motion and strong generalization in HSI tasks.

5.4. Experimental Setup

All training and evaluation environments for simulation experiments are implemented using IsaacGym [67], a high-performance GPU-accelerated simulation platform. The methods are benchmarked on four representative HSI tasks:

Carry Box: Involves walking to a box, lifting it, carrying it, and then placing it at a new goal location. This is a long-horizon task with multiple subtasks.
Sit Down: Requires the robot to walk to a chair and sit on it.
Lie Down: Involves walking to a bed and lying down on it.
Stand Up: The robot starts from a seated position (on a chair) and rises to a standing pose.

For evaluation, two settings are considered:

In-distribution scenes: These scenes include only configurations (e.g., object positions, sizes) that are directly sampled from or very similar to those present in the training dataset.
Full-distribution scenes: These scenes involve uniformly sampled configurations within a broader task space. For instance, objects are placed within [0, 5] m of the start position; boxes are initialized at heights within [0, 0.6] m and have size dimensions within [0.2, 0.5] m. This setting is designed to test the generalization capabilities of the policies.

For both settings, results are reported as the mean and standard deviation computed over five random seeds. Each seed is evaluated across 1000 episodes, and human-likeness scores are based on three demo clips.

6. Results & Analysis

This section analyzes the experimental results, demonstrating the effectiveness of PhysHSI in both simulation and real-world scenarios, and highlighting the contributions of its individual components through ablation studies.

6.1. Core Results Analysis

6.1.1. Simulation Performance

The following are the results from Table I of the original paper:

	Rsucc(%, ↑)	Shuman(↑)	Rsucc(%, ↑)	Shuman(↑)	Rsucc(%, ↑)	Shuman(↑)	Rsucc(%, ↑)	Shuman(↑)
	Carry Box		Sit Down		Lie Down		Stand Up
In Distribution Scene
RL-Rewards	72.92 (±8.29)	1.67(±0.47)	83.60 (±5.98)	1.50(±0.24)	76.72 (±9.43)	0.50 (±0.00)	93.02 (±0.71)	1.50 (±0.24)
Tracking-Based	11.84(±3.16)	4.83 (±0.24)	31.46 (±2.96)	3.80 (±0.08)	19.58 (±1.02)	2.23 (±0.21)	99.00(±1.28)	4.67(±0.12)
PhysHSI	91.34(±1.63)	4.00(±0.41)	96.28 (±0.21)	4.80(±0.08)	97.86 (±0.60)	4.80(±0.08)	99.68(±0.21)	3.77 (±0.21)
Full Distribution Scene
RL-Rewards	63.40 (±8.63)	1.17(±0.24)	73.14(±4.29)	3.07 (±0.09)	55.76 (±12.51)	2.00 (±1.08)	90.50 (±2.33)	1.07(±0.09)
Tracking-Based	0.02 (±0.01)	0.50 (±0.00)	1.12 (±0.51)	0.50 (±0.00)	0.94 (±0.45)	1.00(±0.41)	35.32 (±2.51)	3.27±0.54)
PhysHSI	84.60(±3.74)	3.83(±0.24)	91.32(±2.48)	4.77±0.05)	81.28 (±3.99)	4.43 (±0.33)	92.24(±0.75)	3.77(±0.52)

Key Findings from Table I:

Consistently High Success Rates: PhysHSI demonstrates superior performance across all four long-horizon HSI tasks. For in-distribution scenes, it achieves success rates above 91% for Carry Box, Sit Down, Lie Down, and Stand Up. Even for the complex Carry Box task (with four subtasks), PhysHSI reaches an impressive 91.34% success rate, comparable to the simpler Sit Down task.
Strong Generalization: PhysHSI maintains high success rates even in full-distribution scenes, where scenarios are uniformly sampled across a wide task space. For instance, in full-distribution scenes, PhysHSI achieves 84.60% for Carry Box and 91.32% for Sit Down. This is in stark contrast to Tracking-Based methods, which almost completely fail (near 0% success) in full-distribution scenes for complex tasks like Carry Box, Sit Down, and Lie Down. This highlights that PhysHSI's AMP framework enables flexible motion recombination and style alignment rather than rigid trajectory following, allowing it to adapt to novel situations. The following figure (Figure 3 from the original paper) illustrates the generalization capability:

Fig. 3: Spatial Generalization. Root trajectories of the robot are shown for tasks (a) Carry Box and (b) Lie Down. Red trajectories indicate reference data, with others representing sampled policy motions.

As depicted in Fig. 3, PhysHSI generates diverse root trajectories that successfully navigate to different target locations for tasks like Carry Box and Lie Down, despite the reference data (red trajectories) being limited. This visually confirms its strong spatial generalization.
Lifelike Motion Patterns: PhysHSI attains significantly higher human-likeness scores ( $S_{\mathrm{human}}$ ) compared to RL-Rewards methods. For example, in in-distribution scenes, PhysHSI scores 4.00 for Carry Box and 4.80 for Sit Down, whereas RL-Rewards only achieve 1.67 and 1.50 respectively. This indicates that the adversarial training between the policy and discriminator effectively guides the robot to produce natural behaviors by distinguishing between dataset motions and policy-generated motions. RL-Rewards methods, which rely on manually hand-crafted gait and regularization terms, struggle to produce natural motions, especially for long-horizon tasks. Tracking-Based methods generally achieve high human-likeness if the reference is good and the task is within distribution, but their generalization is poor.

6.1.2. Real-World Performance

The following are the results from Table III of the original paper:

Tasks	Rsucc	Rprecision (m)	Texec (s)	Mrange e(m)
Carry Box	8/10, 6/10	0.19(±0.10)	10.5(±2.8)	5.69
Sit Down	9/10	0.07(±0.03)	6.2 (±1.3)	4.14
Lie Down	8/10	0.16(±0.07)	6.7(±1.0)	3.76
Stand Up	8/10	/	2.3(±0.4)	1.74

Key Findings from Table III and accompanying text:

Competitive Success Rates with High Precision: PhysHSI achieves zero-shot transfer and successfully completes all four HSI tasks in real-world settings. For the Carry Box task, it records an 8/10 success rate for lifting and 6/10 for the full sequence, with placement errors (finish precision) under 20 cm (0.19m ± 0.10m). The Sit Down task shows particularly strong performance with 9/10 success and very high precision (0.07m ± 0.03m). Lie Down and Stand Up also show robust performance (8/10 success).
Effective Generalization in Real World: The system generalizes effectively across variations in spatial layout and object properties. It handles locomotion over distances up to 5.7m (for Carry Box) and accommodates diverse box dimensions, heights, and weights (as further detailed in the Limitation Analysis). This is demonstrated in Figure 4 (not provided in the prompt, but described in the paper as showing examples of varied scene configurations).
Natural, Human-like Motions: The policies learned in simulation, particularly the catwalk-style locomotion inherited from the AMASS data, transfer well to the real robot. The framework also supports stylized motion learning, allowing the system to produce diverse locomotion styles such as dinosaur-like walking or high-knee stepping (Fig. 1(e), not provided in the prompt but described).
Portability and Autonomy: PhysHSI can be deployed outdoors using only onboard sensing and computation (Fig. 1(a)-(c), referring to the abstract image showing outdoor scenarios). This highlights the system's portability and independence from external infrastructure like MoCap systems, a significant advantage for practical applications.

6.2. Ablation Studies / Parameter Analysis

The following are the results from Table II of the original paper:

	Carry Box		Sit Down
	Rsucc(%, ↑)	Shuman(↑)	Rsucc(%, ↑)	Shuman(↑)
Ablation on Data Processing
w/o Smoothness	63.28(±11.72)	2.33 (±1.03)	87.24(±2.19)	1.33 (±0.23)
w/o Object	55.42(±8.17)	2.60(±0.57)	72.36(±6.71)	3.50(±0.70)
PhysHSI	79.34(±4.71)	3.83(±0.24)	91.32(±2.48)	4.77(±0.05)
Ablation on RSI Strategy
w/o RSI	41.24(±6.92)	2.50 (±1.63)	78.24(±3.91)	4.50(±0.00)
Naive RSI	5.70(±2.38)	0.50(±0.0)	18.70(±5.33)	1.83(±0.62)
Hybrid RSI	79.34(±4.71)	3.83(±0.24)	91.32(±2.48)	4.77(±0.05)
Ablation on Mask Strategy (for dynamic objects)
w/o Obs Mask	85.90(±2.90)	4.30(±0.14)	I	I
PhysHSI	79.34(±4.71)	3.83(±0.24)	91.32 (±2.48)	4.77(±0.05)

Key Observations from Ablation Analysis:

Data Quality and Object Annotation are Critical:
- w/o Smoothness: Training without smoothed motion data significantly reduces both success rate and human-likeness score. For Carry Box, $R_{\mathrm{succ}}$ drops from 79.34% to 63.28%, and $S_{\mathrm{human}}$ drops from 3.83 to 2.33. This indicates that raw, unsmoothed data introduces artifacts (e.g., jittering end-effectors, abrupt motion shifts) that the discriminator may exploit, leading to unnatural behaviors.
- w/o Object: Removing object annotations from the training data drastically increases failure rates. For Carry Box, $R_{\mathrm{succ}}$ drops from 79.34% to 55.42%. This confirms that object states in AMP observations are essential for the policy to learn stage transitions (e.g., distinguishing between approaching, carrying, or placing an object) and appropriate motion styles conditioned on the object's presence and state. For example, the discriminator guides the robot to walk towards a distant object and keep a grasped box centered.
Hybrid RSI is Crucial for Generalization and Efficiency:
- w/o RSI: Training without any Reference State Initialization leads to much lower success rates (e.g., 41.24% for Carry Box), indicating poor exploration efficiency as the robot rarely experiences critical transitions.
- Naive RSI: Simply initializing all episodes from reference states fixed to dataset settings performs even worse (5.70% for Carry Box, 18.70% for Sit Down) and yields very low human-likeness scores. This demonstrates severe overfitting and poor generalization because the policy is unable to adapt to any scene configurations outside the limited demonstrations.
- Hybrid RSI: The proposed Hybrid RSI strategy significantly improves both generalization and sample efficiency, resulting in the highest success rates and strong human-likeness compared to the other RSI variants. It balances exploration with exposure to diverse scene parameters.
Mask Processing Has Limited Impact on Overall Performance:
- w/o Obs Mask: Training without the masking strategy (i.e., using complete object states even when they would be unobservable in the real world) achieves slightly higher success rates (85.90% for Carry Box) and human-likeness scores (4.30) than PhysHSI. This represents an upper bound on performance, as the policy has access to perfect information.
- PhysHSI (with Obs Mask): While slightly slower in training and showing a marginal drop in performance compared to w/o Obs Mask, the masking strategy ensures that the policy is robust to real-world sensing limitations. The minimal impact on the final success rate (e.g., 79.34% vs 85.90% for Carry Box) suggests that the domain randomization and asymmetric actor-critic framework effectively compensate for the partial observations, making the policy ready for real-world deployment. (The "I" in the table for Sit Down w/o Obs Mask likely indicates 'irrelevant' or 'not applicable' for this specific ablation, as mask strategy is for dynamic objects, and Sit Down usually involves static objects.)

6.3. Object Localization Module Analysis

The effectiveness of the coarse-to-fine object localization module was evaluated using 17 real-world HSI trials, with 15 successful completions. The object trajectories estimated by the module were compared against ground-truth trajectories from a MoCap system.

The following figure (Figure 5 from the original paper) shows the real-world localization system analysis:

Fig. 5: Real-World Localization System Analysis. (a) Localization error versus robotobject distance, with coarse-to-fine transition statistics and distribution. (b) A representative object localization trajectory, highlighting three stages: (i) coarse localization, (ii) fine localization, and (ii) grasp.

Key Findings from Fig. 5:

Coarse-to-Fine Transition: As shown in Fig. 5(a), the localization error is relatively large (average of $0.35 \mathrm{m}$ ) when the robot is far from the object, using coarse localization (LiDAR-based). Upon reaching approximately $2.4 \mathrm{m}$ from the object, AprilTag detection activates, and the system transitions to fine localization (camera-based). In this fine stage, the average localization error significantly drops to $0.05 \mathrm{m}$ . This clearly demonstrates the effectiveness of the coarse-to-fine design: coarse localization provides sufficient directional guidance from a distance, while fine localization delivers precise positions at close range.
Robustness: The high overall success rate (15/17 trials) further confirms the robustness of the module. The two failures were attributed to coarse guidance deviating too far (preventing AprilTag from entering FoV) and a system crash.
Error Source Analysis (Fig. 5b):
- (i) Coarse Stage: In this stage, errors primarily originate from manually specified goal points in LiDAR point cloud visualization. For instance, an initial deviation of $0.3 \mathrm{m}$ from the exact position was observed. Despite this, the estimated and ground-truth trajectories show consistent trends, which is adequate for long-range guidance.
- (ii) Fine Stage: During fine localization, errors are mainly due to odometry drift from FAST-LIO and AprilTag noise. However, these errors remain small, and the estimated trajectories closely align with the ground truth.
- (iii) Grasping Stage: At very close range, particularly during rapid manipulative motions like grasping, AprilTag noise becomes more pronounced and dominates the errors. This highlights the inherent challenges of precise perception during dynamic, close-quarter interactions.

6.4. System Limitation Analysis

The paper also conducted a limitation analysis on the carry box task to understand the boundaries of the system's capabilities.

The following are the results from Table IV of the original paper:

Test Condition	0 cm	20 cm	40 cm	60 cm	0.6 kg	1.2 kg	2.3 kg	3.6 kg	4.5 kg	20 cm	30 cm	40 cm	45 cm
Test Condition	Box Height				Box Weight					Maximum Box Size
Rsucc(↑)	2/3	3/3	3/3	1/3	2/3	3/3	2/3	1/3	0/3	2/3	3/3	2/3	3/3

Key Findings from Table IV and accompanying text:

Box Height Limitations: The humanoid robot can stably carry boxes at heights within $[0, 60] \mathrm{cm}$ . However, carrying boxes higher than $60 \mathrm{cm}$ becomes challenging (e.g., $1/3$ success at $60 \mathrm{cm}$ height), primarily because it exceeds the robot's vertical FoV even when stationary, making AprilTag detection difficult.
Box Weight Limitations: The robot can handle weights in the range of $[0.6, 3.6] \mathrm{kg}$ . Beyond $3.6 \mathrm{kg}$ (e.g., $1/3$ success for $3.6 \mathrm{kg}$ , $0/3$ success for $4.5 \mathrm{kg}$ ), the robot struggles significantly due to hardware constraints, specifically motor overheating and the limited clamping force of its rubber hand.
Box Size Limitations: The system performs well with maximum box sizes up to $[20, 45] \mathrm{cm}$ . Larger or wider boxes (e.g., beyond $45 \mathrm{cm}$ ) cannot be handled effectively because of the limited reach and grip of the robot's rubber hand and arm length.

Beyond these specific findings, the authors identify several broader limitations:
Hardware Constraints: The Unitree G1 robot's rubber hand has limited clamping force, restricting its ability to manipulate larger or heavier boxes. Excessive weight can also lead to motor overheating and potential hardware failures.
Large-Scale High-Quality HSI Data: The current approach relies on manual post-annotation of objects in retargeted humanoid motion data. This manual process is time-consuming and does not scale well to generate the large quantities of high-quality HSI data needed for more complex and diverse tasks.
Automated Perception Module: The existing object localization module is modular, combining LiDAR and camera inputs. While effective, its reliance on specific components (e.g., AprilTags, manual initialization) introduces complexity and potential fragility. A more automated perception module, potentially incorporating active perception (where the robot actively moves to improve its perception), could enhance robustness and simplify deployment.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduced PhysHSI, a novel real-world system designed for generalizable and natural humanoid-scene interaction. By integrating an effective simulation training pipeline with a robust real-world deployment module, PhysHSI enables humanoid robots to autonomously perform complex tasks like box carrying, sitting down, lying down, and standing up. The system achieves high success rates, demonstrates strong generalization across diverse spatial layouts and object properties, and produces remarkably natural and human-like motion behaviors. A key enabler is the AMP-based training pipeline that learns from humanoid interaction data in simulation, coupled with a pragmatic coarse-to-fine object localization module that fuses LiDAR and camera inputs for continuous and robust scene perception using only onboard sensors. The portability of PhysHSI, exemplified by its successful deployment in outdoor environments with minimal manual initialization (a coarse object pose and a single fiducial tag), marks a significant step towards practical, autonomous humanoid robotics. This work serves as an initial yet comprehensive exploration into real-world HSI tasks, paving the way for more advanced object and scene interaction capabilities.

7.2. Limitations & Future Work

The authors highlight several limitations of PhysHSI and suggest future research directions:

Hardware Constraints: The current Unitree G1 robot's rubber hand and motor capabilities restrict manipulation to smaller and lighter objects. Future work could explore adapting the system to robots with more dexterous manipulators or investigate techniques for handling heavier loads without motor overheating or failure.
Scalability of HSI Data: The reliance on manual post-annotation of object information in retargeted motion data is a bottleneck. Developing automated methods for generating large-scale, high-quality HSI data is crucial for expanding the system's capabilities to a wider range of interactions and objects.
Automated Perception: The current object localization module, while robust, is modular and requires some manual initialization (e.g., coarse object position). Future research could focus on developing a more fully automated perception module, possibly incorporating active perception strategies where the robot intelligently moves to gather better sensor data, thereby improving robustness and simplifying deployment for truly unknown environments.

7.3. Personal Insights & Critique

PhysHSI presents a compelling blend of advanced learning techniques and pragmatic real-world engineering, which is inspiring. The innovation lies in effectively bridging the sim-to-real gap for complex AMP-based policies, an area where many prior motion imitation works have fallen short. The hybrid RSI strategy is a clever solution to balance efficient exploration with robust generalization, addressing a common challenge in RL from demonstrations. The coarse-to-fine perception module is also a well-thought-out design, leveraging the strengths of different sensor modalities to overcome their individual weaknesses, making real-world deployment feasible with readily available onboard hardware.

Transferability and Applicability: The methodologies presented in PhysHSI are highly transferable. The AMP-based training pipeline could be adapted to learn other complex whole-body skills or manipulation tasks for humanoid robots, not just HSI. The coarse-to-fine perception paradigm could be applied to various robotic platforms (e.g., mobile manipulators) requiring robust and continuous object localization in dynamic environments, such as logistics, assistive robotics in homes, or industrial inspection tasks. The insights on domain randomization and asymmetric actor-critic are also broadly applicable for sim-to-real transfer in robot learning.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

Dependency on Fiducial Markers (AprilTags): While AprilTags provide excellent precision, their requirement for objects to be tagged limits the system's application in truly unstructured environments where objects cannot be modified. Future work could investigate tag-less object pose estimation using 3D object models, visual tracking, or neural radiance fields (NeRFs) for real-time object reconstruction, enhancing generalization to arbitrary objects.
Subjectivity of Human-likeness Score: The human-likeness score evaluated by Gemini-2.5-Pro is an interesting and novel approach. However, the inherent subjectivity of such qualitative measures, even from an advanced AI, warrants consideration. A more comprehensive evaluation could involve human user studies (e.g., Mechanical Turk or expert evaluation) to validate the scores and provide a more robust assessment of perceived naturalness.
Scalability of Data Annotation: The paper acknowledges the manual annotation bottleneck for HSI data. Exploring generative models that can synthesize diverse humanoid-object interaction data (perhaps from simpler demonstrations or linguistic prompts) or self-supervised learning from large-scale unlabeled video data could significantly improve scalability.
Hardware Constraints as a Fundamental Limit: While recognized, the hardware constraints of the Unitree G1 (e.g., rubber hand, motor limits) are quite fundamental. For genuinely advanced real-world HSI, future robot hardware would need to evolve significantly, or the policies would need to become even more adaptive to work within very strict physical boundaries. The current clamping mechanism might struggle with varying object geometries or textures.
Explanations of Specific Components: The paper mentions L2C2 smoothness regularization as an important motion constraint but does not detail its implementation or provide specific formulas. For a novice reader, a brief explanation of how this regularization works would be beneficial.
Real-time Computation Load: While stating everything runs onboard a Jetson Orin NX, details on the real-time performance, inference latency, and computational load of the entire perception-to-action pipeline would be valuable for assessing its practical deployability in high-frequency interaction scenarios.

Overall, PhysHSI is a strong foundational paper that addresses critical challenges in real-world HSI. Its innovative integration of AMP with a robust onboard perception system sets a promising direction for the development of highly capable and autonomous humanoid robots.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

PhysHSI: Towards a Real-World Generalizable and Natural Humanoid-Scene Interaction System

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~36 min read · 50,173 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

Humanoid-Scene Interaction (HSI)

Reinforcement Learning (RL)

Motion Capture (MoCap)

Adversarial Motion Priors (AMP)

Sim-to-Real Transfer

LiDAR-Inertial Odometry (LIO)

AprilTag

PD Controller

3.2. Previous Works

A. Humanoid-Scene Interactions

B. Humanoid Motion Imitation

C. Scene Perception

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Simulation Training Pipeline

4.2.1.1. Data Preparation

4.2.1.2. Adversarial Motion Prior Policy Training

1) Observation and Action Space

2) Reward Terms and Discriminator Learning

3) Hybrid Reference State Initialization (RSI)

4) Asymmetric Actor-Critic Training

5) Motion Constraints

4.2.2. Real-World Deployment System

4.2.2.1. Coarse-to-Fine Object Localization

4.2.2.2. Sim-to-Real Transfer

4.2.2.3. Hardware Setup

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Success Rate (RsuccR_{\mathrm{succ}}Rsucc​)

5.2.2. Human-likeness Score (ShumanS_{\mathrm{human}}Shuman​)

5.2.3. Finish Precision (RprecisionR_{\mathrm{precision}}Rprecision​)

5.2.4. Execution Time (TexecT_{\mathrm{exec}}Texec​)

5.2.5. Maximum Movement Range (MrangeM_{\mathrm{range}}Mrange​)

5.3. Baselines

5.4. Experimental Setup

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Simulation Performance

6.1.2. Real-World Performance

6.2. Ablation Studies / Parameter Analysis

6.3. Object Localization Module Analysis

6.4. System Limitation Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

5.2.1. Success Rate ( $R_{\mathrm{succ}}$ )

5.2.2. Human-likeness Score ( $S_{\mathrm{human}}$ )

5.2.3. Finish Precision ( $R_{\mathrm{precision}}$ )

5.2.4. Execution Time ( $T_{\mathrm{exec}}$ )

5.2.5. Maximum Movement Range ( $M_{\mathrm{range}}$ )