HOMIE: Humanoid Loco-Manipulation with Isomorphic Exoskeleton Cockpit
TL;DR Summary
HOMIE integrates isomorphic exoskeleton arms, motion-sensing gloves, and RL-based body control to enable efficient, coordinated humanoid locomotion and manipulation, enhancing task speed and versatility via a data flywheel approach.
Abstract
Generalizable humanoid loco-manipulation poses significant challenges, requiring coordinated whole-body control and precise, contact-rich object manipulation. To address this, this paper introduces HOMIE, a semi-autonomous teleoperation system that combines a reinforcement learning policy for body control mapped to a pedal, an isomorphic exoskeleton arm for arm control, and motion-sensing gloves for hand control, forming a unified cockpit to freely operate humanoids and establish a data flywheel. The policy incorporates novel designs, including an upper-body pose curriculum, a height-tracking reward, and symmetry utilization. These features enable the system to perform walking and squatting to specific heights while seamlessly adapting to arbitrary upper-body poses. The exoskeleton, by eliminating the reliance on inverse dynamics, delivers faster and more precise arm control. The gloves utilize Hall sensors instead of servos, allowing even compact devices to achieve 15 or more degrees of freedom and freely adapt to any model of dexterous hands. Compared to previous teleoperation systems, HOMIE stands out for its exceptional efficiency, completing tasks in half the time; its expanded working range, allowing users to freely reach high and low areas as well as interact with any objects; and its affordability, with a price of just $500. The system is fully open-source, demos and code can be found in our https://homietele.github.io/.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
HOMIE: Humanoid Loco-Manipulation with Isomorphic Exoskeleton Cockpit
1.2. Authors
The authors of the paper are Qingwei Ben, Feiyu Jia, Jia Zeng, Junting Dong, Dahua Lin, and Jiangmiao Pang. The first three authors contributed equally. They are affiliated with the Shanghai AI Laboratory and the Multimedia Laboratory at The Chinese University of Hong Kong. These institutions and authors are active in the fields of robotics, computer vision, and artificial intelligence, contributing to cutting-edge research in humanoid control and machine learning.
1.3. Journal/Conference
The paper was submitted to arXiv, a preprint server for academic papers. The publication date listed is in the future (February 18, 2025), indicating this is a preliminary version of a paper intended for submission to a future conference or journal. arXiv is a highly respected platform in fields like AI and robotics for the rapid dissemination of research findings before formal peer review.
1.4. Publication Year
2025 (as listed on the preprint). The version analyzed was submitted on February 18, 2025.
1.5. Abstract
The paper addresses the significant challenges of generalizable humanoid loco-manipulation, which requires a robot to both move around (locomotion) and interact with objects (manipulation). To solve this, the authors introduce HOMIE, a semi-autonomous teleoperation system. HOMIE consists of a "cockpit" that includes: 1) a pedal for controlling a reinforcement learning (RL) policy that manages the robot's body movement, 2) an isomorphic exoskeleton for direct arm control, and 3) motion-sensing gloves for hand control. The RL policy features novel designs like an upper-body pose curriculum and a height-tracking reward, allowing the robot to walk and squat to specific heights while adapting to any upper-body pose. The isomorphic hardware bypasses the need for complex inverse kinematics, resulting in faster and more precise control. The gloves, using Hall sensors, offer high degrees of freedom (15-DoF) and are adaptable to various robotic hands. The paper highlights that HOMIE is twice as efficient (completing tasks in half the time), has a larger operational range, and is highly affordable ($500). The entire system is open-source.
1.6. Original Source Link
- Original Source: https://arxiv.org/abs/2502.13013
- PDF Link: https://arxiv.org/pdf/2502.13013v2.pdf
- Publication Status: This is a preprint available on arXiv and has not yet undergone formal peer review for a conference or journal publication.
2. Executive Summary
2.1. Background & Motivation
The ultimate goal of humanoid robotics is to create machines that can operate effectively in human environments, performing a wide range of physical tasks. Achieving this requires generalizable humanoid loco-manipulation—the ability for a robot to skillfully coordinate its entire body to walk, balance, and manipulate objects.
However, the field has historically been fragmented, creating a "lose-lose situation":
-
Advanced Locomotion Policies: Researchers have used Reinforcement Learning (RL) to train robots to walk, run, and maintain balance in complex environments. These policies are robust but often lack intuitive, real-time interfaces for an operator to guide the robot's hands and arms for manipulation tasks.
-
Advanced Manipulation Systems: Teleoperation systems have been developed to allow a human to precisely control a robot's arms and hands. However, these systems typically assume the robot is stationary, ignoring locomotion. This severely limits the robot's workspace and its ability to interact with the broader environment.
The core problem is the lack of a unified system that seamlessly integrates dynamic whole-body locomotion with precise upper-body manipulation. This paper's innovative idea is to create a semi-autonomous teleoperation system that intelligently divides the control problem between an AI policy and a human operator. The AI handles the difficult, non-intuitive task of dynamic balancing and leg movement, while the human provides high-level locomotion goals and performs direct, intuitive control of the arms and hands.
2.2. Main Contributions / Findings
The paper presents HOMIE, a system designed to bridge this gap. Its main contributions are:
- A Novel Humanoid Teleoperation Cockpit:
HOMIEintegrates three key components into a single, unified control station: an RL-based policy for lower-body control, an isomorphic exoskeleton for arm control, and motion-sensing gloves for hand control. This allows a single operator to command the robot's full body efficiently. - Loco-Manipulation Without Motion Priors: The paper introduces the first successful implementation of a teleoperation-compatible humanoid policy that can perform dynamic walking and squatting without relying on pre-recorded
motion capture (MoCap)data. This is a significant step towards scalability, as MoCap data is expensive and time-consuming to collect. - A Cost-Effective and High-Performance Hardware System: The entire hardware cockpit costs only $500. By using an
isomorphicdesign, it bypasses computationally slow and often inaccurate methods likeinverse kinematics (IK), enabling faster and more precise control than existing systems. Experiments show it can reduce task completion times by half compared to VR-based approaches.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
-
Humanoid Loco-manipulation: This term refers to the combined ability of a humanoid robot to perform locomotion (moving from one place to another, e.g., walking, running) and manipulation (interacting with objects using hands or arms, e.g., grasping, pushing). The challenge lies in coordinating the entire body to perform both simultaneously, as moving the arms and torso shifts the center of mass and requires the legs to constantly adjust to maintain balance.
-
Teleoperation: A method of controlling a robot from a distance. A human operator's movements or commands are captured and transmitted to the robot, which then executes them. In
HOMIE, the operator's arm, hand, and foot movements are used to control the robot. -
Reinforcement Learning (RL): A type of machine learning where an "agent" (the robot's control policy) learns to make decisions by interacting with an "environment". The agent receives "rewards" or "penalties" for its actions and learns a strategy, or "policy," to maximize its cumulative reward over time.
HOMIEuses RL to train a policy that controls the robot's lower body for stable walking and squatting.- Proximal Policy Optimization (PPO): A popular RL algorithm used in this paper. PPO is known for its stability and efficiency. It works by updating the policy in small, controlled steps, preventing drastic changes that could destabilize the learning process.
-
Kinematics (Forward vs. Inverse):
- Forward Kinematics (FK): If you know the angles of all the joints in a robot's arm, FK is the calculation that tells you the final position and orientation of the robot's hand (the end-effector). This is a straightforward, one-to-one calculation.
- Inverse Kinematics (IK): This is the reverse and much harder problem. If you know where you want the robot's hand to be, IK calculates the necessary joint angles to get it there. This is complex because there can be multiple (or no) solutions, and the calculation is often an iterative, computationally expensive process that may not be perfectly accurate. Many teleoperation systems suffer from delays and inaccuracies due to IK.
-
Isomorphic Exoskeleton: An exoskeleton (a wearable robotic frame) that has the same kinematic structure (i.e., the same number of joints, arranged in the same way) as the robot arm it controls. Because the structures are identical, the joint angles measured from the exoskeleton can be directly mapped to the robot's joints. This is called joint-matching and completely bypasses the need for IK, leading to very fast and precise control. This is a core principle of
HOMIE's design.
3.2. Previous Works
The paper positions HOMIE against two main categories of prior research:
3.2.1. Teleoperation Systems
The paper provides a comprehensive comparison table (Table I) of existing teleoperation systems. The following are the results from Table I of the original paper:
| Teleop System | Cost (\$) | Arm Tracking | Dex-Hand Tracking | Loco-Manip. | Whole-body | No MoCap |
|---|---|---|---|---|---|---|
| Mobile-ALOHA [14] | 32k | Joint-matching | X | ✓ | ✓ | |
| GELLO [15] | 0.6k | Joint-matching | × | X | X | ✓ |
| AirExo [16] | 0.6k | Joint-matching | X | × | X | ✓ |
| ACE [8] | 0.6k | Joint-matching | Vision Retarget | X | X | ✓ |
| DexCap [13] | 4k | Vision Retarget | Mocap + SLAM | ✓ | X | X |
| AnyTeleop [10] | ∼ 0.3k | Vision Retarget | Vision Retarget | X | X | ✓ |
| OpenTelevision [7] | 4k | VR devices | VR devices | X | X | ✓ |
| HumanPlus [1] | 0.05k | Vision Retarget | Vision Retarget | ✓ | ✓ | X |
| OmniH2O [3] | 0-3.5k | Vision / VR | Vision / VR | ✓ | ✓ | × |
| Mobile-TeleVision [6] | 3.5k | VR devices | VR devices | ✓ | ✓ | X |
| HOMIE (Ours) | 0.5k | Joint-matching | Joint-matching | ✓ | ✓ | ✓ |
- High-Cost, High-Fidelity Systems: Some systems like
Mobile-ALOHAuse a second, identical robot as the controller. This allows for perfectjoint-matchingbut is extremely expensive ($32k). - Vision/VR-Based Systems: Systems like
AnyTeleopandOpenTelevisionuse cameras or VR controllers to track the operator's hand/wrist pose. They then use IK to control the robot. These are cheaper but suffer from inaccuracies, latency, and problems with occlusion (when the camera's view is blocked). - MoCap-Based Systems: Systems like
DexCapuse professional motion capture suits. These are very accurate and fast but prohibitively expensive ($4k+), limiting their widespread use. - Low-Cost Exoskeletons: Systems like
GELLOandAirExouse low-cost isomorphic exoskeletons, but they typically only control simple grippers, not fully dexterous hands, and do not address locomotion.
3.2.2. Whole-body Loco-Manipulation
- Model-Based Optimization: Traditional methods often involve solving complex
Optimal Control Problems (OCPs)to generate robot movements. These are computationally intensive and struggle to adapt to unexpected changes in real-time. - Reinforcement Learning (RL): More recent work uses RL to train policies for locomotion. However, many successful approaches (
HumanPlus,OmniH2O) rely onMoCapdata to provide a "motion prior," which guides the learning process. This dependency on expensive MoCap data limits scalability. Furthermore, these systems often lack a way to precisely control the robot's height (e.g., for squatting) and use inconvenient interfaces (like joysticks) that occupy the operator's hands.
3.3. Technological Evolution
The field has been moving from highly specialized, expensive, and fragmented solutions towards more general-purpose, affordable, and integrated ones.
- Early Stage: Separate, high-cost systems for either locomotion or manipulation.
- Intermediate Stage: Attempts to combine them using vision or VR, but with performance trade-offs. Use of expensive MoCap data to bootstrap learning.
- HOMIE's Position:
HOMIErepresents a significant step towards a "sweet spot": it is low-cost, integrated, high-performance, and scalable (by avoiding MoCap). It intelligently combines the strengths of RL and direct human control in a single, accessible package.
3.4. Differentiation Analysis
HOMIE's core innovations compared to prior work are:
- Unified Control: It is one of the first systems to provide a single, cohesive cockpit for a single operator to control full-body (locomotion + manipulation) actions, unlike most systems that focus on one or the other.
- IK-Free and Vision-Free Control: By using
joint-matchingfor both the arms and the dexterous hands, it avoids the latency and imprecision of IK and vision-based pose estimation, which plague many other low-cost systems. - MoCap-Free Training: Its RL policy learns robust loco-manipulation skills (including squatting) from scratch, without needing expensive MoCap data as a reference. This makes the approach more scalable and adaptable.
- Ergonomic and Affordable Design: The use of a foot pedal for locomotion frees the operator's hands for manipulation, and the total system cost of $500 makes it accessible to a much wider range of researchers and developers.
4. Methodology
4.1. Principles
The core principle of HOMIE is a semi-autonomous division of labor between a human operator and an RL policy.
-
The RL Policy (): Handles the complex, high-frequency, and non-intuitive task of whole-body control. It ensures the robot maintains balance, walks, and adjusts its height, all while compensating for the shifting center of mass caused by upper-body movements.
-
The Human Operator: Provides high-level strategic commands and performs intuitive, direct control. The operator specifies what to do (e.g., "walk forward," "squat to this height," "grasp this object"), and the policy figures out how to execute the lower-body portion safely. The operator directly controls the arms and hands, leveraging human dexterity.
This hybrid approach combines the strengths of both: the AI's ability to solve complex dynamics problems and the human's superior planning and manipulation skills.
The overall system architecture is depicted below, showing the operator in the cockpit controlling the robot in either the real world or simulation.
该图像是论文中展示的示意图,描述了人形机器人全身远程操作系统及其控制策略。左侧展示佩戴运动感应手套和外骨骼的操作者如何通过踏板控制机器人行走与蹲下。右侧为控制策略框架,包含手臂动作策略和躯干运动策略两部分。
4.2. Core Methodology In-depth
4.2.1. System Overview
The HOMIE system operates in a continuous loop:
- Visual Feedback: The operator wears a display showing the robot's first-person view (FPV), allowing for immersive control.
- Locomotion Command: The operator uses a foot pedal to issue locomotion commands , which specify the desired forward velocity, turning speed, and torso height, respectively.
- Upper-Body Command: Simultaneously, the operator moves their arms and hands, which are tracked by the isomorphic exoskeleton and motion-sensing gloves. This generates the target joint angles for the robot's upper body, .
- Robot Execution: The robot receives these commands via Wi-Fi.
- The lower-body commands are fed into the RL policy , which calculates the necessary motor torques to control the legs.
- The upper-body joint angles are directly set on the robot's arm and hand motors.
- Data Flywheel: This teleoperation process can be recorded to collect a dataset of expert demonstrations. This data can then be used with
Imitation Learning (IL)to train a fully autonomous policy, , which can eventually take over from the human operator.
4.2.2. Humanoid Whole-body Control
The paper introduces a novel RL training framework to create the policy, which is capable of zero-shot sim-to-real transfer (i.e., it works in the real world without further training). The framework is depicted in the figure below.
该图像是图表,展示了论文中提出的两个函数的可视化。左侧为条件概率分布 随不同参数 变化的曲线图,右侧为函数 关于变量 和 的三维曲面图,反映了膝部奖励设计的形状特征。
Training Settings
- Observation (): The policy's input at each step is a history of observations over the last 5 steps. A single observation is a vector containing:
- : The operator's command (desired velocity and height).
- : The robot body's angular velocity.
- : The gravity vector projected into the robot's coordinate frame (indicates orientation).
- : The current angles of all robot joints.
- : The current velocities of all robot joints.
- : The action taken in the previous step.
- Action (): The policy's output is a set of target joint positions for the robot's lower body (legs).
- Torque Calculation: These target positions are then used by a Proportional-Derivative (PD) controller to calculate the final motor torques for each joint :
$
\tau_{t,i} = Kp_i \times (a_{t,i} - q_{0,t,i}) - Kd_i \times \dot{q}_{t,i}
$
Where:
- : The torque for joint at time .
- and : The proportional (stiffness) and derivative (damping) gains for the joint. These are constants.
- : The target joint angle from the policy for joint .
- : The default joint position for joint . (Note: The paper seems to have a typo here. In standard PD control for RL, this would be the current joint position . The formula in the paper is slightly non-standard, possibly with being a default pose and an offset, or it's a typo and should be . Given the context of tracking a target position, is more common. However, adhering to the paper, is the default joint position.)
- : The current velocity of joint .
Upper-body Pose Curriculum
To train a policy that is robust to any upper-body movement, the authors introduce a curriculum. Instead of exposing the robot to wildly random arm poses from the start (which would make it fall), they gradually increase the range of motion. This is controlled by the upper action ratio , which goes from 0 to 1. The key is how they sample poses. Instead of a simple uniform sampling, they use a more complex distribution that smoothly transitions from sampling small-range poses to large-range poses.
The target joint angle for an upper-body joint is sampled using the following formula, derived from the probability distribution : $ a_i = \mathcal{U} \left( 0, -\frac{1}{20(1-\rho_a)} \ln \left( 1 - \mathcal{U}(0,1) \left( 1 - e^{-20(1-\rho_a)} \right) \right) \right) $ Where:
- is a uniform random sample between min and max.
- is the curriculum progress variable.
- Intuition: This formula ensures that even early in training (when is small), there is still a small chance of sampling a large pose. As training progresses and approaches 1, the sampling distribution becomes fully uniform over the entire range. This smooth, gradual increase in difficulty helps the policy learn more effectively. This is visualized in Figure 16 (left) of the paper.
Height Tracking Reward
To enable the robot to squat to a target height , a novel reward function is introduced. $ r_{knee} = - | (h_{r,t} - h_t) \times \left( \frac{q_{knee,t} - q_{knee,min}}{q_{knee,max} - q_{knee,min}} - \frac{1}{2} \right) | $ Where:
- : The robot's actual torso height at time .
- : The commanded target height.
- : The current angle of the knee joints.
- : The minimum and maximum possible knee joint angles.
- Intuition:
- The term is the height error.
- The term maps the knee angle to a range around 0. It's negative when the knee is bent and positive when it's straight.
- When the robot is too high (), the height error is positive. The reward encourages the knee term to be negative (i.e., bend the knees).
- When the robot is too low (), the height error is negative. The reward encourages the knee term to be positive (i.e., straighten the knees). This clever formulation directly links the height error to the action of bending or straightening the knees, providing a clear learning signal. This is visualized in Figure 16 (right) of the paper.
Symmetry Utilization
To improve data efficiency and ensure the robot learns a balanced, symmetric gait, a two-part technique is used:
- Symmetric Data Augmentation: For every piece of data collected, a "mirrored" version is created by flipping all left-right symmetric values (e.g., left leg joint angles become right leg joint angles, right becomes left, and turning velocity command is negated). Both the original and mirrored data are used for training, effectively doubling the data.
- Symmetry Loss: To explicitly enforce that the policy network itself is symmetric, additional loss terms are added during optimization. $ \mathcal{L}{sym}^{actor} = MSE(a_t, a_t') $ $ \mathcal{L}{sym}^{critic} = MSE(V_t, V_t') $ Where and are the action and value from the original state, and and are the (mirrored) action and value from the mirrored state. These losses penalize the network if its outputs for symmetric inputs are not also symmetric, forcing it to learn a symmetric function.
4.2.3. Hardware System Design
The hardware cockpit is designed to be low-cost, intuitive, and high-performance.
该图像是论文中展示的示意图,分别说明了运动感知手套(a)及其传感器模块和微控制器组成,和同构外骨骼与人形机器人腕臂的匹配关系(b),包括各关节自由度及关键部位标注,体现了HOMIE系统中手臂控制的结构设计。
Isomorphic Exoskeleton
- Design: A pair of 7-DoF (Degrees of Freedom) arms are 3D-printed. The kinematic structure is designed to be isomorphic (identical in structure) to the arms of the target humanoid robots (Unitree G1 and Fourier GR-1).
- Joint Mapping: Each joint on the exoskeleton is equipped with a servo motor that can read its absolute angle . This angle is directly mapped to the corresponding robot joint angle using a simple linear transformation, bypassing IK entirely.
$
q_t = \pm k_t (p_t + \frac{n_t \pi}{2}) + \tau_t
$
Where:
- : The target angle for the robot joint.
- : The angle reading from the exoskeleton servo.
- : A fixed offset to account for the mounting orientation of the servo.
- : A scaling factor (usually 1).
- : An additional compensation offset (usually 0). This direct mapping is extremely fast and accurate.
Motion-sensing Gloves
- Design: Based on an open-source project, these gloves are designed to be low-cost and highly dexterous. They provide up to 15-DoF of sensing per hand.
- Sensing Technology: Instead of bulky servos or expensive sensors, they use a simple and effective combination of Hall effect sensors and small magnets at each joint. A Hall effect sensor measures the strength of a magnetic field. As a finger joint bends, the magnet attached to it rotates, changing the magnetic field detected by the sensor. This change is then mapped to a joint angle.
- Versatility: The gloves are modular and can be detached from the exoskeleton, making them reusable and adaptable to control various types of dexterous robotic hands.
Foot Pedal
-
Design: A custom-built foot pedal system with three small pedals and two mode-switching buttons.
-
Function: It serves as the locomotion command interface, freeing the operator's hands for manipulation. The operator can control forward/backward speed, turning speed, and squatting height, similar to driving a car. This is more ergonomic and practical for large-scale teleoperation than using a hand-held joystick. The pedal layout is shown below.
该图像是包含两个单元图(a)与(b)的示意图,展示了两种不同的手部外骨骼设备(Unitree G1与Fourier GR-1)及其机械结构细节,(b)部分重点展示了U2D2电路板、对接站、3D打印销及垂直滑动机构的设计。
5. Experimental Setup
5.1. Datasets
- RL Training: The policy is trained entirely in simulation using Isaac Gym, a high-performance physics simulator from NVIDIA. This allows for massively parallel training across thousands of simulated environments simultaneously, which is key to the efficiency of modern RL. No real-world data or MoCap data is used for this phase.
- Imitation Learning (IL) Data Collection: To test the "data flywheel" concept, the authors use the
HOMIEsystem to collect a dataset for two specific tasks:-
Squat Pick: The robot must squat down to pick a tomato from a low sofa.
-
Pick & Place: The robot must pick a tomato from a table and place it elsewhere. For each task, 50 demonstration episodes were collected. Each data point includes RGB images from the robot's cameras, the robot's state (joint angles, velocities), and the commands from the operator ( and ). An example of the data collection setup is shown below.
该图像是一个示意图,展示了图6中的踏板控制系统。三个小踏板分别控制、和,左侧切换按钮用于切换左右模式,右侧切换按钮用于切换前后模式。
-
5.2. Evaluation Metrics
5.2.1. For RL Policy Evaluation
-
Tracking Linear Velocity Error
- Conceptual Definition: Measures how well the robot's actual forward/backward speed matches the speed commanded by the operator. A lower value is better, indicating more precise control.
- Mathematical Formula:
- Symbol Explanation:
- : The robot's actual linear velocity vector.
- : The commanded linear velocity vector from the pedal.
-
Tracking Angular Velocity Error
- Conceptual Definition: Measures how well the robot's actual turning speed matches the turning speed commanded by the operator. A lower value is better.
- Mathematical Formula:
- Symbol Explanation:
- : The robot's actual angular velocity vector.
- : The commanded angular velocity vector from the pedal.
-
Tracking Height Error
- Conceptual Definition: Measures the difference between the robot's actual torso height and the height commanded by the operator via the pedal. A lower value is better, indicating precise squatting control.
- Mathematical Formula:
- Symbol Explanation:
- : The robot's actual torso height.
- : The commanded torso height.
-
Symmetry Loss
- Conceptual Definition: This is the loss function defined in the methodology section, used to enforce policy symmetry. A lower value indicates the learned policy is more symmetric.
- Mathematical Formula:
- Symbol Explanation: As defined in Section 4.2.2.
-
Living Time
- Conceptual Definition: The average duration (in seconds) that the robot can operate in the simulation without falling over. A higher value is better, indicating a more stable policy.
5.2.2. For IL and Teleoperation System Evaluation
-
Success Rate (SR)
- Conceptual Definition: The percentage of trials in which the robot successfully completes a given task from start to finish. This is the primary metric for evaluating the final autonomous policy.
- Mathematical Formula:
- Symbol Explanation: The terms are self-explanatory.
-
Task Completion Time
- Conceptual Definition: The time taken for an operator (or autonomous policy) to complete a task. This is used to measure the efficiency of the
HOMIEsystem compared to baselines. A lower time is better.
- Conceptual Definition: The time taken for an operator (or autonomous policy) to complete a task. This is used to measure the efficiency of the
5.3. Baselines
- For RL Ablation Studies: The baselines are variants of the proposed method with specific components removed or altered (e.g.,
w/o curwhich removes the curriculum,w/o kneewhich removes the height tracking reward). This is done to isolate and validate the contribution of each new technique. - For Teleoperation Efficiency: The main baseline is OpenTelevision, a representative system that uses VR devices (like a Meta Quest headset) for teleoperation. This provides a direct comparison between
HOMIE's isomorphic exoskeleton approach and a common VR-based approach.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Humanoid Whole-body Control Ablations
The authors conducted ablation studies to validate each component of their RL training framework. The results, shown as learning curves over training steps, are presented in the figure below.
该图像是示意图和实物图片,展示了使用霍尔传感器和磁铁实现的手指关节角度测量装置(a)及带有传感手套的机械手(b),体现了论文中基于霍尔传感器的自由度手部控制技术。
- Upper-body Pose Curriculum: The top row compares
ours(with the proposed curriculum),w/o cur(no curriculum), andrand(a simpler curriculum).oursachieves lower tracking errors for both linear and angular velocity, demonstrating that the smooth, gradual introduction of difficulty helps the policy learn more effectively than either no curriculum or a naive curriculum. - Height Tracking Reward: The middle row compares
ours(with the reward),w/o knee(no special knee reward), andhei(simply increasing the weight of the standard height reward).oursconverges significantly faster and to a lower height error. Theheibaseline performs poorly on locomotion tracking, showing that naively rewarding height can interfere with other objectives. This validates that the specially designed reward provides a much better learning signal for squatting. - Symmetry Utilization: The bottom row compares
ours(augmentation + loss), (only augmentation), (only loss), andnone. The results clearly show that data augmentation (oursand ) dramatically speeds up training. Furthermore,oursachieves a much lower final symmetry loss than , confirming that the explicit symmetry loss term is effective at enforcing a symmetric policy.
6.1.2. Training on Different Robots
To demonstrate the generality of the framework, it was applied to two morphologically different robots: the Unitree G1 and the taller, heavier Fourier GR-1.
该图像是论文中关于踏板装置结构的示意图,包括带旋转电位器和弹簧的踏板部分(a),以及带微动开关和锥形弹簧的开关结构(b)。图示展示了踏板按压及开关触发的工作原理。
The following are the results from Table II of the original paper:
| Metrics | Unitree G1 | Fourier GR-1 |
|---|---|---|
| Lin. Vel Error (m/s) | 0.194 (±0.003) | 0.273 (±0.003) |
| Ang. Vel Error (rad/s) | 0.451 (±0.006) | 0.540 (±0.002) |
| Height Error (m) | 0.022 (±0.019) | 0.038 (±0.003) |
| symmetry loss (-) | 0.019 (±0.017) | 0.009 (±0.001) |
| Living Time (s) | 19.947 (±0.092) | 19.960 (±0.035) |
As shown in the table, the framework successfully trained stable and robust policies for both robots, achieving long living times and low tracking errors. This demonstrates that the proposed methods are not specific to one robot but form a general framework for humanoid loco-manipulation.
6.1.3. Teleoperation Hardware Performance
The performance of the hardware itself is a key contribution.
The following are the results from Table III of the original paper:
| Hardware | Cost (\$) | Acquisition Freq. | Acquisition Acc. |
|---|---|---|---|
| Exoskeleton | 430 | 0.26 kHz | ~2^12 (with 360°) |
| Glove | 30 (each) | 0.3 kHz | ~2^12 |
| Pedal | 20 | 0.5 kHz | ~2^12 (with 270°) |
The total cost is remarkably low (~$500), and the acquisition frequencies are high (260-500 Hz). The most critical result is the comparison of the final output frequency to the robot, which shows the advantage of bypassing IK.
The following are the results from Table IV of the original paper:
| Teleop system | Hardware | Arm (Hz) | Hand (Hz) |
|---|---|---|---|
| Telekinesis [20] | 2 RTX 3080 Ti | 16 | 24 |
| AnyTeleop [10] | RTX 3090 | 125 | 111 |
| OpenTeleVision [7] | M2 Chip | 60 | 60 |
| Ours | No GPU / SoC | 263 | 293 |
HOMIE achieves arm and hand control frequencies of 263 Hz and 293 Hz, respectively. This is 2-5 times faster than VR/vision-based systems that rely on powerful GPUs for pose estimation and IK calculations. This high frequency translates to smoother, more responsive, and more precise control for the operator.
6.1.4. Real-World Loco-Manipulation Tasks
The system was deployed on a real Unitree G1 robot to perform a variety of complex tasks, demonstrating its robustness and versatility.
该图像是多幅机器人操控示意图,展示了HOMIE系统中人形机器人在不同任务中的运动与操作姿态,包括走路、蹲下、拿取物品和与人类交互等场景,体现了系统的多自由度和高效控制能力。
These tasks include:
-
Squatting to pick objects from low shelves and placing them on high ones.
-
Pushing heavy objects (a 60kg person in a chair).
-
Opening an oven, which requires coordinated pulling and backward walking.
-
Bimanual collaboration, such as passing an object from one hand to the other.
These qualitative results powerfully demonstrate that the system successfully integrates locomotion and manipulation in complex, real-world scenarios.
6.1.5. Efficiency and User Study
To quantify the system's efficiency, a direct comparison was made against the VR-based OpenTelevision system on four desktop manipulation tasks.
该图像是论文中展示的示意图,描述了人形机器人全身远程操作系统及其控制策略。左侧展示佩戴运动感应手套和外骨骼的操作者如何通过踏板控制机器人行走与蹲下。右侧为控制策略框架,包含手臂动作策略和躯干运动策略两部分。
The results, shown in the chart below, are striking.
该图像是图3,展示了HOMIE系统的强化学习训练框架,包含Isaac Gym中的机器人状态输入,Rollout阶段的多个视角观察,以及PPO算法中的Actor、Critic和Optimizer模块。
HOMIE completed every task in approximately half the time of the VR-based system. The authors note the gap is largest on tasks requiring high precision, where the inaccuracies of vision-based tracking become a bottleneck.
A user study with five novice operators further demonstrated the system's intuitiveness.
该图像是图示,展示了多个机器人在Isaac Gym环境中训练时,持续变化上半身姿态下的行走和下蹲动作。机器人展示了从站立到下蹲以及不同动作的连续转换。
The learning curve shows that new users rapidly improved their task completion time, approaching expert-level performance within just five trials. This indicates the system is easy to learn and use.
6.1.6. Autonomous Policy Learning
A key claim of the paper is that HOMIE can create a "data flywheel" for learning autonomous policies. To validate this, they used the collected demonstration data to train an end-to-end visuomotor policy.
The following are the results from Table V of the original paper:
| Tasks | Squat Pick | Pick & Place |
|---|---|---|
| Success Rate (%) | 73.3 | 80.0 |
The trained autonomous policy achieved a 73.3% success rate on the Squat Pick task and an 80.0% success rate on the Pick & Place task. While not perfect, these results are highly promising and serve as a strong proof-of-concept. They confirm that the data collected via HOMIE is of sufficient quality to train complex, whole-body autonomous skills. The robot performing these tasks autonomously is shown below.
该图像是多组折线图表,展示了不同训练策略下HOMIE系统在线速度误差、角速度误差、高度误差、对称损失和存活时间方面的性能随训练步数变化的趋势。图中对比了多种方法,反映了策略改进对机器人控制效果的影响。
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper introduces HOMIE, a novel semi-autonomous teleoperation cockpit for humanoid loco-manipulation. By intelligently combining a robust RL policy for lower-body control with a high-performance, low-cost isomorphic exoskeleton for upper-body control, HOMIE effectively solves the fragmentation problem in humanoid robotics.
The key contributions are:
-
An RL training framework that produces robust loco-manipulation policies capable of walking and dynamic squatting without relying on expensive MoCap data.
-
An open-source, $500 hardware system that provides significantly faster and more precise control than existing low-cost alternatives by bypassing IK and vision-based tracking.
-
Demonstration of a complete "data flywheel," showing that data collected with
HOMIEcan be used to successfully train autonomous policies for complex whole-body tasks.HOMIErepresents a significant step towards creating general-purpose humanoid robots by providing an accessible, efficient, and powerful platform for both teleoperation and data collection.
7.2. Limitations & Future Work
The authors candidly acknowledge several limitations and areas for future research:
- Terrain Generalization: The current locomotion policy is trained on flat ground and does not yet handle diverse or challenging terrains like stairs or slopes.
- Glove Ergonomics: The 15-DoF glove design for the thumb is not perfectly aligned with human anatomy, making control of some robotic hands less intuitive than it could be.
- Lack of Force Feedback: The system does not provide haptic (force) feedback to the operator. This limits its effectiveness in tasks that require a delicate sense of touch or force application.
- No Waist Control: The current exoskeleton only controls the arms and does not provide teleoperation for the robot's waist/torso joints, even though the RL policy can accommodate arbitrary upper-body poses.
7.3. Personal Insights & Critique
-
Strengths and Inspirations:
- Pragmatic Design Philosophy: The division of labor between the RL policy and the human operator is a brilliant and pragmatic approach. It leverages the strengths of both AI and human intelligence, avoiding the need to solve the full, end-to-end autonomy problem at once.
- Democratization of Research: The emphasis on low cost ($500) and open-sourcing is a massive contribution to the field. It lowers the barrier to entry for high-quality humanoid robotics research, enabling more labs and individuals to contribute.
- Tackling Core Bottlenecks: The work directly addresses two major bottlenecks in robotics: the over-reliance on expensive MoCap data for learning and the performance limitations imposed by IK in real-time control. The solutions presented are elegant and effective.
-
Potential Issues and Critique:
- Hardware Generality: While the software framework is general, the
isomorphicnature of the hardware means a new exoskeleton must be designed and built for any robot with a different arm morphology. This could limit the "plug-and-play" applicability of the system across diverse robot platforms. - User Study Scope: The user study is promising but limited in scope (5 novices, 1 task). A more extensive study with more users and a wider variety of loco-manipulation tasks would provide stronger evidence for the system's general usability and intuitiveness.
- The "Flying Blind" Problem: The lack of force feedback is a significant limitation for practical applications. For tasks like inserting a key, plugging in a cable, or handling fragile objects, an operator relies heavily on haptic cues. Without this, the operator is essentially "flying blind," which could lead to failures or damage. Integrating force feedback would be a critical next step for real-world utility.
- Autonomy Gap: While the 70-80% success rate for the autonomous policy is a great proof-of-concept, it also highlights the gap that still exists between tele-collected data and fully robust, deployable autonomy. The path from "good data" to "perfect policy" remains a significant research challenge.
- Hardware Generality: While the software framework is general, the
Similar papers
Recommended via semantic vector search.