Omnigrasp: Grasping Diverse Objects with Simulated Humanoids

Weipeng Xu

Paper status: completed

Omnigrasp: Grasping Diverse Objects with Simulated Humanoids

Published:07/16/2024

Grasp Control with Simulated Humanoids (1)Grasping and Moving Diverse Objects (1)Humanoid Motion Representation Learning (1)Training without Paired Dataset (1)Object Trajectory Following Task (1)

Original Link PDF

Price: 0.10

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Omnigrasp is a method for controlling simulated humanoids to grasp and manipulate over 1200 diverse objects along predefined trajectories. It enhances control accuracy through humanoid motion representation, requiring no paired training data and demonstrating excellent scalabilit

Abstract

We present a method for controlling a simulated humanoid to grasp an object and move it to follow an object's trajectory. Due to the challenges in controlling a humanoid with dexterous hands, prior methods often use a disembodied hand and only consider vertical lifts or short trajectories. This limited scope hampers their applicability for object manipulation required for animation and simulation. To close this gap, we learn a controller that can pick up a large number (>1200) of objects and carry them to follow randomly generated trajectories. Our key insight is to leverage a humanoid motion representation that provides human-like motor skills and significantly speeds up training. Using only simplistic reward, state, and object representations, our method shows favorable scalability on diverse objects and trajectories. For training, we do not need a dataset of paired full-body motion and object trajectories. At test time, we only require the object mesh and desired trajectories for grasping and transporting. To demonstrate the capabilities of our method, we show state-of-the-art success rates in following object trajectories and generalizing to unseen objects. Code and models will be released.

Mind Map

In-depth Reading

English Analysis~30 min read · 40,035 chars

1. Bibliographic Information

1.1. Title

Omnigrasp: Grasping Diverse Objects with Simulated Humanoids

1.2. Authors

Zhengyi Luo (Carnegie Mellon University; Reality Labs Research, Meta)
Jinkun Cao (Carnegie Mellon University)
Sammy Christen (Reality Labs Research, Meta; ETH Zurich)
Alexander Winkler (Reality Labs Research, Meta)
Kris Kitani (Carnegie Mellon University; Reality Labs Research, Meta)
Weipeng Xu (Reality Labs Research, Meta)

1.3. Journal/Conference

The paper is published as a preprint on arXiv (arXiv:2407.11385v2). While arXiv is not a peer-reviewed journal or conference, it is a widely recognized platform for sharing research in physics, mathematics, computer science, and related fields before (or in parallel with) formal publication. The authors' affiliations with Carnegie Mellon University and Reality Labs Research (Meta) suggest a strong research background in robotics, computer vision, and simulated environments.

1.4. Publication Year

2024 (Published at UTC: 2024-07-16T05:05:02.000Z)

1.5. Abstract

The paper introduces Omnigrasp, a method for controlling a simulated humanoid to grasp and manipulate diverse objects, specifically making them follow a predefined trajectory. The core challenge lies in controlling dexterous humanoid hands within a full-body simulation, which prior methods often simplify by using disembodied hands or limiting tasks to simple lifts and short trajectories. Omnigrasp addresses this by learning a controller capable of picking up over 1200 objects and carrying them along randomly generated trajectories. A key innovation is leveraging a humanoid motion representation (PULSE-X) that provides human-like motor skills, significantly accelerating Reinforcement Learning (RL) training. The method uses simplistic reward, state, and object representations, demonstrating scalability across diverse objects and trajectories. Crucially, it does not require a dataset of paired full-body motion and object trajectories for training. At test time, only the object mesh and desired trajectories are needed. The authors report state-of-the-art success rates in trajectory following and generalization to unseen objects, with code and models planned for release.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2407.11385v2
PDF Link: https://arxiv.org/pdf/2407.11385v2.pdf
Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem Omnigrasp aims to solve is the challenging task of controlling a simulated humanoid with dexterous hands to grasp diverse objects and precisely manipulate them along arbitrary trajectories. This capability is crucial for generating realistic human-object interactions in applications like animation, virtual reality (VR), augmented reality (AR), and eventually, for controlling real humanoid robots.

Prior research in simulated grasping has largely faced several significant challenges:

Complexity of Full-Body Control: Controlling a bipedal humanoid requires maintaining balance while simultaneously executing dexterous movements with arms and fingers. This high-dimensional control space (e.g., 153 degrees of freedom for the SMPL-X model used) makes Reinforcement Learning (RL) exploration highly inefficient and prone to unnatural motion.
Limited Scope of Prior Work: Many existing methods simplify the problem by using a disembodied hand (a "floating hand") where its root position and orientation are controlled by non-physical forces, thus avoiding the complexities of full-body balance and locomotion. Even when considering full bodies, tasks are often limited to simple actions like vertical lifts or very short, pre-recorded trajectories for a single object or task. This limited scope severely restricts their applicability for dynamic and diverse object manipulation needed for animation and advanced simulations.
Diversity of Objects and Trajectories: Humans can effortlessly grasp a vast array of objects and manipulate them along countless trajectories. Scaling simulated grasping to thousands of diverse object shapes and arbitrary, randomly generated trajectories is a significant hurdle. Each object might require a unique grasping strategy, and each trajectory demands precise full-body coordination. Prior work typically focuses on simple trajectories or requires task-specific policies.
Data Dependency: Many advanced human-object interaction (HOI) methods rely on Motion Capture (MoCap) data, which is scarce for paired full-body motion and object trajectories, especially for diverse objects or complex interactions.

The paper's entry point and innovative idea revolve around addressing these challenges simultaneously by:

Leveraging a universal dexterous humanoid motion representation (PULSE-X): This representation provides human-like motor skills as a structured action space for the RL agent, significantly speeding up training and preventing unnatural motion. This is a crucial motion prior that constrains the RL exploration to plausible human movements.
Designing a hierarchical RL framework with pre-grasp guidance: By using a simple stepwise reward function that incorporates pre-grasp poses (hand pose just before grasping) as guidance, the policy can learn to approach and grasp objects effectively without requiring full kinematic grasp synthesis or paired human-object motion data.
Training with randomly generated trajectories and hard-negative mining: This allows the system to learn generalized manipulation skills for diverse, unseen objects and trajectories without relying on MoCap interaction data.

2.2. Main Contributions / Findings

The paper makes several significant contributions to the field of simulated humanoid control and dexterous manipulation:

Dexterous and Universal Humanoid Motion Representation (PULSE-X): The authors design and implement PULSE-X, an extension of the PULSE motion representation that incorporates articulated finger motions. This 48-dimensional, full-body-and-hand motion latent space acts as a powerful human-like motion prior in the RL action space. This structured action space dramatically increases sample efficiency during RL training and allows for the use of simpler state and reward designs. This is a critical innovation for enabling stable and natural full-body control, especially for dexterous tasks.
Data-Efficient Grasping Policy Learning: Omnigrasp demonstrates that leveraging this universal motion representation allows for learning grasping policies using only synthetic grasp poses (pre-grasps) and randomly generated trajectories. Crucially, it does not require any dataset of paired full-body human-object motion data. This overcomes a major data bottleneck in the field, making the method highly scalable.
Scalable and Generalizable Humanoid Controller: The paper shows the feasibility of training a single humanoid controller that achieves:
- High Success Rates: State-of-the-art grasp success rates and trajectory following success rates on complex tasks.
- Diversity: Capability to grasp and transport a large number of diverse objects (over 1200 objects from the OakInk dataset) and follow arbitrary complex trajectories.
- Generalization: Robustly generalizes to unseen objects and reference trajectories at test time, showcasing its applicability to novel scenarios without re-training.
- Bimanual Manipulation: The policy learns to use both hands for grasping and transporting larger or heavier objects, demonstrating emergent, human-like manipulation skills.
  
  These findings collectively address the limited scope and data dependency issues of prior work, opening new avenues for creating realistic and versatile human-object interactions in simulation and animation, with potential for sim-to-real transfer to humanoid robotics.

3.1. Foundational Concepts

To understand Omnigrasp, a foundational understanding of Reinforcement Learning (RL), humanoid control, and motion representation is essential.

3.1.1. Reinforcement Learning (RL)

Reinforcement Learning (RL) is a paradigm of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. The agent interacts with the environment over a sequence of timesteps.

Agent: The entity that learns and makes decisions. In this paper, it's the Omnigrasp policy controlling the humanoid.
Environment: The world in which the agent operates. Here, it's a physics simulator (Isaac Gym) containing the humanoid and objects.
State ( $s$ ): A complete description of the environment at a given timestep. The humanoid's proprioception (joint angles, velocities, contact forces) and goal state (object information, target trajectory) constitute the state.
Action ( $a$ ): A decision made by the agent that influences the environment. In Omnigrasp, the action is a latent motion representation that is then decoded into joint actuation targets for the humanoid.
Reward ( $r$ ): A scalar feedback signal from the environment indicating how good or bad the agent's last action was. The agent's goal is to learn a policy that maximizes the expected cumulative reward.
Policy ( $\pi$ ): A mapping from states to actions, defining the agent's behavior. The Omnigrasp policy $\pi_{\text{Omnigrasp}}$ learns to choose latent codes based on the current state.
Markov Decision Process (MDP): A mathematical framework for modeling RL problems. An MDP is defined by a tuple $\mathcal{M} = \langle S, A, T, \mathcal{R}, \gamma \rangle$ $M = ⟨ S, A, T, R, γ ⟩$ , where:
- $S$ : Set of possible states.
- $A$ : Set of possible actions.
- $T$ : Transition dynamics, specifying the probability of moving to a new state $s'$ given the current state $s$ and action $a$ , i.e., $P(s' | s, a)$ .
- $\mathcal{R}$ : Reward function, defining the reward $r$ received for taking action $a$ in state $s$ and transitioning to $s'$ .
- $\gamma$ : Discount factor, a value between 0 and 1 that determines the present value of future rewards.

3.1.2. Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) [67] is a popular RL algorithm that is an on-policy method, meaning it learns from experiences generated by the current policy. PPO aims to update the policy in a stable manner by taking multiple small steps, avoiding large updates that could destabilize training. It achieves this by clipping the policy ratio (the ratio of the new policy's probability to the old policy's probability for a given action) to ensure that the new policy does not deviate too far from the old one, and by using a clipped objective function. Omnigrasp uses PPO to maximize the discounted reward.

3.1.3. Humanoid Control

Humanoid control involves developing algorithms and systems to make simulated or physical humanoid robots perform complex behaviors like walking, running, jumping, and manipulating objects. This is highly challenging due to the high degrees of freedom (DoF), complex kinematics and dynamics, and the need for balance.

Degrees of Freedom (DoF): The number of independent parameters that define the configuration of a mechanical system. A humanoid can have many DoF (e.g., SMPL-X has 153 DoF), making control complex.
Proportional-Derivative (PD) Controller: A common control loop feedback mechanism used to control the target joint positions of a robot. It calculates an error value as the difference between a desired setpoint (the PD target action $a_t$ in this paper) and a measured process variable (the current joint position) and then applies a correction based on proportional (P) and derivative (D) terms of this error.

3.1.4. Motion Representation and Latent Space

Motion Representation: A way to encode or describe human (or humanoid) motion in a compact and meaningful format. Traditional representations might use raw joint angles and positions, but these are high-dimensional and redundant.
Latent Space: A lower-dimensional, abstract representation of data where similar data points are grouped closer together. In the context of motion, a motion latent space captures the underlying structure and patterns of human movement, allowing RL policies to operate on more meaningful and less noisy actions.
Variational Autoencoder (VAE) [32]: A type of generative model that learns a latent space representation of data. A VAE consists of an encoder that maps input data to a latent distribution (typically Gaussian) and a decoder that reconstructs the data from samples drawn from this latent distribution. The training objective includes both a reconstruction loss and a Kullback-Leibler (KL) divergence term to ensure the latent space is well-structured and follows a prior distribution (e.g., unit Gaussian). PULSE-X uses a similar concept of variational information bottleneck for online distillation.

3.1.5. Grasping and Dexterous Manipulation

Grasping: The act of physically interacting with an object using a hand or manipulator to hold, lift, or move it.
Pre-grasp: The hand pose and position right before making contact with an object to initiate a grasp. This is a crucial intermediate step that significantly influences the success of a grasp.
Dexterous Hands: Robotic hands with multiple articulated fingers that can perform complex manipulation tasks, similar to human hands.
Bimanual Manipulation: Tasks that involve the coordinated use of two hands to manipulate an object.

3.2. Previous Works

The paper positions Omnigrasp in contrast to and building upon several lines of research:

3.2.1. Simulated Humanoid Control

Traditional Methods: Model-based control [29] and trajectory optimization [36, 82] are used, but deep RL [13, 53] has gained popularity due to its flexibility and scalability.
PHC [42]: Perpetual Humanoid Control. A physics-based humanoid motion imitator that can learn complex locomotion skills from MoCap data. Omnigrasp extends this to PHC-X for dexterous hands. PHC directly imitates kinematic poses, but its average 30mm tracking error for hands can be too large for precise grasping.
AMP [56]: Adversarial Motion Priors. Uses adversarial learning to train physics-based characters to perform stylized movements from MoCap data, achieving human-like motion. AMP is used as a baseline, and Omnigrasp shows its importance of motion prior in the action space for better trajectory following.
Full-body Imitators with Limited Scope: Prior work often limits articulated fingers [3, 6, 36, 48] or focuses on single object interaction sequences [77], encountering difficulties in trajectory following [6].

3.2.2. Dexterous Manipulation

Disembodied Hand Approaches: A common approach in robotics [7, 8, 11, 12, 15, 16, 19, 37, 61, 74, 84, 95, 96, 97] and animation [2, 6, 34, 100]. These methods often use non-physical virtual forces to control the hand, simplifying the problem by ignoring full-body dynamics.
- D-Grasp [16]: Leverages the MANO [65] hand model for physically plausible grasp synthesis and 6DoF target reaching.
- UniDexGrasp [84] and follow-ups [74]: Use the Shadow Hand [1] model for dexterous grasping, often requiring specialized training procedures like generalist-specialist training or curriculum learning for diverse objects.
- PGDM [17]: Trains grasping policies for individual object trajectories and identifies pre-grasp initialization as crucial for success. Omnigrasp adopts the idea of pre-grasps in its reward design.
Full-body Object Manipulation:
- PMP [3] and PhysHOI [77]: Train one policy per task or object. PhysHOI explicitly uses MoCap human-object interaction data and interaction graphs to imitate human behavior.
- Braun et al. [6]: Studies a similar setting to Omnigrasp (full-body hand-object interaction synthesis) but relies on MoCap human-object interaction data and uses only one hand. Omnigrasp aims to be data-agnostic for training and supports bimanual motion.

3.2.3. Kinematic Grasp Synthesis

This area focuses on generating hand poses given an object, often from images/videos [5, 10, 10, 18, 21, 38, 46, 50, 83, 88] or for image generation [51, 89].

GrabNet [69]: Trained on object shapes from OakInk [85] to generate hand poses. Omnigrasp uses GrabNet to generate pre-grasps as reward guidance when MoCap data is unavailable.
TOCH [102] and GeneOH [39]: Focus on denoising dynamic hand pose predictions for object interactions.
Generative Methods for Paired HOI: Some methods [22, 35, 68, 71, 72, 81, 90] can create paired human-object interactions, but often require initialization from ground truth [22, 68, 81] or only predict static full-body grasps [72]. The lack of large-scale MoCap data for synchronized full-body and object trajectories is a key challenge that Omnigrasp circumvents.

3.2.4. Humanoid Motion Representation

This field explores methods to create compact and efficient action spaces for RL to improve sample efficiency and produce coherent motion.

Motion Primitives [24, 25, 47, 62] or Motion Latent Space [55, 73]: These approaches aim to provide structured action spaces.
Part-based Motion Priors [3, 6]: Effective for single task settings but struggle to scale to free-formed motion.
PULSE [41]: Physics-based Universal Latent Space for humanoids. A recently proposed universal humanoid motion representation that Omnigrasp extends to PULSE-X for dexterous humanoids. PULSE distills motor skills into a latent representation using a variational information bottleneck.

3.3. Technological Evolution

The evolution of simulated humanoid control has progressed from model-based control and trajectory optimization to deep Reinforcement Learning. Initial RL efforts focused on locomotion and simple body motions, often without considering articulated hands. The introduction of motion imitation methods like PHC and AMP allowed RL agents to learn human-like motion from MoCap data, improving realism and sample efficiency.

Concurrently, dexterous manipulation research primarily focused on disembodied hands due to the complexity of integrating hand dexterity with full-body balance. Methods like D-Grasp and UniDexGrasp advanced grasp synthesis but typically for floating hands.

This paper represents a significant step in converging these two lines of research. Omnigrasp extends universal motion representations (like PULSE) to include dexterous hands, enabling a full-body humanoid to perform complex, generalized grasping and trajectory following. It moves beyond task-specific policies and heavy MoCap data reliance for human-object interaction, pushing towards more general, data-efficient, and scalable solutions. By leveraging an intelligent motion prior, it tackles the exploration problem that plagued earlier RL approaches for high-DoF humanoids performing dexterous tasks.

3.4. Differentiation Analysis

Omnigrasp differentiates itself from previous work in several key aspects:

Full-Body Dexterous Control with Generalization: Unlike most prior work that uses disembodied hands [16, 17, 60, 84] or limits full-body manipulation to single tasks/objects [77] or simple lifts [16, 84], Omnigrasp controls a full-body humanoid with dexterous hands to grasp diverse objects (over 1200) and follow complex, randomly generated trajectories.
Universal Motion Representation as Action Space: The core innovation is using PULSE-X, a unified universal and dexterous humanoid motion latent space, as the action space for the RL policy. This differs from:
- Direct Joint Actuation: Training directly on joint actuation space (e.g., PPO-10B baseline) leads to unnatural motion and severe exploration problems.
- Separate Body/Hand Latent Spaces: Prior work like Braun et al. [6] used adversarial latent spaces for body and hands, but these were often small-scale and curated, not achieving high grasping success rates. Omnigrasp proposes a unified latent space that covers both.
Reduced Data Dependency: Omnigrasp learns grasping policies with synthetic grasp poses (pre-grasps) and randomly generated trajectories, without requiring any dataset of paired full-body human-object motion data. This is a significant advantage over methods like Braun et al. [6] and PhysHOI [77], which rely heavily on MoCap human-object interaction data.
Simple State and Reward Design: By leveraging the strong human-like motion prior from PULSE-X, Omnigrasp can use simplistic reward, state, and object representations. It does not require specialized interaction graphs [77, 101] or reference body motion as input, which simplifies the system and enhances generalizability.
Robustness and Scalability: Omnigrasp demonstrates favorable scalability on diverse objects and trajectories and generalizes to unseen objects with state-of-the-art success rates, including emergent bimanual manipulation strategies. This robustness is further shown under observation noise.

In essence, Omnigrasp provides a more general, data-efficient, and scalable solution for dexterous full-body humanoid control by intelligently structuring the action space with a universal motion prior and designing a hierarchical RL framework that minimizes reliance on specific kinematic data or complex reward engineering.

4. Methodology

The Omnigrasp method for controlling a simulated humanoid to grasp objects and follow object trajectories is structured as a hierarchical Reinforcement Learning (RL) framework, built upon a novel universal dexterous humanoid motion representation. The entire architecture is visualized in Figure 2 from the original paper.

The following figure (Figure 2 from the original paper) shows that Omnigrasp is trained in two stages:

Figure :Omnigrasp is traine in two stages. (a) A universal and dexterous humanoid motion representation is trained via distillation. (b) Pre-grasp guided grasping training using a pretrained motion r…
该图像是一个示意图，展示了Omnigrasp的训练过程分为两个阶段：第一阶段是通过蒸馏训练通用灵巧的人形运动表示（PULSE-X）；第二阶段是利用预训练运动表示进行前抓引导抓取训练。图中包含的关键信息包括状态、动作解码器以及物理仿真环境。

Figure :Omnigrasp is traine in two stages. (a) A universal and dexterous humanoid motion representation is trained via distillation. (b) Pre-grasp guided grasping training using a pretrained motion representation.

4.1. Principles

The core idea behind Omnigrasp is to simplify the Reinforcement Learning task for dexterous humanoid manipulation by providing the RL agent with a high-level, human-like motion vocabulary rather than forcing it to learn raw joint actuations from scratch. This motion vocabulary acts as a powerful motion prior, significantly improving sample efficiency and promoting natural movements.

The theoretical basis and intuition are rooted in the challenges of high-dimensional action spaces in RL. Directly controlling a humanoid's 153 degrees of freedom (DoF) (including articulated fingers) leads to an immense exploration problem. Random exploration in this space often results in unnatural, unstable motions that quickly destabilize the humanoid or cause it to miss objects, hindering learning progress. By compressing motor skills into a low-dimensional latent space (the PULSE-X representation), the RL policy (Omnigrasp) can choose coherent, human-like movements directly, making the learning process more stable and efficient.

Furthermore, Omnigrasp uses pre-grasps as guidance in its reward function to steer the humanoid's hands towards plausible grasping configurations before actual contact. This stepwise reward strategy helps the policy to first approach, then form a valid pre-grasp, and finally grasp and manipulate the object.

4.2. Core Methodology In-depth (Layer by Layer)

The Omnigrasp training process is divided into two main stages:

4.2.1. Stage 1: PULSE-X: Physics-based Universal Dexterous Humanoid Motion Representation

This stage focuses on learning a universal motion representation for a dexterous humanoid. This representation, called PULSE-X, extends PULSE [41] by incorporating articulated fingers.

4.2.1.1. Data Augmentation

Full-body motion datasets that include articulated finger motion are scarce (e.g., only 9% of AMASS sequences contain finger motion). To address this, the authors augment existing sequences to create a dexterous full-body motion dataset.

Process: Similar to BEDLAM [4], full-body motion from AMASS [44] is randomly paired with hand motion sampled from GRAB [70] and Re:InterHand [49].
Purpose: This augmentation increases the diversity of dexterous motions in the training data, enhancing the dexterity of the subsequent motion imitator and motion representation.

4.2.1.2. PHC-X: Humanoid Motion Imitation with Articulated Fingers

The next step is to train a humanoid motion imitator (PHC-X) that can scale to the augmented dexterous full-body motion dataset. PHC-X is inspired by PHC [42].

Approach: The finger joints are treated similarly to other body joints (e.g., toes, wrists), which is found sufficient for acquiring the necessary dexterity.
Goal State for Training $\pi_{\text{PHC-X}}$ : The Reinforcement Learning (RL) policy $\pi_{\text{PHC-X}}$ $π_{PHC-X}$ is trained to imitate a reference motion. Its goal state $s_t^{\text{g-mimic}}$ $s_{t}^{g-mimic}$ at timestep $t$ $t$ is defined as: $s_t^{\text{g-mimic}} \triangleq \big( \hat{\pmb{\theta}}_{t+1} \odot \hat{\pmb{\theta}}_t^{-1}, \hat{\pmb{p}}_{t+1} - \pmb{p}_t, \hat{\pmb{v}}_{t+1} - \pmb{v}_t, \hat{\pmb{\omega}}_{t+1} - \pmb{\omega}_t, \hat{\pmb{\theta}}_{t+1}, \hat{\pmb{p}}_{t+1} \big)$
- $\hat{\pmb{\theta}}_{t+1} \odot \hat{\pmb{\theta}}_t^{-1}$ : The difference in 3D joint rotations between the reference pose at $t+1$ and the current reference pose at $t$ . This represents the desired rotational change. The $\odot$ symbol likely denotes a composition operator for rotations (e.g., quaternion multiplication or relative rotation in 6D representation).
- $\hat{\pmb{p}}_{t+1} - \pmb{p}_t$ : The difference in 3D joint positions between the reference pose at $t+1$ and the current physics simulation pose at $t$ . This represents the desired positional change.
- $\hat{\pmb{v}}_{t+1} - \pmb{v}_t$ : The difference in linear velocities between the reference pose at $t+1$ and the current physics simulation pose at $t$ . This represents the desired linear velocity change.
- $\hat{\pmb{\omega}}_{t+1} - \pmb{\omega}_t$ : The difference in angular velocities between the reference pose at $t+1$ and the current physics simulation pose at $t$ . This represents the desired angular velocity change.
- $\hat{\pmb{\theta}}_{t+1}$ : The target 3D joint rotations from the reference motion at $t+1$ .
- $\hat{\pmb{p}}_{t+1}$ : The target 3D joint positions from the reference motion at $t+1$ .
- The hat symbol ( $\hat{\cdot}$ ) indicates values from the reference motion, while symbols without accents (e.g., $\pmb{p}_t$ ) refer to values from the physics simulation.
- This goal state encourages the PHC-X policy to match the reference motion's future pose, velocity, and angular velocity, effectively learning to imitate human motion including fingers.

4.2.1.3. Learning Motion Representation via Online Distillation

Once PHC-X is trained, its motor skills are distilled into a latent representation using a variational information bottleneck, similar to a Variational Autoencoder (VAE) [32]. This latent space will then serve as the action space for downstream tasks like object manipulation.

Components:
- Encoder ( $\mathcal{E}_{\text{PULSE-X}}$ ): Maps the humanoid's proprioception and the mimicry goal state to a latent code $z_t$ .
- Decoder ( $\mathcal{D}_{\text{PULSE-X}}$ ): Translates the latent code $z_t$ and proprioception $s_t^{\text{p}}$ into joint actuation targets $a_t$ .
- Prior ( $\mathcal{P}_{\text{PULSE-X}}$ ): Defines a Gaussian distribution over the latent code based solely on the humanoid's proprioception. This prior replaces the unit Gaussian distribution typically used in VAEs and increases the expressiveness of the latent space.
Mathematical Formulations: The encoder and prior distributions are modeled as diagonal Gaussians: $\mathcal{E}_{\text{PULSE-X}}(z_t | s_t^{\text{p}}, s_t^{\text{g-mimic}}) = \mathcal{N}(z_t | \mu_t^e, \sigma_t^e)$ $\mathcal{P}_{\text{PULSE-X}}(z_t | s_t^{\text{p}}) = \mathcal{N}(z_t | \mu_t^p, \sigma_t^p)$
- $z_t$ : The latent code at timestep $t$ . It is a 48-dimensional vector in Omnigrasp.
- $s_t^{\text{p}}$ : The humanoid's proprioception at timestep $t$ , defined as $\pmb{s}_t^{\text{p}} \triangleq (\pmb{q}_t, \dot{\pmb{q}}_t, \pmb{c}_t)$ , where $\pmb{q}_t$ is the 3D body pose (joint rotations and positions), $\dot{\pmb{q}}_t$ is the velocity (angular and linear), and $\pmb{c}_t$ are the contact forces on the hand. All values are normalized with respect to the humanoid heading (yaw).
- $s_t^{\text{g-mimic}}$ : The goal state for motion mimicry (explained in the PHC-X section above).
- $\mathcal{N}(\cdot | \mu, \sigma)$ : A Gaussian (normal) distribution with mean $\mu$ and standard deviation $\sigma$ .
- $\mu_t^e, \sigma_t^e$ : The mean and standard deviation output by the encoder.
- $\mu_t^p, \sigma_t^p$ : The mean and standard deviation output by the prior.
Training Process: Online distillation (similar to DAgger [66]) is used. The encoder-decoder pair is rolled out in simulation, and the PHC-X imitator ( $\pi_{\text{PHC-X}}$ ) provides action labels ( $\pmb{\bar{a}}_t^{\text{PHC-X}}$ ) to train the PULSE-X components. This process effectively teaches PULSE-X to compress the motor skills learned by PHC-X into a compact latent space.
Role in Downstream Tasks: For object manipulation, the frozen decoder ( $\mathcal{D}_{\text{PULSE-X}}$ ) and prior ( $\mathcal{P}_{\text{PULSE-X}}$ ) will be used to translate the latent code chosen by the Omnigrasp policy into joint actuations. The prior also guides downstream learning by forming a residual action space.

4.2.2. Stage 2: Pre-grasp Guided Object Manipulation

This stage uses the pretrained PULSE-X components to train the main Omnigrasp policy ( $\pi_{\text{Omnigrasp}}$ ) for object grasping and trajectory following.

4.2.2.1. State

The goal state $s_t^{\text{g}}$ provided to the Omnigrasp policy contains information about the object and the desired object trajectory. It is defined as: $s_t^{\text{g}} \triangleq (\hat{p}_{t+1:t+\phi}^{\text{obj}} - p_t^{\text{obj}}, \hat{\theta}_{t+1:t+\phi}^{\text{obj}} - \theta_t^{\text{obj}}, \hat{v}_{t+1:t+\phi}^{\text{obj}} - v_t^{\text{obj}}, \hat{\omega}_{t+1:t+\phi}^{\text{obj}} - \omega_t^{\text{obj}}, p_t^{\text{obj}}, \theta_t^{\text{obj}}, \sigma^{\text{obj}}, p_t^{\text{obj}} - p_t^{\text{hand}})$

$\hat{p}_{t+1:t+\phi}^{\text{obj}} - p_t^{\text{obj}}$ : The difference between the reference object position for the next $\phi$ frames and the current object position.
$\hat{\theta}_{t+1:t+\phi}^{\text{obj}} - \theta_t^{\text{obj}}$ : The difference between the reference object orientation for the next $\phi$ frames and the current object orientation.
$\hat{v}_{t+1:t+\phi}^{\text{obj}} - v_t^{\text{obj}}$ : The difference between the reference object linear velocity for the next $\phi$ frames and the current object linear velocity.
$\hat{\omega}_{t+1:t+\phi}^{\text{obj}} - \omega_t^{\text{obj}}$ : The difference between the reference object angular velocity for the next $\phi$ frames and the current object angular velocity.
$p_t^{\text{obj}}$ : The current object position.
$\theta_t^{\text{obj}}$ : The current object orientation.
$\sigma^{\text{obj}} \in \mathbb{R}^{512}$ : The object shape latent code. This is computed using the canonical object pose and Basis Point Set (BPS) [57], which represents the object's geometry.
$p_t^{\text{obj}} - p_t^{\text{hand}}$ : The difference between the current object position and each hand joint position.
$\phi$ : The number of future frames (e.g., 20 frames at 15Hz) for which the reference trajectory information is provided.
Normalization: All values are normalized with respect to the humanoid heading (yaw).
Key Design Choice: The state $s_t^{\text{g}}$ does not contain body pose, grasp information, or phase variables (which are sometimes used in locomotion policies). This design choice enhances the method's applicability to unseen objects and reference trajectories at test time.

4.2.2.2. Action

The action space for the Omnigrasp policy ( $\pi_{\text{Omnigrasp}}$ ) is the latent motion representation $z_t$ . This is a residual action relative to the prior's mean ( $\mu_t^p$ ) from PULSE-X. The policy directly outputs a residual latent code $z_t^{\text{omnigrasp}} \in \mathbb{R}^{48}$ , which is then added to the prior's mean and decoded to produce the PD target $a_t$ . The PD target $\pmb{a}_t$ is computed as: $\pmb{a}_t = \pmb{\mathcal{D}}_{\text{PULSE-X}} \big( \pi_{\text{Omnigrasp}} (\boldsymbol{z}_t^{\text{omnigrasp}} | \boldsymbol{s}_t^{\text{p}}, \boldsymbol{s}_t^{\text{g}}) + \pmb{\mu}_t^p \big)$

$\pi_{\text{Omnigrasp}} (\boldsymbol{z}_t^{\text{omnigrasp}} | \boldsymbol{s}_t^{\text{p}}, \boldsymbol{s}_t^{\text{g}})$ : The Omnigrasp policy outputs a residual latent code $z_t^{\text{omnigrasp}}$ based on the humanoid's proprioception $s_t^{\text{p}}$ and the goal state $s_t^{\text{g}}$ .
$\pmb{\mu}_t^p$ : The mean of the latent code distribution predicted by the PULSE-X prior ( $\mathcal{P}_{\text{PULSE-X}}(z_t | \boldsymbol{s}_t^{\text{p}})$ ). This provides a default, human-like motion trajectory based on current proprioception.
$\pmb{\mathcal{D}}_{\text{PULSE-X}}(\cdot)$ : The PULSE-X decoder, which translates the combined latent code (residual + prior mean) into joint actuation targets for the humanoid.
Benefit: This residual action space allows the Omnigrasp policy to fine-tune the human-like motion provided by PULSE-X to achieve the specific task of grasping and object trajectory following, significantly simplifying the learning problem.

4.2.2.3. Reward

The reward function is designed to guide the humanoid through three distinct phases: approach, pre-grasp, and object trajectory following. It incorporates pre-grasp information, either generated (e.g., by GrabNet [69]) or extracted from MoCap data, as guidance. The stepwise pre-grasp reward $r_t^{\text{omnigrasp}}$ is defined as: $r_t^{\text{omnigrasp}} = \left\{ \begin{array}{ll} r_t^{\text{approach}}, & \text{if } \Vert \hat{p}^{\text{pre-grasp}} - p_t^{\text{hand}} \Vert_2 > 0.2 \text{ and } t < \lambda \\ r_t^{\text{pre-grasp}}, & \text{if } \Vert \hat{p}^{\text{pre-grasp}} - p_t^{\text{hand}} \Vert_2 \le 0.2 \text{ and } t < \lambda \\ r_t^{\text{obj}}, & \text{if } t \ge \lambda \end{array} \right.$

$\hat{p}^{\text{pre-grasp}}$ : The target pre-grasp hand position (from GrabNet or MoCap).
$p_t^{\text{hand}}$ : The current hand position (usually the wrist or palm for distance calculation).
$\lambda$ : A time threshold (set to 1.5 seconds) indicating the frame by which grasping should occur.

Let's break down each component of the reward:

Approach Reward ( $r_t^{\text{approach}}$ ): This reward is active when the hands are far from the pre-grasp target (distance $> 0.2 \mathrm{m}$ ) and before the grasping time threshold $\lambda$ . It encourages the hands to move closer to the pre-grasp position: $r_t^{\text{approach}} = \lVert \hat{p}^{\text{pre-grasp}} - p_t^{\text{hand}} \rVert_2 - \lVert \hat{p}^{\text{pre-grasp}} - p_{t-1}^{\text{hand}} \rVert_2$
- This is a differential reward that gives a positive value if the hand moved closer to the pre-grasp target compared to the previous timestep, and a negative value if it moved further away.
Pre-grasp Reward ( $r_t^{\text{pre-grasp}}$ ): This reward becomes active when the hands are sufficiently close to the pre-grasp target ( $\le 0.2 \mathrm{m}$ ) and still before $\lambda$ . It encourages precise hand position and orientation matching to the pre-grasp pose: $r_t^{\text{pre-grasp}} = w_{\text{hp}} e^{-100 \Vert \hat{p}^{\text{pre-grasp}} - p_t^{\text{hand}} \Vert_2} \times \mathbb{1} \{ \Vert \hat{p}^{\text{pre-grasp}} - \hat{p}_t^{\text{obj}} \Vert_2 \le 0.2 \} + w_{\text{hr}} e^{-100 \Vert \hat{\theta}^{\text{pre-grasp}} - \theta_t^{\text{hand}} \Vert_2}$
- $w_{\text{hp}}, w_{\text{hr}}$ : Weighting coefficients for hand position and hand rotation components, respectively.
- $e^{-100 \Vert \cdot \Vert_2}$ : An exponential decay term that provides a high reward for very small errors and rapidly decreases as the error increases, promoting precision.
- $\hat{\theta}^{\text{pre-grasp}}$ : The target pre-grasp hand orientation.
- $\theta_t^{\text{hand}}$ : The current hand orientation.
- $\mathbb{1} \{ \Vert \hat{p}^{\text{pre-grasp}} - \hat{p}_t^{\text{obj}} \Vert_2 \le 0.2 \}$ is an indicator variable that is 1 if the pre-grasp position is within $0.2 \mathrm{m}$ of the object position, and 0 otherwise. This ensures the pre-grasp is relevant to the object's location.
Object Trajectory Following Reward ( $r_t^{\text{obj}}$ ): This reward is active after the grasping time threshold $t \ge \lambda$ . It encourages the humanoid to hold the object and make it follow the reference object trajectory: $r_t^{\text{obj}} = (w_{\text{op}} e^{-100 \Vert \vec{p}_t^{\text{obj}} - p_t^{\text{obj}} \Vert_2} + w_{\text{or}} e^{-100 \Vert \vec{\theta}_t^{\text{obj}} - \vec{\theta}_t^{\text{obj}} \Vert_2} + w_{\text{ov}} e^{-5 \Vert \vec{v}_t^{\text{obj}} - v_t^{\text{obj}} \Vert_2} + w_{\text{oav}} e^{-5 \Vert \vec{\omega}_t^{\text{obj}} - \omega_t^{\text{obj}} \Vert_2}) \cdot \mathbb{1}\{\mathbb{C}\} + \mathbb{1}\{\mathbb{C}\} \cdot w_{\text{c}}$
- $\vec{p}_t^{\text{obj}}, \vec{\theta}_t^{\text{obj}}, \vec{v}_t^{\text{obj}}, \vec{\omega}_t^{\text{obj}}$ $p_{t}^{obj}, θ_{t}^{obj}, v_{t}^{obj}, ω_{t}^{obj}$ : These are likely typographical errors in the paper and should refer to the reference (target) object position, orientation, linear velocity, and angular velocity (e.g., $\hat{p}_t^{\text{obj}}$ $\overset{p}{^}_{t}^{obj}$ , $\hat{\theta}_t^{\text{obj}}$ $\hat{θ}_{t}^{obj}$ , etc.). Assuming this correction:
  - $w_{\text{op}}, w_{\text{or}}, w_{\text{ov}}, w_{\text{oav}}$ : Weighting coefficients for object position, rotation, linear velocity, and angular velocity matching respectively.
  - $e^{-100 \Vert \cdot \Vert_2}$ for position/rotation and $e^{-5 \Vert \cdot \Vert_2}$ for velocity/angular velocity: Exponential decay terms encouraging precise matching of the object's state to the reference trajectory. The lower decay constant (5) for velocities suggests less strict matching is required or tolerated for dynamic properties.
  - $\mathbb{1}\{\mathbb{C}\}$ : An indicator variable that is true (1) if the object is in contact with the humanoid hands, and false (0) otherwise. This filters the trajectory following reward to only be active when the object is actually held.
  - $w_{\text{c}}$ : A weighting coefficient for contact reward. The term $\mathbb{1}\{\mathbb{C}\} \cdot w_{\text{c}}$ provides a constant positive reward for simply maintaining contact with the object, encouraging stable grasps.

4.2.2.4. Object 3D Trajectory Generator

To overcome the scarcity of ground-truth object trajectories, Omnigrasp uses a synthetic 3D object trajectory generator ( $\mathcal{T}^{\text{3D}}$ ).

Purpose: This generator creates diverse trajectories with varying speed and direction, allowing the policy to be trained without relying on ground-truth trajectories.
Mechanism: The generator $\mathcal{T}^{\text{3D}}$ takes an initial object pose ( $\pmb{q}_0^{\text{obj}}$ ) and produces a sequence of reference object poses ( $\hat{\pmb{q}}_{1:T}^{\text{obj}}$ ).
Trajectory Characteristics:
- Velocity: Randomly sampled between $[0, 2] \mathrm{m/s}$ .
- Angles: Bounded between [0, 1] radian for general movement.
- Sharp Turns: With a probability of 0.2, a sharp turn can occur, with angles between $[0, 2\pi]$ radian.
- Z-direction: Bounded between [0.1, 2.0] meters to prevent trajectories that are too high or too low.
- Rotation: An extrapolator uses the object's initial trajectory to obtain a sequence of target rotations.
Benefit: This generator enables training on a vast and diverse set of object trajectories, crucial for generalization.

4.2.2.5. Training

The training process for Omnigrasp is outlined in Algorithm 1 of the paper (shown below).

Hard-Negative Mining: To improve performance and efficiency, Omnigrasp employs a simple hard-negative mining process. Instead of complex curriculum learning [74, 100], it regularly evaluates the policy and identifies hard objects (those that frequently lead to failed lifts/grasps) to prioritize in subsequent training steps.
- Let $s_j$ be the number of failed lifts for object $j$ .
- The sampling probability for object $j$ is $P(j) = \frac{s_j}{\sum_{i=1}^J s_i}$ , where $J$ is the total number of objects. This means objects that are harder to grasp are sampled more frequently.
  
  The following is Algorithm 1 from the original paper:

1 Function TrainOmnigrasp(  $\pmb{\mathcal{D}}_{\mathrm{PULSE-X}}$ ,  $\mathcal{P}_{\mathrm{PULSE-X}}$ ,  $\pi_{\mathrm{Omnigrasp}}$ ,  $\hat{O}$ ,  $\mathcal{T}^{\mathrm{3D}}$  ):
2 Input: Pretrained PULSE-X's decoder  $\mathcal{D}_{\mathrm{PULSE-X}}$  and prior  $\mathcal{P}_{\mathrm{PULSE-X}}$ , Object mesh dataset  $\hat{O}$ , 3D trajectory Generator  $\mathcal{T}^{\mathrm{3D}}$ ;
3 while not converged do
4      $M \gets \emptyset$  // initialize sampling memory;
5     while not filled( $M$ ) do
6          $\pmb{q}_0^{\mathrm{obj}}$ ,  $\hat{p}^{\mathrm{pre-grasp}}$ ,  $s_t^{\mathrm{p}} \sim$  randomly sample initial object pose, pre-grasp and humanoid state from  $\hat{O}_{\mathrm{hard}}$  or from dataset;
7          $\hat{q}_{1:T}^{\mathrm{obj}} \sim \mathcal{T}^{\mathrm{3D}}(\pmb{q}_0^{\mathrm{obj}})$  // generate object trajectory;
8         for  $t = 1 ... T$  do
9              $\boldsymbol{z}_t^{\mathrm{omnigrasp}} \sim \pi_{\mathrm{Omnigrasp}}(\boldsymbol{z}_t^{\mathrm{omnigrasp}} | \boldsymbol{s}_t^{\mathrm{p}}, \boldsymbol{s}_t^{\mathrm{g}})$  // use pretrained latent space as action space;
10             $\pmb{\mu}_t^p, \pmb{\sigma}_t^p \leftarrow \mathcal{P}_{\mathrm{PULSE-X}}(\boldsymbol{z}_t | \boldsymbol{s}_t^{\mathrm{p}})$  // compute prior latent code;
11             $\pmb{a}_t \leftarrow \pmb{\mathcal{D}}_{\mathrm{PULSE-X}}(\pmb{a}_t | \pmb{s}_t^{\mathrm{p}}, \boldsymbol{z}_t^{\mathrm{omnigras p}} + \pmb{\mu}_t^p)$  // decode action using pretrained decoder;
12             $s_{t+1} \leftarrow \mathcal{T}(s_{t+1} | s_t, a_t)$  // simulation;
13             $r_t \leftarrow \mathcal{R}(s_t^{\mathrm{p}}, s_t^{\mathrm{g}})$  // compute reward;
14            Add experience  $(s_t, a_t, r_t, s_{t+1})$  to  $M$ ;
15        end for
16    end while
17     $\pi_{\mathrm{Omnigrasp}} \leftarrow \mathrm{PPO}$  update using experiences collected in  $M$ ;
18     $\hat{O}_{\mathrm{hard}} \leftarrow$  Evaluate  $\pi_{\mathrm{Omnigrasp}}$  and update hard objects;
19 return  $\pi_{\mathrm{Omnigrasp}}$ ;

Line 6: Selects an initial object pose ( $\pmb{q}_0^{\text{obj}}$ ), pre-grasp ( $\hat{p}^{\text{pre-grasp}}$ ), and humanoid state ( $s_t^{\text{p}}$ ) either from a dataset (e.g., GRAB for MoCap initial states) or by dropping the object in simulation. Crucially, it prioritizes hard objects ( $\hat{O}_{\text{hard}}$ ) identified during hard-negative mining.
Line 7: A 3D object trajectory ( $\hat{q}_{1:T}^{\text{obj}}$ ) is generated using the trajectory generator ( $\mathcal{T}^{\text{3D}}$ ).
Lines 9-11: The Omnigrasp policy outputs a residual latent code $z_t^{\text{omnigrasp}}$ . This is combined with the mean of the prior latent code ( $\mu_t^p$ ) from PULSE-X and then decoded by PULSE-X's decoder ( $\mathcal{D}_{\text{PULSE-X}}$ ) to produce the PD target action $a_t$ .
Lines 12-14: The simulation advances, reward is computed, and the experience is stored in memory $M$ .
Line 17: The Omnigrasp policy is updated using the PPO algorithm with the collected experiences.
Line 18: The hard objects set $\hat{O}_{\text{hard}}$ is updated based on the current policy's performance (i.e., objects that caused failures are added to $\hat{O}_{\text{hard}}$ for increased sampling frequency).

4.2.2.6. Object and Humanoid Initial State Randomization

Object Randomization: The initial object pose ( $\pmb{q}_0^{\text{obj}}$ ) and its component velocities are perturbed to make the policy robust to variations.
Humanoid Randomization: The initial humanoid state is sampled from a dataset (e.g., GRAB for MoCap initial states) or set to a standing T-pose if no paired data is available.
Independent Training: The final policy only requires the object mesh ( $O$ ), initial object position ( $p_0^{\text{obj}}$ ), and desired object trajectory ( $\hat{q}_{1:T}^{\text{obj}}$ ) at test time, demonstrating independence from pre-grasps or paired kinematic human pose during deployment.

5. Experimental Setup

5.1. Datasets

Omnigrasp evaluates its method on three diverse datasets to cover a range of object sizes, types, and interaction complexities:

GRAB Dataset [70]:
- Characteristics: Contains 1.3k paired full-body motion and object trajectories for 50 distinct objects (excluding a doorknob, which is not movable). The objects are generally small to medium-sized, common household items.
- Source: Motion Capture (MoCap) data, providing ground-truth reference motion for both the humanoid and the object.
- Splits:
  - GRAB-Goal-Test: A cross-object split (training on objects seen, testing on 5 unseen objects). This evaluates generalization to novel objects.
  - GRAB-IMoS-Test: A cross-subject split (training on motions from some subjects, testing on 92 sequences from 44 objects, performed by unseen subjects). This evaluates generalization to novel interaction styles.
- Usage: Used for evaluating trajectory following of MoCap object trajectories. Initial humanoid positions and pre-grasps are extracted from this dataset when available.
OakInk Dataset [85, 86]:
- Characteristics: A large-scale dataset containing 1700 diverse objects across 32 categories. These objects include real-world scanned and generated meshes, varying significantly in shape, size, and material.
- Source: Real-world scans and generated meshes.
- Splits:
  - 1330 objects for training.
  - 185 objects for validation.
  - 185 objects for testing.
  - Splits are category-wise, ensuring training and test splits contain objects from all categories to properly evaluate generalization.
- Usage: Used for scaling up grasping policies to a large number of objects and testing generalization to unseen object shapes. Since no paired MoCap motion exists, GrabNet [69] is used to create pre-grasps.
- Example Data Sample (Conceptual): A 3D mesh of a mug with a corresponding pre-grasp hand pose generated by GrabNet. The humanoid needs to grasp this mug and lift it.
OMOMO Dataset [34]:
- Characteristics: Contains 15 large objects (e.g., table lamps, monitors). The authors select 7 objects with cleaner meshes.
- Source: Reconstructed meshes.
- Usage: Primarily for testing if the Omnigrasp pipeline can learn to move larger objects. Due to the limited number, only in-distribution testing (on objects used for training) is conducted to verify the pipeline's capability rather than generalization.
- Example Data Sample (Conceptual): A 3D mesh of a table lamp. The humanoid needs to grasp this lamp and move it.
  
  These datasets are chosen to validate Omnigrasp's performance across small to large objects, diverse shapes, and different levels of ground-truth motion availability. GRAB provides a strong MoCap-based benchmark for trajectory following, OakInk tests scalability and generalization to hundreds of unseen objects, and OMOMO verifies handling larger items.

5.2. Evaluation Metrics

The paper uses a comprehensive set of metrics to evaluate both the grasping success and the precision of trajectory following:

Grasp Success Rate ( $\text{Succ}_{\text{grasp}}$ ):
- Conceptual Definition: Measures the percentage of episodes where the humanoid successfully grasps and holds the object without dropping it, for a minimum duration. It focuses on the primary goal of establishing and maintaining a stable grip.
- Mathematical Formula: Not explicitly provided in the paper, but conceptually calculated as: $\text{Succ}_{\text{grasp}} = \frac{\text{Number of successful grasps}}{\text{Total number of attempts}} \times 100\%$
- Symbol Explanation:
  - Number of successful grasps: Count of episodes where the object is held for at least 0.5 seconds in the physics simulation without being dropped.
  - Total number of attempts: Total number of times the humanoid attempts to grasp an object.
Trajectory Success Rate ( $\text{Succ}_{\text{traj}}$ ):
- Conceptual Definition: Measures the percentage of episodes where the humanoid successfully grasps the object and follows the entire reference trajectory without the object deviating too far from the reference path at any point. This assesses the overall task completion.
- Mathematical Formula: Not explicitly provided in the paper, but conceptually calculated as: $\text{Succ}_{\text{traj}} = \frac{\text{Number of successful trajectories}}{\text{Total number of attempts}} \times 100\%$
- Symbol Explanation:
  - Number of successful trajectories: Count of episodes where grasping is successful, AND the object's position remains within $25 \mathrm{cm}$ of the reference trajectory for the entire duration of the trajectory. If the object deviates by more than $25 \mathrm{cm}$ at any point, the trajectory is deemed unsuccessful.
  - Total number of attempts: Same as for `Succ}_{\text{grasp}}$.
Trajectory Targets Reached (TTR):
- Conceptual Definition: Quantifies how well the object tracks the reference trajectory targets over time, but only for episodes where grasping was successful. It's a measure of tracking accuracy conditional on having a stable grasp.
- Mathematical Formula: Not explicitly provided in the paper, but conceptually calculated as: $\text{TTR} = \frac{\text{Number of timesteps where object is near target}}{\text{Total number of timesteps in successful trajectories}} \times 100\%$
- Symbol Explanation:
  - Number of timesteps where object is near target: Count of timesteps where the object's position is within $12 \mathrm{cm}$ of the reference target position.
  - Total number of timesteps in successful trajectories: Sum of timesteps across all trajectories that were deemed successful by `Succ}_{\text{traj}}$.
Position Error ( $E_{\text{pos}}$ ):
- Conceptual Definition: The average Euclidean distance between the actual object position and the reference object position over the trajectory. It measures how accurately the object's location is tracked.
- Mathematical Formula: $E_{\text{pos}} = \frac{1}{T} \sum_{t=1}^{T} \Vert p_t^{\text{obj}} - \hat{p}_t^{\text{obj}} \Vert_2 \quad (\text{in mm})$
- Symbol Explanation:
  - $T$ : Total number of timesteps in the trajectory.
  - $p_t^{\text{obj}}$ : Actual object position at timestep $t$ .
  - $\hat{p}_t^{\text{obj}}$ : Reference object position at timestep $t$ .
  - $\Vert \cdot \Vert_2$ : Euclidean norm (L2 distance).
  - Result is reported in millimeters (mm).
Rotation Error ( $E_{\text{rot}}$ ):
- Conceptual Definition: The average angular difference between the actual object orientation and the reference object orientation over the trajectory. It measures how accurately the object's rotation is tracked.
- Mathematical Formula: A common way to compute rotation error between two rotations $R_1$ and $R_2$ (e.g., represented as quaternions or rotation matrices) is using the angle-axis representation or geodesic distance. Assuming 6D rotation representations or quaternions are used internally for $R_1, R_2$ : $E_{\text{rot}} = \frac{1}{T} \sum_{t=1}^{T} \text{angle\_between}(\theta_t^{\text{obj}}, \hat{\theta}_t^{\text{obj}}) \quad (\text{in radians})$
- Symbol Explanation:
  - $T$ : Total number of timesteps.
  - $\theta_t^{\text{obj}}$ : Actual object orientation at timestep $t$ .
  - $\hat{\theta}_t^{\text{obj}}$ : Reference object orientation at timestep $t$ .
  - angle_between: A function that computes the smallest angle in radians between two orientations.
  - Result is reported in radians.
Acceleration Error ( $E_{\text{acc}}$ ):
- Conceptual Definition: The average Euclidean distance between the actual object acceleration and the reference object acceleration over the trajectory. It quantifies how well the dynamics of the object's motion are matched.
- Mathematical Formula: $E_{\text{acc}} = \frac{1}{T} \sum_{t=1}^{T} \Vert a_t^{\text{obj}} - \hat{a}_t^{\text{obj}} \Vert_2 \quad (\text{in mm/frame}^2)$
- Symbol Explanation:
  - $T$ : Total number of timesteps.
  - $a_t^{\text{obj}}$ : Actual object acceleration at timestep $t$ . This is typically computed from successive velocity measurements.
  - $\hat{a}_t^{\text{obj}}$ : Reference object acceleration at timestep $t$ . This is typically computed from successive reference velocity measurements.
  - $\Vert \cdot \Vert_2$ : Euclidean norm.
  - Result is reported in `millimeters per frame squared (mm/frame}^2)$.
Velocity Error ( $E_{\text{vel}}$ ):
- Conceptual Definition: The average Euclidean distance between the actual object linear velocity and the reference object linear velocity over the trajectory. It measures how accurately the object's speed and direction are tracked.
- Mathematical Formula: $E_{\text{vel}} = \frac{1}{T} \sum_{t=1}^{T} \Vert v_t^{\text{obj}} - \hat{v}_t^{\text{obj}} \Vert_2 \quad (\text{in mm/frame})$
- Symbol Explanation:
  - $T$ : Total number of timesteps.
  - $v_t^{\text{obj}}$ : Actual object linear velocity at timestep $t$ .
  - $\hat{v}_t^{\text{obj}}$ : Reference object linear velocity at timestep $t$ .
  - $\Vert \cdot \Vert_2$ : Euclidean norm.
  - Result is reported in millimeters per frame (mm/frame).

5.3. Baselines

Omnigrasp is compared against several representative baselines to demonstrate its superior performance:

PPO-10B: This is a direct Reinforcement Learning baseline trained using PPO without leveraging PULSE-X's latent space (i.e., directly on joint actuation space).
- Characteristics: It uses a similar state and reward design as Omnigrasp but operates on the raw, high-dimensional joint actuation targets.
- Purpose: To highlight the significant performance gain and sample efficiency provided by PULSE-X's universal motion representation. The "10B" likely refers to the approximate number of samples collected (around $10^{10}$ samples) over a long training period (one month), showcasing the computational cost of training without a proper motion prior.
PHC [42]: Perpetual Humanoid Control. This baseline represents an imitator-based approach for grasping.
- Characteristics: A pretrained humanoid motion imitator is directly used. For grasping, ground-truth kinematic body and finger motion (when available) is fed to this imitator to attempt object grasping.
- Purpose: To evaluate if a direct motion imitation approach is sufficient for precise object grasping. The authors note that PHC has an average $30 \mathrm{mm}$ tracking error for hands, which might be too large for tasks requiring high precision like grasping.
AMP [56]: Adversarial Motion Priors.
- Characteristics: A physics-based character control method that uses adversarial learning to generate stylized motions from MoCap data. It is trained with a similar state and reward design (excluding PULSE-X's latent space) and a task and discriminator reward weighting of 0.5 and 0.5.
- Purpose: To compare Omnigrasp's motion prior (from PULSE-X) with an adversarial motion prior and to see its performance in trajectory following when trained with similar task rewards.
Braun et al. [6]: Physically plausible full-body hand-object interaction synthesis.
- Characteristics: This is a prior state-of-the-art (SOTA) method that studies a similar full-body grasping setting. It relies on MoCap human-object interaction data and typically uses only one hand.
- Purpose: To provide a direct comparison with a specialized, data-driven full-body grasping method. Omnigrasp aims to outperform Braun et al. in success rates and generalization while reducing data dependency.
  
  These baselines are representative because they cover different approaches to simulated humanoid control and grasping: raw RL, direct imitation, adversarial motion priors, and prior SOTA specialized methods.

5.4. Implementation Details

Simulator: Isaac Gym [45] is used for physics simulation. Isaac Gym is known for its high-performance, GPU-accelerated physics simulation capabilities, which are crucial for Reinforcement Learning training involving millions of samples.
Simulation and Policy Frequencies:
- Policy runs at $30 \mathrm{Hz}$ .
- Simulation runs at $60 \mathrm{Hz}$ . This means for every policy decision, the simulation takes two steps.
Network Architecture:
- PHC-X and PULSE-X: Each policy is a 6-layer Multi-Layer Perceptron (MLP).
- Omnigrasp (Grasping Task): Employs a GRU [14] based recurrent policy. It uses a GRU with a latent dimension of 512, followed by a 3-layer MLP. The GRU is important for processing sequential information from trajectories.
Training Time:
- Omnigrasp: Trained for three days, collecting around $10^9$ samples on an Nvidia A100 GPU.
- PHC-X: Trained once and frozen, taking approximately 1.5 weeks.
- PULSE-X: Trained once and frozen, taking approximately 3 days.
Object Properties:
- Density: $1000 \mathrm{kg/m^3}$ (density of water), a common value for objects in simulation.
- Static and Dynamic Friction Coefficients: These are set to 1.0 for both the object and the humanoid fingers to allow for stable grasping.
Reference Trajectory Window: For the goal state ( $s_t^{\text{g}}$ ), $\phi = 20$ future frames of the reference trajectory are sampled at $15 \mathrm{Hz}$ . This means the policy gets a preview of the upcoming object motion.
Object Processing:
- Convex Decomposition: Since physics simulators often require convex objects, built-in VHACD function is used to decompose object meshes into convex geometries.
- Object Latent Code: A 512-dimensional Basis Point Set (BPS) [57] is used. This is computed by randomly sampling 512 points on a unit sphere and calculating their distances to points on the object mesh, providing a compact geometric representation.
- Mesh Decimation: For objects with more than 50,000 vertices, quadratic decimation is performed to simplify the mesh.
Early Termination: During training, an episode is terminated if the object is more than $12 \mathrm{cm}$ away from its desired reference trajectory at any timestep: $\Vert \hat{p}_t^{\text{obj}} - p_t^{\text{obj}} \Vert_2 > 0.12$ . This prevents wasting computational resources on irrecoverable failures.
Table Removal: For table-top objects from GRAB and OakInk, a table supports the object initially. However, to prevent collisions with the randomly generated trajectories and since the humanoid lacks environmental awareness, the table is removed after the initial 1.5 seconds.
Contact Detection: Since IsaacGym provides only contact forces and no direct contact labels distinguishing between the table, body, or object, a heuristic-based method is used: an object is deemed in contact with the hands if it is within $0.2 \mathrm{m}$ of the hands, has non-zero contact forces, and has non-zero velocity.

6. Results & Analysis

All experiments are run 10 times, and results are averaged, accounting for slight differences due to floating-point errors in the simulator.

6.1. Grasping and Trajectory Following

6.1.1. GRAB Dataset (50 objects)

The GRAB dataset evaluates Omnigrasp on MoCap object trajectories using the mean body shape humanoid. For fair comparison with Braun et al. [6], Omnigrasp is trained in two settings: one with MoCap object trajectories and one with synthetic trajectories.

The following are the results from Table 1 of the original paper:

Method	Traj \|	GRAB-Goal-Test (Cross-Object, 140 sequences, 5 unseen objects)								GRAB-IMoS-Test (Cross-Subject, 92 sequences, 44 objects)
Method	Traj \|	\| Succgrasp ↑	Succraj ↑	TTR ↑	Epos ↓	Erot ↓	Eacc ↓		Evel ↓	\| Sucgrasp ↑	Succtraj ↑	TTR ↑	Epos ↓	Erot	Eacc ↓	Evel ↓
PPO-10B	Gen	98.4%	55.9%	97.5%		36.4	0.4	21.0	14.5	96.8%	53.2%	97.9%		35.6	0.5	19.6	13.9
PC [42]	MoCap	3.6%	11.4%	81.1%		66.3	0.8	1.5	3.8	0%	3.3%	97.4%	56.5		0.3	1.4	2.9
AMP [56]	Gen	90.4%	46.6%	94.0 %		40.7	0.6	5.3	5.3	95.8 %	49.2%	96.5%		34.9	0.5	6.2	6.0
Braun et al.l. [6]	MoCap	79%		85%		-	-	-	-	64%	-	65%		-	-	-	-
Omnigrasp	MoCap	94.6%	84.8%	98.7%		28.0	0.5	4.2	4.3	95.8%	85.4%	99.8%	27.5		0.6	5.0	5.0
Omigrasp	Gen	100%	94.1%	99.6%		30.2	0.93	5.4	4.7	98.9%	90.5%	99.8%	27.9		0.97	6.3	5.4

Analysis:

Superior Performance: Omnigrasp (trained with Gen or MoCap trajectories) consistently outperforms all baselines across both GRAB-Goal-Test (cross-object) and GRAB-IMoS-Test (cross-subject) splits.
- Omnigrasp (Gen) achieves a 100% grasp success rate and 94.1% trajectory success rate on the cross-object test, significantly higher than Braun et al.'s 79% grasp success and 85% TTR.
- Position error (E_{pos}) for Omnigrasp is around $28-30 \mathrm{mm}$ , which is state-of-the-art and indicative of precise trajectory following.
Importance of PULSE-X: PPO-10B, which trains directly on joint actuation without PULSE-X, shows a much lower trajectory success rate (e.g., 55.9% vs. 94.1% for Omnigrasp (Gen) on cross-object test), despite extensive training (~10 billion samples). This strongly validates the sample efficiency and effectiveness of using PULSE-X's latent motion representation.
Limitations of Direct Imitation: PHC (using a direct imitator) yields very low grasp success rates (3.6% and 0%). This indicates that while imitators can track kinematic poses, their inherent tracking error (e.g., 30mm for hands) is too large for the precision required for object grasping, especially with body shape mismatches between MoCap and the simulated humanoid.
Advantage over AMP: AMP shows better grasp success than PHC but lags significantly behind Omnigrasp in trajectory success rate (46.6% vs. 94.1% for Omnigrasp (Gen)). This further highlights the importance of PULSE-X as a motion prior in the action space.
Training Data Source: Omnigrasp trained on randomly generated (Gen) trajectories (100% Succgrasp, 94.1% Succtraj on cross-object) slightly outperforms or is on par with Omnigrasp trained on MoCap trajectories (94.6% Succgrasp, 84.8% Succtraj). This is a key finding, demonstrating that synthetic data can be sufficient and even superior for learning generalizable manipulation skills, reducing reliance on scarce MoCap data. The authors suggest that the slight difference in Erot (0.93 for Gen vs. 0.5 for MoCap) might be due to the generated rotations being less "physically realistic" than MoCap ones, indicating an area for trajectory generator improvement.
Gap between Grasp and Trajectory Success: The difference between Succgrasp and Succtraj (e.g., 100% vs. 94.1% for Omnigrasp (Gen)) indicates that while objects can often be grasped, maintaining the grasp and following the trajectory precisely until the end remains challenging.

6.1.2. OakInk Dataset (1700 objects)

The OakInk dataset evaluates Omnigrasp's scalability to a large number of diverse objects and its generalization to unseen objects. The task here is vertical lifting (30cm) and holding (3s).

The following are the results from Table 3 of the original paper:

	OakInk-Train (1330 objects)						OakInk-Test (185 objects)
Training Data	Succgrasp ↑	Succtraj ↑	TTR ↑	Epos ↓	Erot ↓	Eacc↓	Evel ↓\|	Succgrasp ↑	Succtraj ↑	TTR ↑	Epos ↓	Erot ↓	Eacc ↓	Evel ↓
OakInk	93.7%	86.2%	100 %	21.3	0.4	7.7	6.0	94.3%	87.5%	100%	21.2	0.4	7.6	5.9
GGRAB	84.5%	75.%	99..9%	22.4	0.4	6.8	5.7	81.9%	72.1%	99.9%	22.7	0.4	7.1	5.8
GRAB + OakInk	95.6%	92.0%	10 %	21.0	0.6	5.4	4.8	93.5%	89.0%	100%	21.3	0.6	5.4	4.8

Analysis:

High Scalability and Generalization:
- Omnigrasp trained solely on OakInk achieves 93.7% grasp success and 86.2% trajectory success on the training set (1330 objects), and 94.3% grasp success and 87.5% trajectory success on the test set (185 unseen objects). This demonstrates excellent scalability to a large number of diverse objects and strong generalization to unseen objects.
- The TTR is 100% across all OakInk experiments, indicating that once a trajectory is deemed successful, the object very accurately tracks its targets.
Failure Cases: The authors note that failed objects are typically too large or too small for the humanoid to establish a stable grasp. The hard-negative mining process is also challenged by such a large number of objects.
Cross-Dataset Generalization:
- Training only on GRAB and testing on OakInk still yields a high grasp success rate of 84.5% (train) and 81.9% (test). This is a remarkable result, showing that the grasping policy learned on GRAB (which has only 50 objects) is robust enough to generalize to over 1000 unseen objects from a different dataset, without any prior exposure to their shapes. This highlights the robustness of the policy.
Combined Training: Training on GRAB + OakInk yields the highest success rates (95.6% grasp, 92.0% trajectory on train, 93.5% grasp, 89.0% trajectory on test). This combination benefits from GRAB providing bi-manual pre-grasps (which OakInk lacks as it only has one-hand pre-grasps), allowing the policy to learn to use both hands, especially for larger objects.

6.1.3. OMOMO Dataset (7 objects)

The OMOMO dataset evaluates Omnigrasp's ability to handle larger objects.

The following are the results from Table 2 of the original paper:

OMOMO (7 objects)
Succgrasp ↑	Succtraj ↑	TTR ↑	Epos ↓	Erot ↓	Eacc ↓	Evel ↓
7/7	7/7	100%	22.8	0.2	3.1	3.3

Analysis:

Omnigrasp achieves 100% grasp success and 100% trajectory success on all 7 OMOMO objects, with very low position ( $22.8 \mathrm{mm}$ ) and rotation ( $0.2 \mathrm{radians}$ ) errors. This confirms that the method can successfully pick up and manipulate large objects and follow trajectories with high precision.
The qualitative results (Figure 3) show successful manipulation of larger objects like table lamps, further supporting this.

The following figure (Figure 3 from the original paper) shows qualitative results on unseen objects for GRAB and OakInk, with green dots indicating reference trajectories:

该图像是图表，展示了三种不同场景下的仿人形机器人抓取和移动物体的动画效果，包括 GRAB、OakInk 和 OMOMO，图中可见绿色点表示参考轨迹。此图展示了机器人在多样物体上进行抓取动作的能力。

Figure 3: Qualitative results. Unseen objects are tested for GRAB and OakInk. Green dots: reference trajectories. Best seen in videos on our supplement site.

6.2. Ablation Studies / Parameter Analysis

Ablation studies investigate the contribution of different components of the Omnigrasp framework using the cross-object split of the GRAB dataset.

The following are the results from Table 4 of the original paper:

	GRAB-Goal-Test (Cross-Object, 140 sequences, 5 unseen objects)
idx	PULSE-X	pre-grasp	Dex-AMASS	Rand-pose	Hard-neg	Succgrasp ↑	Succtraj ↑	TTR ↑	Epos ↓	Erot ↓	Eace ↓	Evel ↓
1	✗	✓	✓	✓	✓	97.0%	33.6%	92.8%	43.5	0.5	10.6	8.3
2		×	✓	✓		77.1%	57.9%	97.4%	54.9	1.0	5.5	5.2
3	✓		X	✓	;	94.4%	77.3%	99.3%	30.5	0.9	4.8	4.4
4	✓	✓	✓	X	✓	92.9%	79.9%	99.2%	31.4	1.1	4.5	4.4
5	✓	✓	✓	✓	×	94.0%	71.6%	98.4%	32.3	1.3	6.2	5.7
6	✓	✓	✓	✓	✓	100%	94.1%	99.6%	30.2	0.9	5.4	4.7

Analysis:

Impact of PULSE-X (Row 1 vs. Row 6):
- Removing PULSE-X's action space (Row 1, ✗) drastically reduces trajectory success rate to 33.6% from 94.1% (Row 6, ✓). While grasp success remains high (97.0%), the ability to follow trajectories after grasping is severely hampered. This confirms that PULSE-X is critical for learning coherent, human-like motion and efficient exploration for trajectory following. Without it, RL struggles to learn stable full-body control.
Impact of Pre-grasp Guidance (Row 2 vs. Row 6):
- Disabling the pre-grasp guidance reward (Row 2, ×) significantly lowers grasp success to 77.1% from 100% (Row 6, ✓). This validates PGDM's [17] finding that pre-grasp initialization (or guidance in this case) is crucial for successful grasping, even with a strong motion prior.
Impact of Dex-AMASS (Row 3 vs. Row 6):
- Training PULSE-X without the dexterous AMASS dataset (Row 3, $X$ ) leads to a lower trajectory success rate of 77.3% compared to 94.1% (Row 6, ✓). This suggests that providing diverse hand motion during PULSE-X training is essential for enabling the policy to learn to pick up diverse objects accurately. Without it, the policy might struggle with novel object shapes, even if it can grasp some.
Impact of Object Initial Pose Randomization (Row 4 vs. Row 6):
- Removing randomization for the object's initial pose (Row 4, $X$ ) reduces trajectory success rate to 79.9% from 94.1%. This indicates that randomizing initial conditions is crucial for training a robust policy that can handle variations in the object's starting position.
Impact of Hard-Negative Mining (Row 5 vs. Row 6):
- Disabling hard-negative mining (Row 5, ×) lowers trajectory success rate to 71.6% from 94.1%. This shows that adaptively focusing on hard-to-grasp objects is important for improving the generalization and robustness of the policy across the entire object set.

Additional Ablations (Appendix C.3, Table 8): The following are the results from Table 8 of the original paper:

GRAB-Goal-Test (Cross-Object, 140 sequences, 5 unseen objects)
idx	Object Latent	RNN	Im-obs	Succgrasp ↑	Succtraj ↑	TTR ↑	Epos ↓	Erot ↓	Eacc↓	Evel ↓
1		✓	×	100%	93.2%	99.8%	28.7	1.3	6.1	5.1
2		X	X	99.9%	89.6%	99.0%	33.4	1.2	4.5	4.4
3	✓	✓	✓	95.2	77.8%	97.9%	32.2	0.9	3.2	3.9
4	✓	✓	× \|	100%	94.1%	99.6%	30.2	0.9	5.4	4.7

Analysis:

Impact of Object Latent Code (Row 1 vs. Row 4):
- On the GRAB cross-object test, removing the object shape latent code ( $\sigma^{\text{obj}}$ ) (Row 1) results in a comparable trajectory success rate (93.2%) to the full model (Row 4, 94.1%). This is because the 5 testing objects in this split might not deviate significantly enough to necessitate explicit shape information. For small objects, the humanoid might learn a general "scooping" or pincer grasp strategy. However, the authors note that on the GRAB cross-subject test (44 objects), $Succ}_{\text{traj}}$ drops to$84.2%without theobject latent code, showing its importance for broader generalization. The higher Erotin Row 1 (1.3) suggests that while grasping is successful, theobject's orientationmight not be as precisely controlled withoutshape information`.
Impact of RNN Policy (Row 2 vs. Row 4):
- Replacing the RNN-based policy with an MLP-based policy (Row 2) slightly reduces trajectory success rate to 89.6% from 94.1%. This indicates that the RNN's ability to process sequential information is beneficial for trajectory following, which is an inherently sequential task.
Impact of Ground Truth Full-Body Pose (Im-obs) (Row 3 vs. Row 4):
- Providing ground-truth full-body pose ( $\hat{q}_t$ ) as input to the policy (Row 3, ✓) actually leads to worse performance (77.8% trajectory success) compared to not providing it (Row 4, 94.1%). This counter-intuitive result suggests that directly imitating kinematic poses without a contact graph (as in PhysHOI [77]) can be detrimental. It also highlights that Omnigrasp's flexible interface, which doesn't require paired MoCap full-body motion, is advantageous for learning and testing on novel objects.

6.2.2. Per-object Success Rate Breakdown (Appendix C.4, Table 9)

The following are the results from Table 9 of the original paper:

Object	Braun et al. [6]
	Succgrasp ↑	Succtraj ↑	TTR ↑	Succgrasp ↑	Succtraj ↑	TTR ↑
Apple	95%	-	91%	100%	99.6%	99.9%
Binoculars	54%		83%	100%	90.5%	99.6%
Camera	95%		85%	100%	97.7%	99.7%
Mug	89%	-	74%	100%	97.3%	99.8%
Toothpaste	64%	-	94%	100%	80.9%	99.0%

Analysis:

Omnigrasp achieves 100% grasp success across all 5 novel objects in the GRAB-Goal (cross-object) split, significantly outperforming Braun et al. [6] on all metrics, especially grasp success (e.g., 54% for Binoculars in Braun et al. vs. 100% in Omnigrasp).
The toothpaste is identified as the hardest object for Omnigrasp to pick up, leading to a trajectory success rate of 80.9%, lower than other objects. The authors explain this is due to slipping on its round edges. This highlights a limitation in handling objects with challenging contact surfaces or geometry.

6.3. Analysis: Diverse Grasps

The following figure (Figure 4 from the original paper) shows diverse grasping strategies:

该图像是一个展示多种物体抓取的插图，展示了不同的手型与抓取方式，包括饮料瓶、玩具、文具等多样化物体。通过这些视觉示例，表现了人形机器人的抓取能力和灵活性。

Fgure (Top rows): raspin differentobjects using both hands.(Bottm) diverse grasps on the samebject.

Analysis:

Figure 4 visually demonstrates Omnigrasp's ability to learn diverse grasping strategies based on the object's shape and initial pose.
The top rows show the humanoid using both hands to grasp different objects, adapting to their specific geometry.
The bottom row illustrates diverse grasps on the same object (a box). This indicates that the policy can find multiple stable grasping configurations, not just a single canonical one.
The emergence of bimanual manipulation for larger objects or heavier objects is a notable learned behavior, which comes from pre-grasps in GRAB that utilize both hands. This showcases the advantages of the simulation environment and the reward system in facilitating skill learning.

6.4. Analysis: Robustness and Potential for Sim-to-real Transfer

The authors study the robustness of Omnigrasp to observation noise, a crucial factor for sim-to-real transfer.

The following are the results from Table 5 of the original paper:

		GRAB-Goal-Test (Cross-Object, 140 sequences, 5 unseen objects)								GRAB-IMoS-Test (Cross-Subject, 92 sequences, 44 objects)
Method		Noise Scale	Succgrasp	Succtaj ↑	TTR ↑		Epos ↓	Erot	Eace	Evel ↓	Succgrasp ↑	↑ Succtaj ↑	TTR ↑	Epos	Erot ↓		Eace	Evel ↓
Omnigrasp	0	100%	94.1%	99.6%		30.2	0.93	5.4	4.7	98.9%	90.5%	99.8%		27.9	0.97	6.3	5.4
Omnigrasp	0.01	100%		91.4%	99.2%	34.8	1.1	15.6	11.5	99.5%		86.2%	99.6%	32.5	1.0	17.9	13.2

Analysis:

Robustness to Noise: Adding uniform random noise of scale 0.01 (a typical value in sim-to-real RL [28]) to task observation (positions, object latent codes, etc.) and proprioception results in a graceful degradation of performance rather than catastrophic failure.
- Grasp success remains 100% on cross-object test and 99.5% on cross-subject test, indicating strong robustness in establishing initial contact.
- Trajectory success rate decreases from 94.1% to 91.4% (cross-object) and 90.5% to 86.2% (cross-subject).
- Position error and rotation error show slight increases.
- The drop is more prominent in acceleration and velocity metrics ( $E_{\text{acc}}$ and $E_{\text{vel}}$ ), which are more sensitive to noise.
Potential for Sim-to-real Transfer: While Omnigrasp is not yet ready for real-world deployment, its relative robustness to noise, even without specific noise training, suggests that a similar system design, combined with sim-to-real modifications (e.g., domain randomization, distilling into a vision-based policy), holds potential for transfer to physical humanoid robots.

7. Conclusion & Reflections

7.1. Conclusion Summary

Omnigrasp presents a significant advancement in simulated humanoid control for dexterous object manipulation. By introducing PULSE-X, a universal dexterous humanoid motion representation, the method effectively provides a human-like motion prior as an action space for Reinforcement Learning. This key insight enables Omnigrasp to learn grasping policies for a vast number of diverse objects (over 1200) and follow complex, randomly generated trajectories with high success rates. Crucially, it achieves this without requiring paired full-body human-object motion data for training, relying instead on synthetic pre-grasps and trajectories. The system demonstrates strong scalability, generalization to unseen objects, and robustness to observation noise, marking a state-of-the-art performance in the field.

7.2. Limitations & Future Work

The authors acknowledge several limitations and suggest future research directions:

Precision in Bimanual Manipulations: While Omnigrasp supports bimanual motion, it does not yet support precise in-hand manipulations or articulations. The current 6DoF inputs provide a coarse orientation target for the hand, which is insufficient for detailed in-hand dexterous tasks.
Trajectory Following Success Rate: Although state-of-the-art, the trajectory following success rate could be improved, as objects can still be dropped or not picked up consistently.
Specific Grasp Types: The current method does not allow for achieving specific types of grasps on an object, which might require additional inputs like desired contact points or grasp configurations.
Human-Level Dexterity: Human-level dexterity, even in simulation, remains a challenging goal.
Future Work Directions:
- Trajectory Following Diversity: Further improving trajectory following and being able to pick up more object categories.
- Improved Motion Representation: Investigating improvements to the humanoid motion representation. Separating the motion representation for hands and body [3, 6] could potentially lead to further enhancements in dexterity and coordination.
- Effective Object Representation: Developing an object representation that does not rely on a canonical object pose and generalizes to vision-based systems. This would be valuable for the model to generalize to even more objects and real-world scenarios.

7.3. Personal Insights & Critique

This paper presents a highly inspiring approach to dexterous humanoid manipulation. The core idea of leveraging a pre-trained, universal motion latent space (PULSE-X) as a structured action space for RL is incredibly powerful. It effectively transforms a high-dimensional, sparse-reward problem into a more manageable low-dimensional, dense-reward problem, a classic technique for RL efficiency but applied here in a novel and effective way to complex full-body dexterity.

The ability to train a generalizable policy with randomly generated trajectories and synthetic pre-grasps, largely independent of paired MoCap data, is a monumental step towards scalability. This addresses one of the most significant bottlenecks in human-object interaction research. The emergent bimanual manipulation and diverse grasping strategies are fascinating examples of RL learning intelligent behaviors when given a proper action space and reward signal.

Critically, the paper's robustness analysis against observation noise is a crucial indicator of its potential for sim-to-real transfer. While not explicitly demonstrated, the architectural choices (e.g., residual action space, recurrent policy) suggest a strong foundation for future work in this direction.

Potential areas for improvement or further exploration might include:

Dynamic Re-grasping and In-hand Manipulation: The current system excels at picking up and carrying. Extending it to re-grasp objects in hand, rotate them, or perform more intricate manipulations (ee.g., using fingertips) would be a natural next step, possibly requiring a more granular hand motion representation or more explicit contact modeling.
Environmental Awareness: The current humanoid lacks environmental awareness beyond the object. Integrating vision-based perception of the environment (e.g., detecting obstacles, surfaces) would make the trajectory following more intelligent and adaptable to complex settings.
Human-defined Grasps: While synthetic pre-grasps are effective, allowing a user to define a desired grasp type (e.g., power grasp, precision pinch) could add artistic control for animation or task specificity for robotics.
Object Properties Generalization: The paper mentions that too large or too small objects can cause failures. Further research into adaptive grasping strategies or multi-modal object representations that handle extreme size variations more robustly could be valuable.

Overall, Omnigrasp provides a compelling blueprint for how to tackle complex full-body dexterous tasks in simulation through intelligent motion priors and data-efficient RL. Its implications for animation, VR/AR, and humanoid robotics are substantial.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Omnigrasp: Grasping Diverse Objects with Simulated Humanoids

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~30 min read · 40,035 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Reinforcement Learning (RL)

3.1.2. Proximal Policy Optimization (PPO)

3.1.3. Humanoid Control

3.1.4. Motion Representation and Latent Space

3.1.5. Grasping and Dexterous Manipulation

3.2. Previous Works

3.2.1. Simulated Humanoid Control

3.2.2. Dexterous Manipulation

3.2.3. Kinematic Grasp Synthesis

3.2.4. Humanoid Motion Representation

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Stage 1: PULSE-X: Physics-based Universal Dexterous Humanoid Motion Representation

4.2.1.1. Data Augmentation

4.2.1.2. PHC-X: Humanoid Motion Imitation with Articulated Fingers

4.2.1.3. Learning Motion Representation via Online Distillation

4.2.2. Stage 2: Pre-grasp Guided Object Manipulation

4.2.2.1. State

4.2.2.2. Action

4.2.2.3. Reward

4.2.2.4. Object 3D Trajectory Generator

4.2.2.5. Training

4.2.2.6. Object and Humanoid Initial State Randomization

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Implementation Details

6. Results & Analysis

6.1. Grasping and Trajectory Following

6.1.1. GRAB Dataset (50 objects)

6.1.2. OakInk Dataset (1700 objects)

6.1.3. OMOMO Dataset (7 objects)

6.2. Ablation Studies / Parameter Analysis

6.2.2. Per-object Success Rate Breakdown (Appendix C.4, Table 9)

6.3. Analysis: Diverse Grasps

6.4. Analysis: Robustness and Potential for Sim-to-real Transfer

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers