SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

Yuke Zhu

Paper status: completed

SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

Published:11/11/2025

Vision-Language-Action Model (1)Motion Tracking Foundation Model (1)Natural Humanoid Control (1)Large-Scale Motion Capture Dataset (1)Real-Time Motion Planning (1)

Original Link PDF

Price: 0.10

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The SONIC framework scales model capacity, data, and compute for natural humanoid control, utilizing diverse motion-capture data for dense supervision. It features real-time motion planning and multi-interface support, demonstrating significant performance gains from scaling.

Abstract

Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited behavior set, and are trained on a handful of GPUs over several days. We show that scaling up model capacity, data, and compute yields a generalist humanoid controller capable of creating natural and robust whole-body movements. Specifically, we posit motion tracking as a natural and scalable task for humanoid control, leverageing dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (from 1.2M to 42M parameters), dataset volume (over 100M frames, 700 hours of high-quality motion data), and compute (9k GPU hours). Beyond demonstrating the benefits of scale, we show the practical utility of our model through two mechanisms: (1) a real-time universal kinematic planner that bridges motion tracking to downstream task execution, enabling natural and interactive control, and (2) a unified token space that supports various motion input interfaces, such as VR teleoperation devices, human videos, and vision-language-action (VLA) models, all using the same policy. Scaling motion tracking exhibits favorable properties: performance improves steadily with increased compute and data diversity, and learned representations generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.

Mind Map

In-depth Reading

English Analysis~46 min read · 62,813 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control." It focuses on enhancing the capabilities and generality of humanoid robot control by significantly increasing the scale of motion tracking through larger models, datasets, and computational resources.

1.2. Authors

The paper lists a large team of authors, primarily affiliated with Nvidia, indicating a strong industry research effort.

Co-First Authors: Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li
Core Contributors: Sirui Chen, Fernando Castañeda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, Xingye Da
Additional Contributors: Runyu Ding, Cyrus Hogg, Lina Song, Edy Lim, Eugene Jeong, Taira He, Har Xue, Wenli Xiao, Zi Wang, Smuen, Jan Kautz, Yan Chang, Umar Ibal, Linxi "Jim" Fan
Project Leads: Yuke Zhu
Affiliation: Nvidia

1.3. Journal/Conference

The paper is published at https://arxiv.org/abs/2511.07820. arXiv is a preprint server widely used in academia for rapid dissemination of research findings before formal peer review and publication in journals or conferences. While not a peer-reviewed venue itself, papers on arXiv often precede publications in top-tier conferences like NeurIPS, ICML, ICLR, or robotics conferences like RSS, IROS, ICRA, which are highly reputable and influential in the fields of artificial intelligence, machine learning, and robotics.

1.4. Publication Year

The paper was published on arXiv at 2025-11-11T04:37:40.000Z, which corresponds to November 11, 2025.

1.5. Abstract

The paper addresses the lack of scaling gains in humanoid control compared to other foundation models (e.g., large language models). It argues that current neural controllers for humanoids are modest in size, behavior, and training resources. The authors propose that scaling up model capacity, data, and compute can lead to a generalist humanoid controller capable of natural and robust whole-body movements. They identify motion tracking as a natural and scalable task for humanoid control, utilizing dense supervision from diverse motion-capture data to learn human motion priors without manual reward engineering.

Their approach involves building a foundation model for motion tracking by scaling along three axes: network size (from 1.2M to 42M parameters), dataset volume (over 100M frames, 700 hours of high-quality motion data), and compute (9k GPU hours). Beyond demonstrating scaling benefits, the model shows practical utility through two mechanisms: (1) a real-time universal kinematic planner for interactive control and downstream task execution, and (2) a unified token space that supports various input interfaces (VR teleoperation, human videos, vision-language-action (VLA) models) using the same policy. The paper concludes that scaling motion tracking steadily improves performance, generalizes to unseen motions, and establishes a practical foundation for humanoid control.

1.6. Original Source Link

Official Source: https://arxiv.org/abs/2511.07820
PDF Link: https://arxiv.org/pdf/2511.07820v1.pdf
Publication Status: This paper is currently a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the lack of scalability in humanoid robot control. In recent years, fields like natural language processing and computer vision have seen monumental progress with billion-parameter foundation models trained on massive datasets using thousands of GPUs. These models demonstrate impressive capabilities like generalization, robustness, and emergent properties that smaller models cannot achieve. However, this trend has not translated to humanoid control.

Current neural controllers for humanoids typically remain small (e.g., three-layer MLPs with a few million parameters), are trained on limited hardware (e.g., a single GPU over days), and target a narrow set of behaviors. A significant challenge in humanoid control is the reliance on manual reward engineering for each specific task. For instance, a controller trained for walking might not be able to dance without extensive re-engineering of its reward function. This task-specific reward engineering is time-consuming, labor-intensive, and limits the general applicability of learned policies, making it difficult to develop flexible controllers that can handle the diverse range of real-world applications required for a truly versatile humanoid.

The paper's entry point or innovative idea is to identify motion tracking as a natural and scalable foundation task for humanoid control. Unlike goal-oriented tasks that require intricate reward designs, motion tracking can leverage motion-capture (MoCap) data which provides dense, frame-by-frame supervision. This eliminates the need for manual reward engineering and allows the robot to acquire human motion priors directly from readily available, diverse human motion datasets covering a wide range of behaviors (walking, running, dancing, sports, object interaction). By demonstrating that motion tracking can be scaled up significantly, the authors propose it as a practical path towards building generalist humanoid controllers.

2.2. Main Contributions / Findings

The paper makes several significant contributions to the field of humanoid control:

Scalable Foundation Task for Humanoid Control: It identifies and demonstrates motion tracking as a scalable foundation task that exhibits favorable scaling properties with increasing compute and data diversity. This is a crucial shift from traditional reward-engineered approaches.
Supersizing Humanoid Control: The authors scale up humanoid control to unprecedented levels, using 9k GPU hours and 100 million frames (700 hours) of high-quality motion sequences. This massive scale results in a universal tracking capability across diverse human behaviors, going beyond the limited scope of prior work.
Kinematic Motion Generation System: Introduction of a real-time universal kinematic planner. This system bridges motion tracking to downstream task execution and enables natural and interactive control. It allows for goal-directed tasks like interactive locomotion, boxing, squatting, kneeling, and crawling, without retraining the core policy.
Unified Token Space for Multimodal Inputs: Development of a unified token space that supports various multimodal control inputs. This single policy can handle teleoperation (VR, 3-point), human videos, motion commands, text, music, and even vision-language-action (VLA) foundation models. This unified approach facilitates cross-embodiment motion tracking and seamless integration with high-level AI models.
Comprehensive Evaluation and Robustness:
- Empirical Scaling Trends: Demonstrated consistent performance improvements with increased data, model capacity, and compute.
- Zero-Shot Transfer: Showed robust generalization to previously unseen motions in simulation.
- Robust Sim-to-Real Deployment: Achieved 100% success rate on 50 diverse motion trajectories on a physical Unitree G1 humanoid robot in a zero-shot manner.
- Integration with Foundation Models: Successfully integrated with a GR00T N1.5 VLA model for mobile manipulation tasks, showcasing compatibility and robustness.
Practical System for Real-World Deployments: The work presents not just a scaled model, but a practical system combining the scaled tracker with a real-time kinematic planner and a universal token space, making it usable for real-time interactive control and teleoperation.

The key conclusions are that scaling motion tracking indeed yields reliable, general whole-body control, and when paired with a robust planner and a universal token space, it becomes a practical foundation for advanced humanoid autonomy, capable of bridging the gap between high-level reasoning and low-level control.

3.1. Foundational Concepts

To understand the paper "SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control," a reader should be familiar with several core concepts in robotics, artificial intelligence, and machine learning.

Humanoid Control:
- Concept: This refers to the systems and algorithms that enable humanoid robots (robots designed to resemble the human body, typically with two legs, two arms, and a torso) to move, balance, and interact with their environment. It's a complex challenge due to the high degrees of freedom, dynamic stability requirements, and real-world uncertainties.
- Relevance: The paper aims to create a generalist controller for such robots, moving beyond simple, pre-programmed movements to more natural and adaptable behaviors.
Motion Tracking (or Motion Imitation):
- Concept: A control paradigm where a robot learns to replicate or follow a given reference motion, typically from human data. The goal is for the robot's pose, trajectory, and dynamics to closely match the reference.
- Relevance: SONIC posits this as the core, scalable task. Instead of defining explicit goal-oriented reward functions, the robot is trained to simply track human motion as accurately as possible. This implicitly teaches the robot how to move naturally and robustly.
Reinforcement Learning (RL):
- Concept: A type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward signal. It involves trial-and-error learning. Key components include:
  - Agent: The learning entity (e.g., the robot controller).
  - Environment: The world the agent interacts with (e.g., simulation, real world).
  - State ( $S$ ): The current situation or observation of the environment.
  - Action ( $A$ ): What the agent can do in a given state.
  - Reward ( $R$ ): A scalar feedback signal indicating the desirability of an action taken in a state.
  - Policy (\pi): A function that maps states to actions.
- Relevance: The paper uses Proximal Policy Optimization (PPO) (explained below) to train its policies.
Proximal Policy Optimization (PPO):
- Concept: A popular Reinforcement Learning algorithm that balances sample efficiency, performance, and ease of implementation. It works by iteratively updating a policy while ensuring that the new policy does not deviate too much from the old policy, which helps to stabilize training.
- Relevance: SONIC trains its control policy using PPO to maximize cumulative rewards.
Foundation Models:
- Concept: Large-scale machine learning models, often neural networks, trained on vast amounts of data using significant computational resources. They are designed to be highly versatile and can be adapted (fine-tuned) for a wide range of downstream tasks. Examples include large language models (like GPT) and large vision models.
- Relevance: The paper draws a direct comparison between the scaling success of foundation models in other domains and the lack thereof in humanoid control, motivating its supersizing approach.
Motion Capture (MoCap) Data:
- Concept: Digitized recordings of human or animal movement, typically captured using specialized cameras and markers or IMU sensors. This data provides high-fidelity, frame-by-frame information about body pose, joint angles, and velocities.
- Relevance: This is the dense supervision used by SONIC to train its motion tracking policies, eliminating the need for manual reward engineering. The paper specifically mentions in-house datasets and existing large datasets like AMASS and LaFAN.
Sim-to-Real Transfer:
- Concept: The process of training a robot control policy in a simulated environment and then deploying it successfully on a physical robot in the real world. This is challenging because of the sim-to-real gap – the discrepancies between simulation dynamics and real-world physics.
- Relevance: Domain randomization (explained below) is a common technique to bridge this gap, and SONIC demonstrates successful zero-shot sim-to-real deployment.
Neural Networks (NNs) / Multi-Layer Perceptrons (MLPs):
- Concept: MLPs are fundamental neural networks consisting of multiple layers of interconnected neurons. Each neuron takes inputs, applies a weighted sum and an activation function, and produces an output. MLPs are universal function approximators.
- Relevance: SONIC uses MLPs extensively for its encoders, decoders, actor (policy network), and critic (value network).
Tokenization & Latent Space:
- Concept: In machine learning, tokenization is the process of breaking down raw data (like text, images, or in this case, motion commands) into smaller, discrete units called tokens. A latent space is a lower-dimensional representation of the input data, where similar data points are grouped closer together. Quantization is the process of mapping continuous values to a finite set of discrete values (like creating discrete tokens from a continuous latent space).
- Relevance: SONIC introduces a unified token space and vector quantizer (FSQ) to represent diverse motion commands (robot, human, hybrid) in a common, discrete format, enabling a single policy to handle multiple input modalities.
Kinematic Planner:
- Concept: A system that plans robot movements based on kinematics (the study of motion without considering forces), often generating a sequence of poses or joint angles to achieve a desired configuration or trajectory, usually with respect to a user input or high-level command.
- Relevance: SONIC uses a generative kinematic motion planner to convert user commands (e.g., velocity, style) into short-horizon reference motions that the core motion tracking policy can then follow.
Vision-Language-Action (VLA) Models:
- Concept: Foundation models that can process vision (images/videos), language (text commands), and generate actions for robotic control. They aim to provide high-level, semantic understanding and control.
- Relevance: SONIC demonstrates compatibility with a VLA model (GR00T N1.5), showing how its low-level, robust control can be layered with high-level reasoning from advanced AI models.

3.2. Previous Works

The paper references several prior works, mainly in the domain of motion tracking and humanoid control. Understanding these helps to contextualize SONIC's advancements.

Prior Motion Trackers:
- Any2Track (Zhang et al., 2025), BeyondMimic (Liao et al., 2025), GMT (Chen et al., 2025): These are state-of-the-art motion tracking methods that SONIC compares itself against.
  - Context: These methods typically focus on tracking human motion in simulation, often trained on smaller datasets like LaFAN or AMASS. Their primary goal is to accurately replicate a given motion.
  - Differentiation: SONIC distinguishes itself by its unprecedented scale (100M+ frames, 9k GPU hours), universal tracking capabilities across diverse human behaviors, and practical utility in downstream tasks beyond just motion tracking, as well as its ability to handle multiple input modalities. SONIC also shows strong sim-to-real performance, which is often a challenge for prior works.
- TWIT (Zeng et al., 2025): Another prior work mentioned for its evaluation dataset usage (a subset of AMA).
Motion Capture Datasets:
- AMA (Mahmood et al., 2019): A large dataset of animal motion, retargeted for humanoids. SONIC's evaluation uses a subset of this data, highlighting its generalization to unseen data.
- LaFAN (Liao et al., 2025): A dataset of human locomotion motions, often used for training motion trackers. SONIC mentions using LaFAN as a smaller baseline for comparison.
- AMASS (Mahmood et al., 2019): A very large and diverse dataset of human motions (including CMU MoCap, MPI HDM05, KIT, etc.). GMT is noted to be trained on AMASS.
Motion Generation Models:
- GENMO (Li et al., 2025): A multimodal motion generation model that SONIC adopts to support multimodal conditioning (video, text, audio) for human motion.
  - Context: GENMO can synthesize complete human motion trajectories from various inputs, effectively acting as a "human motion generator."
  - Relevance to SONIC: SONIC integrates GENMO to translate high-level multimodal inputs into human motion references, which SONIC's human motion encoder then tracks. This allows the humanoid to respond to text commands, music, or video demonstrations.
Robot Platforms & Simulators:
- Unitree G1 (Unitree, 2024): The specific physical humanoid robot platform used for real-world deployment and evaluation in the paper.
- Isaac Lab (NVIDIA et al., 2025): The simulation platform used for training SONIC models. Isaac Lab is a scalable and GPU-accelerated robotics simulation environment.
- MuJoCo (Todorov et al., 2012): A physics engine commonly used in robotics research, particularly for comparing with baselines where MuJoCo implementations are standard.
Technical Components:
- SMPL (Loper et al., 2015): A statistical 3D human body model used to represent human pose estimates (e.g., from VR tracking).
- FSQ (Mentzer et al., 2023): Factorized Quantization or similar techniques for vector quantization, used by SONIC to convert continuous latent representations into discrete universal tokens.
- GR00T N1.5 (Bjorck et al., 2025): A vision-language-action (VLA) foundation model that SONIC integrates with to demonstrate autonomous control, acting as a high-level planner providing commands to SONIC's policy.

3.3. Technological Evolution

The field of humanoid control has evolved significantly:

Early Robotics (Pre-2000s): Dominated by hard-coded, model-based control strategies (e.g., inverse kinematics, zero-moment point control). These methods were precise for known tasks but lacked adaptability to unknown environments or complex, dynamic movements.
Model-Free Reinforcement Learning (2000s-2010s): With advances in computing, RL emerged, allowing robots to learn behaviors directly from experience. However, early RL often struggled with high-dimensional action spaces and required extensive reward engineering. Sim-to-real transfer remained a major hurdle.
Deep Reinforcement Learning (2010s-Present): The integration of deep neural networks with RL (deep RL) led to breakthroughs, enabling robots to learn complex policies directly from raw sensor data. Techniques like PPO, SAC, and TRPO became standard. Motion imitation gained traction as a way to leverage MoCap data to train policies for natural movements, reducing manual reward engineering. However, these policies were often task-specific, limited in scale, and prone to the sim-to-real gap.
Foundation Models Era (Late 2010s-Present): Large language models (LLMs) and large vision models demonstrated that scaling up model capacity, data, and compute could lead to emergent capabilities and generalization. This inspired researchers to seek similar scaling laws in robotics.

This paper's work fits into the latest stage of this evolution, aiming to bring the benefits of foundation models to humanoid control. It seeks to overcome the limitations of previous deep RL approaches by demonstrating that supersizing motion tracking can lead to a generalist controller that is robust, natural, and capable of handling diverse tasks and inputs, bridging the gap between low-level control and high-level VLA reasoning.

3.4. Differentiation Analysis

Compared to the main methods in related work, SONIC's approach presents several core differences and innovations:

Scale of Training (Data, Model, Compute):
- Previous Work: Prior neural controllers for humanoids are typically modest in size (e.g., a few million parameters), trained on limited data (e.g., LaFAN, a subset of AMASS), and consume relatively little compute (e.g., a single GPU for days).
- SONIC's Innovation: SONIC scales network size to 42M parameters, dataset volume to over 100M frames (700 hours), and compute to 9k GPU hours using 128 GPUs. This massive scale is a primary differentiator, directly linking to improved performance and generalization, a trend not previously established in humanoid control.
Generality and Universality:
- Previous Work: Most prior methods are task-specific, requiring reward engineering or separate models for different behaviors (e.g., one for walking, another for dancing). Motion tracking policies often struggle to generalize beyond their training data or to perform complex downstream tasks.
- SONIC's Innovation: SONIC aims for a generalist humanoid controller. It establishes motion tracking at scale as a foundation that acquires human motion priors without manual reward engineering. This leads to a universal tracking capability across diverse human behaviors and allows zero-shot transfer to unseen motions and tasks.
Multimodal Input and Unified Token Space:
- Previous Work: Controllers usually accept specific types of input (e.g., joystick commands for locomotion, MoCap streams for imitation). Bridging different input modalities or embodiments often requires separate retargeting modules or specialized policies.
- SONIC's Innovation: SONIC introduces a unified token space and cross-embodiment learning via specialized encoders. This allows a single policy to accept and seamlessly switch between highly diverse inputs like VR teleoperation, human videos, text commands, music, and even VLA models, treating them all within a shared latent representation. This flexibility is a major step towards truly interactive and adaptable control.
Practical Utility and Downstream Task Execution:
- Previous Work: Many motion trackers primarily demonstrate their ability to imitate motions in simulation, with limited showcasing of their utility for real-world interactive tasks or integration with higher-level systems.
- SONIC's Innovation: SONIC integrates a real-time universal kinematic planner that bridges motion tracking to downstream task execution. This enables interactive control for navigation, boxing, squatting, crawling, and more, all building on the same scaled motion tracker without retraining. It also demonstrates direct integration with VLA models for mobile manipulation, showcasing a complete pipeline from high-level intent to robust physical execution.
Robust Sim-to-Real Deployment:
- Previous Work: Sim-to-real transfer remains a significant challenge. While some methods achieve it, demonstrating 100% success rates across diverse, unseen real-world motions for a generalist policy is difficult.
- SONIC's Innovation: SONIC achieves a 100% success rate on 50 diverse real-world motion trajectories on a Unitree G1 robot in a zero-shot manner. This underlines the robustness and reliability gained from its scaling approach and domain randomization.
Single-Stage Training Without Distillation:
- Previous Work: Some approaches might use distillation from expert policies or multi-stage training to achieve certain capabilities.
- SONIC's Innovation: The universal token space and cross-embodiment learning are achieved through single-stage training, simplifying the learning process and making it more efficient.
  
  In essence, SONIC differentiates itself by aggressively scaling up a well-established but previously limited task (motion tracking) and equipping it with practical mechanisms (kinematic planner, unified token space) to create a generalist, multimodal, and robust controller suitable for real-world humanoid applications and integration with future foundation models.

4. Methodology

4.1. Principles

The core idea behind SONIC is that motion tracking can serve as a scalable and natural foundation task for humanoid control. Instead of manually engineering complex reward functions for every desired behavior (e.g., walking, dancing, jumping), the robot can learn to perform natural and robust whole-body movements by simply trying to accurately imitate diverse human motion-capture (MoCap) data. This approach bypasses the laborious reward engineering process and leverages the vast amount of existing, high-quality human motion data to acquire human motion priors.

The theoretical basis and intuition are rooted in the observation that other domains of AI have seen emergent capabilities and generalization by scaling up model capacity, data volume, and computational resources. SONIC hypothesizes that this scaling law also applies to humanoid control. By supersizing motion tracking—training larger neural networks on significantly more diverse motion data using extensive compute—the system can learn a more general, robust, and nuanced understanding of humanoid dynamics and natural movement. This "generalist" motion tracker then acts as a low-level, reactive System 1 controller, capable of precise, stable whole-body skills, which can be effectively combined with higher-level System 2 reasoning from kinematic planners or vision-language-action (VLA) models for complex tasks. The introduction of a unified token space is a key innovation to handle heterogeneous inputs (human, robot, hybrid motions) and various modalities (video, text, VR) under a single control policy, facilitating broad applicability.

4.2. Core Methodology In-depth (Layer by Layer)

The SONIC framework consists of several interconnected components designed to achieve universal humanoid motion tracking and control.

4.2.1. Humanoid Motion Dataset

The foundation of SONIC's scaling approach is its massive and diverse motion dataset. The authors collected an in-house motion-capture dataset featuring:

Diversity of Subjects: Over 50 subjects, balanced male and female actors, with a range of body heights (mean = 174.3 cm, std = 10.9 cm), approximately following a normal distribution.
Diversity of Motions: Includes daily activities, various locomotion styles, combat motions with varied stylistic expressions, sports movements, and object interactions.
Scale: Contains 700 hours of human motion, resulting in over 100 million frames at 50 Hz. This is significantly larger than datasets typically used in prior works.
Data Characteristics: Clip durations range from 1 to 180 seconds, with most actions performed by multiple subjects and/or multiple takes, providing intrinsic subject variation.
Retargeting: The raw human motion data is retargeted to the specific humanoid robot (Unitree G1) using GMR (Araujo et al., 2025). Retargeting is the process of mapping motion data captured from one body (e.g., a human) to another body (e.g., a robot) that may have different proportions or degrees of freedom.

The following figure shows random samples from their motion dataset:

该图像是一个示意图，展示了多种不同的人体运动姿态。这些姿态包含了从静止到动态的多种动作，旨在支持对自然人形控制的研究和应用。

Figure 7: Random samples from our motion dataset.

4.2.2. Universal Humanoid Motion Tracking

SONIC's central component is a universal humanoid motion tracking framework, illustrated in the overview diagram below. This framework uses a unified control policy to track diverse motion commands across different embodiments (robot, human, hybrid motions) and modalities.

该图像是一个示意图，展示了SONIC模型在运动生成和控制中的结构。图中包含互动控制任务的不同运动生成器，如运动规划器和人类运动生成器，通过统一的控制策略实现机器人运动的解码和输出。

Figure 8: sONIC enables universal humanoid motion tracking through a universal control policy that handles diverse motion commands and modalities. Specialized encoders process robot, human, and hybrid motion commands into a universal token that drives robot control and motion decoders. This cross-embodiment design supports diverse applications including gamepad control, VR teleoperation, whole-body teleoperation, video teleoperation, and multi-modal control from text and music.

4.2.2.1. Motion Tracking Formulation

Humanoid motion tracking is formalized as a Markov Decision Process (MDP) $\mathcal { M } = \langle \boldsymbol { S } , \mathcal { A } , \mathcal { T } , \mathcal { R } , \boldsymbol { \gamma } \rangle$ :

$\boldsymbol { S }$ : State space.
$\mathcal { A }$ : Action space.
$\mathcal { T }$ : Transition function (describes how the environment changes based on actions).
$\mathcal { R }$ : Reward function.
$\boldsymbol { \gamma }$ : Discount factor.

The policy is trained using Proximal Policy Optimization (PPO) to maximize the discounted cumulative reward: $ \left[ \sum _ { t = 1 } ^ { \bar { T } } \gamma ^ { t - 1 } r _ { t } \right] $ where $\bar{T}$ is the episode length, $\gamma$ is the discount factor, and $r_t$ is the reward at time step $t$ .

4.2.2.2. States

The state representation $\boldsymbol { s } _ { t }$ at time $t$ consists of two main components:

Proprioceptive sensing $\boldsymbol { s } _ { t } ^ { \mathrm { p } }$ $s_{t}^{p}$ : Information about the robot's own body state. $ \boldsymbol { s } _ { t } ^ { \mathrm { p } } \triangleq \left( \mathbf { q } _ { t } , \dot { \pmb { q } } _ { t } , \omega _ { t } , \psi _ { t } , \mathbf { a } _ { t - 1 } \right) $
- $\mathbf { q } _ { t }$ : Joint positions (current angles of each joint).
- $\dot { \pmb { q } } _ { t }$ : Joint velocities (rate of change of joint angles).
- $\omega _ { t }$ : Root angular velocity (rotational speed of the robot's main body).
- $\psi _ { t }$ : Gravity vector $\mathbf { \nabla } _ { \mathbf { \boldsymbol { g } } _ { t } }$ in the root frame (provides information about the robot's orientation relative to gravity).
- $\mathbf { a } _ { t - 1 }$ : Previous action taken at time t-1.
Motion command $\boldsymbol { s } _ { t } ^ { g }$ $s_{t}^{g}$ : The target motion the robot needs to track. This can be one of three types:
- Robot motion $\boldsymbol { g } _ { r }$ : A reference motion defined in terms of robot-specific joint angles and root poses.
- Human motion $\boldsymbol { g } _ { h }$ : A reference motion derived from human MoCap data (e.g., SMPL pose estimates).
- Hybrid motion $\boldsymbol { g } _ { m }$ : A combination, typically sparse upper-body keypoints (like head and hands) for real-time tracking, combined with lower-body robot motion.
  
  All state quantities are expressed in the robot's local heading frame to ensure rotation invariance, meaning the policy's understanding of motion is independent of the robot's absolute orientation in the world.

4.2.2.3. Actions

The policy $\pi$ (the neural network controlling the robot) outputs target joint positions $\mathbf { a } _ { t }$ as its actions. These target positions are then sent to proportional-derivative (PD) controllers at each joint, which compute the necessary torques to move the robot's joints to the desired target positions.

4.2.2.4. Rewards

The reward function r _ { t } is designed to guide the policy towards accurate motion tracking while penalizing undesirable robot behaviors. Following Liao et al. (2025), it combines a tracking reward and penalty terms: $ r _ { t } = \mathcal { R } \big ( s _ { t } ^ { \mathrm { p } } , s _ { t } ^ { \mathrm { g } } \big ) + \mathcal { P } \big ( s _ { t } ^ { \mathrm { p } } , \pmb { a } _ { t } \big ) $

$\mathcal { R } ( s _ { t } ^ { \mathrm { p } } , s _ { t } ^ { \mathrm { g } } )$ $R (s_{t}^{p}, s_{t}^{g})$ : The tracking term which minimizes errors between the robot's current proprioceptive state $s _ { t } ^ { \mathrm { p } }$ $s_{t}^{p}$ and the target motion command $\boldsymbol { s } _ { t } ^ { g }$ $s_{t}^{g}$ . This includes minimizing errors in:
- Root position.
- Root orientation (6D rotation representation).
- Body link positions (relative to the root).
- Body link orientations (relative to the root).
- Body link linear velocities.
- Body link angular velocities.
$\mathcal { P } ( s _ { t } ^ { \mathrm { p } } , \pmb { a } _ { t } )$ $P (s_{t}^{p}, a_{t})$ : The penalty term which discourages negative behaviors such as:
- Abrupt action changes (action rate).
- Violations of joint limits.
- Undesired contacts (e.g., unintended body parts touching the ground).
  
  The detailed reward design, including specific equations and weights, is presented in Table 1. Orientations are represented as 6D rotations (Zhou et al., 2019), which is a common way to avoid singularities associated with Euler angles and simplify calculations compared to quaternions in some contexts.

The following are the results from Table 1 of the original paper:

Reward term	Equation	Weight
Tracking rewards R(s, s)
Root orientation	$r_{ori}(t)=exp(− k_{ori} \cdot \\|ori^g_{root} − ori^p_{root}\\|^2)$	0.5
Body link pos (rel.)	$r_{pos}(t) = \exp(− k_{pos} \cdot \frac{1}{\|B\|} \sum_{b \in B} \\|pos^g_{b,rel} − pos^p_{b,rel}\\|^2)$	1.0
Body link ori (rel.)	$r_{body\_ori}(t) = \exp(− k_{ori} \cdot \frac{1}{\|B\|} \sum_{b \in B} \\|ori^g_{b,rel} − ori^p_{b,rel}\\|^2)$	1.0
Body link lin. vel	$r_{lin\_vel}(t) = \exp(− k_{vel} \cdot \frac{1}{\|B\|} \sum_{b \in B} \\|vel^g_{b,lin} − vel^p_{b,lin}\\|^2)$	1.0
Body link ang. vel	$r_{ang\_vel}(t) = \exp(− k_{ang} \cdot \frac{1}{\|B\|} \sum_{b \in B} \\|ang^g_{b,vel} − ang^p_{b,vel}\\|^2)$	1.0
Penalty terms P(s , at)
Action rate	$r_{act}(t) = - \lambda_{act} \cdot \\|a_t − a_{t-1}\\|^2$	-0.1
Joint limit	$r_{joint\_limit}(t) = - \lambda_{joint} \cdot \sum_{j} (\mathbb{I}(q_j < q_{j,min}) + \mathbb{I}(q_j > q_{j,max}))$	-10.0
Undesired contacts	$r_{contact}(t) = - \lambda_{contact} \cdot \sum_{c \in \{ankles, wrists\}} \mathbb{I}(\\|force_c\\| > \tau_{contact})$	-0.1

Symbol Explanation for Table 1:

$r_{ori}(t)$ : Reward for root orientation tracking at time $t$ .
$k_{ori}$ : Weighting coefficient for orientation error.
$ori^g_{root}$ : Goal root orientation (from target motion command).
$ori^p_{root}$ : Proprioceptive (current) root orientation.
$\|\cdot\|^2$ : Squared Euclidean norm of the difference.
$r_{pos}(t)$ : Reward for body link position tracking.
$k_{pos}$ : Weighting coefficient for position error.
$|B|$ : Number of tracked body links in set $B$ .
$pos^g_{b,rel}$ : Goal relative position of body link $b$ .
$pos^p_{b,rel}$ : Proprioceptive relative position of body link $b$ .
$r_{body\_ori}(t)$ : Reward for body link orientation tracking. (Equation missing in original table, assuming similar structure to root orientation).
$r_{lin\_vel}(t)$ : Reward for body link linear velocity tracking.
$k_{vel}$ : Weighting coefficient for linear velocity error.
$vel^g_{b,lin}$ : Goal linear velocity of body link $b$ .
$vel^p_{b,lin}$ : Proprioceptive linear velocity of body link $b$ .
$r_{ang\_vel}(t)$ : Reward for body link angular velocity tracking. (Equation missing in original table, assuming similar structure).
$k_{ang}$ : Weighting coefficient for angular velocity error.
$ang^g_{b,vel}$ : Goal angular velocity of body link $b$ .
$ang^p_{b,vel}$ : Proprioceptive angular velocity of body link $b$ .
$r_{act}(t)$ : Penalty for action rate (abrupt changes in actions).
$\lambda_{act}$ : Weighting coefficient for action rate penalty.
$a_t$ : Action at time $t$ (target joint positions).
$a_{t-1}$ : Action at time t-1.
$r_{joint\_limit}(t)$ : Penalty for violating joint limits.
$\lambda_{joint}$ : Weighting coefficient for joint limit penalty.
$\mathbb{I}(\cdot)$ : Indicator function, which is 1 if the condition is true, and 0 otherwise.
$q_j$ : Current position of joint $j$ .
$q_{j,min}$ : Minimum allowed position for joint $j$ .
$q_{j,max}$ : Maximum allowed position for joint $j$ .
$r_{contact}(t)$ : Penalty for undesired contacts (e.g., non-foot/hand body parts touching the ground).
$\lambda_{contact}$ : Weighting coefficient for contact penalty.
$\{ankles, wrists\}$ : Set of body parts checked for undesired contacts.
$\|force_c\|$ : Magnitude of the contact force on body part $c$ .
$\tau_{contact}$ : Contact force threshold.

4.2.2.5. Domain Randomization

To enhance robustness and generalization across diverse scenarios and bridge the sim-to-real gap, systematic domain randomization is applied during training. This involves varying physical parameters and applying perturbations.

Physical Parameters:
- Friction coefficients ( $\mu_s$ , $\mu_d$ ): Static and dynamic friction.
- Restitution coefficient ( $e$ ): Bounciness of collisions.
- Default joint positions ( $q_0$ ): Small variations in initial joint states.
- Base center-of-mass (COM) position: Variations in the robot's mass distribution.

Perturbations:

Root linear and angular velocities: Periodic application of random external pushes to simulate real-world disturbances.

Motion perturbation to target motion commands ( $\boldsymbol { s } _ { t } ^ { g }$ ): Jitter in target positions, orientations, and velocities.

The following are the results from Table 2 of the original paper:

Domain Randomization	Sampling Distribution
Physical parameters
Static friction coefficients	$\mu_s \sim \mathcal{U}[0.3, 1.6]$
Dynamic friction coefficients	$\mu_d \sim \mathcal{U}[0.3, 1.2]$
Restitution coefficient	$e \sim \mathcal{U}[0, 0.5]$
Default joint positions	$q_0 \sim q_0 + \mathcal{U}[−0.01, 0.01]$
Base COM offset (x, y, z)	$\Delta x \sim \mathcal{U}[−0.075, 0.075]$ , $\Delta y \sim \mathcal{U}[−0.1, 0.1]$ , $\Delta z \sim \mathcal{U}[−0.1, 0.1]$
Root velocity perturbations (external pushes)
Root linear vel (x, y, z)	$v_x \sim \mathcal{U}[−0.5, 0.5]$ , $v_y \sim \mathcal{U}[−0.5, 0.5]$ , $v_z \sim \mathcal{U}[−0.2, 0.2]$
Push duration	$\Delta t \sim \mathcal{U}[1, 3]s$
Root angular vel	$\omega_{roll} \sim \mathcal{U}[−0.52, 0.52]$ , $\omega_{pitch} \sim \mathcal{U}[−0.52, 0.52]$ , $\omega_{yaw} \sim \mathcal{U}[−0.78, 0.78]$
Target motion perturbations ( $\boldsymbol{s}^g$ )
Target position jitter	$\Delta p_g \sim \mathcal{U}[−0.05, 0.05]^3$ (x,y: $\pm0.05$ , z: $\pm0.01$ )
Target orientation jitter	$\Delta \phi_{roll}, \Delta \phi_{pitch} \sim \mathcal{U}[−0.1, 0.1]$ , $\Delta \phi_{yaw} \sim \mathcal{U}[−0.2, 0.2]$
Target linear vel jitter	$\Delta v_g \sim \mathcal{U}[−0.5, 0.5]^3$ (x,y: $\pm0.5$ , z: $\pm0.2$ )
Target angular vel jitter	$\Delta \omega_{roll}, \Delta \omega_{pitch} \sim \mathcal{U}[−0.52, 0.52]$ , $\Delta \omega_{yaw} \sim \mathcal{U}[−0.78, 0.78]$
Target joint jitter	$\Delta q \sim \mathcal{U}[−0.1, 0.1]$

Symbol Explanation for Table 2:

$\mathcal{U}[a, b]$ : Uniform distribution between $a$ and $b$ .
$\mu_s$ : Static friction coefficient.
$\mu_d$ : Dynamic friction coefficient.
$e$ : Restitution coefficient.
$q_0$ : Default joint positions.
$\Delta x, \Delta y, \Delta z$ : Offset for base COM (Center of Mass) in x, y, z directions.
$v_x, v_y, v_z$ : Linear velocity components for root pushes.
$\Delta t$ : Duration of root pushes.
$\omega_{roll}, \omega_{pitch}, \omega_{yaw}$ : Angular velocity components (roll, pitch, yaw) for root pushes.
$\Delta p_g$ : Jitter applied to target position.
$\Delta \phi_{roll}, \Delta \phi_{pitch}, \Delta \phi_{yaw}$ : Jitter applied to target orientation (roll, pitch, yaw).
$\Delta v_g$ : Jitter applied to target linear velocity.
$\Delta \omega_{roll}, \Delta \omega_{pitch}, \Delta \omega_{yaw}$ : Jitter applied to target angular velocity.
$\Delta q$ : Jitter applied to target joint positions.

4.2.2.6. Universal Control Policy Architecture

The Universal Control Policy is the core of SONIC, characterized by its ability to handle multiple motion command types from different embodiments through a unified encoder-decoder architecture.

Cross-Embodiment Learning: Achieved by specialized encoders that process heterogeneous inputs (human, robot, hybrid motions) into a shared latent representation. This representation is then quantized into a universal token, which drives a common robot control decoder. This design enables the robot to learn from various sources without needing a separate retargeting module for each.

Encoders: Three specialized encoders (Multi-Layer Perceptrons as detailed in Table 3) process distinct motion command types into the shared latent space:

Robot motion encoder $\boldsymbol { \varepsilon } _ { r }$ : Encodes robot joint positions and velocities over $F_r$ future frames with a frame interval $\Delta t_r$ .
Human motion encoder $\boldsymbol { \varepsilon } _ { h }$ : Encodes 3D human joint positions (e.g., from SMPL format) over $F_h$ future frames with a frame interval $\Delta t_h$ .
Hybrid motion encoder $\boldsymbol { \varepsilon } _ { m }$ : Encodes sparse upper-body keypoints (head and hands) of the current frame (for real-time tracking), combined with lower-body robot motion over $F_m$ future frames with a frame interval $\Delta t_m$ .

Multi-frame inputs allow the policy to anticipate future motion and improve robustness.

Quantizer: The encoded latent representation from the encoders is quantized into a universal token $\boldsymbol { z }$ using a vector quantizer. Specifically, FSQ (Mentzer et al., 2023) is used. The universal token is a $D_z$ -dimensional vector with $L_z$ quantization levels per dimension. This discretization helps create a shared, discrete representation space.

Decoders: The universal token $\boldsymbol { z }$ is processed by two separate decoders (MLPs as detailed in Table 3):

Robot control decoder $\mathcal { D } _ { c }$ : Transforms the universal token into motor commands (target joint positions) that control the robot's joints.

Robot motion decoder $\mathcal { D } _ { r }$ : Reconstructs the robot motion command. This provides auxiliary supervision during training, helping to improve the quality of the latent space and enhance feature learning.

The following are the results from Table 3 of the original paper:

Module	Architecture	Dims
Network configuration
Quantizer	FSQ	token dimensions = $D_z$	quantization levels = $L_z$
Encoder ( $g_r$ )	MLP	hidden = [2048, 1024, 512, 512]
Encoder (teleop)	MLP	hidden = [2048, 1024, 512, 512]
Encoder (smpl)	MLP	hidden = [2048, 1024, 512, 512]
Decoder (actions)	MLP	hidden = [2048, 2048, 1024, 1024, 512, 512]
Decoder (refs)	MLP	hidden = [2048, 1024, 512, 512]
Action dimension	Diagonal Gaussian	29
Critic	MLP	hidden = [2048, 2048, 1024, 1024, 512, 512]
Motion command
Future frames	$F_r = F_h = F_m = 10$ frames
Frame interval	`\Delta t_r = \Delta t_m = 0.1s`	`\Delta t_h = 0.02s`

Symbol Explanation for Table 3:

$D_z$ : Dimension of the universal token.
$L_z$ : Number of quantization levels per dimension for the universal token.
MLP: Multi-Layer Perceptron. hidden indicates the number of neurons in each hidden layer.
Encoder (g_r): Refers to the robot motion encoder.
Encoder (teleop): Likely refers to the hybrid motion encoder for teleoperation.
Encoder (smpl): Refers to the human motion encoder (as SMPL is a human body model).
Decoder (actions): Corresponds to the robot control decoder $\mathcal{D}_c$ .
Decoder (refs): Corresponds to the robot motion decoder $\mathcal{D}_r$ .
Action dimension: The number of outputs from the policy (e.g., target joint positions). Diagonal Gaussian suggests that the policy outputs mean and standard deviation for a Gaussian distribution over actions.
Critic: The value network in PPO, also an MLP.
$F_r, F_h, F_m$ : Number of future frames considered by robot, human, and hybrid motion encoders, respectively.
$\Delta t_r, \Delta t_m, \Delta t_h$ : Frame intervals for robot, hybrid, and human motion encoders, respectively.

Training: Training involves preparing synchronized motion data across all three command types ( $g_r, g_h, g_m$ ). Each command type is encoded via its respective encoder and quantized to produce universal tokens ( $z_r, z_h, z_m$ ). The total loss function comprises four components:

$ \mathcal { L } = \mathcal { L } _ { \mathrm { ppo } } + \mathcal { L } _ { \mathrm { recon } } + \mathcal { L } _ { \mathrm { token } } + \mathcal { L } _ { \mathrm { cycle } } $

$\mathcal { L } _ { \mathrm { ppo } }$ : This denotes the standard PPO loss. PPO aims to optimize the policy by maximizing the expected cumulative reward while keeping policy updates stable. The details of the PPO loss are well-established in Reinforcement Learning literature and typically involve a clipped surrogate objective and value function loss.
$\mathcal { L } _ { \mathrm { recon } }$ : This is the reconstruction loss for the robot motion command across different input modalities. It ensures that the decoded motion from any token type matches the corresponding robot motion: $ \mathcal { L } _ { \mathrm { recon } } = | \mathcal { D } _ { r } ( z _ { r } ) - g _ { r } | ^ { 2 } + \left| \mathcal { D } _ { r } ( z _ { h } ) - g _ { r } \right| ^ { 2 } + \left| \mathcal { D } _ { r } ( z _ { m } ) - g _ { r } \right| ^ { 2 } $
- $\mathcal { D } _ { r }(z_r)$ : Reconstruction of robot motion from robot token.
- $\mathcal { D } _ { r }(z_h)$ : Reconstruction of robot motion from human token.
- $\mathcal { D } _ { r }(z_m)$ : Reconstruction of robot motion from hybrid token.
- $g_r$ : The ground truth robot motion command.
- $\|\cdot\|^2$ : Squared Euclidean norm.
- Crucially, when the input command is human motion ( $g_h$ ), the encoder-decoder pipeline ( $\mathcal{E}_h \rightarrow \text{Quantizer} \rightarrow \mathcal{D}_r$ ) acts as a retargeting pipeline from human to robot motion. Thus, $\mathcal { L } _ { \mathrm { recon } }$ effectively serves as a retargeting loss that enables cross-embodiment transfer.
$\mathcal { L } _ { \mathrm { token } }$ : This loss measures the discrepancy between the robot token z _ { r } and the human motion token z _ { h } in the latent space: $ \mathcal { L } _ { \mathrm { token } } = \left| z _ { r } - z _ { h } \right| ^ { 2 } $
- This loss explicitly encourages the encoder networks to produce aligned representations across embodiments. It ensures that motions from different embodiments (human vs. robot) are mapped to similar representations in the shared latent space, which is critical for cross-embodiment learning.
$\mathcal { L } _ { \mathrm { cycle } }$ : This is a cycle consistency loss that further reinforces the alignment of the latent space. It checks if re-encoding a reconstructed robot motion (derived from a human token) back into a token results in a token consistent with the original robot token: $ \mathcal { L } _ { \mathrm { cycle } } = \left| \mathcal { E } _ { r } ( \mathcal { D } _ { r } ( z _ { h } ) ) - z _ { r } \right| ^ { 2 } $
- $\mathcal { E } _ { r }( \cdot )$ : The robot motion encoder.
- $\mathcal { D } _ { r }(z_h)$ : Robot motion reconstructed from the human token.
- $\mathcal { E } _ { r } ( \mathcal { D } _ { r } ( z _ { h } ) )$ : Re-encoding of the reconstructed robot motion (originally from human motion) back into a token.
- $z_r$ : The original robot token.
- This loss ensures that the translation from human to robot motion and back to a token preserves the essential motion characteristics and that the latent space is consistently represented across transformations.

Adaptive Motion Sampling: To efficiently sample from the large motion dataset during training, a bin-based adaptive motion sampling strategy is used when selecting the initial frame for each episode.

The entire motion dataset is partitioned into duration bins.
For each bin $i$ , the episode failure rate $f_i$ of the current policy is calculated.
To prevent oversampling from bins with extremely high failure rates, $f_i$ is capped at $\beta \bar{f}$ , where $\beta$ is a hyperparameter and $\bar{f}$ is the average failure rate across all bins.
These normalized, capped failure rates are used to derive preliminary sampling weights $\hat{p}_i$ .
The final sampling probability $p_i$ $p_{i}$ for each bin is defined as: $ p _ { i } = \alpha \hat { p } _ { i } + ( 1 - \alpha ) \frac { 1 } { N } $
- $\alpha$ : A blending hyperparameter.
- $N$ : Total number of bins.
- This formula balances targeted sampling of challenging bins (where the policy struggles) with uniform coverage of the entire dataset. A bin is selected according to $p_i$ , and the initial frame is then sampled uniformly from within the chosen bin. All models are trained using Isaac Lab (NVIDIA et al., 2025) as the simulation platform.

The following are the results from Table 4 of the original paper:

Training hyperparameter	Value
Num parallel envs per GPU	4096
Num steps per env	24
Learning epochs	5
Num mini-batches	4
Discount $\gamma$	0.99
GAE $\lambda$	0.95
Clip parameter	0.2
Entropy coefficient	0.013
Value loss coefficient	1.0
Actor learning rate	2×10^-5
Critic learning rate	1×10^-3
Max gradient norm	0.1
Desired KL	0.01
Adaptive LR min/max	[1×10⁻⁵ , 2×10⁻⁴]
Init noise std	0.05
Actor std clamp min/max	[0.001, 0.5]
Adaptive sampling bin size	1s
Adaptive sampling failure rate cap	$\beta$ = 200
Adaptive sampling blending hyperparameter	$\alpha$ = 0.1

Symbol Explanation for Table 4:

Num parallel envs per GPU: Number of simulated environments run concurrently on each GPU.
Num steps per env: Number of simulation steps collected from each environment per policy iteration.
Learning epochs: Number of times the collected data is iterated over for policy updates within one PPO iteration.
Num mini-batches: Number of mini-batches used to update the policy from the collected data.
Discount\gamma $: Discount factor for future rewards in RL. * `GAE`\lambda$ : Lambda parameter for Generalized Advantage Estimation, used to estimate advantages more accurately.
Clip parameter: Epsilon in PPO's clipped surrogate objective, which limits how much the new policy can deviate from the old.
Entropy coefficient: Weight for the entropy bonus term in the loss, encouraging exploration.
Value loss coefficient: Weight for the value function loss in the PPO objective.
Actor learning rate: Learning rate for the policy network (actor).
Critic learning rate: Learning rate for the value network (critic).
Max gradient norm: Maximum allowed gradient norm during optimization, to prevent exploding gradients.
Desired KL: Target Kullback-Leibler divergence between old and new policies, used in adaptive PPO.
Adaptive LR min/max: Minimum and maximum bounds for the adaptive learning rate.
Init noise std: Initial standard deviation of the action noise.
Actor std clamp min/max: Minimum and maximum bounds for the action standard deviation.
Adaptive sampling bin size: Size of the duration bins for adaptive motion sampling.
Adaptive sampling failure rate cap (\beta): Hyperparameter for capping failure rates in adaptive sampling.
Adaptive sampling blending hyperparameter (\alpha): Hyperparameter for blending targeted and uniform sampling in adaptive motion sampling.

Tasks and Applications: The universal motion tracking framework enables a wide range of applications:

Interactive Gamepad Control: User gamepad inputs (velocities, styles) are transformed into robot motions by the generative kinematic motion planner, which the universal policy then tracks via the robot motion encoder $\boldsymbol { \varepsilon } _ { r }$ .
VR 3-Point Teleoperation: PICO VR headset and handheld controllers provide head and wrist SE(3) pose data and navigation commands. The kinematic motion planner generates lower-body motions, forming a hybrid command (upper-body keypoints, lower-body robot motion) tracked by the hybrid motion encoder $\boldsymbol { \varepsilon } _ { m }$ .
VLA Integration: A GR00T N1.5 model, post-trained on VR 3-Point Teleoperation data, provides actions fed into the kinematic motion planner. This also results in a hybrid command tracked by $\boldsymbol { \varepsilon } _ { m }$ .
VR Whole-body Teleoperation: VR headsets and IMU sensors on ankles stream full-body human motion, which the universal policy tracks using the human motion encoder $\boldsymbol { \varepsilon } _ { h }$ .
Video Teleoperation & Multi-Modal Control: The GENMO model (Li et al., 2025) estimates/generates human motions from multi-modal inputs (video, text, audio). The universal policy tracks these generated motions using the human motion encoder $\boldsymbol { \varepsilon } _ { h }$ .

4.2.3. Generative Kinematic Motion Planner

The generative kinematic motion planner is a large latent generative model trained on the same whole-body motion data as the motion tracking policy. Its function is to convert user commands into plausible robot motion trajectories.

Planning Process: Formulated as an autoregressive motion in-betweening generation task. It continually regenerates future kinematic motions conditioned on previous robot states and incoming user commands.
Keyframes:
- Context keyframes: Capture historical robot states (joint positions, root positions).
- Target keyframes: Generated from user commands (navigation guidance, velocity, direction, style) or skill-specific targets (squatting, crawling, boxing).
Motion Representation:
- Mathematically equivalent to the humanoid pose configuration $q_t$ .
- Uses pelvis-relative joint positions and global joint rotations.
- During training, samples are randomly rotated to enable planning in arbitrary orientations. Global rotations are crucial for motions where the heading is ill-defined (e.g., squatting, crawling).

Generative Neural Backbone in Latent Space: Planning is conducted in a latent space. Continuous motions are first encoded as a sequence of latent tokens: $ \left{ \boldsymbol { z } _ { t } \right} _ { t = 1 } ^ { T / 4 } = \operatorname { e n c } \left( \left{ p _ { t } , r _ { t } \right} _ { t = 1 } ^ { T } \right) $

$\boldsymbol { z } _ { t }$ : Latent token at time step $t$ .
$T$ : Total duration of the motion segment.
$T/4$ : Length of the latent token sequence, indicating a downsampling rate of 4.
$p_t$ : Pose configuration at frame $t$ .
$r_t$ : Root position at frame $t$ .
$\operatorname{enc}(\cdot)$ : Encoding function that converts raw motion data into latent tokens.

The latent token sequence is processed by models like Transformers or Conv1D networks to capture temporal consistency. The in-betweening process in the token space is guided by two constraints: the starting and target keyframes (e.g., $\{p_t, r_t\}_{t=1}^4$ and $\{p_t, r_t\}_{t=T-4}^T$ ). Instead of predicting the sequence directly, a masked token prediction approach (similar to techniques in language models) is used. $ h = \mathcal { F } \left( { p _ { t } , { r _ { t } } } _ { t = 1 } ^ { 4 } , { p _ { t } , { r _ { t } } } _ { t = T - 4 } ^ { T } , { z _ { t } } _ { t = 1 } ^ { T / 4 } \right) $ $ \mathrm { Prob } ( z _ { t } ) = \sigma ( h ) $
$\mathcal { F } ( \cdot )$ : The neural backbone (e.g., Transformer).
$h$ : Logits for each token position.
$\sigma (h)$ : Softmax function applied to the logits to compute token probabilities.
The process is iterative:
- Initially, all latent tokens are unknown, initialized with a learnable mask embedding, $z_{\mathrm{masked}}$ .
- During training, the proportion of masked tokens is uniformly sampled from $[100\%, 0\%]$ .
- At inference, the neural backbone iteratively predicts the unmasked tokens with the highest probability, progressively refining the prediction. The proportion of tokens finalized in each iteration is $\begin{array} { r } { 1 . 0 - \cos { \left( \frac { \pi } { 2 } \cdot \frac { L } { L _ { \mathrm { max } } } \right) } } \end{array}$ , where $L$ is the current iteration and $L_{\mathrm{max}}$ is the maximum number of iterations.
After all tokens are finalized, the predicted tokens are used to reconstruct the kinematic motions and generate robot control signals.

Root Trajectory Spring Model: An intuitive critically damped spring model is used to generate the root position and heading of the keyframes from user commands: $ x ( t ) = \left( x _ { T } - x _ { 0 } + \left( v _ { 0 } + \frac { c } { 2 } \left( x _ { T } - x _ { 0 } \right) \right) t \right) e ^ { - \frac { c } { 2 } t } $

x(t): Position or angle at time $t$ .
$x_T$ : Target value (e.g., target pelvis position, target heading angle).
$x_0$ : Initial value.
$v_0$ : Initial velocity.
$c$ : Damping coefficient.
$e$ : Euler's number (base of natural logarithm).
This model is applied to three quantities: pelvis position along x-axis, pelvis position along y-axis, and projected heading angle of the pelvis. Damping coefficients of $5 \ln(2)$ and $20 \ln(2)$ are used for position and heading, respectively.
Target values can be directly from controllers or derived from a desired velocity to compute target positions after 1.0s. This model improves behavioral predictability and safeguards against unrealistic commands (e.g., abrupt direction reversals).

Keyframe Module and Application Integration:

Traditional motion planning often relies on complex keyframe generation. SONIC provides keyframes more intuitively.
For navigation control, target keyframes are generated by sampling a segment from the dataset that aligns with the specified velocity, direction, and style. The planner's variable-length motion generation (0.8s to 2.4s) and robust in-betweening capabilities ensure natural and smooth motions. Autoregressive replanning continuously refreshes the motion, reducing dependence on specific spatial details of the target keyframes.
For entertainment tasks (e.g., boxing), target keyframes select expressive segments (e.g., maximal arm extension for a punch).
Motion layering is supported, where upper-body movements are specified, and the lower body is generated by the planner.
For manipulation tasks (e.g., squatting, kneeling), keyframes are retrieved from the motion library based on desired heights.

SONIC adopts GENMO (Li et al., 2025) to support multi-modal conditioning, meaning it can generate human motions from various input types.

Core Idea: GENMO treats motion estimation from videos as a constrained generation problem. The model synthesizes a complete motion trajectory that is consistent with observed video keypoints, while also being capable of generating novel motions from abstract conditions like text or music.
Conditioning Modalities and Temporal Layout:
- Accepts mixed, time-varying conditions: text prompts, audio features, and visual observations.
- Conditions can appear at different temporal intervals.
- Each stream is encoded by a modality-specific encoder into features aligned to a common motion frame rate.
- Missing intervals are handled by empty or masked tokens.
Architecture:
- Condition streams are fused using a temporal transformer with cross-attention from motion tokens to multimodal condition tokens.
- A diffusion-based motion prior operates over the human motion sequence, denoising from Gaussian noise to a kinematically plausible trajectory. This prior learns human motion dynamics, being agnostic to the specific conditioning type (the modality-specific processing happens in the encoders and fusion layers).
- The denoised human motion is then encoded into SONIC's universal token space, allowing a single decoder to serve multiple downstream embodiments.
Training Objective: Mixes two complementary objectives:
1. Generative learning: Uses a standard diffusion loss over human motions, conditioned on available modalities (text/audio/video), to learn a broad motion prior.
2. Estimation-guided learning: Adds reconstruction terms when observations are present (e.g., 2D/3D keypoints, trajectory constraints), ensuring the generated motion matches measurements exactly where they exist, while relying on the prior elsewhere.
- This allows the same network parameters to serve both generation and estimation tasks.
- Mixed batches with variable-length sequences and randomly dropped modalities teach the model to handle partial and asynchronous evidence.
Inference Modes: Supports pure generation from abstract prompts, constrained generation given video frames, and hybrid control where modalities switch over time.
Integration with SONIC: GENMO provides low-latency motion generation using sliding windows with overlap. The diffusion denoising process is modified with inpainting to handle transitions between windows, ensuring smooth, continuous motion. The overlapped part of the denoised human motion is overwritten by the previous window's motions during denoising.

4.2.5. Deployment

Experiments are conducted on a Unitree G1 platform, utilizing its built-in joint-level PD controller.

Onboard Execution: The learned policy and its associated management stack execute directly on the robot's onboard CPU/GPU (a Jetson Orin GPU), minimizing feedback latency and improving timing determinism.
Control Frequencies:
- Policy loop: Runs at 50 Hz. At each cycle, observations (robot state, operator commands) are assembled.
- Target joint positions: Streamed by a separate high-rate process at 500 Hz through the Unitree low-level API.
- User input: Captured independently at 100 Hz.
- Kinematic motion planner: Employed at 10 Hz, proposing short-horizon sequences based on input interface commands.
Synchronization and Blending: Planner outputs are temporally aligned with the 50 Hz policy loop and smoothly blended into the control objectives, providing a motion reference without sacrificing responsiveness.
Robustness: Each periodic process operates on consistent snapshots of the robot state. A "latest-data-wins" convention mitigates jitter, preventing transient delays from stalling the system.
Optimization: Both the interactive kinematic motion planner and the policy inference execute onboard using TensorRT with CUDA Graph acceleration. This yields reliable, low-variance execution with $1-2 ms$ per forward pass for the policy and 12 ms per forward pass for motion generation (kinematic planner).
Result: This architecture emphasizes timing robustness and smoothness, producing stable motion while decoupling loops to ensure platform control stability and end-to-end latency.

5. Experimental Setup

5.1. Datasets

The experiments primarily leverage a proprietary in-house motion-capture dataset, augmented with comparisons to existing public datasets.

In-house Dataset (SONIC's Primary Training Data):
- Source: Internally collected motion-capture data.
- Scale: 700 hours of human motion, resulting in over 100 million frames at 50 Hz.
- Characteristics: Collected from over 50 subjects (balanced male/female, diverse body heights). Includes thousands of unique motion behaviors: daily activities, various locomotion styles, combat motions, sports movements, and object interactions. Most actions performed by multiple subjects/takes to provide intrinsic variation.
- Domain: Humanoid motion for whole-body control.
- Purpose: This large and diverse dataset is crucial for SONIC's supersizing approach, providing dense supervision for learning human motion priors without manual reward engineering. The human motion is retargeted to the Unitree G1 robot using GMR (Araujo et al., 2025).
  
  An example of the data diversity can be seen in Figure 7 (already presented in Methodology).
AMA (Mahmood et al., 2019):
- Source: A public dataset, originally for animal motion, but commonly retargeted for humanoids.
- Scale: SONIC uses a uniformly random subset of retargeted AMA data for evaluation (9 hours, 1,602 trajectories).
- Purpose: This dataset serves as a large, previously unseen test set for SONIC to evaluate its generalization capabilities. It is specifically chosen to underscore the robustness of SONIC's generalization as SONIC is not trained on the AMA dataset.
LaFAN (Liao et al., 2025):
- Source: A public dataset of human locomotion motions.
- Scale: 0.4 million frames.
- Purpose: Used as a baseline smaller dataset for scaling studies (comparing 0.4M frames vs. 7.4M frames vs. 100M frames) to show the impact of dataset size on performance. Any2Track and BeyondMimic baselines are typically trained on LaFAN.
AMASS (Mahmood et al., 2019):
- Source: A very large and diverse public dataset of human motions.
- Purpose: GMT (another baseline) is noted to be trained on AMASS.

Why these datasets were chosen: The in-house dataset provides the necessary scale and diversity for the supersizing hypothesis. AMA and LaFAN/AMASS are used for robust evaluation against out-of-distribution motions and for fair comparisons with baselines that were trained on these public datasets, respectively. They are effective for validating the method's performance by testing generalization and demonstrating the benefits of scale.

5.2. Evaluation Metrics

The paper employs a comprehensive set of metrics to quantify motion imitation performance, encompassing both pose-based (kinematic) and physics-based (dynamic) aspects.

Success Rate (succ):
- Conceptual Definition: Measures the percentage of imitation attempts where the humanoid successfully tracks the reference motion trajectory without significant deviation or failure (e.g., falling). It represents the overall robustness and stability of the controller.
- Mathematical Formula: The paper defines it conceptually without a specific formula, but it's typically calculated as: $ \text{succ} = \frac{\text{Number of successful trajectories}}{\text{Total number of trajectories}} \times 100% $
- Symbol Explanation:
  - Number of successful trajectories: The count of motion sequences where the robot maintained tracking within defined thresholds.
  - Total number of trajectories: The total count of motion sequences attempted for tracking.
- Failure Criteria: For scaling up motion tracking experiments, an imitation attempt is deemed unsuccessful if the humanoid's body height deviates by more than 0.25 m from the reference motion, or if the root orientation differs by more than 1 radian at any point during tracking. For comparison with baselines, a more relaxed criterion is used: imitation is unsuccessful only if the humanoid falls, defined as its root height deviating by more than 0.25 m from the reference.
Root-relative Mean Per-Joint Position Error (MPJPE):
- Conceptual Definition: Quantifies the local accuracy of the imitation by measuring the average positional difference between corresponding joints of the simulated humanoid and the reference motion, expressed relative to the root (pelvis). Lower values indicate better tracking fidelity.
- Mathematical Formula: While the paper mentions $E_{\mathrm{mpjpe}}$ in mm, it does not provide its specific formula. A standard MPJPE formula is: $ E_{\mathrm{mpjpe}} = \frac{1}{N_{frames} \cdot N_{joints}} \sum_{t=1}^{N_{frames}} \sum_{j=1}^{N_{joints}} | (P_{j,t}^{\text{robot}} - P_{\text{root},t}^{\text{robot}}) - (P_{j,t}^{\text{ref}} - P_{\text{root},t}^{\text{ref}}) |_2 $
- Symbol Explanation:
  - $N_{frames}$ : Total number of frames in the motion trajectory.
  - $N_{joints}$ : Total number of joints being evaluated.
  - $P_{j,t}^{\text{robot}}$ : Position of joint $j$ of the robot at time $t$ in global coordinates.
  - $P_{\text{root},t}^{\text{robot}}$ : Position of the root (pelvis) of the robot at time $t$ in global coordinates.
  - $P_{j,t}^{\text{ref}}$ : Position of joint $j$ of the reference motion at time $t$ in global coordinates.
  - $P_{\text{root},t}^{\text{ref}}$ : Position of the root (pelvis) of the reference motion at time $t$ in global coordinates.
  - $\| \cdot \|_2$ : Euclidean (L2) norm, representing the distance.
  - The subtraction of root positions $(P_{j,t} - P_{\text{root},t})$ makes the error root-relative, removing the influence of global translation.
Differences in Acceleration ( $\sum_{\mathrm{acc}}$ ):
- Conceptual Definition: Assesses the physical fidelity of the robot's motion by comparing the acceleration profiles of its body parts to those of the reference motion. Large differences indicate unnatural or non-physical movements.
- Mathematical Formula: The paper indicates $mm/frame²$ but does not provide a specific formula. It's typically the mean squared error or absolute error of acceleration, summed or averaged over joints/links: $ \sum_{\mathrm{acc}} = \frac{1}{N_{frames} \cdot N_{links}} \sum_{t=1}^{N_{frames}} \sum_{l=1}^{N_{links}} | A_{l,t}^{\text{robot}} - A_{l,t}^{\text{ref}} |_2 $
- Symbol Explanation:
  - $N_{links}$ : Number of body links.
  - $A_{l,t}^{\text{robot}}$ : Acceleration of body link $l$ of the robot at time $t$ .
  - $A_{l,t}^{\text{ref}}$ : Acceleration of body link $l$ of the reference motion at time $t$ .
  - $\| \cdot \|_2$ : Euclidean (L2) norm.
Differences in Velocity ( $E_{\mathrm{vel}}$ ):
- Conceptual Definition: Another metric for physical fidelity, comparing the velocity profiles of the robot's body parts to the reference motion. Consistent velocity matching contributes to natural movement.
- Mathematical Formula: The paper indicates mm/frame but does not provide a specific formula. Similar to acceleration, it's typically a mean squared or absolute error of velocities: $ E_{\mathrm{vel}} = \frac{1}{N_{frames} \cdot N_{links}} \sum_{t=1}^{N_{frames}} \sum_{l=1}^{N_{links}} | V_{l,t}^{\text{robot}} - V_{l,t}^{\text{ref}} |_2 $
- Symbol Explanation:
  - $N_{links}$ : Number of body links.
  - $V_{l,t}^{\text{robot}}$ : Velocity of body link $l$ of the robot at time $t$ .
  - $V_{l,t}^{\text{ref}}$ : Velocity of body link $l$ of the reference motion at time $t$ .
  - $\| \cdot \|_2$ : Euclidean (L2) norm.

5.3. Baselines

The paper compares SONIC against several state-of-the-art motion tracking methods to demonstrate its superiority:

Any2Track (Zhang et al., 2025):
- Description: A general motion tracking framework designed to track diverse motions.
- Training Data: Typically trained on smaller datasets like LaFAN.
- Why Representative: It's a contemporary deep RL-based motion tracker.
BeyondMimic (Liao et al., 2025):
- Description: Another advanced motion imitation approach.
- Training Data: Also typically trained on smaller datasets like LaFAN.
- Why Representative: Represents another strong deep RL baseline for motion imitation.
GMT (Chen et al., 2025):
- Description: General Motion Tracking, a method known for robust motion tracking.
- Training Data: Trained on the AMASS dataset, a large public motion dataset.
- Why Representative: It's a high-performing baseline that leverages a substantial dataset, making it a strong contender for comparison against SONIC's scaled approach.

Comparison Protocol: To ensure a fair comparison, all baselines (using officially released models where available, or publicly released code for retraining) are evaluated on the same unseen AMA dataset (a uniformly random subset) as SONIC. All evaluations for this comparison are performed in MuJoCo (Todorov ets al., 2012), which is supported by all baseline implementations. A more relaxed termination criterion is used: imitation is considered unsuccessful only if the humanoid falls (root height deviation > 0.25m).

These baselines are representative because they are recent, state-of-the-art deep RL-based motion tracking techniques, allowing the authors to highlight SONIC's advantages gained from its supersizing strategy and multimodal capabilities.

6. Results & Analysis

6.1. Core Results Analysis

The paper presents compelling evidence that scaling up model capacity, data, and compute significantly improves humanoid control, leading to a generalist controller capable of natural and robust movements.

6.1.1. Scaling Up Motion Tracking

The authors analyze the impact of scaling along three key dimensions: GPU hours (compute), model size (network capacity), and motion dataset size (data volume). All evaluations are conducted in Isaac Lab.

The following figure (Figure 2, parts a, b, c) illustrates the scaling trends:

该图像是多个图表，展示了数据大小、模型大小和GPU小时数与成功率之间的关系。通过比较不同模型的成功率与失败率，显示了在数据和计算资源增加时，模型性能的提升。具体数据显示，Ours模型在成功率上达到了99.6%。

Figure 2 (a, b, c): Performance metrics as functions of GPU hours, model size, and dataset size. Lower is better for error metrics. (The caption of the original image includes (d-g) Comparing our method with baselines on tracking out-of-distribution motion sequences, but the image only clearly shows (a-c) for scaling up. The success rate for "Ours" is indicated as 99.6%.)

Consistent Improvements: Across the board, scaling along any of these three axes leads to consistent improvements in motion imitation performance. This empirically validates the scaling hypothesis for humanoid control, mirroring trends observed in other foundation models.
Impact of Dataset Size: Increasing the size of the motion dataset yields the most substantial gains. For dataset size, the paper compares training on LaFAN (0.4M frames), a subset of their in-house dataset (7.4M frames), and their full dataset (100M frames). This highlights the critical role of data volume and diversity in achieving robust and generalizable policies.
Impact of Model Size and Compute: Increased model size (from 1.2M to 42M parameters) and compute (GPU hours) further enhance results. For GPU hours, it's noted that parallelizing over more GPUs (e.g., 128 vs. 8) is essential, as training on a smaller number of GPUs yields worse asymptotic performance, suggesting that sufficient compute is needed to fully leverage larger models and datasets.
High Success Rate: The "Ours" model achieves a success rate of 99.6% (as indicated in the figure), demonstrating highly robust tracking.

6.1.2. Comparison with Baselines

SONIC is compared against state-of-the-art trackers: Any2Track, BeyondMimic, and GMT. The following figure (Figure 2, parts d, e, f, g) illustrates the comparison with baselines: (The provided image images/2.jpg only clearly shows (a-c) and a high-level success rate, but the text describes the comparison results.)

Superior Performance: SONIC significantly outperforms all baselines across all evaluation metrics (success rates, MPJPE, acceleration differences, velocity differences).
Generalization to Unseen Motions: This comparison is crucial as it evaluates the models on the same unseen AMA dataset, demonstrating that SONIC's approach yields a universal motion tracker capable of accurately tracking a wide range of out-of-distribution motions. The baselines, trained on smaller or different datasets (LaFAN or AMASS), show notably lower performance on this challenging test set.

6.1.3. Real-World Evaluation

Robust Sim-to-Real Transfer: SONIC is deployed on a physical Unitree G1 humanoid robot to assess its real-world performance. It tracks 50 diverse motion trajectories, including dance, jumps, and loco-manipulation tasks.
Zero-Shot Success: The policy achieves motion imitation in the real world that closely matches its simulation results, operating in a true zero-shot manner (without any real-world fine-tuning).
100% Success Rate: Remarkably, it succeeds on all 50 sequences without a single failure (a 100% success rate), underscoring the robustness and reliability of the tracker in challenging real-world scenarios.

6.1.4. Interactive Motion Control

SONIC's kinematic motion planner enables real-time interactive control for a variety of tasks.

The following figure (Figure 3) showcases diverse interactive control capabilities:

该图像是一个示意图，展示了不同的机器人运动状态，包括前进、后退、侧向行走和跑步等多种动作。图中还展示了机器人的各种表情和战斗姿态，如闲置、左击、右击等，突显了丰富的运动捕捉数据在自然运动控制中的应用。

Figure 3: Top three rows: interactive navigation with in-betweening different velocities, directions, and styles. Bottom two rows: SONIC produces high-quality and responsive boxing motions while preserving the robot's complete freedom of movement throughout the task.

Navigation Control: Supports arbitrary velocity (0.0 m/s to 6.0 m/s), direction (0 to 360 degrees), and even style commands (e.g., "drunken walking," "injured walking," "happy walking," "stealth walking"). The critically damped spring model smooths impractical commands.
In-betweening Generation: The model robustly generates in-betweening motions, showcasing its flexibility and the potential for natural human-robot interaction.
Interactive Entertainment (Boxing): Produces high-quality and responsive boxing motions, retaining the robot's full freedom of movement. Unlike existing academic or industrial solutions that often rely on a limited collection of clips and require switching between multiple expert models, SONIC offers fluid and natural transitions.
Manipulation Skills: Extends to skills like squatting, kneeling, and crawling, allowing pelvis height control (0.3m to 0.8m) and omnidirectional crawling (0.0 m/s to 0.5 m/s). This enhances adaptability for teleoperation and navigation in confined environments.

The following figure (Figure 4) further illustrates interactive squatting, kneeling, and crawling:

该图像是一个示意图，展示了多种人形机器人的动作，包括跪下、爬行和起身等姿态。图中上方为不同的动作序列，下方展示了实际的机器人姿态，如单腿跪下、双腿跪下和手爬。整体显示了机器人在不同姿态下的行为表现。

Figure 4: Interactive squatting, kneeling and crawling. With SONIC, the robot can squat, kneel and crawl at arbitrary heights, enabling seamless application in real-world downstream scenarios such as teleoperation and navigation in complex environments.

SONIC demonstrates real-time, multimodal, cross-embodiment control through its universal control policy and unified motion generation system (based on GENMO).

The following figure (Figure 5) illustrates multimodal control:

Figure 5: Video teleoperation, multi-modal control, and VR whole-body teleoperation.
该图像是一个示意图，展示了不同类型的控制方式，包括视频遥操作、文本控制和VR全身遥操作。图中展示了机器人如何根据命令做出相应动作，如打开门、爬行和执行武术等。

Figure 5: Video teleoperation, multi-modal control, and VR whole-body teleoperation.

Video Teleoperation: Supports real-time control from pre-recorded clips and live monocular webcam streams. Human motion is estimated at $\ge 60`fps`, enabling interactive teleoperation without specialized `motion-capture hardware`. This provides high-fidelity, precise imitation. * **Music and Text Control:** * `Text Control`: Accepts natural-language prompts and synthesizes target motions at$ \ge 60fps, executing unseen instructions in a zero-shot manner. An interactive graphical interface supports free-form prompting (e.g., "walk forward," "kick left foot," "dance like a monkey").
- Music Control: Generates dance motions conditioned on melodic and rhythmic structure, tracking tempo and adapting style to musical characteristics.
Seamless Modality Transitions: The system supports smooth transitions between modalities (e.g., initiating fine-grained control via video, switching to text for general commands, then to music for performance).

6.1.6. VR-Based Teleoperation and Connecting to Foundation Models

SONIC also supports various VR-based teleoperation regimes and integration with VLA models.

VR-Based Whole-Body Teleoperation: Uses PICO VR headset and ankle trackers to stream full-body human pose estimates in SMPL format. This motion is then processed by the universal control policy for low-latency, stable, human-like control.
VR-Based 3-Point Teleoperation: A lightweight, mobile setup using only the VR headset and two handheld controllers (no ankle trackers). It outputs a compact command (head and wrist $SE(3)$ $SE (3)$ poses, finger joint angles, waist height, locomotion mode, navigation command). This mode is used for scalable and portable data collection.
- Teleoperation Performance: For a mobile pick-and-place task, the pipeline exhibits a mean latency of 121.9 ms. The mean right-wrist position error is 6cm (95th percentile: 13.3 cm), and mean orientation error is $0.145 rad (8.32°)$ (95th percentile: $0.267 rad (15.31°)$ ).
Foundation-Model-Driven Mobile Bimanual Manipulation:
- Integration with GR00T N1.5 VLA model: The VLA model is fine-tuned on 300 VR-teleoperated trajectories (collected via 3-point teleoperation) for an apple-to-plate mobile pick-and-place task.
- Autonomous Control: The VLA outputs teleoperation-format commands (upper-body poses, base height, navigation command), which are fed into SONIC's kinematic planner and hybrid encoder, then executed by the universal control policy.
- Success Rate: Achieves a 95% success rate over 20 trials on this proof-of-concept task. This demonstrates compatibility between foundation-model planning (high-level System 2 capabilities) and SONIC's robust System 1 control (reactive whole-body skills), establishing a powerful hierarchical control architecture.
  
  The following figure (Figure 6) demonstrates the VLA integration:
  
  $Figur :Apple--platemobi imanualmanipulation he Unitre G1human robotcontrolled by - tuned GR00T N1.5 vision-language-action (VLA) model. The VLA emits teleoperation-format commands—poses of head and wrists, base height, and a navigation command—which are executed by soNIc. The model is fine-tuned on 300 VR-teleoperated trajectories and attains $9 5 \\%$ success over 20 trials on this proof-of-concept task, demonstrating compatibility between foundation-model planning and our universal control policy.$ 该图像是一个示意图，展示了人形机器人进行物体操控的多个步骤，表现出其在执行任务过程中与环境的互动。机器人通过视觉-语言-动作（VLA）模型发出指令，展现了自主抓取和操控物体的能力，验证了模型的实用性。

Figure 6: Apple-to-plate mobile bimanual manipulation. The Unitree G1 human robot controlled by a fine-tuned GR00T N1.5 vision-language-action (VLA) model. The VLA emits teleoperation-format commands—poses of head and wrists, base height, and a navigation command—which are executed by SONIC. The model is fine-tuned on 300 VR-teleoperated trajectories and attains 95% success over 20 trials on this proof-of-concept task, demonstrating compatibility between foundation-model planning and our universal control policy.

6.2. Data Presentation (Tables)

All tables from the original paper have been transcribed and explained in the Methodology section:

Table 1: Reward design.
Table 2: Domain randomization parameters.
Table 3: Universal control policy hyperparameters.
Table 4: Training hyperparameters.

6.3. Ablation Studies / Parameter Analysis

While the paper doesn't present explicit "ablation studies" in the traditional sense (e.g., removing a component and re-evaluating), the scaling up motion tracking experiments (Figure 2a-c) effectively serve as a form of parameter and architectural analysis:

Impact of Model Capacity: By varying model size (1.2M to 42M parameters), the authors show that larger networks consistently lead to better motion imitation performance. This indicates that increased capacity allows the model to learn more complex and generalized motion priors.
Impact of Data Volume: The comparison across dataset sizes (0.4M, 7.4M, 100M frames) directly demonstrates that increasing the amount of training data yields significant performance gains. This validates the core hypothesis that supersizing data is critical.
Impact of Compute: The analysis of GPU hours (8, 32, 128 GPUs) shows that more compute allows for better asymptotic performance and faster convergence, suggesting that computational resources are necessary to fully exploit the benefits of larger models and datasets.

These analyses strongly support the paper's central argument that scale across these three dimensions is a primary driver of the observed improvements in natural and robust whole-body control. The consistent performance improvements with increased scale suggest that the underlying architecture and learning objective are well-suited for foundation model-like behavior in robotics.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper "SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control" successfully demonstrates a paradigm shift in humanoid robot control. By casting motion tracking as a core, scalable task, the authors have built a generalist humanoid controller (SONIC) that produces natural and robust whole-body behaviors across diverse conditions. This was achieved by supersizing the training process along three axes: network size (up to 42M parameters), dataset volume (over 100M frames/700 hours of motion data), and compute (9k GPU hours).

Key findings include the empirical validation of scaling laws in humanoid control, where performance consistently improves with increased data, model capacity, and compute. The learned representations exhibit strong generalization to unseen motions in simulation and achieve robust zero-shot sim-to-real deployment on a Unitree G1 robot. Beyond the core motion tracker, the practical utility of SONIC is enhanced by a real-time universal kinematic planner for interactive control and a unified token space that seamlessly integrates heterogeneous input interfaces like VR teleoperation, human videos, text, music, and vision-language-action (VLA) models. This comprehensive system establishes motion tracking at scale as a practical foundation for advanced humanoid autonomy, offering a robust System 1 controller that can be layered with higher-level reasoning.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose future research directions:

Formal Treatment of Safety and Compliance: The current work does not formally address safety protocols or compliance mechanisms for physical interaction, which are crucial for real-world humanoid deployment.
Energy Efficiency for Extended Deployments: The paper notes that energy efficiency for extended deployments is an area for improvement. Large models and compute-intensive processes can consume significant power.
Combating Noisy Input: The system's performance might be affected by noisy input during deployments (e.g., errors in VR tracking or video estimation). Robustness to such noise needs further investigation.
Scaling Laws Across Diverse Datasets: Future work will study scaling laws across even more diverse datasets to further understand the limits and potential of this approach.
VLA-Instructed Whole-Body Loco-Manipulation: The current VLA integration is a proof-of-concept. Future work will focus on enabling more complex, VLA-instructed whole-body loco-manipulation tasks.
Joint Training of Planner, Tokenizers, and Policy: To reduce modality gaps and further improve system coherence, the authors suggest exploring joint training of the kinematic planner, the encoders/quantizers (tokenizers), and the control policy. This could lead to a more end-to-end optimized system.

7.3. Personal Insights & Critique

This paper represents a highly significant step forward in humanoid robotics, particularly in its aggressive pursuit of scaling laws. The core insight that motion tracking can serve as a foundation task that bypasses reward engineering by leveraging dense supervision from MoCap data is elegant and powerful. The empirical demonstration of scaling benefits, consistent across compute, model size, and data, provides strong evidence for this approach.

Insights:

Paradigm Shift: The paper offers a compelling vision for how foundation models can be brought to robotics, moving beyond task-specific controllers to truly generalist agents. This could dramatically accelerate the development of versatile humanoid robots.
Practicality Focus: Beyond theoretical scaling, the emphasis on real-time deployment, sim-to-real transfer, and multimodal control (especially VR teleoperation and VLA integration) makes SONIC highly practical and immediately impactful. The unified token space is a particularly clever solution for integrating disparate input modalities into a single policy.
Robustness from Scale: The 100% success rate on real-world zero-shot motions is astonishing and speaks volumes about the robustness gained from massive data and domain randomization.
Hierarchical Control: The clear delineation between SONIC as a System 1 (fast, reactive, whole-body) controller and VLA models as System 2 (slower, deliberative reasoning) highlights a promising architecture for complex, autonomous robotics.

Critique/Areas for Improvement:

Data Acquisition Cost: While MoCap data is abundant, acquiring 700 hours of high-quality, diverse, and retargetable motion data is a massive undertaking and likely very expensive. This data bottleneck could limit broader adoption or reproduction by smaller research groups. Future work might explore synthetic data generation or more efficient data collection methods.
Sim-to-Real Gap for Complex Manipulation: While the sim-to-real transfer for locomotion and whole-body motions is impressive, the VLA integration for mobile manipulation is a "proof-of-concept." Scaling this to highly dexterous, nuanced manipulation tasks with contact forces and unpredictable object properties might reintroduce significant sim-to-real challenges not fully captured by domain randomization.
Interpretability and Safety Guarantees: As models scale, their interpretability decreases. For safety-critical systems like humanoids, understanding why a robot makes a certain decision or fails is paramount. The paper acknowledges safety compliance as a limitation, and this will be a crucial area to address for real-world deployment.
Computational Footprint: 9k GPU hours and 42M parameters are substantial. While TensorRT optimization helps with inference, the training demands are immense. Investigating model compression or more efficient architectures for deployment on less powerful hardware could be valuable.
Environmental Impact: The energy consumption associated with such large-scale training is considerable. Research into carbon-efficient AI training will become increasingly important.

Overall, SONIC provides a solid blueprint for advancing humanoid control into the foundation model era. Its methodological rigor and impressive experimental results make it a landmark paper, setting a new bar for scale and generality in the field.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~46 min read · 62,813 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Humanoid Motion Dataset

4.2.2. Universal Humanoid Motion Tracking

4.2.2.1. Motion Tracking Formulation

4.2.2.2. States

4.2.2.3. Actions

4.2.2.4. Rewards

4.2.2.5. Domain Randomization

4.2.2.6. Universal Control Policy Architecture

4.2.3. Generative Kinematic Motion Planner

4.2.4. Multi-modal Motion Generation Model (GENMO)

4.2.5. Deployment

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Scaling Up Motion Tracking

6.1.2. Comparison with Baselines

6.1.3. Real-World Evaluation

6.1.4. Interactive Motion Control

6.1.5. Video Teleoperation and Multi-Modal Cross-Embodiment Control

6.1.6. VR-Based Teleoperation and Connecting to Foundation Models

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers