SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control
TL;DR Summary
The SONIC framework scales model capacity, data, and compute for natural humanoid control, utilizing diverse motion-capture data for dense supervision. It features real-time motion planning and multi-interface support, demonstrating significant performance gains from scaling.
Abstract
Despite the rise of billion-parameter foundation models trained across thousands of GPUs, similar scaling gains have not been shown for humanoid control. Current neural controllers for humanoids remain modest in size, target a limited behavior set, and are trained on a handful of GPUs over several days. We show that scaling up model capacity, data, and compute yields a generalist humanoid controller capable of creating natural and robust whole-body movements. Specifically, we posit motion tracking as a natural and scalable task for humanoid control, leverageing dense supervision from diverse motion-capture data to acquire human motion priors without manual reward engineering. We build a foundation model for motion tracking by scaling along three axes: network size (from 1.2M to 42M parameters), dataset volume (over 100M frames, 700 hours of high-quality motion data), and compute (9k GPU hours). Beyond demonstrating the benefits of scale, we show the practical utility of our model through two mechanisms: (1) a real-time universal kinematic planner that bridges motion tracking to downstream task execution, enabling natural and interactive control, and (2) a unified token space that supports various motion input interfaces, such as VR teleoperation devices, human videos, and vision-language-action (VLA) models, all using the same policy. Scaling motion tracking exhibits favorable properties: performance improves steadily with increased compute and data diversity, and learned representations generalize to unseen motions, establishing motion tracking at scale as a practical foundation for humanoid control.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control." It focuses on enhancing the capabilities and generality of humanoid robot control by significantly increasing the scale of motion tracking through larger models, datasets, and computational resources.
1.2. Authors
The paper lists a large team of authors, primarily affiliated with Nvidia, indicating a strong industry research effort.
- Co-First Authors: Zhengyi Luo, Ye Yuan, Tingwu Wang, Chenran Li
- Core Contributors: Sirui Chen, Fernando Castañeda, Zi-Ang Cao, Jiefeng Li, David Minor, Qingwei Ben, Xingye Da
- Additional Contributors: Runyu Ding, Cyrus Hogg, Lina Song, Edy Lim, Eugene Jeong, Taira He, Har Xue, Wenli Xiao, Zi Wang, Smuen, Jan Kautz, Yan Chang, Umar Ibal, Linxi "Jim" Fan
- Project Leads: Yuke Zhu
- Affiliation: Nvidia
1.3. Journal/Conference
The paper is published at https://arxiv.org/abs/2511.07820. arXiv is a preprint server widely used in academia for rapid dissemination of research findings before formal peer review and publication in journals or conferences. While not a peer-reviewed venue itself, papers on arXiv often precede publications in top-tier conferences like NeurIPS, ICML, ICLR, or robotics conferences like RSS, IROS, ICRA, which are highly reputable and influential in the fields of artificial intelligence, machine learning, and robotics.
1.4. Publication Year
The paper was published on arXiv at 2025-11-11T04:37:40.000Z, which corresponds to November 11, 2025.
1.5. Abstract
The paper addresses the lack of scaling gains in humanoid control compared to other foundation models (e.g., large language models). It argues that current neural controllers for humanoids are modest in size, behavior, and training resources. The authors propose that scaling up model capacity, data, and compute can lead to a generalist humanoid controller capable of natural and robust whole-body movements. They identify motion tracking as a natural and scalable task for humanoid control, utilizing dense supervision from diverse motion-capture data to learn human motion priors without manual reward engineering.
Their approach involves building a foundation model for motion tracking by scaling along three axes: network size (from 1.2M to 42M parameters), dataset volume (over 100M frames, 700 hours of high-quality motion data), and compute (9k GPU hours). Beyond demonstrating scaling benefits, the model shows practical utility through two mechanisms: (1) a real-time universal kinematic planner for interactive control and downstream task execution, and (2) a unified token space that supports various input interfaces (VR teleoperation, human videos, vision-language-action (VLA) models) using the same policy. The paper concludes that scaling motion tracking steadily improves performance, generalizes to unseen motions, and establishes a practical foundation for humanoid control.
1.6. Original Source Link
- Official Source:
https://arxiv.org/abs/2511.07820 - PDF Link:
https://arxiv.org/pdf/2511.07820v1.pdf - Publication Status: This paper is currently a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the lack of scalability in humanoid robot control. In recent years, fields like natural language processing and computer vision have seen monumental progress with billion-parameter foundation models trained on massive datasets using thousands of GPUs. These models demonstrate impressive capabilities like generalization, robustness, and emergent properties that smaller models cannot achieve. However, this trend has not translated to humanoid control.
Current neural controllers for humanoids typically remain small (e.g., three-layer MLPs with a few million parameters), are trained on limited hardware (e.g., a single GPU over days), and target a narrow set of behaviors. A significant challenge in humanoid control is the reliance on manual reward engineering for each specific task. For instance, a controller trained for walking might not be able to dance without extensive re-engineering of its reward function. This task-specific reward engineering is time-consuming, labor-intensive, and limits the general applicability of learned policies, making it difficult to develop flexible controllers that can handle the diverse range of real-world applications required for a truly versatile humanoid.
The paper's entry point or innovative idea is to identify motion tracking as a natural and scalable foundation task for humanoid control. Unlike goal-oriented tasks that require intricate reward designs, motion tracking can leverage motion-capture (MoCap) data which provides dense, frame-by-frame supervision. This eliminates the need for manual reward engineering and allows the robot to acquire human motion priors directly from readily available, diverse human motion datasets covering a wide range of behaviors (walking, running, dancing, sports, object interaction). By demonstrating that motion tracking can be scaled up significantly, the authors propose it as a practical path towards building generalist humanoid controllers.
2.2. Main Contributions / Findings
The paper makes several significant contributions to the field of humanoid control:
-
Scalable Foundation Task for Humanoid Control: It identifies and demonstrates
motion trackingas a scalable foundation task that exhibits favorable scaling properties with increasing compute and data diversity. This is a crucial shift from traditionalreward-engineeredapproaches. -
Supersizing Humanoid Control: The authors scale up humanoid control to unprecedented levels, using
9k GPU hoursand100 million frames(700 hours) of high-quality motion sequences. This massive scale results in auniversal tracking capabilityacross diverse human behaviors, going beyond the limited scope of prior work. -
Kinematic Motion Generation System: Introduction of a
real-time universal kinematic planner. This system bridges motion tracking todownstream task executionand enablesnatural and interactive control. It allows for goal-directed tasks like interactive locomotion, boxing, squatting, kneeling, and crawling, without retraining the core policy. -
Unified Token Space for Multimodal Inputs: Development of a
unified token spacethat supports variousmultimodal control inputs. This single policy can handleteleoperation(VR, 3-point),human videos,motion commands,text,music, and evenvision-language-action (VLA)foundation models. This unified approach facilitatescross-embodiment motion trackingand seamless integration with high-level AI models. -
Comprehensive Evaluation and Robustness:
- Empirical Scaling Trends: Demonstrated consistent performance improvements with increased data, model capacity, and compute.
- Zero-Shot Transfer: Showed robust generalization to previously
unseen motionsin simulation. - Robust Sim-to-Real Deployment: Achieved 100% success rate on 50 diverse motion trajectories on a physical
Unitree G1humanoid robot in azero-shot manner. - Integration with Foundation Models: Successfully integrated with a
GR00T N1.5 VLA modelfor mobile manipulation tasks, showcasing compatibility and robustness.
-
Practical System for Real-World Deployments: The work presents not just a scaled model, but a practical system combining the scaled tracker with a real-time kinematic planner and a universal token space, making it usable for
real-time interactive controlandteleoperation.The key conclusions are that scaling motion tracking indeed yields reliable, general
whole-body control, and when paired with a robust planner and a universal token space, it becomes a practical foundation for advanced humanoid autonomy, capable of bridging the gap between high-level reasoning and low-level control.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand the paper "SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control," a reader should be familiar with several core concepts in robotics, artificial intelligence, and machine learning.
- Humanoid Control:
- Concept: This refers to the systems and algorithms that enable humanoid robots (robots designed to resemble the human body, typically with two legs, two arms, and a torso) to move, balance, and interact with their environment. It's a complex challenge due to the high degrees of freedom, dynamic stability requirements, and real-world uncertainties.
- Relevance: The paper aims to create a generalist controller for such robots, moving beyond simple, pre-programmed movements to more natural and adaptable behaviors.
- Motion Tracking (or Motion Imitation):
- Concept: A control paradigm where a robot learns to replicate or follow a given reference motion, typically from human data. The goal is for the robot's pose, trajectory, and dynamics to closely match the reference.
- Relevance: SONIC posits this as the core, scalable task. Instead of defining explicit goal-oriented
reward functions, the robot is trained to simply track human motion as accurately as possible. This implicitly teaches the robot how to move naturally and robustly.
- Reinforcement Learning (RL):
- Concept: A type of machine learning where an
agentlearns to make decisions by performing actions in anenvironmentto maximize a cumulativereward signal. It involves trial-and-error learning. Key components include:Agent: The learning entity (e.g., the robot controller).Environment: The world the agent interacts with (e.g., simulation, real world).State (): The current situation or observation of the environment.Action (): What the agent can do in a given state.Reward (): A scalar feedback signal indicating the desirability of an action taken in a state.Policy (\pi): A function that maps states to actions.
- Relevance: The paper uses
Proximal Policy Optimization (PPO)(explained below) to train its policies.
- Concept: A type of machine learning where an
- Proximal Policy Optimization (PPO):
- Concept: A popular
Reinforcement Learningalgorithm that balances sample efficiency, performance, and ease of implementation. It works by iteratively updating apolicywhile ensuring that the new policy does not deviate too much from the old policy, which helps to stabilize training. - Relevance: SONIC trains its control policy using PPO to maximize
cumulative rewards.
- Concept: A popular
- Foundation Models:
- Concept: Large-scale machine learning models, often
neural networks, trained on vast amounts of data using significant computational resources. They are designed to be highly versatile and can be adapted (fine-tuned) for a wide range of downstream tasks. Examples include large language models (like GPT) and large vision models. - Relevance: The paper draws a direct comparison between the scaling success of foundation models in other domains and the lack thereof in humanoid control, motivating its
supersizingapproach.
- Concept: Large-scale machine learning models, often
- Motion Capture (MoCap) Data:
- Concept: Digitized recordings of human or animal movement, typically captured using specialized cameras and markers or
IMU sensors. This data provides high-fidelity,frame-by-frameinformation about body pose, joint angles, and velocities. - Relevance: This is the
dense supervisionused by SONIC to train its motion tracking policies, eliminating the need for manual reward engineering. The paper specifically mentionsin-house datasetsand existing large datasets likeAMASSandLaFAN.
- Concept: Digitized recordings of human or animal movement, typically captured using specialized cameras and markers or
- Sim-to-Real Transfer:
- Concept: The process of training a robot control policy in a simulated environment and then deploying it successfully on a physical robot in the real world. This is challenging because of the
sim-to-real gap– the discrepancies between simulation dynamics and real-world physics. - Relevance:
Domain randomization(explained below) is a common technique to bridge this gap, and SONIC demonstrates successfulzero-shot sim-to-real deployment.
- Concept: The process of training a robot control policy in a simulated environment and then deploying it successfully on a physical robot in the real world. This is challenging because of the
- Neural Networks (NNs) / Multi-Layer Perceptrons (MLPs):
- Concept:
MLPsare fundamentalneural networksconsisting of multiple layers of interconnectedneurons. Each neuron takes inputs, applies a weighted sum and anactivation function, and produces an output.MLPsare universal function approximators. - Relevance: SONIC uses
MLPsextensively for itsencoders,decoders,actor(policy network), andcritic(value network).
- Concept:
- Tokenization & Latent Space:
- Concept: In machine learning,
tokenizationis the process of breaking down raw data (like text, images, or in this case, motion commands) into smaller, discrete units calledtokens. Alatent spaceis a lower-dimensional representation of the input data, where similar data points are grouped closer together.Quantizationis the process of mapping continuous values to a finite set of discrete values (like creating discrete tokens from a continuous latent space). - Relevance: SONIC introduces a
unified token spaceandvector quantizer (FSQ)to represent diverse motion commands (robot, human, hybrid) in a common, discrete format, enabling a single policy to handle multiple input modalities.
- Concept: In machine learning,
- Kinematic Planner:
- Concept: A system that plans robot movements based on kinematics (the study of motion without considering forces), often generating a sequence of poses or joint angles to achieve a desired configuration or trajectory, usually with respect to a user input or high-level command.
- Relevance: SONIC uses a
generative kinematic motion plannerto convert user commands (e.g., velocity, style) into short-horizon reference motions that the core motion tracking policy can then follow.
- Vision-Language-Action (VLA) Models:
- Concept:
Foundation modelsthat can processvision(images/videos),language(text commands), and generateactionsfor robotic control. They aim to provide high-level, semantic understanding and control. - Relevance: SONIC demonstrates compatibility with a
VLA model (GR00T N1.5), showing how its low-level, robust control can be layered with high-level reasoning from advanced AI models.
- Concept:
3.2. Previous Works
The paper references several prior works, mainly in the domain of motion tracking and humanoid control. Understanding these helps to contextualize SONIC's advancements.
- Prior Motion Trackers:
Any2Track (Zhang et al., 2025),BeyondMimic (Liao et al., 2025),GMT (Chen et al., 2025): These are state-of-the-art motion tracking methods that SONIC compares itself against.- Context: These methods typically focus on tracking human motion in simulation, often trained on smaller datasets like
LaFANorAMASS. Their primary goal is to accurately replicate a given motion. - Differentiation: SONIC distinguishes itself by its unprecedented scale (100M+ frames, 9k GPU hours), universal tracking capabilities across diverse human behaviors, and practical utility in downstream tasks beyond just motion tracking, as well as its ability to handle multiple input modalities. SONIC also shows strong
sim-to-realperformance, which is often a challenge for prior works.
- Context: These methods typically focus on tracking human motion in simulation, often trained on smaller datasets like
TWIT (Zeng et al., 2025): Another prior work mentioned for its evaluation dataset usage (a subset ofAMA).
- Motion Capture Datasets:
AMA (Mahmood et al., 2019): A large dataset of animal motion, retargeted for humanoids. SONIC's evaluation uses a subset of this data, highlighting its generalization to unseen data.LaFAN (Liao et al., 2025): A dataset of human locomotion motions, often used for training motion trackers. SONIC mentions usingLaFANas a smaller baseline for comparison.AMASS (Mahmood et al., 2019): A very large and diverse dataset of human motions (includingCMU MoCap,MPI HDM05,KIT, etc.).GMTis noted to be trained onAMASS.
- Motion Generation Models:
GENMO (Li et al., 2025): A multimodal motion generation model that SONIC adopts to supportmultimodal conditioning(video, text, audio) for human motion.- Context:
GENMOcan synthesize complete human motion trajectories from various inputs, effectively acting as a "human motion generator." - Relevance to SONIC: SONIC integrates
GENMOto translate high-levelmultimodal inputsinto human motion references, which SONIC'shuman motion encoderthen tracks. This allows the humanoid to respond to text commands, music, or video demonstrations.
- Context:
- Robot Platforms & Simulators:
Unitree G1 (Unitree, 2024): The specific physical humanoid robot platform used for real-world deployment and evaluation in the paper.Isaac Lab (NVIDIA et al., 2025): The simulation platform used for training SONIC models.Isaac Labis a scalable and GPU-accelerated robotics simulation environment.MuJoCo (Todorov et al., 2012): A physics engine commonly used in robotics research, particularly for comparing with baselines whereMuJoCoimplementations are standard.
- Technical Components:
SMPL (Loper et al., 2015): A statistical 3D human body model used to represent human pose estimates (e.g., fromVRtracking).FSQ (Mentzer et al., 2023):Factorized Quantizationor similar techniques forvector quantization, used by SONIC to convert continuous latent representations into discreteuniversal tokens.GR00T N1.5 (Bjorck et al., 2025): Avision-language-action (VLA)foundation model that SONIC integrates with to demonstrate autonomous control, acting as a high-level planner providing commands to SONIC's policy.
3.3. Technological Evolution
The field of humanoid control has evolved significantly:
-
Early Robotics (Pre-2000s): Dominated by hard-coded,
model-based controlstrategies (e.g., inverse kinematics, zero-moment point control). These methods were precise for known tasks but lacked adaptability to unknown environments or complex, dynamic movements. -
Model-Free Reinforcement Learning (2000s-2010s): With advances in computing,
RLemerged, allowing robots to learn behaviors directly from experience. However, earlyRLoften struggled withhigh-dimensional action spacesand required extensivereward engineering. Sim-to-real transfer remained a major hurdle. -
Deep Reinforcement Learning (2010s-Present): The integration of
deep neural networkswithRL(deep RL) led to breakthroughs, enabling robots to learn complex policies directly from raw sensor data. Techniques likePPO,SAC, andTRPObecame standard.Motion imitationgained traction as a way to leverageMoCap datato train policies for natural movements, reducing manualreward engineering. However, these policies were often task-specific, limited in scale, and prone to thesim-to-real gap. -
Foundation Models Era (Late 2010s-Present):
Large language models (LLMs)andlarge vision modelsdemonstrated thatscaling up model capacity, data, and computecould lead toemergent capabilitiesandgeneralization. This inspired researchers to seek similar scaling laws in robotics.This paper's work fits into the latest stage of this evolution, aiming to bring the benefits of
foundation modelstohumanoid control. It seeks to overcome the limitations of previousdeep RLapproaches by demonstrating thatsupersizing motion trackingcan lead to ageneralist controllerthat is robust, natural, and capable of handling diverse tasks and inputs, bridging the gap between low-level control and high-levelVLAreasoning.
3.4. Differentiation Analysis
Compared to the main methods in related work, SONIC's approach presents several core differences and innovations:
- Scale of Training (Data, Model, Compute):
- Previous Work: Prior neural controllers for humanoids are typically modest in size (e.g., a few million parameters), trained on limited data (e.g.,
LaFAN, a subset ofAMASS), and consume relatively little compute (e.g., a single GPU for days). - SONIC's Innovation: SONIC scales
network sizeto42M parameters,dataset volumeto over100M frames(700 hours), andcomputeto9k GPU hoursusing128 GPUs. This massive scale is a primary differentiator, directly linking to improved performance and generalization, a trend not previously established in humanoid control.
- Previous Work: Prior neural controllers for humanoids are typically modest in size (e.g., a few million parameters), trained on limited data (e.g.,
- Generality and Universality:
- Previous Work: Most prior methods are task-specific, requiring
reward engineeringor separate models for different behaviors (e.g., one for walking, another for dancing).Motion trackingpolicies often struggle to generalize beyond their training data or to perform complex downstream tasks. - SONIC's Innovation: SONIC aims for a
generalist humanoid controller. It establishesmotion tracking at scaleas afoundationthat acquireshuman motion priorswithoutmanual reward engineering. This leads to auniversal tracking capabilityacross diverse human behaviors and allowszero-shot transferto unseen motions and tasks.
- Previous Work: Most prior methods are task-specific, requiring
- Multimodal Input and Unified Token Space:
- Previous Work: Controllers usually accept specific types of input (e.g., joystick commands for locomotion,
MoCapstreams for imitation). Bridging different input modalities orembodimentsoften requires separateretargeting modulesor specialized policies. - SONIC's Innovation: SONIC introduces a
unified token spaceandcross-embodiment learningvia specialized encoders. This allows a single policy to accept and seamlessly switch between highly diverse inputs likeVR teleoperation,human videos,text commands,music, and evenVLA models, treating them all within a shared latent representation. This flexibility is a major step towards truly interactive and adaptable control.
- Previous Work: Controllers usually accept specific types of input (e.g., joystick commands for locomotion,
- Practical Utility and Downstream Task Execution:
- Previous Work: Many motion trackers primarily demonstrate their ability to imitate motions in simulation, with limited showcasing of their utility for
real-world interactive tasksor integration with higher-level systems. - SONIC's Innovation: SONIC integrates a
real-time universal kinematic plannerthat bridges motion tracking todownstream task execution. This enablesinteractive controlfor navigation, boxing, squatting, crawling, and more, all building on the same scaled motion tracker without retraining. It also demonstrates direct integration withVLA modelsformobile manipulation, showcasing a complete pipeline from high-level intent to robust physical execution.
- Previous Work: Many motion trackers primarily demonstrate their ability to imitate motions in simulation, with limited showcasing of their utility for
- Robust Sim-to-Real Deployment:
- Previous Work:
Sim-to-real transferremains a significant challenge. While some methods achieve it, demonstrating100% success ratesacross diverse, unseen real-world motions for ageneralist policyis difficult. - SONIC's Innovation: SONIC achieves a
100% success rateon 50 diversereal-world motion trajectorieson aUnitree G1robot in azero-shot manner. This underlines the robustness and reliability gained from its scaling approach anddomain randomization.
- Previous Work:
- Single-Stage Training Without Distillation:
-
Previous Work: Some approaches might use
distillationfrom expert policies or multi-stage training to achieve certain capabilities. -
SONIC's Innovation: The
universal token spaceandcross-embodiment learningare achieved throughsingle-stage training, simplifying the learning process and making it more efficient.In essence, SONIC differentiates itself by aggressively scaling up a well-established but previously limited task (
motion tracking) and equipping it with practical mechanisms (kinematic planner,unified token space) to create a generalist, multimodal, and robust controller suitable for real-world humanoid applications and integration with futurefoundation models.
-
4. Methodology
4.1. Principles
The core idea behind SONIC is that motion tracking can serve as a scalable and natural foundation task for humanoid control. Instead of manually engineering complex reward functions for every desired behavior (e.g., walking, dancing, jumping), the robot can learn to perform natural and robust whole-body movements by simply trying to accurately imitate diverse human motion-capture (MoCap) data. This approach bypasses the laborious reward engineering process and leverages the vast amount of existing, high-quality human motion data to acquire human motion priors.
The theoretical basis and intuition are rooted in the observation that other domains of AI have seen emergent capabilities and generalization by scaling up model capacity, data volume, and computational resources. SONIC hypothesizes that this scaling law also applies to humanoid control. By supersizing motion tracking—training larger neural networks on significantly more diverse motion data using extensive compute—the system can learn a more general, robust, and nuanced understanding of humanoid dynamics and natural movement. This "generalist" motion tracker then acts as a low-level, reactive System 1 controller, capable of precise, stable whole-body skills, which can be effectively combined with higher-level System 2 reasoning from kinematic planners or vision-language-action (VLA) models for complex tasks. The introduction of a unified token space is a key innovation to handle heterogeneous inputs (human, robot, hybrid motions) and various modalities (video, text, VR) under a single control policy, facilitating broad applicability.
4.2. Core Methodology In-depth (Layer by Layer)
The SONIC framework consists of several interconnected components designed to achieve universal humanoid motion tracking and control.
4.2.1. Humanoid Motion Dataset
The foundation of SONIC's scaling approach is its massive and diverse motion dataset.
The authors collected an in-house motion-capture dataset featuring:
-
Diversity of Subjects: Over 50 subjects, balanced male and female actors, with a range of body heights (mean = 174.3 cm, std = 10.9 cm), approximately following a normal distribution.
-
Diversity of Motions: Includes daily activities, various locomotion styles, combat motions with varied stylistic expressions, sports movements, and object interactions.
-
Scale: Contains
700 hoursof human motion, resulting in over100 million framesat50 Hz. This is significantly larger than datasets typically used in prior works. -
Data Characteristics: Clip durations range from 1 to 180 seconds, with most actions performed by multiple subjects and/or multiple takes, providing intrinsic subject variation.
-
Retargeting: The raw human motion data is
retargetedto the specific humanoid robot (Unitree G1) usingGMR (Araujo et al., 2025).Retargetingis the process of mapping motion data captured from one body (e.g., a human) to another body (e.g., a robot) that may have different proportions ordegrees of freedom.The following figure shows random samples from their motion dataset:
该图像是一个示意图,展示了多种不同的人体运动姿态。这些姿态包含了从静止到动态的多种动作,旨在支持对自然人形控制的研究和应用。
Figure 7: Random samples from our motion dataset.
4.2.2. Universal Humanoid Motion Tracking
SONIC's central component is a universal humanoid motion tracking framework, illustrated in the overview diagram below. This framework uses a unified control policy to track diverse motion commands across different embodiments (robot, human, hybrid motions) and modalities.

该图像是一个示意图,展示了SONIC模型在运动生成和控制中的结构。图中包含互动控制任务的不同运动生成器,如运动规划器和人类运动生成器,通过统一的控制策略实现机器人运动的解码和输出。
Figure 8: sONIC enables universal humanoid motion tracking through a universal control policy that handles diverse motion commands and modalities. Specialized encoders process robot, human, and hybrid motion commands into a universal token that drives robot control and motion decoders. This cross-embodiment design supports diverse applications including gamepad control, VR teleoperation, whole-body teleoperation, video teleoperation, and multi-modal control from text and music.
4.2.2.1. Motion Tracking Formulation
Humanoid motion tracking is formalized as a Markov Decision Process (MDP) :
-
:
State space. -
:
Action space. -
:
Transition function(describes how the environment changes based on actions). -
:
Reward function. -
:
Discount factor.The policy is trained using
Proximal Policy Optimization (PPO)to maximize thediscounted cumulative reward: $ \left[ \sum _ { t = 1 } ^ { \bar { T } } \gamma ^ { t - 1 } r _ { t } \right] $ where is the episode length, is the discount factor, and is the reward at time step .
4.2.2.2. States
The state representation at time consists of two main components:
Proprioceptive sensing: Information about the robot's own body state. $ \boldsymbol { s } _ { t } ^ { \mathrm { p } } \triangleq \left( \mathbf { q } _ { t } , \dot { \pmb { q } } _ { t } , \omega _ { t } , \psi _ { t } , \mathbf { a } _ { t - 1 } \right) $- : Joint positions (current angles of each joint).
- : Joint velocities (rate of change of joint angles).
- : Root angular velocity (rotational speed of the robot's main body).
- : Gravity vector in the root frame (provides information about the robot's orientation relative to gravity).
- : Previous action taken at time
t-1.
Motion command: The target motion the robot needs to track. This can be one of three types:-
Robot motion: A reference motion defined in terms of robot-specific joint angles and root poses. -
Human motion: A reference motion derived from humanMoCap data(e.g.,SMPLpose estimates). -
Hybrid motion: A combination, typically sparse upper-body keypoints (like head and hands) for real-time tracking, combined with lower-body robot motion.All state quantities are expressed in the robot's local heading frame to ensure
rotation invariance, meaning the policy's understanding of motion is independent of the robot's absolute orientation in the world.
-
4.2.2.3. Actions
The policy (the neural network controlling the robot) outputs target joint positions as its actions. These target positions are then sent to proportional-derivative (PD) controllers at each joint, which compute the necessary torques to move the robot's joints to the desired target positions.
4.2.2.4. Rewards
The reward function r _ { t } is designed to guide the policy towards accurate motion tracking while penalizing undesirable robot behaviors. Following Liao et al. (2025), it combines a tracking reward and penalty terms:
$
r _ { t } = \mathcal { R } \big ( s _ { t } ^ { \mathrm { p } } , s _ { t } ^ { \mathrm { g } } \big ) + \mathcal { P } \big ( s _ { t } ^ { \mathrm { p } } , \pmb { a } _ { t } \big )
$
- : The
tracking termwhich minimizes errors between the robot's current proprioceptive state and the target motion command . This includes minimizing errors in:- Root position.
- Root orientation (6D rotation representation).
- Body link positions (relative to the root).
- Body link orientations (relative to the root).
- Body link linear velocities.
- Body link angular velocities.
- : The
penalty termwhich discourages negative behaviors such as:-
Abrupt action changes (
action rate). -
Violations of
joint limits. -
Undesired contacts (e.g., unintended body parts touching the ground).
The detailed reward design, including specific equations and weights, is presented in Table 1. Orientations are represented as
6D rotations (Zhou et al., 2019), which is a common way to avoid singularities associated with Euler angles and simplify calculations compared to quaternions in some contexts.
-
The following are the results from Table 1 of the original paper:
| Reward term | Equation | Weight |
| Tracking rewards R(s, s) | ||
| Root orientation | 0.5 | |
| Body link pos (rel.) | 1.0 | |
| Body link ori (rel.) | 1.0 | |
| Body link lin. vel | 1.0 | |
| Body link ang. vel | 1.0 | |
| Penalty terms P(s , at) | ||
| Action rate | -0.1 | |
| Joint limit | -10.0 | |
| Undesired contacts | -0.1 | |
Symbol Explanation for Table 1:
- : Reward for root orientation tracking at time .
- : Weighting coefficient for orientation error.
- : Goal root orientation (from target motion command).
- : Proprioceptive (current) root orientation.
- : Squared Euclidean norm of the difference.
- : Reward for body link position tracking.
- : Weighting coefficient for position error.
- : Number of tracked body links in set .
- : Goal relative position of body link .
- : Proprioceptive relative position of body link .
- : Reward for body link orientation tracking. (Equation missing in original table, assuming similar structure to root orientation).
- : Reward for body link linear velocity tracking.
- : Weighting coefficient for linear velocity error.
- : Goal linear velocity of body link .
- : Proprioceptive linear velocity of body link .
- : Reward for body link angular velocity tracking. (Equation missing in original table, assuming similar structure).
- : Weighting coefficient for angular velocity error.
- : Goal angular velocity of body link .
- : Proprioceptive angular velocity of body link .
- : Penalty for
action rate(abrupt changes in actions). - : Weighting coefficient for action rate penalty.
- : Action at time (target joint positions).
- : Action at time
t-1. - : Penalty for violating
joint limits. - : Weighting coefficient for joint limit penalty.
- : Indicator function, which is 1 if the condition is true, and 0 otherwise.
- : Current position of joint .
- : Minimum allowed position for joint .
- : Maximum allowed position for joint .
- : Penalty for undesired contacts (e.g., non-foot/hand body parts touching the ground).
- : Weighting coefficient for contact penalty.
- : Set of body parts checked for undesired contacts.
- : Magnitude of the contact force on body part .
- : Contact force threshold.
4.2.2.5. Domain Randomization
To enhance robustness and generalization across diverse scenarios and bridge the sim-to-real gap, systematic domain randomization is applied during training. This involves varying physical parameters and applying perturbations.
- Physical Parameters:
Friction coefficients(, ): Static and dynamic friction.Restitution coefficient(): Bounciness of collisions.Default joint positions(): Small variations in initial joint states.Base center-of-mass (COM) position: Variations in the robot's mass distribution.
- Perturbations:
-
Root linear and angular velocities: Periodic application of random external pushes to simulate real-world disturbances. -
Motion perturbationto target motion commands (): Jitter in target positions, orientations, and velocities.The following are the results from Table 2 of the original paper:
Domain Randomization Sampling Distribution Physical parameters Static friction coefficients Dynamic friction coefficients Restitution coefficient Default joint positions Base COM offset (x, y, z) , , Root velocity perturbations (external pushes) Root linear vel (x, y, z) , , Push duration Root angular vel , , Target motion perturbations () Target position jitter (x,y: , z: ) Target orientation jitter , Target linear vel jitter (x,y: , z: ) Target angular vel jitter , Target joint jitter
-
Symbol Explanation for Table 2:
- : Uniform distribution between and .
- : Static friction coefficient.
- : Dynamic friction coefficient.
- : Restitution coefficient.
- : Default joint positions.
- : Offset for base
COM(Center of Mass) in x, y, z directions. - : Linear velocity components for root pushes.
- : Duration of root pushes.
- : Angular velocity components (roll, pitch, yaw) for root pushes.
- : Jitter applied to target position.
- : Jitter applied to target orientation (roll, pitch, yaw).
- : Jitter applied to target linear velocity.
- : Jitter applied to target angular velocity.
- : Jitter applied to target joint positions.
4.2.2.6. Universal Control Policy Architecture
The Universal Control Policy is the core of SONIC, characterized by its ability to handle multiple motion command types from different embodiments through a unified encoder-decoder architecture.
- Cross-Embodiment Learning: Achieved by specialized
encodersthat process heterogeneous inputs (human, robot, hybrid motions) into ashared latent representation. This representation is thenquantizedinto auniversal token, which drives a commonrobot control decoder. This design enables the robot to learn from various sources without needing a separateretargeting modulefor each.
Encoders:
Three specialized encoders (Multi-Layer Perceptrons as detailed in Table 3) process distinct motion command types into the shared latent space:
-
Robot motion encoder: Encodes robot joint positions and velocities over future frames with a frame interval . -
Human motion encoder: Encodes3D human joint positions(e.g., fromSMPLformat) over future frames with a frame interval . -
Hybrid motion encoder: Encodes sparse upper-body keypoints (head and hands) of the current frame (for real-time tracking), combined with lower-body robot motion over future frames with a frame interval .Multi-frame inputsallow the policy to anticipate future motion and improve robustness.
Quantizer:
The encoded latent representation from the encoders is quantized into a universal token using a vector quantizer. Specifically, FSQ (Mentzer et al., 2023) is used. The universal token is a -dimensional vector with quantization levels per dimension. This discretization helps create a shared, discrete representation space.
Decoders:
The universal token is processed by two separate decoders (MLPs as detailed in Table 3):
-
Robot control decoder: Transforms theuniversal tokenintomotor commands(target joint positions) that control the robot's joints. -
Robot motion decoder: Reconstructs therobot motion command. This providesauxiliary supervisionduring training, helping to improve the quality of thelatent spaceand enhancefeature learning.The following are the results from Table 3 of the original paper:
Module Architecture Dims Network configuration Quantizer FSQ token dimensions = quantization levels = Encoder () MLP hidden = [2048, 1024, 512, 512] Encoder (teleop) MLP hidden = [2048, 1024, 512, 512] Encoder (smpl) MLP hidden = [2048, 1024, 512, 512] Decoder (actions) MLP hidden = [2048, 2048, 1024, 1024, 512, 512] Decoder (refs) MLP hidden = [2048, 1024, 512, 512] Action dimension Diagonal Gaussian 29 Critic MLP hidden = [2048, 2048, 1024, 1024, 512, 512] Motion command Future frames frames Frame interval `\Delta t_r = \Delta t_m = 0.1s` `\Delta t_h = 0.02s`
Symbol Explanation for Table 3:
- : Dimension of the universal token.
- : Number of quantization levels per dimension for the universal token.
MLP: Multi-Layer Perceptron.hiddenindicates the number of neurons in each hidden layer.Encoder (g_r): Refers to the robot motion encoder.Encoder (teleop): Likely refers to the hybrid motion encoder for teleoperation.Encoder (smpl): Refers to the human motion encoder (asSMPLis a human body model).Decoder (actions): Corresponds to the robot control decoder .Decoder (refs): Corresponds to the robot motion decoder .Action dimension: The number of outputs from the policy (e.g., target joint positions).Diagonal Gaussiansuggests that the policy outputs mean and standard deviation for a Gaussian distribution over actions.Critic: The value network inPPO, also anMLP.- : Number of future frames considered by robot, human, and hybrid motion encoders, respectively.
- : Frame intervals for robot, hybrid, and human motion encoders, respectively.
Training:
Training involves preparing synchronized motion data across all three command types (). Each command type is encoded via its respective encoder and quantized to produce universal tokens (). The total loss function comprises four components:
$ \mathcal { L } = \mathcal { L } _ { \mathrm { ppo } } + \mathcal { L } _ { \mathrm { recon } } + \mathcal { L } _ { \mathrm { token } } + \mathcal { L } _ { \mathrm { cycle } } $
-
: This denotes the
standard PPO loss.PPOaims to optimize the policy by maximizing theexpected cumulative rewardwhile keeping policy updates stable. The details of thePPO lossare well-established inReinforcement Learningliterature and typically involve aclipped surrogate objectiveandvalue function loss. -
: This is the
reconstruction lossfor the robot motion command across different input modalities. It ensures that the decoded motion from any token type matches the corresponding robot motion: $ \mathcal { L } _ { \mathrm { recon } } = | \mathcal { D } _ { r } ( z _ { r } ) - g _ { r } | ^ { 2 } + \left| \mathcal { D } _ { r } ( z _ { h } ) - g _ { r } \right| ^ { 2 } + \left| \mathcal { D } _ { r } ( z _ { m } ) - g _ { r } \right| ^ { 2 } $- : Reconstruction of robot motion from robot token.
- : Reconstruction of robot motion from human token.
- : Reconstruction of robot motion from hybrid token.
- : The ground truth robot motion command.
- : Squared Euclidean norm.
- Crucially, when the input command is
human motion(), the encoder-decoder pipeline () acts as aretargeting pipelinefrom human to robot motion. Thus, effectively serves as aretargeting lossthat enablescross-embodiment transfer.
-
: This loss measures the discrepancy between the
robot tokenz _ { r }and thehuman motion tokenz _ { h }in thelatent space: $ \mathcal { L } _ { \mathrm { token } } = \left| z _ { r } - z _ { h } \right| ^ { 2 } $- This loss explicitly encourages the encoder networks to produce
aligned representationsacrossembodiments. It ensures that motions from different embodiments (human vs. robot) are mapped to similar representations in theshared latent space, which is critical forcross-embodiment learning.
- This loss explicitly encourages the encoder networks to produce
-
: This is a
cycle consistency lossthat further reinforces the alignment of thelatent space. It checks if re-encoding a reconstructed robot motion (derived from a human token) back into a token results in a token consistent with the original robot token: $ \mathcal { L } _ { \mathrm { cycle } } = \left| \mathcal { E } _ { r } ( \mathcal { D } _ { r } ( z _ { h } ) ) - z _ { r } \right| ^ { 2 } $- : The robot motion encoder.
- : Robot motion reconstructed from the human token.
- : Re-encoding of the reconstructed robot motion (originally from human motion) back into a token.
- : The original robot token.
- This loss ensures that the
translationfrom human to robot motion and back to a token preserves the essentialmotion characteristicsand that thelatent spaceis consistently represented across transformations.
Adaptive Motion Sampling:
To efficiently sample from the large motion dataset during training, a bin-based adaptive motion sampling strategy is used when selecting the initial frame for each episode.
- The entire motion dataset is partitioned into
duration bins. - For each bin , the
episode failure rateof the current policy is calculated. - To prevent oversampling from bins with extremely high failure rates, is capped at , where is a hyperparameter and is the average failure rate across all bins.
- These normalized, capped failure rates are used to derive preliminary sampling weights .
- The final sampling probability for each bin is defined as:
$
p _ { i } = \alpha \hat { p } _ { i } + ( 1 - \alpha ) \frac { 1 } { N }
$
- : A blending hyperparameter.
- : Total number of bins.
- This formula balances
targeted samplingof challenging bins (where the policy struggles) withuniform coverageof the entire dataset. A bin is selected according to , and the initial frame is then sampled uniformly from within the chosen bin. All models are trained usingIsaac Lab (NVIDIA et al., 2025)as the simulation platform.
The following are the results from Table 4 of the original paper:
| Training hyperparameter | Value |
| Num parallel envs per GPU | 4096 |
| Num steps per env | 24 |
| Learning epochs | 5 |
| Num mini-batches | 4 |
| Discount | 0.99 |
| GAE | 0.95 |
| Clip parameter | 0.2 |
| Entropy coefficient | 0.013 |
| Value loss coefficient | 1.0 |
| Actor learning rate | 2×10-5 |
| Critic learning rate | 1×10-3 |
| Max gradient norm | 0.1 |
| Desired KL | 0.01 |
| Adaptive LR min/max | [1×10−5 , 2×10−4] |
| Init noise std | 0.05 |
| Actor std clamp min/max | [0.001, 0.5] |
| Adaptive sampling bin size | 1s |
| Adaptive sampling failure rate cap | = 200 |
| Adaptive sampling blending hyperparameter | = 0.1 |
Symbol Explanation for Table 4:
Num parallel envs per GPU: Number of simulated environments run concurrently on each GPU.Num steps per env: Number of simulation steps collected from each environment per policy iteration.Learning epochs: Number of times the collected data is iterated over for policy updates within one PPO iteration.Num mini-batches: Number of mini-batches used to update the policy from the collected data.Discount\gamma: Lambda parameter for Generalized Advantage Estimation, used to estimate advantages more accurately.Clip parameter: Epsilon in PPO's clipped surrogate objective, which limits how much the new policy can deviate from the old.Entropy coefficient: Weight for the entropy bonus term in the loss, encouraging exploration.Value loss coefficient: Weight for the value function loss in the PPO objective.Actor learning rate: Learning rate for the policy network (actor).Critic learning rate: Learning rate for the value network (critic).Max gradient norm: Maximum allowed gradient norm during optimization, to prevent exploding gradients.Desired KL: Target Kullback-Leibler divergence between old and new policies, used in adaptive PPO.Adaptive LR min/max: Minimum and maximum bounds for the adaptive learning rate.Init noise std: Initial standard deviation of the action noise.Actor std clamp min/max: Minimum and maximum bounds for the action standard deviation.Adaptive sampling bin size: Size of the duration bins for adaptive motion sampling.Adaptive sampling failure rate cap (\beta): Hyperparameter for capping failure rates in adaptive sampling.Adaptive sampling blending hyperparameter (\alpha): Hyperparameter for blending targeted and uniform sampling in adaptive motion sampling.
Tasks and Applications: The universal motion tracking framework enables a wide range of applications:
- Interactive Gamepad Control: User gamepad inputs (velocities, styles) are transformed into robot motions by the
generative kinematic motion planner, which the universal policy then tracks via therobot motion encoder. - VR 3-Point Teleoperation: PICO VR headset and handheld controllers provide head and wrist
SE(3) pose dataand navigation commands. Thekinematic motion plannergenerates lower-body motions, forming ahybrid command(upper-body keypoints, lower-body robot motion) tracked by thehybrid motion encoder. - VLA Integration: A
GR00T N1.5model, post-trained onVR 3-Point Teleoperationdata, provides actions fed into thekinematic motion planner. This also results in ahybrid commandtracked by . - VR Whole-body Teleoperation: VR headsets and
IMU sensorson ankles stream full-body human motion, which the universal policy tracks using thehuman motion encoder. - Video Teleoperation & Multi-Modal Control: The
GENMOmodel (Li et al., 2025) estimates/generates human motions frommulti-modal inputs(video, text, audio). The universal policy tracks these generated motions using thehuman motion encoder.
4.2.3. Generative Kinematic Motion Planner
The generative kinematic motion planner is a large latent generative model trained on the same whole-body motion data as the motion tracking policy. Its function is to convert user commands into plausible robot motion trajectories.
- Planning Process: Formulated as an
autoregressive motion in-betweening generation task. It continually regenerates futurekinematic motionsconditioned on previous robot states and incoming user commands. - Keyframes:
Context keyframes: Capture historical robot states (joint positions, root positions).Target keyframes: Generated from user commands (navigation guidance, velocity, direction, style) or skill-specific targets (squatting, crawling, boxing).
- Motion Representation:
- Mathematically equivalent to the humanoid
pose configuration. - Uses
pelvis-relative joint positionsandglobal joint rotations. - During training, samples are randomly rotated to enable planning in arbitrary orientations.
Global rotationsare crucial for motions where the heading is ill-defined (e.g., squatting, crawling).
- Mathematically equivalent to the humanoid
Generative Neural Backbone in Latent Space:
Planning is conducted in a latent space. Continuous motions are first encoded as a sequence of latent tokens:
$
\left{ \boldsymbol { z } _ { t } \right} _ { t = 1 } ^ { T / 4 } = \operatorname { e n c } \left( \left{ p _ { t } , r _ { t } \right} _ { t = 1 } ^ { T } \right)
$
-
: Latent token at time step .
-
: Total duration of the motion segment.
-
: Length of the latent token sequence, indicating a downsampling rate of 4.
-
: Pose configuration at frame .
-
: Root position at frame .
-
: Encoding function that converts raw motion data into latent tokens.
The
latent token sequenceis processed by models likeTransformersorConv1D networksto capture temporal consistency. Thein-betweeningprocess in thetoken spaceis guided by two constraints: thestartingandtarget keyframes(e.g., and ). Instead of predicting the sequence directly, amasked token prediction approach(similar to techniques in language models) is used. $ h = \mathcal { F } \left( { p _ { t } , { r _ { t } } } _ { t = 1 } ^ { 4 } , { p _ { t } , { r _ { t } } } _ { t = T - 4 } ^ { T } , { z _ { t } } _ { t = 1 } ^ { T / 4 } \right) $ $ \mathrm { Prob } ( z _ { t } ) = \sigma ( h ) $ -
: The
neural backbone(e.g., Transformer). -
: Logits for each token position.
-
:
Softmax functionapplied to the logits to compute token probabilities. -
The process is
iterative:- Initially, all
latent tokensare unknown, initialized with alearnable mask embedding, . - During training, the proportion of masked tokens is uniformly sampled from .
- At inference, the neural backbone iteratively predicts the unmasked tokens with the highest probability, progressively refining the prediction. The proportion of tokens finalized in each iteration is , where is the current iteration and is the maximum number of iterations.
- Initially, all
-
After all tokens are finalized, the predicted tokens are used to reconstruct the
kinematic motionsand generaterobot control signals.
Root Trajectory Spring Model:
An intuitive critically damped spring model is used to generate the root position and heading of the keyframes from user commands:
$
x ( t ) = \left( x _ { T } - x _ { 0 } + \left( v _ { 0 } + \frac { c } { 2 } \left( x _ { T } - x _ { 0 } \right) \right) t \right) e ^ { - \frac { c } { 2 } t }
$
x(t): Position or angle at time .- : Target value (e.g., target pelvis position, target heading angle).
- : Initial value.
- : Initial velocity.
- : Damping coefficient.
- : Euler's number (base of natural logarithm).
- This model is applied to three quantities: pelvis position along x-axis, pelvis position along y-axis, and projected heading angle of the pelvis. Damping coefficients of and are used for position and heading, respectively.
- Target values can be directly from controllers or derived from a desired velocity to compute target positions after 1.0s. This model improves behavioral predictability and safeguards against unrealistic commands (e.g., abrupt direction reversals).
Keyframe Module and Application Integration:
- Traditional motion planning often relies on complex keyframe generation. SONIC provides keyframes more intuitively.
- For navigation control, target keyframes are generated by sampling a segment from the dataset that aligns with the specified velocity, direction, and style. The planner's variable-length motion generation (0.8s to 2.4s) and robust in-betweening capabilities ensure natural and smooth motions.
Autoregressive replanningcontinuously refreshes the motion, reducing dependence on specific spatial details of the target keyframes. - For entertainment tasks (e.g., boxing), target keyframes select expressive segments (e.g., maximal arm extension for a punch).
- Motion layering is supported, where upper-body movements are specified, and the lower body is generated by the planner.
- For manipulation tasks (e.g., squatting, kneeling), keyframes are retrieved from the motion library based on desired heights.
4.2.4. Multi-modal Motion Generation Model (GENMO)
SONIC adopts GENMO (Li et al., 2025) to support multi-modal conditioning, meaning it can generate human motions from various input types.
- Core Idea:
GENMOtreatsmotion estimation from videosas aconstrained generationproblem. The model synthesizes a complete motion trajectory that is consistent with observed video keypoints, while also being capable of generating novel motions from abstract conditions like text or music. - Conditioning Modalities and Temporal Layout:
- Accepts
mixed, time-varying conditions:text prompts,audio features, andvisual observations. - Conditions can appear at different
temporal intervals. - Each stream is encoded by a
modality-specific encoderinto features aligned to a common motion frame rate. - Missing intervals are handled by
empty or masked tokens.
- Accepts
- Architecture:
Condition streamsare fused using atemporal transformerwithcross-attentionfrom motion tokens to multimodal condition tokens.- A
diffusion-based motion prioroperates over the human motion sequence, denoising fromGaussian noiseto akinematically plausible trajectory. This prior learns human motion dynamics, being agnostic to the specific conditioning type (the modality-specific processing happens in the encoders and fusion layers). - The
denoised human motionis then encoded into SONIC'suniversal token space, allowing a single decoder to serve multiple downstream embodiments.
- Training Objective: Mixes two complementary objectives:
Generative learning: Uses a standarddiffusion lossover human motions, conditioned on available modalities (text/audio/video), to learn a broad motion prior.Estimation-guided learning: Addsreconstruction termswhen observations are present (e.g.,2D/3D keypoints,trajectory constraints), ensuring the generated motion matches measurements exactly where they exist, while relying on the prior elsewhere.
- This allows the same network parameters to serve both generation and estimation tasks.
Mixed batcheswith variable-length sequences and randomly dropped modalities teach the model to handle partial and asynchronous evidence.
- Inference Modes: Supports
pure generationfrom abstract prompts,constrained generationgiven video frames, andhybrid controlwhere modalities switch over time. - Integration with SONIC:
GENMOprovideslow-latency motion generationusingsliding windowswith overlap. Thediffusion denoising processis modified withinpaintingto handle transitions between windows, ensuring smooth, continuous motion. The overlapped part of the denoised human motion is overwritten by the previous window's motions during denoising.
4.2.5. Deployment
Experiments are conducted on a Unitree G1 platform, utilizing its built-in joint-level PD controller.
- Onboard Execution: The learned policy and its associated management stack execute directly on the robot's onboard CPU/GPU (a
Jetson Orin GPU), minimizingfeedback latencyand improvingtiming determinism. - Control Frequencies:
Policy loop: Runs at50 Hz. At each cycle, observations (robot state, operator commands) are assembled.Target joint positions: Streamed by a separate high-rate process at500 Hzthrough theUnitree low-level API.User input: Captured independently at100 Hz.Kinematic motion planner: Employed at10 Hz, proposing short-horizon sequences based on input interface commands.
- Synchronization and Blending: Planner outputs are
temporally alignedwith the50 Hz policy loopandsmoothly blendedinto the control objectives, providing a motion reference without sacrificing responsiveness. - Robustness: Each periodic process operates on consistent snapshots of the robot state. A "
latest-data-wins" convention mitigates jitter, preventing transient delays from stalling the system. - Optimization: Both the
interactive kinematic motion plannerand thepolicy inferenceexecute onboard usingTensorRTwithCUDA Graph acceleration. This yields reliable, low-variance execution with per forward pass for the policy and12 msper forward pass for motion generation (kinematic planner). - Result: This architecture emphasizes
timing robustnessandsmoothness, producing stable motion while decoupling loops to ensure platform control stability and end-to-end latency.
5. Experimental Setup
5.1. Datasets
The experiments primarily leverage a proprietary in-house motion-capture dataset, augmented with comparisons to existing public datasets.
-
In-house Dataset (SONIC's Primary Training Data):
-
Source: Internally collected
motion-capture data. -
Scale:
700 hoursof human motion, resulting in over100 million framesat50 Hz. -
Characteristics: Collected from over
50 subjects(balanced male/female, diverse body heights). Includes thousands of unique motion behaviors: daily activities, various locomotion styles, combat motions, sports movements, and object interactions. Most actions performed by multiple subjects/takes to provide intrinsic variation. -
Domain: Humanoid motion for
whole-body control. -
Purpose: This large and diverse dataset is crucial for SONIC's
supersizingapproach, providingdense supervisionfor learninghuman motion priorswithout manualreward engineering. The human motion isretargetedto theUnitree G1robot usingGMR (Araujo et al., 2025).An example of the data diversity can be seen in Figure 7 (already presented in Methodology).
-
-
AMA (Mahmood et al., 2019):
- Source: A public dataset, originally for animal motion, but commonly
retargetedfor humanoids. - Scale: SONIC uses a
uniformly random subsetofretargeted AMA datafor evaluation (9 hours, 1,602 trajectories). - Purpose: This dataset serves as a large,
previously unseen test setfor SONIC to evaluate itsgeneralization capabilities. It is specifically chosen to underscore the robustness of SONIC's generalization as SONIC is not trained on theAMAdataset.
- Source: A public dataset, originally for animal motion, but commonly
-
LaFAN (Liao et al., 2025):
- Source: A public dataset of human locomotion motions.
- Scale:
0.4 million frames. - Purpose: Used as a baseline
smaller datasetforscaling studies(comparing0.4Mframes vs.7.4Mframes vs.100Mframes) to show the impact of dataset size on performance.Any2TrackandBeyondMimicbaselines are typically trained onLaFAN.
-
AMASS (Mahmood et al., 2019):
- Source: A very large and diverse public dataset of human motions.
- Purpose:
GMT(another baseline) is noted to be trained onAMASS.
Why these datasets were chosen:
The in-house dataset provides the necessary scale and diversity for the supersizing hypothesis. AMA and LaFAN/AMASS are used for robust evaluation against out-of-distribution motions and for fair comparisons with baselines that were trained on these public datasets, respectively. They are effective for validating the method's performance by testing generalization and demonstrating the benefits of scale.
5.2. Evaluation Metrics
The paper employs a comprehensive set of metrics to quantify motion imitation performance, encompassing both pose-based (kinematic) and physics-based (dynamic) aspects.
-
Success Rate (succ):
- Conceptual Definition: Measures the percentage of imitation attempts where the humanoid successfully tracks the reference motion trajectory without significant deviation or failure (e.g., falling). It represents the overall robustness and stability of the controller.
- Mathematical Formula: The paper defines it conceptually without a specific formula, but it's typically calculated as: $ \text{succ} = \frac{\text{Number of successful trajectories}}{\text{Total number of trajectories}} \times 100% $
- Symbol Explanation:
Number of successful trajectories: The count of motion sequences where the robot maintained tracking within defined thresholds.Total number of trajectories: The total count of motion sequences attempted for tracking.
- Failure Criteria: For
scaling up motion trackingexperiments, an imitation attempt is deemed unsuccessful if the humanoid's body height deviates by more than0.25 mfrom the reference motion, or if the root orientation differs by more than1 radianat any point during tracking. Forcomparison with baselines, a more relaxed criterion is used: imitation is unsuccessful only if the humanoid falls, defined as its root height deviating by more than0.25 mfrom the reference.
-
Root-relative Mean Per-Joint Position Error (MPJPE):
- Conceptual Definition: Quantifies the local accuracy of the imitation by measuring the average positional difference between corresponding joints of the simulated humanoid and the reference motion, expressed relative to the root (pelvis). Lower values indicate better tracking fidelity.
- Mathematical Formula: While the paper mentions in mm, it does not provide its specific formula. A standard
MPJPEformula is: $ E_{\mathrm{mpjpe}} = \frac{1}{N_{frames} \cdot N_{joints}} \sum_{t=1}^{N_{frames}} \sum_{j=1}^{N_{joints}} | (P_{j,t}^{\text{robot}} - P_{\text{root},t}^{\text{robot}}) - (P_{j,t}^{\text{ref}} - P_{\text{root},t}^{\text{ref}}) |_2 $ - Symbol Explanation:
- : Total number of frames in the motion trajectory.
- : Total number of joints being evaluated.
- : Position of joint of the robot at time in global coordinates.
- : Position of the root (pelvis) of the robot at time in global coordinates.
- : Position of joint of the reference motion at time in global coordinates.
- : Position of the root (pelvis) of the reference motion at time in global coordinates.
- : Euclidean (L2) norm, representing the distance.
- The subtraction of root positions makes the error
root-relative, removing the influence of global translation.
-
Differences in Acceleration ():
- Conceptual Definition: Assesses the
physical fidelityof the robot's motion by comparing the acceleration profiles of its body parts to those of the reference motion. Large differences indicate unnatural or non-physical movements. - Mathematical Formula: The paper indicates but does not provide a specific formula. It's typically the mean squared error or absolute error of acceleration, summed or averaged over joints/links: $ \sum_{\mathrm{acc}} = \frac{1}{N_{frames} \cdot N_{links}} \sum_{t=1}^{N_{frames}} \sum_{l=1}^{N_{links}} | A_{l,t}^{\text{robot}} - A_{l,t}^{\text{ref}} |_2 $
- Symbol Explanation:
- : Number of body links.
- : Acceleration of body link of the robot at time .
- : Acceleration of body link of the reference motion at time .
- : Euclidean (L2) norm.
- Conceptual Definition: Assesses the
-
Differences in Velocity ():
- Conceptual Definition: Another metric for
physical fidelity, comparing the velocity profiles of the robot's body parts to the reference motion. Consistent velocity matching contributes to natural movement. - Mathematical Formula: The paper indicates
mm/framebut does not provide a specific formula. Similar to acceleration, it's typically a mean squared or absolute error of velocities: $ E_{\mathrm{vel}} = \frac{1}{N_{frames} \cdot N_{links}} \sum_{t=1}^{N_{frames}} \sum_{l=1}^{N_{links}} | V_{l,t}^{\text{robot}} - V_{l,t}^{\text{ref}} |_2 $ - Symbol Explanation:
- : Number of body links.
- : Velocity of body link of the robot at time .
- : Velocity of body link of the reference motion at time .
- : Euclidean (L2) norm.
- Conceptual Definition: Another metric for
5.3. Baselines
The paper compares SONIC against several state-of-the-art motion tracking methods to demonstrate its superiority:
-
Any2Track (Zhang et al., 2025):
- Description: A general motion tracking framework designed to track diverse motions.
- Training Data: Typically trained on smaller datasets like
LaFAN. - Why Representative: It's a contemporary
deep RL-based motion tracker.
-
BeyondMimic (Liao et al., 2025):
- Description: Another advanced motion imitation approach.
- Training Data: Also typically trained on smaller datasets like
LaFAN. - Why Representative: Represents another strong
deep RLbaseline for motion imitation.
-
GMT (Chen et al., 2025):
- Description:
General Motion Tracking, a method known for robust motion tracking. - Training Data: Trained on the
AMASSdataset, a large public motion dataset. - Why Representative: It's a high-performing baseline that leverages a substantial dataset, making it a strong contender for comparison against SONIC's scaled approach.
- Description:
Comparison Protocol:
To ensure a fair comparison, all baselines (using officially released models where available, or publicly released code for retraining) are evaluated on the same unseen AMA dataset (a uniformly random subset) as SONIC. All evaluations for this comparison are performed in MuJoCo (Todorov ets al., 2012), which is supported by all baseline implementations. A more relaxed termination criterion is used: imitation is considered unsuccessful only if the humanoid falls (root height deviation > 0.25m).
These baselines are representative because they are recent, state-of-the-art deep RL-based motion tracking techniques, allowing the authors to highlight SONIC's advantages gained from its supersizing strategy and multimodal capabilities.
6. Results & Analysis
6.1. Core Results Analysis
The paper presents compelling evidence that scaling up model capacity, data, and compute significantly improves humanoid control, leading to a generalist controller capable of natural and robust movements.
6.1.1. Scaling Up Motion Tracking
The authors analyze the impact of scaling along three key dimensions: GPU hours (compute), model size (network capacity), and motion dataset size (data volume). All evaluations are conducted in Isaac Lab.
The following figure (Figure 2, parts a, b, c) illustrates the scaling trends:

该图像是多个图表,展示了数据大小、模型大小和GPU小时数与成功率之间的关系。通过比较不同模型的成功率与失败率,显示了在数据和计算资源增加时,模型性能的提升。具体数据显示,Ours模型在成功率上达到了99.6%。
Figure 2 (a, b, c): Performance metrics as functions of GPU hours, model size, and dataset size. Lower is better for error metrics. (The caption of the original image includes (d-g) Comparing our method with baselines on tracking out-of-distribution motion sequences, but the image only clearly shows (a-c) for scaling up. The success rate for "Ours" is indicated as 99.6%.)
- Consistent Improvements: Across the board, scaling along any of these three axes leads to consistent improvements in
motion imitation performance. This empirically validates thescaling hypothesisfor humanoid control, mirroring trends observed in otherfoundation models. - Impact of Dataset Size: Increasing the size of the
motion datasetyields themost substantial gains. For dataset size, the paper compares training onLaFAN (0.4M frames), a subset of theirin-house dataset (7.4M frames), and theirfull dataset (100M frames). This highlights the critical role of data volume and diversity in achieving robust and generalizable policies. - Impact of Model Size and Compute: Increased
model size(from 1.2M to 42M parameters) andcompute(GPU hours) further enhance results. ForGPU hours, it's noted that parallelizing over more GPUs (e.g., 128 vs. 8) is essential, as training on a smaller number of GPUs yields worse asymptotic performance, suggesting that sufficient compute is needed to fully leverage larger models and datasets. - High Success Rate: The "Ours" model achieves a
success rateof99.6%(as indicated in the figure), demonstrating highly robust tracking.
6.1.2. Comparison with Baselines
SONIC is compared against state-of-the-art trackers: Any2Track, BeyondMimic, and GMT.
The following figure (Figure 2, parts d, e, f, g) illustrates the comparison with baselines:
(The provided image images/2.jpg only clearly shows (a-c) and a high-level success rate, but the text describes the comparison results.)
- Superior Performance: SONIC significantly
outperforms all baselinesacross all evaluation metrics (success rates,MPJPE,acceleration differences,velocity differences). - Generalization to Unseen Motions: This comparison is crucial as it evaluates the models on the same
unseen AMA dataset, demonstrating that SONIC's approach yields auniversal motion trackercapable of accurately tracking a wide range ofout-of-distributionmotions. The baselines, trained on smaller or different datasets (LaFANorAMASS), show notably lower performance on this challenging test set.
6.1.3. Real-World Evaluation
- Robust Sim-to-Real Transfer: SONIC is deployed on a physical
Unitree G1humanoid robot to assess its real-world performance. It tracks 50 diverse motion trajectories, including dance, jumps, and loco-manipulation tasks. - Zero-Shot Success: The policy achieves
motion imitationin the real world that closely matches itssimulation results, operating in a truezero-shot manner(without any real-world fine-tuning). - 100% Success Rate: Remarkably, it succeeds on all 50 sequences without a single failure (a
100% success rate), underscoring the robustness and reliability of the tracker in challenging real-world scenarios.
6.1.4. Interactive Motion Control
SONIC's kinematic motion planner enables real-time interactive control for a variety of tasks.
The following figure (Figure 3) showcases diverse interactive control capabilities:

该图像是一个示意图,展示了不同的机器人运动状态,包括前进、后退、侧向行走和跑步等多种动作。图中还展示了机器人的各种表情和战斗姿态,如闲置、左击、右击等,突显了丰富的运动捕捉数据在自然运动控制中的应用。
Figure 3: Top three rows: interactive navigation with in-betweening different velocities, directions, and styles. Bottom two rows: SONIC produces high-quality and responsive boxing motions while preserving the robot's complete freedom of movement throughout the task.
-
Navigation Control: Supports arbitrary velocity (0.0 m/s to 6.0 m/s), direction (0 to 360 degrees), and even
style commands(e.g., "drunken walking," "injured walking," "happy walking," "stealth walking"). Thecritically damped spring modelsmooths impractical commands. -
In-betweening Generation: The model robustly generates
in-betweening motions, showcasing its flexibility and the potential for natural human-robot interaction. -
Interactive Entertainment (Boxing): Produces high-quality and responsive boxing motions, retaining the robot's full freedom of movement. Unlike existing academic or industrial solutions that often rely on a limited collection of clips and require switching between multiple expert models, SONIC offers fluid and natural transitions.
-
Manipulation Skills: Extends to skills like squatting, kneeling, and crawling, allowing pelvis height control (0.3m to 0.8m) and omnidirectional crawling (0.0 m/s to 0.5 m/s). This enhances adaptability for
teleoperationand navigation in confined environments.The following figure (Figure 4) further illustrates interactive squatting, kneeling, and crawling:
该图像是一个示意图,展示了多种人形机器人的动作,包括跪下、爬行和起身等姿态。图中上方为不同的动作序列,下方展示了实际的机器人姿态,如单腿跪下、双腿跪下和手爬。整体显示了机器人在不同姿态下的行为表现。
Figure 4: Interactive squatting, kneeling and crawling. With SONIC, the robot can squat, kneel and crawl at arbitrary heights, enabling seamless application in real-world downstream scenarios such as teleoperation and navigation in complex environments.
6.1.5. Video Teleoperation and Multi-Modal Cross-Embodiment Control
SONIC demonstrates real-time, multimodal, cross-embodiment control through its universal control policy and unified motion generation system (based on GENMO).
The following figure (Figure 5) illustrates multimodal control:

该图像是一个示意图,展示了不同类型的控制方式,包括视频遥操作、文本控制和VR全身遥操作。图中展示了机器人如何根据命令做出相应动作,如打开门、爬行和执行武术等。
Figure 5: Video teleoperation, multi-modal control, and VR whole-body teleoperation.
- Video Teleoperation: Supports real-time control from pre-recorded clips and live monocular webcam streams. Human motion is estimated at \ge 60
fps, executingunseen instructionsin azero-shot manner. An interactive graphical interface supportsfree-form prompting(e.g., "walk forward," "kick left foot," "dance like a monkey").Music Control: Generatesdance motionsconditioned on melodic and rhythmic structure, tracking tempo and adapting style to musical characteristics.
- Seamless Modality Transitions: The system supports smooth transitions between modalities (e.g., initiating fine-grained control via video, switching to text for general commands, then to music for performance).
6.1.6. VR-Based Teleoperation and Connecting to Foundation Models
SONIC also supports various VR-based teleoperation regimes and integration with VLA models.
- VR-Based Whole-Body Teleoperation: Uses
PICO VR headsetandankle trackersto stream full-body human pose estimates inSMPLformat. This motion is then processed by theuniversal control policyfor low-latency, stable, human-like control. - VR-Based 3-Point Teleoperation: A lightweight, mobile setup using only the
VR headsetand twohandheld controllers(no ankle trackers). It outputs a compact command (head and wrist poses, finger joint angles, waist height, locomotion mode, navigation command). This mode is used forscalable and portable data collection.- Teleoperation Performance: For a mobile pick-and-place task, the pipeline exhibits a mean latency of
121.9 ms. The mean right-wrist position error is6cm(95th percentile:13.3 cm), and mean orientation error is (95th percentile: ).
- Teleoperation Performance: For a mobile pick-and-place task, the pipeline exhibits a mean latency of
- Foundation-Model-Driven Mobile Bimanual Manipulation:
-
Integration with
GR00T N1.5 VLA model: TheVLA modelisfine-tunedon 300VR-teleoperated trajectories(collected via 3-point teleoperation) for anapple-to-plate mobile pick-and-place task. -
Autonomous Control: The
VLAoutputsteleoperation-format commands(upper-body poses, base height, navigation command), which are fed into SONIC'skinematic plannerandhybrid encoder, then executed by theuniversal control policy. -
Success Rate: Achieves a
95% success rateover 20 trials on this proof-of-concept task. This demonstrates compatibility betweenfoundation-model planning(high-levelSystem 2capabilities) and SONIC'srobust System 1control (reactive whole-body skills), establishing a powerful hierarchical control architecture.The following figure (Figure 6) demonstrates the VLA integration:
该图像是一个示意图,展示了人形机器人进行物体操控的多个步骤,表现出其在执行任务过程中与环境的互动。机器人通过视觉-语言-动作(VLA)模型发出指令,展现了自主抓取和操控物体的能力,验证了模型的实用性。
-
Figure 6: Apple-to-plate mobile bimanual manipulation. The Unitree G1 human robot controlled by a fine-tuned GR00T N1.5 vision-language-action (VLA) model. The VLA emits teleoperation-format commands—poses of head and wrists, base height, and a navigation command—which are executed by SONIC. The model is fine-tuned on 300 VR-teleoperated trajectories and attains 95% success over 20 trials on this proof-of-concept task, demonstrating compatibility between foundation-model planning and our universal control policy.
6.2. Data Presentation (Tables)
All tables from the original paper have been transcribed and explained in the Methodology section:
- Table 1: Reward design.
- Table 2: Domain randomization parameters.
- Table 3: Universal control policy hyperparameters.
- Table 4: Training hyperparameters.
6.3. Ablation Studies / Parameter Analysis
While the paper doesn't present explicit "ablation studies" in the traditional sense (e.g., removing a component and re-evaluating), the scaling up motion tracking experiments (Figure 2a-c) effectively serve as a form of parameter and architectural analysis:
-
Impact of Model Capacity: By varying
model size(1.2M to 42M parameters), the authors show that larger networks consistently lead to bettermotion imitation performance. This indicates that increased capacity allows the model to learn more complex and generalizedmotion priors. -
Impact of Data Volume: The comparison across
dataset sizes(0.4M,7.4M,100Mframes) directly demonstrates that increasing the amount of training data yields significant performance gains. This validates the core hypothesis thatsupersizingdata is critical. -
Impact of Compute: The analysis of
GPU hours(8, 32, 128 GPUs) shows that more compute allows for betterasymptotic performanceand faster convergence, suggesting that computational resources are necessary to fully exploit the benefits of larger models and datasets.These analyses strongly support the paper's central argument that
scaleacross these three dimensions is a primary driver of the observed improvements innatural and robust whole-body control. The consistent performance improvements with increased scale suggest that the underlying architecture and learning objective are well-suited forfoundation model-like behavior in robotics.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper "SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control" successfully demonstrates a paradigm shift in humanoid robot control. By casting motion tracking as a core, scalable task, the authors have built a generalist humanoid controller (SONIC) that produces natural and robust whole-body behaviors across diverse conditions. This was achieved by supersizing the training process along three axes: network size (up to 42M parameters), dataset volume (over 100M frames/700 hours of motion data), and compute (9k GPU hours).
Key findings include the empirical validation of scaling laws in humanoid control, where performance consistently improves with increased data, model capacity, and compute. The learned representations exhibit strong generalization to unseen motions in simulation and achieve robust zero-shot sim-to-real deployment on a Unitree G1 robot. Beyond the core motion tracker, the practical utility of SONIC is enhanced by a real-time universal kinematic planner for interactive control and a unified token space that seamlessly integrates heterogeneous input interfaces like VR teleoperation, human videos, text, music, and vision-language-action (VLA) models. This comprehensive system establishes motion tracking at scale as a practical foundation for advanced humanoid autonomy, offering a robust System 1 controller that can be layered with higher-level reasoning.
7.2. Limitations & Future Work
The authors acknowledge several limitations and propose future research directions:
- Formal Treatment of Safety and Compliance: The current work does not formally address
safetyprotocols orcompliancemechanisms for physical interaction, which are crucial for real-world humanoid deployment. - Energy Efficiency for Extended Deployments: The paper notes that
energy efficiencyfor extended deployments is an area for improvement. Large models and compute-intensive processes can consume significant power. - Combating Noisy Input: The system's performance might be affected by
noisy inputduring deployments (e.g., errors inVRtracking orvideo estimation). Robustness to such noise needs further investigation. - Scaling Laws Across Diverse Datasets: Future work will study
scaling lawsacross even morediverse datasetsto further understand the limits and potential of this approach. - VLA-Instructed Whole-Body Loco-Manipulation: The current
VLA integrationis a proof-of-concept. Future work will focus on enabling more complex,VLA-instructed whole-body loco-manipulation tasks. - Joint Training of Planner, Tokenizers, and Policy: To reduce
modality gapsand further improve system coherence, the authors suggest exploringjoint trainingof thekinematic planner, theencoders/quantizers (tokenizers), and thecontrol policy. This could lead to a more end-to-end optimized system.
7.3. Personal Insights & Critique
This paper represents a highly significant step forward in humanoid robotics, particularly in its aggressive pursuit of scaling laws. The core insight that motion tracking can serve as a foundation task that bypasses reward engineering by leveraging dense supervision from MoCap data is elegant and powerful. The empirical demonstration of scaling benefits, consistent across compute, model size, and data, provides strong evidence for this approach.
Insights:
- Paradigm Shift: The paper offers a compelling vision for how
foundation modelscan be brought to robotics, moving beyond task-specific controllers to trulygeneralist agents. This could dramatically accelerate the development of versatile humanoid robots. - Practicality Focus: Beyond theoretical scaling, the emphasis on
real-time deployment,sim-to-real transfer, andmultimodal control(especiallyVR teleoperationandVLA integration) makes SONIC highly practical and immediately impactful. Theunified token spaceis a particularly clever solution for integrating disparate input modalities into a single policy. - Robustness from Scale: The
100% success rateon real-worldzero-shotmotions is astonishing and speaks volumes about the robustness gained from massive data anddomain randomization. - Hierarchical Control: The clear delineation between SONIC as a
System 1(fast, reactive, whole-body) controller andVLA modelsasSystem 2(slower, deliberative reasoning) highlights a promising architecture for complex, autonomous robotics.
Critique/Areas for Improvement:
-
Data Acquisition Cost: While
MoCap datais abundant, acquiring700 hoursof high-quality, diverse, andretargetablemotion data is a massive undertaking and likely very expensive. Thisdata bottleneckcould limit broader adoption or reproduction by smaller research groups. Future work might explore synthetic data generation or more efficient data collection methods. -
Sim-to-Real Gap for Complex Manipulation: While the
sim-to-real transferforlocomotionandwhole-body motionsis impressive, theVLA integrationformobile manipulationis a "proof-of-concept." Scaling this to highly dexterous, nuanced manipulation tasks with contact forces and unpredictable object properties might reintroduce significantsim-to-real challengesnot fully captured bydomain randomization. -
Interpretability and Safety Guarantees: As models scale, their
interpretabilitydecreases. For safety-critical systems like humanoids, understanding why a robot makes a certain decision or fails is paramount. The paper acknowledgessafety complianceas a limitation, and this will be a crucial area to address for real-world deployment. -
Computational Footprint:
9k GPU hoursand42M parametersare substantial. WhileTensorRToptimization helps with inference, the training demands are immense. Investigatingmodel compressionormore efficient architecturesfor deployment on less powerful hardware could be valuable. -
Environmental Impact: The energy consumption associated with such large-scale training is considerable. Research into
carbon-efficient AI trainingwill become increasingly important.Overall, SONIC provides a solid blueprint for advancing humanoid control into the
foundation modelera. Its methodological rigor and impressive experimental results make it a landmark paper, setting a new bar for scale and generality in the field.
Similar papers
Recommended via semantic vector search.