Paper status: completed

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Published:03/19/2025

Vision-Language-Action Model (30)Generalist Humanoid Robot Foundation Model (1)Diffusion Transformer Module (1)Humanoid Robot Manipulation Tasks (1)Multimodal Data Training (1)

Original Link PDF

Price: 0.100000

4 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

GR00T N1 is an open foundation model for humanoid robots, integrating a reasoning module and a motion generation module. Trained end-to-end with a pyramid of heterogeneous data, it outperforms existing imitation learning methods in simulation benchmarks, demonstrating high perfor

Abstract

General-purpose robots need a versatile body and an intelligent mind. Recent advancements in humanoid robots have shown great promise as a hardware platform for building generalist autonomy in the human world. A robot foundation model, trained on massive and diverse data sources, is essential for enabling the robots to reason about novel situations, robustly handle real-world variability, and rapidly learn new tasks. To this end, we introduce GR00T N1, an open foundation model for humanoid robots. GR00T N1 is a Vision-Language-Action (VLA) model with a dual-system architecture. The vision-language module (System 2) interprets the environment through vision and language instructions. The subsequent diffusion transformer module (System 1) generates fluid motor actions in real time. Both modules are tightly coupled and jointly trained end-to-end. We train GR00T N1 with a heterogeneous mixture of real-robot trajectories, human videos, and synthetically generated datasets. We show that our generalist robot model GR00T N1 outperforms the state-of-the-art imitation learning baselines on standard simulation benchmarks across multiple robot embodiments. Furthermore, we deploy our model on the Fourier GR-1 humanoid robot for language-conditioned bimanual manipulation tasks, achieving strong performance with high data efficiency.

Mind Map

In-depth Reading

English Analysis~23 min read · 30,905 chars

1. Bibliographic Information

1.1. Title

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

1.2. Authors

The paper is authored by a large NVIDIA research team (primarily NVIDIA researchers), with extensive contributions across model training, simulation, real-robot infrastructure, data curation, and open-sourcing. Key roles include:

Research Leads: Linxi “Jim” Fan, Yuke Zhu
Core Contributors (selected): Scott Reed, Ruijie Zheng, Guanzhi Wang, Johan Bjorck, Joel Jang, Ao Zhang, Jing Wang, Yinzhen Xu, Fengyuan Hu, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Loic Magne, Zhiding Yu, Zhiqi Li; plus specialized teams for teleoperation, simulation, video generation, IDM training, and infrastructure (see Appendix A.1–A.3 of the paper for full rosters). Affiliation: Predominantly NVIDIA, with acknowledgments to external teams (e.g., 1X and Fourier) for hardware and support.

1.3. Journal/Conference

arXiv (preprint). arXiv is a widely used open-access repository for scientific preprints and is commonly used for disseminating cutting-edge AI/robotics research prior to peer review.

1.4. Publication Year

2025 (Published at UTC: 2025-03-18T21:06:21.000Z).

1.5. Abstract

The paper introduces GR00T N1, an open Vision-Language-Action (VLA) foundation model for generalist humanoid robots. It features a dual-system architecture:

System 2: a pre-trained Vision-Language Model (VLM) (Eagle-2) that interprets images and language instructions.
System 1: a Diffusion Transformer (DiT) trained via flow-matching that generates fluid motor actions at high frequency. Both modules are tightly coupled via cross-attention and trained end-to-end. Training uses a “data pyramid” that mixes real robot trajectories, human egocentric videos, and synthetically generated datasets (simulation via DexMimicGen and neural video trajectories). GR00T N1 outperforms state-of-the-art imitation-learning baselines (BC-Transformer and Diffusion Policy) across multiple simulation benchmarks and embodiments, and demonstrates strong performance on the Fourier GR-1 humanoid with language-conditioned bimanual manipulation, showing high data efficiency. The paper releases model checkpoints, training data, and simulation benchmarks.

1.6. Original Source Link

arXiv page: https://arxiv.org/abs/2503.14734
PDF: https://arxiv.org/pdf/2503.14734v2.pdf Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

Core problem: Building general-purpose, humanoid robots that can robustly execute diverse tasks in the human world, with strong generalization and rapid adaptation from limited data.
Why it matters: Humanoid form factors are promising for human environments. However, robot foundation models need massive, diverse, embodied data to reason in novel situations and control complex bodies. Real-world data is scarce and expensive; embodiments and sensors vary widely (leading to “data islands”).
Entry point/innovative idea:
1. A dual-system VLA model: System 2 (VLM reasoning at 10 Hz) coupled to System 1 (DiT action policy at ~120 Hz), trained end-to-end via cross-attention, enabling tight coordination between perception-language reasoning and motor control.
2. A “data pyramid” strategy: unify heterogeneous sources—web-scale human videos, synthetic data (simulation and neural video generation), and real-robot trajectories—by converting action-less videos into a common latent action space (via VQ-VAE latent action pretraining (LAPA) and inverse dynamics model (IDM) pseudo-actions), allowing cross-embodiment pretraining and post-training.

2.2. Main Contributions / Findings

Contributions:
1. Architecture: A compositional VLA design (Eagle-2 VLM + DiT with flow-matching) for cross-embodiment, closed-loop action generation; embodiment-specific encoders/decoders handle varying state/action dimensions.
2. Training pipeline: Unified pretraining and post-training across a heterogeneous “data pyramid” by annotating actions in action-less videos via latent actions and IDM, and co-training with synthetic simulation and neural video trajectories.
3. Open release: GR00T-N1-2B checkpoint (~2.2B parameters; 1.34B in VLM), training data, and simulation benchmarks; inference throughput for 16-step action chunk is reported (63.9 ms on an NVIDIA L40 using bf16).
Findings:
- In simulation (RoboCasa, DexMimicGen cross-embodiment suite, GR-1 Tabletop), GR00T N1 consistently outperforms two strong baselines (BC-Transformer and Diffusion Policy), especially in humanoid GR-1 tasks.
- On real GR-1 humanoid tasks, GR00T N1 achieves high success rates, surpassing Diffusion Policy substantially, even when trained on only 10% of the data (showing data efficiency).
- Co-training with neural trajectories further boosts performance in both simulation and real-world tasks; latent actions (LAPA) are more beneficial in low-data regimes, while IDM labels shine as data grows.

3.1. Foundational Concepts

Foundation model: A large-scale, general-purpose model trained on diverse data to provide transferable capabilities across tasks (e.g., large language models). In robotics, foundation models aim to encode visual-language understanding and control priors.
Vision-Language Model (VLM): A model that jointly processes images and text and produces a unified representation for reasoning. Here, Eagle-2 is the VLM backbone, composed of:
- SmolLM2 (LLM): A compact large language model post-trained for multimodal tasks.
- SigLIP-2 (image encoder): A multilingual, strong vision-language encoder producing dense features from images.
Diffusion Transformer (DiT): A transformer variant for generative modeling where denoising is conditioned via adaptive layer normalization; actions are generated by iterative denoising steps.
Flow matching (FM): A generative training objective where the model learns a vector field that moves noisy samples toward data samples. GR00T N1 uses FM to train action denoising.
Inverse Dynamics Model (IDM): A model that predicts actions given two states (e.g., current and future frames). Used here to label actions for videos that lack explicit action traces.
VQ-VAE (Vector-Quantized Variational Autoencoder): An autoencoder variant where continuous latent embeddings are discretized via a learned codebook; here used to learn “latent actions” (LAPA) from pairs of video frames.
Cross-attention: A mechanism in transformers where one sequence (queries) attends to another (keys/values). GR00T N1’s DiT uses cross-attention to condition action generation on vision-language tokens.
Transformer basics (for beginners):
- Self-attention computes attention scores within a sequence to aggregate contextual information. The standard scaled dot-product attention is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Symbols: $Q$ (queries), $K$ (keys), $V$ (values) are linear projections of input embeddings; $d_k$ is key dimensionality; softmax normalizes scores.
- Multi-head attention (MHA) uses multiple attention heads to capture diverse patterns; outputs are concatenated and projected.

3.2. Previous Works

VLA models and robot foundation models:
- RT-1/RT-2 (Brohan et al., 2022/2023): Language-conditioned policies with large-scale data; RT-2 transfers web knowledge to robotic control.
- OpenVLA (Kim et al., 2024): Open-source VLA model; focuses on planning/control via language.
- π₀ VLA flow model (Black et al., 2024): Flow models for robot control; introduces flow-matching for actions; GR00T N1 builds on similar FM principles but uses a simpler cross-attention coupling (versus mixture-of-experts).
- Octo (Octo Model Team, 2024): Cross-embodiment generalist policy with embodiment-specific projectors; GR00T N1 similarly uses embodiment-aware encoders/decoders and additionally fine-tunes the VLM.
Data scaling and teleoperation:
- Open X-Embodiment (2024), DROID (Khazatsky et al., 2024), BridgeData v2 (Walke et al., 2023): massive robot datasets enabling broad generalization.
- Teleoperation platforms: ALOHA, RoboTurk, OpenTeach, Gello, TeleMoMa—provide high-quality human demonstrations.
Synthetic data generation:
- MimicGen (Mandlekar et al., 2023) and DexMimicGen (Jiang et al., 2024): automated synthesis of demonstrations via segment transformation and replay in simulation.
- Neural video generation (Wan Team, 2025; Agarwal et al., 2025; CogVideoX): used here to create “neural trajectories” that augment real data and cover counterfactual scenarios at scale.
Latent action learning:
- LAPA (Ye et al., 2025): Latent Action Pretraining from Videos—learns action representations from video pairs; GR00T N1 uses a VQ-VAE variant for latent actions to annotate action-less data.

3.3. Technological Evolution

From task-specific policies and supervised imitation learning to large-scale VLA foundation models trained on heterogeneous corpora.
Growing emphasis on cross-embodiment generalization and multi-modal grounding (vision + language + action).
Rise of synthetic data pipelines in both simulation (MimicGen/DexMimicGen) and generative neural videos to overcome real-world data scarcity.

3.4. Differentiation Analysis

Architectural simplicity and tight coupling: GR00T N1 couples Eagle-2 VLM and DiT via cross-attention and end-to-end training, rather than heavier mixture-of-experts bridges.
Unified data pyramid: It integrates web-scale human videos, synthetic (simulation and neural) trajectories, and real demos by converting video-only data into unified latent action spaces (via LAPA/IDM), enabling a single model across embodiments.
Embodiment-aware projectors: Modular encoders/decoders map diverse state/action spaces into shared embeddings, supporting single-arm, bimanual, and humanoid embodiments.
Practical efficiency: High-frequency action generation and strong data efficiency in real-world deployment on GR-1 humanoids.

4. Methodology

4.1. Principles

Dual-system inspiration: Following “Thinking, Fast and Slow” (Kahneman, 2011), GR00T N1 adopts:
- System 2 (slow, deliberative): the VLM (Eagle-2) encodes vision-language context at ~10 Hz to interpret scene/task goals.
- System 1 (fast, reactive): the DiT policy generates closed-loop motor actions at ~120 Hz from noisy action chunks via flow-matching denoising, cross-attending to System 2 tokens.
Unified training: End-to-end coupling ensures the language-grounded visual reasoning informs real-time control, while the action module learns fluid, robust actuation across embodiments.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Inputs, Tokenization, and Embeddings

Observations:
- Images: encoded by SigLIP-2 at resolution $224 \times 224$ followed by pixel shuffle (producing 64 image token embeddings per frame).
- Text instructions: formatted in a “chat-like” prompt to the VLM (consistent with Eagle-2 training).
- Robot state: proprioceptive state vector (e.g., joint positions/rotations, end-effector states), varies by embodiment.
- Actions: represented as chunks $A_t = [a_t, a_{t+1}, \dots, a_{t+H-1}]$ with horizon $H=16$ (chunked actions enable efficient denoising and temporal consistency).
Embodiment-aware encoders/decoders:
- State encoder (MLP per embodiment): projects varying-dimensional state $q_t$ into a shared embedding space.
- Action encoder (MLP per embodiment): encodes diffusion timestep and the noised action vector (following Black et al., 2024).
- Action decoder (MLP per embodiment): decodes the final DiT outputs to actual action vectors for the specific embodiment.

4.2.2. Vision-Language Module (System 2)

Backbone: Eagle-2 VLM (Li et al., 2025), fine-tuned from SmolLM2 LLM and SigLIP-2 image encoder; aligned on broad vision-language tasks.
Processing:
1. Encode input images to 64 tokens/frame and format language in chat template.
2. The LLM fuses image and text tokens; the authors extract intermediate LLM layer representations for efficiency and performance (for GR00T-N1-2B, the 12th layer).
3. Output: vision-language features $\phi_t$ of shape (batch size × sequence length × hidden dimension) fed to the DiT via cross-attention.
Empirical note: Middle-layer embeddings gave faster inference and better downstream success than final-layer embeddings.

The following figure (Figure 2 from the original paper) shows the system overview:

该图像是GR00T N1模型概述的示意图，展示了一个视觉-语言-动作（VLA）模型的双系统架构。该模型通过视觉观察和语言指令生成运动动作。具体而言，图中展示了如何将图像和文本转化为序列标记，并通过视觉语言模型(VLM)输出，从而驱动扩散变换器生成实时的电机动作。

4.2.3. Diffusion Transformer Module (System 1) with Flow-Matching

Architecture: A DiT variant (Peebles and Xie, 2023) with denoising step conditioning via adaptive layer normalization, denoted $V_{\theta}$ .
- Self-attention blocks operate over noised action token embeddings $A_t^{\tau}$ together with state embeddings $q_t$ .
- Cross-attention blocks condition on vision-language token embeddings $\phi_t$ output by the VLM.
- The final DiT block is followed by the embodiment-specific Action Decoder to predict the action chunk.
  
  The following figure (Figure 3 from the original paper) shows the detailed architecture:
  
  该图像是GR00T N1模型架构示意图，展示了该模型如何通过视觉编码器、文本标记器和状态编码器等模块，将机器人状态和动作与视觉-语言模型Eagle-2结合。信息通过自注意力机制在DiT块中处理，以生成最终的运动动作。
Flow-matching training:
- Noising process: Given ground-truth action chunk $A_t$ , a flow-matching timestep $\tau \in [0,1]$ , and sampled noise $\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ , define: $ A_t^{\tau} = \tau A_t + (1 - \tau)\epsilon $ Symbols: $A_t$ is the true action chunk for timesteps $t$ to t+H-1; $\epsilon$ is Gaussian noise; $\tau$ sets how close $A_t^{\tau}$ is to $A_t$ vs. noise.
- Objective: The model prediction $V_{\theta}(\phi_t, A_t^{\tau}, q_t)$ approximates the denoising vector field $\epsilon - A_t$ by minimizing: $ \mathcal{L}{fm}(\theta) = \mathbb{E}{\tau}\left[ \Vert V_{\theta}(\phi_t, A_t^{\tau}, q_t) - (\epsilon - A_t) \Vert^2 \right]. $ Symbols: $\mathcal{L}_{fm}$ is the flow-matching loss; expectation over $\tau$ samples; $V_{\theta}$ parameterized by $\theta$ ; norm is typically $L^2$ over the action chunk tokens.
- Timestep distribution: As in Black et al. (2024), the authors use $ p(\tau) = \mathrm{Beta}\left(\frac{s - \tau}{s}; 1.5, 1\right), \quad s = 0.999. $ Symbols: $\mathrm{Beta}(\cdot; \alpha, \beta)$ denotes the Beta distribution with parameters $(1.5, 1)$ ; $s$ is a scalar shaping parameter. The argument $(s - \tau)/s$ modulates sampling near endpoints.
Inference (denoising):
- Initialize by sampling $A_t^{0} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ .
- Iterative forward Euler integration with $K$ steps (empirically $K=4$ across embodiments): $ A_t^{\tau + 1/K} = A_t^{\tau} + \frac{1}{K} V_{\theta}(\phi_t, A_t^{\tau}, q_t). $ Symbols: $A_t^{\tau}$ is the current noisy action chunk; $V_{\theta}$ estimates the vector field; $\frac{1}{K}$ is the step size; $q_t$ the state embedding; $\phi_t$ the vision-language context.

4.2.4. Chunked Action Modeling

Temporal horizon: The policy operates on action chunks $A_t = [a_t, a_{t+1}, \dots, a_{t+H-1}]$ with $H=16$ . Chunking stabilizes training and permits multi-step generation per inference while keeping computational cost modest (inspired by Zhao et al., 2023).

4.2.5. Latent Actions (LAPA) from Videos

Problem: Many web-scale human egocentric videos and neural trajectories lack robot action labels.
Solution: Train a VQ-VAE to learn latent actions $z_t$ $z_{t}$ from frame pairs $(x_t, x_{t+H})$ $(x_{t}, x_{t + H})$ :
- Encoder takes $(x_t, x_{t+H})$ and outputs a continuous embedding (pre-quantized); a codebook maps it to the nearest discrete vector (classic VQ-VAE objective).
- Decoder takes $(z_t, x_t)$ to reconstruct $x_{t+H}$ , forcing $z_t$ to encode motion dynamics (inverse dynamics).
- After training, use the encoder as a learned inverse dynamics estimator: feed $(x_t, x_{t+H})$ , extract the continuous pre-quantized embedding, and treat it as a latent action label during pretraining. This defines a distinct “LAPA embodiment.”
- Benefit: Unifies heterogeneous video sources (robot/human) into a common learned latent action space, aiding cross-embodiment generalization.
  
  The following figure (Figure 4 from the original paper) qualitatively illustrates latent actions across embodiments:
  
  该图像是示意图，展示了GR00T N1在不同场景下的双手协作能力。左侧展示了机器人在操作红色盒子和厨房环境的图例，右侧则显示了机器人在切菜和清洗器具等任务中的表现，体现了其强大的数据效率和学习能力。
  
  该图像是示意图，展示了 GR00T N1 模型应用于双手操控任务的多个场景。这些场景包括处理物体、调整物体位置以及在各种环境中执行任务，展现了模型的多样性和灵活性。

4.2.6. Neural Trajectory Generation

Motivation: Real robot data scales linearly with human labor; generating counterfactual trajectories via video models is far cheaper.
Approach:
1. Fine-tune open-source image-to-video models (e.g., WAN2.1-I2V-14B via LoRA) on a subset of GR-1 teleoperation trajectories (3,000 samples; 81 frames; 480p).
2. Use a multimodal LLM to detect objects and synthesize diverse language prompts for feasible pick-and-place variants (“pick up {object} from {A} to {B}”), expanding instructional diversity.
3. Post-process: filter generations that do not follow the instruction using an LLM judge over downsampled frames; re-caption if needed.
4. Label actions: apply latent actions and/or IDM-based pseudo-actions to these videos, allowing joint training with robot data.
Scale: ~82 hours of neural videos generated (minute per 10-second video on L40; ~105k L40 GPU hours across 3,600 L40 GPUs).

The following figure (Figure 6 in the original paper) illustrates synthetically generated videos:

该图像是一个展示机器人执行语言指令的示意图。机器人依据不同的提示，在多个场景中完成拾取和移动物品的任务，包括从切板取物、将物品放入篮子或微波炉等。该系列图像展示了机器人的灵活性和对语言指令的反应能力。

4.2.7. Simulation Trajectories via DexMimicGen

Pipeline:
- Collect a small set of human demos via teleoperation (e.g., Leap Motion for wrist and finger tracking; retarget via whole-body IK).
- Segment demos into object-centric subtasks; transform segments to novel environments; interpolate movements to ensure smooth execution; verify task success; keep successful demos.
- Mass generation: 10,000 new demos per (source, target) receptacle pair; 540k total demonstrations (pretraining regime). Overall: 780,000 simulation trajectories generated (≈6,500 hours; produced in ~11 hours).
Environments: RoboCasa tasks (e.g., kitchen manipulation), DexMimicGen cross-embodiment suites including humanoid dexterous tasks and bimanual arms.

The following figure (Figure 7 in the original paper) shows simulation tasks:

该图像是插图，展示了多种机器人执行的语言条件双手操作任务，包括捡拾和放置、关闭炉灶、设置杯子等。图中涵盖了不同的操控示例，展示了机器人在厨房环境中的灵活性和效率。

4.2.8. Training Strategy (Pre-training and Post-training)

Pre-training:
- Objective: Flow-matching loss $\mathcal{L}_{fm}$ (as above).
- Data mixture: Real robot datasets (GR-1 teleop, Open X-Embodiment, AgiBot-Alpha), synthetic simulation (DexMimicGen, RoboCasa), human video datasets (Ego4D, Ego-Exo4D, Assembly-101, EPIC-KITCHENS, HOI4D, HoloAssist, RH20T-Human), and neural-generated trajectories.
- Labeling strategy:
  - Human videos: use learned latent actions as targets (LAPA).
  - Robot datasets: use ground-truth actions and/or latent actions as targets.
  - Neural trajectories: use latent actions and/or IDM-based pseudo-actions derived from real robot data-trained IDMs.
- Auxiliary loss: An object detection auxiliary loss ( $L_{det}$ ) using OWL-v2 bounding boxes to improve spatial understanding. For each frame, compute normalized center coordinates of the target object $(x_{gt})$ and minimize $L_{det} = \| \mathbf{x}_{pred} - \mathbf{x}_{gt} \|^2$ . Total loss: $\mathcal{L} = \mathcal{L}_{fm} + \mathcal{L}_{det}$ .
Post-training:
- Fine-tune per single embodiment to adapt for target tasks; keep the language component of the VLM frozen (the text tokenizer stays frozen; vision encoder can be tuned depending on compute).
- Co-train on real trajectories and neural trajectories at 1:1 sampling ratio.
- Low-data regimes: Train IDM on limited action data; use IDM pseudo-labels for neural videos (especially for GR-1 humanoid tasks, which are challenging).
Infrastructure:
- Managed by NVIDIA OSMO; up to 1,024 H100 GPUs; pretraining consumed ~50,000 H100 GPU hours; multi-node fault tolerance via Ray.
- Fine-tuning tested on single A6000 GPU; large batch sizes possible when only tuning adapters and DiT.
  
  The following figure (Figure 6 in the original paper) depicts teleoperation-based data collection:
  
  该图像是图示，展示了通过遥操作进行人类运动捕捉的过程。左侧展示了三种遥操作硬件选项，中央部分显示了人手运动的捕捉过程，右侧则展示了机器人动作重定向及其在真实和模拟环境中的执行。通过这些步骤，机器人能够精确模拟人类动作。

The following figure (Figure 1 in the original paper) illustrates the data pyramid concept:

Figure 1: Data Pyramid for Robot Foundation Model Training. GR00T N1's heterogeneous training corpora can be represented as a pyramid: data quantity decreases, and embodiment-specificity increases, moving from the bottom to the top. 该图像是示意图，展示了GR00T N1机器人的数据金字塔结构。底部为网络数据及人类视频，包括Common Crawl和Wikipedia等，接着是合成数据，最顶层为真实世界数据，如机器人操作的图像，数据量逐渐减少而特定性逐渐增加。

5. Experimental Setup

5.1. Datasets

Real-robot datasets:
1. GR00T N1 Humanoid Pre-Training (Fourier GR-1 via teleoperation):
  - Devices: VIVE trackers (wrists), Xsens Metagloves (fingers); explored Vision Pro and Leap Motion.
  - Retarget to humanoid via inverse kinematics; control at 20 Hz; head-mounted egocentric camera.
  - Hierarchical annotations: fine-grained (grasp, move, place) and coarse-grained sequences.
2. Open X-Embodiment (2024): RT-1, Bridge-v2, Language Table, DROID, MUTEX, RoboSet, Plex.
3. AgiBot-Alpha (2025): 140,000 trajectories from 100 robots (fine manipulation, tools, multi-robot collaboration).
Synthetic datasets:
- Simulation (RoboCasa; DexMimicGen): humanoid bimanual dexterous tasks, Franka arm tasks, and GR-1 tabletop rearrangements (various receptacles and objects, distractors, novel (source,target) pairs).
- Neural trajectories: ~82 hours of video generations (filtered, re-captioned); actions labeled via LAPA/IDM.
Human video datasets:
- Ego4D, Ego-Exo4D, Assembly-101, EPIC-KITCHENS, HOI4D, HoloAssist, RH20T-Human—egocentric recordings of task-oriented human-object interactions.
  
  Example qualitative samples (Figure 14 in the original paper) from human video datasets:
  
  该图像是人类自我中心视频数据集样本，展示了七个不同的人类视频数据集的示例及其对应的语言注释。这些图像用于预训练GR00T N1模型，以帮助机器人理解和执行多样化任务。

5.2. Evaluation Metrics

Success rate (simulation and real robot):
1. Conceptual definition: Proportion of trials where the policy completes the task within constraints (time, correctness).
2. Mathematical formula: $ \mathrm{Success\ Rate} = \frac{\text{Number of successful trials}}{\text{Total number of trials}} $
3. Symbol explanation: Numerator counts completed tasks; denominator is total attempts for that task/config; success criteria follow benchmark protocols.
Real-world partial scoring (for Machinery Packing): Count how many out of 5 parts/tools are placed into the bin within 30 seconds; report fractional success accordingly. Conceptually, this is a per-trial completion fraction; one can average across trials to get mean success.

5.3. Baselines

BC-Transformer (Mandlekar et al., 2021): Transformer-based behavior cloning policy (RoboMimic); processes observation sequences; Gaussian Mixture Model (GMM) for action distributions; inputs: 10 observation frames; outputs: next 10 actions.
Diffusion Policy (Chi et al., 2024): Action diffusion via a U-Net; removes noise progressively; conditioned on observation sequences; typically consumes a single observation frame and outputs 16 action steps per pass.

5.4. Protocols

Simulation benchmarks:
- RoboCasa Kitchen: 24 atomic tasks (pick-and-place, door operations, faucets, buttons), with Franka arm demos (MimicGen-generated, 3000 demos/task). Observations: RGB images (left, right, wrist). State: end-effector/base poses, gripper state. Actions: relative EE pos/rot + gripper state. Follow Nasiriany et al. (2024) protocols.
- DexMimicGen cross-embodiment suite: 9 tasks across three embodiments (bimanual Panda with parallel-jaw grippers; bimanual Panda with dexterous hands; GR-1 humanoid with dexterous hands). 1000 demos per task; eval generalization to novel object configs.
- GR-1 Tabletop tasks: 24 tasks focusing on dexterous hand control; 18 rearrangements (novel combinations unseen in pretraining), 6 articulated placements/closures (cabinets, drawers, microwaves). Observations: egocentric head camera. Actions: joint positions/rotations of arms/hands, waist, neck; optionally end-effector-based actions for whole-body IK controller. 1000 demos per task via DexMimicGen.
Real-world benchmarks:
- Categories: Pick-and-Place (seen/unseen objects), Articulated Object Manipulation (wooden chest, dark cabinet, white drawer), Industrial (machinery packing, mesh cup pouring, cylinder handover), Multi-Agent Coordination (handover sequences).
- Data-limited: Evaluate with 10% of dataset vs. full dataset; 10 trials per task (except Machinery Packing with partial scoring).

6. Results & Analysis

6.1. Core Results Analysis

Simulation (using 100 demos/task):
- GR00T-N1-2B surpasses BC-Transformer and Diffusion Policy across RoboCasa, DexMimicGen suite, and GR-1 Tabletop. The largest margin is on GR-1 tasks (+17% over Diffusion Policy).
- This validates the benefits of: (i) dual-system VLA coupling; (ii) flow-matched DiT action generation; (iii) cross-embodiment pretraining with heterogeneous data (including latent actions).
Real robot (GR-1):
- GR00T-N1-2B beats Diffusion Policy by +32.4% (10% data) and +30.4% (full data) averaged across categories.
- Remarkably, GR00T-N1-2B trained on 10% of data is only 3.8% lower than Diffusion Policy trained on full data—strong evidence of data efficiency.
Co-training with neural trajectories:
- RoboCasa: +4.2%, +8.8%, +6.8% average gains in 30/100/300 regimes.
- Real GR-1: +5.8% average gain over 8 tasks in low-data setting.
- LAPA vs. IDM labels: LAPA slightly better in very low-data (30 demos); IDM pulls ahead as data grows (100/300), as pseudo-actions align more closely with real actions.

6.2. Data Presentation (Tables)

The following are the results from Table 2 of the original paper:

	RoboCasa	DexMG	GR-1	Average
BC Transformer	26.3%	53.9%	16.1%	26.4%
Diffusion Policy	25.6%	56.1%	32.7%	33.4%
GROOT-N1-2B	32.1%	66.5%	50.0%	45.0%

The following are the results from Table 3 of the original paper:

	Pick-and-Place	Articulated	Industrial	Coordination	Average
Diffusion Policy (10% Data)	3.0%	14.3%	6.7%	27.5%	10.2%
Diffusion Policy (Full Data)	36.0%	38.6%	61.0%	62.5%	46.4%
GROOT-N1-2B (10% Data)	35.0%	62.0%	31.0%	50.0%	42.6%
GROOT-N1-2B (Full Data)	82.0%	70.9%	70.0%	82.5%	76.8%

The following are the results from Table 4 of the original paper:

Task	Diffusion Policy			GRO0T-N1-2B
Task	30 demos	100 demos	300 demos	30 demos	100 demos	300 demos
RoboCasa Kitchen (24 tasks, PnP = Pick-and-Place)
Close Double Door	1.7	26.5	94.1	0.0	43.1	74.5 99.0
Close Drawer	57.5	88.2	72.6	76.9	96.1	83.3
Close Single Door	21.7	46.1		49.1	67.7
Coffee Press Button	32.5	46.1	91.2	27.8	56.9	85.3
Coffee Serve Mug	6.7	28.4	66.7	3.7	34.3	72.6
Coffee Setup Mug	0.0	19.6	32.4 18.6	0.0	2.0	22.6
Open Double Door	0.0	9.8	61.8	0.0	12.8	14.7
Open Drawer	15.8	42.2	57.8	9.3 20.4	42.2	79.4
Open Single Door	36.7	42.2	9.8		54.9	58.8
PnP from Cab to Counter	2.5	4.9	10.8	0.9	3.9	19.6
PnP from Counter to Cab	0.0	2.9		1.9	6.9	36.3
PnP from Counter to Microwave	0.0	2.0	8.8	0.0	0.0	12.8
PnP from Counter to Sink	0.0	0.0	13.7	0.0	1.0	9.8
PnP from Counter to Stove	0.0	1.0	17.7	0.0	0.0	23.5
PnP from Microwave to Counter	0.0	2.0	11.8	0.0	0.0	15.7
PnP from Sink to Counter	4.2	8.8	42.2	0.0	5.9	33.3
PnP from Stove to Counter	1.7	2.9	23.5	0.0	0.0	29.4
Turn Off Microwave	63.3	53.9	52.0	47.2	57.8	70.6
Turn Off Sink Faucet	21.7	63.7	72.6 19.6	49.1	67.7	72.6
Turn Off Stove	5.0	10.8		4.6	15.7	26.5
Turn On Microwave	30.0	51.0	75.5 63.7	55.6	73.5	78.4
Turn On Sink Faucet	31.7	27.5 22.6	36.3	33.3	59.8	62.8
Turn On Stove	12.5 8.3	11.8	23.5	14.8 24.1	25.5	55.9
Turn Sink Spout	14.7	25.6	43.2	17.4	42.2 32.1	52.9 49.6
RoboCasa Average
DexMimicGen Cross-Embodiment Suite (9 tasks)
Can Sort	82.8	93.1	99.4	94.8	98.0	98.0
Coffee	35.5	68.1	79.7	44.9	79.4	73.5
Pouring	37.0	62.3	68.8	54.4	71.6	87.3
Threading	4.2	18.3	27.5	3.9	37.3	60.8
Three Piece Assembly	10.0	32.5	63.3	10.8	43.1	69.6
Transport	7.5	25.0	53.3	7.8	48.0	61.8
Box Cleanup	30.0	80.8	97.5 52.5	33.3	29.4	95.1
Drawer Cleanup	1.7 5.0	16.7 25.0	73.3	10.8 5.8	42.2	55.9
Lift Tray					77.5	65.7
DexMG Average	23.7	46.9	68.4	29.6	58.5	74.2
GR-1 Tabletop (24 Tasks)
Cutting Board to Pot	22.6	37.3	48.0	58.8
Cutting Board to Basket	19.6	42.2	29.4	43.1	57.8 61.8	57.8 56.9
Cutting Board to Tiered Basket	13.7	13.7	18.6	13.7	23.5	34.3
Cutting Board to Pan	28.4	48.0	57.8	67.7	65.7	68.6
Cutting Board to Bowl	11.8	15.7	22.6	31.4	30.4	33.3
Cutting Board to Cardboard Box	14.7	18.6	23.5	31.4	39.2	39.2
Placemat to Bowl	15.7	23.5	37.3	33.3
Placemat to Plate	15.7		41.2		37.3	49.0
Placemat to Basket	6.9	25.5 5.9	11.8	50.0 11.8	46.1 21.6	55.9 19.6
Placemat to Tiered Shelf
Plate to Pan	13.7	17.7	35.3	35.3	48.0	52.9
Plate to Cardboard Box	12.8	13.7	27.5	34.3	38.2	32.4
Plate to Bowl	15.7	18.6	31.4	41.2	42.2	34.3
Plate to Plate	25.5	39.2	61.8	72.6	85.3	68.6
Tray to Tiered Shelf	2.0	6.9	15.7	17.7	27.5	14.7
Tray to Tiered Basket	12.8	34.3	39.2	33.3	49.0	45.1
Tray to Plate	26.5	41.2	49.0	53.9	68.6	62.8

The following are the results from Table 5 of the original paper:

Task	Diffusion Policy		GRO0T-N1-2B
Task	10% Data	Full Data	10% Data	Full Data
Tray to Plate	0.0%	20.0%	40.0%	100.0%
Cutting Board to Basket	0.0%	30.0%	10.0%	100.0%
Cutting Board to Pan	0.0%	60.0%	60.0%	80.0%
Plate to Bowl	0.0%	40.0%	30.0%	100.0%
Placemat to Basket	10.0%	60.0%	40.0%	80.0%
Pick-and-Place Seen Object Average	2.0%	42.0%	36.0%	92.0%
Tray to Plate	0.0%	20.0%	30.0%	80.0%
Cutting Board to Basket	10.0%	20.0%	60.0%	60.0%
Cutting Board to Pan	0.0%	40.0%	40.0%	80.0%
Plate to Bowl	0.0%	20.0%	10.0%	40.0%
Placemat to Basket	10.0%	50.0%	30.0%	100.0%
Pick-and-Place Unseen Object Average	4.0%	30.0%	34.0%	72.0%
Pick-and-Place Average	3.0%	36.0%	35.0%	82.0%
White Drawer	6.6%	36.4%	26.4%	79.9%
Dark Cabinet	0.0%	46.2%	86.6%	69.7%
Wooden Chest	36.4%	33.2%	72.9%	63.2%
Articulated Average	14.3%	38.6%	62.0%	70.9%
Machinery Packing	20.0%	44.0%	8.0%	56.0%
Mesh Cup Pouring	0.0%	62.5%	65.0%	67.5%
Cylinder Handover	0.0%	76.5%	20.0%	86.6%
Industrial Average	6.7%	61.0%	31.0%	70.0%
Coordination Part 1	45.0%	65.0%	70.0%	80.0%
Coordination Part 2	10.0%	60.0%	30.0%	85.0%
Coordination Average	27.5%	62.5%	50.0%	82.5%
Average	10.2%	46.4%	42.6%	76.8%

The following are the results from Table 6 of the original paper:

Hyperparameter	Pre-training Value	Post-training Value
Learning rate	1e-4
Optimizer	AdamW
Adam beta1	0.95
Adam beta2	0.999
Adam epsilon	1e-8
Weight decay	1e-5
LR scheduler	cosine
Warmup ratio	0.05
Batch size	16,384	128 or 1024
Gradient steps	200,000	20,000 - 60,000
Backbone's vision encoder	unfrozen
Backbone's text tokenizer	frozen
DiT	unfrozen

The following are the results from Table 7 of the original paper:

Dataset	Length (Frames)	Duration (hr)	FPS	Camera View	Category
GR-1 Teleop Pre-Training	6.4M	88.4	20	Egocentric	Real robot
DROID (OXE)	23.1M	428.3	15	Left, Right, Wrist	Real robot
RT-1 (OXE)	3.7M	338.4	3	Egocentric	Real robot
Language Table (OXE)	7.0M	195.7	10	Front-facing	Real robot
Bridge-v2 (OXE)	2.0M	111.1	5	Shoulder, left, right, wrist	Real robot
MUTEX (OXE)	362K	5.0	20	Wrist	Real robot
Plex (OXE)	77K	1.1	20	Wrist	Real robot
RoboSet (OXE)	1.4M	78.9	5	Left, Right, Wrist	Real robot
Agibot-Alpha	213.8M	1,979.4	30	Egocentric, left, right	Real robot
RH20T-Robot	4.5M	62.5	20	Wrist	Real robot
Ego4D	154.4M	2,144.7	20	Egocentric	Human
Ego-Exo4D	8.9M	123.0	30	Egocentric	Human
Assembly-101	1.4M	19.3	20	Egocentric	Human
HOI4D	892K	12.4	20	Egocentric	Human
HoloAssist	12.2M	169.6	20	Egocentric	Human
RH20T-Human	1.2M	16.3	20	Egocentric	Human
EPIC-KITCHENS	2.3M	31.7	20	Egocentric	Human
GR-1 Simulation Pre-Training	125.5M	1,742.6	20	Egocentric	Simulation
GR-1 Neural Videos	23.8M	827.3	8	Egocentric	Neural-generated
Total robot data	262.3M	3,288.8	−
Total human data	181.3M	2,517.0
Total simulation data	125.5M	1,742.6
Total neural data	23.8M	827.3
Total	592.9M	8,375.7

6.3. Ablation Studies / Parameter Analysis

Neural trajectories co-training ablation (Figure 9): Demonstrated consistent gains from co-training with neural trajectories across data regimes in both simulation and real deployment.
LAPA vs. IDM labels:
- Low-data (30 demos): LAPA slightly outperforms IDM.
- More data (100/300 demos): IDM gains more, likely due to improved alignment of pseudo-actions with real-world control as IDM sees more data.
  
  The following figure (Figure 9 in the original paper) summarizes these ablations:
  
  $Figure 9: Neural Trajectories Ablations. In the RoboCasa simulation, we show using neural trajectories for post-training across 3 data regimes (30, 100, and 300 per task). In the real world, we show results only on the low-data regime $1 0 \\%$ of the demonstrations). We co-train with 3k neural trajectories per task for RoboCasa an10 neural trajetories per ask or real-world asks. We explore using both latent and IM-labeleacions in simulation and only IDM-labeled actions for the real robot.$ 该图像是图表，展示了在RoboCasa模拟和真实GR-1人形机器人上的成功率比较。图中包含不同演示数量下，GR00T N1-2B模型与Diffusion策略及其他组合的表现。具体而言，展示了在低数据（30，100，300演示）的情况下，成功率的变化趋势。

6.4. Qualitative Analyses

Pretraining checkpoint behavior (Figure 11): The model can perform two-handed handovers with jerkier motions, implying learned coordination priors.

该图像是图示，展示了GR00T-N1-2B模型在执行双手操作任务时的过程。机器人通过手的灵活运动将一个苹果放入篮子中，虽然动作有些生硬。此图帮助阐释模型在处理复杂任务时的表现。
Post-training comparisons (Figure 12): GR00T N1 exhibits smoother motions and more accurate grasps than Diffusion Policy, resulting in higher success in tasks like “Placemat to Basket” and “Cutting Board to Pan.”

该图像是图示，展示了GR00T-N1-2B和Diffusion Policy在从放置垫取黄瓜到篮子的任务中的表现。图中上方是GR00T-N1-2B成功将黄瓜放入篮子，而下方的Diffusion Policy因抓取不准确而未能完成任务。

7. Conclusion & Reflections

7.1. Conclusion Summary

GR00T N1 is an open VLA foundation model for generalist humanoid robotics with a dual-system architecture (VLM reasoning + DiT action generation) trained end-to-end.
A unified “data pyramid” strategy integrates real robot data, simulation trajectories, neural generative videos, and human egocentric videos by converting action-less videos into latent/pseudo action labels.
Across simulation and real GR-1 humanoid tasks, GR00T N1 outperforms strong baselines (BC-Transformer, Diffusion Policy), is highly data-efficient, and benefits further from co-training with neural trajectories.

7.2. Limitations & Future Work

Current focus: Short-horizon tabletop manipulation; long-horizon loco-manipulation is future work (requires hardware, architecture, and data advances).
Synthetic data quality: Neural/simulation generation can struggle with physics fidelity and diversity; improving generators and automated synthesis systems is an active direction.
VLM advancements: Stronger spatial reasoning, language understanding, and cross-modal grounding are expected to further boost performance.

7.3. Personal Insights & Critique

Strengths:
- Elegant, tightly coupled architecture with practical throughput.
- Holistic data strategy to break “data islands,” making web-scale human videos usable for robot control via latent actions and IDM.
- Clear empirical wins with rigorous benchmarks and public release, accelerating community progress.
Potential issues:
- Latent action transfer relies on learned inverse dynamics between frames—if scene dynamics or camera viewpoints significantly differ from robot observations, misalignment may occur; robust domain adaptation methods (e.g., contrastive alignment, cycle-consistency) might further improve.
- IDM pseudo-actions are only as good as the IDM; assessing label noise sensitivity and developing confidence-aware weighting during training could help.
- Real-world evaluation breadth is promising, but more tasks with long-horizon planning, locomotion, and safety-critical handling would solidify generalist claims.
Transferability:
- The dual-system VLA + flow-matched DiT pattern is broadly applicable to other embodied agents (mobile manipulators, quadrupeds) and potentially to multi-agent coordination beyond bimanual tasks.
- The data pyramid and latent action labeling could benefit AR/VR embodied assistance, industrial process automation, and teleoperation training simulators.
  
  Mandatory self-correction checklist:
Integrated explanation of formulas: The flow-matching and inference equations are presented exactly and explained within the step-by-step methodology (Sections 4.2.3).
Faithfulness to original text: All formulas (noising, loss, timestep distribution, Euler updates) match the paper exactly; no alternate forms introduced.
Sectioning and formatting: Seven Level 1 headings in correct order; subheadings sequential; tables handled per rules (HTML used for merged headers; full data transcribed); images placed with context; LaTeX used for math without backticks.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~23 min read · 30,905 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Inputs, Tokenization, and Embeddings

4.2.2. Vision-Language Module (System 2)

4.2.3. Diffusion Transformer Module (System 1) with Flow-Matching

4.2.4. Chunked Action Modeling

4.2.5. Latent Actions (LAPA) from Videos

4.2.6. Neural Trajectory Generation

4.2.7. Simulation Trajectories via DexMimicGen

4.2.8. Training Strategy (Pre-training and Post-training)

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Protocols

6. Results & Analysis

6.1. Core Results Analysis

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

6.4. Qualitative Analyses

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers