Paper status: completed

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Published:03/19/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
4 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

GR00T N1 is an open foundation model for humanoid robots, integrating a reasoning module and a motion generation module. Trained end-to-end with a pyramid of heterogeneous data, it outperforms existing imitation learning methods in simulation benchmarks, demonstrating high perfor

Abstract

General-purpose robots need a versatile body and an intelligent mind. Recent advancements in humanoid robots have shown great promise as a hardware platform for building generalist autonomy in the human world. A robot foundation model, trained on massive and diverse data sources, is essential for enabling the robots to reason about novel situations, robustly handle real-world variability, and rapidly learn new tasks. To this end, we introduce GR00T N1, an open foundation model for humanoid robots. GR00T N1 is a Vision-Language-Action (VLA) model with a dual-system architecture. The vision-language module (System 2) interprets the environment through vision and language instructions. The subsequent diffusion transformer module (System 1) generates fluid motor actions in real time. Both modules are tightly coupled and jointly trained end-to-end. We train GR00T N1 with a heterogeneous mixture of real-robot trajectories, human videos, and synthetically generated datasets. We show that our generalist robot model GR00T N1 outperforms the state-of-the-art imitation learning baselines on standard simulation benchmarks across multiple robot embodiments. Furthermore, we deploy our model on the Fourier GR-1 humanoid robot for language-conditioned bimanual manipulation tasks, achieving strong performance with high data efficiency.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

1.2. Authors

The paper is authored by a large NVIDIA research team (primarily NVIDIA researchers), with extensive contributions across model training, simulation, real-robot infrastructure, data curation, and open-sourcing. Key roles include:

  • Research Leads: Linxi “Jim” Fan, Yuke Zhu
  • Core Contributors (selected): Scott Reed, Ruijie Zheng, Guanzhi Wang, Johan Bjorck, Joel Jang, Ao Zhang, Jing Wang, Yinzhen Xu, Fengyuan Hu, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Loic Magne, Zhiding Yu, Zhiqi Li; plus specialized teams for teleoperation, simulation, video generation, IDM training, and infrastructure (see Appendix A.1–A.3 of the paper for full rosters). Affiliation: Predominantly NVIDIA, with acknowledgments to external teams (e.g., 1X and Fourier) for hardware and support.

1.3. Journal/Conference

arXiv (preprint). arXiv is a widely used open-access repository for scientific preprints and is commonly used for disseminating cutting-edge AI/robotics research prior to peer review.

1.4. Publication Year

2025 (Published at UTC: 2025-03-18T21:06:21.000Z).

1.5. Abstract

The paper introduces GR00T N1, an open Vision-Language-Action (VLA) foundation model for generalist humanoid robots. It features a dual-system architecture:

  • System 2: a pre-trained Vision-Language Model (VLM) (Eagle-2) that interprets images and language instructions.
  • System 1: a Diffusion Transformer (DiT) trained via flow-matching that generates fluid motor actions at high frequency. Both modules are tightly coupled via cross-attention and trained end-to-end. Training uses a “data pyramid” that mixes real robot trajectories, human egocentric videos, and synthetically generated datasets (simulation via DexMimicGen and neural video trajectories). GR00T N1 outperforms state-of-the-art imitation-learning baselines (BC-Transformer and Diffusion Policy) across multiple simulation benchmarks and embodiments, and demonstrates strong performance on the Fourier GR-1 humanoid with language-conditioned bimanual manipulation, showing high data efficiency. The paper releases model checkpoints, training data, and simulation benchmarks.

2. Executive Summary

2.1. Background & Motivation

  • Core problem: Building general-purpose, humanoid robots that can robustly execute diverse tasks in the human world, with strong generalization and rapid adaptation from limited data.
  • Why it matters: Humanoid form factors are promising for human environments. However, robot foundation models need massive, diverse, embodied data to reason in novel situations and control complex bodies. Real-world data is scarce and expensive; embodiments and sensors vary widely (leading to “data islands”).
  • Entry point/innovative idea:
    1. A dual-system VLA model: System 2 (VLM reasoning at 10 Hz) coupled to System 1 (DiT action policy at ~120 Hz), trained end-to-end via cross-attention, enabling tight coordination between perception-language reasoning and motor control.
    2. A “data pyramid” strategy: unify heterogeneous sources—web-scale human videos, synthetic data (simulation and neural video generation), and real-robot trajectories—by converting action-less videos into a common latent action space (via VQ-VAE latent action pretraining (LAPA) and inverse dynamics model (IDM) pseudo-actions), allowing cross-embodiment pretraining and post-training.

2.2. Main Contributions / Findings

  • Contributions:
    1. Architecture: A compositional VLA design (Eagle-2 VLM + DiT with flow-matching) for cross-embodiment, closed-loop action generation; embodiment-specific encoders/decoders handle varying state/action dimensions.
    2. Training pipeline: Unified pretraining and post-training across a heterogeneous “data pyramid” by annotating actions in action-less videos via latent actions and IDM, and co-training with synthetic simulation and neural video trajectories.
    3. Open release: GR00T-N1-2B checkpoint (~2.2B parameters; 1.34B in VLM), training data, and simulation benchmarks; inference throughput for 16-step action chunk is reported (63.9 ms on an NVIDIA L40 using bf16).
  • Findings:
    • In simulation (RoboCasa, DexMimicGen cross-embodiment suite, GR-1 Tabletop), GR00T N1 consistently outperforms two strong baselines (BC-Transformer and Diffusion Policy), especially in humanoid GR-1 tasks.
    • On real GR-1 humanoid tasks, GR00T N1 achieves high success rates, surpassing Diffusion Policy substantially, even when trained on only 10% of the data (showing data efficiency).
    • Co-training with neural trajectories further boosts performance in both simulation and real-world tasks; latent actions (LAPA) are more beneficial in low-data regimes, while IDM labels shine as data grows.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

  • Foundation model: A large-scale, general-purpose model trained on diverse data to provide transferable capabilities across tasks (e.g., large language models). In robotics, foundation models aim to encode visual-language understanding and control priors.
  • Vision-Language Model (VLM): A model that jointly processes images and text and produces a unified representation for reasoning. Here, Eagle-2 is the VLM backbone, composed of:
    • SmolLM2 (LLM): A compact large language model post-trained for multimodal tasks.
    • SigLIP-2 (image encoder): A multilingual, strong vision-language encoder producing dense features from images.
  • Diffusion Transformer (DiT): A transformer variant for generative modeling where denoising is conditioned via adaptive layer normalization; actions are generated by iterative denoising steps.
  • Flow matching (FM): A generative training objective where the model learns a vector field that moves noisy samples toward data samples. GR00T N1 uses FM to train action denoising.
  • Inverse Dynamics Model (IDM): A model that predicts actions given two states (e.g., current and future frames). Used here to label actions for videos that lack explicit action traces.
  • VQ-VAE (Vector-Quantized Variational Autoencoder): An autoencoder variant where continuous latent embeddings are discretized via a learned codebook; here used to learn “latent actions” (LAPA) from pairs of video frames.
  • Cross-attention: A mechanism in transformers where one sequence (queries) attends to another (keys/values). GR00T N1’s DiT uses cross-attention to condition action generation on vision-language tokens.
  • Transformer basics (for beginners):
    • Self-attention computes attention scores within a sequence to aggregate contextual information. The standard scaled dot-product attention is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Symbols: QQ (queries), KK (keys), VV (values) are linear projections of input embeddings; dkd_k is key dimensionality; softmax normalizes scores.
    • Multi-head attention (MHA) uses multiple attention heads to capture diverse patterns; outputs are concatenated and projected.

3.2. Previous Works

  • VLA models and robot foundation models:
    • RT-1/RT-2 (Brohan et al., 2022/2023): Language-conditioned policies with large-scale data; RT-2 transfers web knowledge to robotic control.
    • OpenVLA (Kim et al., 2024): Open-source VLA model; focuses on planning/control via language.
    • π₀ VLA flow model (Black et al., 2024): Flow models for robot control; introduces flow-matching for actions; GR00T N1 builds on similar FM principles but uses a simpler cross-attention coupling (versus mixture-of-experts).
    • Octo (Octo Model Team, 2024): Cross-embodiment generalist policy with embodiment-specific projectors; GR00T N1 similarly uses embodiment-aware encoders/decoders and additionally fine-tunes the VLM.
  • Data scaling and teleoperation:
    • Open X-Embodiment (2024), DROID (Khazatsky et al., 2024), BridgeData v2 (Walke et al., 2023): massive robot datasets enabling broad generalization.
    • Teleoperation platforms: ALOHA, RoboTurk, OpenTeach, Gello, TeleMoMa—provide high-quality human demonstrations.
  • Synthetic data generation:
    • MimicGen (Mandlekar et al., 2023) and DexMimicGen (Jiang et al., 2024): automated synthesis of demonstrations via segment transformation and replay in simulation.
    • Neural video generation (Wan Team, 2025; Agarwal et al., 2025; CogVideoX): used here to create “neural trajectories” that augment real data and cover counterfactual scenarios at scale.
  • Latent action learning:
    • LAPA (Ye et al., 2025): Latent Action Pretraining from Videos—learns action representations from video pairs; GR00T N1 uses a VQ-VAE variant for latent actions to annotate action-less data.

3.3. Technological Evolution

  • From task-specific policies and supervised imitation learning to large-scale VLA foundation models trained on heterogeneous corpora.
  • Growing emphasis on cross-embodiment generalization and multi-modal grounding (vision + language + action).
  • Rise of synthetic data pipelines in both simulation (MimicGen/DexMimicGen) and generative neural videos to overcome real-world data scarcity.

3.4. Differentiation Analysis

  • Architectural simplicity and tight coupling: GR00T N1 couples Eagle-2 VLM and DiT via cross-attention and end-to-end training, rather than heavier mixture-of-experts bridges.
  • Unified data pyramid: It integrates web-scale human videos, synthetic (simulation and neural) trajectories, and real demos by converting video-only data into unified latent action spaces (via LAPA/IDM), enabling a single model across embodiments.
  • Embodiment-aware projectors: Modular encoders/decoders map diverse state/action spaces into shared embeddings, supporting single-arm, bimanual, and humanoid embodiments.
  • Practical efficiency: High-frequency action generation and strong data efficiency in real-world deployment on GR-1 humanoids.

4. Methodology

4.1. Principles

  • Dual-system inspiration: Following “Thinking, Fast and Slow” (Kahneman, 2011), GR00T N1 adopts:
    • System 2 (slow, deliberative): the VLM (Eagle-2) encodes vision-language context at ~10 Hz to interpret scene/task goals.
    • System 1 (fast, reactive): the DiT policy generates closed-loop motor actions at ~120 Hz from noisy action chunks via flow-matching denoising, cross-attending to System 2 tokens.
  • Unified training: End-to-end coupling ensures the language-grounded visual reasoning informs real-time control, while the action module learns fluid, robust actuation across embodiments.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Inputs, Tokenization, and Embeddings

  • Observations:
    • Images: encoded by SigLIP-2 at resolution 224×224224 \times 224 followed by pixel shuffle (producing 64 image token embeddings per frame).
    • Text instructions: formatted in a “chat-like” prompt to the VLM (consistent with Eagle-2 training).
    • Robot state: proprioceptive state vector (e.g., joint positions/rotations, end-effector states), varies by embodiment.
    • Actions: represented as chunks At=[at,at+1,,at+H1]A_t = [a_t, a_{t+1}, \dots, a_{t+H-1}] with horizon H=16H=16 (chunked actions enable efficient denoising and temporal consistency).
  • Embodiment-aware encoders/decoders:
    • State encoder (MLP per embodiment): projects varying-dimensional state qtq_t into a shared embedding space.
    • Action encoder (MLP per embodiment): encodes diffusion timestep and the noised action vector (following Black et al., 2024).
    • Action decoder (MLP per embodiment): decodes the final DiT outputs to actual action vectors for the specific embodiment.

4.2.2. Vision-Language Module (System 2)

  • Backbone: Eagle-2 VLM (Li et al., 2025), fine-tuned from SmolLM2 LLM and SigLIP-2 image encoder; aligned on broad vision-language tasks.

  • Processing:

    1. Encode input images to 64 tokens/frame and format language in chat template.
    2. The LLM fuses image and text tokens; the authors extract intermediate LLM layer representations for efficiency and performance (for GR00T-N1-2B, the 12th layer).
    3. Output: vision-language features ϕt\phi_t of shape (batch size × sequence length × hidden dimension) fed to the DiT via cross-attention.
  • Empirical note: Middle-layer embeddings gave faster inference and better downstream success than final-layer embeddings.

    The following figure (Figure 2 from the original paper) shows the system overview:

    Figure 2: GR0oT N1 Model Overview. Our model is a Vision-Language-Action (VLA) model that adopts a dual-system desg.Weconvert theimageoservation and languag istruction inta sequence tokenst be processed by the Vision-Language Model (VLM) backbone. The VLM outputs, together with robot state and action encodings, are passed to the Diffusion Transformer module to generate motor actions. 该图像是GR00T N1模型概述的示意图,展示了一个视觉-语言-动作(VLA)模型的双系统架构。该模型通过视觉观察和语言指令生成运动动作。具体而言,图中展示了如何将图像和文本转化为序列标记,并通过视觉语言模型(VLM)输出,从而驱动扩散变换器生成实时的电机动作。

4.2.3. Diffusion Transformer Module (System 1) with Flow-Matching

  • Architecture: A DiT variant (Peebles and Xie, 2023) with denoising step conditioning via adaptive layer normalization, denoted VθV_{\theta}.

    • Self-attention blocks operate over noised action token embeddings AtτA_t^{\tau} together with state embeddings qtq_t.

    • Cross-attention blocks condition on vision-language token embeddings ϕt\phi_t output by the VLM.

    • The final DiT block is followed by the embodiment-specific Action Decoder to predict the action chunk.

      The following figure (Figure 3 from the original paper) shows the detailed architecture:

      Figure 3: GR00T N1 Model Architecture. GR0OT N1 is trained on a diverse set of embodiments ranging from single-arm robot arms to bimanual humanoid dexterous hands. To deal with different robot embodiment's stateobservation and action, we use DT blocks with an embodiment-aware state and action encoder to embed the robot's state and action inputs. GR0oT N1 model leverages latent embeddings of the Eagle-2 model to iorporate herobot' visualservation and language istructions. The visin language tokens will then e fed into the DiT blocks through cross-attention layers. 该图像是GR00T N1模型架构示意图,展示了该模型如何通过视觉编码器、文本标记器和状态编码器等模块,将机器人状态和动作与视觉-语言模型Eagle-2结合。信息通过自注意力机制在DiT块中处理,以生成最终的运动动作。

  • Flow-matching training:

    • Noising process: Given ground-truth action chunk AtA_t, a flow-matching timestep τ[0,1]\tau \in [0,1], and sampled noise ϵN(0,I)\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I}), define: $ A_t^{\tau} = \tau A_t + (1 - \tau)\epsilon $ Symbols: AtA_t is the true action chunk for timesteps tt to t+H-1; ϵ\epsilon is Gaussian noise; τ\tau sets how close AtτA_t^{\tau} is to AtA_t vs. noise.
    • Objective: The model prediction Vθ(ϕt,Atτ,qt)V_{\theta}(\phi_t, A_t^{\tau}, q_t) approximates the denoising vector field ϵAt\epsilon - A_t by minimizing: $ \mathcal{L}{fm}(\theta) = \mathbb{E}{\tau}\left[ \Vert V_{\theta}(\phi_t, A_t^{\tau}, q_t) - (\epsilon - A_t) \Vert^2 \right]. $ Symbols: Lfm\mathcal{L}_{fm} is the flow-matching loss; expectation over τ\tau samples; VθV_{\theta} parameterized by θ\theta; norm is typically L2L^2 over the action chunk tokens.
    • Timestep distribution: As in Black et al. (2024), the authors use $ p(\tau) = \mathrm{Beta}\left(\frac{s - \tau}{s}; 1.5, 1\right), \quad s = 0.999. $ Symbols: Beta(;α,β)\mathrm{Beta}(\cdot; \alpha, \beta) denotes the Beta distribution with parameters (1.5,1)(1.5, 1); ss is a scalar shaping parameter. The argument (sτ)/s(s - \tau)/s modulates sampling near endpoints.
  • Inference (denoising):

    • Initialize by sampling At0N(0,I)A_t^{0} \sim \mathcal{N}(\mathbf{0}, \mathbf{I}).
    • Iterative forward Euler integration with KK steps (empirically K=4K=4 across embodiments): $ A_t^{\tau + 1/K} = A_t^{\tau} + \frac{1}{K} V_{\theta}(\phi_t, A_t^{\tau}, q_t). $ Symbols: AtτA_t^{\tau} is the current noisy action chunk; VθV_{\theta} estimates the vector field; 1K\frac{1}{K} is the step size; qtq_t the state embedding; ϕt\phi_t the vision-language context.

4.2.4. Chunked Action Modeling

  • Temporal horizon: The policy operates on action chunks At=[at,at+1,,at+H1]A_t = [a_t, a_{t+1}, \dots, a_{t+H-1}] with H=16H=16. Chunking stabilizes training and permits multi-step generation per inference while keeping computational cost modest (inspired by Zhao et al., 2023).

4.2.5. Latent Actions (LAPA) from Videos

  • Problem: Many web-scale human egocentric videos and neural trajectories lack robot action labels.
  • Solution: Train a VQ-VAE to learn latent actions ztz_t from frame pairs (xt,xt+H)(x_t, x_{t+H}):
    • Encoder takes (xt,xt+H)(x_t, x_{t+H}) and outputs a continuous embedding (pre-quantized); a codebook maps it to the nearest discrete vector (classic VQ-VAE objective).

    • Decoder takes (zt,xt)(z_t, x_t) to reconstruct xt+Hx_{t+H}, forcing ztz_t to encode motion dynamics (inverse dynamics).

    • After training, use the encoder as a learned inverse dynamics estimator: feed (xt,xt+H)(x_t, x_{t+H}), extract the continuous pre-quantized embedding, and treat it as a latent action label during pretraining. This defines a distinct “LAPA embodiment.”

    • Benefit: Unifies heterogeneous video sources (robot/human) into a common learned latent action space, aiding cross-embodiment generalization.

      The following figure (Figure 4 from the original paper) qualitatively illustrates latent actions across embodiments:

      该图像是示意图,展示了GR00T N1在不同场景下的双手协作能力。左侧展示了机器人在操作红色盒子和厨房环境的图例,右侧则显示了机器人在切菜和清洗器具等任务中的表现,体现了其强大的数据效率和学习能力。 该图像是示意图,展示了GR00T N1在不同场景下的双手协作能力。左侧展示了机器人在操作红色盒子和厨房环境的图例,右侧则显示了机器人在切菜和清洗器具等任务中的表现,体现了其强大的数据效率和学习能力。

      该图像是示意图,展示了 GR00T N1 模型应用于双手操控任务的多个场景。这些场景包括处理物体、调整物体位置以及在各种环境中执行任务,展现了模型的多样性和灵活性。 该图像是示意图,展示了 GR00T N1 模型应用于双手操控任务的多个场景。这些场景包括处理物体、调整物体位置以及在各种环境中执行任务,展现了模型的多样性和灵活性。

4.2.6. Neural Trajectory Generation

  • Motivation: Real robot data scales linearly with human labor; generating counterfactual trajectories via video models is far cheaper.

  • Approach:

    1. Fine-tune open-source image-to-video models (e.g., WAN2.1-I2V-14B via LoRA) on a subset of GR-1 teleoperation trajectories (3,000 samples; 81 frames; 480p).
    2. Use a multimodal LLM to detect objects and synthesize diverse language prompts for feasible pick-and-place variants (“pick up {object} from {A} to {B}”), expanding instructional diversity.
    3. Post-process: filter generations that do not follow the instruction using an LLM judge over downsampled frames; re-caption if needed.
    4. Label actions: apply latent actions and/or IDM-based pseudo-actions to these videos, allowing joint training with robot data.
  • Scale: ~82 hours of neural videos generated (minute per 10-second video on L40; ~105k L40 GPU hours across 3,600 L40 GPUs).

    The following figure (Figure 6 in the original paper) illustrates synthetically generated videos:

    Figure:Synthetically GeneratedVideos.We leverageoffthe-he videogeneration models t createneural trajecories crease he uantity ndiversity urrainigdatasets.Thesgenerate dat canbeu for both pre and post-training of our GR0oT N1. (1) The first three rows are generated from the same iitil arom he samelame but replac ejec pik up he ext row howase he ideoel generating a robot trajectory which is very challenging to generate in simulation spilling contents inside a m cp int a, and the last row generat from an itil ramefom suatin data.We the red rectangles to indicate the initial frames. 该图像是一个展示机器人执行语言指令的示意图。机器人依据不同的提示,在多个场景中完成拾取和移动物品的任务,包括从切板取物、将物品放入篮子或微波炉等。该系列图像展示了机器人的灵活性和对语言指令的反应能力。

4.2.7. Simulation Trajectories via DexMimicGen

  • Pipeline:

    • Collect a small set of human demos via teleoperation (e.g., Leap Motion for wrist and finger tracking; retarget via whole-body IK).
    • Segment demos into object-centric subtasks; transform segments to novel environments; interpolate movements to ensure smooth execution; verify task success; keep successful demos.
    • Mass generation: 10,000 new demos per (source, target) receptacle pair; 540k total demonstrations (pretraining regime). Overall: 780,000 simulation trajectories generated (≈6,500 hours; produced in ~11 hours).
  • Environments: RoboCasa tasks (e.g., kitchen manipulation), DexMimicGen cross-embodiment suites including humanoid dexterous tasks and bimanual arms.

    The following figure (Figure 7 in the original paper) shows simulation tasks:

    Figure 7: Simulation Tasks. Our simulation experiments use tasks from two open-source benchmarks (RoboCasa (Nasiriany et al., 2024) in the top row and DexMimicGen (Jiang et al., 2024) in the middle row) and a newly developed suit tabletop manipulation asks that closely resembleour real-world task (bott row). We provide Omniverse renderings of the tasks above. 该图像是插图,展示了多种机器人执行的语言条件双手操作任务,包括捡拾和放置、关闭炉灶、设置杯子等。图中涵盖了不同的操控示例,展示了机器人在厨房环境中的灵活性和效率。

4.2.8. Training Strategy (Pre-training and Post-training)

  • Pre-training:
    • Objective: Flow-matching loss Lfm\mathcal{L}_{fm} (as above).
    • Data mixture: Real robot datasets (GR-1 teleop, Open X-Embodiment, AgiBot-Alpha), synthetic simulation (DexMimicGen, RoboCasa), human video datasets (Ego4D, Ego-Exo4D, Assembly-101, EPIC-KITCHENS, HOI4D, HoloAssist, RH20T-Human), and neural-generated trajectories.
    • Labeling strategy:
      • Human videos: use learned latent actions as targets (LAPA).
      • Robot datasets: use ground-truth actions and/or latent actions as targets.
      • Neural trajectories: use latent actions and/or IDM-based pseudo-actions derived from real robot data-trained IDMs.
    • Auxiliary loss: An object detection auxiliary loss (LdetL_{det}) using OWL-v2 bounding boxes to improve spatial understanding. For each frame, compute normalized center coordinates of the target object (xgt)(x_{gt}) and minimize Ldet=xpredxgt2L_{det} = \| \mathbf{x}_{pred} - \mathbf{x}_{gt} \|^2. Total loss: L=Lfm+Ldet\mathcal{L} = \mathcal{L}_{fm} + \mathcal{L}_{det}.
  • Post-training:
    • Fine-tune per single embodiment to adapt for target tasks; keep the language component of the VLM frozen (the text tokenizer stays frozen; vision encoder can be tuned depending on compute).
    • Co-train on real trajectories and neural trajectories at 1:1 sampling ratio.
    • Low-data regimes: Train IDM on limited action data; use IDM pseudo-labels for neural videos (especially for GR-1 humanoid tasks, which are challenging).
  • Infrastructure:
    • Managed by NVIDIA OSMO; up to 1,024 H100 GPUs; pretraining consumed ~50,000 H100 GPU hours; multi-node fault tolerance via Ray.

    • Fine-tuning tested on single A6000 GPU; large batch sizes possible when only tuning adapters and DiT.

      The following figure (Figure 6 in the original paper) depicts teleoperation-based data collection:

      Figure 6: Data Collection via Teleoperation. Our teleoperation infrastructure supports multiple devices to capture human hand motion, including 6-DoF wrist poses and hand skeletons. Robot actions are produced through retargeting and executed on robots in real and simulation environments. 该图像是图示,展示了通过遥操作进行人类运动捕捉的过程。左侧展示了三种遥操作硬件选项,中央部分显示了人手运动的捕捉过程,右侧则展示了机器人动作重定向及其在真实和模拟环境中的执行。通过这些步骤,机器人能够精确模拟人类动作。

The following figure (Figure 1 in the original paper) illustrates the data pyramid concept:

Figure 1: Data Pyramid for Robot Foundation Model Training. GR00T N1's heterogeneous training corpora can be represented as a pyramid: data quantity decreases, and embodiment-specificity increases, moving from the bottom to the top. 该图像是示意图,展示了GR00T N1机器人的数据金字塔结构。底部为网络数据及人类视频,包括Common Crawl和Wikipedia等,接着是合成数据,最顶层为真实世界数据,如机器人操作的图像,数据量逐渐减少而特定性逐渐增加。

5. Experimental Setup

5.1. Datasets

  • Real-robot datasets:
    1. GR00T N1 Humanoid Pre-Training (Fourier GR-1 via teleoperation):
      • Devices: VIVE trackers (wrists), Xsens Metagloves (fingers); explored Vision Pro and Leap Motion.
      • Retarget to humanoid via inverse kinematics; control at 20 Hz; head-mounted egocentric camera.
      • Hierarchical annotations: fine-grained (grasp, move, place) and coarse-grained sequences.
    2. Open X-Embodiment (2024): RT-1, Bridge-v2, Language Table, DROID, MUTEX, RoboSet, Plex.
    3. AgiBot-Alpha (2025): 140,000 trajectories from 100 robots (fine manipulation, tools, multi-robot collaboration).
  • Synthetic datasets:
    • Simulation (RoboCasa; DexMimicGen): humanoid bimanual dexterous tasks, Franka arm tasks, and GR-1 tabletop rearrangements (various receptacles and objects, distractors, novel (source,target) pairs).
    • Neural trajectories: ~82 hours of video generations (filtered, re-captioned); actions labeled via LAPA/IDM.
  • Human video datasets:
    • Ego4D, Ego-Exo4D, Assembly-101, EPIC-KITCHENS, HOI4D, HoloAssist, RH20T-Human—egocentric recordings of task-oriented human-object interactions.

      Example qualitative samples (Figure 14 in the original paper) from human video datasets:

      Figure 14: Human Egocentric Video Dataset Samples. We use seven human video datasets for pre-training. The images above show example from eachf the seven datases with their corresponding language annotations. 该图像是人类自我中心视频数据集样本,展示了七个不同的人类视频数据集的示例及其对应的语言注释。这些图像用于预训练GR00T N1模型,以帮助机器人理解和执行多样化任务。

5.2. Evaluation Metrics

  • Success rate (simulation and real robot):
    1. Conceptual definition: Proportion of trials where the policy completes the task within constraints (time, correctness).
    2. Mathematical formula: $ \mathrm{Success\ Rate} = \frac{\text{Number of successful trials}}{\text{Total number of trials}} $
    3. Symbol explanation: Numerator counts completed tasks; denominator is total attempts for that task/config; success criteria follow benchmark protocols.
  • Real-world partial scoring (for Machinery Packing): Count how many out of 5 parts/tools are placed into the bin within 30 seconds; report fractional success accordingly. Conceptually, this is a per-trial completion fraction; one can average across trials to get mean success.

5.3. Baselines

  • BC-Transformer (Mandlekar et al., 2021): Transformer-based behavior cloning policy (RoboMimic); processes observation sequences; Gaussian Mixture Model (GMM) for action distributions; inputs: 10 observation frames; outputs: next 10 actions.
  • Diffusion Policy (Chi et al., 2024): Action diffusion via a U-Net; removes noise progressively; conditioned on observation sequences; typically consumes a single observation frame and outputs 16 action steps per pass.

5.4. Protocols

  • Simulation benchmarks:
    • RoboCasa Kitchen: 24 atomic tasks (pick-and-place, door operations, faucets, buttons), with Franka arm demos (MimicGen-generated, 3000 demos/task). Observations: RGB images (left, right, wrist). State: end-effector/base poses, gripper state. Actions: relative EE pos/rot + gripper state. Follow Nasiriany et al. (2024) protocols.
    • DexMimicGen cross-embodiment suite: 9 tasks across three embodiments (bimanual Panda with parallel-jaw grippers; bimanual Panda with dexterous hands; GR-1 humanoid with dexterous hands). 1000 demos per task; eval generalization to novel object configs.
    • GR-1 Tabletop tasks: 24 tasks focusing on dexterous hand control; 18 rearrangements (novel combinations unseen in pretraining), 6 articulated placements/closures (cabinets, drawers, microwaves). Observations: egocentric head camera. Actions: joint positions/rotations of arms/hands, waist, neck; optionally end-effector-based actions for whole-body IK controller. 1000 demos per task via DexMimicGen.
  • Real-world benchmarks:
    • Categories: Pick-and-Place (seen/unseen objects), Articulated Object Manipulation (wooden chest, dark cabinet, white drawer), Industrial (machinery packing, mesh cup pouring, cylinder handover), Multi-Agent Coordination (handover sequences).
    • Data-limited: Evaluate with 10% of dataset vs. full dataset; 10 trials per task (except Machinery Packing with partial scoring).

6. Results & Analysis

6.1. Core Results Analysis

  • Simulation (using 100 demos/task):
    • GR00T-N1-2B surpasses BC-Transformer and Diffusion Policy across RoboCasa, DexMimicGen suite, and GR-1 Tabletop. The largest margin is on GR-1 tasks (+17% over Diffusion Policy).
    • This validates the benefits of: (i) dual-system VLA coupling; (ii) flow-matched DiT action generation; (iii) cross-embodiment pretraining with heterogeneous data (including latent actions).
  • Real robot (GR-1):
    • GR00T-N1-2B beats Diffusion Policy by +32.4% (10% data) and +30.4% (full data) averaged across categories.
    • Remarkably, GR00T-N1-2B trained on 10% of data is only 3.8% lower than Diffusion Policy trained on full data—strong evidence of data efficiency.
  • Co-training with neural trajectories:
    • RoboCasa: +4.2%, +8.8%, +6.8% average gains in 30/100/300 regimes.
    • Real GR-1: +5.8% average gain over 8 tasks in low-data setting.
    • LAPA vs. IDM labels: LAPA slightly better in very low-data (30 demos); IDM pulls ahead as data grows (100/300), as pseudo-actions align more closely with real actions.

6.2. Data Presentation (Tables)

The following are the results from Table 2 of the original paper:

RoboCasa DexMG GR-1 Average
BC Transformer 26.3% 53.9% 16.1% 26.4%
Diffusion Policy 25.6% 56.1% 32.7% 33.4%
GROOT-N1-2B 32.1% 66.5% 50.0% 45.0%

The following are the results from Table 3 of the original paper:

Pick-and-Place Articulated Industrial Coordination Average
Diffusion Policy (10% Data) 3.0% 14.3% 6.7% 27.5% 10.2%
Diffusion Policy (Full Data) 36.0% 38.6% 61.0% 62.5% 46.4%
GROOT-N1-2B (10% Data) 35.0% 62.0% 31.0% 50.0% 42.6%
GROOT-N1-2B (Full Data) 82.0% 70.9% 70.0% 82.5% 76.8%

The following are the results from Table 4 of the original paper:

Task Diffusion Policy GRO0T-N1-2B
30 demos 100 demos 300 demos 30 demos 100 demos 300 demos
RoboCasa Kitchen (24 tasks, PnP = Pick-and-Place)
Close Double Door1.726.594.10.043.174.5 99.0
Close Drawer57.588.272.676.996.183.3
Close Single Door21.746.149.167.7
Coffee Press Button32.546.191.227.856.985.3
Coffee Serve Mug6.728.466.73.734.372.6
Coffee Setup Mug0.019.632.4 18.60.02.022.6
Open Double Door0.09.861.80.012.814.7
Open Drawer15.842.257.89.3 20.442.279.4
Open Single Door36.742.29.854.958.8
PnP from Cab to Counter2.54.910.80.93.919.6
PnP from Counter to Cab0.02.91.96.936.3
PnP from Counter to Microwave0.02.08.80.00.012.8
PnP from Counter to Sink0.00.013.70.01.09.8
PnP from Counter to Stove0.01.017.70.00.023.5
PnP from Microwave to Counter0.02.011.80.00.015.7
PnP from Sink to Counter4.28.842.20.05.933.3
PnP from Stove to Counter1.72.923.50.00.029.4
Turn Off Microwave63.353.952.047.257.870.6
Turn Off Sink Faucet21.763.772.6 19.649.167.772.6
Turn Off Stove5.010.84.615.726.5
Turn On Microwave30.051.075.5 63.755.673.578.4
Turn On Sink Faucet31.727.5 22.636.333.359.862.8
Turn On Stove12.5 8.311.823.514.8 24.125.555.9
Turn Sink Spout14.725.643.217.442.2 32.152.9 49.6
RoboCasa Average
DexMimicGen Cross-Embodiment Suite (9 tasks)
Can Sort82.893.199.494.898.098.0
Coffee35.568.179.744.979.473.5
Pouring37.062.368.854.471.687.3
Threading4.218.327.53.937.360.8
Three Piece Assembly10.032.563.310.843.169.6
Transport7.525.053.37.848.061.8
Box Cleanup30.080.897.5 52.533.329.495.1
Drawer Cleanup1.7 5.016.7 25.073.310.8 5.842.255.9
Lift Tray77.565.7
DexMG Average23.746.968.429.658.574.2
GR-1 Tabletop (24 Tasks)
Cutting Board to Pot22.637.348.058.8
Cutting Board to Basket19.642.229.443.157.8 61.857.8 56.9
Cutting Board to Tiered Basket13.713.718.613.723.534.3
Cutting Board to Pan28.448.057.867.765.768.6
Cutting Board to Bowl11.815.722.631.430.433.3
Cutting Board to Cardboard Box14.718.623.531.439.239.2
Placemat to Bowl15.723.537.333.3
Placemat to Plate15.741.237.349.0
Placemat to Basket6.925.5 5.911.850.0 11.846.1 21.655.9 19.6
Placemat to Tiered Shelf
Plate to Pan13.717.735.335.348.052.9
Plate to Cardboard Box12.813.727.534.338.232.4
Plate to Bowl15.718.631.441.242.234.3
Plate to Plate25.539.261.872.685.368.6
Tray to Tiered Shelf2.06.915.717.727.514.7
Tray to Tiered Basket12.834.339.233.349.045.1
Tray to Plate26.541.249.053.968.662.8

The following are the results from Table 5 of the original paper:

Task Diffusion Policy GRO0T-N1-2B
10% Data Full Data 10% Data Full Data
Tray to Plate0.0%20.0%40.0%100.0%
Cutting Board to Basket0.0%30.0%10.0%100.0%
Cutting Board to Pan0.0%60.0%60.0%80.0%
Plate to Bowl0.0%40.0%30.0%100.0%
Placemat to Basket10.0%60.0%40.0%80.0%
Pick-and-Place Seen Object Average2.0%42.0%36.0%92.0%
Tray to Plate0.0%20.0%30.0%80.0%
Cutting Board to Basket10.0%20.0%60.0%60.0%
Cutting Board to Pan0.0%40.0%40.0%80.0%
Plate to Bowl0.0%20.0%10.0%40.0%
Placemat to Basket10.0%50.0%30.0%100.0%
Pick-and-Place Unseen Object Average4.0%30.0%34.0%72.0%
Pick-and-Place Average3.0%36.0%35.0%82.0%
White Drawer6.6%36.4%26.4%79.9%
Dark Cabinet0.0%46.2%86.6%69.7%
Wooden Chest36.4%33.2%72.9%63.2%
Articulated Average14.3%38.6%62.0%70.9%
Machinery Packing20.0%44.0%8.0%56.0%
Mesh Cup Pouring0.0%62.5%65.0%67.5%
Cylinder Handover0.0%76.5%20.0%86.6%
Industrial Average6.7%61.0%31.0%70.0%
Coordination Part 145.0%65.0%70.0%80.0%
Coordination Part 210.0%60.0%30.0%85.0%
Coordination Average27.5%62.5%50.0%82.5%
Average10.2%46.4%42.6%76.8%

The following are the results from Table 6 of the original paper:

Hyperparameter Pre-training Value Post-training Value
Learning rate 1e-4
Optimizer AdamW
Adam beta1 0.95
Adam beta2 0.999
Adam epsilon 1e-8
Weight decay 1e-5
LR scheduler cosine
Warmup ratio 0.05
Batch size 16,384 128 or 1024
Gradient steps 200,000 20,000 - 60,000
Backbone's vision encoder unfrozen
Backbone's text tokenizer frozen
DiT unfrozen

The following are the results from Table 7 of the original paper:

Dataset Length (Frames) Duration (hr) FPS Camera View Category
GR-1 Teleop Pre-Training 6.4M 88.4 20 Egocentric Real robot
DROID (OXE) 23.1M 428.3 15 Left, Right, Wrist Real robot
RT-1 (OXE) 3.7M 338.4 3 Egocentric Real robot
Language Table (OXE) 7.0M 195.7 10 Front-facing Real robot
Bridge-v2 (OXE) 2.0M 111.1 5 Shoulder, left, right, wrist Real robot
MUTEX (OXE) 362K 5.0 20 Wrist Real robot
Plex (OXE) 77K 1.1 20 Wrist Real robot
RoboSet (OXE) 1.4M 78.9 5 Left, Right, Wrist Real robot
Agibot-Alpha 213.8M 1,979.4 30 Egocentric, left, right Real robot
RH20T-Robot 4.5M 62.5 20 Wrist Real robot
Ego4D 154.4M 2,144.7 20 Egocentric Human
Ego-Exo4D 8.9M 123.0 30 Egocentric Human
Assembly-101 1.4M 19.3 20 Egocentric Human
HOI4D 892K 12.4 20 Egocentric Human
HoloAssist 12.2M 169.6 20 Egocentric Human
RH20T-Human 1.2M 16.3 20 Egocentric Human
EPIC-KITCHENS 2.3M 31.7 20 Egocentric Human
GR-1 Simulation Pre-Training 125.5M 1,742.6 20 Egocentric Simulation
GR-1 Neural Videos 23.8M 827.3 8 Egocentric Neural-generated
Total robot data 262.3M 3,288.8
Total human data 181.3M 2,517.0
Total simulation data 125.5M 1,742.6
Total neural data 23.8M 827.3
Total 592.9M 8,375.7

6.3. Ablation Studies / Parameter Analysis

  • Neural trajectories co-training ablation (Figure 9): Demonstrated consistent gains from co-training with neural trajectories across data regimes in both simulation and real deployment.
  • LAPA vs. IDM labels:
    • Low-data (30 demos): LAPA slightly outperforms IDM.

    • More data (100/300 demos): IDM gains more, likely due to improved alignment of pseudo-actions with real-world control as IDM sees more data.

      The following figure (Figure 9 in the original paper) summarizes these ablations:

      Figure 9: Neural Trajectories Ablations. In the RoboCasa simulation, we show using neural trajectories for post-training across 3 data regimes (30, 100, and 300 per task). In the real world, we show results only on the low-data regime \(1 0 \\%\) of the demonstrations). We co-train with 3k neural trajectories per task for RoboCasa an10 neural trajetories per ask or real-world asks. We explore using both latent and IM-labeleacions in simulation and only IDM-labeled actions for the real robot. 该图像是图表,展示了在RoboCasa模拟和真实GR-1人形机器人上的成功率比较。图中包含不同演示数量下,GR00T N1-2B模型与Diffusion策略及其他组合的表现。具体而言,展示了在低数据(30,100,300演示)的情况下,成功率的变化趋势。

6.4. Qualitative Analyses

  • Pretraining checkpoint behavior (Figure 11): The model can perform two-handed handovers with jerkier motions, implying learned coordination priors.

    Figure 11: Pre-training Qualitative Example. While prompting the pretrained GR00T-N1-2B model with a post-raig tasksrion, we evncea theiffilty y placin the apple the bothands. Depiot avict hissetuurai hmo uy pache eap e basket via a two-handed handover, albeit with jerkier motion. 该图像是图示,展示了GR00T-N1-2B模型在执行双手操作任务时的过程。机器人通过手的灵活运动将一个苹果放入篮子中,虽然动作有些生硬。此图帮助阐释模型在处理复杂任务时的表现。

  • Post-training comparisons (Figure 12): GR00T N1 exhibits smoother motions and more accurate grasps than Diffusion Policy, resulting in higher success in tasks like “Placemat to Basket” and “Cutting Board to Pan.”

    Figure 12: Post-training Qualitative Example. Top) Post-trained GR00T-N1-2B successfully places the cucber into the basket, whereas the Diffusion Policy fails due to an inaccurate grasp. (Bottom) The posttimouy heome ar puthe pn whi e Policy remains stuck. 该图像是图示,展示了GR00T-N1-2B和Diffusion Policy在从放置垫取黄瓜到篮子的任务中的表现。图中上方是GR00T-N1-2B成功将黄瓜放入篮子,而下方的Diffusion Policy因抓取不准确而未能完成任务。

7. Conclusion & Reflections

7.1. Conclusion Summary

  • GR00T N1 is an open VLA foundation model for generalist humanoid robotics with a dual-system architecture (VLM reasoning + DiT action generation) trained end-to-end.
  • A unified “data pyramid” strategy integrates real robot data, simulation trajectories, neural generative videos, and human egocentric videos by converting action-less videos into latent/pseudo action labels.
  • Across simulation and real GR-1 humanoid tasks, GR00T N1 outperforms strong baselines (BC-Transformer, Diffusion Policy), is highly data-efficient, and benefits further from co-training with neural trajectories.

7.2. Limitations & Future Work

  • Current focus: Short-horizon tabletop manipulation; long-horizon loco-manipulation is future work (requires hardware, architecture, and data advances).
  • Synthetic data quality: Neural/simulation generation can struggle with physics fidelity and diversity; improving generators and automated synthesis systems is an active direction.
  • VLM advancements: Stronger spatial reasoning, language understanding, and cross-modal grounding are expected to further boost performance.

7.3. Personal Insights & Critique

  • Strengths:
    • Elegant, tightly coupled architecture with practical throughput.
    • Holistic data strategy to break “data islands,” making web-scale human videos usable for robot control via latent actions and IDM.
    • Clear empirical wins with rigorous benchmarks and public release, accelerating community progress.
  • Potential issues:
    • Latent action transfer relies on learned inverse dynamics between frames—if scene dynamics or camera viewpoints significantly differ from robot observations, misalignment may occur; robust domain adaptation methods (e.g., contrastive alignment, cycle-consistency) might further improve.
    • IDM pseudo-actions are only as good as the IDM; assessing label noise sensitivity and developing confidence-aware weighting during training could help.
    • Real-world evaluation breadth is promising, but more tasks with long-horizon planning, locomotion, and safety-critical handling would solidify generalist claims.
  • Transferability:
    • The dual-system VLA + flow-matched DiT pattern is broadly applicable to other embodied agents (mobile manipulators, quadrupeds) and potentially to multi-agent coordination beyond bimanual tasks.

    • The data pyramid and latent action labeling could benefit AR/VR embodied assistance, industrial process automation, and teleoperation training simulators.

      Mandatory self-correction checklist:

  • Integrated explanation of formulas: The flow-matching and inference equations are presented exactly and explained within the step-by-step methodology (Sections 4.2.3).
  • Faithfulness to original text: All formulas (noising, loss, timestep distribution, Euler updates) match the paper exactly; no alternate forms introduced.
  • Sectioning and formatting: Seven Level 1 headings in correct order; subheadings sequential; tables handled per rules (HTML used for merged headers; full data transcribed); images placed with context; LaTeX used for math without backticks.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.