Paper status: completed

RT-1: Robotics Transformer for Real-World Control at Scale

Published:12/14/2022
Original LinkPDF
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The RT-1 model introduces a Robotics Transformer for universal robot control, leveraging open-ended task-agnostic training. Through extensive real-world data collection, it demonstrates scalable generalization across varying data sizes, model capacities, and diversity.

Abstract

By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. While this capability has been demonstrated in other fields such as computer vision, natural language processing or speech recognition, it remains to be shown in robotics, where the generalization capabilities of the models are particularly critical due to the difficulty of collecting real-world robotic data. We argue that one of the keys to the success of such general robotic models lies with open-ended task-agnostic training, combined with high-capacity architectures that can absorb all of the diverse, robotic data. In this paper, we present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks. The project's website and videos can be found at robotics-transformer1.github.io

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

RT-1: Robotics Transformer for Real-World Control at Scale

The title clearly states the paper's core contribution: a new model named RT-1 (Robotics Transformer 1), which is based on the Transformer architecture and is designed for large-scale, real-world robotic control tasks.

1.2. Authors

The paper is authored by a large team of researchers from three main groups: Robotics at Google, Everyday Robots, and Google Research, Brain Team. Everyday Robots was a subsidiary of Google's parent company, Alphabet, focused on building general-purpose learning robots, and much of its team and technology was later integrated into Google's robotics efforts. The author list includes prominent figures in robotics and machine learning like Chelsea Finn and Sergey Levine, known for their work in reinforcement learning, meta-learning, and robot learning. The extensive author list and affiliations highlight that this is a large-scale, industry-led research project requiring significant resources in terms of hardware (robot fleets), data collection, and computational power.

1.3. Journal/Conference

The paper was published as a preprint on arXiv. While arXiv is not a peer-reviewed venue itself, it is a standard platform for disseminating cutting-edge research quickly. The work was later presented at the Conference on Robot Learning (CoRL) in 2022, which is a top-tier, highly selective conference dedicated to the intersection of robotics and machine learning. Publication at CoRL signifies that the work has passed rigorous peer review and is considered a significant contribution to the field.

1.4. Publication Year

The paper was first submitted to arXiv on December 13, 2022.

1.5. Abstract

The abstract outlines the central thesis: modern machine learning models achieve high performance by transferring knowledge from large, diverse datasets, a paradigm that has been successful in fields like computer vision and NLP but remains underexplored in robotics. The authors argue that a general-purpose robotics model requires two key elements: open-ended, task-agnostic training on diverse robotic data and a high-capacity architecture capable of absorbing this data. They introduce the Robotics Transformer (RT-1) as such a model. The paper validates RT-1's scalability and generalization capabilities through extensive real-world experiments, analyzing its performance as a function of data size, model size, and data diversity. The experiments were conducted using a large dataset collected from real robots performing real-world tasks.

  • Original Source Link: https://arxiv.org/abs/2212.06817
  • PDF Link: https://arxiv.org/pdf/2212.06817v2.pdf
  • Publication Status: The paper is a preprint on arXiv and was also published at the Conference on Robot Learning (CoRL) 2022.

2. Executive Summary

2.1. Background & Motivation

  • Core Problem: The core problem is how to build a single, general-purpose robotic control model that can perform a wide variety of tasks and generalize to new tasks, objects, and environments, especially in the real world where data collection is expensive and difficult.
  • Existing Gaps: Traditional robot learning focuses on training specialized models for single or narrow multi-task settings. This "siloed" approach is inefficient and does not scale well because it requires collecting large, task-specific datasets for every new skill. While large pre-trained models (like foundation models) have revolutionized NLP and computer vision by learning from vast, diverse web-scale data, a similar breakthrough has not yet occurred in robotics. A major challenge is the lack of large, diverse, and standardized robotic datasets. Furthermore, existing models often struggle with the trade-off between high capacity (needed to learn from diverse data) and inference speed (critical for real-time robot control).
  • Innovative Idea: The paper's central idea is to apply the "large model" paradigm to robotics. This involves two synergistic components:
    1. Large-scale, diverse data collection: Amassing a broad dataset of robot experiences across hundreds of tasks.

    2. A scalable, efficient model architecture: Proposing the Robotics Transformer 1 (RT-1), which combines the high capacity of Transformers with architectural innovations to make it efficient enough for real-time control (running at 3Hz).

      The hypothesis is that by training a high-capacity model on a sufficiently large and diverse dataset, it can learn generalizable patterns of perception and action, enabling it to perform new tasks in a zero-shot or few-shot manner.

2.2. Main Contributions / Findings

  • The RT-1 Architecture: The paper introduces a novel and efficient Transformer-based model for robotic control. Key features include:
    • Early Fusion of Language and Vision: It uses FiLM layers to inject language instruction embeddings into an EfficientNet vision backbone, allowing the model to extract task-relevant visual features from the start.
    • Efficient Tokenization: It employs TokenLearner to compress the visual feature map into a small, fixed number of tokens, drastically reducing the computational cost of the subsequent Transformer layers.
    • Real-time Inference: These design choices enable the 35M parameter model to run at 3Hz, which is crucial for closed-loop control on a real robot.
  • Large-Scale Real-World Dataset and Evaluation: The authors collected a massive dataset of ~130,000 demonstrations for over 700 tasks using a fleet of 13 real robots over 17 months. They also conducted one of the largest real-world evaluations in robotics literature, with over 3000 trials to test performance, generalization, and robustness.
  • Demonstrated Scalability and Generalization: The experiments show that RT-1 significantly outperforms prior methods. It achieves a 97% success rate on seen tasks and generalizes well to unseen tasks (76% success), distracting objects (83% success), and new environments (59% success).
  • Heterogeneous Data Absorption: RT-1 is shown to be an effective "data sponge." It can be trained on a mix of data from different sources, such as real-world data and simulation data, or even data from different robot morphologies (Everyday Robots and Kuka arms), improving its capabilities without degrading performance on original tasks.
  • Data Diversity over Quantity: Ablation studies reveal a key insight for the field: data diversity (the number of distinct tasks) is more critical for generalization than raw data quantity (the number of demonstrations per task).

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

  • Imitation Learning (IL): This is a machine learning paradigm where an agent learns to perform a task by observing expert demonstrations. The goal is to learn a policy (a mapping from observations to actions) that mimics the expert's behavior. The most common form of IL is Behavioral Cloning (BC), which treats the problem as a standard supervised learning task. Given a dataset of state-action pairs (s, a) from an expert, BC learns a policy π(as)\pi(a|s) by minimizing the difference between the predicted action and the expert's action (e.g., using mean squared error for continuous actions or cross-entropy for discrete actions). RT-1 is trained using behavioral cloning on a large dataset of human-teleoperated demonstrations.
  • Transformer Architecture: Originally proposed for machine translation in the paper "Attention Is All You Need," the Transformer is a neural network architecture that relies entirely on self-attention mechanisms to process sequential data. Unlike Recurrent Neural Networks (RNNs), it can process all elements of a sequence in parallel, making it highly efficient for large-scale training. Its ability to capture long-range dependencies has made it the dominant architecture in NLP (e.g., GPT, BERT) and increasingly popular in computer vision (e.g., Vision Transformer). RT-1 uses a decoder-only Transformer to model the sequence of observations and output actions.
  • EfficientNet: This is a family of Convolutional Neural Network (CNN) architectures that achieves high accuracy with much fewer parameters than previous models. Its core idea is a principled method called compound scaling, which uniformly scales the network's depth, width, and resolution in a balanced way. RT-1 uses a pre-trained EfficientNet-B3 as its vision backbone to extract features from camera images.
  • FiLM (Feature-wise Linear Modulation): FiLM is a general-purpose conditioning method. It allows one neural network's output to influence the behavior of another by applying an affine (linear) transformation to the intermediate features of the second network. Specifically, for a feature map FF, a FiLM layer predicts a scaling factor γ\gamma and a shifting factor β\beta from some conditioning input (like a language embedding). The output is then γF+β\gamma \cdot F + \beta. RT-1 uses FiLM layers to condition the EfficientNet vision backbone on the language instruction, enabling the network to focus on visual features relevant to the given task.
  • TokenLearner: This is a module designed to reduce the number of tokens in a sequence for more efficient processing by subsequent Transformer layers. It learns to generate a smaller number of "semantic" tokens by taking a weighted average of the original tokens. The weights are learned dynamically using an attention-like mechanism. This is crucial for RT-1's efficiency, as it compresses 81 visual tokens per image down to just 8, significantly reducing the quadratic complexity of the self-attention mechanism in the Transformer.

3.2. Previous Works

  • Gato (Reed et al., 2022): Proposed by DeepMind, Gato is a "generalist agent" designed to perform a wide range of tasks across different modalities (text, images, robotics). It uses a single, large Transformer network to process tokenized data from various domains. In the context of robotics, Gato was trained on a single real-world block stacking task. The authors of RT-1 compare against a Gato-like architecture, noting that while the concept is similar, RT-1's design is specifically optimized for real-time control and better language-vision fusion. RT-1's evaluation is also far more extensive in terms of real-world task diversity and generalization.
  • BC-Z (Jang et al., 2021): This work focuses on zero-shot task generalization through imitation learning. It uses a simpler, non-Transformer architecture (a ResNet-based policy) conditioned on language embeddings. BC-Z demonstrated that a policy trained on a multi-task dataset could generalize to new instructions by combining known verbs and nouns. It was a key component of the SayCan system. RT-1 builds on this idea but scales it up with a higher-capacity Transformer model and a much larger dataset, leading to superior performance.
  • SayCan (Ahn et al., 2022): This system grounds large language models (LLMs) in robotic affordances. It uses an LLM to break down high-level natural language commands (e.g., "I spilled my drink, can you help?") into a sequence of feasible, low-level skills. The feasibility of each potential step is scored by a "value function" (an affordance model) that predicts the robot's likelihood of successfully completing that skill. SayCan used BC-Z as its underlying policy for executing these skills. The RT-1 paper shows that by replacing BC-Z with the more capable RT-1, the system can perform longer and more complex tasks with higher reliability.
  • Decision Transformer (Chen et al., 2021): This work framed reinforcement learning as a sequence modeling problem. Instead of learning a traditional policy or value function, it uses a Transformer to model trajectories (sequences of states, actions, and returns) and predict future actions autoregressively. This inspired RT-1's approach of treating robotic control as a sequence-to-sequence problem, though RT-1 uses an imitation learning objective rather than an RL one.

3.3. Technological Evolution

The field of robot learning has evolved from simple, engineered controllers to learning-based methods.

  1. Early Machine Learning: Initial works used traditional machine learning (e.g., SVMs) on handcrafted features for specific tasks like grasping.

  2. Deep Learning for Perception: The rise of deep learning led to end-to-end models, often using CNNs to process raw pixel inputs for tasks like grasping (e.g., Lenz et al., 2015) or manipulation.

  3. Multi-Task and Language-Conditioned Learning: Researchers began training single policies for multiple tasks, often conditioned on goal images or language instructions (e.g., BC-Z). This aimed to improve generalization by sharing knowledge across tasks.

  4. Large-Scale Data and Models: Inspired by successes in NLP and vision, the focus shifted towards scaling. Projects like QT-Opt (Kalashnikov et al., 2018) showed that massive data collection (using RL) could lead to robust grasping. RoboNet (Dasari et al., 2019) aggregated data from many different robots to study cross-robot generalization.

    RT-1 represents the culmination of this trend, combining large-scale, multi-task, language-conditioned learning with a state-of-the-art, high-capacity Transformer architecture specifically adapted for the constraints of real-world robotics. It is a direct attempt to create a "foundation model" for robotics.

3.4. Differentiation Analysis

Compared to previous works, RT-1's innovations are:

  • Architecture for Real-Time Control: Unlike generic models like Gato, which were not optimized for inference speed on a robot, RT-1 introduces specific components like TokenLearner and an efficient vision backbone to achieve the 3Hz control frequency necessary for practical use.
  • Early Language-Vision Fusion: RT-1 uses FiLM layers to condition the vision model on language from the earliest stages. This contrasts with "late fusion" methods where language and vision features are combined only before the final decision layers. Early fusion allows the model to actively seek out task-relevant visual information, which proves crucial for robustness against distractors.
  • Scale of Real-World Data and Evaluation: The sheer scale of RT-1's dataset (~130k real-world episodes, 700+ tasks) and evaluation (>3000 real-world trials) is a major step up from most prior academic and even industrial research, which often relied on smaller datasets or simulation.
  • Focus on Data Absorption: RT-1 is explicitly designed and tested as a "data absorbent" model. The experiments showing it can effectively learn from simulation data and data from different robots provide strong evidence for its potential as a generalist backbone model that can be continuously improved with new, heterogeneous data sources.

4. Methodology

4.1. Principles

The core principle of RT-1 is to frame robotic manipulation as a sequence modeling problem, solvable with a powerful and scalable Transformer architecture. The system takes a history of recent camera images and a natural language instruction as input and outputs a sequence of robot actions. The goal is to create a model that is both high-capacity (to absorb knowledge from a large and diverse dataset) and computationally efficient (to enable real-time control on a physical robot).

The methodology achieves this through a carefully designed architecture that tokenizes high-dimensional inputs (images) into a compact representation before feeding them to a Transformer backbone. This avoids the prohibitive computational cost of applying a Transformer directly to raw pixels or dense feature maps.

4.2. Core Methodology In-depth (Layer by Layer)

The RT-1 model operates in a closed loop. At each timestep tt, it receives the current image observation xtx_t and processes it along with a history of the last 5 images and the constant language instruction ii. It then outputs the action ata_t for the robot to execute. The overall architecture is shown in Figure 3 of the paper.

Figure 3: The architecture diagram of RT-1. The instruction is transformed into a USE embedding and used to condition a pre-trained EffcientNet via FiLM layers. The resulting vision-language tokens are reduced by the TokenLearner and fed into a decoder-only Transformer, which outputs tokenized actions. 该图像是RT-1的架构示意图。它展示了如何将指令转换为USE嵌入并用于条件预训练的EfficientNet,继而通过FiLM层进行处理,生成视觉语言令牌,最后通过解码器变换器输出动作。该模型具有高度的可扩展性,并结合了自注意力机制与多种参数的特性。

The process can be broken down into the following integrated steps:

4.2.1. Input Representation

  • Image History: The model takes a sequence of the last 6 images as input, {xt5,...,xt}\{x_{t-5}, ..., x_t\}. This history provides temporal context, allowing the model to infer dynamics and the state of an ongoing action (e.g., whether the gripper is closing). The images have a resolution of 300×300300 \times 300 pixels.
  • Language Instruction: A natural language command, such as "pick up the apple", is provided at the start of the episode. This instruction, denoted by ii, remains constant throughout the episode.

4.2.2. Instruction and Image Tokenization

This stage converts the raw image and text inputs into a sequence of tokens suitable for the Transformer.

  1. Instruction Embedding: The natural language instruction ii is first passed through a pre-trained Universal Sentence Encoder (USE).

    • Universal Sentence Encoder (USE): This is a model developed by Google that encodes sentences into high-dimensional vectors (512 dimensions in this case). These embeddings capture the semantic meaning of the sentence, such that similar sentences are mapped to nearby points in the vector space.
    • This step produces a single, fixed-size language embedding vector that represents the task goal.
  2. Vision Backbone with FiLM Conditioning: The sequence of 6 images is processed by a vision backbone to extract visual features. This backbone is an EfficientNet-B3 model, pre-trained on the ImageNet dataset. The key innovation here is how the language embedding is used to modulate the visual processing.

    • The language embedding from USE is fed into several FiLM (Feature-wise Linear Modulation) layers.
    • These FiLM layers are inserted into the EfficientNet architecture. At specific blocks within the network, the FiLM layer uses the language embedding to generate a scaling parameter γ\gamma and a shifting parameter β\beta. These parameters are then applied element-wise to the feature maps of the EfficientNet.
    • Identity Initialization: A crucial detail is that the weights of the dense layers that produce γ\gamma and β\beta are initialized to zero. This makes the initial transformation γ=1\gamma=1 and β=0\beta=0 (after passing through activations and additions), meaning the FiLM layer initially acts as an identity function. This prevents the randomly initialized FiLM layer from disrupting the activations of the pre-trained EfficientNet at the beginning of training, thus preserving the benefit of ImageNet pre-training.
    • This early-fusion process results in a spatial feature map of shape 9×9×5129 \times 9 \times 512 for each of the 6 input images. The feature map is then flattened into 81 visual tokens per image. These are not just visual tokens; they are vision-language tokens, as they have been conditioned on the task instruction.

4.2.3. Token Compression with TokenLearner

The output from the previous step is 6×81=4866 \times 81 = 486 tokens. Processing this many tokens with a Transformer would be computationally expensive. To address this, the paper uses TokenLearner.

  • For each of the 6 feature maps, the TokenLearner module is applied. It takes the 81 tokens as input and learns to dynamically generate a much smaller set of 8 tokens.
  • TokenLearner works by using an attention-like mechanism. It computes "attention weights" for each of the 81 input tokens, not to select them, but to compute 8 different weighted combinations of them. This allows the model to learn to summarize the important spatial information from the 81 tokens into just 8.
  • After this step, the input sequence is reduced to 6×8=486 \times 8 = 48 tokens in total.

4.2.4. Transformer Backbone

The 48 vision-language tokens are the final input to the core of the RT-1 model: a decoder-only Transformer.

  • Positional Encoding: Before being fed into the Transformer, a positional encoding is added to each of the 48 tokens to give the model information about their temporal position (i.e., which of the 6 timesteps they belong to).
  • Architecture: The Transformer consists of 8 self-attention layers. It is "decoder-only," meaning it processes the input sequence and directly generates the output without a separate encoder-decoder structure. It has 19M parameters.
  • Function: The Transformer applies self-attention over the 48 tokens, allowing it to weigh the importance of different spatial features at different points in time to make a decision. The output of the Transformer is a prediction for the action tokens.

4.2.5. Action Representation and Output

RT-1 uses a discretized action space, which the authors found to be more effective than continuous actions for learning multi-modal behaviors (i.e., when multiple different actions could be correct in a given situation).

  • Action Dimensions: The robot's action is 11-dimensional:
    • Arm (7-DoF): 3D position change (x, y, z), 3D orientation change (roll, pitch, yaw), and gripper opening.
    • Base (3-DoF): 2D position change (x, y) and yaw change.
    • Terminate/Mode Switch (1-D): A discrete variable to switch control between the arm and base, or to terminate the episode.
  • Action Tokenization: Each of the 11 continuous action dimensions is discretized into 256 uniform bins. The terminate/mode dimension is also mapped to this representation. This transforms the action prediction problem from regression into an 11-way classification problem, where for each dimension, the model must predict one of 256 possible bins.
  • Output: The Transformer outputs a probability distribution over the 256 bins for each of the 11 action dimensions. The model is trained via behavioral cloning to maximize the log-probability of the expert's action tokens from the demonstration data. The loss function is a standard categorical cross-entropy loss.
  • Inference: During execution, the model does not generate the 11 action tokens autoregressively (i.e., predicting the token for dimension 2 based on the predicted token for dimension 1). Instead, it predicts all 11 distributions in parallel. This significantly speeds up inference compared to autoregressive generation.

4.2.6. Inference Speed Optimizations

Two key techniques are used to achieve the required 3Hz control frequency:

  1. TokenLearner: As described above, this reduces the number of tokens the Transformer must process, resulting in a 2.4x speedup.

  2. Reusing Image Tokens: Since the model uses a sliding window of 6 images, the computations for 5 of those images overlap between consecutive timesteps. The model caches and reuses the computed image tokens (the 48 tokens fed to the Transformer) from the previous step, only computing new tokens for the newest image frame. This provides an additional 1.7x speedup.

    Together, these optimizations bring the inference latency down to a manageable level for real-time control.

5. Experimental Setup

5.1. Datasets

  • Source and Scale: The primary dataset was collected in-house using a fleet of 13 Everyday Robots mobile manipulators over 17 months. The robots have a 7-DoF arm, a two-fingered gripper, and a mobile base. The dataset consists of approximately 130,000 demonstrations of human teleoperated episodes.

  • Environment: Data was collected in "robot classrooms" — mock office kitchen environments designed for large-scale data collection.

  • Tasks and Diversity: The dataset is highly diverse, covering over 700 distinct task instructions. The paper groups these instructions into skills based on the verb (e.g., "pick," "place," "open drawer"). The tasks involve a wide variety of objects. The following table summarizes the main skills.

    Skill Count Description Example Instruction
    Pick Object 130 Lift the object off the surface pick iced tea can
    Move Object Near Object 337 Move the first object near the second move pepsi can near rxbar blueberry
    Place Object Upright 8 Place an elongated object upright place water bottle upright
    Knock Object Over 8 Knock an elongated object over knock redbull can over
    Open Drawer 3 Open any of the cabinet drawers open the top drawer
    Close Drawer 3 Close any of the cabinet drawers close the middle drawer
    Place Object into Receptacle 84 Place an object into a receptacle place brown chip bag into white bowl
    Pick Object from Receptacle and Place on the Counter 162 Pick an object up from a location and then place it on the counter pick green jalapeno chip bag from paper bowl and place on counter
    Section 6.3 and 6.4 tasks 9 Skills trained for realistic, long instructions pull napkin out of dispenser grab scooper
    Total 744
  • Data Example: A single data sample (episode) consists of a language instruction ii (e.g., "pick coke can") and a sequence of observation-action pairs {(xt,at)}t=0T\{ (x_t, a_t) \}_{t=0}^T, where xtx_t is the robot's camera image and ata_t is the teleoperated action.

  • Heterogeneous Data: For specific experiments, this primary dataset was augmented with:

    • Simulation Data: 518k trajectories from a simulated environment, including objects not seen in the real world.

    • Kuka Robot Data: 209k grasping episodes from the QT-Opt dataset, collected on a different robot (Kuka IIWA) with a different action space.

      These datasets were chosen to rigorously test RT-1's ability to learn from large, diverse, real-world data and its capacity to absorb heterogeneous data from different sources.

5.2. Evaluation Metrics

The primary evaluation metric used throughout the paper is Success Rate.

  1. Conceptual Definition: Success Rate measures the percentage of trials in which the robot successfully completes the task specified by the language instruction. For each trial, a human evaluator observes the robot's behavior and gives a binary score: 1 for success, 0 for failure. This is the most direct and intuitive measure of a policy's real-world performance.
  2. Mathematical Formula: $ \text{Success Rate} = \frac{\sum_{j=1}^{N} \text{IsSuccess}(j)}{N} \times 100% $
  3. Symbol Explanation:
    • NN: The total number of evaluation trials for a given task or set of tasks.

    • IsSuccess(j)\text{IsSuccess}(j): A binary function that returns 1 if trial jj was successful, and 0 otherwise.

      This metric is applied across several evaluation categories:

  • Seen Tasks: Performance on instructions from the training distribution.
  • Unseen Tasks: Generalization to novel combinations of known skills and objects.
  • Robustness: Performance under visual perturbations, like added distractor objects or new backgrounds.
  • Long-Horizon Scenarios: Performance on complex tasks requiring multiple steps, evaluated within the SayCan framework.

5.3. Baselines

The paper compares RT-1 against two state-of-the-art baseline architectures, which are trained on the same large-scale dataset to ensure a fair comparison of model capabilities.

  • Gato (Reed et al., 2022): A Transformer-based generalist agent. The version used here is adapted to be of similar size to RT-1 (37M parameters) to allow for real-time inference. Key differences from RT-1 include: it does not use a pre-trained text encoder, it tokenizes image patches independently, and it uses late fusion of language and vision.

  • BC-Z (Jang et al., 2021): A ResNet-based feedforward policy. It does not use a Transformer or observation history. It uses continuous actions instead of discrete tokens. This represents a strong, simpler, non-Transformer baseline.

  • BC-Z XL: A larger version of BC-Z with a similar parameter count to RT-1, created by the authors to test if BC-Z's performance limitations were due to its smaller size.

    These baselines are representative because they cover both Transformer-based and CNN-based architectures, allowing the authors to isolate the benefits of RT-1's specific design choices.

6. Results & Analysis

6.1. Core Results Analysis

The core results demonstrate that RT-1 significantly outperforms all baselines across performance, generalization, and robustness, validating the effectiveness of its architecture and the benefit of training on a large, diverse dataset.

The following are the results from Table 2 of the original paper:

Model Seen Tasks Unseen Tasks Distractors Backgrounds
Gato (Reed et al., 2022) 65% 52% 43% 35%
BC-Z (Jang et al., 2021) 72% 19% 47% 41%
BC-Z XL 56% 43% 23% 35%
RT-1 (ours) 97% 76% 83% 59%
  • Seen Tasks: RT-1 achieves a near-perfect 97% success rate, which is a 25% absolute improvement over the best baseline (BC-Z). This shows its high capacity to memorize and execute the wide range of 700+ training tasks.
  • Unseen Tasks: RT-1's 76% success rate demonstrates strong zero-shot generalization to novel instructions. This is a 24% absolute improvement over Gato. This highlights the model's ability to recompose learned concepts (e.g., applying the "pick" skill to an object it has only seen in "move" tasks). The poor performance of BC-Z (19%) suggests that a simple CNN architecture struggles with this level of compositional generalization.
  • Robustness to Distractors: With an 83% success rate, RT-1 is highly robust to cluttered scenes with distracting objects. This is a 36% improvement over BC-Z. This result strongly supports the hypothesis that RT-1's early language-vision fusion via FiLM allows it to focus on task-relevant objects and ignore distractors.
  • Robustness to Backgrounds: RT-1 achieves 59% success in new environments with different lighting, textures, and backgrounds, an 18% improvement over the next best baseline. This indicates good visual generalization, likely benefiting from the ImageNet pre-training and the diversity of the training data.

6.1.1. Heterogeneous Data Absorption

The paper presents compelling evidence that RT-1 can act as a "data sponge."

  • Absorbing Simulation Data: The following are the results from Table 4 of the original paper:

    Models Training Data Real Objects
    Seen Skill w/ Objects
    Sim Objects (not seen in real)
    Seen Skill w/ Objects Unseen Skill w/ Objects
    RT-1 Real Only 92% 23% 7%
    RT-1 Real + Sim 90% (-2%) 87% (+64%) 33% (+26%)

    When training on a mix of real and simulation data, performance on real-world objects remains high (92% -> 90%). Crucially, the model's ability to manipulate objects only seen in simulation improves dramatically (from 23% to 87%). This shows successful sim-to-real transfer. Furthermore, generalization to unseen skills with these sim-only objects also improves significantly (7% to 33%).

  • Absorbing Data from Different Robots: The following are the results from Table 5 of the original paper:

    Models Training Data Classroom eval Bin-picking eval
    RT-1 Kuka bin-picking data + EDR data 90% (-2%) 39% (+17%)
    RT-1 EDR only data 92% 22%
    RT-1 Kuka bin-picking only data 0% 0%

    Mixing data from the Everyday Robot with grasping data from a Kuka robot (QT-Opt dataset) only minimally impacts performance on the original tasks (92% -> 90%). However, it nearly doubles the performance on a new bin-picking task that resembles the Kuka setup (22% -> 39%). This demonstrates that RT-1 can learn from the experiences of a different robot morphology and apply that knowledge, a significant step towards a universal robotics model.

6.1.2. Long-Horizon Scenarios with SayCan

The following are the results from Table 6 of the original paper:

SayCan tasks in Kitchen1 SayCan tasks in Kitchen2
Planning Execution Planning Execution
Original SayCan (Ahn et al., 2022)* 73% 47% - -
SayCan w/ Gato (Reed et al., 2022) 87% 33% 87% 0%
SayCan w/ BC-Z (Jang et al., 2021) 87% 53% 87% 13%
SayCan w/ RT-1 (ours) 87% 67% 87% 67%

When integrated into the SayCan framework for long-horizon planning, RT-1 achieves the highest execution success rate (67%). More impressively, its performance does not degrade when moved to Kitchen2, a much more challenging and visually different environment. In contrast, Gato fails completely (0%) and BC-Z's performance drops to 13%. This highlights RT-1's superior robustness and generalization, which are critical for deploying robots in real, unstructured human environments.

6.2. Ablation Studies / Parameter Analysis

The paper performs two main types of ablation studies: on the dataset and on the model architecture.

6.2.1. Data Ablations

The following are the results from Table 7 of the original paper:

Models % Tasks % Data Seen Tasks Generalization
All Unseen Tasks Distractors Backgrounds
Smaller Data
RT-1 (ours) 100% 100% 97% 73% 76% 83% 59%
RT-1 100% 51% 71% 50% 52% 39% 59%
RT-1 100% 37% 55% 46% 57% 35% 47%
RT-1 100% 22% 59% 29% 14% 31% 41%
Narrower Data
RT-1 (ours) 100% 100% 97% 73% 76% 83% 59%
RT-1 75% 97% 86% 54% 67% 42% 53%

This study reveals a crucial finding: data diversity is more important than data quantity.

  • Reducing the dataset size while keeping all tasks (the Smaller Data rows) leads to a graceful degradation in performance.
  • However, the Narrower Data experiment, which removes 25% of the least frequent tasks but keeps 97% of the total data, causes a much sharper drop in generalization performance (from 73% to 54%).
  • The authors note that removing 25% of tasks (losing only 3% of data) has a similar negative impact on generalization as removing 49% of the data. This strongly suggests that for building generalist robot models, prioritizing the breadth of tasks in the dataset is more effective than simply collecting more examples for existing tasks.

6.2.2. Model Ablations

The following are the results from Table 13 of the original paper, showing the performance change relative to the full RT-1 model.

Model Seen Tasks Unseen Tasks Distractors Backgrounds Inference Time (ms)
RT-1 (ours) 97 76 83 59 15
RT-1 w/o big model 89 (-8) 62 (-14) 77 (-6) 53 (-6) 13.5
RT-1 w/o pre-training 84 (-13) 43 (-33) 60 (-23) 41 (-18) 15
RT-1 w/ continuous actions 68 (-29) 43 (-33) 37 (-46) 35 (-24) 16
RT-1 w/ auto-regressive actions 85 (-12) 71 (-5) 67 (-16) 65 (+6) 36
RT-1 w/o history 82 (-15) 62 (-14) 50 (-33) 59 (+0) 15
RT-1 w/o Transformer 86 (-13) 62 (-14) 67 (-16) 59 (+0) 26

This table provides rich insights into the importance of each design choice:

  • Pre-training is Critical for Generalization: Removing ImageNet pre-training causes the largest drop in unseen task performance ( -33% ), confirming that leveraging knowledge from large, external vision datasets is hugely beneficial.
  • Action Discretization is Superior: Switching to continuous actions with a Gaussian output leads to a massive performance drop across the board, especially on distractor robustness ( -46% ). This suggests the discrete action space is better at modeling the complex, multi-modal action distributions present in the diverse dataset.
  • History is Key for Dynamic Scenes: Removing the 6-frame image history significantly hurts performance, particularly on distractor robustness ( -33% ). This implies that temporal context is vital for understanding the scene and filtering out irrelevant information.
  • The Transformer is Beneficial: Removing the Transformer component and using a simpler architecture hurts performance across the board, demonstrating the value of its high capacity and attention mechanism.
  • Autoregressive actions are not worth the cost: While autoregressive action generation slightly improves background robustness, it harms performance elsewhere and more than doubles the inference time (15ms -> 36ms), making it impractical for real-time control.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully presents and validates RT-1, a Transformer-based model for real-world robotic control that demonstrates impressive scalability and generalization. By training on a large and diverse dataset of over 130,000 real-world demonstrations, RT-1 achieves a high success rate (97%) on over 700 trained instructions and generalizes effectively to new tasks, objects, and environments, significantly outperforming prior architectures.

The key contributions are:

  1. The novel and efficient RT-1 architecture, which balances the high capacity of Transformers with the real-time inference demands of robotics.
  2. The demonstration that a single model can absorb heterogeneous data from simulation and even different robot platforms, improving its capabilities without forgetting previous skills.
  3. The crucial finding that data diversity is more important than data quantity for improving robotic generalization.
  4. The successful integration of RT-1 into the SayCan framework, enabling the execution of long-horizon, multi-stage tasks in complex, real-world kitchens.

7.2. Limitations & Future Work

The authors acknowledge several limitations:

  • Imitation Learning Ceiling: As an imitation learning method, RT-1 is fundamentally limited by the quality of its demonstration data. It cannot learn to perform tasks better than the human demonstrators.

  • Limited Motion Generalization: While RT-1 shows strong compositional generalization (combining known concepts in new ways), it is not yet capable of generating entirely novel motions that were not present in the training data.

  • Task Dexterity: The manipulation tasks, while diverse, are not highly dexterous (e.g., they do not involve complex in-hand object reorientation).

  • Environment Diversity: While tested in new kitchens, the model's robustness could be further improved by training on data from a much wider range of environments to reduce overfitting to the "office kitchen" setting.

    Future work directions include:

  • Faster Scaling: Developing methods for non-experts to contribute data, potentially through model prompting or interactive correction, to accelerate skill acquisition.

  • Improved Context and Speed: Exploring scalable attention mechanisms and memory architectures to increase RT-1's context window and reaction speed.

  • Expanding Skill Set: Continuously adding more diverse and dexterous skills to the dataset to broaden the robot's capabilities.

7.3. Personal Insights & Critique

  • A Landmark in Robot Learning: This paper feels like a significant milestone, marking a clear shift in robotics research towards the "large-scale data + large models" paradigm that has been so successful elsewhere. The sheer scale of the engineering and data collection effort is formidable and provides a strong "existence proof" that this approach is viable for real-world robotics.
  • The Power of Architectural Co-design: A key takeaway is the importance of co-designing the model architecture with the system's constraints. RT-1 is not just a generic Transformer; its efficiency comes from specific, clever choices like FiLM, TokenLearner, and non-autoregressive action decoding. This is a valuable lesson for applying large models to any resource-constrained domain.
  • "Data Diversity > Quantity" is a Powerful Insight: The data ablation study provides one of the paper's most actionable insights. For anyone working on a robotics project, it suggests that spending resources to collect data for 10 new tasks is likely more valuable than collecting 10x more data for one existing task. This has major implications for how data collection efforts should be prioritized.
  • Potential Issues and Unverified Assumptions:
    • The "Data Sponge" Metaphor: While the heterogeneous data absorption results are impressive, the experiments are still limited (one other robot type, one simulation environment). It remains an open question how well RT-1 would scale if trained on data from dozens of different robots with vastly different morphologies and action spaces. There may be a point of negative interference or "catastrophic forgetting" that was not reached here.
    • Cost and Accessibility: The methodology relies on a level of resources (robot fleets, months of data collection, large engineering teams) that is inaccessible to most academic labs. While the authors open-sourced the model code, the true "secret sauce" is the data. This raises concerns about the reproducibility and further exploration of these methods by the wider community. Future work on data-efficient learning and high-fidelity, reusable simulation will be critical to democratize this line of research.
    • Safety and Reliability: Achieving 97% on seen tasks is excellent, but for real-world deployment in human spaces, even a 3% failure rate is too high, especially if failures can be unsafe. The paper focuses on success rate but does not deeply analyze failure modes or discuss safety guarantees, which will be a critical next step for any practical application.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.