Open X-Embodiment: Robotic Learning Datasets and RT-X Models
TL;DR Summary
This work presents a large standardized dataset from 22 robots and 527 skills, introducing the high-capacity RT-X model for generalist cross-robot policies, demonstrating positive transfer and improved multi-robot generalization across diverse tasks.
Abstract
Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website https://robotics-transformer-x.github.io.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: Open X-Embodiment: Robotic Learning Datasets and RT-X Models
- Authors: The paper is attributed to the "Open X-Embodiment Collaboration," representing a large-scale joint effort by researchers from 21 institutions, including Google DeepMind, Stanford University, UC Berkeley, Carnegie Mellon University, and many others. This collaborative authorship underscores the paper's focus on community-driven data sharing.
- Journal/Conference: This paper is a preprint available on arXiv. Preprints are research articles shared publicly before or during the formal peer-review process. Given the scale and impact of the work, it is of a caliber suitable for top-tier robotics conferences like Robotics: Science and Systems (RSS) or the Conference on Robot Learning (CoRL).
- Publication Year: 2023
- Abstract: The authors address the challenge of creating generalist robotic policies, contrasting with the conventional approach of training separate models for each robot, task, and environment. Inspired by successes in NLP and Computer Vision, they propose that a consolidation of models can be achieved by training on large, diverse, multi-robot (X-embodiment) datasets. The paper introduces two main contributions: (1) A large-scale, standardized dataset collected from 22 different robots across 21 institutions, encompassing 527 skills and over 160,000 tasks. (2) A high-capacity model, named
RT-X, trained on this data, which demonstrates "positive transfer"—the ability to improve a robot's capabilities by leveraging experience from other, different robots. - Original Source Link:
- Official Source: https://arxiv.org/abs/2310.08864
- PDF: https://arxiv.org/pdf/2310.08864v9.pdf
- Publication Status: This is a preprint manuscript.
2. Executive Summary
-
Background & Motivation (Why): The field of robotics has historically been fragmented. Unlike natural language processing (NLP) and computer vision, where large-scale, diverse datasets (e.g., from the internet) have enabled the creation of powerful "foundation models" (like GPT-4 or CLIP) that serve as a general-purpose starting point for many applications, robotics has lacked such a unifying resource. Typically, a new policy (a robot's "brain") must be trained from scratch for every new robot, every new task, and even every new environment. This process is inefficient and limits the generalization capabilities of learned models. The central question this paper investigates is whether robotics can follow the same consolidation trend by training a single, generalist X-robot policy that works across different robot bodies ("embodiments"), tasks, and environments. The primary gap was the absence of a large, diverse, and standardized dataset from many different robots, and the lack of proof that training on such data would actually be beneficial.
-
Main Contributions / Findings (What): The paper makes two landmark contributions to address this challenge:
- The Open X-Embodiment (OXE) Dataset: The authors assembled and standardized a massive robotic learning dataset, pooling data from 21 different institutions. It contains over 1 million real-robot trajectories from 22 distinct robot embodiments, demonstrating 527 unique skills. The data is provided in a consistent format to facilitate easy use by the research community. This is one of the largest and most diverse robotics datasets of its kind.
- The RT-X Models and Positive Transfer: The paper presents
RT-X, a family of models trained on the OXE dataset. Through extensive experiments, the authors prove that positive transfer occurs: training on data from many different robots improves the performance on a specific robot beyond what is achievable by training on that robot's data alone.- The
RT-1-Xmodel achieved a 50% higher average success rate on tasks from small-scale datasets compared to specialized models. - The larger
RT-2-Xmodel demonstrated a ~3x improvement in generalization to novel, "emergent" skills that were not in its own primary training data but were present in the data from other robots.
- The
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Imitation Learning: A machine learning paradigm where an agent (e.g., a robot) learns to perform a task by observing and mimicking expert demonstrations. In this paper, the 1M+ trajectories in the OXE dataset serve as expert demonstrations for training the
RT-Xpolicies. - Transfer Learning: The process of taking knowledge gained from solving one problem and applying it to a different but related problem. The paper's core hypothesis is that a model can achieve positive transfer by learning general manipulation principles from one robot's data and applying them to improve performance on another robot.
- X-Embodiment Learning: A term popularized by this work, referring to the training of a single robotic policy on data collected from multiple, physically different robots (i.e., across various "embodiments").
- Transformer: A neural network architecture that excels at processing sequential data by using a mechanism called
self-attention. It can weigh the importance of different parts of the input (e.g., different images in a time series or words in a sentence) to make a prediction.RT-1andRT-2are both based on Transformers. - Vision-Language Models (VLMs): Large-scale models pretrained on vast quantities of internet-scale image and text data. They learn rich, general-purpose representations of the visual world and its connection to language. The
RT-2model is a VLM that is fine-tuned to also output robot actions, allowing it to leverage its extensive "world knowledge" for robotic control.
- Imitation Learning: A machine learning paradigm where an agent (e.g., a robot) learns to perform a task by observing and mimicking expert demonstrations. In this paper, the 1M+ trajectories in the OXE dataset serve as expert demonstrations for training the
-
Previous Works: The paper situates itself within three main areas of prior research:
- Transfer across Embodiments: Previous studies have explored transferring policies between different robots, but often relied on specialized techniques to handle the "embodiment gap," such as learning shared action representations, using modular networks, or explicitly conditioning the policy on the robot's physical properties.
- Large-scale Robot Learning Datasets: Prior efforts like
RoboNetalso collected data from multiple robots. However, the Open X-Embodiment dataset is significantly larger and more diverse in terms of the number of unique robots, skills, and contributing institutions. - Language-Conditioned Robot Learning: Many modern methods train robots to follow natural language instructions. This work builds on that paradigm, specifically adopting the architectures of
RT-1andRT-2, which are language-conditioned policies.
-
Differentiation: The primary innovation of this paper is not a new model architecture but a paradigm shift demonstrated at an unprecedented scale. Instead of designing complex mechanisms to explicitly bridge the differences between robots, the authors show that a simple, brute-force approach of pooling massive, diverse data and training a high-capacity model is sufficient to achieve significant positive transfer. It is a powerful argument for data-centric, "scaling-driven" research in robotics.
4. Methodology (Core Technology & Implementation)
The core of the RT-X methodology is to train a single, large Transformer-based model on the combined Open X-Embodiment dataset. This required a pragmatic approach to handling the heterogeneity of the data.
-
Principles: The guiding principle is that a high-capacity model, when exposed to a sufficiently diverse stream of "robot-task-environment" data, can learn underlying, generalizable patterns of interaction that are not specific to any single embodiment. The variation in the data itself forces the model to become more robust and abstract.
-
Data Format Consolidation: A critical step was to unify the disparate datasets into a common format.
- Observations: The model input consists of a language instruction (the task to be performed) and a history of recent images from a single "canonical" camera view. All images are resized to a standard resolution.
- Action Space: The most significant challenge is the variation in robot hardware. The authors addressed this with a coarsely aligned 7-DOF action space. Every robot's native action is converted into a 7-dimensional vector representing end-effector control: 3D position (
x, y, z), 3D rotation (roll, pitch, yaw), and a 1D gripper state (open/closed).- Key Detail: This is not a perfectly aligned space. The same action vector
[0.1, 0, 0, ...]might mean "move 1cm forward" for one robot but "move at a velocity of 1cm/s forward" for another. The coordinate frames are also not globally aligned. The model must learn to interpret its own output differently depending on the embodiment context, which it implicitly infers from the input images. Actions for each dataset are normalized before being discretized into bins.
- Key Detail: This is not a perfectly aligned space. The same action vector
-
Policy Architectures: The paper experiments with two architectures,
RT-1-XandRT-2-X, which are X-embodiment versions of existing models.
该图像是一个示意图,展示了RT-X模型架构的两种版本:RT-1-X和RT-2-X。图中包含输入的指令和图像,经过不同特征提取器(FiLM EfficientNet或ViT),再通过Transformer或大语言模型处理,最终输出离散动作指令,分别对应不同机器人的抓取任务。- RT-1-X (35M parameters): Based on the
RT-1model.- Input: A history of 15 images and a language instruction.
- Processing: Images are passed through an ImageNet-pretrained
EfficientNetto extract visual features. The language instruction is converted to an embedding usingUSE(Universal Sentence Encoder). - Fusion: Visual and language features are fused using
FiLM(Feature-wise Linear Modulation) layers. - Output: The fused tokens are fed into a decoder-only Transformer, which predicts a discretized action from 256 possible bins for each of the 7-DOF dimensions plus a termination dimension.
- RT-2-X (5B and 55B parameters): Based on the
RT-2model, which is a Vision-Language-Action (VLA) model.- Input: Image(s) and a language instruction.
- Core Idea: It treats robotics as a "language" problem. Actions are tokenized and represented as a string of text (e.g.,
"action: 1 128 91 241 5 101 127"). - Processing: A pretrained Vision-Language Model (
PaLI-X, which uses aViTvision encoder andUL2language model) is co-fine-tuned on a mixture of its original web data and the robotics data. - Output: The model generates the action string as text, which is then parsed and sent to the robot. This allows the model to directly leverage the vast knowledge and reasoning capabilities learned from web-scale data.
- RT-1-X (35M parameters): Based on the
-
Training: Both models are trained with a standard categorical cross-entropy loss function to predict the correct action token.
RT-1-Xis trained only on the robotics data mixture, whileRT-2-Xis co-fine-tuned on both web data and robotics data.
5. Experimental Setup
- Datasets: The experiments use a subset of the full Open X-Embodiment dataset, comprising data from 9 different manipulators. Key datasets mentioned include
RT-1,QT-Opt,Bridge,Jaco Play,Cable Routing,NYU VINN,Austin VIOLA,Berkeley Autolab UR5,TOTO, andLanguage Table. - Evaluation Metrics: The primary metric used is Success Rate.
- Conceptual Definition: The Success Rate is the fraction of experimental trials in which the robot successfully completes the instructed task. It is a simple and intuitive measure of task-level performance, where each trial is judged as a binary success or failure.
- Mathematical Formula:
- Symbol Explanation:
Number of Successful Trials: The count of attempts where the robot achieved the task's goal (e.g., placed the block in the correct bowl).Total Number of Trials: The total number of attempts made for a given task or set of tasks.
- Baselines:
Original Method: The policy learning algorithm originally proposed by the creators of each individual dataset (e.g.,LCBCfor theBridgedataset). This represents a strong, specialized baseline.RT-1(single-dataset): AnRT-1model trained only on the data from the specific robot being evaluated. This baseline helps to isolate the benefit of X-embodiment training versus just using theRT-1architecture.
6. Results & Analysis
The experiments systematically answer three core questions about X-embodiment training.
-
Core Result 1: In-distribution Performance (Positive Transfer)
-
On Small-Scale Datasets:
该图像是柱状图,展示了不同机器人平台和数据集上多种方法的成功率对比。图中RT-1-X模型在多个任务(厨房操作、布线、门开启等)及总体平均成功率上明显优于其他方法,提升幅度最高达50%。As shown in Figure 4, when evaluated on domains with limited data, the
RT-1-Xmodel (trained on the full mixture) dramatically outperforms both theOriginal Methodand theRT-1model trained on single-robot data. The average success rate is 50% higher, providing clear evidence that co-training on X-embodiment data helps robots in data-poor settings. -
On Large-Scale Datasets: This table is a manual transcription of Table I from the paper.
Note: The paper seems to have a typo in the table structure. The results fortd>30%
Evaluation Setting Bridge Bridge RT-1 paper 6 skills Evaluation Location Robot Embodiment IRIS (Stanford) WidowX RAIL Lab (UCB) WidowX Google Robotic Lab Google Robot Original Method LCBC [95] LCBC [95] Original Method 13% 13% 92% RT-1 40% 30% 73% RT-1-X 27% 27% 91% RT-2-X (55B) 50% - RT-1andRT-1-Xappear swapped or misaligned in the last column. Assuming the values are correct as written, the analysis is as follows. In the last column,RT-1with 92% is likely the single-dataset baseline, andRT-1-Xwith 73% is the multi-dataset model. The smallerRT-1-X(35M parameters) underperforms on the largeBridgeandRT-1datasets. The authors interpret this as underfitting: the model lacks the capacity to both absorb the diverse X-embodiment data and master these complex, data-rich tasks. However, the much largerRT-2-X(55B parameters) achieves strong performance (50% onBridge), suggesting that high model capacity is essential to benefit from X-embodiment training in data-rich domains.-
Core Result 2: Improved Generalization (Emergent Skills) This experiment tests whether
RT-2-Xcan perform new skills on theGoogle Robotthat were only present in theBridgedataset (for theWidowXrobot). These are "emergent" skills transferred across embodiments.
该图像是图表,展示了不同动作类型对物体操作路径的影响,包含(a)绝对运动,(b)物体相对运动,以及(c)介词如何改变行为的示意,通过实物和箭头标示移动轨迹,直观展示机器人不同指令下的操作差异。This table is a manual transcription of the relevant columns from Table II in the paper.
Row Model Dataset Emergent Skills Evaluation RT-2 Generalization Evaluation (1) RT-2 Google Robot action 27.3% 62% (2) RT-2-X Robotics data 75.8% 61% (3) RT-2-X Robotics data except Bridge 42.8% 54% Comparing row (1) and (2),
RT-2-Xachieves a 75.8% success rate on these emergent skills, a nearly 3x improvement over the baselineRT-2(27.3%). This is a powerful demonstration of skill transfer. Row (3) is an ablation study: removing theBridgedataset from the training mix causes performance to drop to 42.8%, confirming that the new skills are indeed being learned from theWidowXrobot's data. -
Core Result 3: Design Decisions Ablation studies in Table II reveal key insights:
- Model Capacity: The 55B parameter
RT-2-Xmodel dramatically outperforms the 5B version on emergent skills (75.8% vs. 44.4%), confirming that larger models are better at cross-embodiment transfer. - Web Pretraining: An
RT-2-Xmodel trained from scratch (without web pretraining) achieves 0% success. This highlights that leveraging the knowledge from large-scale vision-language pretraining is absolutely critical. - History: Including a short history of images (2 frames) significantly boosts performance for the 5B model (44.4% vs. 14.5%), indicating that temporal context is important for control.
- Model Capacity: The 55B parameter
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully demonstrates that a paradigm shift towards generalist, X-embodiment models is feasible in robotics. By creating the massive Open X-Embodiment dataset and training high-capacity
RT-Xmodels, the authors provide the first large-scale empirical evidence that positive transfer across different robots is not only possible but highly effective. TheRT-1-Xmodel boosts performance for data-poor robots, while the largerRT-2-Xmodel exhibits emergent capabilities by learning skills from other robots. The work provides the community with both a foundational dataset and a strong set of baseline models to catalyze future research into "robot foundation models." -
Limitations & Future Work: The authors acknowledge several limitations and directions for future work:
- The study does not evaluate generalization to entirely new, unseen robot embodiments.
- A clear theoretical or empirical criterion for when positive or negative transfer occurs is still needed.
- The current dataset is limited to robots with similar sensing (RGB cameras) and actuation (7-DOF end-effector). Expanding to include robots with different modalities (e.g., tactile sensors, different gripper types, mobile bases) is an important next step.
-
Personal Insights & Critique:
- Impact: This is a landmark paper for the field of robot learning. It provides a tangible path forward for building more general and capable robots, moving away from the fragmented, one-off model paradigm. The OXE dataset and RT-X models serve as a "ImageNet moment" for X-embodiment robotics, providing a common benchmark to drive progress.
- Pragmatism vs. Principle: The "coarsely aligned" action space is a clever and pragmatic choice that made this large-scale experiment possible. However, it may be a long-term limitation. More principled methods for aligning action spaces or learning embodiment-invariant representations could unlock even better performance.
- Accessibility and Deployment: The best-performing model,
RT-2-X, is enormous (55B parameters) and requires cloud infrastructure for inference. This raises practical questions about its accessibility to smaller labs and its feasibility for deployment on real-world robots where low latency and local computation are often required. - Open Questions: How does the diversity of tasks versus the diversity of embodiments contribute to generalization? Could this approach be extended beyond manipulation to other domains like locomotion or human-robot interaction? This paper lays the groundwork for a decade of future research to answer these questions.
-
-
Similar papers
Recommended via semantic vector search.