RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
TL;DR Summary
RDT-1B proposes a 1.2B-parameter diffusion Transformer foundation model for bimanual manipulation, addressing multi-modal actions and data scarcity via a physically interpretable unified action space. Pre-trained on vast datasets, it achieves superior zero-shot generalization, la
Abstract
Bimanual manipulation is essential in robotics, yet developing foundation models is extremely challenging due to the inherent complexity of coordinating two robot arms (leading to multi-modal action distributions) and the scarcity of training data. In this paper, we present the Robotics Diffusion Transformer (RDT), a pioneering diffusion foundation model for bimanual manipulation. RDT builds on diffusion models to effectively represent multi-modality, with innovative designs of a scalable Transformer to deal with the heterogeneity of multi-modal inputs and to capture the nonlinearity and high frequency of robotic data. To address data scarcity, we further introduce a Physically Interpretable Unified Action Space, which can unify the action representations of various robots while preserving the physical meanings of original actions, facilitating learning transferrable physical knowledge. With these designs, we managed to pre-train RDT on the largest collection of multi-robot datasets to date and scaled it up to 1.2B parameters, which is the largest diffusion-based foundation model for robotic manipulation. We finally fine-tuned RDT on a self-created multi-task bimanual dataset with over 6K+ episodes to refine its manipulation capabilities. Experiments on real robots demonstrate that RDT significantly outperforms existing methods. It exhibits zero-shot generalization to unseen objects and scenes, understands and follows language instructions, learns new skills with just 1~5 demonstrations, and effectively handles complex, dexterous tasks. We refer to https://rdt-robotics.github.io/rdt-robotics/ for the code and videos.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
- Authors: Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu.
- Affiliations: All authors are affiliated with the Department of Computer Science & Technology at Tsinghua University, a leading institution in AI and computer science research.
- Journal/Conference: The paper is currently available as a preprint on arXiv, an open-access archive for scholarly articles. This means it has not yet undergone formal peer review for a conference or journal publication.
- Publication Year: 2024 (as per its submission to arXiv).
- Abstract: The paper introduces the Robotics Diffusion Transformer (RDT), a 1.2 billion parameter foundation model designed for bimanual robotic manipulation. The authors identify two key challenges: the complexity of coordinating two arms (which creates multi-modal action distributions) and the scarcity of bimanual training data. RDT addresses these by using a diffusion model to handle multi-modality and a scalable Transformer architecture for heterogeneous inputs. To overcome data scarcity, they propose a "Physically Interpretable Unified Action Space," which allows them to pre-train RDT on a massive collection of diverse multi-robot datasets. The model is then fine-tuned on a custom-collected, large-scale bimanual dataset. Experiments on real robots show that RDT significantly outperforms existing methods, demonstrating strong capabilities in zero-shot generalization, instruction following, few-shot learning, and performing complex, dexterous tasks.
- Original Source Link:
- arXiv Page: https://arxiv.org/abs/2410.07864
- PDF Link: http://arxiv.org/pdf/2410.07864v2
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Creating a general-purpose, adaptable policy for robots with two arms (bimanual manipulation) is a fundamental goal in robotics. Such tasks are common in human environments but incredibly difficult for robots.
- Key Challenges:
- Complexity & Multi-modality: Coordinating two arms dramatically increases the complexity of possible actions. For any given task, there are often multiple equally valid ways for the two arms to work together (e.g., which hand holds an object vs. which one acts on it). This leads to a multi-modal action distribution, which is difficult for simple models to learn.
- Data Scarcity: Bimanual robot hardware is expensive and complex, making it difficult to collect the massive datasets needed to train large "foundation models."
- Innovation: The paper proposes to tackle these issues by building the largest-ever diffusion-based robotics model. The key idea is to leverage a diffusion model's inherent ability to capture complex distributions to solve the multi-modality problem. To solve the data problem, they introduce a novel unified action space that allows them to pool together dozens of existing (mostly single-arm) robot datasets for pre-training, thereby learning general physical knowledge before specializing in bimanual tasks.
-
Main Contributions / Findings (What):
- Robotics Diffusion Transformer (RDT): The development of a 1.2B parameter diffusion-based visuomotor policy, making it the largest diffusion model for robotic manipulation to date.
- Architectural Innovations for Robotics: RDT is not a standard Diffusion Transformer (
DiT). It includes specific modifications—QKNorm&RMSNormfor training stability, anMLP Decoderfor capturing nonlinear actions, andAlternating Condition Injection (ACI)for better fusion of vision and language—all tailored to the unique properties of robotic data. - Physically Interpretable Unified Action Space: A novel method to standardize action representations across different robots without losing the physical meaning of the actions. This was crucial for enabling pre-training on a heterogeneous collection of 46 different robot datasets.
- Extensive Data Curation: The model was pre-trained on over 1 million trajectories and fine-tuned on a new, comprehensive bimanual dataset created by the authors, containing over 6,000 episodes across 300+ challenging tasks.
- State-of-the-Art Performance: On real-world robot experiments, RDT demonstrated superior performance over strong baselines, achieving a 56% improvement in success rates. It excels at zero-shot generalization to new objects/scenes, following novel language instructions, learning new skills from just 1-5 examples, and executing tasks requiring high dexterity.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
-
Bimanual Manipulation: The control of a robot with two arms to perform a task. It is significantly more complex than single-arm (
unimanual) manipulation due to the need for coordination, the larger action space, and the potential for collisions between the arms. -
Foundation Models: A term popularised in AI, referring to very large models (like GPT-4) pre-trained on vast amounts of general data. These models learn broad knowledge and can be adapted (or
fine-tuned) to a wide range of specific downstream tasks with relatively little task-specific data. -
Imitation Learning: A machine learning paradigm where an agent learns to perform a task by observing and mimicking demonstrations from an expert (in this case, a human teleoperating the robot).
-
Diffusion Models: A class of generative models that learn to create data by reversing a noise-adding process. They start with pure noise and iteratively refine it into a coherent sample (e.g., an image or, in this paper, a sequence of robot actions). Their key strength is the ability to model highly complex and multi-modal probability distributions.
-
Transformers: A neural network architecture based on the
self-attentionmechanism. Transformers have proven to be exceptionally scalable and effective at processing sequential data, making them the backbone of most modern large language models and, increasingly, vision and robotics models. -
Multi-modal Action Distribution: As shown in Image 7, for a simple task like grasping a cube with two hands, a robot could approach from the left, the right, or with both hands simultaneously. Each of these is a valid "mode." When demonstrations are collected, the human operator might use any of these modes, creating a dataset where the "correct" action for a given state is not unique. A simple regression model would average these modes, resulting in a nonsensical, infeasible action. A model that can represent a multi-modal distribution can learn all valid possibilities.
该图像是示意图,包括(a)一个双臂机器人系统的结构示意,标注了左右手腕摄像头和夹爪及外部摄像头位置;(b)通过抓取积木的示例展示了双臂操作的多模态动作特性,右侧图显示了对应的多峰动作分布,反映多模态的存在。
-
-
Previous Works:
- Traditional Bimanual Manipulation: Early methods often relied on hand-crafted rules or
task-specific primitives(pre-defined simple movements), which lacked generalizability. - Early Learning-Based Methods: These were often limited by small models and datasets, working only for simple tasks and failing to generalize to new scenarios.
- Robotics Foundation Models: The paper builds upon recent successes like:
RT-1andRT-2: Large Transformer models from Google DeepMind that showed impressive generalization. However, they use an actiondiscretizationapproach (turning continuous actions into a vocabulary of tokens, like words). The authors argue this leads to quantization errors and lacks the precision needed for dexterous bimanual control.OpenVLA: The largest open-source robotics model (7B parameters), which also relies on discretization.Octo: A diffusion-based model, similar in spirit to RDT, but much smaller (93M parameters) and pre-trained on a smaller subset of data.
- Traditional Bimanual Manipulation: Early methods often relied on hand-crafted rules or
-
Differentiation: RDT distinguishes itself from prior work in four key ways:
- Modeling Choice: It champions continuous diffusion models over discretization for high-precision control.
- Scale: It pushes the model size to 1.2B parameters, significantly larger than previous diffusion-based policies like
Octo. - Data Unification: The
Physically Interpretable Unified Action Spaceis a novel solution to the data heterogeneity problem, allowing for more effective use of diverse datasets than previous methods that either discarded or simplified data. - Architectural Adaptation: It specifically modifies the popular
DiTarchitecture to better suit the unique characteristics of robotic control signals.
4. Methodology (Core Technology & Implementation)
The core of the paper is the Robotics Diffusion Transformer (RDT) and the data strategy that enables its training.
该图像是示意图,展示了RDT框架中的统一动作空间设计。图中左侧将单臂末端执行器、双臂关节和轮式机器人等多种异构动作空间嵌入到统一动作空间。中间部分为低维输入(本体感知、带噪声动作片段及控制频率)经过多层感知机和交叉注意力Transformers处理输出去噪动作。右侧显示了图像输入(外部、右腕和左腕相机视角)及语言输入,并通过SigLIP和T5-XXL模型联合编码,辅助动作预测。
4.1 RDT Model Architecture
RDT is designed to address the need for an expressive and scalable architecture.
-
Diffusion Modeling:
- Principle: Instead of predicting a single "best" action, RDT learns the entire conditional probability distribution , where is the action, is the language instruction, and is the observation. This naturally handles multi-modality.
- Action Chunking: To improve temporal consistency and reduce the accumulation of errors over long tasks, RDT predicts a sequence (or "chunk") of future actions at once, i.e., it models . In the experiments, .
- Denoising Process: At inference time, the model starts with a random noisy action chunk and iteratively denoises it over steps. The core of this process is the denoising network , which predicts the clean action from a noisy version .
- Training Objective: The model is trained to minimize the mean squared error (MSE) between the ground-truth action from the dataset and the model's prediction of that action from a noisy version. The training loss is:
Where:
- are the model parameters.
- is the clean action from the training data.
- are the language and observation conditions.
- is the noisy action created by adding Gaussian noise to the clean action. The amount of noise is determined by the noise schedule and the diffusion timestep .
- is the neural network that tries to predict the original .
-
Multi-Modal Input Encoding:
- Low-Dimensional Inputs: Robot joint positions (
proprioception), noisy action chunks, and control frequency are encoded using Multi-Layer Perceptrons (MLPs) with Fourier features to better capture high-frequency signals. - Image Inputs: RGB images from three cameras (exterior, right wrist, left wrist) are encoded using a frozen, pre-trained
SigLIPvision encoder. - Language Inputs: Text instructions are encoded using a frozen, pre-trained
T5-XXLlanguage model. - Using frozen encoders saves significant memory and computation, allowing the model to focus training on the core fusion and policy network.
- Low-Dimensional Inputs: Robot joint positions (
-
Network Structure () - A Modified Diffusion Transformer: The backbone is a Transformer, but with crucial modifications for robotics:
-
QKNorm&RMSNorm: StandardLayerNormcan be unstable in large models and can disrupt time-series data.QKNormstabilizes attention scores, andRMSNormis a simpler normalization that works better for time series. Image 9(a) shows that training is unstable without these. -
MLP Decoder: The final layer that maps the Transformer's output back to an action is an MLP instead of a simple linear layer. This gives the model more capacity to represent the complex, nonlinear dynamics of robot movements. Image 9(b) shows this is critical for dexterous tasks. -
Alternating Condition Injection (ACI): Vision and language inputs are treated as conditions. Instead of feeding them in all at once in every layer (which could let the more numerous image tokens "drown out" the language tokens), the model alternates injecting them via cross-attention in successive Transformer layers. This ensures balanced processing of both modalities, improving instruction following.
该图像是图表,展示了RDT模型训练过程中的损失曲线和不同模型配置在任务中的成功率对比。(a)图显示了没有QKNorm和RMSNorm的训练损失曲线不稳定,RDT曲线较为平稳;(b)图比较了去掉MLP解码器和去掉ACI模块的模型在“Dexterity”和“Instr. Following”任务中的成功率,RDT表现显著更好。
-
4.2 Data Strategy
-
Physically Interpretable Unified Action Space:
- Problem: Different robots have different numbers of joints and different action spaces (e.g., joint positions vs. end-effector poses). How can a single model be trained on data from all of them?
- Solution: The authors define a standardized 128-dimensional vector that can represent all relevant physical quantities for a gripper-arm robot (e.g., right arm joint positions, left gripper velocities, base velocities). For any given robot, its native action vector is mapped into this unified space by placing each value in its corresponding slot based on its physical meaning. Unused slots are padded.
- Benefit: This preserves the physical meaning of actions (e.g., a value of 1.0 rad/s means the same thing regardless of the robot), facilitating the learning of transferable physical knowledge (e.g., how fast to move to pick up a delicate object).
-
Pre-Training Dataset: A massive collection of 46 existing datasets, totaling over 1 million trajectories. This includes famous datasets like
RT-1,DROID, andRH20T. See Table 5 in the paper's appendix for a full list. -
Fine-Tuning Dataset: A new, high-quality dataset collected by the authors specifically for bimanual manipulation. It is one of the largest of its kind.
-
Scale: 6,000+ episodes, 300+ tasks, 100+ objects, 15+ scenes.
-
Diversity: Includes challenging tasks requiring dexterity (e.g., unscrewing a cap), comprehension (e.g., spelling "love" with blocks), and bimanual coordination (e.g., plugging in a cable).
-
Augmentation: The authors used GPT-4-Turbo to rewrite and expand the language instructions, increasing textual diversity.
该图像是示意图,展示了用于微调的双手操作数据集的核心特征,包括丰富多样的物体和场景(超过100种物体与15种场景及不同光照条件)、多样且具有挑战性的任务(涉及理解能力和灵巧操控),以及包含多模态注释的信息增强指令。
-
5. Experimental Setup
-
Hardware: All real-world experiments were conducted on the
ALOHA dual-arm robot, a popular low-cost platform for bimanual manipulation research.
该图像是论文中的一张机器人硬件设备插图,显示了配备有手腕摄像头和前置摄像头的双臂机器人系统。图中标注了控制臂、手腕摄像头、前置摄像头、笔记本电脑和电池等关键硬件部件,展示了整体机器人平台的结构与组成。 -
Tasks: 7 challenging tasks were designed to test different aspects of the model's capabilities:
-
The following is a manual transcription of the data in Table 1 from the paper.
TASK NAME DIMENSION EXPLANATION Wash Cup Unseen Object (Q1) To wash one seen and two unseen cups with the faucet Pour Water Unseen Scene (Q1) To pour water into the cup in three unseen rooms Pour Water-L-1/3 Instruction Following (Q2) To pour water into the cup with the left hand until one-third full Pour Water-R-2/3 Instruction Following (Q2) To pour water into the cup with the right hand until two-thirds full Handover 5-Shot Learning (Q3) To move the marker to the box, where handover is needed due to far distance Fold Shorts 1-Shot Learning (Q3) To fold the shorts in half horizontally Robot Dog Dexterity (Q4) To push the joystick straight to control the robot dog to walk in a straight line
-
-
Evaluation Metrics:
- Success Rate: The primary metric for evaluation.
- Conceptual Definition: It measures the percentage of trials where the robot successfully completes the entire task as defined. A higher success rate indicates better performance.
- Mathematical Formula:
- Symbol Explanation:
- Number of Successful Trials: The count of experiments where the robot achieved the task goal.
- Total Number of Trials: The total number of times the experiment was run for a given task.
- Success Rate: The primary metric for evaluation.
-
Baselines:
ACT(Action Chunking with Transformers): A state-of-the-art method for bimanual manipulation that uses a Variational Autoencoder (VAE) to model the action distribution.OpenVLA: The largest open-source robotics foundation model (7B parameters), which uses action discretization.Octo: A 93M parameter diffusion-based foundation model.RDT (scratch): RDT trained only on the fine-tuning data, without pre-training, to measure the benefit of the large pre-training dataset.
6. Results & Analysis
-
Core Results: RDT consistently and significantly outperforms all baselines across all tasks.
-
The following is a manual transcription of the data in Table 3 from the paper.
Wash Cup: seen cup 1 | unseen cup 1 | unseen cup 2 (Unseen Object) Pick Up Cup Turn On Faucet Get Water Pour Out Water Place Back Cup Total ACT 50 12.5 37.5 0 0 0 0 0 0 0 0 0 0 0 OpenVLA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Octo 0 0 0 0 0 0 0 0 0 0 0 0 0 0 RDT (scratch) 37.5 12.5 12.5 12.5 0 0 0 37.5 12.5 0 25 0 0 0 RDT (ours) 87.5 87.5 50 62.5 75 50 50 75 50 87.5 75 50 62.5 50 Pour Water-L-1/3 | Pour Water-R-2/3 (Instruction Following) Pick Up Bottle Pour Water Place Back Bottle Total Correct Hand Correct Amount OpenVLA 50 0 0 0 50 0 Octo 0 0 0 0 0 0 RDT (scratch) 100 0 0 0 62.5 12.5 RDT (ours) 100 75 62.5 62.5 100 75 Pour Water: unseen room 1 | unseen room 2 | unseen room 3 (Unseen Scene) Pick Up Bottle Pour Water Place Back Bottle Total Fold Shorts (1-Shot) Total ACT 25 87.5 25 0 50 12.5 0 37.5 12.5 0 37.5 12.5 0 OpenVLA 0 0 0 0 0 0 0 0 0 0 0 0 0 Octo 50 0 12.5 12.5 12.5 0 0 12.5 0 0 0 0 4 RDT (scratch) 62.5 100 62.5 25 87.5 37.5 25 75 25 25 75 25 40 RDT (ours) 62.5 100 62.5 62.5 100 62.5 62.5 100 62.5 62.5 100 62.5 68 Handover (5-Shot) Robot Dog (Dexterity) Pick Up Pen Switch Hand Drop Pen Fall into Box Total Grab Remote Push Joystick Walk Straight Total ACT 44 0 0 0 0 88 4 4 4 OpenVLA 0 0 0 0 0 84 0 0 0 Octo 0 0 0 0 0 100 4 0 0 RDT (scratch) 40 32 24 24 24 100 40 32 32 RDT (ours) 84 76 76 76 72 100 96 88 84 -
Q1 & Q2 (Zero-Shot Generalization & Instruction Following): RDT maintains high success rates when faced with unseen objects (
Wash Cup), unseen scenes (Pour Water), and unseen instructions (Pour Water-L-1/3, where it has never seen "one-third"). Baselines completely fail in these generalization tasks. Figure 10 visually confirms RDT's precision in the water pouring task. -
Q3 (Few-Shot Learning): RDT can learn entirely new, complex skills like
HandoverandFold Shortsfrom just 5 and 1 demonstrations, respectively, achieving high success rates (72% and 68%). Baselines are near 0%. This demonstrates its ability to quickly adapt, a key feature of foundation models. -
Q4 (Dexterity): In the
Robot Dogtask, which requires precise, steady control of a joystick, RDT achieves an 84% success rate, while others fail to make the dog walk straight. This highlights the advantage of the continuous action space from the diffusion model for fine-grained tasks.
-
-
Ablation Study: This study confirms the importance of each key design choice.
-
The following is a manual transcription of the data in Table 2 from the paper.
VARIANT NAME UNSEEN OBJECT (%) UNSEEN SCENE (%) INSTRUCTION FOLLOWING (%) RDT (regress) 12.5 50 12.5 RDT (small) 37.5 62.5 25 RDT (scratch) 0 25 62.5 RDT (ours) 50 62.5 100 -
Diffusion Modeling is Crucial:
RDT (regress)performs very poorly, showing that simple regression cannot handle the multi-modal nature of the tasks. -
Large Model Size Matters:
RDT (small)underperforms the full model, confirming that scale is a key contributor to performance. -
Extensive Pre-training is Essential:
RDT (scratch)fails at generalizing to unseen objects and scenes, proving that the knowledge learned from the massive, diverse pre-training dataset is critical for generalization.
-
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully presents RDT-1B, a pioneering diffusion-based foundation model for bimanual manipulation. By combining a large-scale diffusion transformer architecture, a novel unified action space to leverage massive multi-robot datasets, and a comprehensive fine-tuning dataset, RDT sets a new state-of-the-art. It demonstrates remarkable capabilities in zero-shot generalization, few-shot adaptation, and dexterous control, pushing the boundaries of what is possible with learning-based robotics.
-
Limitations & Future Work:
- Computational Cost: Training a 1.2B parameter model requires immense computational resources (48 H100 GPUs for a month), which is a significant barrier for most academic labs and smaller companies.
- Static Manipulation Focus: While the robot base is mobile, the tasks themselves are static. The next frontier is to extend these capabilities to mobile manipulation, where the robot must navigate and manipulate simultaneously.
- Safety and Reliability: As with any large-scale model deployed in the real world, ensuring safety, predictability, and robustness to edge cases remains a critical challenge that is not deeply explored in the paper.
- Dependence on Human Demonstrations: The model relies entirely on imitation learning. This approach can be limited by the quality of the demonstrations and may struggle with tasks beyond the scope of what a human can teleoperate effectively.
-
Personal Insights & Critique:
- The
Physically Interpretable Unified Action Spaceis arguably the most impactful and transferable contribution of this work. It offers a principled way to solve the long-standing problem of data heterogeneity in robotics, paving the way for more effective cross-embodiment learning. - The paper provides strong evidence that the scaling laws observed in NLP and vision also apply to robotics: larger models trained on more diverse data lead to better generalization. The success of RDT makes a compelling case for continued investment in large-scale data collection and model development.
- The meticulous architectural adaptations to the
DiTmodel are a valuable piece of engineering, providing a practical guide for others looking to apply diffusion transformers to control problems. The ablation studies convincingly justify each design choice. - Overall, this paper represents a significant step forward in the quest for generalist robots. It successfully combines the strengths of diffusion models for expressiveness, Transformers for scalability, and a clever data strategy to create a highly capable bimanual manipulation system.
- The
Similar papers
Recommended via semantic vector search.