Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics
TL;DR Summary
This paper presents a novel framework for a robotic world model using dual-autoregressive mechanisms and self-supervised training, enabling reliable long-horizon predictions without domain-specific biases. It supports efficient policy optimization and seamless deployment in real-
Abstract
Learning robust and generalizable world models is crucial for enabling efficient and scalable robotic control in real-world environments. In this work, we introduce a novel framework for learning world models that accurately capture complex, partially observable, and stochastic dynamics. The proposed method employs a dual-autoregressive mechanism and self-supervised training to achieve reliable long-horizon predictions without relying on domain-specific inductive biases, ensuring adaptability across diverse robotic tasks. We further propose a policy optimization framework that leverages world models for efficient training in imagined environments and seamless deployment in real-world systems. This work advances model-based reinforcement learning by addressing the challenges of long-horizon prediction, error accumulation, and sim-to-real transfer. By providing a scalable and robust framework, the introduced methods pave the way for adaptive and efficient robotic systems in real-world applications.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is a novel framework for learning robust and generalizable world models in robotics, specifically designed to facilitate robust policy optimization and seamless sim-to-real transfer. The framework is named Robotic World Model (RWM).
1.2. Authors
-
Chenhao Li: ETH Zurich, Switzerland. Email: chenhli@ethz.ch.
-
Andreas Krause: ETH Zurich, Switzerland.
-
Marco Hutter: ETH Zurich, Switzerland.
The authors are affiliated with
ETH Zurich, a prominent research university known for its strong contributions to robotics, machine learning, and control theory. Chenhao Li, as the first author, is likely the primary researcher on this project, while Andreas Krause and Marco Hutter likely serve as senior researchers or advisors, given their established reputations in the field of machine learning and robotics, respectively. Marco Hutter, in particular, is well-known for his work on legged robots such as theANYmal.
1.3. Journal/Conference
The paper is available as a preprint on arXiv. As of the provided publication date, it has not yet been officially published in a journal or conference. However, arXiv is a widely respected platform for disseminating cutting-edge research in physics, mathematics, computer science, and related fields, allowing researchers to share their work before peer review. The abstract mentions advancing model-based reinforcement learning and addressing sim-to-real transfer, indicating its relevance to top-tier conferences in robotics (ICRA, IROS, CoRL) or machine learning (NeurIPS, ICML).
1.4. Publication Year
2025 (Published at UTC: 2025-01-17T10:39:09.000Z)
1.5. Abstract
The paper introduces a novel framework, Robotic World Model (RWM), for learning world models that can accurately capture complex, partially observable, and stochastic dynamics in robotics. The core methodology involves a dual-autoregressive mechanism and self-supervised training to achieve reliable long-horizon predictions without relying on domain-specific inductive biases, which enhances its adaptability across various robotic tasks. Furthermore, the authors propose a policy optimization framework, MBPO-PPO, that leverages these learned world models for efficient training in imagined environments and zero-shot deployment in real-world systems. This work aims to advance model-based reinforcement learning (MBRL) by tackling critical challenges such as long-horizon prediction accuracy, error accumulation, and sim-to-real transfer. The introduced methods provide a scalable and robust framework, paving the way for more adaptive and efficient robotic systems in practical applications.
1.6. Original Source Link
https://arxiv.org/abs/2501.10100 (Preprint on arXiv)
1.7. PDF Link
https://arxiv.org/pdf/2501.10100v4.pdf
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is enabling robotic systems to efficiently and robustly learn and adapt in complex, real-world environments. Reinforcement Learning (RL) and control theory have advanced robotics, but a significant limitation remains: the lack of adaptation and continuous learning once a policy is deployed on a real system. This means valuable data from real-world interactions is often underutilized, restricting the robot's robustness and ability to handle evolving scenarios.
This problem is crucial because truly intelligent robotic systems need to operate efficiently with limited data and adapt scalably to real-world conditions. Model-free RL algorithms like PPO and SAC require a vast number of interactions, making them impractical for real-world robotics where interactions are costly and potentially unsafe. Therefore, sample-efficient methods are essential.
World models offer a promising solution by simulating environment dynamics, allowing for planning and policy optimization in "imagination." However, developing reliable and generalizable world models for real-world dynamics (which are often nonlinear, stochastic, and partially observable) presents unique challenges:
-
Complexity of Real-World Dynamics: Robotic environments are inherently complex.
-
Long-Horizon Prediction: Models need to predict far into the future, which is prone to
error accumulation. -
Partial Observability: Robots often don't have access to the full state of the environment.
-
Domain-Specific Inductive Biases: Many existing world models incorporate hand-designed structures or physics principles, limiting their scalability and adaptability to new tasks or environments.
-
Sim-to-Real Transfer: Policies trained in simulation often struggle when deployed on physical hardware due to the
reality gap.The paper's innovative idea is to develop a general framework for learning world models (
RWM) that can overcome these challenges without relying on domain-specific assumptions or handcrafted representations. It focuses on robustness and accuracy over long horizons, coupled with a policy optimization method (MBPO-PPO) that enableszero-shot deploymenton physical hardware.
2.2. Main Contributions / Findings
The paper makes the following primary contributions:
-
A Novel Network Architecture and Training Framework: Introduction of
RWM, which employs adual-autoregressive mechanismandself-supervised trainingto learn reliable world models capable oflong autoregressive rollouts. This is a critical property for downstream planning and control, specifically addressingerror accumulationandpartial observabilitywithout domain-specific biases. -
Comprehensive Evaluation Suite: A thorough evaluation of
RWMacross diverse robotic tasks (manipulation, quadruped, humanoid locomotion) is provided. Comparative experiments demonstrate its effectiveness against existing world model frameworks. -
Efficient Policy Optimization Framework and Hardware Generalization: Proposal of
MBPO-PPO, an efficient policy optimization framework that leverages the learnedRWMfor continuous control. It demonstrates effective generalization to real-world scenarios through hardware experiments onANYmal D(quadruped) andUnitree G1(humanoid) systems, achievingzero-shot deploymentwith minimal performance loss.Key findings include:
-
RWMachieves remarkable alignment between predicted and ground truth trajectories over extended autoregressive rollouts, effectively mitigating compounding errors. -
RWMdemonstrates superior robustness under various noise perturbations compared to baseline models. -
RWMtrained with autoregressive training (RWM-AR) consistently outperforms other baseline architectures (MLP, RSSM, Transformer-based) and its teacher-forcing counterpart (RWM-TF) across diverse robotic environments. -
The
MBPO-PPOframework successfully trains policies on the learnedRWMfor complex velocity tracking tasks, leading to stable and robust behaviors in simulation and successfulzero-shot transferto physical robots. This is achieved for over a hundred autoregressive steps, a capability exceeding many existing frameworks.These findings collectively address the challenges of long-horizon prediction, error accumulation, and
sim-to-real transfer, showcasing a scalable and robust framework for adaptive and efficient robotic systems in real-world applications.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
Reinforcement Learning (RL)
Reinforcement Learning (RL) is a paradigm of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. The agent is not explicitly programmed with a solution but learns through trial and error. It interacts with the environment, observes the resulting state and reward, and adjusts its policy (strategy for choosing actions) to achieve its goal.
Markov Decision Process (MDP)
A Markov Decision Process (MDP) is a mathematical framework for modeling sequential decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. An MDP is defined by a tuple :
-
: A set of possible
statesof the environment. -
: A set of possible
actionsthe agent can take. -
: A
transition functionthat describes the probability of moving to state from state after taking action . This is based on theMarkov property, meaning the next state depends only on the current state and action, not on the entire history. -
: A
reward functionR(s, a, s')that specifies the immediate reward received after transitioning from state to state via action . -
: A
discount factor, which determines the present value of future rewards. Rewards received sooner are generally valued more than those received later.The goal in an
MDPis to find apolicythat maximizes the expected discounted sum of rewards over time.
Partially Observable Markov Decision Process (POMDP)
A Partially Observable Markov Decision Process (POMDP) is an extension of an MDP where the agent cannot directly observe the current state of the environment. Instead, it receives observations that are probabilistically related to the underlying state. A POMDP is defined by :
-
: A set of possible
states(unobservable by the agent). -
: A set of possible
actions. -
: A set of possible
observationsthe agent can receive. -
: A
transition function. -
: A
reward functionR(s, a, s'). -
: An
observation functionthat describes the probability of observing when the environment is in state . -
: A
discount factor.In a
POMDP, the agent must maintain abelief(a probability distribution over the possible states) and use this belief, along with its history of actions and observations, to make decisions. TheRWMframework models the environment as aPOMDP, recognizing that robotic systems often operate with partial information.
World Models
World models are neural network models that learn to predict the future behavior of an environment. They learn the dynamics of the environment, essentially creating a "simulator" of the world. This allows an RL agent to "imagine" future outcomes of its actions without interacting with the real (or a separate simulated) environment. Key components often include:
-
Dynamics Model: Predicts the next state given the current state and action.
-
Observation Model: Predicts the next observation given the next state (especially useful in
latent space models). -
Reward Model: Predicts the reward for a given state and action.
World models are crucial for
model-based reinforcement learningbecause they enableplanning in imaginationand significantly improvesample efficiency.
Model-Based Reinforcement Learning (MBRL)
Model-Based Reinforcement Learning (MBRL) algorithms learn a model of the environment's dynamics, then use this model to either plan optimal actions or train a policy. The general cycle involves:
- Data Collection: Interact with the real environment to collect
trajectories(sequences of states, actions, rewards, observations). - Model Learning: Train a
world modelusing the collected data to predict future states/observations/rewards. - Policy Optimization: Use the learned world model to:
-
Planning: Search for optimal action sequences within the imagined environment (e.g., using
Model Predictive Control - MPC). -
Policy Training: Generate synthetic experience (
imagination rollouts) to train or refine a policy using amodel-free RLalgorithm.MBRLmethods are typically moresample-efficientthanmodel-free RLbecause they can generate a lot of training data from a small amount of real-world interaction.
-
Model-Free Reinforcement Learning
Model-Free Reinforcement Learning algorithms directly learn a policy or value function without explicitly building a model of the environment's dynamics. They learn directly from trial-and-error interactions. Examples include:
-
Policy Optimization:
Proximal Policy Optimization (PPO) -
Value-Based Methods:
Q-learning,Deep Q-Networks (DQN) -
Actor-Critic Methods:
Soft Actor-Critic (SAC)While
model-free RLcan achieve impressive results, especially in complex tasks, it often requires an extremely large number of interactions with the environment, making it costly or impractical for real-world robotics.
Recurrent Neural Networks (RNNs) and Gated Recurrent Units (GRUs)
Recurrent Neural Networks (RNNs) are a class of neural networks designed to process sequential data, where the output at a given time step depends on previous computations. They have internal memory (hidden states) that allows them to capture temporal dependencies.
Gated Recurrent Units (GRUs) are a type of RNN that addresses some of the limitations of simple RNNs, such as the vanishing gradient problem, by using gating mechanisms. GRUs have two main gates:
- Update Gate: Controls how much of the past information (hidden state) should be passed to the future.
- Reset Gate: Controls how much of the past hidden state to forget.
GRUsare simpler and computationally more efficient thanLong Short-Term Memory (LSTM)networks but still effective at handling long-term dependencies in sequential data.RWMuses aGRU-based architecturefor its ability to maintain long-term historical context.
Multilayer Perceptrons (MLPs)
A Multilayer Perceptron (MLP) is a class of feedforward artificial neural network. It consists of at least three layers of nodes: an input layer, a hidden layer, and an output layer. Each node (neuron) in one layer is connected to every node in the subsequent layer with a certain weight. MLPs are general-purpose function approximators and are often used as components within larger neural network architectures, such as the heads of RWM that predict the mean and standard deviation of observations.
Proximal Policy Optimization (PPO)
Proximal Policy Optimization (PPO) is a model-free RL algorithm that is widely used due to its balance of ease of implementation, sample efficiency, and good performance. It is an actor-critic method, meaning it learns both a policy (actor) and a value function (critic). PPO works by performing multiple epochs of mini-batch stochastic gradient ascent on the clipped surrogate objective function. This objective encourages the new policy to stay close to the old policy, preventing large, destructive updates, which contributes to stability.
The clipped surrogate objective in PPO is typically formulated as:
$
L^{CLIP}(\theta) = \hat{E}_t \left[ \min(r_t(\theta) \hat{A}_t, \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t) \right]
$
Where:
-
: The parameters of the policy network.
-
: Empirical average over a batch of samples.
-
: The
probability ratio, which is the ratio of the new policy's probability of taking action in state to the old policy's probability. -
: The
advantage estimateat time , representing how much better (or worse) action was than average for state . -
: Clips the probability ratio to be within the range , where is a hyperparameter (e.g., 0.2). This clipping prevents excessively large policy updates.
The
PPOalgorithm updates the policy by maximizing this objective, optionally with an addedvalue functionloss andentropybonus to encourage exploration.RWMusesPPOwithin itsMBPO-PPOframework for policy optimization.
Soft Actor-Critic (SAC)
Soft Actor-Critic (SAC) is another model-free RL algorithm, known for its off-policy nature and sample efficiency, making it suitable for continuous control tasks. SAC aims to maximize a trade-off between expected reward and entropy. The entropy term encourages exploration, preventing the policy from converging to a suboptimal local optimum too quickly. SAC learns an actor (policy), a critic (Q-function), and often a target Q-network and an entropy temperature parameter. It has shown impressive results in various robotic domains.
Model Predictive Control (MPC)
Model Predictive Control (MPC) is an advanced control strategy that uses an explicit model of the system to predict future behavior. At each time step, MPC solves an optimization problem to find a sequence of control actions that minimizes a cost function over a finite prediction horizon, subject to system constraints. Only the first action from this optimal sequence is applied, and the process is repeated at the next time step with updated system measurements. MPC is often used with world models for planning, as the world model serves as the system dynamics model.
3.2. Previous Works
World Models for Robotics
- Dynamics Models for Policy Optimization [24]: Early applications of world models focused on describing real-world dynamics to optimize control policies.
- Vision-Based World Models:
Visual Foresight[18, 25, 17]: Techniques that learn visual dynamics for planning in high-dimensional sensory spaces. These models predict future video frames or image features to anticipate outcomes of actions.PlaNet[15]: Deep Planning Network, a pioneeringlatent-space dynamics modelthat learns compact representations of the state space and plans directly in this learned latent space.Dreamer[29, 11, 30]: Successor toPlaNet, extends the concept by integrating anactor-criticframework into the latent dynamics model, allowing for simultaneous learning of both the dynamics model and the policy.DreamerV3[30] is a later iteration known for mastering diverse domains.
- Models with Domain-Specific Inductive Biases: Many approaches incorporate known physics principles or structured state representations to improve model fidelity, especially when generalizing beyond training data. Examples include:
- Foot-placement dynamics [21]
- Object invariance [22]
- Granular media interactions [27]
- Frequency domain parameterization [23]
- Rigid body dynamics [20]
- Semi-structured Lagrangian dynamics models [28] These methods, while effective, often require significant domain knowledge and handcrafted features, limiting their versatility.
Model-Based Reinforcement Learning (MBRL)
- Probabilistic Ensembles with Trajectory Sampling (PETS) [12]: One of the pioneering
MBRLmethods. It uses an ensemble of probabilistic neural networks to model the environment dynamics. By using an ensemble,PETScaptures model uncertainty, which helps in robust planning. Trajectories are sampled from this ensemble to evaluate actions. - PlaNet [15] and Dreamer [29, 11, 30]: As described above, these methods leverage
latent dynamics modelsfor planning and policy learning, demonstrating state-of-the-art performance in various control and navigation tasks, including extensions to real-world robotics [19, 31]. - Architectural Variations for Latent Dynamics Models:
Autoregressive Transformers[32]: Used to improve the generation capabilities of latent dynamics models.Variational Autoencoders[33]: Capture the stochastic nature of the environment.
- TD-MPC and TD-MPC2 [34, 35, 36]: Integrate
model-based learningwithModel Predictive Control (MPC)to achieve high-performance control, particularly in dynamic environments. They combine the predictive power of learned models with the optimal control capabilities ofMPC.
Hybrid Model-Based/Model-Free Approaches
- Model-Based Policy Optimization (MBPO) [13]: A notable hybrid approach that combines the sample efficiency of
MBRLwith the robustness ofmodel-free RL.MBPOuses a model-based approach to generate additional synthetic data for planning and policy optimization, but then refines the policy using standardmodel-freeupdates. A key idea is to selectively rely on the learned model only when its predictions are accurate, mitigating the negative effects of model inaccuracies. - Model-based Offline Policy Optimization (MOPO) [37]: Extends
MBPOto theoffline RLsetting, where learning occurs entirely from a fixed dataset of previously collected data without further environment interactions. - Gradient-Based Policy Optimization [38, 39]: Some
MBRLapproaches usefirst-order gradient-based optimizationfor policy learning, where gradients are propagated through the dynamics model to update the policy.Short-Horizon Actor-Critic (SHAC)[38] is an example that leverages differentiable simulation.
3.3. Technological Evolution
The field of robotic control and RL has seen a significant evolution. Initially, model-free RL methods like PPO and SAC gained prominence for their ability to learn complex behaviors without explicit environment models. However, their sample inefficiency limited their application in real-world robotics where data collection is expensive and often hazardous.
This led to a resurgence of model-based RL, leveraging world models to improve sample efficiency. Early world models often relied on domain-specific inductive biases (e.g., known physics, structured state representations) to achieve fidelity, but this restricted their generalizability and adaptability. The development of latent-space dynamics models (like PlaNet and Dreamer) marked a shift towards learning more general black-box models that could operate on high-dimensional observations (e.g., pixels) without requiring explicit state engineering.
However, challenges remained:
-
Error accumulationover long prediction horizons in purelyautoregressivemodels. -
Difficulty in capturing
stochasticityandpartial observabilityaccurately. -
The persistent
sim-to-real gapwhen deploying policies trained on learned models.This paper's work,
RWM, fits into this timeline by addressing these limitations. It moves beyond previousblack-boxmodels by introducing adual-autoregressive mechanismandself-supervised trainingspecifically designed to enhancelong-horizon prediction robustnessand reduceerror accumulationwithout domain-specific biases. This allowsRWMto generate high-fidelity imagined trajectories suitable for robust policy optimization viaMBPO-PPO, ultimately enablingzero-shot transferto real hardware. It represents an advancement in creating more general, reliable, and deployableworld modelsfor robotics.
3.4. Differentiation Analysis
Compared to the main methods in related work, RWM introduces several core differences and innovations:
-
Generalization without Domain-Specific Biases:
- Prior Methods: Many successful
world modelsfor robotics, especially in legged locomotion or manipulation, incorporate strongdomain-specific inductive biases(e.g., foot-placement dynamics [21], object invariance [22], rigid body dynamics [20]). While effective, these hand-crafted representations limit scalability and adaptability to diverse tasks or novel environments. - RWM's Innovation:
RWMis explicitly designed to learnworld modelswithout relying onhandcrafted representationsorspecialized architectural biases. This approach aims for broader applicability and enhancedgeneralizationacross a wide range of robotic systems (manipulation, quadruped, humanoid locomotion) and scenarios.
- Prior Methods: Many successful
-
Robust Long-Horizon Prediction via Dual-Autoregressive Mechanism and Self-Supervised Training:
- Prior Methods (e.g.,
Dreamer,PlaNet,RSSM): Often useteacher-forcingduring training, where ground-truth observations are always fed as input for the next-step prediction. While efficient for training due to parallelization, this can lead to amismatch between training and inference distributions(known asexposure bias) and poor performance during longautoregressive rolloutsat test time, where errors can compound rapidly.DreamerandTD-MPCtypically operate with shorter planning horizons in imagination. - RWM's Innovation:
RWMemploys a noveldual-autoregressive mechanismwithself-supervised trainingoverlong prediction horizons().- Self-supervised Autoregressive Training: The model is trained to predict future observations by using its own predictions recursively as input, mimicking the test-time scenario. This reduces the
distribution mismatch. - Dual-Autoregression: It combines
inner autoregression(updatingGRU hidden stateswithin thecontext horizon) andouter autoregression(feeding predicted observations from theforecast horizonback into the network). This specifically ensures stability and robustness inlong-horizon predictionsand effectively mitigateserror accumulation, even instochasticandpartially observableenvironments. This allowsRWMto optimize policies overhundreds of autoregressive steps, a capability the paper claims exceeds many existing frameworks.
- Self-supervised Autoregressive Training: The model is trained to predict future observations by using its own predictions recursively as input, mimicking the test-time scenario. This reduces the
- Prior Methods (e.g.,
-
Policy Optimization and Zero-Shot Hardware Transfer:
-
Prior Methods (
MBPO,Dreamer,SHAC): WhileMBPOcombines model-based and model-free, andDreamerlearns policies in latent space,RWMexplicitly highlights its ability to optimizePPOpolicies on its learned model (MBPO-PPO) even over extended rollouts.SHACusesfirst-order gradientsthrough the model, which can be unstable withdiscontinuous dynamics. -
RWM's Innovation: The high fidelity and robustness of
RWM'slong-horizon predictionsdirectly translate into the ability to trainPPOpolicies that are stable and effective. This enableszero-shot deploymenton physical hardware (ANYmal D,Unitree G1) with minimal performance loss, which is a significant practical achievement that many model-based methods struggle with. The paper suggests this is the first framework to reliably train policies on a learned neural network simulator without domain-specific knowledge and deploy them on hardware with minimal performance loss.In essence,
RWMdifferentiates itself by prioritizing a general, robust, and accurateworld modelthat handles long-term dependencies and partial observability through its unique training paradigm, which then directly facilitates more stable and deployablemodel-based policy optimizationfor real-world robotic systems.
-
4. Methodology
4.1. Principles
The core idea behind Robotic World Model (RWM) is to create a highly accurate and robust neural network-based simulator of robotic environments (a world model) that can make reliable long-horizon predictions without relying on domain-specific knowledge. This learned world model then serves as an imagined environment for efficiently training reinforcement learning policies. The theoretical basis hinges on modeling the environment as a Partially Observable Markov Decision Process (POMDP) and using self-supervised autoregressive training to overcome challenges like partial observability and error accumulation inherent in long-term forecasting. The intuition is that by training the model to predict its own future inputs, it becomes robust to the distributional shifts encountered during imagination rollouts, making the synthetic experience generated by the model a faithful proxy for real-world interactions for policy optimization.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Reinforcement Learning and World Models Formulation (Section 3.1)
The environment is formally modeled as a Partially Observable Markov Decision Process (POMDP). This choice acknowledges that robotic systems typically do not have access to the complete underlying state of the environment but rather receive observations.
A POMDP is defined by the tuple where:
-
: Denotes the
state space. This is the set of all possible true states of the environment, which are not directly observable by the agent. -
: Denotes the
action space. This is the set of all possible actions the agent can take. -
: Denotes the
observation space. This is the set of all possible observations the agent receives, which are noisy or partial reflections of the true state. -
: Represents the
transition kernel, which captures the environment dynamics. Specifically, it defines the probability of transitioning to a new state given the current state and action . This is denoted as . -
: Is the
reward function, . It maps a state-action-next state transition to a scalarreward. -
: Is the
observation kernel, . It defines the probabilities with which observations are emitted, given the current true state . -
: Is the
discount factor, which balances immediate and future rewards.The
agent(the robot) aims to learn apolicy. This policy maps observations to actions, with parameters . The objective of the policy is to maximize theexpected discounted return, which is the sum of future rewards discounted by : $ \mathbb { E } _ { \pi _ { \theta } } \left[ \sum _ { t \geq 0 } \gamma ^ { t } r _ { t } \right] $ Here,r _ { t }is the reward at time .
World models approximate the environment dynamics and enable policy optimization through simulated environment interactions (learning in imagination). The typical training process for world models involves three iterative steps:
-
Data Collection: Collect real interaction data from the environment by executing actions with a policy.
-
World Model Training: Train the
world modelusing this collected data to learn the environment's dynamics. -
Policy Optimization: Optimize the
policywithin the simulated environment generated by the trainedworld model.The
Robotic World Model (RWM)framework builds upon this core concept but introduces specific architectural and training innovations to enablereliable long-horizon predictionsin complex, stochastic, and partially observable settings, which are crucial for real-world robotics.
4.2.2. Self-supervised Autoregressive Training (Section 3.2)
The RWM framework utilizes a self-supervised autoregressive training mechanism as its backbone. This mechanism is designed to train the world model (, with parameters ) to predict future observations by using both historical observation-action sequences and its own predictions. This strategy aims to ensure robustness over extended rollouts.
Input and Prediction Mechanism:
At each time step , the world model receives a sequence of observation-action pairs spanning historical steps as input. Based on this history, the model predicts the distribution of the next observation.
The predictions are generated autoregressively. This means that after predicting an observation at step , this predicted observation () is then treated as if it were a real observation and appended to the history. It is then combined with the subsequent action () to form the input for predicting the observation at step . This process continues iteratively over a prediction horizon of steps, generating a sequence of future predictions.
The predicted observation steps ahead, denoted as , is sampled from the model's predicted distribution, which is conditioned on the historical context and the model's own prior predictions. The formula for this is:
$ \begin{array} { r } { o _ { t + k } ^ { \prime } \sim p _ { \phi } \left( \cdot { \mathrm { ~ | ~ } } o _ { t - M + k : t } , o _ { t + 1 : t + k - 1 } ^ { \prime } , a _ { t - M + k : t + k - 1 } \right) . } \end{array} $ (Equation 1)
Where:
- : The predicted observation at time step . The indicates it's sampled from the predicted distribution.
- : The
world modelparameterized by , predicting the distribution of the next observation. The placeholder signifies the output (the predicted observation). - : The actual (ground truth) observations from history, specifically from
M-k+1steps before the current prediction up to the current time . This maintains a window of real observations. - : The
model's own predicted observationsfrom the previousk-1steps within the forecast horizon. This is the core of theautoregressivemechanism. - : The sequence of actions, including historical actions (real) and future actions (which are typically sampled from the policy during imagination).
Privileged Information Prediction:
In addition to observations, the RWM also predicts privileged information , such as contact forces. This serves as an auxiliary learning objective, implicitly embedding crucial information for making accurate long-term predictions, especially in contact-rich robotics tasks.
Optimization Objective:
The world model is optimized by minimizing a multi-step prediction error, which combines the discrepancy between predicted and true observations (L _ { o }) and predicted and true privileged information (L _ { c }). The loss function is:
$ \mathcal { L } = \frac { 1 } { N } \sum _ { k = 1 } ^ { N } \alpha ^ { k } \left[ L _ { o } \left( o _ { t + k } ^ { \prime } , o _ { t + k } \right) + L _ { c } \left( c _ { t + k } ^ { \prime } , c _ { t + k } \right) \right] , $ (Equation 2)
Where:
-
: The total loss function to be minimized.
-
: The
forecast horizon, which is the number of future steps the model is trained to predict autoregressively. -
: A
decay factor(), which can be used to weight predictions differently based on their distance into the future (e.g., giving more weight to closer predictions if ). The paper uses , meaning no decay. -
: A loss function (e.g., mean squared error, negative log-likelihood if is a distribution) that quantifies the discrepancy between the predicted observation and the true observation .
-
: A similar loss function for the
privileged information, comparing the predicted and true .This
autoregressive training objectiveforces the model'shidden statesto learn representations that are robust enough to support accurate and reliable long-horizon predictions, even when fed its own generated data.
Training Data Construction:
Training data is prepared by sliding a window of size over collected trajectories. This ensures that for each training sample, there is sufficient historical context ( steps) and corresponding future ground truth for prediction targets ( steps). Reparameterization tricks are applied during training to enable effective end-to-end optimization, allowing gradients to flow through the stochastic sampling process.
Benefits of Autoregressive Training:
- Mitigates Error Accumulation: By training the model on its own predictions, it becomes more robust to the compounding errors that naturally occur during long
autoregressive rolloutsat test time. - Addresses Partial Observability: Incorporating historical observations () allows
RWMto capture unobservable dynamics and infer missing information, crucial inpartially observableenvironments. - Reduces Distribution Mismatch: This training scheme exposes the model to the distribution of inputs it will encounter during inference (its own predictions), reducing the
exposure biasproblem common inteacher-forcingmethods. - Generalization: Eliminates the need for
handcrafted representationsordomain-specific inductive biases, enhancinggeneralizationacross diverse tasks.
Comparison with Teacher-Forcing:
Figure 2 illustrates the difference between autoregressive training (Fig. 2a) and teacher-forcing training (Fig. 2b).
- Autoregressive Training (Fig. 2a): For a history horizon and forecast horizon , the model takes as input, predicts . Then it takes to predict . This matches how the model will be used during
imagination. - Teacher-Forcing Training (Fig. 2b): This is a special case of
autoregressive trainingwhere theforecast horizon. It always uses theground truth observationto predict . While this allows for greater parallelization during training, it doesn't prepare the model for scenarios where its own predictions are fed back, leading to less robustness toerror accumulation.
Network Architecture:
While the autoregressive training framework is architecture-agnostic, RWM utilizes a GRU-based architecture. GRUs are chosen for their ability to maintain long-term historical context while being computationally efficient and operating on low-dimensional inputs. The network outputs the mean and standard deviation of a Gaussian distribution for the next observation, reflecting the stochasticity of the environment.
Dual-Autoregressive Mechanism:
The RWM framework introduces a specific dual-autoregressive mechanism (visualized in Figure S6):
-
Inner Autoregression: Within the
context horizon, theGRU hidden statesare updatedautoregressivelyafter processing each historical step. This ensures that theGRUeffectively captures and maintains a rich, long-term memory of past events. -
Outer Autoregression: This refers to the process described above where predicted observations from the
forecast horizonare fed back into the network as inputs for subsequent predictions. This is the primary mechanism for mitigatingerror accumulationand training forlong-horizon robustness.This dual mechanism provides a robust way to handle
long-term dependenciesandtransitions, makingRWMsuitable for complex robotics applications.
The following figure (Figure S6 from the original paper) visualizes the dual-autoregressive mechanism:
Figure S6: Dual-autoregressive mechanism employed in RWM. Inner autoregression updates GRU hidden states after each historical step within the context horizon, while outer autoregression feeds predicted observations from the forecast horizon back into the network. The dashed arrows denote the sequential autoregressive prediction steps, highlighting robustness to long-term dependencies and transitions.
Figure S6: Dual-autoregressive mechanism employed in RWM. Inner autoregression updates GRU hidden states after each historical step within the context horizon, while outer autoregression feeds predicted observations from the forecast horizon back into the network. The dashed arrows denote the sequential autoregressive prediction steps, highlighting robustness to long-term dependencies and transitions.
4.2.3. Policy Optimization on Learned World Models (Section 3.3)
The policy optimization in RWM is performed using the learned world model, drawing inspiration from Model-Based Policy Optimization (MBPO) [13] and the Dyna algorithm [42]. This approach combines model-based imagination (generating synthetic experience) with model-free reinforcement learning (using an algorithm like PPO to update the policy) to achieve efficient and robust policy optimization.
Action Generation in Imagination:
During imagination rollouts, the actions are recursively generated by the policy (with parameters ). These actions are conditioned on the observations predicted by the world model , which itself is conditioned on its previous predictions. This creates a closed loop in the imagined environment.
The actions at time in imagination can be written as:
$ \begin{array} { r } { a _ { t + k } ^ { \prime } \sim \pi _ { \theta } \left( \cdot \mid o _ { t + k } ^ { \prime } \right) , } \end{array} $ (Equation 3)
Where:
- : The action sampled from the policy at time in the imagined environment.
- : The
policynetwork, parameterized by , which outputs a distribution over actions conditioned on the current imagined observation. - : The observation at time , which is predicted
autoregressivelyby theworld modelas described in Equation 1.
Reward Calculation:
Rewards for these imagined transitions are computed from the imagined observations and privileged information predicted by the world model.
Policy Optimization Algorithm (MBPO-PPO):
The overall policy optimization process is outlined in Algorithm 1. The approach is denoted as MBPO-PPO because it adapts the MBPO framework to use PPO for policy updates, leveraging the RWM's robust long-horizon rollouts.
The following is Algorithm 1 from the original paper:
| 1: Initialize policy πθ, world model pφ, and replay buffer D |
| 2: for learning iterations = 1, 2, . . . do |
| 3: Collect observation-action pairs in D by interacting with the environment using πθ |
| 4: Update pφ with autoregressive training using data sampled from D according to Eq. 2 |
| 5: Initialize imagination agents with observations sampled from D |
| 6: Roll out imagination trajectories using π0 and pφ for T steps according to Eq. 3 |
| 7: Update πθ using PPO or another reinforcement learning algorithm end for |
Let's break down each step:
-
Initialize policy , world model , and replay buffer :
- The
policynetwork (the agent's brain for deciding actions) and theworld model(the learned simulator) are initialized with random or pre-trained weights. - A
replay bufferis initialized to store real interaction data.
- The
-
for learning iterations = 1, 2, . . . do: The learning process proceeds iteratively.
-
Collect observation-action pairs in by interacting with the environment using :
- The current
policyis deployed in the actual (or high-fidelity simulated) environment for a short period. - The agent collects
observation-action pairs() along with subsequent observations () and rewards (). - This real-world interaction data is stored in the
replay buffer. This step is crucial for gathering fresh, accurate experience.
- The current
-
Update with autoregressive training using data sampled from according to Eq. 2:
- The
world modelis updated using theautoregressive trainingscheme described in Section 3.2 (using Equation 2 as the loss function). - Training data for the
world modelis sampled from thereplay buffer. This step refines theworld modelto better match the real environment's dynamics based on the latest experience.
- The
-
Initialize imagination agents with observations sampled from :
- To begin
imagination rollouts, multiple "imagination agents" are initialized. Each agent starts from a realobservation(orstateif available) sampled from thereplay buffer. This grounds the imagined trajectories in realistic starting points.
- To begin
-
Roll out imagination trajectories using and for steps according to Eq. 3:
- For each
imagination agent, thepolicyproposes an action based on the current imagined observation. - The
world modelthen predicts the next observation and reward based on this action and the current imagined observation. - This process is repeated for steps, generating
imagined trajectories(sequences of imagined observations, actions, and rewards). The actions are generated using Equation 3, and the observations are predicted autoregressively using the model. These imagined trajectories effectively augment the real data.
- For each
-
Update using PPO or another reinforcement learning algorithm:
- The
policyis updated using amodel-free reinforcement learningalgorithm, specificallyProximal Policy Optimization (PPO)in this framework. - The training data for
PPOcomes from both the real data in and, crucially, theimagined trajectoriesgenerated in step 6. The imagined data allowsPPOto learn and explore efficiently without needing extensive real-world interactions.
- The
-
end for: The loop continues, iteratively collecting new real data, updating the
world model, generating imagined data, and refining thepolicy.
The training diagram for MBPO-PPO is visualized in Figure S7:
Figure S7: Model-Based Policy Optimization with learned world models. The framework combines real environment interactions with simulated rollouts for efficient policy optimization. Observation and action pairs from the environment are stored in a replay buffer and used to train the autoregressive world model. Imagination rollouts using the learned model predict future states over a horizon of , providing trajectories for policy updates through reinforcement learning algorithms.
Figure S7: Model-Based Policy Optimization with learned world models. The framework combines real environment interactions with simulated rollouts for efficient policy optimization. Observation and action pairs from the environment are stored in a replay buffer and used to train the autoregressive world model. Imagination rollouts using the learned model predict future states over a horizon of , providing trajectories for policy updates through reinforcement learning algorithms.
Challenges and RWM's Robustness:
While PPO is a strong performer, training it on learned world models is challenging. Model inaccuracies can be exploited by the policy, leading to a reality gap where policies perform well in imagination but poorly in the real world. This problem is exacerbated by the extended autoregressive rollouts required for PPO, which can compound prediction errors.
Despite these challenges, RWM demonstrates significant robustness by successfully optimizing policies over hundreds of autoregressive steps with MBPO-PPO. This is a key differentiator, as many existing frameworks (MBPO, Dreamer, TD-MPC) struggle with such long horizons due to model inaccuracies and error accumulation. This result underscores the accuracy and stability of RWM's training method and its ability to synthesize policies that are reliably deployable on hardware.
5. Experimental Setup
The experiments conducted aim to validate RWM's accuracy, robustness, and architectural design choices across diverse robotic systems and environments. They also demonstrate its effectiveness with MBPO-PPO for policy optimization in simulation and real-world deployment.
5.1. Datasets
The world model is trained using simulation data generated by a velocity tracking policy. This policy interacts with a simulated environment to produce trajectories which are then used to train RWM.
The robots used for evaluation are:
-
ANYmal D: A
quadrupedal robot(four-legged). -
Unitree G1: A
humanoid robot.The simulation environment used is
Isaac Lab[43], a unified simulation framework for interactive robot learning.
The detailed observation, privileged information, and action spaces for the world model and the policy are provided in the supplementary material.
5.1.1. World Model Observation Space
The world model receives observations about the robot's state. The structure of this observation space is detailed in Table S2.
The following are the results from Table S2 of the original paper:
| Entry | Symbol | Dimensions | Entry | Symbol | Dimensions |
|---|---|---|---|---|---|
| ANYmal D | Unitree G1 | ||||
| base linear velocity | U | 0:3 | base linear velocity | U | 0:3 |
| base angular velocity | 3 | 3:6 | base angular velocity | 3 | 3:6 |
| projected gravity | g | 6:9 | projected gravity | g | 6:9 |
| joint positions | q | 9:21 | joint positions | q | 9:38 |
| joint velocities | q | 21:33 | joint velocities | q | 38:67 |
| joint torques | τ | 33:45 | joint torques | τ | 67:96 |
Where:
base linear velocity (v): The robot's linear velocity in the x, y, z directions in its body frame.- : The robot's angular velocity (roll, pitch, yaw rates) in its body frame.
projected gravity (g): A measurement of the gravity vector projected into the robot's body frame, which indicates the robot's orientation relative to gravity (e.g., pitch and roll).joint positions (q): The current angular positions of the robot's joints.joint velocities (\dot{q}): The current angular velocities of the robot's joints.joint torques (\tau): The torques applied at the robot's joints. The dimensions indicate the slice of the observation vector corresponding to each entry. For example, for ANYmal D,base linear velocityoccupies indices 0 to 2 (3 dimensions).
5.1.2. World Model Privileged Information Space
Privileged information is data that is useful for learning but might not be directly available during real-world deployment or is difficult to obtain. In this context, it's used as an auxiliary learning objective for the world model.
The following are the results from Table S3 of the original paper:
| Entry | Symbol | Dimensions | Entry | Symbol | Dimensions |
|---|---|---|---|---|---|
| ANYmal D | Unitree G1 | ||||
| knee contact | 0:4 | body contact | 0:26 | ||
| foot contact | 4:8 | foot height | 26:28 | ||
| foot velocity | 28:30 |
Where:
knee contact: Binary or continuous values indicating contact force/state for the robot's knees.foot contact: Binary or continuous values indicating contact force/state for the robot's feet.body contact: For Unitree G1, this likely refers to contact forces across various parts of the robot's body.foot height: The height of the robot's feet relative to the ground.foot velocity: The velocity of the robot's feet.
5.1.3. World Model Action Space
The action space defines the control signals sent to the robot.
The following are the results from Table S4 of the original paper:
| Entry | Symbol | Dimensions | Entry | Symbol | Dimensions |
|---|---|---|---|---|---|
| ANYmal D | Unitree G1 | ||||
| joint position targets | q* | 0:12 | joint position targets | q* | 0:29 |
Where:
joint position targets (q*): These are the desired target angular positions for the robot's joints, which a low-level controller then tries to achieve. This is a common action representation inRLfor legged robots.
5.1.4. Policy Observation Space
The policy, which learns to control the robot, receives a different set of observations tailored to the control task (velocity tracking).
The following are the results from Table S5 of the original paper:
| Entry | Symbol | Dimensions | Entry | Symbol | Dimensions |
|---|---|---|---|---|---|
| ANYmal D | Unitree G1 | ||||
| base linear velocity | U | 0:3 | base linear velocity | U | 0:3 |
| base angular velocity | 3 | 3:6 | base angular velocity | 3 | 3:6 |
| projected gravity | g | 6:9 | projected gravity | g | 6:9 |
| velocity command | c | 9:12 | velocity command | c | 9:12 |
| joint positions | q | 12:24 | joint positions | q | 12:41 |
| joint velocities | q | 24:36 | joint velocities | q | 41:70 |
Where:
base linear velocity (v): Same as the world model observation.- : Same as the world model observation.
projected gravity (g): Same as the world model observation.velocity command (c): The desired linear (x, y) and angular (z) velocities that the policy is trying to achieve. This makes the policygoal-conditioned.joint positions (q): Same as the world model observation.joint velocities (\dot{q}): Same as the world model observation.
5.2. Evaluation Metrics
5.2.1. Autoregressive Prediction Error ()
Conceptual Definition: This metric quantifies how accurately the world model can predict future observations (or states) when it uses its own predictions as input for subsequent steps, mimicking its behavior during imagination rollouts. A lower error indicates that the model is robust to error accumulation and can maintain fidelity over long horizons, which is critical for model-based policy optimization. The paper mentions it as a relative prediction error.
Mathematical Formula: The paper does not provide an explicit mathematical formula for the relative prediction error . However, based on common practice in world model evaluation, it is generally calculated as the normalized difference between the predicted trajectory and the ground truth trajectory. A common way to define relative prediction error for a sequence of predictions could be:
$
e = \frac{1}{N \cdot D} \sum_{k=1}^{N} \frac{|o_{t+k}^\prime - o_{t+k}|2}{|o{t+k}|2 + \epsilon}
$
Or, more simply, if it refers to the average magnitude of error relative to the magnitude of the actual observation:
$
e = \frac{1}{N} \sum{k=1}^{N} \frac{|o_{t+k}^\prime - o_{t+k}|_2}{\text{scale_factor}}
$
Given the context, the paper likely uses a metric that averages the L2 norm (Euclidean distance) of the difference between predicted and ground-truth observations, potentially normalized by some scale factor to make it "relative" and comparable across different observation dimensions or environments.
Symbol Explanation (based on likely interpretation):
- : The relative autoregressive prediction error.
- : The forecast horizon (number of steps predicted).
- : The dimensionality of the observation space.
- : The
world model's predicted observation at time step . - : The true (ground truth) observation at time step .
- : The
Euclidean (L2) norm, measuring the distance between vectors. - : A normalization constant (e.g., standard deviation of observations, or maximum possible observation value) to make the error "relative."
- : A small constant to prevent division by zero, if normalizing by the magnitude of the ground truth.
5.2.2. Policy Mean Reward ()
Conceptual Definition: This metric represents the average cumulative reward obtained by the trained policy over an episode or a set of episodes. In reinforcement learning, the goal is to maximize this cumulative reward. A higher mean reward indicates a more effective policy that successfully achieves its objectives (e.g., velocity tracking) while adhering to desired behaviors (e.g., maintaining balance, avoiding undesired contacts). The paper distinguishes between estimated rewards (computed from the world model's predictions) and ground truth rewards (reported by the simulator).
Mathematical Formula: For a single episode of length , the discounted return (cumulative reward) is:
$
G = \sum_{t=0}^{T_{episode}-1} \gamma^t r_t
$
The policy mean reward would then be the average of these discounted returns over many episodes.
$
\text{Mean Reward} = \frac{1}{K} \sum_{i=1}^{K} G_i
$
Symbol Explanation:
- : The discounted return for a single episode.
- : The length of an episode.
- : The reward received at time step .
- : The
discount factor. - : The total number of evaluation episodes.
- : The discounted return for the -th episode.
5.2.3. Reward Functions
The total reward for the policy is a sum of several weighted terms. These terms are designed to encourage desired behaviors (e.g., tracking velocity) and penalize undesirable ones (e.g., high joint torques, undesired contacts). The weights for each term are detailed in Table S6.
The following are the results from Table S6 of the original paper:
| Symbol | Value | Symbol | Value | Symbol | Value | Symbol | Value |
|---|---|---|---|---|---|---|---|
| ANYmal D | Unitree G1 | ||||||
| Wvxy | 1.0 | Wωz | 0.5 | Wvxy | 1.0 | Wωz | 0.5 |
| Wvz | -2.0 | Wωxy | -0.05 | Wvz | −2.0 | Wωxy | -0.05 |
| Wqt | -2.5e-5 | Wq | -2.5e-7 | Wqt | -2.5e-5 | Wq | -2.5e-7 |
| W | -0.01 | Wfa | 0.5 | W | −0.05 | Wfa | 0.0 |
| Wc | -1.0 | Wg | -5.0 | Wc | -1.0 | Wg | -5.0 |
| Wfc | 0.0 | Wqd | 0.0 | Wfc | 1.0 | Wqd | −1.0 |
Individual Reward Terms:
-
Linear velocity tracking (
x, y): Rewards the policy for matching the commanded linear velocity in the horizontal plane. $ r _ { v _ { x y } } = w _ { v _ { x y } } e ^ { - | c _ { x y } - v _ { x y } | _ { 2 } ^ { 2 } / \sigma _ { v _ { x y } } ^ { 2 } } $ Where:w _ { v _ { x y } }: Weight for this reward term (e.g., 1.0).- : Euler's number (base of natural logarithm).
c _ { x y }: Commanded base linear velocity in thex, yplane.v _ { x y }: Current base linear velocity in thex, yplane.- : Squared Euclidean distance (L2 norm squared).
- : A temperature factor that controls the sensitivity of the exponential reward. Smaller makes the reward drop faster when the error increases.
-
Angular velocity tracking (): Rewards the policy for matching the commanded angular velocity around the vertical () axis. $ r _ { \omega _ { z } } = w _ { \omega _ { z } } e ^ { - | c _ { z } - \omega _ { z } | _ { 2 } ^ { 2 } / \sigma _ { \omega _ { z } } ^ { 2 } } $ Where:
- : Weight for this reward term (e.g., 0.5).
c _ { z }: Commanded base angular velocity around the axis.- : Current base angular velocity around the axis.
- : A temperature factor.
-
Linear velocity : Penalizes vertical velocity, encouraging the robot to maintain a stable height. $ r _ { v _ { z } } = w _ { v _ { z } } v _ { z } ^ { 2 } $ Where:
w _ { v _ { z } }: Weight for this reward term (e.g., -2.0, negative for penalty).v _ { z }: Current base vertical velocity. The squared term ensures positive penalty regardless of direction.
-
Angular velocity
x, y: Penalizes roll and pitch angular velocities, encouraging the robot to stay level. $ r _ { \omega _ { x y } } = w _ { \omega _ { x y } } | \omega _ { x y } | _ { 2 } ^ { 2 } $ Where:- : Weight for this reward term (e.g., -0.05, negative for penalty).
- : Current base roll and pitch angular velocities.
-
Joint torque: Penalizes large joint torques, encouraging energy-efficient and smooth movements. $ r _ { \boldsymbol { q } _ { \tau } } = w _ { \boldsymbol { q } _ { \tau } } | \tau | _ { 2 } ^ { 2 } $ Where:
- : Weight for this reward term (e.g., -2.5e-5, negative for penalty).
- : Joint torques.
-
Joint acceleration: Penalizes high joint accelerations, promoting smoother motions. $ r _ { \ddot { q } } = w _ { \ddot { q } } | \ddot { q } | _ { 2 } ^ { 2 } $ Where:
- : Weight for this reward term (e.g., -2.5e-7, negative for penalty).
- : Joint accelerations.
-
Action rate: Penalizes rapid changes in actions (joint position targets), promoting smoother control signals. $ r _ { \dot { a } } = w _ { \dot { a } } | a ^ { \prime } - a | _ { 2 } ^ { 2 } $ Where:
- : Weight for this reward term (e.g., -0.01, negative for penalty).
- : Previous action.
- : Current action.
-
Feet air time: Rewards keeping feet in the air for appropriate durations (e.g., during swinging phase of locomotion). $ r _ { f _ { a } } = w _ { f _ { a } } t _ { f _ { a } } $ Where:
w _ { f _ { a } }: Weight for this reward term (e.g., 0.5 for ANYmal D, 0.0 for Unitree G1).t _ { f _ { a } }: Sum of time for which the feet are in the air.
-
Undesired contacts: Penalizes contacts by parts of the robot other than the feet (e.g., knees, body). $ r _ { c } = w _ { c } c _ { u } $ Where:
w _ { c }: Weight for this reward term (e.g., -1.0, negative for penalty).c _ { u }: Counts of undesired contacts.
-
Flat orientation: Penalizes deviations from a flat (level) orientation, encouraging stable posture. $ r _ { g } = w _ { g } g _ { x y } ^ { 2 } $ Where:
w _ { g }: Weight for this reward term (e.g., -5.0, negative for penalty).g _ { x y }: Thex, ycomponents of the projected gravity vector (which are non-zero if the robot is pitched or rolled).
-
Foot clearance: Rewards maintaining sufficient clearance height for swinging feet to avoid obstacles. $ r _ { f _ { c } } = w _ { f _ { c } } h _ { f _ { c } } $ Where:
w _ { f _ { c } }: Weight for this reward term (e.g., 0.0 for ANYmal D, 1.0 for Unitree G1).h _ { f _ { c } }: Clearance height of the swing feet.
-
Joint deviation: Penalizes deviation of joint positions from a default (resting) pose. $ r _ { q _ { d } } = w _ { q _ { d } } | q - q _ { 0 } | _ { 1 } $ Where:
w _ { q _ { d } }: Weight for this reward term (e.g., 0.0 for ANYmal D, -1.0 for Unitree G1).- : Current joint positions.
q _ { 0 }: Default joint positions.- : L1 norm (sum of absolute differences).
5.3. Baselines
5.3.1. World Model Baselines
For comparing autoregressive trajectory prediction errors, RWM is evaluated against several common neural network architectures and a variant of itself:
- MLP: A
Multilayer Perceptron, a basic feedforward neural network, used to predict the next state. This serves as a fundamental baseline to show the benefits of recurrent models. - RSSM (Recurrent State-Space Model): This architecture, notably used in
PlaNet[15] andDreamer[29, 11, 30], integrates recurrent processing with alatent state spaceto model dynamics. It's a strong baseline forlatent dynamics models. - Transformer-based Architectures [41, 45]:
Transformersare powerful sequence models that useattention mechanisms. They have been successfully applied asworld models[32] and inRL(e.g.,Decision Transformer[41]). This represents a state-of-the-art sequence modeling baseline. - RWM-TF (RWM with Teacher-Forcing): A variant of
RWMtrained using theteacher-forcingparadigm (where , meaning only one-step prediction using ground truth observations as input). This specific baseline highlights the importance and benefits ofautoregressive training(RWM-AR) overteacher-forcing.
5.3.2. Policy Optimization Baselines
For evaluating the policy learning and hardware transfer capabilities of MBPO-PPO, the following baselines are used:
- Short-Horizon Actor-Critic (SHAC) [38]: An
MBRLmethod that employs afirst-order gradient-based optimizationapproach. It propagates gradients through theworld modelto optimize the policy. This method is noted for its ability to leveragedifferentiable simulation. - DreamerV3 [30]: A highly advanced
MBRLframework that integrates alatent-space dynamics modelwith anactor-criticframework. It is known for itssample efficiencyand robustness incontinuous control tasks, mastering diverse domains.
5.4. Network Architectures
5.4.1. RWM Architecture
The RWM model is comprised of a GRU base and MLP heads.
The following are the results from Table S7 of the original paper:
| Component | Type | Hidden Shape | Activation |
|---|---|---|---|
| base | GRU | 256, 256 | — |
| heads | MLP | 128 | ReLU |
Where:
base: The main recurrent part of the model, which is aGRU(Gated Recurrent Unit). It likely has multiple layers, each with 256 hidden units, as indicated by256, 256.heads:MLP(Multilayer Perceptron) networks that branch off theGRU's output. Theseheadsare responsible for predicting the mean and standard deviation of the next observation and privileged information. They each have a hidden layer of 128 units and use theReLU(Rectified Linear Unit) activation function.
5.4.2. Baseline Architectures
The architectures for the baseline world models are provided for comparison.
The following are the results from Table S8 of the original paper:
| Network | Parameter | Value |
|---|---|---|
| MLP | hidden shape activation |
256, 256 ReLU |
| RSSM | type hidden size layers latent dimension prior type categories |
GRU 256 2 64 categorical 32 |
| Transformer | type dimension heads layers context length positional encoding |
decoder 64 8 2 32 sinusoidal |
Where:
- MLP: Similar to RWM's heads, it has two hidden layers of 256 units each and uses
ReLUactivation. - RSSM (Recurrent State-Space Model):
type: Uses aGRUfor its recurrent component.hidden size: 256 units per layer.layers: 2 recurrent layers.latent dimension: The dimensionality of the continuous latent state is 64.prior type: Uses acategoricaldistribution for the discrete latent state.categories: The number of discrete categories in the latent state is 32.
- Transformer:
type: Adecoder-onlyTransformerarchitecture.dimension: Model dimension of 64.heads: 8attention heads.layers: 2Transformerlayers.context length: 32, meaning it considers 32 previous tokens/steps.positional encoding: Usessinusoidal positional encodingto incorporate sequence order information.
5.4.3. MBPO-PPO Policy and Value Function Architecture
The policy and value function networks used within the MBPO-PPO framework are MLPs.
The following are the results from Table S9 of the original paper:
| Network | Type | Hidden Shape | Activation |
|---|---|---|---|
| policy | MLP | 128, 128, 128 | ELU |
| value function | MLP | 128, 128, 128 | ELU |
Where:
policy: Theactornetwork, which is anMLPwith three hidden layers, each having 128 units. It uses theELU(Exponential Linear Unit) activation function.value function: Thecriticnetwork, also anMLPwith three hidden layers of 128 units each, andELUactivation.
5.5. Training Parameters
All learning networks and algorithms are implemented in PyTorch 2.4.0 with CUDA 12.6 and trained on an NVIDIA RTX 4090 GPU.
5.5.1. RWM Training Parameters
The training information for the RWM itself is summarized below.
The following are the results from Table S10 of the original paper:
| Parameter | Symbol | Value |
|---|---|---|
| step time seconds | ∆t | 0.02 |
| max iterations | − | 2500 |
| learning rate | √ | 1e-4 |
| weight decay | 1e-5 | |
| batch size | 1024 | |
| history horizon | M | 32 |
| forecast horizon | N | 8 |
| forecast decay | α | 1.0 |
| approximate training hours | − | 1 |
| number of seeds | − | 5 |
Where:
step time seconds (∆t): The duration of a single simulation/control step, 0.02 seconds (corresponding to 50 Hz control frequency).max iterations: The maximum number of training steps for theworld model.learning rate: The step size for gradient descent (1e-4 = 0.0001).weight decay: A regularization term to prevent overfitting (L2 regularization).batch size: The number of samples processed in one training iteration.history horizon (M): The number of past observation-action pairs used as context for prediction.forecast horizon (N): The number of future steps the model is trained to predict autoregressively.- : The decay factor for the multi-step prediction loss.
1.0means all steps in the forecast horizon are weighted equally. approximate training hours: The approximate training duration for theworld model.number of seeds: The number of random seeds used for multiple runs to ensure statistical robustness.
5.5.2. MBPO-PPO Training Parameters
The training information for the MBPO-PPO algorithm (policy optimization) is summarized below.
The following are the results from Table S11 of the original paper:
| Parameter | Symbol | Value |
|---|---|---|
| imagination environments | 4096 | |
| imagination steps per iteration | 100 | |
| step time seconds | ∆t | 0.02 |
| buffer size | |D| | 1000 |
| max iterations | − | 2500 |
| learning rate | 0.001 | |
| weight decay | 0.0 | |
| learning epochs | 5 | |
| mini-batches | 4 | |
| KL divergence target | 0.01 | |
| discount factor | γ | 0.99 |
| clip range | € | 0.2 |
| entropy coefficient | 0.005 | |
| number of seeds | 5 |
Where:
imagination environments: The number of parallelimagination agents(or starting points for rollouts) used to generate synthetic experience.imagination steps per iteration: The length of eachimagination rollout( from Algorithm 1).step time seconds (∆t): Same as forRWMtraining.- : The maximum number of real environment transitions stored in the
replay buffer. max iterations: The maximum number of learning iterations for policy optimization.learning rate: Learning rate for policy and value networks.weight decay: Regularization for policy and value networks (0.0 means no L2 regularization).learning epochs: Number of timesPPOupdates are run over the collected data (real + imagined) per iteration.mini-batches: Number of mini-batches perPPOupdate epoch.KL divergence target: A target for theKullback-Leibler (KL) divergencebetween the old and new policies, used in somePPOvariants for adaptive clipping or learning rate.- : Same as the
POMDPdefinition, used for calculating discounted returns. clip range (€): The hyperparameter forPPO'sclipped surrogate objective(e.g., 0.2).entropy coefficient: A term added to thePPOloss to encourage exploration by penalizing lowpolicy entropy.number of seeds: Number of random seeds for multiple runs.
5.6. Hardware
The learned policies are deployed and validated on two physical robotic platforms in a zero-shot transfer setup:
- ANYmal D [44]: A highly mobile and dynamic
quadrupedal robotdeveloped atETH Zurich. - Unitree G1: A
humanoid robotfrom Unitree Robotics.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Autoregressive Trajectory Prediction (Section 4.1)
The ability of a world model to make accurate long-horizon predictions is crucial for effective planning and policy optimization. The experiments analyze the autoregressive prediction performance of RWM using trajectories from ANYmal D hardware, with a control frequency of 50 Hz. The world model was trained with a history horizon and forecast horizon .
The left panel of Figure 3 (shown below) visualizes the autoregressive trajectory predictions by RWM against ground truth trajectories. The solid lines represent the ground truth, and the dashed lines denote the predicted state evolution. Predictions start at , using historical observations, and future observations are predicted autoregressively by feeding prior predictions back into the model.
The following figure (Figure 3 from the original paper) shows the autoregressive trajectory prediction and robustness under noise:
Figure 3: (Left) Solid lines represent ground truth trajectories, while dashed lines denote predicted state evolution. Predictions commence at using historical observations, with future observations predicted autoregressively by feeding prior predictions back into the model. (Right) Yellow curves denote RWM at varying noise levels, demonstrating consistent robustness and lower error accumulation across forecast steps. Grey curves represent the MLP baseline, which exhibits significantly higher error accumulation and reduced robustness to noise.
Figure 3: (Left) Solid lines represent ground truth trajectories, while dashed lines denote predicted state evolution. Predictions commence at using historical observations, with future observations predicted autoregressively by feeding prior predictions back into the model. (Right) Yellow curves denote RWM at varying noise levels, demonstrating consistent robustness and lower error accumulation across forecast steps. Grey curves represent the MLP baseline, which exhibits significantly higher error accumulation and reduced robustness to noise.
Analysis:
The results demonstrate a remarkable alignment between RWM's predicted trajectories and the ground truth across all observed variables. This consistency is maintained even over extended rollouts, which is a significant achievement given the inherent challenge of compounding errors in long-horizon predictions. This strong performance is attributed to the dual-autoregressive mechanism introduced in Section 3.2, which stabilizes predictions even with a relatively short forecast horizon () used during training. The comparison of state evolution between RWM prediction and ground truth (Figure 1, bottom row, and Figure S9) further highlights RWM's ability to maintain consistency over long horizons, even beyond the training forecast horizon. This robustness is critical for stable policy learning and deployment.
6.1.2. Robustness under Noise (Section 4.2)
The capacity of a world model to generalize under noisy conditions, especially during autoregressive rollouts, is vital. Small deviations can rapidly cascade into large errors, leading to hallucinations. To test this, RWM's performance was evaluated under Gaussian noise perturbations applied to both observations and actions. The results were compared against an MLP-based baseline that was also trained autoregressively with the same history and forecast horizons. The right panel of Figure 3 (shown above) illustrates these results, where yellow curves represent RWM's relative prediction error at varying noise levels, and grey curves represent the MLP baseline.
Analysis:
RWM demonstrates a clear advantage over the MLP baseline. The MLP model's relative prediction error grows significantly and diverges much faster as the forecast steps increase, particularly under noise. In contrast, RWM exhibits superior stability, maintaining lower prediction errors even with high noise levels. This robustness is directly attributed to RWM's dual-autoregressive mechanism which, by continually refining the state representation towards long-term predictions, minimizes error accumulation even with noisy inputs.
6.1.3. Generality across Robotic Environments (Section 4.3)
To assess RWM's generality and robustness, its performance was compared against several baseline methods across a diverse range of robotic environments. The baselines include MLP, recurrent state-space model (RSSM), and transformer-based architectures. All models were provided the same context during training and evaluation. The relative autoregressive prediction errors for these models are presented in Figure 4. The tasks encompass manipulation scenarios, as well as quadruped and humanoid locomotion. The study also highlights the importance of autoregressive training by comparing RWM trained with teacher-forcing (RWM-TF) and autoregressive training (RWM-AR).
The following figure (Figure 4 from the original paper) shows autoregressive trajectory prediction errors across diverse robotic environments:
Figure 4: Autoregressive trajectory prediction errors across diverse robotic environments and network architectures. RWM trained with autoregressive training (RWM-AR) consistently outperforms baseline methods, including MLP, recurrent state-space model (RSSM), and transformer-based architectures. RWM-AR demonstrates superior generalization and robustness across tasks, from manipulation to locomotion. Autoregressive training (RWM-AR) reduces compounding errors over long rollouts, significantly improving performance compared to teacher-forcing training (RWM-TF).
Figure 4: Autoregressive trajectory prediction errors across diverse robotic environments and network architectures. RWM trained with autoregressive training (RWM-AR) consistently outperforms baseline methods, including MLP, recurrent state-space model (RSSM), and transformer-based architectures. RWM-AR demonstrates superior generalization and robustness across tasks, from manipulation to locomotion. Autoregressive training (RWM-AR) reduces compounding errors over long rollouts, significantly improving performance compared to teacher-forcing training (RWM-TF).
Analysis:
Figure 4 clearly shows the superiority of RWM trained with autoregressive training (RWM-AR), which consistently achieves the lowest prediction errors across all evaluated environments. The performance gap is particularly noticeable in complex and dynamic tasks like velocity tracking for legged robots, where accurate long-horizon predictions are essential for effective control.
The comparison between RWM-AR and RWM-TF is also critical: RWM-AR significantly outperforms its teacher-forcing counterpart. This strongly underscores that the autoregressive training mechanism is vital for mitigating compounding prediction errors over long rollouts, validating one of the core contributions of the paper.
The paper notes that baselines are traditionally implemented using teacher-forcing. However, when RSSM is trained with autoregressive training, its performance becomes comparable to the GRU-based RWM. The authors chose the GRU-based RWM due to its simplicity and computational efficiency. Conversely, transformer architectures did not scale effectively with autoregressive training due to GPU memory constraints arising from multi-step gradient propagation, limiting their practicality for this specific approach.
These results affirm that RWM, particularly when coupled with its autoregressive training, achieves robust and generalizable performance across a wide array of robotic tasks. Visualizations of RWM-AR imagination rollouts against ground truth simulations (Figure 1 and S9) further support these claims.
6.1.4. Policy Learning and Hardware Transfer (Section 4.4)
The paper uses MBPO-PPO to train a goal-conditioned velocity tracking policy for ANYmal D and Unitree G1, leveraging the RWM. This policy's observation and action spaces, reward functions, and architectural details are provided in the supplementary material (Sections A.1 and A.2.3). The performance of MBPO-PPO is compared against Short-Horizon Actor-Critic (SHAC) [38] and DreamerV3 [30].
Figure 5 illustrates the model error and policy mean reward during policy optimization for ANYmal D (left) and Unitree G1 (right) velocity tracking tasks. Policies are trained using estimated rewards from RWM predictions, while ground truth rewards (solid lines) are for evaluation only.
The following figure (Figure 5 from the original paper) shows model error and policy mean reward during policy optimization:
Figure 5: Model error and policy mean reward for the ANYmalD (left) and Unitree G1 (right) velocity tracking task with MBPO-PPO. The policy is trained using estimated rewards computed from predicted observations by RWM. Ground truth rewards, visualized with solid lines, are reported by the simulator for evaluation purposes only.
Figure 5: Model error and policy mean reward for the ANYmalD (left) and Unitree G1 (right) velocity tracking task with MBPO-PPO. The policy is trained using estimated rewards computed from predicted observations by RWM. Ground truth rewards, visualized with solid lines, are reported by the simulator for evaluation purposes only.
Analysis of Model Error (Top Panels of Figure 5):
MBPO-PPOshows asignificant reduction in model errorover the course of training, indicating that theworld modelbecomes more accurate as the policy is refined and more data is collected.SHAC, in contrast, struggles withhigh and fluctuating model errorthroughout the training process. Its reliance onfirst-order gradientspropagated through theworld modelproves problematic fordiscontinuous dynamics(likelegged locomotionwith varying contact patterns), leading to inaccurate gradients andsuboptimal policy updates. This results inchaotic robot behaviors.Dreamereffectively leverages itslatent-space dynamics model, but itsshorter planning horizonsduring training limit its ability to capturelong-horizon dependenciesinstochastic environments, leading tomoderate compounding errorsduring policy learning.
Analysis of Policy Mean Reward (Bottom Panels of Figure 5):
MBPO-PPO'spredicted rewards(dashed lines) initiallyovershoottheground truth rewards(solid lines). This is a common phenomenon where the policy might exploit small inaccuracies oroptimistic estimateswithin the learned model. However, as training progresses, thepredicted rewards align more closely with ground truth, demonstrating that the model's predictions remain accurate enough to guide effective learning.SHACfails to converge, producingunstable behaviorsand degrading both policy and model quality, aligning with its high model error.Dreamershowspartial convergence, achieving higher rewards thanSHACbutsignificantly lagging behind MBPO-PPO.
Hardware Transfer:
The ultimate validation is zero-shot transfer to physical hardware. SHAC and Dreamer failed to produce deployable policies due to collapse or instability during training. However, the policy learned using MBPO-PPO (as shown in Figure 1, top row) demonstrates reliable and robust performance on ANYmal D and Unitree G1 hardware. It successfully tracks goal-conditioned velocity commands and maintains stability even under external disturbances and varying terrain conditions. This success is directly attributed to the high-quality trajectory predictions generated by RWM, which enable accurate and effective policy optimization. Videos on the project webpage further showcase this robustness.
Comparison with Model-Free Methods (Table 1):
While MBPO-PPO excels among MBRL methods, the paper acknowledges that it still falls short of well-tuned model-free RL methods trained on high-fidelity simulators. A comparison with a PPO-based method on a high-fidelity simulator is provided in Table 1.
The following are the results from Table 1 of the original paper:
| Method | RWM pretraining | MBPO-PPO | PPO |
|---|---|---|---|
| state transitions | 6M | — | 250M |
| total training time | 50 min | 5 min | 10 min |
| step inference time | − | 1 ms | 1 ms |
| real tracking reward | − | 0.90 ± 0.04 | 0.90 ± 0.03 |
Analysis of Table 1:
state transitions:PPO(model-free) requires a massive 250 million state transitions, indicating itssample inefficiency.RWM pretrainingrequires 6 million state transitions, which is significantly less.MBPO-PPOitself does not require direct state transitions but rather leverages thepretrained world model. This highlights thesample efficiencyadvantage of themodel-basedapproach for learning the dynamics.total training time:RWM pretrainingtakes 50 minutes, andMBPO-PPOpolicy training takes 5 minutes (after the model is pretrained). In comparison,PPOtakes 10 minutes. This suggests that whileRWMpretraining is an upfront cost, the total time for theMBPO-PPOprocess (55 min) is comparable toPPOfor the policy training itself, but with drastically fewer real environment interactions (6M vs 250M).step inference time: BothMBPO-PPOandPPOpolicies have comparablestep inference times(1 ms), meaning they are equally fast for real-time control once trained.real tracking reward: BothMBPO-PPOandPPOachieve very similarreal tracking rewards(0.90 ± 0.04 vs 0.90 ± 0.03). This is a crucial finding:MBPO-PPOcan match the performance ofPPOtrained on ahigh-fidelity simulatorwhile being significantly moresample-efficientin terms of real environment interactions. The strength ofMBRLlies in scenarios where accurate or efficient simulation formodel-free RLis infeasible, making it valuable for real-world environments.
6.2. Ablation Studies / Parameter Analysis
6.2.1. Dual-autoregressive Mechanism (Section A.4.1)
An ablation study was conducted to analyze the impact of the history horizon and forecast horizon on the performance of RWM. The results are presented in a heatmap in Figure S8, showing relative autoregressive prediction error (left) and training time (right).
The following figure (Figure S8 from the original paper) shows the ablation study on history and forecast horizons:
Figure S8: Ablation study on the history horizon and forecast horizon in RWM. The heatmap on the left shows the relative autoregressive prediction error, with darker colors indicating higher errors. Models trained with larger history horizons exhibit lower errors, although the improvements plateau beyond a certain point. Forecast horizon has a significant impact, with longer horizons leading to better long-term prediction accuracy due to exposure to extended rollouts during training. The heatmap on the right illustrates training time, with darker colors representing longer durations. Increasing significantly raises training time due to sequential computation, while shorter horizons (e.g., , teacher-forcing) enable faster training but result in poor prediction accuracy.
Figure S8: Ablation study on the history horizon and forecast horizon in RWM. The heatmap on the left shows the relative autoregressive prediction error, with darker colors indicating higher errors. Models trained with larger history horizons exhibit lower errors, although the improvements plateau beyond a certain point. Forecast horizon has a significant impact, with longer horizons leading to better long-term prediction accuracy due to exposure to extended rollouts during training. The heatmap on the right illustrates training time, with darker colors representing longer durations. Increasing significantly raises training time due to sequential computation, while shorter horizons (e.g., , teacher-forcing) enable faster training but result in poor prediction accuracy.
Analysis of Prediction Error (Left Heatmap):
History Horizon(): Models trained with a longerhistory horizonconsistently showlower prediction errors. This confirms the importance of providing sufficient historical context forRWMto capture complex underlying dynamics. However, the improvement from increasingplateausbeyond a certain point, suggesting diminishing returns for excessively long histories.Forecast Horizon(): Theforecast horizonplays adecisive rolein improvinglong-term prediction accuracy. Increasing during training leads to better performance inautoregressive rollouts. This is because a larger forces the model to learn representations that are robust tocompounding errorsover extended prediction horizons, as it must rely on its own predictions for more steps.
Analysis of Training Time (Right Heatmap):
- Increasing
significantly raises training time. This is a direct consequence of theautoregressive natureof the training process, where larger values require more sequential computations during training, reducing parallelization. - When (i.e.,
teacher-forcing), training can be highly parallelized, resulting inminimal training time. However, as seen in the left heatmap, this setting leads topoor autoregressive performancebecause the model lacks exposure tolong-horizon prediction scenariosduring training and thus fails to handlecompounding errorseffectively.
Optimal Trade-off:
The study highlights a critical trade-off between prediction accuracy and training efficiency. An optimal balance is achieved with moderate values for and . For instance, a history horizon of and a forecast horizon of yield strong autoregressive performance within a manageable training time. These settings ensure enough historical context while robustly training the model for long-term predictions.
6.2.2. Collision Handling and Model Pretraining (Section A.4.3)
The paper discusses how RWM handles collision events and the role of model pretraining.
-
Collision Handling: During both
pretrainingandonline fine-tuning, rollouts are terminated and the environment is reset if ground contact by the base is detected, signaling a failure.RWMis explicitly trained to predict these termination events through itsprivileged information prediction head. This allows theworld modelto learn about transitions that lead to unsafe situations. Duringpolicy optimization,MBPO-PPOtreats thesetermination predictionsasepisode-ending eventsinimagination rollouts, influencingPPO'sreturn computationandstate values. -
Model Pretraining:
RWMispretrainedusingsimulation datagenerated bysuboptimal policiestrained for similar tasks under varied dynamics. The policy is then learned from scratch purely inimagination, withRWMsubsequently fine-tuned using asingle-environment online dataset. Pretraining is deemed essential for two main reasons:- Limited Online Dataset: The
online datasetis very small (mimicking real-world constraints). Training theworld modelfrom scratch on such limited data would lead to severeoverfittingand long training times. - Immature Policy Failures: An
immature policywould frequently cause the robot to fall, generatinglow-value transitions(mostly failures). Training theworld modelsolely on this chaotic data would result inpoor imagined rolloutsand, consequently,poor policy updates.Pretrainingstabilizes training and provides a robust initialization foronline fine-tuning, especially in environments with challenging dynamics and frequent failures. Importantly,RWM pretrainingdoes not require data from optimal policies, and it remains robust to domain shifts and injected noise (as shown in Figure 3). Thiswarm-upphase is primarily necessary forlocomotion tasksdue to theirdiscontinuous dynamicsand frequent environment terminations, but not formanipulation experiments.
- Limited Online Dataset: The
6.2.3. Visualization of Imagination Rollouts (Section A.4.2)
Figure S9 visualizes autoregressive imagination from RWM compared with ground-truth simulation across diverse robotic systems.
The following figure (Figure S9 from the original paper) shows autoregressive imagination of RWM and ground-truth simulation:
Figure S9: Autoregressive imagination of RWM and ground-truth simulation across diverse robotic systems. For each environment, the top row showcases the RWM autoregressively predicting future trajectories in imagination. The second row visualizes the ground truth evolution in simulation. The visualized coordinate and arrow markers denote the predicted and measured end-effector pose and base velocity, respectively.
Figure S9: Autoregressive imagination of RWM and ground-truth simulation across diverse robotic systems. For each environment, the top row showcases the RWM autoregressively predicting future trajectories in imagination. The second row visualizes the ground truth evolution in simulation. The visualized coordinate and arrow markers denote the predicted and measured end-effector pose and base velocity, respectively.
Analysis:
For each environment (e.g., manipulation, quadruped, humanoid), the top row of Figure S9 displays RWM's autoregressive predictions of future trajectories in imagination, while the second row shows the corresponding ground truth evolution in simulation. The visualizations, including coordinate and arrow markers for end-effector pose and base velocity, confirm the high fidelity of RWM's predictions across varied tasks and robot types. This reinforces the claims about RWM's generality and accuracy in replicating complex dynamics, which is foundational for the subsequent policy optimization and sim-to-real transfer success.
6.3. Additional Discussion
6.3.1. Challenges in Real-World Online Learning (Section A.4.4)
The authors acknowledge that performing the policy training phase directly on real hardware would further demonstrate the advantages of RWM. However, several significant challenges currently prevent real-world online deployment:
-
Safety and Collisions: During
online learning, policies often exploit minorworld model errors, leading to overly optimistic behaviors that can result in collisions. In simulation, these failures providecorrective signals, but on real hardware, they pose a significant risk to the robot. Experiments show failures occurmore than 20 times on averageduring online learning, which would be detrimental to physical systems. -
Recovery Policies: Fully automating
online learningwould require arecovery policycapable of safely resetting the robot to an initial state after a failure, which is particularly challenging for large platforms likeANYmal DorUnitree G1. -
Privileged Information:
Privileged information(e.g., contact forces) used to fine-tuneRWMmust be either reliably measured or accurately estimated using onboard sensors, which may not always be available or precise enough. -
Error Exploitation Mitigation: To mitigate the
policy's exploitation of model errors,uncertainty-aware world modelscould be explored, but this would require additional architectural modifications toRWM.Due to these challenges, the current work approximates
real-world constraintsby using only asingle simulation environmentwithdomain shiftsfrom pretraining environments, reducing engineering effort while demonstrating feasibility. Future work will specifically address these issues, focusing onuncertainty-aware modelsand saferonline adaptationstrategies.
7. Conclusion & Reflections
7.1. Conclusion Summary
This work introduces Robotic World Model (RWM), a robust and scalable framework designed for learning accurate world models specifically tailored to complex robotic tasks. The core innovation lies in its dual-autoregressive mechanism and self-supervised training over long prediction horizons. This approach effectively tackles critical challenges in model-based reinforcement learning (MBRL), including compounding errors, partial observability, and stochastic dynamics, without needing domain-specific inductive biases. Extensive experiments demonstrate RWM's superior autoregressive prediction accuracy compared to state-of-the-art baselines like RSSM and transformer-based architectures across diverse robotic environments (manipulation, quadruped, humanoid locomotion).
Building upon RWM's high world model rollout fidelity, the authors propose MBPO-PPO, a policy optimization framework. Policies trained using MBPO-PPO exhibit superior performance in simulation and achieve seamless zero-shot transfer to physical hardware, as evidenced by successful deployment on ANYmal D and Unitree G1 robots. This work significantly advances MBRL by providing a generalizable, efficient, and scalable framework for learning and deploying world models, thereby paving the way for more adaptive, robust, and high-performing robotic systems in real-world applications.
7.2. Limitations & Future Work
The authors openly acknowledge several limitations of the current RWM framework and suggest directions for future research:
- Performance Gap with Model-Free RL: While
MBPO-PPOsurpasses existingMBRLmethods in robustness and generalization, its performance still falls short ofwell-tuned model-free RL methodstrained onhigh-fidelity simulators(as shown in Table 1).Model-free RLbenefits from maturity and extensive optimization in settings with unlimited access to near-perfect simulators. - Dependence on Pretraining: The
world modelis currentlypretrainedusing simulation data beforepolicy optimization. TrainingRWMfrom scratch is challenging becauseimmature policiescan exploit model inaccuracies during early exploration, leading to inefficiency and instability. Thispretrainingphase is particularly necessary for locomotion tasks due to theirdiscontinuous dynamicsand frequentenvironment terminations. - Limited Online Fine-tuning: While
RWMis fine-tuned with asingle-environment online dataset, the need for additional real-world interaction to continually refine theworld modelhighlights an area for further development. - Challenges in Real-World Online Learning: Enabling
safe and effective online learningdirectly on hardware remains a significant hurdle.-
Safety Risks: Policies often exploit minor model errors, leading to
optimistic behaviorsand collisions, which are risky for physical hardware. -
Recovery: Fully automated
online learningwould require robustrecovery policiesto reset the robot after failures, especially for large platforms. -
Privileged Information Availability:
Privileged information(e.g., contact forces) used forRWMfine-tuning might not be reliably measurable or estimable with onboard sensors in real-world scenarios. -
Uncertainty Quantification: Incorporating
safety constraintsandrobust uncertainty estimatesintoRWMwould be critical for deployment inreal-world, lifelong learning scenariosto mitigateerror exploitation.Future work will focus on addressing these safety concerns and practical challenges, including exploring
uncertainty-aware modelsand developing saferonline adaptationstrategies.
-
7.3. Personal Insights & Critique
This paper presents a significant step forward in model-based reinforcement learning for real-world robotics. The emphasis on a generalizable world model without domain-specific inductive biases is particularly appealing, as it promises wider applicability and reduces the engineering effort typically associated with new robotic tasks. The dual-autoregressive mechanism with self-supervised training is a clever and effective way to tackle the long-standing problem of error accumulation in long-horizon predictions, which has historically hampered the practical utility of world models for tasks requiring extended foresight. The ability to achieve robust zero-shot sim-to-real transfer on complex legged and humanoid robots is a strong validation of this approach's practical relevance.
Insights and Transferability:
- Robustness through Self-Supervision: The core idea of training a model by feeding it its own predictions (
autoregressive training) to reduceexposure biasis highly valuable. This principle could be applied to other sequential prediction tasks beyondworld models, such as time series forecasting or generative modeling, where maintaining long-term consistency is crucial. - Hybrid Approach Strength: The
MBPO-PPOframework effectively marries thesample efficiencyof model-based methods with therobustnessof model-freePPOupdates. This hybrid strategy is often the sweet spot for real-world applications, enabling faster learning with limited real data while still benefiting from the stability of well-establishedmodel-freealgorithms. - Value of Privileged Information: Using
privileged informationas an auxiliary loss for theworld modelis a smart way to implicitly embed critical domain knowledge (like contact states) without hand-designing the model architecture. This could be extended to other forms of auxiliary losses derived from sensors or expert knowledge.
Critique and Areas for Improvement:
-
Reliance on Pretraining: While justified by current safety and data limitations, the reliance on
pretrainingwith simulation data (even if suboptimal) means the system is not truly learning from scratch in the wild. Reallifelong learningwould ideally involve continuousonline adaptationand model refinement directly on hardware without such a prerequisite. The transition from pretraining to fine-tuning, and the potentialdomain shiftbetween them, could be further analyzed. -
Safety in Online Learning: The authors correctly identify the major hurdle of
real-world online learningbeing safety. The current solution of using a single simulation environment withdomain shiftsis a pragmatic compromise. However, developing robustuncertainty-aware world modelsthat can quantify their confidence in predictions and trigger safe exploration strategies (e.g., slowing down, asking for human intervention) is essential for truly autonomousonline learningon physical robots. Explicitsafety layersorguard policiescould also be integrated. -
Computational Cost for Transformers: The finding that
transformer architecturesstruggle with GPU memory constraints underautoregressive trainingis an interesting limitation. Future work might explore more memory-efficienttransformer variantsor alternative architectures that can combine the benefits ofattentionwith theautoregressive trainingparadigm without prohibitive computational costs. -
Generalization to New Tasks/Objects: While
RWMavoids domain-specific biases, itsgeneralizationto entirely new tasks or objects that significantly alter environmental dynamics would be a crucial next step. How well does it adapt to unseen kinematics or highly deformable objects, for example?Overall,
RWMrepresents a highly relevant and impactful contribution tomodel-based robotics. Its meticulous design for robustness and long-horizon accuracy, combined with successful hardware deployment, sets a strong foundation for future research in adaptive and efficient robotic systems.
Similar papers
Recommended via semantic vector search.