Paper status: completed

Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics

Published:01/17/2025

Robotic World Model (1)Autoregressive Mechanism (1)Long-Horizon Prediction (1)Model-Based Reinforcement Learning (2)Self-Supervised Training (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper presents a novel framework for a robotic world model using dual-autoregressive mechanisms and self-supervised training, enabling reliable long-horizon predictions without domain-specific biases. It supports efficient policy optimization and seamless deployment in real-

Abstract

Learning robust and generalizable world models is crucial for enabling efficient and scalable robotic control in real-world environments. In this work, we introduce a novel framework for learning world models that accurately capture complex, partially observable, and stochastic dynamics. The proposed method employs a dual-autoregressive mechanism and self-supervised training to achieve reliable long-horizon predictions without relying on domain-specific inductive biases, ensuring adaptability across diverse robotic tasks. We further propose a policy optimization framework that leverages world models for efficient training in imagined environments and seamless deployment in real-world systems. This work advances model-based reinforcement learning by addressing the challenges of long-horizon prediction, error accumulation, and sim-to-real transfer. By providing a scalable and robust framework, the introduced methods pave the way for adaptive and efficient robotic systems in real-world applications.

Mind Map

In-depth Reading

English Analysis~47 min read · 64,640 chars

1. Bibliographic Information

1.1. Title

The central topic of this paper is a novel framework for learning robust and generalizable world models in robotics, specifically designed to facilitate robust policy optimization and seamless sim-to-real transfer. The framework is named Robotic World Model (RWM).

1.2. Authors

Chenhao Li: ETH Zurich, Switzerland. Email: chenhli@ethz.ch.
Andreas Krause: ETH Zurich, Switzerland.
Marco Hutter: ETH Zurich, Switzerland.

The authors are affiliated with ETH Zurich, a prominent research university known for its strong contributions to robotics, machine learning, and control theory. Chenhao Li, as the first author, is likely the primary researcher on this project, while Andreas Krause and Marco Hutter likely serve as senior researchers or advisors, given their established reputations in the field of machine learning and robotics, respectively. Marco Hutter, in particular, is well-known for his work on legged robots such as the ANYmal.

1.3. Journal/Conference

The paper is available as a preprint on arXiv. As of the provided publication date, it has not yet been officially published in a journal or conference. However, arXiv is a widely respected platform for disseminating cutting-edge research in physics, mathematics, computer science, and related fields, allowing researchers to share their work before peer review. The abstract mentions advancing model-based reinforcement learning and addressing sim-to-real transfer, indicating its relevance to top-tier conferences in robotics (ICRA, IROS, CoRL) or machine learning (NeurIPS, ICML).

1.4. Publication Year

2025 (Published at UTC: 2025-01-17T10:39:09.000Z)

1.5. Abstract

The paper introduces a novel framework, Robotic World Model (RWM), for learning world models that can accurately capture complex, partially observable, and stochastic dynamics in robotics. The core methodology involves a dual-autoregressive mechanism and self-supervised training to achieve reliable long-horizon predictions without relying on domain-specific inductive biases, which enhances its adaptability across various robotic tasks. Furthermore, the authors propose a policy optimization framework, MBPO-PPO, that leverages these learned world models for efficient training in imagined environments and zero-shot deployment in real-world systems. This work aims to advance model-based reinforcement learning (MBRL) by tackling critical challenges such as long-horizon prediction accuracy, error accumulation, and sim-to-real transfer. The introduced methods provide a scalable and robust framework, paving the way for more adaptive and efficient robotic systems in practical applications.

1.6. Original Source Link

https://arxiv.org/abs/2501.10100 (Preprint on arXiv)

1.7. PDF Link

https://arxiv.org/pdf/2501.10100v4.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is enabling robotic systems to efficiently and robustly learn and adapt in complex, real-world environments. Reinforcement Learning (RL) and control theory have advanced robotics, but a significant limitation remains: the lack of adaptation and continuous learning once a policy is deployed on a real system. This means valuable data from real-world interactions is often underutilized, restricting the robot's robustness and ability to handle evolving scenarios.

This problem is crucial because truly intelligent robotic systems need to operate efficiently with limited data and adapt scalably to real-world conditions. Model-free RL algorithms like PPO and SAC require a vast number of interactions, making them impractical for real-world robotics where interactions are costly and potentially unsafe. Therefore, sample-efficient methods are essential.

World models offer a promising solution by simulating environment dynamics, allowing for planning and policy optimization in "imagination." However, developing reliable and generalizable world models for real-world dynamics (which are often nonlinear, stochastic, and partially observable) presents unique challenges:

Complexity of Real-World Dynamics: Robotic environments are inherently complex.
Long-Horizon Prediction: Models need to predict far into the future, which is prone to error accumulation.
Partial Observability: Robots often don't have access to the full state of the environment.
Domain-Specific Inductive Biases: Many existing world models incorporate hand-designed structures or physics principles, limiting their scalability and adaptability to new tasks or environments.
Sim-to-Real Transfer: Policies trained in simulation often struggle when deployed on physical hardware due to the reality gap.

The paper's innovative idea is to develop a general framework for learning world models (RWM) that can overcome these challenges without relying on domain-specific assumptions or handcrafted representations. It focuses on robustness and accuracy over long horizons, coupled with a policy optimization method (MBPO-PPO) that enables zero-shot deployment on physical hardware.

2.2. Main Contributions / Findings

The paper makes the following primary contributions:

A Novel Network Architecture and Training Framework: Introduction of RWM, which employs a dual-autoregressive mechanism and self-supervised training to learn reliable world models capable of long autoregressive rollouts. This is a critical property for downstream planning and control, specifically addressing error accumulation and partial observability without domain-specific biases.
Comprehensive Evaluation Suite: A thorough evaluation of RWM across diverse robotic tasks (manipulation, quadruped, humanoid locomotion) is provided. Comparative experiments demonstrate its effectiveness against existing world model frameworks.
Efficient Policy Optimization Framework and Hardware Generalization: Proposal of MBPO-PPO, an efficient policy optimization framework that leverages the learned RWM for continuous control. It demonstrates effective generalization to real-world scenarios through hardware experiments on ANYmal D (quadruped) and Unitree G1 (humanoid) systems, achieving zero-shot deployment with minimal performance loss.

Key findings include:

RWM achieves remarkable alignment between predicted and ground truth trajectories over extended autoregressive rollouts, effectively mitigating compounding errors.
RWM demonstrates superior robustness under various noise perturbations compared to baseline models.
RWM trained with autoregressive training (RWM-AR) consistently outperforms other baseline architectures (MLP, RSSM, Transformer-based) and its teacher-forcing counterpart (RWM-TF) across diverse robotic environments.
The MBPO-PPO framework successfully trains policies on the learned RWM for complex velocity tracking tasks, leading to stable and robust behaviors in simulation and successful zero-shot transfer to physical robots. This is achieved for over a hundred autoregressive steps, a capability exceeding many existing frameworks.

These findings collectively address the challenges of long-horizon prediction, error accumulation, and sim-to-real transfer, showcasing a scalable and robust framework for adaptive and efficient robotic systems in real-world applications.

3.1. Foundational Concepts

Reinforcement Learning (RL)

Reinforcement Learning (RL) is a paradigm of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. The agent is not explicitly programmed with a solution but learns through trial and error. It interacts with the environment, observes the resulting state and reward, and adjusts its policy (strategy for choosing actions) to achieve its goal.

Markov Decision Process (MDP)

A Markov Decision Process (MDP) is a mathematical framework for modeling sequential decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. An MDP is defined by a tuple $(S, A, T, R, \gamma)$ :

$S$ : A set of possible states of the environment.
$A$ : A set of possible actions the agent can take.
$T$ : A transition function $p(s' | s, a)$ that describes the probability of moving to state $s'$ from state $s$ after taking action $a$ . This is based on the Markov property, meaning the next state depends only on the current state and action, not on the entire history.
$R$ : A reward function R(s, a, s') that specifies the immediate reward received after transitioning from state $s$ to state $s'$ via action $a$ .
$\gamma$ : A discount factor $\in [0, 1]$ , which determines the present value of future rewards. Rewards received sooner are generally valued more than those received later.

The goal in an MDP is to find a policy $\pi(a | s)$ that maximizes the expected discounted sum of rewards over time.

Partially Observable Markov Decision Process (POMDP)

A Partially Observable Markov Decision Process (POMDP) is an extension of an MDP where the agent cannot directly observe the current state of the environment. Instead, it receives observations that are probabilistically related to the underlying state. A POMDP is defined by $(S, A, \Omega, T, R, O, \gamma)$ :

$S$ : A set of possible states (unobservable by the agent).
$A$ : A set of possible actions.
$\Omega$ : A set of possible observations the agent can receive.
$T$ : A transition function $p(s' | s, a)$ .
$R$ : A reward function R(s, a, s').
$O$ : An observation function $p(o | s)$ that describes the probability of observing $o$ when the environment is in state $s$ .
$\gamma$ : A discount factor.

In a POMDP, the agent must maintain a belief (a probability distribution over the possible states) and use this belief, along with its history of actions and observations, to make decisions. The RWM framework models the environment as a POMDP, recognizing that robotic systems often operate with partial information.

World Models

World models are neural network models that learn to predict the future behavior of an environment. They learn the dynamics of the environment, essentially creating a "simulator" of the world. This allows an RL agent to "imagine" future outcomes of its actions without interacting with the real (or a separate simulated) environment. Key components often include:

Dynamics Model: Predicts the next state given the current state and action.
Observation Model: Predicts the next observation given the next state (especially useful in latent space models).
Reward Model: Predicts the reward for a given state and action.

World models are crucial for model-based reinforcement learning because they enable planning in imagination and significantly improve sample efficiency.

Model-Based Reinforcement Learning (MBRL)

Model-Based Reinforcement Learning (MBRL) algorithms learn a model of the environment's dynamics, then use this model to either plan optimal actions or train a policy. The general cycle involves:

Data Collection: Interact with the real environment to collect trajectories (sequences of states, actions, rewards, observations).
Model Learning: Train a world model using the collected data to predict future states/observations/rewards.
Policy Optimization: Use the learned world model to:
- Planning: Search for optimal action sequences within the imagined environment (e.g., using Model Predictive Control - MPC).
- Policy Training: Generate synthetic experience (imagination rollouts) to train or refine a policy using a model-free RL algorithm.
  
  MBRL methods are typically more sample-efficient than model-free RL because they can generate a lot of training data from a small amount of real-world interaction.

Model-Free Reinforcement Learning

Model-Free Reinforcement Learning algorithms directly learn a policy or value function without explicitly building a model of the environment's dynamics. They learn directly from trial-and-error interactions. Examples include:

Policy Optimization: Proximal Policy Optimization (PPO)
Value-Based Methods: Q-learning, Deep Q-Networks (DQN)
Actor-Critic Methods: Soft Actor-Critic (SAC)

While model-free RL can achieve impressive results, especially in complex tasks, it often requires an extremely large number of interactions with the environment, making it costly or impractical for real-world robotics.

Recurrent Neural Networks (RNNs) and Gated Recurrent Units (GRUs)

Recurrent Neural Networks (RNNs) are a class of neural networks designed to process sequential data, where the output at a given time step depends on previous computations. They have internal memory (hidden states) that allows them to capture temporal dependencies. Gated Recurrent Units (GRUs) are a type of RNN that addresses some of the limitations of simple RNNs, such as the vanishing gradient problem, by using gating mechanisms. GRUs have two main gates:

Update Gate: Controls how much of the past information (hidden state) should be passed to the future.
Reset Gate: Controls how much of the past hidden state to forget. GRUs are simpler and computationally more efficient than Long Short-Term Memory (LSTM) networks but still effective at handling long-term dependencies in sequential data. RWM uses a GRU-based architecture for its ability to maintain long-term historical context.

Multilayer Perceptrons (MLPs)

A Multilayer Perceptron (MLP) is a class of feedforward artificial neural network. It consists of at least three layers of nodes: an input layer, a hidden layer, and an output layer. Each node (neuron) in one layer is connected to every node in the subsequent layer with a certain weight. MLPs are general-purpose function approximators and are often used as components within larger neural network architectures, such as the heads of RWM that predict the mean and standard deviation of observations.

Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) is a model-free RL algorithm that is widely used due to its balance of ease of implementation, sample efficiency, and good performance. It is an actor-critic method, meaning it learns both a policy (actor) and a value function (critic). PPO works by performing multiple epochs of mini-batch stochastic gradient ascent on the clipped surrogate objective function. This objective encourages the new policy to stay close to the old policy, preventing large, destructive updates, which contributes to stability.

The clipped surrogate objective in PPO is typically formulated as: $ L^{CLIP}(\theta) = \hat{E}_t \left[ \min(r_t(\theta) \hat{A}_t, \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t) \right] $ Where:

$\theta$ : The parameters of the policy network.
$\hat{E}_t$ : Empirical average over a batch of samples.
$r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)}$ : The probability ratio, which is the ratio of the new policy's probability of taking action $a_t$ in state $s_t$ to the old policy's probability.
$\hat{A}_t$ : The advantage estimate at time $t$ , representing how much better (or worse) action $a_t$ was than average for state $s_t$ .
$\mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)$ : Clips the probability ratio $r_t(\theta)$ to be within the range $[1-\epsilon, 1+\epsilon]$ , where $\epsilon$ is a hyperparameter (e.g., 0.2). This clipping prevents excessively large policy updates.

The PPO algorithm updates the policy by maximizing this objective, optionally with an added value function loss and entropy bonus to encourage exploration. RWM uses PPO within its MBPO-PPO framework for policy optimization.

Soft Actor-Critic (SAC)

Soft Actor-Critic (SAC) is another model-free RL algorithm, known for its off-policy nature and sample efficiency, making it suitable for continuous control tasks. SAC aims to maximize a trade-off between expected reward and entropy. The entropy term encourages exploration, preventing the policy from converging to a suboptimal local optimum too quickly. SAC learns an actor (policy), a critic (Q-function), and often a target Q-network and an entropy temperature parameter. It has shown impressive results in various robotic domains.

Model Predictive Control (MPC)

Model Predictive Control (MPC) is an advanced control strategy that uses an explicit model of the system to predict future behavior. At each time step, MPC solves an optimization problem to find a sequence of control actions that minimizes a cost function over a finite prediction horizon, subject to system constraints. Only the first action from this optimal sequence is applied, and the process is repeated at the next time step with updated system measurements. MPC is often used with world models for planning, as the world model serves as the system dynamics model.

3.2. Previous Works

World Models for Robotics

Dynamics Models for Policy Optimization [24]: Early applications of world models focused on describing real-world dynamics to optimize control policies.
Vision-Based World Models:
- Visual Foresight [18, 25, 17]: Techniques that learn visual dynamics for planning in high-dimensional sensory spaces. These models predict future video frames or image features to anticipate outcomes of actions.
- PlaNet [15]: Deep Planning Network, a pioneering latent-space dynamics model that learns compact representations of the state space and plans directly in this learned latent space.
- Dreamer [29, 11, 30]: Successor to PlaNet, extends the concept by integrating an actor-critic framework into the latent dynamics model, allowing for simultaneous learning of both the dynamics model and the policy. DreamerV3 [30] is a later iteration known for mastering diverse domains.
Models with Domain-Specific Inductive Biases: Many approaches incorporate known physics principles or structured state representations to improve model fidelity, especially when generalizing beyond training data. Examples include:
- Foot-placement dynamics [21]
- Object invariance [22]
- Granular media interactions [27]
- Frequency domain parameterization [23]
- Rigid body dynamics [20]
- Semi-structured Lagrangian dynamics models [28] These methods, while effective, often require significant domain knowledge and handcrafted features, limiting their versatility.

Model-Based Reinforcement Learning (MBRL)

Probabilistic Ensembles with Trajectory Sampling (PETS) [12]: One of the pioneering MBRL methods. It uses an ensemble of probabilistic neural networks to model the environment dynamics. By using an ensemble, PETS captures model uncertainty, which helps in robust planning. Trajectories are sampled from this ensemble to evaluate actions.
PlaNet [15] and Dreamer [29, 11, 30]: As described above, these methods leverage latent dynamics models for planning and policy learning, demonstrating state-of-the-art performance in various control and navigation tasks, including extensions to real-world robotics [19, 31].
Architectural Variations for Latent Dynamics Models:
- Autoregressive Transformers [32]: Used to improve the generation capabilities of latent dynamics models.
- Variational Autoencoders [33]: Capture the stochastic nature of the environment.
TD-MPC and TD-MPC2 [34, 35, 36]: Integrate model-based learning with Model Predictive Control (MPC) to achieve high-performance control, particularly in dynamic environments. They combine the predictive power of learned models with the optimal control capabilities of MPC.

Hybrid Model-Based/Model-Free Approaches

Model-Based Policy Optimization (MBPO) [13]: A notable hybrid approach that combines the sample efficiency of MBRL with the robustness of model-free RL. MBPO uses a model-based approach to generate additional synthetic data for planning and policy optimization, but then refines the policy using standard model-free updates. A key idea is to selectively rely on the learned model only when its predictions are accurate, mitigating the negative effects of model inaccuracies.
Model-based Offline Policy Optimization (MOPO) [37]: Extends MBPO to the offline RL setting, where learning occurs entirely from a fixed dataset of previously collected data without further environment interactions.
Gradient-Based Policy Optimization [38, 39]: Some MBRL approaches use first-order gradient-based optimization for policy learning, where gradients are propagated through the dynamics model to update the policy. Short-Horizon Actor-Critic (SHAC) [38] is an example that leverages differentiable simulation.

3.3. Technological Evolution

The field of robotic control and RL has seen a significant evolution. Initially, model-free RL methods like PPO and SAC gained prominence for their ability to learn complex behaviors without explicit environment models. However, their sample inefficiency limited their application in real-world robotics where data collection is expensive and often hazardous.

This led to a resurgence of model-based RL, leveraging world models to improve sample efficiency. Early world models often relied on domain-specific inductive biases (e.g., known physics, structured state representations) to achieve fidelity, but this restricted their generalizability and adaptability. The development of latent-space dynamics models (like PlaNet and Dreamer) marked a shift towards learning more general black-box models that could operate on high-dimensional observations (e.g., pixels) without requiring explicit state engineering.

However, challenges remained:

Error accumulation over long prediction horizons in purely autoregressive models.
Difficulty in capturing stochasticity and partial observability accurately.
The persistent sim-to-real gap when deploying policies trained on learned models.

This paper's work, RWM, fits into this timeline by addressing these limitations. It moves beyond previous black-box models by introducing a dual-autoregressive mechanism and self-supervised training specifically designed to enhance long-horizon prediction robustness and reduce error accumulation without domain-specific biases. This allows RWM to generate high-fidelity imagined trajectories suitable for robust policy optimization via MBPO-PPO, ultimately enabling zero-shot transfer to real hardware. It represents an advancement in creating more general, reliable, and deployable world models for robotics.

3.4. Differentiation Analysis

Compared to the main methods in related work, RWM introduces several core differences and innovations:

Generalization without Domain-Specific Biases:
- Prior Methods: Many successful world models for robotics, especially in legged locomotion or manipulation, incorporate strong domain-specific inductive biases (e.g., foot-placement dynamics [21], object invariance [22], rigid body dynamics [20]). While effective, these hand-crafted representations limit scalability and adaptability to diverse tasks or novel environments.
- RWM's Innovation: RWM is explicitly designed to learn world models without relying on handcrafted representations or specialized architectural biases. This approach aims for broader applicability and enhanced generalization across a wide range of robotic systems (manipulation, quadruped, humanoid locomotion) and scenarios.
Robust Long-Horizon Prediction via Dual-Autoregressive Mechanism and Self-Supervised Training:
- Prior Methods (e.g., Dreamer, PlaNet, RSSM): Often use teacher-forcing during training, where ground-truth observations are always fed as input for the next-step prediction. While efficient for training due to parallelization, this can lead to a mismatch between training and inference distributions (known as exposure bias) and poor performance during long autoregressive rollouts at test time, where errors can compound rapidly. Dreamer and TD-MPC typically operate with shorter planning horizons in imagination.
- RWM's Innovation: RWM employs a novel dual-autoregressive mechanism with self-supervised training over long prediction horizons ( $N$ $N$ ).
  - Self-supervised Autoregressive Training: The model is trained to predict future observations by using its own predictions recursively as input, mimicking the test-time scenario. This reduces the distribution mismatch.
  - Dual-Autoregression: It combines inner autoregression (updating GRU hidden states within the context horizon) and outer autoregression (feeding predicted observations from the forecast horizon back into the network). This specifically ensures stability and robustness in long-horizon predictions and effectively mitigates error accumulation, even in stochastic and partially observable environments. This allows RWM to optimize policies over hundreds of autoregressive steps, a capability the paper claims exceeds many existing frameworks.
Policy Optimization and Zero-Shot Hardware Transfer:
- Prior Methods (MBPO, Dreamer, SHAC): While MBPO combines model-based and model-free, and Dreamer learns policies in latent space, RWM explicitly highlights its ability to optimize PPO policies on its learned model (MBPO-PPO) even over extended rollouts. SHAC uses first-order gradients through the model, which can be unstable with discontinuous dynamics.
- RWM's Innovation: The high fidelity and robustness of RWM's long-horizon predictions directly translate into the ability to train PPO policies that are stable and effective. This enables zero-shot deployment on physical hardware (ANYmal D, Unitree G1) with minimal performance loss, which is a significant practical achievement that many model-based methods struggle with. The paper suggests this is the first framework to reliably train policies on a learned neural network simulator without domain-specific knowledge and deploy them on hardware with minimal performance loss.
  
  In essence, RWM differentiates itself by prioritizing a general, robust, and accurate world model that handles long-term dependencies and partial observability through its unique training paradigm, which then directly facilitates more stable and deployable model-based policy optimization for real-world robotic systems.

4. Methodology

4.1. Principles

The core idea behind Robotic World Model (RWM) is to create a highly accurate and robust neural network-based simulator of robotic environments (a world model) that can make reliable long-horizon predictions without relying on domain-specific knowledge. This learned world model then serves as an imagined environment for efficiently training reinforcement learning policies. The theoretical basis hinges on modeling the environment as a Partially Observable Markov Decision Process (POMDP) and using self-supervised autoregressive training to overcome challenges like partial observability and error accumulation inherent in long-term forecasting. The intuition is that by training the model to predict its own future inputs, it becomes robust to the distributional shifts encountered during imagination rollouts, making the synthetic experience generated by the model a faithful proxy for real-world interactions for policy optimization.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Reinforcement Learning and World Models Formulation (Section 3.1)

The environment is formally modeled as a Partially Observable Markov Decision Process (POMDP). This choice acknowledges that robotic systems typically do not have access to the complete underlying state of the environment but rather receive observations.

A POMDP is defined by the tuple $( S , \mathcal { A } , \mathcal { O } , T , R , O , \gamma )$ where:

$S$ : Denotes the state space. This is the set of all possible true states of the environment, which are not directly observable by the agent.
$\mathcal { A }$ : Denotes the action space. This is the set of all possible actions the agent can take.
$\mathcal { O }$ : Denotes the observation space. This is the set of all possible observations the agent receives, which are noisy or partial reflections of the true state.
$T$ : Represents the transition kernel, which captures the environment dynamics. Specifically, it defines the probability of transitioning to a new state $s_{t+1}$ given the current state $s_t$ and action $a_t$ . This is denoted as $p \left( \boldsymbol { s } _ { t + 1 } \mid \boldsymbol { s } _ { t } , \boldsymbol { a } _ { t } \right)$ .
$R$ : Is the reward function, $R : S \times \mathcal { A } \times \mathcal { S } \to \mathbb { R }$ . It maps a state-action-next state transition to a scalar reward.
$O$ : Is the observation kernel, $O : S \to \mathcal { O }$ . It defines the probabilities $p \left( o _ { t } \mid s _ { t } \right)$ with which observations $o _ { t } \in \mathcal { O }$ are emitted, given the current true state $s_t$ .
$\gamma \in [ 0 , 1 ]$ : Is the discount factor, which balances immediate and future rewards.

The agent (the robot) aims to learn a policy $\pi _ { \theta } : \mathcal { O } \to \mathcal { A }$ . This policy maps observations to actions, with parameters $\theta$ . The objective of the policy is to maximize the expected discounted return, which is the sum of future rewards discounted by $\gamma$ : $ \mathbb { E } _ { \pi _ { \theta } } \left[ \sum _ { t \geq 0 } \gamma ^ { t } r _ { t } \right] $ Here, r _ { t } is the reward at time $t$ .

World models approximate the environment dynamics and enable policy optimization through simulated environment interactions (learning in imagination). The typical training process for world models involves three iterative steps:

Data Collection: Collect real interaction data from the environment by executing actions with a policy.
World Model Training: Train the world model using this collected data to learn the environment's dynamics.
Policy Optimization: Optimize the policy within the simulated environment generated by the trained world model.

The Robotic World Model (RWM) framework builds upon this core concept but introduces specific architectural and training innovations to enable reliable long-horizon predictions in complex, stochastic, and partially observable settings, which are crucial for real-world robotics.

4.2.2. Self-supervised Autoregressive Training (Section 3.2)

The RWM framework utilizes a self-supervised autoregressive training mechanism as its backbone. This mechanism is designed to train the world model ( $p _ { \phi }$ , with parameters $\phi$ ) to predict future observations by using both historical observation-action sequences and its own predictions. This strategy aims to ensure robustness over extended rollouts.

Input and Prediction Mechanism: At each time step $t$ , the world model receives a sequence of observation-action pairs spanning $M$ historical steps as input. Based on this history, the model predicts the distribution of the next observation.

The predictions are generated autoregressively. This means that after predicting an observation at step $t+1$ , this predicted observation ( $o _ { t + 1 } ^ { \prime }$ ) is then treated as if it were a real observation and appended to the history. It is then combined with the subsequent action ( $a _ { t + 1 }$ ) to form the input for predicting the observation at step $t+2$ . This process continues iteratively over a prediction horizon of $N$ steps, generating a sequence of future predictions.

The predicted observation $k$ steps ahead, denoted as $o _ { t + k } ^ { \prime }$ , is sampled from the model's predicted distribution, which is conditioned on the historical context and the model's own prior predictions. The formula for this is:

$ \begin{array} { r } { o _ { t + k } ^ { \prime } \sim p _ { \phi } \left( \cdot { \mathrm { ~ | ~ } } o _ { t - M + k : t } , o _ { t + 1 : t + k - 1 } ^ { \prime } , a _ { t - M + k : t + k - 1 } \right) . } \end{array} $ (Equation 1)

Where:

$o _ { t + k } ^ { \prime }$ : The predicted observation at time step $t+k$ . The $\sim$ indicates it's sampled from the predicted distribution.
$p _ { \phi } ( \cdot \mid \dots )$ : The world model parameterized by $\phi$ , predicting the distribution of the next observation. The $\cdot$ placeholder signifies the output (the predicted observation).
$o _ { t - M + k : t }$ : The actual (ground truth) observations from history, specifically from M-k+1 steps before the current prediction up to the current time $t$ . This maintains a window of real observations.
$o _ { t + 1 : t + k - 1 } ^ { \prime }$ : The model's own predicted observations from the previous k-1 steps within the forecast horizon. This is the core of the autoregressive mechanism.
$a _ { t - M + k : t + k - 1 }$ : The sequence of actions, including historical actions (real) and future actions (which are typically sampled from the policy during imagination).

Privileged Information Prediction: In addition to observations, the RWM also predicts privileged information $c$ , such as contact forces. This serves as an auxiliary learning objective, implicitly embedding crucial information for making accurate long-term predictions, especially in contact-rich robotics tasks.

Optimization Objective: The world model is optimized by minimizing a multi-step prediction error, which combines the discrepancy between predicted and true observations (L _ { o }) and predicted and true privileged information (L _ { c }). The loss function is:

$ \mathcal { L } = \frac { 1 } { N } \sum _ { k = 1 } ^ { N } \alpha ^ { k } \left[ L _ { o } \left( o _ { t + k } ^ { \prime } , o _ { t + k } \right) + L _ { c } \left( c _ { t + k } ^ { \prime } , c _ { t + k } \right) \right] , $ (Equation 2)

Where:

$\mathcal { L }$ : The total loss function to be minimized.
$N$ : The forecast horizon, which is the number of future steps the model is trained to predict autoregressively.
$\alpha$ : A decay factor ( $\in [0, 1]$ ), which can be used to weight predictions differently based on their distance into the future (e.g., giving more weight to closer predictions if $\alpha < 1$ ). The paper uses $\alpha = 1.0$ , meaning no decay.
$L _ { o } \left( o _ { t + k } ^ { \prime } , o _ { t + k } \right)$ : A loss function (e.g., mean squared error, negative log-likelihood if $o$ is a distribution) that quantifies the discrepancy between the predicted observation $o _ { t + k } ^ { \prime }$ and the true observation $o _ { t + k }$ .
$L _ { c } \left( c _ { t + k } ^ { \prime } , c _ { t + k } \right)$ : A similar loss function for the privileged information, comparing the predicted $c _ { t + k } ^ { \prime }$ and true $c _ { t + k }$ .

This autoregressive training objective forces the model's hidden states to learn representations that are robust enough to support accurate and reliable long-horizon predictions, even when fed its own generated data.

Training Data Construction: Training data is prepared by sliding a window of size $M + N$ over collected trajectories. This ensures that for each training sample, there is sufficient historical context ( $M$ steps) and corresponding future ground truth for prediction targets ( $N$ steps). Reparameterization tricks are applied during training to enable effective end-to-end optimization, allowing gradients to flow through the stochastic sampling process.

Benefits of Autoregressive Training:

Mitigates Error Accumulation: By training the model on its own predictions, it becomes more robust to the compounding errors that naturally occur during long autoregressive rollouts at test time.
Addresses Partial Observability: Incorporating historical observations ( $o _ { t - M + k : t }$ ) allows RWM to capture unobservable dynamics and infer missing information, crucial in partially observable environments.
Reduces Distribution Mismatch: This training scheme exposes the model to the distribution of inputs it will encounter during inference (its own predictions), reducing the exposure bias problem common in teacher-forcing methods.
Generalization: Eliminates the need for handcrafted representations or domain-specific inductive biases, enhancing generalization across diverse tasks.

Comparison with Teacher-Forcing: Figure 2 illustrates the difference between autoregressive training (Fig. 2a) and teacher-forcing training (Fig. 2b).

Autoregressive Training (Fig. 2a): For a history horizon $M=3$ and forecast horizon $N=2$ , the model takes $o_t, a_t$ as input, predicts $o'_{t+1}$ . Then it takes $o'_{t+1}, a_{t+1}$ to predict $o'_{t+2}$ . This matches how the model will be used during imagination.
Teacher-Forcing Training (Fig. 2b): This is a special case of autoregressive training where the forecast horizon $N=1$ . It always uses the ground truth observation $o_t$ to predict $o'_{t+1}$ . While this allows for greater parallelization during training, it doesn't prepare the model for scenarios where its own predictions are fed back, leading to less robustness to error accumulation.

Network Architecture: While the autoregressive training framework is architecture-agnostic, RWM utilizes a GRU-based architecture. GRUs are chosen for their ability to maintain long-term historical context while being computationally efficient and operating on low-dimensional inputs. The network outputs the mean and standard deviation of a Gaussian distribution for the next observation, reflecting the stochasticity of the environment.

Dual-Autoregressive Mechanism: The RWM framework introduces a specific dual-autoregressive mechanism (visualized in Figure S6):

Inner Autoregression: Within the context horizon $M$ , the GRU hidden states are updated autoregressively after processing each historical step. This ensures that the GRU effectively captures and maintains a rich, long-term memory of past events.
Outer Autoregression: This refers to the process described above where predicted observations from the forecast horizon $N$ are fed back into the network as inputs for subsequent predictions. This is the primary mechanism for mitigating error accumulation and training for long-horizon robustness.

This dual mechanism provides a robust way to handle long-term dependencies and transitions, making RWM suitable for complex robotics applications.

The following figure (Figure S6 from the original paper) visualizes the dual-autoregressive mechanism:

Figure S6: Dual-autoregressive mechanism employed in RWM. Inner autoregression updates GRU hidden states after each historical step within the context horizon, while outer autoregression feeds predicted observations from the forecast horizon back into the network. The dashed arrows denote the sequential autoregressive prediction steps, highlighting robustness to long-term dependencies and transitions.

4.2.3. Policy Optimization on Learned World Models (Section 3.3)

The policy optimization in RWM is performed using the learned world model, drawing inspiration from Model-Based Policy Optimization (MBPO) [13] and the Dyna algorithm [42]. This approach combines model-based imagination (generating synthetic experience) with model-free reinforcement learning (using an algorithm like PPO to update the policy) to achieve efficient and robust policy optimization.

Action Generation in Imagination: During imagination rollouts, the actions are recursively generated by the policy $\pi _ { \theta }$ (with parameters $\theta$ ). These actions are conditioned on the observations predicted by the world model $p _ { \phi }$ , which itself is conditioned on its previous predictions. This creates a closed loop in the imagined environment.

The actions at time $t+k$ in imagination can be written as:

$ \begin{array} { r } { a _ { t + k } ^ { \prime } \sim \pi _ { \theta } \left( \cdot \mid o _ { t + k } ^ { \prime } \right) , } \end{array} $ (Equation 3)

Where:

$a _ { t + k } ^ { \prime }$ : The action sampled from the policy at time $t+k$ in the imagined environment.
$\pi _ { \theta } ( \cdot \mid o _ { t + k } ^ { \prime } )$ : The policy network, parameterized by $\theta$ , which outputs a distribution over actions conditioned on the current imagined observation.
$o _ { t + k } ^ { \prime }$ : The observation at time $t+k$ , which is predicted autoregressively by the world model as described in Equation 1.

Reward Calculation: Rewards for these imagined transitions are computed from the imagined observations and privileged information predicted by the world model.

Policy Optimization Algorithm (MBPO-PPO): The overall policy optimization process is outlined in Algorithm 1. The approach is denoted as MBPO-PPO because it adapts the MBPO framework to use PPO for policy updates, leveraging the RWM's robust long-horizon rollouts.

The following is Algorithm 1 from the original paper:

1: Initialize policy πθ, world model pφ, and replay buffer D

2: for learning iterations = 1, 2, . . . do

3: Collect observation-action pairs in D by interacting with the environment using πθ

4: Update pφ with autoregressive training using data sampled from D according to Eq. 2

5: Initialize imagination agents with observations sampled from D

6: Roll out imagination trajectories using π0 and pφ for T steps according to Eq. 3

7: Update πθ using PPO or another reinforcement learning algorithm end for

Let's break down each step:

Initialize policy $\pi_\theta$ , world model $p_\phi$ , and replay buffer $\mathcal{D}$ :
- The policy network $\pi_\theta$ (the agent's brain for deciding actions) and the world model $p_\phi$ (the learned simulator) are initialized with random or pre-trained weights.
- A replay buffer $\mathcal{D}$ is initialized to store real interaction data.
for learning iterations = 1, 2, . . . do: The learning process proceeds iteratively.
Collect observation-action pairs in $\mathcal{D}$ by interacting with the environment using $\pi_\theta$ :
- The current policy $\pi_\theta$ is deployed in the actual (or high-fidelity simulated) environment for a short period.
- The agent collects observation-action pairs ( $o_t, a_t$ ) along with subsequent observations ( $o_{t+1}$ ) and rewards ( $r_t$ ).
- This real-world interaction data is stored in the replay buffer $\mathcal{D}$ . This step is crucial for gathering fresh, accurate experience.
Update $p_\phi$ with autoregressive training using data sampled from $\mathcal{D}$ according to Eq. 2:
- The world model $p_\phi$ is updated using the autoregressive training scheme described in Section 3.2 (using Equation 2 as the loss function).
- Training data for the world model is sampled from the replay buffer $\mathcal{D}$ . This step refines the world model to better match the real environment's dynamics based on the latest experience.
Initialize imagination agents with observations sampled from $\mathcal{D}$ :
- To begin imagination rollouts, multiple "imagination agents" are initialized. Each agent starts from a real observation (or state if available) sampled from the replay buffer $\mathcal{D}$ . This grounds the imagined trajectories in realistic starting points.
Roll out imagination trajectories using $\pi_\theta$ and $p_\phi$ for $T$ steps according to Eq. 3:
- For each imagination agent, the policy $\pi_\theta$ proposes an action based on the current imagined observation.
- The world model $p_\phi$ then predicts the next observation and reward based on this action and the current imagined observation.
- This process is repeated for $T$ steps, generating imagined trajectories (sequences of imagined observations, actions, and rewards). The actions are generated using Equation 3, and the observations are predicted autoregressively using the model. These imagined trajectories effectively augment the real data.
Update $\pi_\theta$ using PPO or another reinforcement learning algorithm:
- The policy $\pi_\theta$ is updated using a model-free reinforcement learning algorithm, specifically Proximal Policy Optimization (PPO) in this framework.
- The training data for PPO comes from both the real data in $\mathcal{D}$ and, crucially, the imagined trajectories generated in step 6. The imagined data allows PPO to learn and explore efficiently without needing extensive real-world interactions.
end for: The loop continues, iteratively collecting new real data, updating the world model, generating imagined data, and refining the policy.

The training diagram for MBPO-PPO is visualized in Figure S7:

$Figure S7: Model-Based Policy Optimization with learned world models. The framework combines real environment interactions with simulated rollouts for efficient policy optimization. Observation and action pairs from the environment are stored in a replay buffer and used to train the autoregressive world model. Imagination rollouts using the learned model predict future states over a horizon of $T$ , providing trajectories for policy updates through reinforcement learning algorithms.$ Figure S7: Model-Based Policy Optimization with learned world models. The framework combines real environment interactions with simulated rollouts for efficient policy optimization. Observation and action pairs from the environment are stored in a replay buffer and used to train the autoregressive world model. Imagination rollouts using the learned model predict future states over a horizon of $T$ , providing trajectories for policy updates through reinforcement learning algorithms.

Figure S7: Model-Based Policy Optimization with learned world models. The framework combines real environment interactions with simulated rollouts for efficient policy optimization. Observation and action pairs from the environment are stored in a replay buffer and used to train the autoregressive world model. Imagination rollouts using the learned model predict future states over a horizon of $T$ , providing trajectories for policy updates through reinforcement learning algorithms.

Challenges and RWM's Robustness: While PPO is a strong performer, training it on learned world models is challenging. Model inaccuracies can be exploited by the policy, leading to a reality gap where policies perform well in imagination but poorly in the real world. This problem is exacerbated by the extended autoregressive rollouts required for PPO, which can compound prediction errors.

Despite these challenges, RWM demonstrates significant robustness by successfully optimizing policies over hundreds of autoregressive steps with MBPO-PPO. This is a key differentiator, as many existing frameworks (MBPO, Dreamer, TD-MPC) struggle with such long horizons due to model inaccuracies and error accumulation. This result underscores the accuracy and stability of RWM's training method and its ability to synthesize policies that are reliably deployable on hardware.

5. Experimental Setup

The experiments conducted aim to validate RWM's accuracy, robustness, and architectural design choices across diverse robotic systems and environments. They also demonstrate its effectiveness with MBPO-PPO for policy optimization in simulation and real-world deployment.

5.1. Datasets

The world model is trained using simulation data generated by a velocity tracking policy. This policy interacts with a simulated environment to produce trajectories which are then used to train RWM.

The robots used for evaluation are:

ANYmal D: A quadrupedal robot (four-legged).
Unitree G1: A humanoid robot.

The simulation environment used is Isaac Lab [43], a unified simulation framework for interactive robot learning.

The detailed observation, privileged information, and action spaces for the world model and the policy are provided in the supplementary material.

5.1.1. World Model Observation Space

The world model receives observations about the robot's state. The structure of this observation space is detailed in Table S2.

The following are the results from Table S2 of the original paper:

Entry	Symbol	Dimensions	Entry	Symbol	Dimensions
ANYmal D			Unitree G1
base linear velocity	U	0:3	base linear velocity	U	0:3
base angular velocity	3	3:6	base angular velocity	3	3:6
projected gravity	g	6:9	projected gravity	g	6:9
joint positions	q	9:21	joint positions	q	9:38
joint velocities	q	21:33	joint velocities	q	38:67
joint torques	τ	33:45	joint torques	τ	67:96

Where:

base linear velocity (v): The robot's linear velocity in the x, y, z directions in its body frame.
$base angular velocity (ω)$ : The robot's angular velocity (roll, pitch, yaw rates) in its body frame.
projected gravity (g): A measurement of the gravity vector projected into the robot's body frame, which indicates the robot's orientation relative to gravity (e.g., pitch and roll).
joint positions (q): The current angular positions of the robot's joints.
joint velocities (\dot{q}): The current angular velocities of the robot's joints.
joint torques (\tau): The torques applied at the robot's joints. The dimensions indicate the slice of the observation vector corresponding to each entry. For example, for ANYmal D, base linear velocity occupies indices 0 to 2 (3 dimensions).

5.1.2. World Model Privileged Information Space

Privileged information is data that is useful for learning but might not be directly available during real-world deployment or is difficult to obtain. In this context, it's used as an auxiliary learning objective for the world model.

The following are the results from Table S3 of the original paper:

Entry	Dimensions	Entry	Dimensions
ANYmal D		Unitree G1
knee contact	0:4	body contact	0:26
foot contact	4:8	foot height	26:28
		foot velocity	28:30

Where:

knee contact: Binary or continuous values indicating contact force/state for the robot's knees.
foot contact: Binary or continuous values indicating contact force/state for the robot's feet.
body contact: For Unitree G1, this likely refers to contact forces across various parts of the robot's body.
foot height: The height of the robot's feet relative to the ground.
foot velocity: The velocity of the robot's feet.

5.1.3. World Model Action Space

The action space defines the control signals sent to the robot.

The following are the results from Table S4 of the original paper:

Entry	Symbol	Dimensions	Entry	Symbol	Dimensions
ANYmal D			Unitree G1
joint position targets	q*	0:12	joint position targets	q*	0:29

Where:

joint position targets (q*): These are the desired target angular positions for the robot's joints, which a low-level controller then tries to achieve. This is a common action representation in RL for legged robots.

5.1.4. Policy Observation Space

The policy, which learns to control the robot, receives a different set of observations tailored to the control task (velocity tracking).

The following are the results from Table S5 of the original paper:

Entry	Symbol	Dimensions	Entry	Symbol	Dimensions
ANYmal D			Unitree G1
base linear velocity	U	0:3	base linear velocity	U	0:3
base angular velocity	3	3:6	base angular velocity	3	3:6
projected gravity	g	6:9	projected gravity	g	6:9
velocity command	c	9:12	velocity command	c	9:12
joint positions	q	12:24	joint positions	q	12:41
joint velocities	q	24:36	joint velocities	q	41:70

Where:

base linear velocity (v): Same as the world model observation.
$base angular velocity (ω)$ : Same as the world model observation.
projected gravity (g): Same as the world model observation.
velocity command (c): The desired linear (x, y) and angular (z) velocities that the policy is trying to achieve. This makes the policy goal-conditioned.
joint positions (q): Same as the world model observation.
joint velocities (\dot{q}): Same as the world model observation.

5.2. Evaluation Metrics

5.2.1. Autoregressive Prediction Error ( $e$ )

Conceptual Definition: This metric quantifies how accurately the world model can predict future observations (or states) when it uses its own predictions as input for subsequent steps, mimicking its behavior during imagination rollouts. A lower error indicates that the model is robust to error accumulation and can maintain fidelity over long horizons, which is critical for model-based policy optimization. The paper mentions it as a relative prediction error.

Mathematical Formula: The paper does not provide an explicit mathematical formula for the relative prediction error $e$ . However, based on common practice in world model evaluation, it is generally calculated as the normalized difference between the predicted trajectory and the ground truth trajectory. A common way to define relative prediction error for a sequence of predictions could be: $ e = \frac{1}{N \cdot D} \sum_{k=1}^{N} \frac{|o_{t+k}^\prime - o_{t+k}|2}{|o{t+k}|2 + \epsilon} $ Or, more simply, if it refers to the average magnitude of error relative to the magnitude of the actual observation: $ e = \frac{1}{N} \sum{k=1}^{N} \frac{|o_{t+k}^\prime - o_{t+k}|_2}{\text{scale_factor}} $ Given the context, the paper likely uses a metric that averages the L2 norm (Euclidean distance) of the difference between predicted and ground-truth observations, potentially normalized by some scale factor to make it "relative" and comparable across different observation dimensions or environments.

Symbol Explanation (based on likely interpretation):

$e$ : The relative autoregressive prediction error.
$N$ : The forecast horizon (number of steps predicted).
$D$ : The dimensionality of the observation space.
$o_{t+k}^\prime$ : The world model's predicted observation at time step $t+k$ .
$o_{t+k}$ : The true (ground truth) observation at time step $t+k$ .
$\|\cdot\|_2$ : The Euclidean (L2) norm, measuring the distance between vectors.
$\text{scale\_factor}$ : A normalization constant (e.g., standard deviation of observations, or maximum possible observation value) to make the error "relative."
$\epsilon$ : A small constant to prevent division by zero, if normalizing by the magnitude of the ground truth.

5.2.2. Policy Mean Reward ( $r$ )

Conceptual Definition: This metric represents the average cumulative reward obtained by the trained policy over an episode or a set of episodes. In reinforcement learning, the goal is to maximize this cumulative reward. A higher mean reward indicates a more effective policy that successfully achieves its objectives (e.g., velocity tracking) while adhering to desired behaviors (e.g., maintaining balance, avoiding undesired contacts). The paper distinguishes between estimated rewards (computed from the world model's predictions) and ground truth rewards (reported by the simulator).

Mathematical Formula: For a single episode of length $T_{episode}$ , the discounted return (cumulative reward) is: $ G = \sum_{t=0}^{T_{episode}-1} \gamma^t r_t $ The policy mean reward would then be the average of these discounted returns over many episodes. $ \text{Mean Reward} = \frac{1}{K} \sum_{i=1}^{K} G_i $ Symbol Explanation:

$G$ : The discounted return for a single episode.
$T_{episode}$ : The length of an episode.
$r_t$ : The reward received at time step $t$ .
$\gamma$ : The discount factor.
$K$ : The total number of evaluation episodes.
$G_i$ : The discounted return for the $i$ -th episode.

5.2.3. Reward Functions

The total reward for the policy is a sum of several weighted terms. These terms are designed to encourage desired behaviors (e.g., tracking velocity) and penalize undesirable ones (e.g., high joint torques, undesired contacts). The weights for each term are detailed in Table S6.

The following are the results from Table S6 of the original paper:

Symbol	Value	Symbol	Value	Symbol	Value	Symbol	Value
ANYmal D				Unitree G1
Wvxy	1.0	Wωz	0.5	Wvxy	1.0	Wωz	0.5
Wvz	-2.0	Wωxy	-0.05	Wvz	−2.0	Wωxy	-0.05
Wqt	-2.5e-5	Wq	-2.5e-7	Wqt	-2.5e-5	Wq	-2.5e-7
W	-0.01	Wfa	0.5	W	−0.05	Wfa	0.0
Wc	-1.0	Wg	-5.0	Wc	-1.0	Wg	-5.0
Wfc	0.0	Wqd	0.0	Wfc	1.0	Wqd	−1.0

Individual Reward Terms:

Linear velocity tracking (x, y): Rewards the policy for matching the commanded linear velocity in the horizontal plane. $ r _ { v _ { x y } } = w _ { v _ { x y } } e ^ { - | c _ { x y } - v _ { x y } | _ { 2 } ^ { 2 } / \sigma _ { v _ { x y } } ^ { 2 } } $ Where:
- w _ { v _ { x y } }: Weight for this reward term (e.g., 1.0).
- $e$ : Euler's number (base of natural logarithm).
- c _ { x y }: Commanded base linear velocity in the x, y plane.
- v _ { x y }: Current base linear velocity in the x, y plane.
- $\| \cdot \| _ { 2 } ^ { 2 }$ : Squared Euclidean distance (L2 norm squared).
- $\sigma _ { v _ { x y } } = 0.25$ : A temperature factor that controls the sensitivity of the exponential reward. Smaller $\sigma$ makes the reward drop faster when the error increases.
Angular velocity tracking ( $z$ ): Rewards the policy for matching the commanded angular velocity around the vertical ( $z$ ) axis. $ r _ { \omega _ { z } } = w _ { \omega _ { z } } e ^ { - | c _ { z } - \omega _ { z } | _ { 2 } ^ { 2 } / \sigma _ { \omega _ { z } } ^ { 2 } } $ Where:
- $w _ { \omega _ { z } }$ : Weight for this reward term (e.g., 0.5).
- c _ { z }: Commanded base angular velocity around the $z$ axis.
- $\omega _ { z }$ : Current base angular velocity around the $z$ axis.
- $\sigma _ { \omega _ { z } } = 0.25$ : A temperature factor.
Linear velocity $z$ : Penalizes vertical velocity, encouraging the robot to maintain a stable height. $ r _ { v _ { z } } = w _ { v _ { z } } v _ { z } ^ { 2 } $ Where:
- w _ { v _ { z } }: Weight for this reward term (e.g., -2.0, negative for penalty).
- v _ { z }: Current base vertical velocity. The squared term ensures positive penalty regardless of direction.
Angular velocity x, y: Penalizes roll and pitch angular velocities, encouraging the robot to stay level. $ r _ { \omega _ { x y } } = w _ { \omega _ { x y } } | \omega _ { x y } | _ { 2 } ^ { 2 } $ Where:
- $w _ { \omega _ { x y } }$ : Weight for this reward term (e.g., -0.05, negative for penalty).
- $\omega _ { x y }$ : Current base roll and pitch angular velocities.
Joint torque: Penalizes large joint torques, encouraging energy-efficient and smooth movements. $ r _ { \boldsymbol { q } _ { \tau } } = w _ { \boldsymbol { q } _ { \tau } } | \tau | _ { 2 } ^ { 2 } $ Where:
- $w _ { \boldsymbol { q } _ { \tau } }$ : Weight for this reward term (e.g., -2.5e-5, negative for penalty).
- $\tau$ : Joint torques.
Joint acceleration: Penalizes high joint accelerations, promoting smoother motions. $ r _ { \ddot { q } } = w _ { \ddot { q } } | \ddot { q } | _ { 2 } ^ { 2 } $ Where:
- $w _ { \ddot { q } }$ : Weight for this reward term (e.g., -2.5e-7, negative for penalty).
- $\ddot { q }$ : Joint accelerations.
Action rate: Penalizes rapid changes in actions (joint position targets), promoting smoother control signals. $ r _ { \dot { a } } = w _ { \dot { a } } | a ^ { \prime } - a | _ { 2 } ^ { 2 } $ Where:
- $w _ { \dot { a } }$ : Weight for this reward term (e.g., -0.01, negative for penalty).
- $a ^ { \prime }$ : Previous action.
- $a$ : Current action.
Feet air time: Rewards keeping feet in the air for appropriate durations (e.g., during swinging phase of locomotion). $ r _ { f _ { a } } = w _ { f _ { a } } t _ { f _ { a } } $ Where:
- w _ { f _ { a } }: Weight for this reward term (e.g., 0.5 for ANYmal D, 0.0 for Unitree G1).
- t _ { f _ { a } }: Sum of time for which the feet are in the air.
Undesired contacts: Penalizes contacts by parts of the robot other than the feet (e.g., knees, body). $ r _ { c } = w _ { c } c _ { u } $ Where:
- w _ { c }: Weight for this reward term (e.g., -1.0, negative for penalty).
- c _ { u }: Counts of undesired contacts.
Flat orientation: Penalizes deviations from a flat (level) orientation, encouraging stable posture. $ r _ { g } = w _ { g } g _ { x y } ^ { 2 } $ Where:
- w _ { g }: Weight for this reward term (e.g., -5.0, negative for penalty).
- g _ { x y }: The x, y components of the projected gravity vector (which are non-zero if the robot is pitched or rolled).
Foot clearance: Rewards maintaining sufficient clearance height for swinging feet to avoid obstacles. $ r _ { f _ { c } } = w _ { f _ { c } } h _ { f _ { c } } $ Where:
- w _ { f _ { c } }: Weight for this reward term (e.g., 0.0 for ANYmal D, 1.0 for Unitree G1).
- h _ { f _ { c } }: Clearance height of the swing feet.
Joint deviation: Penalizes deviation of joint positions from a default (resting) pose. $ r _ { q _ { d } } = w _ { q _ { d } } | q - q _ { 0 } | _ { 1 } $ Where:
- w _ { q _ { d } }: Weight for this reward term (e.g., 0.0 for ANYmal D, -1.0 for Unitree G1).
- $q$ : Current joint positions.
- q _ { 0 }: Default joint positions.
- $\| \cdot \| _ { 1 }$ : L1 norm (sum of absolute differences).

5.3. Baselines

5.3.1. World Model Baselines

For comparing autoregressive trajectory prediction errors, RWM is evaluated against several common neural network architectures and a variant of itself:

MLP: A Multilayer Perceptron, a basic feedforward neural network, used to predict the next state. This serves as a fundamental baseline to show the benefits of recurrent models.
RSSM (Recurrent State-Space Model): This architecture, notably used in PlaNet [15] and Dreamer [29, 11, 30], integrates recurrent processing with a latent state space to model dynamics. It's a strong baseline for latent dynamics models.
Transformer-based Architectures [41, 45]: Transformers are powerful sequence models that use attention mechanisms. They have been successfully applied as world models [32] and in RL (e.g., Decision Transformer [41]). This represents a state-of-the-art sequence modeling baseline.
RWM-TF (RWM with Teacher-Forcing): A variant of RWM trained using the teacher-forcing paradigm (where $N=1$ , meaning only one-step prediction using ground truth observations as input). This specific baseline highlights the importance and benefits of autoregressive training (RWM-AR) over teacher-forcing.

5.3.2. Policy Optimization Baselines

For evaluating the policy learning and hardware transfer capabilities of MBPO-PPO, the following baselines are used:

Short-Horizon Actor-Critic (SHAC) [38]: An MBRL method that employs a first-order gradient-based optimization approach. It propagates gradients through the world model to optimize the policy. This method is noted for its ability to leverage differentiable simulation.
DreamerV3 [30]: A highly advanced MBRL framework that integrates a latent-space dynamics model with an actor-critic framework. It is known for its sample efficiency and robustness in continuous control tasks, mastering diverse domains.

5.4. Network Architectures

5.4.1. RWM Architecture

The RWM model is comprised of a GRU base and MLP heads.

The following are the results from Table S7 of the original paper:

Component	Type	Hidden Shape	Activation
base	GRU	256, 256	—
heads	MLP	128	ReLU

Where:

base: The main recurrent part of the model, which is a GRU (Gated Recurrent Unit). It likely has multiple layers, each with 256 hidden units, as indicated by 256, 256.
heads: MLP (Multilayer Perceptron) networks that branch off the GRU's output. These heads are responsible for predicting the mean and standard deviation of the next observation and privileged information. They each have a hidden layer of 128 units and use the ReLU (Rectified Linear Unit) activation function.

5.4.2. Baseline Architectures

The architectures for the baseline world models are provided for comparison.

The following are the results from Table S8 of the original paper:

Network	Parameter	Value
MLP	hidden shape activation	256, 256 ReLU
RSSM	type hidden size layers latent dimension prior type categories	GRU 256 2 64 categorical 32
Transformer	type dimension heads layers context length positional encoding	decoder 64 8 2 32 sinusoidal

Where:

MLP: Similar to RWM's heads, it has two hidden layers of 256 units each and uses ReLU activation.
RSSM (Recurrent State-Space Model):
- type: Uses a GRU for its recurrent component.
- hidden size: 256 units per layer.
- layers: 2 recurrent layers.
- latent dimension: The dimensionality of the continuous latent state is 64.
- prior type: Uses a categorical distribution for the discrete latent state.
- categories: The number of discrete categories in the latent state is 32.
Transformer:
- type: A decoder-only Transformer architecture.
- dimension: Model dimension of 64.
- heads: 8 attention heads.
- layers: 2 Transformer layers.
- context length: 32, meaning it considers 32 previous tokens/steps.
- positional encoding: Uses sinusoidal positional encoding to incorporate sequence order information.

5.4.3. MBPO-PPO Policy and Value Function Architecture

The policy and value function networks used within the MBPO-PPO framework are MLPs.

The following are the results from Table S9 of the original paper:

Network	Type	Hidden Shape	Activation
policy	MLP	128, 128, 128	ELU
value function	MLP	128, 128, 128	ELU

Where:

policy: The actor network, which is an MLP with three hidden layers, each having 128 units. It uses the ELU (Exponential Linear Unit) activation function.
value function: The critic network, also an MLP with three hidden layers of 128 units each, and ELU activation.

5.5. Training Parameters

All learning networks and algorithms are implemented in PyTorch 2.4.0 with CUDA 12.6 and trained on an NVIDIA RTX 4090 GPU.

5.5.1. RWM Training Parameters

The training information for the RWM itself is summarized below.

The following are the results from Table S10 of the original paper:

Parameter	Symbol	Value
step time seconds	∆t	0.02
max iterations	−	2500
learning rate	√	1e-4
weight decay		1e-5
batch size		1024
history horizon	M	32
forecast horizon	N	8
forecast decay	α	1.0
approximate training hours	−	1
number of seeds	−	5

Where:

step time seconds (∆t): The duration of a single simulation/control step, 0.02 seconds (corresponding to 50 Hz control frequency).
max iterations: The maximum number of training steps for the world model.
learning rate: The step size for gradient descent (1e-4 = 0.0001).
weight decay: A regularization term to prevent overfitting (L2 regularization).
batch size: The number of samples processed in one training iteration.
history horizon (M): The number of past observation-action pairs used as context for prediction.
forecast horizon (N): The number of future steps the model is trained to predict autoregressively.
$forecast decay (α)$ : The decay factor for the multi-step prediction loss. 1.0 means all steps in the forecast horizon are weighted equally.
approximate training hours: The approximate training duration for the world model.
number of seeds: The number of random seeds used for multiple runs to ensure statistical robustness.

5.5.2. MBPO-PPO Training Parameters

The training information for the MBPO-PPO algorithm (policy optimization) is summarized below.

The following are the results from Table S11 of the original paper:

Parameter	Symbol	Value
imagination environments		4096
imagination steps per iteration		100
step time seconds	∆t	0.02
buffer size	\|D\|	1000
max iterations	−	2500
learning rate		0.001
weight decay		0.0
learning epochs		5
mini-batches		4
KL divergence target		0.01
discount factor	γ	0.99
clip range	€	0.2
entropy coefficient		0.005
number of seeds		5

Where:

imagination environments: The number of parallel imagination agents (or starting points for rollouts) used to generate synthetic experience.
imagination steps per iteration: The length of each imagination rollout ( $T$ from Algorithm 1).
step time seconds (∆t): Same as for RWM training.
$buffer size (|D|)$ : The maximum number of real environment transitions stored in the replay buffer.
max iterations: The maximum number of learning iterations for policy optimization.
learning rate: Learning rate for policy and value networks.
weight decay: Regularization for policy and value networks (0.0 means no L2 regularization).
learning epochs: Number of times PPO updates are run over the collected data (real + imagined) per iteration.
mini-batches: Number of mini-batches per PPO update epoch.
KL divergence target: A target for the Kullback-Leibler (KL) divergence between the old and new policies, used in some PPO variants for adaptive clipping or learning rate.
$discount factor (γ)$ : Same as the POMDP definition, used for calculating discounted returns.
clip range (€): The $\epsilon$ hyperparameter for PPO's clipped surrogate objective (e.g., 0.2).
entropy coefficient: A term added to the PPO loss to encourage exploration by penalizing low policy entropy.
number of seeds: Number of random seeds for multiple runs.

5.6. Hardware

The learned policies are deployed and validated on two physical robotic platforms in a zero-shot transfer setup:

ANYmal D [44]: A highly mobile and dynamic quadrupedal robot developed at ETH Zurich.
Unitree G1: A humanoid robot from Unitree Robotics.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Autoregressive Trajectory Prediction (Section 4.1)

The ability of a world model to make accurate long-horizon predictions is crucial for effective planning and policy optimization. The experiments analyze the autoregressive prediction performance of RWM using trajectories from ANYmal D hardware, with a control frequency of 50 Hz. The world model was trained with a history horizon $M=32$ and forecast horizon $N=8$ .

The left panel of Figure 3 (shown below) visualizes the autoregressive trajectory predictions by RWM against ground truth trajectories. The solid lines represent the ground truth, and the dashed lines denote the predicted state evolution. Predictions start at $t=32$ , using historical observations, and future observations are predicted autoregressively by feeding prior predictions back into the model.

The following figure (Figure 3 from the original paper) shows the autoregressive trajectory prediction and robustness under noise:

$Figure 3: (Left) Solid lines represent ground truth trajectories, while dashed lines denote predicted state evolution. Predictions commence at $t = 3 2$ using historical observations, with future observations predicted autoregressively by feeding prior predictions back into the model. (Right) Yellow curves denote RWM at varying noise levels, demonstrating consistent robustness and lower error accumulation across forecast steps. Grey curves represent the MLP baseline, which exhibits significantly higher error accumulation and reduced robustness to noise.$ Figure 3: (Left) Solid lines represent ground truth trajectories, while dashed lines denote predicted state evolution. Predictions commence at $t = 3 2$ using historical observations, with future observations predicted autoregressively by feeding prior predictions back into the model. (Right) Yellow curves denote RWM at varying noise levels, demonstrating consistent robustness and lower error accumulation across forecast steps. Grey curves represent the MLP baseline, which exhibits significantly higher error accumulation and reduced robustness to noise.

Figure 3: (Left) Solid lines represent ground truth trajectories, while dashed lines denote predicted state evolution. Predictions commence at $t = 3 2$ using historical observations, with future observations predicted autoregressively by feeding prior predictions back into the model. (Right) Yellow curves denote RWM at varying noise levels, demonstrating consistent robustness and lower error accumulation across forecast steps. Grey curves represent the MLP baseline, which exhibits significantly higher error accumulation and reduced robustness to noise.

Analysis: The results demonstrate a remarkable alignment between RWM's predicted trajectories and the ground truth across all observed variables. This consistency is maintained even over extended rollouts, which is a significant achievement given the inherent challenge of compounding errors in long-horizon predictions. This strong performance is attributed to the dual-autoregressive mechanism introduced in Section 3.2, which stabilizes predictions even with a relatively short forecast horizon ( $N=8$ ) used during training. The comparison of state evolution between RWM prediction and ground truth (Figure 1, bottom row, and Figure S9) further highlights RWM's ability to maintain consistency over long horizons, even beyond the training forecast horizon. This robustness is critical for stable policy learning and deployment.

6.1.2. Robustness under Noise (Section 4.2)

The capacity of a world model to generalize under noisy conditions, especially during autoregressive rollouts, is vital. Small deviations can rapidly cascade into large errors, leading to hallucinations. To test this, RWM's performance was evaluated under Gaussian noise perturbations applied to both observations and actions. The results were compared against an MLP-based baseline that was also trained autoregressively with the same history and forecast horizons. The right panel of Figure 3 (shown above) illustrates these results, where yellow curves represent RWM's relative prediction error $e$ at varying noise levels, and grey curves represent the MLP baseline.

Analysis: RWM demonstrates a clear advantage over the MLP baseline. The MLP model's relative prediction error grows significantly and diverges much faster as the forecast steps increase, particularly under noise. In contrast, RWM exhibits superior stability, maintaining lower prediction errors even with high noise levels. This robustness is directly attributed to RWM's dual-autoregressive mechanism which, by continually refining the state representation towards long-term predictions, minimizes error accumulation even with noisy inputs.

6.1.3. Generality across Robotic Environments (Section 4.3)

To assess RWM's generality and robustness, its performance was compared against several baseline methods across a diverse range of robotic environments. The baselines include MLP, recurrent state-space model (RSSM), and transformer-based architectures. All models were provided the same context during training and evaluation. The relative autoregressive prediction errors $e$ for these models are presented in Figure 4. The tasks encompass manipulation scenarios, as well as quadruped and humanoid locomotion. The study also highlights the importance of autoregressive training by comparing RWM trained with teacher-forcing (RWM-TF) and autoregressive training (RWM-AR).

The following figure (Figure 4 from the original paper) shows autoregressive trajectory prediction errors across diverse robotic environments:

Figure 4: Autoregressive trajectory prediction errors across diverse robotic environments and network architectures. RWM trained with autoregressive training (RWM-AR) consistently outperforms baseline methods, including MLP, recurrent state-space model (RSSM), and transformer-based architectures. RWM-AR demonstrates superior generalization and robustness across tasks, from manipulation to locomotion. Autoregressive training (RWM-AR) reduces compounding errors over long rollouts, significantly improving performance compared to teacher-forcing training (RWM-TF).

Analysis: Figure 4 clearly shows the superiority of RWM trained with autoregressive training (RWM-AR), which consistently achieves the lowest prediction errors across all evaluated environments. The performance gap is particularly noticeable in complex and dynamic tasks like velocity tracking for legged robots, where accurate long-horizon predictions are essential for effective control.

The comparison between RWM-AR and RWM-TF is also critical: RWM-AR significantly outperforms its teacher-forcing counterpart. This strongly underscores that the autoregressive training mechanism is vital for mitigating compounding prediction errors over long rollouts, validating one of the core contributions of the paper.

The paper notes that baselines are traditionally implemented using teacher-forcing. However, when RSSM is trained with autoregressive training, its performance becomes comparable to the GRU-based RWM. The authors chose the GRU-based RWM due to its simplicity and computational efficiency. Conversely, transformer architectures did not scale effectively with autoregressive training due to GPU memory constraints arising from multi-step gradient propagation, limiting their practicality for this specific approach.

These results affirm that RWM, particularly when coupled with its autoregressive training, achieves robust and generalizable performance across a wide array of robotic tasks. Visualizations of RWM-AR imagination rollouts against ground truth simulations (Figure 1 and S9) further support these claims.

6.1.4. Policy Learning and Hardware Transfer (Section 4.4)

The paper uses MBPO-PPO to train a goal-conditioned velocity tracking policy for ANYmal D and Unitree G1, leveraging the RWM. This policy's observation and action spaces, reward functions, and architectural details are provided in the supplementary material (Sections A.1 and A.2.3). The performance of MBPO-PPO is compared against Short-Horizon Actor-Critic (SHAC) [38] and DreamerV3 [30].

Figure 5 illustrates the model error and policy mean reward during policy optimization for ANYmal D (left) and Unitree G1 (right) velocity tracking tasks. Policies are trained using estimated rewards from RWM predictions, while ground truth rewards (solid lines) are for evaluation only.

The following figure (Figure 5 from the original paper) shows model error and policy mean reward during policy optimization:

Figure 5: Model error and policy mean reward for the ANYmalD (left) and Unitree G1 (right) velocity tracking task with MBPO-PPO. The policy is trained using estimated rewards computed from predicted observations by RWM. Ground truth rewards, visualized with solid lines, are reported by the simulator for evaluation purposes only.

Analysis of Model Error (Top Panels of Figure 5):

MBPO-PPO shows a significant reduction in model error over the course of training, indicating that the world model becomes more accurate as the policy is refined and more data is collected.
SHAC, in contrast, struggles with high and fluctuating model error throughout the training process. Its reliance on first-order gradients propagated through the world model proves problematic for discontinuous dynamics (like legged locomotion with varying contact patterns), leading to inaccurate gradients and suboptimal policy updates. This results in chaotic robot behaviors.
Dreamer effectively leverages its latent-space dynamics model, but its shorter planning horizons during training limit its ability to capture long-horizon dependencies in stochastic environments, leading to moderate compounding errors during policy learning.

Analysis of Policy Mean Reward (Bottom Panels of Figure 5):

MBPO-PPO's predicted rewards (dashed lines) initially overshoot the ground truth rewards (solid lines). This is a common phenomenon where the policy might exploit small inaccuracies or optimistic estimates within the learned model. However, as training progresses, the predicted rewards align more closely with ground truth, demonstrating that the model's predictions remain accurate enough to guide effective learning.
SHAC fails to converge, producing unstable behaviors and degrading both policy and model quality, aligning with its high model error.
Dreamer shows partial convergence, achieving higher rewards than SHAC but significantly lagging behind MBPO-PPO.

Hardware Transfer: The ultimate validation is zero-shot transfer to physical hardware. SHAC and Dreamer failed to produce deployable policies due to collapse or instability during training. However, the policy learned using MBPO-PPO (as shown in Figure 1, top row) demonstrates reliable and robust performance on ANYmal D and Unitree G1 hardware. It successfully tracks goal-conditioned velocity commands and maintains stability even under external disturbances and varying terrain conditions. This success is directly attributed to the high-quality trajectory predictions generated by RWM, which enable accurate and effective policy optimization. Videos on the project webpage further showcase this robustness.

Comparison with Model-Free Methods (Table 1): While MBPO-PPO excels among MBRL methods, the paper acknowledges that it still falls short of well-tuned model-free RL methods trained on high-fidelity simulators. A comparison with a PPO-based method on a high-fidelity simulator is provided in Table 1.

The following are the results from Table 1 of the original paper:

Method	RWM pretraining	MBPO-PPO	PPO
state transitions	6M	—	250M
total training time	50 min	5 min	10 min
step inference time	−	1 ms	1 ms
real tracking reward	−	0.90 ± 0.04	0.90 ± 0.03

Analysis of Table 1:

state transitions: PPO (model-free) requires a massive 250 million state transitions, indicating its sample inefficiency. RWM pretraining requires 6 million state transitions, which is significantly less. MBPO-PPO itself does not require direct state transitions but rather leverages the pretrained world model. This highlights the sample efficiency advantage of the model-based approach for learning the dynamics.
total training time: RWM pretraining takes 50 minutes, and MBPO-PPO policy training takes 5 minutes (after the model is pretrained). In comparison, PPO takes 10 minutes. This suggests that while RWM pretraining is an upfront cost, the total time for the MBPO-PPO process (55 min) is comparable to PPO for the policy training itself, but with drastically fewer real environment interactions (6M vs 250M).
step inference time: Both MBPO-PPO and PPO policies have comparable step inference times (1 ms), meaning they are equally fast for real-time control once trained.
real tracking reward: Both MBPO-PPO and PPO achieve very similar real tracking rewards (0.90 ± 0.04 vs 0.90 ± 0.03). This is a crucial finding: MBPO-PPO can match the performance of PPO trained on a high-fidelity simulator while being significantly more sample-efficient in terms of real environment interactions. The strength of MBRL lies in scenarios where accurate or efficient simulation for model-free RL is infeasible, making it valuable for real-world environments.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Dual-autoregressive Mechanism (Section A.4.1)

An ablation study was conducted to analyze the impact of the history horizon $M$ and forecast horizon $N$ on the performance of RWM. The results are presented in a heatmap in Figure S8, showing relative autoregressive prediction error (left) and training time (right).

The following figure (Figure S8 from the original paper) shows the ablation study on history and forecast horizons:

$Figure S8: Ablation study on the history horizon $M$ and forecast horizon $N$ in RWM. The heatmap on the left shows the relative autoregressive prediction error, with darker colors indicating higher errors. Models trained with larger history horizons $M$ exhibit lower errors, although the improvements plateau beyond a certain point. Forecast horizon $N$ has a significant impact, with longer horizons leading to better long-term prediction accuracy due to exposure to extended rollouts during training. The heatmap on the right illustrates training time, with darker colors representing longer durations. Increasing $N$ significantly raises training time due to sequential computation, while shorter horizons (e.g., $N = 1$ , teacher-forcing) enable faster training but result in poor prediction accuracy.$ Figure S8: Ablation study on the history horizon $M$ and forecast horizon $N$ in RWM. The heatmap on the left shows the relative autoregressive prediction error, with darker colors indicating higher errors. Models trained with larger history horizons $M$ exhibit lower errors, although the improvements plateau beyond a certain point. Forecast horizon $N$ has a significant impact, with longer horizons leading to better long-term prediction accuracy due to exposure to extended rollouts during training. The heatmap on the right illustrates training time, with darker colors representing longer durations. Increasing $N$ significantly raises training time due to sequential computation, while shorter horizons (e.g., $N = 1$ , teacher-forcing) enable faster training but result in poor prediction accuracy.

Figure S8: Ablation study on the history horizon $M$ and forecast horizon $N$ in RWM. The heatmap on the left shows the relative autoregressive prediction error, with darker colors indicating higher errors. Models trained with larger history horizons $M$ exhibit lower errors, although the improvements plateau beyond a certain point. Forecast horizon $N$ has a significant impact, with longer horizons leading to better long-term prediction accuracy due to exposure to extended rollouts during training. The heatmap on the right illustrates training time, with darker colors representing longer durations. Increasing $N$ significantly raises training time due to sequential computation, while shorter horizons (e.g., $N = 1$ , teacher-forcing) enable faster training but result in poor prediction accuracy.

Analysis of Prediction Error (Left Heatmap):

History Horizon ( $M$ ): Models trained with a longer history horizon $M$ consistently show lower prediction errors. This confirms the importance of providing sufficient historical context for RWM to capture complex underlying dynamics. However, the improvement from increasing $M$ plateaus beyond a certain point, suggesting diminishing returns for excessively long histories.
Forecast Horizon ( $N$ ): The forecast horizon $N$ plays a decisive role in improving long-term prediction accuracy. Increasing $N$ during training leads to better performance in autoregressive rollouts. This is because a larger $N$ forces the model to learn representations that are robust to compounding errors over extended prediction horizons, as it must rely on its own predictions for more steps.

Analysis of Training Time (Right Heatmap):

Increasing $N$ significantly raises training time. This is a direct consequence of the autoregressive nature of the training process, where larger $N$ values require more sequential computations during training, reducing parallelization.
When $N=1$ (i.e., teacher-forcing), training can be highly parallelized, resulting in minimal training time. However, as seen in the left heatmap, this setting leads to poor autoregressive performance because the model lacks exposure to long-horizon prediction scenarios during training and thus fails to handle compounding errors effectively.

Optimal Trade-off: The study highlights a critical trade-off between prediction accuracy and training efficiency. An optimal balance is achieved with moderate values for $M$ and $N$ . For instance, a history horizon of $M=32$ and a forecast horizon of $N=8$ yield strong autoregressive performance within a manageable training time. These settings ensure enough historical context while robustly training the model for long-term predictions.

6.2.2. Collision Handling and Model Pretraining (Section A.4.3)

The paper discusses how RWM handles collision events and the role of model pretraining.

Collision Handling: During both pretraining and online fine-tuning, rollouts are terminated and the environment is reset if ground contact by the base is detected, signaling a failure. RWM is explicitly trained to predict these termination events through its privileged information prediction head. This allows the world model to learn about transitions that lead to unsafe situations. During policy optimization, MBPO-PPO treats these termination predictions as episode-ending events in imagination rollouts, influencing PPO's return computation and state values.
Model Pretraining: RWM is pretrained using simulation data generated by suboptimal policies trained for similar tasks under varied dynamics. The policy is then learned from scratch purely in imagination, with RWM subsequently fine-tuned using a single-environment online dataset. Pretraining is deemed essential for two main reasons:
1. Limited Online Dataset: The online dataset is very small (mimicking real-world constraints). Training the world model from scratch on such limited data would lead to severe overfitting and long training times.
2. Immature Policy Failures: An immature policy would frequently cause the robot to fall, generating low-value transitions (mostly failures). Training the world model solely on this chaotic data would result in poor imagined rollouts and, consequently, poor policy updates. Pretraining stabilizes training and provides a robust initialization for online fine-tuning, especially in environments with challenging dynamics and frequent failures. Importantly, RWM pretraining does not require data from optimal policies, and it remains robust to domain shifts and injected noise (as shown in Figure 3). This warm-up phase is primarily necessary for locomotion tasks due to their discontinuous dynamics and frequent environment terminations, but not for manipulation experiments.

6.2.3. Visualization of Imagination Rollouts (Section A.4.2)

Figure S9 visualizes autoregressive imagination from RWM compared with ground-truth simulation across diverse robotic systems.

The following figure (Figure S9 from the original paper) shows autoregressive imagination of RWM and ground-truth simulation:

Figure S9: Autoregressive imagination of RWM and ground-truth simulation across diverse robotic systems. For each environment, the top row showcases the RWM autoregressively predicting future trajectories in imagination. The second row visualizes the ground truth evolution in simulation. The visualized coordinate and arrow markers denote the predicted and measured end-effector pose and base velocity, respectively.

Analysis: For each environment (e.g., manipulation, quadruped, humanoid), the top row of Figure S9 displays RWM's autoregressive predictions of future trajectories in imagination, while the second row shows the corresponding ground truth evolution in simulation. The visualizations, including coordinate and arrow markers for end-effector pose and base velocity, confirm the high fidelity of RWM's predictions across varied tasks and robot types. This reinforces the claims about RWM's generality and accuracy in replicating complex dynamics, which is foundational for the subsequent policy optimization and sim-to-real transfer success.

6.3. Additional Discussion

6.3.1. Challenges in Real-World Online Learning (Section A.4.4)

The authors acknowledge that performing the policy training phase directly on real hardware would further demonstrate the advantages of RWM. However, several significant challenges currently prevent real-world online deployment:

Safety and Collisions: During online learning, policies often exploit minor world model errors, leading to overly optimistic behaviors that can result in collisions. In simulation, these failures provide corrective signals, but on real hardware, they pose a significant risk to the robot. Experiments show failures occur more than 20 times on average during online learning, which would be detrimental to physical systems.
Recovery Policies: Fully automating online learning would require a recovery policy capable of safely resetting the robot to an initial state after a failure, which is particularly challenging for large platforms like ANYmal D or Unitree G1.
Privileged Information: Privileged information (e.g., contact forces) used to fine-tune RWM must be either reliably measured or accurately estimated using onboard sensors, which may not always be available or precise enough.
Error Exploitation Mitigation: To mitigate the policy's exploitation of model errors, uncertainty-aware world models could be explored, but this would require additional architectural modifications to RWM.

Due to these challenges, the current work approximates real-world constraints by using only a single simulation environment with domain shifts from pretraining environments, reducing engineering effort while demonstrating feasibility. Future work will specifically address these issues, focusing on uncertainty-aware models and safer online adaptation strategies.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work introduces Robotic World Model (RWM), a robust and scalable framework designed for learning accurate world models specifically tailored to complex robotic tasks. The core innovation lies in its dual-autoregressive mechanism and self-supervised training over long prediction horizons. This approach effectively tackles critical challenges in model-based reinforcement learning (MBRL), including compounding errors, partial observability, and stochastic dynamics, without needing domain-specific inductive biases. Extensive experiments demonstrate RWM's superior autoregressive prediction accuracy compared to state-of-the-art baselines like RSSM and transformer-based architectures across diverse robotic environments (manipulation, quadruped, humanoid locomotion).

Building upon RWM's high world model rollout fidelity, the authors propose MBPO-PPO, a policy optimization framework. Policies trained using MBPO-PPO exhibit superior performance in simulation and achieve seamless zero-shot transfer to physical hardware, as evidenced by successful deployment on ANYmal D and Unitree G1 robots. This work significantly advances MBRL by providing a generalizable, efficient, and scalable framework for learning and deploying world models, thereby paving the way for more adaptive, robust, and high-performing robotic systems in real-world applications.

7.2. Limitations & Future Work

The authors openly acknowledge several limitations of the current RWM framework and suggest directions for future research:

Performance Gap with Model-Free RL: While MBPO-PPO surpasses existing MBRL methods in robustness and generalization, its performance still falls short of well-tuned model-free RL methods trained on high-fidelity simulators (as shown in Table 1). Model-free RL benefits from maturity and extensive optimization in settings with unlimited access to near-perfect simulators.
Dependence on Pretraining: The world model is currently pretrained using simulation data before policy optimization. Training RWM from scratch is challenging because immature policies can exploit model inaccuracies during early exploration, leading to inefficiency and instability. This pretraining phase is particularly necessary for locomotion tasks due to their discontinuous dynamics and frequent environment terminations.
Limited Online Fine-tuning: While RWM is fine-tuned with a single-environment online dataset, the need for additional real-world interaction to continually refine the world model highlights an area for further development.
Challenges in Real-World Online Learning: Enabling safe and effective online learning directly on hardware remains a significant hurdle.
- Safety Risks: Policies often exploit minor model errors, leading to optimistic behaviors and collisions, which are risky for physical hardware.
- Recovery: Fully automated online learning would require robust recovery policies to reset the robot after failures, especially for large platforms.
- Privileged Information Availability: Privileged information (e.g., contact forces) used for RWM fine-tuning might not be reliably measurable or estimable with onboard sensors in real-world scenarios.
- Uncertainty Quantification: Incorporating safety constraints and robust uncertainty estimates into RWM would be critical for deployment in real-world, lifelong learning scenarios to mitigate error exploitation.
  
  Future work will focus on addressing these safety concerns and practical challenges, including exploring uncertainty-aware models and developing safer online adaptation strategies.

7.3. Personal Insights & Critique

This paper presents a significant step forward in model-based reinforcement learning for real-world robotics. The emphasis on a generalizable world model without domain-specific inductive biases is particularly appealing, as it promises wider applicability and reduces the engineering effort typically associated with new robotic tasks. The dual-autoregressive mechanism with self-supervised training is a clever and effective way to tackle the long-standing problem of error accumulation in long-horizon predictions, which has historically hampered the practical utility of world models for tasks requiring extended foresight. The ability to achieve robust zero-shot sim-to-real transfer on complex legged and humanoid robots is a strong validation of this approach's practical relevance.

Insights and Transferability:

Robustness through Self-Supervision: The core idea of training a model by feeding it its own predictions (autoregressive training) to reduce exposure bias is highly valuable. This principle could be applied to other sequential prediction tasks beyond world models, such as time series forecasting or generative modeling, where maintaining long-term consistency is crucial.
Hybrid Approach Strength: The MBPO-PPO framework effectively marries the sample efficiency of model-based methods with the robustness of model-free PPO updates. This hybrid strategy is often the sweet spot for real-world applications, enabling faster learning with limited real data while still benefiting from the stability of well-established model-free algorithms.
Value of Privileged Information: Using privileged information as an auxiliary loss for the world model is a smart way to implicitly embed critical domain knowledge (like contact states) without hand-designing the model architecture. This could be extended to other forms of auxiliary losses derived from sensors or expert knowledge.

Critique and Areas for Improvement:

Reliance on Pretraining: While justified by current safety and data limitations, the reliance on pretraining with simulation data (even if suboptimal) means the system is not truly learning from scratch in the wild. Real lifelong learning would ideally involve continuous online adaptation and model refinement directly on hardware without such a prerequisite. The transition from pretraining to fine-tuning, and the potential domain shift between them, could be further analyzed.
Safety in Online Learning: The authors correctly identify the major hurdle of real-world online learning being safety. The current solution of using a single simulation environment with domain shifts is a pragmatic compromise. However, developing robust uncertainty-aware world models that can quantify their confidence in predictions and trigger safe exploration strategies (e.g., slowing down, asking for human intervention) is essential for truly autonomous online learning on physical robots. Explicit safety layers or guard policies could also be integrated.
Computational Cost for Transformers: The finding that transformer architectures struggle with GPU memory constraints under autoregressive training is an interesting limitation. Future work might explore more memory-efficient transformer variants or alternative architectures that can combine the benefits of attention with the autoregressive training paradigm without prohibitive computational costs.
Generalization to New Tasks/Objects: While RWM avoids domain-specific biases, its generalization to entirely new tasks or objects that significantly alter environmental dynamics would be a crucial next step. How well does it adapt to unseen kinematics or highly deformable objects, for example?

Overall, RWM represents a highly relevant and impactful contribution to model-based robotics. Its meticulous design for robustness and long-horizon accuracy, combined with successful hardware deployment, sets a strong foundation for future research in adaptive and efficient robotic systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~47 min read · 64,640 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

1.7. PDF Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

Reinforcement Learning (RL)

Markov Decision Process (MDP)

Partially Observable Markov Decision Process (POMDP)

World Models

Model-Based Reinforcement Learning (MBRL)

Model-Free Reinforcement Learning

Recurrent Neural Networks (RNNs) and Gated Recurrent Units (GRUs)

Multilayer Perceptrons (MLPs)

Proximal Policy Optimization (PPO)

Soft Actor-Critic (SAC)

Model Predictive Control (MPC)

3.2. Previous Works

World Models for Robotics

Model-Based Reinforcement Learning (MBRL)

Hybrid Model-Based/Model-Free Approaches

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Reinforcement Learning and World Models Formulation (Section 3.1)

4.2.2. Self-supervised Autoregressive Training (Section 3.2)

4.2.3. Policy Optimization on Learned World Models (Section 3.3)

5. Experimental Setup

5.1. Datasets

5.1.1. World Model Observation Space

5.1.2. World Model Privileged Information Space

5.1.3. World Model Action Space

5.1.4. Policy Observation Space

5.2. Evaluation Metrics

5.2.1. Autoregressive Prediction Error (eee)

5.2.2. Policy Mean Reward (rrr)

5.2.3. Reward Functions

5.3. Baselines

5.3.1. World Model Baselines

5.3.2. Policy Optimization Baselines

5.4. Network Architectures

5.4.1. RWM Architecture

5.4.2. Baseline Architectures

5.4.3. MBPO-PPO Policy and Value Function Architecture

5.5. Training Parameters

5.5.1. RWM Training Parameters

5.5.2. MBPO-PPO Training Parameters

5.6. Hardware

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Autoregressive Trajectory Prediction (Section 4.1)

6.1.2. Robustness under Noise (Section 4.2)

6.1.3. Generality across Robotic Environments (Section 4.3)

6.1.4. Policy Learning and Hardware Transfer (Section 4.4)

6.2. Ablation Studies / Parameter Analysis

6.2.1. Dual-autoregressive Mechanism (Section A.4.1)

6.2.2. Collision Handling and Model Pretraining (Section A.4.3)

6.2.3. Visualization of Imagination Rollouts (Section A.4.2)

6.3. Additional Discussion

6.3.1. Challenges in Real-World Online Learning (Section A.4.4)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

5.2.1. Autoregressive Prediction Error ( $e$ )

5.2.2. Policy Mean Reward ( $r$ )