Paper status: completed

A Survey: Learning Embodied Intelligence from Physical Simulators and World Models

Published:07/02/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This survey examines the role of embodied intelligence in achieving AGI, focusing on how the integration of physical simulators and world models enhances robot autonomy and adaptability, providing new insights and challenges for the learning of embodied AI.

Abstract

The pursuit of artificial general intelligence (AGI) has placed embodied intelligence at the forefront of robotics research. Embodied intelligence focuses on agents capable of perceiving, reasoning, and acting within the physical world. Achieving robust embodied intelligence requires not only advanced perception and control, but also the ability to ground abstract cognition in real-world interactions. Two foundational technologies, physical simulators and world models, have emerged as critical enablers in this quest. Physical simulators provide controlled, high-fidelity environments for training and evaluating robotic agents, allowing safe and efficient development of complex behaviors. In contrast, world models empower robots with internal representations of their surroundings, enabling predictive planning and adaptive decision-making beyond direct sensory input. This survey systematically reviews recent advances in learning embodied AI through the integration of physical simulators and world models. We analyze their complementary roles in enhancing autonomy, adaptability, and generalization in intelligent robots, and discuss the interplay between external simulation and internal modeling in bridging the gap between simulated training and real-world deployment. By synthesizing current progress and identifying open challenges, this survey aims to provide a comprehensive perspective on the path toward more capable and generalizable embodied AI systems. We also maintain an active repository that contains up-to-date literature and open-source projects at https://github.com/NJU3DV-LoongGroup/Embodied-World-Models-Survey.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

A Survey: Learning Embodied Intelligence from Physical Simulators and World Models

1.2. Authors

Xiaoio Long, Qingrui Zhao, Kaiwen Zhang, Zihao Zhang, Dingrui Wang*, Yumeng Liu, Zhengjhu, Yi*, Shozheng Wang*, Xinzhe Wei, Wei Li, Wei Yin, Yao ao, Jia an, Qiu Shen, Ruigangang, Xun Cao†, Qionghai Dai

1.3. Journal/Conference

This paper is a preprint and is published on arXiv. Comment: arXiv is a popular open-access repository for preprints of scientific papers in fields like physics, mathematics, computer science, and more. While preprints have not undergone formal peer review, arXiv is widely used for rapid dissemination of research findings before or in parallel with formal publication.

1.4. Publication Year

2025

1.5. Abstract

The paper explores embodied intelligence as a core focus for achieving artificial general intelligence (AGI). It highlights that embodied intelligence requires agents capable of perceiving, reasoning, and acting within the physical world, necessitating advanced perception, control, and grounding of abstract cognition in real-world interactions. The survey identifies two critical enabling technologies: physical simulators and world models. Physical simulators offer controlled, high-fidelity environments for safe and efficient training and evaluation of robotic agents. World models provide robots with internal representations of their surroundings, facilitating predictive planning and adaptive decision-making beyond direct sensory input. The survey systematically reviews recent advancements in learning embodied AI by integrating these two technologies, analyzing their complementary roles in enhancing autonomy, adaptability, and generalization in intelligent robots. It discusses the interplay between external simulation and internal modeling to bridge the sim-to-real gap (the challenge of transferring behaviors learned in simulation to the real world). By synthesizing current progress and identifying open challenges, the survey aims to provide a comprehensive perspective on the path toward more capable and generalizable embodied AI systems. An active repository of literature and open-source projects is maintained at https://github.com/NJU3DV-LoongGroup/Embodied-World-Models-Survey.

https://arxiv.org/abs/2507.00917 (Preprint) PDF Link: https://arxiv.org/pdf/2507.00917v3.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem the paper addresses is how to achieve robust embodied intelligence in robots, which is seen as a crucial step towards artificial general intelligence (AGI). This is important because AGI requires grounding abstract reasoning in real-world understanding and action, moving beyond disembodied intelligence (systems operating purely on symbolic or digital data). Intelligent robots, by acting and perceiving within a physical body, can robustly learn from experience and adapt to dynamic, uncertain environments.

The paper identifies specific challenges in this pursuit:

  1. Safety and Cost of Real-world Experimentation: Training and testing complex robotic behaviors directly in the real world can be expensive, time-consuming, and risky.

  2. Data Bottlenecks: Intelligent robots require vast amounts of high-quality interaction data, which is hard to collect in the real world due to cost, safety, and repeatability issues.

  3. Generalization: Robots need to adapt their behavior and cognition continuously based on feedback from the physical world, requiring robust learning that can generalize to unforeseen scenarios.

  4. Lack of Systematic Evaluation: A comprehensive grading system for robot intelligence that integrates cognition, autonomous behavior, and social interaction is still lacking, hindering technology roadmap clarification and safety assessment.

    The paper's entry point or innovative idea is to systematically explore the synergistic relationship between physical simulators and world models as critical enablers for developing embodied intelligence. Simulators provide a safe, controlled external environment for training, while world models create internal representations for adaptive decision-making, jointly addressing the challenges of data scarcity, safety, and generalization.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

  • Proposed a Five-Level Grading Standard for Intelligent Robots: It introduces a comprehensive five-level classification framework (IR-L0 to IR-L4) for evaluating humanoid robot autonomy. This standard assesses robots across four key dimensions: autonomy, task handling ability, environmental adaptability, and societal cognition ability. This framework helps clarify the technological development roadmap and provides guidance for robot regulation and safety assessment.

  • Systematic Review of Robot Learning Techniques: It analyzes recent advancements in intelligent robotics across various domains, including:

    • Legged Locomotion: Covering bipedal walking, unstructured environment adaptation, high dynamic movements, and fall recovery.
    • Manipulation: Discussing unimanual (gripper-based, dexterous hand), bimanual, and whole-body manipulation.
    • Human-Robot Interaction (HRI): Focusing on cognitive collaboration, physical reliability, and social embeddedness.
  • Comprehensive Analysis of Current Physical Simulators: The survey provides a detailed comparative analysis of mainstream simulators (e.g., Webots, Gazebo, MuJoCo, Isaac Gym/Sim/Lab, SAPIEN, Genesis, Newton). This analysis covers their physical simulation capabilities (e.g., suction, deformable objects, fluid dynamics, differentiable physics), rendering quality (e.g., ray tracing, physically-based rendering, parallel rendering), and sensor/joint component support.

  • Review of Recent Advancements in World Models: It revisits the main architectures of world models (e.g., Recurrent State Space Models, Joint-Embedding Predictive Architectures, Transformer-based, Autoregressive, Diffusion-based) and their potential roles. It discusses how world models serve as controllable neural simulators, dynamic models for model-based reinforcement learning (MBRL), and reward models for embodied intelligence. Furthermore, it comprehensively discusses recent world models designed for specific applications like autonomous driving and articulated robots.

    The key conclusions and findings indicate that the synergistic integration of physical simulators and world models is crucial for bridging the sim-to-real gap and fostering autonomous, adaptable, and generalizable embodied AI systems. This combination enables:

  • Efficient Data Generation: Simulators allow for rapid, cost-effective, and safe generation of large volumes of synthetic data with automated annotation.

  • Enhanced Learning and Planning: World models enable agents to learn internal representations of environment dynamics, simulate hypothetical futures, and plan actions through imagined experiences, significantly improving sample efficiency in reinforcement learning.

  • Robust Generalization: High-fidelity simulation and advanced world models help prevent policy overfitting and enhance algorithm generalization by capturing complex physical phenomena and uncertainties.

  • Progress Towards AGI: These technologies lay the foundation for developing IR-L4 fully autonomous systems capable of seamless integration into human society.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To fully understand this paper, a reader should be familiar with the following core concepts:

  • Artificial Intelligence (AI): A broad field of computer science that gives computers the ability to simulate human intelligence. This includes learning, problem-solving, perception, and language understanding.
  • Robotics: The interdisciplinary branch of engineering and science that deals with the design, construction, operation, and use of robots. Robots are machines that can carry out a series of actions autonomously or semi-autonomously.
  • Artificial General Intelligence (AGI): A theoretical form of AI that would have the ability to understand, learn, and apply intelligence to any intellectual task that a human being can. It stands in contrast to narrow AI, which is designed to perform a specific task (e.g., playing chess or recommending products).
  • Embodied Intelligence: A paradigm in AI and robotics that emphasizes the importance of a physical body and its interaction with the environment for the development of intelligence. Unlike disembodied intelligence (which operates on symbolic or digital data), embodied intelligence suggests that perception, action, and cognition are deeply intertwined with the physical form and sensory-motor experiences of an agent.
  • Physical Simulators: Software environments that mimic the physical properties and dynamics of the real world, allowing virtual robots to operate, interact, and learn. They provide a controlled, safe, and reproducible platform for designing, testing, and refining robotic algorithms without the costs and risks associated with real-world experiments. Examples include Gazebo and MuJoCo.
  • World Models: Internal, learned representations within an AI agent that capture the dynamics of its environment. These models allow an agent to predict future states given current observations and actions, enabling cognitive processes like planning, imagination, and adaptive decision-making without needing to constantly interact with the real world. They are often generative, meaning they can synthesize future sensory inputs (e.g., video frames).
  • Sim-to-Real Gap: The discrepancy between behaviors learned in a simulated environment and their performance when transferred to a physical robot in the real world. This gap arises due to differences in physics, sensor noise, latency, and other factors that are difficult to perfectly model in simulation. Bridging this gap is a major challenge in robotics.
  • Reinforcement Learning (RL): A type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. The agent learns through trial and error, receiving feedback in the form of rewards or penalties for its actions.
  • Imitation Learning (IL): A machine learning paradigm where an agent learns to perform a task by observing and mimicking demonstrations provided by an expert (e.g., a human). It bypasses the need for explicit reward functions, making it suitable for complex tasks that are hard to specify programmatically.
  • Model Predictive Control (MPC): An advanced control strategy that uses a dynamic model of a system to predict its future behavior over a finite time horizon. At each time step, an optimization problem is solved to compute a sequence of control actions that minimize a cost function while satisfying constraints. Only the first action in the sequence is executed, and the process is repeated.
  • Whole-Body Control (WBC): A comprehensive control framework for complex robots (e.g., humanoids) that coordinates all degrees of freedom (joints and limbs) simultaneously to achieve multiple tasks while satisfying various physical constraints (e.g., balance, joint limits, contact forces).
  • Visual-Language-Action (VLA) Models: Cross-modal AI frameworks that integrate visual perception, natural language understanding, and action generation for robots. They leverage large language models (LLMs) and visual models (VMs) to interpret human instructions and directly map them to robot actions, enabling more intuitive and generalizable control.
  • Foundation Models (FMs): Large-scale machine learning models (e.g., Large Language Models - LLMs, Vision Models - VMs, Vision-Language Models - VLMs) that are pre-trained on vast amounts of diverse, internet-scale data. They possess powerful capabilities in semantic understanding, world knowledge integration, and cross-modal reasoning, making them adaptable for a wide range of downstream tasks through fine-tuning or zero-shot inference.
  • Recurrent State Space Models (RSSMs): A class of world models that use a compact latent (hidden) space to represent the environment's state and a recurrent neural network (RNN) to model its temporal dynamics. They predict future states and observations in this latent space, enabling long-horizon planning.
  • Joint-Embedding Predictive Architectures (JEPAs): World models that learn abstract representations by predicting missing parts of data (e.g., masked image regions or video segments) in a purely self-supervised manner, without requiring explicit generative decoders. They focus on learning rich, semantic representations.
  • Transformer-based Models: Neural network architectures that utilize attention mechanisms (specifically self-attention) to weigh the importance of different parts of input sequences. They excel at capturing long-range dependencies and parallelism, making them powerful for sequence modeling tasks in world models.
    • Attention Mechanism: The core idea of attention is to allow a neural network to focus on specific parts of its input sequence when making predictions or processing information. For example, in natural language processing, when translating a sentence, the model might pay more attention to certain words in the source sentence when generating a word in the target sentence. The Scaled Dot-Product Attention is a common form: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
      • QQ (Query) is a matrix representing the query vectors.
      • KK (Key) is a matrix representing the key vectors.
      • VV (Value) is a matrix representing the value vectors.
      • dkd_k is the dimension of the key vectors (used for scaling to prevent vanishing gradients during softmax).
      • QKTQK^T computes the dot product between queries and keys, indicating similarity.
      • softmax\mathrm{softmax} normalizes the scores to obtain attention weights.
      • The attention weights are then multiplied by the value vectors VV to get the output, which is a weighted sum of the values.
  • Autoregressive Generative Models: Models that generate sequences (e.g., video frames) one element at a time, where each new element is conditioned on all previously generated elements. They typically use Transformers to model the sequential dependencies.
  • Diffusion Models: Generative models that learn to create data by reversing a gradual noise process. They start with random noise and iteratively denoise it to produce a desired output (e.g., a high-fidelity image or video). They are known for generating high-quality and diverse samples.

3.2. Previous Works

The paper extensively references numerous previous works, which can be broadly categorized:

  • Robotics & Control:
    • Early Humanoid Control: Model Predictive Control (MPC) and Whole-Body Control (WBC) have foundational works, such as Tom Erez et al. [30] for real-time MPC on humanoids, and Oussama Khatib's operational space formulation [33] for WBC. These methods focused on explicit programming and dynamic modeling.
    • Reinforcement Learning (RL) in Robotics: The application of RL to humanoid robotics dates back to the late 1990s and early 2000s, with Morimoto and Doya [43] demonstrating a simulated two-joint robot learning to stand up. More recently, works like DeepLoco [44] explored deep RL for bipedal tasks, and Xie et al. [47] achieved robust dynamic walking for physical bipedal robots like Cassie. Joonho Lee et al. [86] demonstrated the first successful real-world RL application for legged locomotion in outdoor environments.
    • Imitation Learning (IL) in Robotics: IL allows robots to learn from human demonstrations, as seen in works using human motion capture data [51][53] to achieve natural robot gaits. Challenges include data cost and generalization [49].
    • Visual-Language-Action (VLA) Models: Google DeepMind's RT-2 [65] pioneered this paradigm by discretizing robot control into language-like tokens. Subsequent models [4][66][67] have further advanced VLA in robotics.
    • Human-Robot Interaction (HRI): Studies such as Lemaignan et al. [182] explored cognitive skills for HRI, while work on physical reliability [192][199] used motion planning algorithms like PRM and RRT. Social embeddedness research [219][223] focuses on understanding social norms.
  • Physical Simulators:
    • Traditional Simulators: Webots [239] (1998) and Gazebo [15] (2002) are long-standing open-source platforms. MuJoCo [16] (2012) and PyBullet [240] (2017) are widely used physics engines for contact-rich dynamics and RL. CoppeliaSim [241] (around 2010) is a general-purpose robot simulation software.
    • GPU-Accelerated & High-Fidelity Simulators: The NVIDIA Isaac series, including Isaac Gym [242] (2021) for parallel GPU-accelerated physics, Isaac Sim [243] for digital twin simulation with Omniverse [244] and RTX-based ray tracing, and Isaac Lab [246] as an RL framework built on Isaac Sim. SAPIEN [247] (2020) is designed for part-level interactive objects, leading to benchmarks like ManiSkill [248]. Genesis [250] (2024) is a general-purpose platform unifying various physics solvers. NVIDIA Newton [251] (2025) is an emerging open-source physics engine for high-fidelity simulation.
  • World Models:
    • Pioneering Work: David Ha and Jürgen Schmidhuber's "World Models" [18] (2018) demonstrated learning compact environmental representations for internal planning.
    • Recurrent State Space Models (RSSMs): The Dreamer series [267][271] (starting 2018) popularized RSSMs for learning latent dynamics and enabling MBRL. DreamerV3 [270] achieved state-of-the-art performance across diverse visual control tasks.
    • Joint-Embedding Predictive Architectures (JEPAs): Proposed by Yann LeCun [272], I-JEPA [273] and V-JEPA [274][275] learn abstract representations by predicting masked content in latent space.
    • Generative World Models (Video Generation): Recent advancements in video generation, like Sora [263] (2024) and Kling [264] (2025), have emphasized the potential of video generation models as world simulators. Early video generation frameworks include CogVideo [279] and VideoPoet [281].
    • Diffusion-based Models: Imagen Video [285], VideoLDM [286], SVD [287] paved the way for high-fidelity video synthesis. DriveDreamer [289], Vista [290], and GAIA-2 [291] apply diffusion models to generate driving or 3D scenes.
    • Domain-Specific World Models: Wayve's GAIA series [282][291] for autonomous driving, DOME [297] for 3D occupancy prediction, Cosmos [294] as a unified platform for foundation video models.
    • World Models as Dynamic/Reward Models: PlaNet [267] and the Dreamer series [393][394] use world models as dynamic models for planning in MBRL. VIPER [305] uses video prediction models as reward signals.

3.3. Technological Evolution

The field of embodied AI has seen a profound evolution, mirroring advancements in AI and computing:

  1. Early Robotics (Pre-2000s): Focused on traditional control methods (e.g., PID control, inverse kinematics). Robots were largely program-driven and operated in structured environments (IR-L0). Simulators like Webots and Gazebo emerged to provide basic virtual testing grounds.

  2. Model-Based Control (2000s-early 2010s): Introduction of MPC and WBC allowed for more dynamic and complex behaviors, but still relied on explicit models of robot dynamics and environments. Robots began to show limited reactivity (IR-L1). MuJoCo provided a more accurate physics engine for articulated systems.

  3. Learning-Based Robotics (Mid-2010s-Present): The deep learning revolution brought Reinforcement Learning (RL) and Imitation Learning (IL) to the forefront. Robots started learning behaviors from data, reducing the need for explicit programming and enhancing adaptability (IR-L2 to IR-L3). This period saw increased demand for simulators capable of parallelized training (Isaac Gym).

  4. Emergence of World Models (Late 2010s-Present): Inspired by human cognition, world models began to be integrated, allowing agents to learn internal representations of environment dynamics. This enabled model-based RL, planning in latent spaces, and improved sample efficiency. The Dreamer series played a pivotal role here.

  5. Foundation Models & Generative AI (Early 2020s-Present): The advent of large language models (LLMs) and vision-language models (VLMs) (i.e., Foundation Models) profoundly influenced robotics, leading to Visual-Language-Action (VLA) models and generative world models. These models leverage vast internet data for semantic understanding, task planning, and high-fidelity scene generation, pushing robots towards humanoid cognition and collaboration (IR-L3) and full autonomy (IR-L4). GPU-accelerated simulators like Isaac Sim and Genesis became crucial for training these data-intensive models.

    This paper's work fits within this timeline by synthesizing the advancements in physical simulators and world models, showing how their combined evolution is driving the current progress in embodied AI towards AGI.

3.4. Differentiation Analysis

This survey distinguishes itself from existing literature by providing a comprehensive examination of the synergistic relationship between physical simulators and world models in advancing embodied intelligence.

  • Previous Surveys: The paper notes that prior surveys typically focused on individual components. For example:

    • Robotics simulators [19][21]
    • World models [22][24]
  • This Paper's Innovation:

    • Bridging Two Domains: Instead of treating simulators and world models as separate entities, this survey explicitly analyzes their complementary roles and the interplay between external simulation and internal modeling. It highlights how simulators provide the external training ground, while world models offer the internal cognitive framework.

    • Unified Framework: It presents a unified view of how these two technologies enhance autonomy, adaptability, and generalization in intelligent robots, particularly in bridging the sim-to-real gap.

    • Structured Evaluation: The proposal of a five-level grading standard (IR-L0 to IR-L4) for humanoid robot autonomy offers a new framework for assessing and guiding development, which is currently lacking in comprehensive integration of intelligent cognition and autonomous behavior.

    • Application-Specific Focus: The survey delves into the specific applications and challenges of world models in critical domains like autonomous driving and articulated robots, providing a nuanced understanding of their practical implications.

      By synthesizing these components, the survey offers a more holistic perspective on the path toward more capable and generalizable embodied AI systems.

4. Methodology

This survey primarily presents a comprehensive review and classification framework rather than a novel algorithmic methodology. The "methodology" of the paper is its systematic approach to analyzing the state of the art in embodied intelligence through the lens of physical simulators and world models. This includes a proposed grading system, a review of robotic techniques, a comparative analysis of simulators, and an exploration of world model architectures and applications.

4.1. Principles

The core idea is to establish a systematic understanding of how physical simulators and world models contribute to embodied intelligence, individually and synergistically. The theoretical basis is that embodied AI requires both external (simulated) environments for safe and scalable training and internal (world model) representations for adaptive, intelligent behavior. The intuition is that by combining these two, robots can learn complex skills and generalize to novel situations more effectively than relying on either alone, thereby bridging the sim-to-real gap.

4.2. Core Methodology In-depth (Layer by Layer)

The paper's methodology unfolds through several analytical layers:

4.2.1. Levels of Intelligent Robot (Section 2)

The paper proposes a capability grading model for intelligent robots, systematically outlining five progressive levels (IR-L0 to IR-L4). This classification aims to provide a unified framework to assess and guide the development of intelligent robots.

4.2.1.1. Level Criteria

The classification is based on:

  • The robot's ability to complete tasks independently (from human control to full autonomy).
  • The difficulty of tasks the robot can handle (from simple repetitive labor to innovative problem-solving).
  • The robot's ability to work in dynamic or extreme environments.
  • The robot's capacity to understand, interact with, and respond to social situations within human society.

4.2.1.2. Level Factors

The intelligent level of robots is graded based on the following five factors:

  • Autonomy: The robot's ability to autonomously make decisions across various tasks.

  • Task Handling Ability: The complexity of the tasks the robot can perform.

  • Environmental Adaptability: The robot's performance in different environments.

  • Societal Cognition Ability: The level of intelligence exhibited by robots in social scenarios.

    The relationship between graded levels and level factors are listed in Table 1.

The following are the results from Table 1 of the original paper:

Level Autonomy Task Handling Ability Environmental Adaptability Societal Cognition Ability
IR-LO Human Control Basic Tasks Controlled Only No Social Cognition
IR-L1 Human Supervised Complex Navigation Predictable Environments Basic Recognition
IR-L2 Human Assisted Dynamic Collaboration Adaptive Learning Simple Interaction
IR-L3 ConditionalAutonomy Multitasking Dynamic Adaptation Emotional Intelligence
IR-L4 Full Autonomy Innovation Universal Flexibility Advanced Social Intelligence

4.2.1.3. Classification Levels

The paper defines five discrete levels:

  • IR-L0: Basic Execution Level:

    • Characteristics: Completely non-intelligent, program-driven, focused on repetitive, mechanized, deterministic tasks (e.g., industrial welding, fixed-path material handling). Relies entirely on predefined instructions or teleoperation. Low perception - high execution.
    • Technical Requirements:
      • Hardware: High-precision servomotors, rigid mechanical structures, PLC/MCU-based motion controllers.
      • Perception: Extremely limited (limit switches, encoders).
      • Control Algorithms: Predefined scripts, action sequences, teleoperation (no real-time feedback).
      • Human-Robot Interaction: None, or simple buttons/teleoperation.
  • IR-L1: Programmatic Response Level:

    • Characteristics: Limited rule-based reactive capabilities, executes predefined task sequences (e.g., cleaning/reception robots). Uses fundamental sensors to trigger specific behaviors. Operates only in closed-task environments with clear rules. Limited perception—limited execution.
    • Technical Requirements:
      • Hardware: Basic sensors (infrared, ultrasonic, pressure), moderately enhanced processors.
      • Perception: Detection of obstacles, boundaries, simple human movements.
      • Control Algorithms: Rule engines, finite state machines (FSM), basic SLAM (Simultaneous Localization and Mapping) or random walk.
      • Human-Robot Interaction: Basic voice and touch interfaces for simple command-response.
      • Software Architecture: Embedded real-time operating systems with elementary task scheduling.
  • IR-L2: Environmental Awareness and Adaptive Response Level:

    • Characteristics: Preliminary environmental awareness and autonomous capabilities, responsive to environmental changes, transitions between multiple task modes (e.g., service robot delivering water or navigating while avoiding obstacles). Human supervision is still essential. Demonstrates greater execution flexibility and "contextual understanding."
    • Technical Requirements:
      • Hardware: Multimodal sensor arrays (cameras, LiDAR, microphone arrays), enhanced computational resources.
      • Perception: Visual processing, auditory recognition, spatial localization, basic object identification, environmental mapping.
      • Control Algorithms: Finite state machines, behavior trees, SLAM, path planning, obstacle avoidance.
      • Human-Robot Interaction: Speech recognition and synthesis for basic command comprehension/execution.
      • Software Architecture: Modular design for parallel task execution with preliminary priority management.
  • IR-L3: Humanoid Cognition and Collaboration Level:

    • Characteristics: Autonomous decision-making in complex, dynamic environments, sophisticated multimodal HRI. Infers user intent, adapts behavior, operates within ethical constraints (e.g., eldercare robot detecting emotional state and responding appropriately).
    • Technical Requirements:
      • Hardware: High-performance computing platforms, comprehensive multimodal sensor suites (depth cameras, electromyography, force-sensing arrays).
      • Perception: Multimodal fusion (vision, speech, tactile), affective computing for emotion recognition, dynamic user modeling.
      • Control Algorithms: Deep learning architectures (CNNs, Transformers) for perception/language understanding; Reinforcement Learning for adaptive policy optimization; planning and reasoning modules.
      • Human-Robot Interaction: Multi-turn natural language dialogue, facial expression recognition, foundational empathy and emotion regulation.
      • Software Architecture: Service-oriented, distributed frameworks for task decomposition/collaboration; integrated learning/adaptation mechanisms.
      • Safety and Ethics: Embedded ethical governance systems.
  • IR-L4: Fully Autonomous Level:

    • Characteristics: Pinnacle of intelligent robotics; complete autonomy in perception, decision-making, and execution in any environment without human intervention. Possesses self-evolving ethical reasoning, advanced cognition, empathy, and long-term adaptive learning. Engages in sophisticated social interactions (multi-turn natural language, emotional understanding, cultural adaptation, multi-agent collaboration). Sci-fi movie robots.
    • Technical Requirements:
      • Hardware: Highly biomimetic structures (full-body, multi-degree-of-freedom articulation), distributed high-performance computing platforms.
      • Perception: Omnidirectional, multi-scale, multimodal sensing systems; real-time environment modeling, intent inference.
      • Control Algorithms: General Artificial Intelligence (AGI) frameworks integrating meta-learning, generative AI, embodied intelligence; autonomous task generation, advanced reasoning.
      • Human-Robot Interaction: Natural language understanding and generation, complex social context adaptation, empathy, ethical deliberation.
      • Software Architecture: Cloud-edge-client collaborative systems; distributed agent architectures for self-evolution/knowledge transfer.
      • Safety and Ethics: Embedded dynamic ethical decision systems for morally sound choices in dilemmas.

4.2.2. Robotic Mobility, Dexterity, and Interaction (Section 3)

This section reviews current progress in intelligent robotic tasks, covering fundamental technical approaches and advancements in locomotion, manipulation, and human-robot interaction.

  • Model Predictive Control (MPC): Optimization-based approach predicting future system behavior using a dynamic model, computing control actions by solving an optimization problem at each time step. Handles constraints on inputs and states.
  • Whole-Body Control (WBC): Coordinates all joints and limbs simultaneously, formulating motion and force objectives as prioritized tasks solved via optimization or hierarchical control.
  • Reinforcement Learning (RL): Agent learns to perform tasks by interacting with the environment and receiving rewards/penalties, discovering optimal actions through trial and error.
  • Imitation Learning (IL): Robots learn by observing and mimicking demonstrations (human or other agents), bypassing explicit programming or reward functions. Faces challenges like data cost and generalization.
  • Visual-Language-Action (VLA) Models: Cross-modal AI framework integrating visual perception, language understanding, and action generation. Leverages Large Language Models (LLMs) for reasoning, mapping natural language instructions to physical robot actions (e.g., RT-2).

4.2.2.2. Robotic Locomotion

Focuses on natural movement patterns (walking, running, jumping) and dynamic adaptation.

  • Legged Locomotion:

    • Unstructured Environment Adaption: Ability to maintain stable walking in complex, unknown, or dynamic environments (e.g., rugged terrains, stairs). Early methods used position-controlled robots [57][58][75]. Modern robots use force-controlled joints for better compliance [77][78], enabling more sophisticated algorithms [59][60]. Learning-based methods (e.g., RL with domain randomization [61]) and exteroceptive sensing (depth cameras, LiDAR for height maps [63]) significantly enhanced adaptability.
    • High Dynamic Movements: Achieving stability and agility in high-speed, dynamic movements (running, jumping). Early studies used simplified dynamic models (SLIP, LIPM, SRBM) [79][80]. RL-based methods [81][82] and Imitation Learning [84] have shown promising results for complex dynamic behaviors.
  • Fall Protection and Recovery: Strategies to reduce damage during falls and efficiently recover to a standing posture.

    • Model-based Methods: Inspired by human biomechanics, using optimization control to generate damage-reducing motion trajectories and recovery movements [97][98][99].
    • Learning-based Methods: RL and IL offer insensitivity to high-precision models and strong generalization, training robots to recover from falls (e.g., HiFAR [102], HoST [101]).

4.2.2.3. Robotic Manipulation

Covers tasks from picking objects to complex assembly.

  • Unimanual Manipulation Task: Using a single end effector (gripper or dexterous hand).

    • Gripper-based manipulation: Common for grasping, placing, tool use. Traditional methods [105][106] struggled with unstructured environments. Learning-based approaches using CNNs for 6D pose estimation [107], affordance learning [109], Imitation Learning with Neural Descriptor Fields (NDFs) [111], Diffusion Policy [3], and Foundation Models like RT2 [112] have significantly improved capabilities.
    • Dexterous hand manipulation: Aims for human-like versatility and precision using multi-fingered hands. Early work focused on hardware designs [121][122] and theoretical foundations [124][125]. Learning-based methods (two-stage: pose generation then control [126][141]; end-to-end: direct trajectory modeling via RL or IL [142][152]) have become mainstream, often using sim-to-real transfer.
  • Bimanual Manipulation Task: Coordinated use of two arms for complex operations (cooperative transport, assembly, handling deformable objects). Challenges include high-dimensional state-action spaces and inter-arm collisions.

    • Early research used inductive biases or structural decompositions (e.g., BUDS [156], SIMPLe [157]).
    • End-to-end approaches with large-scale data collection and Imitation Learning (e.g., ALOHA series [49][153]) have shown strong generalization.
  • Whole-Body Manipulation Control: Interacting with objects using the entire robot body (dual arms, torso, base).

    • Leverages large pre-trained models (LLMs, VLMs) for semantic understanding (e.g., TidyBot [164], MOO [165]).
    • Visual demonstrations guide learning (e.g., OKAMI [167], iDP3 [168]).
    • Reinforcement Learning Sim-to-Real approaches (e.g., OmniH20 [96]) and Transformer-based low-level control (HumanPlus [6]) are key.
  • Foundation Models in Humanoid Robot Manipulation:

    • Hierarchical Approach: Pretrained language or vision-language foundation models serve as high-level task planning and reasoning engines, passing sub-goals to low-level action policies (e.g., Figure AI's Helix [174], NVIDIA's GR00T N1 [175]).
    • End-to-End Approach: Directly incorporates robot operation data into foundation models to construct Vision-Language-Action (VLA) models [4][68] (e.g., Google DeepMind's RT series [112][177]).

4.2.2.4. Human-Robot Interaction (HRI)

Enabling robots to understand and respond to human needs and emotions.

  • Cognitive Collaboration: Bidirectional cognitive alignment between robots and humans, understanding explicit instructions and implicit intentions. Relies on complex cognitive architectures and multimodal information processing (e.g., Lemaignan et al. [182], multimodal intention learning [183]). LLMs are used for semantic understanding in goal-oriented navigation [186][190].

  • Physical Reliability: Coordinated physical actions to ensure safety and efficiency. Relies on motion planning (sampling-based like PRM/RRT [193][194], optimization-based like CHOMP [200]) and control strategies (impedance/admittance control [208]). Imitation Learning and Reinforcement Learning with large-scale generative datasets from simulation are advancing this [217][191].

  • Social Embeddedness: Ability to recognize and adapt to social norms, cultural expectations, and group dynamics. Involves social space understanding [219][220] and behavior understanding (linguistic and non-linguistic cues) [224][230].

    The following figure (Figure 2 from the original paper) illustrates the different levels of intelligent robots and their relationship with physical simulators and world models:

    该图像是示意图,展示了智能机器人的不同层级及其与物理模拟器和世界模型的关系。左侧列出了智能机器人的五个层级,从基本执行到完全自主,右侧则描述了机器人运动、灵巧性和交互的相关内容。 该图像是示意图,展示了智能机器人的不同层级及其与物理模拟器和世界模型的关系。左侧列出了智能机器人的五个层级,从基本执行到完全自主,右侧则描述了机器人运动、灵巧性和交互的相关内容。

4.2.3. General Physical Simulators (Section 4)

This section details the role of simulators in addressing data bottlenecks and the sim-to-real gap, followed by a comparative analysis of mainstream platforms.

4.2.3.1. Mainstream Simulators

The paper introduces several key simulators:

  • Webots [239]: Integrated framework for robot modeling, programming, simulation (open-sourced 2018). Multi-language APIs, cross-platform. Lacks deformable bodies, fluid dynamics, advanced physics.

  • Gazebo [15]: Widely adopted open-source simulator (2002), extensible, integrates with ROS. Modular plugin system. Lacks suction, deformable objects, fluid dynamics.

  • MuJoCo [16]: Physics engine for contact-rich dynamics in articulated systems (2012, acquired by Google DeepMind 2021). High-precision physics, optimized generalized-coordinate formulation. Excels in contact dynamics and RL. Limited rendering, no fluid/DEM/LiDAR simulation.

  • PyBullet [240]: Python interface for Bullet physics engine (2017). Lightweight, easy-to-integrate, open-source. Slightly less fidelity than some mainstream simulators.

  • CoppeliaSim [241]: General-purpose robot simulation (around 2010, formerly V-REP). Distributed control architecture, supports various middleware. Educational edition is open-source.

  • NVIDIA Isaac Series:

    • Isaac Gym [242]: Pioneered GPU-accelerated physics simulation (2021), parallel training of thousands of environments. Built on PhysX. Limited rendering fidelity, no ray tracing/fluid/LiDAR.
    • Isaac Sim [243]: Full-featured digital twin simulator, integrates Omniverse [244]. PhysX 5, RTX-based real-time ray tracing, high-fidelity LiDAR simulation. Supports USD (Universal Scene Description) [245].
    • Isaac Lab [246]: Modular RL framework on Isaac Sim. Tiled rendering for multi-camera inputs. Supports IL and RL.
  • SAPIEN [247]: Simulation platform for physically realistic modeling of complex, part-level interactive objects (2020). Led to PartNet-Mobility dataset and ManiSkill benchmarks [248][249]. Lacks soft-body, fluid dynamics, ray tracing, LiDAR, GPS, ROS integration.

  • Genesis [250]: General-purpose physical simulation platform (2024). Unifies rigid body, MPM, SPH, FEM, PBD, fluid solvers. Generative data engine (natural language prompts). Differentiable physics. No LiDAR/GPS/ROS.

  • NVIDIA Newton [251]: Open-source physics engine (2025). Built on NVIDIA Warp for GPU acceleration. Differentiable physics. Compatible with MuJoCo Playground, Isaac Lab. OpenUSD-based scene construction.

    The following figure (Figure 14 from the original paper) shows mainstream simulators for robotic research:

    Fig. 14: Mainstream Simulators for robotic research. 该图像是图表,展示了用于机器人研究的主流仿真器,包括Webots、Gazebo、CoppeliaSim、PyBullet、Genesis、Isaac Gym、Isaac Sim、Isaac Lab、MuJoCo和SAPIEN等。这些仿真器为机器人智能体的训练和评估提供了高保真度的环境。

4.2.3.2. Physical Properties of Simulators

High-fidelity physical property simulation enhances realism and algorithm generalization. Table 2 summarizes support for various types of physical simulation.

  • Suction: Modeling non-rigid attachment (e.g., vacuum grasping). MuJoCo (user-defined logic), Gazebo (plugins), Webots, CoppeliaSim, Isaac Sim (native module support).

  • Random external forces: Simulating environmental uncertainties (collisions, wind). Most platforms support it; Isaac Gym for efficient large-scale scenarios.

  • Deformable objects: Materials changing shape under force (cloth, ropes, soft robots). MuJoCo, PyBullet (basic); Isaac Gym, Isaac Sim, Isaac Lab (advanced, GPU/PhysX-based); Genesis (integrates state-of-the-art solvers).

  • Soft-body contacts: Interactions between soft materials. Webots, Gazebo, MuJoCo, CoppeliaSim, PyBullet (basic); Isaac Gym, Isaac Sim, Isaac Lab, Genesis (advanced, GPU/FEM).

  • Fluid mechanism: Motion and interaction of liquids/gases. Webots, Gazebo (basic); Isaac Sim (particle-based); Genesis (native high-fidelity). Other lack native support.

  • DEM (Discrete Element Method) simulation: Models objects as rigid particles, simulating contact/collision/friction (granular materials). Not natively supported by mainstream simulators, though Gazebo can extend via plugins for indirect simulation.

  • Differentiable physics: Simulator's ability to compute gradients of physical states w.r.t. input parameters, enabling end-to-end optimization. MuJoCo XLA (via JAX), PyBullet (Tiny Differentiable Simulator), Genesis (ground-up design).

    The following are the results from Table 2 of the original paper:

    Simulator Physics Engine Suction Random external forces Deformable objects Soft-body contacts Fluid mechanism DEM simulation Differentiable physics
    Webots ODE(default) X X X
    Gazebo DART(default) X
    MuJoCo MuJoCo - X X
    CoppeliaSim Bullet, ODE, Vortex, Newton S S X X
    PyBullet Bullet X X X
    Isaac Gym PhysX, FleX(GPU) X X
    Isaac Sim PhysX(GPU)
    Isaac Lab PhysX(GPU)
    SAPIEN PhysX X X
    Genesis Custom-designed +

Note: The table uses symbols like for support, XX for lack of support, SS for support via scripting, - for user-defined logic, and + for complex simulation via custom solvers.

4.2.3.3. Rendering Capabilities

Crucial for diminishing the sim-to-real gap. Table 3 compares rendering features.

  • Rendering Engine: Core software for creating 2D images from 3D scenes.
    • OpenGL: Webots (WREN), MuJoCo, CoppeliaSim, PyBullet.
    • Vulkan: Isaac Gym, SAPIEN (SapienRenderer).
    • Omniverse RTX Renderer: Isaac Sim, Isaac Lab (leveraging NVIDIA's RTX technology).
    • PyRender + LuisaRender: Genesis.
  • Ray Tracing: Simulates physical behavior of light for accurate shadows, reflections, global illumination, and realistic sensor simulation.
    • No native real-time: Webots, MuJoCo, PyBullet.
    • Static image generation: CoppeliaSim (POV-Ray tracer).
    • Robust real-time: Isaac Sim, Isaac Lab (Omniverse RTX).
    • Significant support: SAPIEN (SapienRenderer), Genesis.
    • Path towards: Gazebo (experimental NVIDIA OptiX).
  • Physically-Based Rendering (PBR): Models light interaction with materials based on physical properties for realistic visuals.
    • Support: Webots (WREN), Modern Gazebo (Ignition Rendering), Isaac Sim, Isaac Lab, SAPIEN, Genesis.
    • Lack support: MuJoCo, CoppeliaSim, PyBullet, Isaac Gym.
  • Parallel Rendering: Rendering multiple independent simulation environments simultaneously.
    • Strong capabilities: Isaac Gym, Isaac Sim/Lab, SAPIEN (ManiSkill), Genesis.

    • Limited/not primary strength: Webots, Gazebo, MuJoCo, CoppeliaSim, PyBullet.

      The following are the results from Table 3 of the original paper:

      Simulator Rendering Engine Ray Tracing Physically-Based Rendering Scalable Parallel Rendering
      Webots WREN (OpenGL-based) X X
      Gazebo Ogre (OpenGL-based)
      Mujoco OpenGL-based X X X
      CoppeliaSim OpenGL-based X X X
      PyBullet OpenGL-based (GPU) TinyRender (CPU) X
      Isaac Gym Vulkan-based X
      Isaac Sim Omniverse RTX Renderer
      Isaac Lab Omniverse RTX Renderer
      SAPIEN SapienRenderer (Vulkan-based)
      Genesis PyRender+LuisaRender

Note: indicates support, XX indicates lack of support.

4.2.3.4. Sensor and Joint Component Types

Realistic sensor models and accurate joint simulation are crucial. Table 4 summarizes support.

  • Sensors: Most mainstream platforms support RGB camera, IMU (Inertial Measurement Unit), and force contact.
    • High-fidelity: Isaac Sim, Isaac Lab, Genesis.
    • Limited: Isaac Gym (vision limited, often combined with Isaac Sim).
    • Specific gaps: Isaac Gym, SAPIEN (no native LiDAR); MuJoCo, PyBullet, SAPIEN (no GPS).
  • Joint types: Define robot degrees of freedom (DOF) and flexibility.
    • Commonly supported: Floating, Fixed, Hinge (revolute), Spherical, Prismatic.

    • Less common: Helical joints (coupled rotational and translational motion), natively implemented only in Gazebo and CoppeliaSim.

      The following are the results from Table 4 of the original paper:

      Simulator Sensor Joint type
      IMU/Force contact/ RGB Camera LiDAR GPS Floating/Fixed/Hinge Spherical/Prismatic Helical
      Webots X X
      azebo
      Mujoco X X X
      CoppeliaSim
      PyBullet X X X
      Isaac Gym X X X
      Isaac Sim X
      Isaac Lab X
      SAPIEN + X X
      Genesis X X X

Note: indicates support, XX indicates lack of support, + indicates partial support.

The following figure (Figure 15 from the original paper) shows main joint types in simulators:

Fig. 15: Main joint types in simulators. \[260\] 该图像是示意图,展示了六种主要的关节类型,包括浮动关节、铰链关节、球形关节、棱柱关节、固定关节和螺旋关节。这些关节在机器人模拟器中用于实现不同的运动机制。

4.2.3.5. Discussions and Future Perspectives

  • Advantages of Simulators: Cost-effectiveness, safety, control (over variables), repeatability.
  • Challenges of Simulators: Accuracy (simplifications), complexity (real-world systems are intricate), data dependency (calibration/validation), overfitting (to specific scenarios).
  • Future Perspectives: Limitations highlight the need for world models—more sophisticated, adaptable modeling frameworks that leverage ML/AI to adapt to new data, handle complex systems, and reduce reliance on extensive datasets.

4.2.4. World Models (Section 5)

This section delves into the concept of world models as generative AI models understanding real-world dynamics, inspired by human cognition.

4.2.4.1. Definition and Motivation

World models are "generative AI models that understand the dynamics of the real world, including physics and spatial properties" [17]. Pioneering work by Ha and Schmidhuber [18] showed agents learning compressed, generative models for internal simulation. Recent video generation models like Sora [263] and Kling [264] emphasize using these as world simulators. Yann LeCun advocates for video-based world models to achieve human-level cognition, proposing V-JEPA [266] to learn abstract representations.

The following figure (Figure 16 from the original paper) illustrates the role and training of world models in AI systems:

该图像是示意图,展示了在没有奖励的环境中,通过无任务探索学习全局世界模型的过程。该模型支持预测和适应不同任务(A、B、C),并实现零样本或少样本适应能力。 该图像是示意图,展示了在没有奖励的环境中,通过无任务探索学习全局世界模型的过程。该模型支持预测和适应不同任务(A、B、C),并实现零样本或少样本适应能力。

4.2.4.2. Representative Architectures of World Models

The architectures reflect different approaches to representing and predicting the world, evolving from compact latent dynamics models to powerful generative architectures.

  • Recurrent State Space Model (RSSM): Uses a compact latent space to encode environmental states and a recurrent structure to model temporal dynamics. Enables long-horizon prediction by simulating futures in latent space. Popularized by the Dreamer series [267][271].

  • Joint-Embedding Predictive Architecture (JEPA): Models the world in an abstract latent space, but learns by predicting abstract-level representations of missing content in a self-supervised manner, without explicit generative decoders. (I-JEPA [273], V-JEPA [274][275]).

  • Transformer-based State Space Models: Replaces RNNs with attention-based sequence modeling for latent dynamics modeling, capturing long-range dependencies and offering parallelism. Examples include TransDreamer [276], TWM [277], and Genie [278].

  • Autoregressive Generative World Models: Treats world modeling as sequential prediction over tokenized visual observations, using Transformers to generate future observations conditioned on past context. Often integrates multimodal inputs (actions, language). Examples include CogVideo [279], NUWA [280], VideoPoet [281], GAIA-1 [282], OccWorld [283].

  • Diffusion-based Generative World Models: Iteratively denoises from noise to synthesize temporally consistent visual sequences, offering stable training and superior fidelity. Shifts from pixel-space diffusion to latent-space modeling for efficiency (Imagen Video [285], VideoLDM [286], SVD [287]). OpenAI's Sora [263] and Google Deepmind's Veo3 [288] demonstrate visual realism and 3D/physical dynamics modeling. Increasingly adopted for simulating future observations (e.g., DriveDreamer [289], Vista [290], GAIA-2 [291]).

    The following figure (Figure 17 from the original paper) displays representative architectures and applications of world models:

    Fig. 17: Representative architectures and applications of world models. 该图像是示意图,展示了世界模型的代表性架构及其应用。包括递归状态空间模型、联合嵌入预测架构、扩散式和基于变换器的世界模型,并呈现了不同年份的应用实例,如自动驾驶和通用机器人等。

The following figure (Figure 18 from the original paper) compares autoregressive transformer-based world models and video diffusion-based world model:

Fig. 18: Comparison of autoregressive transformer-based world models and video diffusion-based world model 该图像是图表,展示了自回归变换器基础的世界模型(例如 GAIA-1)和视频扩散基础的世界模型(例如 Vista)的比较。左侧展示了自回归过程的结构,而右侧则概述了视频扩散模型的时间演变及其未来预测能力。

4.2.4.3. Core Roles of World Models

World models serve as general-purpose representations of the environment, enabling various applications.

  • World Models as Neural Simulator: Generate controllable, high-fidelity synthetic experiences across vision and action domains.

    • NVIDIA's Cosmos series [294]: Unified platform for building foundation video models as general-purpose world simulators, adaptable via fine-tuning (e.g., Cosmos-Transfer1 [295] for spatially conditioned, multi-modal video generation).
    • Domain-specific simulators: Wayve's GAIA series (GAIA-1 [282], GAIA-2 [291]) for realistic and controllable traffic simulation.
    • 3D-structured neural simulators: Explicitly model physical occupancy or scene geometry (e.g., DriveWorld [296], DOME [297], AETHER [298], DeepVerse [299]).
  • World Models as Dynamic Models: In model-based reinforcement learning (MBRL), agents build an internal model of the environment (including dynamic and reward models) to simulate interactions and improve policy learning. World models capture environment dynamics directly from data.

    • Dreamer series [268][269][270]: Systematically explores latent-space modeling of environment dynamics from visual input using RSSM (e.g., DreamerV3 [270] achieved state-of-the-art across diverse visual control tasks). DayDreamer [271] validated real-world applicability on physical robots.
    • Pretraining on real-world video data: ContextWM [301] learns generalizable visual dynamics unsupervised.
    • Token-based approaches: iVideoGPT [302] tokenizes videos, actions, rewards into multi-modal sequences, using a Transformer for autoregressive prediction.
  • World Models as Reward Models: Address the challenge of designing effective reward signals in RL by inferring rewards automatically. Generative world models trained for video prediction can serve as implicit reward models, interpreting the model's prediction confidence as a learned reward signal.

    • VIPER [305]: Trains an autoregressive video world model on expert demonstrations and uses the model's prediction likelihood as the reward. Enables learning high-quality policies without manual rewards and supports cross-embodiment generalization.

      The following figure (Figure 19 from the original paper) illustrates the general framework of Model-based RL:

      Fig. 19: The general framework of Model-based RL. The agent learns a dynamic model \(f : ( s _ { t } , a _ { t } ) \\ : \\ : s _ { t + 1 }\) and a reward model `r : ( s _ { t } , a _ { t } ) r _ { t } ,` which are used to simulate interactions and improve policy learning. 该图像是一个示意图,展示了基于模型的强化学习框架。图中包含三个主要组件:动态模型(Environment)、奖励模型(Reward Model)和策略模型(Policy Model / Agent)。状态 sts_t 经过策略模型生成动作 ata_t,并接收奖励 rtr_t,同时预测下一个状态 st+1s_{t+1}。该框架强调了各组件之间的相互作用,以改进政策学习。

4.2.5. World Models for Intelligent Agents (Section 6)

This section explores specific applications and challenges of world models in autonomous driving and articulated robots.

4.2.5.1. World Models for Autonomous Driving

Autonomous driving requires real-time comprehension and forecasting of complex, dynamic road environments. Video generation-based world models are well-suited due to their capacity for capturing physics and dynamic interactions. The paper categorizes them into Neural Simulator, Dynamic Model, and Reward Model.

The following figure (Figure 21 from the original paper) illustrates the world model for autonomous driving as a Neural Simulator, Dynamic Model, and Reward Model:

该图像是示意图,展示了神经模拟器、动力学模型和奖励模型之间的关系。输入由当前驾驶状态和不同条件组成,通过编码器和世界模型处理,并用于下游任务的计划与控制。各模块通过解码器输出相关信息。 该图像是示意图,展示了神经模拟器、动力学模型和奖励模型之间的关系。输入由当前驾驶状态和不同条件组成,通过编码器和世界模型处理,并用于下游任务的计划与控制。各模块通过解码器输出相关信息。

  • WMs as Neural Simulators for Autonomous Driving: Generate realistic driving scenarios for training and testing.

    • GAIA-1 [282]: Pioneered sequence modeling of driving videos with multimodal inputs (video, text, action) via an autoregressive transformer.
    • GAIA-2 [312]: Advanced controllable generation with structured conditioning (ego-vehicle dynamics, multi-agent interactions) using a latent diffusion backbone.
    • DriveDreamer [313] / DriveDreamer-2 [314] / DriveDreamer4D [315]: Diffusion-based generation with structured traffic constraints, integrating LLMs for natural language-driven scenario generation, and 4D driving scene representation.
    • MagicDrive [316] / MagicDrive3D [317] / MagicDrive-V2 [318]: Novel street view generation with diverse inputs and cross-view attention, controllable 3D generation, and high-resolution long video generation using Diffusion Transformer.
    • WoVoGen [320]: Explicit 4D world volumes for multi-camera video generation with sensor interconnectivity.
    • Occupancy representations: OccSora [321] (diffusion-based 4D occupancy generation), DriveWorld [322] (occupancy-based Memory State-Space Models), Drive-OccWorld [323] (occupancy forecasting with end-to-end planning).
  • WMs as Dynamic Models for Autonomous Driving: Learn underlying physics and motion patterns for perception, prediction, and planning.

    • MILE [350]: Model-based imitation learning for urban driving, jointly learning a predictive world model and driving policy.
    • TrafficBots [352]: Multi-agent traffic simulation with configurable agent personalities.
    • Occupancy-based representations: UniWorld [353] (4D geometric occupancy prediction), OccWorld [283] (vector-quantized variational autoencoders), GaussianWorld [368] (4D occupancy forecasting).
    • Sensor fusion & geometric understanding: MUvO [356] (spatial voxel representations from camera/LiDAR), ViDAR [357] (visual point cloud forecasting).
    • Self-supervised learning: LAW [360] (predicting future latent features without perception labels).
    • Integrated reasoning: Cosmos-Reason1 [351] (physical common sense, embodied reasoning), Doe-1 [367] (autonomous driving as next-token generation), DrivingGPT [370] (driving world modeling and trajectory planning).
  • WMs as Reward Models for Autonomous Driving: Evaluate quality and safety of driving behaviors for policy optimization.

    • Vista [376]: Generalizable reward functions using the model's own simulation capabilities.

    • WoTE [379]: Trajectory evaluation using Bird's-Eye View (BEV) world models for real-time safety assessment.

    • Drive-WM [378]: Multi-future trajectory exploration with image-based reward evaluation.

    • Iso-Dream [375]: Separates controllable vs. non-controllable dynamics.

      The following figure (Figure 22 from the original paper) illustrates how the world model processes information from multi-view images during encoding and decoding:

      该图像是示意图,展示了世界模型在编码和解码过程中如何处理多视角图像的信息。左侧展示了多个摄像头捕捉的图像,经过编码器处理后,进入世界模型模块,结合时间步和条件嵌入,最终形成去噪图像,右侧显示了重建的场景。 该图像是示意图,展示了世界模型在编码和解码过程中如何处理多视角图像的信息。左侧展示了多个摄像头捕捉的图像,经过编码器处理后,进入世界模型模块,结合时间步和条件嵌入,最终形成去噪图像,右侧显示了重建的场景。

4.2.5.2. World Models for Articulated Robots

Articulated robots (robotic arms, quadrupeds, humanoids) have stringent world modeling requirements for complex locomanipulation tasks.

  • WMs as Neural Simulators for Articulated Robots: Generate high-fidelity digital environments.

    • NVIDIA's Cosmos World Foundation Model Platform [294]: Unified framework for physics-accurate 3D video predictions, facilitating sim-to-real transfer.

    • WHALE [381]: Generalizable world model with behavior-conditioning for OoD generalization and uncertainty estimation.

    • RoboDreamer [382]: Compositional world model for robotic decision-making by factorizing video generation into primitives.

    • DreMa [383]: Compositional world model combining Gaussian Splatting and physics simulation for photorealistic future prediction.

    • DreamGen [384]: Trains generalizable robot policies via neural trajectories synthesized by video world models.

      The following figure (Figure 24 from the original paper) illustrates the workflow of the Cosmos-Predict world model:

      Fig. 23: The Cosmos-Predict World Foundation Model processes input videos through Cosmos-Tokenize1 \(C \\nabla 8 \\times 8 \\times 8 -\) \(7 2 0 \\mathrm { p }\) , encoding them into latent representations perturbed with Gaussian noise. A 3D patchification step structures these latents, followed by iterative self-attention, crossattention (conditioned on text), and MLP blocks, modulated by adaptive layer normalization. Finally, the decoder reconstructs high-fidelity video output from the refined latent space. This architecture enables robust spatiotemporal modeling for diverse Physical AI applications \[294\]. 该图像是示意图,展示了Cosmos-Predict世界模型的工作流程。输入视频经过3D Patchify和自注意力层处理后,与文本条件交叉注意,最终重构高保真视频输出。模型完成从当前状态到未来状态的预测,用以引导机器人执行任务。

  • WMs as Dynamic Models for Articulated Robots: Learn predictive representations of environmental dynamics for MBRL.

    • PlaNet [429]: Latent dynamic model for pixel-based planning.

    • Plan2Explore [262]: Self-supervised RL agent seeking future novelty via model-based planning.

    • Dreamer series [393][394][270]: Learn latent-state dynamics from high-dimensional observations.

    • Dreaming [395] / DreamingV2 [396]: Reconstruction-free MBRL by eliminating Dreamer's decoder.

    • LEXA [398]: Unified framework for unsupervised goal-reaching.

    • FOWM [399]: Offline world model pretraining with online finetuning.

    • DWL [401]: End-to-end RL for humanoid locomotion enabling zero-shot sim-to-real transfer.

    • Puppeteer [404]: Hierarchical world model for visual whole-body humanoid control.

    • PIVOT-R [406]: Primitive-driven waypoint-aware world model for language-guided manipulation.

    • SafeDreamer [408]: Integrates Lagrangian-based methods with world model planning for safe RL.

    • V-JEPA 2 [275]: 1.2B-parameter world model for video-based understanding, prediction, and zero-shot planning.

      The following figure (Figure 25 from the original paper) illustrates latent dynamic models employing distinct transition mechanisms or temporal prediction:

      该图像是图示,展示了三种不同类型的模型:确定性模型(RNN)、随机模型(SSM)和递归状态空间模型。每种模型通过不同的结构表达了动作、隐藏状态和观测值之间的关系。 该图像是图示,展示了三种不同类型的模型:确定性模型(RNN)、随机模型(SSM)和递归状态空间模型。每种模型通过不同的结构表达了动作、隐藏状态和观测值之间的关系。

  • WMs as Reward models for Articulated Robots: Implicitly infer rewards by measuring alignment with model predictions.

    • PlaNet [267]: Uses an explicitly learned reward predictor as part of the dynamics model.

    • VIPER [427]: Uses pretrained video prediction models as reward signals, interpreting prediction likelihoods as rewards.

      The following figure (Figure 26 from the original paper) presents three different architectures: Joint-Embedding Architecture, Generative Architecture, and Joint-Embedding Predictive Architecture:

      该图像是示意图,展示了三种不同的架构:联合嵌入架构(a)、生成架构(b)、和联合嵌入预测架构(c)。图中包含了相关的编码器和解码器以及它们之间的关系,关键的判别器 \(D(s_x, s_y)\) 在不同架构中起到了重要的作用。 该图像是示意图,展示了三种不同的架构:联合嵌入架构(a)、生成架构(b)、和联合嵌入预测架构(c)。图中包含了相关的编码器和解码器以及它们之间的关系,关键的判别器 D(sx,sy)D(s_x, s_y) 在不同架构中起到了重要的作用。

  • Technical Trends: Tactile-enhanced world models for dexterous manipulation, unified world models for cross-hardware/cross-task generalization, hierarchical world models for long-horizon tasks.
  • Challenges:
    • High-Dimensionality and Partial Observability: Handling vast sensory data and inherent environmental uncertainty.
    • Causal Reasoning versus Correlation Learning: Moving beyond correlations to understand underlying physics and intent.
    • Abstract and Semantic Understanding: Integrating fine-grained physical predictions with abstract concepts (traffic laws, intent, affordances).
    • Systematic Evaluation and Benchmarking: Developing metrics that correlate with downstream task performance.
    • Memory Architecture and Long-Term Dependencies: Retaining and retrieving relevant information over extended timescales.
    • Human Interaction and Predictability: Ensuring legible, predictable, and socially compliant agent behavior.
    • Interpretability and Verifiability: Understanding model rationale for safety-critical applications.
    • Compositional Generalization and Abstraction: Learning disentangled representations for understanding novel scenarios by composing known concepts.
    • Data Curation and Bias: Addressing data quality, bias, and learning from rare but safety-critical events.

4.3. Algorithmic Flows and Formulas

As a survey paper, the core methodology involves the systematic review and categorization of existing research rather than the introduction of new mathematical algorithms or formulas within the paper's main body. However, the paper discusses various algorithms and models that rely on specific mathematical principles. The methodologies described above for MPC, WBC, RL, IL, VLA, and different world model architectures all implicitly rely on underlying mathematical frameworks (e.g., optimization, probability theory, neural network architectures, attention mechanisms).

For example, the section on Transformer-based State Space Models mentions attention mechanisms. The Scaled Dot-Product Attention is a foundational component of Transformers, calculated as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:

  • QQ represents the Query matrix.

  • KK represents the Key matrix.

  • VV represents the Value matrix.

  • dkd_k is the dimension of the key vectors, used for scaling to prevent the dot products from becoming too large and pushing the softmax function into regions with very small gradients.

  • QKTQK^T computes the similarity scores between queries and keys.

  • softmax\mathrm{softmax} normalizes these scores to produce attention weights.

  • These weights are then applied to the values VV to produce the output, which is a weighted sum of the values.

    This formula, while not explicitly derived or introduced as new within the survey, is a fundamental building block for many of the Transformer-based world models and VLA models that the paper discusses. The paper's strength lies in organizing and presenting the landscape of how these underlying techniques are applied and combined in embodied AI.

5. Experimental Setup

This paper is a survey, so it does not present its own experimental setup in the traditional sense. Instead, it synthesizes the experimental contexts, datasets, evaluation metrics, and baselines from the numerous research papers it reviews. The "experimental setup" here refers to the common practices and benchmarks within the embodied AI field, as highlighted by the authors.

5.1. Datasets

The paper mentions a wide array of datasets commonly used in the fields of robotics, autonomous driving, and world modeling. These datasets are crucial for training and evaluating various embodied AI systems.

  • General Robotic Control:

    • DeepMind Control (DMC) suite: A collection of continuous control tasks in a physics simulator (MuJoCo), often used for Reinforcement Learning benchmarks. Tasks range from simple balance to complex locomotion and manipulation.
    • Atari: A suite of classic video games, frequently used to benchmark Reinforcement Learning algorithms that learn from pixel inputs.
    • RLBench: A benchmark for robot learning with a focus on dexterous manipulation tasks. It provides a large collection of physically simulated robot manipulation tasks.
    • RoboSuite: An open-source framework for robot control that provides a rich collection of robot manipulation environments.
    • Meta-world: A meta-Reinforcement Learning benchmark that focuses on training agents to quickly adapt to new tasks.
    • Safety-Gymnasium: A Reinforcement Learning benchmark designed to evaluate safe RL algorithms, often used to test safety constraints.
  • Autonomous Driving:

    • nuScenes [327]: A large-scale multimodal dataset for autonomous driving in urban environments, collected in Boston and Singapore. It includes LiDAR, radar, camera, and GPS data, along with 3D bounding box annotations. It is chosen for its diversity and comprehensive sensor suite, enabling research in perception, prediction, and planning.
    • Waymo Open Dataset (WOD) [328]: Another large-scale, high-quality autonomous driving dataset from the Waymo self-driving fleet. It contains LiDAR, camera, and motion data for a variety of urban and suburban driving scenarios. Its scale and diversity make it suitable for generalizable autonomous driving research.
    • CARLA simulation [340]: A high-fidelity open-source simulator for autonomous driving research. It provides realistic rendering and physics, allowing for controlled experimentation in various scenarios, including urban driving. It's often used when real-world data collection is impractical.
    • BDD [343]: Berkeley DeepDrive is a large-scale collection of diverse driving videos and images, often used for computer vision tasks in autonomous driving, such as object detection, semantic segmentation, and traffic light recognition.
    • NuPlan [330]: A benchmark for learning-based planning in real-world autonomous driving, providing a diverse set of real-world driving scenarios.
    • Occ3D [331]: A large-scale 3D occupancy prediction benchmark for autonomous driving.
    • Lyft-Level5 [335]: A self-driving motion prediction dataset for research in autonomous vehicle perception and behavior prediction.
    • KITTI [355]: A widely used dataset for autonomous driving research, providing stereo vision, optical flow, visual odometry, 3D object detection, and 3D tracking benchmarks.
    • NAVSIM [361]: A data-driven non-reactive autonomous vehicle simulation and benchmarking platform.
  • Humanoid and Manipulation Robotics:

    • PartNet-Mobility [247]: A comprehensive dataset featuring motion-annotated, articulated 3D objects, used with the SAPIEN simulator for articulated object manipulation.
    • ManiSkill [248] / ManiSkill3 [249]: Benchmarks for generalizable manipulation skills in realistic physics-based environments, providing diverse tasks and high-quality demonstrations.
    • Human motion capture data: Used in Imitation Learning to derive style-based rewards or reference gaits for natural robot movements.
    • RT-X: Robotic Transformer dataset and benchmark.
    • SeaWave benchmark: For robotic manipulation tasks.
  • World Model Pre-training:

    • Internet-scale video data: Used for pre-training video generation models and world models to learn generalizable visual dynamics and physical intuition (e.g., Cosmos [294], ContextWM [301]).

      These datasets are chosen to cover a wide range of scenarios, from simulated control tasks to complex real-world driving and manipulation, ensuring that proposed methods are evaluated for both fundamental capabilities and practical applicability.

5.2. Evaluation Metrics

The paper discusses various evaluation metrics implicitly through the results and discussions of the surveyed papers. These metrics generally fall into categories assessing performance, efficiency, fidelity, and safety. For clarity, I will define common metrics in these categories, some of which are explicitly mentioned and others are implied by the context of robot learning and simulation.

  • Performance Metrics (Task Completion):

    • Success Rate (SR):
      • Conceptual Definition: The proportion of attempts where an agent successfully completes a given task. It is a direct measure of task-oriented performance.
      • Mathematical Formula: $ \text{SR} = \frac{\text{Number of successful attempts}}{\text{Total number of attempts}} $
      • Symbol Explanation:
        • Number of successful attempts\text{Number of successful attempts}: The count of times the robot achieved the task goal.
        • Total number of attempts\text{Total number of attempts}: The total number of times the robot attempted the task.
    • Reward (Cumulative Reward):
      • Conceptual Definition: In Reinforcement Learning, the sum of rewards received by an agent over an episode or a series of interactions. The goal of RL is to maximize this cumulative reward.
      • Mathematical Formula: $ R_t = \sum_{k=0}^{T} \gamma^k r_{t+k+1} $
      • Symbol Explanation:
        • RtR_t: Cumulative reward (or return) at time step tt.
        • TT: The time horizon or end of the episode.
        • γ\gamma: Discount factor (0γ10 \le \gamma \le 1), which determines the present value of future rewards.
        • rt+k+1r_{t+k+1}: The reward received at time step t+k+1t+k+1.
    • Driving Score / Safety Score:
      • Conceptual Definition: Composite metrics specific to autonomous driving that combine various factors like compliance with traffic rules, collision avoidance, smoothness of driving, and efficiency to assess overall performance and safety.
      • Mathematical Formula: (Highly task-specific, no universal formula. Typically a weighted sum of sub-metrics.) $ \text{Driving Score} = w_1 \cdot \text{CollisionRate} + w_2 \cdot \text{TrafficRuleViolations} + w_3 \cdot \text{DrivingComfort} + \dots $
      • Symbol Explanation:
        • wiw_i: Weighting factors for different components.
        • CollisionRate\text{CollisionRate}: Frequency of collisions.
        • TrafficRuleViolations\text{TrafficRuleViolations}: Number of traffic rule infractions.
        • DrivingComfort\text{DrivingComfort}: Metric for smoothness of accelerations/braking.
  • Efficiency Metrics:

    • Sample Efficiency:
      • Conceptual Definition: How many interactions (samples) an RL agent needs with the environment to learn a high-performing policy. World models often aim to improve sample efficiency by allowing agents to learn from imagined experiences.
      • Mathematical Formula: (No direct formula, usually expressed as the number of environmental steps or episodes required to reach a certain performance level.)
    • Throughput Advantage:
      • Conceptual Definition: A measure of how much faster one system (e.g., a GPU-accelerated simulator) can process tasks or generate data compared to another.
      • Mathematical Formula: $ \text{Throughput Advantage} = \frac{\text{Throughput of Method A}}{\text{Throughput of Method B}} $
      • Symbol Explanation:
        • Throughput of Method A\text{Throughput of Method A}: The rate at which Method A processes units of work (ee.g., frames per second, samples per second).
        • Throughput of Method B\text{Throughput of Method B}: The rate at which Method B processes units of work.
  • Fidelity & Quality Metrics:

    • Mean Squared Error (MSE):
      • Conceptual Definition: A common metric for quantifying the difference between values predicted by a model and the true values. Often used for video prediction or state prediction in world models.
      • Mathematical Formula: $ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 $
      • Symbol Explanation:
        • nn: The number of data points.
        • YiY_i: The observed (true) value for the ii-th data point.
        • Y^i\hat{Y}_i: The predicted value for the ii-th data point.
    • Fréchet Inception Distance (FID):
      • Conceptual Definition: A metric used to assess the quality of images generated by generative models. It measures the distance between the feature distributions of generated and real images, with lower FID indicating higher quality.
      • Mathematical Formula: $ \text{FID} = ||\mu_1 - \mu_2||^2_2 + \text{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1 \Sigma_2)^{1/2}) $
      • Symbol Explanation:
        • μ1,μ2\mu_1, \mu_2: The mean feature vectors of real and generated images, respectively, extracted from an Inception-v3 network.
        • Σ1,Σ2\Sigma_1, \Sigma_2: The covariance matrices of the feature vectors for real and generated images.
        • 22||\cdot||^2_2: The squared L2 norm.
        • Tr()\text{Tr}(\cdot): The trace of a matrix.
        • (Σ1Σ2)1/2(\Sigma_1 \Sigma_2)^{1/2}: The matrix square root of the product of covariance matrices.
    • Visual Realism / Fidelity: Often assessed qualitatively by human evaluation or specific quantitative metrics like PSNR (Peak Signal-to-Noise Ratio) or SSIM (Structural Similarity Index Measure) for image/video quality.
  • Generalization Metrics:

    • Zero-shot Generalization: The ability of a model to perform a task without any prior training examples for that specific task, relying on knowledge learned from diverse pre-training data.
    • Few-shot Adaptation: The ability to quickly learn a new task with a very small number of training examples.

5.3. Baselines

The baselines mentioned or implied in the paper are typically previous state-of-the-art methods or contrasting architectural approaches within each sub-field.

  • Robot Control & Learning:

    • Traditional Control Methods: PID controllers, kinematics-based approaches, hard-coded behaviors.
    • Model-Free RL: Actor-critic methods that learn policies directly from interaction without an explicit world model.
    • Earlier RL Approaches: Previous iterations of RL algorithms (e.g., DreamerV1 vs. DreamerV3).
    • Sim-to-Real Baselines: Methods that attempt to transfer policies from simulation to reality without world models or advanced sim-to-real techniques (e.g., domain randomization without world model guidance).
  • Simulators:

    • Older Simulators: Comparisons between Webots/Gazebo and newer GPU-accelerated platforms like Isaac Gym/Isaac Sim in terms of speed, fidelity, and specific physics capabilities.
    • CPU-based Physics Engines: Benchmarks against CPU-only physics engines to highlight the speedup of GPU-accelerated ones.
  • World Models:

    • Autoregressive vs. Diffusion Models: GAIA-1 (autoregressive) compared with GAIA-2 or DriveDreamer (diffusion-based) for video generation quality and control.
    • Latent Space vs. Pixel Space Models: Dreamer-style RSSMs (latent space) versus iVideoGPT (tokenized pixel space).
    • Non-Generative Predictive Models: Simpler predictive models that don't synthesize full observations.
    • Models without Foundation Model Integration: Comparisons showing the benefits of LLM/VLM integration in VLA models or world models.
  • Autonomous Driving:

    • Modular Architectures: Traditional perception-prediction-planning-control pipelines are contrasted with end-to-end or world model-integrated systems.

    • Image-based vs. 3D Occupancy-based Models: Different representations for scene understanding and forecasting.

      These baselines serve to highlight the advancements and unique contributions of the methods discussed in the survey by demonstrating improvements in aspects like sample efficiency, generalization, fidelity, speed, and safety.

6. Results & Analysis

This section synthesizes the findings from the various sub-fields reviewed in the paper, drawing conclusions about the effectiveness, advantages, and challenges of physical simulators and world models in embodied AI.

6.1. Core Results Analysis

The survey's analysis consistently highlights that the integration and advancement of physical simulators and world models are driving significant progress in embodied AI.

  1. Capability Grading (IR-L0 to IR-L4): The proposed IR-L framework provides a structured way to assess robot capabilities. It implicitly shows that current research is rapidly moving from IR-L0 (basic execution) and IR-L1 (programmatic response) towards IR-L2 (environmental awareness) and IR-L3 (humanoid cognition and collaboration), with IR-L4 (full autonomy) remaining the ultimate, aspirational goal. This framework helps contextualize the sophistication of various robot control algorithms and AI systems.

  2. Robotic Mobility and Dexterity:

    • Learning-based methods (RL, IL) have enabled robots to achieve complex locomotion and manipulation skills that were previously intractable with traditional control. Examples include robust bipedal walking on unstructured terrain [61] and agile whole-body movements like jump-shots [84].
    • Foundation Models (e.g., LLMs, VLMs) are enhancing semantic understanding and task planning, allowing robots to interpret complex instructions and generalize to new tasks, pushing towards more versatile whole-body manipulation [119]. VLA models provide end-to-end mapping from language to action [65].
    • Fall protection and recovery have seen substantial improvements with learning-based methods, offering better robustness and generalization than model-based approaches [101][102].
  3. Physical Simulators' Impact:

    • Simulators have become indispensable for overcoming the cost, safety, and repeatability issues of real-world training.
    • GPU-accelerated simulators like Isaac Gym, Isaac Sim, Genesis, and NVIDIA Newton offer significantly improved data generation efficiency (e.g., Genesis shows 2.70×2.70 \times to 11.79×11.79 \times throughput advantage over Isaac Gym across various batch sizes). This is crucial for data-hungry deep reinforcement learning and generative AI models.
    • High-fidelity rendering (e.g., Isaac Sim with RTX-based ray tracing and PBR) helps diminish the sim-to-real gap by providing more realistic visual data for training vision-based perception algorithms.
    • Differentiable physics in emerging simulators (MuJoCo XLA, Genesis, Newton) is a game-changer, enabling end-to-end optimization and tighter integration with machine learning models.
    • Despite advancements, simulators still face accuracy, complexity, and overfitting challenges, highlighting the need for world models.
  4. World Models' Transformative Roles:

    • Neural Simulators: Generative world models (especially diffusion-based ones like DriveDreamer, GAIA-2) can synthesize controllable, high-fidelity driving scenarios or robotic interactions [291][313]. This allows for data augmentation, rare event synthesis, and sim-to-real transfer, serving as scalable alternatives to traditional simulators.
    • Dynamic Models: In model-based RL, world models (like the Dreamer series) learn latent-space dynamics from visual inputs, enabling agents to simulate future states and plan actions through imagined rollouts. This dramatically improves sample efficiency and generalization across tasks [270].
    • Reward Models: World models can implicitly infer reward signals (e.g., VIPER [305]) by assessing how well an agent's behavior aligns with the model's predictions. This addresses the challenging problem of manual reward engineering and supports cross-embodiment generalization.
  5. Autonomous Driving Specifics:

    • World models are evolving from autoregressive to diffusion-based architectures for superior generation quality and control.
    • Multi-modal integration (camera, LiDAR, text, trajectories) and structured conditioning enable the generation of diverse and controllable scenarios, crucial for stress-testing autonomous systems [312][314].
    • A shift towards 3D spatial-temporal understanding and occupancy-based representations provides better geometric consistency and scene understanding than purely image-based approaches [321][322].
    • Increasing end-to-end integration with autonomous driving pipelines aims to unify perception, prediction, and planning within a single neural architecture [367][370].
  6. Articulated Robots Specifics:

    • World models enable zero-shot sim-to-real transfer for complex locomotion [401] and manipulation tasks [275].

    • Compositional world models (e.g., RoboDreamer [382]) allow generalization to unseen object-action combinations.

    • The V-JEPA 2 model [275] demonstrates state-of-the-art performance in action recognition, anticipation, and model-predictive control for robotics with minimal real-world data.

      Overall, the core result is a clear validation of the synergistic power of advanced physical simulators and generative world models in pushing the boundaries of embodied AI toward more capable, adaptable, and generalizable systems.

6.2. Data Presentation (Tables)

The paper provides several comparative tables. I will transcribe them in full.

The following are the results from Table 5 of the original paper:

Category Paper Input Output World Model Architecture Dataset Code Availability
Image Text LiDAR Action
Neural Simulator GAIA-1 [282] X Image Transformer In-house nuScenes X
DriveDreamer [313] X X Image Diffusion nuScenes [327] & In-house
ADriver-I [326] Image Diffussion In-house X
GAIA-2 [312] Image Diffusion nuScenes X
DriveDreamer-2 [314] X Image Diffusion Waymo Open dataset (WOD) [328]
DriveDreamer4D [315] X Image Transformer NuPlan [330] & In-house X
DrivingWorld [329] X X Image DiT nuScenes
MagicDrive [316] X Image DiT nuScenes X
MagicDrive3D [317] X X Image DiT nuScenes X
MagicDrive-V2 [318] X Image Diffusion nuScenes, Occ3d [331], nuScenes-lidarseg X
WoVoGen [320] X Image Diffusion Waymo Open dataset (WOD)
ReconDreamer [325] X X Image Transformer nuScenes X
DualDiff+ [332] X Image Diffusion nuScenes X
Panacea [319] X Image DiT Cosmos [294]
Cosmos-Transfer1 [295] X Image Diffusion nuScenes X
GeoDrive [333] X Occupancy Transformer nuScenes, Openscene [334] X
DriveWorld [322] X Occupancy Diffusion nuScenes
OccSora [321] X Occupancy Transformer nuScenes, Lyft-Level5 [335] X
Drive-OccWorld [323] X Occupancy Diffusion nuScenes
DOME [297] X Occupancy Diffusion nuScenes
Dynamic Model RenderWorld [336] X Occupancy Transformer nuScenes X
OccLLama [337] X Occupancy Transformer NuScenes, Occ3D, NuScenes-QA [338] X
BEVWorld [339] X Image, point cloud Diffusion nuScenes, Carla simulation [340] X
HoloDrive [341] X Image, point cloud Transformer NuScenes X
GEM [342] X Image, depth Diffusion BDD [343], etc [344], [345]
DriveArena [346] X Image Diffusion nuScenes
ACT-Bench [347] Image Transformer nuScenes, ACT-Bench X
InfinityDrive [324] X Image Transformer nuScenes X
Epona [293] X Image Transformer NuPlan
DrivePhysica [348] X X Image Diffusion nuScenes X
Cosmos-Drive [349] Image DiT In-house, WOD
MILE [350] X X Image, BEV Semantics RNN Carla simulation
Cosmos-Reason1 [351] Text Transformer Cosmos-Reason-1-Dataset
TrafficBots [352] X X Trajectory Transformer Waymo Open Motion Dataset nuScenes X
Uniworld [353] X Occupancy Transformer Waymo Open Motion Dataset nuScenes X
CopilotdD [354] X Point cloud Diffusion nuScenes, KITTI [355] X
MUVO [356] X Image, Occupancy, Point cloud RNN Carla simulation X
OccWorld [283] X Occupancy Transformer nuScenes, Occ3D
ViDAR [357] X Point cloud Transformer nuScenes X
Think2Drive [358] X X BEV Semantics RNN CARLA simulation X
Reward Model LidarDM [359] X X Point cloud Diffusion KITTI-360, WOD X
LAW [360] X BEV Semantics Transformer nuScenes, NAVSIM [361], CARLA simulation
UnO [362] X Occupancy Transformer nuScenes, Argoverse 2, KITTI Odometry X
CarFormer [363] X Trajectory Transformer CARLA simulation

Note: means supported/used, XX means not supported/used. "Code Availability" means code is released, XX means not released. Original table contains error in row grouping, corrected to match content categorization.

The following are the results from Table 6 of the original paper:

Category Paper Year Input Architecture Experiments Code Availability
Image Text Video Action
Neural Simulators WHALE [381] 2024 X X Transformer Robotic arm X
RoboDreamer [382] 2024 X Diffusion Robotic arm
DreMa [383] 2024 X X GS reconstruction Robotic arm
DreamGen [384] 2025 X X DiT Dual robotic arm O
EnerVerse [385] 2025 X Diffusion Robotic arm
WorldEval [386] 2025 X X DiT Robotic arm
Cosmos (NVIDIA) [294] 2025 Diffusion + Autoregressive Robots
Pangu [387] 2025 X X X / / X
RoboTransfer [388] 2025 Diffusion Robotic arm O
TesserAct [389] 2025 X DiT Robotic arm
3DPEWM [390] 2025 X DiT Mobile robot X
SGImageNav [391] 2025 X LLM Quadruped robot X
Embodiedreamer [392] 2025 X Diffusion + ACT Robotic arm O
Dynamics Models PlaNet [267] 2018 X X RSSM DMC
Plan2Explore [262] 2020 X X RSSM DMC
Dreamer [393] 2020 X X RSSM DMC
DreamerV2 [394] 2021 X X RSSM DMC
DreamerV3 [270] 2023 X X RSSM DMC
DayDreamer [271] 2024 X RSSM Robotic arm
Dreaming [395] 2021 X X RSSM DMC
Dreaming V2 [396] 2021 X X RSSM DMC + RoboSuite
DreamerPro [397] 2022 X RSSM DMC
TransDreamer [276] 2024 X TSSM 2D Simulation
LEXA [398] 2021 X RSSM Simulated robotic arm
FOWM [399] 2023 X TD-MPC Robotic arm
SWIM [400] 2023 X X RSSM Robotic arm X
ContextWM [301] 2023 X X RSSM DMC + CARLA + Meta-world
iVideoGPT [302] 2023 X Autoregressive Transformer DMC
DWL [401] 2024 X X Recurrent encoder Robotic arm
Surfer [402] 2024 X X Transformer Humanoid robot X
GAS [403] 2024 X RSSM Surgical robot
Puppeteer [404] 2024 X TD-MPC2 Simulation (56-DoF humanoid) X
TWIST [405] 2024 X X RSSM Robotic arm
Dynamics Models (continued) PIVOT-R [406] 2024 Transformer Robotic arm X
HarmonyDream [407] 2024 X X RSSM Robotic arm
SafeDreamer [408] 2024 X OSRP simulation (Safety-Gymnasium)
WMP [409] 2024 X RSSM Quadruped robot
RWM [410] 2025 X GRU+MLP Quadruped robot X
RWM-O [411] 2025 X GRU+MLP Robotic arm X
SSWM [412] 2025 X X SSM Quadrotor X
WMR [413] 2025 X X LSTM Humanoid robot X
PIN-WM [414] 2025 X GS reconstruction Robotic arm O
LUMOS [415] 2025 X X RSSM Robotic arm
OSVI-WM [416] 2025 X X Transformer Robotic arm X
FOCUS [417] 2025 X X RSSM Robotic arm
FLIP [418] 2025 X X DiT Robotic arm
EnerVerse-AC [419] 2025 X Diffusion Robotic arm
FlowDreamer [420] 2025 X Diffusion Robotic arm
HWM [421] 2025 X MVM Humanoid robot X
MoDem-V2 [422] 2024 X X FM Robotic arm X
V-JEPA 2 [275] 2025 X JEPA Robotic arm X
AdaWorld [423] 2025 X X RSSM Robotic arm X

Note: means supported/used, XX means not supported/used. "Code Availability" means code is released, OO is not yet available, XX means not released. Architectural abbreviations: DiT=DiffusionTransformers;RSSM=RecurrentStateSpaceModel;TSSM=TransformerStateSpaceModel;SSM=StateSpaceModel;TDMPC=TemporalDifferencelearningforModelPredictiveControl;OSRP=onlinesafetyrewardplanning;GRU=GatedRecurrentUnit;MLP=MultilayerPerceptron;LSTM=LongShortTermMemory;GS=GaussianSplatting;MVM=MaskedVideoModelling;FM=FlowMatching;JEPA=JointEmbeddingPredictiveArchitecture.DiT = Diffusion Transformers; RSSM = Recurrent State-Space Model; TSSM = Transformer State-Space Model; SSM = State-Space Model; TD-MPC = Temporal Difference learning for Model Predictive Control; OSRP = online safety-reward planning; GRU = Gated Recurrent Unit; MLP = Multilayer Perceptron; LSTM = Long Short-Term Memory; GS = Gaussian Splatting; MVM = Masked Video Modelling; FM = Flow Matching; JEPA = Joint Embedding-Predictive Architecture.

The following are the results from Table 7 of the original paper:

Category Paper Year Input Architecture Experiments Code Availability
Image Text Video Action
Dynamics Models MoSim [424] 2025 X rigid-body dynamics + ODE Robotic arm
DALI [425] 2025 X RSSM DMC
GWM [426] 2025 X X DiT Robotic arm
Reward Models VIPER [427] 2023 X Autoregressive Transformer DMC + RLBench + Atari
PlaNet [267] 2018 X X RSSM DMC

Note: means supported/used, XX means not supported/used. "Code Availability" means code is released, OO is not yet available, XX means not released. Architectural abbreviations: DiT=DiffusionTransformers;RSSM=RecurrentStateSpaceModel;TSSM=TransformerStateSpaceModel;SSM=StateSpaceModel;TDMPC=TemporalDifferencelearningforModelPredictiveControl;OSRP=onlinesafetyrewardplanning;GRU=GatedRecurrentUnit;MLP=MultilayerPerceptron;LSTM=LongShortTermMemory;GS=GaussianSplatting;MVM=MaskedVideoModelling;FM=FlowMatching;JEPA=JointEmbeddingPredictiveArchitecture.DiT = Diffusion Transformers; RSSM = Recurrent State-Space Model; TSSM = Transformer State-Space Model; SSM = State-Space Model; TD-MPC = Temporal Difference learning for Model Predictive Control; OSRP = online safety-reward planning; GRU = Gated Recurrent Unit; MLP = Multilayer Perceptron; LSTM = Long Short-Term Memory; GS = Gaussian Splatting; MVM = Masked Video Modelling; FM = Flow Matching; JEPA = Joint Embedding-Predictive Architecture.

6.3. Ablation Studies / Parameter Analysis

As a survey paper, this work does not present its own ablation studies or parameter analyses. Instead, it discusses the impact of design choices and parameters in the context of the surveyed literature. For instance:

  • Impact of Architectural Choices: The comparison between autoregressive and diffusion-based world models in autonomous driving (e.g., GAIA-1 vs. GAIA-2) inherently discusses how different architectures affect generation fidelity, controllability, and computational cost.

  • Role of Latent Space vs. High-Dimensional Representations: The Dreamer series explores the benefits of latent-space planning for sample efficiency, implying that directly operating in high-dimensional pixel space can be less efficient.

  • Influence of Multimodal Inputs: The effectiveness of VLA models and advanced autonomous driving world models is attributed to their ability to integrate various modalities (vision, language, LiDAR, actions), suggesting that each modality contributes to richer scene understanding and control.

  • Generalization vs. Specificity: Discussions around domain randomization in simulators or pretraining world models on diverse datasets highlight how these techniques contribute to generalization capabilities, mitigating overfitting to specific simulated environments.

  • Trade-offs in Simulators: The comparison of physical properties and rendering capabilities across simulators (Tables 2, 3, 4) implicitly reveals trade-offs between physical accuracy, rendering fidelity, computational speed (parallel rendering), and sensor support. For example, MuJoCo prioritizes physics accuracy and RL efficiency, while Isaac Sim emphasizes photorealistic rendering and comprehensive sensor simulation.

    The survey emphasizes that researchers often conduct these ablation studies to validate individual components or design choices within their proposed world models or simulation platforms.

7. Conclusion & Reflections

7.1. Conclusion Summary

This survey provides a comprehensive and timely overview of the pivotal roles played by physical simulators and world models in advancing embodied artificial intelligence. It successfully delineates a clear path toward truly intelligent robotic systems by integrating these two foundational technologies. The paper's key contributions include:

  1. A Novel Classification System: The introduction of a five-level grading standard (IR-L0 to IR-L4) offers a robust framework for evaluating robot autonomy, providing a common language for assessing progress and guiding future development in embodied AI.

  2. Exhaustive Review of Simulators: A detailed comparative analysis of mainstream simulation platforms highlights their diverse capabilities in physical simulation, rendering, and sensor/joint support, emphasizing the shift towards GPU-accelerated, high-fidelity, and differentiable simulators to bridge the sim-to-real gap.

  3. In-depth Analysis of World Models: A thorough exploration of world model architectures (from RSSMs to diffusion-based Foundation Models) and their critical roles as neural simulators, dynamic models in model-based RL, and reward models in embodied intelligence.

  4. Application-Specific Insights: A focused discussion on the application of world models in autonomous driving and articulated robots reveals how these systems are enabling unprecedented capabilities in scene generation, prediction, planning, and cross-embodiment generalization.

    The survey concludes that the synergistic integration of physical simulators for external training and world models for internal cognition is the cornerstone for realizing IR-L4 fully autonomous systems, fundamentally transforming robotics from specialized automation to general-purpose intelligence seamlessly integrated into human society.

7.2. Limitations & Future Work

The paper itself identifies several critical challenges that current world models and embodied AI systems face, which also serve as directions for future research:

  1. High-Dimensionality and Partial Observability: Future work needs to develop more robust state estimation and belief-state management techniques to handle the vast, incomplete sensory inputs.

  2. Causal Reasoning versus Correlation Learning: A major leap is required from merely learning correlations to understanding causal relationships and enabling counterfactual reasoning ("what if" scenarios) for true generalization to novel situations.

  3. Abstract and Semantic Understanding: Integrating low-level physical predictions with high-level abstract reasoning about concepts like traffic laws, pedestrian intent, and object affordances is crucial for intelligent, context-aware behavior.

  4. Systematic Evaluation and Benchmarking: New evaluation frameworks and metrics are needed that correlate better with downstream task performance, robustness in safety-critical scenarios, and the capture of causally relevant aspects of the environment, moving beyond simple MSE on predictions.

  5. Memory Architecture and Long-Term Dependencies: Designing efficient and effective memory architectures (e.g., advanced Transformers or State-Space Models) is vital for retaining and retrieving relevant information over extended timescales.

  6. Human Interaction and Predictability: World models must facilitate agent behaviors that are legible, predictable, and socially compliant to humans, requiring a deeper integration of social intelligence.

  7. Interpretability and Verifiability: For safety-critical applications, developing methods to audit and understand the internal decision-making processes of black-box deep learning models is non-negotiable. Formal verification of safety properties is a formidable theoretical and engineering challenge.

  8. Compositional Generalization and Abstraction: Future world models should learn disentangled, abstract representations of entities, relations, and physical properties to understand and predict novel scenarios by composing known concepts, rather than relying on end-to-end pattern matching.

  9. Data Curation and Bias: Addressing biases in training data and systematically collecting data for long-tail, rare-but-safety-critical events is essential for building robust and reliable systems.

    The paper also highlights specific technical trends in world models for robotics, such as tactile-enhanced world models for dexterous manipulation, unified world models for cross-hardware and cross-task generalization, and hierarchical world models for long-horizon tasks.

7.3. Personal Insights & Critique

This survey offers a tremendously valuable synthesis of two rapidly evolving and converging fields: physical simulation and world models. Its clear structure, comprehensive comparisons of simulators, and deep dive into world model architectures and applications make it an excellent resource for anyone seeking to understand the current landscape of embodied AI.

One of the most inspiring aspects is the emphasis on the synergy between external and internal models. The analogy to human cognition—where we mentally simulate possibilities and learn from them—underscores the intuitive power of world models. The advancements in GPU-accelerated differentiable simulators are particularly exciting, as they promise to close the sim-to-real gap more effectively by allowing AI systems to learn directly from gradients in a virtual world that closely mirrors reality. The potential for zero-shot generalization and few-shot adaptation enabled by foundation models and advanced world models is truly transformative, suggesting a future where robots can quickly adapt to novel situations without extensive re-training.

However, the identified limitations also present significant challenges:

  • The "Black Box" Problem: While world models show immense promise, their interpretability and verifiability remain critical concerns, especially for safety-critical applications like autonomous driving. The ability to explain why a world model made a particular prediction or plan is paramount for trust and accountability. This is an area where current deep learning methods still struggle.

  • True Causal Understanding: The distinction between correlation learning and causal reasoning is fundamental. Many world models excel at predicting what will happen, but not necessarily why it will happen based on underlying physical laws or agent intentions. Without genuine causal understanding, compositional generalization to truly novel scenarios might be limited, as the model may just be interpolating rather than truly reasoning.

  • Memory and Long-Term Planning: While Transformers have improved long-range dependencies, modeling long-term memory and planning in highly dynamic, open-ended environments remains exceptionally hard. Robots need to recall past events, adapt behaviors over extended periods, and maintain consistent goals, which taxes current memory architectures.

    The methods and conclusions of this paper have broad applicability. The IR-L classification could be adapted for any AI agent interacting with an environment, not just humanoid robots. The principles of model-based RL and generative world models are transferable to other domains requiring complex dynamic predictions and planning, such as climate modeling, drug discovery, or financial forecasting.

My critique leans towards the philosophical challenge of AGI: while the engineering progress in simulators and world models is astonishing, the bridge from sophisticated pattern matching and prediction to true understanding, consciousness, or common sense reasoning remains the ultimate frontier. The paper articulates these challenges well, particularly in the causal reasoning and abstract understanding sections. Future research must not only focus on scale and fidelity but also on embedding deeper cognitive architectures that mirror human-like reasoning and learning from interactive experience.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.