A Survey: Learning Embodied Intelligence from Physical Simulators and World Models
TL;DR Summary
This survey examines the role of embodied intelligence in achieving AGI, focusing on how the integration of physical simulators and world models enhances robot autonomy and adaptability, providing new insights and challenges for the learning of embodied AI.
Abstract
The pursuit of artificial general intelligence (AGI) has placed embodied intelligence at the forefront of robotics research. Embodied intelligence focuses on agents capable of perceiving, reasoning, and acting within the physical world. Achieving robust embodied intelligence requires not only advanced perception and control, but also the ability to ground abstract cognition in real-world interactions. Two foundational technologies, physical simulators and world models, have emerged as critical enablers in this quest. Physical simulators provide controlled, high-fidelity environments for training and evaluating robotic agents, allowing safe and efficient development of complex behaviors. In contrast, world models empower robots with internal representations of their surroundings, enabling predictive planning and adaptive decision-making beyond direct sensory input. This survey systematically reviews recent advances in learning embodied AI through the integration of physical simulators and world models. We analyze their complementary roles in enhancing autonomy, adaptability, and generalization in intelligent robots, and discuss the interplay between external simulation and internal modeling in bridging the gap between simulated training and real-world deployment. By synthesizing current progress and identifying open challenges, this survey aims to provide a comprehensive perspective on the path toward more capable and generalizable embodied AI systems. We also maintain an active repository that contains up-to-date literature and open-source projects at https://github.com/NJU3DV-LoongGroup/Embodied-World-Models-Survey.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
A Survey: Learning Embodied Intelligence from Physical Simulators and World Models
1.2. Authors
Xiaoio Long, Qingrui Zhao, Kaiwen Zhang, Zihao Zhang, Dingrui Wang*, Yumeng Liu, Zhengjhu, Yi*, Shozheng Wang*, Xinzhe Wei, Wei Li, Wei Yin, Yao ao, Jia an, Qiu Shen, Ruigangang, Xun Cao†, Qionghai Dai
1.3. Journal/Conference
This paper is a preprint and is published on arXiv.
Comment: arXiv is a popular open-access repository for preprints of scientific papers in fields like physics, mathematics, computer science, and more. While preprints have not undergone formal peer review, arXiv is widely used for rapid dissemination of research findings before or in parallel with formal publication.
1.4. Publication Year
2025
1.5. Abstract
The paper explores embodied intelligence as a core focus for achieving artificial general intelligence (AGI). It highlights that embodied intelligence requires agents capable of perceiving, reasoning, and acting within the physical world, necessitating advanced perception, control, and grounding of abstract cognition in real-world interactions. The survey identifies two critical enabling technologies: physical simulators and world models. Physical simulators offer controlled, high-fidelity environments for safe and efficient training and evaluation of robotic agents. World models provide robots with internal representations of their surroundings, facilitating predictive planning and adaptive decision-making beyond direct sensory input. The survey systematically reviews recent advancements in learning embodied AI by integrating these two technologies, analyzing their complementary roles in enhancing autonomy, adaptability, and generalization in intelligent robots. It discusses the interplay between external simulation and internal modeling to bridge the sim-to-real gap (the challenge of transferring behaviors learned in simulation to the real world). By synthesizing current progress and identifying open challenges, the survey aims to provide a comprehensive perspective on the path toward more capable and generalizable embodied AI systems. An active repository of literature and open-source projects is maintained at https://github.com/NJU3DV-LoongGroup/Embodied-World-Models-Survey.
1.6. Original Source Link
https://arxiv.org/abs/2507.00917 (Preprint)
PDF Link: https://arxiv.org/pdf/2507.00917v3.pdf
2. Executive Summary
2.1. Background & Motivation
The core problem the paper addresses is how to achieve robust embodied intelligence in robots, which is seen as a crucial step towards artificial general intelligence (AGI). This is important because AGI requires grounding abstract reasoning in real-world understanding and action, moving beyond disembodied intelligence (systems operating purely on symbolic or digital data). Intelligent robots, by acting and perceiving within a physical body, can robustly learn from experience and adapt to dynamic, uncertain environments.
The paper identifies specific challenges in this pursuit:
-
Safety and Cost of Real-world Experimentation: Training and testing complex robotic behaviors directly in the real world can be expensive, time-consuming, and risky.
-
Data Bottlenecks:
Intelligent robotsrequire vast amounts of high-quality interaction data, which is hard to collect in the real world due to cost, safety, and repeatability issues. -
Generalization: Robots need to adapt their behavior and cognition continuously based on feedback from the physical world, requiring robust learning that can generalize to unforeseen scenarios.
-
Lack of Systematic Evaluation: A comprehensive grading system for robot intelligence that integrates cognition, autonomous behavior, and social interaction is still lacking, hindering technology roadmap clarification and safety assessment.
The paper's entry point or innovative idea is to systematically explore the synergistic relationship between
physical simulatorsandworld modelsas critical enablers for developingembodied intelligence. Simulators provide a safe, controlled external environment for training, whileworld modelscreate internal representations for adaptive decision-making, jointly addressing the challenges of data scarcity, safety, and generalization.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
-
Proposed a Five-Level Grading Standard for Intelligent Robots: It introduces a comprehensive five-level classification framework (
IR-L0toIR-L4) for evaluatinghumanoid robot autonomy. This standard assesses robots across four key dimensions: autonomy, task handling ability, environmental adaptability, and societal cognition ability. This framework helps clarify the technological development roadmap and provides guidance for robot regulation and safety assessment. -
Systematic Review of Robot Learning Techniques: It analyzes recent advancements in
intelligent roboticsacross various domains, including:Legged Locomotion: Covering bipedal walking, unstructured environment adaptation, high dynamic movements, and fall recovery.Manipulation: Discussing unimanual (gripper-based, dexterous hand), bimanual, and whole-body manipulation.Human-Robot Interaction (HRI): Focusing on cognitive collaboration, physical reliability, and social embeddedness.
-
Comprehensive Analysis of Current Physical Simulators: The survey provides a detailed comparative analysis of mainstream
simulators(e.g.,Webots,Gazebo,MuJoCo,Isaac Gym/Sim/Lab,SAPIEN,Genesis,Newton). This analysis covers theirphysical simulation capabilities(e.g., suction, deformable objects, fluid dynamics, differentiable physics),rendering quality(e.g., ray tracing, physically-based rendering, parallel rendering), andsensor/joint component support. -
Review of Recent Advancements in World Models: It revisits the main architectures of
world models(e.g.,Recurrent State Space Models,Joint-Embedding Predictive Architectures,Transformer-based,Autoregressive,Diffusion-based) and their potential roles. It discusses howworld modelsserve ascontrollable neural simulators,dynamic modelsformodel-based reinforcement learning (MBRL), andreward modelsforembodied intelligence. Furthermore, it comprehensively discusses recentworld modelsdesigned for specific applications likeautonomous drivingandarticulated robots.The key conclusions and findings indicate that the synergistic integration of
physical simulatorsandworld modelsis crucial for bridging thesim-to-real gapand fostering autonomous, adaptable, and generalizableembodied AIsystems. This combination enables: -
Efficient Data Generation: Simulators allow for rapid, cost-effective, and safe generation of large volumes of synthetic data with automated annotation.
-
Enhanced Learning and Planning:
World modelsenable agents to learn internal representations of environment dynamics, simulate hypothetical futures, and plan actions through imagined experiences, significantly improvingsample efficiencyinreinforcement learning. -
Robust Generalization: High-fidelity simulation and advanced
world modelshelp prevent policy overfitting and enhance algorithm generalization by capturing complex physical phenomena and uncertainties. -
Progress Towards AGI: These technologies lay the foundation for developing
IR-L4 fully autonomous systemscapable of seamless integration into human society.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand this paper, a reader should be familiar with the following core concepts:
- Artificial Intelligence (AI): A broad field of computer science that gives computers the ability to simulate human intelligence. This includes learning, problem-solving, perception, and language understanding.
- Robotics: The interdisciplinary branch of engineering and science that deals with the design, construction, operation, and use of robots. Robots are machines that can carry out a series of actions autonomously or semi-autonomously.
- Artificial General Intelligence (AGI): A theoretical form of AI that would have the ability to understand, learn, and apply intelligence to any intellectual task that a human being can. It stands in contrast to
narrow AI, which is designed to perform a specific task (e.g., playing chess or recommending products). - Embodied Intelligence: A paradigm in AI and robotics that emphasizes the importance of a physical body and its interaction with the environment for the development of intelligence. Unlike
disembodied intelligence(which operates on symbolic or digital data),embodied intelligencesuggests that perception, action, and cognition are deeply intertwined with the physical form and sensory-motor experiences of an agent. - Physical Simulators: Software environments that mimic the physical properties and dynamics of the real world, allowing virtual robots to operate, interact, and learn. They provide a controlled, safe, and reproducible platform for designing, testing, and refining robotic algorithms without the costs and risks associated with real-world experiments. Examples include
GazeboandMuJoCo. - World Models: Internal, learned representations within an AI agent that capture the dynamics of its environment. These models allow an agent to predict future states given current observations and actions, enabling cognitive processes like planning, imagination, and adaptive decision-making without needing to constantly interact with the real world. They are often generative, meaning they can synthesize future sensory inputs (e.g., video frames).
- Sim-to-Real Gap: The discrepancy between behaviors learned in a simulated environment and their performance when transferred to a physical robot in the real world. This gap arises due to differences in physics, sensor noise, latency, and other factors that are difficult to perfectly model in simulation. Bridging this gap is a major challenge in robotics.
- Reinforcement Learning (RL): A type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. The agent learns through trial and error, receiving feedback in the form of rewards or penalties for its actions.
- Imitation Learning (IL): A machine learning paradigm where an agent learns to perform a task by observing and mimicking demonstrations provided by an expert (e.g., a human). It bypasses the need for explicit reward functions, making it suitable for complex tasks that are hard to specify programmatically.
- Model Predictive Control (MPC): An advanced control strategy that uses a dynamic model of a system to predict its future behavior over a finite time horizon. At each time step, an optimization problem is solved to compute a sequence of control actions that minimize a cost function while satisfying constraints. Only the first action in the sequence is executed, and the process is repeated.
- Whole-Body Control (WBC): A comprehensive control framework for complex robots (e.g., humanoids) that coordinates all degrees of freedom (joints and limbs) simultaneously to achieve multiple tasks while satisfying various physical constraints (e.g., balance, joint limits, contact forces).
- Visual-Language-Action (VLA) Models: Cross-modal AI frameworks that integrate visual perception, natural language understanding, and action generation for robots. They leverage large language models (LLMs) and visual models (VMs) to interpret human instructions and directly map them to robot actions, enabling more intuitive and generalizable control.
- Foundation Models (FMs): Large-scale machine learning models (e.g., Large Language Models - LLMs, Vision Models - VMs, Vision-Language Models - VLMs) that are pre-trained on vast amounts of diverse, internet-scale data. They possess powerful capabilities in semantic understanding, world knowledge integration, and cross-modal reasoning, making them adaptable for a wide range of downstream tasks through fine-tuning or zero-shot inference.
- Recurrent State Space Models (RSSMs): A class of
world modelsthat use a compact latent (hidden) space to represent the environment's state and a recurrent neural network (RNN) to model its temporal dynamics. They predict future states and observations in this latent space, enabling long-horizon planning. - Joint-Embedding Predictive Architectures (JEPAs):
World modelsthat learn abstract representations by predicting missing parts of data (e.g., masked image regions or video segments) in a purely self-supervised manner, without requiring explicit generative decoders. They focus on learning rich, semantic representations. - Transformer-based Models: Neural network architectures that utilize
attention mechanisms(specificallyself-attention) to weigh the importance of different parts of input sequences. They excel at capturing long-range dependencies and parallelism, making them powerful for sequence modeling tasks inworld models.- Attention Mechanism: The core idea of attention is to allow a neural network to focus on specific parts of its input sequence when making predictions or processing information. For example, in natural language processing, when translating a sentence, the model might pay more attention to certain words in the source sentence when generating a word in the target sentence.
The
Scaled Dot-Product Attentionis a common form: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:- (Query) is a matrix representing the query vectors.
- (Key) is a matrix representing the key vectors.
- (Value) is a matrix representing the value vectors.
- is the dimension of the key vectors (used for scaling to prevent vanishing gradients during softmax).
- computes the dot product between queries and keys, indicating similarity.
- normalizes the scores to obtain attention weights.
- The attention weights are then multiplied by the value vectors to get the output, which is a weighted sum of the values.
- Attention Mechanism: The core idea of attention is to allow a neural network to focus on specific parts of its input sequence when making predictions or processing information. For example, in natural language processing, when translating a sentence, the model might pay more attention to certain words in the source sentence when generating a word in the target sentence.
The
- Autoregressive Generative Models: Models that generate sequences (e.g., video frames) one element at a time, where each new element is conditioned on all previously generated elements. They typically use
Transformersto model the sequential dependencies. - Diffusion Models: Generative models that learn to create data by reversing a gradual noise process. They start with random noise and iteratively denoise it to produce a desired output (e.g., a high-fidelity image or video). They are known for generating high-quality and diverse samples.
3.2. Previous Works
The paper extensively references numerous previous works, which can be broadly categorized:
- Robotics & Control:
- Early Humanoid Control:
Model Predictive Control (MPC)andWhole-Body Control (WBC)have foundational works, such as Tom Erez et al. [30] for real-timeMPCon humanoids, and Oussama Khatib's operational space formulation [33] forWBC. These methods focused on explicit programming and dynamic modeling. - Reinforcement Learning (RL) in Robotics: The application of
RLto humanoid robotics dates back to the late 1990s and early 2000s, with Morimoto and Doya [43] demonstrating a simulated two-joint robot learning to stand up. More recently, works like DeepLoco [44] explored deepRLfor bipedal tasks, and Xie et al. [47] achieved robust dynamic walking for physical bipedal robots like Cassie. Joonho Lee et al. [86] demonstrated the first successful real-worldRLapplication for legged locomotion in outdoor environments. - Imitation Learning (IL) in Robotics:
ILallows robots to learn from human demonstrations, as seen in works using human motion capture data [51][53] to achieve natural robot gaits. Challenges include data cost and generalization [49]. - Visual-Language-Action (VLA) Models: Google DeepMind's
RT-2[65] pioneered this paradigm by discretizing robot control into language-like tokens. Subsequent models [4][66][67] have further advancedVLAin robotics. - Human-Robot Interaction (HRI): Studies such as Lemaignan et al. [182] explored cognitive skills for
HRI, while work on physical reliability [192][199] used motion planning algorithms likePRMandRRT. Social embeddedness research [219][223] focuses on understanding social norms.
- Early Humanoid Control:
- Physical Simulators:
- Traditional Simulators:
Webots[239] (1998) andGazebo[15] (2002) are long-standing open-source platforms.MuJoCo[16] (2012) andPyBullet[240] (2017) are widely used physics engines for contact-rich dynamics andRL.CoppeliaSim[241] (around 2010) is a general-purpose robot simulation software. - GPU-Accelerated & High-Fidelity Simulators: The
NVIDIA Isaacseries, includingIsaac Gym[242] (2021) for parallelGPU-accelerated physics,Isaac Sim[243] for digital twin simulation withOmniverse[244] andRTX-based ray tracing, andIsaac Lab[246] as anRLframework built onIsaac Sim.SAPIEN[247] (2020) is designed for part-level interactive objects, leading to benchmarks likeManiSkill[248].Genesis[250] (2024) is a general-purpose platform unifying various physics solvers.NVIDIA Newton[251] (2025) is an emerging open-source physics engine for high-fidelity simulation.
- Traditional Simulators:
- World Models:
- Pioneering Work: David Ha and Jürgen Schmidhuber's "World Models" [18] (2018) demonstrated learning compact environmental representations for internal planning.
- Recurrent State Space Models (RSSMs): The
Dreamerseries [267][271] (starting 2018) popularizedRSSMsfor learning latent dynamics and enablingMBRL.DreamerV3[270] achieved state-of-the-art performance across diverse visual control tasks. - Joint-Embedding Predictive Architectures (JEPAs): Proposed by Yann LeCun [272],
I-JEPA[273] andV-JEPA[274][275] learn abstract representations by predicting masked content in latent space. - Generative World Models (Video Generation): Recent advancements in video generation, like
Sora[263] (2024) andKling[264] (2025), have emphasized the potential ofvideo generation modelsasworld simulators. Early video generation frameworks includeCogVideo[279] andVideoPoet[281]. - Diffusion-based Models:
Imagen Video[285],VideoLDM[286],SVD[287] paved the way for high-fidelity video synthesis.DriveDreamer[289],Vista[290], andGAIA-2[291] applydiffusion modelsto generate driving or 3D scenes. - Domain-Specific World Models:
Wayve's GAIAseries [282][291] for autonomous driving,DOME[297] for 3D occupancy prediction,Cosmos[294] as a unified platform for foundation video models. - World Models as Dynamic/Reward Models:
PlaNet[267] and theDreamerseries [393][394] useworld modelsas dynamic models for planning inMBRL.VIPER[305] uses video prediction models as reward signals.
3.3. Technological Evolution
The field of embodied AI has seen a profound evolution, mirroring advancements in AI and computing:
-
Early Robotics (Pre-2000s): Focused on
traditional control methods(e.g.,PID control,inverse kinematics). Robots were largelyprogram-drivenand operated instructured environments(IR-L0).SimulatorslikeWebotsandGazeboemerged to provide basic virtual testing grounds. -
Model-Based Control (2000s-early 2010s): Introduction of
MPCandWBCallowed for more dynamic and complex behaviors, but still relied on explicit models of robot dynamics and environments. Robots began to show limited reactivity (IR-L1).MuJoCoprovided a more accurate physics engine for articulated systems. -
Learning-Based Robotics (Mid-2010s-Present): The deep learning revolution brought
Reinforcement Learning (RL)andImitation Learning (IL)to the forefront. Robots started learning behaviors from data, reducing the need for explicit programming and enhancing adaptability (IR-L2toIR-L3). This period saw increased demand forsimulatorscapable of parallelized training (Isaac Gym). -
Emergence of World Models (Late 2010s-Present): Inspired by human cognition,
world modelsbegan to be integrated, allowing agents to learn internal representations of environment dynamics. This enabledmodel-based RL, planning in latent spaces, and improvedsample efficiency. TheDreamerseries played a pivotal role here. -
Foundation Models & Generative AI (Early 2020s-Present): The advent of
large language models (LLMs)andvision-language models (VLMs)(i.e.,Foundation Models) profoundly influenced robotics, leading toVisual-Language-Action (VLA) modelsandgenerative world models. These models leverage vast internet data for semantic understanding, task planning, and high-fidelity scene generation, pushing robots towardshumanoid cognition and collaboration(IR-L3) andfull autonomy(IR-L4).GPU-acceleratedsimulatorslikeIsaac SimandGenesisbecame crucial for training these data-intensive models.This paper's work fits within this timeline by synthesizing the advancements in
physical simulatorsandworld models, showing how their combined evolution is driving the current progress inembodied AItowardsAGI.
3.4. Differentiation Analysis
This survey distinguishes itself from existing literature by providing a comprehensive examination of the synergistic relationship between physical simulators and world models in advancing embodied intelligence.
-
Previous Surveys: The paper notes that prior surveys typically focused on individual components. For example:
- Robotics simulators [19][21]
- World models [22][24]
-
This Paper's Innovation:
-
Bridging Two Domains: Instead of treating
simulatorsandworld modelsas separate entities, this survey explicitly analyzes their complementary roles and theinterplay between external simulation and internal modeling. It highlights howsimulatorsprovide the external training ground, whileworld modelsoffer the internal cognitive framework. -
Unified Framework: It presents a unified view of how these two technologies enhance
autonomy,adaptability, andgeneralizationinintelligent robots, particularly in bridging thesim-to-real gap. -
Structured Evaluation: The proposal of a
five-level grading standard (IR-L0 to IR-L4)forhumanoid robot autonomyoffers a new framework for assessing and guiding development, which is currently lacking in comprehensive integration ofintelligent cognitionandautonomous behavior. -
Application-Specific Focus: The survey delves into the specific applications and challenges of
world modelsin critical domains likeautonomous drivingandarticulated robots, providing a nuanced understanding of their practical implications.By synthesizing these components, the survey offers a more holistic perspective on the path toward
more capable and generalizable embodied AI systems.
-
4. Methodology
This survey primarily presents a comprehensive review and classification framework rather than a novel algorithmic methodology. The "methodology" of the paper is its systematic approach to analyzing the state of the art in embodied intelligence through the lens of physical simulators and world models. This includes a proposed grading system, a review of robotic techniques, a comparative analysis of simulators, and an exploration of world model architectures and applications.
4.1. Principles
The core idea is to establish a systematic understanding of how physical simulators and world models contribute to embodied intelligence, individually and synergistically. The theoretical basis is that embodied AI requires both external (simulated) environments for safe and scalable training and internal (world model) representations for adaptive, intelligent behavior. The intuition is that by combining these two, robots can learn complex skills and generalize to novel situations more effectively than relying on either alone, thereby bridging the sim-to-real gap.
4.2. Core Methodology In-depth (Layer by Layer)
The paper's methodology unfolds through several analytical layers:
4.2.1. Levels of Intelligent Robot (Section 2)
The paper proposes a capability grading model for intelligent robots, systematically outlining five progressive levels (IR-L0 to IR-L4). This classification aims to provide a unified framework to assess and guide the development of intelligent robots.
4.2.1.1. Level Criteria
The classification is based on:
- The robot's ability to complete tasks independently (from human control to full autonomy).
- The difficulty of tasks the robot can handle (from simple repetitive labor to innovative problem-solving).
- The robot's ability to work in dynamic or extreme environments.
- The robot's capacity to understand, interact with, and respond to social situations within human society.
4.2.1.2. Level Factors
The intelligent level of robots is graded based on the following five factors:
-
Autonomy: The robot's ability to autonomously make decisions across various tasks.
-
Task Handling Ability: The complexity of the tasks the robot can perform.
-
Environmental Adaptability: The robot's performance in different environments.
-
Societal Cognition Ability: The level of intelligence exhibited by robots in social scenarios.
The relationship between graded levels and level factors are listed in Table 1.
The following are the results from Table 1 of the original paper:
| Level | Autonomy | Task Handling Ability | Environmental Adaptability | Societal Cognition Ability |
| IR-LO | Human Control | Basic Tasks | Controlled Only | No Social Cognition |
| IR-L1 | Human Supervised | Complex Navigation | Predictable Environments | Basic Recognition |
| IR-L2 | Human Assisted | Dynamic Collaboration | Adaptive Learning | Simple Interaction |
| IR-L3 | ConditionalAutonomy | Multitasking | Dynamic Adaptation | Emotional Intelligence |
| IR-L4 | Full Autonomy | Innovation | Universal Flexibility | Advanced Social Intelligence |
4.2.1.3. Classification Levels
The paper defines five discrete levels:
-
IR-L0: Basic Execution Level:
- Characteristics: Completely non-intelligent, program-driven, focused on repetitive, mechanized, deterministic tasks (e.g., industrial welding, fixed-path material handling). Relies entirely on predefined instructions or teleoperation.
Low perception - high execution. - Technical Requirements:
- Hardware: High-precision servomotors, rigid mechanical structures, PLC/MCU-based motion controllers.
- Perception: Extremely limited (limit switches, encoders).
- Control Algorithms: Predefined scripts, action sequences, teleoperation (no real-time feedback).
- Human-Robot Interaction: None, or simple buttons/teleoperation.
- Characteristics: Completely non-intelligent, program-driven, focused on repetitive, mechanized, deterministic tasks (e.g., industrial welding, fixed-path material handling). Relies entirely on predefined instructions or teleoperation.
-
IR-L1: Programmatic Response Level:
- Characteristics: Limited rule-based reactive capabilities, executes predefined task sequences (e.g., cleaning/reception robots). Uses fundamental sensors to trigger specific behaviors. Operates only in closed-task environments with clear rules.
Limited perception—limited execution. - Technical Requirements:
- Hardware: Basic sensors (infrared, ultrasonic, pressure), moderately enhanced processors.
- Perception: Detection of obstacles, boundaries, simple human movements.
- Control Algorithms: Rule engines, finite state machines (FSM), basic
SLAM(Simultaneous Localization and Mapping) or random walk. - Human-Robot Interaction: Basic voice and touch interfaces for simple command-response.
- Software Architecture: Embedded real-time operating systems with elementary task scheduling.
- Characteristics: Limited rule-based reactive capabilities, executes predefined task sequences (e.g., cleaning/reception robots). Uses fundamental sensors to trigger specific behaviors. Operates only in closed-task environments with clear rules.
-
IR-L2: Environmental Awareness and Adaptive Response Level:
- Characteristics: Preliminary environmental awareness and autonomous capabilities, responsive to environmental changes, transitions between multiple task modes (e.g., service robot delivering water or navigating while avoiding obstacles). Human supervision is still essential. Demonstrates greater execution flexibility and "contextual understanding."
- Technical Requirements:
- Hardware: Multimodal sensor arrays (cameras,
LiDAR, microphone arrays), enhanced computational resources. - Perception: Visual processing, auditory recognition, spatial localization, basic object identification, environmental mapping.
- Control Algorithms: Finite state machines, behavior trees,
SLAM, path planning, obstacle avoidance. - Human-Robot Interaction: Speech recognition and synthesis for basic command comprehension/execution.
- Software Architecture: Modular design for parallel task execution with preliminary priority management.
- Hardware: Multimodal sensor arrays (cameras,
-
IR-L3: Humanoid Cognition and Collaboration Level:
- Characteristics: Autonomous decision-making in complex, dynamic environments, sophisticated multimodal
HRI. Infers user intent, adapts behavior, operates within ethical constraints (e.g., eldercare robot detecting emotional state and responding appropriately). - Technical Requirements:
- Hardware: High-performance computing platforms, comprehensive multimodal sensor suites (depth cameras, electromyography, force-sensing arrays).
- Perception: Multimodal fusion (vision, speech, tactile), affective computing for emotion recognition, dynamic user modeling.
- Control Algorithms: Deep learning architectures (
CNNs,Transformers) for perception/language understanding;Reinforcement Learningfor adaptive policy optimization; planning and reasoning modules. - Human-Robot Interaction: Multi-turn natural language dialogue, facial expression recognition, foundational empathy and emotion regulation.
- Software Architecture: Service-oriented, distributed frameworks for task decomposition/collaboration; integrated learning/adaptation mechanisms.
- Safety and Ethics: Embedded ethical governance systems.
- Characteristics: Autonomous decision-making in complex, dynamic environments, sophisticated multimodal
-
IR-L4: Fully Autonomous Level:
- Characteristics: Pinnacle of intelligent robotics; complete autonomy in perception, decision-making, and execution in any environment without human intervention. Possesses self-evolving ethical reasoning, advanced cognition, empathy, and long-term adaptive learning. Engages in sophisticated social interactions (multi-turn natural language, emotional understanding, cultural adaptation, multi-agent collaboration).
Sci-fi movie robots. - Technical Requirements:
- Hardware: Highly biomimetic structures (full-body, multi-degree-of-freedom articulation), distributed high-performance computing platforms.
- Perception: Omnidirectional, multi-scale, multimodal sensing systems; real-time environment modeling, intent inference.
- Control Algorithms:
General Artificial Intelligence (AGI)frameworks integratingmeta-learning,generative AI,embodied intelligence; autonomous task generation, advanced reasoning. - Human-Robot Interaction: Natural language understanding and generation, complex social context adaptation, empathy, ethical deliberation.
- Software Architecture: Cloud-edge-client collaborative systems; distributed agent architectures for self-evolution/knowledge transfer.
- Safety and Ethics: Embedded dynamic ethical decision systems for morally sound choices in dilemmas.
- Characteristics: Pinnacle of intelligent robotics; complete autonomy in perception, decision-making, and execution in any environment without human intervention. Possesses self-evolving ethical reasoning, advanced cognition, empathy, and long-term adaptive learning. Engages in sophisticated social interactions (multi-turn natural language, emotional understanding, cultural adaptation, multi-agent collaboration).
4.2.2. Robotic Mobility, Dexterity, and Interaction (Section 3)
This section reviews current progress in intelligent robotic tasks, covering fundamental technical approaches and advancements in locomotion, manipulation, and human-robot interaction.
4.2.2.1. Related Robotic Techniques
- Model Predictive Control (MPC): Optimization-based approach predicting future system behavior using a dynamic model, computing control actions by solving an optimization problem at each time step. Handles constraints on inputs and states.
- Whole-Body Control (WBC): Coordinates all joints and limbs simultaneously, formulating motion and force objectives as prioritized tasks solved via optimization or hierarchical control.
- Reinforcement Learning (RL): Agent learns to perform tasks by interacting with the environment and receiving rewards/penalties, discovering optimal actions through trial and error.
- Imitation Learning (IL): Robots learn by observing and mimicking demonstrations (human or other agents), bypassing explicit programming or reward functions. Faces challenges like data cost and generalization.
- Visual-Language-Action (VLA) Models: Cross-modal
AIframework integrating visual perception, language understanding, and action generation. LeveragesLarge Language Models (LLMs)for reasoning, mapping natural language instructions to physical robot actions (e.g.,RT-2).
4.2.2.2. Robotic Locomotion
Focuses on natural movement patterns (walking, running, jumping) and dynamic adaptation.
-
Legged Locomotion:
- Unstructured Environment Adaption: Ability to maintain stable walking in complex, unknown, or dynamic environments (e.g., rugged terrains, stairs). Early methods used position-controlled robots [57][58][75]. Modern robots use
force-controlled jointsfor better compliance [77][78], enabling more sophisticated algorithms [59][60].Learning-based methods(e.g.,RLwithdomain randomization[61]) andexteroceptive sensing(depth cameras,LiDARforheight maps[63]) significantly enhanced adaptability. - High Dynamic Movements: Achieving stability and agility in high-speed, dynamic movements (running, jumping). Early studies used simplified dynamic models (
SLIP,LIPM,SRBM) [79][80].RL-based methods[81][82] andImitation Learning[84] have shown promising results for complex dynamic behaviors.
- Unstructured Environment Adaption: Ability to maintain stable walking in complex, unknown, or dynamic environments (e.g., rugged terrains, stairs). Early methods used position-controlled robots [57][58][75]. Modern robots use
-
Fall Protection and Recovery: Strategies to reduce damage during falls and efficiently recover to a standing posture.
- Model-based Methods: Inspired by human biomechanics, using optimization control to generate damage-reducing motion trajectories and recovery movements [97][98][99].
- Learning-based Methods:
RLandILoffer insensitivity to high-precision models and strong generalization, training robots to recover from falls (e.g.,HiFAR[102],HoST[101]).
4.2.2.3. Robotic Manipulation
Covers tasks from picking objects to complex assembly.
-
Unimanual Manipulation Task: Using a single end effector (gripper or dexterous hand).
- Gripper-based manipulation: Common for grasping, placing, tool use. Traditional methods [105][106] struggled with unstructured environments.
Learning-based approachesusingCNNsfor6D pose estimation[107],affordance learning[109],Imitation LearningwithNeural Descriptor Fields (NDFs)[111],Diffusion Policy[3], andFoundation ModelslikeRT2[112] have significantly improved capabilities. - Dexterous hand manipulation: Aims for human-like versatility and precision using multi-fingered hands. Early work focused on hardware designs [121][122] and theoretical foundations [124][125].
Learning-based methods(two-stage: pose generation then control [126][141]; end-to-end: direct trajectory modeling viaRLorIL[142][152]) have become mainstream, often usingsim-to-realtransfer.
- Gripper-based manipulation: Common for grasping, placing, tool use. Traditional methods [105][106] struggled with unstructured environments.
-
Bimanual Manipulation Task: Coordinated use of two arms for complex operations (cooperative transport, assembly, handling deformable objects). Challenges include high-dimensional state-action spaces and inter-arm collisions.
- Early research used
inductive biasesorstructural decompositions(e.g.,BUDS[156],SIMPLe[157]). End-to-end approacheswithlarge-scale data collectionandImitation Learning(e.g.,ALOHAseries [49][153]) have shown strong generalization.
- Early research used
-
Whole-Body Manipulation Control: Interacting with objects using the entire robot body (dual arms, torso, base).
- Leverages
large pre-trained models(LLMs,VLMs) for semantic understanding (e.g.,TidyBot[164],MOO[165]). Visual demonstrationsguide learning (e.g.,OKAMI[167],iDP3[168]).Reinforcement Learning Sim-to-Realapproaches (e.g.,OmniH20[96]) andTransformer-basedlow-level control (HumanPlus[6]) are key.
- Leverages
-
Foundation Models in Humanoid Robot Manipulation:
- Hierarchical Approach:
Pretrained languageorvision-language foundation modelsserve as high-level task planning and reasoning engines, passing sub-goals to low-levelaction policies(e.g.,Figure AI's Helix[174],NVIDIA's GR00T N1[175]). - End-to-End Approach: Directly incorporates robot operation data into
foundation modelsto constructVision-Language-Action (VLA) models[4][68] (e.g.,Google DeepMind's RT series[112][177]).
- Hierarchical Approach:
4.2.2.4. Human-Robot Interaction (HRI)
Enabling robots to understand and respond to human needs and emotions.
-
Cognitive Collaboration: Bidirectional cognitive alignment between robots and humans, understanding explicit instructions and implicit intentions. Relies on complex cognitive architectures and multimodal information processing (e.g.,
Lemaignan et al.[182],multimodal intention learning[183]).LLMsare used for semantic understanding in goal-oriented navigation [186][190]. -
Physical Reliability: Coordinated physical actions to ensure safety and efficiency. Relies on
motion planning(sampling-basedlikePRM/RRT[193][194],optimization-basedlikeCHOMP[200]) andcontrol strategies(impedance/admittance control[208]).Imitation LearningandReinforcement Learningwith large-scale generative datasets from simulation are advancing this [217][191]. -
Social Embeddedness: Ability to recognize and adapt to social norms, cultural expectations, and group dynamics. Involves
social space understanding[219][220] andbehavior understanding(linguistic and non-linguistic cues) [224][230].The following figure (Figure 2 from the original paper) illustrates the different levels of intelligent robots and their relationship with physical simulators and world models:
该图像是示意图,展示了智能机器人的不同层级及其与物理模拟器和世界模型的关系。左侧列出了智能机器人的五个层级,从基本执行到完全自主,右侧则描述了机器人运动、灵巧性和交互的相关内容。
4.2.3. General Physical Simulators (Section 4)
This section details the role of simulators in addressing data bottlenecks and the sim-to-real gap, followed by a comparative analysis of mainstream platforms.
4.2.3.1. Mainstream Simulators
The paper introduces several key simulators:
-
Webots [239]: Integrated framework for robot modeling, programming, simulation (open-sourced 2018). Multi-language
APIs, cross-platform. Lacks deformable bodies, fluid dynamics, advanced physics. -
Gazebo [15]: Widely adopted open-source simulator (2002), extensible, integrates with
ROS. Modular plugin system. Lacks suction, deformable objects, fluid dynamics. -
MuJoCo [16]: Physics engine for contact-rich dynamics in articulated systems (2012, acquired by Google DeepMind 2021). High-precision physics, optimized generalized-coordinate formulation. Excels in contact dynamics and
RL. Limited rendering, no fluid/DEM/LiDAR simulation. -
PyBullet [240]: Python interface for Bullet physics engine (2017). Lightweight, easy-to-integrate, open-source. Slightly less fidelity than some mainstream simulators.
-
CoppeliaSim [241]: General-purpose robot simulation (around 2010, formerly V-REP). Distributed control architecture, supports various middleware. Educational edition is open-source.
-
NVIDIA Isaac Series:
- Isaac Gym [242]: Pioneered
GPU-accelerated physics simulation (2021), parallel training of thousands of environments. Built onPhysX. Limited rendering fidelity, no ray tracing/fluid/LiDAR. - Isaac Sim [243]: Full-featured digital twin simulator, integrates
Omniverse[244].PhysX 5,RTX-based real-time ray tracing, high-fidelityLiDARsimulation. SupportsUSD(Universal Scene Description) [245]. - Isaac Lab [246]: Modular
RLframework onIsaac Sim. Tiled rendering for multi-camera inputs. SupportsILandRL.
- Isaac Gym [242]: Pioneered
-
SAPIEN [247]: Simulation platform for physically realistic modeling of complex, part-level interactive objects (2020). Led to
PartNet-Mobilitydataset andManiSkillbenchmarks [248][249]. Lacks soft-body, fluid dynamics, ray tracing,LiDAR,GPS,ROSintegration. -
Genesis [250]: General-purpose physical simulation platform (2024). Unifies rigid body,
MPM,SPH,FEM,PBD, fluid solvers. Generative data engine (natural language prompts).Differentiable physics. NoLiDAR/GPS/ROS. -
NVIDIA Newton [251]: Open-source physics engine (2025). Built on
NVIDIA WarpforGPUacceleration.Differentiable physics. Compatible withMuJoCo Playground,Isaac Lab.OpenUSD-based scene construction.The following figure (Figure 14 from the original paper) shows mainstream simulators for robotic research:
该图像是图表,展示了用于机器人研究的主流仿真器,包括Webots、Gazebo、CoppeliaSim、PyBullet、Genesis、Isaac Gym、Isaac Sim、Isaac Lab、MuJoCo和SAPIEN等。这些仿真器为机器人智能体的训练和评估提供了高保真度的环境。
4.2.3.2. Physical Properties of Simulators
High-fidelity physical property simulation enhances realism and algorithm generalization. Table 2 summarizes support for various types of physical simulation.
-
Suction: Modeling non-rigid attachment (e.g., vacuum grasping).
MuJoCo(user-defined logic),Gazebo(plugins),Webots,CoppeliaSim,Isaac Sim(native module support). -
Random external forces: Simulating environmental uncertainties (collisions, wind). Most platforms support it;
Isaac Gymfor efficient large-scale scenarios. -
Deformable objects: Materials changing shape under force (cloth, ropes, soft robots).
MuJoCo,PyBullet(basic);Isaac Gym,Isaac Sim,Isaac Lab(advanced,GPU/PhysX-based);Genesis(integrates state-of-the-art solvers). -
Soft-body contacts: Interactions between soft materials.
Webots,Gazebo,MuJoCo,CoppeliaSim,PyBullet(basic);Isaac Gym,Isaac Sim,Isaac Lab,Genesis(advanced,GPU/FEM). -
Fluid mechanism: Motion and interaction of liquids/gases.
Webots,Gazebo(basic);Isaac Sim(particle-based);Genesis(native high-fidelity). Other lack native support. -
DEM (Discrete Element Method) simulation: Models objects as rigid particles, simulating contact/collision/friction (granular materials). Not natively supported by mainstream simulators, though
Gazebocan extend via plugins for indirect simulation. -
Differentiable physics: Simulator's ability to compute gradients of physical states w.r.t. input parameters, enabling end-to-end optimization.
MuJoCo XLA(via JAX),PyBullet(Tiny Differentiable Simulator),Genesis(ground-up design).The following are the results from Table 2 of the original paper:
Simulator Physics Engine Suction Random external forces Deformable objects Soft-body contacts Fluid mechanism DEM simulation Differentiable physics Webots ODE(default) ✓ ✓ X ✓ ✓ X X Gazebo DART(default) ✓ ✓ ✓ ✓ √ √ X MuJoCo MuJoCo - ✓ ✓ ✓ X X ✓ CoppeliaSim Bullet, ODE, Vortex, Newton √ S ✓ S X X ✓ PyBullet Bullet X ✓ ✓ ✓ X X ✓ Isaac Gym PhysX, FleX(GPU) X ✓ ✓ ✓ X ✓ ✓ Isaac Sim PhysX(GPU) ✓ ✓ ✓ ✓ ✓ ✓ ✓ Isaac Lab PhysX(GPU) ✓ ✓ ✓ ✓ ✓ ✓ ✓ SAPIEN PhysX ✓ ✓ X ✓ X ✓ ✓ Genesis Custom-designed + ✓ ✓ ✓ ✓ ✓ ✓
Note: The table uses symbols like ✓ for support, for lack of support, for support via scripting, - for user-defined logic, and + for complex simulation via custom solvers.
4.2.3.3. Rendering Capabilities
Crucial for diminishing the sim-to-real gap. Table 3 compares rendering features.
- Rendering Engine: Core software for creating 2D images from 3D scenes.
OpenGL:Webots(WREN),MuJoCo,CoppeliaSim,PyBullet.Vulkan:Isaac Gym,SAPIEN(SapienRenderer).Omniverse RTX Renderer:Isaac Sim,Isaac Lab(leveragingNVIDIA's RTX technology).PyRender+LuisaRender:Genesis.
- Ray Tracing: Simulates physical behavior of light for accurate shadows, reflections, global illumination, and realistic sensor simulation.
- No native real-time:
Webots,MuJoCo,PyBullet. - Static image generation:
CoppeliaSim(POV-Ray tracer). - Robust real-time:
Isaac Sim,Isaac Lab(Omniverse RTX). - Significant support:
SAPIEN(SapienRenderer),Genesis. - Path towards:
Gazebo(experimentalNVIDIA OptiX).
- No native real-time:
- Physically-Based Rendering (PBR): Models light interaction with materials based on physical properties for realistic visuals.
- Support:
Webots(WREN),Modern Gazebo(Ignition Rendering),Isaac Sim,Isaac Lab,SAPIEN,Genesis. - Lack support:
MuJoCo,CoppeliaSim,PyBullet,Isaac Gym.
- Support:
- Parallel Rendering: Rendering multiple independent simulation environments simultaneously.
-
Strong capabilities:
Isaac Gym,Isaac Sim/Lab,SAPIEN(ManiSkill),Genesis. -
Limited/not primary strength:
Webots,Gazebo,MuJoCo,CoppeliaSim,PyBullet.The following are the results from Table 3 of the original paper:
Simulator Rendering Engine Ray Tracing Physically-Based Rendering Scalable Parallel Rendering Webots WREN (OpenGL-based) X ✓ X Gazebo Ogre (OpenGL-based) ✓ ✓ ✓ Mujoco OpenGL-based X X X CoppeliaSim OpenGL-based X X X PyBullet OpenGL-based (GPU) TinyRender (CPU) ✓ X ✓ Isaac Gym Vulkan-based X ✓ ✓ Isaac Sim Omniverse RTX Renderer ✓ ✓ ✓ Isaac Lab Omniverse RTX Renderer ✓ ✓ ✓ SAPIEN SapienRenderer (Vulkan-based) ✓ ✓ ✓ Genesis PyRender+LuisaRender ✓ ✓ ✓
-
Note: ✓ indicates support, indicates lack of support.
4.2.3.4. Sensor and Joint Component Types
Realistic sensor models and accurate joint simulation are crucial. Table 4 summarizes support.
- Sensors: Most mainstream platforms support
RGB camera,IMU(Inertial Measurement Unit), andforce contact.- High-fidelity:
Isaac Sim,Isaac Lab,Genesis. - Limited:
Isaac Gym(vision limited, often combined withIsaac Sim). - Specific gaps:
Isaac Gym,SAPIEN(no nativeLiDAR);MuJoCo,PyBullet,SAPIEN(noGPS).
- High-fidelity:
- Joint types: Define robot degrees of freedom (
DOF) and flexibility.-
Commonly supported:
Floating,Fixed,Hinge(revolute),Spherical,Prismatic. -
Less common:
Helical joints(coupled rotational and translational motion), natively implemented only inGazeboandCoppeliaSim.The following are the results from Table 4 of the original paper:
Simulator Sensor Joint type IMU/Force contact/ RGB Camera LiDAR GPS Floating/Fixed/Hinge Spherical/Prismatic Helical Webots ✓ ✓ X ✓ X azebo √ ✓ ✓ ✓ √ Mujoco ✓ X X ✓ X CoppeliaSim ✓ ✓ ✓ √ √ PyBullet ✓ X X ✓ X Isaac Gym ✓ X X ✓ X Isaac Sim ✓ ✓ ✓ ✓ X Isaac Lab ✓ ✓ ✓ ✓ X SAPIEN ✓ + X ✓ X Genesis ✓ X X ✓ X
-
Note: ✓ indicates support, indicates lack of support, + indicates partial support.
The following figure (Figure 15 from the original paper) shows main joint types in simulators:
该图像是示意图,展示了六种主要的关节类型,包括浮动关节、铰链关节、球形关节、棱柱关节、固定关节和螺旋关节。这些关节在机器人模拟器中用于实现不同的运动机制。
4.2.3.5. Discussions and Future Perspectives
- Advantages of Simulators: Cost-effectiveness, safety, control (over variables), repeatability.
- Challenges of Simulators: Accuracy (simplifications), complexity (real-world systems are intricate), data dependency (calibration/validation), overfitting (to specific scenarios).
- Future Perspectives: Limitations highlight the need for
world models—more sophisticated, adaptable modeling frameworks that leverageML/AIto adapt to new data, handle complex systems, and reduce reliance on extensive datasets.
4.2.4. World Models (Section 5)
This section delves into the concept of world models as generative AI models understanding real-world dynamics, inspired by human cognition.
4.2.4.1. Definition and Motivation
World models are "generative AI models that understand the dynamics of the real world, including physics and spatial properties" [17]. Pioneering work by Ha and Schmidhuber [18] showed agents learning compressed, generative models for internal simulation. Recent video generation models like Sora [263] and Kling [264] emphasize using these as world simulators. Yann LeCun advocates for video-based world models to achieve human-level cognition, proposing V-JEPA [266] to learn abstract representations.
The following figure (Figure 16 from the original paper) illustrates the role and training of world models in AI systems:
该图像是示意图,展示了在没有奖励的环境中,通过无任务探索学习全局世界模型的过程。该模型支持预测和适应不同任务(A、B、C),并实现零样本或少样本适应能力。
4.2.4.2. Representative Architectures of World Models
The architectures reflect different approaches to representing and predicting the world, evolving from compact latent dynamics models to powerful generative architectures.
-
Recurrent State Space Model (RSSM): Uses a compact latent space to encode environmental states and a recurrent structure to model temporal dynamics. Enables long-horizon prediction by simulating futures in latent space. Popularized by the
Dreamerseries [267][271]. -
Joint-Embedding Predictive Architecture (JEPA): Models the world in an abstract latent space, but learns by predicting abstract-level representations of missing content in a self-supervised manner, without explicit generative decoders. (
I-JEPA[273],V-JEPA[274][275]). -
Transformer-based State Space Models: Replaces
RNNswithattention-based sequence modelingforlatent dynamics modeling, capturing long-range dependencies and offering parallelism. Examples includeTransDreamer[276],TWM[277], andGenie[278]. -
Autoregressive Generative World Models: Treats
world modelingas sequential prediction over tokenized visual observations, usingTransformersto generate future observations conditioned on past context. Often integrates multimodal inputs (actions, language). Examples includeCogVideo[279],NUWA[280],VideoPoet[281],GAIA-1[282],OccWorld[283]. -
Diffusion-based Generative World Models: Iteratively denoises from noise to synthesize temporally consistent visual sequences, offering stable training and superior fidelity. Shifts from pixel-space diffusion to latent-space modeling for efficiency (
Imagen Video[285],VideoLDM[286],SVD[287]).OpenAI's Sora[263] andGoogle Deepmind's Veo3[288] demonstrate visual realism and 3D/physical dynamics modeling. Increasingly adopted for simulating future observations (e.g.,DriveDreamer[289],Vista[290],GAIA-2[291]).The following figure (Figure 17 from the original paper) displays representative architectures and applications of world models:
该图像是示意图,展示了世界模型的代表性架构及其应用。包括递归状态空间模型、联合嵌入预测架构、扩散式和基于变换器的世界模型,并呈现了不同年份的应用实例,如自动驾驶和通用机器人等。
The following figure (Figure 18 from the original paper) compares autoregressive transformer-based world models and video diffusion-based world model:
该图像是图表,展示了自回归变换器基础的世界模型(例如 GAIA-1)和视频扩散基础的世界模型(例如 Vista)的比较。左侧展示了自回归过程的结构,而右侧则概述了视频扩散模型的时间演变及其未来预测能力。
4.2.4.3. Core Roles of World Models
World models serve as general-purpose representations of the environment, enabling various applications.
-
World Models as Neural Simulator: Generate controllable, high-fidelity synthetic experiences across vision and action domains.
- NVIDIA's Cosmos series [294]: Unified platform for building foundation
video modelsas general-purposeworld simulators, adaptable via fine-tuning (e.g.,Cosmos-Transfer1[295] for spatially conditioned, multi-modal video generation). - Domain-specific simulators:
Wayve's GAIAseries (GAIA-1[282],GAIA-2[291]) for realistic and controllable traffic simulation. - 3D-structured neural simulators: Explicitly model physical occupancy or scene geometry (e.g.,
DriveWorld[296],DOME[297],AETHER[298],DeepVerse[299]).
- NVIDIA's Cosmos series [294]: Unified platform for building foundation
-
World Models as Dynamic Models: In
model-based reinforcement learning (MBRL), agents build an internal model of the environment (including dynamic and reward models) to simulate interactions and improve policy learning.World modelscapture environment dynamics directly from data.- Dreamer series [268][269][270]: Systematically explores latent-space modeling of environment dynamics from visual input using
RSSM(e.g.,DreamerV3[270] achieved state-of-the-art across diverse visual control tasks).DayDreamer[271] validated real-world applicability on physical robots. - Pretraining on real-world video data:
ContextWM[301] learns generalizable visual dynamics unsupervised. - Token-based approaches:
iVideoGPT[302] tokenizes videos, actions, rewards into multi-modal sequences, using aTransformerfor autoregressive prediction.
- Dreamer series [268][269][270]: Systematically explores latent-space modeling of environment dynamics from visual input using
-
World Models as Reward Models: Address the challenge of designing effective reward signals in
RLby inferring rewards automatically.Generative world modelstrained for video prediction can serve as implicit reward models, interpreting the model's prediction confidence as a learned reward signal.-
VIPER [305]: Trains an
autoregressive video world modelon expert demonstrations and uses the model's prediction likelihood as the reward. Enables learning high-quality policies without manual rewards and supports cross-embodiment generalization.The following figure (Figure 19 from the original paper) illustrates the general framework of Model-based RL:
该图像是一个示意图,展示了基于模型的强化学习框架。图中包含三个主要组件:动态模型(Environment)、奖励模型(Reward Model)和策略模型(Policy Model / Agent)。状态 经过策略模型生成动作 ,并接收奖励 ,同时预测下一个状态 。该框架强调了各组件之间的相互作用,以改进政策学习。
-
4.2.5. World Models for Intelligent Agents (Section 6)
This section explores specific applications and challenges of world models in autonomous driving and articulated robots.
4.2.5.1. World Models for Autonomous Driving
Autonomous driving requires real-time comprehension and forecasting of complex, dynamic road environments. Video generation-based world models are well-suited due to their capacity for capturing physics and dynamic interactions. The paper categorizes them into Neural Simulator, Dynamic Model, and Reward Model.
The following figure (Figure 21 from the original paper) illustrates the world model for autonomous driving as a Neural Simulator, Dynamic Model, and Reward Model:
该图像是示意图,展示了神经模拟器、动力学模型和奖励模型之间的关系。输入由当前驾驶状态和不同条件组成,通过编码器和世界模型处理,并用于下游任务的计划与控制。各模块通过解码器输出相关信息。
-
WMs as Neural Simulators for Autonomous Driving: Generate realistic driving scenarios for training and testing.
- GAIA-1 [282]: Pioneered
sequence modelingof driving videos with multimodal inputs (video, text, action) via anautoregressive transformer. - GAIA-2 [312]: Advanced controllable generation with
structured conditioning(ego-vehicle dynamics, multi-agent interactions) using alatent diffusion backbone. - DriveDreamer [313] / DriveDreamer-2 [314] / DriveDreamer4D [315]:
Diffusion-based generationwith structured traffic constraints, integratingLLMsfor natural language-driven scenario generation, and4D driving scene representation. - MagicDrive [316] / MagicDrive3D [317] / MagicDrive-V2 [318]: Novel street view generation with diverse inputs and
cross-view attention, controllable 3D generation, and high-resolution long video generation usingDiffusion Transformer. - WoVoGen [320]: Explicit
4D world volumesfor multi-camera video generation with sensor interconnectivity. - Occupancy representations:
OccSora[321] (diffusion-based 4D occupancy generation),DriveWorld[322] (occupancy-based Memory State-Space Models),Drive-OccWorld[323] (occupancy forecasting with end-to-end planning).
- GAIA-1 [282]: Pioneered
-
WMs as Dynamic Models for Autonomous Driving: Learn underlying physics and motion patterns for perception, prediction, and planning.
- MILE [350]:
Model-based imitation learningfor urban driving, jointly learning a predictiveworld modeland driving policy. - TrafficBots [352]: Multi-agent traffic simulation with configurable agent personalities.
- Occupancy-based representations:
UniWorld[353] (4D geometric occupancy prediction),OccWorld[283] (vector-quantized variational autoencoders),GaussianWorld[368] (4D occupancy forecasting). - Sensor fusion & geometric understanding:
MUvO[356] (spatial voxel representations from camera/LiDAR),ViDAR[357] (visual point cloud forecasting). - Self-supervised learning:
LAW[360] (predicting future latent features without perception labels). - Integrated reasoning:
Cosmos-Reason1[351] (physical common sense, embodied reasoning),Doe-1[367] (autonomous driving as next-token generation),DrivingGPT[370] (drivingworld modelingand trajectory planning).
- MILE [350]:
-
WMs as Reward Models for Autonomous Driving: Evaluate quality and safety of driving behaviors for policy optimization.
-
Vista [376]: Generalizable reward functions using the model's own simulation capabilities.
-
WoTE [379]: Trajectory evaluation using
Bird's-Eye View (BEV) world modelsfor real-time safety assessment. -
Drive-WM [378]: Multi-future trajectory exploration with image-based reward evaluation.
-
Iso-Dream [375]: Separates controllable vs. non-controllable dynamics.
The following figure (Figure 22 from the original paper) illustrates how the world model processes information from multi-view images during encoding and decoding:
该图像是示意图,展示了世界模型在编码和解码过程中如何处理多视角图像的信息。左侧展示了多个摄像头捕捉的图像,经过编码器处理后,进入世界模型模块,结合时间步和条件嵌入,最终形成去噪图像,右侧显示了重建的场景。
-
4.2.5.2. World Models for Articulated Robots
Articulated robots (robotic arms, quadrupeds, humanoids) have stringent world modeling requirements for complex locomanipulation tasks.
-
WMs as Neural Simulators for Articulated Robots: Generate high-fidelity digital environments.
-
NVIDIA's Cosmos World Foundation Model Platform [294]: Unified framework for physics-accurate 3D video predictions, facilitating
sim-to-realtransfer. -
WHALE [381]: Generalizable
world modelwith behavior-conditioning forOoD generalizationand uncertainty estimation. -
RoboDreamer [382]: Compositional
world modelfor robotic decision-making by factorizing video generation into primitives. -
DreMa [383]: Compositional
world modelcombiningGaussian Splattingand physics simulation for photorealistic future prediction. -
DreamGen [384]: Trains generalizable robot policies via
neural trajectoriessynthesized byvideo world models.The following figure (Figure 24 from the original paper) illustrates the workflow of the Cosmos-Predict world model:
该图像是示意图,展示了Cosmos-Predict世界模型的工作流程。输入视频经过3D Patchify和自注意力层处理后,与文本条件交叉注意,最终重构高保真视频输出。模型完成从当前状态到未来状态的预测,用以引导机器人执行任务。
-
-
WMs as Dynamic Models for Articulated Robots: Learn predictive representations of environmental dynamics for
MBRL.-
PlaNet [429]: Latent dynamic model for pixel-based planning.
-
Plan2Explore [262]: Self-supervised
RLagent seeking future novelty via model-based planning. -
Dreamer series [393][394][270]: Learn latent-state dynamics from high-dimensional observations.
-
Dreaming [395] / DreamingV2 [396]: Reconstruction-free
MBRLby eliminating Dreamer's decoder. -
LEXA [398]: Unified framework for unsupervised goal-reaching.
-
FOWM [399]: Offline
world modelpretraining with online finetuning. -
DWL [401]: End-to-end
RLfor humanoid locomotion enablingzero-shot sim-to-real transfer. -
Puppeteer [404]: Hierarchical
world modelfor visual whole-body humanoid control. -
PIVOT-R [406]: Primitive-driven waypoint-aware
world modelfor language-guided manipulation. -
SafeDreamer [408]: Integrates
Lagrangian-based methodswithworld model planningfor safeRL. -
V-JEPA 2 [275]:
1.2B-parameter world modelfor video-based understanding, prediction, andzero-shot planning.The following figure (Figure 25 from the original paper) illustrates latent dynamic models employing distinct transition mechanisms or temporal prediction:
该图像是图示,展示了三种不同类型的模型:确定性模型(RNN)、随机模型(SSM)和递归状态空间模型。每种模型通过不同的结构表达了动作、隐藏状态和观测值之间的关系。
-
-
WMs as Reward models for Articulated Robots: Implicitly infer rewards by measuring alignment with model predictions.
-
PlaNet [267]: Uses an explicitly learned
reward predictoras part of the dynamics model. -
VIPER [427]: Uses pretrained
video prediction modelsasreward signals, interpreting prediction likelihoods as rewards.The following figure (Figure 26 from the original paper) presents three different architectures: Joint-Embedding Architecture, Generative Architecture, and Joint-Embedding Predictive Architecture:
该图像是示意图,展示了三种不同的架构:联合嵌入架构(a)、生成架构(b)、和联合嵌入预测架构(c)。图中包含了相关的编码器和解码器以及它们之间的关系,关键的判别器 在不同架构中起到了重要的作用。
-
4.2.5.3. Technical Trends and Challenges
- Technical Trends: Tactile-enhanced
world modelsfor dexterous manipulation, unifiedworld modelsforcross-hardware/cross-task generalization, hierarchicalworld modelsforlong-horizon tasks. - Challenges:
- High-Dimensionality and Partial Observability: Handling vast sensory data and inherent environmental uncertainty.
- Causal Reasoning versus Correlation Learning: Moving beyond correlations to understand underlying physics and intent.
- Abstract and Semantic Understanding: Integrating fine-grained physical predictions with abstract concepts (traffic laws, intent, affordances).
- Systematic Evaluation and Benchmarking: Developing metrics that correlate with downstream task performance.
- Memory Architecture and Long-Term Dependencies: Retaining and retrieving relevant information over extended timescales.
- Human Interaction and Predictability: Ensuring legible, predictable, and socially compliant agent behavior.
- Interpretability and Verifiability: Understanding model rationale for safety-critical applications.
- Compositional Generalization and Abstraction: Learning disentangled representations for understanding novel scenarios by composing known concepts.
- Data Curation and Bias: Addressing data quality, bias, and learning from rare but safety-critical events.
4.3. Algorithmic Flows and Formulas
As a survey paper, the core methodology involves the systematic review and categorization of existing research rather than the introduction of new mathematical algorithms or formulas within the paper's main body. However, the paper discusses various algorithms and models that rely on specific mathematical principles. The methodologies described above for MPC, WBC, RL, IL, VLA, and different world model architectures all implicitly rely on underlying mathematical frameworks (e.g., optimization, probability theory, neural network architectures, attention mechanisms).
For example, the section on Transformer-based State Space Models mentions attention mechanisms. The Scaled Dot-Product Attention is a foundational component of Transformers, calculated as:
$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$
Where:
-
represents the Query matrix.
-
represents the Key matrix.
-
represents the Value matrix.
-
is the dimension of the key vectors, used for scaling to prevent the dot products from becoming too large and pushing the
softmaxfunction into regions with very small gradients. -
computes the similarity scores between queries and keys.
-
normalizes these scores to produce attention weights.
-
These weights are then applied to the values to produce the output, which is a weighted sum of the values.
This formula, while not explicitly derived or introduced as new within the survey, is a fundamental building block for many of the
Transformer-basedworld modelsandVLA modelsthat the paper discusses. The paper's strength lies in organizing and presenting the landscape of how these underlying techniques are applied and combined inembodied AI.
5. Experimental Setup
This paper is a survey, so it does not present its own experimental setup in the traditional sense. Instead, it synthesizes the experimental contexts, datasets, evaluation metrics, and baselines from the numerous research papers it reviews. The "experimental setup" here refers to the common practices and benchmarks within the embodied AI field, as highlighted by the authors.
5.1. Datasets
The paper mentions a wide array of datasets commonly used in the fields of robotics, autonomous driving, and world modeling. These datasets are crucial for training and evaluating various embodied AI systems.
-
General Robotic Control:
DeepMind Control (DMC) suite: A collection of continuous control tasks in a physics simulator (MuJoCo), often used forReinforcement Learningbenchmarks. Tasks range from simple balance to complex locomotion and manipulation.Atari: A suite of classic video games, frequently used to benchmarkReinforcement Learningalgorithms that learn from pixel inputs.RLBench: A benchmark forrobot learningwith a focus ondexterous manipulationtasks. It provides a large collection of physically simulated robot manipulation tasks.RoboSuite: An open-source framework for robot control that provides a rich collection of robot manipulation environments.Meta-world: Ameta-Reinforcement Learningbenchmark that focuses on training agents to quickly adapt to new tasks.Safety-Gymnasium: AReinforcement Learningbenchmark designed to evaluate safeRLalgorithms, often used to test safety constraints.
-
Autonomous Driving:
nuScenes [327]: A large-scalemultimodal datasetforautonomous drivingin urban environments, collected in Boston and Singapore. It includesLiDAR, radar, camera, andGPSdata, along with3D bounding box annotations. It is chosen for its diversity and comprehensive sensor suite, enabling research inperception,prediction, andplanning.Waymo Open Dataset (WOD) [328]: Another large-scale, high-qualityautonomous driving datasetfrom theWaymoself-driving fleet. It containsLiDAR, camera, andmotion datafor a variety of urban and suburban driving scenarios. Its scale and diversity make it suitable forgeneralizable autonomous drivingresearch.CARLA simulation [340]: A high-fidelity open-sourcesimulatorforautonomous drivingresearch. It provides realistic rendering and physics, allowing for controlled experimentation in various scenarios, includingurban driving. It's often used when real-world data collection is impractical.BDD [343]:Berkeley DeepDriveis a large-scale collection of diverse driving videos and images, often used forcomputer vision tasksinautonomous driving, such as object detection, semantic segmentation, andtraffic light recognition.NuPlan [330]: A benchmark forlearning-based planninginreal-world autonomous driving, providing a diverse set of real-world driving scenarios.Occ3D [331]: A large-scale3D occupancy prediction benchmarkforautonomous driving.Lyft-Level5 [335]: Aself-driving motion prediction datasetfor research in autonomous vehicle perception and behavior prediction.KITTI [355]: A widely useddatasetforautonomous drivingresearch, providingstereo vision,optical flow,visual odometry,3D object detection, and3D tracking benchmarks.NAVSIM [361]: Adata-driven non-reactive autonomous vehicle simulation and benchmarking platform.
-
Humanoid and Manipulation Robotics:
PartNet-Mobility [247]: A comprehensivedatasetfeaturingmotion-annotated, articulated 3D objects, used with theSAPIENsimulator forarticulated object manipulation.ManiSkill [248] / ManiSkill3 [249]: Benchmarks forgeneralizable manipulation skillsin realistic physics-based environments, providing diverse tasks and high-quality demonstrations.Human motion capture data: Used inImitation Learningto derivestyle-based rewardsor reference gaits for natural robot movements.RT-X: RoboticTransformerdataset and benchmark.SeaWave benchmark: For robotic manipulation tasks.
-
World Model Pre-training:
-
Internet-scale video data: Used for pre-trainingvideo generation modelsandworld modelsto learn generalizable visual dynamics and physical intuition (e.g.,Cosmos[294],ContextWM[301]).These
datasetsare chosen to cover a wide range of scenarios, from simulated control tasks to complex real-world driving and manipulation, ensuring that proposed methods are evaluated for both fundamental capabilities and practical applicability.
-
5.2. Evaluation Metrics
The paper discusses various evaluation metrics implicitly through the results and discussions of the surveyed papers. These metrics generally fall into categories assessing performance, efficiency, fidelity, and safety. For clarity, I will define common metrics in these categories, some of which are explicitly mentioned and others are implied by the context of robot learning and simulation.
-
Performance Metrics (Task Completion):
- Success Rate (SR):
- Conceptual Definition: The proportion of attempts where an agent successfully completes a given task. It is a direct measure of task-oriented performance.
- Mathematical Formula: $ \text{SR} = \frac{\text{Number of successful attempts}}{\text{Total number of attempts}} $
- Symbol Explanation:
- : The count of times the robot achieved the task goal.
- : The total number of times the robot attempted the task.
- Reward (Cumulative Reward):
- Conceptual Definition: In
Reinforcement Learning, the sum of rewards received by an agent over an episode or a series of interactions. The goal ofRLis to maximize this cumulative reward. - Mathematical Formula: $ R_t = \sum_{k=0}^{T} \gamma^k r_{t+k+1} $
- Symbol Explanation:
- : Cumulative reward (or return) at time step .
- : The time horizon or end of the episode.
- : Discount factor (), which determines the present value of future rewards.
- : The reward received at time step .
- Conceptual Definition: In
- Driving Score / Safety Score:
- Conceptual Definition: Composite metrics specific to
autonomous drivingthat combine various factors like compliance with traffic rules, collision avoidance, smoothness of driving, and efficiency to assess overall performance and safety. - Mathematical Formula: (Highly task-specific, no universal formula. Typically a weighted sum of sub-metrics.) $ \text{Driving Score} = w_1 \cdot \text{CollisionRate} + w_2 \cdot \text{TrafficRuleViolations} + w_3 \cdot \text{DrivingComfort} + \dots $
- Symbol Explanation:
- : Weighting factors for different components.
- : Frequency of collisions.
- : Number of traffic rule infractions.
- : Metric for smoothness of accelerations/braking.
- Conceptual Definition: Composite metrics specific to
- Success Rate (SR):
-
Efficiency Metrics:
- Sample Efficiency:
- Conceptual Definition: How many interactions (samples) an
RLagent needs with the environment to learn a high-performing policy.World modelsoften aim to improvesample efficiencyby allowing agents to learn from imagined experiences. - Mathematical Formula: (No direct formula, usually expressed as the number of environmental steps or episodes required to reach a certain performance level.)
- Conceptual Definition: How many interactions (samples) an
- Throughput Advantage:
- Conceptual Definition: A measure of how much faster one system (e.g., a
GPU-acceleratedsimulator) can process tasks or generate data compared to another. - Mathematical Formula: $ \text{Throughput Advantage} = \frac{\text{Throughput of Method A}}{\text{Throughput of Method B}} $
- Symbol Explanation:
- : The rate at which Method A processes units of work (ee.g., frames per second, samples per second).
- : The rate at which Method B processes units of work.
- Conceptual Definition: A measure of how much faster one system (e.g., a
- Sample Efficiency:
-
Fidelity & Quality Metrics:
- Mean Squared Error (MSE):
- Conceptual Definition: A common metric for quantifying the difference between values predicted by a model and the true values. Often used for
video predictionorstate predictioninworld models. - Mathematical Formula: $ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 $
- Symbol Explanation:
- : The number of data points.
- : The observed (true) value for the -th data point.
- : The predicted value for the -th data point.
- Conceptual Definition: A common metric for quantifying the difference between values predicted by a model and the true values. Often used for
- Fréchet Inception Distance (FID):
- Conceptual Definition: A metric used to assess the quality of images generated by
generative models. It measures the distance between the feature distributions of generated and real images, with lowerFIDindicating higher quality. - Mathematical Formula: $ \text{FID} = ||\mu_1 - \mu_2||^2_2 + \text{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1 \Sigma_2)^{1/2}) $
- Symbol Explanation:
- : The mean feature vectors of real and generated images, respectively, extracted from an
Inception-v3network. - : The covariance matrices of the feature vectors for real and generated images.
- : The squared L2 norm.
- : The trace of a matrix.
- : The matrix square root of the product of covariance matrices.
- : The mean feature vectors of real and generated images, respectively, extracted from an
- Conceptual Definition: A metric used to assess the quality of images generated by
- Visual Realism / Fidelity: Often assessed qualitatively by human evaluation or specific quantitative metrics like
PSNR(Peak Signal-to-Noise Ratio) orSSIM(Structural Similarity Index Measure) for image/video quality.
- Mean Squared Error (MSE):
-
Generalization Metrics:
- Zero-shot Generalization: The ability of a model to perform a task without any prior training examples for that specific task, relying on knowledge learned from diverse pre-training data.
- Few-shot Adaptation: The ability to quickly learn a new task with a very small number of training examples.
5.3. Baselines
The baselines mentioned or implied in the paper are typically previous state-of-the-art methods or contrasting architectural approaches within each sub-field.
-
Robot Control & Learning:
- Traditional Control Methods:
PID controllers,kinematics-based approaches,hard-coded behaviors. - Model-Free RL:
Actor-critic methodsthat learn policies directly from interaction without an explicitworld model. - Earlier RL Approaches: Previous iterations of
RLalgorithms (e.g.,DreamerV1vs.DreamerV3). - Sim-to-Real Baselines: Methods that attempt to transfer policies from simulation to reality without
world modelsor advancedsim-to-realtechniques (e.g.,domain randomizationwithoutworld modelguidance).
- Traditional Control Methods:
-
Simulators:
- Older Simulators: Comparisons between
Webots/Gazeboand newerGPU-accelerated platforms likeIsaac Gym/Isaac Simin terms of speed, fidelity, and specific physics capabilities. - CPU-based Physics Engines: Benchmarks against
CPU-only physics engines to highlight the speedup ofGPU-accelerated ones.
- Older Simulators: Comparisons between
-
World Models:
- Autoregressive vs. Diffusion Models:
GAIA-1(autoregressive) compared withGAIA-2orDriveDreamer(diffusion-based) for video generation quality and control. - Latent Space vs. Pixel Space Models:
Dreamer-styleRSSMs(latent space) versusiVideoGPT(tokenized pixel space). - Non-Generative Predictive Models: Simpler predictive models that don't synthesize full observations.
- Models without Foundation Model Integration: Comparisons showing the benefits of
LLM/VLMintegration inVLA modelsorworld models.
- Autoregressive vs. Diffusion Models:
-
Autonomous Driving:
-
Modular Architectures: Traditional
perception-prediction-planning-controlpipelines are contrasted withend-to-endorworld model-integratedsystems. -
Image-based vs. 3D Occupancy-based Models: Different representations for scene understanding and forecasting.
These
baselinesserve to highlight the advancements and unique contributions of the methods discussed in the survey by demonstrating improvements in aspects likesample efficiency,generalization,fidelity,speed, andsafety.
-
6. Results & Analysis
This section synthesizes the findings from the various sub-fields reviewed in the paper, drawing conclusions about the effectiveness, advantages, and challenges of physical simulators and world models in embodied AI.
6.1. Core Results Analysis
The survey's analysis consistently highlights that the integration and advancement of physical simulators and world models are driving significant progress in embodied AI.
-
Capability Grading (IR-L0 to IR-L4): The proposed
IR-Lframework provides a structured way to assess robot capabilities. It implicitly shows that current research is rapidly moving fromIR-L0(basic execution) andIR-L1(programmatic response) towardsIR-L2(environmental awareness) andIR-L3(humanoid cognition and collaboration), withIR-L4(full autonomy) remaining the ultimate, aspirational goal. This framework helps contextualize the sophistication of various robot control algorithms andAIsystems. -
Robotic Mobility and Dexterity:
- Learning-based methods (
RL,IL) have enabled robots to achieve complexlocomotionandmanipulationskills that were previously intractable withtraditional control. Examples include robust bipedal walking on unstructured terrain [61] and agile whole-body movements like jump-shots [84]. Foundation Models(e.g.,LLMs,VLMs) are enhancingsemantic understandingandtask planning, allowing robots to interpret complex instructions and generalize to new tasks, pushing towards more versatilewhole-body manipulation[119].VLA modelsprovide end-to-end mapping from language to action [65].- Fall protection and recovery have seen substantial improvements with
learning-based methods, offering better robustness and generalization thanmodel-based approaches[101][102].
- Learning-based methods (
-
Physical Simulators' Impact:
Simulatorshave become indispensable for overcoming thecost,safety, andrepeatabilityissues of real-world training.GPU-accelerated simulators likeIsaac Gym,Isaac Sim,Genesis, andNVIDIA Newtonoffer significantly improved data generation efficiency (e.g.,Genesisshows to throughput advantage overIsaac Gymacross various batch sizes). This is crucial for data-hungrydeep reinforcement learningandgenerative AImodels.- High-fidelity rendering (e.g.,
Isaac SimwithRTX-based ray tracing andPBR) helps diminish thesim-to-real gapby providing more realistic visual data for trainingvision-based perception algorithms. Differentiable physicsin emerging simulators (MuJoCo XLA,Genesis,Newton) is a game-changer, enablingend-to-end optimizationand tighter integration withmachine learning models.- Despite advancements,
simulatorsstill faceaccuracy,complexity, andoverfittingchallenges, highlighting the need forworld models.
-
World Models' Transformative Roles:
- Neural Simulators:
Generative world models(especiallydiffusion-basedones likeDriveDreamer,GAIA-2) can synthesizecontrollable,high-fidelitydriving scenarios or robotic interactions [291][313]. This allows for data augmentation, rare event synthesis, andsim-to-real transfer, serving as scalable alternatives to traditional simulators. - Dynamic Models: In
model-based RL,world models(like theDreamerseries) learnlatent-space dynamicsfrom visual inputs, enabling agents tosimulate future statesandplan actionsthrough imagined rollouts. This dramatically improvessample efficiencyandgeneralizationacross tasks [270]. - Reward Models:
World modelscan implicitly inferreward signals(e.g.,VIPER[305]) by assessing how well an agent's behavior aligns with the model's predictions. This addresses the challenging problem ofmanual reward engineeringand supportscross-embodiment generalization.
- Neural Simulators:
-
Autonomous Driving Specifics:
World modelsare evolving fromautoregressivetodiffusion-based architecturesfor superiorgeneration qualityandcontrol.Multi-modal integration(camera,LiDAR, text, trajectories) andstructured conditioningenable the generation of diverse and controllable scenarios, crucial forstress-testing autonomous systems[312][314].- A shift towards
3D spatial-temporal understandingandoccupancy-based representationsprovides bettergeometric consistencyandscene understandingthan purely image-based approaches [321][322]. - Increasing
end-to-end integrationwithautonomous driving pipelinesaims to unifyperception,prediction, andplanningwithin a singleneural architecture[367][370].
-
Articulated Robots Specifics:
-
World modelsenablezero-shot sim-to-real transferfor complexlocomotion[401] andmanipulationtasks [275]. -
Compositional world models(e.g.,RoboDreamer[382]) allow generalization to unseen object-action combinations. -
The
V-JEPA 2model [275] demonstratesstate-of-the-art performancein action recognition, anticipation, andmodel-predictive controlfor robotics with minimal real-world data.Overall, the core result is a clear validation of the synergistic power of advanced
physical simulatorsandgenerative world modelsin pushing the boundaries ofembodied AItoward more capable, adaptable, and generalizable systems.
-
6.2. Data Presentation (Tables)
The paper provides several comparative tables. I will transcribe them in full.
The following are the results from Table 5 of the original paper:
| Category | Paper | Input | Output | World Model Architecture | Dataset | Code Availability | |||
| Image | Text | LiDAR | Action | ||||||
| Neural Simulator | GAIA-1 [282] | ✓ | ✓ | X | ✓ | Image | Transformer | In-house nuScenes | X |
| DriveDreamer [313] | ✓ | X | X | ✓ | Image | Diffusion | nuScenes [327] & In-house | ✓ | |
| ADriver-I [326] | ✓ | ✓ | ✓ | ✓ | Image | Diffussion | In-house | X | |
| GAIA-2 [312] | ✓ | ✓ | ✓ | ✓ | Image | Diffusion | nuScenes | X | |
| DriveDreamer-2 [314] | √ | ✓ | X | ✓ | Image | Diffusion | Waymo Open dataset (WOD) [328] | ✓ | |
| DriveDreamer4D [315] | ✓ | X | ✓ | ✓ | Image | Transformer | NuPlan [330] & In-house | X | |
| DrivingWorld [329] | ✓ | X | X | ✓ | Image | DiT | nuScenes | ✓ | |
| MagicDrive [316] | ✓ | ✓ | X | ✓ | Image | DiT | nuScenes | X | |
| MagicDrive3D [317] | ✓ | X | X | ✓ | Image | DiT | nuScenes | X | |
| MagicDrive-V2 [318] | ✓ | ✓ | X | ✓ | Image | Diffusion | nuScenes, Occ3d [331], nuScenes-lidarseg | X | |
| WoVoGen [320] | √ | X | ✓ | ✓ | Image | Diffusion | Waymo Open dataset (WOD) | ✓ | |
| ReconDreamer [325] | ✓ | X | X | ✓ | Image | Transformer | nuScenes | X | |
| DualDiff+ [332] | √ | X | ✓ | ✓ | Image | Diffusion | nuScenes | X | |
| Panacea [319] | √ | ✓ | X | ✓ | Image | DiT | Cosmos [294] | ✓ | |
| Cosmos-Transfer1 [295] | ✓ | ✓ | X | ✓ | Image | Diffusion | nuScenes | X | |
| GeoDrive [333] | ✓ | X | ✓ | ✓ | Occupancy | Transformer | nuScenes, Openscene [334] | X | |
| DriveWorld [322] | ✓ | X | ✓ | ✓ | Occupancy | Diffusion | nuScenes | ✓ | |
| OccSora [321] | ✓ | X | ✓ | ✓ | Occupancy | Transformer | nuScenes, Lyft-Level5 [335] | X | |
| Drive-OccWorld [323] | ✓ | X | ✓ | ✓ | Occupancy | Diffusion | nuScenes | ✓ | |
| DOME [297] | ✓ | X | ✓ | ✓ | Occupancy | Diffusion | nuScenes | ✓ | |
| Dynamic Model | RenderWorld [336] | ✓ | X | ✓ | ✓ | Occupancy | Transformer | nuScenes | X |
| OccLLama [337] | ✓ | ✓ | X | ✓ | Occupancy | Transformer | NuScenes, Occ3D, NuScenes-QA [338] | X | |
| BEVWorld [339] | ✓ | X | ✓ | ✓ | Image, point cloud | Diffusion | nuScenes, Carla simulation [340] | X | |
| HoloDrive [341] | ✓ | X | ✓ | ✓ | Image, point cloud | Transformer | NuScenes | X | |
| GEM [342] | ✓ | X | ✓ | ✓ | Image, depth | Diffusion | BDD [343], etc [344], [345] | √ | |
| DriveArena [346] | ✓ | X | ✓ | ✓ | Image | Diffusion | nuScenes | ✓ | |
| ACT-Bench [347] | ✓ | ✓ | ✓ | ✓ | Image | Transformer | nuScenes, ACT-Bench | X | |
| InfinityDrive [324] | ✓ | X | ✓ | ✓ | Image | Transformer | nuScenes | X | |
| Epona [293] | ✓ | X | ✓ | ✓ | Image | Transformer | NuPlan | ✓ | |
| DrivePhysica [348] | ✓ | X | X | ✓ | Image | Diffusion | nuScenes | X | |
| Cosmos-Drive [349] | ✓ | ✓ | ✓ | ✓ | Image | DiT | In-house, WOD | ✓ | |
| MILE [350] | ✓ | X | X | ✓ | Image, BEV Semantics | RNN | Carla simulation | ✓ | |
| Cosmos-Reason1 [351] | ✓ | ✓ | ✓ | ✓ | Text | Transformer | Cosmos-Reason-1-Dataset | ✓ | |
| TrafficBots [352] | ✓ | X | X | ✓ | Trajectory | Transformer | Waymo Open Motion Dataset nuScenes | X | |
| Uniworld [353] | ✓ | X | ✓ | ✓ | Occupancy | Transformer | Waymo Open Motion Dataset nuScenes | X | |
| CopilotdD [354] | ✓ | X | ✓ | ✓ | Point cloud | Diffusion | nuScenes, KITTI [355] | X | |
| MUVO [356] | ✓ | X | ✓ | ✓ | Image, Occupancy, Point cloud | RNN | Carla simulation | X | |
| OccWorld [283] | ✓ | X | ✓ | ✓ | Occupancy | Transformer | nuScenes, Occ3D | ✓ | |
| ViDAR [357] | ✓ | X | ✓ | ✓ | Point cloud | Transformer | nuScenes | X | |
| Think2Drive [358] | ✓ | X | X | ✓ | BEV Semantics | RNN | CARLA simulation | X | |
| Reward Model | LidarDM [359] | ✓ | X | ✓ | X | Point cloud | Diffusion | KITTI-360, WOD | X |
| LAW [360] | ✓ | X | ✓ | ✓ | BEV Semantics | Transformer | nuScenes, NAVSIM [361], CARLA simulation | ✓ | |
| UnO [362] | ✓ | X | ✓ | ✓ | Occupancy | Transformer | nuScenes, Argoverse 2, KITTI Odometry | X | |
| CarFormer [363] | ✓ | X | ✓ | ✓ | Trajectory | Transformer | CARLA simulation | ✓ | |
Note: ✓ means supported/used, means not supported/used. "Code Availability" ✓ means code is released, means not released. Original table contains error in row grouping, corrected to match content categorization.
The following are the results from Table 6 of the original paper:
| Category | Paper | Year | Input | Architecture | Experiments | Code Availability | |||
| Image | Text | Video | Action | ||||||
| Neural Simulators | WHALE [381] | 2024 | X | ✓ | X | ✓ | Transformer | Robotic arm | X |
| RoboDreamer [382] | 2024 | ✓ | X | ✓ | ✓ | Diffusion | Robotic arm | ✓ | |
| DreMa [383] | 2024 | X | ✓ | X | ✓ | GS reconstruction | Robotic arm | ✓ | |
| DreamGen [384] | 2025 | X | ✓ | X | ✓ | DiT | Dual robotic arm | O | |
| EnerVerse [385] | 2025 | X | ✓ | ✓ | ✓ | Diffusion | Robotic arm | ✓ | |
| WorldEval [386] | 2025 | X | ✓ | X | ✓ | DiT | Robotic arm | ✓ | |
| Cosmos (NVIDIA) [294] | 2025 | ✓ | ✓ | ✓ | ✓ | Diffusion + Autoregressive | Robots | ✓ | |
| Pangu [387] | 2025 | X | ✓ | X | X | / | / | X | |
| RoboTransfer [388] | 2025 | ✓ | ✓ | ✓ | ✓ | Diffusion | Robotic arm | O | |
| TesserAct [389] | 2025 | ✓ | ✓ | X | ✓ | DiT | Robotic arm | ✓ | |
| 3DPEWM [390] | 2025 | ✓ | X | ✓ | ✓ | DiT | Mobile robot | X | |
| SGImageNav [391] | 2025 | ✓ | ✓ | X | ✓ | LLM | Quadruped robot | X | |
| Embodiedreamer [392] | 2025 | ✓ | ✓ | X | ✓ | Diffusion + ACT | Robotic arm | O | |
| Dynamics Models | PlaNet [267] | 2018 | ✓ | X | X | ✓ | RSSM | DMC | ✓ |
| Plan2Explore [262] | 2020 | ✓ | X | X | ✓ | RSSM | DMC | √ | |
| Dreamer [393] | 2020 | ✓ | X | X | ✓ | RSSM | DMC | ✓ | |
| DreamerV2 [394] | 2021 | ✓ | X | X | ✓ | RSSM | DMC | √ | |
| DreamerV3 [270] | 2023 | ✓ | X | X | ✓ | RSSM | DMC | ✓ | |
| DayDreamer [271] | 2024 | ✓ | ✓ | X | √ | RSSM | Robotic arm | √ | |
| Dreaming [395] | 2021 | ✓ | X | X | ✓ | RSSM | DMC | ✓ | |
| Dreaming V2 [396] | 2021 | ✓ | X | X | ✓ | RSSM | DMC + RoboSuite | √ | |
| DreamerPro [397] | 2022 | ✓ | ✓ | X | ✓ | RSSM | DMC | ✓ | |
| TransDreamer [276] | 2024 | ✓ | X | ✓ | ✓ | TSSM | 2D Simulation | √ | |
| LEXA [398] | 2021 | ✓ | X | ✓ | ✓ | RSSM | Simulated robotic arm | ✓ | |
| FOWM [399] | 2023 | ✓ | √ | X | ✓ | TD-MPC | Robotic arm | √ | |
| SWIM [400] | 2023 | X | ✓ | X | ✓ | RSSM | Robotic arm | X | |
| ContextWM [301] | 2023 | ✓ | X | X | ✓ | RSSM | DMC + CARLA + Meta-world | ✓ | |
| iVideoGPT [302] | 2023 | ✓ | X | ✓ | ✓ | Autoregressive Transformer | DMC | ✓ | |
| DWL [401] | 2024 | ✓ | X | X | ✓ | Recurrent encoder | Robotic arm | ✓ | |
| Surfer [402] | 2024 | ✓ | X | X | ✓ | Transformer | Humanoid robot | X | |
| GAS [403] | 2024 | ✓ | ✓ | X | ✓ | RSSM | Surgical robot | √ | |
| Puppeteer [404] | 2024 | ✓ | ✓ | X | ✓ | TD-MPC2 | Simulation (56-DoF humanoid) | X | |
| TWIST [405] | 2024 | ✓ | X | X | ✓ | RSSM | Robotic arm | ✓ | |
| Dynamics Models (continued) | PIVOT-R [406] | 2024 | ✓ | ✓ | ✓ | ✓ | Transformer | Robotic arm | X |
| HarmonyDream [407] | 2024 | ✓ | X | X | ✓ | RSSM | Robotic arm | √ | |
| SafeDreamer [408] | 2024 | ✓ | X | ✓ | ✓ | OSRP | simulation (Safety-Gymnasium) | ✓ | |
| WMP [409] | 2024 | ✓ | X | ✓ | ✓ | RSSM | Quadruped robot | ✓ | |
| RWM [410] | 2025 | ✓ | X | ✓ | ✓ | GRU+MLP | Quadruped robot | X | |
| RWM-O [411] | 2025 | ✓ | X | ✓ | ✓ | GRU+MLP | Robotic arm | X | |
| SSWM [412] | 2025 | ✓ | X | X | ✓ | SSM | Quadrotor | X | |
| WMR [413] | 2025 | ✓ | X | X | ✓ | LSTM | Humanoid robot | X | |
| PIN-WM [414] | 2025 | ✓ | X | ✓ | ✓ | GS reconstruction | Robotic arm | O | |
| LUMOS [415] | 2025 | X | ✓ | X | ✓ | RSSM | Robotic arm | ✓ | |
| OSVI-WM [416] | 2025 | ✓ | X | X | ✓ | Transformer | Robotic arm | X | |
| FOCUS [417] | 2025 | ✓ | X | X | ✓ | RSSM | Robotic arm | ✓ | |
| FLIP [418] | 2025 | X | ✓ | X | ✓ | DiT | Robotic arm | ✓ | |
| EnerVerse-AC [419] | 2025 | ✓ | X | ✓ | ✓ | Diffusion | Robotic arm | ✓ | |
| FlowDreamer [420] | 2025 | ✓ | X | ✓ | ✓ | Diffusion | Robotic arm | ✓ | |
| HWM [421] | 2025 | ✓ | X | ✓ | ✓ | MVM | Humanoid robot | X | |
| MoDem-V2 [422] | 2024 | ✓ | X | X | ✓ | FM | Robotic arm | X | |
| V-JEPA 2 [275] | 2025 | ✓ | X | ✓ | ✓ | JEPA | Robotic arm | X | |
| AdaWorld [423] | 2025 | ✓ | X | X | ✓ | RSSM | Robotic arm | X | |
Note: ✓ means supported/used, means not supported/used. "Code Availability" ✓ means code is released, is not yet available, means not released.
Architectural abbreviations:
The following are the results from Table 7 of the original paper:
| Category | Paper | Year | Input | Architecture | Experiments | Code Availability | |||
| Image | Text | Video | Action | ||||||
| Dynamics Models | MoSim [424] | 2025 | X | √ | ✓ | ✓ | rigid-body dynamics + ODE | Robotic arm | ✓ |
| DALI [425] | 2025 | ✓ | ✓ | X | ✓ | RSSM | DMC | ✓ | |
| GWM [426] | 2025 | ✓ | X | X | ✓ | DiT | Robotic arm | ✓ | |
| Reward Models | VIPER [427] | 2023 | ✓ | ✓ | X | ✓ | Autoregressive Transformer | DMC + RLBench + Atari | ✓ |
| PlaNet [267] | 2018 | √ | X | X | ✓ | RSSM | DMC | ✓ | |
Note: ✓ means supported/used, means not supported/used. "Code Availability" ✓ means code is released, is not yet available, means not released.
Architectural abbreviations:
6.3. Ablation Studies / Parameter Analysis
As a survey paper, this work does not present its own ablation studies or parameter analyses. Instead, it discusses the impact of design choices and parameters in the context of the surveyed literature. For instance:
-
Impact of Architectural Choices: The comparison between
autoregressiveanddiffusion-basedworld modelsinautonomous driving(e.g.,GAIA-1vs.GAIA-2) inherently discusses how different architectures affectgeneration fidelity,controllability, andcomputational cost. -
Role of Latent Space vs. High-Dimensional Representations: The
Dreamerseries explores the benefits oflatent-space planningforsample efficiency, implying that directly operating in high-dimensional pixel space can be less efficient. -
Influence of
Multimodal Inputs: The effectiveness ofVLA modelsand advancedautonomous driving world modelsis attributed to their ability to integrate various modalities (vision,language,LiDAR,actions), suggesting that each modality contributes to richerscene understandingandcontrol. -
Generalization vs. Specificity: Discussions around
domain randomizationinsimulatorsorpretraining world modelson diverse datasets highlight how these techniques contribute togeneralizationcapabilities, mitigatingoverfittingto specific simulated environments. -
Trade-offs in Simulators: The
comparison of physical propertiesandrendering capabilitiesacrosssimulators(Tables 2, 3, 4) implicitly reveals trade-offs betweenphysical accuracy,rendering fidelity,computational speed(parallel rendering), andsensor support. For example,MuJoCoprioritizes physics accuracy andRLefficiency, whileIsaac Simemphasizes photorealistic rendering and comprehensive sensor simulation.The survey emphasizes that researchers often conduct these
ablation studiesto validate individual components or design choices within their proposedworld modelsorsimulation platforms.
7. Conclusion & Reflections
7.1. Conclusion Summary
This survey provides a comprehensive and timely overview of the pivotal roles played by physical simulators and world models in advancing embodied artificial intelligence. It successfully delineates a clear path toward truly intelligent robotic systems by integrating these two foundational technologies. The paper's key contributions include:
-
A Novel Classification System: The introduction of a
five-level grading standard (IR-L0 to IR-L4)offers a robust framework for evaluating robot autonomy, providing a common language for assessing progress and guiding future development inembodied AI. -
Exhaustive Review of Simulators: A detailed comparative analysis of mainstream
simulation platformshighlights their diverse capabilities inphysical simulation,rendering, andsensor/joint support, emphasizing the shift towardsGPU-accelerated, high-fidelity, anddifferentiable simulatorsto bridge thesim-to-real gap. -
In-depth Analysis of World Models: A thorough exploration of
world modelarchitectures (fromRSSMstodiffusion-based Foundation Models) and their critical roles asneural simulators,dynamic modelsinmodel-based RL, andreward modelsinembodied intelligence. -
Application-Specific Insights: A focused discussion on the application of
world modelsinautonomous drivingandarticulated robotsreveals how these systems are enabling unprecedented capabilities inscene generation,prediction,planning, andcross-embodiment generalization.The survey concludes that the synergistic integration of
physical simulatorsfor external training andworld modelsfor internal cognition is the cornerstone for realizingIR-L4 fully autonomous systems, fundamentally transforming robotics from specialized automation to general-purpose intelligence seamlessly integrated into human society.
7.2. Limitations & Future Work
The paper itself identifies several critical challenges that current world models and embodied AI systems face, which also serve as directions for future research:
-
High-Dimensionality and Partial Observability: Future work needs to develop more robust state estimation and belief-state management techniques to handle the vast, incomplete sensory inputs.
-
Causal Reasoning versus Correlation Learning: A major leap is required from merely learning correlations to understanding
causal relationshipsand enablingcounterfactual reasoning("what if" scenarios) for true generalization to novel situations. -
Abstract and Semantic Understanding: Integrating low-level physical predictions with high-level
abstract reasoningabout concepts liketraffic laws,pedestrian intent, andobject affordancesis crucial for intelligent, context-aware behavior. -
Systematic Evaluation and Benchmarking: New evaluation frameworks and metrics are needed that correlate better with downstream task performance, robustness in safety-critical scenarios, and the capture of
causally relevantaspects of the environment, moving beyond simpleMSEon predictions. -
Memory Architecture and Long-Term Dependencies: Designing efficient and effective
memory architectures(e.g., advancedTransformersorState-Space Models) is vital for retaining and retrieving relevant information over extended timescales. -
Human Interaction and Predictability:
World modelsmust facilitate agent behaviors that arelegible,predictable, andsocially compliantto humans, requiring a deeper integration ofsocial intelligence. -
Interpretability and Verifiability: For safety-critical applications, developing methods to audit and understand the internal decision-making processes of
black-box deep learning modelsis non-negotiable. Formal verification ofsafety propertiesis a formidable theoretical and engineering challenge. -
Compositional Generalization and Abstraction: Future
world modelsshould learndisentangled,abstract representationsof entities, relations, and physical properties to understand and predict novel scenarios by composing known concepts, rather than relying onend-to-end pattern matching. -
Data Curation and Bias: Addressing biases in
training dataand systematically collecting data forlong-tail,rare-but-safety-critical eventsis essential for building robust and reliable systems.The paper also highlights specific technical trends in
world modelsfor robotics, such astactile-enhanced world modelsfordexterous manipulation,unified world modelsforcross-hardwareandcross-task generalization, andhierarchical world modelsforlong-horizon tasks.
7.3. Personal Insights & Critique
This survey offers a tremendously valuable synthesis of two rapidly evolving and converging fields: physical simulation and world models. Its clear structure, comprehensive comparisons of simulators, and deep dive into world model architectures and applications make it an excellent resource for anyone seeking to understand the current landscape of embodied AI.
One of the most inspiring aspects is the emphasis on the synergy between external and internal models. The analogy to human cognition—where we mentally simulate possibilities and learn from them—underscores the intuitive power of world models. The advancements in GPU-accelerated differentiable simulators are particularly exciting, as they promise to close the sim-to-real gap more effectively by allowing AI systems to learn directly from gradients in a virtual world that closely mirrors reality. The potential for zero-shot generalization and few-shot adaptation enabled by foundation models and advanced world models is truly transformative, suggesting a future where robots can quickly adapt to novel situations without extensive re-training.
However, the identified limitations also present significant challenges:
-
The "Black Box" Problem: While
world modelsshow immense promise, theirinterpretabilityandverifiabilityremain critical concerns, especially forsafety-critical applicationslikeautonomous driving. The ability to explain why aworld modelmade a particular prediction or plan is paramount for trust and accountability. This is an area where currentdeep learningmethods still struggle. -
True Causal Understanding: The distinction between
correlation learningandcausal reasoningis fundamental. Manyworld modelsexcel at predicting what will happen, but not necessarily why it will happen based on underlying physical laws or agent intentions. Without genuinecausal understanding,compositional generalizationto truly novel scenarios might be limited, as the model may just be interpolating rather than truly reasoning. -
Memory and Long-Term Planning: While
Transformershave improvedlong-range dependencies, modelinglong-term memoryandplanningin highly dynamic, open-ended environments remains exceptionally hard. Robots need to recall past events, adapt behaviors over extended periods, and maintain consistent goals, which taxes currentmemory architectures.The methods and conclusions of this paper have broad applicability. The
IR-Lclassification could be adapted for anyAI agentinteracting with an environment, not justhumanoid robots. The principles ofmodel-based RLandgenerative world modelsare transferable to other domains requiring complex dynamic predictions and planning, such asclimate modeling,drug discovery, orfinancial forecasting.
My critique leans towards the philosophical challenge of AGI: while the engineering progress in simulators and world models is astonishing, the bridge from sophisticated pattern matching and prediction to true understanding, consciousness, or common sense reasoning remains the ultimate frontier. The paper articulates these challenges well, particularly in the causal reasoning and abstract understanding sections. Future research must not only focus on scale and fidelity but also on embedding deeper cognitive architectures that mirror human-like reasoning and learning from interactive experience.
Similar papers
Recommended via semantic vector search.