Paper status: completed

Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI

Published:07/09/2024
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This survey comprehensively explores the latest advancements in Embodied AI, emphasizing its role in achieving AGI and bridging cyberspace with the physical world, covering key research areas and challenges.

Abstract

Embodied Artificial Intelligence (Embodied AI) is crucial for achieving Artificial General Intelligence (AGI) and serves as a foundation for various applications (e.g., intelligent mechatronics systems, smart manufacturing) that bridge cyberspace and the physical world. Recently, the emergence of Multi-modal Large Models (MLMs) and World Models (WMs) have attracted significant attention due to their remarkable perception, interaction, and reasoning capabilities, making them a promising architecture for embodied agents. In this survey, we give a comprehensive exploration of the latest advancements in Embodied AI. Our analysis firstly navigates through the forefront of representative works of embodied robots and simulators, to fully understand the research focuses and their limitations. Then, we analyze four main research targets: 1) embodied perception, 2) embodied interaction, 3) embodied agent, and 4) sim-to-real adaptation, covering state-of-the-art methods, essential paradigms, and comprehensive datasets. Additionally, we explore the complexities of MLMs in virtual and real embodied agents, highlighting their significance in facilitating interactions in digital and physical environments. Finally, we summarize the challenges and limitations of embodied AI and discuss potential future directions. We hope this survey will serve as a foundational reference for the research community. The associated project can be found at https://github.com/HCPLab-SYSU/Embodied_AI_Paper_List.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI

1.2. Authors

Yang Liu, Weixing Chen, Yongjie Bai, Xiaodan Liang, Guanbin Li, Wen Gao, Liang Lin

The authors are affiliated with Sun Yat-sen University, Peng Cheng Laboratory, and Peking University in China. Their research backgrounds generally span computer science, engineering, and artificial intelligence, with specific expertise in areas such as multi-modal reasoning, causality learning, computer vision, image processing, deep learning, human-related analysis, and multimedia.

1.3. Journal/Conference

The paper is a preprint published on arXiv. As a preprint, it has not yet undergone formal peer review or been accepted by a specific journal or conference. However, the arXiv platform is a highly influential repository for rapidly disseminating research in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. Given the affiliations and past publications of the authors in top-tier venues (e.g., IEEE TPAMI, CVPR, ICCV, NeurIPS), it is likely intended for a high-impact journal or conference in the field of AI or robotics.

1.4. Publication Year

2024

1.5. Abstract

This survey provides a comprehensive exploration of the latest advancements in Embodied Artificial Intelligence (Embodied AI). It highlights Embodied AI's crucial role in achieving Artificial General Intelligence (AGI) and its application in bridging cyberspace and the physical world through systems like intelligent mechatronics and smart manufacturing. The paper notes the significant impact of Multi-modal Large Models (MLMs) and World Models (WMs) due to their perception, interaction, and reasoning capabilities, positioning them as promising architectures for embodied agents. The analysis begins by reviewing embodied robots and simulators to understand current research focuses and limitations. It then delves into four main research targets: 1) embodied perception, 2) embodied interaction, 3) embodied agent, and 4) sim-to-real adaptation, detailing state-of-the-art methods, essential paradigms, and comprehensive datasets. The survey further examines the role of MLMs in both virtual and real embodied agents, emphasizing their importance in digital and physical interactions. Finally, it summarizes challenges, limitations, and discusses future research directions. The authors aim for this survey to serve as a foundational reference for the research community, with an associated project available on GitHub.

Official Source Link: https://arxiv.org/abs/2407.06886 PDF Link: https://arxiv.org/pdf/2407.06886v8.pdf Publication Status: Preprint

2. Executive Summary

2.1. Background & Motivation

The core problem the paper addresses is the gap between Artificial Intelligence (AI) systems operating in cyberspace (virtual environments) and their ability to function intelligently in the physical world. Traditional AI often excels at abstract problem-solving but struggles with the complexities and unpredictability of real-world interaction. This limitation hinders the realization of Artificial General Intelligence (AGI) – the ability of an AI to understand, learn, and apply intelligence across a wide range of tasks, similar to human cognitive abilities.

This problem is critically important because Embodied AI is seen as a foundational step towards AGI. It is also essential for various practical applications that require AI to interact directly with the physical world, such as intelligent mechatronics systems (systems integrating mechanical engineering, electronics, computer engineering, and control engineering for smart products), smart manufacturing (digitally integrated and optimized production processes), robotics, healthcare, and autonomous vehicles. Without Embodied AI, these applications cannot achieve true autonomy and adaptive functionality.

Prior research primarily focused on disembodied AI (e.g., ChatGPT), where cognition and physical entities are separated. While these models show impressive reasoning and language capabilities, they lack the ability to actively perceive, interact with, and act upon the physical environment. The emergence of Multi-modal Large Models (MLMs) and World Models (WMs) has created a new opportunity to bridge this gap. MLMs combine Natural Language Processing (NLP) and Computer Vision (CV) capabilities, allowing AI to understand both language instructions and visual observations. WMs enable AI to build internal predictive models of its environment, simulating physical laws and understanding causal relationships. These advancements provide strong perception, interaction, and planning capabilities, making them a promising architecture for developing general-purpose embodied agents.

The paper's entry point is to provide a comprehensive survey that specifically contextualizes the field of Embodied AI within this new era of MLMs and WMs. It aims to consolidate the fragmented progress, identify key research directions, and highlight the challenges that remain in aligning AI's virtual intelligence with real-world physical interaction.

2.2. Main Contributions / Findings

The paper makes several primary contributions and presents key findings:

  • First Comprehensive Survey with MLM and WM Focus: To the best of the authors' knowledge, this is the first comprehensive survey of Embodied AI that specifically focuses on the alignment of cyber and physical spaces based on Multi-modal Large Models (MLMs) and World Models (WMs). This offers novel insights into methodologies, benchmarks, challenges, and applications within this rapidly evolving landscape.

  • Detailed Taxonomy of Embodied AI: The survey categorizes and summarizes Embodied AI into several essential parts, providing a detailed taxonomy:

    • Embodied Robots: Discussing various types of robots that serve as physical embodiments.
    • Embodied Simulators: Reviewing general-purpose and real-scene-based simulation platforms crucial for developing and testing embodied agents.
    • Four Main Research Tasks:
      • Embodied Perception: Covering active visual perception and visual language navigation (VLN).
      • Embodied Interaction: Discussing embodied question answering (EQA) and embodied grasping.
      • Embodied Agents: Including multi-modal foundation models and task planning.
      • Sim-to-Real Adaptation: Focusing on embodied world models, data collection and training, and control algorithms.
  • Proposal of ARIO Dataset Standard: To facilitate the development of robust, general-purpose embodied agents, the paper proposes a new dataset standard called ARIO (All Robots In One). This standard aims to unify control and motion data from robots with different morphologies into a single format. A large-scale ARIO dataset is also introduced, encompassing approximately 3 million episodes collected from 258 series and 321,064 tasks.

  • Identification of Challenges and Future Directions: The survey concludes by outlining significant challenges in Embodied AI, such as the need for high-quality robotic datasets, long-horizon task execution, robust causal reasoning, a unified evaluation benchmark, and addressing security and privacy concerns. It also discusses promising future research directions to overcome these limitations.

    The key conclusion is that while Embodied AI has made rapid progress, especially with the integration of MLMs and WMs, significant research is still required to achieve truly general-purpose embodied agents that can reliably and robustly bridge cyberspace and the physical world for AGI and diverse applications. The proposed taxonomy, dataset standard, and future directions serve to guide this ongoing research.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this survey, a reader needs a grasp of several fundamental concepts in AI and robotics:

  • Artificial Intelligence (AI): The broad field of computer science dedicated to creating machines that can perform tasks that typically require human intelligence, such as learning, problem-solving, perception, and decision-making.

  • Artificial General Intelligence (AGI): A hypothetical form of AI that can understand, learn, and apply intelligence to a wide range of problems, similar to human cognitive abilities, rather than being limited to a single task (narrow AI). The paper posits Embodied AI as crucial for achieving AGI.

  • Embodied AI: AI systems that are integrated into physical bodies (e.g., robots, autonomous vehicles) and can interact with the physical world through sensors (perception) and actuators (action). Unlike disembodied AI, embodied AI learns and reasons within a physical context, experiencing and influencing its environment.

    • Disembodied AI: AI that operates purely in cyberspace or virtual environments, without a physical body or direct interaction with the real world. Examples include chatbots like ChatGPT or game-playing AI. Its cognition and physical entities are separate.
  • Multi-modal Large Models (MLMs): Large Models (like large language models, LLMs) that can process and integrate information from multiple modalities, such as text, images, video, and audio. These models are trained on vast datasets to develop strong capabilities in perception (understanding sensory input), interaction (responding to and influencing the environment across modalities), and reasoning (making logical inferences).

    • Large Language Models (LLMs): A type of MLM primarily focused on text. They are deep learning models trained on enormous text datasets, enabling them to understand, generate, and process human language for tasks like translation, summarization, and question answering.
  • World Models (WMs): Computational models that learn a compressed, predictive representation of an environment. An embodied agent uses a World Model to simulate possible future states, understand physical laws, and plan actions without needing to interact directly with the real (or simulated) world for every decision. This allows for more efficient learning and planning.

  • Sim-to-Real Adaptation: The process of training AI models or control policies in a simulated environment (cheaper, safer, faster) and then transferring that learned knowledge to a real-world physical system (e.g., a robot). The challenge lies in bridging the "domain gap" – the differences between the simulated and real environments – to ensure the AI performs effectively in reality.

  • Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. The agent learns through trial and error, receiving feedback (rewards or penalties) for its actions.

  • Computer Vision (CV): A field of AI that enables computers to "see" and interpret visual information from the real world, such as images and videos. Tasks include object detection, image segmentation, and 3D reconstruction.

  • Natural Language Processing (NLP): A field of AI focused on enabling computers to understand, interpret, and generate human language. Tasks include language understanding, text generation, and machine translation.

  • Robotics: The interdisciplinary field that deals with the design, construction, operation, and use of robots. Embodied AI is deeply intertwined with robotics as robots provide the physical embodiment for AI agents.

3.2. Previous Works

The survey differentiates itself from prior works by specifically focusing on the recent advancements brought by MLMs and WMs. It acknowledges existing surveys but highlights their limitations:

  • Outdated Surveys: Several previous surveys on Embodied AI [5], [6], [17], [18] are noted as "outdated" because they were published before the era of MLMs, which started around 2023. This means they do not capture the significant shifts and advancements that MLMs and WMs have introduced to the field.
    • [5] Duan, J., Yu, S., Tan, H. L., Zhu, H., & Tan, C. (2022). A survey of embodied ai: From simulators to research tasks.
    • [17] R. ier and F. ida (2004). Embodied artificial intelligence: Trends and challenges. (This is significantly older, highlighting the rapid pace of change).
  • Limited Scope Surveys: Among the more recent surveys after 2023, only two are specifically mentioned:
    • [6] Ma, Y., Song, Z., Zhuang, Y., Hao, J., & King, I. (2024). A survey on vision-language-action models for embodied ai. This work focused on vision-language-action models.
    • [18] Ren, L., Dong, J., Liu, S., Zhang, L., & Wang, L. (2024). Embodied intelligence toward future smart manufacturing in the era of ai foundation model. This survey proposed the ABC model (AI brain, Body, Cross-modal sensors) and focused on Embodied AI systems for smart manufacturing.
  • Overlooked Aspects: The paper points out that previous surveys generally overlooked the full consideration of MLMs, WMs, and embodied agents, as well as recent developments in embodied robots and simulators.

Key Technologies and Representative Models Mentioned:

  • Disembodied AI Example: ChatGPT [4] is cited as a representative disembodied AI agent, demonstrating the contrast with Embodied AI. RoboGPT [8] is mentioned in the context of LLMs for long-term decisions, indicating how LLMs can be integrated into embodied agents.
  • Embodied AI Examples: RT-1 [9], RT-2 [10], and RT-H [3] are highlighted as recent representative embodied models. These Robotics Transformers are crucial examples of Vision-Language-Action (VLA) models that transfer web knowledge to robotic control and utilize action hierarchies.
  • Vision Encoders: CLIP [13] (Learning transferable visual models from natural language supervision) and BLIP-2 [14] (Bootstrapping language-image pre-training with frozen image encoders and large language models) are mentioned as state-of-the-art vision encoders that provide precise estimations of object class, pose, and geometry, crucial for embodied perception.
  • World Models: I-JEPA [15] (Self-supervised learning from images with a joint-embedding predictive architecture) and Sora [16] (a generative model capable of producing realistic videos, suggesting world-modeling capabilities) are examples of World Models or related generative approaches that exhibit simulation capabilities and understanding of physical laws.

3.3. Technological Evolution

The evolution of Embodied AI has progressed through several stages:

  1. Early Robotics and Traditional AI: Initially, robots were controlled by rule-based systems and classical AI algorithms for tasks like SLAM (Simultaneous Localization and Mapping) and simple navigation. These systems were often brittle, limited to well-defined environments, and lacked adaptability.
  2. Deep Learning Era (2012 onwards): The advent of deep learning revolutionized Computer Vision and Natural Language Processing, providing robots with enhanced perception (e.g., better object recognition, semantic segmentation) and some understanding of natural language commands. Deep Reinforcement Learning allowed robots to learn complex behaviors through interaction. However, these models were often data-hungry and struggled with generalization across diverse tasks and environments.
  3. Large Foundation Models Era (2020s onwards): The recent emergence of Large Language Models (LLMs) and Multi-modal Large Models (MLMs) marks a significant leap. These models, trained on vast internet-scale datasets, possess unprecedented perception, reasoning, and generalization capabilities.
    • LLMs enable robots to better understand abstract linguistic instructions and perform high-level task planning through chain-of-thought reasoning.

    • MLMs align visual and linguistic representations, allowing embodied agents to perceive multi-modal elements in complex environments and interact more naturally.

    • World Models further enhance this by providing a predictive understanding of physical laws, enabling agents to simulate consequences and plan more effectively.

      This paper's work sits firmly within this Large Foundation Models Era, specifically focusing on how MLMs and WMs are transforming Embodied AI by providing the "brains" for embodied agents to bridge the cyber-physical gap.

3.4. Differentiation Analysis

Compared to previous surveys and methods in Embodied AI, the core differences and innovations of this paper's approach are:

  • Focus on MLMs and WMs: The most significant differentiation is its explicit and central focus on Multi-modal Large Models (MLMs) and World Models (WMs). While previous works might have touched upon multi-modal learning or predictive models, this survey systematically explores their profound impact on embodied perception, interaction, agent architectures, and sim-to-real adaptation in the context of aligning cyberspace with the physical world. This perspective is new given the recent, rapid rise of these foundation models.

  • Comprehensive Scope: It covers a broader scope by not only detailing theoretical advancements but also providing a structured overview of:

    • Embodied Hardware: A dedicated section on embodied robots (fixed-base, wheeled, tracked, quadruped, humanoid, biomimetic).
    • Simulation Environments: A thorough review of both general-purpose and real-scene-based simulators, which are critical tools for Embodied AI development and sim-to-real transfer.
    • Four Core Research Tasks: A structured analysis of embodied perception (active visual perception, VLN), embodied interaction (EQA, grasping), embodied agents (multi-modal foundation models, task planning), and sim-to-real adaptation. This comprehensive breakdown offers a complete picture of the field's components.
  • Novel Dataset Standard (ARIO): The proposal of ARIO (All Robots In One) as a unified dataset standard and the introduction of a large-scale ARIO dataset directly addresses a critical challenge in the field: the lack of standardized, large-scale, and multi-modal datasets for training general-purpose embodied agents. This is a concrete, practical contribution beyond a mere literature review.

  • Updated Perspective on Challenges and Future Directions: By considering the capabilities and limitations of MLMs and WMs, the survey offers a refined understanding of current challenges (e.g., long-horizon tasks, causal reasoning, security) and proposes future directions that are highly relevant to the latest technological paradigms.

    In essence, this paper serves as an updated roadmap for Embodied AI, explicitly acknowledging and integrating the transformative role of MLMs and WMs—a perspective largely missing from prior reviews.

4. Methodology

This paper is a comprehensive survey, and its "methodology" is the structured approach it takes to review and analyze the field of Embodied AI. The authors systematically categorize and discuss existing research, identifying key components, tasks, and challenges. The core idea is to provide a holistic view of how AI can be embodied to bridge cyberspace and the physical world, with a particular emphasis on the impact of Multi-modal Large Models (MLMs) and World Models (WMs).

4.1. Principles

The theoretical basis and intuition behind this survey's structure are rooted in the understanding that Embodied AI requires a cohesive ecosystem of hardware, software, and learning paradigms to achieve Artificial General Intelligence (AGI). The survey's principles can be broken down as follows:

  1. Cyber-Physical Alignment: The central theme is the alignment of AI's cognitive abilities (developed in cyberspace) with its physical embodiment and interaction capabilities in the physical world. This involves understanding how AI perceives, processes, and acts within real-world constraints.
  2. Foundation Models as Enablers: The survey posits MLMs and WMs as pivotal architectures that provide embodied agents with enhanced perception, interaction, and reasoning. These models offer a pathway to generalize AI capabilities from virtual to physical environments.
  3. Modular Breakdown: To thoroughly analyze a complex field like Embodied AI, it must be broken down into constituent parts. The survey adopts a modular approach, starting with the physical platforms (robots and simulators), then delving into core functional capabilities (perception, interaction), integrating these into an intelligent entity (embodied agent), and finally addressing the critical challenge of transferring knowledge between virtual and real domains (sim-to-real adaptation).
  4. Problem-Centric View: Each section on research targets (perception, interaction, agent, sim-to-real) addresses specific problems embodied agents must solve, outlining state-of-the-art solutions, paradigms, and supporting datasets.
  5. Future-Oriented: Beyond summarizing current achievements, the survey aims to identify limitations and propose future research directions, serving as a guide for the community.

4.2. Core Methodology In-depth (Layer by Layer)

The survey systematically explores Embodied AI through several key areas, progressively building from physical components to advanced cognitive functions and transfer mechanisms.

4.2.1. Embodied Robots

This section reviews the physical platforms that serve as embodiments for AI agents. The selection of a robot type is critical as it dictates the range of interactions and environments an agent can operate in.

The following figure (Figure 2 from the original paper) shows the various types of embodied robots discussed:

Fig. 2. The Embodied Robots include Fixed-base Robots, Quadruped Robots, Humanoid Robots, Wheeled Robots, Tracked Robots, and Biomimetic Robots. 该图像是一个示意图,展示了多种类型的具身机器人,包括固定式机器人、轮式机器人、履带式机器人、四足机器人、人形机器人和仿生机器人。这些机器人在实现人工智能的应用中起着重要作用。

  • Fixed-base Robots: These robots, like Franka Emika Panda [19], Kuka iiwa [21], and Sawyer [23], are typically stationary and excel in precision tasks within structured environments such as industrial automation or laboratory settings. Their strength lies in repetitive, high-accuracy manipulation.
  • Wheeled Robots: Examples include Kiva and Jackal robots [25]. These robots are designed for efficient movement on flat, predictable surfaces, commonly used in logistics and warehousing due to their simple structure and low cost. They face limitations on uneven terrain.
  • Tracked Robots: With track systems, these robots offer stability on soft or uneven terrains, making them suitable for off-road tasks such as agriculture, exploration, and disaster recovery [26].
  • Quadruped Robots: Such as Unitree Robotics' A1 and Go1, and Boston Dynamics Spot. These robots mimic four-legged animals, enabling them to navigate complex and unstructured terrains, making them ideal for exploration and rescue missions.
  • Humanoid Robots: Designed to resemble human form and movement, these robots (e.g., LLM-enhanced humanoids [29]) aim to provide personalized services and perform intricate tasks using dexterous hands. They are expected to enhance efficiency and safety in various service and industrial sectors.
  • Biomimetic Robots: These robots replicate movements and functions of natural organisms (e.g., fish-like [32], insect-like [33], soft-bodied robots [34]). This approach helps them operate efficiently in complex environments by emulating biological mechanisms, often optimizing energy use and adapting to difficult terrains.

4.2.2. Embodied Simulators

Embodied simulators are crucial for Embodied AI research due to their cost-effectiveness, safety, scalability, and ability to generate extensive training data. The survey categorizes them into general simulators and real-scene based simulators.

4.2.2.1. General Simulator

These simulators provide virtual environments that closely mimic the physical world, focusing on realistic physics and dynamic interactions. They are essential for algorithm development and model training.

The following figure (Figure 3 from the original paper) shows examples of general simulators:

Fig. 3. Examples of General Simulators. The MuJoCo's figure is from \[46\]. 该图像是示意图,展示了多个通用模拟器的实例,包括 Isaac Sim、Webots、Pybullet、V-REP(CoppeliaSim)、Genesis、MuJoCo、Unity ML-Agents、AirSim、MORSE 和 Gazebo。这些模拟器在机器人和模拟器研究中具有重要意义。

The following are the results from Table II of the original paper:

Simulator Year Rendering Robotics-specific features CP Physics Engine Main Applications
HFPS HQGR RRL DLS LSPC ROS MSS
Genesis [35] 2024 O o O O O O Custom RL, LSPS, RS
Isaac Sim [36] 2023 O O O O O O O PhysX Nav, AD
Isaac Gym [37] 2019 O O O PhysX RL,LSPS
Gazebo [38] 2004 O O O O O ODE, Bullet, Simbody, DART Nav,MR
PyBullet [39] 2017 O O Bullet RL,RS
Webots [40] 1996 O O O O ODE RS
MuJoCo [41] 2012 O o Custom RL, RS
Unity ML-Agents [42] 2017 O O O Custom RL, RS
AirSim [43] 2017 O o Custom Drone sim, AD, RL
MORSE [44] 2015 o O Bullet Nav, MR
V-REP (CoppeliaSim) [45] 2013 O O O O Bullet, ODE, Vortex, Newton MR, RS
  • HFPS: High-fidelity Physical Simulation

  • HQGR: High-quality Graphics Rendering

  • RRL: Rich Robot Libraries

  • DLS: Deep Learning Support

  • LSPC: Large-scale Parallel Computation

  • ROS: Robot Operating System support

  • MSS: Multi-sensor Support

  • CP: Cross-Platform

  • RL: Reinforcement Learning

  • LSPS: Large-scale Parallel Simulation

  • RS: Robot Simulation

  • Nav: Navigation

  • AD: Autonomous Driving

  • MR: Multi-Robot Systems

  • ODE: Open Dynamics Engine

    Key simulators include:

  • Isaac Sim [36]: Known for high-fidelity physics, real-time ray tracing, extensive robot models, and deep learning support, used in autonomous driving and industrial automation.

  • Gazebo [47]: Open-source, strong ROS integration, supports various sensors and pre-built models, mainly for robot navigation and control.

  • PyBullet [39]: Python interface for Bullet physics engine, easy to use, supports real-time physical simulation.

  • Genesis [35]: A newly launched simulator with a differentiable physics engine and generative capabilities.

4.2.2.2. Real-Scene Based Simulators

These simulators focus on creating photorealistic 3D environments, often built from real-world data and game engines (e.g., UE5, Unity), to replicate human daily life and complex indoor activities.

The following figure (Figure 4 from the original paper) shows examples of real-scene based simulators:

Fig. 4. Examples of Real-Scene Based Simulators. 该图像是一个示意图,展示了多个基于真实场景的模拟器,包括AI2-THOR、Matterport 3D、Virtualhome、SAPIEN、Habitat、iGibson、TDW和Infinite-World等,这些工具在虚拟环境中进行人机交互与感知研究。

Key simulators include:

  • SAPIEN [48]: Designed for interactions with joint objects (doors, cabinets).

  • VirtualHome [49]: Uses an environment graph for high-level embodied planning based on natural language.

  • AI2-ThOR [50]: Offers many interactive scenes, though interactions are often script-based.

  • iGibson [51] and TDW [52]: Provide fine-grained embodied control and highly simulated physical interactions. iGibson excels in large-scale realistic scenes, while TDW offers user freedom, unique audio, and fluid simulations.

  • Matterport3D [53]: A foundational 2D-3D visual dataset widely used in benchmarks.

  • Habitat [107]: Lacks interaction capabilities but is valued for extensive indoor scenes and an open framework, especially for navigation.

  • InfiniteWorld [54]: Focuses on a unified and scalable simulation framework with improvements in implicit asset reconstruction and natural language-driven scene generation.

    The survey also mentions tools for automated simulation scene construction like RoboGen [55], HOLODECK [56], PhyScene [57], and ProcTHOR [58], which leverage LLMs and generative models to create diverse and interactive scenes for training embodied agents.

4.2.3. Embodied Perception

Embodied perception goes beyond passive image recognition, requiring agents to actively move and interact to understand 3D space and dynamic environments. It involves visual reasoning, 3D relation understanding, and prediction for complex tasks.

The following figure (Figure 5 from the original paper) shows the schematic diagram of active visual perception:

Fig. 5. The schematic diagram of active visual perception. Visual SLAM and 3D Scene Understanding provide the foundation for passive visual perception, while active exploration provides activeness to the passive perception system. These elements works collaboratively for the active visual perception system. 该图像是关于主动视觉感知的示意图,展示了被动视觉感知的各个方面,包括3D场景理解、定位与映射精度提升和视觉SLAM。图中还指出了观察能力的改善、主动探索的激活和最终的行动,这些要素共同作用于主动视觉感知系统。

4.2.3.1. Active Visual Perception

This system combines fundamental capabilities for state estimation, scene perception, and environment exploration.

The following are the results from Table III of the original paper:

Function Type Methods
vSLAM Traditional vSLAM MonoSLAM [60], ORB-SLAM [61], LSD-SLAM [62]
Semantic vSLAM SLAM++ [63], QuadricSLAM [64], So-SLAM [65],SG-SLAM [66], OVD-SLAM [67], GS-SLAM [68]
3D Scene Understanding Projection-based MV3D [69], PointPillars [70], MVCNN [71]
Voxel-based VoxNet [72], SSCNet [73]), MinkowskiNet [74], SSCNs [75], Embodiedscan [76]
Point-based PointNet [77], PointNet++ [78], PointMLP [79], PointTransformer [80], Swin3d [81], PT2 [82],3D-VisTA [83], LEO [84], PQ3D [85], PointMamba [86], Mamba3D [87]
Active Exploration Interacting with the environment Pinto et al. [88], Tatiya et al. [89]
Changing the viewing direction Jayaraman et al. [90], NeU-NBV [91], Hu et al. [92], Fan et al. [93]
  • Visual Simultaneous Localization and Mapping (vSLAM): SLAM aims to determine a robot's position and build an environment map simultaneously [97]. vSLAM uses cameras for this, offering low hardware costs and rich environmental details.
    • Traditional vSLAM: Uses image data and multi-view geometry to estimate pose and construct low-level maps (e.g., sparse point clouds). Methods include filter-based (MonoSLAM [60]), keyframe-based (ORB-SLAM [61]), and direct tracking (LSD-SLAM [62]).
    • Semantic vSLAM: Integrates semantic information (object recognition) into vSLAM to enhance environment interpretation and navigation (SLAM++SLAM++ [63], QuadricSLAM [64], So-SLAM [65], SG-SLAM [66], OVD-SLAM [67], GS-SLAM [68]).
  • 3D Scene Understanding: Aims to distinguish object semantics, identify locations, and infer geometric attributes from 3D data (e.g., point clouds from LiDAR or RGB-D sensors) [100]. Point clouds are sparse and irregular.
    • Projection-based methods: Project 3D points onto 2D image planes for 2D CNN-based feature extraction (MV3D [69], PointPillars [70], MVCNN [71]).
    • Voxel-based methods: Convert point clouds into regular voxel grids for 3D convolution operations (VoxNet [72], SSCNet [73]), with sparse convolution for efficiency (MinkowskiNet [74], SSCNs [75], Embodiedscan [76]).
    • Point-based methods: Directly process point clouds (PointNet [77], PointNet++PointNet++ [78], PointMLP [79]). Recent advancements include Transformer-based (PointTransformer [80], Swin3d [81], PT2 [82], 3D-VisTA [83], LEO [84], PQ3D [85]) and Mamba-based (PointMamba [86], Mamba3D [87]) architectures for scalability. PQ3D [85] integrates features from multiple modalities.
  • Active Exploration: Complements passive perception by enabling robots to dynamically interact with and perceive their surroundings.
    • Interacting with the environment: Robots learn visual representations through physical interaction (Pinto et al. [88]), or transfer implicit knowledge via learned exploratory interactions across robot morphologies (Tatiya et al. [89]).
    • Changing the viewing direction: Agents learn to acquire informative visual observations by reducing uncertainty about unobserved parts of the environment using Reinforcement Learning (Jayaraman et al. [90]), or plan camera positions for informative images (NeU-NBV [91]), or predict future state values (Hu et al. [92]), or treat active recognition as sequential evidence-gathering (Fan et al. [93]).

4.2.3.2. Visual Language Navigation (VLN)

VLN is a crucial task where embodied agents navigate unseen environments by following linguistic instructions, requiring understanding of diverse visual observations and multi-granular instructions. The process is represented as Action=M(O,H,I)Action = \mathcal{M}(O, H, I), where Action is the chosen action, OO is the current observation, HH is the historical information, and II is the natural language instruction.

The following figure (Figure 6 from the original paper) shows an overview and different tasks of VLN:

Fig. 6. (a) Overview of VLN. The embodied agent communicates with humans through natural language. Humans issue instructions to the embodied agent, who completes tasks such as planning and dialog. Subsequently, through collaborative cooperation or the embodied agent's independent actions, actions are made in interactive or non-interactive environments based on visual observations and instructions, (b) Different tasks of VLN. 该图像是示意图,展示了虚拟环境中的一名智能体与人类的互动。图中包含了两部分:左侧说明了交互环境中的自然语言指令,右侧展示了不同的导航任务与互动步骤。智能体通过观察和执行任务,完成目标导航。

  • Datasets: Various datasets exist based on instruction granularity and interaction requirements.

    The following are the results from Table IV of the original paper:

    Dataset Year Simulator Environment Feature Size
    R2R [105] 2018 M3D I, D SbS 21,567
    R4R [106] 2019 M3D I, D SbS 200,000+
    VLN-CE [107] 2020 Habitat I, C SbS -
    TOUCHDOWN [108] 2019 - O, D SbS 9,326
    REVERIE [109] 2020 M3D I, D DGN 21,702
    SOON [110] 2021 M3D I, D DGN 3,848
    DDN [111] 2023 AT I, C DDN 30,000+
    ALFRED [112] 2020 AT I, C NwI 25,743
    OVMM [113] 2023 Habitat I, C NwI 7,892
    BEHAVIOR-1K [114] 2023 OG I, C LSNwI 1,000
    CVDN [115] 2020 M3D I, D D&O 2,050
    DialFRED [116] 2022 AT I, C D&O 53,000
  • M3D: Matterport3D

  • AT: AI2-THOR

  • OG: OMNIGIBSON

  • II: Indoor

  • DD: Discrete

  • OO: Outdoor

  • CC: Continuous

  • SbS: Step-by-Step Instructions

  • DGN: Described Goal Navigation

  • DDN: Demand-Driven Navigation

  • NwI: Navigation with Interaction

  • LSNwI: Long-Span Navigation with Interaction

  • D&O: Dialog and Oracle

    Examples include R2R [105] (step-by-step instructions in Matterport3D), REVERIE [109] (goal navigation), ALFRED [112] (household tasks with interaction in AI2-THOR), and CVDN [115] (dialog-based navigation).

  • Method: VLN methods are categorized by their focus.

    The following are the results from Table V of the original paper:

    Method Model Year Feature
    Memory-UnderstandingBased LVERG [117] 2020 Graph Learning
    CMG [118] 2020 Adversarial Learning
    RCM [119] 2021 Reinforcement learning
    FILM [120] 2022 Semantic Map
    LM-Nav [121] 2022 Graph Learning
    HOP [122] 2022 History Modeling
    NaviLLM [123] 2024 Large Model
    FSTT [124] 2024 Test-Time Augmentation
    DiscussNav [125] 2024 Large Model
    GOAT [126] 2024 Causal Learning
    VER [127] 2024 Environment Encoder
    NaVid [128] 2024 Large Model
    Future-PredictionBased LookBY [129] 2018 Reinforcement Learning
    NvEM [130] 2021 Environment Encoder
    BGBL [131] 2022 Graph Learning
    Mic [132] 2023 Large Model
    HNR [133] 2024 Environment Encoder
    ETPNav [134] 2024 Graph Learning
    Others MCR-Agent [135] 2023 Multi-Level Model
    OVLM [136] 2023 Large Model
  • Memory-Understanding Based: Focus on perceiving and understanding the environment using historical observations or trajectories. Methods often use graph-based learning (e.g., LVERG [117], LM-Nav [121]), semantic maps (FILM [120], VER [127]), adversarial learning (CMG [118]), causal learning (GOAT [126]), and increasingly Large Models for understanding (NaviLLM [123], NaVid [128]).

  • Future-Prediction Based: Model, predict, and understand future states. These methods often use graph-based learning for waypoint prediction (BGBL [131], ETPNav [134]), environmental encoding for future observations (NvEM [130], HNR [133]), and reinforcement learning for future state forecasting (LookBY [129]). Large Models are also used for "imagining" future scenes (MiC [132]).

4.2.4. Embodied Interaction

Embodied interaction covers scenarios where agents interact with humans and the environment in physical or simulated spaces, typically focusing on Embodied Question Answering (EQA) and embodied grasping.

4.2.4.1. Embodied Question Answering (EQA)

In EQA, an agent explores an environment from a first-person perspective to gather information to answer questions. This requires autonomous exploration and decision-making about when to stop exploring and provide an answer.

The following figure (Figure 7 from the original paper) shows various types of question answering tasks:

Fig. 7. The gray box displays the scenes an agent observes during exploration. The other boxes show various types of question answering tasks. Except for the task of answering questions based on episodic memory, the agent ceases exploration once it has gathered sufficient information to answer the question. 该图像是图表,展示了一个智能体在探索环境中的过程,并包括多种问答任务。灰色框中的场景代表智能体观察到的环境,而其他框则展示了不同类型的问题,例如单一目标、多目标和交互任务等,智能体在获取足够信息后会停止探索。图中还包含了基于记忆、知识和对象状态的问题示例。

  • Datasets:

    The following are the results from Table VI of the original paper:

    Dataset Year Type Data Sources Simulator Query Creation Answer Size
    EQA v1 [138] 2018 Active EQA SUNCG House3D Rule-Based open-ended 5,000+
    MT-EQA [139] 2019 Active EQA SUNCG House3D Rule-Based open-ended 19,000+
    MP3D-EQA [140] 2019 Active EQA MP3D Simulator based on MINOS Rule-Based open-ended 1,136
    IQUAD V1 [141] 2018 Interactive EQA AI2THOR Rule-Based multi-choice 75,000+
    VideoNavQA [142] 2019 Episodic Memory EQA SUNCG House3D Rule-Based open-ended 101,000
    SQA3D [143] 2022 QA only ScanNet Manual multi-choice 33,400
    K-EQA [144] 2023 Active EQA AI2THOR Rule-Based open-ended 60,000
    OpenEQA [145] 2024 Active EQA, Episodic Memory EQA ScanNet, HM3D Habitat Manual open-ended 1,600+
    HM-EQA [146] 2024 Active EQA HM3D Habitat VLM multi-choice 500
    S-EQA [147] 2024 Active EQA VirtualHome LLM binary
    EXPRESS-Bench [148] 2025 Exploration-aware EQA HM3D Habitat VLM open-ended 2,044

Examples include EQA v1 [138] (first dataset, synthetic indoor scenes), IQUAD V1 [141] (interactive questions requiring affordance understanding), K-EQA [144] (complex questions with logical clauses and knowledge), and OpenEQA [145] (first open-vocabulary dataset for episodic memory and active exploration).

  • Methods:
    • Neural Network Methods: Early approaches built deep neural networks, often combining Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) for vision, language, navigation, and answering modules (Das et al. [138]). Techniques like imitation learning and reinforcement learning were used for training. Later works integrated navigation and QA modules for joint training (Wu et al. [152]) or introduced Hierarchical Interactive Memory Networks for dynamic environments (Gordon et al. [141]). Some leveraged neural program synthesis and knowledge graphs for external knowledge and action planning (Tan et al. [144]).
    • LLMs/VLMs Methods: More recent methods leverage Large Language Models (LLMs) and Vision-Language Models (VLMs). For episodic memory EQA (EM-EQA), they used Blind LLMs, Socratic LLMs with language descriptions or scene graphs, or VLMs processing multiple scene frames (Majumdar et al. [145]). For active EQA (A-EQA), these methods are extended with frontier-based exploration (FBE) [154] to identify areas for exploration, building semantic maps, and using conformal prediction or image-text matching to decide when to stop exploration (Majumdar et al. [145], Sakamoto et al. [155]). Some use multiple LLM-based agents to collectively answer questions (Patel et al. [156]).

4.2.4.2. Embodied Grasping

Embodied grasping involves performing tasks like picking and placing objects based on human instructions, combining traditional kinematic methods with LLMs and Vision-Language Models (VLMs).

The following figure (Figure 8 from the original paper) illustrates language-guided grasping tasks, human-agent-object interaction, and publication trends:

该图像是一个示意图,展示了语言引导的抓取任务(a)和人-代理-物体交互(b)以及出版状态(c)。左侧部分通过不同的指令示例(如直接物体说明、空间推理等)展示了抓取与场景的关系。右侧显示了不同年份的出版论文数量,反映了该领域的研究增长趋势。 该图像是一个示意图,展示了语言引导的抓取任务(a)和人-代理-物体交互(b)以及出版状态(c)。左侧部分通过不同的指令示例(如直接物体说明、空间推理等)展示了抓取与场景的关系。右侧显示了不同年份的出版论文数量,反映了该领域的研究增长趋势。

  • Datasets: Initially, grasping datasets focused on kinematic annotations for single or cluttered objects from real (Cornell [159], OCID-Grasp [164]) or virtual environments (Jacquard [160], 6-DOF GraspNet [161], ACRONYM [162], MultiGripperGrasp [163]). With MLMs, these datasets have been augmented or reconstructed to include linguistic text, creating semantic-grasping datasets (OCID-VLG [165], ReasoningGrasp [166], CapGrasp [167]).

    The following are the results from Table VII of the original paper:

    Dataset Year Type Modality Grasp Label Gripper Finger Objects Grasps Scenes Language
    Cornell [159] 2011 Real RGB-D Rect. 2 240 8K Single ×
    Jacquard [160] 2018 Sim RGB-D Rect. 2 11K 1.1M Single ×
    6-DOF GraspNet [161] 2019 Sim 3D 6D 2 206 7.07M Single ×
    ACRONYM [162] 2021 Sim 3D 6D 2 8872 17.7M Multi ×
    MultiGripperGrasp [163] 2024 Sim 3D - 2-5 345 30.4M Single ×
    OCID-Grasp [164] 2021 Real RGB-D Rect. 2 89 75K Multi ×
    OCID-VLG [165] 2023 Real RGB-D,3D Rect. 2 89 75K Multi
    ReasoingGrasp [166] 2024 Real RGB-D 6D 2 64 99.3M Multi
    CapGrasp [167] 2024 Sim 3D - 5 1.8K 50K Single
  • Language-guided grasping: Combines MLMs for semantic scene reasoning, allowing agents to grasp based on implicit or explicit human instructions.

    • Explicit instructions: Clearly specify the object category (e.g., "Grasp the banana").
    • Implicit instructions: Require reasoning to identify the target, involving spatial reasoning (e.g., "Grasp the keyboard that is to the right of the brown kleenex box" [165]) and logical reasoning (e.g., "I am thirsty, can you give me something to drink?" [166]).
  • End-to-End Approaches: These models directly map multi-modal inputs to grasp outputs.

    • CLIPORT [168]: A language-conditioned imitation learning agent combining CLIP (a vision-language pre-trained model) with Transporter Net for semantic understanding and grasp generation. It's trained on expert demonstrations from virtual environments.
    • CROG [165]: Leverages CLIP's visual capabilities to learn grasp synthesis directly from image-text pairs.
    • Reasoning Grasping [166]: Integrates multimodal LLMs with vision-based robotic grasping to generate grasps based on semantics and vision.
    • SemGrasp [167]: Incorporates semantic information into grasp representations to generate dexterous hand grasp postures according to language instructions.
  • Modular Approaches: Break down the task into distinct modules.

    • F3RM [169]: Elevates CLIP's text-image priors into 3D space, using extracted features for language localization followed by grasp generation.
    • GaussianGrasper [170]: Constructs a 3D Gaussian field for feature distillation, performs language-based localization, and then generates grasp poses using a pre-trained grasping network.

4.2.5. Embodied Agent

An embodied agent is an autonomous entity capable of perceiving its environment and acting to achieve objectives, with its capabilities significantly expanded by MLMs. For a task, an embodied agent performs high-level task planning (decomposing complex tasks) and low-level action planning (executing subtasks and interacting with the environment).

The following figure (Figure 9 from the original paper) shows the framework of the embodied agent:

该图像是示意图,展示了高层任务规划和低层行动规划的流程,包括任务规划、视觉描述和视觉表示等内容。图中还涉及了多模态大模型(LLM/VLM)和其在高低层次的应用,以及实体化的相关任务。 该图像是示意图,展示了高层任务规划和低层行动规划的流程,包括任务规划、视觉描述和视觉表示等内容。图中还涉及了多模态大模型(LLM/VLM)和其在高低层次的应用,以及实体化的相关任务。

4.2.5.1. Embodied Task Planning

This involves decomposing abstract and complex tasks into specific subtasks. Traditionally, this was rule-based (e.g., PDDL [173], MCTS [174], AA* [175]). With LLMs, planning leverages rich embedded world knowledge for reasoning.

  • Planning utilizing the Emergent Capabilities of LLMs: LLMs can decompose tasks using chain-of-thought reasoning and internal world knowledge without explicit training.
    • Translated LM [179] and Inner Monologue [180]: Break down complex tasks using internal logic.
    • ReAd [181]: A multi-agent collaboration framework that refines plans via prompts.
    • Memory banks: Store and recall past successful examples as skills for planning [182], [183], [184].
    • Code as reasoning medium: LLMs generate code based on API libraries for task planning [185], [186].
    • Multi-turn reasoning: Socratic Models [187] and Socratic Planner [188] use questioning to derive reliable plans, correcting hallucinations.
  • Planning utilizing the visual information from embodied perception model: Integrating visual information is crucial to prevent plans from deviating from actual scenarios.
    • Object detectors: Query objects in the environment, feed information back to LLMs to modify plans [187], [189], [190]. RoboGPT [8] refines this by considering different names for similar objects.
    • 3D scene graphs: SayPlan [191] uses hierarchical 3D scene graphs to represent the environment, improving task planning in large settings. ConceptGraphs [192] provides detailed open-world object detection and code-based planning via 3D scene graphs.
  • Planning utilizing the VLMs: VLMs can capture visual details and contextual information in latent space, aligning abstract visual features with structured textual features.
    • EmbodiedGPT [193]: Uses an Embodied-Former module to align embodied, visual, and textual information for task planning.
    • LEO [194]: Encodes 2D egocentric images and 3D scenes into visual tokens for 3D world perception and task execution.
    • EIF-Unknow [195]: Utilizes Semantic Feature Maps from Voxel Features as visual tokens for LLaVA model-based task planning.
    • Embodied multimodal foundation models (VLA models): Examples like RT series [2], [9], PaLM-E [196], and Matcha [197] are trained on large datasets to align visual and textual features for embodied scenarios.

4.2.5.2. Embodied Action Planning

This step focuses on executing subtasks derived from task planning, addressing real-world uncertainties due to the insufficient granularity of high-level plans [198].

  • Action utilizing APIs: LLMs are provided with definitions of well-trained policy models (APIs) to use for specific tasks [189], [199]. They can generate code to abstract tools into a function library (Liang et al. [186]). Reflexion [200] adjusts these tools during execution, and DEPS [201] enables LLMs to learn and combine skills through zero-shot learning. This modular approach enhances flexibility but relies on the quality of external policy models.
  • Action utilizing VLA model: This paradigm integrates task planning and action execution within the same VLA model system, reducing communication latency and improving response speed. Embodied multimodal foundation models (RT series [10], EmbodiedGPT [193]) tightly integrate perception, decision-making, and execution for efficient handling of complex tasks and dynamic environments. This allows for real-time feedback and strategy self-adjustment.
  • Scalability in Diverse Environments: Strategies include hierarchical SLAM for mapping, multimodal perception, energy-efficient edge computing, multi-agent systems, decentralized communication, and domain adaptation for generalization in new environments.

4.2.6. Sim-to-Real Adaptation

Sim-to-Real adaptation is the process of transferring capabilities learned in simulated environments (cyberspace) to real-world scenarios (physical world), ensuring robust and reliable performance. It involves embodied world models, data collection and training methods, and embodied control algorithms.

4.2.6.1. Embodied World Model

These models predict the next state to inform decisions and are crucial for developing physical intuition. They are generally trained from scratch on physical world data, unlike VLA models which are pre-trained.

The following figure (Figure 10 from the original paper) shows three types of embodied world models:

Fig. 10. Embodied world models can be roughly divided into three type. (a) Generation-based Methods learn the transformation relation between the input space and the output space using an autoencoder framework. (b) Prediction-based Methods are more general frameworks where a world model is trained in latent space. (c) Knowledge-driven Methods inject artificially constructed knowledge into the model, giving the model world knowledge to obtain output that meets the given knowledge constraints. Note that the components within the dashed line are optional. 该图像是示意图,展示了三种嵌入式世界模型的分类,包括生成方法、预测方法和知识驱动方法。每种方法都有不同的结构,其中生成方法通过自编码器学习输入空间与输出空间之间的转换关系,预测方法则在潜在空间中训练世界模型,而知识驱动方法则将人工构建的知识注入模型以满足特定知识约束。

  • Generation-based Methods: Generative models learn to understand and produce data (images [203], videos [16], [204], point clouds [205], or other formats [206]) that adhere to physical laws, thereby internalizing world knowledge. This enhances model generalization, robustness, adaptability, and predictive accuracy. Examples include World Models [203], Sora [16], Pandora [204], 3D-VLA [205], and DWM [206].
  • Prediction-based Methods: These models predict and understand the environment by constructing and utilizing internal representations in a latent space. By reconstructing features based on conditions, they capture deeper semantics and world knowledge, enabling robots to perceive essential environmental representations (e.g., I-JEPA [15], MC-JEPA [207], A-JEPA [208], Point-JEPA [209], IWM [210]) and perform downstream tasks (iVideoGPT [211], IRASim [212], STP [213], MuDreamer [214]). The latent space processing allows for abstracting and decoupling knowledge, leading to better generalization.
  • Knowledge-driven Methods: These models are endowed with world knowledge by injecting artificially constructed knowledge.
    • Real2Sim2Real [217]: Uses real-world knowledge to build physics-compliant simulators for robot training.
    • Constructing common sense/physics-compliant knowledge for generative models or simulators (e.g., ElastoGen [218], One-2-3-45 [219], PLoT [220]).
    • Combining artificial physical rules with LLMs/MLMs to generate diverse and semantically rich scenes through automatic spatial layout optimization (e.g., Holodeck [56], LEGENT [221], GRUtopia [222]).

4.2.6.2. Data Collection and Training

High-quality data is critical for sim-to-real adaptation.

  • Real-World Data: Collecting real-world robotic data is time-consuming and expensive. Efforts are focused on creating large, diverse datasets to enhance generalization. Examples include Open X-Embodiment [202] (data from 22 robots, 527 skills, 160,266 tasks), UMI [224] (framework for bimanual data), Mobile ALOHA [225] (data for full-body mobile manipulation), and human-agent collaboration [226] for data quality.

  • Simulated Data: Simulation-based data collection offers cost-effectiveness and efficiency. Examples include CLIPORT [168] and Transporter Networks [227] using Pybullet data, GAPartNet [228] for part-level annotations, and SemGrasp [167] creating CapGrasp for semantic hand grasping.

  • Sim-to-Real Paradigms: Various paradigms mitigate the need for extensive real-world data.

    The following figure (Figure 11 from the original paper) shows five paradigms for sim-to-real transfer:

    该图像是示意图,展示了将虚拟环境中的智能体技术应用于现实世界的过程,包括Real2Sim2Real、TRANSIC、Domain Randomization及Lang4Sim2Real四个主要部分,强调了相关模型训练和转移的步骤与策略。 该图像是示意图,展示了将虚拟环境中的智能体技术应用于现实世界的过程,包括Real2Sim2Real、TRANSIC、Domain Randomization及Lang4Sim2Real四个主要部分,强调了相关模型训练和转移的步骤与策略。

  • Real2Sim2real [229]: Improves imitation learning by using reinforcement learning in a "digital twin" simulation, then transferring strategies to the real world.

  • TRANSIC [230]: Reduces the sim-to-real gap through real-time human intervention and residual policy training based on corrected behaviors.

  • Domain Randomization [231], [232], [233]: Increases model generalization by varying simulation parameters to cover real-world conditions.

  • System Identification [234], [235]: Creates accurate simulations of real-world scenes to ensure smooth transitions.

  • Lang4sim2real [236]: Leverages natural language descriptions to bridge the sim-to-real gap, improving model generalization with cross-domain image representations.

  • ARIO (All Robots in One): Proposed as a new dataset standard [237] to overcome limitations of existing datasets (lack of comprehensive sensory modalities, unified format, diverse control object representation, data volume, and combined simulated/real data). ARIO unifies control and motion data from diverse robots, facilitating training of high-performing, generalizable embodied AI models. The ARIO dataset comprises approximately 3 million episodes from 258 series and 321,064 tasks.

    The following figure (Figure 12 from the original paper) shows exemplar tasks from ARIO:

    Fig. 12. Exemplar tasks from ARIO, where the top row indicates the task category while the text at the bottom row provides task instructions. 该图像是示意图,展示了三个不同类型的任务,包括长时间跨度任务、双手操作任务和丰富接触任务。每个任务下方提供了相应的操作说明,分别涉及物体的拾取与放置。

  • Real-world Deployments of Embodied AI Systems: Embodied AI systems are deployed in healthcare (e.g., Da Vinci Surgical System), logistics (Amazon Robotics, Boston Dynamics' Stretch), and manufacturing (Fanuc, ABB), enhancing precision and efficiency.

5. Experimental Setup

As this paper is a comprehensive survey, it does not present new experimental results from its own methodology. Instead, it reviews the experimental setups, datasets, evaluation metrics, and baselines used by the numerous research papers it surveys across various Embodied AI tasks. This section summarizes the common practices and key components of these experimental setups as described in the survey.

5.1. Datasets

The survey highlights a wide array of datasets tailored for different Embodied AI tasks, reflecting the diversity and complexity of the field. These datasets typically provide visual (RGB, depth, point clouds), textual (instructions, questions), and action data, often collected in or simulated from indoor or outdoor environments.

  • For Embodied Perception (Visual Language Navigation - VLN):

    • Source & Characteristics: Many VLN datasets are built on 3D scanned or procedurally generated indoor environments, often leveraging platforms like Matterport3D [53], AI2-THOR [50], Habitat [107], SUNCG [73], or OmniGibson [51]. Instructions range from step-by-step (SbS) to described goal navigation (DGN) or demand-driven navigation (DDN). Environments can be discrete (graph-based navigation) or continuous (free movement).
    • Examples:
      • R2R [105] (Room to Room): Uses Matterport3D for SbS instructions in indoor discrete environments.
      • ALFRED [112]: Built on AI2-THOR, focusing on Navigation with Interaction (NwI) for household tasks in continuous indoor environments.
      • TOUCHDOWN [108]: Uses Google Street View, providing outdoor SbS navigation.
      • BEHAVIOR-1K [114]: Based on OmniGibson, for long-span navigation with interaction (LSNwI) tasks.
    • Data Sample: For R2R, a data sample would include a starting panorama, a target location, and a natural language instruction like "Walk down the stairs, turn left, and stop at the red chair."
  • For Embodied Interaction (Embodied Question Answering - EQA):

    • Source & Characteristics: EQA datasets require agents to explore and gather information to answer questions. They use environments like SUNCG [73], Matterport3D [151], AI2-THOR [50], ScanNet [143], and HM3D [145]. Questions can be open-ended, multi-choice, or binary, and types vary (e.g., location, color, existence, counting, spatial relationships, knowledge-based).
    • Examples:
      • EQA v1 [138]: Built on SUNCG within House3D, with rule-based questions on object attributes.
      • IQUAD V1 [141]: Based on AI2-THOR, requires agents to interact with dynamic environments to answer questions.
      • OpenEQA [145]: Uses ScanNet and HM3D in Habitat, supporting open-vocabulary and both episodic memory and active exploration.
    • Data Sample: An EQA sample might involve an agent in a kitchen, given the question "What color is the refrigerator?" and requiring navigation to find and identify the object.
  • For Embodied Interaction (Embodied Grasping):

    • Source & Characteristics: Grasping datasets typically provide RGB-D images, point clouds, or 3D scenes with annotated grasp poses (e.g., 4-DOF or 6-DOF rectangles or 6D poses for grippers). With MLMs, these datasets are augmented with linguistic text for semantic grasping.
    • Examples:
      • Cornell [159]: Real-world RGB-D images with rectangle grasp annotations for single objects.
      • 6-DOF GraspNet [161]: Simulated 3D data with 6D grasp poses.
      • OCID-VLG [165]: Real RGB-D and 3D data with semantic expressions, linking language, vision, and grasping.
      • CapGrasp [167]: Simulated 3D data with semantic descriptions for dexterous hand grasping.
    • Data Sample: A grasping sample would include an image of an object (e.g., a banana) and an instruction like "Grasp the banana," with corresponding valid grasp configurations.
  • For Sim-to-Real Adaptation & General-Purpose Agents:

    • ARIO (All Robots In One) [237]: This proposed standard and dataset aim to unify control and motion data from diverse robot morphologies in a consistent format. It combines simulated and real data to address the sim-to-real gap and facilitate large-scale pretraining for general-purpose embodied agents. The dataset is vast, with approximately 3 million episodes from 258 series and 321,064 tasks.

    • Open X-Embodiment [202]: A large-scale real-world dataset providing data from 22 robots with 527 skills and 160,266 tasks in domestic settings.

      The choice of datasets is driven by the need to validate specific aspects of Embodied AI, from low-level perception and control to high-level reasoning and interaction. Synthetic datasets allow for scale and control, while real-world datasets ensure relevance and robustness. The increasing integration of language into datasets reflects the growing role of MLMs.

5.2. Evaluation Metrics

The evaluation metrics used in Embodied AI are highly task-dependent, reflecting different aspects of an agent's performance, such as navigation efficiency, task completion, interaction accuracy, and semantic understanding.

  • For Visual Language Navigation (VLN):

    1. Success Rate (SR):
      • Conceptual Definition: Measures the percentage of times an agent successfully reaches the target location or completes the instructed task. It is a primary indicator of task completion.
      • Mathematical Formula: $ \mathrm{SR} = \frac{\text{Number of successful episodes}}{\text{Total number of episodes}} \times 100% $
      • Symbol Explanation:
        • Number of successful episodes: The count of navigation attempts where the agent reaches the goal.
        • Total number of episodes: The total count of navigation attempts.
    2. Success weighted by Path Length (SPL):
      • Conceptual Definition: This metric balances Success Rate with path efficiency. It penalizes agents that take overly long routes to reach the goal, even if they succeed. A higher SPL indicates both successful and efficient navigation.
      • Mathematical Formula: $ \mathrm{SPL} = \frac{1}{N} \sum_{i=1}^{N} S_i \frac{L_{shortest,i}}{\max(P_i, L_{shortest,i})} $
      • Symbol Explanation:
        • NN: The total number of episodes.
        • SiS_i: Binary indicator, 1 if episode ii is successful, 0 otherwise.
        • Lshortest,iL_{shortest,i}: Length of the shortest path from the start to the goal in episode ii.
        • PiP_i: Length of the path taken by the agent in episode ii.
    3. Navigation Error (NE):
      • Conceptual Definition: Measures the geodesic distance between the agent's final position and the target goal position. Lower NE indicates more accurate navigation.
      • Mathematical Formula: $ \mathrm{NE} = \frac{1}{N} \sum_{i=1}^{N} \mathrm{geodesic_distance}(\text{final_pos}_i, \text{goal_pos}_i) $
      • Symbol Explanation:
        • NN: The total number of episodes.
        • final_posi\text{final\_pos}_i: The agent's final position in episode ii.
        • goal_posi\text{goal\_pos}_i: The target goal position in episode ii.
        • geodesic_distance(,)\mathrm{geodesic\_distance}(\cdot, \cdot): The shortest path distance between two points in the environment.
    4. Path Length (PL):
      • Conceptual Definition: The total length of the path traversed by the agent during an episode. Often reported for successful episodes to assess efficiency alongside SPL.
      • Mathematical Formula: $ \mathrm{PL} = \frac{1}{\text{Num successful episodes}} \sum_{i \in \text{successful episodes}} P_i $
      • Symbol Explanation:
        • Num successful episodes\text{Num successful episodes}: The count of successful navigation attempts.
        • PiP_i: Length of the path taken by the agent in successful episode ii.
  • For Embodied Question Answering (EQA):

    1. Accuracy (Acc):
      • Conceptual Definition: The proportion of questions for which the agent provides the correct answer. For multiple-choice questions, it's a direct count; for open-ended, it often involves semantic similarity or exact match.
      • Mathematical Formula: $ \mathrm{Acc} = \frac{\text{Number of correctly answered questions}}{\text{Total number of questions}} \times 100% $
      • Symbol Explanation:
        • Number of correctly answered questions: Count of questions where the agent's answer matches the ground truth.
        • Total number of questions: Total count of questions posed to the agent.
    2. Success (S) / Answer F1: For open-ended EQA, F1-score may be used to evaluate the quality of generated answers by comparing them to reference answers, especially when multiple correct phrasings are possible.
      • Conceptual Definition: The F1-score is the harmonic mean of precision and recall, providing a single metric that balances both. Precision measures the proportion of agent's answers that are correct, while recall measures the proportion of correct answers that the agent identified.
      • Mathematical Formula: $ \mathrm{F1} = 2 \times \frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}} $ Where: $ \mathrm{Precision} = \frac{\text{Number of correct words in agent's answer}}{\text{Total words in agent's answer}} $ $ \mathrm{Recall} = \frac{\text{Number of correct words in agent's answer}}{\text{Total words in ground truth answer}} $
      • Symbol Explanation:
        • Precision\mathrm{Precision}: The proportion of agent's predicted words that are relevant.
        • Recall\mathrm{Recall}: The proportion of relevant words that are correctly predicted by the agent.
        • Number of correct words in agent's answer: Overlap between agent's answer and ground truth.
        • Total words in agent's answer: Total words in the agent's generated response.
        • Total words in ground truth answer: Total words in the reference answer.
  • For Embodied Grasping:

    1. Grasp Success Rate:
      • Conceptual Definition: The percentage of attempts where the robot successfully grasps the target object in a stable manner.
      • Mathematical Formula: $ \mathrm{Grasp\ Success\ Rate} = \frac{\text{Number of successful grasps}}{\text{Total number of grasp attempts}} \times 100% $
      • Symbol Explanation:
        • Number of successful grasps: Count of instances where the robot successfully picked up and held the object.
        • Total number of grasp attempts: Total count of attempts to grasp an object.
    2. Placement Success Rate:
      • Conceptual Definition: The percentage of times the robot successfully places a grasped object in the specified target location.
      • Mathematical Formula: $ \mathrm{Placement\ Success\ Rate} = \frac{\text{Number of successful placements}}{\text{Total number of placement attempts}} \times 100% $
      • Symbol Explanation:
        • Number of successful placements: Count of instances where the robot successfully deposited the object at the target.
        • Total number of placement attempts: Total count of attempts to place an object.

5.3. Baselines

The survey implicitly discusses various baselines against which new Embodied AI methods are compared. These baselines represent the state-of-the-art or common approaches prior to the introduction of specific innovations.

  • Traditional Deep Reinforcement Learning (DRL) Approaches: Many early VLN and EQA models were based on DRL, where an agent learns policies through trial and error in the environment. These are often considered representative as they learn behaviors from scratch.

  • Imitation Learning (IL) Methods: Many embodied control and grasping tasks use IL, where a model learns by observing expert demonstrations. CLIPORT [168] is a language-conditioned IL agent.

  • Rule-Based or Symbolic Planning Methods: For task planning, traditional approaches often rely on explicit rules, logical reasoning (e.g., PDDL [173]), or search algorithms (MCTS [174], AA* [175]). These are representative of non-LLM-driven planning.

  • Earlier Vision-Language Models: Prior to MLMs, vision-language models with more limited capacity or specific architectures served as baselines for tasks like VLN and EQA. The survey highlights that MLMs have injected "strong perception, interaction and planning capabilities" compared to these earlier models.

  • Model-Free vs. Model-Based RL: In VLN, LookBY [129] is mentioned as bridging model-free (learning directly from interactions) and model-based (World Model-driven prediction) RL, indicating these as common baseline categories.

  • Without External Knowledge: For K-EQA [144], baselines would be EQA models that do not leverage external knowledge bases for answering complex questions.

  • Sim-only Trained Policies: In Sim-to-Real adaptation, a primary baseline is a policy trained purely in simulation without any domain randomization, system identification, or transfer learning techniques. The performance of such a policy in the real world highlights the sim-to-real gap.

    These baselines are chosen because they represent the established or prior state-of-the-art methods against which new MLM and WM-based approaches aim to demonstrate superior generalization, adaptability, reasoning, and perception capabilities in complex and dynamic embodied AI tasks.

6. Results & Analysis

As a survey paper, this document does not present new experimental results from its own research. Instead, it synthesizes and analyzes the findings from a vast body of existing research in Embodied AI. The "results" discussed here are the overall trends, advancements, and the current state of the art revealed through the comparative review of numerous studies.

6.1. Core Results Analysis

The survey highlights a significant shift and acceleration in Embodied AI research, primarily driven by the emergence of Multi-modal Large Models (MLMs) and World Models (WMs).

  • Enhanced Perception and Reasoning: MLMs have endowed embodied agents with remarkable perception and reasoning capabilities, enabling them to understand complex and dynamic environments more thoroughly than traditional methods. The integration of pre-trained visual representations (e.g., from CLIP, BLIP-2) with Large Language Models (LLMs) has provided precise object estimations, improved linguistic instruction understanding, and robust alignment of visual and linguistic features.

  • Improved Interaction: MLMs facilitate more natural human-robot interaction, allowing agents to understand nuanced instructions and perform complex tasks like embodied grasping based on semantic reasoning (both explicit and implicit instructions). Embodied Question Answering (EQA) has progressed from simple object queries to complex, knowledge-intensive questions, often requiring active exploration.

  • Advanced Agent Architectures: Embodied agents are evolving towards general-purpose capabilities, moving beyond single-task AI. The survey shows a trend towards hierarchical task planning (decomposing abstract goals into subtasks) and action planning (executing low-level steps). LLMs are increasingly used for high-level planning, while Vision-Language-Action (VLA) models integrate perception, decision-making, and execution, reducing latency and improving adaptation.

  • Progress in Sim-to-Real Adaptation: The paper emphasizes the critical need for sim-to-real adaptation to transfer learned capabilities from cost-effective and safe simulations to the physical world. Embodied World Models are shown to be highly promising, as they learn predictive representations of physical laws, allowing agents to develop intuition and simulate future states. Various sim-to-real paradigms (e.g., Domain Randomization, Real2Sim2Real, System Identification) are actively being researched to bridge the domain gap.

  • Growth of Datasets and Simulators: The field is characterized by a continuous development of more realistic and diverse embodied simulators (both general-purpose and real-scene-based) and datasets that are larger-scale, multi-modal, and increasingly designed for complex, long-horizon tasks. The proposal of ARIO underscores the community's effort towards standardized, versatile datasets for general-purpose embodied agents.

    Overall, the survey concludes that MLMs and WMs are providing a feasible and powerful approach for Embodied AI to align cyberspace intelligence with physical world interaction, bringing the field closer to achieving AGI. However, significant challenges remain, particularly in data quality, long-horizon autonomy, causal reasoning, and robust evaluation.

6.2. Data Presentation (Tables)

The following are the results from Table I of the original paper:

Type Environment Physical Entities Description Representative Agents
Disembodied AI Cyber Space No Cognition and physical entities are disentangled ChatGPT [4], RoboGPT [8]
Embodied AI Physical Space Robots, Cars, Other devices Cognition is integrated into physical entities RT-1 [9], RT-2 [10], RT-H [3]

The following are the results from Table II of the original paper:

Simulator Year Rendering Robotics-specific features CP Physics Engine Main Applications
HFPS HQGR RRL DLS LSPC ROS MSS
Genesis [35] 2024 O o O O O O Custom RL, LSPS, RS
Isaac Sim [36] 2023 O O O O O O O PhysX Nav, AD
Isaac Gym [37] 2019 O O O PhysX RL,LSPS
Gazebo [38] 2004 O O O O O ODE, Bullet, Simbody, DART Nav,MR
PyBullet [39] 2017 O O Bullet RL,RS
Webots [40] 1996 O O O O ODE RS
MuJoCo [41] 2012 O o Custom RL, RS
Unity ML-Agents [42] 2017 O O O Custom RL, RS
AirSim [43] 2017 O o Custom Drone sim, AD, RL
MORSE [44] 2015 o O Bullet Nav, MR
V-REP (CoppeliaSim) [45] 2013 O O O O Bullet, ODE, Vortex, Newton MR, RS

The following are the results from Table III of the original paper:

Function Type Methods
vSLAM Traditional vSLAM MonoSLAM [60], ORB-SLAM [61], LSD-SLAM [62]
Semantic vSLAM SLAM++ [63], QuadricSLAM [64], So-SLAM [65],SG-SLAM [66], OVD-SLAM [67], GS-SLAM [68]
3D Scene Understanding Projection-based MV3D [69], PointPillars [70], MVCNN [71]
Voxel-based VoxNet [72], SSCNet [73]), MinkowskiNet [74], SSCNs [75], Embodiedscan [76]
Point-based PointNet [77], PointNet++ [78], PointMLP [79], PointTransformer [80], Swin3d [81], PT2 [82],3D-VisTA [83], LEO [84], PQ3D [85], PointMamba [86], Mamba3D [87]
Active Exploration Interacting with the environment Pinto et al. [88], Tatiya et al. [89]
Changing the viewing direction Jayaraman et al. [90], NeU-NBV [91], Hu et al. [92], Fan et al. [93]

The following are the results from Table IV of the original paper:

Dataset Year Simulator Environment Feature Size
R2R [105] 2018 M3D I, D SbS 21,567
R4R [106] 2019 M3D I, D SbS 200,000+
VLN-CE [107] 2020 Habitat I, C SbS -
TOUCHDOWN [108] 2019 - O, D SbS 9,326
REVERIE [109] 2020 M3D I, D DGN 21,702
SOON [110] 2021 M3D I, D DGN 3,848
DDN [111] 2023 AT I, C DDN 30,000+
ALFRED [112] 2020 AT I, C NwI 25,743
OVMM [113] 2023 Habitat I, C NwI 7,892
BEHAVIOR-1K [114] 2023 OG I, C LSNwI 1,000
CVDN [115] 2020 M3D I, D D&O 2,050
DialFRED [116] 2022 AT I, C D&O 53,000

The following are the results from Table V of the original paper:

Method Model Year Feature
Memory-UnderstandingBased LVERG [117] 2020 Graph Learning
CMG [118] 2020 Adversarial Learning
RCM [119] 2021 Reinforcement learning
FILM [120] 2022 Semantic Map
LM-Nav [121] 2022 Graph Learning
HOP [122] 2022 History Modeling
NaviLLM [123] 2024 Large Model
FSTT [124] 2024 Test-Time Augmentation
DiscussNav [125] 2024 Large Model
GOAT [126] 2024 Causal Learning
VER [127] 2024 Environment Encoder
NaVid [128] 2024 Large Model
Future-PredictionBased LookBY [129] 2018 Reinforcement Learning
NvEM [130] 2021 Environment Encoder
BGBL [131] 2022 Graph Learning
Mic [132] 2023 Large Model
HNR [133] 2024 Environment Encoder
ETPNav [134] 2024 Graph Learning
Others MCR-Agent [135] 2023 Multi-Level Model
OVLM [136] 2023 Large Model

The following are the results from Table VI of the original paper:

Dataset Year Type Data Sources Simulator Query Creation Answer Size
EQA v1 [138] 2018 Active EQA SUNCG House3D Rule-Based open-ended 5,000+
MT-EQA [139] 2019 Active EQA SUNCG House3D Rule-Based open-ended 19,000+
MP3D-EQA [140] 2019 Active EQA MP3D Simulator based on MINOS Rule-Based open-ended 1,136
IQUAD V1 [141] 2018 Interactive EQA AI2THOR Rule-Based multi-choice 75,000+
VideoNavQA [142] 2019 Episodic Memory EQA SUNCG House3D Rule-Based open-ended 101,000
SQA3D [143] 2022 QA only ScanNet Manual multi-choice 33,400
K-EQA [144] 2023 Active EQA AI2THOR Rule-Based open-ended 60,000
OpenEQA [145] 2024 Active EQA, Episodic Memory EQA ScanNet, HM3D Habitat Manual open-ended 1,600+
HM-EQA [146] 2024 Active EQA HM3D Habitat VLM multi-choice 500
S-EQA [147] 2024 Active EQA VirtualHome LLM binary
EXPRESS-Bench [148] 2025 Exploration-aware EQA HM3D Habitat VLM open-ended 2,044

The following are the results from Table VII of the original paper:

Dataset Year Type Modality Grasp Label Gripper Finger Objects Grasps Scenes Language
Cornell [159] 2011 Real RGB-D Rect. 2 240 8K Single ×
Jacquard [160] 2018 Sim RGB-D Rect. 2 11K 1.1M Single ×
6-DOF GraspNet [161] 2019 Sim 3D 6D 2 206 7.07M Single ×
ACRONYM [162] 2021 Sim 3D 6D 2 8872 17.7M Multi ×
MultiGripperGrasp [163] 2024 Sim 3D - 2-5 345 30.4M Single ×
OCID-Grasp [164] 2021 Real RGB-D Rect. 2 89 75K Multi ×
OCID-VLG [165] 2023 Real RGB-D,3D Rect. 2 89 75K Multi
ReasoingGrasp [166] 2024 Real RGB-D 6D 2 64 99.3M Multi
CapGrasp [167] 2024 Sim 3D - 5 1.8K 50K Single

6.3. Ablation Studies / Parameter Analysis

While the survey itself does not conduct ablation studies, it highlights the importance of such analyses implicitly by discussing various components of Embodied AI systems. In the context of the surveyed papers, ablation studies are crucial for:

  • Validating Component Effectiveness: Researchers in Embodied AI regularly conduct ablation studies to verify the contribution of individual modules or novel architectural choices (e.g., specific multi-modal fusion strategies, memory mechanisms, planning components, sim-to-real transfer techniques). For instance, in VLN or EQA tasks, studies often compare performance with and without semantic mapping, dialogue history, or different visual encoders.

  • Understanding Hyper-parameter Impact: The performance of embodied agents, especially those based on Reinforcement Learning or Large Models, is highly sensitive to hyper-parameters (e.g., learning rates, reward function weights, network sizes, exploration strategies). Surveyed papers would typically perform parameter sweeps or sensitivity analyses to demonstrate robustness or identify optimal configurations.

  • Generalization and Robustness: Ablation studies often involve testing components under varying conditions (e.g., unseen environments, different object configurations, varying levels of noise) to assess how robustly a particular design generalizes. This is particularly relevant for sim-to-real adaptation where the impact of domain randomization parameters or the effectiveness of policy distillation techniques would be ablated.

    The detailed taxonomy presented in the survey, breaking down Embodied AI into perception, interaction, agent, and sim-to-real, directly supports the design of such ablation studies, allowing researchers to isolate and evaluate the contribution of advancements in each sub-area. The push for unified evaluation benchmarks and comprehensive datasets like ARIO is also aimed at making these comparative analyses more standardized and meaningful across the field.

7. Conclusion & Reflections

7.1. Conclusion Summary

This comprehensive survey thoroughly reviews the advancements in Embodied Artificial Intelligence (Embodied AI), positioning it as a critical pathway toward achieving Artificial General Intelligence (AGI). The paper emphasizes how the emergence of Multi-modal Large Models (MLMs) and World Models (WMs) has fundamentally transformed Embodied AI, endowing agents with unprecedented perception, interaction, and reasoning capabilities essential for bridging the gap between cyberspace and the physical world.

The survey provides a detailed taxonomy, starting with a review of embodied robots and simulators (both general and real-scene-based). It then delves into four core research areas: embodied perception (including active visual perception and visual language navigation), embodied interaction (covering embodied question answering and grasping), embodied agents (examining multi-modal foundation models and task planning), and sim-to-real adaptation (focusing on embodied world models, data collection, and training paradigms). A significant contribution is the proposal of ARIO (All Robots In One), a new unified dataset standard and a large-scale dataset designed to facilitate the development of versatile, general-purpose embodied agents. Finally, the survey outlines the pressing challenges and promising future directions, aiming to serve as a foundational reference for the research community.

7.2. Limitations & Future Work

The authors clearly identify several critical challenges and limitations that Embodied AI currently faces, which also point to key future research directions:

  • High-quality Robotic Datasets:

    • Limitation: A scarcity of sufficient, high-quality real-world robotic data due to time and resource intensity. Over-reliance on simulation data exacerbates the sim-to-real gap. Existing multi-robot datasets lack unified formats, comprehensive sensory modalities, and combined simulated/real data.
    • Future Work: Develop more realistic and efficient simulators. Foster extensive collaboration among institutions for diverse real-world data collection. Construct large-scale datasets (like the proposed ARIO) leveraging high-quality simulated data to augment real-world data for generalizable embodied models capable of cross-scenario and cross-task applications.
  • Long-Horizon Task Execution:

    • Limitation: Current high-level task planners struggle with long-horizon tasks (e.g., "clean the kitchen") in diverse scenarios due to insufficient tuning for embodied tasks and limited ability to execute complex sequences of low-level actions.
    • Future Work: Develop efficient planners with robust perception capabilities and extensive commonsense knowledge. Implement hybrid planning approaches that combine lightweight, high-frequency monitoring modules with lower-frequency adapters for subtask and path adaptation reasoning, balancing planning complexity and real-time adaptability.
  • Causal Reasoning:

    • Limitation: Existing data-driven embodied agents often make decisions based on superficial data correlations rather than true causal relations between knowledge, behavior, and environment, leading to biased and unreliable strategies in real-world settings.
    • Future Work: Develop embodied agents driven by world knowledge and capable of autonomous causal reasoning. Agents should learn world workings via abductive reasoning through interaction. Establish spatial-temporal causal relations across modalities via interactive instructions and state predictions. Agents need to understand object affordances for adaptive task planning in dynamic scenes.
  • Unified Evaluation Benchmark:

    • Limitation: Existing benchmarks for low-level control policies vary significantly in assessed skills, and objects/scenes are often constrained by simulator limitations. Many high-level task planner benchmarks only assess planning (e.g., via question-answering) in isolation.
    • Future Work: Develop unified benchmarks that encompass a diverse range of skills using realistic simulators. Evaluate both high-level task planners and low-level control policies together for executing long-horizon tasks, measuring overall success rates rather than isolated component performance.
  • Security and Privacy:

    • Limitation: Embodied agents deployed in sensitive or private spaces face significant security challenges, especially as they rely on LLMs for decision-making. LLMs are susceptible to backdoor attacks (word injection, scenario manipulation, knowledge injection) that could lead to hazardous actions (e.g., autonomous vehicles accelerating into obstacles).
    • Future Work: Evaluate potential attack vectors and develop more robust defenses. Implement secure prompting, state management, and safety validation mechanisms to enhance security and robustness.

7.3. Personal Insights & Critique

This survey is a timely and valuable contribution to the Embodied AI literature. Its strength lies in its explicit focus on Multi-modal Large Models (MLMs) and World Models (WMs), which are indeed defining the current frontier of AI. For a novice, the systematic breakdown of Embodied AI into robots, simulators, and functional tasks (perception, interaction, agent, sim-to-real) provides an excellent structured entry point into a complex field. The consistent emphasis on the cyber-physical alignment clarifies the overarching goal.

The proposal of ARIO is a particularly insightful and practical contribution. The current fragmentation of datasets is a major bottleneck, and a unified standard could genuinely accelerate research by enabling more consistent pre-training and benchmarking of general-purpose embodied agents. The scale of the proposed ARIO dataset (3 million episodes) is impressive and indicative of the large data requirements for foundation models.

Potential Issues or Areas for Improvement:

  1. Depth of Technical Details within Surveyed Methods: While the survey categorizes methods and lists representative works, for a truly "beginner-friendly" deep-dive, it could have provided slightly more technical intuition or a simplified architectural overview for a few exemplar methods within each category (e.g., how ORB-SLAM works at a high level, or the basic structure of a Vision-Language-Action model). Given the breadth, this is a difficult balance, but for a foundational reference, a "typical architecture" for each sub-task would be beneficial.
  2. Explicit Examples of MLM/WM Mechanics: The survey talks about MLMs and WMs extensively, but a slightly deeper dive into how they specifically enhance perception or planning (e.g., "latent space" processing for WMs explained with a clearer analogy, or how cross-modal attention in MLMs grounds language to vision) could further cement understanding for a beginner.
  3. Real-world Impact vs. Research Promise: While applications are mentioned, a more explicit discussion on the current maturity levels of different Embodied AI applications in the real world versus those primarily in research labs could be beneficial. The term "aligning cyberspace with the physical world" is broad, and a clearer delineation of what's currently achievable in practice versus what's still aspirational would add nuance.
  4. Criticality on LLM Hallucinations: The survey touches upon LLM "hallucinations" in task planning (e.g., in RoboGPT context). Given the reliance on LLMs, a slightly more detailed discussion on the common failure modes of LLM-based planning and the practical strategies to mitigate them (beyond just multi-turn reasoning or visual feedback) would be valuable.

Transferability and Applications:

The methods and conclusions outlined in this survey are highly transferable across various domains:

  • Industrial Automation: Enhanced embodied grasping and long-horizon task execution are directly applicable to smart manufacturing and logistics, leading to more flexible and autonomous factory floors and warehouses.

  • Healthcare: Humanoid robots and mobile manipulators with improved perception and interaction can assist in patient care, surgery, and sanitation, especially with causal reasoning for delicate tasks.

  • Autonomous Driving: The advancements in embodied perception (3D scene understanding, vSLAM), world models (predicting future states), and sim-to-real adaptation are foundational for safer and more reliable self-driving vehicles.

  • Exploration and Disaster Response: Quadruped and tracked robots benefit immensely from improved active exploration and causal reasoning to navigate unknown, hazardous environments and perform complex rescue operations.

  • Smart Homes and Personal Service Robotics: Embodied agents that can understand natural language, perform household tasks, and engage in EQA will be transformative for assistive and convenience applications in homes.

    This survey provides a solid foundation for understanding the current state and future trajectory of Embodied AI, making it an indispensable resource for researchers and practitioners alike who aim to bridge the digital and physical realms with intelligent machines.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.