Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI
TL;DR Summary
This survey comprehensively explores the latest advancements in Embodied AI, emphasizing its role in achieving AGI and bridging cyberspace with the physical world, covering key research areas and challenges.
Abstract
Embodied Artificial Intelligence (Embodied AI) is crucial for achieving Artificial General Intelligence (AGI) and serves as a foundation for various applications (e.g., intelligent mechatronics systems, smart manufacturing) that bridge cyberspace and the physical world. Recently, the emergence of Multi-modal Large Models (MLMs) and World Models (WMs) have attracted significant attention due to their remarkable perception, interaction, and reasoning capabilities, making them a promising architecture for embodied agents. In this survey, we give a comprehensive exploration of the latest advancements in Embodied AI. Our analysis firstly navigates through the forefront of representative works of embodied robots and simulators, to fully understand the research focuses and their limitations. Then, we analyze four main research targets: 1) embodied perception, 2) embodied interaction, 3) embodied agent, and 4) sim-to-real adaptation, covering state-of-the-art methods, essential paradigms, and comprehensive datasets. Additionally, we explore the complexities of MLMs in virtual and real embodied agents, highlighting their significance in facilitating interactions in digital and physical environments. Finally, we summarize the challenges and limitations of embodied AI and discuss potential future directions. We hope this survey will serve as a foundational reference for the research community. The associated project can be found at https://github.com/HCPLab-SYSU/Embodied_AI_Paper_List.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI
1.2. Authors
Yang Liu, Weixing Chen, Yongjie Bai, Xiaodan Liang, Guanbin Li, Wen Gao, Liang Lin
The authors are affiliated with Sun Yat-sen University, Peng Cheng Laboratory, and Peking University in China. Their research backgrounds generally span computer science, engineering, and artificial intelligence, with specific expertise in areas such as multi-modal reasoning, causality learning, computer vision, image processing, deep learning, human-related analysis, and multimedia.
1.3. Journal/Conference
The paper is a preprint published on arXiv. As a preprint, it has not yet undergone formal peer review or been accepted by a specific journal or conference. However, the arXiv platform is a highly influential repository for rapidly disseminating research in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. Given the affiliations and past publications of the authors in top-tier venues (e.g., IEEE TPAMI, CVPR, ICCV, NeurIPS), it is likely intended for a high-impact journal or conference in the field of AI or robotics.
1.4. Publication Year
2024
1.5. Abstract
This survey provides a comprehensive exploration of the latest advancements in Embodied Artificial Intelligence (Embodied AI). It highlights Embodied AI's crucial role in achieving Artificial General Intelligence (AGI) and its application in bridging cyberspace and the physical world through systems like intelligent mechatronics and smart manufacturing. The paper notes the significant impact of Multi-modal Large Models (MLMs) and World Models (WMs) due to their perception, interaction, and reasoning capabilities, positioning them as promising architectures for embodied agents. The analysis begins by reviewing embodied robots and simulators to understand current research focuses and limitations. It then delves into four main research targets: 1) embodied perception, 2) embodied interaction, 3) embodied agent, and 4) sim-to-real adaptation, detailing state-of-the-art methods, essential paradigms, and comprehensive datasets. The survey further examines the role of MLMs in both virtual and real embodied agents, emphasizing their importance in digital and physical interactions. Finally, it summarizes challenges, limitations, and discusses future research directions. The authors aim for this survey to serve as a foundational reference for the research community, with an associated project available on GitHub.
1.6. Original Source Link
Official Source Link: https://arxiv.org/abs/2407.06886 PDF Link: https://arxiv.org/pdf/2407.06886v8.pdf Publication Status: Preprint
2. Executive Summary
2.1. Background & Motivation
The core problem the paper addresses is the gap between Artificial Intelligence (AI) systems operating in cyberspace (virtual environments) and their ability to function intelligently in the physical world. Traditional AI often excels at abstract problem-solving but struggles with the complexities and unpredictability of real-world interaction. This limitation hinders the realization of Artificial General Intelligence (AGI) – the ability of an AI to understand, learn, and apply intelligence across a wide range of tasks, similar to human cognitive abilities.
This problem is critically important because Embodied AI is seen as a foundational step towards AGI. It is also essential for various practical applications that require AI to interact directly with the physical world, such as intelligent mechatronics systems (systems integrating mechanical engineering, electronics, computer engineering, and control engineering for smart products), smart manufacturing (digitally integrated and optimized production processes), robotics, healthcare, and autonomous vehicles. Without Embodied AI, these applications cannot achieve true autonomy and adaptive functionality.
Prior research primarily focused on disembodied AI (e.g., ChatGPT), where cognition and physical entities are separated. While these models show impressive reasoning and language capabilities, they lack the ability to actively perceive, interact with, and act upon the physical environment. The emergence of Multi-modal Large Models (MLMs) and World Models (WMs) has created a new opportunity to bridge this gap. MLMs combine Natural Language Processing (NLP) and Computer Vision (CV) capabilities, allowing AI to understand both language instructions and visual observations. WMs enable AI to build internal predictive models of its environment, simulating physical laws and understanding causal relationships. These advancements provide strong perception, interaction, and planning capabilities, making them a promising architecture for developing general-purpose embodied agents.
The paper's entry point is to provide a comprehensive survey that specifically contextualizes the field of Embodied AI within this new era of MLMs and WMs. It aims to consolidate the fragmented progress, identify key research directions, and highlight the challenges that remain in aligning AI's virtual intelligence with real-world physical interaction.
2.2. Main Contributions / Findings
The paper makes several primary contributions and presents key findings:
-
First Comprehensive Survey with MLM and WM Focus: To the best of the authors' knowledge, this is the first comprehensive survey of
Embodied AIthat specifically focuses on the alignment ofcyberandphysical spacesbased onMulti-modal Large Models (MLMs)andWorld Models (WMs). This offers novel insights into methodologies, benchmarks, challenges, and applications within this rapidly evolving landscape. -
Detailed Taxonomy of Embodied AI: The survey categorizes and summarizes
Embodied AIinto several essential parts, providing a detailed taxonomy:Embodied Robots: Discussing various types of robots that serve as physical embodiments.Embodied Simulators: Reviewing general-purpose and real-scene-based simulation platforms crucial for developing and testing embodied agents.- Four Main Research Tasks:
Embodied Perception: Covering active visual perception and visual language navigation (VLN).Embodied Interaction: Discussing embodied question answering (EQA) and embodied grasping.Embodied Agents: Includingmulti-modal foundation modelsandtask planning.Sim-to-Real Adaptation: Focusing onembodied world models,data collection and training, andcontrol algorithms.
-
Proposal of ARIO Dataset Standard: To facilitate the development of robust, general-purpose
embodied agents, the paper proposes a new dataset standard calledARIO (All Robots In One). This standard aims to unify control and motion data from robots with different morphologies into a single format. A large-scaleARIOdataset is also introduced, encompassing approximately 3 million episodes collected from 258 series and 321,064 tasks. -
Identification of Challenges and Future Directions: The survey concludes by outlining significant challenges in
Embodied AI, such as the need forhigh-quality robotic datasets,long-horizon task execution, robustcausal reasoning, aunified evaluation benchmark, and addressingsecurity and privacyconcerns. It also discusses promising future research directions to overcome these limitations.The key conclusion is that while
Embodied AIhas made rapid progress, especially with the integration ofMLMsandWMs, significant research is still required to achieve truly general-purposeembodied agentsthat can reliably and robustly bridgecyberspaceand thephysical worldforAGIand diverse applications. The proposed taxonomy, dataset standard, and future directions serve to guide this ongoing research.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this survey, a reader needs a grasp of several fundamental concepts in AI and robotics:
-
Artificial Intelligence (AI): The broad field of computer science dedicated to creating machines that can perform tasks that typically require human intelligence, such as learning, problem-solving, perception, and decision-making.
-
Artificial General Intelligence (AGI): A hypothetical form of
AIthat can understand, learn, and apply intelligence to a wide range of problems, similar to human cognitive abilities, rather than being limited to a single task (narrow AI). The paper positsEmbodied AIas crucial for achievingAGI. -
Embodied AI:
AIsystems that are integrated into physical bodies (e.g., robots, autonomous vehicles) and can interact with the physical world through sensors (perception) and actuators (action). Unlikedisembodied AI,embodied AIlearns and reasons within a physical context, experiencing and influencing its environment.- Disembodied AI:
AIthat operates purely incyberspaceor virtual environments, without a physical body or direct interaction with the real world. Examples include chatbots likeChatGPTor game-playingAI. Its cognition and physical entities are separate.
- Disembodied AI:
-
Multi-modal Large Models (MLMs):
Large Models(like large language models,LLMs) that can process and integrate information from multiple modalities, such as text, images, video, and audio. These models are trained on vast datasets to develop strong capabilities inperception(understanding sensory input),interaction(responding to and influencing the environment across modalities), andreasoning(making logical inferences).- Large Language Models (LLMs): A type of
MLMprimarily focused on text. They are deep learning models trained on enormous text datasets, enabling them to understand, generate, and process human language for tasks like translation, summarization, and question answering.
- Large Language Models (LLMs): A type of
-
World Models (WMs): Computational models that learn a compressed, predictive representation of an environment. An
embodied agentuses aWorld Modelto simulate possible future states, understand physical laws, and plan actions without needing to interact directly with the real (or simulated) world for every decision. This allows for more efficient learning and planning. -
Sim-to-Real Adaptation: The process of training
AImodels or control policies in asimulated environment(cheaper, safer, faster) and then transferring that learned knowledge to areal-world physical system(e.g., a robot). The challenge lies in bridging the "domain gap" – the differences between the simulated and real environments – to ensure theAIperforms effectively in reality. -
Reinforcement Learning (RL): A paradigm of machine learning where an
agentlearns to make decisions by performing actions in an environment to maximize a cumulative reward. Theagentlearns through trial and error, receiving feedback (rewards or penalties) for its actions. -
Computer Vision (CV): A field of
AIthat enables computers to "see" and interpret visual information from the real world, such as images and videos. Tasks include object detection, image segmentation, and 3D reconstruction. -
Natural Language Processing (NLP): A field of
AIfocused on enabling computers to understand, interpret, and generate human language. Tasks include language understanding, text generation, and machine translation. -
Robotics: The interdisciplinary field that deals with the design, construction, operation, and use of robots.
Embodied AIis deeply intertwined withroboticsas robots provide the physical embodiment forAIagents.
3.2. Previous Works
The survey differentiates itself from prior works by specifically focusing on the recent advancements brought by MLMs and WMs. It acknowledges existing surveys but highlights their limitations:
- Outdated Surveys: Several previous surveys on
Embodied AI[5], [6], [17], [18] are noted as "outdated" because they were published before the era ofMLMs, which started around 2023. This means they do not capture the significant shifts and advancements thatMLMsandWMshave introduced to the field.- [5] Duan, J., Yu, S., Tan, H. L., Zhu, H., & Tan, C. (2022). A survey of embodied ai: From simulators to research tasks.
- [17] R. ier and F. ida (2004). Embodied artificial intelligence: Trends and challenges. (This is significantly older, highlighting the rapid pace of change).
- Limited Scope Surveys: Among the more recent surveys after 2023, only two are specifically mentioned:
- [6] Ma, Y., Song, Z., Zhuang, Y., Hao, J., & King, I. (2024). A survey on vision-language-action models for embodied ai. This work focused on
vision-language-action models. - [18] Ren, L., Dong, J., Liu, S., Zhang, L., & Wang, L. (2024). Embodied intelligence toward future smart manufacturing in the era of ai foundation model. This survey proposed the
ABC model(AI brain, Body, Cross-modal sensors) and focused onEmbodied AI systemsforsmart manufacturing.
- [6] Ma, Y., Song, Z., Zhuang, Y., Hao, J., & King, I. (2024). A survey on vision-language-action models for embodied ai. This work focused on
- Overlooked Aspects: The paper points out that previous surveys generally overlooked the full consideration of
MLMs,WMs, andembodied agents, as well as recent developments inembodied robotsandsimulators.
Key Technologies and Representative Models Mentioned:
- Disembodied AI Example:
ChatGPT[4] is cited as a representativedisembodied AIagent, demonstrating the contrast withEmbodied AI.RoboGPT[8] is mentioned in the context ofLLMsfor long-term decisions, indicating howLLMscan be integrated intoembodied agents. - Embodied AI Examples:
RT-1[9],RT-2[10], andRT-H[3] are highlighted as recent representativeembodied models. TheseRobotics Transformersare crucial examples ofVision-Language-Action (VLA)models that transfer web knowledge to robotic control and utilize action hierarchies. - Vision Encoders:
CLIP[13] (Learning transferable visual models from natural language supervision) andBLIP-2[14] (Bootstrapping language-image pre-training with frozen image encoders and large language models) are mentioned as state-of-the-art vision encoders that provide precise estimations of object class, pose, and geometry, crucial forembodied perception. - World Models:
I-JEPA[15] (Self-supervised learning from images with a joint-embedding predictive architecture) andSora[16] (a generative model capable of producing realistic videos, suggesting world-modeling capabilities) are examples ofWorld Modelsor related generative approaches that exhibit simulation capabilities and understanding of physical laws.
3.3. Technological Evolution
The evolution of Embodied AI has progressed through several stages:
- Early Robotics and Traditional AI: Initially,
robotswere controlled by rule-based systems and classicalAIalgorithms for tasks likeSLAM(Simultaneous Localization and Mapping) and simple navigation. These systems were often brittle, limited to well-defined environments, and lacked adaptability. - Deep Learning Era (2012 onwards): The advent of
deep learningrevolutionizedComputer VisionandNatural Language Processing, providing robots with enhanced perception (e.g., better object recognition, semantic segmentation) and some understanding of natural language commands.Deep Reinforcement Learningallowed robots to learn complex behaviors through interaction. However, these models were often data-hungry and struggled with generalization across diverse tasks and environments. - Large Foundation Models Era (2020s onwards): The recent emergence of
Large Language Models (LLMs)andMulti-modal Large Models (MLMs)marks a significant leap. These models, trained on vast internet-scale datasets, possess unprecedentedperception,reasoning, andgeneralizationcapabilities.-
LLMsenable robots to better understand abstract linguistic instructions and perform high-leveltask planningthroughchain-of-thought reasoning. -
MLMsalign visual and linguistic representations, allowingembodied agentsto perceive multi-modal elements in complex environments and interact more naturally. -
World Modelsfurther enhance this by providing a predictive understanding of physical laws, enabling agents to simulate consequences and plan more effectively.This paper's work sits firmly within this
Large Foundation Models Era, specifically focusing on howMLMsandWMsare transformingEmbodied AIby providing the "brains" forembodied agentsto bridge thecyber-physical gap.
-
3.4. Differentiation Analysis
Compared to previous surveys and methods in Embodied AI, the core differences and innovations of this paper's approach are:
-
Focus on MLMs and WMs: The most significant differentiation is its explicit and central focus on
Multi-modal Large Models (MLMs)andWorld Models (WMs). While previous works might have touched uponmulti-modal learningorpredictive models, this survey systematically explores their profound impact onembodied perception,interaction,agent architectures, andsim-to-real adaptationin the context of aligningcyberspacewith thephysical world. This perspective is new given the recent, rapid rise of these foundation models. -
Comprehensive Scope: It covers a broader scope by not only detailing theoretical advancements but also providing a structured overview of:
- Embodied Hardware: A dedicated section on
embodied robots(fixed-base, wheeled, tracked, quadruped, humanoid, biomimetic). - Simulation Environments: A thorough review of both
general-purposeandreal-scene-based simulators, which are critical tools forEmbodied AIdevelopment andsim-to-real transfer. - Four Core Research Tasks: A structured analysis of
embodied perception(active visual perception, VLN),embodied interaction(EQA, grasping),embodied agents(multi-modal foundation models, task planning), andsim-to-real adaptation. This comprehensive breakdown offers a complete picture of the field's components.
- Embodied Hardware: A dedicated section on
-
Novel Dataset Standard (ARIO): The proposal of
ARIO (All Robots In One)as a unified dataset standard and the introduction of a large-scaleARIOdataset directly addresses a critical challenge in the field: the lack of standardized, large-scale, and multi-modal datasets for traininggeneral-purpose embodied agents. This is a concrete, practical contribution beyond a mere literature review. -
Updated Perspective on Challenges and Future Directions: By considering the capabilities and limitations of
MLMsandWMs, the survey offers a refined understanding of current challenges (e.g., long-horizon tasks, causal reasoning, security) and proposes future directions that are highly relevant to the latest technological paradigms.In essence, this paper serves as an updated roadmap for
Embodied AI, explicitly acknowledging and integrating the transformative role ofMLMsandWMs—a perspective largely missing from prior reviews.
4. Methodology
This paper is a comprehensive survey, and its "methodology" is the structured approach it takes to review and analyze the field of Embodied AI. The authors systematically categorize and discuss existing research, identifying key components, tasks, and challenges. The core idea is to provide a holistic view of how AI can be embodied to bridge cyberspace and the physical world, with a particular emphasis on the impact of Multi-modal Large Models (MLMs) and World Models (WMs).
4.1. Principles
The theoretical basis and intuition behind this survey's structure are rooted in the understanding that Embodied AI requires a cohesive ecosystem of hardware, software, and learning paradigms to achieve Artificial General Intelligence (AGI). The survey's principles can be broken down as follows:
- Cyber-Physical Alignment: The central theme is the alignment of
AI's cognitive abilities (developed incyberspace) with its physical embodiment and interaction capabilities in thephysical world. This involves understanding howAIperceives, processes, and acts within real-world constraints. - Foundation Models as Enablers: The survey posits
MLMsandWMsas pivotal architectures that provideembodied agentswith enhancedperception,interaction, andreasoning. These models offer a pathway to generalizeAIcapabilities from virtual to physical environments. - Modular Breakdown: To thoroughly analyze a complex field like
Embodied AI, it must be broken down into constituent parts. The survey adopts a modular approach, starting with the physical platforms (robotsandsimulators), then delving into core functional capabilities (perception,interaction), integrating these into an intelligent entity (embodied agent), and finally addressing the critical challenge of transferring knowledge between virtual and real domains (sim-to-real adaptation). - Problem-Centric View: Each section on research targets (perception, interaction, agent, sim-to-real) addresses specific problems
embodied agentsmust solve, outlining state-of-the-art solutions, paradigms, and supporting datasets. - Future-Oriented: Beyond summarizing current achievements, the survey aims to identify limitations and propose future research directions, serving as a guide for the community.
4.2. Core Methodology In-depth (Layer by Layer)
The survey systematically explores Embodied AI through several key areas, progressively building from physical components to advanced cognitive functions and transfer mechanisms.
4.2.1. Embodied Robots
This section reviews the physical platforms that serve as embodiments for AI agents. The selection of a robot type is critical as it dictates the range of interactions and environments an agent can operate in.
The following figure (Figure 2 from the original paper) shows the various types of embodied robots discussed:
该图像是一个示意图,展示了多种类型的具身机器人,包括固定式机器人、轮式机器人、履带式机器人、四足机器人、人形机器人和仿生机器人。这些机器人在实现人工智能的应用中起着重要作用。
- Fixed-base Robots: These robots, like
Franka Emika Panda[19],Kuka iiwa[21], andSawyer[23], are typically stationary and excel in precision tasks within structured environments such as industrial automation or laboratory settings. Their strength lies in repetitive, high-accuracy manipulation. - Wheeled Robots: Examples include
KivaandJackal robots[25]. These robots are designed for efficient movement on flat, predictable surfaces, commonly used in logistics and warehousing due to their simple structure and low cost. They face limitations on uneven terrain. - Tracked Robots: With track systems, these robots offer stability on soft or uneven terrains, making them suitable for off-road tasks such as agriculture, exploration, and disaster recovery [26].
- Quadruped Robots: Such as
Unitree Robotics' A1andGo1, andBoston Dynamics Spot. These robots mimic four-legged animals, enabling them to navigate complex and unstructured terrains, making them ideal for exploration and rescue missions. - Humanoid Robots: Designed to resemble human form and movement, these robots (e.g.,
LLM-enhanced humanoids [29]) aim to provide personalized services and perform intricate tasks using dexterous hands. They are expected to enhance efficiency and safety in various service and industrial sectors. - Biomimetic Robots: These robots replicate movements and functions of natural organisms (e.g., fish-like [32], insect-like [33], soft-bodied robots [34]). This approach helps them operate efficiently in complex environments by emulating biological mechanisms, often optimizing energy use and adapting to difficult terrains.
4.2.2. Embodied Simulators
Embodied simulators are crucial for Embodied AI research due to their cost-effectiveness, safety, scalability, and ability to generate extensive training data. The survey categorizes them into general simulators and real-scene based simulators.
4.2.2.1. General Simulator
These simulators provide virtual environments that closely mimic the physical world, focusing on realistic physics and dynamic interactions. They are essential for algorithm development and model training.
The following figure (Figure 3 from the original paper) shows examples of general simulators:
该图像是示意图,展示了多个通用模拟器的实例,包括 Isaac Sim、Webots、Pybullet、V-REP(CoppeliaSim)、Genesis、MuJoCo、Unity ML-Agents、AirSim、MORSE 和 Gazebo。这些模拟器在机器人和模拟器研究中具有重要意义。
The following are the results from Table II of the original paper:
| Simulator | Year | Rendering | Robotics-specific features | CP | Physics Engine | Main Applications | ||||
| HFPS | HQGR | RRL | DLS | LSPC | ROS | MSS | ||||
| Genesis [35] | 2024 | O | o | O | O | O | O | Custom | RL, LSPS, RS | |
| Isaac Sim [36] | 2023 | O | O | O | O | O | O | O | PhysX | Nav, AD |
| Isaac Gym [37] | 2019 | O | O | O | PhysX | RL,LSPS | ||||
| Gazebo [38] | 2004 | O | O | O | O | O | ODE, Bullet, Simbody, DART Nav,MR | |||
| PyBullet [39] | 2017 | O | O | Bullet | RL,RS | |||||
| Webots [40] | 1996 | O | O | O | O | ODE | RS | |||
| MuJoCo [41] | 2012 | O | o | Custom | RL, RS | |||||
| Unity ML-Agents [42] | 2017 | O | O | O | Custom | RL, RS | ||||
| AirSim [43] | 2017 | O | o | Custom | Drone sim, AD, RL | |||||
| MORSE [44] | 2015 | o | O | Bullet | Nav, MR | |||||
| V-REP (CoppeliaSim) [45] | 2013 | O | O | O | O | Bullet, ODE, Vortex, Newton MR, RS | ||||
-
HFPS: High-fidelity Physical Simulation -
HQGR: High-quality Graphics Rendering -
RRL: Rich Robot Libraries -
DLS: Deep Learning Support -
LSPC: Large-scale Parallel Computation -
ROS:Robot Operating Systemsupport -
MSS: Multi-sensor Support -
CP: Cross-Platform -
RL: Reinforcement Learning -
LSPS: Large-scale Parallel Simulation -
RS: Robot Simulation -
Nav: Navigation -
AD: Autonomous Driving -
MR: Multi-Robot Systems -
ODE: Open Dynamics EngineKey simulators include:
-
Isaac Sim[36]: Known for high-fidelity physics, real-time ray tracing, extensive robot models, and deep learning support, used in autonomous driving and industrial automation. -
Gazebo[47]: Open-source, strongROSintegration, supports various sensors and pre-built models, mainly for robot navigation and control. -
PyBullet[39]: Python interface forBulletphysics engine, easy to use, supports real-time physical simulation. -
Genesis[35]: A newly launched simulator with a differentiable physics engine and generative capabilities.
4.2.2.2. Real-Scene Based Simulators
These simulators focus on creating photorealistic 3D environments, often built from real-world data and game engines (e.g., UE5, Unity), to replicate human daily life and complex indoor activities.
The following figure (Figure 4 from the original paper) shows examples of real-scene based simulators:
该图像是一个示意图,展示了多个基于真实场景的模拟器,包括AI2-THOR、Matterport 3D、Virtualhome、SAPIEN、Habitat、iGibson、TDW和Infinite-World等,这些工具在虚拟环境中进行人机交互与感知研究。
Key simulators include:
-
SAPIEN[48]: Designed for interactions with joint objects (doors, cabinets). -
VirtualHome[49]: Uses an environment graph for high-level embodied planning based on natural language. -
AI2-ThOR[50]: Offers many interactive scenes, though interactions are often script-based. -
iGibson[51] andTDW[52]: Provide fine-grained embodied control and highly simulated physical interactions.iGibsonexcels in large-scale realistic scenes, whileTDWoffers user freedom, unique audio, and fluid simulations. -
Matterport3D[53]: A foundational 2D-3D visual dataset widely used in benchmarks. -
Habitat[107]: Lacks interaction capabilities but is valued for extensive indoor scenes and an open framework, especially for navigation. -
InfiniteWorld[54]: Focuses on a unified and scalable simulation framework with improvements in implicit asset reconstruction and natural language-driven scene generation.The survey also mentions tools for automated simulation scene construction like
RoboGen[55],HOLODECK[56],PhyScene[57], andProcTHOR[58], which leverageLLMsand generative models to create diverse and interactive scenes for trainingembodied agents.
4.2.3. Embodied Perception
Embodied perception goes beyond passive image recognition, requiring agents to actively move and interact to understand 3D space and dynamic environments. It involves visual reasoning, 3D relation understanding, and prediction for complex tasks.
The following figure (Figure 5 from the original paper) shows the schematic diagram of active visual perception:
该图像是关于主动视觉感知的示意图,展示了被动视觉感知的各个方面,包括3D场景理解、定位与映射精度提升和视觉SLAM。图中还指出了观察能力的改善、主动探索的激活和最终的行动,这些要素共同作用于主动视觉感知系统。
4.2.3.1. Active Visual Perception
This system combines fundamental capabilities for state estimation, scene perception, and environment exploration.
The following are the results from Table III of the original paper:
| Function | Type | Methods |
| vSLAM | Traditional vSLAM | MonoSLAM [60], ORB-SLAM [61], LSD-SLAM [62] |
| Semantic vSLAM | SLAM++ [63], QuadricSLAM [64], So-SLAM [65],SG-SLAM [66], OVD-SLAM [67], GS-SLAM [68] | |
| 3D Scene Understanding | Projection-based | MV3D [69], PointPillars [70], MVCNN [71] |
| Voxel-based | VoxNet [72], SSCNet [73]), MinkowskiNet [74], SSCNs [75], Embodiedscan [76] | |
| Point-based | PointNet [77], PointNet++ [78], PointMLP [79], PointTransformer [80], Swin3d [81], PT2 [82],3D-VisTA [83], LEO [84], PQ3D [85], PointMamba [86], Mamba3D [87] | |
| Active Exploration | Interacting with the environment | Pinto et al. [88], Tatiya et al. [89] |
| Changing the viewing direction | Jayaraman et al. [90], NeU-NBV [91], Hu et al. [92], Fan et al. [93] |
- Visual Simultaneous Localization and Mapping (vSLAM):
SLAMaims to determine a robot's position and build an environment map simultaneously [97].vSLAMuses cameras for this, offering low hardware costs and rich environmental details.- Traditional vSLAM: Uses image data and multi-view geometry to estimate pose and construct low-level maps (e.g., sparse point clouds). Methods include filter-based (
MonoSLAM[60]), keyframe-based (ORB-SLAM[61]), and direct tracking (LSD-SLAM[62]). - Semantic vSLAM: Integrates semantic information (object recognition) into
vSLAMto enhance environment interpretation and navigation ( [63],QuadricSLAM[64],So-SLAM[65],SG-SLAM[66],OVD-SLAM[67],GS-SLAM[68]).
- Traditional vSLAM: Uses image data and multi-view geometry to estimate pose and construct low-level maps (e.g., sparse point clouds). Methods include filter-based (
- 3D Scene Understanding: Aims to distinguish object semantics, identify locations, and infer geometric attributes from 3D data (e.g., point clouds from LiDAR or RGB-D sensors) [100]. Point clouds are sparse and irregular.
- Projection-based methods: Project 3D points onto 2D image planes for
2D CNN-based feature extraction (MV3D[69],PointPillars[70],MVCNN[71]). - Voxel-based methods: Convert point clouds into regular
voxel gridsfor3D convolutionoperations (VoxNet[72],SSCNet[73]), with sparse convolution for efficiency (MinkowskiNet[74],SSCNs[75],Embodiedscan[76]). - Point-based methods: Directly process point clouds (
PointNet[77], [78],PointMLP[79]). Recent advancements includeTransformer-based (PointTransformer[80],Swin3d[81],PT2[82],3D-VisTA[83],LEO[84],PQ3D[85]) andMamba-based (PointMamba[86],Mamba3D[87]) architectures for scalability.PQ3D[85] integrates features from multiple modalities.
- Projection-based methods: Project 3D points onto 2D image planes for
- Active Exploration: Complements passive perception by enabling robots to dynamically interact with and perceive their surroundings.
- Interacting with the environment: Robots learn visual representations through physical interaction (
Pinto et al.[88]), or transfer implicit knowledge via learned exploratory interactions across robot morphologies (Tatiya et al.[89]). - Changing the viewing direction: Agents learn to acquire informative visual observations by reducing uncertainty about unobserved parts of the environment using
Reinforcement Learning(Jayaraman et al.[90]), or plan camera positions for informative images (NeU-NBV[91]), or predict future state values (Hu et al.[92]), or treat active recognition as sequential evidence-gathering (Fan et al.[93]).
- Interacting with the environment: Robots learn visual representations through physical interaction (
4.2.3.2. Visual Language Navigation (VLN)
VLN is a crucial task where embodied agents navigate unseen environments by following linguistic instructions, requiring understanding of diverse visual observations and multi-granular instructions. The process is represented as , where Action is the chosen action, is the current observation, is the historical information, and is the natural language instruction.
The following figure (Figure 6 from the original paper) shows an overview and different tasks of VLN:
该图像是示意图,展示了虚拟环境中的一名智能体与人类的互动。图中包含了两部分:左侧说明了交互环境中的自然语言指令,右侧展示了不同的导航任务与互动步骤。智能体通过观察和执行任务,完成目标导航。
-
Datasets: Various datasets exist based on instruction granularity and interaction requirements.
The following are the results from Table IV of the original paper:
Dataset Year Simulator Environment Feature Size R2R [105] 2018 M3D I, D SbS 21,567 R4R [106] 2019 M3D I, D SbS 200,000+ VLN-CE [107] 2020 Habitat I, C SbS - TOUCHDOWN [108] 2019 - O, D SbS 9,326 REVERIE [109] 2020 M3D I, D DGN 21,702 SOON [110] 2021 M3D I, D DGN 3,848 DDN [111] 2023 AT I, C DDN 30,000+ ALFRED [112] 2020 AT I, C NwI 25,743 OVMM [113] 2023 Habitat I, C NwI 7,892 BEHAVIOR-1K [114] 2023 OG I, C LSNwI 1,000 CVDN [115] 2020 M3D I, D D&O 2,050 DialFRED [116] 2022 AT I, C D&O 53,000 -
M3D: Matterport3D -
AT: AI2-THOR -
OG: OMNIGIBSON -
: Indoor
-
: Discrete
-
: Outdoor
-
: Continuous
-
SbS: Step-by-Step Instructions -
DGN: Described Goal Navigation -
DDN: Demand-Driven Navigation -
NwI: Navigation with Interaction -
LSNwI: Long-Span Navigation with Interaction -
D&O: Dialog and OracleExamples include
R2R[105] (step-by-step instructions in Matterport3D),REVERIE[109] (goal navigation),ALFRED[112] (household tasks with interaction in AI2-THOR), andCVDN[115] (dialog-based navigation). -
Method:
VLNmethods are categorized by their focus.The following are the results from Table V of the original paper:
Method Model Year Feature Memory-UnderstandingBased LVERG [117] 2020 Graph Learning CMG [118] 2020 Adversarial Learning RCM [119] 2021 Reinforcement learning FILM [120] 2022 Semantic Map LM-Nav [121] 2022 Graph Learning HOP [122] 2022 History Modeling NaviLLM [123] 2024 Large Model FSTT [124] 2024 Test-Time Augmentation DiscussNav [125] 2024 Large Model GOAT [126] 2024 Causal Learning VER [127] 2024 Environment Encoder NaVid [128] 2024 Large Model Future-PredictionBased LookBY [129] 2018 Reinforcement Learning NvEM [130] 2021 Environment Encoder BGBL [131] 2022 Graph Learning Mic [132] 2023 Large Model HNR [133] 2024 Environment Encoder ETPNav [134] 2024 Graph Learning Others MCR-Agent [135] 2023 Multi-Level Model OVLM [136] 2023 Large Model -
Memory-Understanding Based: Focus on perceiving and understanding the environment using historical observations or trajectories. Methods often use
graph-based learning(e.g.,LVERG[117],LM-Nav[121]),semantic maps(FILM[120],VER[127]),adversarial learning(CMG[118]),causal learning(GOAT[126]), and increasinglyLarge Modelsfor understanding (NaviLLM[123],NaVid[128]). -
Future-Prediction Based: Model, predict, and understand future states. These methods often use
graph-based learningfor waypoint prediction (BGBL[131],ETPNav[134]),environmental encodingfor future observations (NvEM[130],HNR[133]), andreinforcement learningfor future state forecasting (LookBY[129]).Large Modelsare also used for "imagining" future scenes (MiC[132]).
4.2.4. Embodied Interaction
Embodied interaction covers scenarios where agents interact with humans and the environment in physical or simulated spaces, typically focusing on Embodied Question Answering (EQA) and embodied grasping.
4.2.4.1. Embodied Question Answering (EQA)
In EQA, an agent explores an environment from a first-person perspective to gather information to answer questions. This requires autonomous exploration and decision-making about when to stop exploring and provide an answer.
The following figure (Figure 7 from the original paper) shows various types of question answering tasks:
该图像是图表,展示了一个智能体在探索环境中的过程,并包括多种问答任务。灰色框中的场景代表智能体观察到的环境,而其他框则展示了不同类型的问题,例如单一目标、多目标和交互任务等,智能体在获取足够信息后会停止探索。图中还包含了基于记忆、知识和对象状态的问题示例。
-
Datasets:
The following are the results from Table VI of the original paper:
Dataset Year Type Data Sources Simulator Query Creation Answer Size EQA v1 [138] 2018 Active EQA SUNCG House3D Rule-Based open-ended 5,000+ MT-EQA [139] 2019 Active EQA SUNCG House3D Rule-Based open-ended 19,000+ MP3D-EQA [140] 2019 Active EQA MP3D Simulator based on MINOS Rule-Based open-ended 1,136 IQUAD V1 [141] 2018 Interactive EQA AI2THOR Rule-Based multi-choice 75,000+ VideoNavQA [142] 2019 Episodic Memory EQA SUNCG House3D Rule-Based open-ended 101,000 SQA3D [143] 2022 QA only ScanNet Manual multi-choice 33,400 K-EQA [144] 2023 Active EQA AI2THOR Rule-Based open-ended 60,000 OpenEQA [145] 2024 Active EQA, Episodic Memory EQA ScanNet, HM3D Habitat Manual open-ended 1,600+ HM-EQA [146] 2024 Active EQA HM3D Habitat VLM multi-choice 500 S-EQA [147] 2024 Active EQA VirtualHome LLM binary EXPRESS-Bench [148] 2025 Exploration-aware EQA HM3D Habitat VLM open-ended 2,044
Examples include EQA v1 [138] (first dataset, synthetic indoor scenes), IQUAD V1 [141] (interactive questions requiring affordance understanding), K-EQA [144] (complex questions with logical clauses and knowledge), and OpenEQA [145] (first open-vocabulary dataset for episodic memory and active exploration).
- Methods:
- Neural Network Methods: Early approaches built deep neural networks, often combining
Convolutional Neural Networks (CNNs)andRecurrent Neural Networks (RNNs)for vision, language, navigation, and answering modules (Das et al.[138]). Techniques likeimitation learningandreinforcement learningwere used for training. Later works integrated navigation andQAmodules for joint training (Wu et al.[152]) or introducedHierarchical Interactive Memory Networksfor dynamic environments (Gordon et al.[141]). Some leveragedneural program synthesisandknowledge graphsfor external knowledge and action planning (Tan et al.[144]). - LLMs/VLMs Methods: More recent methods leverage
Large Language Models (LLMs)andVision-Language Models (VLMs). Forepisodic memory EQA (EM-EQA), they usedBlind LLMs,Socratic LLMswith language descriptions or scene graphs, orVLMsprocessing multiple scene frames (Majumdar et al.[145]). Foractive EQA (A-EQA), these methods are extended withfrontier-based exploration (FBE)[154] to identify areas for exploration, buildingsemantic maps, and usingconformal predictionorimage-text matchingto decide when to stop exploration (Majumdar et al.[145],Sakamoto et al.[155]). Some use multipleLLM-based agents to collectively answer questions (Patel et al.[156]).
- Neural Network Methods: Early approaches built deep neural networks, often combining
4.2.4.2. Embodied Grasping
Embodied grasping involves performing tasks like picking and placing objects based on human instructions, combining traditional kinematic methods with LLMs and Vision-Language Models (VLMs).
The following figure (Figure 8 from the original paper) illustrates language-guided grasping tasks, human-agent-object interaction, and publication trends:
该图像是一个示意图,展示了语言引导的抓取任务(a)和人-代理-物体交互(b)以及出版状态(c)。左侧部分通过不同的指令示例(如直接物体说明、空间推理等)展示了抓取与场景的关系。右侧显示了不同年份的出版论文数量,反映了该领域的研究增长趋势。
-
Datasets: Initially,
grasping datasetsfocused on kinematic annotations for single or cluttered objects from real (Cornell[159],OCID-Grasp[164]) or virtual environments (Jacquard[160],6-DOF GraspNet[161],ACRONYM[162],MultiGripperGrasp[163]). WithMLMs, these datasets have been augmented or reconstructed to include linguistic text, creatingsemantic-grasping datasets(OCID-VLG[165],ReasoningGrasp[166],CapGrasp[167]).The following are the results from Table VII of the original paper:
Dataset Year Type Modality Grasp Label Gripper Finger Objects Grasps Scenes Language Cornell [159] 2011 Real RGB-D Rect. 2 240 8K Single × Jacquard [160] 2018 Sim RGB-D Rect. 2 11K 1.1M Single × 6-DOF GraspNet [161] 2019 Sim 3D 6D 2 206 7.07M Single × ACRONYM [162] 2021 Sim 3D 6D 2 8872 17.7M Multi × MultiGripperGrasp [163] 2024 Sim 3D - 2-5 345 30.4M Single × OCID-Grasp [164] 2021 Real RGB-D Rect. 2 89 75K Multi × OCID-VLG [165] 2023 Real RGB-D,3D Rect. 2 89 75K Multi √ ReasoingGrasp [166] 2024 Real RGB-D 6D 2 64 99.3M Multi √ CapGrasp [167] 2024 Sim 3D - 5 1.8K 50K Single √ -
Language-guided grasping: Combines
MLMsfor semantic scene reasoning, allowing agents to grasp based on implicit or explicit human instructions.- Explicit instructions: Clearly specify the object category (e.g., "Grasp the banana").
- Implicit instructions: Require reasoning to identify the target, involving
spatial reasoning(e.g., "Grasp the keyboard that is to the right of the brown kleenex box" [165]) andlogical reasoning(e.g., "I am thirsty, can you give me something to drink?" [166]).
-
End-to-End Approaches: These models directly map multi-modal inputs to grasp outputs.
CLIPORT[168]: A language-conditionedimitation learningagent combiningCLIP(avision-language pre-trained model) withTransporter Netfor semantic understanding and grasp generation. It's trained on expert demonstrations from virtual environments.CROG[165]: LeveragesCLIP's visual capabilities to learn grasp synthesis directly from image-text pairs.Reasoning Grasping[166]: Integratesmultimodal LLMswith vision-based robotic grasping to generate grasps based on semantics and vision.SemGrasp[167]: Incorporates semantic information into grasp representations to generate dexterous hand grasp postures according to language instructions.
-
Modular Approaches: Break down the task into distinct modules.
F3RM[169]: ElevatesCLIP's text-image priors into 3D space, using extracted features for language localization followed by grasp generation.GaussianGrasper[170]: Constructs a 3D Gaussian field for feature distillation, performs language-based localization, and then generates grasp poses using a pre-trained grasping network.
4.2.5. Embodied Agent
An embodied agent is an autonomous entity capable of perceiving its environment and acting to achieve objectives, with its capabilities significantly expanded by MLMs. For a task, an embodied agent performs high-level task planning (decomposing complex tasks) and low-level action planning (executing subtasks and interacting with the environment).
The following figure (Figure 9 from the original paper) shows the framework of the embodied agent:
该图像是示意图,展示了高层任务规划和低层行动规划的流程,包括任务规划、视觉描述和视觉表示等内容。图中还涉及了多模态大模型(LLM/VLM)和其在高低层次的应用,以及实体化的相关任务。
4.2.5.1. Embodied Task Planning
This involves decomposing abstract and complex tasks into specific subtasks. Traditionally, this was rule-based (e.g., PDDL [173], MCTS [174], [175]). With LLMs, planning leverages rich embedded world knowledge for reasoning.
- Planning utilizing the Emergent Capabilities of LLMs:
LLMscan decompose tasks usingchain-of-thought reasoningand internal world knowledge without explicit training.Translated LM[179] andInner Monologue[180]: Break down complex tasks using internal logic.ReAd[181]: A multi-agent collaboration framework that refines plans via prompts.- Memory banks: Store and recall past successful examples as skills for planning [182], [183], [184].
- Code as reasoning medium:
LLMsgenerate code based onAPIlibraries for task planning [185], [186]. - Multi-turn reasoning:
Socratic Models[187] andSocratic Planner[188] use questioning to derive reliable plans, correcting hallucinations.
- Planning utilizing the visual information from embodied perception model: Integrating visual information is crucial to prevent plans from deviating from actual scenarios.
- Object detectors: Query objects in the environment, feed information back to
LLMsto modify plans [187], [189], [190].RoboGPT[8] refines this by considering different names for similar objects. - 3D scene graphs:
SayPlan[191] uses hierarchical 3D scene graphs to represent the environment, improving task planning in large settings.ConceptGraphs[192] provides detailed open-world object detection and code-based planning via 3D scene graphs.
- Object detectors: Query objects in the environment, feed information back to
- Planning utilizing the VLMs:
VLMscan capture visual details and contextual information in latent space, aligning abstract visual features with structured textual features.EmbodiedGPT[193]: Uses anEmbodied-Formermodule to align embodied, visual, and textual information for task planning.LEO[194]: Encodes 2D egocentric images and 3D scenes into visual tokens for3D world perceptionand task execution.EIF-Unknow[195]: UtilizesSemantic Feature MapsfromVoxel Featuresas visual tokens forLLaVAmodel-based task planning.Embodied multimodal foundation models(VLA models): Examples likeRT series[2], [9],PaLM-E[196], andMatcha[197] are trained on large datasets to align visual and textual features for embodied scenarios.
4.2.5.2. Embodied Action Planning
This step focuses on executing subtasks derived from task planning, addressing real-world uncertainties due to the insufficient granularity of high-level plans [198].
- Action utilizing APIs:
LLMsare provided with definitions of well-trainedpolicy models(APIs) to use for specific tasks [189], [199]. They can generate code to abstract tools into a function library (Liang et al.[186]).Reflexion[200] adjusts these tools during execution, andDEPS[201] enablesLLMsto learn and combine skills throughzero-shot learning. This modular approach enhances flexibility but relies on the quality of externalpolicy models. - Action utilizing VLA model: This paradigm integrates
task planningandaction executionwithin the sameVLA modelsystem, reducing communication latency and improving response speed.Embodied multimodal foundation models(RT series[10],EmbodiedGPT[193]) tightly integrateperception,decision-making, andexecutionfor efficient handling of complex tasks and dynamic environments. This allows for real-time feedback and strategy self-adjustment. - Scalability in Diverse Environments: Strategies include
hierarchical SLAMfor mapping,multimodal perception,energy-efficient edge computing,multi-agent systems,decentralized communication, anddomain adaptationfor generalization in new environments.
4.2.6. Sim-to-Real Adaptation
Sim-to-Real adaptation is the process of transferring capabilities learned in simulated environments (cyberspace) to real-world scenarios (physical world), ensuring robust and reliable performance. It involves embodied world models, data collection and training methods, and embodied control algorithms.
4.2.6.1. Embodied World Model
These models predict the next state to inform decisions and are crucial for developing physical intuition. They are generally trained from scratch on physical world data, unlike VLA models which are pre-trained.
The following figure (Figure 10 from the original paper) shows three types of embodied world models:
该图像是示意图,展示了三种嵌入式世界模型的分类,包括生成方法、预测方法和知识驱动方法。每种方法都有不同的结构,其中生成方法通过自编码器学习输入空间与输出空间之间的转换关系,预测方法则在潜在空间中训练世界模型,而知识驱动方法则将人工构建的知识注入模型以满足特定知识约束。
- Generation-based Methods: Generative models learn to understand and produce data (images [203], videos [16], [204], point clouds [205], or other formats [206]) that adhere to physical laws, thereby internalizing world knowledge. This enhances
model generalization,robustness,adaptability, andpredictive accuracy. Examples includeWorld Models[203],Sora[16],Pandora[204],3D-VLA[205], andDWM[206]. - Prediction-based Methods: These models predict and understand the environment by constructing and utilizing internal representations in a
latent space. By reconstructing features based on conditions, they capture deeper semantics and world knowledge, enabling robots to perceive essential environmental representations (e.g.,I-JEPA[15],MC-JEPA[207],A-JEPA[208],Point-JEPA[209],IWM[210]) and perform downstream tasks (iVideoGPT[211],IRASim[212],STP[213],MuDreamer[214]). Thelatent spaceprocessing allows for abstracting and decoupling knowledge, leading to better generalization. - Knowledge-driven Methods: These models are endowed with world knowledge by injecting artificially constructed knowledge.
Real2Sim2Real[217]: Uses real-world knowledge to build physics-compliant simulators for robot training.- Constructing common sense/physics-compliant knowledge for generative models or simulators (e.g.,
ElastoGen[218],One-2-3-45[219],PLoT[220]). - Combining artificial physical rules with
LLMs/MLMsto generate diverse and semantically rich scenes through automatic spatial layout optimization (e.g.,Holodeck[56],LEGENT[221],GRUtopia[222]).
4.2.6.2. Data Collection and Training
High-quality data is critical for sim-to-real adaptation.
-
Real-World Data: Collecting real-world robotic data is time-consuming and expensive. Efforts are focused on creating large, diverse datasets to enhance generalization. Examples include
Open X-Embodiment[202] (data from 22 robots, 527 skills, 160,266 tasks),UMI[224] (framework for bimanual data),Mobile ALOHA[225] (data for full-body mobile manipulation), andhuman-agent collaboration[226] for data quality. -
Simulated Data: Simulation-based data collection offers cost-effectiveness and efficiency. Examples include
CLIPORT[168] andTransporter Networks[227] usingPybulletdata,GAPartNet[228] for part-level annotations, andSemGrasp[167] creatingCapGraspfor semantic hand grasping. -
Sim-to-Real Paradigms: Various paradigms mitigate the need for extensive real-world data.
The following figure (Figure 11 from the original paper) shows five paradigms for
sim-to-real transfer:
该图像是示意图,展示了将虚拟环境中的智能体技术应用于现实世界的过程,包括Real2Sim2Real、TRANSIC、Domain Randomization及Lang4Sim2Real四个主要部分,强调了相关模型训练和转移的步骤与策略。 -
Real2Sim2real[229]: Improvesimitation learningby usingreinforcement learningin a "digital twin" simulation, then transferring strategies to the real world. -
TRANSIC[230]: Reduces thesim-to-real gapthrough real-time human intervention andresidual policy trainingbased on corrected behaviors. -
Domain Randomization[231], [232], [233]: Increasesmodel generalizationby varying simulation parameters to cover real-world conditions. -
System Identification[234], [235]: Creates accurate simulations of real-world scenes to ensure smooth transitions. -
Lang4sim2real[236]: Leverages natural language descriptions to bridge thesim-to-real gap, improvingmodel generalizationwith cross-domain image representations. -
ARIO (All Robots in One): Proposed as a new dataset standard [237] to overcome limitations of existing datasets (lack of comprehensive sensory modalities, unified format, diverse control object representation, data volume, and combined simulated/real data).
ARIOunifies control and motion data from diverse robots, facilitating training of high-performing, generalizableembodied AI models. TheARIO datasetcomprises approximately 3 million episodes from 258 series and 321,064 tasks.The following figure (Figure 12 from the original paper) shows exemplar tasks from
ARIO:
该图像是示意图,展示了三个不同类型的任务,包括长时间跨度任务、双手操作任务和丰富接触任务。每个任务下方提供了相应的操作说明,分别涉及物体的拾取与放置。 -
Real-world Deployments of Embodied AI Systems:
Embodied AIsystems are deployed in healthcare (e.g.,Da Vinci Surgical System), logistics (Amazon Robotics,Boston Dynamics' Stretch), and manufacturing (Fanuc,ABB), enhancing precision and efficiency.
5. Experimental Setup
As this paper is a comprehensive survey, it does not present new experimental results from its own methodology. Instead, it reviews the experimental setups, datasets, evaluation metrics, and baselines used by the numerous research papers it surveys across various Embodied AI tasks. This section summarizes the common practices and key components of these experimental setups as described in the survey.
5.1. Datasets
The survey highlights a wide array of datasets tailored for different Embodied AI tasks, reflecting the diversity and complexity of the field. These datasets typically provide visual (RGB, depth, point clouds), textual (instructions, questions), and action data, often collected in or simulated from indoor or outdoor environments.
-
For Embodied Perception (Visual Language Navigation - VLN):
- Source & Characteristics: Many
VLNdatasets are built on 3D scanned or procedurally generated indoor environments, often leveraging platforms likeMatterport3D[53],AI2-THOR[50],Habitat[107],SUNCG[73], orOmniGibson[51]. Instructions range fromstep-by-step (SbS)todescribed goal navigation (DGN)ordemand-driven navigation (DDN). Environments can bediscrete(graph-based navigation) orcontinuous(free movement). - Examples:
R2R[105] (Room to Room): UsesMatterport3DforSbSinstructions in indoordiscreteenvironments.ALFRED[112]: Built onAI2-THOR, focusing onNavigation with Interaction (NwI)for household tasks incontinuousindoor environments.TOUCHDOWN[108]: Uses Google Street View, providing outdoorSbSnavigation.BEHAVIOR-1K[114]: Based onOmniGibson, forlong-span navigation with interaction (LSNwI)tasks.
- Data Sample: For
R2R, a data sample would include a starting panorama, a target location, and a natural language instruction like "Walk down the stairs, turn left, and stop at the red chair."
- Source & Characteristics: Many
-
For Embodied Interaction (Embodied Question Answering - EQA):
- Source & Characteristics:
EQAdatasets require agents to explore and gather information to answer questions. They use environments likeSUNCG[73],Matterport3D[151],AI2-THOR[50],ScanNet[143], andHM3D[145]. Questions can beopen-ended,multi-choice, orbinary, and types vary (e.g., location, color, existence, counting, spatial relationships, knowledge-based). - Examples:
EQA v1[138]: Built onSUNCGwithinHouse3D, with rule-based questions on object attributes.IQUAD V1[141]: Based onAI2-THOR, requires agents to interact with dynamic environments to answer questions.OpenEQA[145]: UsesScanNetandHM3DinHabitat, supportingopen-vocabularyand bothepisodic memoryandactive exploration.
- Data Sample: An
EQAsample might involve an agent in a kitchen, given the question "What color is the refrigerator?" and requiring navigation to find and identify the object.
- Source & Characteristics:
-
For Embodied Interaction (Embodied Grasping):
- Source & Characteristics: Grasping datasets typically provide
RGB-D images,point clouds, or3D sceneswith annotated grasp poses (e.g., 4-DOF or 6-DOF rectangles or 6D poses for grippers). WithMLMs, these datasets are augmented withlinguistic textforsemantic grasping. - Examples:
Cornell[159]: Real-worldRGB-Dimages with rectangle grasp annotations for single objects.6-DOF GraspNet[161]: Simulated 3D data with 6D grasp poses.OCID-VLG[165]: RealRGB-Dand 3D data with semantic expressions, linking language, vision, and grasping.CapGrasp[167]: Simulated 3D data with semantic descriptions for dexterous hand grasping.
- Data Sample: A grasping sample would include an image of an object (e.g., a banana) and an instruction like "Grasp the banana," with corresponding valid grasp configurations.
- Source & Characteristics: Grasping datasets typically provide
-
For Sim-to-Real Adaptation & General-Purpose Agents:
-
ARIO (All Robots In One)[237]: This proposed standard and dataset aim to unify control and motion data from diverse robot morphologies in a consistent format. It combines simulated and real data to address thesim-to-real gapand facilitatelarge-scale pretrainingforgeneral-purpose embodied agents. The dataset is vast, with approximately 3 million episodes from 258 series and 321,064 tasks. -
Open X-Embodiment[202]: A large-scale real-world dataset providing data from 22 robots with 527 skills and 160,266 tasks in domestic settings.The choice of datasets is driven by the need to validate specific aspects of
Embodied AI, from low-level perception and control to high-level reasoning and interaction. Synthetic datasets allow for scale and control, while real-world datasets ensure relevance and robustness. The increasing integration of language into datasets reflects the growing role ofMLMs.
-
5.2. Evaluation Metrics
The evaluation metrics used in Embodied AI are highly task-dependent, reflecting different aspects of an agent's performance, such as navigation efficiency, task completion, interaction accuracy, and semantic understanding.
-
For Visual Language Navigation (VLN):
- Success Rate (SR):
- Conceptual Definition: Measures the percentage of times an
agentsuccessfully reaches the target location or completes the instructed task. It is a primary indicator of task completion. - Mathematical Formula: $ \mathrm{SR} = \frac{\text{Number of successful episodes}}{\text{Total number of episodes}} \times 100% $
- Symbol Explanation:
Number of successful episodes: The count of navigation attempts where the agent reaches the goal.Total number of episodes: The total count of navigation attempts.
- Conceptual Definition: Measures the percentage of times an
- Success weighted by Path Length (SPL):
- Conceptual Definition: This metric balances
Success Ratewith path efficiency. It penalizes agents that take overly long routes to reach the goal, even if they succeed. A higherSPLindicates both successful and efficient navigation. - Mathematical Formula: $ \mathrm{SPL} = \frac{1}{N} \sum_{i=1}^{N} S_i \frac{L_{shortest,i}}{\max(P_i, L_{shortest,i})} $
- Symbol Explanation:
- : The total number of episodes.
- : Binary indicator,
1if episode is successful,0otherwise. - : Length of the shortest path from the start to the goal in episode .
- : Length of the path taken by the
agentin episode .
- Conceptual Definition: This metric balances
- Navigation Error (NE):
- Conceptual Definition: Measures the geodesic distance between the
agent's final position and the target goal position. LowerNEindicates more accurate navigation. - Mathematical Formula: $ \mathrm{NE} = \frac{1}{N} \sum_{i=1}^{N} \mathrm{geodesic_distance}(\text{final_pos}_i, \text{goal_pos}_i) $
- Symbol Explanation:
- : The total number of episodes.
- : The
agent's final position in episode . - : The target goal position in episode .
- : The shortest path distance between two points in the environment.
- Conceptual Definition: Measures the geodesic distance between the
- Path Length (PL):
- Conceptual Definition: The total length of the path traversed by the
agentduring an episode. Often reported for successful episodes to assess efficiency alongsideSPL. - Mathematical Formula: $ \mathrm{PL} = \frac{1}{\text{Num successful episodes}} \sum_{i \in \text{successful episodes}} P_i $
- Symbol Explanation:
- : The count of successful navigation attempts.
- : Length of the path taken by the
agentin successful episode .
- Conceptual Definition: The total length of the path traversed by the
- Success Rate (SR):
-
For Embodied Question Answering (EQA):
- Accuracy (Acc):
- Conceptual Definition: The proportion of questions for which the
agentprovides the correct answer. For multiple-choice questions, it's a direct count; for open-ended, it often involvessemantic similarityorexact match. - Mathematical Formula: $ \mathrm{Acc} = \frac{\text{Number of correctly answered questions}}{\text{Total number of questions}} \times 100% $
- Symbol Explanation:
Number of correctly answered questions: Count of questions where theagent's answer matches the ground truth.Total number of questions: Total count of questions posed to theagent.
- Conceptual Definition: The proportion of questions for which the
- Success (S) / Answer F1: For open-ended
EQA,F1-scoremay be used to evaluate the quality of generated answers by comparing them to reference answers, especially when multiple correct phrasings are possible.- Conceptual Definition: The
F1-scoreis the harmonic mean ofprecisionandrecall, providing a single metric that balances both.Precisionmeasures the proportion ofagent's answers that are correct, whilerecallmeasures the proportion of correct answers that theagentidentified. - Mathematical Formula: $ \mathrm{F1} = 2 \times \frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}} $ Where: $ \mathrm{Precision} = \frac{\text{Number of correct words in agent's answer}}{\text{Total words in agent's answer}} $ $ \mathrm{Recall} = \frac{\text{Number of correct words in agent's answer}}{\text{Total words in ground truth answer}} $
- Symbol Explanation:
- : The proportion of
agent's predicted words that are relevant. - : The proportion of relevant words that are correctly predicted by the
agent. Number of correct words in agent's answer: Overlap betweenagent's answer and ground truth.Total words in agent's answer: Total words in theagent's generated response.Total words in ground truth answer: Total words in the reference answer.
- : The proportion of
- Conceptual Definition: The
- Accuracy (Acc):
-
For Embodied Grasping:
- Grasp Success Rate:
- Conceptual Definition: The percentage of attempts where the robot successfully grasps the target object in a stable manner.
- Mathematical Formula: $ \mathrm{Grasp\ Success\ Rate} = \frac{\text{Number of successful grasps}}{\text{Total number of grasp attempts}} \times 100% $
- Symbol Explanation:
Number of successful grasps: Count of instances where the robot successfully picked up and held the object.Total number of grasp attempts: Total count of attempts to grasp an object.
- Placement Success Rate:
- Conceptual Definition: The percentage of times the robot successfully places a grasped object in the specified target location.
- Mathematical Formula: $ \mathrm{Placement\ Success\ Rate} = \frac{\text{Number of successful placements}}{\text{Total number of placement attempts}} \times 100% $
- Symbol Explanation:
Number of successful placements: Count of instances where the robot successfully deposited the object at the target.Total number of placement attempts: Total count of attempts to place an object.
- Grasp Success Rate:
5.3. Baselines
The survey implicitly discusses various baselines against which new Embodied AI methods are compared. These baselines represent the state-of-the-art or common approaches prior to the introduction of specific innovations.
-
Traditional Deep Reinforcement Learning (DRL) Approaches: Many early
VLNandEQAmodels were based onDRL, where anagentlearns policies through trial and error in the environment. These are often considered representative as they learn behaviors from scratch. -
Imitation Learning (IL) Methods: Many
embodied controlandgraspingtasks useIL, where a model learns by observing expert demonstrations.CLIPORT[168] is alanguage-conditioned ILagent. -
Rule-Based or Symbolic Planning Methods: For
task planning, traditional approaches often rely on explicit rules, logical reasoning (e.g.,PDDL[173]), or search algorithms (MCTS[174], [175]). These are representative of non-LLM-driven planning. -
Earlier Vision-Language Models: Prior to
MLMs,vision-language modelswith more limited capacity or specific architectures served as baselines for tasks likeVLNandEQA. The survey highlights thatMLMshave injected "strong perception, interaction and planning capabilities" compared to these earlier models. -
Model-Free vs. Model-Based RL: In
VLN,LookBY[129] is mentioned as bridgingmodel-free(learning directly from interactions) andmodel-based(World Model-driven prediction)RL, indicating these as common baseline categories. -
Without External Knowledge: For
K-EQA[144], baselines would beEQAmodels that do not leverage external knowledge bases for answering complex questions. -
Sim-only Trained Policies: In
Sim-to-Real adaptation, a primary baseline is a policy trained purely in simulation without anydomain randomization,system identification, ortransfer learningtechniques. The performance of such a policy in the real world highlights thesim-to-real gap.These baselines are chosen because they represent the established or prior state-of-the-art methods against which new
MLMandWM-based approaches aim to demonstrate superiorgeneralization,adaptability,reasoning, andperceptioncapabilities in complex and dynamicembodied AItasks.
6. Results & Analysis
As a survey paper, this document does not present new experimental results from its own research. Instead, it synthesizes and analyzes the findings from a vast body of existing research in Embodied AI. The "results" discussed here are the overall trends, advancements, and the current state of the art revealed through the comparative review of numerous studies.
6.1. Core Results Analysis
The survey highlights a significant shift and acceleration in Embodied AI research, primarily driven by the emergence of Multi-modal Large Models (MLMs) and World Models (WMs).
-
Enhanced Perception and Reasoning:
MLMshave endowedembodied agentswith remarkableperceptionandreasoningcapabilities, enabling them to understand complex and dynamic environments more thoroughly than traditional methods. The integration ofpre-trained visual representations(e.g., fromCLIP,BLIP-2) withLarge Language Models (LLMs)has provided precise object estimations, improved linguistic instruction understanding, and robust alignment of visual and linguistic features. -
Improved Interaction:
MLMsfacilitate more naturalhuman-robot interaction, allowing agents to understand nuanced instructions and perform complex tasks likeembodied graspingbased on semantic reasoning (both explicit and implicit instructions).Embodied Question Answering (EQA)has progressed from simple object queries to complex, knowledge-intensive questions, often requiring active exploration. -
Advanced Agent Architectures:
Embodied agentsare evolving towardsgeneral-purposecapabilities, moving beyond single-taskAI. The survey shows a trend towards hierarchicaltask planning(decomposing abstract goals into subtasks) andaction planning(executing low-level steps).LLMsare increasingly used for high-level planning, whileVision-Language-Action (VLA)models integrate perception, decision-making, and execution, reducing latency and improving adaptation. -
Progress in Sim-to-Real Adaptation: The paper emphasizes the critical need for
sim-to-real adaptationto transfer learned capabilities from cost-effective and safe simulations to the physical world.Embodied World Modelsare shown to be highly promising, as they learn predictive representations of physical laws, allowing agents to develop intuition and simulate future states. Varioussim-to-real paradigms(e.g.,Domain Randomization,Real2Sim2Real,System Identification) are actively being researched to bridge thedomain gap. -
Growth of Datasets and Simulators: The field is characterized by a continuous development of more realistic and diverse
embodied simulators(both general-purpose and real-scene-based) anddatasetsthat are larger-scale, multi-modal, and increasingly designed for complex, long-horizon tasks. The proposal ofARIOunderscores the community's effort towards standardized, versatile datasets forgeneral-purpose embodied agents.Overall, the survey concludes that
MLMsandWMsare providing a feasible and powerful approach forEmbodied AIto aligncyberspaceintelligence withphysical worldinteraction, bringing the field closer to achievingAGI. However, significant challenges remain, particularly in data quality, long-horizon autonomy, causal reasoning, and robust evaluation.
6.2. Data Presentation (Tables)
The following are the results from Table I of the original paper:
| Type | Environment | Physical Entities | Description | Representative Agents |
| Disembodied AI | Cyber Space | No | Cognition and physical entities are disentangled | ChatGPT [4], RoboGPT [8] |
| Embodied AI | Physical Space | Robots, Cars, Other devices | Cognition is integrated into physical entities | RT-1 [9], RT-2 [10], RT-H [3] |
The following are the results from Table II of the original paper:
| Simulator | Year | Rendering | Robotics-specific features | CP | Physics Engine | Main Applications | ||||
| HFPS | HQGR | RRL | DLS | LSPC | ROS | MSS | ||||
| Genesis [35] | 2024 | O | o | O | O | O | O | Custom | RL, LSPS, RS | |
| Isaac Sim [36] | 2023 | O | O | O | O | O | O | O | PhysX | Nav, AD |
| Isaac Gym [37] | 2019 | O | O | O | PhysX | RL,LSPS | ||||
| Gazebo [38] | 2004 | O | O | O | O | O | ODE, Bullet, Simbody, DART Nav,MR | |||
| PyBullet [39] | 2017 | O | O | Bullet | RL,RS | |||||
| Webots [40] | 1996 | O | O | O | O | ODE | RS | |||
| MuJoCo [41] | 2012 | O | o | Custom | RL, RS | |||||
| Unity ML-Agents [42] | 2017 | O | O | O | Custom | RL, RS | ||||
| AirSim [43] | 2017 | O | o | Custom | Drone sim, AD, RL | |||||
| MORSE [44] | 2015 | o | O | Bullet | Nav, MR | |||||
| V-REP (CoppeliaSim) [45] | 2013 | O | O | O | O | Bullet, ODE, Vortex, Newton MR, RS | ||||
The following are the results from Table III of the original paper:
| Function | Type | Methods |
| vSLAM | Traditional vSLAM | MonoSLAM [60], ORB-SLAM [61], LSD-SLAM [62] |
| Semantic vSLAM | SLAM++ [63], QuadricSLAM [64], So-SLAM [65],SG-SLAM [66], OVD-SLAM [67], GS-SLAM [68] | |
| 3D Scene Understanding | Projection-based | MV3D [69], PointPillars [70], MVCNN [71] |
| Voxel-based | VoxNet [72], SSCNet [73]), MinkowskiNet [74], SSCNs [75], Embodiedscan [76] | |
| Point-based | PointNet [77], PointNet++ [78], PointMLP [79], PointTransformer [80], Swin3d [81], PT2 [82],3D-VisTA [83], LEO [84], PQ3D [85], PointMamba [86], Mamba3D [87] | |
| Active Exploration | Interacting with the environment | Pinto et al. [88], Tatiya et al. [89] |
| Changing the viewing direction | Jayaraman et al. [90], NeU-NBV [91], Hu et al. [92], Fan et al. [93] |
The following are the results from Table IV of the original paper:
| Dataset | Year | Simulator | Environment | Feature | Size |
| R2R [105] | 2018 | M3D | I, D | SbS | 21,567 |
| R4R [106] | 2019 | M3D | I, D | SbS | 200,000+ |
| VLN-CE [107] | 2020 | Habitat | I, C | SbS | - |
| TOUCHDOWN [108] | 2019 | - | O, D | SbS | 9,326 |
| REVERIE [109] | 2020 | M3D | I, D | DGN | 21,702 |
| SOON [110] | 2021 | M3D | I, D | DGN | 3,848 |
| DDN [111] | 2023 | AT | I, C | DDN | 30,000+ |
| ALFRED [112] | 2020 | AT | I, C | NwI | 25,743 |
| OVMM [113] | 2023 | Habitat | I, C | NwI | 7,892 |
| BEHAVIOR-1K [114] | 2023 | OG | I, C | LSNwI | 1,000 |
| CVDN [115] | 2020 | M3D | I, D | D&O | 2,050 |
| DialFRED [116] | 2022 | AT | I, C | D&O | 53,000 |
The following are the results from Table V of the original paper:
| Method | Model | Year | Feature |
| Memory-UnderstandingBased | LVERG [117] | 2020 | Graph Learning |
| CMG [118] | 2020 | Adversarial Learning | |
| RCM [119] | 2021 | Reinforcement learning | |
| FILM [120] | 2022 | Semantic Map | |
| LM-Nav [121] | 2022 | Graph Learning | |
| HOP [122] | 2022 | History Modeling | |
| NaviLLM [123] | 2024 | Large Model | |
| FSTT [124] | 2024 | Test-Time Augmentation | |
| DiscussNav [125] | 2024 | Large Model | |
| GOAT [126] | 2024 | Causal Learning | |
| VER [127] | 2024 | Environment Encoder | |
| NaVid [128] | 2024 | Large Model | |
| Future-PredictionBased | LookBY [129] | 2018 | Reinforcement Learning |
| NvEM [130] | 2021 | Environment Encoder | |
| BGBL [131] | 2022 | Graph Learning | |
| Mic [132] | 2023 | Large Model | |
| HNR [133] | 2024 | Environment Encoder | |
| ETPNav [134] | 2024 | Graph Learning | |
| Others | MCR-Agent [135] | 2023 | Multi-Level Model |
| OVLM [136] | 2023 | Large Model |
The following are the results from Table VI of the original paper:
| Dataset | Year | Type | Data Sources | Simulator | Query Creation | Answer | Size |
| EQA v1 [138] | 2018 | Active EQA | SUNCG | House3D | Rule-Based | open-ended | 5,000+ |
| MT-EQA [139] | 2019 | Active EQA | SUNCG | House3D | Rule-Based | open-ended | 19,000+ |
| MP3D-EQA [140] | 2019 | Active EQA | MP3D | Simulator based on MINOS | Rule-Based | open-ended | 1,136 |
| IQUAD V1 [141] | 2018 | Interactive EQA | AI2THOR | Rule-Based | multi-choice | 75,000+ | |
| VideoNavQA [142] | 2019 | Episodic Memory EQA | SUNCG | House3D | Rule-Based | open-ended | 101,000 |
| SQA3D [143] | 2022 | QA only | ScanNet | Manual | multi-choice | 33,400 | |
| K-EQA [144] | 2023 | Active EQA | AI2THOR | Rule-Based | open-ended | 60,000 | |
| OpenEQA [145] | 2024 | Active EQA, Episodic Memory EQA | ScanNet, HM3D | Habitat | Manual | open-ended | 1,600+ |
| HM-EQA [146] | 2024 | Active EQA | HM3D | Habitat | VLM | multi-choice | 500 |
| S-EQA [147] | 2024 | Active EQA | VirtualHome | LLM | binary | ||
| EXPRESS-Bench [148] | 2025 | Exploration-aware EQA | HM3D | Habitat | VLM | open-ended | 2,044 |
The following are the results from Table VII of the original paper:
| Dataset | Year | Type | Modality | Grasp Label | Gripper Finger | Objects | Grasps | Scenes | Language |
| Cornell [159] | 2011 | Real | RGB-D | Rect. | 2 | 240 | 8K | Single | × |
| Jacquard [160] | 2018 | Sim | RGB-D | Rect. | 2 | 11K | 1.1M | Single | × |
| 6-DOF GraspNet [161] | 2019 | Sim | 3D | 6D | 2 | 206 | 7.07M | Single | × |
| ACRONYM [162] | 2021 | Sim | 3D | 6D | 2 | 8872 | 17.7M | Multi | × |
| MultiGripperGrasp [163] | 2024 | Sim | 3D | - | 2-5 | 345 | 30.4M | Single | × |
| OCID-Grasp [164] | 2021 | Real | RGB-D | Rect. | 2 | 89 | 75K | Multi | × |
| OCID-VLG [165] | 2023 | Real | RGB-D,3D | Rect. | 2 | 89 | 75K | Multi | √ |
| ReasoingGrasp [166] | 2024 | Real | RGB-D | 6D | 2 | 64 | 99.3M | Multi | √ |
| CapGrasp [167] | 2024 | Sim | 3D | - | 5 | 1.8K | 50K | Single | √ |
6.3. Ablation Studies / Parameter Analysis
While the survey itself does not conduct ablation studies, it highlights the importance of such analyses implicitly by discussing various components of Embodied AI systems. In the context of the surveyed papers, ablation studies are crucial for:
-
Validating Component Effectiveness: Researchers in
Embodied AIregularly conduct ablation studies to verify the contribution of individual modules or novel architectural choices (e.g., specificmulti-modal fusionstrategies,memory mechanisms,planning components,sim-to-real transfertechniques). For instance, inVLNorEQAtasks, studies often compare performance with and withoutsemantic mapping,dialogue history, or different visual encoders. -
Understanding Hyper-parameter Impact: The performance of
embodied agents, especially those based onReinforcement LearningorLarge Models, is highly sensitive tohyper-parameters(e.g., learning rates, reward function weights, network sizes, exploration strategies). Surveyed papers would typically perform parameter sweeps or sensitivity analyses to demonstrate robustness or identify optimal configurations. -
Generalization and Robustness: Ablation studies often involve testing components under varying conditions (e.g., unseen environments, different object configurations, varying levels of noise) to assess how robustly a particular design generalizes. This is particularly relevant for
sim-to-real adaptationwhere the impact ofdomain randomizationparameters or the effectiveness ofpolicy distillationtechniques would be ablated.The detailed taxonomy presented in the survey, breaking down
Embodied AIinto perception, interaction, agent, and sim-to-real, directly supports the design of such ablation studies, allowing researchers to isolate and evaluate the contribution of advancements in each sub-area. The push forunified evaluation benchmarksand comprehensive datasets likeARIOis also aimed at making these comparative analyses more standardized and meaningful across the field.
7. Conclusion & Reflections
7.1. Conclusion Summary
This comprehensive survey thoroughly reviews the advancements in Embodied Artificial Intelligence (Embodied AI), positioning it as a critical pathway toward achieving Artificial General Intelligence (AGI). The paper emphasizes how the emergence of Multi-modal Large Models (MLMs) and World Models (WMs) has fundamentally transformed Embodied AI, endowing agents with unprecedented perception, interaction, and reasoning capabilities essential for bridging the gap between cyberspace and the physical world.
The survey provides a detailed taxonomy, starting with a review of embodied robots and simulators (both general and real-scene-based). It then delves into four core research areas: embodied perception (including active visual perception and visual language navigation), embodied interaction (covering embodied question answering and grasping), embodied agents (examining multi-modal foundation models and task planning), and sim-to-real adaptation (focusing on embodied world models, data collection, and training paradigms). A significant contribution is the proposal of ARIO (All Robots In One), a new unified dataset standard and a large-scale dataset designed to facilitate the development of versatile, general-purpose embodied agents. Finally, the survey outlines the pressing challenges and promising future directions, aiming to serve as a foundational reference for the research community.
7.2. Limitations & Future Work
The authors clearly identify several critical challenges and limitations that Embodied AI currently faces, which also point to key future research directions:
-
High-quality Robotic Datasets:
- Limitation: A scarcity of sufficient, high-quality real-world robotic data due to time and resource intensity. Over-reliance on simulation data exacerbates the
sim-to-real gap. Existing multi-robot datasets lack unified formats, comprehensive sensory modalities, and combined simulated/real data. - Future Work: Develop more realistic and efficient simulators. Foster extensive collaboration among institutions for diverse real-world data collection. Construct large-scale datasets (like the proposed
ARIO) leveraging high-quality simulated data to augment real-world data forgeneralizable embodied modelscapable ofcross-scenarioandcross-task applications.
- Limitation: A scarcity of sufficient, high-quality real-world robotic data due to time and resource intensity. Over-reliance on simulation data exacerbates the
-
Long-Horizon Task Execution:
- Limitation: Current
high-level task plannersstruggle withlong-horizon tasks(e.g., "clean the kitchen") in diverse scenarios due to insufficient tuning forembodied tasksand limited ability to execute complex sequences of low-level actions. - Future Work: Develop efficient planners with robust
perception capabilitiesand extensivecommonsense knowledge. Implement hybrid planning approaches that combine lightweight, high-frequency monitoring modules with lower-frequency adapters for subtask and path adaptation reasoning, balancing planning complexity and real-time adaptability.
- Limitation: Current
-
Causal Reasoning:
- Limitation: Existing
data-driven embodied agentsoften make decisions based on superficial data correlations rather than truecausal relationsbetween knowledge, behavior, and environment, leading to biased and unreliable strategies in real-world settings. - Future Work: Develop
embodied agentsdriven byworld knowledgeand capable ofautonomous causal reasoning. Agents should learn world workings viaabductive reasoningthrough interaction. Establishspatial-temporal causal relationsacross modalities viainteractive instructionsandstate predictions. Agents need to understandobject affordancesforadaptive task planningin dynamic scenes.
- Limitation: Existing
-
Unified Evaluation Benchmark:
- Limitation: Existing benchmarks for
low-level control policiesvary significantly in assessed skills, and objects/scenes are often constrained by simulator limitations. Many high-leveltask plannerbenchmarks only assess planning (e.g., via question-answering) in isolation. - Future Work: Develop
unified benchmarksthat encompass a diverse range of skills using realistic simulators. Evaluate bothhigh-level task plannersandlow-level control policiestogether for executinglong-horizon tasks, measuring overall success rates rather than isolated component performance.
- Limitation: Existing benchmarks for
-
Security and Privacy:
- Limitation:
Embodied agentsdeployed in sensitive or private spaces face significant security challenges, especially as they rely onLLMsfor decision-making.LLMsare susceptible tobackdoor attacks(word injection, scenario manipulation, knowledge injection) that could lead to hazardous actions (e.g., autonomous vehicles accelerating into obstacles). - Future Work: Evaluate potential attack vectors and develop more robust defenses. Implement
secure prompting,state management, andsafety validation mechanismsto enhancesecurityandrobustness.
- Limitation:
7.3. Personal Insights & Critique
This survey is a timely and valuable contribution to the Embodied AI literature. Its strength lies in its explicit focus on Multi-modal Large Models (MLMs) and World Models (WMs), which are indeed defining the current frontier of AI. For a novice, the systematic breakdown of Embodied AI into robots, simulators, and functional tasks (perception, interaction, agent, sim-to-real) provides an excellent structured entry point into a complex field. The consistent emphasis on the cyber-physical alignment clarifies the overarching goal.
The proposal of ARIO is a particularly insightful and practical contribution. The current fragmentation of datasets is a major bottleneck, and a unified standard could genuinely accelerate research by enabling more consistent pre-training and benchmarking of general-purpose embodied agents. The scale of the proposed ARIO dataset (3 million episodes) is impressive and indicative of the large data requirements for foundation models.
Potential Issues or Areas for Improvement:
- Depth of Technical Details within Surveyed Methods: While the survey categorizes methods and lists representative works, for a truly "beginner-friendly" deep-dive, it could have provided slightly more technical intuition or a simplified architectural overview for a few exemplar methods within each category (e.g., how
ORB-SLAMworks at a high level, or the basic structure of aVision-Language-Action model). Given the breadth, this is a difficult balance, but for a foundational reference, a "typical architecture" for each sub-task would be beneficial. - Explicit Examples of
MLM/WMMechanics: The survey talks aboutMLMsandWMsextensively, but a slightly deeper dive into how they specifically enhance perception or planning (e.g., "latent space" processing forWMsexplained with a clearer analogy, or howcross-modal attentioninMLMsgrounds language to vision) could further cement understanding for a beginner. - Real-world Impact vs. Research Promise: While applications are mentioned, a more explicit discussion on the current maturity levels of different
Embodied AIapplications in the real world versus those primarily in research labs could be beneficial. The term "aligning cyberspace with the physical world" is broad, and a clearer delineation of what's currently achievable in practice versus what's still aspirational would add nuance. - Criticality on
LLMHallucinations: The survey touches uponLLM"hallucinations" intask planning(e.g., inRoboGPTcontext). Given the reliance onLLMs, a slightly more detailed discussion on the common failure modes ofLLM-based planning and the practical strategies to mitigate them (beyond just multi-turn reasoning or visual feedback) would be valuable.
Transferability and Applications:
The methods and conclusions outlined in this survey are highly transferable across various domains:
-
Industrial Automation: Enhanced
embodied graspingandlong-horizon task executionare directly applicable tosmart manufacturingand logistics, leading to more flexible and autonomous factory floors and warehouses. -
Healthcare:
Humanoid robotsandmobile manipulatorswith improved perception and interaction can assist in patient care, surgery, and sanitation, especially withcausal reasoningfor delicate tasks. -
Autonomous Driving: The advancements in
embodied perception(3D scene understanding,vSLAM),world models(predicting future states), andsim-to-real adaptationare foundational for safer and more reliable self-driving vehicles. -
Exploration and Disaster Response:
Quadrupedandtracked robotsbenefit immensely from improvedactive explorationandcausal reasoningto navigate unknown, hazardous environments and perform complex rescue operations. -
Smart Homes and Personal Service Robotics:
Embodied agentsthat can understand natural language, perform household tasks, and engage inEQAwill be transformative for assistive and convenience applications in homes.This survey provides a solid foundation for understanding the current state and future trajectory of
Embodied AI, making it an indispensable resource for researchers and practitioners alike who aim to bridge the digital and physical realms with intelligent machines.
Similar papers
Recommended via semantic vector search.