Paper status: completed

Large Model Empowered Embodied AI: A Survey on Decision-Making and Embodied Learning

Published:08/14/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This survey discusses large model empowered embodied AI, focusing on autonomous decision-making and embodied learning. It explores hierarchical and end-to-end decision paradigms, highlighting how large models enhance decision processes and Vision-Language-Action models, including

Abstract

Embodied AI aims to develop intelligent systems with physical forms capable of perceiving, decision-making, acting, and learning in real-world environments, providing a promising way to Artificial General Intelligence (AGI). Despite decades of explorations, it remains challenging for embodied agents to achieve human-level intelligence for general-purpose tasks in open dynamic environments. Recent breakthroughs in large models have revolutionized embodied AI by enhancing perception, interaction, planning and learning. In this article, we provide a comprehensive survey on large model empowered embodied AI, focusing on autonomous decision-making and embodied learning. We investigate both hierarchical and end-to-end decision-making paradigms, detailing how large models enhance high-level planning, low-level execution, and feedback for hierarchical decision-making, and how large models enhance Vision-Language-Action (VLA) models for end-to-end decision making. For embodied learning, we introduce mainstream learning methodologies, elaborating on how large models enhance imitation learning and reinforcement learning in-depth. For the first time, we integrate world models into the survey of embodied AI, presenting their design methods and critical roles in enhancing decision-making and learning. Though solid advances have been achieved, challenges still exist, which are discussed at the end of this survey, potentially as the further research directions.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of this paper is Large Model Empowered Embodied AI: A Survey on Decision-Making and Embodied Learning. It focuses on how large models enhance embodied artificial intelligence, specifically in the areas of autonomous decision-making and embodied learning.

1.2. Authors

The authors are WENLONG LIANG, RUI ZHOU, YANG MA, BING ZHANG, SONGLIN LI, YIJIA LIAO, and PING KUANG. All authors are affiliated with the University of Electronic Science and Technology of China, China.

1.3. Journal/Conference

This paper is published as a preprint on arXiv (Original Source Link: https://arxiv.org/abs/2508.10399). As a preprint, it has not yet undergone formal peer review by a journal or conference. However, arXiv is a highly influential platform for rapid dissemination of research in physics, mathematics, computer science, and related fields, allowing researchers to share their work before formal publication. Papers on arXiv are often later published in top-tier conferences or journals.

1.4. Publication Year

The paper was published at (UTC) 2025-08-14T06:56:16.000Z, indicating a publication year of 2025.

1.5. Abstract

This survey paper provides a comprehensive overview of large model empowered embodied AI, specifically focusing on autonomous decision-making and embodied learning. The authors investigate two decision-making paradigms: hierarchical and end-to-end. For hierarchical decision-making, the paper details how large models enhance high-level planning, low-level execution, and feedback. For end-to-end decision-making, it elaborates on how large models improve Vision-Language-Action (VLA) models. In terms of embodied learning, the survey introduces mainstream methodologies, explaining how large models enhance imitation learning and reinforcement learning. Uniquely, it integrates world models into the embodied AI survey, discussing their design and critical roles in enhancing both decision-making and learning. Finally, the paper addresses existing challenges and proposes future research directions, aiming to provide a theoretical framework and practical guidance for researchers in the field.

The official source is a preprint on arXiv: https://arxiv.org/abs/2508.10399. The PDF link is: https://arxiv.org/pdf/2508.10399v1.pdf. This paper is currently a preprint and has not been formally published in a journal or conference at the time of this analysis.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is enabling embodied AI systems to achieve human-level intelligence for general-purpose tasks in open, unstructured, and dynamic environments. Embodied AI (systems with physical forms that perceive, decide, act, and learn in the real world) is seen as a promising path toward Artificial General Intelligence (AGI).

This problem is highly important because despite decades of exploration, embodied agents still struggle with generalization and transferability across diverse scenarios. Early systems relied on rigid preprogrammed rules, limiting adaptability. While deep learning improved capabilities, models were often task-specific. Recent breakthroughs in large models (such as Large Language Models, Vision-Language Models, etc.) have significantly enhanced perception, interaction, planning, and learning abilities in embodied AI, laying foundations for general-purpose embodied agents. However, the field is still nascent, facing challenges in generalization, scalability, and seamless environmental interaction.

The paper's entry point is to provide a comprehensive and systematic review of these recent advances in large model empowered embodied AI. It addresses existing gaps in the literature, where current studies are scattered, lack systematic categorization, and often miss the latest advancements, particularly Vision-Language-Action (VLA) models and end-to-end decision-making. The innovative idea is to systematically analyze how large models empower the core aspects of embodied AI: autonomous decision-making and embodied learning, including a novel integration of world models.

2.2. Main Contributions / Findings

The paper's primary contributions are summarized as follows:

  1. Focus on Large Model Empowerment from an Embodied AI Viewpoint: The survey systematically categorizes and reviews works based on how large models empower hierarchical decision-making (high-level planning, low-level execution, feedback enhancement) and end-to-end decision-making (via VLA models). For embodied learning, it details how large models enhance imitation learning (policy and strategy network construction) and reinforcement learning (reward function design and policy network construction).

  2. Comprehensive Review of Embodied Decision-Making and Learning: It provides a detailed review of both hierarchical and end-to-end decision-making paradigms, comparing them in depth. For embodied learning, it covers imitation learning, reinforcement learning, transfer learning, and meta-learning. Crucially, it integrates world models into the survey for the first time in this context, discussing their design and impact on decision-making and learning.

  3. Dual Analytical Approach for In-Depth Insights: The survey employs a dual analytical methodology, combining horizontal analysis (comparing diverse approaches like different large models, decision-making paradigms, and learning strategies) with vertical analysis (tracing the evolution, advances, and challenges of core models/methods). This approach aims to provide both a macro-level overview and deep insights into mainstream embodied AI methods.

    The key conclusions and findings are that large models have revolutionized embodied AI by significantly enhancing perception, interaction, planning, and learning capabilities. They enable more robust and versatile embodied AI systems through improved decision-making (both hierarchical and end-to-end) and more efficient embodied learning (particularly imitation learning and reinforcement learning). The integration of world models further boosts these capabilities by allowing agents to simulate environments and predict outcomes. Despite these advancements, significant challenges remain, including the scarcity of high-quality embodied data, continual learning for long-term adaptability, computational and deployment efficiency, and the sim-to-real gap. Addressing these challenges is crucial for the future path towards AGI.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a foundational grasp of Embodied AI and Large Models is essential.

3.1.1. Embodied AI

Embodied AI refers to intelligent systems that possess a physical form (like robots, intelligent vehicles) and are capable of perceiving their environment, making decisions, performing actions, and learning through interaction within real-world settings. The core belief is that true intelligence emerges from this physical interaction, making it a promising pathway toward Artificial General Intelligence (AGI).

Key aspects of Embodied AI highlighted in the paper include:

  • Physical Entities: The hardware components, such as humanoid robots, quadruped robots, or intelligent vehicles, that interact with the physical world. They execute actions and receive sensory feedback.
  • Intelligent Agents: The cognitive software core that enables autonomous decision-making and learning.
  • Human-like Learning and Problem-Solving Paradigm: Agents interpret human instructions, explore surroundings, perceive multimodal information (e.g., vision, language), and execute tasks. This mimics how humans learn from diverse resources, assess environments, plan, mentally simulate, and adapt.
  • Autonomous Decision-Making: How agents translate perceptions and task understanding into executable actions. The paper discusses two main approaches:
    • Hierarchical Paradigm: Separates functionalities into distinct modules for perception, planning, and execution.
    • End-to-End Paradigm: Integrates these functionalities into a unified framework, often directly mapping multimodal inputs to actions.
  • Embodied Learning: The process by which agents refine their behavioral strategies and cognitive models autonomously through long-term environmental interactions, aiming for continual improvement. Key methods include Imitation Learning, Reinforcement Learning, Transfer Learning, and Meta-Learning.
  • World Models: Internal simulations or representations of the environment that allow agents to anticipate future states, understand cause-and-effect, and plan without constant real-world interaction.

3.1.2. Large Models (LMs)

Large Models are neural network models characterized by their massive scale (billions or trillions of parameters), extensive training data, and typically a Transformer-based architecture. They have shown impressive perception, reasoning, and interaction capabilities. The paper categorizes them into several types:

  • Large Language Model (LLM): Models trained on vast text corpora to understand, generate, and reason with natural language. Examples include BERT, GPT series (GPT-1, GPT-2, GPT-3, ChatGPT, InstructGPT, GPT-3.5), Codex, PaLM, Vicuna, and Llama series. They serve as cognitive backbones, processing linguistic inputs and generating responses.
  • Large Vision Model (LVM): Models specialized in processing visual information. Examples include Vision Transformer (ViT), DINO, DINOv2, Masked Autoencoder (MAE), and Segment Anything Model (SAM). They are used for perception tasks like object recognition, pose estimation, and segmentation.
  • Large Vision-Language Model (LVLM): Models that integrate visual and linguistic processing, allowing them to understand visual inputs and respond to vision-related queries using language. Examples include CLIP, BLIP, BLIP-2, Flamingo, GPT-4V, and DeepSeek-V3. They enhance agents' ability to understand human instructions across text and vision.
  • Multimodal Large Model (MLM): More general models that can process diverse modalities beyond just vision and language, potentially including audio, tactile, etc. They can be multimodal-input text-output (e.g., Video-Chat, VideoLLaMA, Gemini, PaLM-E) or multimodal-input multimodal-output (e.g., DALL·E, DALL·E2, DALL·E3, Sora). These models are crucial for Embodied Large Models (ELM) or Embodied Multimodal Large Models (EMLM).
  • Vision-Language-Action (VLA) Model: A specialized type of MLM or LVLM whose core objective is to directly map multimodal inputs (visual observations, linguistic instructions) to action outputs. This creates a streamlined pipeline for end-to-end decision-making in robots, improving perception-action integration. Examples include RT-2, BYO-VLA, 3D-VLA, PointVLA, Octo, Diffusion-VLA, TinyVLA, and π0\pi_0.

3.1.3. General Capability Enhancements (GCE) of Large Models

To overcome limitations like reasoning ability, hallucination, computational cost, and task specificity, several techniques have been proposed:

  • In-Context Learning (ICL): Enables zero-shot generalization by carefully designing prompts, allowing LMs to tackle new tasks without additional training.
  • X of Thoughts (XoT): A family of reasoning frameworks that guide LMs through intermediate steps:
    • Chain of Thoughts (CoT): Incorporates step-by-step reasoning into prompts.
    • Tree of Thoughts (ToT): Explores multiple reasoning paths in a tree structure.
    • Graph of Thoughts (GoT): Uses a graph structure to represent intermediate states and dependencies, enabling non-linear reasoning.
  • Retrieval Augmented Generation (RAG): Retrieves relevant information from external knowledge bases (databases, web) to augment LM responses, addressing outdated or incomplete knowledge.
  • Reasoning and Acting (ReAct): Integrates reasoning with action execution, allowing LMs to produce explicit reasoning traces before acting, enhancing decision transparency.
  • Reinforcement Learning from Human Feedback (RLHF): Aligns LMs with human preferences and values by using human feedback as a reward signal during training, improving helpfulness, harmlessness, and honesty.
  • Model Context Protocol (MCP): A standardized interface for LMs to interact with external data sources, tools, and services, enhancing interoperability.

3.2. Previous Works

The paper primarily positions itself as a comprehensive survey that integrates various aspects of large model empowered embodied AI, distinguishing itself from prior reviews. It identifies several categories of related surveys:

  • Specific Surveys on Large Models Themselves: ([29, 104, 113, 151, 191, 225]) These focus on LLMs or VLMs but do not specifically address their synergy with embodied agents.
  • Surveys on Components of Embodied AI:
    • Planning: ([188]) Focuses only on planning in embodied AI.
    • Learning: ([7, 26, 204]) Concentrates on learning methods, sometimes with LLM integration, but often without a systematic analysis of the overall embodied AI paradigm or comprehensive integration of world models and VLA models.
    • Simulators: ([201]) Reviews simulators.
    • Applications: ([157, 201, 209]) Focuses on embodied AI applications.
  • Comprehensive Surveys (with limitations):
    • Some surveys ([48, 220]) are identified as being outdated due to the rapid development of the field.

    • Others, while comprehensive, might miss recent advances like VLA models and end-to-end decision-making ([190, 95]).

    • The review [119] provides a detailed introduction to VLA models but lacks comparison with the hierarchical paradigm and detailed exploration of learning methods.

    • [117] is listed as a comprehensive survey that covers Large models, Hierarchical and End-to-end decision-making, IL, RL, and Other learning methods, but still lacks World Models.

      The paper also refers to specific seminal works that underpin the evolution of large models and embodied AI:

  • BERT [42] (2018): Google's bidirectional Transformer for natural language understanding.
  • GPT [149] (2018): OpenAI's generative Transformer, a breakthrough in text generation.
  • GPT-3 [54] (2020): Milestone for its vast capacity and zero-/few-shot learning.
  • Vision Transformer (ViT) [45] (2020): Adapted Transformer for computer vision.
  • CLIP [148] (2021): OpenAI's LVLM aligning image and text via contrastive learning.
  • RT-2 [234] (2023): Pioneering VLA model, using pretrained VLM to discretize action space.
  • Sora [22] (2024): OpenAI's video generative model, an example of diffusion-based world models.

3.3. Technological Evolution

The field of Embodied AI has evolved through several stages:

  1. Early Systems (Symbolic Reasoning & Behaviorism): In the early decades, embodied AI systems (e.g., [21, 200]) were largely based on rigid preprogrammed rules and symbolic reasoning. These systems excelled in highly controlled environments (like manufacturing robots) but lacked adaptability and generalization to open, dynamic settings.
  2. Deep Learning Era: The advent of machine learning and deep learning ([99, 133]) marked a turning point. Vision-guided planning and reinforcement learning-based control ([173]) significantly reduced the need for precise environment modeling. However, these models were often task-specific and still faced challenges in generalization across diverse scenarios.
  3. Large Model Revolution: Recent breakthroughs in large models ([149, 150, 182, 183]) have propelled embodied AI forward. These models, with their robust perception, interaction, planning, and learning capabilities, are now laying the foundation for general-purpose embodied agents ([137]). This era is characterized by the integration of LLMs, VLMs, MLMs, and the emergence of VLA models, fundamentally changing how embodied agents perceive, reason, and act. The paper focuses on this current, large model empowered stage.

3.4. Differentiation Analysis

Compared to the main methods and existing surveys in related work, this paper's core differences and innovations lie in its comprehensive scope and analytical approach:

  • Integrated Focus on Large Model Empowerment: Unlike surveys focusing solely on large models or specific embodied AI components (e.g., planning, learning, simulators), this paper provides a holistic view of how large models empower the entire embodied AI system, specifically decision-making and learning.
  • Systematic Categorization of Decision-Making: It offers a detailed comparison and analysis of both hierarchical and end-to-end decision-making paradigms, highlighting the role of large models in each. This includes a dedicated discussion of VLA models, which are a very recent and prominent development.
  • In-depth Analysis of Embodied Learning: The survey covers imitation learning, reinforcement learning, transfer learning, and meta-learning, with a specific emphasis on how large models enhance these methodologies.
  • Novel Integration of World Models: For the first time, it integrates world models into a survey of embodied AI, presenting their design methods and critical roles in enhancing decision-making and learning. This is a significant addition as world models are increasingly recognized for their potential to enable more efficient and robust embodied intelligence.
  • Dual Analytical Methodology: The paper explicitly states its use of horizontal and vertical analysis, which allows for both broad comparisons across diverse approaches and deep dives into the evolution and challenges of specific methods. This aims to provide a more structured and insightful overview than previous, potentially scattered, reviews.
  • Timeliness: By being published in 2025, it aims to capture the latest progress, including VLA models and end-to-end decision-making, which earlier surveys might have missed.

4. Methodology

The survey systematically deconstructs large model empowered embodied AI by focusing on autonomous decision-making and embodied learning, integrating world models as a crucial component. The overall organization is depicted in Figure 1 of the original paper.

The following figure (Figure 1 from the original paper) shows the overall organization of this survey:

该图像是一幅示意图,展示了大型模型赋能的具身人工智能的决策和学习框架。图中详细阐述了具身人工智能的基本概念、决策自主性、具身学习方法及世界模型的设计,涵盖了模仿学习和强化学习的最新发展。 该图像是一幅示意图,展示了大型模型赋能的具身人工智能的决策和学习框架。图中详细阐述了具身人工智能的基本概念、决策自主性、具身学习方法及世界模型的设计,涵盖了模仿学习和强化学习的最新发展。

4.1. Hierarchical Autonomous Decision-Making

Hierarchical autonomous decision-making in embodied AI breaks down complex tasks into a structured series of modules: perception and interaction, high-level planning, low-level execution, and feedback and enhancement. Traditionally, these layers relied on vision models, predefined logical rules, and classic control algorithms, respectively. While effective in structured environments, this approach struggled with adaptability and holistic optimization in dynamic settings. The advent of large models has revolutionized this paradigm by injecting robust learning, reasoning, and generalization capabilities into each layer.

The following figure (Figure 5 from the original paper) illustrates the hierarchical decision-making paradigm:

该图像是示意图,展示了大型模型赋能的体现智能决策过程,包含高层规划、低层执行和反馈增强三个主要模块。高层规划利用自然语言和编程语言生成指令,低层执行则结合传统控制算法和大型视觉模型。此外,反馈增强部分通过人类和环境反馈优化决策流程。 该图像是示意图,展示了大型模型赋能的体现智能决策过程,包含高层规划、低层执行和反馈增强三个主要模块。高层规划利用自然语言和编程语言生成指令,低层执行则结合传统控制算法和大型视觉模型。此外,反馈增强部分通过人类和环境反馈优化决策流程。

4.1.1. High-Level Planning

High-level planning is responsible for generating reasonable plans based on task instructions and perceived environmental information. Traditionally, this involved rule-based methods ([59, 75, 126]) using languages like Planning Domain Definition Language (PDDL). These planners used heuristic search to verify action preconditions and select optimal action sequences. While precise in structured environments, their adaptability to unstructured or dynamic scenarios was limited. Large Language Models (LLMs), with their zero-shot and few-shot generalization, have driven significant breakthroughs by enhancing various forms of planning.

The following figure (Figure 6 from the original paper) illustrates high-level planning empowered by large models:

Fig. 6. High-level planning empowered by large models. 该图像是图示,展示了大型模型(LLM)在三个不同规划方法中的应用,包括结构化语言规划、自然语言规划和编程语言规划。图中描述了如何通过LLM生成PDDL文件,并在不同上下文中生成计划。

4.1.1.1. Structured Language Planning with LLM

LLMs enhance structured language planning (e.g., PDDL) through two main strategies:

  1. LLM as the Planner: LLMs directly generate plans. However, early attempts often resulted in infeasible plans due to strict PDDL syntax and semantics ([185]).
    • LLV [9]: Introduced an external validator (e.g., a PDDL parser or environment simulator) to check LLM-generated plans and iteratively refine them using error feedback.
    • FSP-LLM [175]: Optimized prompt engineering to align plans with logical constraints, improving feasibility.
  2. LLM for Automated PDDL Generation: LLMs automate the creation of PDDL domain files and problem descriptions, reducing manual effort.
    • LLM+^+P [116]: LLM generated PDDL files, which were then solved by traditional symbolic planners, combining linguistic understanding with symbolic reasoning.
    • PDDL-WM [64]: Used LLM to iteratively construct and refine PDDL domain models, validated by parsers and user feedback.

4.1.1.2. Natural Language Planning with LLM

Natural language offers greater flexibility, allowing LLMs to decompose complex tasks into sub-plans ([110, 167]). However, natural language planning often generates infeasible plans based on common sense rather than actual environmental constraints (e.g., proposing to use a vacuum cleaner that isn't present).

  • Zero-shot [85]: Explored LLM's ability to decompose high-level tasks into executable language steps, showing preliminary plans based on common sense but lacking physical environment constraints.
  • SayCAN [4]: Integrates LLM with reinforcement learning. It combines LLM-generated plans with a predefined skill repository and value functions to evaluate action feasibility. By scoring actions with expected cumulative rewards, it filters out impractical steps.
  • Text2Motion [114]: Enhances planning for spatial tasks by incorporating geometric feasibility. LLM proposes action sequences, which are then evaluated by a checker for physical viability.
  • Grounded Decoding [87]: Addresses the limitation of fixed skill sets by dynamically integrating LLM outputs with a real-time grounded model. This model assesses action feasibility based on current environmental states and agent capabilities, guiding LLM to produce contextually viable plans.

4.1.1.3. Programming Language Planning with LLM

Programming language planning translates natural language instructions into executable programs (e.g., Python code), leveraging code precision to define spatial relationships, function calls, and control APIs for dynamic planning.

  • CaP [112]: Converts task planning into code generation, producing Python-style programs with recursively defined functions to create a dynamic function library, enhancing adaptability to new tasks. Its reliance on perception APIs and unconstrained code generation limits complex instruction handling.
  • Instruct2Act [84]: A more integrated solution using multimodal foundation models to unify perception, planning, and control. It uses vision-language models for accurate object identification and spatial understanding, feeding this data to LLM to generate code-based action sequences from a predefined robot skill repository. This increases planning accuracy and adaptability.
  • ProgPrompt [176]: Employs structured prompts with environmental operations, object descriptions, and example programs to guide LLM in generating tailored, code-based plans, minimizing invalid code and enhancing cross-environment adaptability.

4.1.2. Low-Level Execution

After high-level planning, low-level actions are executed using predefined skill lists. These lists represent basic capabilities (e.g., object recognition, obstacle detection, object grasping, moving) that bridge task planning and physical execution. The implementation of these skills has evolved from traditional control algorithms to learning-driven and modular control.

The following figure (Figure 7 from the original paper) illustrates low-level execution:

Fig. 7. Low-level execution. 该图像是图表,展示了传统控制算法、基于学习的控制与模块化控制的关系以及所需的技能和子任务。左侧展示了PID和MPC等算法的基本技能,中央强调了模仿学习和强化学习的互动流程,右侧则介绍了大语言模型(LLM)在检测和分类任务中的应用示例。

4.1.2.1. Traditional Control Algorithms

These foundational skills use classic model-based techniques with clear mathematical derivations.

  • Proportional-Integral-Derivative (PID) control [81]: Adjusts parameters to minimize errors in robotic arm joint control.
  • State feedback control [178]: Often paired with Linear Quadratic Regulators (LQR) [125], optimizes performance using system state data.
  • Model Predictive Control (MPC) [2]: Forecasts states and generates control sequences via rolling optimization, suitable for tasks like drone path tracking. These offer mathematical interpretability, low computational complexity, and real-time performance, but lack adaptability in dynamic, high-dimensional, or uncertain environments, necessitating integration with data-driven techniques.

4.1.2.2. Learning-Driven Control with LLM

Robot learning develops control strategies and low-level skills from extensive data (human demonstrations, simulations, environmental interactions).

  • Imitation Learning: Trains strategies from expert demonstrations.
    • Embodied-GPT [131]: Leverages a 7B language model for high-level planning and converts plans to low-level strategies via imitation learning.
  • Reinforcement Learning: Optimizes strategies through iterative trials and environmental rewards.
    • Hi-Core [140]: Employs a two-layer framework where LLM sets high-level strategies and sub-goals, while reinforcement learning generates specific actions at low levels. These methods offer strong adaptability and generalization but require large datasets, computational resources, and pose challenges in guaranteeing convergence and stability.

4.1.2.3. Modular Control with LLM and Pretrained Models

Modular control integrates LLMs with pretrained strategy models (e.g., CLIP for visual recognition, SAM for segmentation). LLMs are equipped with descriptions of these tools and can invoke them dynamically.

  • DEPS [192]: Combines multiple modules for detection and actions based on task requirements and natural language descriptions of pretrained models.
  • PaLM-E [46]: Merges LLM with visual modules for segmentation and recognition.
  • CLIPort [172]: Leverages CLIP for open-vocabulary detection.
  • [112]: Leveraged LLM to generate code for creating libraries of callable functions for navigation and operations. This approach ensures scalability and reusability but introduces computational and communication delays and relies heavily on the quality of pretrained models.

4.1.3. Feedback and Enhancement

To ensure the quality of task planning in hierarchical decision-making, a closed-loop feedback mechanism is introduced. This feedback can originate from the large model itself, humans, or external environments.

The following figure (Figure 8 from the original paper) illustrates feedback and enhancement:

Fig. 8. Feedback and enhancement. 该图像是示意图,展示了大模型的反馈与增强机制,包括自我反思、人工反馈和环境反馈三个部分。图中分别详细说明了大模型在计划、执行和反馈过程中的角色,以及策略优化的机制。

4.1.3.1. Self-Reflection of Large Models

Large models can act as task planners, evaluators, and optimizers, iteratively refining decision-making without external intervention. Agents autonomously detect and analyze failed executions, learning from past tasks.

  • Re-Prompting [153]: Triggers plan regeneration based on detected execution failures or precondition errors. It integrates error context as feedback to dynamically adjust prompts and correct LLM-generated plans.
    • DEPS [153]: Adopts a "describe, explain, plan, select" framework, where LLM describes execution, explains failure causes, and re-prompts to correct plans.
  • Introspection Mechanism: Enables LLM to evaluate and refine its output independently.
    • Self-Refine [121]: Uses a single LLM for planning and optimization, iteratively improving plan rationality through multiple self-feedback cycles.
    • Reflexion [170]: Extends Self-Refine by incorporating long-term memory to store evaluation results, combining multiple feedback mechanisms.
    • ISR-LLM [231]: Applies iterative self-optimization in PDDL-based planning, generating initial plans, performing rationality checks, and refining outcomes via self-feedback.
    • Voyager [189]: Tailored for programming language planning, it builds a dynamic code skill library by extracting feedback from execution failures.

4.1.3.2. Human Feedback

Human feedback enhances planning accuracy and efficiency through an interactive closed-loop mechanism, allowing agents to dynamically adjust behaviors based on human input.

  • KNOwNO [161]: Introduces an uncertainty measurement framework for LLM to identify knowledge gaps and seek human assistance in high-risk scenarios.
  • EmbodiedGPT [132]: Uses a planning-execution-feedback loop, where agents request human input when low-level controls fail. This feedback, combined with reinforcement learning and self-supervised optimization, refines planning strategies.
  • YAY Robot [168]: Allows users to pause robots with commands and provide real-time language-based corrections, recorded for strategy fine-tuning.
  • IRAP [80]: Facilitates interactive question-answering with humans to acquire task-specific knowledge for precise robot instructions.

4.1.3.3. Environment Feedback

Environment feedback enhances LLM-based planning via dynamic interactions with the environment.

  • Inner Monologue [88]: Transforms multimodal inputs into linguistic descriptions for an "inner monologue" reasoning, allowing LLM to adjust plans based on environmental feedback.

  • TaPA [203]: Integrates open-vocabulary object detection and tailors plans for navigation and operations.

  • DoReMi [65]: Detects discrepancies between planned and actual outcomes and uses multimodal feedback to adjust tasks dynamically.

  • RoCo [123]: In multi-agent settings, it leverages environmental feedback and inter-agent communications for real-time robotic arm path planning corrections.

    Vision-Language Models (VLMs) simplify feedback by integrating visual inputs and language reasoning, avoiding feedback conversions.

  • ViLaIn [171]: Integrates LLM with VLM to generate machine-readable PDDL from language instructions and scene observations.

  • ViLA [83] and Octopus [211]: Achieve robot vision language planning by leveraging GPT-4V MLM to generate plans, integrating perception data for robust zero-shot reasoning.

  • Voxposer [86]: Exploits MLM to extract spatial geometric information, generating 3D coordinates and constraint maps from robot observations to populate code parameters, enhancing spatial accuracy.

4.2. End-to-End Autonomous Decision-Making

The hierarchical paradigm suffers from error accumulation across separate modules and struggles with generalization to diverse tasks due to the difficulty in directly applying high-level semantic knowledge from large models to low-level robotic actions. To address these issues, end-to-end autonomous decision-making has gained attention, aiming to directly map multimodal inputs (visual observations, linguistic instructions) to action outputs. This is typically implemented by Vision-Language-Action (VLA) models.

The following figure (Figure 9 from the original paper) illustrates end-to-end decision-making by VLA:

Fig. 9. End-to-end decision-making by VLA. 该图像是一幅示意图,展示了通过视觉语言动作(VLA)实现的端到端自主决策流程。图中包含了规划、执行和感知的统一处理过程,强调了消除沟通延迟的优势,确保机器人在变化环境中能够实时适应和快速响应。

4.2.1. Vision-Language-Action Models

VLA models integrate perception, language understanding, planning, action execution, and feedback optimization into a unified framework. They leverage the rich prior knowledge of large models to achieve precise and adaptable task execution in dynamic, open environments. A typical VLA model comprises three key components: tokenization and representation, multimodal information fusion, and action detokenization.

The following figure (Figure 10 from the original paper) illustrates the components of Vision-Language-Action Models:

Fig. 10. Vision-Language-Action Models. 该图像是示意图,展示了视觉-语言-动作模型的构造和信息处理流程。图中包括视觉编码器、语言编码器和状态编码器,分别负责不同类型的信息输入。通过多模态信息融合,输入的信息被解码为动作指令,随后输入到动作头以执行具体的机器人动作。该流程强调了从感知到执行的反馈和更新机制,展示了决策过程的复杂性。

4.2.1.1. Tokenization and Representation

VLA models use four types of tokens to encode multimodal inputs:

  • Vision Tokens: Encode environmental scenes (e.g., images) into embeddings.
  • Language Tokens: Encode linguistic instructions into embeddings, forming task context.
  • State Tokens: Capture the agent's physical configuration (e.g., joint positions, force-torque, gripper status, end-effector pose, object locations).
  • Action Tokens: Generated autoregressively, representing low-level control signals (e.g., joint angles, torque, wheel velocities) or high-level movement primitives (e.g., "move to grasp pose", "rotate wrist"). They allow VLA models to act as language-driven policy generators.

4.2.1.2. Multimodal Information Fusion

Visual tokens, language tokens, and state tokens are fused into a unified embedding for decision-making. This is typically achieved through a cross-modal attention mechanism within a Transformer architecture. This mechanism dynamically weighs the contributions of each modality, enabling the VLA model to jointly reason over object semantics, spatial layouts, and physical constraints based on the task context.

4.2.1.3. Action Detokenization

The fused embedding is fed into an autoregressive decoder (typically a Transformer) to generate a series of action tokens.

  • Discrete Action Generation: The model selects from a predefined set of actions or discretized control signals.
  • Continuous Action Generation: The model outputs fine-grained control signals, often sampled from a continuous distribution using a final Multi-Layer Perceptron (MLP) layer, enabling precise manipulation or navigation. These action tokens are then detokenized (mapped to executable control commands) and passed to the execution loop, which provides updated state information, allowing the VLA model to adapt dynamically to perturbations.

Example: Robotics Transformer 2 (RT-2) [234] RT-2 utilizes Vision Transformer (ViT) for visual processing and PaLM to integrate vision, language, and robot state information. It discretizes the action space into eight dimensions (e.g., 6-DoF end-effector displacement, gripper status, termination commands). Each dimension (except termination) is divided into 256 discrete intervals and embedded into the VLM vocabulary as action tokens. RT-2 employs a two-stage training strategy:

  1. Pretraining: With Internet-scale vision-language data to enhance semantic generalization.
  2. Fine-tuning: To map inputs (robot camera images, text task descriptions) to outputs (action word token sequences, e.g., 1128912415101127255;*1128912415101127255;). By modeling actions as "language," RT-2 leverages large models' capabilities to imbue low-level actions with rich semantic knowledge.

4.2.2. Enhancements on VLA

Despite their power, VLA end-to-end decision-making architectures have limitations: sensitivity to visual/language perturbations, limited 3D perception, reliance on simplistic policy networks for action generation, and high computational cost. Researchers have proposed enhancements categorized into perception capability enhancement, trajectory action optimization, and training cost reduction.

The following figure (Figure 11 from the original paper) illustrates enhancements on Vision-Language-Action Models:

Fig. 11. Enhancements on Vision-Language-Action Models. 该图像是示意图,展示了增强视觉-语言-动作(VLA)模型的方法。图中分为三个部分:第一部分展示了感知能力的增强方法,包括SigLip和Ego3D PE等;第二部分描述了轨迹行动优化的流程,利用Octo和Diffusion VLA模型;第三部分则强调了训练成本的减少,利用预训练模型和动作专家。整体体现了大模型在决策与学习中的应用。

4.2.2.1. Perception Capability Enhancement

These enhancements address sensitivity to noise and limited 3D perception.

  • BYO-VLA [74]: Implements a runtime observation intervention mechanism using automated image preprocessing to filter visual noise (occlusions, cluttered backgrounds).
  • TraceVLA [229]: Introduces visual trajectory prompts to the cross-modal attention mechanism during multimodal information fusion. By incorporating trajectory data with vision, language, and state tokens, it enhances spatio-temporal awareness for precise action trajectory predictions.
  • 3D-VLA [226]: Combines a 3D large model with a diffusion-based world model to process point clouds and language instructions. It generates semantic scene representations and predicts future point cloud sequences, improving 3D object relationship understanding.
  • SpatialVLA [147]: Emphasizes spatial understanding in robot sorting tasks. It proposes Ego3D position encoding to inject 3D information into input observations and uses adaptive action schemes.

4.2.2.2. Trajectory Action Optimization

These methods provide smoother and more controllable actions, overcoming the limitations of discrete action spaces.

  • Octo [180]: Combines Transformer and diffusion models. It processes multimodal inputs via Transformer, extracts visual-language features, and uses conditional diffusion decoders to iteratively optimize action sequences, generating smooth and precise trajectories. Achieves cross-task generalization with minimal task-specific data.
  • Diffusion-VLA [196]: Unifies a language model with a diffusion policy decoder. The autoregressive language model parses instructions and generates preliminary task representations, which the diffusion policy decoder optimizes through gradual denoising. Employs end-to-end training to jointly optimize language understanding and action generation, ensuring smooth and robust action trajectories. Higher computational cost than Octo but better for complex tasks needing deep semantic-action fusion.
  • π0\pi_0 [18]: Exploits flow matching to represent complex continuous action distributions. Compared to multi-step sampling in diffusion models, flow matching optimizes action generation through continuous flow field modeling, reducing computational overhead and improving real-time performance. This makes it suitable for resource-constrained applications requiring high-precision continuous control.

4.2.2.3. Training Cost Reduction

These methods reduce computational cost for VLA models, improving inference speed, data efficiency, and real-time performance for resource-constrained platforms.

  • TinyVLA [198]: Achieves significant improvements by designing a lightweight multimodal model and a diffusion strategy decoder.
  • OpenVLA-OFT [92]: Uses parallel decoding instead of traditional autoregressive generation to generate complete action sequences in a single forward pass, significantly reducing inference time.
  • π0\pi_0 Fast [143]: Introduces an efficient action tokenization scheme based on the Discrete Cosine Transform (DCT), enabling autoregressive models to handle high-frequency tasks and speeding up training.
  • Edge-VLA [25]: A streamlined VLA tailored for edge devices, achieving high inference speeds (30-50Hz) with comparable performance to OpenVLA while optimized for low-power deployment.

4.2.3. Mainstream VLA Models

The following are the results from Table 2 of the original paper:

ModelContributionsEnhancements
Pioneering large-scale VLA, jointlyPAC
RT-2 [234] (2023)Vision Encoder: ViT22B/ViT-4B Language Encoder: PaLIX/PaLM-E Action Decoder: Symbol-tuningfine-tuned on web-based VQA and robotic datasets, unlocking advanced emergent functionalities.×
Seer [63] (2023)•Vision Encoder: Visual backbone Language Encoder: Transformer-based Action Decoder: Autoregressive action prediction headEfficiently predict future video frames from language instructions by extending a pretrained text-to-image diffusion model.×
Octo [180] (2024)Vision Encoder: CNN • Language Encoder: T5-base Action Decoder: Diffusion TransformerFirst generalist policy trained on a massive multi-robot dataset (800k+ trajectories). A powerful open-source foundation model.××
Open- VLA [94] (2024)• Vision Encoder: DINOv2 + SigLIP Language Encoder: Prismatic-7B Action Decoder: Symbol-tuningAn open-source alternative to RT-2, superior parameter efficiency and strong generalization with efficient LoRA fine-tuning.××
Mobility- VLA [37] (2024)Vision Encoder: Long-context ViT + goal image encoder •Language Encoder: T5-based instruction encoder Action Decoder: Hybrid diffusion + au- toregressive ensembleLeverages demonstration tour videos as an environmental prior, using a long-context VLM and topological graphs for navigating based on complex multimodal instructions.×
Tiny-VLA [198] (2025)Vision Encoder: FastViT with low-latency encoding Language Encoder: Compact language en- coder (128-d) Action Decoder: Diffusion policy decoder (50M parameters)Outpaces OpenVLA in speed and precision; eliminates pretraining needs; achieves 5x faster inference for real-time applications.××
Tale .Mainstream VLA models (: perception, A: trajectory action, C: training cost) (continued).
ModelArchitectureContributionsEnhancements
AC
Diffusion- VLA [196] (2024)Transformer-based visual encoder for con- textual perception Language Encoder: Autoregressive rea- soning module with next-token prediction Diffusion policy head for robust action sequence generationLeverage diffusion-based action modeling for precise control; superior contextual awareness and reliable sequence planning.×√ ×
Point- VLA [105] (2025)•Vision Encoder: CLIP + 3D Point Cloud Language Encoder: Llama-2 Action Decoder: Transformer with spatial token fusionExcel at long-horizon and spatial reasoning tasks; avoid retraining by preserving pretrained 2D knowledge×
VLA- Cache [208] (2025)Vision Encoder: SigLIP with token mem- ory buffer Language Encoder: Prismatic-7B •Action Decoder: Transformer with dy- namic token reuseFaster inference with near-zero loss; dynamically reuse static features for real-time robotics×× √
π0 [18] (2024)Vision Encoder: PaliGemma VLM back- bone Language Encoder: PaliGemma (multi- modal)Employ flow matching to produce smooth, high-frequency (50Hz) action trajectories for real-time control.
π0 Fast [143] (2025)• Vision Encoder: PaliGemma VLM back- bone Language Encoder: PaliGemma (multi- modal) Action Decoder: Autoregressive Trans- former with FASTIntroduces an efficient action tokenization scheme based on the Discrete Cosine Transform (DCT), enabling autoregressive models to handle high-frequency tasks and significantly speeding up training.×√ √
Edge-VLA [25] (2025)• Vision Encoder: SigLIP + DINOv2 Language Encoder: Qwen2 (0.5B parame- ters) Action Decoder: Joint control prediction (non-autoregressive)Streamlined VLA tailored for edge devices, delivering 3050Hz inference speed with OpenVLA-comparable performance, optimized for low-power, real-time deployment.× ×
Tale .Mainstream VLA models (: perception, A: trajectory action, C: training cost) (continued).
ArchitectureContributionsEnhancements
•Vision Encoder: SigLIP + DINOv2 (multi- An optimized fine-tuning recipe forP
OpenVLA- OFT [92] (2025)view) j • Language Encoder: Llama-2 7B Action Decoder: Parallel decoding with action chunking and L1 regressionVLAs that integrates parallel decoding and a continuous action representation to improve inference speed and task success.×
Spatial- VLA [147] (2025)• Vision Encoder: SigLIP from PaLiGemma2 4B Language Encoder: PaLiGemma2 •Action Decoder: Adaptive Action Grids and autoregressive transformerEnhance spatial intelligence by injecting 3D information via 'Ego3D Position Encoding' and representing actions with 'Adaptive Action Grids'.×
MoLe- VLA [219] (2025)• Vision Encoder: Multi-stage ViT with STAR router Language Encoder: CogKD-enhanced Transformer Action Decoder: Sparse Transformer with dynamic routingA brain-inspired architecture that uses dynamic layer-skipping (Mixture-of-Layers) and knowledge distillation to improve efficiency.××V
VLA [230] (2025)Vision Encoder: Object-centric spatial ViT DexGrasp- Language Encoder: Transformer with grasp sequence reasoning Action Decoder: Diffusion controller for grasp pose generationA hierarchical framework for general dexterous grasping, using a VLM for high-level planning and a diffusion policy for low-level control.××
Dex-VLA [197] (2025)A large plug-in diffusion-based action expert and an embodiment curriculum learning strategy for efficient cross-robot training and adaptation.××

4.2.4. Hierarchical versus End-to-End Decision-Making

Hierarchical and end-to-end paradigms represent two distinct philosophies for autonomous decision-making in embodied intelligence.

The following are the results from Table 3 of the original paper:

AspectHierarchicalEnd-to-End
ArchitecturePerception: dedicated modules (e.g., SLAM, CLIP) High-level planning: structured, language, pro- Planning: implicit via VLA pretraining gram • Low-level execution: predefined skill lists mentPerception: integrated in tokenization Action generation: Autoregressive generation with diffusion-based decoders •Feedback: LLM self-reflection, human, environ- • Feedback: inherent in closed-loop cycle
PerformanceReliable in structured tasks Limited in dynamic settings• Superior in complex, open-ended tasks with strong generalization Dependent on training data
InterpretabilityHigh, with clear modular designLow, due to black-box nature of neural net- works
GeneralizationLimited, due to reliance on human-designed Strong, driven by large-scale pretraining structuresSensitive to data gaps
Real-timeduce delays in complex scenariosLow, inter-module communications may intro- High, direct perception-to-action mapping min- imizes processing overhead
Computational Costtion but coordination overhead•Moderate, with independent module optimiza- •High, requiring significant resources for train- ing
Table 3. Comparison of hierarchical and end-to-end decision-making (continued).
AspectHierarchicalEnd-to-End
ApplicationSuitable for industrial automation, drone navi- Suitable for domestic robots, virtual assistants, gation, autonomous drivinghuman-robot collaboration
AdvantagesHigh interpretability High reliability Easy to integrate domain knowledge•Seamless multimodal integration Efficient in complex tasks • Minimal error accumulation
LimitationsSub-optimal, due to module coordination issues Low adaptability to unstructured settingsLow interpretability High dependency on training data High computational costs •Low generalization in out-of-distribution sce- narios
  • Hierarchical Architectures: Decompose decision-making into separate perception, planning, execution, and feedback modules. This modularity facilitates debuggability, optimization, and maintenance. They excel in integrating domain knowledge (e.g., physical constraints, rules), offering high interpretability and reliability. However, module separation can lead to sub-optimal solutions due to coordination issues, especially in dynamic environments, and manual task decomposition can limit adaptability to unseen scenarios.
  • End-to-End Architectures: Employ a large-scale neural network (e.g., VLA) to directly map multimodal inputs to actions. Built on large multimodal models and trained on extensive datasets, VLA models achieve simultaneous visual perception, language understanding, and action generation. Their integrated architecture minimizes error accumulation and enables efficient end-to-end optimization, leading to strong generalization in complex, unstructured environments. The main drawbacks include their black-box nature (low interpretability), heavy reliance on quality and diversity of training data, and high computational cost for training.

4.3. Embodied Learning

Embodied learning aims to enable agents to acquire complex skills and refine their capabilities through continuous interactions with environments. This continuous learning is crucial for embodied agents to achieve precise decision-making and real-time adaptation in the complex and variable real world. It can be achieved through the coordination of various learning strategies.

The following figure (Figure 12 from the original paper) illustrates embodied learning processes and methodologies:

Fig. 12. Embodied learning: process and methodologies. 该图像是一个示意图,展示了不同方法在自我学习中的应用,包括模仿学习、迁移学习、元学习和强化学习。模仿学习通过多样来源快速获取技能,迁移学习将已学技能应用于相关任务,元学习提高学习效率,强化学习通过奖励优化技能。

4.3.1. Embodied Learning Methods

Embodied learning can be modeled as a goal-conditional partially observable Markov decision process (POMDP), defined as an 8-tuple (S,A,G,T,R,Ω,O,γ)( S , A , G , T , R , \Omega , O , \gamma ):

  • SS: The set of states of the environment, encoding multimodal information (textual descriptions, images, structured data).

  • AA: The set of actions, representing instructions or commands, often in natural language.

  • GG: The set of possible goals (gGg \in G), specifying objectives (e.g., "purchase a laptop").

  • T(ss,a)T ( s ^ { \prime } | s , a ): The state transition probability function, defining the probability distribution over next states sSs ^ { \prime } \in S for each state-action pair ( s , a ).

  • R:S×A×GRR : S \times A \times G \to R: The goal-conditional reward function, evaluating how well an action aa in state ss advances goal gg. Rewards can be numeric or textual.

  • Ω\Omega: The set of observations, which may include textual, visual, or multimodal data, representing the agent's partial view of the state.

  • O(os,a)O ( o ^ { \prime } | s ^ { \prime } , a ): The observation probability function, defining the probability of observing oΩo ^ { \prime } \in \Omega after transitioning to state ss ^ { \prime } via action aa.

  • γ[0,1)\gamma \in \left[ 0 , 1 \right): The discount factor, balancing immediate and long-term rewards, used when rewards are numeric.

    At time tt, an agent receives observation otΩo _ { t } \in \Omega and goal gGg \in G, selecting action atAa _ { t } \in A according to policy π(atot,g)\pi ( a _ { t } | o _ { t } , g ). The environment transitions to st+1T(sst,at)s _ { t + 1 } \sim T \bigl ( s ^ { \prime } | s _ { t } , a _ { t } \bigr ), yielding observation ot+1O(ost+1,at)o _ { t + 1 } \sim O ( o ^ { \prime } | s _ { t + 1 } , a _ { t } ) and reward R ( s _ { t + 1 } , a _ { t } , g ).

  • For end-to-end decision-making: The VLA model directly encodes the policy π(ao,g)\pi ( a | o , g ), processing multimodal observation oo and producing action aa.

  • For hierarchical decision-making: A high-level agent generates a context-aware subgoal g _ { s u b } via an LLM-enhanced policy πhigh(gsubo,g)\pi _ { h i g h } ( g _ { s u b } | o , g ), then a low-level policy πlow(ao,gsub)\pi _ { l o w } ( a | o , g _ { s u b } ) maps the subgoal to an action aa. This low-level policy can be learned through imitation learning or reinforcement learning.

    The following are the results from Table 4 of the original paper:

    MethodsStrengthsLimitationsApplications
    Imitation Learning•Rapid policy learning by Dependent on diverse, high- mimicking expert demon- strations Efficient for tasks with high- quality dataquality demonstrations Limited adaptability to new tasks or sparse data scenar- iosRobotic manipulation Structured navigation Human-robot interaction with expert guidance
    Reinforcement Learning• Optimizes policies in dy- namic uncertain environ- ments via trial-and-errors Excels in tasks with clear re- ward signalsRequires large samples and Autonomous navigation computational resources Sensitive to reward function and discount factorAdaptive human-robot in- teraction • Dynamic task optimization
    Transfer Learning•Accelerates learning by • Risks negative transfer Navigation across diverse transferring knowledge between related tasks Enhances generalization in • Requires task similarity for related taskswhen tasks differ signifi- cantly effective learningenvironments • Manipulation with shared structures •Cross-task skill reuse
    Meta-Learning•Rapid adaptation to new tasks with minimal data Ideal for diverse embodied tasksDemands extensive pre- training and large datasets Establishing a universal meta-policy is resource- intensiveRapid adaptation in naviga- tion, manipulation, or inter- action across diverse tasks and environments

4.3.1.1. Imitation Learning

Imitation learning ([204]) allows agents to learn policies by mimicking expert demonstrations, making it highly efficient for tasks with high-quality data (e.g., robotic manipulation). The training is supervised, using datasets of expert state-action pairs ( s , a ). The objective is to learn a policy π(as)\pi ( a | s ) that replicates the expert's behavior by minimizing the negative log-likelihood of expert actions.

The objective function for imitation learning is defined as: $ \mathcal { L } ( \pi ) = - \mathrm { E } _ { \tau \sim \mathrm { P D } } [ \log \pi ( \mathrm { a } | s ) ] $ where:

  • L(π)\mathcal { L } ( \pi ): The loss function for the policy π\pi.

  • EτPD\mathrm { E } _ { \tau \sim \mathrm { P D } }: The expectation over expert demonstrations τ\tau sampled from the expert data distribution DD.

  • logπ(as)\log \pi ( \mathrm { a } | s ): The log-probability of the expert action aa given the state ss under the learned policy π\pi. This term encourages the policy to assign high probability to expert actions.

    Each demonstration τi\tau _ { i } consists of a sequence of state-action pairs ( s _ { t } , a _ { t } ) of length LL: $ \tau _ { i } = [ ( s _ { 1 } , a _ { 1 } ) , \cdots , ( s _ { t } , a _ { t } ) , \cdots , ( s _ { L } , a _ { L } ) ] $ In continuous action spaces, π()\pi ( \cdot ) is often modeled as a Gaussian distribution, and the objective is approximated using Mean Squared Error (MSE) between predicted and expert actions. It is sample-efficient but highly dependent on the quality and coverage of demonstration data, struggling with unseen scenarios. Combining it with reinforcement learning can enhance robustness.

4.3.1.2. Reinforcement Learning

Reinforcement learning ([139]) is a dominant method, enabling agents to learn policies by interacting with environments through trial-and-error, well-suited for dynamic and uncertain settings. At each time step tt, the agent observes state ss, selects action aa according to policy π(as)\pi ( a | s ), receives reward rr from R ( s , a , g ), and the environment transitions to ss ^ { \prime }.

The objective function of reinforcement learning is to maximize the expected cumulative reward: $ \mathcal { T } ( \pi ) = E _ { \pi , T , O } \left( \sum _ { t = 0 } ^ { \infty } \gamma ^ { t } R ( s , a , g ) \right) $ where:

  • T(π)\mathcal { T } ( \pi ): The expected total reward (return) for policy π\pi.

  • Eπ,T,OE _ { \pi , T , O }: The expectation taken over trajectories generated by policy π\pi, state transitions TT, and observations OO.

  • γ[0,1)\gamma \in \left[ 0 , 1 \right): The discount factor, which determines the present value of future rewards. A value closer to 0 emphasizes immediate rewards, while a value closer to 1 emphasizes long-term rewards.

  • R ( s , a , g ): The reward received for taking action aa in state ss to achieve goal gg.

    Reinforcement learning excels in optimizing policies for complex tasks but requires extensive exploration, making it computationally costly. Hybrid approaches with imitation learning can mitigate this by providing initial policies.

4.3.1.3. Transfer Learning

Transfer learning ([152]) accelerates learning on target tasks by leveraging knowledge from source tasks, reducing the need for extensive data and time. It adapts a source policy πs\pi _ { s } (from a source task with state sSs \in S and action aAa \in A) to a target task with different dynamics or goals. The objective is to minimize the divergence between πs\pi _ { s } and the target policy πt\pi _ { t }, typically by fine-tuning using a small amount of target task data.

The process is guided by the task-specific loss of the target task, constrained by the Kullback-Leibler (KL) divergence for policy alignment: $ \boldsymbol { \theta } _ { t } ^ { * } = \arg \operatorname* { m i n } _ { \boldsymbol { \theta } _ { t } } \mathbb { E } _ { \boldsymbol { s } \sim \boldsymbol { S } _ { t } } ( D _ { K L } ( \pi _ { s } ( \cdot { | \ \boldsymbol { s } ; \boldsymbol { \theta } _ { s } ) | } | \pi _ { t } ( \cdot { | \ \boldsymbol { s } ; \boldsymbol { \theta } _ { t } ) ) } ) + \lambda \mathcal { L } _ { t } ( \boldsymbol { \theta } _ { t } ) $ where:

  • θt\boldsymbol { \theta } _ { t } ^ { * }: The optimal policy parameters for the target task.

  • θs\boldsymbol { \theta } _ { s }: The parameters of the source policy πs\pi _ { s }.

  • θt\boldsymbol { \theta } _ { t }: The parameters of the target policy πt\pi _ { t }.

  • EsSt\mathbb { E } _ { \boldsymbol { s } \sim \boldsymbol { S } _ { t } }: The expectation over states ss from the target task's state distribution StS_t.

  • DKL(πs( s;θs)πt( s;θt)))D _ { K L } ( \pi _ { s } ( \cdot { | \ \boldsymbol { s } ; \boldsymbol { \theta } _ { s } ) | } \| \pi _ { t } ( \cdot { | \ \boldsymbol { s } ; \boldsymbol { \theta } _ { t } ) ) } ): The Kullback-Leibler divergence measuring the difference between the source policy πs\pi_s and the target policy πt\pi_t given state ss. It penalizes deviations of the target policy from the source policy.

  • Lt(θt)\mathcal { L } _ { t } ( \boldsymbol { \theta } _ { t } ): The task-specific loss of the target task, which measures how well the target policy performs on the target task.

  • λ\lambda: A regularization parameter that balances the importance of policy alignment (KL divergence) and target task performance.

    In embodied settings, transfer learning enables skill reuse across environments and goals. However, large disparities between tasks can lead to negative transfer, where performance degrades.

4.3.1.4. Meta-Learning

Meta-learning ([51, 66]), or "learning how to learn," allows agents to swiftly infer optimal policies for new tasks with minimal samples. At each time step tt, the agent receives observation oΩo \in \Omega and goal gg, selecting an action aa according to a meta-policy that adapts to task-specific dynamics. The objective is to optimize expected performance across tasks by minimizing the loss on task-specific data.

In Model-Agnostic Meta-Learning (MAML) [52], this is achieved by learning an initial set of model parameters θ\theta that can be adapted quickly to new tasks with minimal updates. Specifically, for a set of tasks Ti\mathcal { T } _ { i }, MAML optimizes the meta-objective as: $ \begin{array} { l } { { \displaystyle { \theta ^ { * } = \arg \operatorname* { m i n } _ { \theta } \sum _ { \mathcal { T } _ { i } } \mathcal { L } _ { \mathcal { T } _ { i } } \left( f _ { \theta _ { i } } \right) } } } \ { { \displaystyle { \theta _ { i } = \theta - \alpha \nabla _ { \theta } \mathcal { L } _ { \mathcal { T } _ { i } } \left( f _ { \theta } \right) } } } \end{array} $ where:

  • θ\theta ^ { * }: The optimal meta-policy parameters.

  • Ti\sum _ { \mathcal { T } _ { i } }: Summation over a set of distinct tasks Ti\mathcal { T } _ { i }.

  • LTi(fθi)\mathcal { L } _ { \mathcal { T } _ { i } } \left( f _ { \theta _ { i } } \right): The task-specific loss for task Ti\mathcal { T } _ { i }, evaluated on the model ff with parameters θi\theta _ { i }.

  • fθf _ { \theta }: The model parameterized by θ\theta.

  • θi\theta _ { i }: The task-specific parameters after a single (or few) gradient updates from θ\theta using the loss LTi(fθ)\mathcal { L } _ { \mathcal { T } _ { i } } \left( f _ { \theta } \right).

  • α\alpha: The inner loop learning rate for adapting to a specific task.

  • θLTi(fθ)\nabla _ { \theta } \mathcal { L } _ { \mathcal { T } _ { i } } \left( f _ { \theta } \right): The gradient of the task-specific loss with respect to the meta-parameters θ\theta.

    Meta-learning enables agents to quickly adapt to new tasks by fine-tuning a pretrained model with few demonstrations. However, it requires substantial pretraining and large samples across diverse tasks.

4.3.2. Imitation Learning Empowered by Large Models

Imitation learning aims for agents to achieve expert-level performance by mimicking demonstrations. Key challenges include capturing complex behaviors, generalizing to unseen states, ensuring robustness against distribution shifts, and achieving sample efficiency. Large models have significantly enhanced behavior cloning, the most important imitation learning approach (formulating it as a supervised regression task to predict actions aAa \in A from observations oΩo \in \Omega and goals gGg \in G).

The following figure (Figure 13 from the original paper) illustrates imitation learning empowered by diffusion models or Transformers:

Fig. 13. Imitation learning empowered by diffusion models or Transformers. 该图像是图示,展示了利用扩散模型和变换器的模仿学习机制。图中包含专家数据集、状态/观察与动作的关系,以及基于扩散模型和变换器的决策网络结构示意。

4.3.2.1. Diffusion-based Policy Network

Diffusion models excel in handling complex multimodal distributions, generating diverse action trajectories, and enhancing robustness and expressiveness of policies.

  • Pearce [142]: Proposes a diffusion model-based imitation learning framework that integrates diffusion models into policy networks. It optimizes expert demonstrations iteratively through noise addition and removal, capturing action distribution diversity.
  • DABC [34]: Adopts a two-stage process: pretrains a base policy via behavior cloning, then refines action distribution modeling via a diffusion model.
  • Diffusion Policy [36]: Uses a diffusion model as the decision model for vision-driven robot tasks. It takes visual input and robot's current state as conditions, uses U-Net as a denoising network, and predicts denoising steps to generate continuous action sequences.
  • 3D-Diffusion [217]: Proposes a diffusion policy framework based on 3D inputs (simple 3D representations). It leverages a diffusion model to generate action sequences, improving generalization of visual motion policies by capturing spatial information.

4.3.2.2. Transformer-based Policy Network

Transformer architectures empower imitation learning by treating expert trajectories as sequential data and using self-attention mechanisms to model dependencies between actions, states, and goals, minimizing error accumulation.

  • RT-1 [20]: Google's pioneering work demonstrating Transformers in robot control. Combines a large, diverse dataset (130k+ trajectories, 700+ tasks) with a pretrained vision-language model to improve task generalization.
  • RT-Trajectory [62]: Introduces "trajectory sketch" to incorporate low-level visual cues, enhancing task generalization.
  • ALOHA [224]: Stanford's work using Transformer's encoding-decoding structure to generate robotic arm action sequences from multi-view images, achieving precise double-arm operations. Its follow-up uses action chunking for multi-step predictions, improving stability.
  • Mobile ALOHA [58]: Extends ALOHA to whole-body coordinated mobile operation tasks using a mobile platform and teleoperation interface.
  • HiveFormer [224] and RVT [60]: Utilize multi-view data and CLIP for visual-language feature fusion to directly predict 6D grasp poses, achieving state-of-the-art performance in complex spatial modeling.
  • RoboCat [19]: Employs cross-task, cross-entity embodied imitation learning, integrating VQ-GAN ([50]) to tokenize visual inputs and Decision Transformer to predict actions and observations, enabling rapid policy generalization.
  • RoboAgent [17]: Adopts a similar encoding-decoding structure, fusing vision, task descriptions, and robot states to minimize action sequence prediction errors.
  • CrossFormer [44]: Proposes a Transformer-based imitation learning architecture for cross-embodied tasks, trained on large-scale expert data to unify processing of manipulation, navigation, mobility, and aerial tasks, demonstrating multi-task learning potential.

4.3.3. Reinforcement Learning Empowered by Large Models

Reinforcement learning ([11]) allows agents to develop optimal control strategies through environmental interactions. While traditional methods like Q-learning ([194]), SARSA ([164]), and Deep Reinforcement Learning (DRL) (DQN [130], PPO [166], SAC [68]) have achieved significant successes, they still face limitations in reward function design and policy network construction. Large models offer solutions to these challenges.

The following figure (Figure 14 from the original paper) illustrates reinforcement learning empowered by large models:

Fig. 14. Reinforcement learning empowered by large models. 该图像是一个示意图,展示了大模型在强化学习中的应用。图中展示了环境、状态、动作以及基于奖励函数和策略网络构建的设计,涉及扩散、变换器和大语言模型的构建方法。该图揭示了如何通过大模型设定奖励来优化智能体的决策过程。

4.3.3.1. Reward Function Design

Designing effective reward functions ([49]) is challenging due to complexity, task-specificity, and issues like reward hacking. Large models offer solutions by generating reward signals or entire reward functions.

  • Kwon et al. and Language to Rewards (L2R) [215]: Use zero-shot and few-shot approaches with GPT-3 to produce reward signals directly from textual behavior prompts, translating high-level goals to hardware-specific policies. Limitations include sparse rewards and heavy dependence on precise prompts.
  • Text2Reward [205]: Generates dense interpretable Python reward functions from environment descriptions and examples, iteratively refined via human feedback, achieving high success rates.
  • Eureka [120]: Leverages GPT-4 to create dense rewards from task and environment prompts. It automates reward function optimization through an iterative strategy, mitigating reliance on human feedback (unlike Text2Reward), and can surpass human-crafted rewards.
  • Auto MC-Reward [106]: Implements full automation for Minecraft with a multi-stage pipeline: a reward designer generates signals, a validator ensures quality, and a trajectory analyzer refines rewards through failure-driven iterations. Achieves significant efficiency but is domain-specific.

4.3.3.2. Policy Network Construction

Offline reinforcement learning ([101]) learns policies from pre-collected datasets without online interaction, but policy regularization is needed to mitigate errors for actions absent from datasets. Large models enhance policy network construction to improve expressiveness and adaptability.

The following figure (Figure 15 from the original paper) illustrates policy network construction empowered by large models:

Fig. 15. Policy network construction empowered by large models.

Policy network construction with diffusion models:

  • DiffusionQL [193]: Employs diffusion models as a foundation policy to model action distributions and trains them to maximize value function objectives within the Q-learning framework. This generates high-reward policies that fit multimodal or non-standard action distributions in offline datasets.
  • EDP [91]: Introduces an efficient sampling method that reconstructs actions from intermediate noised states in a single step, significantly reducing computational overhead of diffusion models. It can integrate with various offline reinforcement learning frameworks.

Policy network construction with Transformer-based architectures: Transformers capture long-term dependencies in trajectories, improving policy flexibility and accuracy.

  • Decision Transformer [31]: Re-frames offline reinforcement learning as a conditional sequence modeling problem, treating state-action-reward trajectories as sequential inputs, and applies supervised learning to generate optimal actions from offline datasets.
  • Prompt-DT [207]: Enhances generalization in few-shot scenarios by incorporating prompt engineering (trajectory prompts with task-specific encoding) to guide action generation for new tasks.
  • Online Decision Transformer (ODT) [228]: Pretrains Transformer via offline reinforcement learning to learn sequence generation, then fine-tunes it through online reinforcement learning interactions.
  • Q-Transformer [30]: Integrates Transformer's sequence modeling with Q-function estimation, learning Q-values autoregressively to generate optimal actions.
  • Gato [158]: A Transformer-based sequence modeling approach for multi-task offline reinforcement learning, but it relies heavily on dataset optimality and incurs high training cost.

Policy network construction with LLM: LLMs leverage pretrained knowledge to streamline offline reinforcement learning.

  • GLAM [28]: Uses LLM as policy agents to generate executable action sequences for language-defined tasks, optimized online via PPO with contextual memory.
  • LaMo [169]: Employs GPT-2 as a base policy, fine-tuned with LoRA to preserve prior knowledge, converting state-action-reward sequences into language prompts for task-aligned policy generation.
  • Reid [159]: Explores LLM's transferability using pretrained BERT, fine-tuned for specific tasks and augmented by external knowledge bases. It can outperform Decision Transformer while reducing training time.

4.4. World Models

World models serve as internal simulations or representations of environments, enabling intelligent systems to anticipate future states, comprehend cause-and-effect relationships, and make decisions without solely relying on expensive or infeasible real-world interactions. They provide a rich cognitive framework for efficient learning, decision-making, and adaptation in complex dynamic environments.

The following figure (Figure 16 from the original paper) illustrates world models and applications in decision-making and embodied learning:

Fig. 16.World models and applications in decision-making and embodied learning. 该图像是示意图,展示了世界模型在决策制定和体现学习中的应用。图中包含三部分:左上角为不同类型的世界模型设计,包括潜在空间模型、基于变换器的模型和基于扩散的模型;右侧展示了世界模型在决策制定中的作用,强调预测和上下文知识的整合;底部则说明了世界模型在体现学习中的应用,包括状态转移、奖励和数据生成等关键环节。图中还提到了JEPA和相关组件的交互关系。

4.4.1. Design of World Models

Traditional reinforcement learning (RL) is costly due to repeated interactions. World models allow learning in a simulated environment. Current world models are categorized into latent space world models, Transformer-based world models, diffusion-based world models, and joint embedding predictive architectures.

4.4.1.1. Latent Space World Model

These models predict in latent spaces, represented by Recurrent State Space Models (RSSMs) [67, 69]. RSSMs learn dynamic environment models from pixel observations and plan actions in encoded latent spaces, decomposing the latent state into stochastic and deterministic parts.

  • PlaNet [71]: Employs RSSM with a Gated Recurrent Unit (GRU) and a Convolutional Variational AutoEncoder (CVAE) for latent dynamics and model predictive control.
  • Dreamer [70]: Advances PlaNet by learning the actor and value networks from latent representations.
  • Dreamer V2 [72]: Further uses the actor-critic algorithm to learn behaviors purely from imagined sequences generated by the world model, achieving human-level performance on Atari.
  • Dreamer V3 [73]: Enhances stability with symlog predictions, layer normalization, and normalized returns via exponential moving average, outperforming specialized algorithms.

4.4.1.2. Transformer-based World Model

These models leverage the attention mechanism to model multimodal inputs, overcoming CNN and RNN limitations in high-dimensional, continuous, or multimodal environments, especially for complex memory-interaction tasks.

  • IRIS [129]: One of the first to apply Transformer in world models. Agents learn skills within an autoregressive Transformer-based world model. IRIS tokenizes images using a Vector Quantized Variational Autoencoder (VQ-VAE) and predicts future tokens.
  • Google's Genie [24]: Built on spatial-temporal Transformer ([206]), trained on vast unlabeled Internet video datasets via self-supervised learning. It provides a paradigm for manipulable, generative, interactive environments.
  • TWM [162]: Proposes a Transformer-XL-based world model, migrating Transformer-XL's segment-level recurrence mechanism to capture long-term dependencies. It trains a model-free agent within latent imagination to enhance efficiency.
  • STORM [222]: Utilizes a stochastic Transformer, fusing state and action into a single token, improving training efficiency and matching Dreamer V3 performance.

4.4.1.3. Diffusion-based World Model

These models excel in generating predictive video sequences in the original image space.

  • OpenAI's Sora [22]: Leverages an encoding network to convert videos/images into tokens, then a large-scale diffusion model applies noising and denoising processes to these tokens, mapping them back to the original image space for multi-step image predictions based on language descriptions. This can generate trajectory videos for agents.
  • UniPi [47]: Employs diffusion models to model agent trajectories in image space, generating future key video frames from language inputs and initial images, followed by super-resolution in time series.
  • UniSim [212]: Improves trajectory prediction by jointly training diffusion models on Internet data and robot interaction videos, enabling prediction of long-sequence video trajectories for both high-level and low-level task instructions.

4.4.1.4. Joint Embedding Predictive Architecture (JEPA)

Proposed by Yann LeCun at Meta ([102]), JEPA aims to overcome the lack of real-world common sense in data-driven world models. Inspired by human brains, it introduces hierarchical planning and self-supervised learning in a high-level representation space.

  • Hierarchical Planning: Breaks complex tasks into multiple abstraction levels, focusing on semantic features rather than pixel-level outputs.
  • Self-supervised Learning: Trains networks to predict missing or hidden input data, enabling pretraining on large unlabeled datasets and fine-tuning for diverse tasks.
  • Architecture: Comprises a perception module and a cognitive module, forming a world model using latent variables to capture essential information while filtering redundancies. This supports efficient decision-making and future scenario planning.
  • Dual-System Concept: Balances "fast" intuitive reactions with "slow" deliberate reasoning.

4.4.2. World Model in Decision-Making

World models provide powerful internal representations, enabling agents to predict environmental dynamics and outcomes before acting. For decision-making, they serve two main roles: simulated validation and knowledge augmentation.

4.4.2.1. World Model for Simulated Validation

In robotics, testing decisions in the real world is expensive and time-consuming. World models allow agents to "try out" actions and observe likely consequences in a simulated environment, dramatically shortening iteration time and facilitating safe testing of high-risk scenarios. This ability helps agents identify and avoid potential mistakes, optimizing performance.

  • NeBula [3]: Constructs probabilistic belief spaces using Bayesian filtering, enabling robots to reason across diverse structural configurations, even in unknown environments, and predict outcomes under uncertainty.
  • UniSim [212]: A generative simulator for real-world interactions, capable of simulating visual outcomes of both high-level instructions and low-level controls. It integrates diverse datasets across different modulations.

4.4.2.2. World Model for Knowledge Augmentation

World models augment agents with predictive and contextual knowledge essential for strategy planning. By predicting future environmental states or enriching understanding of the world, they enable agents to anticipate outcomes, avoid mistakes, and optimize performance.

  • World Knowledge Model (WKM) [146]: Imitates human mental world knowledge by providing global prior knowledge before a task and maintaining local dynamic knowledge during the task. It synthesizes global task knowledge and local state knowledge from experts and sampled trajectories.
  • Agent-Pro [221]: Transforms an agent's interactions with its environment (especially with other agents) into "beliefs," representing the agent's social understanding and informing subsequent decisions.
  • GovSim [144]: Explores the emergence of cooperative behaviors within societies of LLM agents. Agents gather information through multi-agent conversations, implicitly forming high-level insights and representations of the world model.

4.4.3. World Model in Embodied Learning

World models enable agents to learn new skills and behaviors efficiently. In contrast to model-free reinforcement learning (which is computationally expensive and data-inefficient), model-based reinforcement learning uses world models to streamline learning by simulating state transitions and generating data.

4.4.3.1. World Model for State Transitions

Model-based reinforcement learning leverages a world model that explicitly captures state transitions and dynamics, allowing agents to learn from simulated environments for safe, cost-effective, and data-efficient training. The world model creates virtual representations of the real world, enabling agents to explore hypothetical actions and refine policies without real-world risks.

  • RobotDreamPolicy [145]: Learns a world model and develops the policy within it, drastically reducing real-environment interactions.
  • DayDreamer [202]: Leverages Dreamer V2 (an RSSM-based world model) to encode observations into latent states and predict future states, achieving rapid skill learning in real robots with high sample efficiency.
  • SWIM [128]: Utilizes Internet-scale human video data to understand human interactions and gain affordances. Initially trained on egocentric videos, it's fine-tuned with robot data to adapt to robot domains, enabling efficient task learning.

4.4.3.2. World Model for Data Generation

World models, especially diffusion-based world models, can synthesize data, which is crucial for embodied AI due to challenges in collecting diverse real-world data. They can synthesize realistic trajectory data, state representations, and dynamics, augmenting existing datasets or creating new ones.

  • SynthER [118]: Utilizes diffusion-based world models to generate low-dimensional offline RL trajectory data to augment original datasets, enhancing performance in both offline and online settings.
  • MTDiff [77]: Applies diffusion-based world models to generate multi-task trajectories, using expert trajectories as prompts to guide the generation of agent trajectories aligned with specific task objectives and dynamics.
  • VPDD [76]: Trains trajectory prediction world models using a large-scale human operation dataset, then fine-tunes the action generation module with minimal labeled action data, significantly reducing the need for extensive robot interaction data for policy learning.

5. Experimental Setup

As a comprehensive survey paper, this article does not present its own novel experimental results, datasets, or evaluation metrics. Instead, it synthesizes the methodologies, findings, and challenges observed across numerous prior research works in large model empowered embodied AI. Therefore, this section will discuss the general trends in datasets, evaluation metrics, and baselines that are commonly found in the literature surveyed by the authors, providing context for the advancements and limitations discussed throughout the paper.

5.1. Datasets

The paper highlights a critical challenge in embodied AI: the scarcity of high-quality embodied data. While large models thrive on massive datasets, collecting real-world robot interaction data is prohibitively expensive and complex. The types of datasets commonly used or leveraged in the field include:

  • Robot Trajectory Datasets: These are datasets of recorded robot actions and corresponding observations (e.g., camera images, joint states) from human demonstrations or robot trials.
    • VIMA [89]: Mentioned as a large dataset with 650,000 demonstrations for robot manipulation.
    • RT-1 [20]: Mentioned with 130,000 demonstrations (trajectories from 700+ tasks).
    • RT-X [186]: An initiative to collect robot arm data from over 60 laboratories to build the open X-Embodiment dataset.
    • Open X-Embodiment dataset [186]: A large-scale, diverse dataset of robot trajectories.
  • Internet-Scale Vision-Language Data: Large models leverage vast web-scraped datasets for pretraining, particularly VLMs and MLMs.
    • LAION-5B: A massive dataset with 5.75 billion text-image pairs, dwarfing current embodied datasets. This highlights the data disparity between general vision-language models and embodied AI.
  • Human Video Datasets: These provide rich real-world dynamics and observations from human perspectives.
    • Ego4D [61]: Provides egocentric videos, offering insights into human behaviors and interactions, which can be used to improve contextual understanding for robotic tasks.
  • Simulated Datasets: Simulators ([163]) are crucial for generating large and diverse datasets cost-effectively. They allow agents to be trained in virtual environments before deployment in the real world.
    • Minecraft (Auto MC-Reward [106]): A specific example of a simulation environment used for reward design in reinforcement learning.

    • D4RL benchmarks [57]: A collection of offline reinforcement learning datasets derived from various simulated locomotion and manipulation environments.

      The choice of datasets is driven by the need to validate a method's performance on tasks that are representative of real-world embodied AI challenges, such as robotic manipulation, navigation, and human-robot interaction. Leveraging diverse data sources, from robot-specific interactions to general web data and human videos, is seen as critical for improving generalization and transferability.

5.2. Evaluation Metrics

The paper, being a survey, does not define its own evaluation metrics. However, it implicitly discusses the performance aspects that are crucial for embodied AI and are commonly evaluated in the literature it surveys. Based on the challenges and enhancements discussed, key evaluation metrics in this field generally fall into categories such as task success, efficiency, generalization, and robustness.

Here are some common evaluation metrics, which are not explicitly defined with formulas in the paper, but are standard in the field:

  1. Task Success Rate (SR) / Success Percentage:

    • Conceptual Definition: Measures the proportion of trials or episodes in which the agent successfully completes the assigned task according to predefined criteria. It directly quantifies the agent's effectiveness in achieving its goals.
    • Mathematical Formula: $ SR = \frac{\text{Number of successful trials}}{\text{Total number of trials}} \times 100% $
    • Symbol Explanation:
      • SR: The Task Success Rate.
      • Number of successful trials: The count of attempts where the agent fully achieved the task objective.
      • Total number of trials: The total number of attempts made by the agent for a given task.
  2. Cumulative Reward (Return):

    • Conceptual Definition: In reinforcement learning, this metric sums all the rewards an agent receives from the environment over an episode, potentially discounted by a factor γ\gamma. It reflects the total utility or performance achieved by the agent over a task execution.
    • Mathematical Formula: $ R_t = \sum_{k=0}^{T} \gamma^k r_{t+k+1} $
    • Symbol Explanation:
      • RtR_t: The cumulative reward (return) at time step tt.
      • TT: The final time step of the episode.
      • γ\gamma: The discount factor (between 0 and 1, exclusive) that weighs immediate rewards more heavily than future rewards.
      • rt+k+1r_{t+k+1}: The instantaneous reward received at time step t+k+1t+k+1.
  3. Sample Efficiency:

    • Conceptual Definition: Measures how quickly an agent can learn an effective policy from a limited amount of data or environmental interactions. Higher sample efficiency means less data or fewer interactions are needed, which is crucial for real-world robotics where data collection is expensive.
    • Mathematical Formula: Often evaluated by plotting performance (e.g., Task Success Rate or Cumulative Reward) against the number of environmental interactions, demonstrating a steeper learning curve or achieving a target performance with fewer samples. No single universal formula, but typically analyzed visually or by comparing data points needed for convergence.
    • Symbol Explanation: This is an observational metric rather than a single formula. It might involve comparing Number of interactions (e.g., timesteps, episodes, demonstrations) to reach a certain Performance threshold.
  4. Generalization Capability:

    • Conceptual Definition: Assesses the agent's ability to perform well on tasks or in environments that were not seen during training. This is a critical aspect for embodied AI to operate in open, dynamic, and unstructured real-world settings.
    • Mathematical Formula: Typically measured by Task Success Rate or Cumulative Reward on unseen tasks, novel object instances, or different environmental layouts/conditions. No single formula, but involves comparing performance between training distribution and out-of-distribution (OOD) scenarios.
    • Symbol Explanation: This is an aggregate measure, often reported as PerformanceOODPerformance_{OOD} versus PerformancetrainPerformance_{train}.
  5. Computational Cost / Inference Latency / Memory Footprint:

    • Conceptual Definition: These metrics quantify the resources required by the model for training and deployment. Computational cost (e.g., PFLOPs) measures total operations, inference latency (e.g., milliseconds) measures response time, and memory footprint (e.g., GB) measures RAM/VRAM usage. These are crucial for deploying embodied AI on resource-constrained platforms.
    • Mathematical Formula: Directly measured by hardware monitors or profiling tools.
      • Latency = Time(response) - Time(request)
      • Memory=RAM/VRAMusageMemory = RAM/VRAM usage
      • PFLOPs is typically reported by specialized benchmarks.
    • Symbol Explanation: Directly corresponds to measured time, memory, or computational operations.

5.3. Baselines

Since this is a survey paper, it does not propose a new method to compare against baselines in its own experiments. However, throughout its review of the literature, the paper implicitly or explicitly refers to various baseline approaches that large model empowered embodied AI methods aim to surpass. These baselines represent the state-of-the-art or traditional methods against which new large model-based approaches are evaluated in the original research papers.

Common baselines and comparison points include:

  1. Traditional Rule-Based and Symbolic Planning Systems:

    • Rule-based methods [59, 75, 126]: Often using PDDL and heuristic search planners, these are baselines for high-level planning. LLM-enhanced planning aims to overcome their limitations in adaptability to unstructured or dynamic scenarios.
    • Classic Control Algorithms [81, 125, 2]: Such as PID, LQR, and MPC, serve as baselines for low-level execution, particularly for foundational skills. Learning-driven and modular control methods aim to enhance their adaptability to high-dimensional and uncertain dynamics.
  2. Traditional (Deep) Reinforcement Learning (DRL) Algorithms:

    • Q-learning [194], SARSA [164], DQN [130], PPO [166], SAC [68]: These represent strong model-free RL baselines. Large model empowered RL seeks to improve sample efficiency, reward function design, and policy expressiveness, especially in complex tasks or offline settings.
    • Decision Transformer [31]: This itself serves as a baseline in offline RL comparisons, particularly when LLMs are used for policy network construction, as seen in Reid [159].
  3. Traditional Imitation Learning (IL) Approaches:

    • Behavior Cloning [53]: The fundamental IL approach that diffusion-based and Transformer-based policy networks aim to significantly improve upon, especially regarding generalization, robustness, and handling multimodal action distributions.
  4. Prior Large Models or Architectures:

    • Specific LLMs/VLMs: Earlier versions of LLMs (e.g., GPT-3, PaLM) or VLMs (CLIP) are often used as components or prior baselines when new multimodal large models or VLA models (RT-2, Octo) are introduced. For instance, OpenVLA [94] is presented as an open-source alternative to RT-2.
    • Latent Space World Models: RSSM and its variants (Dreamer V2/V3) are baselines for newer Transformer-based (Genie, TWM) or diffusion-based world models (Sora, UniPi), which aim to improve prediction accuracy and generative capabilities in complex or pixel-space domains.
  5. Hierarchical vs. End-to-End Paradigms: The paper itself constructs a comparative framework between these two paradigms, where one often serves as a conceptual baseline or alternative to the other, depending on the specific task requirements and design goals.

    These baselines are representative because they either embody the established practices in their respective sub-fields or represent the immediate predecessors that newer large model empowered approaches are designed to outperform or generalize beyond.

6. Results & Analysis

As a comprehensive survey paper, this article does not present its own novel experimental results. Instead, it synthesizes the findings and advancements from numerous prior research works, drawing conclusions about the effectiveness, advantages, and limitations of large model empowered embodied AI across different paradigms and methodologies. The "results" in this context are the summarized insights and comparisons derived from the vast body of literature reviewed by the authors.

6.1. Core Results Analysis

The core results of this survey highlight the transformative impact of large models on embodied AI, particularly in decision-making and embodied learning.

6.1.1. Impact on Autonomous Decision-Making

  • Hierarchical Decision-Making: Large models (especially LLMs) significantly enhance all layers.
    • High-Level Planning: LLMs overcome the adaptability limitations of traditional rule-based planners by enabling structured language planning (e.g., generating PDDL or validating plans), natural language planning (decomposing tasks, filtering impractical actions), and programming language planning (generating executable code). This leads to more dynamic and flexible plans.
    • Low-Level Execution: LLMs integrate with learning-driven control (imitation learning, reinforcement learning) and modular control (calling pretrained models like CLIP or SAM), enhancing adaptability and generalization beyond traditional control algorithms.
    • Feedback and Enhancement: Large models enable advanced feedback mechanisms: self-reflection (iterative plan refinement, error correction), human feedback (real-time guidance, knowledge acquisition), and environment feedback (dynamic plan adjustment based on observations). VLMs simplify this by integrating visual and language reasoning.
  • End-to-End Decision-Making: VLA models represent a breakthrough by directly mapping multimodal inputs to action outputs, minimizing error accumulation and enabling efficient learning. They leverage large models' prior knowledge for precise and adaptable task execution. Enhancements focus on:
    • Perception: Improving robustness to visual noise and enhancing 3D spatial understanding.
    • Trajectory Action Optimization: Using diffusion models or flow matching to generate smoother, more controllable, and high-precision action trajectories.
    • Training Cost Reduction: Developing lightweight architectures, efficient tokenization, and parallel decoding for real-time deployment on resource-constrained devices.

6.1.2. Impact on Embodied Learning

Large models significantly enhance imitation learning and reinforcement learning.

  • Imitation Learning: Diffusion models are used to construct policy networks that capture complex multimodal action distributions, enhancing robustness and expressiveness. Transformer-based policy networks leverage self-attention to model long-term dependencies in trajectories, improving fidelity, consistency, and generalization across tasks and embodiments.
  • Reinforcement Learning:
    • Reward Function Design: Large models automate the generation of dense, interpretable reward signals or reward functions, reducing reliance on manual design and mitigating reward hacking.
    • Policy Network Construction: Diffusion models, Transformer-based architectures, and LLMs empower offline reinforcement learning by enhancing policy expressiveness, capturing long-term dependencies, and leveraging pretrained knowledge for sample-efficient learning and generalization.

6.1.3. Role of World Models

The integration of world models is a crucial finding. They provide agents with internal simulations of environments, allowing them to anticipate future states and comprehend cause-and-effect.

  • Decision-Making: World models enable simulated validation (testing actions without real-world risk) and knowledge augmentation (providing predictive and contextual knowledge for planning).

  • Embodied Learning: They facilitate model-based reinforcement learning by simulating state transitions (for safe, cost-effective, and data-efficient training) and generating synthetic data (augmenting scarce real-world datasets).

    Overall, the survey concludes that large models offer unparalleled capabilities for embodied AI, pushing the field closer to AGI by making agents more capable, adaptable, and efficient. However, significant challenges remain, which are discussed in Section 7.

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

Survey typeRelated surveysPublication timeLarge modelsDecision-makingEmbodied learningWorld model
HierarchicalEnd2endILRLOther
Specific[29, 104, 113, 151,2024××××××
191, 225] [210]2024××
[26]2024×××××× ×
[7, 227]2025××××
[188]2024××××××
[204]2024×××××
[165]2025××××××
[43, 122]2024××××××
Compre hensive[119]2024××
[190]2024×J×××
[95]2024×√ √×××
[117]2024√ √√ √××
Ours
  • Analysis of Table 1 (Comparison of Surveys): This table is crucial for establishing the uniqueness and comprehensive nature of the current survey. It clearly shows that many existing surveys are specific (e.g., focusing only on large models, planning, or a particular learning method) or, if comprehensive, they often lack coverage of World Models or the latest end-to-end decision-making and VLA models. The "Ours" row distinctively marks coverage across Large Models, Hierarchical and End-to-end decision-making, Imitation Learning (IL), Reinforcement Learning (RL), Other learning methods, and, uniquely, World Models. This table validates the authors' claim of providing a more systematic and up-to-date review compared to previous works.

    The following are the results from Table 2 of the original paper:

    ModelContributionsEnhancements
    Pioneering large-scale VLA, jointlyPAC
    RT-2 [234] (2023)Vision Encoder: ViT22B/ViT-4B Language Encoder: PaLIX/PaLM-E Action Decoder: Symbol-tuningfine-tuned on web-based VQA and robotic datasets, unlocking advanced emergent functionalities.×
    Seer [63] (2023)•Vision Encoder: Visual backbone Language Encoder: Transformer-based Action Decoder: Autoregressive action prediction headEfficiently predict future video frames from language instructions by extending a pretrained text-to-image diffusion model.×
    Octo [180] (2024)Vision Encoder: CNN • Language Encoder: T5-base Action Decoder: Diffusion TransformerFirst generalist policy trained on a massive multi-robot dataset (800k+ trajectories). A powerful open-source foundation model.××
    Open- VLA [94] (2024)• Vision Encoder: DINOv2 + SigLIP Language Encoder: Prismatic-7B Action Decoder: Symbol-tuningAn open-source alternative to RT-2, superior parameter efficiency and strong generalization with efficient LoRA fine-tuning.××
    Mobility- VLA [37] (2024)Vision Encoder: Long-context ViT + goal image encoder •Language Encoder: T5-based instruction encoder Action Decoder: Hybrid diffusion + au- toregressive ensembleLeverages demonstration tour videos as an environmental prior, using a long-context VLM and topological graphs for navigating based on complex multimodal instructions.×
    Tiny-VLA [198] (2025)Vision Encoder: FastViT with low-latency encoding Language Encoder: Compact language en- coder (128-d) Action Decoder: Diffusion policy decoder (50M parameters)Outpaces OpenVLA in speed and precision; eliminates pretraining needs; achieves 5x faster inference for real-time applications.××
    Tale .Mainstream VLA models (: perception, A: trajectory action, C: training cost) (continued).
    ModelArchitectureContributionsEnhancements
    AC
    Diffusion- VLA [196] (2024)Transformer-based visual encoder for con- textual perception Language Encoder: Autoregressive rea- soning module with next-token prediction Diffusion policy head for robust action sequence generationLeverage diffusion-based action modeling for precise control; superior contextual awareness and reliable sequence planning.×√ ×
    Point- VLA [105] (2025)•Vision Encoder: CLIP + 3D Point Cloud Language Encoder: Llama-2 Action Decoder: Transformer with spatial token fusionExcel at long-horizon and spatial reasoning tasks; avoid retraining by preserving pretrained 2D knowledge×
    VLA- Cache [208] (2025)Vision Encoder: SigLIP with token mem- ory buffer Language Encoder: Prismatic-7B •Action Decoder: Transformer with dy- namic token reuseFaster inference with near-zero loss; dynamically reuse static features for real-time robotics×× √
    π0 [18] (2024)Vision Encoder: PaliGemma VLM back- bone Language Encoder: PaliGemma (multi- modal)Employ flow matching to produce smooth, high-frequency (50Hz) action trajectories for real-time control.
    π0 Fast [143] (2025)• Vision Encoder: PaliGemma VLM back- bone Language Encoder: PaliGemma (multi- modal) Action Decoder: Autoregressive Trans- former with FASTIntroduces an efficient action tokenization scheme based on the Discrete Cosine Transform (DCT), enabling autoregressive models to handle high-frequency tasks and significantly speeding up training.×√ √
    Edge-VLA [25] (2025)• Vision Encoder: SigLIP + DINOv2 Language Encoder: Qwen2 (0.5B parame- ters) Action Decoder: Joint control prediction (non-autoregressive)Streamlined VLA tailored for edge devices, delivering 3050Hz inference speed with OpenVLA-comparable performance, optimized for low-power, real-time deployment.× ×
    Tale .Mainstream VLA models (: perception, A: trajectory action, C: training cost) (continued).
    ArchitectureContributionsEnhancements
    •Vision Encoder: SigLIP + DINOv2 (multi- An optimized fine-tuning recipe forP
    OpenVLA- OFT [92] (2025)view) j • Language Encoder: Llama-2 7B Action Decoder: Parallel decoding with action chunking and L1 regressionVLAs that integrates parallel decoding and a continuous action representation to improve inference speed and task success.×
    Spatial- VLA [147] (2025)• Vision Encoder: SigLIP from PaLiGemma2 4B Language Encoder: PaLiGemma2 •Action Decoder: Adaptive Action Grids and autoregressive transformerEnhance spatial intelligence by injecting 3D information via 'Ego3D Position Encoding' and representing actions with 'Adaptive Action Grids'.×
    MoLe- VLA [219] (2025)• Vision Encoder: Multi-stage ViT with STAR router Language Encoder: CogKD-enhanced Transformer Action Decoder: Sparse Transformer with dynamic routingA brain-inspired architecture that uses dynamic layer-skipping (Mixture-of-Layers) and knowledge distillation to improve efficiency.××V
    VLA [230] (2025)Vision Encoder: Object-centric spatial ViT DexGrasp- Language Encoder: Transformer with grasp sequence reasoning Action Decoder: Diffusion controller for grasp pose generationA hierarchical framework for general dexterous grasping, using a VLM for high-level planning and a diffusion policy for low-level control.××
    Dex-VLA [197] (2025)A large plug-in diffusion-based action expert and an embodiment curriculum learning strategy for efficient cross-robot training and adaptation.××
  • Analysis of Table 2 (Mainstream VLA Models): This table provides a valuable snapshot of the rapidly evolving VLA landscape. It demonstrates a clear trend toward:

    • Diverse Architectures: Utilizing various Vision Encoders (ViT, CNN, DINOv2, SigLIP, CLIP, 3D Point Cloud), Language Encoders (PaLM, T5-base, Prismatic-7B, Llama-2, PaliGemma, Qwen2), and Action Decoders (Symbol-tuning, Autoregressive action prediction head, Diffusion Transformer, Flow matching).
    • Focused Enhancements: Models like Mobility-VLA, Point-VLA, and Spatial-VLA prioritize Perception (P) (e.g., long-context VLM, 3D information). Octo, Diffusion-VLA, π0\pi_0, and DexGrasp-VLA focus on Trajectory Action (A) optimization, often through diffusion models or flow matching for smoother, more precise control. Open-VLA, Tiny-VLA, VLA-Cache, $$\pi_0Fast, Edge-VLA, OpenVLA-OFT, and MoLe-VLA aim for Training Cost (C) reduction, emphasizing parameter efficiency, faster inference, and deployment on edge devices.
    • Generalization and Open-Source: Models like Octo and Open-VLA highlight the shift towards generalist policies and open-source foundation models trained on massive datasets. This table effectively illustrates the diverse strategies employed to improve VLA models' perception, action generation, and efficiency for end-to-end embodied AI.

The following are the results from Table 3 of the original paper:

AspectHierarchicalEnd-to-End
ArchitecturePerception: dedicated modules (e.g., SLAM, CLIP) High-level planning: structured, language, pro- Planning: implicit via VLA pretraining gram • Low-level execution: predefined skill lists mentPerception: integrated in tokenization Action generation: Autoregressive generation with diffusion-based decoders •Feedback: LLM self-reflection, human, environ- • Feedback: inherent in closed-loop cycle
PerformanceReliable in structured tasks Limited in dynamic settings• Superior in complex, open-ended tasks with strong generalization Dependent on training data
InterpretabilityHigh, with clear modular designLow, due to black-box nature of neural net- works
GeneralizationLimited, due to reliance on human-designed Strong, driven by large-scale pretraining structuresSensitive to data gaps
Real-timeduce delays in complex scenariosLow, inter-module communications may intro- High, direct perception-to-action mapping min- imizes processing overhead
Computational Costtion but coordination overhead•Moderate, with independent module optimiza- •High, requiring significant resources for train- ing
Table 3. Comparison of hierarchical and end-to-end decision-making (continued).
AspectHierarchicalEnd-to-End
ApplicationSuitable for industrial automation, drone navi- Suitable for domestic robots, virtual assistants, gation, autonomous drivinghuman-robot collaboration
AdvantagesHigh interpretability High reliability Easy to integrate domain knowledge•Seamless multimodal integration Efficient in complex tasks • Minimal error accumulation
LimitationsSub-optimal, due to module coordination issues Low adaptability to unstructured settingsLow interpretability High dependency on training data High computational costs •Low generalization in out-of-distribution sce- narios
  • Analysis of Table 3 (Comparison of Hierarchical and End-to-End Decision-Making): This table provides a concise yet comprehensive comparison, clarifying the trade-offs between the two major decision-making paradigms.
    • Architecture & Performance: Hierarchical systems emphasize modularity and explicit planning, offering high interpretability and reliability in structured tasks. However, they can suffer from sub-optimal solutions due to inter-module coordination issues and limited adaptability to dynamic settings. In contrast, End-to-End systems (like VLA models) integrate perception-to-action mapping, resulting in superior performance and strong generalization in complex, open-ended tasks, often with high real-time capabilities due to minimized processing overhead.
    • Interpretability & Generalization: The black-box nature of end-to-end neural networks leads to low interpretability compared to the clear modular design of hierarchical systems. While hierarchical systems have limited generalization due to human-designed structures, end-to-end systems achieve strong generalization through large-scale pretraining but are sensitive to data gaps.
    • Computational Cost & Applications: Hierarchical systems have moderate computational cost but can incur coordination overhead. End-to-End systems, especially during training, have high computational costs. Applications diverge: hierarchical for industrial automation, drone navigation, autonomous driving, and end-to-end for domestic robots, virtual assistants, and human-robot collaboration. The table effectively summarizes that while hierarchical approaches offer control and transparency, end-to-end approaches provide adaptability and seamlessness, with the choice depending on task complexity, interpretability needs, and available computational resources.

6.3. Ablation Studies / Parameter Analysis

As a survey paper, the article does not present its own ablation studies or parameter analyses. However, the discussions throughout the paper implicitly highlight the importance of such analyses performed by the original research works it cites. For example:

  • Effectiveness of LLM components in hierarchical planning: Studies on structured language planning (e.g., LLV [9], FSP-LLM [175]) would involve ablation studies to demonstrate the value of external validators or optimized prompt engineering in generating feasible plans.

  • Contribution of RLHF in reward function design: Eureka [120] and Text2Reward [205] would have analyzed how human feedback or automated iterative optimization (Eureka) improves the quality and density of generated reward functions compared to manual or simpler LLM-based approaches.

  • Impact of diffusion models vs. Transformers in policy networks: Papers like Diffusion Policy [36] or Decision Transformer [31] would have conducted experiments to show how their respective architectures (diffusion for diverse action distributions, Transformer for sequence modeling) improve imitation learning or offline RL performance.

  • Efficiency gains in VLA models: Works like TinyVLA [198] or OpenVLA-OFT [92] would have performed analyses on factors like model compression techniques (knowledge distillation, quantization), action tokenization schemes, or parallel decoding to quantify their impact on inference speed, memory footprint, and data efficiency.

  • Role of World Models components: Research on latent space world models (Dreamer V2/V3 [72, 73]) or Transformer-based world models (IRIS [129]) would typically include ablations on components like stochastic vs. deterministic state representations or attention mechanisms to show their contribution to prediction accuracy and sample efficiency.

    These implicit analyses from the surveyed literature collectively inform the conclusions drawn about the strengths and weaknesses of different approaches and the efficacy of various large model enhancements.

7. Conclusion & Reflections

7.1. Conclusion Summary

This comprehensive survey rigorously analyzes the burgeoning field of large model empowered embodied AI, focusing on two fundamental pillars: autonomous decision-making and embodied learning. The authors systematically detail how the capabilities of large models have revolutionized both hierarchical and end-to-end decision-making paradigms, enhancing planning, execution, and feedback mechanisms. Furthermore, the survey elucidates the profound impact of large models on embodied learning methods, particularly imitation learning and reinforcement learning. A key unique contribution is the integration and in-depth discussion of world models, highlighting their design and critical role in facilitating simulated validation, knowledge augmentation, state transitions, and data generation for robust embodied intelligence. The paper concludes that large models have unlocked unprecedented intelligent capabilities for embodied agents, making substantial strides towards Artificial General Intelligence (AGI).

7.2. Limitations & Future Work

Despite the significant advances, the survey identifies several persistent challenges that define the frontier of embodied AI research:

  1. Scarcity of Embodied Data:

    • Limitation: Real-world robotic data is immensely diverse and complex to collect, leading to datasets (e.g., VIMA, RT-1) that are orders of magnitude smaller than those for vision-language models (e.g., LAION-5B). This hinders generalization and scalability. Direct transfer from human video datasets (e.g., Ego4D) faces misalignment issues due to morphological differences between humans and robots.
    • Future Work:
      • Leveraging world models (especially diffusion-based) to synthesize high-quality new data from existing agent experiences (SynthER [118]).
      • Improving techniques for integrating large human datasets while addressing reality gap and alignment issues.
  2. Continual Learning for Long-Term Adaptability:

    • Limitation: Embodied AI systems need to continually update knowledge and optimize strategies in dynamic environments while avoiding catastrophic forgetting of previously acquired skills. Efficient autonomous exploration to balance new experiences with existing knowledge remains challenging in high-dimensional, sparse-reward scenarios. The unpredictability of the real world (sensor degradation, mechanical wear) further complicates learning.
    • Future Work:
      • Enhancing experience replay [10] and regularization techniques [98] to mitigate catastrophic forgetting.
      • Developing data mixing strategies [100] to reduce feature distortion.
      • Improving self-supervised learning to drive active exploration via intrinsic motivation.
      • Incorporating multi-agent collaboration mechanisms to accelerate individual learning.
  3. Computation and Deployment Efficiency:

    • Limitation: The sophisticated nature of large models demands substantial computational resources (e.g., DiffusionVLA [22] requires hundreds of GPUs and weeks of training, resulting in seconds of inference latency). High memory footprints (e.g., RT-2 [234] requiring 20GB video memory) hinder deployment on resource-constrained edge devices. Cloud-based deployment faces issues of data privacy, security, and real-time constraints.
    • Future Work:
      • Further developing Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA [82] to reduce fine-tuning costs.
      • Advancing model compression techniques (e.g., knowledge distillation, quantization) to deploy models on limited hardware (e.g., TinyVLA [234] achieving low latency and memory footprint).
      • Designing inherently lightweight architectures and hardware acceleration (e.g., MiniGPT-4 [232]).
  4. Sim-to-Real Gap:

    • Limitation: Training agents in simulators is scalable and cost-effective, but fundamental discrepancies between simulated and real-world environments (e.g., inaccurate physical dynamics, visual rendering differences) lead to the sim-to-real gap. Policies trained in simulation often fail unexpectedly in reality, especially in out-of-distribution scenarios. Modeling real-world complexity accurately is inherently challenging, and small errors accumulate in long-term decision-making.
    • Future Work:
      • Developing advanced simulators with more precise physics modeling and photorealistic rendering (e.g., Genesis [134]).
      • Researching domain adaptation and domain randomization techniques to make policies more robust to variations between simulation and reality.
      • Exploring methods that can learn directly from real-world interactions more efficiently, possibly with world models aiding in data generation.

7.3. Personal Insights & Critique

This survey provides an exceptionally timely and well-structured overview of the rapidly evolving large model empowered embodied AI landscape. The "dual analytical approach" is particularly effective in clarifying the complex interplay between diverse large models, decision-making paradigms, and learning methodologies. The inclusion of world models as a distinct and critical component is a significant strength, reflecting their growing importance in enabling more intelligent and autonomous agents. The detailed comparisons in Table 1, Table 2, and Table 3 are invaluable for researchers seeking to understand the current state-of-the-art and identify promising avenues.

One key inspiration drawn from this paper is the sheer potential of VLA models for end-to-end decision-making. The idea of directly mapping multimodal observations and instructions to actions, bypassing complex modular pipelines, represents a paradigm shift that could significantly simplify robot programming and enhance adaptability. The rapid advancements in trajectory action optimization and training cost reduction for VLA models suggest that real-time, general-purpose embodied agents might be closer than previously imagined. The concept of flow matching for continuous action generation, in particular, seems like a powerful technique for achieving both precision and efficiency.

However, a potential area for further improvement or a subtle unverified assumption lies in the inherent generalization capabilities attributed to large models when applied to embodied AI. While LLMs and VLMs show impressive generalization on digital data, the sim-to-real gap and the scarcity of embodied data remain formidable hurdles. The paper acknowledges these limitations, but the implied path forward heavily relies on either synthesizing more realistic data or improving transferability from internet-scale data. A critical question remains: how much of the linguistic and visual common sense learned from the internet truly grounds itself into robust, safe, and physical common sense required for dynamic real-world interaction? The black-box nature of end-to-end VLA models, while offering seamlessness, might also mask unverified assumptions about the underlying physical reasoning, making debugging and ensuring safety challenging in safety-critical applications.

The methods and conclusions of this paper are highly transferable and applicable to various domains beyond traditional robotics, such as autonomous driving, virtual reality/augmented reality agents, smart manufacturing, and elderly care robots. Any system requiring intelligent agents to operate physically in dynamic, uncertain environments can benefit from these insights.

In critique, while the survey is extensive, it primarily summarizes existing work. Future iterations could perhaps delve deeper into the mechanisms by which large models specifically acquire physical common sense or causal reasoning relevant to the embodied world, rather than simply stating that they enhance planning or perception. For example, what specific architectural choices or training objectives allow LLMs to reason about object permanence or physical forces beyond abstract language correlation? This would bridge a crucial gap in understanding how large models genuinely "embody" intelligence.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.