Large Model Empowered Embodied AI: A Survey on Decision-Making and Embodied Learning
TL;DR Summary
This survey discusses large model empowered embodied AI, focusing on autonomous decision-making and embodied learning. It explores hierarchical and end-to-end decision paradigms, highlighting how large models enhance decision processes and Vision-Language-Action models, including
Abstract
Embodied AI aims to develop intelligent systems with physical forms capable of perceiving, decision-making, acting, and learning in real-world environments, providing a promising way to Artificial General Intelligence (AGI). Despite decades of explorations, it remains challenging for embodied agents to achieve human-level intelligence for general-purpose tasks in open dynamic environments. Recent breakthroughs in large models have revolutionized embodied AI by enhancing perception, interaction, planning and learning. In this article, we provide a comprehensive survey on large model empowered embodied AI, focusing on autonomous decision-making and embodied learning. We investigate both hierarchical and end-to-end decision-making paradigms, detailing how large models enhance high-level planning, low-level execution, and feedback for hierarchical decision-making, and how large models enhance Vision-Language-Action (VLA) models for end-to-end decision making. For embodied learning, we introduce mainstream learning methodologies, elaborating on how large models enhance imitation learning and reinforcement learning in-depth. For the first time, we integrate world models into the survey of embodied AI, presenting their design methods and critical roles in enhancing decision-making and learning. Though solid advances have been achieved, challenges still exist, which are discussed at the end of this survey, potentially as the further research directions.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is Large Model Empowered Embodied AI: A Survey on Decision-Making and Embodied Learning. It focuses on how large models enhance embodied artificial intelligence, specifically in the areas of autonomous decision-making and embodied learning.
1.2. Authors
The authors are WENLONG LIANG, RUI ZHOU, YANG MA, BING ZHANG, SONGLIN LI, YIJIA LIAO, and PING KUANG. All authors are affiliated with the University of Electronic Science and Technology of China, China.
1.3. Journal/Conference
This paper is published as a preprint on arXiv (Original Source Link: https://arxiv.org/abs/2508.10399). As a preprint, it has not yet undergone formal peer review by a journal or conference. However, arXiv is a highly influential platform for rapid dissemination of research in physics, mathematics, computer science, and related fields, allowing researchers to share their work before formal publication. Papers on arXiv are often later published in top-tier conferences or journals.
1.4. Publication Year
The paper was published at (UTC) 2025-08-14T06:56:16.000Z, indicating a publication year of 2025.
1.5. Abstract
This survey paper provides a comprehensive overview of large model empowered embodied AI, specifically focusing on autonomous decision-making and embodied learning. The authors investigate two decision-making paradigms: hierarchical and end-to-end. For hierarchical decision-making, the paper details how large models enhance high-level planning, low-level execution, and feedback. For end-to-end decision-making, it elaborates on how large models improve Vision-Language-Action (VLA) models. In terms of embodied learning, the survey introduces mainstream methodologies, explaining how large models enhance imitation learning and reinforcement learning. Uniquely, it integrates world models into the embodied AI survey, discussing their design and critical roles in enhancing both decision-making and learning. Finally, the paper addresses existing challenges and proposes future research directions, aiming to provide a theoretical framework and practical guidance for researchers in the field.
1.6. Original Source Link
The official source is a preprint on arXiv: https://arxiv.org/abs/2508.10399.
The PDF link is: https://arxiv.org/pdf/2508.10399v1.pdf.
This paper is currently a preprint and has not been formally published in a journal or conference at the time of this analysis.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is enabling embodied AI systems to achieve human-level intelligence for general-purpose tasks in open, unstructured, and dynamic environments. Embodied AI (systems with physical forms that perceive, decide, act, and learn in the real world) is seen as a promising path toward Artificial General Intelligence (AGI).
This problem is highly important because despite decades of exploration, embodied agents still struggle with generalization and transferability across diverse scenarios. Early systems relied on rigid preprogrammed rules, limiting adaptability. While deep learning improved capabilities, models were often task-specific. Recent breakthroughs in large models (such as Large Language Models, Vision-Language Models, etc.) have significantly enhanced perception, interaction, planning, and learning abilities in embodied AI, laying foundations for general-purpose embodied agents. However, the field is still nascent, facing challenges in generalization, scalability, and seamless environmental interaction.
The paper's entry point is to provide a comprehensive and systematic review of these recent advances in large model empowered embodied AI. It addresses existing gaps in the literature, where current studies are scattered, lack systematic categorization, and often miss the latest advancements, particularly Vision-Language-Action (VLA) models and end-to-end decision-making. The innovative idea is to systematically analyze how large models empower the core aspects of embodied AI: autonomous decision-making and embodied learning, including a novel integration of world models.
2.2. Main Contributions / Findings
The paper's primary contributions are summarized as follows:
-
Focus on Large Model Empowerment from an Embodied AI Viewpoint: The survey systematically categorizes and reviews works based on how large models empower
hierarchical decision-making(high-level planning, low-level execution, feedback enhancement) andend-to-end decision-making(viaVLAmodels). Forembodied learning, it details how large models enhanceimitation learning(policy and strategy network construction) andreinforcement learning(reward function design and policy network construction). -
Comprehensive Review of Embodied Decision-Making and Learning: It provides a detailed review of both
hierarchicalandend-to-enddecision-making paradigms, comparing them in depth. Forembodied learning, it coversimitation learning,reinforcement learning,transfer learning, andmeta-learning. Crucially, it integratesworld modelsinto the survey for the first time in this context, discussing their design and impact on decision-making and learning. -
Dual Analytical Approach for In-Depth Insights: The survey employs a
dual analytical methodology, combininghorizontal analysis(comparing diverse approaches like different large models, decision-making paradigms, and learning strategies) withvertical analysis(tracing the evolution, advances, and challenges of core models/methods). This approach aims to provide both a macro-level overview and deep insights into mainstreamembodied AImethods.The key conclusions and findings are that
large modelshave revolutionizedembodied AIby significantly enhancing perception, interaction, planning, and learning capabilities. They enable more robust and versatileembodied AIsystems through improveddecision-making(bothhierarchicalandend-to-end) and more efficientembodied learning(particularlyimitation learningandreinforcement learning). The integration ofworld modelsfurther boosts these capabilities by allowing agents to simulate environments and predict outcomes. Despite these advancements, significant challenges remain, including thescarcity of high-quality embodied data,continual learning for long-term adaptability,computational and deployment efficiency, and thesim-to-real gap. Addressing these challenges is crucial for the future path towardsAGI.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a foundational grasp of Embodied AI and Large Models is essential.
3.1.1. Embodied AI
Embodied AI refers to intelligent systems that possess a physical form (like robots, intelligent vehicles) and are capable of perceiving their environment, making decisions, performing actions, and learning through interaction within real-world settings. The core belief is that true intelligence emerges from this physical interaction, making it a promising pathway toward Artificial General Intelligence (AGI).
Key aspects of Embodied AI highlighted in the paper include:
- Physical Entities: The hardware components, such as
humanoid robots,quadruped robots, orintelligent vehicles, that interact with the physical world. They execute actions and receive sensory feedback. - Intelligent Agents: The cognitive software core that enables
autonomous decision-makingandlearning. - Human-like Learning and Problem-Solving Paradigm: Agents interpret human instructions, explore surroundings, perceive multimodal information (e.g., vision, language), and execute tasks. This mimics how humans learn from diverse resources, assess environments, plan, mentally simulate, and adapt.
- Autonomous Decision-Making: How agents translate perceptions and task understanding into executable actions. The paper discusses two main approaches:
- Hierarchical Paradigm: Separates functionalities into distinct modules for
perception,planning, andexecution. - End-to-End Paradigm: Integrates these functionalities into a unified framework, often directly mapping multimodal inputs to actions.
- Hierarchical Paradigm: Separates functionalities into distinct modules for
- Embodied Learning: The process by which agents refine their behavioral strategies and cognitive models autonomously through long-term environmental interactions, aiming for continual improvement. Key methods include
Imitation Learning,Reinforcement Learning,Transfer Learning, andMeta-Learning. - World Models: Internal simulations or representations of the environment that allow agents to anticipate future states, understand cause-and-effect, and plan without constant real-world interaction.
3.1.2. Large Models (LMs)
Large Models are neural network models characterized by their massive scale (billions or trillions of parameters), extensive training data, and typically a Transformer-based architecture. They have shown impressive perception, reasoning, and interaction capabilities. The paper categorizes them into several types:
- Large Language Model (LLM): Models trained on vast text corpora to understand, generate, and reason with natural language. Examples include
BERT,GPTseries (GPT-1,GPT-2,GPT-3,ChatGPT,InstructGPT,GPT-3.5),Codex,PaLM,Vicuna, andLlamaseries. They serve as cognitive backbones, processing linguistic inputs and generating responses. - Large Vision Model (LVM): Models specialized in processing visual information. Examples include
Vision Transformer (ViT),DINO,DINOv2,Masked Autoencoder (MAE), andSegment Anything Model (SAM). They are used for perception tasks like object recognition, pose estimation, and segmentation. - Large Vision-Language Model (LVLM): Models that integrate visual and linguistic processing, allowing them to understand visual inputs and respond to vision-related queries using language. Examples include
CLIP,BLIP,BLIP-2,Flamingo,GPT-4V, andDeepSeek-V3. They enhance agents' ability to understand human instructions across text and vision. - Multimodal Large Model (MLM): More general models that can process diverse modalities beyond just vision and language, potentially including audio, tactile, etc. They can be
multimodal-input text-output(e.g.,Video-Chat,VideoLLaMA,Gemini,PaLM-E) ormultimodal-input multimodal-output(e.g.,DALL·E,DALL·E2,DALL·E3,Sora). These models are crucial forEmbodied Large Models (ELM)orEmbodied Multimodal Large Models (EMLM). - Vision-Language-Action (VLA) Model: A specialized type of
MLMorLVLMwhose core objective is to directly mapmultimodal inputs(visual observations, linguistic instructions) toaction outputs. This creates a streamlined pipeline forend-to-end decision-makingin robots, improvingperception-action integration. Examples includeRT-2,BYO-VLA,3D-VLA,PointVLA,Octo,Diffusion-VLA,TinyVLA, and .
3.1.3. General Capability Enhancements (GCE) of Large Models
To overcome limitations like reasoning ability, hallucination, computational cost, and task specificity, several techniques have been proposed:
- In-Context Learning (ICL): Enables
zero-shot generalizationby carefully designing prompts, allowing LMs to tackle new tasks without additional training. - X of Thoughts (XoT): A family of reasoning frameworks that guide LMs through intermediate steps:
- Chain of Thoughts (CoT): Incorporates step-by-step reasoning into prompts.
- Tree of Thoughts (ToT): Explores multiple reasoning paths in a tree structure.
- Graph of Thoughts (GoT): Uses a graph structure to represent intermediate states and dependencies, enabling non-linear reasoning.
- Retrieval Augmented Generation (RAG): Retrieves relevant information from external knowledge bases (databases, web) to augment LM responses, addressing outdated or incomplete knowledge.
- Reasoning and Acting (ReAct): Integrates reasoning with action execution, allowing LMs to produce explicit reasoning traces before acting, enhancing decision transparency.
- Reinforcement Learning from Human Feedback (RLHF): Aligns LMs with human preferences and values by using human feedback as a reward signal during training, improving helpfulness, harmlessness, and honesty.
- Model Context Protocol (MCP): A standardized interface for LMs to interact with external data sources, tools, and services, enhancing interoperability.
3.2. Previous Works
The paper primarily positions itself as a comprehensive survey that integrates various aspects of large model empowered embodied AI, distinguishing itself from prior reviews. It identifies several categories of related surveys:
- Specific Surveys on Large Models Themselves: ([29, 104, 113, 151, 191, 225]) These focus on
LLMsorVLMsbut do not specifically address their synergy withembodied agents. - Surveys on Components of Embodied AI:
- Planning: ([188]) Focuses only on planning in
embodied AI. - Learning: ([7, 26, 204]) Concentrates on learning methods, sometimes with
LLMintegration, but often without a systematic analysis of the overallembodied AIparadigm or comprehensive integration ofworld modelsandVLAmodels. - Simulators: ([201]) Reviews simulators.
- Applications: ([157, 201, 209]) Focuses on
embodied AIapplications.
- Planning: ([188]) Focuses only on planning in
- Comprehensive Surveys (with limitations):
-
Some surveys ([48, 220]) are identified as being outdated due to the rapid development of the field.
-
Others, while comprehensive, might miss recent advances like
VLAmodels andend-to-end decision-making([190, 95]). -
The review [119] provides a detailed introduction to
VLAmodels but lacks comparison with thehierarchical paradigmand detailed exploration oflearning methods. -
[117] is listed as a comprehensive survey that covers
Large models,HierarchicalandEnd-to-end decision-making,IL,RL, andOtherlearning methods, but still lacksWorld Models.The paper also refers to specific seminal works that underpin the evolution of large models and embodied AI:
-
- BERT [42] (2018): Google's bidirectional Transformer for natural language understanding.
- GPT [149] (2018): OpenAI's generative Transformer, a breakthrough in text generation.
- GPT-3 [54] (2020): Milestone for its vast capacity and
zero-/few-shotlearning. - Vision Transformer (ViT) [45] (2020): Adapted Transformer for computer vision.
- CLIP [148] (2021): OpenAI's
LVLMaligning image and text viacontrastive learning. - RT-2 [234] (2023): Pioneering
VLAmodel, using pretrainedVLMto discretize action space. - Sora [22] (2024): OpenAI's video generative model, an example of
diffusion-based world models.
3.3. Technological Evolution
The field of Embodied AI has evolved through several stages:
- Early Systems (Symbolic Reasoning & Behaviorism): In the early decades,
embodied AIsystems (e.g., [21, 200]) were largely based onrigid preprogrammed rulesandsymbolic reasoning. These systems excelled in highlycontrolled environments(like manufacturing robots) but lackedadaptabilityandgeneralizationto open, dynamic settings. - Deep Learning Era: The advent of
machine learninganddeep learning([99, 133]) marked a turning point.Vision-guided planningandreinforcement learning-based control([173]) significantly reduced the need for precise environment modeling. However, these models were oftentask-specificand still faced challenges ingeneralizationacross diverse scenarios. - Large Model Revolution: Recent breakthroughs in
large models([149, 150, 182, 183]) have propelledembodied AIforward. These models, with their robustperception,interaction,planning, andlearningcapabilities, are now laying the foundation forgeneral-purpose embodied agents([137]). This era is characterized by the integration ofLLMs,VLMs,MLMs, and the emergence ofVLAmodels, fundamentally changing howembodied agentsperceive, reason, and act. The paper focuses on this current,large model empoweredstage.
3.4. Differentiation Analysis
Compared to the main methods and existing surveys in related work, this paper's core differences and innovations lie in its comprehensive scope and analytical approach:
- Integrated Focus on Large Model Empowerment: Unlike surveys focusing solely on large models or specific
embodied AIcomponents (e.g., planning, learning, simulators), this paper provides a holistic view of how large models empower the entireembodied AIsystem, specificallydecision-makingandlearning. - Systematic Categorization of Decision-Making: It offers a detailed comparison and analysis of both
hierarchicalandend-to-enddecision-making paradigms, highlighting the role of large models in each. This includes a dedicated discussion ofVLAmodels, which are a very recent and prominent development. - In-depth Analysis of Embodied Learning: The survey covers
imitation learning,reinforcement learning,transfer learning, andmeta-learning, with a specific emphasis on howlarge modelsenhance these methodologies. - Novel Integration of World Models: For the first time, it integrates
world modelsinto a survey ofembodied AI, presenting their design methods and critical roles in enhancingdecision-makingandlearning. This is a significant addition asworld modelsare increasingly recognized for their potential to enable more efficient and robustembodied intelligence. - Dual Analytical Methodology: The paper explicitly states its use of
horizontalandverticalanalysis, which allows for both broad comparisons across diverse approaches and deep dives into the evolution and challenges of specific methods. This aims to provide a more structured and insightful overview than previous, potentially scattered, reviews. - Timeliness: By being published in 2025, it aims to capture the latest progress, including
VLAmodels andend-to-end decision-making, which earlier surveys might have missed.
4. Methodology
The survey systematically deconstructs large model empowered embodied AI by focusing on autonomous decision-making and embodied learning, integrating world models as a crucial component. The overall organization is depicted in Figure 1 of the original paper.
The following figure (Figure 1 from the original paper) shows the overall organization of this survey:
该图像是一幅示意图,展示了大型模型赋能的具身人工智能的决策和学习框架。图中详细阐述了具身人工智能的基本概念、决策自主性、具身学习方法及世界模型的设计,涵盖了模仿学习和强化学习的最新发展。
4.1. Hierarchical Autonomous Decision-Making
Hierarchical autonomous decision-making in embodied AI breaks down complex tasks into a structured series of modules: perception and interaction, high-level planning, low-level execution, and feedback and enhancement. Traditionally, these layers relied on vision models, predefined logical rules, and classic control algorithms, respectively. While effective in structured environments, this approach struggled with adaptability and holistic optimization in dynamic settings. The advent of large models has revolutionized this paradigm by injecting robust learning, reasoning, and generalization capabilities into each layer.
The following figure (Figure 5 from the original paper) illustrates the hierarchical decision-making paradigm:
该图像是示意图,展示了大型模型赋能的体现智能决策过程,包含高层规划、低层执行和反馈增强三个主要模块。高层规划利用自然语言和编程语言生成指令,低层执行则结合传统控制算法和大型视觉模型。此外,反馈增强部分通过人类和环境反馈优化决策流程。
4.1.1. High-Level Planning
High-level planning is responsible for generating reasonable plans based on task instructions and perceived environmental information. Traditionally, this involved rule-based methods ([59, 75, 126]) using languages like Planning Domain Definition Language (PDDL). These planners used heuristic search to verify action preconditions and select optimal action sequences. While precise in structured environments, their adaptability to unstructured or dynamic scenarios was limited. Large Language Models (LLMs), with their zero-shot and few-shot generalization, have driven significant breakthroughs by enhancing various forms of planning.
The following figure (Figure 6 from the original paper) illustrates high-level planning empowered by large models:
该图像是图示,展示了大型模型(LLM)在三个不同规划方法中的应用,包括结构化语言规划、自然语言规划和编程语言规划。图中描述了如何通过LLM生成PDDL文件,并在不同上下文中生成计划。
4.1.1.1. Structured Language Planning with LLM
LLMs enhance structured language planning (e.g., PDDL) through two main strategies:
- LLM as the Planner:
LLMsdirectly generate plans. However, early attempts often resulted ininfeasible plansdue to strictPDDLsyntax and semantics ([185]).- LLV [9]: Introduced an
external validator(e.g., aPDDL parseror environmentsimulator) to checkLLM-generated plansand iteratively refine them usingerror feedback. - FSP-LLM [175]: Optimized
prompt engineeringto align plans with logical constraints, improving feasibility.
- LLV [9]: Introduced an
- LLM for Automated PDDL Generation:
LLMsautomate the creation ofPDDL domain filesandproblem descriptions, reducing manual effort.- LLMP [116]:
LLMgeneratedPDDLfiles, which were then solved by traditionalsymbolic planners, combininglinguistic understandingwithsymbolic reasoning. - PDDL-WM [64]: Used
LLMto iteratively construct and refinePDDL domain models, validated byparsersanduser feedback.
- LLMP [116]:
4.1.1.2. Natural Language Planning with LLM
Natural language offers greater flexibility, allowing LLMs to decompose complex tasks into sub-plans ([110, 167]). However, natural language planning often generates infeasible plans based on common sense rather than actual environmental constraints (e.g., proposing to use a vacuum cleaner that isn't present).
- Zero-shot [85]: Explored
LLM's ability to decompose high-level tasks into executable language steps, showing preliminary plans based on common sense but lacking physical environment constraints. - SayCAN [4]: Integrates
LLMwithreinforcement learning. It combinesLLM-generated planswith a predefinedskill repositoryandvalue functionsto evaluate actionfeasibility. By scoring actions with expectedcumulative rewards, it filters out impractical steps. - Text2Motion [114]: Enhances planning for spatial tasks by incorporating
geometric feasibility.LLMproposes action sequences, which are then evaluated by acheckerfor physical viability. - Grounded Decoding [87]: Addresses the limitation of fixed skill sets by dynamically integrating
LLMoutputs with areal-time grounded model. This model assesses actionfeasibilitybased on current environmental states and agent capabilities, guidingLLMto produce contextually viable plans.
4.1.1.3. Programming Language Planning with LLM
Programming language planning translates natural language instructions into executable programs (e.g., Python code), leveraging code precision to define spatial relationships, function calls, and control APIs for dynamic planning.
- CaP [112]: Converts task planning into
code generation, producingPython-style programswith recursively defined functions to create a dynamic function library, enhancing adaptability to new tasks. Its reliance onperception APIsand unconstrained code generation limits complex instruction handling. - Instruct2Act [84]: A more integrated solution using
multimodal foundation modelsto unifyperception,planning, andcontrol. It usesvision-language modelsfor accurate object identification and spatial understanding, feeding this data toLLMto generatecode-based action sequencesfrom a predefinedrobot skill repository. This increases planning accuracy and adaptability. - ProgPrompt [176]: Employs
structured promptswith environmental operations, object descriptions, and example programs to guideLLMin generating tailored,code-based plans, minimizing invalid code and enhancing cross-environment adaptability.
4.1.2. Low-Level Execution
After high-level planning, low-level actions are executed using predefined skill lists. These lists represent basic capabilities (e.g., object recognition, obstacle detection, object grasping, moving) that bridge task planning and physical execution. The implementation of these skills has evolved from traditional control algorithms to learning-driven and modular control.
The following figure (Figure 7 from the original paper) illustrates low-level execution:
该图像是图表,展示了传统控制算法、基于学习的控制与模块化控制的关系以及所需的技能和子任务。左侧展示了PID和MPC等算法的基本技能,中央强调了模仿学习和强化学习的互动流程,右侧则介绍了大语言模型(LLM)在检测和分类任务中的应用示例。
4.1.2.1. Traditional Control Algorithms
These foundational skills use classic model-based techniques with clear mathematical derivations.
- Proportional-Integral-Derivative (PID) control [81]: Adjusts parameters to minimize errors in robotic arm joint control.
- State feedback control [178]: Often paired with
Linear Quadratic Regulators (LQR) [125], optimizes performance using system state data. - Model Predictive Control (MPC) [2]: Forecasts states and generates control sequences via rolling optimization, suitable for tasks like drone path tracking. These offer mathematical interpretability, low computational complexity, and real-time performance, but lack adaptability in dynamic, high-dimensional, or uncertain environments, necessitating integration with data-driven techniques.
4.1.2.2. Learning-Driven Control with LLM
Robot learning develops control strategies and low-level skills from extensive data (human demonstrations, simulations, environmental interactions).
- Imitation Learning: Trains strategies from
expert demonstrations.- Embodied-GPT [131]: Leverages a
7B language modelfor high-level planning and converts plans to low-level strategies viaimitation learning.
- Embodied-GPT [131]: Leverages a
- Reinforcement Learning: Optimizes strategies through iterative trials and environmental rewards.
- Hi-Core [140]: Employs a two-layer framework where
LLMsets high-level strategies and sub-goals, whilereinforcement learninggenerates specific actions at low levels. These methods offer strong adaptability and generalization but require large datasets, computational resources, and pose challenges in guaranteeing convergence and stability.
- Hi-Core [140]: Employs a two-layer framework where
4.1.2.3. Modular Control with LLM and Pretrained Models
Modular control integrates LLMs with pretrained strategy models (e.g., CLIP for visual recognition, SAM for segmentation). LLMs are equipped with descriptions of these tools and can invoke them dynamically.
- DEPS [192]: Combines multiple modules for detection and actions based on task requirements and natural language descriptions of
pretrained models. - PaLM-E [46]: Merges
LLMwith visual modules for segmentation and recognition. - CLIPort [172]: Leverages
CLIPforopen-vocabulary detection. - [112]: Leveraged
LLMto generate code for creating libraries of callable functions for navigation and operations. This approach ensuresscalabilityandreusabilitybut introduces computational and communication delays and relies heavily on the quality ofpretrained models.
4.1.3. Feedback and Enhancement
To ensure the quality of task planning in hierarchical decision-making, a closed-loop feedback mechanism is introduced. This feedback can originate from the large model itself, humans, or external environments.
The following figure (Figure 8 from the original paper) illustrates feedback and enhancement:
该图像是示意图,展示了大模型的反馈与增强机制,包括自我反思、人工反馈和环境反馈三个部分。图中分别详细说明了大模型在计划、执行和反馈过程中的角色,以及策略优化的机制。
4.1.3.1. Self-Reflection of Large Models
Large models can act as task planners, evaluators, and optimizers, iteratively refining decision-making without external intervention. Agents autonomously detect and analyze failed executions, learning from past tasks.
- Re-Prompting [153]: Triggers plan regeneration based on detected execution failures or precondition errors. It integrates error context as feedback to dynamically adjust prompts and correct
LLM-generated plans.- DEPS [153]: Adopts a "describe, explain, plan, select" framework, where
LLMdescribes execution, explains failure causes, andre-promptsto correct plans.
- DEPS [153]: Adopts a "describe, explain, plan, select" framework, where
- Introspection Mechanism: Enables
LLMto evaluate and refine its output independently.- Self-Refine [121]: Uses a single
LLMfor planning and optimization, iteratively improving plan rationality through multiple self-feedback cycles. - Reflexion [170]: Extends
Self-Refineby incorporatinglong-term memoryto store evaluation results, combining multiple feedback mechanisms. - ISR-LLM [231]: Applies iterative self-optimization in
PDDL-based planning, generating initial plans, performing rationality checks, and refining outcomes via self-feedback. - Voyager [189]: Tailored for
programming language planning, it builds a dynamic code skill library by extracting feedback from execution failures.
- Self-Refine [121]: Uses a single
4.1.3.2. Human Feedback
Human feedback enhances planning accuracy and efficiency through an interactive closed-loop mechanism, allowing agents to dynamically adjust behaviors based on human input.
- KNOwNO [161]: Introduces an
uncertainty measurement frameworkforLLMto identify knowledge gaps and seek human assistance in high-risk scenarios. - EmbodiedGPT [132]: Uses a
planning-execution-feedback loop, where agents request human input when low-level controls fail. This feedback, combined withreinforcement learningandself-supervised optimization, refines planning strategies. - YAY Robot [168]: Allows users to pause robots with commands and provide real-time language-based corrections, recorded for strategy fine-tuning.
- IRAP [80]: Facilitates
interactive question-answeringwith humans to acquire task-specific knowledge for precise robot instructions.
4.1.3.3. Environment Feedback
Environment feedback enhances LLM-based planning via dynamic interactions with the environment.
-
Inner Monologue [88]: Transforms
multimodal inputsinto linguistic descriptions for an "inner monologue" reasoning, allowingLLMto adjust plans based on environmental feedback. -
TaPA [203]: Integrates
open-vocabulary object detectionand tailors plans for navigation and operations. -
DoReMi [65]: Detects discrepancies between planned and actual outcomes and uses
multimodal feedbackto adjust tasks dynamically. -
RoCo [123]: In
multi-agent settings, it leverages environmental feedback and inter-agent communications for real-time robotic arm path planning corrections.Vision-Language Models (VLMs)simplify feedback by integrating visual inputs and language reasoning, avoiding feedback conversions. -
ViLaIn [171]: Integrates
LLMwithVLMto generate machine-readablePDDLfrom language instructions and scene observations. -
ViLA [83] and Octopus [211]: Achieve robot
vision language planningby leveragingGPT-4V MLMto generate plans, integrating perception data for robustzero-shot reasoning. -
Voxposer [86]: Exploits
MLMto extract spatial geometric information, generating3D coordinatesandconstraint mapsfrom robot observations to populate code parameters, enhancing spatial accuracy.
4.2. End-to-End Autonomous Decision-Making
The hierarchical paradigm suffers from error accumulation across separate modules and struggles with generalization to diverse tasks due to the difficulty in directly applying high-level semantic knowledge from large models to low-level robotic actions. To address these issues, end-to-end autonomous decision-making has gained attention, aiming to directly map multimodal inputs (visual observations, linguistic instructions) to action outputs. This is typically implemented by Vision-Language-Action (VLA) models.
The following figure (Figure 9 from the original paper) illustrates end-to-end decision-making by VLA:
该图像是一幅示意图,展示了通过视觉语言动作(VLA)实现的端到端自主决策流程。图中包含了规划、执行和感知的统一处理过程,强调了消除沟通延迟的优势,确保机器人在变化环境中能够实时适应和快速响应。
4.2.1. Vision-Language-Action Models
VLA models integrate perception, language understanding, planning, action execution, and feedback optimization into a unified framework. They leverage the rich prior knowledge of large models to achieve precise and adaptable task execution in dynamic, open environments. A typical VLA model comprises three key components: tokenization and representation, multimodal information fusion, and action detokenization.
The following figure (Figure 10 from the original paper) illustrates the components of Vision-Language-Action Models:
该图像是示意图,展示了视觉-语言-动作模型的构造和信息处理流程。图中包括视觉编码器、语言编码器和状态编码器,分别负责不同类型的信息输入。通过多模态信息融合,输入的信息被解码为动作指令,随后输入到动作头以执行具体的机器人动作。该流程强调了从感知到执行的反馈和更新机制,展示了决策过程的复杂性。
4.2.1.1. Tokenization and Representation
VLA models use four types of tokens to encode multimodal inputs:
- Vision Tokens: Encode environmental scenes (e.g., images) into embeddings.
- Language Tokens: Encode linguistic instructions into embeddings, forming task context.
- State Tokens: Capture the agent's physical configuration (e.g., joint positions, force-torque, gripper status, end-effector pose, object locations).
- Action Tokens: Generated autoregressively, representing low-level control signals (e.g., joint angles, torque, wheel velocities) or high-level movement primitives (e.g., "move to grasp pose", "rotate wrist"). They allow
VLA modelsto act aslanguage-driven policy generators.
4.2.1.2. Multimodal Information Fusion
Visual tokens, language tokens, and state tokens are fused into a unified embedding for decision-making. This is typically achieved through a cross-modal attention mechanism within a Transformer architecture. This mechanism dynamically weighs the contributions of each modality, enabling the VLA model to jointly reason over object semantics, spatial layouts, and physical constraints based on the task context.
4.2.1.3. Action Detokenization
The fused embedding is fed into an autoregressive decoder (typically a Transformer) to generate a series of action tokens.
- Discrete Action Generation: The model selects from a predefined set of actions or discretized control signals.
- Continuous Action Generation: The model outputs fine-grained control signals, often sampled from a continuous distribution using a final
Multi-Layer Perceptron (MLP)layer, enabling precise manipulation or navigation. Theseaction tokensare thendetokenized(mapped to executable control commands) and passed to the execution loop, which provides updated state information, allowing theVLA modelto adapt dynamically to perturbations.
Example: Robotics Transformer 2 (RT-2) [234]
RT-2 utilizes Vision Transformer (ViT) for visual processing and PaLM to integrate vision, language, and robot state information. It discretizes the action space into eight dimensions (e.g., 6-DoF end-effector displacement, gripper status, termination commands). Each dimension (except termination) is divided into 256 discrete intervals and embedded into the VLM vocabulary as action tokens.
RT-2 employs a two-stage training strategy:
- Pretraining: With
Internet-scale vision-language datato enhancesemantic generalization. - Fine-tuning: To map inputs (robot camera images, text task descriptions) to outputs (action word token sequences, e.g., ).
By modeling actions as "language,"
RT-2leverageslarge models'capabilities to imbue low-level actions with rich semantic knowledge.
4.2.2. Enhancements on VLA
Despite their power, VLA end-to-end decision-making architectures have limitations: sensitivity to visual/language perturbations, limited 3D perception, reliance on simplistic policy networks for action generation, and high computational cost. Researchers have proposed enhancements categorized into perception capability enhancement, trajectory action optimization, and training cost reduction.
The following figure (Figure 11 from the original paper) illustrates enhancements on Vision-Language-Action Models:
该图像是示意图,展示了增强视觉-语言-动作(VLA)模型的方法。图中分为三个部分:第一部分展示了感知能力的增强方法,包括SigLip和Ego3D PE等;第二部分描述了轨迹行动优化的流程,利用Octo和Diffusion VLA模型;第三部分则强调了训练成本的减少,利用预训练模型和动作专家。整体体现了大模型在决策与学习中的应用。
4.2.2.1. Perception Capability Enhancement
These enhancements address sensitivity to noise and limited 3D perception.
- BYO-VLA [74]: Implements a
runtime observation intervention mechanismusingautomated image preprocessingto filter visual noise (occlusions, cluttered backgrounds). - TraceVLA [229]: Introduces
visual trajectory promptsto thecross-modal attention mechanismduringmultimodal information fusion. By incorporating trajectory data with vision, language, and state tokens, it enhancesspatio-temporal awarenessfor precise action trajectory predictions. - 3D-VLA [226]: Combines a
3D large modelwith adiffusion-based world modelto processpoint cloudsand language instructions. It generatessemantic scene representationsand predicts futurepoint cloud sequences, improving3D object relationship understanding. - SpatialVLA [147]: Emphasizes
spatial understandingin robot sorting tasks. It proposesEgo3D position encodingto inject3D informationinto input observations and usesadaptive action schemes.
4.2.2.2. Trajectory Action Optimization
These methods provide smoother and more controllable actions, overcoming the limitations of discrete action spaces.
- Octo [180]: Combines
Transformeranddiffusion models. It processes multimodal inputs viaTransformer, extractsvisual-language features, and usesconditional diffusion decodersto iteratively optimize action sequences, generating smooth and precise trajectories. Achievescross-task generalizationwith minimal task-specific data. - Diffusion-VLA [196]: Unifies a
language modelwith adiffusion policy decoder. Theautoregressive language modelparses instructions and generates preliminary task representations, which thediffusion policy decoderoptimizes throughgradual denoising. Employsend-to-end trainingto jointly optimize language understanding and action generation, ensuring smooth and robust action trajectories. Higher computational cost thanOctobut better for complex tasks needing deepsemantic-action fusion. - [18]: Exploits
flow matchingto represent complex continuous action distributions. Compared to multi-step sampling indiffusion models,flow matchingoptimizes action generation throughcontinuous flow field modeling, reducing computational overhead and improving real-time performance. This makes it suitable for resource-constrained applications requiring high-precision continuous control.
4.2.2.3. Training Cost Reduction
These methods reduce computational cost for VLA models, improving inference speed, data efficiency, and real-time performance for resource-constrained platforms.
- TinyVLA [198]: Achieves significant improvements by designing a
lightweight multimodal modeland adiffusion strategy decoder. - OpenVLA-OFT [92]: Uses
parallel decodinginstead of traditionalautoregressive generationto generate complete action sequences in a single forward pass, significantly reducinginference time. - Fast [143]: Introduces an efficient action tokenization scheme based on the
Discrete Cosine Transform (DCT), enablingautoregressive modelsto handle high-frequency tasks and speeding up training. - Edge-VLA [25]: A streamlined
VLAtailored foredge devices, achieving high inference speeds (30-50Hz) with comparable performance toOpenVLAwhile optimized for low-power deployment.
4.2.3. Mainstream VLA Models
The following are the results from Table 2 of the original paper:
| Model | Contributions | Enhancements | |||
| • | Pioneering large-scale VLA, jointly | P | A | C | |
| RT-2 [234] (2023) | Vision Encoder: ViT22B/ViT-4B Language Encoder: PaLIX/PaLM-E Action Decoder: Symbol-tuning | fine-tuned on web-based VQA and robotic datasets, unlocking advanced emergent functionalities. | × | ||
| Seer [63] (2023) | •Vision Encoder: Visual backbone Language Encoder: Transformer-based Action Decoder: Autoregressive action prediction head | Efficiently predict future video frames from language instructions by extending a pretrained text-to-image diffusion model. | √ | × | √ |
| Octo [180] (2024) | Vision Encoder: CNN • Language Encoder: T5-base Action Decoder: Diffusion Transformer | First generalist policy trained on a massive multi-robot dataset (800k+ trajectories). A powerful open-source foundation model. | × | × | |
| Open- VLA [94] (2024) | • Vision Encoder: DINOv2 + SigLIP Language Encoder: Prismatic-7B Action Decoder: Symbol-tuning | An open-source alternative to RT-2, superior parameter efficiency and strong generalization with efficient LoRA fine-tuning. | × | × | √ |
| Mobility- VLA [37] (2024) | Vision Encoder: Long-context ViT + goal image encoder •Language Encoder: T5-based instruction encoder Action Decoder: Hybrid diffusion + au- toregressive ensemble | Leverages demonstration tour videos as an environmental prior, using a long-context VLM and topological graphs for navigating based on complex multimodal instructions. | √ | √ | × |
| Tiny-VLA [198] (2025) | Vision Encoder: FastViT with low-latency encoding Language Encoder: Compact language en- coder (128-d) Action Decoder: Diffusion policy decoder (50M parameters) | Outpaces OpenVLA in speed and precision; eliminates pretraining needs; achieves 5x faster inference for real-time applications. | × | × | √ |
| Tale .Mainstream VLA models (: perception, A: trajectory action, C: training cost) (continued). | |||||
| Model | Architecture | Contributions | Enhancements | ||
| A | C | ||||
| Diffusion- VLA [196] (2024) | Transformer-based visual encoder for con- textual perception Language Encoder: Autoregressive rea- soning module with next-token prediction Diffusion policy head for robust action sequence generation | Leverage diffusion-based action modeling for precise control; superior contextual awareness and reliable sequence planning. | × | √ × | |
| Point- VLA [105] (2025) | •Vision Encoder: CLIP + 3D Point Cloud Language Encoder: Llama-2 Action Decoder: Transformer with spatial token fusion | Excel at long-horizon and spatial reasoning tasks; avoid retraining by preserving pretrained 2D knowledge | √ | × | |
| VLA- Cache [208] (2025) | Vision Encoder: SigLIP with token mem- ory buffer Language Encoder: Prismatic-7B •Action Decoder: Transformer with dy- namic token reuse | Faster inference with near-zero loss; dynamically reuse static features for real-time robotics | × | × √ | |
| π0 [18] (2024) | Vision Encoder: PaliGemma VLM back- bone Language Encoder: PaliGemma (multi- modal) | Employ flow matching to produce smooth, high-frequency (50Hz) action trajectories for real-time control. | |||
| π0 Fast [143] (2025) | • Vision Encoder: PaliGemma VLM back- bone Language Encoder: PaliGemma (multi- modal) Action Decoder: Autoregressive Trans- former with FAST | Introduces an efficient action tokenization scheme based on the Discrete Cosine Transform (DCT), enabling autoregressive models to handle high-frequency tasks and significantly speeding up training. | × | √ √ | |
| Edge-VLA [25] (2025) | • Vision Encoder: SigLIP + DINOv2 Language Encoder: Qwen2 (0.5B parame- ters) Action Decoder: Joint control prediction (non-autoregressive) | Streamlined VLA tailored for edge devices, delivering 3050Hz inference speed with OpenVLA-comparable performance, optimized for low-power, real-time deployment. | × × | √ | |
| Tale .Mainstream VLA models (: perception, A: trajectory action, C: training cost) (continued). | |||||
| Architecture | Contributions | Enhancements | |||
| •Vision Encoder: SigLIP + DINOv2 (multi- An optimized fine-tuning recipe for | P | ||||
| OpenVLA- OFT [92] (2025) | view) j • Language Encoder: Llama-2 7B Action Decoder: Parallel decoding with action chunking and L1 regression | VLAs that integrates parallel decoding and a continuous action representation to improve inference speed and task success. | × | √ | |
| Spatial- VLA [147] (2025) | • Vision Encoder: SigLIP from PaLiGemma2 4B Language Encoder: PaLiGemma2 •Action Decoder: Adaptive Action Grids and autoregressive transformer | Enhance spatial intelligence by injecting 3D information via 'Ego3D Position Encoding' and representing actions with 'Adaptive Action Grids'. | √ | × | |
| MoLe- VLA [219] (2025) | • Vision Encoder: Multi-stage ViT with STAR router Language Encoder: CogKD-enhanced Transformer Action Decoder: Sparse Transformer with dynamic routing | A brain-inspired architecture that uses dynamic layer-skipping (Mixture-of-Layers) and knowledge distillation to improve efficiency. | × | × | V |
| VLA [230] (2025) | Vision Encoder: Object-centric spatial ViT DexGrasp- Language Encoder: Transformer with grasp sequence reasoning Action Decoder: Diffusion controller for grasp pose generation | A hierarchical framework for general dexterous grasping, using a VLM for high-level planning and a diffusion policy for low-level control. | × | √ | × |
| Dex-VLA [197] (2025) | A large plug-in diffusion-based action expert and an embodiment curriculum learning strategy for efficient cross-robot training and adaptation. | × | × | ||
4.2.4. Hierarchical versus End-to-End Decision-Making
Hierarchical and end-to-end paradigms represent two distinct philosophies for autonomous decision-making in embodied intelligence.
The following are the results from Table 3 of the original paper:
| Aspect | Hierarchical | End-to-End |
| Architecture | Perception: dedicated modules (e.g., SLAM, CLIP) High-level planning: structured, language, pro- Planning: implicit via VLA pretraining gram • Low-level execution: predefined skill lists ment | Perception: integrated in tokenization Action generation: Autoregressive generation with diffusion-based decoders •Feedback: LLM self-reflection, human, environ- • Feedback: inherent in closed-loop cycle |
| Performance | Reliable in structured tasks Limited in dynamic settings | • Superior in complex, open-ended tasks with strong generalization Dependent on training data |
| Interpretability | High, with clear modular design | Low, due to black-box nature of neural net- works |
| Generalization | Limited, due to reliance on human-designed Strong, driven by large-scale pretraining structures | Sensitive to data gaps |
| Real-time | duce delays in complex scenarios | Low, inter-module communications may intro- High, direct perception-to-action mapping min- imizes processing overhead |
| Computational Cost | tion but coordination overhead | •Moderate, with independent module optimiza- •High, requiring significant resources for train- ing |
| Table 3. Comparison of hierarchical and end-to-end decision-making (continued). | ||
| Aspect | Hierarchical | End-to-End |
| Application | Suitable for industrial automation, drone navi- Suitable for domestic robots, virtual assistants, gation, autonomous driving | human-robot collaboration |
| Advantages | High interpretability High reliability Easy to integrate domain knowledge | •Seamless multimodal integration Efficient in complex tasks • Minimal error accumulation |
| Limitations | Sub-optimal, due to module coordination issues Low adaptability to unstructured settings | Low interpretability High dependency on training data High computational costs •Low generalization in out-of-distribution sce- narios |
- Hierarchical Architectures: Decompose decision-making into separate
perception,planning,execution, andfeedbackmodules. This modularity facilitatesdebuggability,optimization, andmaintenance. They excel in integratingdomain knowledge(e.g., physical constraints, rules), offering highinterpretabilityandreliability. However, module separation can lead tosub-optimal solutionsdue to coordination issues, especially in dynamic environments, andmanual task decompositioncan limitadaptabilityto unseen scenarios. - End-to-End Architectures: Employ a large-scale neural network (e.g.,
VLA) to directly mapmultimodal inputstoactions. Built on largemultimodal modelsand trained on extensive datasets,VLAmodels achieve simultaneousvisual perception,language understanding, andaction generation. Their integrated architecture minimizeserror accumulationand enables efficientend-to-end optimization, leading to stronggeneralizationin complex, unstructured environments. The main drawbacks include theirblack-box nature(lowinterpretability), heavy reliance onquality and diversity of training data, and highcomputational costfor training.
4.3. Embodied Learning
Embodied learning aims to enable agents to acquire complex skills and refine their capabilities through continuous interactions with environments. This continuous learning is crucial for embodied agents to achieve precise decision-making and real-time adaptation in the complex and variable real world. It can be achieved through the coordination of various learning strategies.
The following figure (Figure 12 from the original paper) illustrates embodied learning processes and methodologies:
该图像是一个示意图,展示了不同方法在自我学习中的应用,包括模仿学习、迁移学习、元学习和强化学习。模仿学习通过多样来源快速获取技能,迁移学习将已学技能应用于相关任务,元学习提高学习效率,强化学习通过奖励优化技能。
4.3.1. Embodied Learning Methods
Embodied learning can be modeled as a goal-conditional partially observable Markov decision process (POMDP), defined as an 8-tuple :
-
: The set of
statesof the environment, encoding multimodal information (textual descriptions, images, structured data). -
: The set of
actions, representing instructions or commands, often in natural language. -
: The set of possible
goals(), specifying objectives (e.g., "purchase a laptop"). -
: The
state transition probability function, defining the probability distribution over next states for each state-action pair( s , a ). -
: The
goal-conditional reward function, evaluating how well an action in state advances goal . Rewards can be numeric or textual. -
: The set of
observations, which may include textual, visual, or multimodal data, representing the agent's partial view of the state. -
: The
observation probability function, defining the probability of observing after transitioning to state via action . -
: The
discount factor, balancing immediate and long-term rewards, used when rewards are numeric.At time , an agent receives observation and goal , selecting action according to policy . The environment transitions to , yielding observation and reward
R ( s _ { t + 1 } , a _ { t } , g ). -
For
end-to-end decision-making: TheVLA modeldirectly encodes the policy , processingmultimodal observationand producing action . -
For
hierarchical decision-making: Ahigh-level agentgenerates a context-awaresubgoalg _ { s u b }via anLLM-enhanced policy, then alow-level policymaps thesubgoalto an action . This low-level policy can be learned throughimitation learningorreinforcement learning.The following are the results from Table 4 of the original paper:
Methods Strengths Limitations Applications Imitation Learning •Rapid policy learning by Dependent on diverse, high- mimicking expert demon- strations Efficient for tasks with high- quality data quality demonstrations Limited adaptability to new tasks or sparse data scenar- ios Robotic manipulation Structured navigation Human-robot interaction with expert guidance Reinforcement Learning • Optimizes policies in dy- namic uncertain environ- ments via trial-and-errors Excels in tasks with clear re- ward signals Requires large samples and Autonomous navigation computational resources Sensitive to reward function and discount factor Adaptive human-robot in- teraction • Dynamic task optimization Transfer Learning •Accelerates learning by • Risks negative transfer Navigation across diverse transferring knowledge between related tasks Enhances generalization in • Requires task similarity for related tasks when tasks differ signifi- cantly effective learning environments • Manipulation with shared structures •Cross-task skill reuse Meta-Learning •Rapid adaptation to new tasks with minimal data Ideal for diverse embodied tasks Demands extensive pre- training and large datasets Establishing a universal meta-policy is resource- intensive Rapid adaptation in naviga- tion, manipulation, or inter- action across diverse tasks and environments
4.3.1.1. Imitation Learning
Imitation learning ([204]) allows agents to learn policies by mimicking expert demonstrations, making it highly efficient for tasks with high-quality data (e.g., robotic manipulation). The training is supervised, using datasets of expert state-action pairs ( s , a ). The objective is to learn a policy that replicates the expert's behavior by minimizing the negative log-likelihood of expert actions.
The objective function for imitation learning is defined as:
$
\mathcal { L } ( \pi ) = - \mathrm { E } _ { \tau \sim \mathrm { P D } } [ \log \pi ( \mathrm { a } | s ) ]
$
where:
-
: The loss function for the policy .
-
: The expectation over expert demonstrations sampled from the expert data distribution .
-
: The log-probability of the expert action given the state under the learned policy . This term encourages the policy to assign high probability to expert actions.
Each demonstration consists of a sequence of state-action pairs
( s _ { t } , a _ { t } )of length : $ \tau _ { i } = [ ( s _ { 1 } , a _ { 1 } ) , \cdots , ( s _ { t } , a _ { t } ) , \cdots , ( s _ { L } , a _ { L } ) ] $ In continuous action spaces, is often modeled as aGaussian distribution, and the objective is approximated usingMean Squared Error (MSE)between predicted and expert actions. It issample-efficientbut highly dependent on the quality and coverage of demonstration data, struggling with unseen scenarios. Combining it withreinforcement learningcan enhance robustness.
4.3.1.2. Reinforcement Learning
Reinforcement learning ([139]) is a dominant method, enabling agents to learn policies by interacting with environments through trial-and-error, well-suited for dynamic and uncertain settings. At each time step , the agent observes state , selects action according to policy , receives reward from R ( s , a , g ), and the environment transitions to .
The objective function of reinforcement learning is to maximize the expected cumulative reward:
$
\mathcal { T } ( \pi ) = E _ { \pi , T , O } \left( \sum _ { t = 0 } ^ { \infty } \gamma ^ { t } R ( s , a , g ) \right)
$
where:
-
: The expected total reward (return) for policy .
-
: The expectation taken over trajectories generated by policy , state transitions , and observations .
-
: The
discount factor, which determines the present value of future rewards. A value closer to 0 emphasizes immediate rewards, while a value closer to 1 emphasizes long-term rewards. -
R ( s , a , g ): The reward received for taking action in state to achieve goal .Reinforcement learningexcels in optimizing policies for complex tasks but requires extensive exploration, making itcomputationally costly. Hybrid approaches withimitation learningcan mitigate this by providing initial policies.
4.3.1.3. Transfer Learning
Transfer learning ([152]) accelerates learning on target tasks by leveraging knowledge from source tasks, reducing the need for extensive data and time. It adapts a source policy (from a source task with state and action ) to a target task with different dynamics or goals. The objective is to minimize the divergence between and the target policy , typically by fine-tuning using a small amount of target task data.
The process is guided by the task-specific loss of the target task, constrained by the Kullback-Leibler (KL) divergence for policy alignment:
$
\boldsymbol { \theta } _ { t } ^ { * } = \arg \operatorname* { m i n } _ { \boldsymbol { \theta } _ { t } } \mathbb { E } _ { \boldsymbol { s } \sim \boldsymbol { S } _ { t } } ( D _ { K L } ( \pi _ { s } ( \cdot { | \ \boldsymbol { s } ; \boldsymbol { \theta } _ { s } ) | } | \pi _ { t } ( \cdot { | \ \boldsymbol { s } ; \boldsymbol { \theta } _ { t } ) ) } ) + \lambda \mathcal { L } _ { t } ( \boldsymbol { \theta } _ { t } )
$
where:
-
: The optimal policy parameters for the target task.
-
: The parameters of the source policy .
-
: The parameters of the target policy .
-
: The expectation over states from the target task's state distribution .
-
: The
Kullback-Leibler divergencemeasuring the difference between the source policy and the target policy given state . It penalizes deviations of the target policy from the source policy. -
: The task-specific loss of the target task, which measures how well the target policy performs on the target task.
-
: A
regularization parameterthat balances the importance of policy alignment (KL divergence) and target task performance.In
embodied settings,transfer learningenables skill reuse across environments and goals. However, large disparities between tasks can lead tonegative transfer, where performance degrades.
4.3.1.4. Meta-Learning
Meta-learning ([51, 66]), or "learning how to learn," allows agents to swiftly infer optimal policies for new tasks with minimal samples. At each time step , the agent receives observation and goal , selecting an action according to a meta-policy that adapts to task-specific dynamics. The objective is to optimize expected performance across tasks by minimizing the loss on task-specific data.
In Model-Agnostic Meta-Learning (MAML) [52], this is achieved by learning an initial set of model parameters that can be adapted quickly to new tasks with minimal updates. Specifically, for a set of tasks , MAML optimizes the meta-objective as:
$
\begin{array} { l } { { \displaystyle { \theta ^ { * } = \arg \operatorname* { m i n } _ { \theta } \sum _ { \mathcal { T } _ { i } } \mathcal { L } _ { \mathcal { T } _ { i } } \left( f _ { \theta _ { i } } \right) } } } \ { { \displaystyle { \theta _ { i } = \theta - \alpha \nabla _ { \theta } \mathcal { L } _ { \mathcal { T } _ { i } } \left( f _ { \theta } \right) } } } \end{array}
$
where:
-
: The optimal
meta-policy parameters. -
: Summation over a set of distinct tasks .
-
: The task-specific loss for task , evaluated on the model with parameters .
-
: The model parameterized by .
-
: The task-specific parameters after a single (or few) gradient updates from using the loss .
-
: The inner loop learning rate for adapting to a specific task.
-
: The gradient of the task-specific loss with respect to the meta-parameters .
Meta-learningenables agents to quickly adapt to new tasks by fine-tuning a pretrained model with few demonstrations. However, it requires substantial pretraining and large samples across diverse tasks.
4.3.2. Imitation Learning Empowered by Large Models
Imitation learning aims for agents to achieve expert-level performance by mimicking demonstrations. Key challenges include capturing complex behaviors, generalizing to unseen states, ensuring robustness against distribution shifts, and achieving sample efficiency. Large models have significantly enhanced behavior cloning, the most important imitation learning approach (formulating it as a supervised regression task to predict actions from observations and goals ).
The following figure (Figure 13 from the original paper) illustrates imitation learning empowered by diffusion models or Transformers:
该图像是图示,展示了利用扩散模型和变换器的模仿学习机制。图中包含专家数据集、状态/观察与动作的关系,以及基于扩散模型和变换器的决策网络结构示意。
4.3.2.1. Diffusion-based Policy Network
Diffusion models excel in handling complex multimodal distributions, generating diverse action trajectories, and enhancing robustness and expressiveness of policies.
- Pearce [142]: Proposes a
diffusion model-based imitation learning frameworkthat integratesdiffusion modelsinto policy networks. It optimizes expert demonstrations iteratively through noise addition and removal, capturing action distribution diversity. - DABC [34]: Adopts a two-stage process: pretrains a base policy via
behavior cloning, then refines action distribution modeling via adiffusion model. - Diffusion Policy [36]: Uses a
diffusion modelas the decision model forvision-driven robot tasks. It takes visual input and robot's current state as conditions, usesU-Netas adenoising network, and predicts denoising steps to generate continuous action sequences. - 3D-Diffusion [217]: Proposes a
diffusion policy frameworkbased on3D inputs(simple3D representations). It leverages adiffusion modelto generate action sequences, improvinggeneralizationofvisual motion policiesby capturing spatial information.
4.3.2.2. Transformer-based Policy Network
Transformer architectures empower imitation learning by treating expert trajectories as sequential data and using self-attention mechanisms to model dependencies between actions, states, and goals, minimizing error accumulation.
- RT-1 [20]: Google's pioneering work demonstrating
Transformersin robot control. Combines a large, diverse dataset (130k+ trajectories,700+ tasks) with apretrained vision-language modelto improvetask generalization. - RT-Trajectory [62]: Introduces "trajectory sketch" to incorporate low-level visual cues, enhancing
task generalization. - ALOHA [224]: Stanford's work using
Transformer's encoding-decoding structureto generate robotic arm action sequences frommulti-view images, achieving precisedouble-arm operations. Its follow-up usesaction chunkingfor multi-step predictions, improving stability. - Mobile ALOHA [58]: Extends
ALOHAtowhole-body coordinated mobile operation tasksusing a mobile platform and teleoperation interface. - HiveFormer [224] and RVT [60]: Utilize
multi-view dataandCLIPforvisual-language feature fusionto directly predict6D grasp poses, achieving state-of-the-art performance in complex spatial modeling. - RoboCat [19]: Employs
cross-task,cross-entity embodied imitation learning, integratingVQ-GAN([50]) to tokenize visual inputs andDecision Transformerto predict actions and observations, enabling rapid policy generalization. - RoboAgent [17]: Adopts a similar
encoding-decoding structure, fusing vision, task descriptions, and robot states to minimize action sequence prediction errors. - CrossFormer [44]: Proposes a
Transformer-based imitation learning architectureforcross-embodied tasks, trained on large-scale expert data to unify processing of manipulation, navigation, mobility, and aerial tasks, demonstratingmulti-task learningpotential.
4.3.3. Reinforcement Learning Empowered by Large Models
Reinforcement learning ([11]) allows agents to develop optimal control strategies through environmental interactions. While traditional methods like Q-learning ([194]), SARSA ([164]), and Deep Reinforcement Learning (DRL) (DQN [130], PPO [166], SAC [68]) have achieved significant successes, they still face limitations in reward function design and policy network construction. Large models offer solutions to these challenges.
The following figure (Figure 14 from the original paper) illustrates reinforcement learning empowered by large models:
该图像是一个示意图,展示了大模型在强化学习中的应用。图中展示了环境、状态、动作以及基于奖励函数和策略网络构建的设计,涉及扩散、变换器和大语言模型的构建方法。该图揭示了如何通过大模型设定奖励来优化智能体的决策过程。
4.3.3.1. Reward Function Design
Designing effective reward functions ([49]) is challenging due to complexity, task-specificity, and issues like reward hacking. Large models offer solutions by generating reward signals or entire reward functions.
- Kwon et al. and Language to Rewards (L2R) [215]: Use
zero-shotandfew-shotapproaches withGPT-3to producereward signalsdirectly from textual behavior prompts, translating high-level goals to hardware-specific policies. Limitations include sparse rewards and heavy dependence on precise prompts. - Text2Reward [205]: Generates
dense interpretable Python reward functionsfrom environment descriptions and examples, iteratively refined viahuman feedback, achieving high success rates. - Eureka [120]: Leverages
GPT-4to createdense rewardsfrom task and environment prompts. It automates reward function optimization through aniterative strategy, mitigating reliance on human feedback (unlikeText2Reward), and can surpass human-crafted rewards. - Auto MC-Reward [106]: Implements full automation for Minecraft with a
multi-stage pipeline: areward designergenerates signals, avalidatorensures quality, and atrajectory analyzerrefines rewards throughfailure-driven iterations. Achieves significant efficiency but is domain-specific.
4.3.3.2. Policy Network Construction
Offline reinforcement learning ([101]) learns policies from pre-collected datasets without online interaction, but policy regularization is needed to mitigate errors for actions absent from datasets. Large models enhance policy network construction to improve expressiveness and adaptability.
The following figure (Figure 15 from the original paper) illustrates policy network construction empowered by large models:

Policy network construction with diffusion models:
- DiffusionQL [193]: Employs
diffusion modelsas afoundation policyto model action distributions and trains them to maximizevalue function objectiveswithin theQ-learning framework. This generates high-reward policies that fit multimodal or non-standard action distributions in offline datasets. - EDP [91]: Introduces an efficient sampling method that reconstructs actions from intermediate noised states in a single step, significantly reducing computational overhead of
diffusion models. It can integrate with variousoffline reinforcement learning frameworks.
Policy network construction with Transformer-based architectures:
Transformers capture long-term dependencies in trajectories, improving policy flexibility and accuracy.
- Decision Transformer [31]: Re-frames
offline reinforcement learningas aconditional sequence modeling problem, treatingstate-action-reward trajectoriesas sequential inputs, and appliessupervised learningto generate optimal actions from offline datasets. - Prompt-DT [207]: Enhances
generalizationinfew-shot scenariosby incorporatingprompt engineering(trajectory prompts with task-specific encoding) to guide action generation for new tasks. - Online Decision Transformer (ODT) [228]: Pretrains
Transformerviaoffline reinforcement learningto learn sequence generation, then fine-tunes it throughonline reinforcement learning interactions. - Q-Transformer [30]: Integrates
Transformer's sequence modelingwithQ-function estimation, learningQ-values autoregressivelyto generate optimal actions. - Gato [158]: A
Transformer-based sequence modeling approachformulti-task offline reinforcement learning, but it relies heavily on dataset optimality and incurs high training cost.
Policy network construction with LLM:
LLMs leverage pretrained knowledge to streamline offline reinforcement learning.
- GLAM [28]: Uses
LLMaspolicy agentsto generate executable action sequences for language-defined tasks, optimized online viaPPOwith contextual memory. - LaMo [169]: Employs
GPT-2as a base policy, fine-tuned withLoRAto preserve prior knowledge, convertingstate-action-reward sequencesintolanguage promptsfor task-aligned policy generation. - Reid [159]: Explores
LLM's transferabilityusing pretrainedBERT, fine-tuned for specific tasks and augmented by external knowledge bases. It can outperformDecision Transformerwhile reducing training time.
4.4. World Models
World models serve as internal simulations or representations of environments, enabling intelligent systems to anticipate future states, comprehend cause-and-effect relationships, and make decisions without solely relying on expensive or infeasible real-world interactions. They provide a rich cognitive framework for efficient learning, decision-making, and adaptation in complex dynamic environments.
The following figure (Figure 16 from the original paper) illustrates world models and applications in decision-making and embodied learning:
该图像是示意图,展示了世界模型在决策制定和体现学习中的应用。图中包含三部分:左上角为不同类型的世界模型设计,包括潜在空间模型、基于变换器的模型和基于扩散的模型;右侧展示了世界模型在决策制定中的作用,强调预测和上下文知识的整合;底部则说明了世界模型在体现学习中的应用,包括状态转移、奖励和数据生成等关键环节。图中还提到了JEPA和相关组件的交互关系。
4.4.1. Design of World Models
Traditional reinforcement learning (RL) is costly due to repeated interactions. World models allow learning in a simulated environment. Current world models are categorized into latent space world models, Transformer-based world models, diffusion-based world models, and joint embedding predictive architectures.
4.4.1.1. Latent Space World Model
These models predict in latent spaces, represented by Recurrent State Space Models (RSSMs) [67, 69]. RSSMs learn dynamic environment models from pixel observations and plan actions in encoded latent spaces, decomposing the latent state into stochastic and deterministic parts.
- PlaNet [71]: Employs
RSSMwith aGated Recurrent Unit (GRU)and aConvolutional Variational AutoEncoder (CVAE)for latent dynamics andmodel predictive control. - Dreamer [70]: Advances
PlaNetby learning theactorandvalue networksfromlatent representations. - Dreamer V2 [72]: Further uses the
actor-critic algorithmto learn behaviors purely fromimagined sequencesgenerated by theworld model, achieving human-level performance on Atari. - Dreamer V3 [73]: Enhances stability with
symlog predictions,layer normalization, andnormalized returnsviaexponential moving average, outperforming specialized algorithms.
4.4.1.2. Transformer-based World Model
These models leverage the attention mechanism to model multimodal inputs, overcoming CNN and RNN limitations in high-dimensional, continuous, or multimodal environments, especially for complex memory-interaction tasks.
- IRIS [129]: One of the first to apply
Transformerinworld models. Agents learn skills within anautoregressive Transformer-based world model.IRIStokenizes images using aVector Quantized Variational Autoencoder (VQ-VAE)and predicts future tokens. - Google's Genie [24]: Built on
spatial-temporal Transformer([206]), trained on vast unlabeled Internet video datasets viaself-supervised learning. It provides a paradigm for manipulable, generative, interactive environments. - TWM [162]: Proposes a
Transformer-XL-based world model, migratingTransformer-XL's segment-level recurrence mechanismto capturelong-term dependencies. It trains amodel-free agentwithinlatent imaginationto enhance efficiency. - STORM [222]: Utilizes a
stochastic Transformer, fusingstateandactioninto a single token, improving training efficiency and matchingDreamer V3performance.
4.4.1.3. Diffusion-based World Model
These models excel in generating predictive video sequences in the original image space.
- OpenAI's Sora [22]: Leverages an
encoding networkto convert videos/images into tokens, then alarge-scale diffusion modelappliesnoisinganddenoising processesto these tokens, mapping them back to the original image space for multi-step image predictions based on language descriptions. This can generate trajectory videos for agents. - UniPi [47]: Employs
diffusion modelsto model agent trajectories in image space, generating future key video frames from language inputs and initial images, followed bysuper-resolutionin time series. - UniSim [212]: Improves trajectory prediction by jointly training
diffusion modelsonInternet dataandrobot interaction videos, enabling prediction of long-sequence video trajectories for both high-level and low-level task instructions.
4.4.1.4. Joint Embedding Predictive Architecture (JEPA)
Proposed by Yann LeCun at Meta ([102]), JEPA aims to overcome the lack of real-world common sense in data-driven world models. Inspired by human brains, it introduces hierarchical planning and self-supervised learning in a high-level representation space.
- Hierarchical Planning: Breaks complex tasks into multiple abstraction levels, focusing on
semantic featuresrather than pixel-level outputs. - Self-supervised Learning: Trains networks to predict missing or hidden input data, enabling pretraining on large unlabeled datasets and fine-tuning for diverse tasks.
- Architecture: Comprises a
perception moduleand acognitive module, forming aworld modelusinglatent variablesto capture essential information while filtering redundancies. This supports efficient decision-making and future scenario planning. - Dual-System Concept: Balances "fast" intuitive reactions with "slow" deliberate reasoning.
4.4.2. World Model in Decision-Making
World models provide powerful internal representations, enabling agents to predict environmental dynamics and outcomes before acting. For decision-making, they serve two main roles: simulated validation and knowledge augmentation.
4.4.2.1. World Model for Simulated Validation
In robotics, testing decisions in the real world is expensive and time-consuming. World models allow agents to "try out" actions and observe likely consequences in a simulated environment, dramatically shortening iteration time and facilitating safe testing of high-risk scenarios. This ability helps agents identify and avoid potential mistakes, optimizing performance.
- NeBula [3]: Constructs
probabilistic belief spacesusingBayesian filtering, enabling robots to reason across diverse structural configurations, even in unknown environments, and predict outcomes under uncertainty. - UniSim [212]: A
generative simulatorfor real-world interactions, capable of simulating visual outcomes of both high-level instructions and low-level controls. It integrates diverse datasets across different modulations.
4.4.2.2. World Model for Knowledge Augmentation
World models augment agents with predictive and contextual knowledge essential for strategy planning. By predicting future environmental states or enriching understanding of the world, they enable agents to anticipate outcomes, avoid mistakes, and optimize performance.
- World Knowledge Model (WKM) [146]: Imitates human mental world knowledge by providing
global prior knowledgebefore a task and maintaininglocal dynamic knowledgeduring the task. It synthesizes global task knowledge and local state knowledge from experts and sampled trajectories. - Agent-Pro [221]: Transforms an agent's interactions with its environment (especially with other agents) into "beliefs," representing the agent's
social understandingand informing subsequent decisions. - GovSim [144]: Explores the emergence of
cooperative behaviorswithin societies ofLLM agents. Agents gather information throughmulti-agent conversations, implicitly forming high-level insights and representations of theworld model.
4.4.3. World Model in Embodied Learning
World models enable agents to learn new skills and behaviors efficiently. In contrast to model-free reinforcement learning (which is computationally expensive and data-inefficient), model-based reinforcement learning uses world models to streamline learning by simulating state transitions and generating data.
4.4.3.1. World Model for State Transitions
Model-based reinforcement learning leverages a world model that explicitly captures state transitions and dynamics, allowing agents to learn from simulated environments for safe, cost-effective, and data-efficient training. The world model creates virtual representations of the real world, enabling agents to explore hypothetical actions and refine policies without real-world risks.
- RobotDreamPolicy [145]: Learns a
world modeland develops the policy within it, drastically reducing real-environment interactions. - DayDreamer [202]: Leverages
Dreamer V2(anRSSM-based world model) to encode observations intolatent statesand predict future states, achieving rapid skill learning in real robots with highsample efficiency. - SWIM [128]: Utilizes
Internet-scale human video datato understand human interactions and gainaffordances. Initially trained onegocentric videos, it's fine-tuned with robot data to adapt to robot domains, enabling efficient task learning.
4.4.3.2. World Model for Data Generation
World models, especially diffusion-based world models, can synthesize data, which is crucial for embodied AI due to challenges in collecting diverse real-world data. They can synthesize realistic trajectory data, state representations, and dynamics, augmenting existing datasets or creating new ones.
- SynthER [118]: Utilizes
diffusion-based world modelsto generate low-dimensionaloffline RL trajectory datato augment original datasets, enhancing performance in both offline and online settings. - MTDiff [77]: Applies
diffusion-based world modelsto generatemulti-task trajectories, usingexpert trajectoriesas prompts to guide the generation of agent trajectories aligned with specific task objectives and dynamics. - VPDD [76]: Trains
trajectory prediction world modelsusing a large-scale human operation dataset, then fine-tunes theaction generation modulewith minimal labeled action data, significantly reducing the need for extensive robot interaction data forpolicy learning.
5. Experimental Setup
As a comprehensive survey paper, this article does not present its own novel experimental results, datasets, or evaluation metrics. Instead, it synthesizes the methodologies, findings, and challenges observed across numerous prior research works in large model empowered embodied AI. Therefore, this section will discuss the general trends in datasets, evaluation metrics, and baselines that are commonly found in the literature surveyed by the authors, providing context for the advancements and limitations discussed throughout the paper.
5.1. Datasets
The paper highlights a critical challenge in embodied AI: the scarcity of high-quality embodied data. While large models thrive on massive datasets, collecting real-world robot interaction data is prohibitively expensive and complex. The types of datasets commonly used or leveraged in the field include:
- Robot Trajectory Datasets: These are datasets of recorded robot actions and corresponding observations (e.g., camera images, joint states) from human demonstrations or robot trials.
- VIMA [89]: Mentioned as a large dataset with
650,000 demonstrationsforrobot manipulation. - RT-1 [20]: Mentioned with
130,000 demonstrations(trajectories from700+ tasks). - RT-X [186]: An initiative to collect
robot arm datafrom over 60 laboratories to build theopen X-Embodiment dataset. - Open X-Embodiment dataset [186]: A large-scale, diverse dataset of robot trajectories.
- VIMA [89]: Mentioned as a large dataset with
- Internet-Scale Vision-Language Data:
Large modelsleverage vast web-scraped datasets for pretraining, particularlyVLMsandMLMs.- LAION-5B: A massive dataset with
5.75 billion text-image pairs, dwarfing current embodied datasets. This highlights the data disparity between generalvision-language modelsandembodied AI.
- LAION-5B: A massive dataset with
- Human Video Datasets: These provide rich real-world dynamics and observations from human perspectives.
- Ego4D [61]: Provides
egocentric videos, offering insights into human behaviors and interactions, which can be used to improve contextual understanding for robotic tasks.
- Ego4D [61]: Provides
- Simulated Datasets:
Simulators([163]) are crucial for generating large and diverse datasets cost-effectively. They allow agents to be trained in virtual environments before deployment in the real world.-
Minecraft (
Auto MC-Reward [106]): A specific example of a simulation environment used for reward design inreinforcement learning. -
D4RL benchmarks [57]: A collection ofoffline reinforcement learning datasetsderived from various simulated locomotion and manipulation environments.The choice of datasets is driven by the need to validate a method's performance on tasks that are representative of real-world
embodied AIchallenges, such asrobotic manipulation,navigation, andhuman-robot interaction. Leveraging diverse data sources, from robot-specific interactions to general web data and human videos, is seen as critical for improvinggeneralizationandtransferability.
-
5.2. Evaluation Metrics
The paper, being a survey, does not define its own evaluation metrics. However, it implicitly discusses the performance aspects that are crucial for embodied AI and are commonly evaluated in the literature it surveys. Based on the challenges and enhancements discussed, key evaluation metrics in this field generally fall into categories such as task success, efficiency, generalization, and robustness.
Here are some common evaluation metrics, which are not explicitly defined with formulas in the paper, but are standard in the field:
-
Task Success Rate (SR) / Success Percentage:
- Conceptual Definition: Measures the proportion of trials or episodes in which the agent successfully completes the assigned task according to predefined criteria. It directly quantifies the agent's effectiveness in achieving its goals.
- Mathematical Formula: $ SR = \frac{\text{Number of successful trials}}{\text{Total number of trials}} \times 100% $
- Symbol Explanation:
SR: The Task Success Rate.- Number of successful trials: The count of attempts where the agent fully achieved the task objective.
- Total number of trials: The total number of attempts made by the agent for a given task.
-
Cumulative Reward (Return):
- Conceptual Definition: In
reinforcement learning, this metric sums all the rewards an agent receives from the environment over an episode, potentially discounted by a factor . It reflects the total utility or performance achieved by the agent over a task execution. - Mathematical Formula: $ R_t = \sum_{k=0}^{T} \gamma^k r_{t+k+1} $
- Symbol Explanation:
- : The cumulative reward (return) at time step .
- : The final time step of the episode.
- : The
discount factor(between 0 and 1, exclusive) that weighs immediate rewards more heavily than future rewards. - : The instantaneous reward received at time step .
- Conceptual Definition: In
-
Sample Efficiency:
- Conceptual Definition: Measures how quickly an agent can learn an effective policy from a limited amount of data or environmental interactions. Higher sample efficiency means less data or fewer interactions are needed, which is crucial for real-world robotics where data collection is expensive.
- Mathematical Formula: Often evaluated by plotting performance (e.g.,
Task Success RateorCumulative Reward) against the number of environmental interactions, demonstrating a steeper learning curve or achieving a target performance with fewer samples. No single universal formula, but typically analyzed visually or by comparingdata points needed for convergence. - Symbol Explanation: This is an observational metric rather than a single formula. It might involve comparing
Number of interactions(e.g., timesteps, episodes, demonstrations) to reach a certainPerformance threshold.
-
Generalization Capability:
- Conceptual Definition: Assesses the agent's ability to perform well on tasks or in environments that were not seen during training. This is a critical aspect for
embodied AIto operate in open, dynamic, and unstructured real-world settings. - Mathematical Formula: Typically measured by
Task Success RateorCumulative Rewardon unseen tasks, novel object instances, or different environmental layouts/conditions. No single formula, but involves comparing performance betweentraining distributionandout-of-distribution (OOD)scenarios. - Symbol Explanation: This is an aggregate measure, often reported as versus .
- Conceptual Definition: Assesses the agent's ability to perform well on tasks or in environments that were not seen during training. This is a critical aspect for
-
Computational Cost / Inference Latency / Memory Footprint:
- Conceptual Definition: These metrics quantify the resources required by the model for training and deployment.
Computational cost(e.g.,PFLOPs) measures total operations,inference latency(e.g.,milliseconds) measures response time, andmemory footprint(e.g.,GB) measures RAM/VRAM usage. These are crucial for deployingembodied AIon resource-constrained platforms. - Mathematical Formula: Directly measured by hardware monitors or profiling tools.
Latency = Time(response) - Time(request)PFLOPsis typically reported by specialized benchmarks.
- Symbol Explanation: Directly corresponds to measured time, memory, or computational operations.
- Conceptual Definition: These metrics quantify the resources required by the model for training and deployment.
5.3. Baselines
Since this is a survey paper, it does not propose a new method to compare against baselines in its own experiments. However, throughout its review of the literature, the paper implicitly or explicitly refers to various baseline approaches that large model empowered embodied AI methods aim to surpass. These baselines represent the state-of-the-art or traditional methods against which new large model-based approaches are evaluated in the original research papers.
Common baselines and comparison points include:
-
Traditional Rule-Based and Symbolic Planning Systems:
- Rule-based methods [59, 75, 126]: Often using
PDDLandheuristic search planners, these are baselines forhigh-level planning.LLM-enhanced planningaims to overcome their limitations in adaptability to unstructured or dynamic scenarios. - Classic Control Algorithms [81, 125, 2]: Such as
PID,LQR, andMPC, serve as baselines forlow-level execution, particularly for foundational skills.Learning-drivenandmodular controlmethods aim to enhance their adaptability to high-dimensional and uncertain dynamics.
- Rule-based methods [59, 75, 126]: Often using
-
Traditional (Deep) Reinforcement Learning (DRL) Algorithms:
Q-learning [194],SARSA [164],DQN [130],PPO [166],SAC [68]: These represent strongmodel-free RLbaselines.Large model empowered RLseeks to improvesample efficiency,reward function design, andpolicy expressiveness, especially in complex tasks or offline settings.Decision Transformer [31]: This itself serves as a baseline inoffline RLcomparisons, particularly whenLLMsare used forpolicy network construction, as seen inReid [159].
-
Traditional Imitation Learning (IL) Approaches:
- Behavior Cloning [53]: The fundamental
ILapproach thatdiffusion-basedandTransformer-based policy networksaim to significantly improve upon, especially regardinggeneralization,robustness, and handlingmultimodal action distributions.
- Behavior Cloning [53]: The fundamental
-
Prior Large Models or Architectures:
- Specific LLMs/VLMs: Earlier versions of
LLMs(e.g.,GPT-3,PaLM) orVLMs(CLIP) are often used as components or prior baselines when newmultimodal large modelsorVLAmodels (RT-2,Octo) are introduced. For instance,OpenVLA [94]is presented as an open-source alternative toRT-2. - Latent Space World Models:
RSSMand its variants (Dreamer V2/V3) are baselines for newerTransformer-based(Genie,TWM) ordiffusion-based world models(Sora,UniPi), which aim to improveprediction accuracyandgenerative capabilitiesin complex or pixel-space domains.
- Specific LLMs/VLMs: Earlier versions of
-
Hierarchical vs. End-to-End Paradigms: The paper itself constructs a comparative framework between these two paradigms, where one often serves as a conceptual baseline or alternative to the other, depending on the specific task requirements and design goals.
These baselines are representative because they either embody the established practices in their respective sub-fields or represent the immediate predecessors that newer
large model empoweredapproaches are designed to outperform or generalize beyond.
6. Results & Analysis
As a comprehensive survey paper, this article does not present its own novel experimental results. Instead, it synthesizes the findings and advancements from numerous prior research works, drawing conclusions about the effectiveness, advantages, and limitations of large model empowered embodied AI across different paradigms and methodologies. The "results" in this context are the summarized insights and comparisons derived from the vast body of literature reviewed by the authors.
6.1. Core Results Analysis
The core results of this survey highlight the transformative impact of large models on embodied AI, particularly in decision-making and embodied learning.
6.1.1. Impact on Autonomous Decision-Making
- Hierarchical Decision-Making:
Large models(especiallyLLMs) significantly enhance all layers.- High-Level Planning:
LLMsovercome theadaptability limitationsof traditionalrule-based plannersby enablingstructured language planning(e.g., generatingPDDLor validating plans),natural language planning(decomposing tasks, filtering impractical actions), andprogramming language planning(generating executable code). This leads to more dynamic and flexible plans. - Low-Level Execution:
LLMsintegrate withlearning-driven control(imitation learning,reinforcement learning) andmodular control(callingpretrained modelslikeCLIPorSAM), enhancingadaptabilityandgeneralizationbeyond traditional control algorithms. - Feedback and Enhancement:
Large modelsenable advanced feedback mechanisms:self-reflection(iterative plan refinement, error correction),human feedback(real-time guidance, knowledge acquisition), andenvironment feedback(dynamic plan adjustment based on observations).VLMssimplify this by integrating visual and language reasoning.
- High-Level Planning:
- End-to-End Decision-Making:
VLA modelsrepresent a breakthrough by directly mappingmultimodal inputstoaction outputs, minimizingerror accumulationand enablingefficient learning. They leveragelarge models'prior knowledge forpreciseandadaptable task execution. Enhancements focus on:Perception: Improvingrobustnessto visual noise and enhancing3D spatial understanding.Trajectory Action Optimization: Usingdiffusion modelsorflow matchingto generatesmoother,more controllable, andhigh-precision action trajectories.Training Cost Reduction: Developinglightweight architectures,efficient tokenization, andparallel decodingforreal-time deploymentonresource-constrained devices.
6.1.2. Impact on Embodied Learning
Large models significantly enhance imitation learning and reinforcement learning.
- Imitation Learning:
Diffusion modelsare used to constructpolicy networksthat capturecomplex multimodal action distributions, enhancingrobustnessandexpressiveness.Transformer-based policy networksleverageself-attentionto modellong-term dependenciesin trajectories, improvingfidelity,consistency, andgeneralizationacross tasks and embodiments. - Reinforcement Learning:
Reward Function Design:Large modelsautomate the generation ofdense,interpretable reward signalsorreward functions, reducing reliance on manual design and mitigatingreward hacking.Policy Network Construction:Diffusion models,Transformer-based architectures, andLLMsempoweroffline reinforcement learningby enhancingpolicy expressiveness, capturinglong-term dependencies, and leveragingpretrained knowledgeforsample-efficient learningandgeneralization.
6.1.3. Role of World Models
The integration of world models is a crucial finding. They provide agents with internal simulations of environments, allowing them to anticipate future states and comprehend cause-and-effect.
-
Decision-Making:
World modelsenablesimulated validation(testing actions without real-world risk) andknowledge augmentation(providing predictive and contextual knowledge for planning). -
Embodied Learning: They facilitate
model-based reinforcement learningbysimulating state transitions(for safe, cost-effective, anddata-efficient training) andgenerating synthetic data(augmenting scarce real-world datasets).Overall, the survey concludes that
large modelsoffer unparalleled capabilities forembodied AI, pushing the field closer toAGIby making agents more capable, adaptable, and efficient. However, significant challenges remain, which are discussed in Section 7.
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper:
| Survey type | Related surveys | Publication time | Large models | Decision-making | Embodied learning | World model | |||
| HierarchicalEnd2end | IL | RL | Other | ||||||
| Specific | [29, 104, 113, 151, | 2024 | √ | × | × | × | × | × | × |
| 191, 225] [210] | 2024 | × | √ | × | √ | √ | √ | ||
| [26] | 2024 | √ | × | × | × | × | × | × × | |
| [7, 227] | 2025 | × | × | × | √ | √ | √ | × | |
| [188] | 2024 | × | √ | × | × | × | × | × | |
| [204] | 2024 | × | √ | × | × | × | √ | × | |
| [165] | 2025 | × | × | × | × | √ | × | × | |
| [43, 122] | 2024 | × | × | × | × | × | × | √ | |
| Compre hensive | [119] | 2024 | √ | √ | √ | √ | √ | × | × |
| [190] | 2024 | × | √ | √ | J | × | × | × | |
| [95] | 2024 | × | √ | √ | √ √ | × | × | × | |
| [117] | 2024 | √ √ | √ | √ √ | √ | × | √ | × | |
| Ours | √ | √ | √ | √ | |||||
-
Analysis of Table 1 (Comparison of Surveys): This table is crucial for establishing the uniqueness and comprehensive nature of the current survey. It clearly shows that many existing surveys are
specific(e.g., focusing only on large models, planning, or a particular learning method) or, ifcomprehensive, they often lack coverage ofWorld Modelsor the latestend-to-end decision-makingandVLAmodels. The "Ours" row distinctively marks coverage acrossLarge Models,HierarchicalandEnd-to-enddecision-making,Imitation Learning (IL),Reinforcement Learning (RL),Otherlearning methods, and, uniquely,World Models. This table validates the authors' claim of providing a more systematic and up-to-date review compared to previous works.The following are the results from Table 2 of the original paper:
Model Contributions Enhancements • Pioneering large-scale VLA, jointly P A C RT-2 [234] (2023) Vision Encoder: ViT22B/ViT-4B Language Encoder: PaLIX/PaLM-E Action Decoder: Symbol-tuning fine-tuned on web-based VQA and robotic datasets, unlocking advanced emergent functionalities. × Seer [63] (2023) •Vision Encoder: Visual backbone Language Encoder: Transformer-based Action Decoder: Autoregressive action prediction head Efficiently predict future video frames from language instructions by extending a pretrained text-to-image diffusion model. √ × √ Octo [180] (2024) Vision Encoder: CNN • Language Encoder: T5-base Action Decoder: Diffusion Transformer First generalist policy trained on a massive multi-robot dataset (800k+ trajectories). A powerful open-source foundation model. × × Open- VLA [94] (2024) • Vision Encoder: DINOv2 + SigLIP Language Encoder: Prismatic-7B Action Decoder: Symbol-tuning An open-source alternative to RT-2, superior parameter efficiency and strong generalization with efficient LoRA fine-tuning. × × √ Mobility- VLA [37] (2024) Vision Encoder: Long-context ViT + goal image encoder •Language Encoder: T5-based instruction encoder Action Decoder: Hybrid diffusion + au- toregressive ensemble Leverages demonstration tour videos as an environmental prior, using a long-context VLM and topological graphs for navigating based on complex multimodal instructions. √ √ × Tiny-VLA [198] (2025) Vision Encoder: FastViT with low-latency encoding Language Encoder: Compact language en- coder (128-d) Action Decoder: Diffusion policy decoder (50M parameters) Outpaces OpenVLA in speed and precision; eliminates pretraining needs; achieves 5x faster inference for real-time applications. × × √ Tale .Mainstream VLA models (: perception, A: trajectory action, C: training cost) (continued). Model Architecture Contributions Enhancements A C Diffusion- VLA [196] (2024) Transformer-based visual encoder for con- textual perception Language Encoder: Autoregressive rea- soning module with next-token prediction Diffusion policy head for robust action sequence generation Leverage diffusion-based action modeling for precise control; superior contextual awareness and reliable sequence planning. × √ × Point- VLA [105] (2025) •Vision Encoder: CLIP + 3D Point Cloud Language Encoder: Llama-2 Action Decoder: Transformer with spatial token fusion Excel at long-horizon and spatial reasoning tasks; avoid retraining by preserving pretrained 2D knowledge √ × VLA- Cache [208] (2025) Vision Encoder: SigLIP with token mem- ory buffer Language Encoder: Prismatic-7B •Action Decoder: Transformer with dy- namic token reuse Faster inference with near-zero loss; dynamically reuse static features for real-time robotics × × √ π0 [18] (2024) Vision Encoder: PaliGemma VLM back- bone Language Encoder: PaliGemma (multi- modal) Employ flow matching to produce smooth, high-frequency (50Hz) action trajectories for real-time control. π0 Fast [143] (2025) • Vision Encoder: PaliGemma VLM back- bone Language Encoder: PaliGemma (multi- modal) Action Decoder: Autoregressive Trans- former with FAST Introduces an efficient action tokenization scheme based on the Discrete Cosine Transform (DCT), enabling autoregressive models to handle high-frequency tasks and significantly speeding up training. × √ √ Edge-VLA [25] (2025) • Vision Encoder: SigLIP + DINOv2 Language Encoder: Qwen2 (0.5B parame- ters) Action Decoder: Joint control prediction (non-autoregressive) Streamlined VLA tailored for edge devices, delivering 3050Hz inference speed with OpenVLA-comparable performance, optimized for low-power, real-time deployment. × × √ Tale .Mainstream VLA models (: perception, A: trajectory action, C: training cost) (continued). Architecture Contributions Enhancements •Vision Encoder: SigLIP + DINOv2 (multi- An optimized fine-tuning recipe for P OpenVLA- OFT [92] (2025) view) j • Language Encoder: Llama-2 7B Action Decoder: Parallel decoding with action chunking and L1 regression VLAs that integrates parallel decoding and a continuous action representation to improve inference speed and task success. × √ Spatial- VLA [147] (2025) • Vision Encoder: SigLIP from PaLiGemma2 4B Language Encoder: PaLiGemma2 •Action Decoder: Adaptive Action Grids and autoregressive transformer Enhance spatial intelligence by injecting 3D information via 'Ego3D Position Encoding' and representing actions with 'Adaptive Action Grids'. √ × MoLe- VLA [219] (2025) • Vision Encoder: Multi-stage ViT with STAR router Language Encoder: CogKD-enhanced Transformer Action Decoder: Sparse Transformer with dynamic routing A brain-inspired architecture that uses dynamic layer-skipping (Mixture-of-Layers) and knowledge distillation to improve efficiency. × × V VLA [230] (2025) Vision Encoder: Object-centric spatial ViT DexGrasp- Language Encoder: Transformer with grasp sequence reasoning Action Decoder: Diffusion controller for grasp pose generation A hierarchical framework for general dexterous grasping, using a VLM for high-level planning and a diffusion policy for low-level control. × √ × Dex-VLA [197] (2025) A large plug-in diffusion-based action expert and an embodiment curriculum learning strategy for efficient cross-robot training and adaptation. × × -
Analysis of Table 2 (Mainstream VLA Models): This table provides a valuable snapshot of the rapidly evolving
VLAlandscape. It demonstrates a clear trend toward:- Diverse Architectures: Utilizing various
Vision Encoders(ViT,CNN,DINOv2,SigLIP,CLIP,3D Point Cloud),Language Encoders(PaLM,T5-base,Prismatic-7B,Llama-2,PaliGemma,Qwen2), andAction Decoders(Symbol-tuning,Autoregressive action prediction head,Diffusion Transformer,Flow matching). - Focused Enhancements: Models like
Mobility-VLA,Point-VLA, andSpatial-VLAprioritizePerception (P)(e.g., long-contextVLM, 3D information).Octo,Diffusion-VLA, , andDexGrasp-VLAfocus onTrajectory Action (A)optimization, often throughdiffusion modelsorflow matchingfor smoother, more precise control.Open-VLA,Tiny-VLA,VLA-Cache, $$\pi_0Fast,Edge-VLA,OpenVLA-OFT, andMoLe-VLAaim forTraining Cost (C)reduction, emphasizingparameter efficiency,faster inference, anddeployment on edge devices. - Generalization and Open-Source: Models like
OctoandOpen-VLAhighlight the shift towardsgeneralist policiesandopen-source foundation modelstrained on massive datasets. This table effectively illustrates the diverse strategies employed to improveVLAmodels'perception,action generation, andefficiencyforend-to-end embodied AI.
- Diverse Architectures: Utilizing various
The following are the results from Table 3 of the original paper:
| Aspect | Hierarchical | End-to-End |
| Architecture | Perception: dedicated modules (e.g., SLAM, CLIP) High-level planning: structured, language, pro- Planning: implicit via VLA pretraining gram • Low-level execution: predefined skill lists ment | Perception: integrated in tokenization Action generation: Autoregressive generation with diffusion-based decoders •Feedback: LLM self-reflection, human, environ- • Feedback: inherent in closed-loop cycle |
| Performance | Reliable in structured tasks Limited in dynamic settings | • Superior in complex, open-ended tasks with strong generalization Dependent on training data |
| Interpretability | High, with clear modular design | Low, due to black-box nature of neural net- works |
| Generalization | Limited, due to reliance on human-designed Strong, driven by large-scale pretraining structures | Sensitive to data gaps |
| Real-time | duce delays in complex scenarios | Low, inter-module communications may intro- High, direct perception-to-action mapping min- imizes processing overhead |
| Computational Cost | tion but coordination overhead | •Moderate, with independent module optimiza- •High, requiring significant resources for train- ing |
| Table 3. Comparison of hierarchical and end-to-end decision-making (continued). | ||
| Aspect | Hierarchical | End-to-End |
| Application | Suitable for industrial automation, drone navi- Suitable for domestic robots, virtual assistants, gation, autonomous driving | human-robot collaboration |
| Advantages | High interpretability High reliability Easy to integrate domain knowledge | •Seamless multimodal integration Efficient in complex tasks • Minimal error accumulation |
| Limitations | Sub-optimal, due to module coordination issues Low adaptability to unstructured settings | Low interpretability High dependency on training data High computational costs •Low generalization in out-of-distribution sce- narios |
- Analysis of Table 3 (Comparison of Hierarchical and End-to-End Decision-Making): This table provides a concise yet comprehensive comparison, clarifying the trade-offs between the two major decision-making paradigms.
- Architecture & Performance:
Hierarchicalsystems emphasize modularity and explicit planning, offeringhigh interpretabilityandreliabilityinstructured tasks. However, they can suffer fromsub-optimal solutionsdue to inter-module coordination issues andlimited adaptabilityto dynamic settings. In contrast,End-to-Endsystems (likeVLAmodels) integrateperception-to-action mapping, resulting insuperior performanceandstrong generalizationincomplex, open-ended tasks, often withhigh real-time capabilitiesdue to minimized processing overhead. - Interpretability & Generalization: The
black-box natureofend-to-end neural networksleads tolow interpretabilitycompared to the clear modular design ofhierarchicalsystems. Whilehierarchicalsystems havelimited generalizationdue to human-designed structures,end-to-endsystems achievestrong generalizationthroughlarge-scale pretrainingbut aresensitive to data gaps. - Computational Cost & Applications:
Hierarchicalsystems havemoderate computational costbut can incurcoordination overhead.End-to-Endsystems, especially during training, havehigh computational costs. Applications diverge:hierarchicalforindustrial automation,drone navigation,autonomous driving, andend-to-endfordomestic robots,virtual assistants, andhuman-robot collaboration. The table effectively summarizes that whilehierarchicalapproaches offer control and transparency,end-to-endapproaches provide adaptability and seamlessness, with the choice depending on task complexity, interpretability needs, and available computational resources.
- Architecture & Performance:
6.3. Ablation Studies / Parameter Analysis
As a survey paper, the article does not present its own ablation studies or parameter analyses. However, the discussions throughout the paper implicitly highlight the importance of such analyses performed by the original research works it cites. For example:
-
Effectiveness of
LLMcomponents inhierarchical planning: Studies onstructured language planning(e.g.,LLV [9],FSP-LLM [175]) would involve ablation studies to demonstrate the value of external validators or optimizedprompt engineeringin generating feasible plans. -
Contribution of
RLHFinreward function design:Eureka [120]andText2Reward [205]would have analyzed howhuman feedbackor automated iterative optimization (Eureka) improves the quality and density of generated reward functions compared to manual or simplerLLM-based approaches. -
Impact of
diffusion modelsvs.Transformersinpolicy networks: Papers likeDiffusion Policy [36]orDecision Transformer [31]would have conducted experiments to show how their respective architectures (diffusion for diverse action distributions, Transformer for sequence modeling) improveimitation learningoroffline RLperformance. -
Efficiency gains in
VLAmodels: Works likeTinyVLA [198]orOpenVLA-OFT [92]would have performed analyses on factors likemodel compression techniques(knowledge distillation, quantization),action tokenization schemes, orparallel decodingto quantify their impact oninference speed,memory footprint, anddata efficiency. -
Role of
World Modelscomponents: Research onlatent space world models(Dreamer V2/V3 [72, 73]) orTransformer-based world models(IRIS [129]) would typically include ablations on components likestochasticvs.deterministic state representationsorattention mechanismsto show their contribution toprediction accuracyandsample efficiency.These implicit analyses from the surveyed literature collectively inform the conclusions drawn about the strengths and weaknesses of different approaches and the efficacy of various
large modelenhancements.
7. Conclusion & Reflections
7.1. Conclusion Summary
This comprehensive survey rigorously analyzes the burgeoning field of large model empowered embodied AI, focusing on two fundamental pillars: autonomous decision-making and embodied learning. The authors systematically detail how the capabilities of large models have revolutionized both hierarchical and end-to-end decision-making paradigms, enhancing planning, execution, and feedback mechanisms. Furthermore, the survey elucidates the profound impact of large models on embodied learning methods, particularly imitation learning and reinforcement learning. A key unique contribution is the integration and in-depth discussion of world models, highlighting their design and critical role in facilitating simulated validation, knowledge augmentation, state transitions, and data generation for robust embodied intelligence. The paper concludes that large models have unlocked unprecedented intelligent capabilities for embodied agents, making substantial strides towards Artificial General Intelligence (AGI).
7.2. Limitations & Future Work
Despite the significant advances, the survey identifies several persistent challenges that define the frontier of embodied AI research:
-
Scarcity of Embodied Data:
- Limitation: Real-world robotic data is immensely diverse and complex to collect, leading to datasets (e.g.,
VIMA,RT-1) that are orders of magnitude smaller than those forvision-language models(e.g.,LAION-5B). This hindersgeneralizationandscalability. Direct transfer from human video datasets (e.g.,Ego4D) facesmisalignment issuesdue to morphological differences between humans and robots. - Future Work:
- Leveraging
world models(especiallydiffusion-based) to synthesizehigh-quality new datafrom existing agent experiences (SynthER [118]). - Improving techniques for
integrating large human datasetswhile addressingreality gapandalignment issues.
- Leveraging
- Limitation: Real-world robotic data is immensely diverse and complex to collect, leading to datasets (e.g.,
-
Continual Learning for Long-Term Adaptability:
- Limitation:
Embodied AIsystems need to continually update knowledge and optimize strategies in dynamic environments while avoidingcatastrophic forgettingof previously acquired skills. Efficientautonomous explorationto balance new experiences with existing knowledge remains challenging in high-dimensional, sparse-reward scenarios. The unpredictability of the real world (sensor degradation, mechanical wear) further complicates learning. - Future Work:
- Enhancing
experience replay [10]andregularization techniques [98]to mitigatecatastrophic forgetting. - Developing
data mixing strategies [100]to reduce feature distortion. - Improving
self-supervised learningto driveactive explorationviaintrinsic motivation. - Incorporating
multi-agent collaboration mechanismsto accelerate individual learning.
- Enhancing
- Limitation:
-
Computation and Deployment Efficiency:
- Limitation: The sophisticated nature of
large modelsdemands substantial computational resources (e.g.,DiffusionVLA [22]requires hundreds ofGPUsand weeks of training, resulting in seconds ofinference latency). High memory footprints (e.g.,RT-2 [234]requiring20GB video memory) hinder deployment onresource-constrained edge devices.Cloud-based deploymentfaces issues of data privacy, security, and real-time constraints. - Future Work:
- Further developing
Parameter-Efficient Fine-Tuning (PEFT)methods likeLoRA [82]to reduce fine-tuning costs. - Advancing
model compression techniques(e.g.,knowledge distillation,quantization) to deploy models on limited hardware (e.g.,TinyVLA [234]achieving low latency and memory footprint). - Designing
inherently lightweight architecturesandhardware acceleration(e.g.,MiniGPT-4 [232]).
- Further developing
- Limitation: The sophisticated nature of
-
Sim-to-Real Gap:
- Limitation: Training agents in
simulatorsis scalable and cost-effective, butfundamental discrepanciesbetween simulated and real-world environments (e.g., inaccuratephysical dynamics,visual renderingdifferences) lead to thesim-to-real gap. Policies trained in simulation often fail unexpectedly in reality, especially inout-of-distribution scenarios. Modeling real-world complexity accurately is inherently challenging, and small errors accumulate in long-term decision-making. - Future Work:
- Developing
advanced simulatorswith more precisephysics modelingandphotorealistic rendering(e.g.,Genesis [134]). - Researching
domain adaptationanddomain randomizationtechniques to make policies more robust to variations between simulation and reality. - Exploring methods that can
learn directly from real-world interactionsmore efficiently, possibly withworld modelsaiding in data generation.
- Developing
- Limitation: Training agents in
7.3. Personal Insights & Critique
This survey provides an exceptionally timely and well-structured overview of the rapidly evolving large model empowered embodied AI landscape. The "dual analytical approach" is particularly effective in clarifying the complex interplay between diverse large models, decision-making paradigms, and learning methodologies. The inclusion of world models as a distinct and critical component is a significant strength, reflecting their growing importance in enabling more intelligent and autonomous agents. The detailed comparisons in Table 1, Table 2, and Table 3 are invaluable for researchers seeking to understand the current state-of-the-art and identify promising avenues.
One key inspiration drawn from this paper is the sheer potential of VLA models for end-to-end decision-making. The idea of directly mapping multimodal observations and instructions to actions, bypassing complex modular pipelines, represents a paradigm shift that could significantly simplify robot programming and enhance adaptability. The rapid advancements in trajectory action optimization and training cost reduction for VLA models suggest that real-time, general-purpose embodied agents might be closer than previously imagined. The concept of flow matching for continuous action generation, in particular, seems like a powerful technique for achieving both precision and efficiency.
However, a potential area for further improvement or a subtle unverified assumption lies in the inherent generalization capabilities attributed to large models when applied to embodied AI. While LLMs and VLMs show impressive generalization on digital data, the sim-to-real gap and the scarcity of embodied data remain formidable hurdles. The paper acknowledges these limitations, but the implied path forward heavily relies on either synthesizing more realistic data or improving transferability from internet-scale data. A critical question remains: how much of the linguistic and visual common sense learned from the internet truly grounds itself into robust, safe, and physical common sense required for dynamic real-world interaction? The black-box nature of end-to-end VLA models, while offering seamlessness, might also mask unverified assumptions about the underlying physical reasoning, making debugging and ensuring safety challenging in safety-critical applications.
The methods and conclusions of this paper are highly transferable and applicable to various domains beyond traditional robotics, such as autonomous driving, virtual reality/augmented reality agents, smart manufacturing, and elderly care robots. Any system requiring intelligent agents to operate physically in dynamic, uncertain environments can benefit from these insights.
In critique, while the survey is extensive, it primarily summarizes existing work. Future iterations could perhaps delve deeper into the mechanisms by which large models specifically acquire physical common sense or causal reasoning relevant to the embodied world, rather than simply stating that they enhance planning or perception. For example, what specific architectural choices or training objectives allow LLMs to reason about object permanence or physical forces beyond abstract language correlation? This would bridge a crucial gap in understanding how large models genuinely "embody" intelligence.
Similar papers
Recommended via semantic vector search.