Large Language Model Agent: A Survey on Methodology, Applications and Challenges
TL;DR Summary
This survey systematically analyzes large language model agents’ architecture, collaboration, and evolution, unifying fragmented research and outlining evaluation, tools, and applications, to guide future advancements toward artificial general intelligence.
Abstract
The era of intelligent agents is upon us, driven by revolutionary advancements in large language models. Large Language Model (LLM) agents, with goal-driven behaviors and dynamic adaptation capabilities, potentially represent a critical pathway toward artificial general intelligence. This survey systematically deconstructs LLM agent systems through a methodology-centered taxonomy, linking architectural foundations, collaboration mechanisms, and evolutionary pathways. We unify fragmented research threads by revealing fundamental connections between agent design principles and their emergent behaviors in complex environments. Our work provides a unified architectural perspective, examining how agents are constructed, how they collaborate, and how they evolve over time, while also addressing evaluation methodologies, tool applications, practical challenges, and diverse application domains. By surveying the latest developments in this rapidly evolving field, we offer researchers a structured taxonomy for understanding LLM agents and identify promising directions for future research. The collection is available at https://github.com/luo-junyu/Awesome-Agent-Papers.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: Large Language Model Agent: A Survey on Methodology, Applications and Challenges
- Authors: Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, Rongcheng Tu, Xiao Luo, Wei Ju, Zhiping Xiao, Yifan Wang, Meng Xiao, Chenwu Liu, Jingyang Yuan, Shichang Zhang, Yiqiao Jin, Fan Zhang, Xian Wu, Hanqing Zhao, Dg Tao, and Philip S. Yu. The large number of authors from various institutions suggests a comprehensive, collaborative effort to survey this broad field.
- Journal/Conference: This paper is a preprint available on arXiv. Preprints are research articles shared publicly before or during the formal peer-review process.
- Publication Year: The paper lists a publication date of March 27, 2025, on arXiv, indicating it is a very recent submission intended for future publication.
- Abstract: The abstract introduces Large Language Model (LLM) agents as a potential pathway toward Artificial General Intelligence (AGI), highlighting their goal-driven and adaptive capabilities. The authors propose a systematic survey that deconstructs LLM agents using a methodology-centered taxonomy. This taxonomy connects three core dimensions: how agents are constructed, how they collaborate, and how they evolve. The survey aims to unify existing research, provide a structured architectural perspective, and cover practical aspects like evaluation, tools, challenges (e.g., security, ethics), and applications. The goal is to offer researchers a clear framework for understanding the field and to identify future research directions.
- Original Source Link:
- arXiv Link: https://arxiv.org/abs/2503.21460
- PDF Link: https://arxiv.org/pdf/2503.21460v1.pdf
- The paper also provides a link to a curated collection of related papers: https://github.com/luo-junyu/Awesome-Agent-Papers
2. Executive Summary
-
Background & Motivation (Why): The field of artificial intelligence is witnessing a paradigm shift towards "intelligent agents" powered by Large Language Models (LLMs). Unlike traditional AI, these LLM agents can perceive environments, reason about goals, and execute actions dynamically. However, research in this area is fragmented and evolving at an explosive pace. There is a critical need for a systematic framework to organize and understand the principles behind how these agents are designed, how they work together, and how they improve. This paper addresses this gap by providing a comprehensive survey that unifies disparate research threads under a single, coherent taxonomy.
-
Main Contributions / Findings (What):
- Methodology-Centered Taxonomy: The paper introduces a novel taxonomy for understanding LLM agents, focusing on the core methodologies behind their creation and operation. This provides a structured way to analyze and compare different agent systems.
- Build-Collaborate-Evolve Framework: The central contribution is a holistic framework that examines LLM agents through three interconnected stages of their lifecycle:
- Construction (Build): How an individual agent is built, including its profile, memory, planning, and action capabilities.
- Collaboration (Collaborate): How multiple agents interact, using centralized, decentralized, or hybrid architectures.
- Evolution (Evolve): How agents learn and improve over time, through self-optimization, multi-agent co-evolution, or external feedback.
- Comprehensive Ecosystem Review: Beyond methodology, the survey provides a broad overview of the LLM agent ecosystem, including evaluation benchmarks, development tools, real-world challenges (security, privacy, ethics), and diverse applications. This offers a 360-degree view of the field, from theory to practice.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Large Language Model (LLM): A type of AI model (like GPT-4 or Llama) trained on vast amounts of text data. LLMs excel at understanding and generating human-like language, making them powerful "brains" for reasoning and decision-making.
- Intelligent Agent: An autonomous entity that can perceive its environment, make decisions, and take actions to achieve specific goals.
- LLM Agent: An intelligent agent that uses an LLM as its core processing unit or "cognitive engine." It leverages the LLM's reasoning, planning, and language capabilities to interact with digital or physical environments.
- Artificial General Intelligence (AGI): A hypothetical form of AI that possesses the ability to understand, learn, and apply its intelligence to solve any intellectual task that a human being can. LLM agents are considered a potential stepping stone toward AGI.
-
Previous Works: The authors explicitly position their survey as a more comprehensive and methodologically focused work compared to prior surveys. They note that previous efforts have been narrower in scope, focusing on:
- Specific applications like gaming (
[11],[12]). - Particular deployment environments (
[13],[14]). - Specific capabilities like multi-modality (
[15]) or security ([16]). - Other surveys provided broad overviews without a deep methodological breakdown (
[1],[17]) or focused on just one aspect, like multi-agent interaction ([18]) or workflows ([19]).
- Specific applications like gaming (
-
Technological Evolution: The paper attributes the rise of modern LLM agents to the convergence of three key technological advancements:
- Unprecedented Reasoning Capabilities: LLMs have become powerful general-purpose reasoners.
- Tool Manipulation: Agents can now interact with external software, APIs, and environments.
- Sophisticated Memory: New architectures allow agents to accumulate experiences and learn over time.
-
Differentiation: This survey distinguishes itself through its unique
Build-Collaborate-Evolveframework. Unlike previous work that often treated individual agent design and multi-agent systems separately, this paper presents an integrated perspective, showing the continuity from how a single agent is built to how groups of agents collaborate and evolve. Its methodology-centered taxonomy provides a deeper, more structured analysis of the architectural foundations of LLM agents.
4. Methodology (Core Technology & Implementation)
The core contribution of this paper is its detailed taxonomy of LLM agent methodologies, organized into the Build-Collaborate-Evolve framework.
Image 1: This diagram illustrates the paper's overall structure. The central focus is "Agent Methodology," broken down into Construction, Collaboration, and Evolution. This is surrounded by practical considerations like Evaluation, Tools, Real-World Issues, and Applications.
Image 2: This figure provides a detailed map of the paper's core taxonomy. It shows how the three main pillars—Construction, Collaboration, and Evolution—are further deconstructed into their fundamental components.
4.1 Agent Construction (How an agent is built)
This phase covers the foundational components of a single agent.
-
1. Profile Definition: This sets the agent's identity and behavioral rules.
- Human-Curated Static Profiles: Experts manually define an agent's role, rules, and knowledge. This ensures predictable, consistent behavior, ideal for structured tasks. Examples include
CamelandAutoGen, where agents have predefined roles like "assistant" and "user proxy," orMetaGPTandChatDev, where agents play specific roles in a software development team (e.g., "programmer," "product manager"). - Batch-Generated Dynamic Profiles: Agent profiles are generated automatically with controlled variations in personality, knowledge, or values. This is useful for simulating diverse social behaviors or creating varied user profiles for testing, as seen in human behavior simulation studies.
DSPycan be used to optimize these generation parameters.
- Human-Curated Static Profiles: Experts manually define an agent's role, rules, and knowledge. This ensures predictable, consistent behavior, ideal for structured tasks. Examples include
-
2. Memory Mechanism: This enables agents to store and retrieve information.
- Short-Term Memory: Holds recent conversational history and environmental feedback for immediate context. It is essential for step-by-step reasoning but is limited by the LLM's context window size and is transient. Used in frameworks like
ReActandChatDev. - Long-Term Memory: Stores experiences and knowledge for future use. This can be implemented as:
- Skill Libraries: Codified procedures that an agent learns, like the Minecraft skills in
Voyager. - Experience Repositories: Databases of past successes and failures, as in
Reflexion. - Tool Synthesis: Frameworks where agents create new tools from existing ones, like in
TPTUandOpenAgents.
- Skill Libraries: Codified procedures that an agent learns, like the Minecraft skills in
- Knowledge Retrieval as Memory: Instead of relying only on internal memory, agents use external knowledge sources. This is commonly done with Retrieval-Augmented Generation (RAG), where agents pull information from text corpora or knowledge graphs (
GraphRAG) to answer questions or perform tasks.
- Short-Term Memory: Holds recent conversational history and environmental feedback for immediate context. It is essential for step-by-step reasoning but is limited by the LLM's context window size and is transient. Used in frameworks like
-
3. Planning Capability: This allows agents to break down complex goals into manageable steps.
- Task Decomposition Strategies:
- Single-Path Chaining: A linear plan of subtasks is created and executed sequentially. The simplest form is Chain-of-Thought (CoT) prompting. Dynamic planning improves this by generating the next step based on the current situation.
- Multi-Path Tree Expansion: A tree-like structure of possible reasoning paths is explored. This allows the agent to backtrack and correct mistakes. The Tree-of-Thought (ToT) method is a prime example. This can be combined with search algorithms like Monte Carlo Tree Search for complex tasks in robotics or game playing.
- Feedback-Driven Iteration: The agent refines its plan based on feedback from various sources: the environment (e.g., a robot bumping into a wall), humans (corrections), the model itself (self-verification), or other agents (collaboration).
- Task Decomposition Strategies:
-
4. Action Execution: This is the agent's ability to interact with its environment.
- Tool Utilization: Using external tools to overcome LLM limitations. This involves deciding when to use a tool and which tool to select. Examples include using a calculator for math, a search engine for real-time information, or a code interpreter for programming.
- Physical Interaction: For embodied agents (e.g., robots), this involves executing actions in the real world, which requires understanding hardware constraints, social norms, and other agents.
4.2 Agent Collaboration (How agents work together)
This section explores how multiple agents can team up to solve problems that are too complex for a single agent.
-
1. Centralized Control: A "manager" or "controller" agent coordinates the work of other agents.
- Explicit Controller Systems: A dedicated agent acts as the central coordinator. For example, in
Coscientist, a human controller directs specialized agents in a scientific workflow. InMetaGPT, a manager agent assigns tasks to programmer, tester, etc. agents. - Differentiation-based Systems: A single powerful LLM assumes different sub-roles as needed, acting as its own controller.
Meta-Promptingis an example where a single model coordinates sub-tasks by assigning them to "virtual" specialized agents.
- Explicit Controller Systems: A dedicated agent acts as the central coordinator. For example, in
-
2. Decentralized Collaboration: Agents interact directly with each other without a central manager.
- Revision-based Systems: Agents work on a shared output, iteratively refining what others have done. In
MedAgents, different medical expert agents propose and modify a diagnosis until a consensus is reached. - Communication-based Systems: Agents engage in dialogue or debate to solve a problem.
AutoGenimplements a group chat where agents can discuss and refine solutions.MAD(Multi-Agent Debate) uses structured debate to improve reasoning and avoid fixation on initial bad ideas.
- Revision-based Systems: Agents work on a shared output, iteratively refining what others have done. In
-
3. Hybrid Architecture: These systems combine centralized and decentralized approaches to get the best of both worlds.
-
Static Systems: The collaboration structure is predefined.
CAMELuses decentralized role-playing within teams but centralized coordination between teams.AFlowuses a three-tier hierarchy with different collaboration styles at each level. -
Dynamic Systems: The collaboration structure adapts in real-time based on the task.
DyLANidentifies the most important agents for a task and adjusts the communication topology to focus on them.MDAgentsroutes simple tasks to a single agent and complex tasks to a hierarchical team.Here is a transcription of Table 1 from the paper, summarizing these collaboration methods.
-
Table 1: A summary of agent collaboration methods.
| Category | Method | Key Contribution |
|---|---|---|
| Centralized Control | Coscientist [73] | Human-centralized experimental control |
| LLM-Blender [74] | Cross-attention response fusion | |
| MetaGPT [27] | Role-specialized workflow management | |
| AutoAct [75] | Triple-agent task differentiation | |
| Meta-Prompting [76] | Meta-prompt task decomposition | |
| WJudge [77] | Weak-discriminator validation | |
| Decentralized Collaboration | MedAgents [78] | Expert voting consensus |
| ReConcile [79] | Multi-agent answer refinement | |
| METAL [115] | Domain-specific revision agents | |
| DS-Agent [116] | Database-driven revision | |
| MAD [80] | Structured anti-degeneration protocols | |
| MADR [81] | Verifiable fact-checking critiques | |
| MDebate [82] | Stubborn-collaborative consensus | |
| AutoGen [26] | Group-chat iterative debates | |
| Hybrid Architecture | CAMEL [25] | Grouped role-play coordination |
| AFlow [29] | Three-tier hybrid planning | |
| EoT [117] | Multi-topology collaboration patterns | |
| DiscoGraph [118] | Pose-aware distillation | |
| DyLAN [119] | Importance-aware topology | |
| MDAgents [120] | Complexity-aware routing |
4.3 Agent Evolution (How agents improve)
This dimension explores mechanisms that enable agents to learn and adapt over time.
-
1. Autonomous Optimization and Self-Learning: Agents improve without direct human supervision.
- Self-Supervised Learning: Agents learn from unlabeled data they generate.
Evolutionary optimizationtechniques allow models to merge and adapt efficiently. - Self-Reflection and Self-Correction: Agents critique and refine their own work.
SELF-REFINEuses an iterative feedback loop.STaR(Self-Taught Reasoner) trains models to generate rationales and then filter the correct ones to fine-tune themselves. - Self-Rewarding and Reinforcement Learning (RL): Agents generate their own reward signals to guide learning. A model can act as its own "judge" (
LLM-as-a-Judge) to score its outputs, providing a reward signal for improvement via RL.
- Self-Supervised Learning: Agents learn from unlabeled data they generate.
-
2. Multi-Agent Co-Evolution: Agents improve by interacting with other agents.
- Cooperative and Collaborative Learning: Agents learn by working together. In
ProAgent, agents infer their teammates' intentions to coordinate better. InCAMEL, role-playing agents collaborate to solve tasks. - Competitive and Adversarial Co-Evolution: Agents improve by competing or debating.
Red-Teaminginvolves one agent trying to find flaws ("red team") in another agent to make it more robust.Multi-Agent Debateframeworks force agents to defend their reasoning and critique others, improving factuality.
- Cooperative and Collaborative Learning: Agents learn by working together. In
-
3. Evolution via External Resources: Agents improve by leveraging outside information and feedback.
-
Knowledge-Enhanced Evolution: Agents integrate structured knowledge (e.g., from a knowledge base) to improve planning and decision-making, as seen in
KnowAgent. -
External Feedback-Driven Evolution: Agents use feedback from tools or environments.
CRITICallows an agent to use an external tool to check its output and correct it.SelfEvolveuses feedback from a code executor to debug and improve its generated code.Below is a transcription of Table 2 from the paper, summarizing agent evolution methods.
-
Table 2: A summary of agent evolution methods.
| Category | Method | Key Contribution |
|---|---|---|
| Self-Supervised Learning | SE [86] | Adaptive token masking for pretraining |
| Evolutionary Optimization [87] | Efficient model merging and adaptation | |
| DiverseEvol [88] | Improved instruction tuning via diverse data | |
| Self-Reflection & Self-Correction | SELF-REFINE [89] | Iterative self-feedback for refinement |
| STaR [90] | Bootstrapping reasoning with few rationales | |
| V-STaR [91] | Training a verifier using DPO | |
| Self-Verification [92] | Backward verification for correction | |
| Self-Rewarding & RL | Self-Rewarding [93] | LLM-as-a-Judge for self-rewarding |
| RLCD [94] | Contrastive distillation for alignment | |
| RLC [95] | Evaluation-generation gap for optimization | |
| Cooperative Co-Evolution | ProAgent [96] | Intent inference for teamwork |
| CORY [97] | Multi-agent RL fine-tuning | |
| CAMEL [25] | Role-playing framework for cooperation | |
| Competitive Co-Evolution | Red-Team LLMs [98] | Adversarial robustness training |
| Multi-Agent Debate [82] | Iterative critique for refinement | |
| MMAD [99] | Debate-driven divergent thinking | |
| Knowledge-Enhanced Evolution | KnowAgent [83] | Action knowledge for planning |
| WKM [84] | Synthesizing prior and dynamic knowledge | |
| Feedback-Driven Evolution | CRITIC [100] | Tool-assisted self-correction |
| STE [101] | Slate iano o ol | |
| SelfEvolve [102] | Automated debugging and refinement |
5. Experimental Setup
As a survey paper, it does not conduct its own experiments. Instead, Section 3 reviews the landscape of evaluation benchmarks and tools used by the community to assess LLM agents.
Image 3: This diagram organizes the evaluation ecosystem. The left side lists benchmarks for general, domain-specific, and collaborative assessment. The right side shows tools used by agents, created by agents, and used for deploying agents.
-
Datasets & Benchmarks: The paper categorizes evaluation frameworks into three types:
- General Assessment Frameworks: These benchmarks test a broad range of agent capabilities in multiple environments.
AgentBench: Evaluates agents across eight different interactive environments.Mind2Web: Tests agents on real-world tasks across 137 websites.MMAU: Breaks down agent intelligence into five core competencies using over 3,000 tasks.VisualAgentBench: Focuses on multimodal agents that handle visual tasks.BENCHAGENTS: A framework where LLM agents themselves help create new evaluation tasks, making the benchmark self-evolving.
- Domain-Specific Evaluation Systems: These are tailored to test agent performance in specialized fields.
- Healthcare:
MedAgentBenchandAI Hospitalsimulate clinical tasks and workflows. - Autonomous Driving:
LaMPilotevaluates agents on code generation for driving systems. - Data Science:
DSEvalandDA-Codecover the full data science lifecycle. - Security:
AgentHarmassesses the risk of agents being used for malicious purposes. - Real-World Simulation:
OSWorldprovides a real computer environment (Ubuntu/Windows/macOS) for agents to perform tasks.EgoLifeuses egocentric video to test agents on daily human activities.
- Healthcare:
- Collaborative Evaluation of Complex Systems: These benchmarks assess the performance of multi-agent systems.
TheAgentCompany: Simulates a software company to test collaborative coding and web interaction.MLRBandMLE-Bench: Evaluate multi-agent teams on competitive machine learning research and engineering tasks (like Kaggle competitions).
- General Assessment Frameworks: These benchmarks test a broad range of agent capabilities in multiple environments.
-
Evaluation Metrics: The paper does not propose specific metrics. It notes that evaluation is moving beyond simple
success-rateto more nuanced, multi-dimensional assessments of reasoning, planning, and adaptability, as captured by the diverse tasks in the benchmarks listed above. -
Tools Ecosystem: The paper also surveys the tools that are integral to the agent ecosystem.
- Tools used by LLM agents:
- Knowledge Retrieval: Search engines like DuckDuckGo (
ToolCoder) or commercial APIs (WebGPT). - Computation: Python interpreters (
CodeActAgent) and calculators (Toolformer). - API Interactions: Tools to interact with external services via REST APIs (
RestGPT).
- Knowledge Retrieval: Search engines like DuckDuckGo (
- Tools created by LLM agents: The provided text is incomplete and cuts off while introducing this topic. It suggests that agents can also create their own tools to handle novel tasks that existing tools cannot address.
- Tools used by LLM agents:
6. Results & Analysis
The "results" of this survey are the synthesized insights and identified trends in the field of LLM agents.
-
Core Synthesis:
- Architectural Convergence: The paper reveals that seemingly different agent systems are built from a common set of modular components (profile, memory, planning, action). The
Build-Collaborate-Evolveframework provides a unified lens to understand these systems. - From Individual to Collective Intelligence: There is a clear trend moving from single-agent systems to multi-agent collaboration. The paper's analysis of centralized, decentralized, and hybrid architectures highlights the trade-offs between control and flexibility.
- Performance Gaps in Real-World Scenarios: Domain-specific benchmarks reveal that general-purpose agents often struggle with the complexities and constraints of specialized fields like healthcare or autonomous driving. This highlights the need for more specialized agent development and evaluation.
- The Importance of Evolution: The most advanced agents are not static; they learn and improve. The survey categorizes the key mechanisms for this evolution, from self-correction to competitive co-evolution, underscoring that continuous learning is crucial for building more capable agents.
- Architectural Convergence: The paper reveals that seemingly different agent systems are built from a common set of modular components (profile, memory, planning, action). The
-
Real-World Issues & Challenges: While the full section on this topic is not included in the provided text, the introduction, abstract, and diagrams point to critical challenges:
- Security: Malicious use of agents (
AgentHarm), agent-centric attacks (e.g., prompt injection), and data-centric threats. - Privacy: Agents might memorize and leak sensitive data from their training or interactions.
- Social Impact & Ethics: The paper points to the need to consider the broader societal implications, including job displacement, ethical decision-making, and ensuring agent behavior aligns with human values.
- Security: Malicious use of agents (
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully provides a systematic and comprehensive survey of the LLM agent landscape. Its primary achievement is the
Build-Collaborate-Evolveframework, which offers a structured, methodology-centered taxonomy for analyzing, comparing, and developing LLM agent systems. By unifying fragmented research, highlighting the ecosystem of tools and evaluations, and touching upon real-world challenges, the survey serves as a foundational guide for both newcomers and experienced researchers in this rapidly advancing field. -
Limitations & Future Work:
- Author-Acknowledged: The paper implicitly suggests future work in every sub-category, such as creating more robust memory systems, more adaptive collaboration protocols, and more efficient evolution mechanisms. A key direction is bridging the gap between generalist agents and the demands of specialized, real-world domains.
- Inherent Limitation: The biggest limitation of any survey in such a fast-moving field is that it is a snapshot in time. New architectures and systems are published weekly. The authors attempt to mitigate this by providing a companion GitHub repository, which can be updated more frequently.
- Incomplete Text: The provided markdown is incomplete, cutting off in the middle of Section 3.2.2. A full analysis of the paper would require the complete text, especially the sections on
ApplicationsandReal-World Issues.
-
Personal Insights & Critique:
- Strength: The paper's main strength is its clarity and structure. The
Build-Collaborate-Evolveframework is exceptionally intuitive and provides a powerful mental model for thinking about agent systems. It effectively transforms a chaotic collection of research papers into a well-organized map. - Value: This survey is highly valuable for anyone entering the field of LLM agents. It provides the necessary concepts, terminology, and key research examples to quickly get up to speed. For experts, it offers a way to contextualize their own work within the broader landscape.
- Critique: While the taxonomy is excellent, the boundaries between categories can sometimes be blurry. For example, a system like
AutoGenappears in multiple categories (profile definition, decentralized collaboration), which shows the interconnectedness but also the challenge of clean categorization. However, this is more a reflection of the field's complexity than a flaw in the paper. The survey's true long-term impact will depend on how well its proposed framework adapts to future innovations.
- Strength: The paper's main strength is its clarity and structure. The
Similar papers
Recommended via semantic vector search.