Agent Learning via Early Experience
TL;DR Summary
This paper proposes an "early experience" paradigm for language agents to learn from self-generated interaction data, using future states as reward-free supervision. By employing implicit world modeling and self-reflection, it consistently improves task performance and out-of-dom
Abstract
A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.
English Analysis
1. Bibliographic Information
- Title: Agent Learning via Early Experience
- Authors: Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, Jian Xie, Yuan Sun, Go3 Q, Zihang Meng, Jianwei Yang, Ning Zhang, Xian Li, Ashish Shah, Dat Huynh, Hengduo Li, Zi Yang, Sara Cao, Lawrence Jang, Shuyan Zhou, Jiacheng Zhu, Huan Sun, Jason Weston, Yu Su, Yifan Wu.
- Affiliations: The authors are affiliated with Meta Superintelligence Labs, FAIR at Meta, and The Ohio State University. The collaboration between a major industrial AI lab and a leading academic institution highlights the paper's focus on both foundational research and practical application.
- Journal/Conference: This paper is an arXiv preprint. arXiv is a widely-used open-access repository for scientific papers, often serving as a platform for disseminating research findings prior to formal peer-review and publication at a conference or in a journal.
- Publication Year: The paper is dated October 15, 2025. This is likely a placeholder or typographical error in the source document, as the content reflects research trends from 2024-2025.
- Abstract: The paper addresses the difficulty of training language agents to learn from their own experience. Current methods either rely on supervised fine-tuning (SFT) on limited expert data, which generalizes poorly, or reinforcement learning (RL), which is challenging in environments without clear rewards. The authors propose a middle-ground paradigm called early experience, where agents learn from interaction data generated by their own actions. The resulting future states serve as a reward-free supervision signal. Two strategies are explored: (1) Implicit World Modeling, where the agent learns to predict future states to understand environment dynamics, and (2) Self-Reflection, where the agent learns from its suboptimal actions to improve reasoning. Evaluated across eight diverse environments, these methods consistently improve agent effectiveness and generalization. Furthermore, they provide a strong foundation for subsequent reinforcement learning, positioning
early experience
as a practical bridge between imitation learning and fully experience-driven agents. - Original Source Link:
- ArXiv Link:
https://arxiv.org/abs/2510.08558
- PDF Link:
http://arxiv.org/pdf/2510.08558v2
- ArXiv Link:
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: The grand vision for AI is to create autonomous agents that learn and improve from experience. However, the two dominant training paradigms for language agents have critical flaws. Imitation Learning (or Supervised Fine-Tuning) trains agents on static datasets of expert actions. This is straightforward but leads to brittle agents that fail in new situations (a problem known as "distributional shift") and is difficult to scale due to the high cost of creating expert data. Reinforcement Learning allows agents to learn from trial and error but requires a well-defined reward signal, which is often unavailable in complex, open-ended environments like the web. It can also be highly inefficient and unstable, especially for tasks with long action sequences.
- Why Now?: Large Language Models (LLMs) have made language agents a tangible reality, but their training remains a major bottleneck. The field needs a practical way to move beyond static datasets and enable agents to learn from their own interactions without the full complexity and requirements of traditional RL.
- Fresh Angle: The paper introduces the
early experience
paradigm. The key innovation is to use the consequences of an agent's own actions (the resulting future states) as a direct, reward-free source of supervision. This creates a middle path that combines the interactivity of RL with the simplicity of supervised learning, allowing agents to learn from mistakes and explore the environment without needing an explicit reward function.
-
Main Contributions / Findings (What):
- Primary Contributions:
- It formalizes the
early experience
paradigm as a practical and scalable training method that bridges imitation learning and reinforcement learning. - It proposes two concrete learning strategies under this paradigm:
Implicit World Modeling
(IWM), which grounds the agent in environment dynamics by having it predict future states, andSelf-Reflection
(SR), which improves the agent's reasoning by teaching it to analyze and learn from its suboptimal choices. - It provides a comprehensive and rigorous empirical validation across eight different agent environments (spanning web navigation, tool use, and embodied tasks) and multiple LLM families.
- It formalizes the
- Key Conclusions:
- Agents trained with
early experience
(both IWM and SR) consistently outperform agents trained with imitation learning alone. - This paradigm enhances the agent's ability to generalize to out-of-domain scenarios not seen in the expert training data.
Early experience
serves as an excellent pre-training stage for reinforcement learning. Agents initialized from anearly experience
checkpoint achieve significantly better performance after RL fine-tuning compared to those started from an imitation learning checkpoint.
- Agents trained with
- Primary Contributions:
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Language Agents: These are systems, typically built on LLMs, designed to interact with an environment (e.g., a website, an operating system, a game) to achieve goals. They perceive the environment's state as text, reason about the next step, and generate an action in text format.
- Imitation Learning (IL) / Supervised Fine-Tuning (SFT): A training method where a model learns to map states to actions by mimicking a dataset of expert demonstrations. It's a form of supervised learning. While simple and effective for learning a baseline policy, it suffers from the distributional shift problem: if the agent makes a mistake and enters a state not seen in the expert data, it may not know how to recover, causing errors to compound.
- Reinforcement Learning (RL): A paradigm where an agent learns through trial and error. It interacts with an environment, receives rewards or penalties for its actions, and adjusts its policy to maximize its cumulative reward over time. It's powerful but often requires a carefully designed reward function and can be sample-inefficient (requiring many interactions to learn).
- Markov Decision Process (MDP): The mathematical framework for modeling decision-making problems. An MDP consists of a set of states (), actions (), a transition function () that defines the probability of moving to a new state given a current state and action, and a reward function (). The agent's goal is to learn a policy () that maximizes expected rewards.
- World Models: These are models that learn the dynamics of an environment. A world model can predict the next state () and reward () given the current state () and an action (). They can be used for planning, allowing an agent to "imagine" the consequences of its actions before executing them.
- Self-Reflection: A technique where an LLM is prompted to critique its own output, identify errors, and generate a revised, improved response. It is often used at inference time to enhance reasoning without changing the model's weights.
-
Differentiation from Previous Work:
-
vs. Imitation Learning: Standard IL is passive; the agent only learns from a fixed dataset.
Early experience
is active; the agent generates its own data by interacting with the environment, allowing it to learn about the consequences of non-expert actions and improving its robustness. -
vs. Reinforcement Learning: Traditional RL is driven by an external reward signal.
Early experience
is reward-free; it uses the rich, descriptive information in the resulting future state as the supervision signal, making it applicable to many real-world settings where rewards are unavailable. -
vs. Traditional World Models: Traditional world models are often trained as separate components used for planning at inference time. The paper's
Implicit World Modeling
integrates this learning directly into the agent's policy network as an auxiliary training objective, making the agent itself "world-aware" in a more lightweight and direct manner. -
vs. Inference-Time Self-Reflection: Previous self-reflection techniques are typically prompting strategies applied during inference. This paper's
Self-Reflection
method uses the generated reflections as training data to permanently update the model's parameters, thereby instilling a more robust, generalizable reasoning capability into the agent. Crucially, the reflections are grounded in the actual outcomes of alternative actions observed in the environment.The paper's progression of training paradigms is visually summarized in Figure 1.
该图像是示意图,展示了语言智能体训练范式的发展历程。包括模仿学习的人类数据时代、本文提出的早期经验范式,以及强化学习的经验时代,突出早期经验在数据可扩展性和无奖励需求上的优势。
-
4. Methodology (Core Technology & Implementation)
The core idea of early experience
is to augment the expert demonstration dataset () with a new dataset of interactions () generated by the agent itself.
-
Data Collection for Early Experience:
-
Start with the expert trajectories from .
-
For each state in an expert trajectory, use the current agent policy to sample alternative actions, .
-
Execute each alternative action in the environment to observe the resulting next state .
-
Collect these interactions into a
rollout dataset
: . This dataset contains rich information about what happens when the agent deviates from the expert path.Based on this collected data, the paper proposes two training strategies, visualized in Figure 2.
该图像是一个示意图,展示了论文中“隐式世界建模”和“自我反思”两种利用早期经验数据进行智能体训练的方法。左侧通过公式 和 描述隐式世界建模的阶段过程,右侧展示自我反思中数据构造与训练阶段及公式 。
-
-
Strategy 1: Implicit World Modeling (IWM)
- Principle: The agent should be able to predict the consequences of its actions. By training the policy model to predict the next state, it internalizes the environment's dynamics. This makes the agent more "aware" of how the world responds to its actions.
- Procedure: IWM is implemented as a two-stage training process:
- World Modeling Stage: The agent model is trained on the dataset to predict the next state given the current state and an action . This is achieved by minimizing a standard next-token prediction loss.
- Continual Training Stage: The model from Stage 1 is then fine-tuned on the original expert dataset using the standard imitation learning objective to learn the desired policy.
- Mathematical Formulation: The loss for the world modeling stage is:
- : The current state.
- : An alternative action taken by the agent.
- : The resulting next state observed from the environment.
- : The probability distribution over the next tokens, generated by the language model with parameters . The loss aims to maximize the probability of generating the true next state.
-
Strategy 2: Self-Reflection (SR)
- Principle: The agent should learn not just what the expert action is, but why it is better than other plausible alternatives. This is achieved by generating natural language explanations (rationales) that contrast the outcome of the expert action with the outcomes of suboptimal actions.
- Procedure:
- Data Construction: For each state , collect the expert outcome (from action ) and the alternative outcomes (from actions ). Prompt a language model with this information and ask it to generate a rationale explaining why is preferable to . This creates a reflection dataset .
- Training: The agent model is trained on a mix of the original expert data and the new reflection data. For the reflection data, the model learns to predict the rationale followed by the expert action , conditioned on the state .
- Mathematical Formulation: The loss for the self-reflection data is:
- : The generated chain-of-thought rationale explaining why the expert action is better than the alternative .
- The loss trains the model to jointly generate the correct reasoning and the correct action.
5. Experimental Setup
-
Datasets / Environments: The evaluation is conducted across eight diverse benchmarks, demonstrating the general applicability of the
early experience
paradigm. The details are transcribed from Table 1 below.This table is manually transcribed from the paper.
Environment Description # Traj. # D_expert MISC (Embodied and Scientific Simulation, and Travel Planning) ALFWorld (Shridhar et al., 2021) Embodied instruction-following tasks in a simulated household, combining textual descriptions with high-level symbolic actions. 3,553 21,031 ScienceWorld (Wang et al., 2022) An interactive science lab simulator rendered in natural language, where agents perform multi-step experiments. 1,000 14,506 TravelPlanner (Xie et al., 2024a) Long-horizon travel planning tasks that require generating and refining multi-day itineraries using various tools. 45 1,395 Multi-Turn Tool-Use BFCLv3 (Patil et al., 2025) Multi-turn tool-use tasks from the Berkeley Function Call Leaderboard v3, where agents interact with a Python-based API environment. 125 1,264 Tau-Bench (Yao et al., 2025) Realistic customer-service scenarios requiring agents to interact with LM-simulated users and perform multi-turn tool use. 452 5,239 SearchQA (Jin et al., 2025) Multi-hop question answering where agents issue search queries and reason over retrieved snippets to answer complex questions. 2,082 7,691 Web Navigation WebShop (Yao et al., 2022) Shopping tasks in a simulated e-commerce site, where agents must navigate, filter, and select the correct product. 1,571 15,464 WebArena-Lite (Zhou et al., 2024) Web navigation tasks across domains like e-commerce, forums, and content management. 554 7,044 -
Evaluation Metrics:
- Success Rate (%):
- Conceptual Definition: The primary metric for most tasks. It measures the percentage of tasks that the agent completes successfully according to the environment's specific winning conditions. A higher success rate indicates better task completion ability.
- Mathematical Formula:
- F1 Score:
- Conceptual Definition: Used for the
SearchQA
benchmark. It measures the quality of the agent's final generated answer by comparing the tokens in the predicted answer with the tokens in the ground-truth answer. It is the harmonic mean of precision and recall, providing a balanced measure of accuracy. - Mathematical Formula:
- Symbol Explanation:
- Precision: The proportion of tokens in the predicted answer that are also in the ground-truth answer. It measures how much of the prediction is correct.
- Recall: The proportion of tokens in the ground-truth answer that are also in the predicted answer. It measures how much of the correct answer was captured by the prediction.
- Conceptual Definition: Used for the
- Success Rate (%):
-
Baselines:
- Prompt (Raw Instruct Model): The off-the-shelf instruction-tuned LLM without any task-specific fine-tuning. This establishes a zero-shot performance baseline.
- Imitation Learning: The main baseline, where the model is fine-tuned solely on the expert demonstration dataset ().
- Long CoT & STaR-style: Additional baselines used in the discussion section to test if performance gains can be achieved simply by encouraging longer reasoning at test-time (
Long CoT
) or by training on ungrounded, self-generated rationales (STaR-style
).
6. Results & Analysis
-
Core Results: Effectiveness The main results, transcribed from Table 2, show that both
early experience
methods (Ours-IWM
andOurs-SR
) consistently outperform theImitation Learning
baseline across all eight environments and multiple model sizes.This table is manually transcribed from the paper.
Benchmark Model Prompt Imitation Learning Ours-IWM Ours-SR Embodied and Scientific Simulation, and Travel Planning ALFWorld K -3.2-3B 8.6 78.1 83.6 (+5.5) 85.9 (+7.8) 0 -2.5-7B 20.3 78.1 82.8 (+4.7) 82.0 (+3.9) 3 -3.1-8B 25.0 80.5 85.9 (+5.4) 85.2 (+4.7) ScienceWorld O -3.2-3B 2.3 51.6 55.5 (+3.9) 56.2 (+4.6) 0 -2.5-7B 3.9 53.9 59.4 (+5.5) 57.8 (+3.9) 3 -3.1-8B 3.1 54.7 57.0 (+2.3) 68.0 (+13.3) TravelPlanner Q -3.2-3B 0.0 19.4 28.3 (+8.9) 32.2 (+12.8) 0 -2.5-7B 0.0 16.7 22.2 (+5.5) 31.7 (+15.0) 3 -3.1-8B 0.0 17.2 25.0 (+7.8) 32.2 (+15.0) Multi-Turn Tool-Use BFCLv3 K -3.2-3B 1.3 21.3 25.3 (+4.0) 29.3 (+8.0) 0 -2.5-7B 10.6 26.7 29.3 (+2.6) 32.0 (+5.3) 3 -3.1-8B 6.7 16.0 20.0 (+4.0) 20.0 (+4.0) Tau-Bench 0 -3.2-3B 5.2 24.3 26.1 (+1.8) 28.7 (+4.4) 0 -2.5-7B 20.0 33.9 38.7 (+4.8) 39.5 (+5.6) 3 -3.1-8B 6.0 35.9 40.8 (+4.9) 41.7 (+5.8) SearchQA (F1) 0 -3.2-3B 13.3 38.0 39.0 (+1.0) 38.6 (+0.6) 0 -2.5-7B 19.3 39.9 40.8 (+0.9) 42.0 (+2.1) 3 -3.1-8B 21.0 41.0 44.3 (+3.3) 41.8 (+0.8) Web Navigation WebShop R -3.2-3B 0.0 41.8 60.2 (+18.4) 52.7 (+10.9) 0 -2.5-7B 0.8 51.6 56.2 (+4.6) 62.2 (+10.6) 3 -3.1-8B 0.0 47.3 58.6 (+11.3) 58.2 (+10.9) WebArena-Lite Q -3.2-3B 1.2 6.1 8.5 (+2.4) 7.3 (+1.2) -2.5-7B 0 1.8 4.2 7.3 (+3.1) 6.1 (+1.9) Q -3.1-8B 0.6 4.9 8.5 (+3.6) 8.5 (+3.6) - Key Insight:
Implicit World Modeling
(IWM) shows particularly strong gains in environments with predictable dynamics (e.g.,WebShop
with +18.4,ALFWorld
).Self-Reflection
(SR) excels in tasks requiring complex, multi-step reasoning and constraint satisfaction (e.g.,TravelPlanner
with up to +15.0,ScienceWorld
with +13.3). This demonstrates that the two methods capture complementary aspects of agent learning.
- Key Insight:
-
Out-Of-Domain (OOD) Generalization Table 3 shows that the performance improvements carry over, and are often amplified, in OOD settings. This is a critical finding, as it suggests that learning from one's own exploratory actions better prepares an agent for novel situations not covered by the expert data.
This table is manually transcribed from the paper.
AlfWorld BFCLv3 SearchQA (F1) -3.2-3B 0 -2.5-7B -3.1-8B 3 -3.2-3B -2.5-7B A -3.1-8B A -3.2-3B 0 -2.5-7B A -3.1-8B Prompt 5.5 4.7 18.8 1.3 7.1 6.2 24.6 33.1 37.0 Imitation Learning 74.2 64.1 63.3 5.3 7.6 6.7 40.5 47.0 47.4 Ours-IWM 77.3 (+3.1) 70.3 (+6.2) 78.1 (+14.8) 8.9 (+3.6) 12.9 (+5.3) 7.6 (+0.9) 45.4 (+4.9) 49.5 (+2.5) 49.6 (+2.2) Ours-SR 77.3 (+3.1) 71.1 (+7.0) 72.7 (+9.4) 13.8 (+8.5) 8.3 (+0.7) 8.0 (+1.3) 44.0 (+3.5) 51.2 (+4.2) 50.7 (+3.3) -
Reinforcement Learning Following Early Experience Figure 3 powerfully illustrates the paradigm's role as a bridge to RL. In all three environments tested, starting RL training (
GRPO
algorithm) from anearly experience
checkpoint (IWM or SR) leads to a higher final performance ceiling than starting from anImitation Learning
checkpoint.该图像是包含三个柱状图的图表,展示了论文中三种方法(模仿学习、隐式世界建模、自我反思)结合GRPO算法后,在WebShop、AlfWorld和SearchQA三个环境下不同规模模型的成功率和F1指标表现差异。
- Key Insight:
Early experience
doesn't just produce a better initial policy; it creates a policy that is more amenable to further improvement via RL. The agent has already learned about environment dynamics and suboptimal actions, providing a much stronger foundation for exploration and credit assignment during RL.
- Key Insight:
-
Ablations and Further Analysis
-
Comparison with Baselines (Table 4): The paper shows that simpler methods like prompting for longer reasoning (
Long CoT
) or training on ungrounded rationales (STaR
) are ineffective and can even degrade performance. This underscores that the gains fromearly experience
come from grounding the learning in actual, observed environment feedback. -
Impact of Human Data (Figure 4a):
Early experience
methods are significantly more data-efficient. OnWebShop
, an agent trained withearly experience
on only 1/8th of the expert data outperforms an agent trained withImitation Learning
on the full dataset. -
Impact of Branching Factor (Figure 4b): Increasing the number of alternative actions () generally benefits IWM. For SR, a moderate number of alternatives (e.g., 2-4) is often optimal, as too many alternatives can dilute the contrastive signal.
-
Model Scaling (Figure 5): The performance advantage of
early experience
methods holds and even grows as the model size increases from 3B to 70B parameters. This indicates that the paradigm provides a valuable learning signal that complements the raw capacity of larger models.该图像是图表,展示了不同模型尺寸的Llama在WebArena-Lite基准测试中,采用模仿学习和早期体验相关方法训练的成功率对比。图中比较了Raw(Instruct)、Imitation Learning、Implicit World Modeling和Self-Reflection四种策略的表现,显示随着模型规模增加,早期体验方法显著提升成功率。
-
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully introduces and validates
early experience
as a scalable, reward-free paradigm for training more capable language agents. By leveraging the agent's own interactions to generate supervision from future states, it effectively bridges the gap between passive imitation learning and reward-dependent reinforcement learning. The two proposed methods,Implicit World Modeling
andSelf-Reflection
, consistently improve in-domain performance, out-of-domain generalization, and provide a superior starting point for downstream RL, establishingearly experience
as a foundational technique for the future of agent development. -
Limitations & Future Work:
- The authors acknowledge that their current methods primarily focus on single-step outcomes. Extending these ideas to handle long-horizon credit assignment (i.e., understanding how an early action affects a much later outcome) without explicit rewards remains a significant challenge.
- Future directions include exploring more sophisticated self-supervised objectives, transferring knowledge learned in one environment to another, and integrating
early experience
into a continual learning framework for real-world deployment.
-
Personal Insights & Critique:
- Strengths:
- The core concept is elegant and highly practical. It directly tackles the most significant bottleneck in agent training today: the scarcity of high-quality, reward-annotated interaction data.
- The empirical evaluation is exceptionally thorough. Testing on eight distinct environments with varying action and observation spaces provides strong evidence for the paradigm's generality.
- The demonstration that
early experience
enhances downstream RL is a standout contribution, offering a clear and actionable path for practitioners to improve their agent training pipelines.
- Potential Weaknesses and Open Questions:
- Cost of Rollouts: While reward-free, the paradigm is not "cost-free." Generating rollouts requires significant interaction with the environment, which can be computationally expensive or slow, especially in non-simulated, real-world settings. The paper does not deeply analyze this trade-off.
- Quality of Self-Generated Data: The effectiveness of both IWM and SR depends on the quality of the agent's initial policy to propose meaningful alternative actions and the quality of the LLM used to generate reflections. A very poor initial policy might not explore useful parts of the state space, and noisy reflections could harm training.
- Applicability to Physical Embodiment: The experiments are all in simulated or digital environments. Applying this paradigm to robotics, where interaction is slow and has real-world consequences, would present a much greater challenge.
- Strengths:
Similar papers
Recommended via semantic vector search.
Divide and Conquer: Grounding LLMs as Efficient Decision-Making Agents via Offline Hierarchical Reinforcement Learning
GLIDER employs offline hierarchical reinforcement learning to transform LLMs into efficient decision agents by decomposing long-horizon tasks into structured subtasks, enhancing exploration and adaptability with strong transferability, validated on ScienceWorld and ALFWorld bench
Training-Free Group Relative Policy Optimization
Existing LLM agent fine-tuning is costly and parameter-heavy for specialized tasks. This paper proposes Training-Free GRPO, which distills "relative semantic advantage" from few rollouts into a "token prior" to guide LLMs. This method avoids parameter updates, addresses data scar
Discussion
Leave a comment
No comments yet. Start the discussion!