Paper status: completed

LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training

Published:10/17/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
6 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

UI-Simulator uses LLMs to generate structured UI states for scalable training data synthesis, reducing real data costs. UI-Simulator-Grow further enhances data efficiency, achieving superior robustness and performance on benchmarks with smaller base models.

Abstract

Digital agents require diverse, large-scale UI trajectories to generalize across real-world tasks, yet collecting such data is prohibitively expensive in both human annotation, infra and engineering perspectives. To this end, we introduce UI-Simulator\textbf{UI-Simulator}, a scalable paradigm that generates structured UI states and transitions to synthesize training trajectories at scale. Our paradigm integrates a digital world simulator for diverse UI states, a guided rollout process for coherent exploration, and a trajectory wrapper that produces high-quality and diverse trajectories for agent training. We further propose UI-Simulator-Grow\textbf{UI-Simulator-Grow}, a targeted scaling strategy that enables more rapid and data-efficient scaling by prioritizing high-impact tasks and synthesizes informative trajectory variants. Experiments on WebArena and AndroidWorld show that UI-Simulator rivals or surpasses open-source agents trained on real UIs with significantly better robustness, despite using weaker teacher models. Moreover, UI-Simulator-Grow matches the performance of Llama-3-70B-Instruct using only Llama-3-8B-Instruct as the base model, highlighting the potential of targeted synthesis scaling paradigm to continuously and efficiently enhance the digital agents.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

  • Title: LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training
  • Authors: Yiming Wang, Da Yin, Yuedong Cui, Ruichen Zheng, Zhiqian Li, Zongyu Lin, Di Wu, Xueqing wu, Chenchen Ye, Yu Zhou, Kai-Wei Chang.
  • Affiliations: The authors are from UCLA and Harvard University.
  • Journal/Conference: This paper is an arXiv preprint. The version provided is v1v1. The venue is not a formal peer-reviewed conference or journal at this stage, which is common for fast-moving research in AI.
  • Publication Year: The paper is dated October 16, 2025. This is likely a placeholder or a typo, as the current date is in 2025 but not yet October 16.
  • Abstract: The authors address the high cost and complexity of collecting large-scale User Interface (UI) interaction data (trajectories) needed to train general-purpose digital agents. They propose UI-Simulator, a paradigm that leverages Large Language Models (LLMs) to generate structured UI states and transitions, effectively synthesizing training data at scale without relying on real environments. To improve efficiency, they also introduce UI-Simulator-Grow, a targeted scaling strategy that focuses on synthesizing data for tasks that offer the most learning potential. Experiments on the WebArena and AndroidWorld benchmarks show their method rivals or surpasses agents trained on real UIs, demonstrates superior robustness, and allows a smaller model (Llama-3-8B) to match the performance of a much larger one (Llama-3-70B), highlighting the data efficiency of their targeted synthesis approach.
  • Original Source Link:

2. Executive Summary

  • Background & Motivation (Why):

    • Core Problem: Training powerful digital agents that can navigate websites and mobile apps requires vast amounts of high-quality training data in the form of UI interaction trajectories. Collecting this data is a major bottleneck due to the immense cost in human annotation hours, infrastructure for running real environments, and complex engineering.
    • Existing Gaps: Prior methods for automatic data synthesis either convert existing documentation (Synatra) or rely on exploring real-world environments (NNetNav, OS-Genesis). Exploring real environments is resource-intensive, faces issues like network instability, and is limited by what the real environments can offer (e.g., login walls, "search not found" pages).
    • Innovation: This paper's key innovation is to simulate the digital world itself using an LLM. Instead of interacting with a real UI, the agent interacts with a simulated UI generated by an LLM. This decouples data generation from the constraints of real-world infrastructure, enabling the creation of diverse, novel, and even idealized UI scenarios at scale. The paper further innovates with a data-centric scaling strategy (UI-Simulator-Grow) that intelligently selects what data to synthesize next.
  • Main Contributions / Findings (What):

    1. UI-Simulator: A scalable paradigm for synthesizing UI training trajectories. It consists of three main components:
      • An LLM-based digital world simulator that generates structured UI states and their transitions.
      • A guided rollout process that uses step-wise task controls to ensure the agent's exploration is coherent and diverse.
      • A trajectory wrapper that retrospectively generates a user instruction and refines the agent's reasoning steps to create a high-quality, supervised training instance.
    2. UI-Simulator-Grow: A targeted and data-efficient scaling paradigm. Instead of blindly adding more data, it iteratively identifies tasks that are "just right" in difficulty (not too easy, not too hard) and synthesizes variants of these high-impact tasks to accelerate agent improvement.
    3. Strong Empirical Results:
      • Agents trained with UI-Simulator rival or outperform open-source agents trained on real UI data, even when using a weaker teacher model (GPT-4o-mini vs. GPT-4o).

      • The method demonstrates significantly better robustness when tested on UIs with perturbed layouts.

      • UI-Simulator-Grow is highly data-efficient, enabling a Llama-3-8B-Instruct model to match the performance of a Llama-3-70B-Instruct model on WebArena using only 66% of the training data.

        Figure 1: Overview and performance highlights of UI-SIMULAToR and UI-SIMULATOR-GRoW. 该图像是论文图1的示意图,展示了UI-Simulator及UI-Simulator-Grow的整体结构与性能亮点,包括LLM预训练语料、多模块设计以及在多任务和数据规模扩展下性能提升的对比分析。

3. Prerequisite Knowledge & Related Work

  • Foundational Concepts:

    • Digital Agent: An AI program designed to understand human instructions and accomplish goals by interacting with graphical user interfaces (GUIs) on computers, websites, or mobile devices. They act like an automated user, clicking buttons, typing text, and navigating pages.
    • UI Trajectory: A recorded sequence of an agent's interactions with a UI. Each step in the trajectory typically consists of an observation (the state of the screen/UI) and an action (what the agent did). This data is crucial for training agents via supervised learning.
    • World Model: In AI, a world model is an internal representation of an environment that can predict how the environment will change in response to actions. This paper uses an LLM as a world model for UIs, where it predicts the next UI state (st+1s_t+1) given the current state (sts_t) and an action (ata_t).
    • LLM (Large Language Model): A type of deep learning model trained on vast amounts of text and code. Due to this training, LLMs like GPT-4 have a rich internal "knowledge" of concepts, reasoning, and structures, including how websites and applications are built (e.g., HTML, CSS), which this paper leverages.
    • Teacher-Forcing Loss: A metric used in training sequence models (like LLMs). During training, to predict the next token in a sequence, the model is given the correct, ground-truth previous token as input, rather than its own (potentially incorrect) previous prediction. The loss measures the difference between the model's prediction and the actual correct next token. This paper uses it to gauge how difficult a task is for a student agent compared to a teacher's actions.
    • Continual Learning: A machine learning paradigm where a model is trained sequentially on new data or tasks without "forgetting" what it has learned from previous data. Replay is a common technique where the model is retrained on a mix of new data and a small, representative sample of old data.
  • Previous Works:

    • World Models: The concept is not new (e.g., Ha and Schmidhuber, 2018). Recently, LLMs have been explored as world models for planning (Hao et al., 2023) or for web navigation (Chae et al., 2025).
    • Synthetic Data for Digital Agents: Previous work has focused on two main approaches:
      1. Knowledge Conversion: Methods like Synatra and AgentTrek convert human-readable documentation (e.g., web tutorials) into executable training trajectories.
      2. Unsupervised Exploration: Methods like NNetNav and OS-Genesis let an agent autonomously explore a real website or app and then retroactively label the action sequence with a plausible instruction.
  • Differentiation:

    • Simulation vs. Exploration: The core difference is that UI-Simulator generates the environment itself, whereas prior works like NNetNav and OS-Genesis explore existing, real environments. This gives UI-Simulator more flexibility to create diverse and novel UI states, overcoming the physical and practical limitations of real-world infrastructure.
    • Efficiency and Control: By simulating, the authors avoid issues like network latency, flaky UIs, and access restrictions (e.g., login pages). They can also generate scenarios that are hard to find in the wild, leading to more robust training.
    • Targeted Scaling: UI-Simulator-Grow introduces a strategic, data-centric approach to scaling, a departure from the "more data is always better" assumption. It focuses on synthesizing the most informative data at each stage, leading to greater efficiency.

4. Methodology (Core Technology & Implementation)

The paper's methodology can be broken down into three main parts: simulating the digital world, collecting trajectories within that world, and intelligently scaling the data collection.

Digital World Models in UI-SIMULATOR

The foundation of the framework is an LLM that acts as a simulator for digital environments. The UI state is represented as a structured accessibility tree, a format LLMs are well-suited to understand and generate due to their pre-training on front-end code.

  • Formulation:

    • The state of the UI at timestep tt is denoted as sts_t. Each element ee in the state has properties like text, attributes, and a bounding box bbox(e)\mathrm{bbox}(e).
    • The agent only sees a partial observation oto_t, which consists of the UI elements visible within the current viewport Vt\mathcal{V}_t. The viewport is defined as a rectangular region: Vt=[x0,x1]×[y0,y1] \mathcal { V } _ { t } = [ x _ { 0 } , x _ { 1 } ] \times [ y _ { 0 } , y _ { 1 } ]
    • The observation is the set of elements whose bounding boxes intersect with the viewport: ot={estbbox(e)Vt} o _ { t } = \left\{ e \in s _ { t } \mid { \mathrm { b b o x } } ( e ) \cap \mathcal { V } _ { t } \neq \emptyset \right\}
    • The world's dynamics are governed by a transition function st+1=T(st,at)s_{t+1} = \mathcal{T}(s_t, a_t), where T\mathcal{T} is either the LLM-based simulator or a simple rule for deterministic actions (like scrolling).
  • Simulation Processes: The paper proposes two ways to simulate the next UI state, as illustrated in Figure 2.

    Figure 2: Overall process of how the retrieval-free/-augmented simulators predict the next UI state 该图像是示意图,展示了LLM世界模拟器中无检索和增强检索两种模拟器预测下一UI状态的整体流程,突出有无参考信息对预测效果的影响。

    1. (Retrieval-Free) Simulation: This method relies entirely on the LLM's internal knowledge to generate the next UI state. It follows a three-step Chain-of-Thought (CoT) process:

      • Step 1: Predict an Overview. Given the current state and action (e.g., clicking a "search" button), the LLM first predicts a high-level overview of the next state (e.g., "a search results page").
      • Step 2: Generate Rich Draft. Based on the overview, the LLM generates a detailed, unstructured description of the new UI in natural language. This encourages creativity and content richness without being constrained by a strict format.
      • Step 3: Convert to Structured Format. The LLM then acts as a "style transfer" model, converting the unstructured draft into a well-defined structured format (the accessibility tree) and assigning coordinates to each UI element to finalize the next state st+1s_{t+1}.
    2. (Retrieval-Augmented) Simulation: This method is used when a small amount of experience from a real target environment is available. It grounds the simulation in reality while still allowing for creative generation.

      • An offline corpus D\mathcal{D} is created from the available real-world data, containing state-action-next_state transitions.
      • When simulating a transition, the system retrieves the most relevant past state s~ret\tilde{s}_{\mathrm{ret}} from D\mathcal{D}.
      • The LLM is then prompted with the current context and the retrieved state to generate the next state: st+1=MLLM(st,at,[sret])s_{t+1} = \mathcal{M}_{\mathrm{LLM}}(s_t, a_t, [s_{\mathrm{ret}}]). This helps generate UIs that are more stylistically consistent with the target domain.

Scalable Synthetic Training Trajectory Collection

With the world simulator in place, the next step is to generate high-quality training trajectories.

  • Step-Wise Guided Rollout Process: To prevent the teacher agent from generating boring or repetitive trajectories, a guided process is used:

    1. Task Control Proposal: At each stage of the rollout, the teacher agent proposes a high-level "task control" or sub-goal (e.g., "Navigate to my account page").
    2. Iterative Refinement: Once a sub-goal is completed, the agent proposes a new one based on the current UI state (e.g., after reaching the account page, it might propose "Check my order history"). This chains together simple steps into a complex, coherent task.
    3. Thought & Action Generation: Under the guidance of the current task control, the teacher agent generates its reasoning (thought), the specific action to take, and a summary of the step.
  • Trajectory Wrapper: After a full trajectory is rolled out, a final wrapping process refines it into a clean training instance:

    1. Instruction Generation: A summarizer LLM retrospectively creates a concise user instruction GG that accurately describes the task accomplished by the trajectory.
    2. Thought Refinement: The step-by-step thoughts are rewritten to be consistent with the final instruction GG, ensuring a coherent reasoning chain.

UI-SIMULATOR-GROW: Targeted Scaling

Instead of simply generating more and more data, UI-Simulator-Grow is a strategic approach to scale training efficiently.

  • Target Task Selection: The core idea is to focus on tasks that provide the most learning value.

    • The student agent's performance on a validation set is measured using teacher-forcing loss. A high loss means the student struggles with the task, while a low loss means it has mastered it.

    • Tasks are ranked by this loss. Those that are too easy (bottom 25%) or too hard (top 75%) are discarded.

    • The "sweet spot" of tasks in the middle (25th-75th percentile) are selected as targets for the next round of synthesis.

      Figure 5: Illustration of overall target task selection process. 该图像是论文中图5的示意图,展示了Web任务和移动任务的目标任务选择过程,横轴为动态验证集上的Teacher-forcing损失,以25%和75%分位点标记用于下一轮轨迹合成的任务。

  • Synthesizing Diverse Variants: For each selected target task, the system generates new training data using "lightweight task rewriting." It modifies the task slightly (e.g., changing "search for running shoes" to "search for hiking boots") while keeping the overall task structure (search, click, etc.) the same. This creates meaningful variations without needing to design entirely new tasks from scratch.

  • Continual Learning: To incorporate the new data without the model forgetting past knowledge, a replay strategy is used. A small, representative sample of trajectories from the previous training iteration is mixed with the newly synthesized data for the next training phase.

5. Experimental Setup

  • Datasets:

    • WebArena: A benchmark with 812 complex, realistic web navigation tasks across various domains like e-commerce, social media, and content management.
    • AndroidWorld: A benchmark with 116 challenging tasks focused on daily mobile app usage on the Android operating system.
  • Evaluation Metrics: The primary metric used is Success Rate (SR).

    1. Conceptual Definition: SR measures the percentage of tasks the agent successfully completes according to the predefined success criteria for each task in the benchmark. It is a direct measure of the agent's task-completion capability.
    2. Mathematical Formula: SR(%)=Number of Successfully Completed TasksTotal Number of Tasks×100 \mathrm{SR} (\%) = \frac{\text{Number of Successfully Completed Tasks}}{\text{Total Number of Tasks}} \times 100
    3. Symbol Explanation:
      • Number of Successfully Completed Tasks: The count of tasks where the agent achieved the goal.
      • Total Number of Tasks: The total number of tasks in the evaluation set (e.g., 812 for WebArena).
  • Baselines: The authors compare their method against a wide range of models:

    • Base LLMs: Zero-shot performance of models like Llama-3-8B-Instruct, Llama-3-70B-Instruct, and GPT-4o.
    • Agent Training Baselines:
      • AgentFlan: An early model trained on web-related instructional data.
      • NNetNav: An agent trained via unsupervised exploration of real websites.
      • Synatra: An agent trained on data synthesized from web tutorials.
      • OS-Genesis: An agent trained via unsupervised exploration on real environments, using the strong GPT-4o as a teacher.
  • Implementation Details:

    • Simulator & Teacher Agent: GPT-4o-mini was used for both simulating the UI world and acting as the teacher agent during rollouts. This is a deliberate choice to show the method's effectiveness even with a weaker, more cost-efficient teacher.
    • Student Agent Base Models: Llama-3-8B-Instruct for WebArena and Qwen-2.5-7B-Instruct for AndroidWorld (due to its longer context window support).

6. Results & Analysis

The paper presents a comprehensive analysis of the UI-Simulator framework's performance, robustness, and scaling properties.

Core Results

The main results are summarized in Table 1, which compares UI-Simulator variants with baselines on WebArena and AndroidWorld.

(This table is a transcription of Table 1 from the paper.)

Models Teacher Agents Train Under Real Env.? WebArena SR (%) AndroidWorld SR (%)
Base Open-Source LLMs and Proprietary LLMs
Llama-3-8B-Instruct - X 2.34 -
CodeLlama-34B-Instruct - X 4.06 -
Lemur-chat-70B - X 5.30 -
Llama-3-70B-Instruct - X 7.02 -
Gemini Pro - X 7.12 -
Qwen-1.5-72B-Instruct - X 7.14 -
Qwen-2.5-7B-Instruct - X 3.94 0.0
Qwen-2-VL-7B - X - 5.0
Qwen-2-VL-72B - X - 5.0
Gemma-2-27B - X - 9.5
GPT-40 (GPT-4o) - X 13.10 11.7
Digital Agent Training Data Synthesis Baselines
AgentFlan N/A 4.68 -
NNetNav Llama-3.1-70B 4.80 -
Synatra GPT-4-turbo 6.28 -
OS-Genesis GPT-4o 6.16 9.1
GUIMid (Post-Train) N/A 6.20 9.0
UI-SImULATOR-Series Variants
UI-SIMULATOR-F GPT-4o-mini X 6.28 8.6
UI-SIMULATOR-R GPT-4o-mini ✓(<<) 6.40 12.9
UI-SImULaTOR-GROw-R GPT-4o-mini ✓(<<) 7.14 13.4
  • Key Takeaways:
    • UI-Simulator-F (trained only on simulated data) significantly boosts the base model's performance and is competitive with methods like OS-Genesis that train on real data.
    • UI-Simulator-R (with minimal real data for retrieval) achieves a high SR of 12.9% on AndroidWorld, outperforming the powerful zero-shot GPT-4o.
    • UI-Simulator-Grow-R pushes performance further, matching the 72B-parameter Qwen model on WebArena and surpassing GPT-4o on AndroidWorld, despite being based on a much smaller 8B-class model.
    • Crucially, UI-Simulator variants achieve these results using a weaker teacher model (GPT-4o-mini) than baselines like OS-Genesis (GPT-4o), highlighting the power of the simulation paradigm itself.

Ablation Study

The authors conduct several ablations to validate their design choices, with results presented in Table 2.

(This table is a transcription of Table 2 from the paper.)

Models WA (%) AW (%)
UI-SIMULATOR-F 6.28 8.6
Perturbed Env. 5.54 8.7
Synthesize in Real Env. 4.31 4.7
UI-SIMULATOR-R 6.40 12.9
Synthesize in Real Env. 4.31 9.1
w/o Step-Wise Task Control 1.72 5.2
w/o Multi-Step Simulation 4.06 9.1
OS-Genesis 6.16 9.1
Perturbed Env. 4.43 8.7
Same # of Experience 1.48 5.2
  • Agent Robustness: When UI layouts are perturbed, UI-Simulator-F (trained on diverse simulated data) shows a much smaller performance drop than OS-Genesis (trained on real data), proving that simulation leads to more robust agents.
  • Simulated vs. Real Environment: Surprisingly, training on simulated data outperforms training on real environments using the same trajectory collection pipeline. The authors reason this is because simulators can bypass real-world limitations (e.g., generate a successful search result every time) and create more diverse scenarios than a limited set of real accounts or website states would allow.
  • Component Importance: Removing Step-Wise Task Control causes a massive performance drop, as the generated trajectories become homogeneous and less diverse. Removing the Multi-Step Simulation process also hurts performance, as single-step simulation tends to generate biased, less rich content.

UI-SIMULATOR-GROW vs. Standard Scaling

![Figure 3: The effect of standard scaling and UISIMuLAToR-GRoW targeted scaling.](/files/papers/68f443ebe3046a80be5818c9/images/4.jpg)
*该图像是图表,展示了标准UI-Simulator扩展与UI-Simulator-Grow目标扩展在WebArena和AndroidWorld任务中随着扩展率变化的成功率对比。UI-Simulator-Grow在多种扩展率下均优于标准扩展。*

Figure 3 shows that UI-Simulator-Grow leads to a much steeper performance improvement compared to standard scaling (blindly adding more data). It achieves better performance with less data, demonstrating superior data efficiency. On WebArena, it reaches the final performance of the standard 3x scaling using only 2x the data (which corresponds to 66% of the total dataset).

Figure 4: Successful task numbers across the 5 main task categories through the three iterations of the UI-SIMULATOR-GROW scaling. 该图像是图表,展示了UI-SIMULATOR-GROW在五大任务类别中通过三次迭代成功完成的任务数量。图中可以观察到各类别任务成功数随着迭代次数整体有所提升,体现出该方法的扩展效果。

Figure 4 breaks down the performance gains by task category. UI-Simulator-Grow shows consistent improvement across most categories. Notably, in complex categories like Repo (code repository operations), the agent only begins to solve tasks in the final iteration, highlighting the paradigm's ability to help agents master increasingly difficult skills.

7. Conclusion & Reflections

  • Conclusion Summary: The paper successfully introduces UI-Simulator, a novel and scalable paradigm for training digital agents by using LLMs as world simulators. This approach is not only effective—rivaling or surpassing methods trained on real environments—but also more robust and efficient. The proposed targeted scaling strategy, UI-Simulator-Grow, further enhances data efficiency by strategically synthesizing the most impactful training examples. The work demonstrates that simulation is a powerful and promising direction for overcoming the data bottleneck in digital agent development.

  • Limitations & Future Work:

    • The authors acknowledge that their current simulation operates on a textual, structured representation of the UI (the accessibility tree).
    • Their future work aims to extend the simulation to the pixel level, which would help narrow the "sim-to-real" gap for visually complex applications where the accessibility tree is insufficient.
    • They also plan to extend the paradigm to other UI domains like desktop operation and to any environment that can be represented in text.
  • Personal Insights & Critique:

    • Novelty and Impact: The core idea of simulating the environment itself is a significant conceptual shift from prior work that focused on exploring real environments. This decoupling is a powerful abstraction that addresses fundamental bottlenecks in infrastructure and data diversity. The results are compelling and suggest this is a highly promising research direction.

    • Data-Centric AI: UI-Simulator-Grow is an excellent example of a data-centric approach. It emphasizes that the quality and relevance of data can be more important than sheer quantity, a crucial lesson for building AI systems efficiently.

    • Potential Weakness (Sim-to-Real Gap): While the results are strong, the simulation is still an abstraction. The real world is messy; websites have visual bugs, non-standard components, and dynamic JavaScript elements that may not be perfectly captured by an accessibility tree. An agent trained purely in this "clean" simulated world might struggle with these real-world imperfections. The authors' plan to move to pixel-level simulation is the right next step to address this.

    • The Role of the Simulator LLM: The quality of the entire pipeline is ultimately bounded by the capability of the LLM used as the world simulator. As base LLMs become more powerful, the quality and realism of UI-Simulator will likely improve in tandem, making this a very future-proof approach.

    • Human Evaluation Interface: The paper's appendix (referenced as containing human evaluation details) likely uses the interface shown in Figure 6, which provides a structured way to assess the quality of synthesized trajectories, indicating a thorough evaluation process.

      Figure 6: The front-end web interface for trajectory human evaluation. 该图像是图6所示的前端网页界面,用于人工评估轨迹的合理性和完成度,包含轨迹查看器和评估表单,展示具体步骤和状态信息。

Similar papers

Recommended via semantic vector search.

No similar papers found yet.