Paper status: completed

Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data

Published:03/27/2025

LLM-guided motion planning (27)Large Language Model Fine-Tuning (50)Sequence Policy Optimization (40)RL Training for Large Language Models (67)Real User Behavior Simulation (1)

Original Link PDF

Price: 0.100000

6 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study quantitatively evaluates LLM agents on 31,865 real shopping sessions, revealing low baseline accuracy (11.86%) in mimicking multi-turn human behavior. Fine-tuning with real and synthetic data boosts accuracy to 17.26%, significantly improving prediction fidelity.

Abstract

Recent research shows that LLM Agents can generate ``believable'' human behaviors via prompt-only methods, and such agents have been increasingly adopted in downstream applications. However, existing evaluation of these agents only focuses on qualitative believability (whether human raters think they are accurate), leaving open questions of whether LLM agents can accurately generate step-by-step actions mimicking a particular human's behavior in a multi-turn interaction task. In this work, we take shopping as a case study and present the first large-scale quantitative evaluation of state-of-the-art LLMs' ability to accurately simulate human behavior. Using real-world data from 31,865 online shopping sessions containing 230,965 user actions, our evaluation reveals that prompt-based LLMs (DeepSeek-R1, Llama, Claude) achieve only 11.86% accuracy in generating human actions, highlighting a substantial gap in actual behavioral accuracy. Through experiments, we also showcase that strategies as simple as fine-tuning LLMs on real human click-through data augmented with synthesized reasoning traces can greatly enhance models' performance. The fine-tuned Qwen2.5-7B achieves 17.26% action generation accuracy and 33.86% F1 score on final purchase prediction, representing substantial improvements of 5.4% and 13.85% over prompt-only baselines. This work establishes the first rigorous benchmark for human behavior simulation and provides actionable insights for developing more accurate LLM agents for future downstream applications.

Mind Map

In-depth Reading

English Analysis~14 min read · 17,926 chars

1. Bibliographic Information

Title: Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data
Authors: Yuxuan Lu, Jing Huang, Yan Han, Bingsheng Yao, Sisong Bei, Jiri Gesi, Yaochen Xie, Zheshen (Jessie) Wang, Qi He, Dakuo Wang
Affiliations: The authors are affiliated with Amazon.com, Inc. and Northeastern University. This blend of industry and academic researchers suggests access to large-scale, real-world data and a focus on both practical application and rigorous academic evaluation.
Journal/Conference: This paper is a preprint submitted to arXiv. Preprints are drafts of research papers shared publicly before (or during) formal peer review. While not yet certified by a conference or journal, they provide a first look at cutting-edge research.
Publication Year: The paper was submitted to arXiv with a potential publication date of March 26, 2025.
Abstract: The abstract highlights that while LLM Agents can generate "believable" human behavior, their quantitative accuracy is unknown. This paper presents the first large-scale quantitative evaluation using real online shopping data (31,865 sessions). It finds that standard prompt-based LLMs achieve very low accuracy (11.86%) in predicting a user's step-by-step actions. The authors show that fine-tuning LLMs on this real data, especially when augmented with synthetically generated "reasoning" for each action, significantly improves performance. Their best fine-tuned model achieves 17.26% action accuracy and a 33.86% F1 score for purchase prediction, establishing a new benchmark and providing a path forward for creating more accurate human simulators.
Original Source Link:
- Official Source: https://arxiv.org/abs/2503.20749
- PDF Link: https://arxiv.org/pdf/2503.20749v7.pdf
- Publication Status: This is a preprint and has not yet undergone formal peer review for a conference or journal publication.

2. Executive Summary

Background & Motivation (Why):
- Core Problem: Recent advancements have led to the creation of LLM Agents—AI systems using Large Language Models (LLMs) to simulate human behavior in various tasks like social interactions or web browsing. However, existing evaluations of these agents are flawed. They either focus on qualitative "believability" (do human raters feel the agent is human-like?) or on outcome-centric accuracy (did the agent complete the final goal, like buying a product?).
- Gap in Prior Work: There is no large-scale, rigorous, quantitative evaluation of whether these agents can accurately replicate the step-by-step actions of a real human in a multi-turn interaction. It's unknown if an agent that seems human also acts human at each decision point.
- Innovation: This paper provides the first systematic, process-centric evaluation of LLM agents. Using a massive real-world dataset of online shopping sessions, it directly measures how well LLMs can predict the next action a specific user will take. It also introduces a novel technique: augmenting the training data with synthesized reasoning traces to help the model learn the "why" behind an action, not just the "what".
Main Contributions / Findings (What):
1. First Rigorous Benchmark: The paper establishes the first large-scale, quantitative benchmark for process-centric human behavior simulation, using over 230,000 real user actions from an e-commerce platform.
2. Prompt-Only LLMs are Inaccurate: State-of-the-art LLMs (like Claude, Llama, DeepSeek-R1), when used "out-of-the-box" with prompting, are poor simulators of step-by-step human actions, achieving only around 11.86% accuracy.
3. Fine-Tuning is Effective: Simply fine-tuning a smaller LLM on real human action data significantly boosts performance, demonstrating the need for domain-specific adaptation.
4. Synthesized Reasoning Further Improves Accuracy: The most significant finding is that fine-tuning an LLM with not just actions but also synthetically generated reasoning (explanations for each action) provides a substantial additional performance gain. The fine-tuned Qwen2.5-7B model with reasoning achieved 17.26% action accuracy and a 33.86% F1 score on final purchase prediction, large improvements over the baselines.

Foundational Concepts:
- Large Language Model (LLM): A deep learning model trained on vast amounts of text data, capable of understanding and generating human-like language (e.g., GPT-4, Llama 3).
- LLM Agent: An autonomous system that uses an LLM as its core "brain" to perceive its environment, reason, plan, and execute actions to achieve goals.
- Multi-Turn Interaction: A sequential task where a user and a system (or a simulated user and an environment) interact over multiple steps. Online shopping is a classic example, involving a sequence of searches, clicks, and filters.
- Prompt-Only / In-Context Learning (ICL): A method of using a pre-trained LLM without changing its weights. The desired behavior is guided by providing instructions and a few examples ("shots") within the input prompt.
- Fine-Tuning: The process of taking a pre-trained LLM and further training it on a smaller, domain-specific dataset. This adapts the model's knowledge and behavior to the specific task (in this case, simulating shopper actions).
- Reasoning Trace / Chain-of-Thought: A series of intermediate reasoning steps, usually expressed in natural language, that show the model's "thinking process" before it produces a final answer or action.
Previous Works:
- Studies on "Believable" Simulation:
  - Park et al. (2023) created "Generative Agents" that simulated believable social behaviors in a virtual town. Evaluation was qualitative, focusing on emergent social phenomena.
  - $Lu et al. (2025a)$ developed UXAgent to simulate usability testing, where evaluation relied on qualitative interviews about the perceived realism.
  - These studies established that LLMs can appear human-like, but didn't measure if their actions precisely match real human actions.
- Studies on "Outcome-Centric" Evaluation:
  - $Yao et al. (2023)$ with ReAct and Zhou et al. (2024) with WebArena benchmarked web agents based on final task completion rates (e.g., did the agent successfully buy the item?).
  - $Xie et al. (2024)$ evaluated if LLM agents could replicate the final outcome of a human's decision in a trust game.
  - These studies ignored the fidelity of the intermediate steps, focusing only on the final result. An agent could reach the right outcome through a completely non-human-like path.
- Reasoning in Simulation:
  - ReAct (Yao et al., 2023) and WebAgent (Gur et al., 2023) used prompting to make LLMs generate reasoning before acting, improving task success.
  - DeepSeek-AI et al. (2025) showed that synthesizing reasoning traces could help in reinforcement learning.
  - These works used reasoning only in a prompt-based setting. This paper is the first to investigate if adding synthesized reasoning during the fine-tuning process improves simulation accuracy.
Differentiation: This work distinguishes itself by uniquely combining three key elements:
1. Process-Centric Evaluation: It measures accuracy at every single action, not just the final outcome.
2. Real-World Data: It uses a large-scale dataset of actual human behavior, not a synthetic environment.
3. Fine-Tuning with Synthesized Reasoning: It goes beyond prompting and demonstrates that fine-tuning with machine-generated reasoning is a powerful technique for improving behavioral fidelity.

4. Methodology (Core Technology & Implementation)

The core of the paper is a framework for training and evaluating LLMs on a next-action prediction task in a multi-turn setting.

$Figure 1: Overview of the next action prediction task. The model takes the currently observed $\\langle \\mathbf { c o n t e x t } \\rangle _ { t }$ and a sequence of previous context, reasoning, action…$ 该图像是图1的示意图，展示了基于上下文、推理和动作序列对下一步动作的预测过程。模型输入当前观察到的上下文和之前的序列<Context, Reasoning, Action>1:t-1，生成下一步的(Reasoning, Action)t。由于真实数据缺少推理轨迹，采用合成推理补充。

As shown in Figure 1, the model's task is to predict the user's next action and the reasoning behind it, given the history of the current session.

Principles: The central idea is to frame human behavior simulation as a sequence generation problem. The model learns to predict the next (reasoning, action) pair based on the history of interactions. The key innovation is enriching this history with synthesized reasoning to provide a stronger learning signal.
Steps & Procedures: The model learns a function $f$ that maps the history of a session to the next reasoning-action pair.

f ( c _ { 1 . . . t } , a _ { 1 . . . t - 1 } , r _ { 1 . . . t - 1 } ) = r _ { t } , a _ { t }

*   ** $c_{1...t}$ **: The sequence of **contexts** observed by the user up to the current step  $t$ . The context is the state of the webpage.
*   ** $a_{1...t-1}$ **: The sequence of **actions** taken by the user in previous steps.
*   ** $r_{1...t-1}$ **: The sequence of synthesized **reasoning** traces corresponding to the previous actions.
*   ** $r_t, a_t$ **: The **output** of the model: the predicted reasoning and action for the current step  $t$ .

Input Representations:
- Context ( $c$ ): To represent the webpage, the authors use a simplified HTML format. They remove irrelevant elements like CSS and scripts but preserve the core structure (lists, tables) and text. This is more robust than parsing rules and less noisy than raw HTML. To allow the model to refer to specific elements (like a button), each interactable element is given a unique, human-readable hierarchical name (e.g., columbia_shirt.view_product).
- Action ( $a$ ): The action space is defined by three low-level browser operations:
  1. click: Clicks on a specific element.
  2. type_and_submit: Types text into a field and submits it (e.g., a search query).
  3. terminate: Ends the session (simulating closing the browser). This generic action space makes the framework adaptable to other web-based tasks.
- Reasoning ( $r$ ): This is a natural language explanation for why an action was taken. For example, for the action of clicking a "4 stars and up" filter, the reasoning might be: "I want to find a comfortable piece of clothing, so I'm looking for options with high ratings."
Synthesized Reasoning Trace: Since real-world clickstream data does not contain users' thoughts, the authors generate synthetic reasoning traces. They use a powerful LLM (Claude 3.5 Sonnet) and prompt it with:
1. The context (webpage state) before the action.
2. The ground-truth action the human took.
3. In-context examples from real human think-aloud protocols to guide the style of reasoning. The goal is not to perfectly guess the human's true thought process but to create a plausible and logically consistent intermediate step that helps the model being trained to better connect context to action.
Model Architecture and Training:
- The base models are pre-trained LLMs (e.g., Qwen, Mistral).
- During training, the model is fed a full session history as a single long sequence: $(context_1, reasoning_1, action_1, context_2, reasoning_2, action_2, ...)$ .
- The training objective is to minimize the standard next-token prediction loss, but only for the reasoning and action tokens. The loss for the context tokens is ignored (masked out), as the model should predict what to do, not what it sees.
- During evaluation, the model operates autoregressively: given the history, it first generates the reasoning for the next step, and then, conditioned on that generated reasoning, it generates the action.

5. Experimental Setup

Datasets:
- The study uses a proprietary dataset from a major global e-commerce platform.
- Size: It contains 31,865 user sessions from 3,526 users, totaling 230,965 user actions.
- Characteristics: Sessions start with a search and end with either a purchase (4,432 sessions) or termination (27,433 sessions).
- Privacy: The data was collected from users who opted-in, was fully anonymized, and is not publicly available due to its sensitive nature.
- Test Set: A subset of sessions was held out for testing, ensuring no user overlap with the training set. Test cases were created from the second action onward in each session.
Evaluation Metrics:
1. Next Action Generation Accuracy:
  - Conceptual Definition: This metric measures the percentage of times the model predicts the exact next action taken by the human user. A prediction is correct only if the action type (e.g., click), the target (e.g., $product_link_A$ ), and any attributes (e.g., search query text) all perfectly match the ground truth. It is averaged per-session to give each session equal weight.
  - This is a very strict, process-centric metric.
2. Session Outcome F1 Score:
  - Conceptual Definition: This metric evaluates the model's ability to predict the final outcome of a session. At the last step, the model must predict whether the user will make a purchase (click on the "buy" button) or leave (terminate). This is treated as a binary classification problem. The F1 score is used because the dataset is imbalanced (many more terminations than purchases). It provides a balanced measure between precision and recall.
  - Mathematical Formula: $\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$
  - Symbol Explanation:
    - Precision = $\frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}$ : Of all the times the model predicted "purchase," how often was it correct?
    - Recall = $\frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}$ : Of all the actual purchases, how many did the model correctly identify?
    - A True Positive here would be correctly predicting a purchase action.
Baselines:
- Prompt-Only (ICL) Models: A range of powerful, instruction-tuned LLMs were evaluated without fine-tuning. This includes open-source models like Llama 3.1, Mixtral, Qwen2.5, and the reasoning-focused DeepSeek-R1, as well as proprietary models like Claude 3 series.
- Fine-Tuned Models: Several open-source models of varying sizes were fine-tuned on the custom dataset: Qwen2.5-7B, Mistral-7B-v0.3, and Llama-3.2-3B.

6. Results & Analysis

Core Results: The main results are presented in Table 1, which compares prompt-only, fine-tuned, and reasoning-augmented fine-tuned models.

(Manual transcription of Table 1, as no image was provided)

Model	Generated Next Action			Session Outcome
Model	Accuracy	%Δ vs Base	V.s. DS-R1	F1 Score	%Δ vs Base	V.s. DS-R1
Open-Source Models
DeepSeek-R1	11.86%			20.01%
Llama 3.1 8B	5.05%		-6.81%	10.87%		-9.14%
Llama 3.1 70B	8.19%		-3.67%	12.69%		-7.32%
Mixtral 8x7B	5.41%		-6.45%	13.16%		-6.85%
Qwen2.5-7B	4.25%		-7.61%	11.94%		-8.07%
... (other open-source models show similar or worse performance) ...
Proprietary Models
Claude 3.5 Haiku	9.18%		-2.68%	14.77%		-5.24%
Claude 3 Opus	6.78%		-5.08%	15.08%		-4.93%
Claude 3.5 Sonnet v2	11.69%		-0.17%	18.54%		-1.47%
... (other proprietary models show similar performance) ...
Fine-tuned Models
Qwen2.5-7B	16.67%		4.81%	26.92%		6.91%
+ reasoning	17.26%	3.54%	5.40%	33.86%	25.78%	13.85%
Mistral-7B-v0.3	14.17%		2.31%	17.99%		-2.02%
+ reasoning	15.84%	11.79%	3.98%	30.12%	67.43%	10.11%
Llama-3.2-3B	9.31%	-	-2.55%	4.73%		-15.28%
+ reasoning	15.77%	69.39%	3.91%	33.99%	618.60%	13.98%

Analysis:
- Prompt-Only Models Fail: The best prompt-only model, DeepSeek-R1, only achieves 11.86% action accuracy and a 20.01% F1 score. Other powerful models like Llama 3.1 70B and Claude 3.5 Sonnet v2 perform similarly or worse. This confirms the central hypothesis that out-of-the-box LLMs cannot accurately simulate step-by-step human actions.
- Fine-Tuning Works: Fine-tuning Qwen2.5-7B on action data alone boosts accuracy to 16.67%—a substantial improvement.
- Reasoning Provides the Biggest Edge: Adding synthesized reasoning during fine-tuning (+ reasoning) consistently improves performance. Qwen2.5-7B + reasoning achieves the highest action accuracy at 17.26% (a 5.4% absolute improvement over the best baseline). The effect is even more dramatic for the final outcome prediction, where Llama-3.2-3B + reasoning achieves an F1 score of 33.99%, a massive relative increase and a 13.98% absolute gain over DeepSeek-R1.

Action Distribution Analysis:

该图像是一个条形堆积图，展示了不同模型（Qwen2.5-7B，DeepSeek-R1，Claude）与人类行为在五类行动（搜索、筛选、点击产品、购买、结束）上的百分比分布差异。
- Figure 2 Analysis: This chart reveals a significant behavioral mismatch.
  - Humans: Real users rely heavily on iterative search (refining queries, correcting typos) and frequently terminate sessions without buying. They rarely use filters.
  - Prompt-Only LLMs (Claude, DeepSeek-R1): These models overuse filters and have a disproportionately high purchase rate. They tend to stick with their initial search query. The authors hypothesize this is because LLM agent benchmarks often reward task completion (purchasing) above all else, creating a bias.
  - Fine-Tuned Model (Qwen2.5-7B): The action distribution of the fine-tuned model much more closely mirrors the human ground truth, with more searching, fewer filters, and a more realistic termination rate. This shows it has learned the patterns of human browsing, not just how to complete a task.
Ablation Study: The impact of reasoning is isolated by comparing the fine-tuned models with and without the + reasoning augmentation in Table 1.
- For Qwen2.5-7B, adding reasoning improves action accuracy from 16.67% to 17.26% (a 3.54% relative gain) and the F1 score from 26.92% to 33.86% (a 25.78% relative gain).
- For Llama-3.2-3B, the improvement is even more stark, with a 69% relative jump in accuracy and a massive 618% relative jump in F1 score.
- This confirms that providing an explicit reasoning signal during fine-tuning is crucial for improving both step-by-step accuracy and final outcome prediction.

Error Analysis:

该图像是条形堆积图，展示了DeepSeek-R1、Claude和Qwen2.5-7B三种模型在多种错误类别上的错误计数分布，体现了各模型生成购物行为预测时不同类型错误的数量差异。

Figure 3 Analysis:

The biggest error for prompt-only models (Claude, DeepSeek-R1) is Didn't terminate—they continue the session when a human would have given up and left. This aligns with their purchase-oriented bias.

The fine-tuned Qwen2.5-7B model has a much lower Didn't terminate error rate, showing it has better learned human termination heuristics.

(Manual transcription of Table 2, as no image was provided)

	Example 1	Example 2
Previous Action	search for "disney gift"	search for "tee conector"
Human Next Action	search for "disney gift card"	search for "tee connector"
Qwen-2.5-7B	search for "disney gift card"	search for "tee connector"
Claude	click on disney_gift_card_. ..	click on spalolen_30_pack_. ..

Table 2 Analysis: This table provides concrete examples of the fine-tuned model's superior ability to capture nuanced human behavior.
- In both examples, the human performs an iterative search: refining a query in Example 1 ("disney gift" -> "disney gift card") and correcting a typo in Example 2 ("tee conector" -> "tee connector").
- The fine-tuned Qwen-2.5-7B correctly mimics this iterative search behavior.
- The prompt-only Claude model fails to do so. Instead of refining the search, it proceeds to click on a product from the initial, imperfect search results. This shows fine-tuning helps capture subtle but common user patterns.

7. Conclusion & Reflections

Conclusion Summary: This paper makes a strong case that current LLM agents, while "believable," are not accurate simulators of step-by-step human behavior. Through a rigorous, large-scale quantitative evaluation on real-world shopping data, the authors demonstrate a significant gap between prompt-only models and actual human actions. They show that fine-tuning on real data, especially when augmented with synthesized reasoning traces, is a highly effective strategy to close this gap. This work not only establishes a much-needed benchmark but also provides a concrete methodology for building more faithful and useful human behavioral simulators.
Limitations & Future Work: The authors acknowledge several limitations:
1. Reasoning Evaluation: The "goodness" of the synthesized reasoning was only measured by its downstream impact on action accuracy, not by its interpretability or alignment with actual human thought.
2. Generalizability: The study was confined to the online shopping domain. Findings may not generalize to other tasks.
3. Synthetic vs. Real Reasoning: The model was not tested on a dataset with genuine, human-annotated reasoning traces.
4. Bias in Synthesis: The process of generating reasoning might introduce its own biases.
5. Simplified Action Space: The action space was limited (e.g., no scrolling or waiting), which simplifies the simulation.
  
  Future work could involve: using reinforcement learning to optimize reasoning, scaling the dataset, incorporating user personas for personalization, and using vision-language models (VLMs) to better understand web interfaces.
Personal Insights & Critique:
- Significance: This paper is a crucial reality check for the field of LLM agents. It moves the conversation from "can agents seem human?" to "can agents act human?"—a much higher and more meaningful bar for evaluation.
- Low Absolute Accuracy: While the relative improvements are impressive, the best action accuracy achieved is still only 17.26%. This highlights the immense difficulty of predicting human behavior, which is often stochastic, irrational, and context-dependent. The paper's call for metrics that account for plausible variations and human-like errors is spot-on.
- The Power of Synthesized Data: The success of using synthesized reasoning is a powerful demonstration of data augmentation for LLMs. It shows that we can "bootstrap" models with machine-generated intermediate steps to improve performance on complex tasks, even when ground-truth intermediate data is unavailable.
- Open Questions: A key question remains: does the synthesized reasoning actually resemble human cognition, or is it just a useful computational artifact that helps the model regularize its predictions? Answering this would require cognitive science studies and human evaluation, as the authors note. The transferability of this technique to domains outside of goal-oriented web tasks (e.g., creative writing, social chat) is also an interesting avenue for future exploration.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.