Customer-R1: Personalized Simulation of Human Behaviors via RL-based LLM Agent in Online Shopping
TL;DR Summary
Customer-R1 uses RL-based LLMs conditioned on user personas for personalized step-wise behavior simulation in online shopping, outperforming prompting and supervised fine-tuning in accuracy and fidelity on the OPeRA dataset.
Abstract
Simulating step-wise human behavior with Large Language Models (LLMs) has become an emerging research direction, enabling applications in various practical domains. While prior methods, including prompting, supervised fine-tuning (SFT), and reinforcement learning (RL), have shown promise in modeling step-wise behavior, they primarily learn a population-level policy without conditioning on a user's persona, yielding generic rather than personalized simulations. In this work, we pose a critical question: how can LLM agents better simulate personalized user behavior? We introduce Customer-R1, an RL-based method for personalized, step-wise user behavior simulation in online shopping environments. Our policy is conditioned on an explicit persona, and we optimize next-step rationale and action generation via action correctness reward signals. Experiments on the OPeRA dataset emonstrate that Customer-R1 not only significantly outperforms prompting and SFT-based baselines in next-action prediction tasks, but also better matches users' action distribution, indicating higher fidelity in personalized behavior simulation.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: Customer-R1: Personalized Simulation of Human Behaviors via RL-based LLM Agent in Online Shopping
- Authors: Ziyi Wang¹, Yuxuan Lu¹, Yimeng Zhang², Jing Huang³, Dakuo Wang¹
- Affiliations: ¹Northeastern University, ²Michigan State University, ³Amazon
- Journal/Conference: The paper is an arXiv preprint. The listed publication date of October 2025 suggests it has been submitted to a future conference and is currently under review or awaiting publication. arXiv is a well-known repository for pre-publication academic papers in fields like computer science, allowing for rapid dissemination of research.
- Publication Year: The paper is a preprint from 2025.
- Abstract: The paper addresses the limitation of existing human behavior simulation models, which typically create generic, "average-user" policies. The authors introduce
CUSTOMER-R1, a novel method that uses Reinforcement Learning (RL) to create personalized, step-by-step simulations of user behavior in online shopping. The key idea is to condition the model on an explicit userpersona(a profile of the user's traits and preferences) and optimize it using reward signals based on the correctness of its predicted actions. Experiments on theOPeRAdataset show thatCUSTOMER-R1significantly outperforms standard approaches like prompting and Supervised Fine-Tuning (SFT), not only in predicting the next action but also in more closely matching the action patterns of individual users. - Original Source Link:
- arXiv: https://arxiv.org/abs/2510.07230
- PDF: https://arxiv.org/pdf/2510.07230v2.pdf
- Status: This is a preprint and has not yet been formally peer-reviewed and published at a conference.
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Simulating how humans behave step-by-step using Large Language Model (LLM) agents is a growing field. However, current methods—whether based on simple prompting, Supervised Fine-Tuning (SFT), or Reinforcement Learning (RL)—tend to learn a one-size-fits-all policy. They predict the most common action an "average" user would take in a given situation, ignoring the vast differences in goals, preferences, and browsing styles among individuals.
- Importance & Gaps: This lack of personalization is a major drawback. In real-world applications like e-commerce, understanding individual user behavior is critical for tasks like usability testing, system design, and personalized recommendations. Prior work either simulated an "average" user or showed only marginal benefits from adding
personainformation when using off-the-shelf LLMs without specialized training. This leaves a critical question unanswered: How can LLM agents be trained to better simulate personalized user behavior? - Innovation: This paper introduces a method that directly tackles this personalization gap. It combines the power of Reinforcement Learning with explicit
personaconditioning, training an LLM agent to not just predict a correct action, but to predict the correct action for a specific user.
-
Main Contributions / Findings (What):
- A Novel Framework: The authors propose
CUSTOMER-R1, an RL-based framework designed specifically for personalized, step-wise simulation of online shopping behavior. - Persona-Conditioned Policy: The model's policy is explicitly conditioned on a rich user
persona(including demographics, personality, and shopping habits), guiding it to generate behaviors consistent with an individual's unique style. - Custom Reward Design: A tailored reward function is introduced that provides a clear, verifiable signal based on the exact correctness of the predicted action (type and attributes), encouraging high-fidelity predictions.
- State-of-the-Art Performance: On the
OPeRAdataset,CUSTOMER-R1is shown to significantly outperform baselines (zero-shot prompting, SFT) in next-action prediction accuracy and in matching the statistical distribution of real user actions. - In-depth Analysis: The paper provides extensive ablation studies confirming that both
personaand intermediaterationale(the "why" behind an action) are crucial for high-quality personalized simulation.
- A Novel Framework: The authors propose
3. Prerequisite Knowledge & Related Work
To understand this paper, one should be familiar with the following concepts:
-
Foundational Concepts:
- Large Language Models (LLMs): These are massive neural networks (e.g., GPT-4, Qwen) trained on vast amounts of text data. They excel at understanding and generating human-like language and possess powerful reasoning capabilities.
- LLM Agents: An LLM that is augmented to interact with an environment. Instead of just generating text, an LLM agent can perform actions, use tools (like a search engine), and make decisions to achieve a goal. In this paper, the agent simulates a user browsing a website.
- Supervised Fine-Tuning (SFT): A training process where a pre-trained LLM is further trained on a smaller, specific dataset of input-output examples. The model learns to mimic the "correct" outputs provided in the dataset.
- Reinforcement Learning (RL): A machine learning paradigm where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. The agent learns from trial and error, guided by a
reward signalthat indicates how good or bad an action was. - Persona: A detailed profile describing a fictional or real user. In this paper, it includes demographics (age, gender), personality traits (e.g., from a personality survey), and shopping preferences (e.g., price sensitivity, brand loyalty).
-
Previous Works & Technological Evolution:
- Early work on LLM-based simulation, like
Park et al. (2023), focused on creating "believable" social simulations in virtual worlds (Generative Agents) without rigorous quantitative validation against real human data. - Later studies, such as , moved towards accurately modeling step-wise actions in real-world scenarios like online shopping. They used SFT but trained an "average-user" model on private data.
Zhang et al. (2025b)introducedShop-R1, which applied RL to improve action prediction accuracy, but again, this was for a generic, non-personalized user.- The concept of using personas is not new.
Park et al. (2024)showed that agents withpersonaprofiles performed better in survey-taking tasks. Critically,Wang et al. (2025b)(who also created theOPeRAdataset used in this paper) benchmarked the use ofpersonabut only with simple prompting on off-the-shelf LLMs, finding the performance gain to be marginal. This highlighted that just giving an LLM a persona is not enough; it needs to be trained to use it effectively. - In RL for LLMs, the field has advanced from preference-based methods like RLHF to methods using verifiable rewards, such as
GRPO(DeepSeek-R1). These methods are more stable and effective for tasks with clear correctness criteria (like math problems), and this paper adapts that paradigm to the more open-ended task of behavior simulation.
- Early work on LLM-based simulation, like
-
Differentiation:
CUSTOMER-R1stands out by being the first to effectively combine RL with explicit persona conditioning for verifiable, step-wise personalized behavior simulation. While prior works had pieces of the puzzle (RL for average users, or persona with weak prompting), this paper integrates them into a cohesive framework and demonstrates significant gains, proving that this combination is key to achieving high-fidelity personalization.
4. Methodology (Core Technology & Implementation)
The core of the paper is the CUSTOMER-R1 framework, which trains an LLM agent to simulate a specific user's shopping journey.
该图像是论文中的示意图,展示了在线购物场景下用户行为模拟。模型基于用户历史行为序列(如搜索、点击品牌和商品)推断其下一步动作,体现个性化的行为预测。
As shown in Figure 1, the task involves observing a user's past actions (e.g., searching for "earbuds," clicking on a brand) to predict what they will do next.
-
Task Formulation: The goal is to predict the next action and the rationale behind it. The model, , takes a sequence of past information and a user persona to generate the current rationale and action . The formal definition is: Where:
-
: The history of actions taken by the user up to the previous step.
-
: The history of rationales explaining why those actions were taken.
-
: The sequence of web page states (as HTML) the user has observed.
-
: The persona profile for the specific user, .
-
: The predicted rationale and action for the current step, .
The actions have different types and required attributes, as detailed in the manually transcribed table below.
Table 1: Required prediction attributes and example for each action type.
Action Type Attributes Example clickelement_nameclick on "flter_price" inputelement_name,texttype "earbuds" into the search box terminateNone terminate the session -
-
CUSTOMER-R1 Framework: The framework, illustrated in Figure 2, outlines the training process.
该图像是论文中Figure 2的示意图,展示了CusToMER-R1框架如何在在线购物中,通过观察用户历史行为(HTML结构、动作和理由)及用户画像,预测下一步动作及其理由,并基于预测动作与真实动作的匹配度计算奖励以优化策略。-
Steps & Procedures:
- Input: The model receives the user's shopping history, the current web page HTML, and the user's detailed persona profile.
- Generation: The model is prompted to generate a
rationalefor its next move, followed by theactionitself in a structured JSON format. - Reward Calculation: The generated action is compared to the ground-truth action from the dataset to compute a reward.
- Policy Optimization: This reward is used to update the model's policy using an RL algorithm, encouraging it to generate actions that are more likely to be correct for that specific user.
-
Mathematical Formulas & Key Details: The reward function is central to the framework's success. It is composed of two parts:
-
Action Reward (): This is a strict, binary reward. It is 1 if and only if the predicted action
typeand all of its requiredattributesexactly match the ground-truth action.- : The action predicted by the model.
- : The ground-truth action from the dataset.
-
Format Reward (): A binary reward that is 1 if the model's output is a valid JSON object conforming to the required schema, and 0 otherwise.
The Overall Reward () combines these with a special weighting function:
-
: A difficulty-aware weighting function. It assigns a higher reward for correctly predicting more complex or rarer actions (e.g., an
inputaction) compared to simpler, more frequent ones (e.g., a commonclickaction). This prevents the model from just learning to predict the easiest actions and encourages it to master a wider range of behaviors.The paper uses Group Relative Policy Optimization (GRPO) as its RL algorithm. The optimization objective is:
-
: The objective function to be maximized.
-
: The ratio of probabilities of the new policy () to the old policy (), which measures how much the policy has changed.
-
: The group-relative advantage. It normalizes the reward for a given sample by subtracting the mean reward and dividing by the standard deviation of the rewards within a group of generated samples. This helps stabilize training by indicating if an action was better or worse than the average action in that context.
-
: A function that constrains the policy ratio to be within a small range to prevent overly large, destabilizing updates.
-
: A Kullback-Leibler (KL) divergence term that penalizes the new policy for deviating too much from a reference policy (often the initial SFT model), keeping the model from "forgetting" its general language capabilities.
-
-
5. Experimental Setup
-
Datasets: The study uses the
OPeRA-filtereddataset fromWang et al. (2025b). This is the only publicly available dataset with real-world, step-by-step user shopping logs that include pairs, annotated rationales, and rich user personas.- Size: 527 sessions, 5,856 action pairs, 207 annotated rationales.
- Users: 49 real users.
- Characteristics: The action distribution is imbalanced, with
clickbeing the most common action. The tables below (transcribed from the paper) show the distribution.
Table 2: Action in OPeRA-filtered.
Action Type Count Percentage Click 5,051 86.3% Input 597 10.2% Terminate 208 3.6% All 5,856 Table 3: Fine-Grained Click Type distribution in OPeRA-filtered dataset.
Click Type Count Percentage review 1052 20.8% search 763 15.1% product_option 700 13.9% product_link 537 10.6% other 449 8.9% purchase 321 6.4% nav_bar 283 5.6% page_related 198 3.9% quantity 191 3.8% suggested_term 182 3.6% cart_side_bar 145 2.9% cart_page_select 139 2.8% filter 91 1.8% -
Data Processing:
- Dynamic Content Selection: To handle long session histories that exceed the model's context window, the authors use a smart truncation strategy. They keep the full HTML for the most recent steps but discard the HTML for older steps, retaining only the action and rationale information to preserve behavioral cues without overflowing the context.
- Rationale Augmentation: Since not all actions in the dataset have human-annotated rationales, the authors use a powerful LLM (
claude-3.5-sonnet) to generate synthetic rationales for the missing entries. This provides a complete dataset for training.
-
Evaluation Metrics:
-
Next Action Generation Accuracy:
- Conceptual Definition: Measures the percentage of predictions where the action
typeand all its requiredattributes(e.g.,element_name,text) exactly match the ground truth. This is a very strict metric of prediction quality. - Mathematical Formula:
- Symbol Explanation: is the total number of predictions, is the -th predicted action, is the -th ground-truth action, and is the indicator function, which is 1 if the condition inside is true and 0 otherwise.
- Conceptual Definition: Measures the percentage of predictions where the action
-
Action Type F1 (Macro):
- Conceptual Definition: First, the F1 score is calculated for each action
type(click,input,terminate). The Macro-F1 is the unweighted average of these per-class F1 scores. It is useful for imbalanced datasets because it treats each class equally, regardless of its frequency. - Mathematical Formula:
- Symbol Explanation:
TP(True Positives),FP(False Positives), andFN(False Negatives) are counts of correct and incorrect classifications. is the number of classes, and is the F1 score for class .
- Conceptual Definition: First, the F1 score is calculated for each action
-
Fine-grained Type Accuracy:
- Conceptual Definition: A more detailed accuracy metric. For
clickactions, it checks if the predicted click subtype (e.g.,review,search,purchase) matches the ground truth. For other actions, it checks ifinputandterminateare correctly identified.
- Conceptual Definition: A more detailed accuracy metric. For
-
Session Outcome F1 (Weighted):
- Conceptual Definition: Evaluates the model's ability to predict the final outcome of a shopping session: did the user
purchaseorterminate? The F1 score is computed for these two outcomes, and the Weighted-F1 averages them, weighted by the number of instances in each class. This measures how well the model captures the user's ultimate intent.
- Conceptual Definition: Evaluates the model's ability to predict the final outcome of a shopping session: did the user
-
-
Baselines: The authors compare four different training configurations of the
Qwen2.5-7B-Instruct-1Mmodel:Zero-shot Inference: The pre-trained LLM is used directly with prompting, without any further training on the shopping dataset.SFT(Supervised Fine-Tuning): The model is fine-tuned on the dataset to predict the ground-truth action for each step.RL(Reinforcement Learning): The model is trained from scratch (from its pre-trained state) using only the RL objective (GRPO) and the custom reward function.- : The model is first fine-tuned with SFT and then further optimized with RL. This is the proposed
CUSTOMER-R1method.
6. Results & Analysis
-
Core Results: The main results clearly demonstrate the superiority of the approach.
Table 4: Evaluation of next action prediction task. (Manual Transcription)
Method Next Action Gen. (Accuracy) Action Type (Macro-F1) Fine-grained Type (Accuracy) Session Outcome (Weighted-F1) Zero-shot Inference 7.32 33.43 25.72 41.11 RL 24.72 31.17 39.58 40.51 SFT 35.14 72.66 56.43 66.29 SFT+RL 39.58 78.50 61.20 79.45 - Analysis:
Zero-shotperformance is very poor, confirming that this is a difficult task requiring specialized training.RLalone improves exact match accuracy but is unstable and performs poorly on F1 metrics, suggesting it struggles with the imbalanced action types.SFTprovides a strong baseline, significantly improving all metrics by learning the basic patterns from the data. The winning approach, (CUSTOMER-R1), achieves the best results across the board. The SFT phase provides a stable and knowledgeable starting point, which the RL phase then refines by optimizing directly for the strict goal of action correctness.
- Analysis:
-
Ablations / Parameter Sensitivity: The paper conducts several crucial ablation studies to understand what makes
CUSTOMER-R1work.1. Effect of Persona and Rationale: This study investigates the impact of removing
personaandrationalefrom the model's input.Table 5: Model performance without persona or rationale. (Manual Transcription)
Method Setting Next Action Gen. (Accuracy) Action Type (Macro-F1) Fine-grained Type (Accuracy) Session Outcome (Weighted-F1) Zero-shot 7.32 33.43 25.72 41.11 Zero-shot w/o persona 10.20 33.10 26.05 35.88 Zero-shot w/o rationale 4.10 25.33 16.91 38.78 RL 24.72 31.17 39.58 40.51 RL w/o persona 26.27 31.20 41.13 32.46 RL w/o rationale 12.64 31.20 20.84 44.25 SFT 35.14 75.28 56.43 75.85 SFT w/o persona 35.37 64.22 57.43 60.95 SFT w/o rationale 32.04 67.93 52.22 71.38 SFT+RL 39.58 78.50 61.20 79.45 SFT+RL w/o persona 37.80 66.67 59.42 59.73 SFT+RL w/o rationale 34.15 73.15 53.99 67.37 - Analysis: For the best model (), removing
personacauses a significant drop in performance, especially inAction Type (Macro-F1)andSession Outcome (F1). This shows that the persona helps the model balance different action types and, crucially, make better decisions about when to purchase or give up. Removingrationalealso consistently hurts performance across all settings, confirming its role as a reasoning "scaffold" that helps the model link context to action.
2. Effect of Model Size and Context Length: This study shows that a larger model and a longer context window are beneficial.
Table 6: Ablation results showing the effect of model size and context length. (Manual Transcription)
Model Size Context Next Action Gen. (Accuracy) Action Type (Macro-F1) Fine-grained Type (Accuracy) Session Outcome (Weighted-F1) Qwen2.5-7B 65k 24.72 31.17 39.58 40.51 Qwen2.5-7B 40k 18.85 31.14 28.60 41.41 Qwen2.5-3B 65k 18.07 31.30 38.91 3.97 - Analysis: Increasing the context length from 40k to 65k tokens for the 7B model boosts accuracy significantly, showing that having more of the user's history is vital for good predictions. The smaller 3B model performs much worse, particularly on
Session Outcome, where its F1 score plummets to 3.97%. This indicates that sufficient model capacity is required to understand complex user intent over a long session.
3. Analysis of Error Patterns & Persona's Role:
-
Reward Hacking: The
RL-onlymodel exhibits "reward hacking"—it learns to exploit the reward system by over-predicting frequent and simpleclickactions while almost never predictinginputorterminateactions.Table 7: RL-only action type distribution and accuracy (Reward Hacking). (Manual Transcription)
Action Type Ground Truth Predicted Correct Accuracy Click 786 831 739 94.0% Terminate 40 0 0 0.0% Input 76 4 1 1.3% Other 0 67 0 0.0% -
Initializing with
SFTprevents this, as the SFT model already knows how to produce a balanced distribution of actions. The RL phase can then refine this balanced policy instead of collapsing to a simplistic one. This is visualized in Figure 3.
该图像是论文中图3,展示了三种训练模型下细粒度用户动作的分布对比柱状图,包括a)仅用RL训练的模型,b)用SFT+RL训练的模型,c)用SFT+RL但无persona的模型,条形高度反映动作计数,颜色区分真实值、预测值和完全匹配数。
Figure 3 shows that the
RL-onlymodel (a) over-predictspurchaseandreviewclicks, while the model (b) has a much more balanced distribution that is closer to the ground truth. Removing the persona (c) hurts the prediction ofterminate.-
Persona is Essential for Personalization: To prove the model is actually using the persona information, the authors conducted an experiment where they shuffled the personas (i.e., gave the model the wrong user's persona).
Table 8: Model performance with shuffled persona. (Manual Transcription)
Method Setting Next Action Gen. (Accuracy) Action Type (Macro-F1) Fine-grained Type (Accuracy) Session Outcome (Weighted-F1) SFT+RL persona 39.58 78.50 61.20 79.45 SFT+RL shuffle 28.94 38.88 40.35 48.41 -
Performance collapses across all metrics when the persona is shuffled. This provides strong evidence that the model learns to ground its predictions in the specific user's profile, shifting its policy from an "average user" to a "specific user" and thereby achieving true personalization.
- Analysis: For the best model (), removing
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully demonstrates that
CUSTOMER-R1, an RL-based method conditioned on explicit user personas, can significantly improve the personalization and accuracy of human behavior simulation in online shopping. It outperforms existing SFT and prompting baselines by a large margin. The key takeaways are that:- Combining SFT with RL () is a highly effective strategy, where SFT provides a stable foundation and RL fine-tunes for correctness.
- Explicit
personainformation is critical for personalization, helping the model make more accurate high-level decisions (like purchasing vs. terminating) and balancing its action predictions. - Step-wise
rationalesact as a valuable reasoning aid, improving the model's ability to generate precise actions.
-
Limitations & Future Work: The authors acknowledge several limitations:
- The model still shows some bias towards frequent actions, even with the reward weighting.
- The reward function is based solely on action correctness and does not capture higher-level concepts like user satisfaction, effort, or task success.
- Future work could focus on designing richer reward signals, developing more sophisticated methods for integrating persona information, and using more detailed context from the web page.
-
Personal Insights & Critique:
- Strength: This paper is a strong piece of research with a clear motivation, a well-designed methodology, and convincing experiments. The use of the
shuffled personaexperiment is particularly effective at proving that the model is genuinely personalizing its outputs. - Potential Weakness: The framework's performance relies heavily on the quality of two things: the
personadata and the augmentedrationales. TheOPeRAdataset is unique, but creating such rich persona profiles is labor-intensive and may not scale easily to millions of users. Furthermore, the use ofclaude-3.5-sonnetto generate missing rationales introduces a potential dependency and source of bias from the "teacher" model. - Open Questions: The difficulty-aware reward weights are hand-tuned. An interesting future direction could be to learn these weights automatically or use a more dynamic reward-shaping approach. It would also be valuable to see if
CUSTOMER-R1can generalize to other domains beyond e-commerce, such as travel booking, content browsing, or software usage. - Overall Impact: This work represents a significant step forward in building high-fidelity, personalized digital twins of users. It provides a strong blueprint for future research and has practical implications for creating more realistic simulators for A/B testing, user experience research, and personalized system design.
- Strength: This paper is a strong piece of research with a clear motivation, a well-designed methodology, and convincing experiments. The use of the
Similar papers
Recommended via semantic vector search.