Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning
TL;DR Summary
Shop-R1 uses reinforcement learning with distinct rewards for rationale generation and action prediction to enhance LLMs' simulation of online shopping behavior, achieving over 65% performance improvement over baselines.
Abstract
Large Language Models (LLMs) have recently demonstrated strong potential in generating 'believable human-like' behavior in web environments. Prior work has explored augmenting training data with LLM-synthesized rationales and applying supervised fine-tuning (SFT) to enhance reasoning ability, which in turn can improve downstream action prediction. However, the performance of such approaches remains inherently bounded by the reasoning capabilities of the model used to generate the rationales. In this paper, we introduce Shop-R1, a novel reinforcement learning (RL) framework aimed at enhancing the reasoning ability of LLMs for simulation of real human behavior in online shopping environments Specifically, Shop-R1 decomposes the human behavior simulation task into two stages: rationale generation and action prediction, each guided by distinct reward signals. For rationale generation, we leverage internal model signals (e.g., logit distributions) to guide the reasoning process in a self-supervised manner. For action prediction, we propose a hierarchical reward structure with difficulty-aware scaling to prevent reward hacking and enable fine-grained reward assignment. This design evaluates both high-level action types and the correctness of fine-grained sub-action details (attributes and values), rewarding outputs proportionally to their difficulty. Experimental results show that our method achieves a relative improvement of over 65% compared to the baseline.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: Shop-R1: Rewarding LLMs to Simulate Human Behavior in Online Shopping via Reinforcement Learning
- Authors: Yimeng Zhang, Tian Wang, Jiri Gesi, Ziyi Wang, Yuxuan Lu, Jiacheng Lin, Sinong Zhan, Vianne Gao, Ruochen Jiao, Junze Liu, Kun Qian, Yuxin Tang, Ran Xue, Houyu Zhang, Qingjun Cui, Yufan Guo, Dakuo Wang.
- Affiliations: The authors are affiliated with several prominent academic institutions (Michigan State University, Northeastern University, University of Illinois Urbana-Champaign, Northwestern University) and a major industry research lab (Store Foundation AI, Amazon). This mix of academic and industry expertise suggests a focus on both foundational research and practical application.
- Journal/Conference: The paper is currently available as a preprint on arXiv. An arXiv preprint is a preliminary version of a research paper that has not yet undergone formal peer review for publication in a conference or journal.
- Publication Year: The paper was submitted to arXiv in 2025, with the specific date being July 23, 2025.
- Abstract: The abstract introduces the problem of using Large Language Models (LLMs) to simulate human-like behavior on the web. It notes that previous methods, like Supervised Fine-Tuning (SFT) on LLM-generated rationales, are limited by the capabilities of the LLM that created the training data. To overcome this, the paper proposes Shop-R1, a reinforcement learning (RL) framework. Shop-R1 divides the simulation task into two stages: rationale generation and action prediction, each with its own reward signal. The rationale generation is guided by the model's own internal confidence signals (self-supervision), while action prediction uses a sophisticated hierarchical reward system. This system rewards both the general action type and the specific details, scaling the reward based on the action's difficulty to prevent "reward hacking." The authors report that their method achieves a relative improvement of over 65% compared to the baseline.
- Original Source Link:
- arXiv Page: https://arxiv.org/abs/2507.17842
- PDF Link: https://arxiv.org/pdf/2507.17842v1.pdf
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Large Language Models (LLMs) are promising for simulating how humans interact with websites, but they often fail to produce actions that are truly representative of real user behavior.
- Existing Gaps: The most common approach is Supervised Fine-Tuning (SFT), where an LLM is trained on examples of user actions. To improve the model's reasoning, these examples are often augmented with "rationales" (explanations for each action), which are themselves generated by another powerful LLM (e.g., Claude 3.5 Sonnet). The critical flaw in this method is that the simulation model's performance is capped by the quality and diversity of the rationales produced by the initial data-generating LLM. It can't learn to reason better than its "teacher."
- Fresh Angle: This paper proposes moving beyond the limitations of SFT by using Reinforcement Learning (RL). RL allows a model (or "agent") to learn by trial and error, receiving "rewards" for good behaviors and "penalties" for bad ones. This is the first work to apply a sophisticated RL framework specifically to the task of simulating human behavior in online shopping, as opposed to just completing a task efficiently. The key innovation is a carefully designed reward system that guides the model to be more human-like.
-
Main Contributions / Findings (What):
- A Novel RL Framework for Behavior Simulation: The paper introduces
Shop-R1, a framework that reframes human behavior simulation as a two-stage RL problem: (1) generate a rationale, and (2) predict an action. - Hybrid Reward Design:
Shop-R1uses a multi-faceted reward system to guide the model:- Format Reward: A basic reward for producing output in the correct, machine-readable (JSON) format.
- Self-Certainty Reward: For rationale generation, the model is rewarded based on its own internal confidence, avoiding the need for "ground-truth" human rationales, which are difficult to obtain.
- Hierarchical Action Reward: For action prediction, the model gets partial credit. It's rewarded for getting the high-level action type right (e.g.,
click) and receives additional rewards for getting the fine-grained details correct (e.g., clicking the correct button). - Difficulty-Aware Reward Scaling (DARS): To prevent the model from cheating by only performing easy actions (a common RL problem called "reward hacking"), rewards for difficult actions (like typing a search query) are significantly amplified.
- State-of-the-Art Performance: Experiments show that
Shop-R1achieves an exact match accuracy of 27.72%, a relative improvement of over 65% compared to the 16.76% accuracy of the standard SFT baseline. This demonstrates the strong effectiveness of the proposed RL approach.
- A Novel RL Framework for Behavior Simulation: The paper introduces
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Large Language Models (LLMs): These are massive neural networks (like GPT-4 or Qwen) trained on vast amounts of text data. They are capable of understanding and generating human-like text, and can be adapted for tasks like planning, reasoning, and decision-making.
- Human Behavior Simulation: The goal is to create a computational model (an "agent") that can mimic the actions a real human would take in a given environment, such as browsing an e-commerce website. This is useful for testing new website designs, understanding user experience, and training recommender systems.
- Supervised Fine-Tuning (SFT): A common technique for adapting a pre-trained LLM to a specific task. The model is shown a dataset of correct input-output pairs (e.g., website state -> user action) and trained to replicate them.
- Reinforcement Learning (RL): A machine learning paradigm where an agent learns to make decisions by interacting with an environment. The agent performs actions and receives rewards or penalties. Its goal is to learn a "policy"—a strategy for choosing actions—that maximizes its total cumulative reward over time.
- Rationale: An explanation or "chain of thought" that justifies why a particular action was chosen. In this paper, it's a piece of text like, "I am looking for a cheaper alternative, so I will click the 'Sort by Price' button."
- Reward Hacking: A common failure mode in RL where an agent discovers an unintended way to get high rewards without actually solving the problem as intended. For example, if an agent gets a small reward for any action, it might learn to spam a simple, low-effort action repeatedly.
- Kullback-Leibler (KL) Divergence: A statistical measure of how one probability distribution differs from a reference probability distribution. In this paper, it is used to measure the model's "self-certainty." A high KL divergence from a uniform (random) distribution means the model is highly confident in its prediction, as its probability is concentrated on a few specific tokens.
-
Previous Works:
- Zero-Shot Prompting: The simplest approach, where an LLM is given instructions in a prompt to act like a user. The paper notes this lacks the personalization and accuracy needed for high-fidelity simulation.
- SFT with Synthetic Rationales (Lu et al., 2025): This is the main baseline. A powerful LLM like Claude 3.5 Sonnet is used to generate rationales for a dataset of human actions. A smaller model is then fine-tuned on these
(context, action, rationale)triplets. The key limitation, which this paper addresses, is that the fine-tuned model's quality is inherently bounded by the rationale-generating LLM. - RL for Task Completion (WebArena, etc.): Previous work has used RL to train web agents, but the goal was typically to complete a task (e.g., book a flight). The reward was given for task success. This paper's focus is different: it aims to simulate the process of human behavior, not just achieve an outcome.
- Reward Design Paradigms: The paper situates its work in the context of modern RL reward design.
RLHF(Reinforcement Learning from Human Feedback): Uses human preferences (e.g., "response A is better than B") to train a reward model, which then guides the LLM. It's powerful but expensive and hard to scale.DPO(Direct Preference Optimization): A more direct and computationally lighter alternative to RLHF that still relies on preference data.RLVR(Reinforcement Learning with Verifiable Rewards): Uses automated, rule-based verifiers for reward calculation in domains with clear correctness criteria (like math or code). This is precise but not applicable to subjective tasks.
-
Differentiation:
Shop-R1stands out by being the first to introduce RL to the task of simulation-oriented human behavior modeling in this context. Its main innovation is the hybrid reward framework, which combines automated, internal signals (self-certainty) with a carefully structured, task-specific reward scheme (hierarchical action reward). This approach avoids the high cost of human feedback (like RLHF) and the rigidity of simple rule-based rewards (like RLVR), creating a system specifically tailored for the nuances of human web navigation.
4. Methodology (Core Technology & Implementation)
The core of the paper is the Shop-R1 framework, which uses reinforcement learning to train an LLM to simulate human shopping behavior.
- Problem Statement: The task is to simulate a user session on a shopping website. A session consists of a sequence of actions () and their corresponding rationales (). At each step , the model observes the website's state (, represented as simplified HTML) and must predict the next rationale and action , given the history of all previous contexts, actions, and rationales. The goal is to learn a function such that:
f ( c _ { 1 . . . t } , a _ { 1 . . . t - 1 } , r _ { 1 . . . t - 1 } ) = r _ { t } , a _ { t }
Here:
* : The history of website observations (simplified HTML pages) up to the current step .
* : The history of actions taken by the user.
* : The history of rationales for those actions.
* : The predicted rationale and action for the current step.
The **action space** includes three main types: `type_and_submit` (e.g., entering text into a search bar), `click` (e.g., clicking a button or link), and `terminate` (ending the session).
-
Steps & Procedures: The training pipeline consists of two main phases:
-
Cold Start with SFT: Before applying RL, the model is first trained using Supervised Fine-Tuning (SFT). It is trained on a dataset of real user trajectories where rationales have been synthetically generated by a powerful LLM (Claude 3.5 Sonnet). This initial phase helps the model learn the basic structure of the task, such as the format of a valid action and the general relationship between context, rationale, and action. The SFT objective is to maximize the probability of the ground-truth rationale-action pair: where represents the entire input history .
-
Shop-R1 Reinforcement Learning: After the SFT "warm-up," the model is further trained using RL with a custom, multi-part reward function designed to encourage human-like behavior. The overall RL objective is to find a policy that maximizes the expected total reward, while staying close to a reference policy (typically the SFT model) to maintain stability. The objective function is: Let's break down each component:
- : The policy (the LLM) being trained.
v(a): The action reward, which is a hierarchical score for the predicted action .s(r): The rationale reward, based on the model's self-certainty in generating rationale .- : Hyperparameters that balance the importance of the rationale reward and the KL regularization term.
- : The KL divergence term, which penalizes the policy for straying too far from the initial SFT model, preventing catastrophic forgetting and stabilizing training.
-
-
Mathematical Formulas & Key Details (The Reward Function): The innovation of
Shop-R1lies in its detailed reward function, which is a sum of several components.-
Format Reward: A simple binary reward. If the model's output is a valid JSON object with the expected
rationaleandactionkeys, it gets a score (e.g., 0.5). If not, the reward is 0. This encourages the model to produce parsable outputs. -
Rationale Reward (Self-Certainty Score): Instead of requiring ground-truth rationales (which are often unavailable or incomplete), the model is rewarded for being confident in its generated rationale. This confidence is measured as the average KL divergence between the model's output probability distribution and a uniform distribution.
- : Number of tokens in the generated rationale .
- : Size of the model's vocabulary.
- : The model's predicted probability for token at position .
- : The probability of a token under a uniform distribution. A higher score indicates that the model's predictions are "peaked" and non-random, reflecting higher certainty.
-
Hierarchical Action Reward with DARS: This is the most complex part, designed to provide fine-grained feedback and prevent reward hacking. The structure is detailed in Table 1.
Manual transcription of Table 1: Hierarchical reward schedule with Difficulty-Aware Reward Scaling (DARS).
Action Type Type Reward Sub-action Attribute Reward Text-Similarity Value Reward terminate0.3 None None click0.3 +0.2 (if name =)+DARS × ROUGE-L( name)type_and_submit0.3 +0.1 (if name =) +0.1 (iftext ≠)+0.1 × ROUGE-L( name) +DARS × ROUGE-L(text) -
Explanation of Table 1:
-
Base Reward: Every correctly predicted action type (
terminate,click,type_and_submit) gets a base reward of 0.3. -
Hierarchical Credit: More complex actions can earn additional rewards.
- For a
clickaction, the model gets +0.2 if it correctly identifies the attribute to act on (e.g., thenameof a button). - For a
type_and_submitaction, it gets partial credit for identifying the input field (name) and for generating text (text).
- For a
-
Text-Similarity Reward: For actions involving text (like a button's name or a search query), the reward is proportional to the ROUGE-L similarity between the predicted text and the ground truth.
-
Difficulty-Aware Reward Scaling (DARS): The reward for correctly predicting long-text values (the hardest part) is multiplied by a large
DARSfactor (e.g., 1000). This makes it much more profitable for the agent to correctly perform difficultclickortype_and_submitactions than to simply spam the easyterminateaction. This directly combats reward hacking.
该图像是论文中关于Shop-R1强化学习框架的示意图,展示了模型如何基于浏览器观察和操作历史,生成行动理由及预测下一步用户行为,并通过格式、理由、动作类型及子动作准确率等多维奖励进行反馈优化。
-
-
5. Experimental Setup
-
Datasets: The study used a proprietary corpus of 52,137 real-world shopping sessions from a major e-commerce platform.
- The raw data consists of multi-turn interactions (clickstreams).
- Rationales were not present in the original data; they were synthetically generated for each action using Claude 3.5 Sonnet.
- The observation context (website state) was represented as simplified HTML, which removes irrelevant code like scripts and styles but preserves the core structure and content.
- For the RL dataset, the data was slightly imbalanced to provide more training examples for the harder
clickandtype_and_submitactions.
-
Evaluation Metrics:
- Exact Match Accuracy:
- Conceptual Definition: This is the strictest metric. A prediction is counted as correct only if all parts of the action perfectly match the ground truth. For a
clickaction, both the action type and the exact name of the clicked element must match. Fortype_and_submit, the type, target field, and meaning of the text must match. - Mathematical Formula: It is a simple accuracy calculation:
- Conceptual Definition: This is the strictest metric. A prediction is counted as correct only if all parts of the action perfectly match the ground truth. For a
- Action Type Accuracy:
- Conceptual Definition: A more lenient metric that only checks if the high-level action type (
click,type_and_submit, orterminate) is correct, ignoring the finer-grained details like the element name or text content. - Mathematical Formula:
- Conceptual Definition: A more lenient metric that only checks if the high-level action type (
- F1 Score:
- Conceptual Definition: The harmonic mean of precision and recall. It is particularly useful for evaluating performance on classification tasks where the classes may be imbalanced. A high F1 score indicates that the model has both low false positives (high precision) and low false negatives (high recall). It is reported for the action type classification.
- Mathematical Formula:
- Symbol Explanation:
- (Of all the times the model predicted a class, how often was it right?)
- (Of all the actual instances of a class, how many did the model find?)
- Exact Match Accuracy:
-
Baselines:
- Zero-shot prompting: An off-the-shelf
Qwen-2.5-3B-Instructmodel given only instructions. - RL (Binary): A model trained from scratch with RL using only a simple binary reward (1 for exact match, 0 otherwise).
- SFT: The standard approach of supervised fine-tuning on the dataset with LLM-generated rationales.
- SFT + RL (Binary): A model first trained with SFT, then further trained with RL using the simple binary reward.
- Shop-R1 (Ours): The full proposed framework (SFT warm-up + RL with the hybrid reward system).
- Zero-shot prompting: An off-the-shelf
6. Results & Analysis
-
Core Results: The main results demonstrate the superiority of the
Shop-R1framework.Manual transcription of Table 2: Simulation accuracy under different fine-tuning methods.
Model Settings Exact Action Action Type Acc. Acc. F1 Qwen-2.5-3B-Instruct Zero-shot prompting 0.32% 15.33% 16.15% RL (Binary) 1.01% 6.17% 9.92% SFT 16.76% 22.25% 24.52% SFT + RL (Binary) 16.55% 23.74% 28.07% Shop-R1 (Ours) 27.72% 36.40% 31.28% Qwen-2.5-1.5B-Instruct Zero-shot prompting 0.53% 3.94% 6.16% SFT 10.86% 23.58% 29.02% Shop-R1 (Ours) 24.11% 34.54% 29.19% Qwen-2.5-0.5B-Instruct Zero-shot prompting 6.76% 12.88% 15.55% SFT 9.90% 17.72% 21.61% Shop-R1 (Ours) 27.72% 31.83% 21.20% - Analysis of Table 2:
-
Zero-shotandRL (Binary)from scratch perform very poorly, showing the task is too complex to be solved without dense supervision. -
SFTprovides a strong baseline (16.76% exact match), confirming that learning from demonstrations is effective. -
SFT + RL (Binary)offers no significant improvement and even slightly degrades exact match accuracy. This shows that a simple, sparse reward signal is insufficient to guide the model beyond what SFT teaches it. -
Shop-R1dramatically outperforms all baselines, achieving 27.72% exact match accuracy. This is a 65.4% relative improvement over theSFTbaseline, proving the effectiveness of the hybrid, hierarchical reward structure.Manual transcription of Table 3: Accuracy per action type.
Models Settings Exact Action Acc. Per Action Type Action Type Acc. Per Action Type click type_and_submit terminate click type_and_submit terminate Qwen-2.5-3B-Instruct Zero-shot prompting 0.58% 0.15% 0.00% 38.7% 1.62% 0.00% SFT 4.93% 3.84% 49.80% 8.55% 15.36% 49.80% SFT + RL (Binary) 8.12% 3.25% 45.51% 17.25% 13.88% 45.51% Shop-R1 (Ours) 7.39% 7.53% 81.84% 10.29% 28.66% 81.84%
-
(Note: The table is partially transcribed for brevity to focus on the 3B model, which is the main model discussed.)
- Analysis of Table 3:
Shop-R1significantly improves performance on all action types, especially the difficulttype_and_submit(from 3.84% to 7.53% exact match) and the simpleterminate(from 49.80% to 81.84%). This shows the reward structure successfully encourages both complex and simple actions when appropriate, avoiding reward hacking.
- Analysis of Table 2:
-
Ablations / Parameter Sensitivity:
-
Model Size:
Shop-R1provides significant gains across all model sizes (0.5B, 1.5B, 3B). However, larger models are better at handling complex, high-entropy actions (click,type_and_submit), while the smallest model (0.5B) achieves high overall accuracy mainly by mastering the simpleterminateaction. This suggests that model scale is still important for behavioral diversity. -
Sampling Temperature:
该图像是图表,展示了采样温度(Temperature τ)变化对动作类型准确率、动作类型F1值和完全匹配准确率的影响,横轴为温度,纵轴为百分比。图中曲线分别表示不同指标随温度的波动趋势。The analysis of Figure 2 shows that
Shop-R1is robust to changes in sampling temperature.Action Type Accuracyis stable across temperatures.F1 Scoredecreases as temperature rises, suggesting increased confusion between classes.Exact Match Accuracypeaks around a temperature of . A small amount of randomness helps the model generate correct long-text arguments that greedy decoding might miss, but too much randomness becomes harmful. The optimal range is found to be 0.6-0.8.
-
Training Components: Manual transcription of Table 4: Ablation study on training components.
Model Training Scheme Components Exact Action Action Type SFT Format Reward Rationale Reward Reward Scale Action Reward Acc. Acc. F1 Qwen-2.5-3B-Instruct ❌ ✔️ ✔️ ✔️ hierarchical 4.63% 36.56% 21.92% ✔️ ❌ ✔️ ✔️ hierarchical 2.87% 3.19% 5.04% ✔️ ✔️ ❌ ✔️ hierarchical 26.93% 37.25% 33.74% ✔️ ✔️ ✔️ ❌ hierarchical 27.83% 27.20% 11.70% ✔️ ✔️ ✔️ ✔️ binary 27.41% 27.46% 12.11% ✔️ ✔️ ✔️ ✔️ hierarchical 27.72% 36.40% 31.28% -
Analysis of Table 4: This is a crucial analysis showing that every component is necessary.
- Removing SFT is catastrophic (4.63% accuracy), showing the model needs a supervised warm-up.
- Removing the Format Reward is even worse (2.87% accuracy), as the model can't get any reward if its output is unparsable.
- Removing the Rationale Reward causes a small drop in exact match, suggesting it helps refine the long-text predictions.
- Removing Reward Scaling (DARS) or using a simple binary reward leads to reward hacking. The model achieves high exact accuracy but very low F1 score (11-12%), indicating it over-predicts the easy
terminateaction.
-
Context: Manual transcription of Table 5: Comparison of whole-session vs. latest-step context.
Settings Exact Action Action Type Acc. Acc. F1 whole-session 27.72% 36.40% 31.28% latest-step 14.74% 30.46% 33.48% -
Analysis of Table 5: Using the whole session history (including simplified HTML from all visited pages) is critical. Removing it cuts exact match accuracy nearly in half (from 27.72% to 14.74%). While the model can still guess the action type reasonably well from the dialogue trace, it cannot generate the correct fine-grained details (like a button name) without seeing the actual page context.
-
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully introduces
Shop-R1, a novel RL framework for simulating human behavior in online shopping. By decomposing the task into rationale generation and action prediction and using a sophisticated hybrid reward system,Shop-R1overcomes the limitations of SFT-based approaches. The framework's components—self-certainty for rationales, hierarchical rewards for actions, format rewards, and difficulty-aware scaling—are all shown to be crucial for achieving high-fidelity simulation, preventing reward hacking, and significantly outperforming existing baselines. -
Limitations & Future Work:
- The authors' work paves the way for more realistic virtual user modeling. Future work could involve expanding the action space, incorporating more complex user personas, or applying the framework to other interactive environments beyond e-commerce.
- The reliance on a proprietary dataset means the results are not directly reproducible by the wider research community.
- The initial SFT phase still depends on rationales generated by an external, powerful LLM. While the RL phase improves upon this, the "cold start" quality is still tied to this external model.
-
Personal Insights & Critique:
- Strengths: The paper's primary strength is its rigorous methodology and comprehensive ablation study. The authors systematically prove the value of each component of their proposed
Shop-R1framework. The design of the hierarchical, difficulty-aware reward is particularly clever and directly addresses a classic problem in RL (reward hacking). - Critique: The use of a proprietary dataset is a significant limitation for reproducibility. Releasing the dataset, or a public-data-based equivalent, would greatly increase the paper's impact. Additionally, while "self-certainty" is a clever proxy for rationale quality, it's an imperfect one; a highly confident model can still be consistently wrong.
- Transferability: The core principles of
Shop-R1are highly transferable. The idea of a two-stage (rationale + action) process with a hybrid reward system could be applied to simulate human behavior in many other domains, such as using complex software (e.g., Photoshop), interacting on social media platforms, or even debugging code. The framework provides a strong template for training agents to mimic nuanced human processes, not just achieve simple goals.
- Strengths: The paper's primary strength is its rigorous methodology and comprehensive ablation study. The authors systematically prove the value of each component of their proposed
Similar papers
Recommended via semantic vector search.