PAARS: Persona Aligned Agentic Retail Shoppers
TL;DR Summary
PAARS framework creates persona-driven retail agents equipped with shopping tools and aligns their group-level behavior distributions with humans, improving simulation accuracy and enabling automated A/B testing applications.
Abstract
In e-commerce, behavioral data is collected for decision making which can be costly and slow. Simulation with LLM powered agents is emerging as a promising alternative for representing human population behavior. However, LLMs are known to exhibit certain biases, such as brand bias, review rating bias and limited representation of certain groups in the population, hence they need to be carefully benchmarked and aligned to user behavior. Ultimately, our goal is to synthesise an agent population and verify that it collectively approximates a real sample of humans. To this end, we propose a framework that: (i) creates synthetic shopping agents by automatically mining personas from anonymised historical shopping data, (ii) equips agents with retail-specific tools to synthesise shopping sessions and (iii) introduces a novel alignment suite measuring distributional differences between humans and shopping agents at the group (i.e. population) level rather than the traditional "individual" level. Experimental results demonstrate that using personas improves performance on the alignment suite, though a gap remains to human behaviour. We showcase an initial application of our framework for automated agentic A/B testing and compare the findings to human results. Finally, we discuss applications, limitations and challenges setting the stage for impactful future work.
Mind Map
In-depth Reading
English Analysis
Bibliographic Information
- Title: PAARS: Persona Aligned Agentic Retail Shoppers
- Authors: Saab Mansour, Leonardo Perelli, Lorenzo Mainetti, George Davidson, Stefano D'Amato
- Affiliations: The authors are all affiliated with Amazon.
- Journal/Conference: This paper was submitted to arXiv, a popular preprint server. This means it has not yet undergone formal peer review for publication in a conference or journal.
- Publication Year: The paper was submitted to arXiv with a listed publication date of 2025-03-31.
- Abstract: The authors address the high cost and slow pace of collecting behavioral data in e-commerce. They propose simulation using Large Language Model (LLM) powered agents as a faster, more scalable alternative. Acknowledging that LLMs have inherent biases, they introduce PAARS, a framework designed to create and align synthetic shopping agents with real human behavior. The framework (i) automatically mines "personas" from anonymized historical shopping data, (ii) gives agents retail-specific tools (like search, view, purchase) to simulate shopping sessions, and (iii) introduces a novel alignment suite that measures how well the distribution of agent behavior matches the distribution of human behavior at a population level (
group alignment), rather than just matching individual actions (individual alignment). Experiments show that using these mined personas improves alignment, although a performance gap to real human behavior remains. The paper also demonstrates a preliminary application for automated A/B testing and discusses the broader potential and limitations of the approach. - Original Source Link:
- arXiv Page: https://arxiv.org/abs/2503.24228
- PDF Link: https://arxiv.org/pdf/2503.24228v1.pdf
Executive Summary
Background & Motivation (Why)
In e-commerce, companies like Amazon make critical business decisions by analyzing customer behavior, often through live experiments like A/B testing. However, collecting this data is expensive, slow, and resource-intensive. A promising alternative is to simulate human behavior using intelligent agents powered by Large Language Models (LLMs). These agents can "act" like shoppers, generating vast amounts of behavioral data quickly and cheaply.
The primary challenge is ensuring these simulations are realistic. LLMs are not perfect mirrors of human populations; they have known biases, such as favoring popular brands (brand bias), being overly influenced by positive reviews (rating bias), and poorly representing minority groups. If the simulated behavior doesn't accurately reflect real human behavior, any decisions based on it will be flawed.
This paper tackles this challenge by proposing a framework to create a population of synthetic shopping agents that collectively approximates the behavior of a real human population. The innovation lies in moving beyond generic LLM prompts to create specific, data-driven "personas" for each agent.
Main Contributions / Findings (What)
The paper introduces PAARS (Persona Aligned Agentic Retail Shoppers), a framework with three core contributions:
- Persona-Powered Simulation Framework: Instead of using a generic LLM, PAARS first mines personas from real, anonymized historical shopping data. These personas, which include demographic profiles and shopping preferences, are then used to instruct LLM agents equipped with a set of retail-specific tools (e.g.,
search,view,purchase) to simulate entire shopping sessions. - Novel Group-Level Alignment Suite: The paper argues that for applications like A/B testing, it's more important that the overall distribution of agent behavior matches the human population, rather than each agent perfectly mimicking one specific human. They formalize this concept as
group alignmentand propose metrics to measure it, primarily using Kullback-Leibler (KL) divergence to compare the statistical distributions of agent and human actions. - Experimental Validation and Application: Experiments demonstrate that agents equipped with the mined personas (
+ Persona) consistently outperform generic agents (Base) on the alignment suite, showing behavior that is statistically closer to real humans. The paper also presents a proof-of-concept application by using PAARS to simulate A/B tests, finding that the agent-based simulations correctly predicted the directional outcome of 2 out of 3 real historical tests.
Prerequisite Knowledge & Related Work
Foundational Concepts
- Large Language Models (LLMs): These are massive neural networks (like GPT-4 or Anthropic's Claude) trained on vast amounts of text and code. They can understand and generate human-like text, making them suitable for powering conversational agents.
- Agentic Framework: In this context, an "agent" is more than just an LLM. It is an LLM given a specific goal, a "persona" or role to play, and a set of "tools" it can use to interact with an environment. For example, a shopping agent might be given the persona of a "price-sensitive student" and tools to
searchfor products orviewtheir details on a simulated website. - Persona: A persona is a detailed, synthetic profile of a user. In this paper, it's not just a simple instruction but a rich description mined from real data, including a consumer profile (age, interests), shopping preferences (price sensitivity, brand loyalty), and the user's actual shopping history.
- A/B Testing: A common method for online experimentation. Users are randomly split into two groups: a control group (A) sees the existing version of a website, and a treatment group (B) sees a new version. By comparing metrics (e.g., sales, clicks) between the groups, companies can determine if the change had a positive, negative, or no effect.
- Algorithmic Fidelity: A term coined by Argyle et al. (2023), it refers to the ability of an LLM to accurately emulate the characteristics and responses of diverse human subgroups when prompted appropriately. This paper aims to achieve high algorithmic fidelity for retail shoppers.
Previous Works
The paper builds upon several lines of research in agent-based simulation and LLM evaluation:
- Social and Psychological Simulation: Prior work has used LLMs to simulate human behavior in social contexts.
- Argyle et al. (2023) showed that GPT-3 could simulate a wide variety of human political subgroups with high accuracy, establishing the concept of
algorithmic fidelity. - Park et al. (2023) created "Generative Agents," an interactive sandbox where LLM-powered agents exhibited emergent social behaviors like spreading information.
- Aher et al. (2023) replicated human subject studies using LLM agents, but also found that simulations could be distorted in certain domains.
- Argyle et al. (2023) showed that GPT-3 could simulate a wide variety of human political subgroups with high accuracy, establishing the concept of
- Multi-Agent Collaboration: Some research uses agents with distinct roles to solve complex tasks.
- Hong et al. (2024) (
MetaGPT) and Qian et al. (2024) (ChatDev) assigned roles like "project manager" and "engineer" to agents to improve software development performance. This demonstrates the power of role-playing, which is analogous to using personas.
- Hong et al. (2024) (
- LLM Bias and Alignment in Recommendation: Researchers have identified and tried to mitigate biases in LLMs for tasks like product recommendation.
- Kamruzzaman et al. (2024) highlighted
brand biasin LLMs. - Yoon et al. (2024) noted a
positive rating biasand proposed a behavioral alignment suite for movie recommendation, but focused onindividual alignment(predicting a specific user's next action).
- Kamruzzaman et al. (2024) highlighted
Differentiation
PAARS distinguishes itself from prior work in several key ways:
- Domain Focus: While most agent simulation has focused on social science or collaborative tasks, PAARS is one of the first comprehensive frameworks specifically designed for the e-commerce and retail domain.
- Data-Driven Persona Mining: Unlike approaches that use hand-crafted or simple role-playing prompts, PAARS induces rich personas directly from real, anonymized shopping histories. This makes the agents more grounded in actual consumer behavior.
- Introduction of
Group Alignment: The most significant conceptual contribution is the formalization ofgroup alignment. Previous work like Yoon et al. (2024) focused onindividual alignment(e.g., "Can the agent predict the exact movie this user will watch next?"). PAARS argues that for population-level analysis like A/B testing, it's more critical to ask, "Does the overall distribution of movies watched by my agent population match the distribution of movies watched by the human population?" This is a more tractable and often more useful goal.
Methodology (Core Technology & Implementation Details)
The PAARS framework, as shown in Figure 1, consists of three main stages: Persona Mining, Session Generation with tools, and Alignment Evaluation.
该图像是论文中的示意图,展示了PAARS框架中从人类购物者行为数据生成代理群体,并通过对齐套件评价个体及群体层面的行为一致性,最终支持多种潜在应用如自动A/B测试和个性化推荐。
1. Persona Mining
The core idea is to create a detailed, synthetic persona for an LLM agent to embody. This is done in a two-step prompting process using an LLM (in this case, Anthropic Claude Sonnet 3.0).
- Step 1: Consumer Profile Generation:
- Input: The LLM receives an anonymized shopping history for a single customer. This history includes search queries, viewed items, and purchased items over the last 6 months.
- Task: The LLM is prompted to synthesize a generic
consumer profilebased on this history. The profile includes fields like age range, marital status, income, and interests. The LLM is also asked to provide reasoning for each field (e.g., "Age Group: 30-45 - Reason: Interest in solo travel and gear").
- Step 2: Shopping Preferences Inference:
-
Input: The LLM is prompted again, this time with both the original shopping history and the newly generated
consumer profile. -
Task: The LLM infers the user's
shopping preferences, such as price sensitivity, brand loyalty, and reliance on customer reviews.The final persona given to the agent is a combination of the synthetic
consumer profile, the inferredshopping preferences, and the originalshopping history.
-
2. Alignment Evaluation Suite
The paper proposes a novel evaluation suite to measure how well the agent population mimics the human population at both an individual and a group level.
Individual vs. Group Alignment
This is a key distinction. Let be a population of human shoppers and be a population of agents designed to mimic them. For a given task , let and be the sets of outputs from humans and agents, respectively.
-
Individual Alignment: Measures how well each agent replicates the specific actions of its human counterpart . A general formula is:
- : A function that compares the output of a single agent () to its human counterpart (). For example, an equality check.
- : An aggregation function, like the average, that combines the comparison results across the whole population. (e.g., overall accuracy).
-
Group Alignment: Measures whether the overall distribution of outputs from the agent population is similar to the distribution from the human population . It does not require a one-to-one mapping between agents and humans. The formula is:
-
: A measure of distributional dissimilarity. The paper uses Kullback-Leibler (KL) Divergence.
The goal is to find a set of personas that minimizes this distributional difference:
-
Computing KL Divergence ()
The paper uses KL Divergence to measure group alignment. measures how much information is lost when distribution is used to approximate distribution . A lower value means the distributions are more similar.
- For 1D Discrete Data (e.g., purchase ranks): The standard formula is used, where probabilities are calculated from histograms (relative frequencies) of the data.
- For Multi-dimensional Continuous Data (e.g., query embeddings): Since there are no discrete bins, the distributions
P(x)andQ(x)are first estimated using a Kernel Density Estimator (KDE). The KL divergence is then approximated using a Monte Carlo estimator: where are samples drawn from the true distribution .
3. Task Definitions
The alignment suite includes three tasks that mimic a typical shopping journey.
-
Query Generation:
- Task: Given a product title that a human viewed, predict the search query they used to find it.
- Individual Alignment Metric:
Cosine Similaritybetween the embeddings of the agent's predicted query and the human's actual query. - Group Alignment Metric:
KL Divergencebetween the distributions of human query embeddings and agent query embeddings.
-
Item Selection:
- Task (Individual): The agent is shown four items: one that the human actually purchased and three distractors. The agent must choose which one to "purchase."
- Individual Alignment Metric:
Accuracy(the percentage of times the agent chooses the same item as the human). - Task (Group): The agent is given a ranked list of search results and asked to select an item to view.
- Group Alignment Metric:
KL Divergencebetween the distribution of ranks selected by humans and the distribution of ranks selected by agents.
-
Session Generation:
- Task: The agent interacts with a text-based simulated retail environment using tools like
Search,View, andCart. The goal is to generate a full shopping session. - Group Alignment Metric:
KL Divergencebetween the distributions of session statistics (e.g., number of searches, number of views, number of purchases per session) for humans and agents.
- Task: The agent interacts with a text-based simulated retail environment using tools like
Experimental Setup
Datasets
The paper uses internal, anonymized historical shopping data from Amazon. The datasets are constructed specifically for each task.
- Data Source: Anonymized shopping sessions from real customers.
- Query Generation: A test set of 3,058 pairs was created by taking the first search and a subsequent product view within 60 seconds from a session.
- Item Selection (Individual): A dataset of 4,600 test cases. Each case consists of a ground-truth purchased item and three distractor items. All four items are removed from the shopping history provided to the agent to prevent trivial data leakage.
- Session Generation: 2,400 sessions were simulated for each agent configuration (with/without persona) to gather statistics for group alignment.
- Data Example (from Appendix A): The paper provides an illustrative example of a shopping session and the persona it might generate.
- Shopping Session Snippet:
2024-09-10 <SEARCH> waterproof hiking shoes - at 10:12 <VIEW> Men's Low height boots - at 10:14 <PURCHASE> <Brand1> Waterproof hiking boots - at 10:42 - Induced Persona Snippet:
Profile: Age Group: 30-45 - Reason: Interest in solo travel and gear Interests: Hiking, camping Shopping Preferences: Price Sensitivity: - Willing to invest in durable outdoor gear
- Shopping Session Snippet:
Evaluation Metrics
Accuracy: Used for the individual item selection task.- Conceptual Definition: Measures the percentage of times the agent's prediction is exactly correct.
- Mathematical Formula:
Cosine Similarity: Used for the individual query generation task.- Conceptual Definition: Measures the cosine of the angle between two vectors. In this case, it measures how similar the semantic meaning of the agent's query is to the human's query. A value of 1 means identical direction (meaning), 0 means orthogonal (unrelated), and -1 means opposite.
- Mathematical Formula: For two vectors and :
- Symbol Explanation: is the dot product of the vectors, and is the magnitude (or L2 norm) of vector . The vectors are obtained from an embedding model (
all-MiniLM-L6-v2).
Kullback-Leibler (KL) Divergence: The primary metric forgroup alignment.- Conceptual Definition: Measures the difference between two probability distributions. A lower KL divergence indicates that the agent population's behavior is statistically more similar to the human population's behavior.
Perplexity: Used to measure the complexity of a query.- Conceptual Definition: A measure of how well a probability model predicts a sample. In NLP, lower perplexity means the language model is less "surprised" by the sequence of words, indicating a more common or predictable phrase. Higher perplexity suggests a more complex or unusual query.
Token-Type Ratio (TTR): Used to measure lexical diversity in the session generation task.- Conceptual Definition: Measures the diversity of vocabulary used. A higher TTR indicates a wider variety of unique words (types) relative to the total number of words (tokens).
- Mathematical Formula:
Baselines
The primary comparison is between two agent configurations:
-
Base: The LLM agent (Anthropic Claude Sonnet 3.0) without any persona conditioning. It receives only the task-specific inputs. -
+ Persona: The same LLM agent but prompted with the full persona (profile, preferences, and history) mined from the corresponding human's data.The behavior of the
Humanpopulation serves as the ground truth or target distribution.
Results & Analysis
The experimental results consistently show that providing agents with mined personas improves their alignment with human behavior, although a gap still exists.
Query Generation
-
Individual Alignment: Agents with personas (
+ Persona) achieved an averagecosine similarityscore of 0.69, a 17% relative improvement over theBaseagents' score of 0.59. -
Manual Transcription of Table 1: This example shows the
+ Personaagent generating a more specific and aligned query.Method Query Baseline knee brace for pain relief + Persona knee brace for women Human adjustable knee brace for women -
Figure 2 Analysis: The plot below shows that while similarity scores decrease for all agents as query
perplexity(complexity) increases, the+ Personaagents consistently outperform theBaseagents across all complexity levels.
该图像是一个散点图,展示了不同困惑度(对数刻度)下有无personas的代理与人类查询的平均相似度分数变化。图中蓝点和绿点分别代表无personas和有personas的分数,趋势线显示有personas的代理相似度整体较高。 -
Group Alignment: The
+ Personaagents achieve a lowerKL divergence(17.51) compared toBaseagents (18.81), indicating their query distribution is closer to the human distribution.
Item Selection
-
Individual Alignment (Purchase Prediction): The results in Table 3 clearly demonstrate the value of adding more context. The
Basemodel performs close to random chance (25%). Adding each component of the persona progressively improves accuracy, with the full persona achieving 47.26%, a significant leap. -
Manual Transcription of Table 3: Item selection individual alignment - purchase prediction task.
Shopping Background Accuracy (%) Base 25.46 + Consumer profile 35.95 + Shopping Preferences 39.01 + History 41.11 + Persona 47.26 -
Group Alignment: Figure 3 shows the distribution of viewed item ranks. All populations tend to view items ranked higher in search results. However, the distribution for
+ Personaagents is visibly closer to theHumandistribution than theBaseagent distribution. This is confirmed by theKL divergencein Table 2, where+ Persona(1.08) is much lower thanBase(2.40).
该图像是图表,展示了图3中比较真实用户行为与具有人格与无人格代理所浏览商品排名分布的情况,纵轴为浏览比例,横轴为电商网站商品排名区间,反映了不同生成方式对模拟数据的影响。
Session Generation & A/B Testing
-
Group Alignment (Session Statistics): For session-level metrics like the number of searches, clicks, and purchases, the
+ Personaagents again show substantially lowerKL divergencethanBaseagents, as seen in Table 2. This means the overall "shape" of their shopping sessions is more human-like. -
Manual Transcription of Table 2: Group alignment metrics - KL divergence.
Query generation Item selection Session generation # Searches # Clicks # Purchases Base 18.81 2.40 11.69 11.70 11.68 + Persona 17.51 1.08 3.71 3.72 3.68 -
Lexical Diversity (TTR): As shown in Table 4, agents with personas use a more diverse vocabulary (
Query-TTRof 0.23,Product-TTRof 0.66) than base agents, bringing them closer to the diversity seen in human shoppers. However, a significant gap remains to human-level diversity (0.38 and 0.97, respectively). -
Manual Transcription of Table 4: TTR metrics for queries searched and products viewed.
Method Query-TTR Product-TTR Base 0.013 0.035 + Persona 0.23 0.66 Human 0.38 0.97 -
A/B Testing Simulation: In a preliminary experiment simulating three historical A/B tests, the PAARS framework correctly predicted the directional change in sales for 2 out of 3 tests. However, the magnitude of the change was much larger (10-30x) in the simulation. The authors hypothesize this is because the agents are currently designed with a strong bias toward purchasing.
Conclusion & Personal Thoughts
Conclusion Summary
The paper introduces PAARS, a novel framework for simulating e-commerce shoppers using LLM-powered agents. Its key innovations are the automatic mining of personas from real data and the introduction of a group alignment evaluation methodology. Experiments convincingly demonstrate that conditioning agents on these rich personas leads to simulated behavior that is statistically more aligned with real human populations compared to using generic agents. The framework shows promise for practical applications like automated A/B testing, even in its early stages. While a gap to perfect human replication remains, PAARS sets a clear path for future research in creating high-fidelity, scalable simulations of human economic behavior.
Limitations & Future Work
The authors acknowledge several limitations and areas for future work:
- Modality and Language: The current framework is text-only and limited to the English language and the US marketplace. Real e-commerce is highly visual and global, requiring multimodal models and testing for cultural nuances.
- Simulation Fidelity: The simulation environment is simplified. The agents need more sophisticated capabilities like navigation and filtering. The discrepancy in the magnitude of A/B test results suggests the agents' underlying shopping "intent" needs to be modeled more realistically.
- Persona Dynamics: Human preferences change over time. The framework currently uses static personas, and future work needs to explore cost-effective ways to update them dynamically.
- LLM Biases: The paper acknowledges that foundational LLM biases (brand, price, etc.) still need to be systematically tested and mitigated.
Personal Insights & Critique
- Significance of Group Alignment: The formalization of
group alignmentis the paper's most important conceptual contribution. It correctly identifies that for many large-scale applications (A/B testing, market trend analysis, surveying), aggregate statistical matching is more valuable and achievable than perfect individual mimicry. The dice-rolling example in Appendix B provides a brilliant and intuitive explanation of this concept. - Practicality and Impact: The approach of mining personas from existing, anonymized data is highly practical for large companies like Amazon that possess such datasets. If matured, this technology could revolutionize how e-commerce companies innovate, allowing them to run thousands of simulated experiments in a fraction of the time and cost of live tests. This could accelerate product development and personalization efforts significantly.
- Reproducibility and Generalizability: A major weakness is the use of a proprietary dataset and a commercial LLM (Claude 3.0 Sonnet). This makes the results impossible for the broader academic community to reproduce. While the framework itself is generalizable, its performance is tied to the quality of the underlying LLM and the data used for persona mining.
- Ethical Considerations: The paper touches on ethics, but the implications run deep. While personas are synthetic, they are derived from real user data, raising privacy concerns. More importantly, a highly effective simulator of consumer behavior could be used to develop more persuasive or even manipulative marketing strategies. The "guardrail" function is positive, but the same tool could be used to optimize for user engagement and spending in ways that may not be in the consumer's best interest. This duality warrants deeper ethical investigation.
- The "Last Mile" Problem: The results show a clear improvement with personas, but also a persistent "gap to human behavior." This gap likely stems from the aspects of human psychology that are hardest to capture from behavioral data alone—impulse, social influence, complex trade-offs, and the physical experience of a product. Closing this final gap will likely require more than just better LLMs or more data; it may require integrating theories from cognitive science and behavioral economics directly into the agent architecture.
Similar papers
Recommended via semantic vector search.