Paper status: completed

PAARS: Persona Aligned Agentic Retail Shoppers

Published:03/31/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

PAARS framework creates persona-driven retail agents equipped with shopping tools and aligns their group-level behavior distributions with humans, improving simulation accuracy and enabling automated A/B testing applications.

Abstract

In e-commerce, behavioral data is collected for decision making which can be costly and slow. Simulation with LLM powered agents is emerging as a promising alternative for representing human population behavior. However, LLMs are known to exhibit certain biases, such as brand bias, review rating bias and limited representation of certain groups in the population, hence they need to be carefully benchmarked and aligned to user behavior. Ultimately, our goal is to synthesise an agent population and verify that it collectively approximates a real sample of humans. To this end, we propose a framework that: (i) creates synthetic shopping agents by automatically mining personas from anonymised historical shopping data, (ii) equips agents with retail-specific tools to synthesise shopping sessions and (iii) introduces a novel alignment suite measuring distributional differences between humans and shopping agents at the group (i.e. population) level rather than the traditional "individual" level. Experimental results demonstrate that using personas improves performance on the alignment suite, though a gap remains to human behaviour. We showcase an initial application of our framework for automated agentic A/B testing and compare the findings to human results. Finally, we discuss applications, limitations and challenges setting the stage for impactful future work.

Mind Map

In-depth Reading

English Analysis

Bibliographic Information

  • Title: PAARS: Persona Aligned Agentic Retail Shoppers
  • Authors: Saab Mansour, Leonardo Perelli, Lorenzo Mainetti, George Davidson, Stefano D'Amato
  • Affiliations: The authors are all affiliated with Amazon.
  • Journal/Conference: This paper was submitted to arXiv, a popular preprint server. This means it has not yet undergone formal peer review for publication in a conference or journal.
  • Publication Year: The paper was submitted to arXiv with a listed publication date of 2025-03-31.
  • Abstract: The authors address the high cost and slow pace of collecting behavioral data in e-commerce. They propose simulation using Large Language Model (LLM) powered agents as a faster, more scalable alternative. Acknowledging that LLMs have inherent biases, they introduce PAARS, a framework designed to create and align synthetic shopping agents with real human behavior. The framework (i) automatically mines "personas" from anonymized historical shopping data, (ii) gives agents retail-specific tools (like search, view, purchase) to simulate shopping sessions, and (iii) introduces a novel alignment suite that measures how well the distribution of agent behavior matches the distribution of human behavior at a population level (group alignment), rather than just matching individual actions (individual alignment). Experiments show that using these mined personas improves alignment, although a performance gap to real human behavior remains. The paper also demonstrates a preliminary application for automated A/B testing and discusses the broader potential and limitations of the approach.
  • Original Source Link:

Executive Summary

Background & Motivation (Why)

In e-commerce, companies like Amazon make critical business decisions by analyzing customer behavior, often through live experiments like A/B testing. However, collecting this data is expensive, slow, and resource-intensive. A promising alternative is to simulate human behavior using intelligent agents powered by Large Language Models (LLMs). These agents can "act" like shoppers, generating vast amounts of behavioral data quickly and cheaply.

The primary challenge is ensuring these simulations are realistic. LLMs are not perfect mirrors of human populations; they have known biases, such as favoring popular brands (brand bias), being overly influenced by positive reviews (rating bias), and poorly representing minority groups. If the simulated behavior doesn't accurately reflect real human behavior, any decisions based on it will be flawed.

This paper tackles this challenge by proposing a framework to create a population of synthetic shopping agents that collectively approximates the behavior of a real human population. The innovation lies in moving beyond generic LLM prompts to create specific, data-driven "personas" for each agent.

Main Contributions / Findings (What)

The paper introduces PAARS (Persona Aligned Agentic Retail Shoppers), a framework with three core contributions:

  1. Persona-Powered Simulation Framework: Instead of using a generic LLM, PAARS first mines personas from real, anonymized historical shopping data. These personas, which include demographic profiles and shopping preferences, are then used to instruct LLM agents equipped with a set of retail-specific tools (e.g., search, view, purchase) to simulate entire shopping sessions.
  2. Novel Group-Level Alignment Suite: The paper argues that for applications like A/B testing, it's more important that the overall distribution of agent behavior matches the human population, rather than each agent perfectly mimicking one specific human. They formalize this concept as group alignment and propose metrics to measure it, primarily using Kullback-Leibler (KL) divergence to compare the statistical distributions of agent and human actions.
  3. Experimental Validation and Application: Experiments demonstrate that agents equipped with the mined personas (+ Persona) consistently outperform generic agents (Base) on the alignment suite, showing behavior that is statistically closer to real humans. The paper also presents a proof-of-concept application by using PAARS to simulate A/B tests, finding that the agent-based simulations correctly predicted the directional outcome of 2 out of 3 real historical tests.

Prerequisite Knowledge & Related Work

Foundational Concepts

  • Large Language Models (LLMs): These are massive neural networks (like GPT-4 or Anthropic's Claude) trained on vast amounts of text and code. They can understand and generate human-like text, making them suitable for powering conversational agents.
  • Agentic Framework: In this context, an "agent" is more than just an LLM. It is an LLM given a specific goal, a "persona" or role to play, and a set of "tools" it can use to interact with an environment. For example, a shopping agent might be given the persona of a "price-sensitive student" and tools to search for products or view their details on a simulated website.
  • Persona: A persona is a detailed, synthetic profile of a user. In this paper, it's not just a simple instruction but a rich description mined from real data, including a consumer profile (age, interests), shopping preferences (price sensitivity, brand loyalty), and the user's actual shopping history.
  • A/B Testing: A common method for online experimentation. Users are randomly split into two groups: a control group (A) sees the existing version of a website, and a treatment group (B) sees a new version. By comparing metrics (e.g., sales, clicks) between the groups, companies can determine if the change had a positive, negative, or no effect.
  • Algorithmic Fidelity: A term coined by Argyle et al. (2023), it refers to the ability of an LLM to accurately emulate the characteristics and responses of diverse human subgroups when prompted appropriately. This paper aims to achieve high algorithmic fidelity for retail shoppers.

Previous Works

The paper builds upon several lines of research in agent-based simulation and LLM evaluation:

  • Social and Psychological Simulation: Prior work has used LLMs to simulate human behavior in social contexts.
    • Argyle et al. (2023) showed that GPT-3 could simulate a wide variety of human political subgroups with high accuracy, establishing the concept of algorithmic fidelity.
    • Park et al. (2023) created "Generative Agents," an interactive sandbox where LLM-powered agents exhibited emergent social behaviors like spreading information.
    • Aher et al. (2023) replicated human subject studies using LLM agents, but also found that simulations could be distorted in certain domains.
  • Multi-Agent Collaboration: Some research uses agents with distinct roles to solve complex tasks.
    • Hong et al. (2024) (MetaGPT) and Qian et al. (2024) (ChatDev) assigned roles like "project manager" and "engineer" to agents to improve software development performance. This demonstrates the power of role-playing, which is analogous to using personas.
  • LLM Bias and Alignment in Recommendation: Researchers have identified and tried to mitigate biases in LLMs for tasks like product recommendation.
    • Kamruzzaman et al. (2024) highlighted brand bias in LLMs.
    • Yoon et al. (2024) noted a positive rating bias and proposed a behavioral alignment suite for movie recommendation, but focused on individual alignment (predicting a specific user's next action).

Differentiation

PAARS distinguishes itself from prior work in several key ways:

  1. Domain Focus: While most agent simulation has focused on social science or collaborative tasks, PAARS is one of the first comprehensive frameworks specifically designed for the e-commerce and retail domain.
  2. Data-Driven Persona Mining: Unlike approaches that use hand-crafted or simple role-playing prompts, PAARS induces rich personas directly from real, anonymized shopping histories. This makes the agents more grounded in actual consumer behavior.
  3. Introduction of Group Alignment: The most significant conceptual contribution is the formalization of group alignment. Previous work like Yoon et al. (2024) focused on individual alignment (e.g., "Can the agent predict the exact movie this user will watch next?"). PAARS argues that for population-level analysis like A/B testing, it's more critical to ask, "Does the overall distribution of movies watched by my agent population match the distribution of movies watched by the human population?" This is a more tractable and often more useful goal.

Methodology (Core Technology & Implementation Details)

The PAARS framework, as shown in Figure 1, consists of three main stages: Persona Mining, Session Generation with tools, and Alignment Evaluation.

Figure 1: The PAARS framework: we synthesize personas from anonymised human shoppers sessions, generate sop sessins by powerig LLM bas agents wih persnas nd retai tols, and measur dividual anroup ale… 该图像是论文中的示意图,展示了PAARS框架中从人类购物者行为数据生成代理群体,并通过对齐套件评价个体及群体层面的行为一致性,最终支持多种潜在应用如自动A/B测试和个性化推荐。

1. Persona Mining

The core idea is to create a detailed, synthetic persona for an LLM agent to embody. This is done in a two-step prompting process using an LLM (in this case, Anthropic Claude Sonnet 3.0).

  • Step 1: Consumer Profile Generation:
    • Input: The LLM receives an anonymized shopping history for a single customer. This history includes search queries, viewed items, and purchased items over the last 6 months.
    • Task: The LLM is prompted to synthesize a generic consumer profile based on this history. The profile includes fields like age range, marital status, income, and interests. The LLM is also asked to provide reasoning for each field (e.g., "Age Group: 30-45 - Reason: Interest in solo travel and gear").
  • Step 2: Shopping Preferences Inference:
    • Input: The LLM is prompted again, this time with both the original shopping history and the newly generated consumer profile.

    • Task: The LLM infers the user's shopping preferences, such as price sensitivity, brand loyalty, and reliance on customer reviews.

      The final persona given to the agent is a combination of the synthetic consumer profile, the inferred shopping preferences, and the original shopping history.

2. Alignment Evaluation Suite

The paper proposes a novel evaluation suite to measure how well the agent population mimics the human population at both an individual and a group level.

Individual vs. Group Alignment

This is a key distinction. Let H={hi}i=1n\mathbf{H} = \{h_i\}_{i=1}^n be a population of nn human shoppers and A={ai}i=1n\mathbf{A} = \{a_i\}_{i=1}^n be a population of nn agents designed to mimic them. For a given task T\mathbf{T}, let OH\mathbf{O_H} and OA\mathbf{O_A} be the sets of outputs from humans and agents, respectively.

  • Individual Alignment: Measures how well each agent aia_i replicates the specific actions of its human counterpart hih_i. A general formula is: Mindividual=fagg({fcomp(oa,i,oh,i)}i=1n) \mathcal { M } _ { i n d i v i d u a l } = \mathbf { f _ { a g g } } ( \{ \mathbf { f _ { c o m p } } ( o _ { a , i } , o _ { h , i } ) \} _ { i = 1 } ^ { n } )

    • fcomp\mathbf{f_{comp}}: A function that compares the output of a single agent (oa,io_{a,i}) to its human counterpart (oh,io_{h,i}). For example, an equality check.
    • fagg\mathbf{f_{agg}}: An aggregation function, like the average, that combines the comparison results across the whole population. (e.g., overall accuracy).
  • Group Alignment: Measures whether the overall distribution of outputs from the agent population A\mathbf{A} is similar to the distribution from the human population H\mathbf{H}. It does not require a one-to-one mapping between agents and humans. The formula is: Mgroup=Φ(OH,OA) \mathcal { M } _ { g r o u p } = \Phi ( \mathbf { O _ { H } } , \mathbf { O _ { A } } )

    • Φ\Phi: A measure of distributional dissimilarity. The paper uses Kullback-Leibler (KL) Divergence.

      The goal is to find a set of personas P\mathbf{P}^* that minimizes this distributional difference: P=argminPDKL(OH,OA(P)) \mathbf { P } ^ { * } = \underset { \mathbf { P } } { \mathrm { a r g m i n } } D _ { \mathrm { K L } } ( \mathbf { O _ { H } } , \mathbf { O _ { A ( P ) } } )

Computing KL Divergence (DKLD_{KL})

The paper uses KL Divergence to measure group alignment. DKL(PQ)D_{KL}(P || Q) measures how much information is lost when distribution QQ is used to approximate distribution PP. A lower value means the distributions are more similar.

  • For 1D Discrete Data (e.g., purchase ranks): The standard formula is used, where probabilities are calculated from histograms (relative frequencies) of the data. DKL(PQ)=xP(x)logP(x)Q(x) D _ { \mathrm { K L } } \left( P \| Q \right) = \sum _ { x } P ( x ) \log \frac { P ( x ) } { Q ( x ) }
  • For Multi-dimensional Continuous Data (e.g., query embeddings): Since there are no discrete bins, the distributions P(x) and Q(x) are first estimated using a Kernel Density Estimator (KDE). The KL divergence is then approximated using a Monte Carlo estimator: DKL(PQ)1Ni=1N[logP(xi)logQ(xi)] D _ { \mathrm { K L } } ( P \Vert Q ) \approx \frac { 1 } { N } \sum _ { i = 1 } ^ { N } \left[ \log P \left( x _ { i } \right) - \log Q \left( x _ { i } \right) \right] where xix_i are samples drawn from the true distribution PP.

3. Task Definitions

The alignment suite includes three tasks that mimic a typical shopping journey.

  1. Query Generation:

    • Task: Given a product title that a human viewed, predict the search query they used to find it.
    • Individual Alignment Metric: Cosine Similarity between the embeddings of the agent's predicted query and the human's actual query.
    • Group Alignment Metric: KL Divergence between the distributions of human query embeddings and agent query embeddings.
  2. Item Selection:

    • Task (Individual): The agent is shown four items: one that the human actually purchased and three distractors. The agent must choose which one to "purchase."
    • Individual Alignment Metric: Accuracy (the percentage of times the agent chooses the same item as the human).
    • Task (Group): The agent is given a ranked list of search results and asked to select an item to view.
    • Group Alignment Metric: KL Divergence between the distribution of ranks selected by humans and the distribution of ranks selected by agents.
  3. Session Generation:

    • Task: The agent interacts with a text-based simulated retail environment using tools like Search, View, and Cart. The goal is to generate a full shopping session.
    • Group Alignment Metric: KL Divergence between the distributions of session statistics (e.g., number of searches, number of views, number of purchases per session) for humans and agents.

Experimental Setup

Datasets

The paper uses internal, anonymized historical shopping data from Amazon. The datasets are constructed specifically for each task.

  • Data Source: Anonymized shopping sessions from real customers.
  • Query Generation: A test set of 3,058 <searchquery,viewedproduct><search query, viewed product> pairs was created by taking the first search and a subsequent product view within 60 seconds from a session.
  • Item Selection (Individual): A dataset of 4,600 test cases. Each case consists of a ground-truth purchased item and three distractor items. All four items are removed from the shopping history provided to the agent to prevent trivial data leakage.
  • Session Generation: 2,400 sessions were simulated for each agent configuration (with/without persona) to gather statistics for group alignment.
  • Data Example (from Appendix A): The paper provides an illustrative example of a shopping session and the persona it might generate.
    • Shopping Session Snippet:
      2024-09-10
      <SEARCH> waterproof hiking shoes - at 10:12
      <VIEW> Men's Low height boots - at 10:14
      <PURCHASE> <Brand1> Waterproof hiking boots - at 10:42
      
    • Induced Persona Snippet:
      Profile:
      Age Group: 30-45
      - Reason: Interest in solo travel and gear
        Interests: Hiking, camping
      Shopping Preferences:
      Price Sensitivity:
      - Willing to invest in durable outdoor gear
      

Evaluation Metrics

  • Accuracy: Used for the individual item selection task.
    • Conceptual Definition: Measures the percentage of times the agent's prediction is exactly correct.
    • Mathematical Formula: Accuracy=Number of Correct PredictionsTotal Number of Predictions \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}
  • Cosine Similarity: Used for the individual query generation task.
    • Conceptual Definition: Measures the cosine of the angle between two vectors. In this case, it measures how similar the semantic meaning of the agent's query is to the human's query. A value of 1 means identical direction (meaning), 0 means orthogonal (unrelated), and -1 means opposite.
    • Mathematical Formula: For two vectors A\vec{A} and B\vec{B}: Similarity=cos(θ)=ABAB \text{Similarity} = \cos(\theta) = \frac{\vec{A} \cdot \vec{B}}{\|\vec{A}\| \|\vec{B}\|}
    • Symbol Explanation: AB\vec{A} \cdot \vec{B} is the dot product of the vectors, and A\|\vec{A}\| is the magnitude (or L2 norm) of vector A\vec{A}. The vectors are obtained from an embedding model (all-MiniLM-L6-v2).
  • Kullback-Leibler (KL) Divergence: The primary metric for group alignment.
    • Conceptual Definition: Measures the difference between two probability distributions. A lower KL divergence indicates that the agent population's behavior is statistically more similar to the human population's behavior.
  • Perplexity: Used to measure the complexity of a query.
    • Conceptual Definition: A measure of how well a probability model predicts a sample. In NLP, lower perplexity means the language model is less "surprised" by the sequence of words, indicating a more common or predictable phrase. Higher perplexity suggests a more complex or unusual query.
  • Token-Type Ratio (TTR): Used to measure lexical diversity in the session generation task.
    • Conceptual Definition: Measures the diversity of vocabulary used. A higher TTR indicates a wider variety of unique words (types) relative to the total number of words (tokens).
    • Mathematical Formula: TTR=Number of Unique Tokens (Types)Total Number of Tokens \text{TTR} = \frac{\text{Number of Unique Tokens (Types)}}{\text{Total Number of Tokens}}

Baselines

The primary comparison is between two agent configurations:

  1. Base: The LLM agent (Anthropic Claude Sonnet 3.0) without any persona conditioning. It receives only the task-specific inputs.

  2. + Persona: The same LLM agent but prompted with the full persona (profile, preferences, and history) mined from the corresponding human's data.

    The behavior of the Human population serves as the ground truth or target distribution.

Results & Analysis

The experimental results consistently show that providing agents with mined personas improves their alignment with human behavior, although a gap still exists.

Query Generation

  • Individual Alignment: Agents with personas (+ Persona) achieved an average cosine similarity score of 0.69, a 17% relative improvement over the Base agents' score of 0.59.

  • Manual Transcription of Table 1: This example shows the + Persona agent generating a more specific and aligned query.

    Method Query
    Baseline knee brace for pain relief
    + Persona knee brace for women
    Human adjustable knee brace for women
  • Figure 2 Analysis: The plot below shows that while similarity scores decrease for all agents as query perplexity (complexity) increases, the + Persona agents consistently outperform the Base agents across all complexity levels.

    Figure 2: Query generation task: we compare agents with and without personas, by measuring the cosine similarity of the agentic queries against the human ones across different query perplexity levels. 该图像是一个散点图,展示了不同困惑度(对数刻度)下有无personas的代理与人类查询的平均相似度分数变化。图中蓝点和绿点分别代表无personas和有personas的分数,趋势线显示有personas的代理相似度整体较高。

  • Group Alignment: The + Persona agents achieve a lower KL divergence (17.51) compared to Base agents (18.81), indicating their query distribution is closer to the human distribution.

Item Selection

  • Individual Alignment (Purchase Prediction): The results in Table 3 clearly demonstrate the value of adding more context. The Base model performs close to random chance (25%). Adding each component of the persona progressively improves accuracy, with the full persona achieving 47.26%, a significant leap.

  • Manual Transcription of Table 3: Item selection individual alignment - purchase prediction task.

    Shopping Background Accuracy (%)
    Base 25.46
    + Consumer profile 35.95
    + Shopping Preferences 39.01
    + History 41.11
    + Persona 47.26
  • Group Alignment: Figure 3 shows the distribution of viewed item ranks. All populations tend to view items ranked higher in search results. However, the distribution for + Persona agents is visibly closer to the Human distribution than the Base agent distribution. This is confirmed by the KL divergence in Table 2, where + Persona (1.08) is much lower than Base (2.40).

    Figure 3: Search rank distribution of viewed items comparing human behavior to agents with/without personas. 该图像是图表,展示了图3中比较真实用户行为与具有人格与无人格代理所浏览商品排名分布的情况,纵轴为浏览比例,横轴为电商网站商品排名区间,反映了不同生成方式对模拟数据的影响。

Session Generation & A/B Testing

  • Group Alignment (Session Statistics): For session-level metrics like the number of searches, clicks, and purchases, the + Persona agents again show substantially lower KL divergence than Base agents, as seen in Table 2. This means the overall "shape" of their shopping sessions is more human-like.

  • Manual Transcription of Table 2: Group alignment metrics - KL divergence.

    Query generation Item selection Session generation
    # Searches # Clicks # Purchases
    Base 18.81 2.40 11.69 11.70 11.68
    + Persona 17.51 1.08 3.71 3.72 3.68
  • Lexical Diversity (TTR): As shown in Table 4, agents with personas use a more diverse vocabulary (Query-TTR of 0.23, Product-TTR of 0.66) than base agents, bringing them closer to the diversity seen in human shoppers. However, a significant gap remains to human-level diversity (0.38 and 0.97, respectively).

  • Manual Transcription of Table 4: TTR metrics for queries searched and products viewed.

    Method Query-TTR Product-TTR
    Base 0.013 0.035
    + Persona 0.23 0.66
    Human 0.38 0.97
  • A/B Testing Simulation: In a preliminary experiment simulating three historical A/B tests, the PAARS framework correctly predicted the directional change in sales for 2 out of 3 tests. However, the magnitude of the change was much larger (10-30x) in the simulation. The authors hypothesize this is because the agents are currently designed with a strong bias toward purchasing.

Conclusion & Personal Thoughts

Conclusion Summary

The paper introduces PAARS, a novel framework for simulating e-commerce shoppers using LLM-powered agents. Its key innovations are the automatic mining of personas from real data and the introduction of a group alignment evaluation methodology. Experiments convincingly demonstrate that conditioning agents on these rich personas leads to simulated behavior that is statistically more aligned with real human populations compared to using generic agents. The framework shows promise for practical applications like automated A/B testing, even in its early stages. While a gap to perfect human replication remains, PAARS sets a clear path for future research in creating high-fidelity, scalable simulations of human economic behavior.

Limitations & Future Work

The authors acknowledge several limitations and areas for future work:

  • Modality and Language: The current framework is text-only and limited to the English language and the US marketplace. Real e-commerce is highly visual and global, requiring multimodal models and testing for cultural nuances.
  • Simulation Fidelity: The simulation environment is simplified. The agents need more sophisticated capabilities like navigation and filtering. The discrepancy in the magnitude of A/B test results suggests the agents' underlying shopping "intent" needs to be modeled more realistically.
  • Persona Dynamics: Human preferences change over time. The framework currently uses static personas, and future work needs to explore cost-effective ways to update them dynamically.
  • LLM Biases: The paper acknowledges that foundational LLM biases (brand, price, etc.) still need to be systematically tested and mitigated.

Personal Insights & Critique

  • Significance of Group Alignment: The formalization of group alignment is the paper's most important conceptual contribution. It correctly identifies that for many large-scale applications (A/B testing, market trend analysis, surveying), aggregate statistical matching is more valuable and achievable than perfect individual mimicry. The dice-rolling example in Appendix B provides a brilliant and intuitive explanation of this concept.
  • Practicality and Impact: The approach of mining personas from existing, anonymized data is highly practical for large companies like Amazon that possess such datasets. If matured, this technology could revolutionize how e-commerce companies innovate, allowing them to run thousands of simulated experiments in a fraction of the time and cost of live tests. This could accelerate product development and personalization efforts significantly.
  • Reproducibility and Generalizability: A major weakness is the use of a proprietary dataset and a commercial LLM (Claude 3.0 Sonnet). This makes the results impossible for the broader academic community to reproduce. While the framework itself is generalizable, its performance is tied to the quality of the underlying LLM and the data used for persona mining.
  • Ethical Considerations: The paper touches on ethics, but the implications run deep. While personas are synthetic, they are derived from real user data, raising privacy concerns. More importantly, a highly effective simulator of consumer behavior could be used to develop more persuasive or even manipulative marketing strategies. The "guardrail" function is positive, but the same tool could be used to optimize for user engagement and spending in ways that may not be in the consumer's best interest. This duality warrants deeper ethical investigation.
  • The "Last Mile" Problem: The results show a clear improvement with personas, but also a persistent "gap to human behavior." This gap likely stems from the aspects of human psychology that are hardest to capture from behavioral data alone—impulse, social influence, complex trade-offs, and the physical experience of a product. Closing this final gap will likely require more than just better LLMs or more data; it may require integrating theories from cognitive science and behavioral economics directly into the agent architecture.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.