LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings
TL;DR Summary
SSR method maps LLMs' textual responses via semantic similarity to replicate human purchase intent with 90% test-retest reliability, preserving realistic survey response patterns and interpretability, enabling scalable consumer research simulation.
Abstract
Consumer research costs companies billions annually yet suffers from panel biases and limited scale. Large language models (LLMs) offer an alternative by simulating synthetic consumers, but produce unrealistic response distributions when asked directly for numerical ratings. We present semantic similarity rating (SSR), a method that elicits textual responses from LLMs and maps these to Likert distributions using embedding similarity to reference statements. Testing on an extensive dataset comprising 57 personal care product surveys conducted by a leading corporation in that market (9,300 human responses), SSR achieves 90% of human test-retest reliability while maintaining realistic response distributions (KS similarity > 0.85). Additionally, these synthetic respondents provide rich qualitative feedback explaining their ratings. This framework enables scalable consumer research simulations while preserving traditional survey metrics and interpretability.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation of Likert Ratings
- Authors: Benjamin F. Maier, Ulf Aslak, Luca Fiaschi, Nina Rismal, Kemble Fletcher, Christian C. Luhmann, Thomas V. Wiecki (PyMC Labs); Robbie Dow, Kli Pappas (Colgate-Palmolive Company). The collaboration between an AI/ML consulting firm (PyMC Labs) and a major consumer packaged goods corporation (Colgate-Palmolive) is a notable strength, grounding the research in a real-world industrial application.
- Journal/Conference: The paper is available on arXiv, which is a preprint server for academic articles. Preprints are not yet formally peer-reviewed. The abstract notes a potential publication at the 33rd ACM Conference on User Modeling, Adaptation and Personalization (UMAP '25).
- Publication Year: The paper was submitted to arXiv with a future publication date of 2025-10-09.
- Abstract: The authors address the high cost and inherent biases of traditional consumer research by exploring the use of Large Language Models (LLMs) as "synthetic consumers." They identify a key problem: LLMs produce unrealistic response distributions when asked for direct numerical ratings (e.g., on a Likert scale). To solve this, they introduce Semantic Similarity Rating (SSR), a method where the LLM provides a textual response, which is then mapped to a Likert scale distribution based on its semantic similarity to predefined reference statements. Tested on a massive dataset of 57 real-world product surveys (9,300 human responses) from a leading personal care company, SSR achieves 90% of the human test-retest reliability for ranking products and generates realistic response distributions (Kolmogorov-Smirnov similarity > 0.85). A key benefit is that the synthetic respondents also generate rich qualitative feedback. The framework offers a scalable and interpretable alternative to traditional consumer research.
- Original Source Link:
2. Executive Summary
-
Background & Motivation (Why): Companies spend billions annually on consumer research to test new product concepts. A central part of this research is measuring Purchase Intent (PI), typically on a 1-to-5 Likert scale. However, this process is expensive, slow, and suffers from human biases. LLMs offer a potential solution by simulating "synthetic consumers," but prior work has shown that when LLMs are asked to provide a numerical rating directly, they produce unrealistic distributions that are often too narrow and skewed, failing to capture the variance seen in human responses. The paper tackles this critical gap, questioning whether the failure lies with the LLMs themselves or with the method used to elicit responses.
-
Main Contributions / Findings (What):
- A Novel Elicitation Method (SSR): The paper proposes Semantic Similarity Rating (SSR), a two-step process that avoids direct numerical rating. First, the LLM is prompted to give a free-text response expressing its purchase intent. Second, this text is converted into a numerical vector (an embedding), and its similarity (using cosine similarity) to predefined "anchor" statements for each point on the Likert scale is calculated. This results in a probability distribution over the Likert scale, preserving the ambiguity of the textual response.
- State-of-the-Art Performance on Real-World Data: On a large, proprietary dataset of 57 personal care product surveys, SSR successfully replicated human survey outcomes. It achieved 90% of the human test-retest reliability in ranking products by mean purchase intent and produced response distributions that were highly similar to human ones (KS similarity > 0.85).
- Importance of Demographic Conditioning: The study demonstrates that providing LLMs with demographic personas (age, income, etc.) is crucial. Without this conditioning, the model failed to produce a meaningful ranking of products, even though it could mimic the overall shape of the human response distribution.
- Rich Qualitative Insights: As a byproduct, the SSR method generates detailed textual rationales from the synthetic consumers, offering qualitative feedback that is often richer than the brief comments provided by human participants. This allows for deeper analysis of a product's perceived strengths and weaknesses.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Large Language Models (LLMs): These are advanced AI models (like GPT-4 and Gemini) trained on vast amounts of text data. They can understand context, generate human-like text, and perform complex reasoning tasks, including impersonating a persona.
- Likert Scale: A widely used psychometric scale in surveys where respondents specify their level of agreement or attitude towards a statement. The paper focuses on a 5-point scale for purchase intent, from "definitely not" (1) to "definitely yes" (5).
- Purchase Intent (PI): A consumer's self-reported likelihood of buying a product. It is a key metric in market research for forecasting demand and making product launch decisions.
- Text Embeddings & Cosine Similarity: Text embeddings are numerical representations of text in a high-dimensional space, where semantically similar texts are located closer to each other. Cosine similarity measures the cosine of the angle between two embedding vectors; a value of 1 means they are identical in orientation, 0 means they are orthogonal (unrelated), and -1 means they are opposite. The paper uses this to measure how close an LLM's textual response is to predefined reference statements (e.g., "I would probably buy it").
- Test-Retest Reliability: A measure of a survey or test's consistency. If the same test is given to the same people at two different times, the correlation between their scores is the test-retest reliability. The paper simulates this by splitting its human sample to establish a "performance ceiling" for how well any method could possibly correlate with the noisy human data.
- Kolmogorov-Smirnov (KS) Distance: A statistic used to quantify the distance between two probability distributions. It is defined as the maximum absolute difference between their cumulative distribution functions (CDFs). The paper uses
1 - KS distanceas a similarity metric.
-
Previous Works: The authors situate their work within the growing field of using LLMs for social science and market research. They note that:
- Many previous studies tried direct numeric elicitation, asking LLMs to output a Likert score directly. This consistently failed, producing distributions with too little variance and a tendency to regress to the mean (e.g., always choosing '3' on a 1-5 scale).
- Some studies used textual responses but then mapped them back to a single number, losing valuable nuance.
- Others have shown that demographic conditioning (giving the LLM a persona) improves the alignment of synthetic responses with human subgroup data, but this alone did not solve the distribution problem.
- The paper positions itself as a zero-shot approach, meaning it does not require fine-tuning the LLM on existing survey data, making it more accessible and generalizable.
-
Differentiation: The core innovation of this paper is the Semantic Similarity Rating (SSR) method. Unlike prior work, SSR does not force the LLM to choose a single number. Instead, it leverages the richness of a textual response and maps it to a full probability distribution over the Likert scale. This elegant approach solves the "unrealistic distribution" problem while retaining the qualitative insights from the text, all within a zero-shot framework.
4. Methodology (Core Technology & Implementation)
The paper's methodology centers on generating and evaluating synthetic consumer responses to product concepts.
-
Data: The study uses a proprietary dataset from Colgate-Palmolive, consisting of 57 consumer surveys on different personal care product concepts.
- Participants: 9,300 unique U.S. participants in total, with 150-400 per survey.
- Demographics: Data on age, gender, location, income, and ethnicity was available for participants.
- Stimulus: Each survey presented a product concept via a slide containing a text description and often an image.
- Task: Participants rated their purchase intent on a 5-point Likert scale.
-
Synthetic Response Generation: The authors created "synthetic consumers" by prompting an LLM (
GPT-4oorGemini-2.0-flash) with a demographic persona matching a real human participant and showing it the product concept. They evaluated three distinct strategies for generating a rating:- Direct Likert Rating (DLR): The LLM is directly asked to output an integer from 1 to 5. This serves as the naive baseline.
- Follow-up Likert Rating (FLR): A two-step process:
- The LLM first generates a short, free-text statement of its purchase intent.
- A new instance of the same LLM, prompted to act as a "Likert rating expert," then reads this text and assigns a single integer rating (1-5).
- Semantic Similarity Rating (SSR): This is the paper's main proposed method.
- Step 1: Textual Elicitation. Like FLR, the LLM first generates a free-text response expressing its purchase intent.
- Step 2: Embedding. The textual response and a set of predefined reference statements (one for each Likert point, e.g., = "I would definitely buy this") are converted into numerical vectors using OpenAI's
text-embedding-3-smallmodel. - Step 3: Similarity Calculation. The cosine similarity is computed between the response embedding and each reference statement embedding .
- Step 4: Probability Mapping. The similarities are converted into a probability mass function (pmf) over the 5 Likert points. This mapping normalizes the scores by subtracting the minimum similarity, which prevents the distribution from being overly flat. The probability for rating is given by:
- Symbol Explanation:
- : The probability that synthetic consumer gives rating , based on reference set .
- : Cosine similarity between the response text and the reference statement for rating in set .
- : The lowest similarity score obtained across all 5 reference statements in set . Subtracting this value re-bases the scores.
- : A small constant to ensure no probability is exactly zero (set to 0 in this study).
- : The Kronecker delta, which is 1 if is the rating with the minimum similarity and 0 otherwise. The resulting values are normalized to sum to 1, creating a pmf. The paper averages results across 6 different sets of reference statements to improve robustness.
- Symbol Explanation:
This process is visualized in the diagram below.
该图像是一个示意图,展示了语义相似度评分(SSR)方法如何通过嵌入空间中的响应向量与参考语句的相似度映射到Likert量表的响应概率分布。
5. Experimental Setup
-
Datasets: As described above, 57 real-world consumer surveys on personal care products with 9,300 total human responses.
-
Evaluation Metrics: The authors use two primary metrics to judge the success of their synthetic panels.
-
Distributional Similarity (
KS similarity): This metric assesses how closely the distribution of synthetic ratings matches the distribution of human ratings for each survey.- Conceptual Definition: It measures the "shape" of the response distribution. A high similarity means the synthetic panel produces the same proportions of 1s, 2s, 3s, 4s, and 5s as the human panel. It is based on the Kolmogorov-Smirnov (KS) test, which finds the maximum difference between two cumulative distribution functions.
- Mathematical Formula: The KS similarity is defined as 1 minus the KS distance.
- Symbol Explanation:
- : The cumulative distribution function (CDF) of human responses for survey . For a rating , it is the proportion of responses less than or equal to .
- : The CDF of synthetic responses for survey .
- : The supremum, or maximum value, of the absolute difference between the two CDFs over all possible ratings . A value of 0 indicates identical distributions. The paper reports the mean KS similarity, , across all 57 surveys.
-
Correlation Attainment (): This metric evaluates how well the synthetic method ranks the 57 product concepts from least to most appealing, benchmarked against the noise inherent in human responses.
- Conceptual Definition: Instead of just aiming for a perfect correlation of 1.0, this metric cleverly asks: "What percentage of the maximum possible correlation is our method achieving?" The maximum possible correlation is estimated by measuring the test-retest reliability of the human data itself.
- Mathematical Formula:
- Symbol Explanation:
- and : Vectors containing the mean purchase intent scores for all 57 products, for humans and synthetics, respectively.
- : The Pearson correlation between the human and synthetic mean PI rankings.
- : The simulated test-retest correlation of the human data. To calculate this, the human sample for each survey is randomly split into two halves ("test" and "control"), and the correlation of mean PIs is computed between these two splits. This gives an estimate of the "noise ceiling" of the data.
- : The expected value, which is approximated by averaging the correlations over 2,000 random splits.
- : Correlation attainment. A value of means the synthetic method achieved 90% of the correlation that one would expect from a repeated human survey.
-
-
Baselines:
Direct Likert Rating (DLR): The most basic method of prompting for a number.Follow-up Likert Rating (FLR): A more advanced baseline involving a text-to-number mapping by a second LLM call.LightGBM: A powerful gradient boosting machine learning model. It was trained on half of the survey data to predict responses on the other half, serving as a strong supervised learning baseline.
6. Results & Analysis
The paper's results demonstrate the clear superiority of the SSR method.
-
Core Results: SSR Outperforms Baselines
-
Direct Likert Rating (DLR): This method failed significantly. While it achieved a surprisingly decent correlation attainment (~80%), its response distributions were completely unrealistic. The models almost exclusively responded with '3' or '4', rarely using the extremes ('1' or '5'), leading to very poor distributional similarity (). The high correlation was an artifact of small shifts away from the mean.
-
Follow-up Likert Rating (FLR): This method was an improvement over DLR but was still inferior to SSR, especially in distributional similarity ().
-
Semantic Similarity Rating (SSR): This method was the clear winner, excelling on both key metrics. It achieved a correlation attainment of ~90% and a very high mean distributional similarity of with GPT-4o. This shows it can both rank products correctly and reproduce the shape of human response distributions.
The charts below (Figures 2 and 3 from the paper) vividly illustrate these differences for GPT-4o. SSR (iii) achieves both high correlation and high distributional similarity, whereas DLR (i) has poor distributions.
该图像是论文中图2的图表,展示了基于GPT-4o和温度参数的真实与合成调查结果对比。A部分为直接Likert评分(DLR)、后续Likert评分(FLR)和语义相似度评分(SSR)的均值购买意图散点图,展示了相关性ho与置信度。B部分为多个具体调查的概率质量函数(pmf)对比,显示SSR方法在分布上更接近真实数据。
该图像是图23,展示了多个调查数据中GPT-4o进行文本引导后的语义相似度评分(SSR)与后续Likert评分的分布对比,分组显示包含与不包含人口统计信息的结果,体现SSR方法与真实评分的相似性。
-
-
Ablations / Parameter Sensitivity:
-
The Crucial Role of Demographics: A key experiment was running the SSR method without providing demographic information to the LLMs. The results were striking and counter-intuitive:
- Distributional similarity increased to . The model became excellent at mimicking the average human response distribution (which was skewed positive).
- However, correlation attainment plummeted to . Without personas, the LLM rated all products similarly and positively, failing to discriminate between good and bad concepts.
- Conclusion: Demographic conditioning is essential for the LLM to generate a meaningful signal for product ranking. It forces the model to consider how different types of people would react to a product, rather than just giving a generic, positive response.
-
Demographic and Product Feature Analysis: As shown in Figure 4, the SSR method successfully replicated several nuanced trends from the human data:
-
Age: Both humans and synthetic consumers showed lower purchase intent for the youngest and oldest age groups.
-
Income: Both showed higher purchase intent for higher-income groups.
-
Product Features: Both reacted similarly to different product categories, sources, and price tiers. This demonstrates that the LLMs were effectively using the persona information provided.
该图像是多子图的图表,展示了基于SSR方法与传统Follow-up Likert方法在多个调查(Survey)中的评分分布对比,横轴为Likert评分,纵轴为概率密度。图中对比了真实数据与不含人口统计变量(w/o demographics)两种情况下的模拟结果,验证了SSR方法的有效性。
-
-
Comparison to Supervised ML: The zero-shot SSR method outperformed a supervised LightGBM model that was trained on in-domain survey data. SSR achieved higher correlation attainment () and distributional similarity () compared to LightGBM (, ). This is a powerful result, showing that the LLM's pre-trained world knowledge is more effective at this task than a supervised model trained only on survey features.
-
Generalization to Other Questions: When tested on a different survey question ("How relevant was the concept?"), the framework also performed well, achieving correlation attainment of 82-91%, suggesting the method is broadly applicable.
Note: The paper refers to a
Tab. 1containing success metrics for all experiments. This table was not included in the provided document, but the key quantitative results are summarized in the text and figures.
-
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully demonstrates that LLMs can be used as reliable synthetic consumers for product concept testing, provided the right elicitation method is used. The proposed Semantic Similarity Rating (SSR) method overcomes the major limitation of prior work—unrealistic response distributions—by eliciting free-text responses and mapping them probabilistically to a Likert scale. This zero-shot approach achieves high fidelity with human data in both product ranking (90% correlation attainment) and distributional shape (KS similarity > 0.85), while also providing valuable qualitative feedback.
-
Limitations & Future Work: The authors responsibly outline the method's limitations and areas for future research:
- Dependence on Reference Statements: The quality of the SSR mapping depends on the manually created reference statements. Future work could explore dynamically generating or optimizing these statements.
- Incomplete Demographic Replication: While the models captured age and income effects well, they were less successful with gender and region. More research is needed to make persona conditioning fully reliable across all demographic subgroups.
- Domain Knowledge Boundary: The method's success is likely tied to the fact that personal care products are a common topic in the LLM's training data. It may not work as well for highly niche or novel product domains where the LLM lacks background knowledge.
- Future Directions: The authors suggest generalizing the method to other survey constructs, optimizing SSR parameters (like temperature), exploring more complex prompting pipelines, and developing hybrid models that combine SSR with light fine-tuning.
-
Personal Insights & Critique:
- High Impact and Credibility: This is a high-impact paper due to its direct commercial relevance and rigorous validation. The collaboration with Colgate-Palmolive and the use of a large, real-world dataset lend it significant credibility.
- Methodological Rigor: The
correlation attainmentmetric is an excellent contribution. By benchmarking against the inherent noise of human data (test-retest reliability), the authors provide a much more honest and insightful evaluation of their model's performance than a simple correlation score. - Insightful Trade-off: The discovery that removing demographics improves distributional similarity while destroying ranking ability is a fascinating and crucial insight. It highlights a fundamental trade-off: a model can either be tuned to mimic an aggregate human response pattern or to discriminate between items like an individual would, but doing both perfectly is a challenge.
- Practical Implications: If these results hold up, the SSR framework could revolutionize early-stage market research. It would allow companies to screen a vast number of product ideas quickly and cheaply, reserving expensive human panels for validating only the most promising candidates.
- A Note on the Acknowledgements: The paper humorously acknowledges "ChatGPT-5" for help with writing. Given the paper's 2025 publication date, this is likely a forward-looking joke, but it's an unusual and noteworthy inclusion in a formal research paper.
Similar papers
Recommended via semantic vector search.