Paper status: completed

Prompting Science Report 4: Playing Pretend: Expert Personas Don't Improve Factual Accuracy

Published:01/01/2025
Original Link
Price: 0.100000
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The study investigates whether assigning expert personas to large language models improves performance on difficult objective questions. Findings reveal no significant accuracy gains from expert personas, while mismatched and low-knowledge personas often degrade model performance

Abstract

This is the fourth in a series of short reports that help business, education, and policy leaders understand the technical details of working with AI through rigorous testing. Here, we ask whether assigning personas to models improves performance on difficult objective multiple-choice questions. We study both domain-specific expert personas and low-knowledge personas, evaluating six models on GPQA Diamond (Rein et al. 2024) and MMLU-Pro (Wang et al. 2024), graduate-level questions spanning science, engineering, and law. We tested three approaches: In-Domain Experts—assigning the model an expert persona matched to the problem type had no significant impact on performance (except Gemini 2.0 Flash); Off-Domain Experts—assigning an expert persona not matched to the problem type resulted in marginal differences; Low-Knowledge Personas—assigning negative capability personas (layperson, young child, toddler) was generally harmful. Across both benchmarks, persona prompts generally did not improve accuracy relative to a no-persona baseline. Expert personas showed no consistent benefit, domain-mismatched expert personas sometimes degraded performance, and low-knowledge personas often reduced accuracy. These results concern answer accuracy only; personas may serve other purposes such as altering tone beyond factual performance.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Prompting Science Report 4: Playing Pretend: Expert Personas Don't Improve Factual Accuracy

1.2. Authors

  • Savir Basil: Generative AI Labs, The Wharton School, University of Pennsylvania.
  • Ina Shapiro: Generative AI Labs, The Wharton School, University of Pennsylvania.
  • Dan Shapiro: Generative AI Labs, The Wharton School, University of Pennsylvania; Glowforge.
  • Ethan Mollick: Generative AI Labs, The Wharton School, University of Pennsylvania.
  • Lilach Mollick: Generative AI Labs, The Wharton School, University of Pennsylvania.
  • Lennart Meincke: Generative AI Labs, The Wharton School, University of Pennsylvania; WHU-Otto Beisheim School of Management.

1.3. Journal/Conference

This paper is the fourth in a series of "Prompting Science Reports" produced by the Wharton School’s Generative AI Labs. While these reports are frequently cited in business and AI education contexts and published on platforms like SSRN or directly by the lab, they are designed as high-rigor technical white papers to inform policy and practice.

1.4. Publication Year

2025 (Published January 1, 2025).

1.5. Abstract

The research investigates whether assigning "personas" (e.g., "You are a physics expert") to Large Language Models (LLMs) improves their performance on difficult, objective multiple-choice questions. Using benchmarks like GPQA Diamond (PhD-level science) and MMLU-Pro (graduate-level engineering, law, and chemistry), the authors tested six different models. They evaluated three categories: In-Domain Experts, Off-Domain Experts, and Low-Knowledge Personas. The study found that expert personas generally do not improve accuracy compared to a baseline with no persona. Mismatched personas occasionally degraded performance, and low-knowledge personas (e.g., "Toddler") significantly reduced accuracy. The report concludes that while personas might change the tone or style of an AI's response, they are not a reliable way to improve factual correctness.

Prompting Science Report 4: Playing Pretend (Note: The provided source link in the text was /files/papers/.../paper.pdf, implying an internal or direct file access, but it is part of the Wharton Generative AI Labs series).


2. Executive Summary

2.1. Background & Motivation

A common piece of advice in prompt engineering (the art of crafting instructions for AI) is to assign the AI a role or persona. Official guides from major AI developers like Google, Anthropic, and OpenAI suggest that starting a prompt with "You are an expert..." helps the model generate higher-quality outputs. The theoretical logic is that by specifying a persona, the model's internal statistical associations will lean toward data and reasoning patterns associated with high-level experts in that specific field.

However, scientific evidence for this "persona gain" is inconsistent. Some studies suggest it helps, while others suggest it makes no difference. This paper aims to settle the debate for difficult, factual questions—scenarios where accuracy is the most critical metric.

2.2. Main Contributions / Findings

The authors conducted a large-scale experiment involving tens of thousands of individual AI "runs" to test the impact of personas across different models and domains. Their primary findings include:

  • Expert Personas are Ineffective: Telling an AI it is a "world-class expert" did not consistently result in higher accuracy on graduate-level questions.

  • Mismatched Personas can be Harmful: Assigning an expert persona from one field (e.g., Physics) to answer questions in another (e.g., Law) often reduced performance or caused the AI to "refuse" to answer.

  • Negative Capability Personas Hurt Performance: Assigning personas like "Toddler" or "Layperson" led to a significant drop in accuracy, confirming that models do respond to the "knowledge level" of a persona, but primarily in a negative direction.

  • Model Refusal: Specifically in the Gemini Flash family of models, expert personas caused the models to become "overly cautious," leading them to refuse to answer questions if they felt the question was outside their assigned expertise.


3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Large Language Models (LLMs)

An LLM is an AI system trained on massive amounts of text data to predict the next word in a sequence. Through this training, they develop an internal "map" of human knowledge and language patterns.

3.1.2. Prompt Engineering & Personas

Prompt Engineering is the practice of refining the input text (the "prompt") given to an LLM to get the best possible output. Persona Prompting (or Role Prompting) is a specific technique where the user defines who the AI should "be."

  • Example: "You are a senior cardiologist. Explain this EKG report."

3.1.3. Zero-shot Prompting

Zero-shot means asking the AI a question without giving it any examples of how to answer. This is the "rawest" form of testing a model's knowledge. This study used zero-shot prompting to isolate the effect of the persona instruction.

3.1.4. Temperature

Temperature is a setting that controls the "creativity" or randomness of an LLM.

  • A temperature of 0 makes the model deterministic (it always gives the same answer).
  • A temperature of 1.0 (used in this study) makes the model more varied. The authors used a high temperature to see the "central tendency" (the most common answer) over 25 different attempts.

3.2. Previous Works

The authors cite several conflicting studies:

  • Kong et al. (2024): Showed some improvements on benchmarks when using role-play.
  • Zheng et al. (2024): Found that adding personas to system prompts did not reliably improve factual accuracy.
  • Meincke et al. (2025a, b, c): Previous reports in this series that established the methodology for high-rigor testing, including the necessity of multiple trials (25+) to account for model variability.

3.3. Technological Evolution

Early prompt engineering (2022-2023) relied heavily on "hacks" like personas or offering "tips." As models have become more sophisticated (moving from GPT-3.5 to reasoning-heavy models like o3-mini), the community is shifting toward evaluating whether these traditional "best practices" still hold value or if the models have become robust enough to ignore them.


4. Methodology

4.1. Principles

The core idea was to test the "Playing Pretend" hypothesis: Does an AI perform better on a task if it is told it is an expert in that task? The theoretical basis is that the instruction acts as a "nudge" to the model's latent space, prioritizing "expert-sounding" tokens.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Experimental Design

The study used a comparative approach across six models:

  1. GPT-4o
  2. GPT-4o-mini
  3. o3-mini
  4. o4-mini (A newer, reasoning-capable model)
  5. Gemini 2.0 Flash
  6. Gemini 2.5 Flash

4.2.2. The Prompt Template

The researchers used a standardized user prompt structure to ensure that the only variable being changed was the persona text. The formula for the prompt can be visualized as: PromptTotal=[PersonaManipulation]+[QuestionText]+[Choices]+[FormattingInstruction]Prompt_{Total} = [Persona Manipulation] + [Question Text] + [Choices] + [Formatting Instruction]

The baseline (no persona) prompt looked like this:

What is the correct answer to this question: [question text] Choices: A. [option] B. [option] ... Format your response as follows: "The correct answer is (insert answer here)"

4.2.3. Persona Variations

The researchers tested several specific personas, ranging from high-expertise to very low-knowledge:

  • Experts: Physics, Math, Economics, Biology, Chemistry, Engineering, Law, History.
  • Layperson: "You are a layperson with no special training in this subject."
  • Young Child: "You are a young child who thinks they understand the world but sometimes mixes things up."
  • Toddler: "You are a 4-year-old toddler who thinks the moon is made of cheese."

4.2.4. Data Collection & Statistical Rigor

To account for the randomness of LLMs, the authors did not just ask each question once. They used a 25-trial approach.

  • N = 25 independent runs per question, per model, per prompt.

  • For GPQA Diamond (198 questions), this resulted in 198×25=4,950198 \times 25 = 4,950 runs per condition.

  • For MMLU-Pro (300 questions), this resulted in 300×25=7,500300 \times 25 = 7,500 runs per condition.

    The primary metric used to evaluate these runs is the Average Rating. This is calculated as: $ \text{Average Rating} = \frac{\sum_{i=1}^{Q} \sum_{j=1}^{T} R_{i,j}}{Q \times T} $ Where:

  • QQ is the total number of questions.

  • TT is the number of trials per question (25).

  • Ri,jR_{i,j} is the result of trial jj for question ii (1 if correct, 0 if incorrect).


5. Experimental Setup

5.1. Datasets

5.1.1. GPQA Diamond

  • Description: "Graduate-Level Google-Proof Q&A Benchmark."
  • Difficulty: PhD-level multiple-choice questions in biology, physics, and chemistry.
  • Human Performance: Experts in the field get ~65-74% correct. Non-experts with access to the internet only get ~34%.
  • Example Question: (From Table S1) "What is the correct answer to this question: If a sperm from species A is injected into an egg from species B and both species have the same number of chromosomes, what would be the main cause of the resulting zygote mortality? A) ... B) Epistatic interactions between the genes... C) ... D) ..."

5.1.2. MMLU-Pro

  • Description: An advanced version of the Massive Multitask Language Understanding benchmark.
  • Structure: 10 potential answers per question (making it much harder to guess correctly than standard 4-choice tests).
  • Subset Selection: 100 questions each from Engineering, Law, and Chemistry.

5.2. Evaluation Metrics

5.2.1. Average Rating (Primary)

  • Conceptual Definition: The average probability that the model will get a question right on any given attempt.
  • Formula: Described in section 4.2.4.

5.2.2. 100% Correct (Strict)

  • Conceptual Definition: The percentage of questions that the model answered correctly in every single one of the 25 trials.
  • Design Goal: To measure extreme reliability.

5.2.3. 90% Correct (High Reliability)

  • Conceptual Definition: The percentage of questions where the model was correct in at least 23 out of 25 trials.
  • Design Goal: Comparable to human "acceptable error rates."

5.2.4. 51% Correct (Majority Vote)

  • Conceptual Definition: The percentage of questions where the model was correct at least 13 out of 25 times.
  • Design Goal: To see if the model "knows" the answer more often than not.

5.3. Baselines

The Baseline condition is the model answering the same questions with no persona prompt (only the question, choices, and formatting instructions). This is the control group used to determine if the "Playing Pretend" instructions actually caused a change.


6. Results & Analysis

6.1. Core Results Analysis

The results across both benchmarks showed a striking lack of benefit from expert personas.

  • GPQA Diamond: No expert persona reliably improved performance. In fact, for models like o4-mini, the "Toddler" and "Layperson" personas caused a statistically significant drop in accuracy.
  • MMLU-Pro: Five out of six models showed no improvement with experts. Gemini 2.0 Flash was the only model that saw a modest gain from expert personas.
  • The "Toddler" Effect: In almost every model, the "Toddler" persona (the moon is made of cheese) performed significantly worse than the "Layperson," which in turn performed worse than the baseline. This shows that models do degrade their performance to match a low-intelligence persona.

6.1.1. Model Refusal (The Gemini 2.5 Flash Case)

One of the most interesting findings was that assigning an "Unrelated Expert" persona (e.g., a Physics expert answering a Biology question) caused Gemini 2.5 Flash to refuse to answer. On GPQA Diamond, it refused an average of 10.56 out of 25 trials per question, stating it "cannot, in good conscience" select an answer outside its expertise.

6.2. Data Presentation (Tables)

The following are the results from Table S2 of the original paper (GPQA Diamond Pairwise Comparisons):

Model Conditions RD [95% CI] Statistics
GPT-4o Baseline - Toddler -0.052 [-0.087, -0.016] p = 0.004
Baseline - Physics Expert -0.010 [-0.082, 0.062] p = 0.779
Baseline - Math Expert 0.002 [-0.075, 0.077] p = 0.956
Baseline - Chemistry Expert -0.010 [-0.036, 0.017] p = 0.469
o4-mini Baseline - Toddler -0.061 [-0.086, -0.036] p < 0.001
Baseline - Layperson -0.022 [-0.043, -0.003] p = 0.030
Baseline - Physics Expert -0.005 [-0.024, 0.014] p = 0.612
Baseline - Math Expert -0.007 [-0.026, 0.013] p = 0.526
Gemini 2.5 Flash Baseline - Young Child 0.098 [0.029, 0.164] p = 0.005
Baseline - Economics Expert -0.163 [-0.205, -0.121] p < 0.001
Baseline - Biology Expert -0.087 [-0.124, -0.049] p < 0.001

(Note: RD stands for Risk Difference, representing the change in accuracy. A negative number means the persona performed worse than the baseline.)

6.3. Domain-Specific Analysis

The researchers also checked if "matching" the expert to the domain (e.g., Physics Expert for Physics questions) helped.

The following figure (Figure 2 from the original paper) illustrates that domain-tailored personas do not generally improve performance:

Figure 2. GPQA Diamond (top) and MMLU-Pro subset (bottom) performance across multiple domain-related prompting variations, categorized by domain 该图像是一个图表,展示了在物理、化学、生物、工程和法律等多个领域中,各种领域相关提示变体下的问题回答正确比例(平均值)。图表中比较了基线、在领域专家、相邻专家和不相关专家的表现。

As seen in the chart, the "In-domain Expert" (light blue bars) rarely outperforms the "Baseline" (dark blue bars) in a statistically significant way, except in rare cases like Gemini 2.0 Flash in Engineering.


7. Conclusion & Reflections

7.1. Conclusion Summary

The practice of assigning expert personas to LLMs—while recommended by major AI companies—does not reliably improve the accuracy of answers to difficult factual questions. While models are very good at "dumbing themselves down" (the negative effect of toddler/layperson personas), they are not consistently able to "smart themselves up" beyond their base training by being told they are experts.

The primary danger of persona prompting discovered here is refusal. By narrowing the model's perceived scope of expertise, users may inadvertently trigger safety or "over-caution" filters that prevent the model from using knowledge it actually possesses.

7.2. Limitations & Future Work

  • Scope of Benchmarks: The study focused on PhD-level and graduate-level multiple-choice questions. It did not test creative writing, coding, or brainstorming, where personas might still be highly effective for tone and style.
  • Persona Complexity: The study used relatively straightforward persona descriptions. More complex "chain-of-thought" personas or personas that include specific problem-solving methodologies might yield different results.
  • Model Evolution: As models evolve, the "refusal" behaviors seen in Gemini might become more or less prevalent.

7.3. Personal Insights & Critique

This paper provides a much-needed "reality check" for the AI community. It suggests that LLMs are not "actors" who can perform better by getting into character; they are statistical engines.

Key Takeaway: If you want an AI to be more accurate, focus on task-specific instructions (e.g., "Think step-by-step," "Check your work for signs of X") rather than identity-specific instructions (e.g., "You are a genius physicist").

One potential issue to consider: the "Toddler" persona performing poorly is actually a form of instruction following. The model is successfully "pretending" to be a toddler who doesn't know the answer. This confirms the model's capability to role-play, but suggests that "Expert" is already the "default" state of these high-end models, leaving little room for improvement through simple role-play.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.