Paper status: completed

PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time

Published:06/07/2025

Personalized LLM Agents (1)Personalized Memory Module (1)Test-Time Personalization (1)User Preference Alignment (1)Personalized Action Module (1)

Original Link PDF

Price: 0.100000

24 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

PersonaAgent introduces a test-time personalized LLM agent combining memory and action modules with user-preference alignment, enabling dynamic, tailored responses and outperforming existing methods in real-world applications.

Abstract

Large Language Model (LLM) empowered agents have recently emerged as advanced paradigms that exhibit impressive capabilities in a wide range of domains and tasks. Despite their potential, current LLM agents often adopt a one-size-fits-all approach, lacking the flexibility to respond to users' varying needs and preferences. This limitation motivates us to develop PersonaAgent, the first personalized LLM agent framework designed to address versatile personalization tasks. Specifically, PersonaAgent integrates two complementary components - a personalized memory module that includes episodic and semantic memory mechanisms; a personalized action module that enables the agent to perform tool actions tailored to the user. At the core, the persona (defined as unique system prompt for each user) functions as an intermediary: it leverages insights from personalized memory to control agent actions, while the outcomes of these actions in turn refine the memory. Based on the framework, we propose a test-time user-preference alignment strategy that simulate the latest n interactions to optimize the persona prompt, ensuring real-time user preference alignment through textual loss feedback between simulated and ground-truth responses. Experimental evaluations demonstrate that PersonaAgent significantly outperforms other baseline methods by not only personalizing the action space effectively but also scaling during test-time real-world applications. These results underscore the feasibility and potential of our approach in delivering tailored, dynamic user experiences.

Mind Map

In-depth Reading

English Analysis~41 min read · 57,052 chars

1. Bibliographic Information

1.1. Title

PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time

1.2. Authors

Weizhi Zhang, Xinyang Zhang, Chenwei Zhang, Liangwei Yang, Jingbo Shang, Zhepei Wei, Henry Peng Zou, Zijie Huang, Zhengyang Wang, Yifan Gao, Xiaoman Pan, Lian Xiong, Jingguo Liu, Philip S. Yu, Xian Li

Affiliations: The authors are primarily affiliated with Amazon and the University of Illinois Chicago, with additional affiliations from the University of California San Diego and the University of Virginia. This blend of industry (Amazon) and academia suggests a strong foundation in both practical application and theoretical research in large language models and artificial intelligence.

1.3. Journal/Conference

The paper is published as a preprint on arXiv, with a publication date of 2025-06-06T17:29:49.000Z. As a preprint, it indicates that the work is yet to undergo full peer review for formal publication in a conference or journal. However, arXiv is a highly respected platform for disseminating cutting-edge research in fields like AI, allowing for rapid sharing and feedback within the scientific community. Many influential papers are first released on arXiv.

1.4. Publication Year

2025

1.5. Abstract

This paper introduces PersonaAgent, a novel framework for personalized Large Language Model (LLM) agents, designed to overcome the one-size-fits-all limitation of existing LLM agents. PersonaAgent integrates two core components: a personalized memory module (comprising episodic and semantic memory) and a personalized action module that enables user-tailored tool actions. A persona, defined as a unique system prompt for each user, acts as an intermediary, utilizing memory insights to guide actions and refining memory based on action outcomes. The framework proposes a test-time user-preference alignment strategy which optimizes the persona prompt by simulating recent interactions. This optimization uses textual loss feedback to align simulated responses with ground-truth user preferences in real-time. Experimental evaluations across four personalization tasks demonstrate that PersonaAgent significantly outperforms baseline methods, effectively personalizing the action space and scaling during real-world applications, thus enabling tailored and dynamic user experiences.

1.6. Original Source Link

Official Source: https://arxiv.org/abs/2506.06254 PDF Link: https://arxiv.org/pdf/2506.06254v1.pdf Publication Status: Preprint (on arXiv).

2. Executive Summary

2.1. Background & Motivation

The field of Artificial Intelligence (AI) has long strived for both superior intelligence and enhanced personalization. Large Language Models (LLMs) and LLM-empowered agents have significantly advanced superior intelligence by demonstrating impressive capabilities in reasoning, language comprehension, and instruction following, often by integrating external tools, memory mechanisms, and goal-directed reasoning.

However, a critical limitation of current LLM agents is their one-size-fits-all approach. They are typically trained on generic datasets and armed with general tools, making them inflexible in adapting to the varying needs and preferences of individual users. To truly harness the potential of these intelligent systems in everyday human contexts, they must be capable of adapting tailored behaviors and interactions to cater to different users. This lack of personalization leads to less relevant responses, reduced user engagement, and a failure to establish trust through tailored interactions.

The core problem the paper aims to solve is the absence of effective, scalable, and dynamic personalization in LLM agents. Prior attempts at personalization, such as human-preference alignment (e.g., RLHF) offer generalized improvements but fall short in individual user preference alignment. User-specific fine-tuning provides individual personalization but faces computational complexity and scalability issues in real-world scenarios. Non-parametric personalization workflows (e.g., RAG) utilize external data but rely on fixed retrieval or summarization, limiting their adaptability in complex situations requiring continuous understanding. The paper seeks to bridge this gap by developing an agent framework that can dynamically utilize personal data and adapt to evolving user preferences in real-time.

The paper's entry point or innovative idea is the concept of a PersonaAgent that leverages a dynamically evolving persona (a unique system prompt per user) as an intermediary between personalized memory and personalized actions, optimized through a test-time user-preference alignment strategy. This allows for real-time adaptation and fine-grained personalization without the computational overhead of constant retraining.

2.2. Main Contributions / Findings

The paper makes several significant contributions:

First Personalized LLM Agent Framework: Introduces PersonaAgent, the first framework designed for versatile personalization tasks within a unified memory-action architecture. This framework effectively addresses the one-size-fits-all limitation of current LLM agents.
User-Specific Persona as Intermediary: Proposes the concept of a user-specific persona (a dynamic system prompt) that acts as a central mediator, bridging the gap between personalized memory and action modules. This persona enforces personalization over the action space and guides action decisions at every step, allowing the agent to adapt its behavior to individual user contexts and preferences.
Novel Test-Time User Preference Alignment Strategy: Introduces a novel strategy for test-time user-preference alignment. This method optimizes the persona prompt by simulating recent interactions and minimizing textual loss (discrepancies between simulated agent responses and ground-truth user responses). This enables seamless, real-time adaptation to evolving user preferences without computationally expensive retraining.
Demonstrated State-of-the-Art Performance: Conducted comprehensive experiments across four diverse personalization tasks (Personalized Citation Identification, Movie Tagging, News Categorization, and Product Rating). PersonaAgent consistently achieved state-of-the-art results, significantly outperforming non-personalized, personalized workflow, and general agentic baselines.
Validation of Framework Components and Scalability: Through ablation studies, the paper confirmed the critical contribution of each component (personalized memory, action, and alignment) to the overall system's success. Persona analysis (t-SNE visualization and Jaccard similarity) demonstrated the effectiveness of persona optimization in capturing unique user traits. Furthermore, test-time scaling experiments and evaluation across different LLM backbones proved the framework's robustness, efficiency, and ability to capture nuanced, evolving user preferences even with varying model capabilities.

These findings collectively demonstrate the feasibility and potential of PersonaAgent in delivering tailored, dynamic, and engaging user experiences in real-world applications.

3.1. Foundational Concepts

To understand PersonaAgent, a foundational grasp of Large Language Models (LLMs), LLM Agents, Memory Mechanisms, Prompt Engineering, and Personalization is essential.

Large Language Models (LLMs): These are advanced artificial intelligence models, such as GPT (Generative Pre-trained Transformer) or LLaMa, trained on vast amounts of text data to understand, generate, and respond to human language. They exhibit emergent capabilities like reasoning, comprehension, and instruction following. They are typically autoregressive models, meaning they predict the next token in a sequence based on previous tokens.
LLM Agents: Moving beyond standalone LLMs, an LLM agent is a system that integrates an LLM with external components to enable more autonomous and complex behaviors. These components often include:
- External Tools: Mechanisms to interact with the external world (e.g., web search, calculators, databases).
- Memory: Systems to retain information over time, allowing for context-aware and consistent interactions.
- Planning/Reasoning: The ability to break down complex goals into sub-tasks and decide on a sequence of actions. An LLM agent aims to act more like a human, interacting with its environment to achieve goals.
Personalization: In the context of AI, personalization refers to tailoring an AI's behavior, responses, or recommendations to the individual characteristics, preferences, and historical interactions of a specific user. The goal is to make the AI more relevant, engaging, and useful for that particular user.
Memory Mechanisms in AI: Inspired by human cognition, AI systems can incorporate different types of memory:
- Episodic Memory: Stores specific events, experiences, and interactions along with their context (e.g., "what happened, when, and where"). It's like recalling a specific moment from one's past.
- Semantic Memory: Stores generalized knowledge, facts, concepts, and abstract understanding, independent of specific events. It's like knowing what a "dog" is, rather than remembering a specific instance of seeing a dog.
Prompt Engineering: The art and science of crafting effective inputs (prompts) to guide an LLM to generate desired outputs. A system prompt is a special type of prompt given to an LLM at the beginning of a conversation to set its role, behavior, or constraints for the entire interaction. A persona in this context is a system prompt designed to imbue the LLM with a specific character or set of preferences.
Retrieval-Augmented Generation (RAG): A technique where an LLM's knowledge generation is augmented by retrieving information from an external knowledge base. When a query is made, relevant documents or data snippets are first retrieved, and then the LLM uses this retrieved information as additional context to generate a more informed and grounded response.
Reinforcement Learning from Human Feedback (RLHF): A training paradigm for LLMs where human preferences (e.g., rankings of model responses) are used to train a reward model. This reward model then guides a reinforcement learning algorithm to fine-tune the LLM, aligning its behavior more closely with human values and preferences.

3.2. Previous Works

The paper contextualizes PersonaAgent by discussing previous approaches to personalization intelligence, categorizing them and highlighting their limitations.

Human-Preference Aligned LLMs:
- Examples: Supervised fine-tuning (SFT) (Zhang et al., 2023) and Reinforcement Learning from Human Feedback (RLHF) (Schulman et al., 2017; Rafailov et al., 2023).
- Concept: These methods aim to align LLMs with human preferences to make them more natural and helpful. RLHF, in particular, involves training a reward model from human feedback on pairs of responses, and then using this reward model to fine-tune the LLM via Proximal Policy Optimization (PPO) or similar algorithms.
- Limitation: While successful in improving general instruction following, they achieve population-level preference alignment, not individual-level personalization. They provide a general alignment that doesn't adapt to specific users' unique needs.
User-Specific Fine-Tuning:
- Examples: Parameter-efficient fine-tuning (PEFT) methods (Tan et al., 2024b,a).
- Concept: These approaches attempt to personalize LLMs by adjusting model parameters specifically for individual users. This could involve fine-tuning a small portion of the model or using adapters.
- Limitation: They face significant scalability hurdles. The computational complexity increases linearly with the number of users, making them impractical for large-scale deployments. Frequent re-tuning for new interactions further exacerbates computational demands and latency.
Personalized LLM Workflows:
- Examples: Retrieval-Augmented Generation (RAG) (Salemi et al., 2024b,a) and Profile-Augmented Generation (PAG) (Richardson et al., 2023).
- Concept: These non-parametric methods incorporate external personalized user data (e.g., user profiles or past interactions) into the LLM's generation process, often by retrieving relevant information and prepending it to the prompt.
- Limitation: They typically follow a fixed pipeline and rely on retrieving only limited relevant interactions or trivial user data summarization. This prevents comprehensive and adaptive personalization, especially in complex scenarios requiring holistic understanding and continuous adaptation.
General LLM Agents:
- Examples: ReAct (Yao et et al., 2023b) and MemBank (Zhong et al., 2024).
- Concept: These agents integrate LLMs with tools and memory to enable complex task execution and reasoning. ReAct interweaves Reasoning (thinking steps) and Acting (tool use) to solve problems. MemBank explicitly incorporates a long-term memory module for task generalization.
- Limitation: They are designed for general-purpose tasks and adopt a one-size-fits-all approach, lacking mechanisms to explicitly adapt their reasoning or tool use to individual user preferences. While they have agentic intelligence, they are weak in personal data utilization and real-time preference alignment.
Personalization of LLM Agents for Specific Domains:
- Examples: Agents for long-term dialogues (Li et al., 2024), personalized web agents (Cai et al., 2024), medical assistants (Zhang et al., 2024b), conversational health agents (Abbasian et al., 2023), and recommendation agents (Wang et al., 2024b; Zhang et al., 2024a).
- Concept: These agents are tailored with specialized memory modules, data integration, or knowledge bases for particular application domains.
- Limitation: Their domain-specific design significantly limits their versatility and generalizability across diverse personalization tasks. They lack a unified, adaptable framework for broad application.

3.3. Technological Evolution

The evolution of AI for human assistance has progressed from basic expert systems to highly versatile Large Language Models, and now to intelligent agents.

Early AI (Expert Systems): Rule-based systems providing assistance in specific domains. Lacked adaptability and personalization.
Machine Learning Era: Introduction of models learning from data. Still largely generic in application.
Deep Learning & LLMs (e.g., GPT-3, LLaMa): A paradigm shift, offering unprecedented language understanding and generation. Initially, these were standalone models with a generic knowledge base.
Human-Aligned LLMs (SFT, RLHF): Efforts to make LLMs safer, more helpful, and generally aligned with human preferences, but still population-averaged.
LLM Agents (e.g., ReAct, MemBank): Enhancing LLMs with external tools and memory to enable complex, multi-step task execution. These agents marked a move towards more autonomous and intelligent systems but remained general-purpose in their application.
Personalized LLM Workflows (RAG, PAG): Initial attempts to bring user-specific data into LLMs, but often through fixed, non-adaptive retrieval mechanisms.
Domain-Specific Personalized Agents: Development of personalized agents for narrow domains (e.g., web navigation, medical advice), highlighting the need for personalization but lacking generalizability.
PersonaAgent (This Paper): Represents the next step by introducing a unified, versatile, and dynamically adaptable personalized LLM agent framework. It addresses the limitations of previous approaches by integrating personalized memory, actions, and a dynamically optimized persona through test-time alignment, offering a scalable solution for versatile personalization tasks across domains.

3.4. Differentiation Analysis

PersonaAgent differentiates itself from previous works by combining agentic intelligence with dynamic, real-time, individual-level personalization in a scalable manner, particularly through its novel test-time alignment strategy.

The following table, originally from the paper, highlights the comparison:

The following are the results from Table 1 of the original paper:

Approach Categories	Agentic Intelligence	Real-world Applicability	Personal Data Utilization	Preference Alignment
Human-Preference Aligned				×
User-Specific Fine-Tuning
Personalized LLM Workflow
General LLM Agent	xx×>	xx×>	xxxx>	x
Personalized LLM Agent (ours)

$\checkmark$ : fully covered, ; partially covered, X: not covered at all. Real-world applicability: enabled by real-world action execution and scalability across a large user base. Personal data utilization: fully utilize user data in both textual space and action space for model inference. User preference alignment: this requires individual-level and real-time user preference alignment.

Here's a breakdown of the differentiation:

Human-Preference Aligned (e.g., RLHF): Good for general alignment but fails to cover individual preference alignment ( $X$ in Preference Alignment). It doesn't use personal data effectively for individual users.
User-Specific Fine-Tuning: Addresses individual personalization ( $\checkmark$ in Preference Alignment, Personal Data Utilization). However, its real-world applicability is limited due to computational complexity and scalability issues with a large user base (; in Real-world Applicability).
Personalized LLM Workflow (e.g., RAG, PAG): Utilizes personal data ( $\checkmark$ ) and offers some preference alignment (;), but lacks agentic intelligence (;) and real-world applicability (;) because it relies on fixed retrieval pipelines rather than dynamic, adaptive actions and comprehensive user understanding.
General LLM Agent (e.g., ReAct, MemBank): Excels in agentic intelligence ( $\checkmark$ ) and real-world applicability ( $\checkmark$ ) by using tools and memory. However, it lacks individual personalization ( $X$ in Personal Data Utilization and Preference Alignment) as it's a one-size-fits-all model.
PersonaAgent (ours): Aims to cover all four dimensions ( $\checkmark$ $✓$ across the board).
- Agentic Intelligence: Achieves this through its personalized action module and tool use.
- Real-world Applicability: Enabled by its efficient test-time alignment that avoids heavy retraining, making it scalable for large user bases.
- Personal Data Utilization: Leverages both episodic and semantic memory to fully incorporate user data.
- Preference Alignment: Ensures individual-level and real-time alignment through continuous persona optimization.
  
  In essence, PersonaAgent uniquely combines the autonomous, tool-using capabilities of general LLM agents with a dynamic, user-specific personalization mechanism, all while maintaining scalability for real-world deployment. The key innovation is the persona acting as an intermediary, optimized through textual loss feedback at test time, which no other method provides comprehensively.

4. Methodology

The PersonaAgent framework is designed to bring personalization to LLM agents by integrating user-specific memory, actions, and a dynamically evolving persona. This section details the architecture and the test-time user-preference alignment strategy.

4.1. Principles

The core idea behind PersonaAgent is to move beyond generic LLM agent behaviors by making them inherently user-centric. This is achieved through three interconnected principles:

Personalized Memory: Equipping the agent with memory mechanisms (episodic and semantic) that specifically store and abstract individual user's interaction history and preferences.
Personalized Actions: Enabling the agent to select and parameterize its actions and external tools based on these personalized memory insights, ensuring actions are tailored to the user's needs.
Dynamic Persona as Intermediary: Introducing a persona (a unique system prompt for each user) that serves as the central control mechanism. This persona is responsible for translating memory insights into action guidance and is continuously refined based on the outcomes of these actions, ensuring real-time alignment with evolving user preferences.

4.2. Core Methodology In-depth (Layer by Layer)

The PersonaAgent framework extends general LLM agent architectures by incorporating user-specific personalization via two complementary modules—personalized memory and personalized action—interconnected through a dynamically evolving persona.

4.2.1. The Definition of "Persona" for Personalized LLM Agents

The persona is defined as a structured representation that unifies persistent user-specific memory (capturing long-term preferences) and explicit agent instructions (e.g., tool usage guidelines). This unified structure forms the unique system prompt for each user, which then governs all personalized user-agent interactions. It's the central piece that dictates how the agent should behave and interpret user requests.

4.2.2. Personalized Memory Module

The personalized memory module is composed of two types of memory, inspired by cognitive psychology: episodic memory and semantic memory.

4.2.2.1. Episodic Memory

Episodic memory records fine-grained, time-stamped user interactions, allowing the agent to recall specific past events. This is crucial for maintaining context-aware personalization and consistency over short to medium terms.

For each user $u$ , an episodic buffer $\mathcal{D}^u$ is maintained: $ \mathcal { D } ^ { u } = \left{ \left( q _ { i } , r _ { i } ^ { \mathrm { g t } } , m _ { i } \right) \right} _ { i = 1 } ^ { N ^ { u } } $ Where:

$q_i$ : Represents a past query or user input.
$r_i^{\mathrm{gt}}$ : Corresponds to the ground-truth response or the desired outcome for query $q_i$ .
$m_i$ : Denotes auxiliary metadata associated with the interaction, such as a timestamp, session context, or other relevant information.
$N^u$ : Is the total number of interaction histories stored for user $u$ .

Upon receiving a new query $q^*$ , the agent needs to retrieve relevant past interactions from this buffer. This is done by computing an embedding for the new query and comparing it to embeddings of stored memories.

First, the embedding of the new query, denoted as $\mathbf{h}_{q^*} = f_{\mathrm{enc}}(q^*)$ , is computed. Here, $f_{\mathrm{enc}}$ is an embedding function (e.g., a sentence encoder) that transforms textual queries into a dense vector space.
Similarly, embeddings $\mathbf{h}_i = f_{\mathrm{enc}}(\mathcal{D}_i^u)$ are computed for each stored memory event $\mathcal{D}_i^u$ (which could be the query, response, or a combination).
The top $K$ $K$ most similar memories, $\mathcal{R}^u(q^*)$ $R^{u} (q^{*})$ , are then retrieved using a similarity function, $\mathrm{sim}(\cdot, \cdot)$ $sim (\cdot, \cdot)$ : $ \mathcal { R } ^ { u } ( q ^ { * } ) = \operatorname { T o p K } _ { i \in [ 1 , N ^ { u } ] } \mathrm { s i m } ( \mathbf { h } _ { q ^ { * } } , \mathbf { h } _ { i } ) $ Where:
- $\mathcal{R}^u(q^*)$ : Is the set of $K$ retrieved relevant memories for user $u$ given the new query $q^*$ .
- $\operatorname{TopK}_{i \in [1, N^u]}$ : Denotes the operation of selecting the top $K$ items from the set of all memories for user $u$ .
- $\mathrm{sim}(\mathbf{h}_{q^*}, \mathbf{h}_i)$ : Is a similarity function (e.g., cosine similarity) that measures how semantically close the new query's embedding is to a stored memory's embedding. These retrieved memories are then used to ground the agent's next response, ensuring consistency and relevance based on the user's past behavior.

4.2.2.2. Semantic Memory

Semantic memory abstracts and consolidates stable user traits, such as enduring preferences and long-term goals, into a compact profile. Unlike episodic memory, it focuses on generalized user-centric knowledge that persists across sessions.

Formally, a summarization function $f_s$ integrates episodic memory events into a coherent profile: $ \mathcal { P } ^ { u } = f _ { s } \big ( S _ { t } , \mathcal { D } ^ { u } \big ) $ Where:

$\mathcal{P}^u$ : Represents the semantic memory profile for user $u$ . This profile serves as a long-term knowledge base.
$f_s$ : Is the summarization function that processes and abstracts information. This function is typically implemented by an LLM that reads through the episodic memory.
$S_t$ : Is a task-based summarization prompt that guides the LLM in what kind of information to extract and consolidate from the episodic memory. For example, $S_t$ might instruct the LLM to identify the user's favorite genres, preferred communication style, or common problem-solving approaches.
$\mathcal{D}^u$ : Is the episodic memory buffer for user $u$ . This semantic memory ensures that the agent's behavior remains aligned with the user's established characteristics, even if specific past events are not explicitly recalled from episodic memory. It provides a stable, generalized understanding of the user.

4.2.3. Personalized Actions

The personalized action module dictates how the agent selects and parameterizes its actions to serve the user effectively. Unlike general LLM agents that use a fixed set of tools and policies, PersonaAgent tailors these.

At each time step $t$ :

The agent receives an observation $o_t \in \mathcal{O}$ from the environment.
It then selects an action $a_t \in \mathcal{A}$ based on its policy $\pi(a_t | c_t)$ .

In PersonaAgent, this process is personalized: $ a _ { t } \sim \pi _ { P } \big ( \cdot \mid c _ { t } \big ) , \qquad a _ { t } \in \hat { \mathcal { A } } . $ Where:

$a_t$ : Is the action chosen by the agent at time step $t$ .
$\pi_P(\cdot \mid c_t)$ : Is the personalized policy, which is conditioned not only on the current context $c_t$ but also on the current persona $P$ . The persona $P$ modulates the policy, tailoring both general tools and personalized operations.
$c_t = (o_1, a_1, \ldots, o_{t-1}, a_{t-1}, o_t)$ : Represents the context up to time $t$ , including past observations and actions.
$\hat{\mathcal{A}} = \mathcal{A} \cup \mathcal{D}$ $\hat{A} = A \cup D$ : Is the augmented action space.
- $\mathcal{A}$ : Represents the set of fundamental, general tools (e.g., web search, calculator).
- $\mathcal{D}$ : Represents tools specifically designed to access personalized user data and histories (e.g., memory retrieval from episodic or semantic memory, personalized search functions). The persona $P$ influences the policy $\pi_P$ , which in turn determines how general tools are used (e.g., what search query to formulate for web search) and how personalized tools are invoked (e.g., what specific information to retrieve from memory). This ensures that the agent's actions are always aligned with the user's unique profile and preferences.

The two tools detailed in Appendix C are:

Wikipedia API for General Knowledge: A standard tool for factual information retrieval.
RAG API for Personalized Episodic Memory: A specific tool for retrieving relevant items/histories from the user's episodic memory. The requirement is that this tool must be used at least once to answer the question, highlighting the importance of personalized memory.

4.3. Test-Time User Preference Alignment

To enable real-time adaptation and precise alignment with individual user preferences, PersonaAgent introduces a novel test-time user-preference alignment strategy. This strategy directly optimizes the persona prompt based on recent user interactions.

The core idea is to iteratively refine the persona $P$ by minimizing textual discrepancies between responses simulated by the agent (using the current persona) and the actual ground-truth responses provided by the user.

Given a batch of $n$ recent user interactions $\mathcal{D}_{\mathrm{batch}}$ : $ \mathcal { D } _ { b a t c h } = { ( q _ { j } , \hat { r } _ { j } , r _ { j } ^ { g t } ) } _ { j = 1 } ^ { n } $ Where:

$q_j$ : Is the query for the $j$ -th interaction in the batch.
$\hat{r}_j$ : Is the agent's simulated response to $q_j$ when conditioned on the persona $P$ .
$r_j^{\mathrm{gt}}$ : Is the ground-truth response or desired outcome for query $q_j$ , representing the user's true preference.
$n$ : Is the size of the batch of recent interactions.

The persona $P$ is optimized by minimizing a textual loss function $L$ : $ P ^ { * } = \arg \operatorname* { m i n } _ { P } \sum _ { j = 1 } ^ { n } L ( \hat { r } _ { j } , r _ { j } ^ { g t } | q _ { j } ) $ Where:
$P^*$ : Represents the optimized persona.
$\arg \min_P$ : Denotes finding the persona $P$ that minimizes the following sum.
$L(\hat{r}_j, r_j^{\mathrm{gt}} | q_j)$ : Is the textual loss function. This function quantifies the discrepancy or difference between the agent's simulated response $\hat{r}_j$ and the ground-truth response $r_j^{\mathrm{gt}}$ for a given query $q_j$ . The loss is textual, implying it operates directly on the text of the responses, possibly using LLM-based evaluators or semantic similarity metrics.

The optimization process is iterative, as detailed in Algorithm 1.

4.3.1. Algorithm 1: Test-Time User Preference Alignment

The algorithm outlines how the persona is iteratively optimized.

Algorithm 1 Test-Time User Preference Alignment 1: Input: Test User data $\mathcal { D }$ , Initial persona $P$ 2: Output: Optimized persona $P ^ { * }$ 3: procedure OPTIMIZATION $\mathsf { V } ( \mathcal { D } _ { b a t c h } , P )$ 4: Initialize empty lists for loss gradients $\hat { \nabla }$ 5: for each $( q , \bar { \hat { r } } , \bar { r } ^ { g t } )$ in $\mathcal { D } _ { b a t c h }$ do 6: Compute $\nabla \gets L L M _ { g r a d } ( q , \hat { r } , r ^ { g t } )$ 7: Add loss gradient/feedback $\nabla$ to $\hat { \nabla }$ 8: end for 9: Gradient update $P ^ { * } \gets L L M _ { u p d a t e } ( \hat { \nabla } , P )$ 10: return updated persona $P ^ { * }$ 11: end procedure 12: for iteration $= 1$ to $\mathcal { E }$ do 13: Obtain batch $\mathcal { D } _ { b a t c h }$ from user data $\mathcal { D }$ 14: Add agent responses to $\mathcal { D } _ { b a t c h }$ 15: $P ^ { * } \gets \mathrm { O P T I M I Z A T I O N } ( \mathcal { D } _ { b a t c h } , P )$ 16: end for

Step-by-step Explanation of Algorithm 1:

Outer Loop (Lines 12-16):

Line 12: $for iteration = 1 to$ \mathcal{E}do: The optimization runs for a predefined number of iterations, $\mathcal{E}$ . This signifies an iterative refinement process.
Line 13: Obtain batch\mathcal{D}_{batch}from user data\mathcal{D} $**: In each iteration, a batch of recent user interaction data is sampled from the overall test user data $\mathcal{D}$. This batch contains queries and their corresponding ground-truth responses. * **Line 14: `Add agent responses to`\mathcal{D}_{batch}$ : For each query in the obtained batch, the current PersonaAgent (conditioned on its current persona $P$ ) generates a simulated response $\hat{r}$ . These simulated responses are added to the batch, creating the triplet $(q_j, \hat{r}_j, r_j^{\mathrm{gt}})$ as defined for the loss function.
Line 15: $P^* \gets \mathrm{OPTIMIZATION}(\mathcal{D}_{batch}, P)$ : The core optimization procedure is called. It takes the current batch of data and the current persona $P$ as input and returns an updated, optimized persona $P^*$ .
Line 16: end for: The loop continues, iteratively refining the persona.

Inner Procedure OPTIMIZATION (Lines 3-11):

Line 3: procedure OPTIMIZATION(\mathcal{D}_{batch},P): This sub-procedure performs one step of persona optimization.
Line 4: Initialize empty lists for loss gradients\hat{\nabla} $**: A list is initialized to store feedback or "gradients" for each interaction in the batch. * **Line 5: `for each`(q, \bar{\hat{r}}, \bar{r}^{gt})`in`\mathcal{D}_{batch}`do`**: It iterates through each interaction triplet in the current batch. The notation $\bar{\hat{r}}$ and $\bar{r}^{gt}$ likely refers to the simulated and ground-truth responses for the particular interaction being processed within the loop. * **Line 6: `Compute`\nabla \gets LLM_{grad}(q, \hat{r}, r^{gt})$ : This is the crucial step where "textual loss feedback" is generated. LLM_grad is a function (likely another LLM call) that takes the query, the agent's simulated response, and the ground-truth response. It then outputs a textual gradient or feedback $\nabla$ . This feedback text describes how to adjust the persona to make the agent's response more aligned with the ground-truth. This is a form of textual gradient optimization, where the "gradient" is not a numerical vector but a textual instruction.
- The prompt used for LLM_grad is the "Loss Gradient/Feedback Prompt" found in Appendix A. This prompt guides an LLM to act as a "meticulous and critical evaluator" to analyze the discrepancy and provide feedback on how to improve the system prompt (persona). It specifically asks for ways to improve search keywords for tools, consider user's prior interactions/preferences, and provide explicit user profile descriptions not specific to the current task.
Line 7: Add loss gradient/feedback\nablato\hat{\nabla} $**: The generated textual feedback for each interaction is collected. * **Line 9: `Gradient update`P^* \gets LLM_{update}(\hat{\nabla}, P)$ : All the collected textual feedback ( $\hat{\nabla}$ which is a list of individual feedbacks) is aggregated and used to update the current persona $P$ . LLM_update is another function (likely an LLM call) that synthesizes this feedback and revises the persona.
- The prompt used for LLM_update is the "Gradient Update Prompt" found in Appendix A. This prompt instructs an LLM to act as a "prompt engineering assistant" to refine the current system prompt (persona) based on the provided aggregated feedback. It explicitly asks to highlight user's unique preferences and instruct the agent to align its responses accordingly.
Line 10: return updated personaP^*$$: The newly updated persona is returned to the outer loop.

This iterative process ensures that the persona continuously approximates real-time user preferences and intentions, enabling adaptive, personalized interactions. While the set of available tools ( $\hat{\mathcal{A}}$ ) remains fixed, the agent's behavior emerges from the personalized policy $\pi_{P^*}(a_t | c_t)$ , which leverages the optimized persona $P^*$ to choose optimal actions $a_t \in \mathcal{A}$ and corresponding action parameters (e.g., specific search queries).

4.3.2. Textual Optimization Prompts (from Appendix A)

The paper relies on LLMs themselves to generate "gradients" and update the persona, which is a key aspect of its textual gradient optimization.

4.3.2.1. Loss Gradient/Feedback Prompt

This prompt is used for the LLM_grad function in Algorithm 1. It instructs an LLM to evaluate the agent's response against the ground truth and provide textual feedback for persona improvement.

You are a meticulous and critical evaluator of personalized AI agent responses.

Analyze the following and give the feedback on how to improve the system prompt to align with the user's preferences.

Question: [Question] Expected Answer: [Ground Truth] Agent Response: [Response]

Your feedback should focus on how to adjust the persona system prompt to tailor the agent's responses to the individual user's unique characteristics. Make sure the feedback is concise and and clear.

Tips:

1.  Explain on how to improve the search keywords of tools for this user.
2.  Take the user's prior interactions, preferences, and any personalization aspects into consideration.
3.  Provide explicit description for user profile and preferences that is not specific to this task.

    Feedback:

Explanation: This prompt defines the role of the LLM as an evaluator. It provides the necessary context (Question, Expected Answer, Agent Response) and then explicitly instructs the LLM on what kind of feedback to generate. The "Tips" are crucial for guiding the LLM to produce useful, actionable feedback for persona refinement, covering aspects like tool usage, general preferences, and long-term user profiles.

4.3.2.2. Gradient Update Prompt

This prompt is used for the LLM_update function in Algorithm 1. It takes the current persona and the aggregated textual feedback and instructs an LLM to generate an updated persona.

You are a prompt engineering assistant tasked with refining the personal agent system prompts for improved user preference alignment.

Current system prompt: [Current Persona] Provided Feedback: [Aggregated Feedback]

Based on the feedback above, generate an updated system prompt that explicitly highlights the user's unique preferences. Ensure that the prompt instructs the agent to align its responses with the user's preferences, including detailed user profile or preferences. Please maintain a helpful and clear tone in the system prompt.

New system prompt:

Explanation: This prompt sets the LLM's role as a "prompt engineering assistant." It provides the Current Persona (the system prompt that needs refinement) and the Aggregated Feedback (the textual "gradients" from the previous step). The instructions are clear: synthesize the feedback to generate a New system prompt that explicitly incorporates user preferences and provides detailed profile information, ensuring a helpful and clear tone. This mechanism allows the persona to dynamically evolve based on performance feedback against user preferences.

4.3.3. Persona Prompt Initialization (from Appendix B)

The initial persona for each user is provided as a system prompt to the LLM agent. This forms the starting point for the test-time user-preference alignment process.

# Initial System Prompt (Persona Initialization)

You are a helpful personalized assistant. Take more than two actions to infer the user preference and answer the question. User summary: [Initial Semantic Memory]

STRICT RULES: when using tools, always:
1. Think step-by-step about what information you need.
2. MUST use at least TWO tools to answer the question.
3. Use tools precisely and deliberately and try to get the most accurate information from different tools.
4. Provide clear, concise responses. Do not give explanation in the final answer.

Explanation: The initial persona sets a general helpful role and includes an Initial Semantic Memory which is a summary of the user's historical behaviors, derived from PAG (Richardson et al., 2023) as mentioned in Appendix D. Crucially, it includes STRICT RULES for tool usage, encouraging a minimum of two tool actions and precise, deliberate information gathering. This ensures that even from the start, the agent is pushed towards leveraging external information and personalized data.

5. Experimental Setup

5.1. Datasets

The experiments evaluate PersonaAgent on a subset of the LaMP (Salemi et al., 2024b) benchmarks, specifically focusing on four decision-making tasks to assess personalized agent effectiveness across diverse domains. The datasets are constructed from the time-ordered version of LaMP, with test sets consisting of the 100 users with the most extensive activity histories. For each user, data is chronologically ordered and partitioned into a profile set (historical behaviors) and a test set (final evaluation).

LaMP-1: Personalized Citation Identification
- Description: A binary classification task. The agent needs to determine which of two papers (one real citation, one negative sample from other users' citing papers) a specific user is more likely to cite in a given context.
- Characteristics: Focuses on topic-level user interests in academic paper citation behavior.
- Domain: Academic research/recommendation.
LaMP-2M: Personalized Movie Tagging
- Description: A multi-classification task. Given a movie description and the user's prior movie-tag pairs (personalization profile), the agent must predict the most aligned tag the user would assign.
- Characteristics: Requires understanding individual user's unique tagging habits and subjective interpretation of movie content.
- Domain: Entertainment/recommendation.
LaMP-2N: Personalized News Categorization
- Description: A multi-classification task. The agent categorizes a news article based on user interests, using the author's historical profile. The dataset was refined to have a compact set of categories by filtering infrequent/overlapping labels.
- Characteristics: Assesses the model's ability to incorporate individual user preferences into news classification.
- Domain: News/content personalization.
LaMP-3: Personalized Product Rating
- Description: A multi-class classification task (predicting numeric ratings 1-5). Given a review text, the model predicts the user's rating, conditioned on their past reviewing behavior (style, sentiment, rating tendencies).
- Characteristics: Challenges user understanding by requiring personalized numeric predictions from user descriptions, leveraging implicit personalization signals from past reviews.
- Domain: E-commerce/recommendation.
  
  These datasets were chosen because they provide rich historical user data, which is crucial for effective personalization, and cover a variety of decision-making tasks (binary, multi-class classification, regression-like rating prediction) across different domains.

5.2. Evaluation Metrics

The paper uses standard evaluation metrics appropriate for classification and regression tasks.

5.2.1. Accuracy (Acc. ↑)

Conceptual Definition: Accuracy measures the proportion of correctly classified instances out of the total number of instances. It represents the overall correctness of the model's predictions. A higher accuracy indicates better performance.
Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
Symbol Explanation:
- Number of Correct Predictions: The count of instances where the model's predicted label matches the true label.
- Total Number of Predictions: The total number of instances evaluated.

5.2.2. F1 Score (F1 ↑)

Conceptual Definition: The F1 Score is the harmonic mean of Precision and Recall. It is a useful metric, especially in classification tasks with imbalanced classes, as it balances both false positives and false negatives. A higher F1 score indicates better performance.
Mathematical Formula: $ \text{F1 Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $ Where:
- Precision: $ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} $
  - True Positives (TP): Instances correctly predicted as positive.
  - False Positives (FP): Instances incorrectly predicted as positive (actually negative).
- Recall: $ \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $
  - False Negatives (FN): Instances incorrectly predicted as negative (actually positive).
Symbol Explanation:
- True Positives: Number of positive instances correctly identified.
- False Positives: Number of negative instances incorrectly identified as positive.
- True Negatives: Number of negative instances correctly identified.
- False Negatives: Number of positive instances incorrectly identified as negative.

5.2.3. Mean Absolute Error (MAE ↓)

Conceptual Definition: MAE measures the average magnitude of the errors in a set of predictions, without considering their direction. It is the average of the absolute differences between predicted and actual values. A lower MAE indicates better performance.
Mathematical Formula: $ \text{MAE} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i| $
Symbol Explanation:
- $N$ : The total number of data points.
- $y_i$ : The actual (ground-truth) value for the $i$ -th data point.
- $\hat{y}_i$ : The predicted value for the $i$ -th data point.
- $|y_i - \hat{y}_i|$ : The absolute difference between the actual and predicted values for the $i$ -th data point.

5.2.4. Root Mean Squared Error (RMSE ↓)

Conceptual Definition: RMSE is a frequently used measure of the differences between values predicted by a model or an estimator and the values observed. It is the square root of the average of squared errors. RMSE gives a relatively high weight to large errors, making it particularly useful when large errors are undesirable. A lower RMSE indicates better performance.
Mathematical Formula: $ \text{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2} $
Symbol Explanation:
- $N$ : The total number of data points.
- $y_i$ : The actual (ground-truth) value for the $i$ -th data point.
- $\hat{y}_i$ : The predicted value for the $i$ -th data point.
- $(y_i - \hat{y}_i)^2$ : The squared difference between the actual and predicted values for the $i$ -th data point.

5.3. Baselines

PersonaAgent is compared against a comprehensive set of baselines spanning non-personalized, personalized workflow, and general-purpose agentic systems.

Non-Personalized Methods:
- Prompt (Direct Prompting): The simplest baseline, where the LLM receives the task query directly without any explicit personalization or additional context beyond the basic instruction.
- ICL (In-Context Learning): The LLM is provided with a few-shot demonstration of examples (input-output pairs) within the prompt itself, before the actual query. These examples guide the LLM's response style and format, but without explicit modeling of user preferences.
Personalized Workflow Approaches: These methods use external personalized data but typically follow a fixed workflow.
- RAG-1 (Retrieval-Augmented Generation with 1 sample): Retrieves 1 relevant user data sample (e.g., past interaction) and prepends it to the LLM's prompt as context.
- RAG-4 (Retrieval-Augmented Generation with 4 samples): Similar to RAG-1, but retrieves 4 relevant user data samples, providing more context.
- PAG-4 (Profile-Augmented Generation): Extends RAG by incorporating a summarized user profile in addition to retrieved samples. The "4" likely refers to the number of retrieved items, similar to RAG-4. This aims for more holistic user understanding beyond just raw interactions.
General Agentic Systems: These are LLM agents with reasoning and tool-use capabilities, but without specific personalization mechanisms.
- ReAct (Reasoning and Acting): An agent framework that integrates tool use and reasoning. It enables the LLM to interleave Thought (internal reasoning steps) and Action (using external tools) to solve problems. It demonstrates planning capabilities but is generic.
- MemBank (MemoryBank): An agentic system that introduces an explicit long-term memory module to support task generalization. It aims to improve an agent's ability to learn and adapt over time by recalling past experiences, but this memory is general-purpose, not explicitly personalized to individual users in the way PersonaAgent proposes.

5.4. Experimental Details

LLM Implementation: All agentic methods, including PersonaAgent and baselines, are implemented on top of LangChain (Chase, 2022), a popular framework for developing applications powered by language models.
Base LLM: Unless otherwise specified, all models are evaluated using Claude-3.5 Sonnet (Anthropic, 2024), ensuring a fair comparison across methods by using a unified and powerful foundation model.
Persona Initialization: The initial semantic memory for the persona prompt (used in Initial System Prompt in Appendix B) is generated by summarizing user behaviors, following the approach of PAG (Richardson et al., 2023).
Test-Time User Preference Alignment Parameters:
- Alignment batch size (n): Set to 3 recent user interactions.
- Alignment iterations (\mathcal{E}): Set to 3 iterations.
Memory Retrieval: The number of retrieved memories ( $K$ in episodic memory retrieval) is set to 4 by default, aligning with the RAG-4 and PAG-4 baselines.
Reproducibility: The LLM sampling temperature is fixed at 0.1, which makes the outputs nearly deterministic and aids in reproducibility of results.
Compute Environment: All experiments were run on Amazon Bedrock (Amazon Web Services, 2023), a fully managed service that provides access to foundation models.
Tools: Only two tools are used to specifically highlight the effectiveness of the memory-action framework and test-time alignment over the persona, rather than benefits from a wide variety of tools:
- Wikipedia API for General Knowledge: For general factual information.
- RAG API for Personalized Episodic Memory: For retrieving user-specific interaction histories. This tool is mandated to be used at least once per question.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate PersonaAgent's superior performance across all four personalized decision-making tasks, highlighting its effectiveness in personalizing action space and scaling during real-world applications.

The following are the results from Table 2 of the original paper:

		Non-Personalized		Personalized Workflow			General Agent		PersonaAgent
Dataset	Metrics	Prompt	ICL	RAG-1	RAG-4	PAG-4	ReAct	MemBank
LaMP-1: Personalized Citation Identification	Acc. ↑	0.772	0.780	0.683	0.715	0.837	0.837	0.862	0.919
LaMP-1: Personalized Citation Identification	F1↑	0.771	0.766	0.705	0.714	0.837	0.853	0.861	0.918
LaMP-2M: Personalized Movie Tagging	Acc. ↑	0.387	0.283	0.320	0.427	0.430	0.450	0.470	0.513
LaMP-2M: Personalized Movie Tagging	F1↑	0.302	0.217	0.256	0.386	0.387	0.378	0.391	0.424
LaMP-2N: Personalized News Categorization	Acc. ↑	0.660	0.388	0.687	0.742	0.768	0.639	0.741	0.796
LaMP-2N: Personalized News Categorization	F1↑	0.386	0.145	0.439	0.484	0.509	0.381	0.456	0.532
LaMP-3: Personalized Product Rating	MAE ↓	0.295	0.277	0.304	0.313	0.339	0.313	0.321	0.241
LaMP-3: Personalized Product Rating	RMSE ↓	0.590	0.543	0.655	0.713	0.835	0.590	0.582	0.509

Overall Analysis:

Dominant Performance: PersonaAgent consistently achieves the best performance across all datasets and metrics. This strong result validates the efficacy of its integrated personalized memory, personalized action module, and test-time persona optimization.
Outperforming Personalized Workflows: For tasks like LaMP-1, LaMP-2M, and LaMP-2N, which require capturing topic-level user interests, PersonaAgent significantly improves over RAG-4, PAG-4, and even MemBank. For instance, on LaMP-1, PersonaAgent achieves 0.919 Accuracy and 0.918 F1, considerably higher than PAG-4 (0.837 Acc, 0.837 F1) and MemBank (0.862 Acc, 0.861 F1). This indicates its superior ability to model nuanced user intent through its dynamic persona and memory-action integration.
Outperforming General Agents: PersonaAgent also surpasses ReAct and MemBank, demonstrating that while agentic capabilities are important, explicit personalization driven by user-specific memory and persona is crucial for tasks requiring individual tailoring. MemBank, despite having a memory module, lacks the fine-grained personalized alignment that PersonaAgent achieves.
Importance of Personalization over Generic Approaches:
- ICL (In-Context Learning) sometimes underperforms Prompt (direct prompting) (e.g., LaMP-2M, LaMP-2N), especially when few-shot examples are not perfectly aligned with individual user preferences. This underscores that generic examples can sometimes be misleading and highlights the necessity of true personalization techniques for user-specific tasks.
- RAG-1 and RAG-4 show that adding more raw retrieved data doesn't automatically translate to better personalization without a mechanism to interpret and integrate it intelligently (e.g., RAG-4 on LaMP-3 performs worse than Prompt in MAE/RMSE).
Handling Complex Personalization (LaMP-3): In LaMP-3 (Personalized Product Rating), PersonaAgent achieves the lowest MAE (0.241) and RMSE (0.509). This task is particularly challenging as it requires predicting numeric ratings based on user reviews and historical interactions, demanding a deep understanding of individual user sentiment and rating tendencies. The fact that PersonaAgent outperforms all other methods, including general agents and even Prompt (which sometimes performs competitively in other tasks), showcases its robust test-time alignment mechanism and its ability to generalize effectively to complex personalized rating scenarios.
Addressing the "Failure" of Other Personalized Approaches on LaMP-3: Both other personalized workflows (RAG-1, RAG-4, PAG-4) and general-purpose agents (ReAct, MemBank) fail to outperform direct prompting on LaMP-3 in terms of MAE and RMSE. This suggests that simply augmenting with profiles or having general agentic capabilities is insufficient; a dedicated and dynamic personalization framework like PersonaAgent is required.

In summary, the results strongly validate PersonaAgent's unified framework of personalized memory, actions, and persona optimization for delivering dynamic and fine-grained personalization across diverse domains.

6.2. Ablation Studies / Parameter Analysis

To understand the contribution of each component to PersonaAgent's overall performance, an ablation study was conducted.

The following are the results from Table 3 of the original paper:

	LaMP-1: Personalized Citation Identification		LaMP-2M: Personalized Movie Tagging		LaMP-2N: Personalized News Categorization		LaMP-3: Personalized Product Rating
Variants	Acc. ↑	F1↑	Acc. ↑	F1↑	Acc. ↑	F1↑	MAE↓	RMSE ↓
PersonaAgent	0.919	0.918	0.513	0.424	0.796	0.532	0.241	0.509
w/o alignment	0.894	0.893	0.487	0.403	0.775	0.502	0.259	0.560
w/o persona	0.846	0.855	0.463	0.361	0.769	0.483	0.277	0.542
w/o Memory	0.821	0.841	0.460	0.365	0.646	0.388	0.348	0.661
w/o Action	0.764	0.789	0.403	0.329	0.626	0.375	0.375	0.756

Analysis of Ablation Study:

w/o alignment (Without Test-Time User Preference Alignment):
- Removing the test-time alignment module leads to a noticeable drop in performance across all tasks. For instance, on LaMP-1, Accuracy drops from 0.919 to 0.894, and F1 drops from 0.918 to 0.893. Similar drops are observed in other tasks (e.g., MAE on LaMP-3 increases from 0.241 to 0.259).
- Conclusion: This confirms that the test-time user-preference alignment strategy is critical for adapting to real-time user preferences and continuously refining the persona, thereby ensuring optimal performance.
w/o persona (Without the Dynamically Evolving Persona Prompt):
- Removing the persona prompt (the unique system prompt for each user) results in further degradation in performance beyond just removing alignment. For example, on LaMP-1, F1 drops from 0.893 (w/o alignment) to 0.855. On LaMP-2M, F1 drops from 0.403 to 0.361.
- Conclusion: The persona acts as the centralized controller between memory and actions. Its absence significantly hinders the agent's ability to effectively bridge memory-driven insights with tailored agent behavior, emphasizing its crucial role in orchestrating personalized interactions.
w/o Memory (Without Personalized Memory Module):
- Removing the personalized memory module (both episodic and semantic memory) has a substantial impact, especially on tasks like LaMP-2N (Acc. drops from 0.796 to 0.646) and LaMP-3 (MAE increases from 0.241 to 0.348). While the performance drop on LaMP-1 and LaMP-2M is also notable, it's particularly pronounced where historical context is paramount.
- Conclusion: This highlights the importance of both episodic memory (for detailed context-rich interactions) and semantic memory (for stable, abstracted user profiles) in modeling historical user context and preferences. Without this memory, the agent cannot effectively learn or recall user-specific information.
w/o Action (Without Personalized Action Module / Adaptive Tool Usage):
- Removing the personalized action module leads to the most significant performance drop across all tasks. For example, on LaMP-1, Accuracy drops drastically from 0.919 to 0.764, and F1 drops from 0.918 to 0.789. On LaMP-3, MAE jumps from 0.241 to 0.375.
- Conclusion: This demonstrates that reasoning alone (even with personalized memory and persona) is insufficient. The ability to perform adaptive tool usage, guided by personalized data and the persona, is essential for effective decision-making and interaction. The personalized action module allows the agent to actively engage with the environment and utilize resources in a user-tailored manner.
  
  Overall Conclusion from Ablation Study: Each component of PersonaAgent—the test-time alignment, the persona prompt, the personalized memory module, and the personalized action module—contributes substantially to its success. The complete system, with all components integrated, delivers the strongest and most balanced performance, validating the synergistic design of the framework.

6.3. Persona Analysis

To gain deeper insights into how the test-time alignment impacts user modeling, the paper visualizes the optimized persona embeddings and provides case studies.

6.3.1. t-SNE Visualization of Personas

The following figure (Figure 2 from the original paper) shows the t-SNE embedding visualization:

Figure 2: Persona case studies on the LaMP-2M movie tagging task.
该图像是一个t-SNE降维可视化图，展示了初始提示词及三个不同用户的个性化系统提示（Persona），每个Persona描述了用户对电影类型和内容的偏好和特点，体现了个性化对话系统中的用户画像设计。

Figure 2: Persona case studies on the LaMP-2M movie tagging task.

Analysis:

The t-SNE (t-Distributed Stochastic Neighbor Embedding) visualization is a technique for dimensionality reduction, used here to project high-dimensional persona embeddings into a 2D space, making them visualizable.
Each point in the plot represents a learned persona after the test-time user preference alignment process. The initial system prompt template is also shown as a reference point.
Well-Separated Clusters: The most striking observation is that the learned personas for different users are well-separated in the latent space. This indicates that the optimization procedure effectively captures user-specific traits and differentiates between individual users.
User Similarity Reflected: For instance, User A and User B, both described as having interests in "historical and classic films," have their persona embeddings located relatively close to each other, forming a cluster. This suggests that the optimization process correctly identifies and groups users with similar underlying preferences.
Clear Divergence: User C's persona, with interests in "sci-fi, action, and book-to-film adaptations," is distinctly separated from User A and B's cluster. This divergence highlights that the persona optimization mechanism can identify unique and contrasting user preferences, moving beyond generic behavior instructions.

6.3.2. Persona Case Studies (from Appendix F)

The paper provides snippets of the optimized persona prompts for three representative users (A, B, C) on the LaMP-2M movie tagging task. These textual examples corroborate the findings from the t-SNE visualization.

6.3.2.1. Persona of User A

User A's persona details a "Strong interest in film analysis, genre classification, and cinematic themes," with a "Preference for concise, direct communication," "Extensive knowledge of classic and cult films," and an "Analytical thinker with a focus on dark comedy and satirical films." The instructions include prioritizing brevity, assuming a high level of film knowledge, providing historically accurate information, and categorizing films based on themes.

6.3.2.2. Persona of User B

User B's persona describes a "Cinephile with deep knowledge of film history, genres, and iconic directors," preferring "concise, factual responses," appreciating "cultural context and diversity," and interested in "classic, critically acclaimed, and influential films." The instructions similarly emphasize direct answers, appropriate terminology, reputable sources, and tailoring responses to global/historical films. It also notes a specific "most popular tag preference: dystopia."

Comparison of User A and B: Both personas share interests in classic/historical films and concise communication, which aligns with their proximity in the t-SNE plot. However, their nuances (User A's focus on dark comedy/satire vs. User B's emphasis on cultural context/dystopia) are still distinct enough to form separate, albeit related, clusters.

6.3.2.3. Persona of User C

User C's persona reveals an "Adult with a strong interest in film analysis and genre classification," "Extensive knowledge of literature, popular book series, and their film adaptations," and a "Preference for sci-fi and action genres." The instructions prioritize literary connections, book-to-film adaptations, and considering sci-fi/action elements favorably.

Divergence of User C: This persona is clearly distinct, focusing on different genres and the literary aspect of film. This textual evidence supports its isolated position in the t-SNE plot, demonstrating the persona optimization mechanism's ability to capture diverse and fine-grained user preferences.

6.3.3. Jaccard Similarity Matrix of Learned Personas (from Appendix G)

The following figure (Figure 5 from the original paper) shows the Jaccard similarity matrix:

Figure 5: Jaccard similarity of learned personas on LaMP-2M.
该图像是图表，展示了PersonaAgent中不同用户个性化人格之间的相似度矩阵。颜色深浅表示用户间人格的Jaccard相似度，主对角线为完全相似（值为1），整体相似度较低，体现个性多样性。

Figure 5: Jaccard similarity of learned personas on LaMP-2M.

Analysis:

The heatmap displays the pairwise Jaccard similarities between the personas inferred for each of the 100 users on the LaMP-2M dataset.
Jaccard Similarity: For two sets A and B, the Jaccard similarity is defined as the size of their intersection divided by the size of their union: $J(A,B) = \frac{|A \cap B|}{|A \cup B|}$ . In the context of textual personas, this could be calculated on word sets, n-grams, or semantic embeddings. Here, it measures the overlap in the content or meaning of the persona prompts.
Main Diagonal (Red): The bright red values along the main diagonal (where a persona is compared to itself) are 1.0, indicating perfect self-consistency. This is expected.
Off-Diagonal Entries (Cool-Blue): The off-diagonal entries, representing similarities between different users' personas, are predominantly cool-blue. This indicates minimal overlap (mostly $\leq 0.4$ ) between different users' profiles.
Conclusion: This clear separation in the similarity matrix reinforces the findings from the t-SNE plot and the case studies. It provides quantitative evidence that the test-time preference-alignment mechanism is highly effective in capturing and preserving each individual's unique persona, leading to distinctly differentiated and tailored user profiles.

6.4. Test-Time Scaling

The paper explores the impact of various scaling factors during the alignment process, which are crucial for PersonaAgent's real-world applicability and efficiency.

The following figure (Figure 3 from the original paper) illustrates the test-time scaling effects:

Figure 3: Test-time scaling effects on PerosnaAgent.
该图像是图表，展示了PersonaAgent在不同测试时间扩展条件下的表现，包括对齐批量大小、迭代次数和检索交互次数对准确率和F1分数的影响。

Figure 3: Test-time scaling effects on PerosnaAgent.

Analysis:

Scaling alignment batch samples:
- Observation: Larger alignment batch sizes ( $n$ in Algorithm 1) lead to improved alignment quality and performance.
- Reasoning: Using more recent interaction samples for each optimization iteration provides the model with a more comprehensive and robust snapshot of current user behavior. This richer feedback signal allows for better persona refinement and stronger personalization performance, as the textual gradient optimization can draw on a broader set of examples to identify user preferences.
Scaling alignment iterations:
- Observation: Increasing the number of alignment iterations shows consistent gains in both accuracy and F1 score up to approximately 3 iterations, after which performance plateaus or slightly declines.
- Reasoning: This suggests that a small number of update steps is sufficient for effective preference alignment. Beyond a certain point, more iterations might lead to overfitting to the immediate batch of interactions or introduce noise, causing performance to stabilize or slightly degrade. This finding is significant for computational efficiency, as it implies PersonaAgent can adapt quickly at test time without requiring extensive iterative optimization.
Scaling retrieved memory:
- Observation: Retrieving more memory entries (i.e., increasing $K$ in episodic memory retrieval) for alignment and generation significantly enhances performance.
- Reasoning: A richer user context, derived from a larger pool of relevant past interactions, strengthens the grounding of both the agent's reasoning and its response generation. More memory provides more evidence for the persona to learn from and for the agent to use when formulating responses, leading to more accurate and personalized outputs.
  
  Overall Conclusion for Test-Time Scaling: These scaling effects validate the design choices of PersonaAgent. The ability to improve performance with larger batch sizes and more retrieved memory indicates that the underlying personalization mechanisms are effective. Crucially, the rapid convergence of alignment iterations demonstrates the computational efficiency and practical applicability of the test-time alignment strategy, making it suitable for dynamic, real-world scenarios.

6.5. Effects of base LLM capability

To evaluate the robustness of PersonaAgent and its framework-level benefits, the paper tests its performance across different underlying LLM backbones.

The following figure (Figure 4 from the original paper) illustrates the effects on LLM base model capability:

Figure 4: Effects on LLM base model capability.
该图像是一个柱状对比图，展示了不同方法在四个LLM基础模型上的准确率和百分位数表现。图中PersonaAgent在所有模型上均表现出最佳性能，突出其个性化 agente 优势。

Figure 4: Effects on LLM base model capability.

Analysis:

LLM Backbones Tested: The experiments compare PersonaAgent against baselines using four different foundation models: Mistral-Small, Mistral-Large, Claude-3.5, and Claude-3.7. These represent a range of model sizes and capabilities, with Claude-3.7 generally being the most advanced.
Consistent Outperformance: The most striking observation is that PersonaAgent consistently outperforms all baselines (Prompt, RAG, PAG, ReAct, MemBank) regardless of the base model's capability. This indicates that the personalization framework itself provides a substantial and additive benefit, independent of the raw intelligence of the underlying LLM.
Gains with Smaller Models: Even with a smaller model like Mistral-Small, PersonaAgent achieves strong gains over the non-personalized and general agentic baselines. This highlights the model-agnostic improvement offered by the test-time user preference alignment. This is particularly important for scenarios with resource constraints or for deployment on local edge devices, where smaller, less powerful LLMs might be necessary.
Scaling with Model Intelligence: As the capability of the base LLM increases (moving from Mistral-Small to Claude-3.7), PersonaAgent not only maintains its lead but also achieves its highest accuracy (55.0%) with Claude-3.7. This demonstrates that the proposed personalization framework effectively scales with model intelligence, meaning the benefits of PersonaAgent are amplified when paired with more powerful LLMs.
Distinct Advantages: The results prove that PersonaAgent offers distinct advantages even in lower-resource LLM regimes while continuing to excel with state-of-the-art foundation models. This robustness makes PersonaAgent a versatile solution for personalized AI across a spectrum of operational environments.

Overall Conclusion for Base LLM Capability: The study convincingly shows that the PersonaAgent framework provides a universal boost to personalization performance, making it a valuable addition regardless of the specific LLM being used. The personalization mechanisms are effective enough to elevate the performance of even smaller models significantly, and they enhance the capabilities of larger models even further.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces PersonaAgent, a pioneering framework for personalized Large Language Model (LLM) agents designed to tackle the challenge of a one-size-fits-all approach in current LLM systems. The core innovation lies in its unified memory-action architecture, which integrates episodic and semantic memory modules with a personalized action module. At the heart of this framework is the persona, a unique system prompt for each user, which acts as an intermediary, leveraging memory insights to guide agent actions and, in turn, refining memory based on action outcomes.

A key contribution is the novel test-time user-preference alignment strategy. This strategy dynamically optimizes the persona prompt by simulating recent interactions and minimizing textual loss feedback between simulated and ground-truth responses, ensuring real-time alignment with individual user preferences.

Extensive experimental evaluations across four diverse personalized decision-making tasks demonstrate that PersonaAgent consistently achieves state-of-the-art results, outperforming non-personalized, personalized workflow, and general agentic baselines. Ablation studies confirm the critical role of each component—memory, action, and alignment—in the framework's success, with persona analysis further highlighting its effectiveness in capturing unique user traits. Furthermore, studies on test-time scaling and the use of different LLM backbones illustrate the framework's robustness, efficiency, and ability to dynamically adapt to evolving user preferences while maintaining scalability.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose future research directions:

Reliance on Textual Feedback: PersonaAgent's reliance on textual feedback for preference alignment may overlook implicit or multi-modal user signals. These could include non-verbal cues (e.g., emotional tone, facial expressions, visual interactions) which are crucial in human-computer interaction but are not captured by textual loss.
Privacy Risks with Personalized Data: While the paper successfully avoids large-scale user data training through test-time personalization, the intensive use of personalized data inherently introduces privacy risks. This is a significant concern for any personalized AI system.
Future Work - Privacy-Preserving Mechanisms: To address the privacy concerns, the authors suggest future work on privacy-preserving mechanisms such as federated learning (Zhang et al., 2021). Federated learning would allow models to be trained on decentralized user data without directly accessing the raw personal information, thus enhancing privacy.

7.3. Personal Insights & Critique

This paper presents a highly innovative and practical approach to personalization in LLM agents. The concept of a dynamically evolving persona as a system prompt, optimized through textual gradient feedback at test-time, is a brilliant solution to the scalability challenges faced by user-specific fine-tuning methods.

Strengths:

Elegance of Test-Time Alignment: The most compelling aspect is the test-time user-preference alignment. Instead of computationally expensive full model fine-tuning per user, leveraging the LLM itself to generate "textual gradients" and update its own system prompt is resource-efficient and highly adaptive. This in-situ learning makes PersonaAgent much more viable for real-world, large-scale deployments.
Unified Framework: The integration of personalized episodic and semantic memory with personalized actions, all orchestrated by the persona, creates a truly holistic and intelligent agent. This moves beyond simple RAG or general tool use.
Robustness Across LLMs: The demonstration that PersonaAgent provides benefits across a range of LLM capabilities, including smaller models, is crucial. It means the framework is not limited to only the most powerful (and expensive) LLMs, opening doors for wider application, potentially even on edge devices.
Comprehensive Evaluation: The experiments across diverse tasks and baselines, coupled with thorough ablation studies and persona analysis, provide strong evidence for the effectiveness and internal coherence of the proposed framework.

Potential Issues/Areas for Improvement:

Quality of Textual Feedback: The effectiveness of test-time alignment heavily relies on the quality of the textual feedback generated by the LLM_grad prompt. If the LLM acting as the "evaluator" or "prompt engineer" is not sufficiently intelligent or aligned, the persona optimization could be suboptimal or even lead to drift. The paper assumes the base LLM is capable of providing meaningful feedback, which might not always be guaranteed, especially with smaller models. This could be a point of failure.
Prompt Engineering Fragility: While elegant, direct persona manipulation via textual prompts can sometimes be fragile. Small changes in wording can lead to unexpected behavior shifts in LLMs. The robustness of this textual optimization over long periods and diverse user interactions might need further investigation.
Defining "Textual Loss": While the paper mentions textual loss feedback, the specific implementation details of how $L(\hat{r}_j, r_j^{\mathrm{gt}} | q_j)$ is computed are not fully elaborated beyond the prompts. Is it a simple string comparison? A semantic similarity metric? Or a more complex LLM-based evaluation? A deeper dive into the exact mechanism would be beneficial for understanding its nuances.
Implicit Signals: The acknowledged limitation regarding implicit or multi-modal signals is indeed significant. Many user preferences are not explicitly stated. Future work could explore integrating multi-modal inputs (e.g., user gaze, tone of voice, interaction patterns) into the memory and alignment process, potentially by converting these into textual descriptions or embeddings that the current framework could still process.

Transferability and Application: The methods and conclusions of PersonaAgent have broad applicability:

Customer Service & Support: Personalized AI agents could provide highly tailored assistance, remembering past interactions, preferences, and even emotional states to offer empathetic and efficient support.
Education: Personalized tutors could adapt teaching styles, content, and pacing to individual student learning preferences and knowledge gaps.
Healthcare: Medical assistants could tailor advice and information based on a patient's specific health history, preferences, and communication style, while also integrating privacy-preserving techniques.
Creative Industries: Content creation tools or virtual assistants for writers/artists could adapt to their unique creative workflows and stylistic preferences.

PersonaAgent offers a compelling vision for the future of human-AI interaction, where AI systems are not just intelligent but also deeply personal, adapting dynamically to our individual needs and evolving over time. The "free lunch" from LLMs in textual gradient optimization is a powerful concept that will likely inspire further research in adaptive AI systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~41 min read · 57,052 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. The Definition of "Persona" for Personalized LLM Agents

4.2.2. Personalized Memory Module

4.2.2.1. Episodic Memory

4.2.2.2. Semantic Memory

4.2.3. Personalized Actions

4.3. Test-Time User Preference Alignment

4.3.1. Algorithm 1: Test-Time User Preference Alignment

4.3.2. Textual Optimization Prompts (from Appendix A)

4.3.2.1. Loss Gradient/Feedback Prompt

4.3.2.2. Gradient Update Prompt

4.3.3. Persona Prompt Initialization (from Appendix B)

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Accuracy (Acc. ↑)

5.2.2. F1 Score (F1 ↑)

5.2.3. Mean Absolute Error (MAE ↓)

5.2.4. Root Mean Squared Error (RMSE ↓)

5.3. Baselines

5.4. Experimental Details

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Studies / Parameter Analysis

6.3. Persona Analysis

6.3.1. t-SNE Visualization of Personas

6.3.2. Persona Case Studies (from Appendix F)

6.3.2.1. Persona of User A

6.3.2.2. Persona of User B

6.3.2.3. Persona of User C

6.3.3. Jaccard Similarity Matrix of Learned Personas (from Appendix G)

6.4. Test-Time Scaling

6.5. Effects of base LLM capability

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers