Large Language Models Empowered Personalized Web Agents
TL;DR Summary
This work formulates LLM-powered personalized Web agents, integrating user data to improve instruction comprehension and execution. It introduces PersonalWAB benchmark and a memory-enhanced alignment framework PUMA for more accurate personalization.
Abstract
Web agents have emerged as a promising direction to automate Web task completion based on user instructions, significantly enhancing user experience. Recently, Web agents have evolved from traditional agents to Large Language Models (LLMs)-based Web agents. Despite their success, existing LLM-based Web agents overlook the importance of personalized data (e.g., user profiles and historical Web behaviors) in assisting the understanding of users' personalized instructions and executing customized actions. To overcome the limitation, we first formulate the task of LLM-empowered personalized Web agents, which integrate personalized data and user instructions to personalize instruction comprehension and action execution. To address the absence of a comprehensive evaluation benchmark, we construct a Personalized Web Agent Benchmark (PersonalWAB), featuring user instructions, personalized user data, Web functions, and two evaluation paradigms across three personalized Web tasks. Moreover, we propose a Personalized User Memory-enhanced Alignment (PUMA) framework to adapt LLMs to the personalized Web agent task. PUMA utilizes a memory bank with a task-specific retrieval strategy to filter relevant historical Web behaviors. Based on the behaviors, PUMA then aligns LLMs for personalized action execution through fine-tuning and direct preference optimization. Extensive experiments validate the superiority of PUMA over existing Web agents on PersonalWAB.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is "Large Language Models Empowered Personalized Web Agents," which focuses on enhancing Web agents with personalization capabilities using large language models.
1.2. Authors
The authors are:
- Hongru Cai (National University of Singapore, Singapore)
- Yongqi Li (The Hong Kong Polytechnic University, Hong Kong SAR, China)
- Wenjie Wang (University of Science and Technology of China, Hefei, China)
- Fengbin Zhu (National University of Singapore, Singapore)
- Xiaoyu Shen (Eastern Institute of Technology, Ningbo, Ningbo, China)
- Wenjie Li (The Hong Kong Polytechnic University, Hong Kong SAR, China)
- Tat-Seng Chua (National University of Singapore, Singapore)
1.3. Journal/Conference
This paper is slated for publication in the Proceedings of the ACM Web Conference 2025 (WWW '25), April 28-May 2, 2025, Sydney, NSW, Australia. WWW is a highly prestigious and influential conference in the field of the World Wide Web, covering topics such as Web technologies, applications, and their societal impact. Publication at WWW signifies high-quality research and significant contributions to the field.
1.4. Publication Year
The paper is published at 2024-10-22T17:54:45.000Z, with the ACM reference indicating publication in 2025.
1.5. Abstract
This paper introduces the concept of LLM-empowered personalized Web agents, which leverage personalized user data (e.g., user profiles and historical Web behaviors) to improve the understanding of user instructions and execute customized actions. Recognizing the lack of a suitable evaluation standard, the authors developed PersonalWAB, the first comprehensive benchmark for this task. PersonalWAB includes user instructions, personalized user data, Web functions, and supports both single-turn and multi-turn evaluation across three personalized Web tasks. Furthermore, the paper proposes Personalized User Memory-enhanced Alignment (PUMA), a framework that adapts Large Language Models (LLMs) for this task. PUMA utilizes a memory bank with a task-specific retrieval strategy to filter relevant historical behaviors, and then aligns LLMs for personalized action execution through fine-tuning and direct preference optimization (DPO). Extensive experiments on PersonalWAB demonstrate PUMA's superior performance compared to existing Web agents.
1.6. Original Source Link
Official Source: https://arxiv.org/abs/2410.17236 PDF Link: https://arxiv.org/pdf/2410.17236v2.pdf Publication Status: This paper is available as a preprint on arXiv, with an ACM reference indicating it is accepted for publication at WWW '25.
2. Executive Summary
2.1. Background & Motivation
The World Wide Web has become an indispensable part of daily life, offering a multitude of services from information retrieval to online shopping. However, the sheer scale and complexity of modern Web services pose significant challenges for users, particularly those who struggle with vast amounts of unstructured data and intricate interactions. Web agents emerged as a promising solution to automate these Web tasks based on user instructions, thereby enhancing user experience and efficiency.
Initially, Web agents were primarily built using reinforcement learning techniques for Web navigation, but their limited context understanding and reasoning capabilities restricted their generalization to complex and novel scenarios. The advent of Large Language Models (LLMs) has revolutionized this field, endowing Web agents with powerful understanding, planning, and reasoning capabilities. Modern LLM-based Web agents utilize techniques like in-context learning, fine-tuning, and reinforcement learning to improve their instruction-following abilities, and some even support multi-turn interactions for conversational Web navigation.
Despite these advancements, existing LLM-based Web agents largely overlook a crucial aspect: personalization. User experience can be significantly enhanced by incorporating personalized data such as user profiles and historical Web behaviors. This personalized data reveals implicit user preferences, which can:
-
Supplement user context for personalized instruction comprehension: Users often don't explicitly state all their preferences (e.g., a price range for a product search).
Personalized datacan fill these gaps. -
Enable personalized action execution: Different users have varying habits and preferences for
Web services, leading to customizedfunction callswith tailored parameters.The core problem this paper aims to solve is the lack of personalization in current
LLM-based Web agents. The current field lacks both a systematic formulation of theLLM-empowered personalized Web agenttask and a comprehensive benchmark for its training and evaluation. Without these, the development ofWeb agentsthat truly understand and cater to individual user needs remains limited.
The paper's entry point is to formalize this personalized Web agent task and bridge the gap by constructing the first dedicated benchmark, PersonalWAB, and proposing a novel framework, PUMA, to effectively address it.
2.2. Main Contributions / Findings
The paper makes several significant contributions:
- Task Formulation: It formally defines the task of
LLM-empowered personalized Web agents, emphasizing the integration ofpersonalized user datafor bothinstruction comprehensionandaction execution. This bridges the gap between genericWeb agentsandcustomized Web services. - Benchmark Construction (
PersonalWAB): The authors construct the first benchmark specifically designed forLLM-empowered personalized Web agents.PersonalWABfeatures:- A diverse set of 1,000 users with simulated profiles and real historical Web behaviors.
- Instructions for three personalized
Web tasks: search, recommendation, and review generation. - A set of callable
Web functionsto interact with the environment. - Two distinct evaluation paradigms:
single-turnandmulti-turninteraction, with the latter utilizing anLLM-based user simulator.
- Framework Proposal (
PUMA): The paper introducesPersonalized User Memory-enhanced Alignment (PUMA), a novel framework designed to adaptLLMsfor thepersonalized Web agenttask.PUMA's key components include:- A
user memory bankto store long-term historical behaviors. - A
task-specific retrieval strategyto filter relevant information from the memory. - Strategies for
function parameter optimizationusingsupervised fine-tuning (SFT)with heuristically constructed pseudo-labels andDirect Preference Optimization (DPO)for alignment with personalized user preferences.
- A
- Extensive Validation: Through extensive experiments on
PersonalWAB, the paper demonstrates thatPUMAconsistently outperforms existingWeb agentsacross bothsingle-turnandmulti-turnpersonalizedWeb tasks. This validatesPUMA's effectiveness in better aligning with personalized user instructions and preferences, showcasing the potential for more intelligent, customized, and user-centeredWeb services.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
Web Agents
At its core, a Web agent is an intelligent software program designed to automate tasks on the internet. It acts on behalf of a user to achieve specific goals, such as finding information, making a purchase, or filling out a form, by interacting with Web services or Web interfaces.
- Traditional Web Agents: These agents often rely on predefined rules, scripts, or
reinforcement learningtechniques to navigateWeb UIs(User Interfaces). They typically learn optimal sequences of actions (e.g., clicks, text inputs) to complete tasks. Theircontext understandingandreasoning capabilitiesare generally limited, making it hard for them to adapt to new or complexWeb environmentsorout-of-distribution scenarios. - LLM-based Web Agents: With the rise of
Large Language Models (LLMs),Web agentshave evolved significantly.LLMspossess extensiveworld knowledge,strong understanding,planning, andreasoning capabilities. These agents can interpret natural language instructions, generate plans, and execute actions by interacting withWeb elementsor bycalling functionsthat abstractWeb services. Techniques likein-context learning(providing examples in the prompt),fine-tuning(adapting theLLMwith task-specific data), andreinforcement learningare employed to enhance their performance.
Personalization
Personalization refers to tailoring a system's behavior, content, or services to individual users based on their unique characteristics, preferences, or historical interactions. In the context of Web agents, personalization means that the agent's actions and responses would adapt to a specific user, rather than providing a generic solution.
- User Profile: A collection of static attributes about a user, such as demographics (gender, age, occupation), interests, and stated preferences.
- Historical Web Behaviors: Records of a user's past interactions with
Web services, such as purchase history, search queries, ratings, reviews, and browsing patterns. This data often revealsimplicit preferencesthat are not explicitly stated in auser profile. - Implicit Preferences: Preferences that are inferred from a user's behavior rather than explicitly declared. For example, consistently buying products from a certain brand implies a brand preference.
Large Language Models (LLMs)
LLMs are deep learning models, often based on the Transformer architecture, trained on vast amounts of text data. They are capable of understanding, generating, and processing human language with remarkable fluency and coherence.
- Fine-tuning: A process where a pre-trained
LLMis further trained on a smaller, task-specific dataset. This allows the model to adapt its general knowledge to the nuances of a particular task or domain, improving its performance for that specific application. - Direct Preference Optimization (DPO): A
reinforcement learning from human feedback (RLHF)technique used to alignLLMswith human preferences. Instead of explicitly training a separatereward model,DPOdirectly optimizes theLLM's policy based on preferences for one response over another (e.g., A is better than B), often collected in pairwise comparisons. It simplifies theRLHFpipeline by translating preference data into aloss functionthat can be directly applied duringfine-tuning. - In-context Learning: The ability of
LLMsto learn new tasks or adapt to new instructions based on examples provided within the prompt itself, without requiring explicitfine-tuningof the model weights. - Cosine Similarity: A metric used to measure the similarity between two non-zero vectors in an inner product space. It is often used in
natural language processingto determine how similar two pieces of text are by comparing theirvector embeddings. Acosine similarityof 1 means the vectors are identical (same direction), 0 means they are orthogonal (no similarity), and -1 means they are opposite. The formula forcosine similaritybetween two vectors and is: $ \text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|} = \frac{\sum_{i=1}^n A_i B_i}{\sqrt{\sum_{i=1}^n A_i^2} \sqrt{\sum_{i=1}^n B_i^2}} $ Where:- is the
dot productof vectors and . - and are the
Euclidean magnitudes(orL2 norms) of vectors and . - and are components of vector and respectively.
- is the
- Sentence Embeddings: Numerical vector representations of sentences that capture their semantic meaning. Models like
Sentence-BERTare trained to produce these embeddings such that semantically similar sentences are mapped to nearby points in the vector space.
3.2. Previous Works
The paper contextualizes its work by reviewing two main lines of research: Web agents and Personalized LLMs.
Web Agents
Previous research on Web agents has largely focused on automating Web tasks.
-
Traditional Web Agents: Early work focused on
reinforcement learningforWeb navigation. Examples include [27], which provided a platform for agents to complete tasks using keyboard and mouse interactions on website widgets, andWebShop[55], which introduced a simulated e-commerce environment. These systems primarily dealt with predefinedWeb UIsand struggled with generalization. -
LLM-based Web Agents: More recent advancements leverage
LLMsfor improvedcontext understandingandreasoning. These studies explore automating tasks in more complex settings:- Multi-domain & Multi-hop:
Mind2Web[6] explores generalist agents for the Web across multiple domains.MMInA[58] focuses onmultihop multimodal Internet agents. - Real-time Interactions & Visual Understanding:
WebArena[62] provides a realisticWeb environmentfor autonomous agents, andWebVoyager[14] andVWA[22] focus onvisual UI understandingandmultimodal agents. - Enhancement Techniques: Researchers have applied
fine-tuning[11, 13],prompting[42, 56, 59], andreinforcement learning[34] toLLMsforWeb agenttasks.
- Multi-domain & Multi-hop:
-
Conversational & Multi-turn Agents: A distinct direction integrates user interactions into the agent's execution process.
META-GUI[46] focuses on mobile app automation with conversational instructions.RUSS[54] andWebLINX[29] design datasets fordialogue-centric Web navigation.MT-Mind2Web[7] extendsMind2Webtomulti-turn instruction following.ChatShop[4] exploresinteractive information seekingwithlanguage agentsusingWeb functions.WorkArena[9] evaluatesWeb agentsoncommon knowledge work tasksinmulti-turnsettings.Differentiation: Despite these advancements, the paper highlights that prior
Web agentresearch, includingLLM-basedones, overlooks the dimension of personalization. While some, likeWebArena[62], simulate users with distinct roles, these roles are predefined and do not require the agent to understanduser preferencesor adjust strategy based on them. This paper explicitly focuses onLLM-empowered personalized Web agents, which is a novel emphasis.
Personalized LLMs
This field focuses on LLMs that adapt to individual users' needs by handling user personas (background, historical behaviors).
-
Personalized Content Generation: This category addresses generating content tailored to users. Examples include using publicly available user data (Reddit [50], Facebook, Twitter [43], blogs [21]) for pre-training
LLMs. Tasks includestance classification,demographic inference[44], andpersonalized sentiment prediction[31]. Benchmarks likeLaMP[39] andLongLaMP[23] provide datasets for evaluatingpersonalized text classificationandcontent generation. -
User-facing Applications: This includes
personalized dialogue systems. Datasets have been built by crowd-workers authoring dialogues based on personas [57] or extracting attributes from social media (Reddit [30], Weibo [61]).Apollonion[5] dynamically updatesuser profilesforpersonalized responses.Memory mechanisms[24, 28, 52] help models recall past conversations and important events.Personalized LLMsare also applied in specialized domains likehealthcare[1, 18],education[8, 40], androbotics[51].Differentiation: The paper notes that previous
personalized LLMstudies have not explored personalized function calls tailored to user-specific needs in Web environments. This work bridges this gap by emphasizing adapting agents' actions based onpersonalized user datato completepersonalized taskswithinWeb environments.
3.3. Technological Evolution
The technological landscape for Web agents has evolved from rudimentary, rule-based systems to sophisticated LLM-driven intelligent agents.
- Early Web Automation (Pre-LLM Era): This involved scripting languages,
web scrapers, andbotsfor repetitive tasks.Reinforcement learninglater offered more adaptive approaches, allowing agents to learn optimal interactions withWeb UIs(e.g., ). However, these agents were brittle, struggling with dynamicWeb pagesand novel task instructions due to limitedunderstandingandgeneralization. - LLM Integration (Current Era): The emergence of powerful
LLMslikeGPT-3,GPT-4, andLlama 2marked a paradigm shift.LLMsbrought unprecedentednatural language understanding,reasoning, andplanning capabilities. This allowedWeb agentsto interpret complexnatural language instructions, generateaction plans, and interact withWeb servicesat a higher level of abstraction (e.g.,WebArena,Mind2Web). Techniques likein-context learningandfine-tuningbecame central to adaptingLLMsforWeb agenttasks.Multi-turn dialoguesalso became feasible, enabling more interactiveWeb navigation. - Personalization (This Paper's Contribution): This paper introduces the next logical step:
personalization. WhileLLMsenhancedgeneral Web agentcapabilities, they typically treated all users generically. This work integratespersonalized user data(profiles, historical behaviors) intoLLM-based Web agents. This enables agents to not just follow instructions, but tocomprehend personalized instructions(inferring implicit preferences) andexecute customized actions(makingpersonalized function calls). This moves beyond generic intelligence to trulyuser-centric intelligence, aiming to provide services that anticipate and align with individual user needs. ThePUMAframework andPersonalWABbenchmark are designed specifically for this advanced stage ofWeb agentevolution.
4. Methodology
The core idea of this paper is to advance LLM-based Web agents by integrating personalized user data to achieve personalized instruction understanding and action execution. The paper first formulates the task and then proposes the PUMA (Personalized User Memory-enhanced Alignment) framework to address it.
4.1. Principles
The fundamental principle behind the proposed methodology is that personalized user data (such as user profiles and historical Web behaviors) holds crucial information about a user's implicit preferences. By leveraging this data, an LLM-based Web agent can move beyond merely following explicit instructions to understand underlying user needs and execute actions that are customized and optimal for that specific user. This involves two main challenges: first, making the LLM select the correct Web function, and second, generating the most personalized and effective parameters for that function.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Task Formulation
LLM-empowered personalized Web agents act as intermediaries between users and Web services. The task is formally defined by considering the following elements:
- User (): Each user is unique, possessing a distinct
profileP _ { u }andhistorical Web behaviorsH _ { u }.P _ { u }: Thisprofilecontainsstatic attributeslike demographics.H _ { u }: This records the user's pastWeb behaviorsas a time-ordered sequence: . Each represents a singleWeb behavior, such as a purchase or a review.
- Instruction (): This is a natural language sentence provided by the user, expressing their specific needs and requirements.
- Web Environment (): This is abstracted as a collection of
Web functions.-
: Each function can be invoked.
-
: An input parameter required to invoke a function.
-
O _ { f _ { p } }: The corresponding result returned by invoking function with parameter . Notably, different input parameters will yield different function results.Goal: Given the
user instructioni _ { u }and thepersonalized data(P _ { u }andH _ { u }), theLLM-empowered personalized Web agentaims to:
-
- Select the appropriate Web function ().
- Determine the optimal parameter () to invoke
personalized results(O _ { f _ { p } }) from theWeb environment.
4.2.2. PUMA Framework Overview
The PUMA (Personalized User Memory-enhanced Alignment) framework is designed to enable LLM-empowered personalized Web agents to effectively complete tasks based on user instructions. As illustrated in Figure 6, PUMA consists of two main steps: Web function identification and function parameter generation.
该图像是论文中图6的示意图,展示了PUMA框架的两大步骤:网页功能识别和参数生成,后者包含任务特定记忆检索和函数参数优化两个部分。图中用箭头区分了微调(SFT)和直接偏好优化(DPO)流程。
Figure 6: Illustration of the PUMA framework, consisting of two main steps: Web Function Identification and Parameter Generation, which includes Task-specific Memory Retrieval and Function Parameter Optimization.
1. Web Function Identification:
The first step involves identifying the correct Web function that the user's instruction intends to invoke.
- A
Large Language Model (LLM)(e.g.,LLaMa-2-7b) isfine-tunedusing "instruction-function" pairs from a training dataset. This training equips theLLMwith the ability to map a givenuser instructionto the most appropriateWeb functionfrom the available set (e.g.,search_product_by_query,get_recommendations_by_history,add_product_review).
2. Function Parameter Generation:
Once the correct Web function is identified, the next and more complex step is to generate the appropriate parameters for that function, taking into account the user's personalized data. This step is further broken down into two sub-components: Task-specific Memory Retrieval and Function Parameter Optimization.
2.2.1. Task-specific Memory Retrieval
This component is responsible for collecting and filtering relevant personalized data for the LLM to use in generating function parameters.
- Long-term Memory Bank: This is a storage system that maintains a detailed record of each user's
historical Web behaviors. For a user , this bank stores information about their purchased products () and associated reviews (). Collectively, these are denoted as .Product detailsinclude attributes like "title", "price", "store", and other relevant metadata.Review detailsencompass "rating", "review title", and "comment" provided by the user.- Formally, if user has purchased products, their
long-term memoryis represented as: $ M = { m _ { i } \ | \ i = 1 , 2 , . . . , n } $ Where each corresponds to a specific historical behavior or product interaction.
- Task-specific Memory Retrieval Strategy: This strategy extracts only the most relevant information from the
long-term memory bankbased on the user's current instruction and the identified function.- Top-K Retrieval: Given a
user instructionand the identifiedWeb function, the system first retrieves the topmemory entriesby computing thecosine similaritybetween the instruction and each memory entrym _ { j }in the bank . This helps to narrow down the vastmemory bankto potentially relevant past behaviors. - Targeted Feature Extraction: Based on the specific identified
Web function, more targeted features are then extracted from these retrieved memory entries.- If is a
search function:product detailssuch as "product title", "category", "price", and "store" are extracted. - If is a
recommendation function:product detailslike "title", "category", and "parent ASIN" (a unique product identifier) are retained. - If is a
review function: only the user's past "ratings" and "comments" are kept.
- If is a
- This process is formally defined as:
$
M _ { i } = \mathrm { E x t r a c t } \left( \mathrm { T o p K } \left( M , \sin ( i , m _ { j } ) , K \right) , f \right) .
$
Where:
M _ { i }represents thetask-specific memoryconstructed for instruction .- is a function that extracts targeted features based on the identified
Web function. - selects the memory entries from with the highest
cosine similarityto the instruction . - is the
cosine similaritybetween theinstructionandmemory entrym _ { j }.
- Top-K Retrieval: Given a
2.2.2. Function Parameter Optimization
After obtaining the task-specific memory M _ { i }, this component focuses on generating Web function parameters that are not only reasonable but also optimally aligned with user preferences. This is achieved through a combination of Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO).
-
Heuristic Fine-tuning for Parameter Generation:
- The
LLMis initially equipped with a foundational ability to generate reasonable parameters throughSFT. - The inputs for
SFTare structured as a combination of theuser instruction, thetask-specific memoryM _ { i }, and the identifiedWeb function. - The
labelsfor thisSFTareWeb function parametersconstructed usingheuristic methodstailored to eachWeb function:- For the
search function:ChatGPTis used to generate textual queries based on the instruction and memory. These generated queries serve as pseudo-labels. - For
recommendation functions: The pseudo-labels consist of the most recent product ASINs (Amazon Standard Identification Number) from the same category found inM _ { i }. - For
review functions: The actual review text provided by the dataset is used as the labels.
- For the
- These
heuristicshelp create meaningfulpseudo-labelsforparameter generation, ensuring the model learns to produce plausible and contextually appropriatefunction parameters.
- The
-
Diverse Parameter Sampling for Pair-wise Optimization (
DPO):- After
SFTprovides a baseline capability, the model's performance is further enhanced usingDirect Preference Optimization (DPO)[36] over a diverse set of parameter candidates. - Candidate Generation: A diverse set of
function parametersis first generated from theSFT-tuned LLMusinghigh-temperature sampling(to increase output variability) andbeam search(to explore multiple plausible sequences). - Pair-wise Preference Data Construction: These candidate parameters are then evaluated based on their
result accuracyforinstruction completion. For each instruction ,best-performing parameters() andworst-performing parameters() are identified and paired. - This pair-wise preference data is formally defined as:
$
\mathcal { D } _ { \mathrm { DPO } } = \left{ \left( p _ { i } ^ { \mathrm { b } } , p _ { i } ^ { \mathrm { w } } , x _ { i } \right) \right} ,
$
Where:
- represents the
best-performing function parametersfor instruction . - represents the
worst-performing function parametersfor instruction . x _ { i }represents theinputto the model, which includes theuser instruction, thetask-specific memoryM _ { i }, and theWeb function.
- represents the
- DPO Optimization:
DPOis then applied tooptimizetheSFT-tuned model(referred to as thereference model) by encouraging it to generate parameters similar to and discouraging it from generating parameters similar to . - The
DPO lossis given by: $ \mathcal { L } _ { \mathrm { DPO } } = - \mathbb { E } \left[ \log \sigma \left( \beta \log \frac { \pi _ { \theta } ( \boldsymbol { \rho } ^ { \mathrm { b } } \mid \boldsymbol { x } ) } { \pi _ { \mathrm { r e f } } ( \boldsymbol { \rho } ^ { \mathrm { b } } \mid \boldsymbol { x } ) } - \beta \log \frac { \pi _ { \theta } ( \boldsymbol { \rho } ^ { \mathrm { w } } \mid \boldsymbol { x } ) } { \pi _ { \mathrm { r e f } } ( \boldsymbol { \rho } ^ { \mathrm { w } } \mid \boldsymbol { x } ) } \right) \right] , $ Where:- is the
sigmoid function, which maps any real-valued number to a value between 0 and 1, used here to model preference probabilities. - is a
temperature-like parameterthat controls the sensitivity of the model's preference to thelog-ratio differencebetween thepolicy model(the model being optimized) and thereference model(theSFT-tuned model). A higher makes the model more sensitive to preference differences. - represents the probability of generating parameters given input by the
policy model(the one currently being trained). - represents the probability of generating parameters given input by the
reference model(theSFT-tunedversion beforeDPO).
- is the
- This
DPOstep ensuressuperior alignmentwithpersonalized user preferencesby directly optimizing the model to generate preferred outputs and avoid less preferred ones.
- After
5. Experimental Setup
5.1. Datasets
The paper constructs the first Personalized Web Agent Benchmark (PersonalWAB) to address the absence of a comprehensive evaluation benchmark. PersonalWAB is built upon the Amazon Review dataset [15], a large-scale collection of users' Web behaviors including purchases and product ratings across various categories.
The construction of PersonalWAB involves the following steps:
- Personalized Data Construction:
- User Sampling: 1,000 diverse users were randomly selected from the
Amazon Review datasetacross five product categories: Electronics, Home and Kitchen, Grocery and Gourmet Food, Clothing, Shoes, and Jewelry, and Health and Household. For each user, all their interactions across these categories (detailed purchased product information and user evaluations) were collected. - Data Split: User interactions were chronologically ordered and split: 80% for
historical data, 10% for thetraining set, and the final 10% for thetest set. - User Profile Generation: Unique profiles for each of the 1,000 users were generated using an
LLM(specifically,gpt-4o-mini-2024-07-18) to infer and summarize potential profiles based on their entire behavior history. The prompt template for profile generation is provided in Figure 10 and Figure 11 in the appendix.- Example of a generated user profile structure (Figure 11):
Basic information: Gender, Age, Occupation (e.g., Male, 35-44, Engineer).Shopping preferences: Price Sensitivity (e.g., Medium: Balanced Buyer), Shopping Interests (summarized product information), Brand Preferences (specific brand names).Behavioral tendencies: Diversity Preference (e.g., Balanced: mix of new and familiar), Interaction Complexity (e.g., Concise: to-the-point reviews), Tone and Style (e.g., Neutral, Objective), Item Reference (keywords related to what they reference), Focus Aspects (e.g., Average Rating, Price, Material).
- The user profiles support
personalized instruction generationandmulti-turn evaluation.
- Example of a generated user profile structure (Figure 11):
- User Sampling: 1,000 diverse users were randomly selected from the
- User Instruction Creation:
LLMs(specifically,claude-3-5-sonnet@20240620) were prompted to generatepersonalized instructionsfor each user, based on theirprofileandreal Web behaviors, across three tasks:- Search Instructions: Generated based on user profile and product information to search for similar products (Figure 12 for prompt). These vary in length, tone, and specificity.
- Recommendation Instructions: Tend to be shorter and more general, generated from user profile and integrated products (Figure 13 for prompt).
- Review Instructions: Generated from user profile, target product info, and actual review text, incorporating personalized requirements (Figure 14 for prompt).
- Web Environment Implementation:
The
Web environmentis abstracted as a series ofWeb functions, simplifying interactions compared toWeb GUIs.-
search_product_by_query: Takes a textual query, returns 10 most similar products. Implemented usingBM25withPyserini[26]. -
get_recommendations_by_history: Accepts product IDs, returns 10 recommended products. Implemented by trainingSASRecmodel [19]. -
add_product_review: Requires review text, assumes review is posted. -
respond: Allows agent-user dialogue. -
stop: Signals task termination.The following are the results from [Table 2] of the original paper:
items Train Test User # Users 939 1,000 # Avg. profile tokens 247 # Avg. behavior length # Avg. behavior tokens 32 7,597 38 9,270 Instruction # Instructions 6,896 2,174 # Avg. tokens 46 45 Product # Products 8,236 # Avg. tokens 665
-
Table 2: Statistics of the PersonalWAB Benchmark.
The dataset statistics show:
-
Users: 939 in training, 1,000 in test, with average profile tokens of 247 and average behavior lengths of 32 (train) / 38 (test) items and 7,597 (train) / 9,270 (test) behavior tokens.
-
Instructions: 6,896 in training, 2,174 in test, with an average of 46 (train) / 45 (test) tokens.
-
Products: 8,236 unique products with an average of 665 tokens per product.
User diversity is shown in Figure 3, illustrating distributions across gender, age, and occupation. Figure 4(a) further details behavioral attributes like Price Sensitivity, Diversity Preference, and Interaction Complexity. Figure 4(b) shows instruction statistics, indicating that recommendation instructions are shortest, while review instructions are more complex.
该图像是论文中图3,展示了用户的性别、年龄和职业分布情况。图中以环形图形式分别显示了男性与女性比例、不同年龄段(25-34岁、35-44岁、46-49岁、56岁以上)用户占比,以及多种职业类别(如作家、家庭主妇、退休人员、自雇者等)的比例分布。
Figure 3: Distribution of users by gender, age, and occupation.
该图像是图表,展示了图4中(a)用户在价格敏感度、多样性偏好和交互复杂度三个行为属性上的分布,以及(b)不同任务指令的数量和平均Token数统计情况。
Figure 4: (a) Distribution of behaviors by Price Sensitivity, Diversity Preference, and Interaction Complexity; (b) Statistics of the instructions on different tasks.
5.2. Evaluation Metrics
The paper establishes two distinct evaluation tracks: single-turn and multi-turn.
5.2.1. Single-turn Track
In this track, the agent has one opportunity to execute the user's instruction.
- Function accuracy (function acc): This metric assesses the agent's ability to select the correct
Web functionand provide parameters in the correct format.- Conceptual Definition: It measures whether the agent correctly identifies the intended
Web functionfor a given instruction and structures its parameters in the expected format. - Formula: If the agent selects the appropriate tool for the task and the input parameters are correctly formatted, it receives a score of 1; otherwise, the score is 0.
- Conceptual Definition: It measures whether the agent correctly identifies the intended
- Result accuracy (res acc): This metric evaluates the quality of the results generated by the agent's
function calls.- Conceptual Definition (Search and Recommendation): For search and recommendation tasks, it measures how well the agent's output (a list of products) aligns with the user's genuinely liked item (ground truth). It assigns a higher score if the target product appears higher in the returned list.
- Mathematical Formula (Search and Recommendation):
$
\operatorname { R e s } \mathrm { Acc } = { \left{ \begin{array} { l l } { 1 - { \frac { r - 1 } { 10 } } , } & { { \mathrm { i f } } r \leq 1 0 , } \ { 0 , } & { { \mathrm { i f } } r > 1 0 . } \end{array} \right. } \quad \mathrm { w i t h } r \in \mathbb { N } ^ { + }
$
Where:
- is the rank of the target product within the returned product list.
- The formula penalizes lower ranks, with rank 1 yielding a score of , and rank 10 yielding . If the target product is not in the top 10, the score is 0.
- Conceptual Definition (Review): For review tasks, it assesses the semantic similarity between the agent's generated review text and the user's actual
ground truthreview. - Formula (Review): The
sentence-transformer[37] model is used to compute thecosine similaritybetween the generated andground truthreview texts, yielding ares accbetween 0 and 1. (No specific formula is provided in the paper, butcosine similarityformula is already explained in Section 3.1.)
5.2.2. Multi-turn Track
This track evaluates the agent's ability to interact with users over multiple turns, using an LLM-based user simulator to provide real-time feedback (Figure 15 for prompt details).
- The
function accandres accmetrics from thesingle-turn trackare also used here. - Average steps: This additional metric measures the efficiency of the agent.
- Conceptual Definition: It counts the total number of actions (steps) taken by the agent to complete the task.
- Formula: No specific formula is provided, but it is defined as the total number of actions taken. The goal is to encourage the agent to accomplish tasks with minimal attempts.
5.2.3. Profile Consistency Evaluation
To verify the reliability of the generated user profiles, the paper conducts consistency evaluations:
- Profile-behavior consistency evaluation:
- Conceptual Definition: Given a
user profile, the task is to identify the correct user from a group of candidates (true user + negative users), where each candidate is represented by theirbehavior sequence. - Formula:
top-1 accuracy. This measures how often the correct user's behavior sequence is matched to their profile when compared against others.
- Conceptual Definition: Given a
- Profile-product consistency evaluation:
- Conceptual Definition: Using a
user profileto rank a set of candidate items (mixture of positive/interacted and negative/random items). The objective is to prioritize positive items. - Formula:
NDCG@5(Normalized Discounted Cumulative Gain at 5) andRecall@5.- Recall@K:
- Conceptual Definition:
Recallmeasures the proportion of relevant items that are successfully retrieved from the total number of relevant items.Recall@Kspecifically checks if any of the relevant items are present in the top recommendations. - Mathematical Formula: $ \text{Recall@K} = \frac{\text{Number of relevant items in top K recommendations}}{\text{Total number of relevant items}} $
- Symbol Explanation:
Number of relevant items in top K recommendations: The count of actual relevant items that appear within the first items recommended by the system.Total number of relevant items: The total count of items that are genuinely relevant to the user's preferences.
- Conceptual Definition:
- NDCG@K (Normalized Discounted Cumulative Gain at K):
- Conceptual Definition:
NDCGmeasures the usefulness, or gain, of a document based on its position in the result list. The gain is accumulated from the top of the result list to the bottom, with the gain of a highly relevant document at a lower position being discounted.NDCGnormalizes the score by dividing by theIdeal DCG(IDCG), which is theDCGof the ideal ordering of results. - Mathematical Formula:
First,
Cumulative Gain (CG)at position : $ \text{CG}k = \sum{i=1}^k \text{rel}_i $ Then,Discounted Cumulative Gain (DCG)at position : $ \text{DCG}k = \sum{i=1}^k \frac{\text{rel}_i}{\log_2(i+1)} $ Finally,Normalized Discounted Cumulative Gain (NDCG)at position : $ \text{NDCG}_k = \frac{\text{DCG}_k}{\text{IDCG}_k} $ - Symbol Explanation:
- : The position in the ranked list (for
NDCG@5, ). - : The relevance score of the item at position in the ranked list. (Often binary: 1 if relevant, 0 if not).
- : The
discounting factor, which reduces the impact of lower-ranked items. - : The
ideal DCGat position , which is the maximum possibleDCGscore if all relevant items were perfectly ranked at the top.
- : The position in the ranked list (for
- Conceptual Definition:
- Recall@K:
- Conceptual Definition: Using a
5.3. Baselines
The paper evaluates the PUMA framework against a range of baselines, categorized into three groups, using gpt-4o-mini-2024-07-18 as the backbone LLM for all baselines (unless specified otherwise for PUMA):
5.3.1. Memory Retrieval Methods
These baselines explore different strategies for selecting and utilizing user history to understand the impact of various memory selection techniques on task performance. The general prompt template for task execution is provided in Figure 16 (single-turn) and Figure 17 (multi-turn) in the appendix, with differences only in the memory component.
- No Memory: The agent operates without access to any
user history, relying solely on the currentinstruction. - Random Memory: The agent randomly selects a portion of behaviors from the user's history for context.
- Last Memory: The agent uses only the most recent behaviors from the user's history, assuming recent context is most relevant. For single-turn, memory length is 50 behaviors; for multi-turn, it's 20 behaviors.
- Relevant Memory: The agent selects past behaviors based on
cosine similaritywith the currentinstruction, aiming to filter for contextually relevant details.Sentence-transformer[37] is used forcosine similaritycalculation. Memory length settings are the same asLast Memory.
5.3.2. Enhanced Reasoning Methods
These frameworks are designed to improve the agent's reasoning and decision-making.
- ReAct [56]: This framework guides the
LLMto "think" before acting. It instructs the model to generate a "Thought:" (reasoning) followed by an "Action:" (JSON-formatted action argument) to interact with the environment. This allows the model to deliberate on available information. For evaluation,ReActis combined with theLast Memoryapproach to provide recent context. - Reflexion [42]: Building upon
ReAct,Reflexionadds aself-evaluationphase. The agent reviews and analyzes its previous actions and outcomes, learns from mistakes, and refines its strategy in subsequent interactions. This baseline is evaluated only in themulti-turn track, where each user message is treated as feedback for theReflexionand adjustment process.
5.3.3. Recommendation-Specific Memory Frameworks
Given that recommendation tasks are inherently personalized, these baselines leverage memory mechanisms specifically developed for recommendation agents.
- RecMind [49]: An
LLM-powered agentfor generalrecommendations. It consists of two memory types:personalized memory(user reviews, ratings) andworld knowledge(item metadata, real-time info viaWeb search). In this setup, thepersonalized memoryretains user reviews and ratings, and an additionalget_product_details_by_asinfunction is incorporated to allowRecMindto access detailed product information. Memory length is set to 400 behaviors. - InteRecAgent [17]: This framework uses
LLMsas areasoning engineandrecommender modelsasfunctionsforinteractive recommendations. Its memory includes acandidate bus(current item candidates) and auser profile(like,dislike,expectpreferences). Theuser profilememory is adopted and updated at the end of each task based on conversation history. This method is evaluated only in themulti-turn settingdue to its reliance on ongoing dialogue foruser profilesynthesis.
5.3.4. PUMA Implementation Details
- LLM Backbone:
LLaMA2-7B[47] is used for fine-tuning. - Fine-tuning: Performed with
LoRA[16] using NVIDIA A5000 GPUs. - Learning Rates: for
SFTand forDPO. - Batch Size: 1 per GPU.
- Memory Token Length: Constrained to 256, 512, and 768 tokens due to GPU memory limitations during training.
- Parameter Generation:
High-temperature sampling(temperature of 1.5) andbeam search(beam size of 10) are used to generate diversefunction parameters. - Pseudo-label Generation:
gpt-4o-mini-2024-07-18is used to generatesearch function parametersfor initialSFT labels.
5.4. Profile Consistency Evaluation Details
As mentioned in Section A.1 of the paper, the profile consistency evaluation uses the following settings:
-
Profile-behavior consistency: Task is to match a user profile with the correct user's past
Web behaviorsamong other candidate users. Metric:top-1 accuracy. -
Profile-product consistency: Task is to rank candidate items (positive + negative) for a user based on their profile. Metric:
NDCG@5andRecall@5. -
Settings: Number of positive samples set to 1 and 3, and negative samples to 4 and 7 for user prediction and recommendation tasks respectively.
-
LLM used:
gpt-4o-mini-2024-07-18.
该图像是图表,展示了个人资料一致性评估实验结果。图中对比了PersonalWAB与Apollonion在用户资料与产品关联召回(Recall@5)、归一化折损累积增益(NDCG@5)及行为准确率(Acc@1)三项指标上的性能,PersonalWAB均明显优于Apollonion,提升幅度分别为25.8%、18.3%和13.3%。
Figure 5: Results of profile consistency evaluation experiments. Our generated profiles align better with users' actual Web behaviors and interested products than Apollonion [5].
Figure 5 shows that PersonalWAB's generated profiles exhibit significant improvements over Apollonion [5] across both tasks, with PersonalWAB achieving higher top-1 accuracy (e.g., 0.85 vs 0.71 for profile-behavior) and higher NDCG@5 and Recall@5 (e.g., 0.61 vs 0.45 and 0.81 vs 0.65 for profile-product), indicating enhanced distinctiveness and alignment with actual user behaviors.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Single-turn Track
The following are the results from [Table 3] of the original paper:
| Method (backbone) | Search | Recommendation | Review | Overall | ||||
|---|---|---|---|---|---|---|---|---|
| Function Acc. | Res Acc | Function Acc. | Res Acc | Function Acc. | Res Acc | Function Acc. | Res Acc | |
| No Memory (gpt-40) | 1.000 | 0.647 | 0.092 | 0.000 | 1.000 | 0.444 | 0.684 | 0.355 |
| Random Memory (gpt-40) | 0.974 | 0.640 | 0.296 | 0.018 | 0.996 | 0.442 | 0.745 | 0.357 |
| Last Memory (gpt-40) | 0.937 | 0.626 | 0.432 | 0.028 | 1.000 | 0.442 | 0.782 | 0.357 |
| Relevant Memory (gpt-4o) | 0.928 | 0.622 | 0.492 | 0.030 | 1.000 | 0.443 | 0.800 | 0.356 |
| ReAct [56] (gpt-40) | 0.903 | 0.605 | 0.560 | 0.027 | 0.996 | 0.444 | 0.815 | 0.350 |
| RecMind [49] (gpt-4o) | 0.981 | 0.645 | 0.226 | 0.017 | 0.990 | 0.442 | 0.721 | 0.359 |
| PUMA(gpt-40) | 1.000 | 0.649 | 0.939 | 0.048 | 1.000 | 0.449 | 0.979 | 0.373 |
| PUMA( LLaMA-7B ) | 0.996 | 0.652 | 0.987 | 0.054 | 1.000 | 0.538 | 0.994 | 0.406 |
Table 3: Single-turn track results. (GPT-4o is denoted as gpt-40). Best performance in each column is indicated by bold, second best by underline.
Key insights from the single-turn track results (Table 3):
- Recommendation Task Difficulty: Recommendation instructions show poor
function accuracyandresult accuracyfor most baselines. For instance,No Memoryhas afunction accof only 0.092 andres accof 0.000 for recommendation. This indicates a significant challenge in correctly identifying the recommendation function and generating effective parameters. Further analysis (e.g., Figure 8(b) in the original paper) reveals many recommendation instructions were incorrectly assigned to the search function. - Impact of Memory: Methods incorporating
memorygenerally show improvedfunction accuracycompared toNo Memory.Relevant MemoryandReActexhibit higherfunction accuracy, suggesting that retrieving relevant information and explicit reasoning help in function selection. However, theresult accuracyfor most baselines remains similar toNo Memory, implying they fail to significantly enhance personalized task execution, especially for therecommendationtask whereres accstays very low (0.000-0.030). - PUMA's Superiority:
PUMAsignificantly outperforms all baselines across all tasks.PUMA(LLaMA-7B)achieves the highest overallfunction accuracy(0.994) andresult accuracy(0.406).- For
recommendation,PUMA(LLaMA-7B)achieves afunction accof 0.987 (vs. 0.560 forReAct) andres accof 0.054 (vs. 0.030 forRelevant Memory), demonstrating a substantial improvement. - This superiority highlights the effectiveness of
PUMA'stask-specific memoryretrieval andfunction parameter optimization(SFT+DPO) in enabling the agent to focus on relevant behaviors and generate higher-quality personalized actions.
- Efficiency: Despite using a smaller backbone
LLM(LLaMA-7B) compared togpt-4o,PUMA(LLaMA-7B)still achieves the best performance, indicating its efficiency and effectiveness.
6.1.2. Multi-turn Track
The following are the results from [Table 4] of the original paper:
| Method (backbone) | Search | Recommendation | Review | Overall | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| F.Acc. | R.Acc. | Avg.Steps | F.Acc. | R.Acc. | Avg.Steps | F.Acc. | R.Acc. | Avg.Steps | F.Acc. | R.Acc. | Avg.Steps | |
| No Memory (gpt-40) | 0.996 | 0.656 | 2.398 | 0.096 | 0.000 | 2.420 | 1.000 | 0.446 | 2.019 | 0.685 | 0.358 | 2.280 |
| Random Memory (gpt-40) | 0.999 | 0.680 | 4.193 | 0.703 | 0.042 | 4.474 | 1.000 | 0.448 | 2.007 | 0.896 | 0.380 | 3.564 |
| Last Memory (gpt-40) | 0.996 | 0.676 | 4.229 | 0.708 | 0.045 | 4.252 | 1.000 | 0.449 | 2.007 | 0.897 | 0.381 | 3.498 |
| Relevant Memory (gpt-4o) | 0.996 | 0.686 | 4.233 | 0.715 | 0.042 | 4.564 | 0.999 | 0.448 | 2.008 | 0.899 | 0.383 | 3.609 |
| ReAct [56] (gpt-40) | 0.996 | 0.674 | 4.657 | 0.218 | 0.013 | 5.468 | 0.974 | 0.448 | 2.129 | 0.718 | 0.369 | 4.098 |
| Reflexion [42] (gpt-40) | 1.000 | 0.686 | 5.406 | 0.281 | 0.014 | 6.145 | 0.976 | 0.449 | 2.145 | 0.741 | 0.373 | 4.579 |
| RecMind [49] (gpt-40) | 0.997 | 0.642 | 6.728 | 0.347 | 0.026 | 6.003 | 0.997 | 0.451 | 2.107 | 0.771 | 0.364 | 4.938 |
| InteRecAgent [17] (gpt-40) | 0.999 | 0.642 | 3.110 | 0.618 | 0.022 | 3.008 | 1.000 | 0.447 | 2.001 | 0.867 | 0.362 | 2.706 |
| PUMA (gpt-40) | 0.999 | 0.720 | 5.082 | 0.984 | 0.052 | 3.791 | 1.000 | 0.453 | 2.002 | 0.994 | 0.399 | 3.608 |
Table 4: Multi-turn track results. (GPT-4o is denoted as gpt-40).
Key insights from the multi-turn track results (Table 4):
- Baselines Benefit from Multi-turn: Compared to the single-turn track, baselines generally perform better in
searchandrecommendationtasks. This is attributed to the ability to benefit from multiple attempts and user feedback, allowing them to correct initial errors. Review tasks show minimal improvement as they are often straightforward. - Memory Retrieval Baselines: Similar trends to single-turn are observed.
Relevant Memoryslightly improvesfunction accuracyandresult accuracybut often at the cost of additional steps. - Reasoning Methods (
ReAct,Reflexion):ReActandReflexionperform worse thanmemory retrievalmethods in terms offunction accuracyandresult accuracyfor recommendation, and require moreaverage steps. The added complexity of explicitreasoningandself-reflexion(which increasesinput token length) seems to hinder efficiency and accuracy in these complex multi-turn settings, potentially due to context window limitations or the difficulty of effectiveself-correction. - Recommendation-Specific Frameworks (
RecMind,InteRecAgent):RecMindrequires a higher number ofaverage steps(6.728 for search, 6.003 for recommendation) due to additionalfunction calls, and struggles withinstruction identification(lowfunction accfor recommendation).InteRecAgentuses fewer steps (3.008 for recommendation) due to its streamlined memory, but this simplification leads to lowerresult accuracy(0.022 for recommendation). - PUMA's Strong Performance:
PUMA (gpt-4o)demonstrates strong performance, especially insearchandrecommendationtasks. It achieves the highest overallfunction accuracy(0.994) andresult accuracy(0.399) among thegpt-4omodels. Forrecommendation,PUMAsignificantly improvesfunction accuracy(0.984 vs. 0.715 forRelevant Memory) andresult accuracy(0.052 vs. 0.045 forLast Memory). By extractingrelevant informationand filtering redundant data,PUMAenables more informed decisions with fewer steps inrecommendation(3.791Avg.Stepsvs. 4.564 forRelevant Memory). While the fullPUMA(withLLaMA-7Bfine-tuning) was not evaluated in multi-turn due to model limitations, thegpt-4ovariant still shows the benefits of itstask-specific memory.
6.2. In-depth Analysis
6.2.1. Analysis on efficiency
该图像是图表,展示了不同方法在任务完成时间(秒)上的平均比较。图中显示PUMA方法用时最短,为2.8秒,显著优于其他方法。
Figure 7: Comparison between the average task completion time (in seconds) for different methods.
The average task completion time is a critical factor for user experience. Figure 7 illustrates the efficiency comparison:
- GPT-based Baselines: Most
GPT-basedmethods, includingNo Memory,Random Memory,Last Memory,Relevant Memory,ReAct,Reflexion,RecMind, andInteRecAgent, show similar completion times, ranging from approximately 6.5 to 6.9 seconds. This is likely due to inherent latency in calling theGPTmodels and thememory processing overhead(even forNo Memory, there's a baseline processing time). - PUMA's Superior Efficiency:
PUMAsignificantly outperforms all baselines, achieving an average task completion time of just 2.8 seconds. This substantial efficiency gain is attributed to two factors:- Smaller Model:
PUMAutilizes aLLaMA-7Bbackbone, which is much smaller and faster to run thangpt-4o. - Compact Memory Structure:
PUMA'stask-specific memory retrievalmechanism is designed to filter out irrelevant information, resulting in a more compact and manageable input. This minimizesinference timeand reduces the computational load. This makesPUMAhighly effective for real-worldWeb applicationswhere quick response times are essential.
- Smaller Model:
6.2.2. Ablation Study
The following are the results from [Table 5] of the original paper:
| Method | Search | Recommendation | Review | Overall | ||||
|---|---|---|---|---|---|---|---|---|
| Function Acc | Result Acc | Function Acc | Result Acc | Function Acc | Result Acc | Function Acc | Result Acc | |
| PUMA | 0.996 | 0.652 | 0.987 | 0.054 | 1.000 | 0.538 | 0.994 | 0.406 |
| w/o Task-specific Memory | 0.990 | 0.643 | 0.992 | 0.008 | 1.000 | 0.496 | 0.994 | 0.373 |
| w/o SFT | 1.000 | 0.000 | 0.983 | 0.000 | 1.000 | 0.160 | 0.994 | 0.054 |
| w/o DPO | 0.996 | 0.648 | 0.987 | 0.047 | 1.000 | 0.529 | 0.994 | 0.399 |
Table 5: Ablation study on key components of PUMA in single-turn track.
An ablation study (Table 5) was conducted to assess the impact of PUMA's key components on performance:
-
w/o Task-specific Memory: Removing thetask-specific memory retrievalleads to a drop inresult accuracyacross all tasks (e.g., from 0.054 to 0.008 for recommendation, from 0.538 to 0.496 for review). This highlights the critical role of effectively filtered memory in retaining relevant information necessary for generating accuratefunction parameters. -
w/o SFT: When thesupervised fine-tuning (SFT)phase is removed,result accuracydramatically declines to near zero (e.g., 0.000 for search and recommendation, 0.160 for review). This indicates thatSFTis fundamental in equipping the model with the basic ability to generate plausible and contextually appropriatefunction parameters. Without it, theLLMstruggles significantly. -
w/o DPO: Removing theDirect Preference Optimization (DPO)phase results in a slight but noticeable performance decrease inresult accuracy(e.g., from 0.054 to 0.047 for recommendation, from 0.538 to 0.529 for review). This suggests thatDPOplays a crucial role in refining thefunction parameters, better aligning them withpersonalized user preferences, and thus improving the overall quality of execution.Overall, the ablation study confirms that all three components—
task-specific memory,SFT, andDPO—are essential forPUMA's superior performance, withSFTbeing foundational andmemoryandDPOproviding critical enhancements for personalization and optimization.
6.2.3. Analysis on memory length
The following are the results from [Table 6] of the original paper:
| Memory Length | Search | Recommendation | Review | Overall | ||||
|---|---|---|---|---|---|---|---|---|
| Function Acc | Result Acc | Function Acc | Result Acc | Function Acc | Result Acc | Function Acc | Result Acc | |
| 256 | 0.997 | 0.651 | 0.985 | 0.019 | 1.000 | 0.530 | 0.994 | 0.395 |
| 512 | 0.991 | 0.648 | 0.988 | 0.032 | 1.000 | 0.531 | 0.993 | 0.395 |
| 768 | 0.996 | 0.652 | 0.987 | 0.054 | 1.000 | 0.538 | 0.994 | 0.406 |
Table 6: Performance comparison of different memory token lengths in PUMA.
The impact of different memory token lengths (256, 512, and 768 tokens) on PUMA's performance was analyzed:
- Function Accuracy:
Memory lengthhas minimal impact onfunction accuracy. The model maintains similar performance in identifying the correct function regardless of the memory size, withfunction accremaining consistently high (around 0.99 for overall). - Result Accuracy: In contrast,
memory lengthsignificantly affectsresult accuracy, especially forrecommendation tasks.-
For
recommendation, increasingmemory lengthfrom 256 to 768 tokens leads to a notable improvement inresult accuracy(from 0.019 to 0.054). Shorter memory lengths limit the number of stored products and behaviors, hindering the model's ability to select appropriate product IDs for recommendations. -
Searchandreview tasksare less sensitive tomemory lengthchanges. Theirresult accuracyremains relatively stable across different lengths. This is because these tasks often rely more heavily on information present in the directuser instructionrather than extensivehistorical memoryforparameter generation. This reduced dependence also implies a potential ceiling for performance improvement from merely increasing memory length for these tasks.The analysis indicates that while longer memory can be beneficial for tasks requiring richer historical context (like recommendation), judicious selection of memory content (as done by
task-specific memory retrieval) is crucial, and merely increasing length doesn't guarantee universal improvement for all task types.
-
6.2.4. Analysis on action transitions
该图像是论文中图8,展示了多轮搜索和推荐任务中代理动作的转变。不同颜色代表不同功能,横轴为交互步骤,颜色宽度体现代理对该动作的关注比例,流动展示策略调整过程。
Figure 8: Transitions of the agent's actions in multi-turn search and recommendation tasks. Each color represents a specific function. The horizontal axis shows interaction steps, while the width of each color band indicates the proportion of the agent's focus on that action. The flow between steps illustrates how the agent adapts its strategy over steps.
Figure 8 visualizes PUMA's actions in each interaction turn within the multi-turn track (excluding review instructions, which are typically completed in two steps):
- Search Instructions (Figure 8a): The agent tends to alternately call the
searchandrespondfunctions. This pattern is logical, as the agent can use therespondfunction to solicituser feedback, clarify ambiguities, or present preliminary results. Based on this feedback, it can thenadjust its search actionin subsequent turns. The interaction flow appears more direct and focused. - Recommendation Instructions (Figure 8b): The action transitions for
recommendation instructionsare "more entangled," indicating a more complex and varied action sequence. This complexity suggests thatmulti-turn recommendation tasksare inherently more challenging. The agent needs to accurately identifyuser intent, dynamically adjust its strategy based on nuanced feedback, and potentially explore different avenues, leading to a less linear interaction flow compared to search. This underlines the difficulty in constantly refining recommendations through dialogue.
6.2.5. Analysis of multi-turn performance variation
该图像是图表,展示了多轮交互任务中智能代理在不同尝试步骤的平均响应准确率和对应尝试次数。折线图表示尝试次数的变化,柱状图则显示不同任务类别(搜索、推荐、回顾)及整体的准确率分布。
Figure 9: Analysis of the agent's performance across multiple attempts in multi-turn track.
Figure 9 presents the analysis of the agent's performance over multiple attempts in the multi-turn track, showing both Result Accuracy (Res Acc) and the number of solved tasks as the number of attempt steps increases:
- Early Task Completion: A high number of tasks are completed within the first five attempts, indicating that most tasks are relatively straightforward and resolvable early in the interaction.
Review tasks, in particular, are typically finished within the first two attempts, implying minimal need for extensive user interaction regarding review requirements. Res AccTrend:Res Accis high during the initial attempts but tends to decline with each subsequent attempt. This pattern suggests that easier tasks are quickly resolved, leaving the more difficult or ambiguous tasks to be addressed in later turns. As the agent encounters more challenging scenarios, its ability to achieve high accuracy decreases.- Outliers: There are a few instances where tasks achieve higher
Res Accin later steps. However, these are rare outliers, involving only one or two tasks, which do not significantly alter the overall declining trend. - Feedback Utilization Challenges: The declining
Res Accin later attempts also implies that the agent struggles to effectively leverageuser feedbackin more complex, prolonged interactions. This could be due to a lack of sufficientmulti-turn training datato tune the agent for robustself-correctionand adaptation over extended dialogues.
6.2.6. Analysis on function usage and outcome accuracy
The following are the results from [Table 7] of the original paper:
| Method | Search | Recommendation | ||||
|---|---|---|---|---|---|---|
| F. Acc. | R. Acc. | O. Acc. | F. Acc. | R. Acc. | O. Acc. | |
| No Memory | 1.000 | 0.647 | 0.647 | 0.092 | 0.000 | 0.155 |
| Random Memory | 0.974 | 0.640 | 0.642 | 0.296 | 0.018 | 0.159 |
| Last Memory | 0.937 | 0.626 | 0.632 | 0.432 | 0.028 | 0.161 |
| Relevant Memory | 0.928 | 0.622 | 0.631 | 0.492 | 0.030 | 0.159 |
| ReAct [56] | 0.903 | 0.605 | 0.628 | 0.560 | 0.027 | 0.160 |
| RecMind [49] | 0.981 | 0.645 | 0.647 | 0.226 | 0.017 | 0.152 |
| PUMA | 1.000 | 0.649 | 0.649 | 0.939 | 0.048 | 0.164 |
Table 7: Single-turn performance comparison of different methods in terms of function accuracy (F. Acc.), result accuracy (R. Acc.), and outcome accuracy (O. Acc.) in search and recommendation.
In real-world applications, users prioritize the relevance of retrieved results over the specific function employed. To reflect this user-centric goal, the paper introduces Outcome Accuracy (O. Acc.).
Outcome Accuracy (O. Acc.)Conceptual Definition: This metric evaluates the correctness of the returned results (e.g., product lists) independently of whether the agent used asearchorrecommendation function. It focuses on whether the final output aligns with user intent, irrespective of the precise tool invoked.- Analysis: As seen in Table 7,
function accuracy (F. Acc.)andresult accuracy (R. Acc.)vary significantly betweensearchandrecommendation tasks. For instance,No Memoryhas perfectF. Acc.for search but abysmal for recommendation. However,Outcome Accuracyprovides a more balanced perspective.PUMAachieves the highestOutcome Accuracy(0.649 for search and 0.164 for recommendation) among all methods. This demonstratesPUMA's ability to deliver relevant results effectively, even if the underlyingfunction selectionmight sometimes be blurred from a user's perspective.- This metric offers a comprehensive evaluation that prioritizes the relevance of final outputs over strict adherence to
function selection, better reflectingreal-world user needs.
6.2.7. Analysis on search function implementation
The following are the results from [Table 8] of the original paper:
| Method | Search (Result Accuracy) | |
|---|---|---|
| BM25 | Dense Retrieval | |
| No Memory | 0.647 | 0.502 |
| Random Memory | 0.640 | 0.504 |
| Last Memory | 0.626 | 0.498 |
| Relevant Memory | 0.622 | 0.499 |
| ReAct [56] | 0.605 | 0.496 |
| RecMind [49] | 0.645 | 0.498 |
| PUMA | 0.649 | 0.506 |
Table 8: Comparison of search result accuracy using BM25 and Dense retrieval methods in single-turn track.
The search function implementation in the benchmark is flexible. The paper conducted an alternative retrieval experiment in the single-turn track, replacing BM25 (a sparse retrieval model) with a dense retrieval model based on Sentence-BERT [38].
- Performance Comparison (Table 8):
Dense retrievalgenerally leads to a slight degradation inresult accuracyacross all baselines compared toBM25. For example,No Memorydrops from 0.647 (BM25) to 0.502 (Dense Retrieval), andPUMAdrops from 0.649 to 0.506.- This is explained by
dense retrievalcapturing richer semantic representations but potentially introducingnoiseby embedding extensive product details, which might not always align perfectly with the direct search intent in this specific setup.
- PUMA's Robustness: Despite the variations caused by different
retrieval methods,PUMAconsistently outperformed all baselines in bothBM25andDense Retrievalscenarios. This demonstratesPUMA's robustness and its ability to effectively utilize the underlyingsearch functionregardless of its specific implementation. This modularity allows for future exploration of differentretrieval modelsandrecommendation strategiesto further enhance performance.
6.2.8. Analysis on zero-shot and few-shot performance
The following are the results from [Table 9] of the original paper:
| Method | Search | Recommendation | Review | Overall | ||||
|---|---|---|---|---|---|---|---|---|
| Function Acc. | Result Acc. | Function Acc. | Result Acc. | Function Acc. | Result Acc. | Function Acc. | Result Acc. | |
| No Memory | 1.000 | 0.684 | 0.050 | 0.000 | 1.000 | 0.388 | 0.625 | 0.328 |
| Random Memory | 0.974 | 0.684 | 0.301 | 0.060 | 0.996 | 0.391 | 0.715 | 0.352 |
| Last Memory | 1.000 | 0.683 | 0.314 | 0.058 | 1.000 | 0.396 | 0.730 | 0.353 |
| Relevant Memory | 0.928 | 0.675 | 0.405 | 0.078 | 1.000 | 0.397 | 0.743 | 0.358 |
| ReAct [56] | 0.945 | 0.675 | 0.475 | 0.080 | 0.996 | 0.393 | 0.774 | 0.358 |
| RecMind [49] | 0.973 | 0.680 | 0.320 | 0.063 | 0.996 | 0.394 | 0.722 | 0.354 |
| PUMA | 1.000 | 0.686 | 0.892 | 0.090 | 1.000 | 0.396 | 0.958 | 0.366 |
Table 9: Performance comparison in zero-shot and few-shot scenarios in single-turn track.
To evaluate performance in zero-shot and few-shot scenarios, the paper analyzed 139 users (16.2% of the test set) with fewer than 10 historical records in the single-turn track.
- Task-dependent Effects:
- Search:
Search performanceremained stable or slightly improved, withresult accuracy(e.g.,PUMAat 0.686) comparable to full-data scenarios. This might be due to a reduction in potentially irrelevant historical information, allowing the agent to focus on theinstructionitself. - Recommendation:
Recommendation performancealso improved (e.g.,PUMAres accfrom 0.054 to 0.090), which is counter-intuitive formemory-dependent tasks. The paper suggests this could be becauselimited memorysimplifies retrieval, making it easier for the agent to pinpoint relevant items from a smaller, less noisy set. - Review:
Review tasksshowed a decline in performance (e.g.,PUMAres accfrom 0.538 to 0.396). The lack of past reviews for these users hinders the agent's ability to generate trulypersonalizedand contextually rich responses, as review generation relies heavily on the user's past expression style and preferences.
- Search:
- PUMA's Consistent Superiority: Despite these variations,
PUMAconsistently outperformed all baselines, achieving the highestfunction accuracyandresult accuracyacross tasks, particularly excelling inRecommendation scenarios. This demonstratesPUMA's adaptability and effectiveness even whenuser historyis sparse, highlighting its ability to leverage even limitedpersonalized datamore effectively than other methods.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper marks a significant advancement in the field of Web agents by introducing the concept of LLM-empowered personalized Web agents. It formally articulates the task of integrating personalized user data (profiles, historical behaviors) to achieve nuanced instruction understanding and customized action execution. To facilitate research and development in this new domain, the authors constructed PersonalWAB, the first comprehensive benchmark for personalized Web agents, encompassing diverse users, three personalized Web tasks (search, recommendation, review), callable Web functions, and supporting both single-turn and multi-turn evaluations. Furthermore, the paper proposes PUMA, a novel framework that enhances LLMs for this task through a user memory bank with task-specific retrieval and function parameter optimization via supervised fine-tuning (SFT) and Direct Preference Optimization (DPO). Extensive experiments on PersonalWAB robustly demonstrate PUMA's superior performance over existing Web agents, affirming its capacity to align better with personalized user instructions and preferences. This work lays foundational groundwork, expanding the research scope and introducing new challenges for future Web agent scenarios.
7.2. Limitations & Future Work
The authors acknowledge several limitations and outline promising avenues for future research:
- Benchmark Expansion:
PersonalWABcould be extended to incorporate a more diverse range oftask scenarios, which would further challenge and evaluate thepersonalization capabilitiesofWeb agents. This implies a need for tasks beyond shopping, encompassing broaderWeb interactions. - Sophisticated User Modeling: Future work could integrate more advanced
user modeling techniques, such asdynamic preference learning. This would enhance agents' adaptability to evolvinguser needsand preferences over time, moving beyond static profiles. - User-in-the-Loop Settings: Exploring
user-in-the-loopsettings presents an exciting opportunity. This involves developing agents that can better integrateuser feedback, proactively identifymissing information, and activelyengage with usersto request necessary details. This approach could significantly improvetask executioneffectiveness and efficiency. - Ethical and Privacy Considerations: The paper explicitly flags the critical importance of
ethicalandprivacy considerations. The use ofpersonalized datacan lead tobiases(e.g.,popularity bias[2]) and raisedata securityconcerns. Future work must focus onfairness-aware personalization techniques,diversity-promoting strategies, andprivacy-preserving techniquesto mitigate these risks. - Scope Generalization: While the current work focuses on the
shopping domain, the framework is generalizable. Extending it to broaderWeb environments(e.g., news recommendation, social media content curation) presents additional complexities that require further investigation.
7.3. Personal Insights & Critique
This paper makes a compelling case for the next generation of Web agents: truly personalized agents. The core idea of bringing personalized data into the LLM-agent loop is highly intuitive and addresses a clear gap in existing research. My key insights are:
- Practical Relevance: The
shopping domainis an excellent choice for demonstratingpersonalizationdue to the rich, quantifiableuser behavior dataand direct impact onuser experience. The ability of an agent to anticipate a user's price sensitivity, brand preference, or even their review style is a tangible leap in utility. - Methodological Rigor: The construction of
PersonalWABis a monumental contribution. The attention to detail, fromuser samplingandprofile generationviaLLMstoinstruction creationandWeb function abstraction, provides a robust and replicable foundation. Theprofile consistency evaluationfurther validates the quality of the synthetic data, a crucial step for any benchmark relying on synthetic elements. - PUMA's Design: The
PUMAframework's multi-stage approach, combiningtask-specific memory retrieval,SFTfor foundational parameter generation, andDPOfor preference alignment, is well-thought-out. The ablation studies clearly demonstrate the incremental value of each component. The use ofDPOis particularly elegant, directly optimizing for user preferences without the complexity of a separatereward model. - Efficiency Aspect: The focus on
efficiencyis highly practical. In real-worldWeb interactions, latency is a killer feature.PUMA's ability to achieve superior performance with a smaller model (LLaMA-7B) and a compactmemory structureis a significant advantage, making it more deployable. - Critical Points for Improvement:
-
Dynamic Profile Updates: While
PUMAuses historical data, theuser profilegeneration is somewhat static. Future work could explore how the agent dynamically learns and updatesuser profilesduring interactions, especially inmulti-turn scenarios. This would make the personalization even more adaptive. -
Handling Conflicting Preferences: Users can have complex, sometimes contradictory, preferences. How would a
personalized Web agenthandle conflicting signals inhistorical dataor ambiguousinstructions? For example, a user who values budget but occasionally splurges. -
Explainability of Personalization: As agents become more personalized, their decisions might become less transparent. Providing explanations for why a particular recommendation was made or why certain search parameters were chosen could enhance user trust and control.
-
Real-time Adaptation to External Factors:
Personalized agentscould benefit from integrating external real-time factors (e.g., current events, weather, social trends) that might influence a user's immediate preferences, even if not explicitly in their historical data. -
Robustness to Adversarial Instructions: How robust is the
personalizationto subtle adversarial or manipulativeinstructions? Ensuring that the agent acts truly in the user's best interest is paramount.Overall, this paper is a foundational piece for
personalized Web agents. Its clear task formulation, robust benchmark, and effective framework will undoubtedly inspire extensive follow-up research and bring us closer to a future whereWeb agentsare not just smart, but trulypersonal.
-
Similar papers
Recommended via semantic vector search.