AiPaper
Paper status: completed

Large Language Models Empowered Personalized Web Agents

Published:10/23/2024
Original LinkPDF
Price: 0.10
Price: 0.10
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This work formulates LLM-powered personalized Web agents, integrating user data to improve instruction comprehension and execution. It introduces PersonalWAB benchmark and a memory-enhanced alignment framework PUMA for more accurate personalization.

Abstract

Web agents have emerged as a promising direction to automate Web task completion based on user instructions, significantly enhancing user experience. Recently, Web agents have evolved from traditional agents to Large Language Models (LLMs)-based Web agents. Despite their success, existing LLM-based Web agents overlook the importance of personalized data (e.g., user profiles and historical Web behaviors) in assisting the understanding of users' personalized instructions and executing customized actions. To overcome the limitation, we first formulate the task of LLM-empowered personalized Web agents, which integrate personalized data and user instructions to personalize instruction comprehension and action execution. To address the absence of a comprehensive evaluation benchmark, we construct a Personalized Web Agent Benchmark (PersonalWAB), featuring user instructions, personalized user data, Web functions, and two evaluation paradigms across three personalized Web tasks. Moreover, we propose a Personalized User Memory-enhanced Alignment (PUMA) framework to adapt LLMs to the personalized Web agent task. PUMA utilizes a memory bank with a task-specific retrieval strategy to filter relevant historical Web behaviors. Based on the behaviors, PUMA then aligns LLMs for personalized action execution through fine-tuning and direct preference optimization. Extensive experiments validate the superiority of PUMA over existing Web agents on PersonalWAB.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of this paper is "Large Language Models Empowered Personalized Web Agents," which focuses on enhancing Web agents with personalization capabilities using large language models.

1.2. Authors

The authors are:

  • Hongru Cai (National University of Singapore, Singapore)
  • Yongqi Li (The Hong Kong Polytechnic University, Hong Kong SAR, China)
  • Wenjie Wang (University of Science and Technology of China, Hefei, China)
  • Fengbin Zhu (National University of Singapore, Singapore)
  • Xiaoyu Shen (Eastern Institute of Technology, Ningbo, Ningbo, China)
  • Wenjie Li (The Hong Kong Polytechnic University, Hong Kong SAR, China)
  • Tat-Seng Chua (National University of Singapore, Singapore)

1.3. Journal/Conference

This paper is slated for publication in the Proceedings of the ACM Web Conference 2025 (WWW '25), April 28-May 2, 2025, Sydney, NSW, Australia. WWW is a highly prestigious and influential conference in the field of the World Wide Web, covering topics such as Web technologies, applications, and their societal impact. Publication at WWW signifies high-quality research and significant contributions to the field.

1.4. Publication Year

The paper is published at 2024-10-22T17:54:45.000Z, with the ACM reference indicating publication in 2025.

1.5. Abstract

This paper introduces the concept of LLM-empowered personalized Web agents, which leverage personalized user data (e.g., user profiles and historical Web behaviors) to improve the understanding of user instructions and execute customized actions. Recognizing the lack of a suitable evaluation standard, the authors developed PersonalWAB, the first comprehensive benchmark for this task. PersonalWAB includes user instructions, personalized user data, Web functions, and supports both single-turn and multi-turn evaluation across three personalized Web tasks. Furthermore, the paper proposes Personalized User Memory-enhanced Alignment (PUMA), a framework that adapts Large Language Models (LLMs) for this task. PUMA utilizes a memory bank with a task-specific retrieval strategy to filter relevant historical behaviors, and then aligns LLMs for personalized action execution through fine-tuning and direct preference optimization (DPO). Extensive experiments on PersonalWAB demonstrate PUMA's superior performance compared to existing Web agents.

Official Source: https://arxiv.org/abs/2410.17236 PDF Link: https://arxiv.org/pdf/2410.17236v2.pdf Publication Status: This paper is available as a preprint on arXiv, with an ACM reference indicating it is accepted for publication at WWW '25.

2. Executive Summary

2.1. Background & Motivation

The World Wide Web has become an indispensable part of daily life, offering a multitude of services from information retrieval to online shopping. However, the sheer scale and complexity of modern Web services pose significant challenges for users, particularly those who struggle with vast amounts of unstructured data and intricate interactions. Web agents emerged as a promising solution to automate these Web tasks based on user instructions, thereby enhancing user experience and efficiency.

Initially, Web agents were primarily built using reinforcement learning techniques for Web navigation, but their limited context understanding and reasoning capabilities restricted their generalization to complex and novel scenarios. The advent of Large Language Models (LLMs) has revolutionized this field, endowing Web agents with powerful understanding, planning, and reasoning capabilities. Modern LLM-based Web agents utilize techniques like in-context learning, fine-tuning, and reinforcement learning to improve their instruction-following abilities, and some even support multi-turn interactions for conversational Web navigation.

Despite these advancements, existing LLM-based Web agents largely overlook a crucial aspect: personalization. User experience can be significantly enhanced by incorporating personalized data such as user profiles and historical Web behaviors. This personalized data reveals implicit user preferences, which can:

  1. Supplement user context for personalized instruction comprehension: Users often don't explicitly state all their preferences (e.g., a price range for a product search). Personalized data can fill these gaps.

  2. Enable personalized action execution: Different users have varying habits and preferences for Web services, leading to customized function calls with tailored parameters.

    The core problem this paper aims to solve is the lack of personalization in current LLM-based Web agents. The current field lacks both a systematic formulation of the LLM-empowered personalized Web agent task and a comprehensive benchmark for its training and evaluation. Without these, the development of Web agents that truly understand and cater to individual user needs remains limited.

The paper's entry point is to formalize this personalized Web agent task and bridge the gap by constructing the first dedicated benchmark, PersonalWAB, and proposing a novel framework, PUMA, to effectively address it.

2.2. Main Contributions / Findings

The paper makes several significant contributions:

  • Task Formulation: It formally defines the task of LLM-empowered personalized Web agents, emphasizing the integration of personalized user data for both instruction comprehension and action execution. This bridges the gap between generic Web agents and customized Web services.
  • Benchmark Construction (PersonalWAB): The authors construct the first benchmark specifically designed for LLM-empowered personalized Web agents. PersonalWAB features:
    • A diverse set of 1,000 users with simulated profiles and real historical Web behaviors.
    • Instructions for three personalized Web tasks: search, recommendation, and review generation.
    • A set of callable Web functions to interact with the environment.
    • Two distinct evaluation paradigms: single-turn and multi-turn interaction, with the latter utilizing an LLM-based user simulator.
  • Framework Proposal (PUMA): The paper introduces Personalized User Memory-enhanced Alignment (PUMA), a novel framework designed to adapt LLMs for the personalized Web agent task. PUMA's key components include:
    • A user memory bank to store long-term historical behaviors.
    • A task-specific retrieval strategy to filter relevant information from the memory.
    • Strategies for function parameter optimization using supervised fine-tuning (SFT) with heuristically constructed pseudo-labels and Direct Preference Optimization (DPO) for alignment with personalized user preferences.
  • Extensive Validation: Through extensive experiments on PersonalWAB, the paper demonstrates that PUMA consistently outperforms existing Web agents across both single-turn and multi-turn personalized Web tasks. This validates PUMA's effectiveness in better aligning with personalized user instructions and preferences, showcasing the potential for more intelligent, customized, and user-centered Web services.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

Web Agents

At its core, a Web agent is an intelligent software program designed to automate tasks on the internet. It acts on behalf of a user to achieve specific goals, such as finding information, making a purchase, or filling out a form, by interacting with Web services or Web interfaces.

  • Traditional Web Agents: These agents often rely on predefined rules, scripts, or reinforcement learning techniques to navigate Web UIs (User Interfaces). They typically learn optimal sequences of actions (e.g., clicks, text inputs) to complete tasks. Their context understanding and reasoning capabilities are generally limited, making it hard for them to adapt to new or complex Web environments or out-of-distribution scenarios.
  • LLM-based Web Agents: With the rise of Large Language Models (LLMs), Web agents have evolved significantly. LLMs possess extensive world knowledge, strong understanding, planning, and reasoning capabilities. These agents can interpret natural language instructions, generate plans, and execute actions by interacting with Web elements or by calling functions that abstract Web services. Techniques like in-context learning (providing examples in the prompt), fine-tuning (adapting the LLM with task-specific data), and reinforcement learning are employed to enhance their performance.

Personalization

Personalization refers to tailoring a system's behavior, content, or services to individual users based on their unique characteristics, preferences, or historical interactions. In the context of Web agents, personalization means that the agent's actions and responses would adapt to a specific user, rather than providing a generic solution.

  • User Profile: A collection of static attributes about a user, such as demographics (gender, age, occupation), interests, and stated preferences.
  • Historical Web Behaviors: Records of a user's past interactions with Web services, such as purchase history, search queries, ratings, reviews, and browsing patterns. This data often reveals implicit preferences that are not explicitly stated in a user profile.
  • Implicit Preferences: Preferences that are inferred from a user's behavior rather than explicitly declared. For example, consistently buying products from a certain brand implies a brand preference.

Large Language Models (LLMs)

LLMs are deep learning models, often based on the Transformer architecture, trained on vast amounts of text data. They are capable of understanding, generating, and processing human language with remarkable fluency and coherence.

  • Fine-tuning: A process where a pre-trained LLM is further trained on a smaller, task-specific dataset. This allows the model to adapt its general knowledge to the nuances of a particular task or domain, improving its performance for that specific application.
  • Direct Preference Optimization (DPO): A reinforcement learning from human feedback (RLHF) technique used to align LLMs with human preferences. Instead of explicitly training a separate reward model, DPO directly optimizes the LLM's policy based on preferences for one response over another (e.g., A is better than B), often collected in pairwise comparisons. It simplifies the RLHF pipeline by translating preference data into a loss function that can be directly applied during fine-tuning.
  • In-context Learning: The ability of LLMs to learn new tasks or adapt to new instructions based on examples provided within the prompt itself, without requiring explicit fine-tuning of the model weights.
  • Cosine Similarity: A metric used to measure the similarity between two non-zero vectors in an inner product space. It is often used in natural language processing to determine how similar two pieces of text are by comparing their vector embeddings. A cosine similarity of 1 means the vectors are identical (same direction), 0 means they are orthogonal (no similarity), and -1 means they are opposite. The formula for cosine similarity between two vectors A\mathbf{A} and B\mathbf{B} is: $ \text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|} = \frac{\sum_{i=1}^n A_i B_i}{\sqrt{\sum_{i=1}^n A_i^2} \sqrt{\sum_{i=1}^n B_i^2}} $ Where:
    • AB\mathbf{A} \cdot \mathbf{B} is the dot product of vectors A\mathbf{A} and B\mathbf{B}.
    • A\|\mathbf{A}\| and B\|\mathbf{B}\| are the Euclidean magnitudes (or L2 norms) of vectors A\mathbf{A} and B\mathbf{B}.
    • AiA_i and BiB_i are components of vector A\mathbf{A} and B\mathbf{B} respectively.
  • Sentence Embeddings: Numerical vector representations of sentences that capture their semantic meaning. Models like Sentence-BERT are trained to produce these embeddings such that semantically similar sentences are mapped to nearby points in the vector space.

3.2. Previous Works

The paper contextualizes its work by reviewing two main lines of research: Web agents and Personalized LLMs.

Web Agents

Previous research on Web agents has largely focused on automating Web tasks.

  • Traditional Web Agents: Early work focused on reinforcement learning for Web navigation. Examples include MiniWoB++MiniWoB++ [27], which provided a platform for agents to complete tasks using keyboard and mouse interactions on website widgets, and WebShop [55], which introduced a simulated e-commerce environment. These systems primarily dealt with predefined Web UIs and struggled with generalization.

  • LLM-based Web Agents: More recent advancements leverage LLMs for improved context understanding and reasoning. These studies explore automating tasks in more complex settings:

    • Multi-domain & Multi-hop: Mind2Web [6] explores generalist agents for the Web across multiple domains. MMInA [58] focuses on multihop multimodal Internet agents.
    • Real-time Interactions & Visual Understanding: WebArena [62] provides a realistic Web environment for autonomous agents, and WebVoyager [14] and VWA [22] focus on visual UI understanding and multimodal agents.
    • Enhancement Techniques: Researchers have applied fine-tuning [11, 13], prompting [42, 56, 59], and reinforcement learning [34] to LLMs for Web agent tasks.
  • Conversational & Multi-turn Agents: A distinct direction integrates user interactions into the agent's execution process. META-GUI [46] focuses on mobile app automation with conversational instructions. RUSS [54] and WebLINX [29] design datasets for dialogue-centric Web navigation. MT-Mind2Web [7] extends Mind2Web to multi-turn instruction following. ChatShop [4] explores interactive information seeking with language agents using Web functions. WorkArena [9] evaluates Web agents on common knowledge work tasks in multi-turn settings.

    Differentiation: Despite these advancements, the paper highlights that prior Web agent research, including LLM-based ones, overlooks the dimension of personalization. While some, like WebArena [62], simulate users with distinct roles, these roles are predefined and do not require the agent to understand user preferences or adjust strategy based on them. This paper explicitly focuses on LLM-empowered personalized Web agents, which is a novel emphasis.

Personalized LLMs

This field focuses on LLMs that adapt to individual users' needs by handling user personas (background, historical behaviors).

  • Personalized Content Generation: This category addresses generating content tailored to users. Examples include using publicly available user data (Reddit [50], Facebook, Twitter [43], blogs [21]) for pre-training LLMs. Tasks include stance classification, demographic inference [44], and personalized sentiment prediction [31]. Benchmarks like LaMP [39] and LongLaMP [23] provide datasets for evaluating personalized text classification and content generation.

  • User-facing Applications: This includes personalized dialogue systems. Datasets have been built by crowd-workers authoring dialogues based on personas [57] or extracting attributes from social media (Reddit [30], Weibo [61]). Apollonion [5] dynamically updates user profiles for personalized responses. Memory mechanisms [24, 28, 52] help models recall past conversations and important events. Personalized LLMs are also applied in specialized domains like healthcare [1, 18], education [8, 40], and robotics [51].

    Differentiation: The paper notes that previous personalized LLM studies have not explored personalized function calls tailored to user-specific needs in Web environments. This work bridges this gap by emphasizing adapting agents' actions based on personalized user data to complete personalized tasks within Web environments.

3.3. Technological Evolution

The technological landscape for Web agents has evolved from rudimentary, rule-based systems to sophisticated LLM-driven intelligent agents.

  1. Early Web Automation (Pre-LLM Era): This involved scripting languages, web scrapers, and bots for repetitive tasks. Reinforcement learning later offered more adaptive approaches, allowing agents to learn optimal interactions with Web UIs (e.g., MiniWoB++MiniWoB++). However, these agents were brittle, struggling with dynamic Web pages and novel task instructions due to limited understanding and generalization.
  2. LLM Integration (Current Era): The emergence of powerful LLMs like GPT-3, GPT-4, and Llama 2 marked a paradigm shift. LLMs brought unprecedented natural language understanding, reasoning, and planning capabilities. This allowed Web agents to interpret complex natural language instructions, generate action plans, and interact with Web services at a higher level of abstraction (e.g., WebArena, Mind2Web). Techniques like in-context learning and fine-tuning became central to adapting LLMs for Web agent tasks. Multi-turn dialogues also became feasible, enabling more interactive Web navigation.
  3. Personalization (This Paper's Contribution): This paper introduces the next logical step: personalization. While LLMs enhanced general Web agent capabilities, they typically treated all users generically. This work integrates personalized user data (profiles, historical behaviors) into LLM-based Web agents. This enables agents to not just follow instructions, but to comprehend personalized instructions (inferring implicit preferences) and execute customized actions (making personalized function calls). This moves beyond generic intelligence to truly user-centric intelligence, aiming to provide services that anticipate and align with individual user needs. The PUMA framework and PersonalWAB benchmark are designed specifically for this advanced stage of Web agent evolution.

4. Methodology

The core idea of this paper is to advance LLM-based Web agents by integrating personalized user data to achieve personalized instruction understanding and action execution. The paper first formulates the task and then proposes the PUMA (Personalized User Memory-enhanced Alignment) framework to address it.

4.1. Principles

The fundamental principle behind the proposed methodology is that personalized user data (such as user profiles and historical Web behaviors) holds crucial information about a user's implicit preferences. By leveraging this data, an LLM-based Web agent can move beyond merely following explicit instructions to understand underlying user needs and execute actions that are customized and optimal for that specific user. This involves two main challenges: first, making the LLM select the correct Web function, and second, generating the most personalized and effective parameters for that function.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Task Formulation

LLM-empowered personalized Web agents act as intermediaries between users and Web services. The task is formally defined by considering the following elements:

  • User (uu): Each user uUu \in \mathcal { U } is unique, possessing a distinct profile P _ { u } and historical Web behaviors H _ { u }.
    • P _ { u }: This profile contains static attributes like demographics.
    • H _ { u }: This records the user's past Web behaviors as a time-ordered sequence: {hu1ˉ,hu2,...,huN}\{ h _ { u } ^ { \bar { 1 } } , h _ { u } ^ { 2 } , . . . , h _ { u } ^ { N } \}. Each huih _ { u } ^ { i } represents a single Web behavior, such as a purchase or a review.
  • Instruction (iui_u): This is a natural language sentence provided by the user, expressing their specific needs and requirements.
  • Web Environment (T\mathcal{T}): This is abstracted as a collection of Web functions.
    • fTf \in \mathcal T: Each function ff can be invoked.

    • P\mathcal { P }: An input parameter required to invoke a function.

    • O _ { f _ { p } }: The corresponding result returned by invoking function ff with parameter P\mathcal{P}. Notably, different input parameters will yield different function results.

      Goal: Given the user instruction i _ { u } and the personalized data (P _ { u } and H _ { u }), the LLM-empowered personalized Web agent aims to:

  1. Select the appropriate Web function (ff).
  2. Determine the optimal parameter (P\mathcal { P }) to invoke personalized results (O _ { f _ { p } }) from the Web environment.

4.2.2. PUMA Framework Overview

The PUMA (Personalized User Memory-enhanced Alignment) framework is designed to enable LLM-empowered personalized Web agents to effectively complete tasks based on user instructions. As illustrated in Figure 6, PUMA consists of two main steps: Web function identification and function parameter generation.

Figure 6: Illustration of the PUMA framework, consisting of two main steps: Web Function Identification and Parameter Generation, which includes Task-specific Memory Retrieval and Function Parameter… 该图像是论文中图6的示意图,展示了PUMA框架的两大步骤:网页功能识别和参数生成,后者包含任务特定记忆检索和函数参数优化两个部分。图中用箭头区分了微调(SFT)和直接偏好优化(DPO)流程。

Figure 6: Illustration of the PUMA framework, consisting of two main steps: Web Function Identification and Parameter Generation, which includes Task-specific Memory Retrieval and Function Parameter Optimization.

1. Web Function Identification: The first step involves identifying the correct Web function that the user's instruction intends to invoke.

  • A Large Language Model (LLM) (e.g., LLaMa-2-7b) is fine-tuned using "instruction-function" pairs from a training dataset. This training equips the LLM with the ability to map a given user instruction to the most appropriate Web function from the available set (e.g., search_product_by_query, get_recommendations_by_history, add_product_review).

2. Function Parameter Generation: Once the correct Web function is identified, the next and more complex step is to generate the appropriate parameters for that function, taking into account the user's personalized data. This step is further broken down into two sub-components: Task-specific Memory Retrieval and Function Parameter Optimization.

2.2.1. Task-specific Memory Retrieval

This component is responsible for collecting and filtering relevant personalized data for the LLM to use in generating function parameters.

  • Long-term Memory Bank: This is a storage system that maintains a detailed record of each user's historical Web behaviors. For a user uu, this bank stores information about their purchased products (hpurchaseh_{\mathit{purchase}}) and associated reviews (hreviewh_{review}). Collectively, these are denoted as mm.
    • Product details include attributes like "title", "price", "store", and other relevant metadata.
    • Review details encompass "rating", "review title", and "comment" provided by the user.
    • Formally, if user uu has purchased nn products, their long-term memory MM is represented as: $ M = { m _ { i } \ | \ i = 1 , 2 , . . . , n } $ Where each mim_i corresponds to a specific historical behavior or product interaction.
  • Task-specific Memory Retrieval Strategy: This strategy extracts only the most relevant information from the long-term memory bank based on the user's current instruction and the identified function.
    1. Top-K Retrieval: Given a user instruction ii and the identified Web function ff, the system first retrieves the top KK memory entries by computing the cosine similarity between the instruction ii and each memory entry m _ { j } in the bank MM. This helps to narrow down the vast memory bank to potentially relevant past behaviors.
    2. Targeted Feature Extraction: Based on the specific identified Web function ff, more targeted features are then extracted from these retrieved memory entries.
      • If ff is a search function: product details such as "product title", "category", "price", and "store" are extracted.
      • If ff is a recommendation function: product details like "title", "category", and "parent ASIN" (a unique product identifier) are retained.
      • If ff is a review function: only the user's past "ratings" and "comments" are kept.
    • This process is formally defined as: $ M _ { i } = \mathrm { E x t r a c t } \left( \mathrm { T o p K } \left( M , \sin ( i , m _ { j } ) , K \right) , f \right) . $ Where:
      • M _ { i } represents the task-specific memory constructed for instruction ii.
      • Extract(,f)\mathrm { Extract } ( \cdot , f ) is a function that extracts targeted features based on the identified Web function ff.
      • TopK(M,sin(i,mj),K)\mathrm { TopK } ( M , \sin ( i , m _ { j } ) , K ) selects the KK memory entries from MM with the highest cosine similarity sin(i,mj)\sin ( i , m _ { j } ) to the instruction ii.
      • sin(i,mj)\sin ( i , m _ { j } ) is the cosine similarity between the instruction ii and memory entry m _ { j }.

2.2.2. Function Parameter Optimization

After obtaining the task-specific memory M _ { i }, this component focuses on generating Web function parameters that are not only reasonable but also optimally aligned with user preferences. This is achieved through a combination of Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO).

  • Heuristic Fine-tuning for Parameter Generation:

    • The LLM is initially equipped with a foundational ability to generate reasonable parameters through SFT.
    • The inputs for SFT are structured as a combination of the user instruction ii, the task-specific memory M _ { i }, and the identified Web function ff.
    • The labels for this SFT are Web function parameters constructed using heuristic methods tailored to each Web function:
      1. For the search function: ChatGPT is used to generate textual queries based on the instruction and memory. These generated queries serve as pseudo-labels.
      2. For recommendation functions: The pseudo-labels consist of the most recent product ASINs (Amazon Standard Identification Number) from the same category found in M _ { i }.
      3. For review functions: The actual review text provided by the dataset is used as the labels.
    • These heuristics help create meaningful pseudo-labels for parameter generation, ensuring the model learns to produce plausible and contextually appropriate function parameters.
  • Diverse Parameter Sampling for Pair-wise Optimization (DPO):

    • After SFT provides a baseline capability, the model's performance is further enhanced using Direct Preference Optimization (DPO) [36] over a diverse set of parameter candidates.
    • Candidate Generation: A diverse set of function parameters is first generated from the SFT-tuned LLM using high-temperature sampling (to increase output variability) and beam search (to explore multiple plausible sequences).
    • Pair-wise Preference Data Construction: These candidate parameters are then evaluated based on their result accuracy for instruction completion. For each instruction ii, best-performing parameters (pibp _ { i } ^ { \mathrm { b } }) and worst-performing parameters (piwp _ { i } ^ { \mathrm { w } }) are identified and paired.
    • This pair-wise preference data DDPO\mathcal { D } _ { \mathrm { DPO } } is formally defined as: $ \mathcal { D } _ { \mathrm { DPO } } = \left{ \left( p _ { i } ^ { \mathrm { b } } , p _ { i } ^ { \mathrm { w } } , x _ { i } \right) \right} , $ Where:
      • pibp _ { i } ^ { \mathrm { b } } represents the best-performing function parameters for instruction ii.
      • piwp _ { i } ^ { \mathrm { w } } represents the worst-performing function parameters for instruction ii.
      • x _ { i } represents the input to the model, which includes the user instruction ii, the task-specific memory M _ { i }, and the Web function ff.
    • DPO Optimization: DPO is then applied to optimize the SFT-tuned model (referred to as the reference model πref\pi _ { \mathrm { r e f } }) by encouraging it to generate parameters similar to pibp _ { i } ^ { \mathrm { b } } and discouraging it from generating parameters similar to piwp _ { i } ^ { \mathrm { w } }.
    • The DPO loss is given by: $ \mathcal { L } _ { \mathrm { DPO } } = - \mathbb { E } \left[ \log \sigma \left( \beta \log \frac { \pi _ { \theta } ( \boldsymbol { \rho } ^ { \mathrm { b } } \mid \boldsymbol { x } ) } { \pi _ { \mathrm { r e f } } ( \boldsymbol { \rho } ^ { \mathrm { b } } \mid \boldsymbol { x } ) } - \beta \log \frac { \pi _ { \theta } ( \boldsymbol { \rho } ^ { \mathrm { w } } \mid \boldsymbol { x } ) } { \pi _ { \mathrm { r e f } } ( \boldsymbol { \rho } ^ { \mathrm { w } } \mid \boldsymbol { x } ) } \right) \right] , $ Where:
      • σ()\sigma ( \cdot ) is the sigmoid function, which maps any real-valued number to a value between 0 and 1, used here to model preference probabilities.
      • β\beta is a temperature-like parameter that controls the sensitivity of the model's preference to the log-ratio difference between the policy model πθ\pi _ { \theta } (the model being optimized) and the reference model πref\pi _ { \mathrm { r e f } } (the SFT-tuned model). A higher β\beta makes the model more sensitive to preference differences.
      • πθ(ρx)\pi _ { \theta } ( \boldsymbol { \rho } \mid \boldsymbol { x } ) represents the probability of generating parameters ρ\boldsymbol { \rho } given input x\boldsymbol { x } by the policy model (the one currently being trained).
      • πref(ρx)\pi _ { \mathrm { r e f } } ( \boldsymbol { \rho } \mid \boldsymbol { x } ) represents the probability of generating parameters ρ\boldsymbol { \rho } given input x\boldsymbol { x } by the reference model (the SFT-tuned version before DPO).
    • This DPO step ensures superior alignment with personalized user preferences by directly optimizing the model to generate preferred outputs and avoid less preferred ones.

5. Experimental Setup

5.1. Datasets

The paper constructs the first Personalized Web Agent Benchmark (PersonalWAB) to address the absence of a comprehensive evaluation benchmark. PersonalWAB is built upon the Amazon Review dataset [15], a large-scale collection of users' Web behaviors including purchases and product ratings across various categories.

The construction of PersonalWAB involves the following steps:

  1. Personalized Data Construction:
    • User Sampling: 1,000 diverse users were randomly selected from the Amazon Review dataset across five product categories: Electronics, Home and Kitchen, Grocery and Gourmet Food, Clothing, Shoes, and Jewelry, and Health and Household. For each user, all their interactions across these categories (detailed purchased product information and user evaluations) were collected.
    • Data Split: User interactions were chronologically ordered and split: 80% for historical data, 10% for the training set, and the final 10% for the test set.
    • User Profile Generation: Unique profiles for each of the 1,000 users were generated using an LLM (specifically, gpt-4o-mini-2024-07-18) to infer and summarize potential profiles based on their entire behavior history. The prompt template for profile generation is provided in Figure 10 and Figure 11 in the appendix.
      • Example of a generated user profile structure (Figure 11):
        • Basic information: Gender, Age, Occupation (e.g., Male, 35-44, Engineer).
        • Shopping preferences: Price Sensitivity (e.g., Medium: Balanced Buyer), Shopping Interests (summarized product information), Brand Preferences (specific brand names).
        • Behavioral tendencies: Diversity Preference (e.g., Balanced: mix of new and familiar), Interaction Complexity (e.g., Concise: to-the-point reviews), Tone and Style (e.g., Neutral, Objective), Item Reference (keywords related to what they reference), Focus Aspects (e.g., Average Rating, Price, Material).
      • The user profiles support personalized instruction generation and multi-turn evaluation.
  2. User Instruction Creation: LLMs (specifically, claude-3-5-sonnet@20240620) were prompted to generate personalized instructions for each user, based on their profile and real Web behaviors, across three tasks:
    • Search Instructions: Generated based on user profile and product information to search for similar products (Figure 12 for prompt). These vary in length, tone, and specificity.
    • Recommendation Instructions: Tend to be shorter and more general, generated from user profile and integrated products (Figure 13 for prompt).
    • Review Instructions: Generated from user profile, target product info, and actual review text, incorporating personalized requirements (Figure 14 for prompt).
  3. Web Environment Implementation: The Web environment is abstracted as a series of Web functions, simplifying interactions compared to Web GUIs.
    • search_product_by_query: Takes a textual query, returns 10 most similar products. Implemented using BM25 with Pyserini [26].

    • get_recommendations_by_history: Accepts product IDs, returns 10 recommended products. Implemented by training SASRec model [19].

    • add_product_review: Requires review text, assumes review is posted.

    • respond: Allows agent-user dialogue.

    • stop: Signals task termination.

      The following are the results from [Table 2] of the original paper:

      items Train Test
      User # Users 939 1,000
      # Avg. profile tokens 247
      # Avg. behavior length # Avg. behavior tokens 32 7,597 38 9,270
      Instruction # Instructions 6,896 2,174
      # Avg. tokens 46 45
      Product # Products 8,236
      # Avg. tokens 665

Table 2: Statistics of the PersonalWAB Benchmark.

The dataset statistics show:

  • Users: 939 in training, 1,000 in test, with average profile tokens of 247 and average behavior lengths of 32 (train) / 38 (test) items and 7,597 (train) / 9,270 (test) behavior tokens.

  • Instructions: 6,896 in training, 2,174 in test, with an average of 46 (train) / 45 (test) tokens.

  • Products: 8,236 unique products with an average of 665 tokens per product.

    User diversity is shown in Figure 3, illustrating distributions across gender, age, and occupation. Figure 4(a) further details behavioral attributes like Price Sensitivity, Diversity Preference, and Interaction Complexity. Figure 4(b) shows instruction statistics, indicating that recommendation instructions are shortest, while review instructions are more complex.

    Figure 3: Distribution of users by gender, age, and occupation. 该图像是论文中图3,展示了用户的性别、年龄和职业分布情况。图中以环形图形式分别显示了男性与女性比例、不同年龄段(25-34岁、35-44岁、46-49岁、56岁以上)用户占比,以及多种职业类别(如作家、家庭主妇、退休人员、自雇者等)的比例分布。

Figure 3: Distribution of users by gender, age, and occupation.

Figure 4: (a) Distribution of behaviors by Price Sensitivity, Diversity Preference, and Interaction Complexity; (b) Statistics of the instructions on different tasks. 该图像是图表,展示了图4中(a)用户在价格敏感度、多样性偏好和交互复杂度三个行为属性上的分布,以及(b)不同任务指令的数量和平均Token数统计情况。

Figure 4: (a) Distribution of behaviors by Price Sensitivity, Diversity Preference, and Interaction Complexity; (b) Statistics of the instructions on different tasks.

5.2. Evaluation Metrics

The paper establishes two distinct evaluation tracks: single-turn and multi-turn.

5.2.1. Single-turn Track

In this track, the agent has one opportunity to execute the user's instruction.

  • Function accuracy (function acc): This metric assesses the agent's ability to select the correct Web function and provide parameters in the correct format.
    • Conceptual Definition: It measures whether the agent correctly identifies the intended Web function for a given instruction and structures its parameters in the expected format.
    • Formula: If the agent selects the appropriate tool for the task and the input parameters are correctly formatted, it receives a score of 1; otherwise, the score is 0.
  • Result accuracy (res acc): This metric evaluates the quality of the results generated by the agent's function calls.
    • Conceptual Definition (Search and Recommendation): For search and recommendation tasks, it measures how well the agent's output (a list of products) aligns with the user's genuinely liked item (ground truth). It assigns a higher score if the target product appears higher in the returned list.
    • Mathematical Formula (Search and Recommendation): $ \operatorname { R e s } \mathrm { Acc } = { \left{ \begin{array} { l l } { 1 - { \frac { r - 1 } { 10 } } , } & { { \mathrm { i f } } r \leq 1 0 , } \ { 0 , } & { { \mathrm { i f } } r > 1 0 . } \end{array} \right. } \quad \mathrm { w i t h } r \in \mathbb { N } ^ { + } $ Where:
      • rr is the rank of the target product within the returned product list.
      • The formula penalizes lower ranks, with rank 1 yielding a score of 1(11)/10=11 - (1-1)/10 = 1, and rank 10 yielding 1(101)/10=0.11 - (10-1)/10 = 0.1. If the target product is not in the top 10, the score is 0.
    • Conceptual Definition (Review): For review tasks, it assesses the semantic similarity between the agent's generated review text and the user's actual ground truth review.
    • Formula (Review): The sentence-transformer [37] model is used to compute the cosine similarity between the generated and ground truth review texts, yielding a res acc between 0 and 1. (No specific formula is provided in the paper, but cosine similarity formula is already explained in Section 3.1.)

5.2.2. Multi-turn Track

This track evaluates the agent's ability to interact with users over multiple turns, using an LLM-based user simulator to provide real-time feedback (Figure 15 for prompt details).

  • The function acc and res acc metrics from the single-turn track are also used here.
  • Average steps: This additional metric measures the efficiency of the agent.
    • Conceptual Definition: It counts the total number of actions (steps) taken by the agent to complete the task.
    • Formula: No specific formula is provided, but it is defined as the total number of actions taken. The goal is to encourage the agent to accomplish tasks with minimal attempts.

5.2.3. Profile Consistency Evaluation

To verify the reliability of the generated user profiles, the paper conducts consistency evaluations:

  • Profile-behavior consistency evaluation:
    • Conceptual Definition: Given a user profile, the task is to identify the correct user from a group of candidates (true user + negative users), where each candidate is represented by their behavior sequence.
    • Formula: top-1 accuracy. This measures how often the correct user's behavior sequence is matched to their profile when compared against others.
  • Profile-product consistency evaluation:
    • Conceptual Definition: Using a user profile to rank a set of candidate items (mixture of positive/interacted and negative/random items). The objective is to prioritize positive items.
    • Formula: NDCG@5 (Normalized Discounted Cumulative Gain at 5) and Recall@5.
      • Recall@K:
        • Conceptual Definition: Recall measures the proportion of relevant items that are successfully retrieved from the total number of relevant items. Recall@K specifically checks if any of the relevant items are present in the top KK recommendations.
        • Mathematical Formula: $ \text{Recall@K} = \frac{\text{Number of relevant items in top K recommendations}}{\text{Total number of relevant items}} $
        • Symbol Explanation:
          • Number of relevant items in top K recommendations: The count of actual relevant items that appear within the first KK items recommended by the system.
          • Total number of relevant items: The total count of items that are genuinely relevant to the user's preferences.
      • NDCG@K (Normalized Discounted Cumulative Gain at K):
        • Conceptual Definition: NDCG measures the usefulness, or gain, of a document based on its position in the result list. The gain is accumulated from the top of the result list to the bottom, with the gain of a highly relevant document at a lower position being discounted. NDCG normalizes the score by dividing by the Ideal DCG (IDCG), which is the DCG of the ideal ordering of results.
        • Mathematical Formula: First, Cumulative Gain (CG) at position kk: $ \text{CG}k = \sum{i=1}^k \text{rel}_i $ Then, Discounted Cumulative Gain (DCG) at position kk: $ \text{DCG}k = \sum{i=1}^k \frac{\text{rel}_i}{\log_2(i+1)} $ Finally, Normalized Discounted Cumulative Gain (NDCG) at position kk: $ \text{NDCG}_k = \frac{\text{DCG}_k}{\text{IDCG}_k} $
        • Symbol Explanation:
          • kk: The position in the ranked list (for NDCG@5, k=5k=5).
          • reli\text{rel}_i: The relevance score of the item at position ii in the ranked list. (Often binary: 1 if relevant, 0 if not).
          • log2(i+1)\log_2(i+1): The discounting factor, which reduces the impact of lower-ranked items.
          • IDCGk\text{IDCG}_k: The ideal DCG at position kk, which is the maximum possible DCG score if all relevant items were perfectly ranked at the top.

5.3. Baselines

The paper evaluates the PUMA framework against a range of baselines, categorized into three groups, using gpt-4o-mini-2024-07-18 as the backbone LLM for all baselines (unless specified otherwise for PUMA):

5.3.1. Memory Retrieval Methods

These baselines explore different strategies for selecting and utilizing user history to understand the impact of various memory selection techniques on task performance. The general prompt template for task execution is provided in Figure 16 (single-turn) and Figure 17 (multi-turn) in the appendix, with differences only in the memory component.

  • No Memory: The agent operates without access to any user history, relying solely on the current instruction.
  • Random Memory: The agent randomly selects a portion of behaviors from the user's history for context.
  • Last Memory: The agent uses only the most recent behaviors from the user's history, assuming recent context is most relevant. For single-turn, memory length is 50 behaviors; for multi-turn, it's 20 behaviors.
  • Relevant Memory: The agent selects past behaviors based on cosine similarity with the current instruction, aiming to filter for contextually relevant details. Sentence-transformer [37] is used for cosine similarity calculation. Memory length settings are the same as Last Memory.

5.3.2. Enhanced Reasoning Methods

These frameworks are designed to improve the agent's reasoning and decision-making.

  • ReAct [56]: This framework guides the LLM to "think" before acting. It instructs the model to generate a "Thought:" (reasoning) followed by an "Action:" (JSON-formatted action argument) to interact with the environment. This allows the model to deliberate on available information. For evaluation, ReAct is combined with the Last Memory approach to provide recent context.
  • Reflexion [42]: Building upon ReAct, Reflexion adds a self-evaluation phase. The agent reviews and analyzes its previous actions and outcomes, learns from mistakes, and refines its strategy in subsequent interactions. This baseline is evaluated only in the multi-turn track, where each user message is treated as feedback for the Reflexion and adjustment process.

5.3.3. Recommendation-Specific Memory Frameworks

Given that recommendation tasks are inherently personalized, these baselines leverage memory mechanisms specifically developed for recommendation agents.

  • RecMind [49]: An LLM-powered agent for general recommendations. It consists of two memory types: personalized memory (user reviews, ratings) and world knowledge (item metadata, real-time info via Web search). In this setup, the personalized memory retains user reviews and ratings, and an additional get_product_details_by_asin function is incorporated to allow RecMind to access detailed product information. Memory length is set to 400 behaviors.
  • InteRecAgent [17]: This framework uses LLMs as a reasoning engine and recommender models as functions for interactive recommendations. Its memory includes a candidate bus (current item candidates) and a user profile (like, dislike, expect preferences). The user profile memory is adopted and updated at the end of each task based on conversation history. This method is evaluated only in the multi-turn setting due to its reliance on ongoing dialogue for user profile synthesis.

5.3.4. PUMA Implementation Details

  • LLM Backbone: LLaMA2-7B [47] is used for fine-tuning.
  • Fine-tuning: Performed with LoRA [16] using 4×24GB4 \times 24 \mathrm{GB} NVIDIA A5000 GPUs.
  • Learning Rates: 4×1034 \times 10^{-3} for SFT and 5×1055 \times 10^{-5} for DPO.
  • Batch Size: 1 per GPU.
  • Memory Token Length: Constrained to 256, 512, and 768 tokens due to GPU memory limitations during training.
  • Parameter Generation: High-temperature sampling (temperature of 1.5) and beam search (beam size of 10) are used to generate diverse function parameters.
  • Pseudo-label Generation: gpt-4o-mini-2024-07-18 is used to generate search function parameters for initial SFT labels.

5.4. Profile Consistency Evaluation Details

As mentioned in Section A.1 of the paper, the profile consistency evaluation uses the following settings:

  • Profile-behavior consistency: Task is to match a user profile with the correct user's past Web behaviors among other candidate users. Metric: top-1 accuracy.

  • Profile-product consistency: Task is to rank candidate items (positive + negative) for a user based on their profile. Metric: NDCG@5 and Recall@5.

  • Settings: Number of positive samples set to 1 and 3, and negative samples to 4 and 7 for user prediction and recommendation tasks respectively.

  • LLM used: gpt-4o-mini-2024-07-18.

    Figure 5: Results of profile consistency evaluation experiments. Our generated profiles align better with users' actual Web behaviors and interested products than Apollonion \[5\]. 该图像是图表,展示了个人资料一致性评估实验结果。图中对比了PersonalWAB与Apollonion在用户资料与产品关联召回(Recall@5)、归一化折损累积增益(NDCG@5)及行为准确率(Acc@1)三项指标上的性能,PersonalWAB均明显优于Apollonion,提升幅度分别为25.8%、18.3%和13.3%。

Figure 5: Results of profile consistency evaluation experiments. Our generated profiles align better with users' actual Web behaviors and interested products than Apollonion [5].

Figure 5 shows that PersonalWAB's generated profiles exhibit significant improvements over Apollonion [5] across both tasks, with PersonalWAB achieving higher top-1 accuracy (e.g., 0.85 vs 0.71 for profile-behavior) and higher NDCG@5 and Recall@5 (e.g., 0.61 vs 0.45 and 0.81 vs 0.65 for profile-product), indicating enhanced distinctiveness and alignment with actual user behaviors.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Single-turn Track

The following are the results from [Table 3] of the original paper:

Method (backbone) Search Recommendation Review Overall
Function Acc. Res Acc Function Acc. Res Acc Function Acc. Res Acc Function Acc. Res Acc
No Memory (gpt-40) 1.000 0.647 0.092 0.000 1.000 0.444 0.684 0.355
Random Memory (gpt-40) 0.974 0.640 0.296 0.018 0.996 0.442 0.745 0.357
Last Memory (gpt-40) 0.937 0.626 0.432 0.028 1.000 0.442 0.782 0.357
Relevant Memory (gpt-4o) 0.928 0.622 0.492 0.030 1.000 0.443 0.800 0.356
ReAct [56] (gpt-40) 0.903 0.605 0.560 0.027 0.996 0.444 0.815 0.350
RecMind [49] (gpt-4o) 0.981 0.645 0.226 0.017 0.990 0.442 0.721 0.359
PUMA(gpt-40) 1.000 0.649 0.939 0.048 1.000 0.449 0.979 0.373
PUMA( LLaMA-7B ) 0.996 0.652 0.987 0.054 1.000 0.538 0.994 0.406

Table 3: Single-turn track results. (GPT-4o is denoted as gpt-40). Best performance in each column is indicated by bold, second best by underline.

Key insights from the single-turn track results (Table 3):

  • Recommendation Task Difficulty: Recommendation instructions show poor function accuracy and result accuracy for most baselines. For instance, No Memory has a function acc of only 0.092 and res acc of 0.000 for recommendation. This indicates a significant challenge in correctly identifying the recommendation function and generating effective parameters. Further analysis (e.g., Figure 8(b) in the original paper) reveals many recommendation instructions were incorrectly assigned to the search function.
  • Impact of Memory: Methods incorporating memory generally show improved function accuracy compared to No Memory. Relevant Memory and ReAct exhibit higher function accuracy, suggesting that retrieving relevant information and explicit reasoning help in function selection. However, the result accuracy for most baselines remains similar to No Memory, implying they fail to significantly enhance personalized task execution, especially for the recommendation task where res acc stays very low (0.000-0.030).
  • PUMA's Superiority: PUMA significantly outperforms all baselines across all tasks.
    • PUMA(LLaMA-7B) achieves the highest overall function accuracy (0.994) and result accuracy (0.406).
    • For recommendation, PUMA(LLaMA-7B) achieves a function acc of 0.987 (vs. 0.560 for ReAct) and res acc of 0.054 (vs. 0.030 for Relevant Memory), demonstrating a substantial improvement.
    • This superiority highlights the effectiveness of PUMA's task-specific memory retrieval and function parameter optimization (SFT+DPO) in enabling the agent to focus on relevant behaviors and generate higher-quality personalized actions.
  • Efficiency: Despite using a smaller backbone LLM (LLaMA-7B) compared to gpt-4o, PUMA(LLaMA-7B) still achieves the best performance, indicating its efficiency and effectiveness.

6.1.2. Multi-turn Track

The following are the results from [Table 4] of the original paper:

Method (backbone) Search Recommendation Review Overall
F.Acc. R.Acc. Avg.Steps F.Acc. R.Acc. Avg.Steps F.Acc. R.Acc. Avg.Steps F.Acc. R.Acc. Avg.Steps
No Memory (gpt-40) 0.996 0.656 2.398 0.096 0.000 2.420 1.000 0.446 2.019 0.685 0.358 2.280
Random Memory (gpt-40) 0.999 0.680 4.193 0.703 0.042 4.474 1.000 0.448 2.007 0.896 0.380 3.564
Last Memory (gpt-40) 0.996 0.676 4.229 0.708 0.045 4.252 1.000 0.449 2.007 0.897 0.381 3.498
Relevant Memory (gpt-4o) 0.996 0.686 4.233 0.715 0.042 4.564 0.999 0.448 2.008 0.899 0.383 3.609
ReAct [56] (gpt-40) 0.996 0.674 4.657 0.218 0.013 5.468 0.974 0.448 2.129 0.718 0.369 4.098
Reflexion [42] (gpt-40) 1.000 0.686 5.406 0.281 0.014 6.145 0.976 0.449 2.145 0.741 0.373 4.579
RecMind [49] (gpt-40) 0.997 0.642 6.728 0.347 0.026 6.003 0.997 0.451 2.107 0.771 0.364 4.938
InteRecAgent [17] (gpt-40) 0.999 0.642 3.110 0.618 0.022 3.008 1.000 0.447 2.001 0.867 0.362 2.706
PUMA (gpt-40) 0.999 0.720 5.082 0.984 0.052 3.791 1.000 0.453 2.002 0.994 0.399 3.608

Table 4: Multi-turn track results. (GPT-4o is denoted as gpt-40).

Key insights from the multi-turn track results (Table 4):

  • Baselines Benefit from Multi-turn: Compared to the single-turn track, baselines generally perform better in search and recommendation tasks. This is attributed to the ability to benefit from multiple attempts and user feedback, allowing them to correct initial errors. Review tasks show minimal improvement as they are often straightforward.
  • Memory Retrieval Baselines: Similar trends to single-turn are observed. Relevant Memory slightly improves function accuracy and result accuracy but often at the cost of additional steps.
  • Reasoning Methods (ReAct, Reflexion): ReAct and Reflexion perform worse than memory retrieval methods in terms of function accuracy and result accuracy for recommendation, and require more average steps. The added complexity of explicit reasoning and self-reflexion (which increases input token length) seems to hinder efficiency and accuracy in these complex multi-turn settings, potentially due to context window limitations or the difficulty of effective self-correction.
  • Recommendation-Specific Frameworks (RecMind, InteRecAgent): RecMind requires a higher number of average steps (6.728 for search, 6.003 for recommendation) due to additional function calls, and struggles with instruction identification (low function acc for recommendation). InteRecAgent uses fewer steps (3.008 for recommendation) due to its streamlined memory, but this simplification leads to lower result accuracy (0.022 for recommendation).
  • PUMA's Strong Performance: PUMA (gpt-4o) demonstrates strong performance, especially in search and recommendation tasks. It achieves the highest overall function accuracy (0.994) and result accuracy (0.399) among the gpt-4o models. For recommendation, PUMA significantly improves function accuracy (0.984 vs. 0.715 for Relevant Memory) and result accuracy (0.052 vs. 0.045 for Last Memory). By extracting relevant information and filtering redundant data, PUMA enables more informed decisions with fewer steps in recommendation (3.791 Avg.Steps vs. 4.564 for Relevant Memory). While the full PUMA (with LLaMA-7B fine-tuning) was not evaluated in multi-turn due to model limitations, the gpt-4o variant still shows the benefits of its task-specific memory.

6.2. In-depth Analysis

6.2.1. Analysis on efficiency

Figure 7: Comparison between the average task completion time (in seconds) for different methods. 该图像是图表,展示了不同方法在任务完成时间(秒)上的平均比较。图中显示PUMA方法用时最短,为2.8秒,显著优于其他方法。

Figure 7: Comparison between the average task completion time (in seconds) for different methods.

The average task completion time is a critical factor for user experience. Figure 7 illustrates the efficiency comparison:

  • GPT-based Baselines: Most GPT-based methods, including No Memory, Random Memory, Last Memory, Relevant Memory, ReAct, Reflexion, RecMind, and InteRecAgent, show similar completion times, ranging from approximately 6.5 to 6.9 seconds. This is likely due to inherent latency in calling the GPT models and the memory processing overhead (even for No Memory, there's a baseline processing time).
  • PUMA's Superior Efficiency: PUMA significantly outperforms all baselines, achieving an average task completion time of just 2.8 seconds. This substantial efficiency gain is attributed to two factors:
    1. Smaller Model: PUMA utilizes a LLaMA-7B backbone, which is much smaller and faster to run than gpt-4o.
    2. Compact Memory Structure: PUMA's task-specific memory retrieval mechanism is designed to filter out irrelevant information, resulting in a more compact and manageable input. This minimizes inference time and reduces the computational load. This makes PUMA highly effective for real-world Web applications where quick response times are essential.

6.2.2. Ablation Study

The following are the results from [Table 5] of the original paper:

Method Search Recommendation Review Overall
Function Acc Result Acc Function Acc Result Acc Function Acc Result Acc Function Acc Result Acc
PUMA 0.996 0.652 0.987 0.054 1.000 0.538 0.994 0.406
w/o Task-specific Memory 0.990 0.643 0.992 0.008 1.000 0.496 0.994 0.373
w/o SFT 1.000 0.000 0.983 0.000 1.000 0.160 0.994 0.054
w/o DPO 0.996 0.648 0.987 0.047 1.000 0.529 0.994 0.399

Table 5: Ablation study on key components of PUMA in single-turn track.

An ablation study (Table 5) was conducted to assess the impact of PUMA's key components on performance:

  • w/o Task-specific Memory: Removing the task-specific memory retrieval leads to a drop in result accuracy across all tasks (e.g., from 0.054 to 0.008 for recommendation, from 0.538 to 0.496 for review). This highlights the critical role of effectively filtered memory in retaining relevant information necessary for generating accurate function parameters.

  • w/o SFT: When the supervised fine-tuning (SFT) phase is removed, result accuracy dramatically declines to near zero (e.g., 0.000 for search and recommendation, 0.160 for review). This indicates that SFT is fundamental in equipping the model with the basic ability to generate plausible and contextually appropriate function parameters. Without it, the LLM struggles significantly.

  • w/o DPO: Removing the Direct Preference Optimization (DPO) phase results in a slight but noticeable performance decrease in result accuracy (e.g., from 0.054 to 0.047 for recommendation, from 0.538 to 0.529 for review). This suggests that DPO plays a crucial role in refining the function parameters, better aligning them with personalized user preferences, and thus improving the overall quality of execution.

    Overall, the ablation study confirms that all three components—task-specific memory, SFT, and DPO—are essential for PUMA's superior performance, with SFT being foundational and memory and DPO providing critical enhancements for personalization and optimization.

6.2.3. Analysis on memory length

The following are the results from [Table 6] of the original paper:

Memory Length Search Recommendation Review Overall
Function Acc Result Acc Function Acc Result Acc Function Acc Result Acc Function Acc Result Acc
256 0.997 0.651 0.985 0.019 1.000 0.530 0.994 0.395
512 0.991 0.648 0.988 0.032 1.000 0.531 0.993 0.395
768 0.996 0.652 0.987 0.054 1.000 0.538 0.994 0.406

Table 6: Performance comparison of different memory token lengths in PUMA.

The impact of different memory token lengths (256, 512, and 768 tokens) on PUMA's performance was analyzed:

  • Function Accuracy: Memory length has minimal impact on function accuracy. The model maintains similar performance in identifying the correct function regardless of the memory size, with function acc remaining consistently high (around 0.99 for overall).
  • Result Accuracy: In contrast, memory length significantly affects result accuracy, especially for recommendation tasks.
    • For recommendation, increasing memory length from 256 to 768 tokens leads to a notable improvement in result accuracy (from 0.019 to 0.054). Shorter memory lengths limit the number of stored products and behaviors, hindering the model's ability to select appropriate product IDs for recommendations.

    • Search and review tasks are less sensitive to memory length changes. Their result accuracy remains relatively stable across different lengths. This is because these tasks often rely more heavily on information present in the direct user instruction rather than extensive historical memory for parameter generation. This reduced dependence also implies a potential ceiling for performance improvement from merely increasing memory length for these tasks.

      The analysis indicates that while longer memory can be beneficial for tasks requiring richer historical context (like recommendation), judicious selection of memory content (as done by task-specific memory retrieval) is crucial, and merely increasing length doesn't guarantee universal improvement for all task types.

6.2.4. Analysis on action transitions

Figure 8: Transitions of the agent's actions in multi-turn search and recommendation tasks. Each color represents a specific function. The horizontal axis shows interaction steps, while the width of… 该图像是论文中图8,展示了多轮搜索和推荐任务中代理动作的转变。不同颜色代表不同功能,横轴为交互步骤,颜色宽度体现代理对该动作的关注比例,流动展示策略调整过程。

Figure 8: Transitions of the agent's actions in multi-turn search and recommendation tasks. Each color represents a specific function. The horizontal axis shows interaction steps, while the width of each color band indicates the proportion of the agent's focus on that action. The flow between steps illustrates how the agent adapts its strategy over steps.

Figure 8 visualizes PUMA's actions in each interaction turn within the multi-turn track (excluding review instructions, which are typically completed in two steps):

  • Search Instructions (Figure 8a): The agent tends to alternately call the search and respond functions. This pattern is logical, as the agent can use the respond function to solicit user feedback, clarify ambiguities, or present preliminary results. Based on this feedback, it can then adjust its search action in subsequent turns. The interaction flow appears more direct and focused.
  • Recommendation Instructions (Figure 8b): The action transitions for recommendation instructions are "more entangled," indicating a more complex and varied action sequence. This complexity suggests that multi-turn recommendation tasks are inherently more challenging. The agent needs to accurately identify user intent, dynamically adjust its strategy based on nuanced feedback, and potentially explore different avenues, leading to a less linear interaction flow compared to search. This underlines the difficulty in constantly refining recommendations through dialogue.

6.2.5. Analysis of multi-turn performance variation

Figure 9: Analysis of the agent's performance across multiple attempts in multi-turn track. 该图像是图表,展示了多轮交互任务中智能代理在不同尝试步骤的平均响应准确率和对应尝试次数。折线图表示尝试次数的变化,柱状图则显示不同任务类别(搜索、推荐、回顾)及整体的准确率分布。

Figure 9: Analysis of the agent's performance across multiple attempts in multi-turn track.

Figure 9 presents the analysis of the agent's performance over multiple attempts in the multi-turn track, showing both Result Accuracy (Res Acc) and the number of solved tasks as the number of attempt steps increases:

  • Early Task Completion: A high number of tasks are completed within the first five attempts, indicating that most tasks are relatively straightforward and resolvable early in the interaction. Review tasks, in particular, are typically finished within the first two attempts, implying minimal need for extensive user interaction regarding review requirements.
  • Res Acc Trend: Res Acc is high during the initial attempts but tends to decline with each subsequent attempt. This pattern suggests that easier tasks are quickly resolved, leaving the more difficult or ambiguous tasks to be addressed in later turns. As the agent encounters more challenging scenarios, its ability to achieve high accuracy decreases.
  • Outliers: There are a few instances where tasks achieve higher Res Acc in later steps. However, these are rare outliers, involving only one or two tasks, which do not significantly alter the overall declining trend.
  • Feedback Utilization Challenges: The declining Res Acc in later attempts also implies that the agent struggles to effectively leverage user feedback in more complex, prolonged interactions. This could be due to a lack of sufficient multi-turn training data to tune the agent for robust self-correction and adaptation over extended dialogues.

6.2.6. Analysis on function usage and outcome accuracy

The following are the results from [Table 7] of the original paper:

Method Search Recommendation
F. Acc. R. Acc. O. Acc. F. Acc. R. Acc. O. Acc.
No Memory 1.000 0.647 0.647 0.092 0.000 0.155
Random Memory 0.974 0.640 0.642 0.296 0.018 0.159
Last Memory 0.937 0.626 0.632 0.432 0.028 0.161
Relevant Memory 0.928 0.622 0.631 0.492 0.030 0.159
ReAct [56] 0.903 0.605 0.628 0.560 0.027 0.160
RecMind [49] 0.981 0.645 0.647 0.226 0.017 0.152
PUMA 1.000 0.649 0.649 0.939 0.048 0.164

Table 7: Single-turn performance comparison of different methods in terms of function accuracy (F. Acc.), result accuracy (R. Acc.), and outcome accuracy (O. Acc.) in search and recommendation.

In real-world applications, users prioritize the relevance of retrieved results over the specific function employed. To reflect this user-centric goal, the paper introduces Outcome Accuracy (O. Acc.).

  • Outcome Accuracy (O. Acc.) Conceptual Definition: This metric evaluates the correctness of the returned results (e.g., product lists) independently of whether the agent used a search or recommendation function. It focuses on whether the final output aligns with user intent, irrespective of the precise tool invoked.
  • Analysis: As seen in Table 7, function accuracy (F. Acc.) and result accuracy (R. Acc.) vary significantly between search and recommendation tasks. For instance, No Memory has perfect F. Acc. for search but abysmal for recommendation. However, Outcome Accuracy provides a more balanced perspective.
    • PUMA achieves the highest Outcome Accuracy (0.649 for search and 0.164 for recommendation) among all methods. This demonstrates PUMA's ability to deliver relevant results effectively, even if the underlying function selection might sometimes be blurred from a user's perspective.
    • This metric offers a comprehensive evaluation that prioritizes the relevance of final outputs over strict adherence to function selection, better reflecting real-world user needs.

6.2.7. Analysis on search function implementation

The following are the results from [Table 8] of the original paper:

Method Search (Result Accuracy)
BM25 Dense Retrieval
No Memory 0.647 0.502
Random Memory 0.640 0.504
Last Memory 0.626 0.498
Relevant Memory 0.622 0.499
ReAct [56] 0.605 0.496
RecMind [49] 0.645 0.498
PUMA 0.649 0.506

Table 8: Comparison of search result accuracy using BM25 and Dense retrieval methods in single-turn track.

The search function implementation in the benchmark is flexible. The paper conducted an alternative retrieval experiment in the single-turn track, replacing BM25 (a sparse retrieval model) with a dense retrieval model based on Sentence-BERT [38].

  • Performance Comparison (Table 8):
    • Dense retrieval generally leads to a slight degradation in result accuracy across all baselines compared to BM25. For example, No Memory drops from 0.647 (BM25) to 0.502 (Dense Retrieval), and PUMA drops from 0.649 to 0.506.
    • This is explained by dense retrieval capturing richer semantic representations but potentially introducing noise by embedding extensive product details, which might not always align perfectly with the direct search intent in this specific setup.
  • PUMA's Robustness: Despite the variations caused by different retrieval methods, PUMA consistently outperformed all baselines in both BM25 and Dense Retrieval scenarios. This demonstrates PUMA's robustness and its ability to effectively utilize the underlying search function regardless of its specific implementation. This modularity allows for future exploration of different retrieval models and recommendation strategies to further enhance performance.

6.2.8. Analysis on zero-shot and few-shot performance

The following are the results from [Table 9] of the original paper:

Method Search Recommendation Review Overall
Function Acc. Result Acc. Function Acc. Result Acc. Function Acc. Result Acc. Function Acc. Result Acc.
No Memory 1.000 0.684 0.050 0.000 1.000 0.388 0.625 0.328
Random Memory 0.974 0.684 0.301 0.060 0.996 0.391 0.715 0.352
Last Memory 1.000 0.683 0.314 0.058 1.000 0.396 0.730 0.353
Relevant Memory 0.928 0.675 0.405 0.078 1.000 0.397 0.743 0.358
ReAct [56] 0.945 0.675 0.475 0.080 0.996 0.393 0.774 0.358
RecMind [49] 0.973 0.680 0.320 0.063 0.996 0.394 0.722 0.354
PUMA 1.000 0.686 0.892 0.090 1.000 0.396 0.958 0.366

Table 9: Performance comparison in zero-shot and few-shot scenarios in single-turn track.

To evaluate performance in zero-shot and few-shot scenarios, the paper analyzed 139 users (16.2% of the test set) with fewer than 10 historical records in the single-turn track.

  • Task-dependent Effects:
    • Search: Search performance remained stable or slightly improved, with result accuracy (e.g., PUMA at 0.686) comparable to full-data scenarios. This might be due to a reduction in potentially irrelevant historical information, allowing the agent to focus on the instruction itself.
    • Recommendation: Recommendation performance also improved (e.g., PUMA res acc from 0.054 to 0.090), which is counter-intuitive for memory-dependent tasks. The paper suggests this could be because limited memory simplifies retrieval, making it easier for the agent to pinpoint relevant items from a smaller, less noisy set.
    • Review: Review tasks showed a decline in performance (e.g., PUMA res acc from 0.538 to 0.396). The lack of past reviews for these users hinders the agent's ability to generate truly personalized and contextually rich responses, as review generation relies heavily on the user's past expression style and preferences.
  • PUMA's Consistent Superiority: Despite these variations, PUMA consistently outperformed all baselines, achieving the highest function accuracy and result accuracy across tasks, particularly excelling in Recommendation scenarios. This demonstrates PUMA's adaptability and effectiveness even when user history is sparse, highlighting its ability to leverage even limited personalized data more effectively than other methods.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper marks a significant advancement in the field of Web agents by introducing the concept of LLM-empowered personalized Web agents. It formally articulates the task of integrating personalized user data (profiles, historical behaviors) to achieve nuanced instruction understanding and customized action execution. To facilitate research and development in this new domain, the authors constructed PersonalWAB, the first comprehensive benchmark for personalized Web agents, encompassing diverse users, three personalized Web tasks (search, recommendation, review), callable Web functions, and supporting both single-turn and multi-turn evaluations. Furthermore, the paper proposes PUMA, a novel framework that enhances LLMs for this task through a user memory bank with task-specific retrieval and function parameter optimization via supervised fine-tuning (SFT) and Direct Preference Optimization (DPO). Extensive experiments on PersonalWAB robustly demonstrate PUMA's superior performance over existing Web agents, affirming its capacity to align better with personalized user instructions and preferences. This work lays foundational groundwork, expanding the research scope and introducing new challenges for future Web agent scenarios.

7.2. Limitations & Future Work

The authors acknowledge several limitations and outline promising avenues for future research:

  • Benchmark Expansion: PersonalWAB could be extended to incorporate a more diverse range of task scenarios, which would further challenge and evaluate the personalization capabilities of Web agents. This implies a need for tasks beyond shopping, encompassing broader Web interactions.
  • Sophisticated User Modeling: Future work could integrate more advanced user modeling techniques, such as dynamic preference learning. This would enhance agents' adaptability to evolving user needs and preferences over time, moving beyond static profiles.
  • User-in-the-Loop Settings: Exploring user-in-the-loop settings presents an exciting opportunity. This involves developing agents that can better integrate user feedback, proactively identify missing information, and actively engage with users to request necessary details. This approach could significantly improve task execution effectiveness and efficiency.
  • Ethical and Privacy Considerations: The paper explicitly flags the critical importance of ethical and privacy considerations. The use of personalized data can lead to biases (e.g., popularity bias [2]) and raise data security concerns. Future work must focus on fairness-aware personalization techniques, diversity-promoting strategies, and privacy-preserving techniques to mitigate these risks.
  • Scope Generalization: While the current work focuses on the shopping domain, the framework is generalizable. Extending it to broader Web environments (e.g., news recommendation, social media content curation) presents additional complexities that require further investigation.

7.3. Personal Insights & Critique

This paper makes a compelling case for the next generation of Web agents: truly personalized agents. The core idea of bringing personalized data into the LLM-agent loop is highly intuitive and addresses a clear gap in existing research. My key insights are:

  • Practical Relevance: The shopping domain is an excellent choice for demonstrating personalization due to the rich, quantifiable user behavior data and direct impact on user experience. The ability of an agent to anticipate a user's price sensitivity, brand preference, or even their review style is a tangible leap in utility.
  • Methodological Rigor: The construction of PersonalWAB is a monumental contribution. The attention to detail, from user sampling and profile generation via LLMs to instruction creation and Web function abstraction, provides a robust and replicable foundation. The profile consistency evaluation further validates the quality of the synthetic data, a crucial step for any benchmark relying on synthetic elements.
  • PUMA's Design: The PUMA framework's multi-stage approach, combining task-specific memory retrieval, SFT for foundational parameter generation, and DPO for preference alignment, is well-thought-out. The ablation studies clearly demonstrate the incremental value of each component. The use of DPO is particularly elegant, directly optimizing for user preferences without the complexity of a separate reward model.
  • Efficiency Aspect: The focus on efficiency is highly practical. In real-world Web interactions, latency is a killer feature. PUMA's ability to achieve superior performance with a smaller model (LLaMA-7B) and a compact memory structure is a significant advantage, making it more deployable.
  • Critical Points for Improvement:
    • Dynamic Profile Updates: While PUMA uses historical data, the user profile generation is somewhat static. Future work could explore how the agent dynamically learns and updates user profiles during interactions, especially in multi-turn scenarios. This would make the personalization even more adaptive.

    • Handling Conflicting Preferences: Users can have complex, sometimes contradictory, preferences. How would a personalized Web agent handle conflicting signals in historical data or ambiguous instructions? For example, a user who values budget but occasionally splurges.

    • Explainability of Personalization: As agents become more personalized, their decisions might become less transparent. Providing explanations for why a particular recommendation was made or why certain search parameters were chosen could enhance user trust and control.

    • Real-time Adaptation to External Factors: Personalized agents could benefit from integrating external real-time factors (e.g., current events, weather, social trends) that might influence a user's immediate preferences, even if not explicitly in their historical data.

    • Robustness to Adversarial Instructions: How robust is the personalization to subtle adversarial or manipulative instructions? Ensuring that the agent acts truly in the user's best interest is paramount.

      Overall, this paper is a foundational piece for personalized Web agents. Its clear task formulation, robust benchmark, and effective framework will undoubtedly inspire extensive follow-up research and bring us closer to a future where Web agents are not just smart, but truly personal.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.