iAgent: LLM Agent as a Shield between User and Recommender Systems
TL;DR Summary
The paper introduces a user-agent-platform paradigm with LLM agents as a protective shield, addressing vulnerabilities in traditional recommender systems. It develops the INSTRUCTREC dataset and the iAgent and i2Agent, with the latter showing a 16.6% improvement in personalizatio
Abstract
Traditional recommender systems usually take the user-platform paradigm, where users are directly exposed under the control of the platform's recommendation algorithms. However, the defect of recommendation algorithms may put users in very vulnerable positions under this paradigm. First, many sophisticated models are often designed with commercial objectives in mind, focusing on the platform's benefits, which may hinder their ability to protect and capture users' true interests. Second, these models are typically optimized using data from all users, which may overlook individual user's preferences. Due to these shortcomings, users may experience several disadvantages under the traditional user-platform direct exposure paradigm, such as lack of control over the recommender system, potential manipulation by the platform, echo chamber effects, or lack of personalization for less active users due to the dominance of active users during collaborative learning. Therefore, there is an urgent need to develop a new paradigm to protect user interests and alleviate these issues. Recently, some researchers have introduced LLM agents to simulate user behaviors, these approaches primarily aim to optimize platform-side performance, leaving core issues in recommender systems unresolved. To address these limitations, we propose a new user-agent-platform paradigm, where agent serves as the protective shield between user and recommender system that enables indirect exposure.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
iAgent: LLM Agent as a Shield between User and Recommender Systems
1.2. Authors
Wujiang Xu, Yunxiao Shi, Zujie Liang, Xuying Ning, Kai Mei, Kun Wang, Xi Zhu, Min Xu, Yongfeng Zhang
Affiliations: Rutgers University University of Technology Sydney Independent Researcher University of Illinois Urbana-Champaign, Nanyang Technological University
1.3. Journal/Conference
Published at arXiv, a preprint server, indicating that the paper has not yet undergone formal peer review for a specific conference or journal. arXiv is a well-regarded platform for rapid dissemination of research in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.
1.4. Publication Year
2025 (Published at UTC: 2025-02-20T15:58:25.000Z)
1.5. Abstract
Traditional recommender systems (RS) typically operate under a user-platform paradigm where users are directly exposed to algorithms often designed with commercial objectives, leading to issues like lack of user control, potential manipulation, echo chambers, and insufficient personalization for less active users. To address these vulnerabilities, the paper proposes a new user-agent-platform paradigm, introducing a large language model (LLM) agent as a protective shield between the user and the recommender system. The authors construct four new recommendation datasets, called INSTRUCTREC, which include user instructions. They then design an Instruction-aware Agent (iAgent) that uses tools to acquire external knowledge to understand user intentions. Further enhancing this, they introduce an Individual Instruction-aware Agent (i2Agent), which incorporates a dynamic memory mechanism to learn from individual user feedback. Empirical results on the INSTRUCTREC datasets show that i2Agent significantly outperforms state-of-the-art baselines, achieving an average improvement of 16.6% across ranking metrics. Moreover, i2Agent effectively mitigates echo chamber effects and alleviates model bias for disadvantaged (less-active) users, thus fulfilling its role as a user shield.
1.6. Original Source Link
https://arxiv.org/abs/2502.14662 (Publication status: Preprint) PDF Link: https://arxiv.org/pdf/2502.14662v4.pdf
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve revolves around the inherent vulnerabilities of users within the traditional user-platform paradigm of recommender systems. In this prevalent model, users are directly subject to the platform's recommendation algorithms.
This problem is important because, despite their widespread adoption, traditional recommender systems exhibit several critical flaws:
- Commercial Objectives vs. User Interests: Many sophisticated recommendation models are primarily designed to optimize platform benefits (e.g., clicks, conversion rates) rather than genuinely protecting or capturing users' true interests. This can lead to algorithmic manipulation where users are swayed towards items benefiting the platform, not necessarily themselves.
- Lack of Individual Personalization: Models are often optimized using aggregate data from all users, which can overlook unique individual preferences and needs. This results in generalized recommendations that fail to cater to specific users.
- User Disadvantages: Under this paradigm, users experience:
-
Lack of Control: Users have minimal say over the recommendations they receive.
-
Manipulation: Algorithms can subtly guide user choices for commercial gain.
-
Echo Chamber Effects: Users can become trapped in filter bubbles, repeatedly receiving homogeneous items that reinforce existing interests, leading to a lack of diversity and exposure to new content.
-
Bias Against Less-Active Users: Recommendation algorithms, particularly those relying on collaborative learning, tend to favor active users whose extensive interaction data dominates the learning process, leading to poor personalization for less-active or "disadvantaged" users.
Prior research, including some using Large Language Model (LLM) agents to simulate user behaviors, has predominantly focused on optimizing platform-side performance, leaving these core user-centric issues largely unaddressed.
-
The paper's entry point and innovative idea is to introduce a new user-agent-platform paradigm. Instead of direct user-platform interaction, an intelligent LLM agent acts as a "protective shield" between the user and the recommender system, enabling indirect exposure. This agent is designed to prioritize user interests, understand individual instructions, learn from personal feedback, and mitigate the aforementioned biases, thereby empowering users and enhancing their control over the recommendation experience.
2.2. Main Contributions / Findings
The paper makes several primary contributions to address the limitations of traditional recommender systems:
-
Proposed New Paradigm: Introduces the
user-agent-platformparadigm, positioning an LLM agent as a "protective shield" to mediate interactions between users and recommender systems, ensuring indirect exposure and prioritizing user interests. -
Novel Datasets (INSTRUCTREC): Constructed four new recommendation datasets (from Amazon, Goodreads, and Yelp) called
INSTRUCTREC. These datasets are unique as they include user-driven, free-text instructions for each interaction record, enabling research into instruction-aware recommendation. -
Instruction-aware Agent (
iAgent): DevelopediAgent, a foundational LLM-based agent capable of understanding free-text user instructions. It leverages a parser to extract user intentions and potentially external tools (e.g., search APIs) to acquire domain-specific knowledge, acting as an expert to inform a reranker. -
Individual Instruction-aware Agent (
i2Agent): Further enhancediAgentby introducingi2Agent, which incorporates a dynamic memory mechanism. This mechanism includes aprofile generator(to build and maintain user-specific profiles from individual feedback) and adynamic extractor(to capture evolving interests based on real-time instructions). This ensures individual optimization, independent of other users' behaviors. -
Empirical Validation & Superior Performance: Conducted extensive experiments on the four
INSTRUCTRECdatasets, demonstrating thati2Agentconsistently and significantly outperforms state-of-the-art baselines across standard ranking metrics, achieving an average improvement of 16.6%. -
Mitigation of Systemic Biases: Demonstrated that
i2Agenteffectively mitigates theecho chamber effect(by filtering unwanted ads and recommending diverse items) and alleviatesmodel biasagainst disadvantaged (less-active) users, offering more personalized services. -
Self-reflection Mechanism: Introduced a
self-reflectionmechanism within bothiAgentandi2Agentto verify reranking list content and address potentialhallucination problemsoften associated with generative LLMs, significantly reducing the hallucination rate.These contributions collectively address the core issues in traditional recommender systems by shifting the focus from platform-centric optimization to user-centric empowerment and protection, leading to more personalized, diverse, and fair recommendations.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a reader should be familiar with the following foundational concepts:
- Recommender Systems (RS): These are information filtering systems that predict what a user might like based on their past behavior, preferences, and similar users' behaviors. They are widely used across various online platforms (e.g., e-commerce, streaming services, social media) to suggest items (products, movies, news, etc.). The goal is to enhance user experience and drive engagement or sales.
- Personalization: The process of tailoring recommendations to individual users. Effective personalization means that the recommendations are highly relevant to a specific user's unique tastes and needs, rather than being generic or popular items.
- User-Platform Paradigm: This refers to the traditional model where users interact directly with a platform's algorithms. The platform's recommender system processes user data and delivers recommendations without an intermediary. The paper argues this direct exposure can make users vulnerable.
- LLM Agent: An
LLM agent(Large Language Model agent) is an artificial intelligence program that leverages the capabilities of a large language model (like GPT-3, GPT-4, Llama, etc.) to perform tasks autonomously. Unlike a simple LLM that just generates text, anLLM agentcan understand prompts, reason, plan a sequence of actions, use external tools (e.g., search engines, calculators, APIs), execute those actions, and reflect on the outcomes to improve future performance. They often have memory to maintain context over time. - Echo Chamber Effect (or Filter Bubble): This phenomenon occurs in recommender systems or social media when algorithms reinforce a user's existing interests or beliefs by repeatedly recommending homogeneous items or content. This leads to a lack of diversity in recommended content and can narrow a user's perspective, making them less exposed to new ideas or different viewpoints. The paper aims to mitigate this.
- Bias in Recommender Systems: This refers to systematic errors or unfairness in recommendations. Examples include:
- Popularity Bias: Recommending popular items more often, leading to a "rich-get-richer" effect for popular items and neglecting niche or less-known items.
- Exposure Bias: Certain items or users get more exposure than others.
- Bias against Less-Active Users: Users with fewer interactions (less-active users) often receive poorer quality or less personalized recommendations because collaborative filtering algorithms rely heavily on sufficient interaction data.
- Hallucination in LLMs: This refers to the phenomenon where LLMs generate plausible-sounding but factually incorrect or nonsensical information. In the context of recommender systems, an LLM agent could "hallucinate" an item that doesn't exist or incorrectly associate an item with certain attributes.
- Top-N Recommendation: The task of recommending a ranked list of N items to a user. Evaluation metrics like
HR@NandNDCG@Nare typically used for this task. - Reranking: The process of reordering an initial list of recommended items generated by a primary recommender system. Reranking is often used to optimize for secondary objectives like diversity, fairness, or, in this paper's case, alignment with specific user instructions.
3.2. Previous Works
The paper builds upon and differentiates itself from several categories of previous work:
3.2.1. Traditional Recommender Systems
Traditional recommender systems primarily focus on predicting user preferences based on historical interactions.
- Sequential Recommendation Models: These models aim to capture the temporal dynamics of user behavior, predicting the next item a user will interact with based on their sequence of past interactions.
- GRU4Rec (Hidasi et al., 2015): Utilizes Gated Recurrent Units (GRUs), a type of recurrent neural network (RNN), to model session-based recommendations. It focuses on capturing short-term user interests within a session.
- SASRec (Kang and McAuley, 2018): Stands for Self-Attentive Sequential Recommendation. It employs a self-attention mechanism, similar to the Transformer architecture, to capture long-range dependencies and identify relevant items in a user's action history to predict the next item.
- BERT4Rec (Sun et al., 2019): Adapts the Bidirectional Encoder Representations from Transformers (BERT) architecture for sequential recommendation. It uses a cloze objective (masked item prediction) to learn bidirectional context from user sequences.
- Key Concept: Self-Attention (as in SASRec): The core idea of self-attention is to weigh the importance of different items in a sequence when predicting the next item. For an input sequence of items , self-attention calculates a weighted sum of values for each item, where weights are determined by the similarity between a query (current item) and keys (other items in the sequence).
The original
Attentionmechanism (from Vaswani et al., 2017) is defined as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:- (Query), (Key), (Value) are matrices derived from the input embeddings.
- calculates the dot product similarity between queries and keys.
- is a scaling factor to prevent large dot products from pushing the softmax function into regions with tiny gradients.
- normalizes these scores into a probability distribution.
- is then weighted by these probabilities. SASRec uses this to capture dependencies between items in a user's historical sequence, allowing it to focus on relevant past interactions.
3.2.2. Conversational Recommender Systems (CRS)
- Traditional CRS (Sun and Zhang, 2018; Zhang et al., 2018): These systems aim to understand user intentions through multi-turn dialogues and provide personalized recommendations. They often use explicit feedback and dialogue context. However, conventional language models used in early CRS had limitations in dialogue flexibility (e.g., fixed dialogue formats, limited turns).
- Mathematical formulation for CRS: Appendix A of the paper summarizes this as:
$
\hat { i } = \arg \operatorname* { m a x } _ { i \in I } P ( s _ { T + 1 } = i \mid S _ { u } , H _ { u } ; \psi ) .
$
Where:
- is the predicted next item.
- is the set of all items.
- is the probability of item being the next interaction, given the user's historical interaction sequence , their historical dialogues , and the model parameters .
- is the user's historical interaction sequence.
- represents multiple historical dialogues of a user, with being the number of dialogues.
- denotes the model's parameters.
- Mathematical formulation for CRS: Appendix A of the paper summarizes this as:
$
\hat { i } = \arg \operatorname* { m a x } _ { i \in I } P ( s _ { T + 1 } = i \mid S _ { u } , H _ { u } ; \psi ) .
$
Where:
- LLM-enhanced CRS (Friedman et al., 2023; Feng et al., 2023): More recent work leverages the power of LLMs to improve dialogue understanding and flexibility in CRS, overcoming some limitations of conventional models.
3.2.3. LLM-based Recommendation Agents
These are newer approaches that use LLMs to simulate user behavior or perform recommendation tasks.
- ToolRec (Zhao et al., 2024): Enhances recommendation systems by using LLMs as surrogate users who employ external
tools(e.g., search, retrieval tools) to refine recommendations based on preferences. It focuses on attribute-oriented tools. - AgentCF (Zhang et al., 2024b): Constructs both user and item agents, powered by LLMs, to simulate interactions. These agents have memory modules for preferences and behaviors, and they use a collaborative reflection mechanism for continuous improvement.
- User-side operations (Wang et al., 2024; Huang et al., 2023b): Some recent works have started to focus on user-side operations, generating reranking results based on user instructions and individual memory. The current paper cites these as related but aims to provide a more robust and explicitly user-protective framework.
3.3. Technological Evolution
The field of recommender systems has evolved from basic collaborative filtering and content-based methods to sophisticated deep learning models that capture complex patterns in sequential user behavior. The integration of large language models represents a significant recent leap.
- Early RS: Simple algorithms like
item-based collaborative filtering(Sarwar et al., 2001) or matrix factorization. - Deep Learning RS: Introduction of neural networks for sequence modeling (e.g.,
GRU4Rec), attention mechanisms (SASRec), and transformer-based models (BERT4Rec) to better capture temporal and contextual user interests. - Generative RS: LLMs are used to generate recommendations, sometimes treating item IDs as tokens (Geng et al., 2022).
- Agent-based RS: The latest evolution, where LLMs are empowered with reasoning, planning, and tool-use capabilities to act as intelligent agents. Initially, these agents were primarily platform-centric (simulating users to optimize platform goals). This paper's work represents a critical shift towards user-centric agents, where the agent directly serves and protects the individual user.
3.4. Differentiation Analysis
Compared to the main methods in related work, this paper's iAgent and especially i2Agent introduce several core differences and innovations:
-
User-Agent-Platform Paradigm: The most fundamental difference is the introduction of an intermediary LLM agent as a protective shield for the user. Unlike previous
user-platform(traditional RS) orLLM-as-surrogate-user-for-platform-optimization(some RecAgents) paradigms,iAgentexplicitly operates on the user's behalf to ensure indirect exposure and control. -
Focus on User Instructions: While some CRS can handle dialogues,
iAgentis specifically designed to understandfree-text user instructionsthat can be highly flexible and go beyond simple product attributes. This is explicitly supported by the newINSTRUCTRECdatasets. -
Individual Optimization:
i2Agentis uniquely optimized for individual users. Its dynamic memory mechanism (profile generator, dynamic extractor) builds and updates user profiles solely based on that user's feedback and instructions, without influence from other users' behaviors. This directly addresses thelack of personalizationandbias against less-active usersissues. In contrast, most traditional RS and even many RecAgents are optimized using data from all users, leading to generalization but often neglecting individual nuances. -
Proactive User Control: The agent doesn't just react to implicit behavior but actively processes explicit
user instructionsto guide recommendations. This provides users with a sense of control over their recommendation experience. -
Mitigation of Echo Chambers and Bias: By design,
i2Agentaims to act as a shield againstecho chambers(e.g., by filtering irrelevant "ads" and promoting diversity) andbiastowards active users, which is not a primary objective of many existing systems. -
Integration of External Knowledge and Self-reflection:
iAgentandi2Agentactively use tools to acquireexternal knowledge(e.g., via search) to become domain-specific experts, and they incorporate aself-reflectionmechanism to reduceLLM hallucination, enhancing the reliability of their reranking outputs. WhileToolRecuses tools,iAgent's integration is within a user-centric, protective framework.The key distinction lies in the
i2Agent's fundamental design principle: it acts as a personal, intelligent proxy, safeguarding and prioritizing the individual user's dynamic interests above platform objectives or aggregate user trends.
4. Methodology
The paper proposes a new user-agent-platform paradigm where an LLM agent serves as a protective shield between the user and the recommender system. This section details the two proposed agents: iAgent (Instruction-aware Agent) and i2Agent (Individual Instruction-aware Agent).
4.1. Principles
The core idea behind the proposed methodology is to empower users by placing an intelligent, personalized LLM agent in control of their recommendation experience. Instead of users being directly exposed to platform algorithms, the agent acts as an intermediary, interpreting user instructions, leveraging external knowledge, and learning from individual feedback to provide recommendations that align with the user's true interests, rather than purely commercial objectives. This indirect exposure aims to mitigate issues like algorithmic manipulation, echo chambers, and lack of personalization for less-active users. The theoretical basis is that by creating a user-specific intelligent entity, it can act as a knowledgeable advocate, filtering and reranking recommendations from the platform to better serve the individual's needs.
The paper first defines its task in Appendix A, distinguishing it from traditional sequential and conversational recommendation:
- Sequential Recommendation: The goal is to predict the next item a user will interact with, based on their past interactions .
$
\hat { i } = \arg \operatorname* { m a x } _ { i \in I } P ( s _ { T + 1 } = i \mid S _ { u } ; \psi ) .
$
Where:
- is the predicted next item.
- is the set of all items.
- is the probability distribution over items for the next interaction, given the user's historical interaction sequence and the model's parameters .
- Conversational Recommendation: This involves analyzing user intentions through multi-turn dialogues alongside historical information to achieve personalized recommendations.
$
\hat { i } = \arg \operatorname* { m a x } _ { i \in I } P ( s _ { T + 1 } = i \mid S _ { u } , H _ { u } ; \psi ) .
$
Where:
- , , , and are as defined for sequential recommendation.
- represents multiple historical dialogues of a user, with being the number of dialogues.
- Our Task (iAgent/i2Agent Paradigm): Unlike the above, this task focuses on learning from user's explicit instructions to build an
agentic shieldand provide personalized recommendations. $ \hat { i } = \arg \operatorname* { m a x } _ { i \in I } P ( s _ { T + 1 } = i \mid S _ { u } , \Omega _ { u } , E ; \psi _ { u } ) . $ Where:- , , and are as defined previously.
- represents the user's instructions.
- represents the external environment, which can supply real-time information to the agent.
- denotes the user-specific model parameters, emphasizing individual personalization.
4.2. Core Methodology In-depth (Layer by Layer)
The workflow of the proposed agents is shown in Figure 2 (from the original paper).
VLM Description: The image is a schematic diagram illustrating the structures of the iAgent and i²Agent user agent models. The iAgent focuses on processing static memory and experience, while the i²Agent incorporates dynamic memory and dynamic interest to optimize the feedback mechanism of the recommendation system.
4.2.1. iAgent
iAgent is the basic instruction-aware agent designed to understand user intentions from free-text instructions and leverage external knowledge. It consists of a Parser, a Reranker, and a Self-reflection Mechanism.
4.2.1.1. Parser
The Parser component is built upon a large language model (LLM), denoted as . Its role is to process a user's instruction, identify both direct demands and hidden preferences, and determine if external tools are needed to gather more information.
- Initial Prompting: The user's instruction is concatenated with a specific
parser's prompt template. This combined input is fed into the LLM . - Internal Knowledge Generation & Tool Decision: Based on this input, generates
internal knowledgerelated to the instruction. Simultaneously, it decides whether to useexternal tools(e.g., a search engine) and generatesinstruction keywordsif tool usage is deemed necessary. The process is formally described as: $ O _ { T } , X _ { K W } , X _ { I K } M _ { p } ( X _ { I } \parallel P _ { t p } ) $ Where:- represents the potential external tool options decided by .
- denotes the instruction keywords generated for tool usage.
- is the internal knowledge generated by about the instruction.
- is the large language model acting as the parser.
- is the user's instruction.
- is the prompt template for the parser.
- denotes concatenation.
- For example, if a user asks for "feel-good books that offer an escape from reality and focus on athletic fashion for everyday people," the parser might generate keywords like "feel-good books," "escape from reality," and "athletic fashion," and decide to use a Google Search tool.
- External Knowledge Acquisition (if needed): If decides to use external tools, the generated instruction keywords and the identified tool options are used to explore the open world and extract
external knowledge. This step is formulated as: $ X _ { E K } M _ { p } ( O _ { T } \parallel X _ { K W } ) $ Where:- is the external knowledge acquired using the tools.
- is used again, presumably to interpret the results from the external tool query.
- and are the tool options and keywords from the previous step.
4.2.1.2. Reranker
After the Parser component has gathered both internal and external knowledge related to the user's instruction, the Reranker component takes this information and reorders an initial list of recommendations provided by a platform.
- Input Collection: The reranker, which is also an LLM-based model denoted as , receives several pieces of information:
- The
instruction-related internal knowledgeandexternal knowledgeobtained from the Parser. - The user's
historical sequential information, which acts as a static memory of the user's past interactions. - The
textual information(e.g., title, description) of the candidate items present in the initial ranking list provided by the recommender platform. This also includes the item indices from . - A specific
reranker's prompt template.
- The
- Reranking Process: All these inputs are concatenated and fed into the reranker LLM to generate a new, re-ranked list of items.
Formally, this process is expressed as:
$
\mathcal { R } ^ { * } M _ { r } ( X _ { I K } \parallel X _ { E K } \parallel X _ { S U } \parallel X _ { I t e m } \parallel P _ { t r } )
$
Where:
- is the re-ranked item list, optimized according to the user's instructions and knowledge.
- is the large language model acting as the reranker.
- is the internal knowledge.
- is the external knowledge.
- is the user's static historical sequence.
- is the textual information of candidate items in the initial list .
- is the prompt template for the reranker.
4.2.1.3. Self-reflection Mechanism
LLMs are prone to generating "hallucinations" (incorrect or nonsensical outputs). To counter this, iAgent incorporates a Self-reflection Mechanism to verify the consistency of the re-ranked item list.
- Comparison: The mechanism compares the elements of the newly generated re-ranked list with the initial ranking list .
- Verification and Regeneration:
- If no discrepancies are found, is directly outputted.
- If differences are detected (e.g., an item is missing from the original candidate list, or new items are introduced), the self-reflection module triggers the reranker to regenerate the list. This regeneration includes an additional
self-reflection promptto guide the LLM to align its output with the original candidates. The formulation for regeneration remains similar to the reranker, but with an updated prompt: $ \mathcal { R } ^ { * } M _ { r } ( X _ { I K } \parallel X _ { E K } \parallel X _ { S U } \parallel X _ { I t e m } \parallel P _ { s r } ) $ Where replaces to specifically instruct the LLM to correct its output to match the original list. This ensures that the agent only reranks existing items and does not "invent" new ones.
4.2.2. i2Agent
While iAgent is instruction-aware, it doesn't learn from individual user feedback or dynamically adapt to evolving interests. i2Agent (Individual Instruction-aware Agent) extends iAgent by adding a dynamic memory mechanism to address these limitations, making it uniquely optimized for individual users.
4.2.2.1. Profile Generator
The Profile Generator is designed to build and maintain a user's personal profile by learning from their feedback. It simulates a neural network's training process over interaction rounds.
- Feedback Iteration Setup: In each round of feedback update iterations, the generator takes a
positive sample(the most recent interacted item) and anegative sample(a randomly selected non-interacted item). - Item Selection for Profile Update: These sampled items, along with their textual information ( for positive, for negative), the user's
static memory, and arank prompt template, are fed into the generator LLM . The LLM then selects one item from the two as a 'recommended' item, which is part of the feedback loop for profile generation. The previous round's user profile is also an input to maintain continuity. The process is described as: $ X _ { G } ^ { T } M _ { g e } ( X _ { S U } \parallel X _ { i } ^ { + } \parallel X _ { i } ^ { - } \parallel \mathcal { F } ^ { T - 1 } \parallel P _ { p r 1 } ) $ Where:- is the recommended item generated by in round .
- is the large language model acting as the profile generator.
- is the user's static historical sequence.
- and are the textual information of the positive and negative samples, respectively.
- is the user's profile from the previous round
T-1. - is the prompt template for item selection in the generator.
- Profile Update: The user's profile for the current round is then updated by integrating the previous profile , the
ground-truth interacted item(positive sample, potentially augmented with user reviews ), and the item generated in the previous step. Acorresponding prompt templateguides this update. This update is formulated as: $ \mathcal { F } ^ { T } M _ { g e } ( \mathcal { F } ^ { T - 1 } \Vert X _ { i } ^ { + \ast } \Vert X _ { G } ^ { T } \Vert P _ { p r 2 } ) $ Where:- is the updated user profile for round .
- contains the positive sample's textual information augmented with user feedback data (e.g., reviews).
- is the prompt template for profile updating.
4.2.2.2. Dynamic Extractor
Similar to an attention mechanism, the Dynamic Extractor focuses on extracting information most relevant to the current instruction, forming a dynamic memory.
- Input Collection: An extractor LLM is prompted with:
- The latest
user profilefrom the Profile Generator. - The user's
static memory. - The current
user instruction. - The
instruction-related internal knowledgeandexternal knowledge(as derived by the Parser iniAgent). - A specific
extractor prompt template.
- The latest
- Dynamic Memory Generation: The extractor processes these inputs to generate a
dynamic profileanddynamic interest. These two components together constitute the user'sdynamic memory, which captures evolving interests based on the immediate instruction and updated profile. This process is formalized as: $ \mathcal { F } _ { d } ^ { T } , \boldsymbol { X } _ { D U } M _ { e } ( \mathcal { F } ^ { T } | \boldsymbol { X } _ { S U } | \boldsymbol { X } _ { I } | \boldsymbol { X } _ { I K } | \boldsymbol { X } _ { E K } | P _ { e } ) $ Where:- is the dynamic profile.
- represents the dynamic interest.
- is the large language model acting as the dynamic extractor.
- Other symbols are as defined previously.
4.2.2.3. Reranker (for i2Agent)
With the enhanced dynamic memory, the reranker in i2Agent can make even more personalized reranking decisions.
- Enhanced Input for Reranking: Similar to
iAgent's reranker, this component uses an LLM . However, its input now includes the newly generateddynamic profileanddynamic interest. The reranking process is expressed as: $ \mathcal { R } ^ { * } M _ { r } ( X _ { I K } | X _ { E K } | X _ { S U } | \mathcal { F } _ { d } ^ { T } | X _ { D U } | X _ { I t e m } | P _ { t r } ^ { * } ) $ Where:- is the final re-ranked item list.
- represents the specific prompt template for the reranker in
i2Agent. - All other symbols are as defined previously.
- Self-reflection: A
self-reflection mechanismis also implemented ini2Agent's reranker, identical in function to that iniAgent, to ensure consistency and preventhallucination. It uses the same inputs as the reranker but with theself-reflection promptduring regeneration if discrepancies are detected.
5. Experimental Setup
5.1. Datasets
The authors constructed four new datasets, collectively named INSTRUCTREC, because no existing dataset included proactive user instructions in the user-agent-platform paradigm. These datasets are derived from well-known public recommendation datasets, augmented with generated user instructions.
- Source Datasets:
- Amazon (Ni et al., 2019): Two subsets were used: "Books" and "Movies and TV." These provide 1-5 star ratings, textual reviews, and metadata (titles, descriptions, categories, pricing).
- Yelp: This dataset contains over 67,000 business reviews (primarily restaurants) from three major English-speaking cities. It includes business metadata (name, location, category, attributes) and user interactions (ratings, reviews).
- Goodreads (Wan et al., 2019): Derived from an online platform for book reviews, it offers user-generated ratings, reviews, and book metadata (ISBNs, title, author, publication year, genre).
- Data Preprocessing:
- Users and items with fewer than 5 associated actions (interactions) were removed to ensure sufficient data density.
- INSTRUCTREC Dataset Construction:
- Instruction Generator:
- Manual annotation of several instruction-review pairs (few-shot examples) was performed.
- A random persona from
Persona Hub(Chan et al., 2024) was assigned to each user. - An LLM (e.g., GPT-4-mini mentioned in future directions, but not explicitly stated for generation here) was prompted with the few-shot examples, a user's review, and their persona to generate a free-text instruction for each interaction.
- A list of instruction-review pairs was maintained to dynamically update the few-shot examples, allowing the LLM to decide if new instructions should be included.
- Instruction Cleaner:
-
To prevent
data leakage(where the instruction might implicitly reveal the ground-truth item), an LLM was used to attempt to recover the item from the generated instruction. -
Given an instruction, the LLM was asked to choose between the ground-truth item and a randomly selected negative item and generate a
certainty score. -
Instructions were retained if the LLM could not infer the ground-truth item. An equal number of correctly inferred instructions with low certainty scores were also kept. This ensures instructions are not trivially predictable from the item itself.
The overview of the INSTRUCTREC dataset construction is illustrated in the figure below:
VLM Description: The image is a diagram illustrating the data flow between the instruction generator and instruction cleaner. User reviews are processed by the instruction generator (LLM) to create examples, which are then refined by the instruction cleaner (LLM) to produce instructions.
-
- Instruction Generator:
The statistics of the constructed INSTRUCTREC datasets are as follows:
The following are the results from Table 1 of the original paper:
| Dataset | |U| | |V| | |ε| | Density | #|XI| | #|SU| |
| InstructRec - Amazon Book | 7,377 | 120,925 | 207,759 | 0.023% | 164 | 1276 |
| InSTRUCTREC -Amazon Movietv | 5,649 | 28,987 | 79,737 | 0.049% | 40 | 726 |
| INSTRUCTREC - Goodreads | 11,734 | 57,364 | 618,330 | 0.092% | 41 | 2827 |
| InStruCtREc - Yelp | 2,950 | 31,636 | 63,142 | 0.068% | 40 | 1976 |
Where:
- represents the number of unique users.
- represents the number of unique items.
- represents the number of interactions (user-item pairs).
- Density is the ratio of interactions to the total possible user-item pairs.
- denotes the average token length of user instructions.
- represents the average token length of the user's static memory (historical interactions).
5.2. Evaluation Metrics
The paper uses standard top-N ranking metrics and introduces specialized metrics to evaluate the mitigation of the echo chamber effect and popularity bias.
5.2.1. Standard Ranking Metrics
These metrics evaluate the quality of the ranked list of recommendations. For all these metrics, higher values indicate better performance.
-
Hit Rate (HR@N):
- Conceptual Definition: Measures whether the ground-truth item is present in the top-N recommended items. It's a binary measure: 1 if the item is there, 0 otherwise.
- Mathematical Formula: $ \mathrm{HR@N} = \frac{\text{Number of users for whom the ground-truth item is in top-N}}{\text{Total number of users}} $
- Symbol Explanation:
- N: The number of top recommendations considered.
- "Number of users...": The count of individual users for whom the target item was successfully recommended within the first N positions.
- "Total number of users": The total count of users in the evaluation set.
-
Normalized Discounted Cumulative Gain (NDCG@N):
- Conceptual Definition: A position-aware metric that evaluates the relevance of recommended items, giving higher scores to relevant items that appear earlier in the list. It accounts for the position of relevant items.
- Mathematical Formula:
$
\mathrm{NDCG@N} = \frac{\mathrm{DCG@N}}{\mathrm{IDCG@N}}
$
Where:
$
\mathrm{DCG@N} = \sum_{k=1}^{N} \frac{\mathrm{rel}k}{\log_2(k+1)}
$
And
IDCG@Nis the ideal DCG, calculated for a perfect ranking where all relevant items are at the top. For implicit feedback (where relevance is binary, 1 if interacted, 0 otherwise): $ \mathrm{IDCG@N} = \sum{k=1}^{\min(N, |R|)} \frac{1}{\log_2(k+1)} $ (assuming all relevant items are ranked perfectly) - Symbol Explanation:
- N: The number of top recommendations considered.
- : The relevance score of the item at position . In implicit feedback scenarios, it's typically 1 if the item is the ground-truth item and 0 otherwise.
- : A logarithmic discount factor, meaning items at higher ranks (smaller ) contribute more to the score.
- DCG@N: Discounted Cumulative Gain at rank N.
- IDCG@N: Ideal Discounted Cumulative Gain at rank N, representing the maximum possible DCG for a given query.
- : The total number of relevant items for the user (in this case, typically 1 ground-truth item).
-
Mean Reciprocal Rank (MRR):
- Conceptual Definition: Measures the average of the reciprocal ranks of the first relevant item. If the first relevant item is at rank , its reciprocal rank is .
- Mathematical Formula: $ \mathrm{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\mathrm{rank}_i} $
- Symbol Explanation:
- : The total number of queries (users) in the evaluation set.
- : The rank position of the first relevant item for the -th query.
5.2.2. Echo Chamber Effect and Popularity Bias Metrics
These metrics are designed to specifically assess how well the proposed method mitigates the echo chamber effect and avoids simply recommending popular items. Higher values generally indicate better mitigation.
-
Filtered Ads Rate (FR@k):
- Conceptual Definition: Measures whether
Ads items(irrelevant items from other domains, simulating advertisements) are ranked below a certain position . A highFR@kmeans ads are successfully demoted. - Mathematical Formula: $ \operatorname { FR } @ \operatorname { k } = { \left{ \begin{array} { l l } { 1 , } & { { \mathrm { i f ~ } } r _ { A d s } > k , } \ { 0 , } & { { \mathrm { i f ~ } } r _ { A d s } \leq k . } \end{array} \right. } $
- Symbol Explanation:
- : The threshold rank position.
- : The position of an Ads item in the re-ranked list. If the Ads item is successfully pushed beyond rank , it contributes 1 to the
FR@kscore. The average across all users is then reported. Ads itemsare randomly selected from a different data domain (e.g., fromInstructRec - Amazon MovietvforInstructRec - Amazon Book). They are randomly inserted into the candidate list to mitigateposition biasin LLMs.
- Conceptual Definition: Measures whether
-
Popularity-Weighted Ranking Metrics (P-HR@N, P-MRR, P-NDCG@N):
- Conceptual Definition: These metrics adjust standard ranking metrics by penalizing recommendations of very popular items. This helps assess if the system is recommending diverse items, including less popular ones, rather than just relying on popularity.
- Mathematical Formula for Popularity Weighting: $ \mathrm { P \mathrm { - } R a n k } = ( 1 - \sigma \left( \mathrm { f r e q } _ { i } \right) ) \cdot \mathrm { R a n k } . $
- Symbol Explanation:
- : The popularity-weighted version of a standard ranking metric (e.g., P-HR@3, P-MRR, P-NDCG@3).
- : The standard ranking metric (e.g., HR, MRR, NDCG).
- : The frequency of item in the dataset, representing its popularity.
- : The sigmoid function, which maps any real value to a value between 0 and 1. $ \sigma(x) = \frac{1}{1 + e^{-x}} $ This function ensures that the popularity term is between 0 and 1. If an item is very popular (high ), will be close to 1, making close to 0, thus heavily penalizing the rank score. Conversely, for less popular items, will be smaller, leading to a higher popularity weight and less penalization.
5.2.3. Additional Analyses
- Active vs. Less-Active Users: Users are categorized based on their activity (top 20% most active, remaining 80% less active) to assess performance disparities.
- Reranking Ratio: The probability of changes in the top-ranked items after reranking (e.g., top 1, 3, 5 positions). This indicates how frequently the agent intervenes.
- Hallucination Rate: Measured by assessing the occurrence rate of LLM
hallucinationsin the reranking list, especially to evaluate the effectiveness of theself-reflection mechanism.
5.3. Baselines
The proposed methods (iAgent, i2Agent) are compared against three classes of baselines:
5.3.1. Sequential Recommendation Methods
These are traditional models that focus on predicting the next item based on a user's interaction sequence, without explicit instructions.
- GRU4Rec (Hidasi et al., 2015): A session-based recommendation model using Gated Recurrent Units (GRUs) to capture short-term user interests.
- BERT4Rec (Sun et al., 2019): Applies the BERT architecture to sequential recommendation, using a bidirectional encoder to model user behavior sequences with a masked item prediction objective.
- SASRec (Kang and McAuley, 2018): A self-attention-based model designed to capture long-term user interests by identifying relevant items in a user's history using an attention mechanism.
5.3.2. Instruction-aware Methods
These baselines use text matching or LLM capabilities to incorporate textual instructions or queries. For these methods, the concatenated text of the instruction serves as the query, and candidate items' metadata (title, description) are treated as documents.
- BM25 (Robertson et al., 2009): A probabilistic ranking function widely used in information retrieval. It measures the similarity between a query (instruction) and a document (item) based on term frequency and inverse document frequency.
- BGE-Rerank (Xiao et al., 2023): A cross-encoder model that processes both the query and document together to generate a relevance score. It captures fine-grained interactions for higher accuracy in reordering candidate documents.
- EasyRec (Ren and Huang, 2024): A lightweight LLM-based recommendation system that uses contrastive learning to align semantic representations from textual data with collaborative filtering signals. It employs a bi-encoder architecture.
5.3.3. Recommendation Agents
These are recent LLM-based agents designed for recommendation tasks.
- ToolRec (Zhao et al., 2024): Uses LLMs as surrogate users to enhance recommendations by leveraging external attribute-oriented tools (e.g., rank and retrieval tools) to explore and refine item suggestions. The self-reflection mechanism from the current paper was added to ToolRec for a fairer comparison regarding hallucination.
- AgentCF (Zhang et al., 2024b): Constructs LLM-powered user and item agents with memory modules to simulate user-item interactions and improve modeling via a collaborative reflection mechanism. For fair comparison, the number of memory-building rounds was set to 1, and the self-reflection mechanism was also equipped.
6. Results & Analysis
The empirical evaluation aims to answer four research questions (RQs):
- RQ1: How does the performance of
iAgentandi2Agentcompare to state-of-the-art baselines across various datasets? - RQ2: Can our method mitigate the
echo chamber effect? - RQ3: How well does our method perform for both active and less-active user groups?
- RQ4: Are the proposed
rerankerandself-reflection mechanismeffective in practice?
6.1. Core Results Analysis
The main results, comparing iAgent and i2Agent against various baselines, are presented in Tables 2 and 3 for the INSTRUCTREC datasets.
The following are the results from Table 2 of the original paper:
| Model | InstructRec - Amazon Book | InstructRec - Amazon Movietv | ||||||
| HR@1 | HR@3 | NDCG@3 | MRR | HR@1 | HR@3 | NDCG@3 | MRR | |
| GRU4Rec | 11.00 | 31.41 | 22.53 | 30.10 | 15.80 | 36.85 | 27.63 | 34.36 |
| BERT4Rec | 11.48 | 30.90 | 22.32 | 30.31 | 14.74 | 35.13 | 26.36 | 33.43 |
| SASRec | 11.08 | 31.34 | 22.42 | 30.15 | 34.52 | 49.71 | 43.18 | 48.06 |
| BM25 | 9.92 | 24.48 | 18.21 | 27.00 | 11.29 | 30.27 | 22.09 | 30.04 |
| BGE-Rerank | 25.36 | 45.90 | 37.11 | 42.84 | 25.44 | 47.48 | 38.02 | 43.28 |
| EasyRec | 30.70 | 48.87 | 41.09 | 46.14 | 34.96 | 61.30 | 50.15 | 52.98 |
| ToolRec | 10.56 | 30.60 | 21.88 | 29.77 | 13.84 | 35.67 | 26.20 | 33.21 |
| AgentCF | 14.24 | 34.16 | 25.55 | 32.77 | 25.90 | 49.82 | 39.64 | 44.23 |
| iAgent | 31.89 | 48.99 | 41.69 | 47.23 | 38.19 | 56.87 | 48.93 | 53.04 |
| i2Agent | 35.11 | 53.51 | 45.64 | 50.28 | 46.43 | 65.77 | 57.67 | 60.43 |
The following are the results from Table 3 of the original paper:
| Model | InstructRec Goodreads | InstructRec - Yelp | ||||||
| HR@1 | HR@3 | NDCG@3 | MRR | HR@1 | HR@3 | NDCG@3 | MRR | |
| GRU4Rec | 15.36 | 39.52 | 29.08 | 35.41 | 10.94 | 30.67 | 21.88 | 29.70 |
| BERT4Rec | 12.70 | 34.69 | 25.02 | 32.32 | 10.99 | 31.02 | 22.32 | 30.05 |
| SASRec | 18.52 | 41.24 | 31.47 | 37.60 | 12.59 | 31.09 | 22.65 | 30.15 |
| BM25 | 14.25 | 40.34 | 29.01 | 35.40 | 12.85 | 33.08 | 24.34 | 31.85 |
| BGE-Rerank | 17.26 | 40.82 | 30.60 | 36.97 | 33.05 | 55.29 | 45.70 | 49.90 |
| EasyRec | 13.94 | 35.38 | 26.11 | 33.27 | 32.41 | 56.31 | 46.04 | 49.86 |
| ToolRec | 19.06 | 42.79 | 32.61 | 38.44 | 12.07 | 30.92 | 22.83 | 30.21 |
| AgentCF | 21.61 | 46.09 | 35.60 | 40.96 | 13.36 | 34.83 | 25.66 | 32.61 |
| iAgent | 23.56 | 47.01 | 36.98 | 42.19 | 37.40 | 56.33 | 48.28 | 52.42 |
| i2Agent | 30.97 | 56.69 | 45.76 | 49.14 | 39.22 | 57.92 | 49.96 | 53.78 |
Analysis:
- Superiority of
i2Agent:i2Agentconsistently achieves the best performance across all four datasets and all standard ranking metrics (HR@1, HR@3, NDCG@3, MRR). It significantly outperforms the second-best baseline, EasyRec, with an average improvement of 16.6%. This strong performance validates the effectiveness of its design, particularly thedynamic memory mechanismthat incorporates individual feedback and evolving interests. - Effectiveness of
iAgent: Even the simpleriAgent(without dynamic memory) shows strong results, often ranking third or fourth, and notably outperforming traditional sequential models and other recommendation agents likeToolRecandAgentCF. This highlights the benefits of itsinstruction-aware parserandrerankerin understanding user intentions and leveraging external knowledge. - Instruction-aware vs. Sequential: Instruction-aware baselines (BGE-Rerank, EasyRec) generally outperform traditional sequential recommendation methods (GRU4Rec, BERT4Rec, SASRec). This suggests that incorporating explicit user instructions provides valuable signals that improve recommendation quality.
- LLM-based Baselines:
EasyRec, which is pre-trained on Amazon datasets and aligns collaborative filtering with natural language information, performs very well, often being the second-best model. This indicates the power of combining LLMs with traditional recommendation techniques.ToolRecandAgentCFperform better than sequential baselines but are notably surpassed byiAgentandi2Agent, suggesting that their agentic designs might not be as effective in capturing individual user-specific instructions and feedback as the proposed models.
6.2. Data Presentation (Tables)
The core results are presented above in Tables 2 and 3. The analysis of echo chamber effects and active/less-active users is presented in the following subsections.
6.3. Ablation Studies / Parameter Analysis
The paper includes analyses on the echo chamber effect, performance for active and less-active users, and the impact of the self-reflection mechanism and reranking ratio, which can be considered forms of ablation or component analysis.
6.3.1. Echo Chamber Effect (RQ2)
The echo chamber effect is evaluated using FR@k (Filtered Ads Rate) and P-Rank (Popularity-weighted ranking metrics). Higher values for these metrics indicate better mitigation of the echo chamber effect and popularity bias.
The following are the results from Table 4 of the original paper:
| Model | InstructRec - Amazon Book | InstructRec - Yelp | ||||||
| FR@1 | FR@3 | P-HR@3 | P-MRR | FR@1 | FR@3 | P-HR@3 | P-MRR | |
| EasyRec | 68.41 | 64.32 | 59.28 | 56.09 | 76.45 | 66.50 | 61.05 | 56.85 |
| ToolRec | 70.13 | 66.61 | 36.74 | 35.80 | 72.64 | 63.64 | 32.50 | 32.73 |
| AgentCF | 58.02 | 50.04 | 41.10 | 39.42 | 71.30 | 64.15 | 38.46 | 36.44 |
| iAgent | 71.98 | 67.82 | 59.51 | 57.32 | 78.24 | 69.71 | 62.74 | 58.76 |
| i2Agent | 77.15 | 70.15 | 64.70 | 60.87 | 87.69 | 84.20 | 64.48 | 60.20 |
The following are the results from Table 7 of the original paper:
| Model | Amazon Book | Amazon Book | ||||||
| FR@1 | FR@3 | FR@5 | FR@10 | P-HR@1 | P-HR@3 | P-NDCG@3 | P-MRR | |
| EasyRec | 68.41 | 64.32 | 60.30 | 0.03 | 37.60 | 59.28 | 50.00 | 56.09 |
| ToolRec | 70.13 | 66.61 | 62.41 | 0.00 | 12.63 | 36.74 | 26.24 | 35.80 |
| AgentCF | 58.02 | 50.04 | 41.32 | 0.06 | 17.00 | 41.10 | 30.68 | 39.42 |
| iAgent | 71.98 | 67.82 | 60.74 | 0.08 | 38.85 | 59.51 | 50.70 | 57.32 |
| i2Agent | 77.15 | 70.15 | 64.05 | 0.09 | 42.62 | 64.70 | 55.25 | 60.87 |
The following are the results from Table 8 of the original paper:
| Model | Amazon Movietv | GoodReads | ||||||
| P-HR@1 | P-HR@3 | P-NDCG@3 | P-MRR | P-HR@1 | P-HR@3 | P-NDCG@3 | P-MRR | |
| EasyRec | 37.31 | 65.45 | 53.54 | 56.69 | 14.22 | 35.98 | 26.56 | 33.84 |
| ToolRec | 14.73 | 38.12 | 27.96 | 35.57 | 19.21 | 43.22 | 32.92 | 38.88 |
| AgentCF | 27.61 | 53.33 | 42.37 | 47.37 | 21.82 | 46.62 | 35.99 | 41.47 |
| iAgent | 40.50 | 60.71 | 52.11 | 56.61 | 23.75 | 47.50 | 37.34 | 42.68 |
| i2Agent | 49.51 | 70.47 | 61.67 | 64.69 | 31.22 | 57.33 | 46.23 | 49.71 |
The following are the results from Table 9 of the original paper:
| Model | Yelp | Yelp | ||||||
| FR@1 | FR@3 | FR@5 | FR@10 | P-HR@1 | P-HR@3 | P-NDCG@3 | P-MRR | |
| EasyRec | 76.45 | 66.50 | 57.16 | 0.05 | 37.18 | 61.05 | 52.51 | 56.85 |
| ToolRec | 72.64 | 63.64 | 53.29 | 0.00 | 12.40 | 32.50 | 23.88 | 32.73 |
| AgentCF | 71.30 | 64.15 | 52.01 | 0.02 | 14.73 | 38.46 | 28.33 | 36.44 |
| iAgent | 78.24 | 69.71 | 56.17 | 0.12 | 41.74 | 62.74 | 53.82 | 58.76 |
| i2Agent | 87.69 | 86.20 | 84.00 | 0.16 | 43.67 | 64.48 | 55.62 | 60.20 |
Analysis:
- Ad Filtering (FR@k):
i2Agentdemonstrates superior ability to filter out unwanted Ads items. ForInstructRec - Amazon Book, itsFR@1is 77.15%, and forInstructRec - Yelp, it reaches 87.69% (FR@3of 84.20%). This indicates thati2Agentcan accurately interpret user instructions and identify irrelevant or undesired items, pushing them further down the ranking list or out of the top positions. This is a direct measure of its function as a "protective shield." - Mitigation of Popularity Bias (P-HR@N, P-MRR, P-NDCG@N):
i2Agentconsistently achieves the highest scores across the popularity-weighted metrics (P-HR@3, P-MRR, P-NDCG@3), often by a significant margin compared to other baselines, especiallyToolRecandAgentCF. This meansi2Agentis not simply recommending popular items but is providing more diverse recommendations, including less popular items, that are still relevant to the user's specific instructions and dynamically learned interests. - Overall Conclusion for RQ2: The results strongly suggest that
i2Agentis highly effective in mitigating theecho chamber effectandpopularity bias, validating its role as a user shield by providing more diversified and instruction-aligned recommendations.
6.3.2. Protect Less-Active Users (RQ3)
The performance for active (top 20%) and less-active (remaining 80%) users is analyzed to assess the agent's ability to provide personalization regardless of user activity level.
The following are the results from Table 5 of the original paper:
| Model | Less-Active Users | Active Users | ||||||
| HR@1 | HR@3 | NDCG@3 | MRR | HR@1 | HR@3 | NDCG@3 | MRR | |
| EasyRec | 32.93 | 51.07 | 43.32 | 48.04 | 28.71 | 47.64 | 39.53 | 44.61 |
| ToolRec | 10.57 | 30.86 | 22.01 | 29.88 | 10.04 | 31.73 | 22.32 | 29.54 |
| AgentCF | 14.79 | 35.00 | 26.26 | 33.35 | 14.87 | 34.37 | 25.93 | 33.24 |
| iAgent | 34.07 | 50.79 | 43.67 | 49.00 | 29.96 | 47.73 | 40.14 | 45.71 |
| i2Agent | 37.92 | 55.75 | 47.84 | 52.11 | 33.27 | 51.74 | 43.81 | 48.67 |
The following are the results from Table 10 of the original paper:
| Model | Less-Active Users | Active Users | ||||||
| HR@1 | HR@3 | NDCG@3 | MRR | HR@1 | HR@3 | NDCG@3 | MRR | |
| EasyRec | 35.17 | 61.56 | 50.39 | 53.21 | 35.47 | 63.15 | 51.26 | 53.64 |
| ToolRec | 14.43 | 36.56 | 26.96 | 33.81 | 12.98 | 32.18 | 23.94 | 31.79 |
| AgentCF | 27.38 | 50.98 | 40.91 | 45.36 | 21.84 | 45.58 | 35.57 | 40.76 |
| iAgent | 39.36 | 57.85 | 49.98 | 53.96 | 34.95 | 55.19 | 46.88 | 51.02 |
| i2Agent | 47.32 | 66.64 | 58.57 | 61.22 | 44.71 | 64.99 | 56.60 | 59.30 |
The following are the results from Table 11 of the original paper:
| Model | Less-Active Users | Active Users | ||||||
| HR@1 | HR@3 | NDCG@3 | MRR | HR@1 | HR@3 | NDCG@3 | MRR | |
| EasyRec | 14.44 | 35.77 | 26.55 | 33.67 | 14.13 | 36.86 | 27.09 | 33.86 |
| ToolRec | 19.85 | 43.34 | 33.29 | 39.11 | 17.89 | 42.02 | 31.63 | 37.35 |
| AgentCF | 22.91 | 46.67 | 36.50 | 41.89 | 19.82 | 46.70 | 35.22 | 40.10 |
| iAgent | 24.57 | 48.12 | 38.00 | 43.04 | 22.62 | 46.96 | 36.64 | 41.70 |
| i2Agent | 32.67 | 58.08 | 47.28 | 50.46 | 29.76 | 55.39 | 44.56 | 48.19 |
The following are the results from Table 12 of the original paper:
| Model | Less-Active Users | Active Users | ||||||
| HR@1 | HR@3 | NDCG@3 | MRR | HR@1 | HR@3 | NDCG@3 | MRR | |
| EasyRec | 32.83 | 56.50 | 46.29 | 50.13 | 30.17 | 50.87 | 42.03 | 47.16 |
| ToolRec | 11.79 | 31.21 | 22.88 | 30.14 | 14.21 | 32.42 | 24.66 | 32.11 |
| AgentCF | 13.11 | 34.72 | 25.50 | 32.46 | 13.22 | 36.41 | 26.45 | 32.89 |
| iAgent | 37.80 | 56.17 | 48.37 | 52.70 | 39.40 | 59.10 | 50.62 | 53.90 |
| i2Agent | 39.02 | 58.49 | 50.23 | 53.88 | 43.25 | 57.75 | 51.48 | 56.05 |
Analysis:
- Performance for Less-Active Users:
i2Agentconsistently shows the highest performance forless-active usersacross all metrics and datasets. For example, onInstructRec - Amazon Book,i2Agentachieves an MRR of 52.11% for less-active users, significantly higher thanEasyRec's 48.04%. This is a crucial finding, as less-active users are typically disadvantaged in traditional recommender systems due to limited interaction data. The success ofi2Agenthere is attributed to itsdynamic memory mechanism, which builds individual profiles based on their specific feedback, rather than being influenced by the aggregate behavior of more active users. - Performance for Active Users:
i2Agentalso performs exceptionally well foractive users, often surpassing other baselines. However, the paper notes that for active users with very long text sequences (extensive interaction history), LLM performance can sometimes decline due to context window limitations (Liu et al., 2024b), which might explain why the performance gain overiAgentorEasyRecmight be less pronounced in some cases for active users compared to less-active ones. - Overall Conclusion for RQ3:
i2Agenteffectively enhances personalization for bothactiveandless-activeuser groups. Its ability to provide robust recommendations for less-active users, by forming dedicated individual profiles, directly addresses a critical fairness issue in recommender systems.
6.3.3. Model Study (RQ4)
This section investigates the effectiveness of the self-reflection mechanism and the impact of the reranker on the ranking list.
The following figure (Figure 4 from the original paper) shows the hallucination rate with and without the self-reflection mechanism, and the probability of changes in the ranking list after reranking:
VLM Description: The image is a chart showing the hallucination rates with and without the self-reflection mechanism in different recommender systems (first row) and the probability of changes in ranking (second row). It includes performance comparisons on Amazon books, Goodreads, and Yelp datasets, illustrating the advantages of iAgent in recommendation effectiveness.
Analysis of Self-reflection Mechanism:
- The top row of Figure 4 illustrates the
hallucination rate(represented as "Error Rate") with and without theself-reflection mechanismacross different datasets foriAgentandi2Agent, and baselinesToolRecandAgentCF(which were also equipped with self-reflection for fair comparison). - The results clearly show that the
self-reflection mechanismdramatically reduces the hallucination rate for all models. For instance, without self-reflection, the error rate can be very high (e.g., around 10-20% foriAgent). With self-reflection, the error rate drops significantly, by "at least 20-fold," becoming negligible (close to 0% forToolRecandAgentCFand very low foriAgentandi2Agent). - The paper notes that
i2Agentmight still exhibit a slightly higher error rate compared to others even with self-reflection. This is attributed to thelonger text sequences(dynamic memory, profiles, instructions, static memory) thati2Agentprocesses, which can sometimes challenge LLM performance due to context length limitations. - Conclusion for Self-reflection: The
self-reflection mechanismis highly effective in mitigatingLLM-induced hallucinations, ensuring that the reranked list only contains items from the original candidate set, thus improving the reliability and consistency of the agent's output.
Analysis of Reranking Ratio:
- The bottom row of Figure 4 shows the "Reranking Ratio," which measures the probability that the elements in the top-K (specifically K=1, 3, 5) positions of the ranking list change after reranking.
- The results indicate that
changes occur almost every time during rerankingfor all models employing a reranker (iAgent,i2Agent,ToolRec,AgentCF,EasyRec). For example, for top-1, the reranking ratio is consistently above 90% across datasets, and still very high for top-3 and top-5. - Conclusion for Reranking Ratio: This demonstrates that the agent is not merely passing through the initial recommendations but is actively and consistently performing personalized reranking based on user instructions and learned interests. The high reranking ratio confirms that the agent serves its purpose of modifying the platform's initial list to better suit individual user needs.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces a novel user-agent-platform paradigm for recommender systems, addressing critical user vulnerabilities in the traditional user-platform model. The core innovation is the iAgent and its advanced version, i2Agent, which act as intelligent, personalized LLM-based shields between users and recommendation algorithms.
The key contributions include:
-
New Paradigm: Shifting from direct user-platform interaction to an agent-mediated indirect exposure, prioritizing user interests.
-
INSTRUCTREC Datasets: Creation of unique datasets with user-driven free-text instructions, enabling research into instruction-aware recommendation.
-
iAgent: A foundational agent capable of understanding flexible user instructions, leveraging internal and external knowledge (via tools) to rerank items.
-
i2Agent: An enhanced agent incorporating a
dynamic memory mechanism(profile generator and dynamic extractor) to learn from individual user feedback and adapt to evolving interests, ensuring optimization specific to each user. -
Empirical Superiority:
i2Agentconsistently outperforms state-of-the-art baselines, achieving a substantial 16.6% average improvement across ranking metrics on theINSTRUCTRECdatasets. -
Mitigation of Biases: The proposed agents effectively mitigate the
echo chamber effect(by filtering ads and promoting diversity) and alleviatemodel biasagainstless-active users, providing robust personalization. -
Self-reflection: A
self-reflection mechanismsignificantly reducesLLM hallucination, enhancing the reliability of the reranking process.In essence,
i2Agentsuccessfully serves as a protective and empowering shield for users, delivering more personalized, diverse, and fair recommendations that align with explicit user instructions and dynamic individual preferences.
7.2. Limitations & Future Work
The authors acknowledge several limitations of their current work and propose future research directions:
7.2.1. Limitations
- Language Dependency: The current implementation primarily focuses on English instructions, and its effectiveness across different languages remains to be explored. This suggests a potential language bias in its current form.
- Nuance of User Satisfaction: While evaluation metrics show improvements in recommendation quality, they may not fully capture the nuanced aspects of user satisfaction and long-term engagement. Standard metrics might not perfectly reflect a user's subjective experience or happiness with the recommendations over time.
7.2.2. Future Work
- More Effective Reranker: The current
rerankeris a zero-shot LLM. Future work could involve fine-tuning smaller, open-source LLMs (e.g., Phi-3, Gemma) on theINSTRUCTRECdataset to build a more efficient and effective reranker. Additionally, existing advanced recommendation models could be used as tools for the agent to retrieve candidate items more effectively. - Multi-step Feedback: The current feedback mechanism is limited to a single ground-truth item with insufficient feedback explanations. In a real-world deployment, collecting continuous, multi-step feedback on user-agent interactions, along with detailed user explanations, could enable the development of more interpretable and adaptive agents.
- Mutual Learning: There's potential for
mutual learningbetween user-side agents (i2Agent) and platform-side recommendation models. Thei2Agentcould provide feedback and explanations to platform models, helping them improve their performance. Conversely, existingRecAgentscould iteratively improve through collaboration withi2Agent.i2Agentcould also serve as a sophisticatedreward functionforReinforcement Learning (RL)-based recommendation models, guiding them towards more user-centric optimization.
7.3. Personal Insights & Critique
This paper presents a highly relevant and forward-thinking approach to recommender systems, particularly in the age of powerful LLMs. The shift to a user-agent-platform paradigm is a significant conceptual leap, moving beyond the platform's commercial interests to genuinely advocate for the user.
Inspirations:
- User Empowerment: The idea of an agent as a "protective shield" is compelling. It offers a tangible solution to common user frustrations with recommender systems, such as feeling manipulated or being stuck in filter bubbles. This user-centric philosophy could become a cornerstone for future recommendation research.
- Dynamic Personalization: The
dynamic memory mechanismofi2Agent, which learns from individual feedback and extracts evolving interests, is a powerful concept. It highlights how LLMs can go beyond static profiles to capture the fluid nature of human preferences. This is especially inspiring for personalized learning environments or adaptive interfaces. - Bridging LLMs and Traditional RS: The paper cleverly integrates LLMs for instruction understanding and reranking, while still leveraging the output of traditional recommender systems. This demonstrates a practical path for LLMs to enhance existing infrastructure rather than completely replacing it.
- Addressing Fairness: The explicit focus on
less-active usersandecho chamber mitigationshows a strong ethical consideration, pushing the boundaries of what recommender systems can achieve in terms of fairness and diversity.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
Computational Cost & Latency: LLM inference, especially for multiple components (parser, generator, extractor, reranker, self-reflection) and potentially external tool calls, can be computationally expensive and introduce significant latency. This might be a major barrier for real-time recommendation scenarios, especially for a personal agent running for each user. The paper mentions fine-tuning smaller LLMs as future work, which is a good direction for efficiency.
-
Scalability for Billions of Users: While
i2Agentis optimized for individual users, deploying a separate, stateful LLM agent for each of potentially billions of users is currently economically and computationally prohibitive. The "individual" aspect needs to be balanced with scalability considerations. Perhaps a shared foundation model with personalized adapters or efficient individual memory storage could be explored. -
Instruction Quality & Ambiguity: The effectiveness heavily relies on the quality and clarity of user instructions. While LLMs are good at understanding natural language, ambiguous or contradictory instructions could lead to suboptimal recommendations. How the agent handles user frustration with its own interpretations is also an important user experience aspect not fully explored.
-
Ethical Implications of Agent Autonomy: While beneficial, giving an agent "control" over recommendations raises questions about its autonomy. How transparent is the agent to the user? Can the user fully override or inspect the agent's decisions? What if the agent's interpretation deviates significantly from the user's intent? The
self-reflectionmechanism addresses internal consistency, but not necessarily alignment with subjective user satisfaction. -
Robustness to Adversarial Instructions: Could users or malicious actors craft instructions to manipulate the agent, potentially for personal gain or to bypass platform controls?
-
Domain-Specificity of External Tools: The reliance on
external tools(like Google Search) introduces dependency and potential biases from those tools. For niche domains, effective external knowledge acquisition might be challenging.Overall,
iAgentandi2Agentrepresent a compelling vision for the future of recommender systems, offering a powerful blueprint for user-centric and ethical AI in recommendation. The technical challenges, though significant, open up exciting avenues for future research.
Similar papers
Recommended via semantic vector search.