Paper status: completed

iAgent: LLM Agent as a Shield between User and Recommender Systems

Published:02/20/2025

Personalized Recommendation Optimization (2)User-Agent Mechanism (1)Recommender System Security (1)LLM Agent (1)Algorithmic Bias Mitigation (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper introduces a user-agent-platform paradigm with LLM agents as a protective shield, addressing vulnerabilities in traditional recommender systems. It develops the INSTRUCTREC dataset and the iAgent and i2Agent, with the latter showing a 16.6% improvement in personalizatio

Abstract

Traditional recommender systems usually take the user-platform paradigm, where users are directly exposed under the control of the platform's recommendation algorithms. However, the defect of recommendation algorithms may put users in very vulnerable positions under this paradigm. First, many sophisticated models are often designed with commercial objectives in mind, focusing on the platform's benefits, which may hinder their ability to protect and capture users' true interests. Second, these models are typically optimized using data from all users, which may overlook individual user's preferences. Due to these shortcomings, users may experience several disadvantages under the traditional user-platform direct exposure paradigm, such as lack of control over the recommender system, potential manipulation by the platform, echo chamber effects, or lack of personalization for less active users due to the dominance of active users during collaborative learning. Therefore, there is an urgent need to develop a new paradigm to protect user interests and alleviate these issues. Recently, some researchers have introduced LLM agents to simulate user behaviors, these approaches primarily aim to optimize platform-side performance, leaving core issues in recommender systems unresolved. To address these limitations, we propose a new user-agent-platform paradigm, where agent serves as the protective shield between user and recommender system that enables indirect exposure.

Mind Map

In-depth Reading

English Analysis~36 min read · 52,757 chars

1. Bibliographic Information

1.1. Title

iAgent: LLM Agent as a Shield between User and Recommender Systems

1.2. Authors

Wujiang Xu $^1$ , Yunxiao Shi $^2$ , Zujie Liang $^3$ , Xuying Ning $^4$ , Kai Mei $^1$ , Kun Wang $^1$ , Xi Zhu $^1$ , Min Xu $^2$ , Yongfeng Zhang $^1$

Affiliations: $^1$ Rutgers University $^2$ University of Technology Sydney $^3$ Independent Researcher $^4$ University of Illinois Urbana-Champaign, Nanyang Technological University

1.3. Journal/Conference

Published at arXiv, a preprint server, indicating that the paper has not yet undergone formal peer review for a specific conference or journal. arXiv is a well-regarded platform for rapid dissemination of research in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.

1.4. Publication Year

2025 (Published at UTC: 2025-02-20T15:58:25.000Z)

1.5. Abstract

Traditional recommender systems (RS) typically operate under a user-platform paradigm where users are directly exposed to algorithms often designed with commercial objectives, leading to issues like lack of user control, potential manipulation, echo chambers, and insufficient personalization for less active users. To address these vulnerabilities, the paper proposes a new user-agent-platform paradigm, introducing a large language model (LLM) agent as a protective shield between the user and the recommender system. The authors construct four new recommendation datasets, called INSTRUCTREC, which include user instructions. They then design an Instruction-aware Agent (iAgent) that uses tools to acquire external knowledge to understand user intentions. Further enhancing this, they introduce an Individual Instruction-aware Agent (i2Agent), which incorporates a dynamic memory mechanism to learn from individual user feedback. Empirical results on the INSTRUCTREC datasets show that i2Agent significantly outperforms state-of-the-art baselines, achieving an average improvement of 16.6% across ranking metrics. Moreover, i2Agent effectively mitigates echo chamber effects and alleviates model bias for disadvantaged (less-active) users, thus fulfilling its role as a user shield.

1.6. Original Source Link

https://arxiv.org/abs/2502.14662 (Publication status: Preprint) PDF Link: https://arxiv.org/pdf/2502.14662v4.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve revolves around the inherent vulnerabilities of users within the traditional user-platform paradigm of recommender systems. In this prevalent model, users are directly subject to the platform's recommendation algorithms.

This problem is important because, despite their widespread adoption, traditional recommender systems exhibit several critical flaws:

Commercial Objectives vs. User Interests: Many sophisticated recommendation models are primarily designed to optimize platform benefits (e.g., clicks, conversion rates) rather than genuinely protecting or capturing users' true interests. This can lead to algorithmic manipulation where users are swayed towards items benefiting the platform, not necessarily themselves.
Lack of Individual Personalization: Models are often optimized using aggregate data from all users, which can overlook unique individual preferences and needs. This results in generalized recommendations that fail to cater to specific users.
User Disadvantages: Under this paradigm, users experience:
- Lack of Control: Users have minimal say over the recommendations they receive.
- Manipulation: Algorithms can subtly guide user choices for commercial gain.
- Echo Chamber Effects: Users can become trapped in filter bubbles, repeatedly receiving homogeneous items that reinforce existing interests, leading to a lack of diversity and exposure to new content.
- Bias Against Less-Active Users: Recommendation algorithms, particularly those relying on collaborative learning, tend to favor active users whose extensive interaction data dominates the learning process, leading to poor personalization for less-active or "disadvantaged" users.
  
  Prior research, including some using Large Language Model (LLM) agents to simulate user behaviors, has predominantly focused on optimizing platform-side performance, leaving these core user-centric issues largely unaddressed.

The paper's entry point and innovative idea is to introduce a new user-agent-platform paradigm. Instead of direct user-platform interaction, an intelligent LLM agent acts as a "protective shield" between the user and the recommender system, enabling indirect exposure. This agent is designed to prioritize user interests, understand individual instructions, learn from personal feedback, and mitigate the aforementioned biases, thereby empowering users and enhancing their control over the recommendation experience.

2.2. Main Contributions / Findings

The paper makes several primary contributions to address the limitations of traditional recommender systems:

Proposed New Paradigm: Introduces the user-agent-platform paradigm, positioning an LLM agent as a "protective shield" to mediate interactions between users and recommender systems, ensuring indirect exposure and prioritizing user interests.
Novel Datasets (INSTRUCTREC): Constructed four new recommendation datasets (from Amazon, Goodreads, and Yelp) called INSTRUCTREC. These datasets are unique as they include user-driven, free-text instructions for each interaction record, enabling research into instruction-aware recommendation.
Instruction-aware Agent (iAgent): Developed iAgent, a foundational LLM-based agent capable of understanding free-text user instructions. It leverages a parser to extract user intentions and potentially external tools (e.g., search APIs) to acquire domain-specific knowledge, acting as an expert to inform a reranker.
Individual Instruction-aware Agent (i2Agent): Further enhanced iAgent by introducing i2Agent, which incorporates a dynamic memory mechanism. This mechanism includes a profile generator (to build and maintain user-specific profiles from individual feedback) and a dynamic extractor (to capture evolving interests based on real-time instructions). This ensures individual optimization, independent of other users' behaviors.
Empirical Validation & Superior Performance: Conducted extensive experiments on the four INSTRUCTREC datasets, demonstrating that i2Agent consistently and significantly outperforms state-of-the-art baselines across standard ranking metrics, achieving an average improvement of 16.6%.
Mitigation of Systemic Biases: Demonstrated that i2Agent effectively mitigates the echo chamber effect (by filtering unwanted ads and recommending diverse items) and alleviates model bias against disadvantaged (less-active) users, offering more personalized services.
Self-reflection Mechanism: Introduced a self-reflection mechanism within both iAgent and i2Agent to verify reranking list content and address potential hallucination problems often associated with generative LLMs, significantly reducing the hallucination rate.

These contributions collectively address the core issues in traditional recommender systems by shifting the focus from platform-centric optimization to user-centric empowerment and protection, leading to more personalized, diverse, and fair recommendations.

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with the following foundational concepts:

Recommender Systems (RS): These are information filtering systems that predict what a user might like based on their past behavior, preferences, and similar users' behaviors. They are widely used across various online platforms (e.g., e-commerce, streaming services, social media) to suggest items (products, movies, news, etc.). The goal is to enhance user experience and drive engagement or sales.
Personalization: The process of tailoring recommendations to individual users. Effective personalization means that the recommendations are highly relevant to a specific user's unique tastes and needs, rather than being generic or popular items.
User-Platform Paradigm: This refers to the traditional model where users interact directly with a platform's algorithms. The platform's recommender system processes user data and delivers recommendations without an intermediary. The paper argues this direct exposure can make users vulnerable.
LLM Agent: An LLM agent (Large Language Model agent) is an artificial intelligence program that leverages the capabilities of a large language model (like GPT-3, GPT-4, Llama, etc.) to perform tasks autonomously. Unlike a simple LLM that just generates text, an LLM agent can understand prompts, reason, plan a sequence of actions, use external tools (e.g., search engines, calculators, APIs), execute those actions, and reflect on the outcomes to improve future performance. They often have memory to maintain context over time.
Echo Chamber Effect (or Filter Bubble): This phenomenon occurs in recommender systems or social media when algorithms reinforce a user's existing interests or beliefs by repeatedly recommending homogeneous items or content. This leads to a lack of diversity in recommended content and can narrow a user's perspective, making them less exposed to new ideas or different viewpoints. The paper aims to mitigate this.
Bias in Recommender Systems: This refers to systematic errors or unfairness in recommendations. Examples include:
- Popularity Bias: Recommending popular items more often, leading to a "rich-get-richer" effect for popular items and neglecting niche or less-known items.
- Exposure Bias: Certain items or users get more exposure than others.
- Bias against Less-Active Users: Users with fewer interactions (less-active users) often receive poorer quality or less personalized recommendations because collaborative filtering algorithms rely heavily on sufficient interaction data.
Hallucination in LLMs: This refers to the phenomenon where LLMs generate plausible-sounding but factually incorrect or nonsensical information. In the context of recommender systems, an LLM agent could "hallucinate" an item that doesn't exist or incorrectly associate an item with certain attributes.
Top-N Recommendation: The task of recommending a ranked list of N items to a user. Evaluation metrics like HR@N and NDCG@N are typically used for this task.
Reranking: The process of reordering an initial list of recommended items generated by a primary recommender system. Reranking is often used to optimize for secondary objectives like diversity, fairness, or, in this paper's case, alignment with specific user instructions.

3.2. Previous Works

The paper builds upon and differentiates itself from several categories of previous work:

3.2.1. Traditional Recommender Systems

Traditional recommender systems primarily focus on predicting user preferences based on historical interactions.

Sequential Recommendation Models: These models aim to capture the temporal dynamics of user behavior, predicting the next item a user will interact with based on their sequence of past interactions.
- GRU4Rec (Hidasi et al., 2015): Utilizes Gated Recurrent Units (GRUs), a type of recurrent neural network (RNN), to model session-based recommendations. It focuses on capturing short-term user interests within a session.
- SASRec (Kang and McAuley, 2018): Stands for Self-Attentive Sequential Recommendation. It employs a self-attention mechanism, similar to the Transformer architecture, to capture long-range dependencies and identify relevant items in a user's action history to predict the next item.
- BERT4Rec (Sun et al., 2019): Adapts the Bidirectional Encoder Representations from Transformers (BERT) architecture for sequential recommendation. It uses a cloze objective (masked item prediction) to learn bidirectional context from user sequences.
- Key Concept: Self-Attention (as in SASRec): The core idea of self-attention is to weigh the importance of different items in a sequence when predicting the next item. For an input sequence of items $X = [x_1, x_2, \ldots, x_T]$ $X = [x_{1}, x_{2}, \dots, x_{T}]$ , self-attention calculates a weighted sum of values for each item, where weights are determined by the similarity between a query (current item) and keys (other items in the sequence). The original Attention mechanism (from Vaswani et al., 2017) is defined as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
  - $Q$ (Query), $K$ (Key), $V$ (Value) are matrices derived from the input embeddings.
  - $Q K^T$ calculates the dot product similarity between queries and keys.
  - $\sqrt{d_k}$ is a scaling factor to prevent large dot products from pushing the softmax function into regions with tiny gradients.
  - $\mathrm{softmax}$ normalizes these scores into a probability distribution.
  - $V$ is then weighted by these probabilities. SASRec uses this to capture dependencies between items in a user's historical sequence, allowing it to focus on relevant past interactions.

3.2.2. Conversational Recommender Systems (CRS)

Traditional CRS (Sun and Zhang, 2018; Zhang et al., 2018): These systems aim to understand user intentions through multi-turn dialogues and provide personalized recommendations. They often use explicit feedback and dialogue context. However, conventional language models used in early CRS had limitations in dialogue flexibility (e.g., fixed dialogue formats, limited turns).
- Mathematical formulation for CRS: Appendix A of the paper summarizes this as: $ \hat { i } = \arg \operatorname* { m a x } _ { i \in I } P ( s _ { T + 1 } = i \mid S _ { u } , H _ { u } ; \psi ) . $ Where:
  - $\hat{i}$ is the predicted next item.
  - $I$ is the set of all items.
  - $P(s_{T+1} = i \mid S_u, H_u; \psi)$ is the probability of item $i$ being the next interaction, given the user's historical interaction sequence $S_u$ , their historical dialogues $H_u$ , and the model parameters $\psi$ .
  - $S_u$ is the user's historical interaction sequence.
  - $H_u = [h_1, \ldots, h_R]$ represents multiple historical dialogues of a user, with $R$ being the number of dialogues.
  - $\psi$ denotes the model's parameters.
LLM-enhanced CRS (Friedman et al., 2023; Feng et al., 2023): More recent work leverages the power of LLMs to improve dialogue understanding and flexibility in CRS, overcoming some limitations of conventional models.

3.2.3. LLM-based Recommendation Agents

These are newer approaches that use LLMs to simulate user behavior or perform recommendation tasks.

ToolRec (Zhao et al., 2024): Enhances recommendation systems by using LLMs as surrogate users who employ external tools (e.g., search, retrieval tools) to refine recommendations based on preferences. It focuses on attribute-oriented tools.
AgentCF (Zhang et al., 2024b): Constructs both user and item agents, powered by LLMs, to simulate interactions. These agents have memory modules for preferences and behaviors, and they use a collaborative reflection mechanism for continuous improvement.
User-side operations (Wang et al., 2024; Huang et al., 2023b): Some recent works have started to focus on user-side operations, generating reranking results based on user instructions and individual memory. The current paper cites these as related but aims to provide a more robust and explicitly user-protective framework.

3.3. Technological Evolution

The field of recommender systems has evolved from basic collaborative filtering and content-based methods to sophisticated deep learning models that capture complex patterns in sequential user behavior. The integration of large language models represents a significant recent leap.

Early RS: Simple algorithms like item-based collaborative filtering (Sarwar et al., 2001) or matrix factorization.
Deep Learning RS: Introduction of neural networks for sequence modeling (e.g., GRU4Rec), attention mechanisms (SASRec), and transformer-based models (BERT4Rec) to better capture temporal and contextual user interests.
Generative RS: LLMs are used to generate recommendations, sometimes treating item IDs as tokens (Geng et al., 2022).
Agent-based RS: The latest evolution, where LLMs are empowered with reasoning, planning, and tool-use capabilities to act as intelligent agents. Initially, these agents were primarily platform-centric (simulating users to optimize platform goals). This paper's work represents a critical shift towards user-centric agents, where the agent directly serves and protects the individual user.

3.4. Differentiation Analysis

Compared to the main methods in related work, this paper's iAgent and especially i2Agent introduce several core differences and innovations:

User-Agent-Platform Paradigm: The most fundamental difference is the introduction of an intermediary LLM agent as a protective shield for the user. Unlike previous user-platform (traditional RS) or LLM-as-surrogate-user-for-platform-optimization (some RecAgents) paradigms, iAgent explicitly operates on the user's behalf to ensure indirect exposure and control.
Focus on User Instructions: While some CRS can handle dialogues, iAgent is specifically designed to understand free-text user instructions that can be highly flexible and go beyond simple product attributes. This is explicitly supported by the new INSTRUCTREC datasets.
Individual Optimization: i2Agent is uniquely optimized for individual users. Its dynamic memory mechanism (profile generator, dynamic extractor) builds and updates user profiles solely based on that user's feedback and instructions, without influence from other users' behaviors. This directly addresses the lack of personalization and bias against less-active users issues. In contrast, most traditional RS and even many RecAgents are optimized using data from all users, leading to generalization but often neglecting individual nuances.
Proactive User Control: The agent doesn't just react to implicit behavior but actively processes explicit user instructions to guide recommendations. This provides users with a sense of control over their recommendation experience.
Mitigation of Echo Chambers and Bias: By design, i2Agent aims to act as a shield against echo chambers (e.g., by filtering irrelevant "ads" and promoting diversity) and bias towards active users, which is not a primary objective of many existing systems.
Integration of External Knowledge and Self-reflection: iAgent and i2Agent actively use tools to acquire external knowledge (e.g., via search) to become domain-specific experts, and they incorporate a self-reflection mechanism to reduce LLM hallucination, enhancing the reliability of their reranking outputs. While ToolRec uses tools, iAgent's integration is within a user-centric, protective framework.

The key distinction lies in the i2Agent's fundamental design principle: it acts as a personal, intelligent proxy, safeguarding and prioritizing the individual user's dynamic interests above platform objectives or aggregate user trends.

4. Methodology

The paper proposes a new user-agent-platform paradigm where an LLM agent serves as a protective shield between the user and the recommender system. This section details the two proposed agents: iAgent (Instruction-aware Agent) and i2Agent (Individual Instruction-aware Agent).

4.1. Principles

The core idea behind the proposed methodology is to empower users by placing an intelligent, personalized LLM agent in control of their recommendation experience. Instead of users being directly exposed to platform algorithms, the agent acts as an intermediary, interpreting user instructions, leveraging external knowledge, and learning from individual feedback to provide recommendations that align with the user's true interests, rather than purely commercial objectives. This indirect exposure aims to mitigate issues like algorithmic manipulation, echo chambers, and lack of personalization for less-active users. The theoretical basis is that by creating a user-specific intelligent entity, it can act as a knowledgeable advocate, filtering and reranking recommendations from the platform to better serve the individual's needs.

The paper first defines its task in Appendix A, distinguishing it from traditional sequential and conversational recommendation:

Sequential Recommendation: The goal is to predict the next item a user will interact with, based on their past interactions $S_u$ $S_{u}$ . $ \hat { i } = \arg \operatorname* { m a x } _ { i \in I } P ( s _ { T + 1 } = i \mid S _ { u } ; \psi ) . $ Where:
- $\hat{i}$ is the predicted next item.
- $I$ is the set of all items.
- $P(s_{T+1} = i \mid S_u; \psi)$ is the probability distribution over items for the next interaction, given the user's historical interaction sequence $S_u$ and the model's parameters $\psi$ .
Conversational Recommendation: This involves analyzing user intentions through multi-turn dialogues $H_u$ $H_{u}$ alongside historical information $S_u$ $S_{u}$ to achieve personalized recommendations. $ \hat { i } = \arg \operatorname* { m a x } _ { i \in I } P ( s _ { T + 1 } = i \mid S _ { u } , H _ { u } ; \psi ) . $ Where:
- $\hat{i}$ , $I$ , $S_u$ , and $\psi$ are as defined for sequential recommendation.
- $H_u = [h_1, \ldots, h_R]$ represents multiple historical dialogues of a user, with $R$ being the number of dialogues.
Our Task (iAgent/i2Agent Paradigm): Unlike the above, this task focuses on learning from user's explicit instructions $\Omega_u$ $Ω_{u}$ to build an agentic shield and provide personalized recommendations. $ \hat { i } = \arg \operatorname* { m a x } _ { i \in I } P ( s _ { T + 1 } = i \mid S _ { u } , \Omega _ { u } , E ; \psi _ { u } ) . $ Where:
- $\hat{i}$ , $I$ , and $S_u$ are as defined previously.
- $\Omega_u$ represents the user's instructions.
- $E$ represents the external environment, which can supply real-time information to the agent.
- $\psi_u$ denotes the user-specific model parameters, emphasizing individual personalization.

4.2. Core Methodology In-depth (Layer by Layer)

The workflow of the proposed agents is shown in Figure 2 (from the original paper).

该图像是示意图，展示了 iAgent 和 i²Agent 两种用户代理模型的结构。其中 iAgent 侧重于静态记忆和经验的处理，而 i²Agent 则引入动态记忆和动态兴趣，以优化推荐系统的反馈机制。 VLM Description: The image is a schematic diagram illustrating the structures of the iAgent and i²Agent user agent models. The iAgent focuses on processing static memory and experience, while the i²Agent incorporates dynamic memory and dynamic interest to optimize the feedback mechanism of the recommendation system.

4.2.1. iAgent

iAgent is the basic instruction-aware agent designed to understand user intentions from free-text instructions and leverage external knowledge. It consists of a Parser, a Reranker, and a Self-reflection Mechanism.

4.2.1.1. Parser

The Parser component is built upon a large language model (LLM), denoted as $M_p$ . Its role is to process a user's instruction, identify both direct demands and hidden preferences, and determine if external tools are needed to gather more information.

Initial Prompting: The user's instruction $X_I$ is concatenated with a specific parser's prompt template $P_{tp}$ . This combined input is fed into the LLM $M_p$ .
Internal Knowledge Generation & Tool Decision: Based on this input, $M_p$ $M_{p}$ generates internal knowledge $X_{IK}$ $X_{I K}$ related to the instruction. Simultaneously, it decides whether to use external tools $O_T$ $O_{T}$ (e.g., a search engine) and generates instruction keywords $X_{KW}$ $X_{K W}$ if tool usage is deemed necessary. The process is formally described as: $ O _ { T } , X _ { K W } , X _ { I K } M _ { p } ( X _ { I } \parallel P _ { t p } ) $ Where:
- $O_T$ represents the potential external tool options decided by $M_p$ .
- $X_{KW}$ denotes the instruction keywords generated for tool usage.
- $X_{IK}$ is the internal knowledge generated by $M_p$ about the instruction.
- $M_p$ is the large language model acting as the parser.
- $X_I$ is the user's instruction.
- $P_{tp}$ is the prompt template for the parser.
- $\parallel$ denotes concatenation.
- For example, if a user asks for "feel-good books that offer an escape from reality and focus on athletic fashion for everyday people," the parser might generate keywords like "feel-good books," "escape from reality," and "athletic fashion," and decide to use a Google Search tool.
External Knowledge Acquisition (if needed): If $M_p$ $M_{p}$ decides to use external tools, the generated instruction keywords $X_{KW}$ $X_{K W}$ and the identified tool options $O_T$ $O_{T}$ are used to explore the open world and extract external knowledge $X_{EK}$ $X_{E K}$ . This step is formulated as: $ X _ { E K } M _ { p } ( O _ { T } \parallel X _ { K W } ) $ Where:
- $X_{EK}$ is the external knowledge acquired using the tools.
- $M_p$ is used again, presumably to interpret the results from the external tool query.
- $O_T$ and $X_{KW}$ are the tool options and keywords from the previous step.

4.2.1.2. Reranker

After the Parser component has gathered both internal and external knowledge related to the user's instruction, the Reranker component takes this information and reorders an initial list of recommendations provided by a platform.

Input Collection: The reranker, which is also an LLM-based model denoted as $M_r$ $M_{r}$ , receives several pieces of information:
- The instruction-related internal knowledge $X_{IK}$ and external knowledge $X_{EK}$ obtained from the Parser.
- The user's historical sequential information $X_{SU}$ , which acts as a static memory of the user's past interactions.
- The textual information $X_{Item}$ (e.g., title, description) of the candidate items present in the initial ranking list $\mathcal{R}$ provided by the recommender platform. This also includes the item indices from $\mathcal{R}$ .
- A specific reranker's prompt template $P_{tr}$ .
Reranking Process: All these inputs are concatenated and fed into the reranker LLM $M_r$ $M_{r}$ to generate a new, re-ranked list of items. Formally, this process is expressed as: $ \mathcal { R } ^ { * } M _ { r } ( X _ { I K } \parallel X _ { E K } \parallel X _ { S U } \parallel X _ { I t e m } \parallel P _ { t r } ) $ Where:
- $\mathcal{R}^*$ is the re-ranked item list, optimized according to the user's instructions and knowledge.
- $M_r$ is the large language model acting as the reranker.
- $X_{IK}$ is the internal knowledge.
- $X_{EK}$ is the external knowledge.
- $X_{SU}$ is the user's static historical sequence.
- $X_{Item}$ is the textual information of candidate items in the initial list $\mathcal{R}$ .
- $P_{tr}$ is the prompt template for the reranker.

4.2.1.3. Self-reflection Mechanism

LLMs are prone to generating "hallucinations" (incorrect or nonsensical outputs). To counter this, iAgent incorporates a Self-reflection Mechanism to verify the consistency of the re-ranked item list.

Comparison: The mechanism compares the elements of the newly generated re-ranked list $\mathcal{R}^*$ with the initial ranking list $\mathcal{R}$ .
Verification and Regeneration:
- If no discrepancies are found, $\mathcal{R}^*$ is directly outputted.
- If differences are detected (e.g., an item is missing from the original candidate list, or new items are introduced), the self-reflection module triggers the reranker $M_r$ to regenerate the list. This regeneration includes an additional self-reflection prompt $P_{sr}$ to guide the LLM to align its output with the original candidates. The formulation for regeneration remains similar to the reranker, but with an updated prompt: $ \mathcal { R } ^ { * } M _ { r } ( X _ { I K } \parallel X _ { E K } \parallel X _ { S U } \parallel X _ { I t e m } \parallel P _ { s r } ) $ Where $P_{sr}$ replaces $P_{tr}$ to specifically instruct the LLM to correct its output to match the original list. This ensures that the agent only reranks existing items and does not "invent" new ones.

4.2.2. i2Agent

While iAgent is instruction-aware, it doesn't learn from individual user feedback or dynamically adapt to evolving interests. i2Agent (Individual Instruction-aware Agent) extends iAgent by adding a dynamic memory mechanism to address these limitations, making it uniquely optimized for individual users.

4.2.2.1. Profile Generator

The Profile Generator is designed to build and maintain a user's personal profile by learning from their feedback. It simulates a neural network's training process over interaction rounds.

Feedback Iteration Setup: In each round $T$ of feedback update iterations, the generator takes a positive sample (the most recent interacted item) and a negative sample (a randomly selected non-interacted item).
Item Selection for Profile Update: These sampled items, along with their textual information ( $X_i^+$ $X_{i}^{+}$ for positive, $X_i^-$ $X_{i}^{-}$ for negative), the user's static memory $X_{SU}$ $X_{S U}$ , and a rank prompt template $P_{pr1}$ $P_{p r 1}$ , are fed into the generator LLM $M_{ge}$ $M_{g e}$ . The LLM then selects one item from the two as a 'recommended' item, which is part of the feedback loop for profile generation. The previous round's user profile $\mathcal{F}^{T-1}$ $F^{T - 1}$ is also an input to maintain continuity. The process is described as: $ X _ { G } ^ { T } M _ { g e } ( X _ { S U } \parallel X _ { i } ^ { + } \parallel X _ { i } ^ { - } \parallel \mathcal { F } ^ { T - 1 } \parallel P _ { p r 1 } ) $ Where:
- $X_G^T$ is the recommended item generated by $M_{ge}$ in round $T$ .
- $M_{ge}$ is the large language model acting as the profile generator.
- $X_{SU}$ is the user's static historical sequence.
- $X_i^+$ and $X_i^-$ are the textual information of the positive and negative samples, respectively.
- $\mathcal{F}^{T-1}$ is the user's profile from the previous round T-1.
- $P_{pr1}$ is the prompt template for item selection in the generator.
Profile Update: The user's profile for the current round $\mathcal{F}^T$ $F^{T}$ is then updated by integrating the previous profile $\mathcal{F}^{T-1}$ $F^{T - 1}$ , the ground-truth interacted item (positive sample, potentially augmented with user reviews $X_i^{+*}$ $X_{i}^{+*}$ ), and the item $X_G^T$ $X_{G}^{T}$ generated in the previous step. A corresponding prompt template $P_{pr2}$ $P_{p r 2}$ guides this update. This update is formulated as: $ \mathcal { F } ^ { T } M _ { g e } ( \mathcal { F } ^ { T - 1 } \Vert X _ { i } ^ { + \ast } \Vert X _ { G } ^ { T } \Vert P _ { p r 2 } ) $ Where:
- $\mathcal{F}^T$ is the updated user profile for round $T$ .
- $X_i^{+*}$ contains the positive sample's textual information augmented with user feedback data (e.g., reviews).
- $P_{pr2}$ is the prompt template for profile updating.

4.2.2.2. Dynamic Extractor

Similar to an attention mechanism, the Dynamic Extractor focuses on extracting information most relevant to the current instruction, forming a dynamic memory.

Input Collection: An extractor LLM $M_e$ $M_{e}$ is prompted with:
- The latest user profile $\mathcal{F}^T$ from the Profile Generator.
- The user's static memory $X_{SU}$ .
- The current user instruction $X_I$ .
- The instruction-related internal knowledge $X_{IK}$ and external knowledge $X_{EK}$ (as derived by the Parser in iAgent).
- A specific extractor prompt template $P_e$ .
Dynamic Memory Generation: The extractor processes these inputs to generate a dynamic profile $\mathcal{F}_d^T$ $F_{d}^{T}$ and dynamic interest $X_{DU}$ $X_{D U}$ . These two components together constitute the user's dynamic memory, which captures evolving interests based on the immediate instruction and updated profile. This process is formalized as: $ \mathcal { F } _ { d } ^ { T } , \boldsymbol { X } _ { D U } M _ { e } ( \mathcal { F } ^ { T } | \boldsymbol { X } _ { S U } | \boldsymbol { X } _ { I } | \boldsymbol { X } _ { I K } | \boldsymbol { X } _ { E K } | P _ { e } ) $ Where:
- $\mathcal{F}_d^T$ is the dynamic profile.
- $X_{DU}$ represents the dynamic interest.
- $M_e$ is the large language model acting as the dynamic extractor.
- Other symbols are as defined previously.

4.2.2.3. Reranker (for i2Agent)

With the enhanced dynamic memory, the reranker in i2Agent can make even more personalized reranking decisions.

Enhanced Input for Reranking: Similar to iAgent's reranker, this component uses an LLM $M_r$ $M_{r}$ . However, its input now includes the newly generated dynamic profile $\mathcal{F}_d^T$ $F_{d}^{T}$ and dynamic interest $X_{DU}$ $X_{D U}$ . The reranking process is expressed as: $ \mathcal { R } ^ { * } M _ { r } ( X _ { I K } | X _ { E K } | X _ { S U } | \mathcal { F } _ { d } ^ { T } | X _ { D U } | X _ { I t e m } | P _ { t r } ^ { * } ) $ Where:
- $\mathcal{R}^*$ is the final re-ranked item list.
- $P_{tr}^*$ represents the specific prompt template for the reranker in i2Agent.
- All other symbols are as defined previously.
Self-reflection: A self-reflection mechanism is also implemented in i2Agent's reranker, identical in function to that in iAgent, to ensure consistency and prevent hallucination. It uses the same inputs as the reranker but with the self-reflection prompt $P_{sr}$ during regeneration if discrepancies are detected.

5. Experimental Setup

5.1. Datasets

The authors constructed four new datasets, collectively named INSTRUCTREC, because no existing dataset included proactive user instructions in the user-agent-platform paradigm. These datasets are derived from well-known public recommendation datasets, augmented with generated user instructions.

Source Datasets:
- Amazon (Ni et al., 2019): Two subsets were used: "Books" and "Movies and TV." These provide 1-5 star ratings, textual reviews, and metadata (titles, descriptions, categories, pricing).
- Yelp: This dataset contains over 67,000 business reviews (primarily restaurants) from three major English-speaking cities. It includes business metadata (name, location, category, attributes) and user interactions (ratings, reviews).
- Goodreads (Wan et al., 2019): Derived from an online platform for book reviews, it offers user-generated ratings, reviews, and book metadata (ISBNs, title, author, publication year, genre).
Data Preprocessing:
- Users and items with fewer than 5 associated actions (interactions) were removed to ensure sufficient data density.
INSTRUCTREC Dataset Construction:
- Instruction Generator:
  1. Manual annotation of several instruction-review pairs (few-shot examples) was performed.
  2. A random persona from Persona Hub (Chan et al., 2024) was assigned to each user.
  3. An LLM (e.g., GPT-4-mini mentioned in future directions, but not explicitly stated for generation here) was prompted with the few-shot examples, a user's review, and their persona to generate a free-text instruction for each interaction.
  4. A list of instruction-review pairs was maintained to dynamically update the few-shot examples, allowing the LLM to decide if new instructions should be included.
- Instruction Cleaner:
  1. To prevent data leakage (where the instruction might implicitly reveal the ground-truth item), an LLM was used to attempt to recover the item from the generated instruction.
  2. Given an instruction, the LLM was asked to choose between the ground-truth item and a randomly selected negative item and generate a certainty score.
  3. Instructions were retained if the LLM could not infer the ground-truth item. An equal number of correctly inferred instructions with low certainty scores were also kept. This ensures instructions are not trivially predictable from the item itself.
    
    The overview of the INSTRUCTREC dataset construction is illustrated in the figure below:
    
    VLM Description: The image is a diagram illustrating the data flow between the instruction generator and instruction cleaner. User reviews are processed by the instruction generator (LLM) to create examples, which are then refined by the instruction cleaner (LLM) to produce instructions.

The statistics of the constructed INSTRUCTREC datasets are as follows:

The following are the results from Table 1 of the original paper:

Dataset	\|U\|	\|V\|	\|ε\|	Density	#\|XI\|	#\|SU\|
InstructRec - Amazon Book	7,377	120,925	207,759	0.023%	164	1276
InSTRUCTREC -Amazon Movietv	5,649	28,987	79,737	0.049%	40	726
INSTRUCTREC - Goodreads	11,734	57,364	618,330	0.092%	41	2827
InStruCtREc - Yelp	2,950	31,636	63,142	0.068%	40	1976

Where:

$|\mathcal{U}|$ represents the number of unique users.
$|\mathcal{V}|$ represents the number of unique items.
$|\mathcal{E}|$ represents the number of interactions (user-item pairs).
Density is the ratio of interactions to the total possible user-item pairs.
$\#|X_I|$ denotes the average token length of user instructions.
$\#|S_U|$ represents the average token length of the user's static memory (historical interactions).

5.2. Evaluation Metrics

The paper uses standard top-N ranking metrics and introduces specialized metrics to evaluate the mitigation of the echo chamber effect and popularity bias.

5.2.1. Standard Ranking Metrics

These metrics evaluate the quality of the ranked list of recommendations. For all these metrics, higher values indicate better performance.

Hit Rate (HR@N):
- Conceptual Definition: Measures whether the ground-truth item is present in the top-N recommended items. It's a binary measure: 1 if the item is there, 0 otherwise.
- Mathematical Formula: $ \mathrm{HR@N} = \frac{\text{Number of users for whom the ground-truth item is in top-N}}{\text{Total number of users}} $
- Symbol Explanation:
  - N: The number of top recommendations considered.
  - "Number of users...": The count of individual users for whom the target item was successfully recommended within the first N positions.
  - "Total number of users": The total count of users in the evaluation set.
Normalized Discounted Cumulative Gain (NDCG@N):
- Conceptual Definition: A position-aware metric that evaluates the relevance of recommended items, giving higher scores to relevant items that appear earlier in the list. It accounts for the position of relevant items.
- Mathematical Formula: $ \mathrm{NDCG@N} = \frac{\mathrm{DCG@N}}{\mathrm{IDCG@N}} $ Where: $ \mathrm{DCG@N} = \sum_{k=1}^{N} \frac{\mathrm{rel}k}{\log_2(k+1)} $ And IDCG@N is the ideal DCG, calculated for a perfect ranking where all relevant items are at the top. For implicit feedback (where relevance is binary, 1 if interacted, 0 otherwise): $ \mathrm{IDCG@N} = \sum{k=1}^{\min(N, |R|)} \frac{1}{\log_2(k+1)} $ (assuming all $|R|$ relevant items are ranked perfectly)
- Symbol Explanation:
  - N: The number of top recommendations considered.
  - $\mathrm{rel}_k$ : The relevance score of the item at position $k$ . In implicit feedback scenarios, it's typically 1 if the item is the ground-truth item and 0 otherwise.
  - $\log_2(k+1)$ : A logarithmic discount factor, meaning items at higher ranks (smaller $k$ ) contribute more to the score.
  - DCG@N: Discounted Cumulative Gain at rank N.
  - IDCG@N: Ideal Discounted Cumulative Gain at rank N, representing the maximum possible DCG for a given query.
  - $|R|$ : The total number of relevant items for the user (in this case, typically 1 ground-truth item).
Mean Reciprocal Rank (MRR):
- Conceptual Definition: Measures the average of the reciprocal ranks of the first relevant item. If the first relevant item is at rank $k$ , its reciprocal rank is $1/k$ .
- Mathematical Formula: $ \mathrm{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\mathrm{rank}_i} $
- Symbol Explanation:
  - $|Q|$ : The total number of queries (users) in the evaluation set.
  - $\mathrm{rank}_i$ : The rank position of the first relevant item for the $i$ -th query.

5.2.2. Echo Chamber Effect and Popularity Bias Metrics

These metrics are designed to specifically assess how well the proposed method mitigates the echo chamber effect and avoids simply recommending popular items. Higher values generally indicate better mitigation.

Filtered Ads Rate (FR@k):
- Conceptual Definition: Measures whether Ads items (irrelevant items from other domains, simulating advertisements) are ranked below a certain position $k$ . A high FR@k means ads are successfully demoted.
- Mathematical Formula: $ \operatorname { FR } @ \operatorname { k } = { \left{ \begin{array} { l l } { 1 , } & { { \mathrm { i f ~ } } r _ { A d s } > k , } \ { 0 , } & { { \mathrm { i f ~ } } r _ { A d s } \leq k . } \end{array} \right. } $
- Symbol Explanation:
  - $k$ : The threshold rank position.
  - $r_{Ads}$ : The position of an Ads item in the re-ranked list. If the Ads item is successfully pushed beyond rank $k$ , it contributes 1 to the FR@k score. The average across all users is then reported.
  - Ads items are randomly selected from a different data domain (e.g., from InstructRec - Amazon Movietv for InstructRec - Amazon Book). They are randomly inserted into the candidate list to mitigate position bias in LLMs.
Popularity-Weighted Ranking Metrics (P-HR@N, P-MRR, P-NDCG@N):
- Conceptual Definition: These metrics adjust standard ranking metrics by penalizing recommendations of very popular items. This helps assess if the system is recommending diverse items, including less popular ones, rather than just relying on popularity.
- Mathematical Formula for Popularity Weighting: $ \mathrm { P \mathrm { - } R a n k } = ( 1 - \sigma \left( \mathrm { f r e q } _ { i } \right) ) \cdot \mathrm { R a n k } . $
- Symbol Explanation:
  - $\mathrm{P-Rank}$ : The popularity-weighted version of a standard ranking metric (e.g., P-HR@3, P-MRR, P-NDCG@3).
  - $\mathrm{Rank}$ : The standard ranking metric (e.g., HR, MRR, NDCG).
  - $\mathrm{freq}_i$ : The frequency of item $i$ in the dataset, representing its popularity.
  - $\sigma(\cdot)$ : The sigmoid function, which maps any real value to a value between 0 and 1. $ \sigma(x) = \frac{1}{1 + e^{-x}} $ This function ensures that the popularity term $(1 - \sigma(\mathrm{freq}_i))$ is between 0 and 1. If an item is very popular (high $\mathrm{freq}_i$ ), $\sigma(\mathrm{freq}_i)$ will be close to 1, making $(1 - \sigma(\mathrm{freq}_i))$ close to 0, thus heavily penalizing the rank score. Conversely, for less popular items, $\sigma(\mathrm{freq}_i)$ will be smaller, leading to a higher popularity weight and less penalization.

5.2.3. Additional Analyses

Active vs. Less-Active Users: Users are categorized based on their activity (top 20% most active, remaining 80% less active) to assess performance disparities.
Reranking Ratio: The probability of changes in the top-ranked items after reranking (e.g., top 1, 3, 5 positions). This indicates how frequently the agent intervenes.
Hallucination Rate: Measured by assessing the occurrence rate of LLM hallucinations in the reranking list, especially to evaluate the effectiveness of the self-reflection mechanism.

5.3. Baselines

The proposed methods (iAgent, i2Agent) are compared against three classes of baselines:

5.3.1. Sequential Recommendation Methods

These are traditional models that focus on predicting the next item based on a user's interaction sequence, without explicit instructions.

GRU4Rec (Hidasi et al., 2015): A session-based recommendation model using Gated Recurrent Units (GRUs) to capture short-term user interests.
BERT4Rec (Sun et al., 2019): Applies the BERT architecture to sequential recommendation, using a bidirectional encoder to model user behavior sequences with a masked item prediction objective.
SASRec (Kang and McAuley, 2018): A self-attention-based model designed to capture long-term user interests by identifying relevant items in a user's history using an attention mechanism.

5.3.2. Instruction-aware Methods

These baselines use text matching or LLM capabilities to incorporate textual instructions or queries. For these methods, the concatenated text of the instruction serves as the query, and candidate items' metadata (title, description) are treated as documents.

BM25 (Robertson et al., 2009): A probabilistic ranking function widely used in information retrieval. It measures the similarity between a query (instruction) and a document (item) based on term frequency and inverse document frequency.
BGE-Rerank (Xiao et al., 2023): A cross-encoder model that processes both the query and document together to generate a relevance score. It captures fine-grained interactions for higher accuracy in reordering candidate documents.
EasyRec (Ren and Huang, 2024): A lightweight LLM-based recommendation system that uses contrastive learning to align semantic representations from textual data with collaborative filtering signals. It employs a bi-encoder architecture.

5.3.3. Recommendation Agents

These are recent LLM-based agents designed for recommendation tasks.

ToolRec (Zhao et al., 2024): Uses LLMs as surrogate users to enhance recommendations by leveraging external attribute-oriented tools (e.g., rank and retrieval tools) to explore and refine item suggestions. The self-reflection mechanism from the current paper was added to ToolRec for a fairer comparison regarding hallucination.
AgentCF (Zhang et al., 2024b): Constructs LLM-powered user and item agents with memory modules to simulate user-item interactions and improve modeling via a collaborative reflection mechanism. For fair comparison, the number of memory-building rounds was set to 1, and the self-reflection mechanism was also equipped.

6. Results & Analysis

The empirical evaluation aims to answer four research questions (RQs):

RQ1: How does the performance of iAgent and i2Agent compare to state-of-the-art baselines across various datasets?
RQ2: Can our method mitigate the echo chamber effect?
RQ3: How well does our method perform for both active and less-active user groups?
RQ4: Are the proposed reranker and self-reflection mechanism effective in practice?

6.1. Core Results Analysis

The main results, comparing iAgent and i2Agent against various baselines, are presented in Tables 2 and 3 for the INSTRUCTREC datasets.

The following are the results from Table 2 of the original paper:

Model	InstructRec - Amazon Book				InstructRec - Amazon Movietv
Model	HR@1	HR@3	NDCG@3	MRR	HR@1	HR@3	NDCG@3	MRR
GRU4Rec	11.00	31.41	22.53	30.10	15.80	36.85	27.63	34.36
BERT4Rec	11.48	30.90	22.32	30.31	14.74	35.13	26.36	33.43
SASRec	11.08	31.34	22.42	30.15	34.52	49.71	43.18	48.06
BM25	9.92	24.48	18.21	27.00	11.29	30.27	22.09	30.04
BGE-Rerank	25.36	45.90	37.11	42.84	25.44	47.48	38.02	43.28
EasyRec	30.70	48.87	41.09	46.14	34.96	61.30	50.15	52.98
ToolRec	10.56	30.60	21.88	29.77	13.84	35.67	26.20	33.21
AgentCF	14.24	34.16	25.55	32.77	25.90	49.82	39.64	44.23
iAgent	31.89	48.99	41.69	47.23	38.19	56.87	48.93	53.04
i²Agent	35.11	53.51	45.64	50.28	46.43	65.77	57.67	60.43

The following are the results from Table 3 of the original paper:

Model	InstructRec Goodreads				InstructRec - Yelp
Model	HR@1	HR@3	NDCG@3	MRR	HR@1	HR@3	NDCG@3	MRR
GRU4Rec	15.36	39.52	29.08	35.41	10.94	30.67	21.88	29.70
BERT4Rec	12.70	34.69	25.02	32.32	10.99	31.02	22.32	30.05
SASRec	18.52	41.24	31.47	37.60	12.59	31.09	22.65	30.15
BM25	14.25	40.34	29.01	35.40	12.85	33.08	24.34	31.85
BGE-Rerank	17.26	40.82	30.60	36.97	33.05	55.29	45.70	49.90
EasyRec	13.94	35.38	26.11	33.27	32.41	56.31	46.04	49.86
ToolRec	19.06	42.79	32.61	38.44	12.07	30.92	22.83	30.21
AgentCF	21.61	46.09	35.60	40.96	13.36	34.83	25.66	32.61
iAgent	23.56	47.01	36.98	42.19	37.40	56.33	48.28	52.42
i²Agent	30.97	56.69	45.76	49.14	39.22	57.92	49.96	53.78

Analysis:

Superiority of i2Agent: i2Agent consistently achieves the best performance across all four datasets and all standard ranking metrics (HR@1, HR@3, NDCG@3, MRR). It significantly outperforms the second-best baseline, EasyRec, with an average improvement of 16.6%. This strong performance validates the effectiveness of its design, particularly the dynamic memory mechanism that incorporates individual feedback and evolving interests.
Effectiveness of iAgent: Even the simpler iAgent (without dynamic memory) shows strong results, often ranking third or fourth, and notably outperforming traditional sequential models and other recommendation agents like ToolRec and AgentCF. This highlights the benefits of its instruction-aware parser and reranker in understanding user intentions and leveraging external knowledge.
Instruction-aware vs. Sequential: Instruction-aware baselines (BGE-Rerank, EasyRec) generally outperform traditional sequential recommendation methods (GRU4Rec, BERT4Rec, SASRec). This suggests that incorporating explicit user instructions provides valuable signals that improve recommendation quality.
LLM-based Baselines: EasyRec, which is pre-trained on Amazon datasets and aligns collaborative filtering with natural language information, performs very well, often being the second-best model. This indicates the power of combining LLMs with traditional recommendation techniques. ToolRec and AgentCF perform better than sequential baselines but are notably surpassed by iAgent and i2Agent, suggesting that their agentic designs might not be as effective in capturing individual user-specific instructions and feedback as the proposed models.

6.2. Data Presentation (Tables)

The core results are presented above in Tables 2 and 3. The analysis of echo chamber effects and active/less-active users is presented in the following subsections.

6.3. Ablation Studies / Parameter Analysis

The paper includes analyses on the echo chamber effect, performance for active and less-active users, and the impact of the self-reflection mechanism and reranking ratio, which can be considered forms of ablation or component analysis.

6.3.1. Echo Chamber Effect (RQ2)

The echo chamber effect is evaluated using FR@k (Filtered Ads Rate) and P-Rank (Popularity-weighted ranking metrics). Higher values for these metrics indicate better mitigation of the echo chamber effect and popularity bias.

The following are the results from Table 4 of the original paper:

Model	InstructRec - Amazon Book				InstructRec - Yelp
Model	FR@1	FR@3	P-HR@3	P-MRR	FR@1	FR@3	P-HR@3	P-MRR
EasyRec	68.41	64.32	59.28	56.09	76.45	66.50	61.05	56.85
ToolRec	70.13	66.61	36.74	35.80	72.64	63.64	32.50	32.73
AgentCF	58.02	50.04	41.10	39.42	71.30	64.15	38.46	36.44
iAgent	71.98	67.82	59.51	57.32	78.24	69.71	62.74	58.76
i²Agent	77.15	70.15	64.70	60.87	87.69	84.20	64.48	60.20

The following are the results from Table 7 of the original paper:

Model	Amazon Book				Amazon Book
Model	FR@1	FR@3	FR@5	FR@10	P-HR@1	P-HR@3	P-NDCG@3	P-MRR
EasyRec	68.41	64.32	60.30	0.03	37.60	59.28	50.00	56.09
ToolRec	70.13	66.61	62.41	0.00	12.63	36.74	26.24	35.80
AgentCF	58.02	50.04	41.32	0.06	17.00	41.10	30.68	39.42
iAgent	71.98	67.82	60.74	0.08	38.85	59.51	50.70	57.32
i²Agent	77.15	70.15	64.05	0.09	42.62	64.70	55.25	60.87

The following are the results from Table 8 of the original paper:

Model	Amazon Movietv				GoodReads
Model	P-HR@1	P-HR@3	P-NDCG@3	P-MRR	P-HR@1	P-HR@3	P-NDCG@3	P-MRR
EasyRec	37.31	65.45	53.54	56.69	14.22	35.98	26.56	33.84
ToolRec	14.73	38.12	27.96	35.57	19.21	43.22	32.92	38.88
AgentCF	27.61	53.33	42.37	47.37	21.82	46.62	35.99	41.47
iAgent	40.50	60.71	52.11	56.61	23.75	47.50	37.34	42.68
i²Agent	49.51	70.47	61.67	64.69	31.22	57.33	46.23	49.71

The following are the results from Table 9 of the original paper:

Model	Yelp				Yelp
Model	FR@1	FR@3	FR@5	FR@10	P-HR@1	P-HR@3	P-NDCG@3	P-MRR
EasyRec	76.45	66.50	57.16	0.05	37.18	61.05	52.51	56.85
ToolRec	72.64	63.64	53.29	0.00	12.40	32.50	23.88	32.73
AgentCF	71.30	64.15	52.01	0.02	14.73	38.46	28.33	36.44
iAgent	78.24	69.71	56.17	0.12	41.74	62.74	53.82	58.76
i²Agent	87.69	86.20	84.00	0.16	43.67	64.48	55.62	60.20

Analysis:

Ad Filtering (FR@k): i2Agent demonstrates superior ability to filter out unwanted Ads items. For InstructRec - Amazon Book, its FR@1 is 77.15%, and for InstructRec - Yelp, it reaches 87.69% (FR@3 of 84.20%). This indicates that i2Agent can accurately interpret user instructions and identify irrelevant or undesired items, pushing them further down the ranking list or out of the top positions. This is a direct measure of its function as a "protective shield."
Mitigation of Popularity Bias (P-HR@N, P-MRR, P-NDCG@N): i2Agent consistently achieves the highest scores across the popularity-weighted metrics (P-HR@3, P-MRR, P-NDCG@3), often by a significant margin compared to other baselines, especially ToolRec and AgentCF. This means i2Agent is not simply recommending popular items but is providing more diverse recommendations, including less popular items, that are still relevant to the user's specific instructions and dynamically learned interests.
Overall Conclusion for RQ2: The results strongly suggest that i2Agent is highly effective in mitigating the echo chamber effect and popularity bias, validating its role as a user shield by providing more diversified and instruction-aligned recommendations.

6.3.2. Protect Less-Active Users (RQ3)

The performance for active (top 20%) and less-active (remaining 80%) users is analyzed to assess the agent's ability to provide personalization regardless of user activity level.

The following are the results from Table 5 of the original paper:

Model	Less-Active Users				Active Users
Model	HR@1	HR@3	NDCG@3	MRR	HR@1	HR@3	NDCG@3	MRR
EasyRec	32.93	51.07	43.32	48.04	28.71	47.64	39.53	44.61
ToolRec	10.57	30.86	22.01	29.88	10.04	31.73	22.32	29.54
AgentCF	14.79	35.00	26.26	33.35	14.87	34.37	25.93	33.24
iAgent	34.07	50.79	43.67	49.00	29.96	47.73	40.14	45.71
i²Agent	37.92	55.75	47.84	52.11	33.27	51.74	43.81	48.67

The following are the results from Table 10 of the original paper:

Model	Less-Active Users				Active Users
Model	HR@1	HR@3	NDCG@3	MRR	HR@1	HR@3	NDCG@3	MRR
EasyRec	35.17	61.56	50.39	53.21	35.47	63.15	51.26	53.64
ToolRec	14.43	36.56	26.96	33.81	12.98	32.18	23.94	31.79
AgentCF	27.38	50.98	40.91	45.36	21.84	45.58	35.57	40.76
iAgent	39.36	57.85	49.98	53.96	34.95	55.19	46.88	51.02
i²Agent	47.32	66.64	58.57	61.22	44.71	64.99	56.60	59.30

The following are the results from Table 11 of the original paper:

Model	Less-Active Users				Active Users
Model	HR@1	HR@3	NDCG@3	MRR	HR@1	HR@3	NDCG@3	MRR
EasyRec	14.44	35.77	26.55	33.67	14.13	36.86	27.09	33.86
ToolRec	19.85	43.34	33.29	39.11	17.89	42.02	31.63	37.35
AgentCF	22.91	46.67	36.50	41.89	19.82	46.70	35.22	40.10
iAgent	24.57	48.12	38.00	43.04	22.62	46.96	36.64	41.70
i²Agent	32.67	58.08	47.28	50.46	29.76	55.39	44.56	48.19

The following are the results from Table 12 of the original paper:

Model	Less-Active Users				Active Users
Model	HR@1	HR@3	NDCG@3	MRR	HR@1	HR@3	NDCG@3	MRR
EasyRec	32.83	56.50	46.29	50.13	30.17	50.87	42.03	47.16
ToolRec	11.79	31.21	22.88	30.14	14.21	32.42	24.66	32.11
AgentCF	13.11	34.72	25.50	32.46	13.22	36.41	26.45	32.89
iAgent	37.80	56.17	48.37	52.70	39.40	59.10	50.62	53.90
i²Agent	39.02	58.49	50.23	53.88	43.25	57.75	51.48	56.05

Analysis:

Performance for Less-Active Users: i2Agent consistently shows the highest performance for less-active users across all metrics and datasets. For example, on InstructRec - Amazon Book, i2Agent achieves an MRR of 52.11% for less-active users, significantly higher than EasyRec's 48.04%. This is a crucial finding, as less-active users are typically disadvantaged in traditional recommender systems due to limited interaction data. The success of i2Agent here is attributed to its dynamic memory mechanism, which builds individual profiles based on their specific feedback, rather than being influenced by the aggregate behavior of more active users.
Performance for Active Users: i2Agent also performs exceptionally well for active users, often surpassing other baselines. However, the paper notes that for active users with very long text sequences (extensive interaction history), LLM performance can sometimes decline due to context window limitations (Liu et al., 2024b), which might explain why the performance gain over iAgent or EasyRec might be less pronounced in some cases for active users compared to less-active ones.
Overall Conclusion for RQ3: i2Agent effectively enhances personalization for both active and less-active user groups. Its ability to provide robust recommendations for less-active users, by forming dedicated individual profiles, directly addresses a critical fairness issue in recommender systems.

6.3.3. Model Study (RQ4)

This section investigates the effectiveness of the self-reflection mechanism and the impact of the reranker on the ranking list.

The following figure (Figure 4 from the original paper) shows the hallucination rate with and without the self-reflection mechanism, and the probability of changes in the ranking list after reranking:

Figure 4: The first row presents the hallucination rate with and without the self-reflection mechanism, while the second row illustrates the probability of changes in the ranking list after our reranker. VLM Description: The image is a chart showing the hallucination rates with and without the self-reflection mechanism in different recommender systems (first row) and the probability of changes in ranking (second row). It includes performance comparisons on Amazon books, Goodreads, and Yelp datasets, illustrating the advantages of iAgent in recommendation effectiveness.

Analysis of Self-reflection Mechanism:

The top row of Figure 4 illustrates the hallucination rate (represented as "Error Rate") with and without the self-reflection mechanism across different datasets for iAgent and i2Agent, and baselines ToolRec and AgentCF (which were also equipped with self-reflection for fair comparison).
The results clearly show that the self-reflection mechanism dramatically reduces the hallucination rate for all models. For instance, without self-reflection, the error rate can be very high (e.g., around 10-20% for iAgent). With self-reflection, the error rate drops significantly, by "at least 20-fold," becoming negligible (close to 0% for ToolRec and AgentCF and very low for iAgent and i2Agent).
The paper notes that i2Agent might still exhibit a slightly higher error rate compared to others even with self-reflection. This is attributed to the longer text sequences (dynamic memory, profiles, instructions, static memory) that i2Agent processes, which can sometimes challenge LLM performance due to context length limitations.
Conclusion for Self-reflection: The self-reflection mechanism is highly effective in mitigating LLM-induced hallucinations, ensuring that the reranked list only contains items from the original candidate set, thus improving the reliability and consistency of the agent's output.

Analysis of Reranking Ratio:

The bottom row of Figure 4 shows the "Reranking Ratio," which measures the probability that the elements in the top-K (specifically K=1, 3, 5) positions of the ranking list change after reranking.
The results indicate that changes occur almost every time during reranking for all models employing a reranker (iAgent, i2Agent, ToolRec, AgentCF, EasyRec). For example, for top-1, the reranking ratio is consistently above 90% across datasets, and still very high for top-3 and top-5.
Conclusion for Reranking Ratio: This demonstrates that the agent is not merely passing through the initial recommendations but is actively and consistently performing personalized reranking based on user instructions and learned interests. The high reranking ratio confirms that the agent serves its purpose of modifying the platform's initial list to better suit individual user needs.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces a novel user-agent-platform paradigm for recommender systems, addressing critical user vulnerabilities in the traditional user-platform model. The core innovation is the iAgent and its advanced version, i2Agent, which act as intelligent, personalized LLM-based shields between users and recommendation algorithms.

The key contributions include:

New Paradigm: Shifting from direct user-platform interaction to an agent-mediated indirect exposure, prioritizing user interests.
INSTRUCTREC Datasets: Creation of unique datasets with user-driven free-text instructions, enabling research into instruction-aware recommendation.
iAgent: A foundational agent capable of understanding flexible user instructions, leveraging internal and external knowledge (via tools) to rerank items.
i2Agent: An enhanced agent incorporating a dynamic memory mechanism (profile generator and dynamic extractor) to learn from individual user feedback and adapt to evolving interests, ensuring optimization specific to each user.
Empirical Superiority: i2Agent consistently outperforms state-of-the-art baselines, achieving a substantial 16.6% average improvement across ranking metrics on the INSTRUCTREC datasets.
Mitigation of Biases: The proposed agents effectively mitigate the echo chamber effect (by filtering ads and promoting diversity) and alleviate model bias against less-active users, providing robust personalization.
Self-reflection: A self-reflection mechanism significantly reduces LLM hallucination, enhancing the reliability of the reranking process.

In essence, i2Agent successfully serves as a protective and empowering shield for users, delivering more personalized, diverse, and fair recommendations that align with explicit user instructions and dynamic individual preferences.

7.2. Limitations & Future Work

The authors acknowledge several limitations of their current work and propose future research directions:

7.2.1. Limitations

Language Dependency: The current implementation primarily focuses on English instructions, and its effectiveness across different languages remains to be explored. This suggests a potential language bias in its current form.
Nuance of User Satisfaction: While evaluation metrics show improvements in recommendation quality, they may not fully capture the nuanced aspects of user satisfaction and long-term engagement. Standard metrics might not perfectly reflect a user's subjective experience or happiness with the recommendations over time.

7.2.2. Future Work

More Effective Reranker: The current reranker is a zero-shot LLM. Future work could involve fine-tuning smaller, open-source LLMs (e.g., Phi-3, Gemma) on the INSTRUCTREC dataset to build a more efficient and effective reranker. Additionally, existing advanced recommendation models could be used as tools for the agent to retrieve candidate items more effectively.
Multi-step Feedback: The current feedback mechanism is limited to a single ground-truth item with insufficient feedback explanations. In a real-world deployment, collecting continuous, multi-step feedback on user-agent interactions, along with detailed user explanations, could enable the development of more interpretable and adaptive agents.
Mutual Learning: There's potential for mutual learning between user-side agents (i2Agent) and platform-side recommendation models. The i2Agent could provide feedback and explanations to platform models, helping them improve their performance. Conversely, existing RecAgents could iteratively improve through collaboration with i2Agent. i2Agent could also serve as a sophisticated reward function for Reinforcement Learning (RL)-based recommendation models, guiding them towards more user-centric optimization.

7.3. Personal Insights & Critique

This paper presents a highly relevant and forward-thinking approach to recommender systems, particularly in the age of powerful LLMs. The shift to a user-agent-platform paradigm is a significant conceptual leap, moving beyond the platform's commercial interests to genuinely advocate for the user.

Inspirations:

User Empowerment: The idea of an agent as a "protective shield" is compelling. It offers a tangible solution to common user frustrations with recommender systems, such as feeling manipulated or being stuck in filter bubbles. This user-centric philosophy could become a cornerstone for future recommendation research.
Dynamic Personalization: The dynamic memory mechanism of i2Agent, which learns from individual feedback and extracts evolving interests, is a powerful concept. It highlights how LLMs can go beyond static profiles to capture the fluid nature of human preferences. This is especially inspiring for personalized learning environments or adaptive interfaces.
Bridging LLMs and Traditional RS: The paper cleverly integrates LLMs for instruction understanding and reranking, while still leveraging the output of traditional recommender systems. This demonstrates a practical path for LLMs to enhance existing infrastructure rather than completely replacing it.
Addressing Fairness: The explicit focus on less-active users and echo chamber mitigation shows a strong ethical consideration, pushing the boundaries of what recommender systems can achieve in terms of fairness and diversity.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

Computational Cost & Latency: LLM inference, especially for multiple components (parser, generator, extractor, reranker, self-reflection) and potentially external tool calls, can be computationally expensive and introduce significant latency. This might be a major barrier for real-time recommendation scenarios, especially for a personal agent running for each user. The paper mentions fine-tuning smaller LLMs as future work, which is a good direction for efficiency.
Scalability for Billions of Users: While i2Agent is optimized for individual users, deploying a separate, stateful LLM agent for each of potentially billions of users is currently economically and computationally prohibitive. The "individual" aspect needs to be balanced with scalability considerations. Perhaps a shared foundation model with personalized adapters or efficient individual memory storage could be explored.
Instruction Quality & Ambiguity: The effectiveness heavily relies on the quality and clarity of user instructions. While LLMs are good at understanding natural language, ambiguous or contradictory instructions could lead to suboptimal recommendations. How the agent handles user frustration with its own interpretations is also an important user experience aspect not fully explored.
Ethical Implications of Agent Autonomy: While beneficial, giving an agent "control" over recommendations raises questions about its autonomy. How transparent is the agent to the user? Can the user fully override or inspect the agent's decisions? What if the agent's interpretation deviates significantly from the user's intent? The self-reflection mechanism addresses internal consistency, but not necessarily alignment with subjective user satisfaction.
Robustness to Adversarial Instructions: Could users or malicious actors craft instructions to manipulate the agent, potentially for personal gain or to bypass platform controls?
Domain-Specificity of External Tools: The reliance on external tools (like Google Search) introduces dependency and potential biases from those tools. For niche domains, effective external knowledge acquisition might be challenging.

Overall, iAgent and i2Agent represent a compelling vision for the future of recommender systems, offering a powerful blueprint for user-centric and ethical AI in recommendation. The technical challenges, though significant, open up exciting avenues for future research.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

iAgent: LLM Agent as a Shield between User and Recommender Systems

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~36 min read · 52,757 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.2.1. Traditional Recommender Systems

3.2.2. Conversational Recommender Systems (CRS)

3.2.3. LLM-based Recommendation Agents

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. iAgent

4.2.1.1. Parser

4.2.1.2. Reranker

4.2.1.3. Self-reflection Mechanism

4.2.2. i2Agent

4.2.2.1. Profile Generator

4.2.2.2. Dynamic Extractor

4.2.2.3. Reranker (for i2Agent)

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Standard Ranking Metrics

5.2.2. Echo Chamber Effect and Popularity Bias Metrics

5.2.3. Additional Analyses

5.3. Baselines

5.3.1. Sequential Recommendation Methods

5.3.2. Instruction-aware Methods

5.3.3. Recommendation Agents

6. Results & Analysis

6.1. Core Results Analysis

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

6.3.1. Echo Chamber Effect (RQ2)

6.3.2. Protect Less-Active Users (RQ3)

6.3.3. Model Study (RQ4)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.2.1. Limitations

7.2.2. Future Work

7.3. Personal Insights & Critique

Similar papers