Paper status: completed

Towards Next-Generation Recommender Systems: A Benchmark for Personalized Recommendation Assistant with LLMs

Published:03/12/2025

LLM-based Recommendation Systems (27)Personalized Recommendation Assistant Benchmark (1)Recommendation System Performance Evaluation (1)Complex User Query Handling (1)LLM Capability Assessment (1)

Original Link PDF

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper introduces RecBench+, a benchmark dataset assessing LLMs in handling complex personalized recommendation tasks, revealing that while LLMs show initial capabilities as assistants, they struggle with reasoning and misleading queries.

Abstract

Recommender systems (RecSys) are widely used across various modern digital platforms and have garnered significant attention. Traditional recommender systems usually focus only on fixed and simple recommendation scenarios, making it difficult to generalize to new and unseen recommendation tasks in an interactive paradigm. Recently, the advancement of large language models (LLMs) has revolutionized the foundational architecture of RecSys, driving their evolution into more intelligent and interactive personalized recommendation assistants. However, most existing studies rely on fixed task-specific prompt templates to generate recommendations and evaluate the performance of personalized assistants, which limits the comprehensive assessments of their capabilities. This is because commonly used datasets lack high-quality textual user queries that reflect real-world recommendation scenarios, making them unsuitable for evaluating LLM-based personalized recommendation assistants. To address this gap, we introduce RecBench+, a new dataset benchmark designed to access LLMs' ability to handle intricate user recommendation needs in the era of LLMs. RecBench+ encompasses a diverse set of queries that span both hard conditions and soft preferences, with varying difficulty levels. We evaluated commonly used LLMs on RecBench+ and uncovered below findings: 1) LLMs demonstrate preliminary abilities to act as recommendation assistants, 2) LLMs are better at handling queries with explicitly stated conditions, while facing challenges with queries that require reasoning or contain misleading information. Our dataset has been released at https://github.com/jiani-huang/RecBench.git.

Mind Map

In-depth Reading

English Analysis~34 min read · 45,867 chars

1. Bibliographic Information

1.1. Title

Towards Next-Generation Recommender Systems: A Benchmark for Personalized Recommendation Assistant with LLMs

1.2. Authors

Jiani Huang (jianihuang01@gmail.com) - The Hong Kong Polytechnic University, HK SAR
Shijie Wang (shijie.wang@connect.polyu.hk) - The Hong Kong Polytechnic University, HK SAR
Liang-bo Ning (BigLemon1123@gmail.com) - The Hong Kong Polytechnic University, HK SAR
Wenqi Fan (wenqifan03@gmail.com) - The Hong Kong Polytechnic University, HK SAR
Shuaiqiang Wang (shqiang.wang@gmail.com) - Baidu Inc., China
Dawei Yin (yindawei@acm.org) - Baidu Inc., China
Qing Li (qing-prof.li@polyu.edu.hk) - The Hong Kong Polytechnic University, HK SAR

1.3. Journal/Conference

The paper is published in ACM. While the specific conference is not detailed in the provided abstract, ACM is a highly reputable organization for computing research. Its conferences and journals are generally well-regarded and influential in the field of computer science, including recommender systems.

1.4. Publication Year

2025 (Published at UTC: 2025-03-12T13:28:23.000Z)

1.5. Abstract

Traditional recommender systems are typically limited to fixed and simple recommendation scenarios, struggling to adapt to new and interactive recommendation tasks. The recent advancements in large language models (LLMs) have led to their integration into recommender systems, transforming them into more intelligent and interactive personalized recommendation assistants. However, current evaluation methods for these LLM-based assistants often rely on fixed, task-specific prompt templates and datasets lacking real-world textual user queries, hindering a comprehensive assessment of their capabilities.

To address this gap, this paper introduces $RecBench+$ , a novel dataset benchmark designed to evaluate LLMs' ability to handle complex user recommendation needs. $RecBench+$ features approximately 30,000 high-quality, diverse user queries encompassing both hard conditions (explicit, implicit, misinformed) and soft preferences (interest-based, demographics-based), with varying difficulty levels.

The authors evaluated several commonly used LLMs on $RecBench+$ and derived key findings: 1) LLMs demonstrate preliminary capabilities as recommendation assistants; 2) LLMs perform better with queries that have explicitly stated conditions but face challenges with queries requiring reasoning or containing misleading information. The dataset has been open-sourced for further research.

1.6. Original Source Link

https://arxiv.org/abs/2503.09382v1 (Preprint status indicated by $v1$ and arXiv link) PDF Link: https://arxiv.org/pdf/2503.09382v1.pdf

2. Executive Summary

2.1. Background & Motivation

Traditional recommender systems (RecSys) are foundational to modern digital platforms but are limited in their ability to generalize to new, interactive, and complex recommendation scenarios. They typically handle fixed tasks like "Customers Who Viewed This Also Viewed" or "Based on Your Browsing History" and struggle with nuanced natural language queries from users, such as "a durable laptop for graphic design under $1500".

The emergence of large language models (LLMs) has introduced a new paradigm, allowing RecSys to evolve into personalized recommendation assistants that can interact conversationally and understand complex user requests. However, the evaluation of these LLM-based assistants is currently hampered by two major issues:

Fixed and Simple Prompt Templates: Most existing studies use overly simplistic and fixed prompt templates for generating recommendations and evaluating performance (e.g., "Will the user like {movie_i}. Please answer Yes or No."). This does not reflect the complexity of real-world user interactions.
Lack of High-Quality Textual User Queries: Commonly used datasets (e.g., Movielens-1M, Amazon Beauty) are designed for traditional RecSys and lack the rich, complex textual user queries needed to assess LLMs' capabilities in handling intricate, interactive recommendation tasks. This leads to a testing paradigm that fails to align with practical scenarios.

The core problem the paper aims to solve is the lack of a comprehensive and realistic benchmark for evaluating LLM-based personalized recommendation assistants that can handle diverse and complex user queries in an interactive setting. This problem is crucial because, without proper evaluation, the true potential and limitations of LLMs in next-generation RecSys cannot be accurately understood or improved upon. The paper's innovative idea is to create such a benchmark that simulates real-world complex user query scenarios.

2.2. Main Contributions / Findings

The paper makes several significant contributions to the field of next-generation recommender systems:

Novel Recommendation Paradigm: It introduces and formalizes a new paradigm for RecSys where LLMs act as interactive and intelligent personalized recommendation assistants. This shifts from traditional, fixed recommendation tasks to a more context-aware and personalized user experience.
Dataset Construction ( $RecBench+$ ): The paper presents $RecBench+$ , a comprehensive and high-quality benchmark dataset comprising approximately 30,000 complex user queries across movie and book domains. This dataset is meticulously designed to simulate practical recommendation scenarios for LLM-based assistants, incorporating variations in difficulty, number of conditions, and user profiles. It is the first public dataset specifically for evaluating personalized recommendation assistants in the LLM era.
Comprehensive Evaluation: The authors conducted extensive experiments with seven state-of-the-art LLMs (including GPT-4o, Gemini-1.5-Pro, DeepSeek-R1, etc.) on $RecBench+$ , analyzing their strengths and limitations.
Actionable Insights and Findings: The evaluation revealed eight detailed observations that shed light on LLMs' capabilities as recommendation assistants:
1. LLMs demonstrate preliminary abilities to act as recommendation assistants, with GPT-4o and DeepSeek-R1 excelling in explicit condition queries, while Gemini-1.5-Pro and DeepSeek-R1 perform better in queries requiring user profile understanding.
2. Model performance decreases with increasing query difficulty: LLMs handle Explicit Condition Queries best but struggle more with Implicit Condition Queries and Misinformed Condition Queries.
3. Precision and Recall improve with more conditions for Condition-based Queries, but Condition Match Rate (CMR) declines for Explicit Condition Queries while rising for Implicit and Misinformed ones.
4. Incorporating user-item interaction history significantly enhances recommendation quality by improving Precision across all query types. However, it can also introduce "distractor" items, potentially reducing CMR by diverting the model's strict adherence to conditions.
5. For User Profile-based Queries, Gemini-1.5 Pro and DeepSeek-R1 showed better performance compared to other models.
6. Demographics-based Queries generally exhibit lower Recall than Interest-based Queries, implying LLMs struggle more with inferring preferences from broad demographic data.
7. For Interest-based Queries, Precision and Recall are higher for more prevalent interests (for movies), as these are more easily recognizable by LLMs. For books, the trend is opposite due to variants of popular books and the exact match evaluation.
8. For Demographics-based Queries, LLMs show variations based on demographics, performing better for female users, sales/marketing professionals, and the 50-55 age group, reflecting more consistent preference patterns or better data availability.
  
  These findings provide a solid foundation for future research and development in LLM-based personalized recommendation assistants.

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with the following core concepts:

Recommender Systems (RecSys): These are information filtering systems that predict what a user might prefer. They are widely used across various platforms (e-commerce, entertainment, social media) to suggest items (products, movies, news articles, etc.) that are likely to be of interest to a particular user.
Personalized Recommendation: The ability of a RecSys to tailor recommendations specifically to an individual user's preferences, behaviors, and context, rather than providing generic suggestions.
Large Language Models (LLMs): These are deep learning models trained on vast amounts of text data, enabling them to understand, generate, and process human language. They possess strong capabilities in natural language understanding, reasoning, and generalization, which makes them suitable for conversational and complex tasks. Examples include GPT-4o, Gemini, Llama.
Knowledge Graph (KG): A structured representation of knowledge that consists of entities (e.g., movies, actors, directors) and relationships between them (e.g., "directed by", "starred in"). KGs provide a rich source of factual information and semantic relationships, which can be leveraged by RecSys to enhance recommendation quality.
User-Item Interactions: Records of how users have engaged with items, such as purchases, clicks, ratings, views, or searches. These interactions are fundamental data for training and evaluating most recommender systems.
Evaluation Metrics (Precision, Recall, FTR, CMR): Standard measures used to quantify the performance of recommender systems.
- Precision: The proportion of recommended items that are relevant to the user. It answers: "Of all items I recommended, how many were actually good?"
- Recall: The proportion of relevant items that are successfully recommended out of all available relevant items. It answers: "Of all the good items available, how many did I actually recommend?"
- Fail to Recommend (FTR): The proportion of queries for which the model failed to generate any recommendations. A lower FTR is generally desirable, except in Misinformed Condition Queries where a higher FTR might indicate the model correctly identified misinformation and refrained from making bad recommendations.
- Condition Match Rate (CMR): A metric specifically proposed in this paper for Condition-based Queries. It measures the percentage of recommended items that strictly meet the conditions specified in the user's query. This is crucial for LLM-based assistants to ensure conditional adherence.

3.2. Previous Works

The paper contextualizes its work within the broader evolution of RecSys and LLM evaluation.

3.2.1. Traditional Recommender Systems

Collaborative Filtering (CF): Early and foundational methods like matrix factorization (MF) techniques [19] ( $SVD++$ [18], NCF [13]) learn latent representations of users and items based on user-item interactions to predict matching scores. These systems excel at typical recommendation scenarios like "Customers Who Viewed This Also Viewed" or "Based on Your Browsing History".
Graph Neural Networks (GNNs): With the rise of deep learning, GNNs (LightGCN [12], GraphRec [7], [37]) have gained prominence for their ability to model high-order user collaborative filtering information by aggregating neighborhood information for user and item embeddings.

These traditional systems are well-suited for fixed, simple scenarios but struggle with generalizing to new and unseen recommendation tasks or handling interactive paradigms with complex natural language queries.

3.2.2. LLM-based Recommender Systems

More recently, LLMs have been integrated into RecSys due to their powerful reasoning and language understanding capabilities:

LLMs as text generation tasks: Some approaches unify various recommendation tasks as text generation tasks, such as P5 [9].
Fine-tuning with user-item interaction data: Models like TALLRec [1] and CoLLM [39] fine-tune LLMs on user-item interaction data to improve their recommendation abilities, especially when inadequate recommendation data was present during pre-training.
Interactive and conversational RecSys: Examples like Chat-Rec [8] and LLaRA [20] directly transform user interactions into natural language prompts, enabling conversational recommendations. AUTOGLM [23] showcases LLMs acting as autonomous assistants for complex tasks, inspiring their role in RecSys.
LLM Benchmarks for RecSys:
- LLMRec [22]: Focuses on evaluating LLMs for traditional recommendation tasks (e.g., rating prediction, sequential recommendation), establishing baselines against classical methods.
- Beyond Utility [16]: Proposes a multi-dimensional framework to assess LLMs as recommenders, considering utility, novelty, history length sensitivity, and hallucination (generation of non-existent items).
- PerRecBench [31]: Aims to isolate personalization accuracy by removing biases in user and item ratings to test LLMs' ability to infer user preferences.
- Is ChatGPT Fair for Recommendation? [38]: Evaluates fairness in LLM-based recommendation.

3.2.3. LLM Evaluation as General Assistants

The paper also references broader LLM evaluation benchmarks for assistant capabilities:

Gaia [24]: Measures LLMs' capabilities as a general-purpose AI assistant.
VoiceBench [5] and Mobile-Bench [3]: Domain-specific frameworks for evaluating voice assistants and mobile agents, respectively.
MMRC [36]: A large-scale benchmark for understanding multimodal LLMs in real-world conversation.

3.3. Technological Evolution

The evolution from traditional RecSys to LLM-based personalized recommendation assistants represents a significant shift from statistical modeling and item-user correlation to semantic understanding, reasoning, and natural language interaction.

Early RecSys (e.g., MF): Focused on implicit correlations and matrix completions. Limited interpretability and generalization to new types of queries.
Deep Learning RecSys (e.g., GNNs): Enhanced modeling of complex relationships and higher-order interactions, but still largely reliant on structured data and less flexible with natural language.
LLM-based RecSys: Leverages LLMs' pre-trained knowledge and reasoning for semantic understanding of queries and items, natural language generation for recommendations, and interactive capabilities. This moves RecSys towards a conversational and context-aware paradigm.

This paper's work fits within the cutting edge of this evolution by providing a crucial evaluation framework for the interactive and intelligent personalized recommendation assistant phase. It addresses the gap where existing benchmarks for LLM-based RecSys still largely evaluate LLMs on traditional tasks or in simplistic interactive settings, rather than assessing their ability to handle complex, real-world natural language queries.

3.4. Differentiation Analysis

Compared to existing methods and benchmarks, $RecBench+$ offers several key differentiators:

Focus on Complex, Interactive Queries: Unlike benchmarks that use fixed prompt templates or evaluate LLMs on traditional RecSys tasks, $RecBench+$ specifically focuses on high-quality textual user queries that reflect the complexity and diversity of real-world user needs in an interactive setting.
Diverse Query Categorization: It introduces a novel categorization of queries into Condition-based (Explicit, Implicit, Misinformed) and User Profile-based (Interest-based, Demographics-based), which comprehensively assesses different facets of an LLM's recommendation capabilities (reasoning, knowledge retrieval, preference understanding, robustness).
Ground Truth Generation from KGs and User Data: The benchmark's queries are systematically constructed using Knowledge Graphs and user interaction histories, ensuring realism and providing robust ground truth for evaluation. This is a significant improvement over synthetic or simple prompt-based evaluation.
Evaluation of Conditional Adherence: The introduction of Condition Match Rate (CMR) specifically addresses whether LLMs strictly adhere to explicit conditions, a crucial aspect for personalized assistants that traditional metrics often miss.
Broader Assessment of LLM Behaviors: The benchmark reveals insights into LLMs' strengths and weaknesses with different query types, their ability to handle misinformation, the impact of user history, and their performance across demographic groups, which goes beyond basic accuracy or utility.

In essence, $RecBench+$ is designed to evaluate LLMs not just as components of a RecSys, but as fully-fledged personalized recommendation assistants capable of natural, intelligent, and context-aware interaction, a gap not fully addressed by prior benchmarks.

4. Methodology

The core methodology of this paper revolves around the construction of the $RecBench+$ benchmark dataset, designed to evaluate LLMs as personalized recommendation assistants. The dataset is built around two main categories of user queries: Condition-based Queries and User Profile-based Queries, each reflecting different real-world recommendation scenarios.

4.1. Principles

The fundamental principle behind $RecBench+$ is to simulate realistic and complex user interactions with a recommendation assistant. This simulation involves generating diverse user queries that incorporate hard conditions (explicit requirements for items) and soft preferences (inferred from user profiles or history), and then evaluating how well LLMs can understand and fulfill these requests. The theoretical basis is that LLMs, with their advanced natural language processing and reasoning capabilities, should be able to interpret nuanced human language and context to provide personalized and accurate recommendations, moving beyond the limitations of traditional RecSys. The intuition is that by creating a benchmark with queries that mimic actual user behavior, the strengths and weaknesses of LLMs as conversational recommendation agents can be accurately assessed.

4.2. Core Methodology In-depth (Layer by Layer)

The $RecBench+$ benchmark is composed of approximately 30,000 high-quality, complex user queries. These queries are categorized into two main types to evaluate different capabilities of LLM-based recommendation assistants: Condition-based Queries and User Profile-based Queries.

4.2.1. Condition-based Query Construction

Condition-based Queries simulate scenarios where users have specific requirements or constraints for the items they want. The construction process leverages Knowledge Graphs (KGs) to ensure realism and diversity.

The overall process for Condition-based Query Construction involves three key steps:

Item Knowledge Graph (Item KG) Construction: Building KGs that link items (movies/books) to their attributes (directors, actors, genres, authors, categories).
Shared Relation Extraction: Identifying common attributes among items in a user's interaction history to form the basis of conditions.
Query Generation: Using LLMs to generate natural language queries based on these extracted conditions, categorized into Explicit, Implicit, and Misinformed.

The following figure (Figure 2 from the original paper) illustrates the process of constructing Condition-based Queries.

该图像是示意图，展示了如何构建项知识图谱、提取共享关系以及基于条件生成查询的过程。图中明确区分了显式条件、隐式条件和错误信息条件，并提供了生成的查询示例。

4.2.1.1. Item KG Construction

The Item KG forms the foundational data for generating realistic conditions.

Movies: Data is extracted from Wikipedia, focusing on 7 key attributes such as directors, actors, composers, and genres. Each movie node is linked to these attributes. The movie dataset used is Movielens-1M [11].
Books: Metadata from the Amazon Book Dataset is used, connecting attributes like authors and categories to book nodes.

These KGs are then combined with traditional recommendation datasets (Movielens-1M and Amazon-Book) to facilitate query generation.

4.2.1.2. Shared Relation Extraction

Shared relations are common attributes found among items in a user's interaction history. This step identifies these commonalities to create meaningful conditions for queries.

Given a user $u$ with an interaction history $\mathcal { H } _ { u } = \{ i _ { 1 } , i _ { 2 } , . . . , i _ { k } \}$ , a KG retrieval function $\mathcal { R }$ is employed to identify shared attributes (relations) across subsets of items in $\mathcal { H } _ { u }$ . Each shared relation is defined as a tuple ( r , t ), where $r$ is the type of relation (e.g., "directed by") and $t$ is the target value (e.g., name of director).

The extraction process results in groups of shared relations and their corresponding subsets of items, represented as: $\mathcal { R } ( \mathcal { H } _ { u } , K G ) = \left\{ ( \mathcal { G } _ { \mathrm { s u b } } , C _ { \mathrm { s h a r e d } } ) \ | \ \mathcal { G } _ { \mathrm { s u b } } \subseteq \mathcal { H } _ { u } , C _ { \mathrm { s h a r e d } } = \{ ( r _ { 1 } , t _ { 1 } ) , ( r _ { 2 } , t _ { 2 } ) , \ldots \} \right\}$ Where:

$\mathcal { G } _ { \mathrm { s u b } }$ : Represents a subset of items from the user's history $\mathcal{H}_u$ .
$C _ { \mathrm { s h a r e d } }$ : Denotes the set of shared relations (conditions) that all items in $\mathcal { G } _ { \mathrm { s u b } }$ possess. These extracted shared relations $C _ { \mathrm { s h a r e d } }$ subsequently serve as the conditions for query generation.

4.2.1.3. Query Generation

After extracting shared relations, an LLM (specifically GPT-4o [14]) is used to generate three types of Condition-based Queries: explicit, implicit, and misinformed.

Explicit Condition Construction: For explicit conditions, the shared relations $C _ { \mathrm { s h a r e d } }$ are directly adopted. This means the conditions $C _ { \mathrm { e x p l i c i t } }$ are directly derived from the attributes shared by the items in $\mathcal { G } _ { \mathrm { s u b } }$ (e.g., director, genre). The prompt used for Explicit Condition Query generation is:
```
You are given a set of attributes. Please simulate a real user and generate a natural language
query covering these attributes to search or request recommendations for related movies. The
attributes are {Explicit Conditions}...
```
Implicit Condition Construction: For implicit conditions, the goal is to make the LLM infer the conditions rather than having them explicitly stated. This is achieved by describing the conditions indirectly through related items using the KG. Specifically, for a shared relation $( r _ { m } , t _ { m } ) \in C _ { \mathrm { s h a r e d } }$ (e.g., (director, Cameron)), the target value t _ { m } (e.g., Cameron) is replaced with an indirect reference that describes t _ { m }'s relation with another item i _ { k } (e.g., 'The Abyss') from the KG. The item $i_k$ is chosen such that $t_m$ is related to $i_k$ via relation $r_m$ . Mathematically, i _ { k } is selected from the set: $i _ { k } \in \{ i \in \mathcal { I } \mid ( t _ { m } ) \xrightarrow { r _ { m } } ( i ) \mathrm { i n } \mathrm { K G } \}$ Where:
- $\mathcal { I }$ : The set of all items in the KG.
- $\xrightarrow { r _ { m } }$ : Denotes the relation $r_m$ between $t_m$ and $i$ . The resulting implicit condition is then: $C _ { \mathrm { i m p l i c i t } } = ( r _ { i } , \mathrm { r e f } ( i _ { k } ) )$ Where:
- $\mathrm { r e f } ( i _ { k } )$ : A textual reference to i _ { k }, like "the director of 'The Abyss'". The prompt used for Implicit Condition Query generation is:
```
You are given a set of attributes and relevant information for [MASK] attributes. Your task is to
generate a query that meets the following criteria:
Query should ask for items that share the input attributes.
Do not directly mention the [MASK] attribute that has additional relevant information (e.g.,
name of cinematography or starring role) in queries. Instead, describe the [MASK] attribute
using the relevant information provided.
The input attributes and relevant information is: ….
```
Misinformed Condition Construction: Misinformed conditions are created by intentionally introducing factual errors into the conditions to test the LLM's robustness in identifying and handling misinformation. For a shared relation $( r _ { i } , t _ { i } ) \in C _ { \mathrm { s h a r e d } }$ , one or more items $\{ i _ { 1 } , i _ { 2 } , \ldots , i _ { m } \} \in \mathcal { I }$ are randomly selected from the KG that do not have the specified relationship r _ { i } with t _ { i }. This is expressed as $i _ { k } \ \not\xrightarrow [ ] { r _ { i } } \ t _ { i }$ for each $i _ { k } \in \{ i _ { 1 } , i _ { 2 } , . . . , i _ { m } \}$ . The condition is then constructed with "error info" that falsely claims these items are related to t _ { i } through r _ { i }: $C _ { \mathrm { m i s i n f o r m e d } } = ( r _ { i } , t _ { i } , \mathrm { e r r o r } \operatorname* { i n f o } ; \{ i _ { 1 } , i _ { 2 } , . . . , i _ { m } \} \xrightarrow { r _ { i } } t _ { i } )$ For example, a condition (director, Cameron) might be misinformed as (director, Cameron, error info: Cameron is the director of 'Startreck'). The prompt used for Misinformed Condition Query generation is:
```
You are given a set of attributes and relevant information for one of the attributes.
Your task is to generate query that meets the following criteria:
Imagine you are a real user, queries should ask for movies that share the input attributes.
One of the attributes is provided with relevant information, such as the movie name, the
relation between the movie and the attribute, and the person's name. You should describe this
attribute using the relevant information provided.
- There may be factual errors between the relevant information and the corresponding attribute,
  but you still need to describe it based on the relevant information provided without making
corrections or explanations. The input attributes and relevant information is: …
```
Final Query Structure: For each generated query $q$ (of any type), a sample is constructed, including the ground truth items: $q = L L M ( C , { \mathcal { P } } )$ Where:
- $C$ : The specific condition (explicit, implicit, or misinformed).
- $\mathcal { P }$ : The prompt template used for query generation. The ground truth item set $\mathcal { G } _ { q }$ for a query $q$ consists of items satisfying the shared conditions $C _ { \mathrm { s h a r e d } }$ (which is the actual, correct condition, even for implicit and misinformed queries): $\mathcal { G } _ { q } = \{ i \in \mathrm { K G } \ | \ C _ { \mathrm { s h a r e d } } \subseteq \mathcal { A } _ { i } \}$ Where:
- $\mathcal { A } _ { i }$ : Represents the attributes of item $i$ in the KG. The user's interaction history $\mathcal { H } _ { u } ^ { \prime }$ used for testing excludes items in $\mathcal { G } _ { q }$ .

4.2.2. User Profile-based Query Construction

User Profile-based Queries evaluate the LLM's ability to provide personalized recommendations based on inferred preferences rather than explicit conditions, using user profiles (interests, demographics). This category is divided into Interest-based and Demographics-based Queries.

The following figure (Figure 3 from the original paper) illustrates the construction methods for User Profile-based Queries.

该图像是示意图，展示了基于兴趣和人口统计的查询构建方法。图中分别描述了如何从用户的互动记录和用户群体中生成查询和理由，以实现个性化推荐。

4.2.2.1. Interest-based Query

These queries capture user interests inferred from collective user behaviors (shared behaviors from multiple users).

Identify Common Interest Sets: Sets of items that frequently interact consecutively by multiple users are identified. Let $\mathcal { H } _ { u }$ $H_{u}$ denote the interaction history of user $u$ $u$ , and ${ \mathcal { H } } = \{ { \mathcal { H } } _ { u } \mid u \in { \mathcal { U } } \}$ $H = {H_{u} ∣ u \in U}$ denote the set of all users' interaction histories. A common interest set $S$ $S$ is defined as: $S = \{ s \mid \exists u \in { \mathcal { U } } , s \subseteq { \mathcal { H } } _ { u } , f ( s ) \geq \theta \}$ Where:
- $s$ : A sequence of items.
- f ( s ): The frequency of sequence $s$ across all users' histories.
- $\theta$ : A predefined frequency threshold.
Extract Preceding Item Sequences: For each common interest set $s$ $s$ , the sequences of items $p$ $p$ that commonly appear immediately before $s$ $s$ in user histories are extracted. Let P ( s ) denote the set of preceding item sequences for a target sequence $s$ $s$ : $P ( s ) = \{ p \mid \exists u \in \mathcal { U } , p \prec s \subseteq \mathcal { H } _ { u } \}$ Where:
- $p \prec s$ : Denotes that sequence $p$ appears immediately before sequence $s$ in the interaction history $\mathcal { H } _ { u }$ .
Query Generation: An LLM is then used to analyze these patterns and infer the reasoning behind common interests. The generated query aims to reflect both contextual features of items and collective user interests. The prompt used for Interest-based Query generation is: $## Input Introduction You are given a "Popular Movie List" and "Previous Movie Statistics". "Popular Movie List" represents movies collectively watched by multiple users, while "Previous Movie Statistics" refers to the statistics of movies that some users watched before watching the "Popular Movie List". Here is an example: Popular Movie List: [A, D, M, ...] Previous Movie Statistics: [Z, V, K], count: 6 [M, O, Z,Y], count: 5 [Y, C, E, O, Z], count: In this example, 6 users watched [A, D, M, ..] after watching [Z, V, K] ## Task Introduction 1. **Step 1: Generate Reasons** Please use all your knowledge about these movies to analyze why users watch movies in the Popular Movie List after watching Previous Movies. What is the relationship between them? Do not give reasons that are too far-fetched or broad. If you think it cannot be explained, don't force an explanation. 2. **Step 2: Generate User Queries and Answers** For each reason you generate: - Create **realistic user queries**. These queries should simulate how a real user might ask recommender systems for movie recommendations using natural language. The query should be as complex and rich as possible, close to the reason, and specific enough not to make the recommendation system recommend movies other than the subset of answers. - Provide a **subset of the Popular Movie List/Previous Movies** as the answer, ensuring the recommendations align with the reason and query. This subset will serve as the answer to the query, so do not mention the movie title of the subset in the query. ## Output Format The output must follow this JSON structured format: 'reason ': <reason>, 'query ': <query>, 'movie subset ':<movie subset>, 'reason': <reason>, 'query ': <query>, 'movie subset ': <movie subset>, \dots]$

4.2.2.2. Demographics-based Query

These queries focus on how demographic attributes (age, gender, occupation) influence recommendations.

Group Users by Demographics: Users are categorized into different user groups based on permutations and combinations of demographic attributes.
Identify Popular Items per Group: For each user group, the set of items most frequently consumed by users within that group is identified.

Query Generation: The user group demographics and the list of most popular items are provided as input to an LLM. The LLM analyzes underlying patterns and generates a query that encapsulates both the distinct characteristics of the user group demographics and the preferences reflected in the popular items. The prompt used for Demographics-based Query generation is:

You are tasked with analyzing a specific user group and their movie preferences. Based on the
provided data, you should generate structured reasons, user queries, and answers in a clear
and organized format. Follow these steps to complete the task: ## **Input Data**:
1. **User Group**: A description of the user group. 2. **List of Movies**: A ranked list of movies
   relevant to the user group, including their titles, release years, TF-IDF scores, and the percentage
of users in the group who viewed them. ## **Your Task**: 1. **Step 1: Generate Reasons**
Analyze the user group description and the provided movie list.
Generate a structured list of reasons explaining why this user group prefers these movies.
2. **Step 2: Generate User Queries and Answers** For each reason you generate: - Create
   **realistic user queries** that naturally combine the reason and user group characteris- tics.
These queries should simulate how a real user from the group might ask recommender
systems for movie recommendations using natural language. - Provide a **subset of the movie
list** as the answer, ensuring the recommendations align with the reason and query. Use the
movie title to represent a movie.

The list of most popular items serves as the ground truth for evaluating recommendation relevance.

4.2.3. Statistics of RecBench+

The benchmark dataset's statistical breakdown is provided in Table 1. The following are the results from Table 1 of the original paper:

Major Category	Sub Category	Condition	Movie	Book
Condition-based Query		1	2,225	2,260
	Explicit	2	2,346	2,604
	Condition	3,4	426	271
		1	1,753	1,626
	Implicit Condition	2	1,552	2,071
		3,4	344	213
	Misinformed	1	1,353	1,626
	Condition	2	1,544	2,075
		3,4	342	215
User Profile-based Query	Interest-based Demographics-based	-	7,365	2,004
Queries in Total		-	279	0
Number of Users			19,529	14,965
Number of Items			6,036 3,247	4,421 9,016
Number of User-Item Interactions			33,291	29,285

From Table 1:

The $RecBench+$ dataset contains a total of 19,529 Movie queries and 14,965 Book queries, providing a substantial testbed.
The distribution across Condition-based Query subcategories (Explicit, Implicit, Misinformed) and the number of conditions (1, 2, 3-4) ensures diversity in difficulty.
User Profile-based Queries also have a significant presence, especially Interest-based ones for movies (7,365 queries).
The dataset is built upon Movielens-1M (for movies) and Amazon-Book (for books), with large numbers of users, items, and user-item interactions, indicating a rich underlying data source.

5. Experimental Setup

5.1. Datasets

The $RecBench+$ benchmark dataset is constructed using two primary sources:

Movie Domain:
- Source: Movielens-1M [11] (for user-item interactions and basic item metadata) and Wikipedia (for detailed item attributes to build the KG).
- Scale: 19,529 queries in total. 6,036 users, 3,247 items, and 33,291 user-item interactions.
- Characteristics: Queries are generated using 7 key attributes (e.g., directors, actors, composers, genres) from Wikipedia to form the Item KG.
Book Domain:
- Source: Amazon Book Dataset (for user-item interactions and metadata like authors and categories to build the KG).
- Scale: 14,965 queries in total. 4,421 users, 9,016 items, and 29,285 user-item interactions.
- Characteristics: Queries are generated using attributes like authors and categories.

Example of data sample (Conceptual): A user query in $RecBench+$ could look like this:

Explicit Condition Query: "I'm really interested in classic films and would love to watch something that showcases Charlie Chaplin's legendary comedic talent. Additionally, I've heard that Roland Totheroh's cinematography adds an exceptional visual quality to movies. If you could point me in the direction of films that include both of these elements, I'd greatly appreciate it!"
- Here, Charlie Chaplin (actor) and Roland Totheroh (cinematographer) are explicit conditions.
Implicit Condition Query: "I recently watched Clockers (1995) and Bamboozled (2000), and I was really impressed by the direction in both films. I'm eager to explore more works from the director, as I found their storytelling style and vision very engaging. If you could suggest other films associated with this director, that would be fantastic."
- The director's name (Spike Lee) is not explicitly stated but must be inferred from the provided movies.
Misinformed Condition Query: "I recently watched Lorenzo's Oil and was really impressed by the cinematography done by Mac Ahlberg. I'm interested in finding more films that showcase his cinematographic style. I also remember seeing his work in Beyond Rangoon, so if there are any other movies he contributed to, I'd love to check them out!"
- Mac Ahlberg is incorrectly attributed as the cinematographer for Lorenzo's Oil and Beyond Rangoon. The LLM needs to detect this misinformation.
Interest-based Query: "I'm fond of romantic and dramatic films from the golden age of Hollywood like 'Roman Holiday' and 'My Fair Lady'. Are there any other dramatic romances from that period you would recommend?"
- The user's interest in golden age romantic dramas is inferred from their liked movies.
Demographics-based Query: "I'm a psychology professor and I'm looking for movies that delve into human emotions and relationships. Have you got any?"
- The recommendation is based on the user's occupation and inferred preferences.
  
  These datasets were chosen because they are widely recognized and frequently used in RecSys research (Movielens-1M, Amazon Book Dataset), providing a familiar foundation. More importantly, their integration with detailed Knowledge Graphs and sophisticated query generation techniques allows $RecBench+$ to create realistic, complex, and diverse textual queries that are specifically designed to test the capabilities of LLMs in personalized recommendation assistant scenarios, which traditional datasets alone cannot achieve.

5.2. Evaluation Metrics

The paper utilizes four key metrics to evaluate the performance of LLM-based recommendation assistants: Precision, Recall, Condition Match Rate (CMR), and Fail to Recommend (FTR).

5.2.1. Precision

Conceptual Definition: Precision measures the accuracy of the recommendations provided by the system. It quantifies the proportion of recommended items that are truly relevant to the user's query or preferences. A high precision indicates that the system is good at not recommending irrelevant items.

Mathematical Formula: $ \text{Precision} = \frac{|\text{Recommended Items} \cap \text{Relevant Items}|}{|\text{Recommended Items}|} $

Symbol Explanation:

$|\text{Recommended Items} \cap \text{Relevant Items}|$ : The number of items that were both recommended by the system and are actually relevant (belong to the ground truth set).
$|\text{Recommended Items}|$ : The total number of items recommended by the system.

5.2.2. Recall

Conceptual Definition: Recall measures the completeness of the recommendations. It quantifies the proportion of all truly relevant items that were successfully recommended by the system. A high recall indicates that the system is good at finding most of the relevant items.

Mathematical Formula: $ \text{Recall} = \frac{|\text{Recommended Items} \cap \text{Relevant Items}|}{|\text{Relevant Items}|} $

Symbol Explanation:

$|\text{Recommended Items} \cap \text{Relevant Items}|$ : The number of items that were both recommended by the system and are actually relevant (belong to the ground truth set).
$|\text{Relevant Items}|$ : The total number of items that are actually relevant (the size of the ground truth set).

5.2.3. Condition Match Rate (CMR)

Conceptual Definition: CMR is a novel metric introduced in this paper specifically for Condition-based Queries. It assesses the strict adherence of the recommended items to the conditions specified in the user's query. This is crucial for evaluating LLMs as assistants, as users expect their explicit constraints to be met. Items that do not satisfy the conditions are considered unsatisfactory.

Mathematical Formula: $ \text{CMR} = \frac{\sum_{i \in \text{Recommended Items}} \mathbb{I}(\text{item } i \text{ meets all specified conditions})}{|\text{Recommended Items}|} $

Symbol Explanation:

$\mathbb{I}(\cdot)$ : An indicator function that equals 1 if the condition inside the parentheses is true, and 0 otherwise.
$\text{item } i \text{ meets all specified conditions}$ : This evaluates whether a recommended item $i$ possesses all the attributes or satisfies all the constraints explicitly or implicitly requested in the user's query.
$|\text{Recommended Items}|$ : The total number of items recommended by the system.

Conceptual Definition: FTR measures the proportion of queries for which the model failed to generate any recommended items. A low FTR is generally desirable, indicating the model's ability to consistently provide recommendations. However, in Misinformed Condition Queries, a higher FTR can indicate that the model successfully identified misinformation and correctly chose not to provide potentially wrong recommendations, thus demonstrating robustness.

Mathematical Formula: $ \text{FTR} = \frac{\text{Number of queries with no recommendations}}{\text{Total number of queries}} $

Symbol Explanation:

$\text{Number of queries with no recommendations}$ : The count of queries for which the LLM assistant did not output any items.
$\text{Total number of queries}$ : The total number of evaluation queries.

The paper notes that for testing, a fixed number of recommendations $K$ was not specified in the main prompts to simulate real-world scenarios where users don't predefine $K$ . However, experiments with a fixed $K=5$ are also included in Appendix F for reference.

5.3. Baselines

The paper evaluates seven widely used and state-of-the-art LLMs as baselines for their performance as personalized recommendation assistants:

GPT-4o (2024-08-06) [14]: A powerful, recent multimodal LLM from OpenAI, known for its advanced reasoning and understanding capabilities.
GPT-4o-mini (2024-07-18) [14]: A smaller, more efficient version of GPT-4o, likely used to assess the trade-off between model size and performance.
Gemini (gemini-1.5-pro-002) [32, 33]: Google's multimodal LLM, representing another leading model family.
Claude (claude-3-5-sonnet-20241022): A strong competitor from Anthropic, known for its conversational abilities and safety.
DeepSeek-V3 [21]: An LLM from DeepSeek, which often focuses on open-source contributions and competitive performance.
DeepSeek-R1 [10]: Another model from DeepSeek, specifically designed with a focus on incentivizing reasoning capabilities via reinforcement learning, making it a critical baseline for Implicit and Misinformed Condition Queries.
Llama (Llama-3.1-70B-Instruct) [6]: A large open-source LLM from Meta, widely used in research and applications.

These baselines are representative because they cover a range of leading proprietary models (GPT, Gemini, Claude) and prominent open-source models (DeepSeek, Llama). This diverse selection allows for a comprehensive comparison of different architectural designs, training methodologies, and scale on the challenging $RecBench+$ tasks. The exclusion of GPT-o1 was due to usage policy violations with the prompts.

6. Results & Analysis

The experimental results provide insights into the capabilities and limitations of various LLMs when acting as personalized recommendation assistants, particularly across different query types and the influence of factors like the number of conditions and user interaction history.

6.1. Core Results Analysis

6.1.1. Performance on Condition-based Query

The following are the results from Table 2 of the original paper:

Domain	Model	Explicit Condition (Easy)				Implicit Condition (Medium)				Misinformed Condition (Hard)				Average
Domain	Model	P↑	R↑	CMR↑	FTR↓	P↑	R↑	CMR↑	FTR↓	P↑	R↑	CMR↑	FTR↑	P↑	R↑	CMR↑
Movie	GPT-4o-mini	0.185	0.322	0.531	0.009	0.083	0.167	0.198	0.017	0.028	0.060	0.153	0.104	0.099	0.183	0.294
	GPT-40	0.308	0.408	0.714	0.016	0.145	0.224	0.301	0.021	0.019	0.039	0.106	0.270	0.157	0.224	0.374
	Gemini	0.256	0.408	0.644	0.052	0.104	0.206	0.203	0.014	0.024	0.049	0.076	0.030	0.128	0.221	0.308
	Claude	0.201	0.422	0.658	0.014	0.105	0.269	0.281	0.011	0.033	0.079	0.128	0.087	0.069	0.183	0.277
	DeepSeek-V3	0.190	0.401	0.621	0.001	0.090	0.260	0.217	0.001	0.027	0.078	0.105	0.013	0.102	0.246	0.314
	DeepSeek-R1	0.224	0.447	0.651	0.001	0.197	0.463	0.496	0.005	0.024	0.068	0.096	0.024	0.148	0.326	0.414
	Llama-3.1-70B	0.238	0.342	0.609	0.003	0.097	0.164	0.210	0.012	0.037	0.050	0.116	0.109	0.124	0.185	0.312
Book	GPT-4o-mini	0.059	0.159	0.475	0.003	0.035	0.081	0.446	0.003	0.013	0.038	0.581	0.044	0.036	0.093	0.501
	GPT-40	0.088	0.192	0.567	0.027	0.057	0.133	0.472	0.021	0.011	0.024	0.500	0.445	0.052	0.116	0.513
	Gemini	0.076	0.221	0.623	0.011	0.035	0.135	0.319	0.013	0.014	0.044	0.274	0.072	0.042	0.133	0.405
	Claude	0.054	0.193	0.608	0.010	0.043	0.161	0.515	0.010	0.020	0.068	0.444	0.056	0.141	0.522
	DeepSeek-V3	0.040	0.124	0.667	0.008	0.056	0.190	0.385	0.005	0.014	0.047	0.230	0.066	0.037	0.120	0.351
	DeepSeek-R1	0.072	0.230	0.471	0.018	0.060	0.194	0.167	0.031	0.015	0.051	0.333	0.097	0.049	0.159	0.324
	Llama-3.1-70B	0.073	0.170	0.542	0.022	0.038	0.082	0.470	0.052	0.014	0.028	0.178	0.158	0.042	0.093	0.397

Observation 1: LLM Performance Varies Across Models.

GPT-4o and DeepSeek-R1 generally outperform other models in Condition-based Queries. GPT-4o achieves the highest Precision and the second-highest average CMR for both movie and book datasets. DeepSeek-R1 leads in Recall and is second in Precision.
Advanced Reasoning Capability: DeepSeek-R1 shows particular strength in queries requiring reasoning, such as Implicit Condition Queries. Its performance drop from Explicit to Implicit conditions is notably smaller compared to other models, indicating superior ability to infer unstated attributes. This is attributed to its advanced reasoning capabilities.
Domain Differences: Performance metrics (P, R, CMR) are generally lower for books compared to movies across all models. This could be due to the inherent complexity or data characteristics of the book domain.

Observation 2: Performance Decreases with Query Difficulty.

Explicit Condition Query (Easy): Most LLMs perform best on these queries, indicating they are more adept at handling clearly stated conditions.
Implicit Condition Query (Medium): These queries pose a greater challenge, requiring models to infer constraints. Performance (P, R, CMR) generally drops compared to Explicit queries.
Misinformed Condition Query (Hard): This is the most difficult category, resulting in the lowest Recall and CMR. This highlights LLMs' struggles with misleading information.
FTR in Misinformed Queries: A higher FTR in Misinformed Condition Queries (e.g., GPT-4o has a FTR of 0.270 for movies) suggests that LLMs might be leveraging general knowledge to detect misinformation and avoid making bad recommendations, rather than simply failing to generate output. This indicates a degree of robustness.

Observation 3: Impact of Number of Conditions. The following figure (Figure 4 from the original paper) illustrates the performance on Condition-based Query with different numbers of conditions.

Figure 4: Performance on Condition-based Query with different number of conditions. 该图像是图表，展示了在不同条件数量下，显式查询、隐式查询和误导性查询的性能表现。图中包含了召回率（Recall）、精准度（Precision）、条件满足率（CMR）和失败率（FTR）的变化趋势随着条件数量的增加而变化的情况。

Precision and Recall: For Condition-based Queries, Precision and Recall generally improve with an increasing number of conditions. More conditions narrow down the search space, making it easier to pinpoint relevant items.
Condition Match Rate (CMR):
- For Explicit Condition Queries, CMR tends to decline as the number of conditions increases. This is because fulfilling more explicit constraints simultaneously becomes harder, leading to a reduced coverage of all conditions.
- For Implicit and Misinformed Condition Queries, CMR improves with more conditions. Additional context provided by more conditions helps the model better infer implicit requirements and mitigate the impact of factual errors.

Observation 4: Effect of User-Item Interaction History. The following figure (Figure 5 from the original paper) illustrates the effect of incorporating user-item history.

Figure 5: The effect of incorporating user-item history. irrelevant candidates. We provide examples of both using and not using history in the Appendix D.3. 该图像是一个柱状图，比较了不同模型在显性和隐性推荐精度及CMR（条件下的推荐率）方面的表现。图中展示了GPT-4、Gemini及Claude模型在有无用户历史记录条件下的评分差异，显示出用户历史对推荐结果的影响。

Improved Precision: Incorporating user-item interaction history generally enhances Precision across all query types. History provides personalization cues, allowing the model to filter irrelevant candidates from a large pool, especially when queries have limited conditions.
Trade-off with CMR: However, incorporating user history does not always improve CMR. The model might prioritize user preferences (learned from history) over strict adherence to query conditions, potentially recommending "distractor" items that align with past interests but violate specific query constraints. This highlights a critical trade-off between personalization and conditional accuracy.

6.1.2. Performance on User Profile-based Query

The following are the results from Table 3 of the original paper:

Domain	Model	Interest-based Query			Demographics-based Query			Average
Domain	Model	P↑	R↑	FTR↓	P↑	R↑	FTR↓	P↑	R↑	FTR↓
Movie	GPT-4o-mini	0.013	0.058	0.000	0.018	0.054	0.000	0.015	0.056	0.000
	GPT-40	0.018	0.067	0.001	0.021	0.059	0.000	0.020	0.063	0.000
	Gemini	0.019	0.072	0.007	0.019	0.063	0.000	0.019	0.072	0.004
	Claude	0.015	0.082	0.000	0.018	0.054	0.000	0.017	0.068	0.000
	DeepSeek-V3	0.015	0.071	0.000	0.019	0.060	0.000	0.017	0.066	0.000
	DeepSeek-R1	0.014	0.081	0.000	0.015	0.068	0.000	0.015	0.075	0.000
	Llama-3.1-70B	0.014	0.061	0.000	0.015	0.046	0.000	0.015	0.054	0.000
Book	GPT-4o-mini	0.038	0.104	0.004	-	-	-	0.038	0.104	0.004
	GPT-40	0.043	0.101	0.022				0.043	0.101	0.022
	Gemini	0.056	0.127	0.049				0.056	0.127	0.049
	Claude	0.018	0.072	0.012				0.018	0.072	0.012
	DeepSeek-V3	0.020	0.081	0.005				0.020	0.081	0.005
	DeepSeek-R1	0.030	0.112	0.031				0.030	0.112	0.031
	Llama-3.1-70B	0.049	0.098	0.003				0.049	0.098	0.003

Observation 5: Top Performers for User Profile-based Queries.

Gemini-1.5 Pro and DeepSeek-R1 generally demonstrate better performance (higher Precision and Recall) for User Profile-based Queries.
Smaller models like GPT-4o-mini tend to underperform.
Most models exhibit low FTR, indicating their reliability in generating recommendations when only profile information is provided.

Observation 6: Differences Between Interest-based and Demographics-based Queries.

Demographics-based Queries generally show lower Recall than Interest-based Queries across all models (for movies).
Precision is similar between the two types.
This suggests that LLMs find it harder to infer relevant recommendations from broad demographic attributes compared to more specific interest patterns derived from user interactions.

Observation 7: Impact of Interest Popularity (for Interest-based Query). The following figure (Figure 6 from the original paper) illustrates the impact of interest popularity on Precision & Recall.

Figure 6: Impact of Interest Popularity on Precision & Recall 该图像是一个柱状图，展示了兴趣流行度对精准度和召回率的影响，分为电影和书籍两部分。图中展示了不同组索引下（从最流行到最不流行）的召回率（蓝色）和精准度（橙色），可以看出电影在各组的召回率较高，而书籍的精准度在某些组表现更佳。

Movies: For movie recommendations, queries based on more prevalent/popular interests tend to yield higher Precision and Recall. This is likely because widely shared interests are more easily recognized and interpreted by LLMs.
Books: The book domain shows an opposite trend. Popular books (e.g., "Dune") often have many editions or publishers, leading to lower metrics because LLMs might confuse variants, making exact matches difficult. Less popular books have fewer variants, leading to higher performance for exact matches.

Observation 8: Performance Variation Across User Demographics (for Demographics-based Query). The following figure (Figure 9 in Appendix E) illustrates the average Recall of queries constructed based on different user demographics.

Figure 9: Average Recall of queries constructed based on different user demographics. 该图像是图表，展示了根据不同用户人口统计特征构建的查询的平均召回率。图表分为性别、职业和年龄三个部分，每个部分显示了相应类别的召回率，突出了一些特定职业和年龄段的用户在推荐系统中的表现差异。

Gender: LLMs exhibit higher accuracy (Recall) for female users. This could be due to more consistent preference patterns observed in female users (e.g., stable genre preferences like romantic comedies).
Occupation: Performance is best for users in sales/marketing roles, possibly because these professions are associated with more consistent and recognizable behavioral patterns.
Age:
- LLMs perform best in Recall for the 50-55 age group. Their preferences might be more focused and less influenced by rapidly changing popular culture compared to younger users.
- Performance for 56 and above users is weaker, possibly due to less online activity (e.g., ratings, reviews) in this age group, resulting in less training data for LLMs to understand their preferences.

6.2. Data Presentation (Tables)

The following are the results from Table 5 of the original paper, showing performance with a fixed $K=5$ recommendations.

Model	Explicit Condition				Implicit Condition				Misinformed Condition				Average
Model	Precision	Recall	CMR	FTR	Precision	Recall	CMR	FTR	Precision	Recall	CMR	FTR	Precision	Recall	CMR
GPT-40	0.233	0.436	0.657	0.002	0.123	0.245	0.266	0.003	0.024	0.052	0.077	0.100	0.127	0.244	0.333
Gemini	0.223	0.416	0.607	0.002	0.099	0.205	0.195	0.003	0.023	0.048	0.062	0.003	0.115	0.223	0.288
Claude	0.234	0.436	0.647	0.002	0.126	0.247	0.277	0.010	0.035	0.075	0.105	0.063	0.132	0.253	0.343

These results with a fixed $K=5$ generally align with the observations from the main experiments, confirming that LLMs perform best on Explicit Condition Queries and performance degrades with increasing query difficulty.

6.3. Ablation Studies / Parameter Analysis

6.3.1. Impact of Number of Conditions (K=5)

The following figure (Figure 7 in Appendix F) illustrates the performance on Condition-based queries with different number of conditions when $K=5$ .

$Figure 7: Performances on Condition-based queries with different number of conditions when $\\mathbf { K } { = } 5$ .$ 该图像是一个图表，展示了在不同条件下（1到4个条件）显式查询、隐式查询和错误信息查询的性能表现。图表中包含了召回率（Recall）、精准度（Precision）、CMR和FTR四个指标的变化趋势，体现了在不同条件下模型的推荐效果。

The trends observed in this fixed $K=5$ experiment reinforce Observation 3 from the main text:

Precision and Recall for Condition-based Queries generally increase as the number of conditions rises.
CMR for Explicit Condition Queries tends to decrease with more conditions, while for Implicit and Misinformed Condition Queries, it tends to increase. This supports the idea that more context helps LLMs in complex reasoning tasks, even if explicit constraint satisfaction becomes harder.

6.3.2. Effect of Incorporating User-Item History (K=5)

The following figure (Figure 8 in Appendix F) illustrates the effect of incorporating user-item history when $K=5$ .

$Figure 8: The effect of incorporating user-item history when $\\mathbf { K } { = } 5$ .$ 该图像是一个图表，展示了在不同条件下（明确条件、隐含条件和误导信息）使用用户物品历史对推荐系统的精度和CMR得分的影响，比较了GPT-4o、Gemini和Claude模型的表现。结果显示加入历史信息的GPT-4o和Claude在各种情况下均表现优于不使用历史信息的版本。

This experiment with fixed $K=5$ further confirms Observation 4 regarding the impact of user-item history:

Incorporating user history consistently improves Precision for all models across Explicit, Implicit, and Misinformed query types.
The impact on CMR is more complex, with history sometimes leading to a slight decrease due to the LLM prioritizing user preferences over strict conditional adherence. This reinforces the identified trade-off. For example, GPT-4o with history shows higher Precision but slightly lower CMR in some cases compared to without history.

Case Study in Appendix D.1 (Impact of LLM Knowledge): The case study illustrates that GPT-4o successfully recommends films by Ellery Ryan (e.g., I Love You Too (2010)), while GPT-4o-mini fails, stating it lacks data. This highlights that access to a more extensive knowledge base (likely correlated with model size/training data) is crucial for an LLM's performance as a recommendation assistant.

Case Study in Appendix D.2 (Impact of Advanced Reasoning Capability): Two cases demonstrate the importance of advanced reasoning.

When asked for films by John Blick (incorrectly associated with The Mirror Has Two Faces), DeepSeek-V3 gives incorrect recommendations, while DeepSeek-R1 correctly identifies the misinformation and outputs "None" after reasoning that the true cinematographer was Dante Spinotti.
Similarly, when Scott Ambrozy is incorrectly cited for Taps and Absence of Malice, DeepSeek-V3 provides wrong recommendations. DeepSeek-R1, however, questions the name, identifies Owen Roizman as the actual cinematographer, and then recommends his works. These cases strongly validate the claim that LLMs with superior reasoning capabilities can detect and handle misinformation or implicit requirements more effectively, leading to robust and accurate recommendations.

Case Study in Appendix D.3 (Impact of User-Item Interaction History): A query for films featuring Karen Dotrice and Matthew Garber demonstrates the value of user history.

Without history, GPT-4o recommends The Three Lives of Thomasina, which features one actor but doesn't align with the user's implicit preference for whimsical/adventurous films.
With history (showing preferences for films like Beauty and the Beast, Beetlejuice), GPT-4o correctly recommends Mary Poppins (1964) and The Gnome-Mobile (1967), which fit both the explicit conditions and the user's inferred preferences. This case clearly shows how user history enables personalized recommendations that align more deeply with user preferences beyond just explicit conditions, even if it might slightly deviate from strict conditional adherence.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces $RecBench+$ , a novel and comprehensive benchmark dataset specifically designed to evaluate the potential of LLMs to function as personalized recommendation assistants in next-generation recommender systems. By creating approximately 30,000 high-quality, complex user queries categorized into Condition-based (Explicit, Implicit, Misinformed) and User Profile-based (Interest-based, Demographics-based), the benchmark effectively simulates diverse real-world recommendation scenarios.

Through extensive experiments with leading LLMs, the authors demonstrate that while LLMs possess preliminary capabilities as recommendation assistants, their performance varies significantly across different query types and models. Key findings highlight that LLMs excel in handling explicit conditions but struggle with queries requiring deep reasoning or containing misinformation. The study also reveals a crucial trade-off: incorporating user interaction history enhances personalization and Precision but can sometimes dilute strict Condition Match Rate. The benchmark provides valuable insights into how model knowledge, reasoning capabilities, interest popularity, and user demographics influence recommendation performance. $RecBench+$ establishes a new standard for evaluating LLM-based interactive recommender systems, pushing the field towards more intelligent and context-aware solutions.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose exciting future directions:

Specialized Fine-tuning: The current study primarily evaluates "pure LLM backbones." Future work should explore the effectiveness of LLMs after specialized fine-tuning for recommendation tasks. This includes assessing performance improvements, the extent of these improvements, and which models benefit most from such training.
Integration of External Tools: LLMs have static training data and may lack up-to-date information or real-world context. Integrating external tools like search engines and knowledge bases could provide LLMs with dynamic access to information, leading to more timely, intelligent, and customized recommendations.
Domain-Specific Knowledge through Hybrid Models: In domains where LLMs lack specialized knowledge (e.g., e-commerce with vast, frequently updated item pools), combining LLMs with techniques like Retrieval-Augmented Generation (RAG) or other hybrid models is crucial. This would allow LLMs to query databases for domain-specific information (e.g., detailed product specifications) to enhance accuracy and contextual relevance, which they struggle with in isolation.

7.3. Personal Insights & Critique

This paper makes a critical contribution by introducing $RecBench+$ , effectively bridging the gap between advanced LLM capabilities and the practical demands of interactive recommender systems. The detailed categorization of queries and the introduction of CMR are particularly insightful, moving beyond traditional accuracy metrics to assess conditional adherence, which is vital for user trust in intelligent assistants.

One significant inspiration from this paper is the emphasis on reasoning capabilities and robustness to misinformation. The case studies showcasing DeepSeek-R1's ability to identify and handle factual errors highlight a frontier where LLMs can truly differentiate themselves from traditional RecSys, which would simply fail or propagate errors. This capability is transferable to many other AI assistant domains where user input might be ambiguous, incomplete, or erroneous.

However, some aspects invite further consideration:

Ground Truth Ambiguity for Soft Preferences: While the construction of Condition-based Queries is rigorous, the "ground truth" for Interest-based and Demographics-based Queries might inherently be softer. Inferred interests or demographic preferences can be subjective, and the ground truth derived from popular items within a group might not fully capture individual nuances or evolving tastes. How well LLMs can generalize beyond collective patterns to truly personalized soft recommendations remains a complex challenge.
Scalability of KG-based Query Generation: The KG-based query generation is effective for domains like movies and books with rich, structured metadata. However, applying this to highly dynamic, granular, or less-structured domains (e.g., news articles with rapidly changing topics, niche e-commerce products) might be challenging without robust, domain-specific KGs or alternative construction methods.
Evaluation of Recommendation Explanations: The paper focuses on the recommended items. In an interactive assistant paradigm, explanation generation is equally crucial for user satisfaction and trust. Evaluating the quality, relevance, and persuasiveness of LLM-generated explanations for recommendations could be a valuable extension.
Long-Term User Interaction and Feedback: The current benchmark evaluates single-turn queries. Real-world personalized assistants involve multi-turn conversations and continuous learning from user feedback. Future benchmarks could incorporate session-based evaluation or reinforcement learning from human feedback to better simulate and improve interactive recommendation loops.
Metrics for Novelty and Diversity: While Precision, Recall, and CMR are essential, novelty (recommending items the user hasn't encountered but would like) and diversity (recommending a variety of items) are also key aspects of good recommender systems. $RecBench+$ could be augmented with metrics or query types designed to assess these aspects for LLMs.

Overall, $RecBench+$ is a timely and well-executed benchmark that sets a strong foundation for future research. Its insights underscore the exciting potential of LLMs while clearly outlining the path forward for developing truly intelligent and user-centric recommendation assistants.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Towards Next-Generation Recommender Systems: A Benchmark for Personalized Recommendation Assistant with LLMs

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~34 min read · 45,867 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.2.1. Traditional Recommender Systems

3.2.2. LLM-based Recommender Systems

3.2.3. LLM Evaluation as General Assistants

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Condition-based Query Construction

4.2.1.1. Item KG Construction

4.2.1.2. Shared Relation Extraction

4.2.1.3. Query Generation

4.2.2. User Profile-based Query Construction

4.2.2.1. Interest-based Query

4.2.2.2. Demographics-based Query

4.2.3. Statistics of RecBench+

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Precision

5.2.2. Recall

5.2.3. Condition Match Rate (CMR)

5.2.4. Fail to Recommend (FTR)

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Performance on Condition-based Query

6.1.2. Performance on User Profile-based Query

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

6.3.1. Impact of Number of Conditions (K=5)

6.3.2. Effect of Incorporating User-Item History (K=5)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers