LLM-REDIAL: A Large-Scale Dataset for Conversational Recommender Systems Created from User Behaviors with LLMs
TL;DR Summary
LLM-REDIAL is a large-scale dataset for conversational recommender systems, addressing limitations of existing datasets. It combines historical user behavior and dialogue templates generated by LLMs, featuring 47.6k multi-turn dialogues with consistent semantics, validated by hum
Abstract
The large-scale conversational recommendation dataset is pivotal for the development of conversational recommender systems (CRS). Most existing CRS datasets suffers from the problems of data inextensibility and semantic inconsistency. To tackle these limitations and establish a benchmark in the conversational recommendation scenario, in this paper, we introduce the LLM-REDIAL dataset to facilitate the research in CRS. LLM-REDIAL is constructed by leveraging large language models (LLMs) to generate the high-quality dialogues. To provide the LLMs with detailed guidance, we integrate historical user behavior data with dialogue templates that are carefully designed through the combination of multiple pre-defined goals. LLM-REDIAL has two main advantages. First, it is the largest multi-domain CRS dataset which consists of 47.6k multi-turn dialogues with 482.6k utterances across 4 domains. Second, dialogue semantics and the users’ historical interaction information is highly consistent. Human evaluation are conducted to verify the quality of LLM-REDIAL. In addition, we evaluate the usability of advanced LLM-based models on LLM-REDIAL.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
LLM-REDIAL: A Large-Scale Dataset for Conversational Recommender Systems Created from User Behaviors with LLMs
1.2. Authors
The paper lists the following authors and their affiliations:
-
Tingting Liang (Hangzhou Dianzi University, China; Zhoushan Tongbo Marine Electronic Information Research Institute of Hangzhou Dianzi University, China)
-
Chenxin Jin (Hangzhou Dianzi University, China)
-
Lingzhi Wang (The Chinese University of Hong Kong, Hong Kong, China)
-
Wenqi Fan (Hangzhou Dianzi University, China)
-
Congying Xia (Salesforce Research, Palo Alto, USA)
-
Kai Chen (Hangzhou Dianzi University, China)
-
Yuyu Yin (Hangzhou Dianzi University, China; Zhoushan Tongbo Marine Electronic Information Research Institute of Hangzhou Dianzi University, China)
The authors are primarily affiliated with academic institutions in China and Hong Kong, with one author from an industry research lab (Salesforce Research). Their research interests appear to lie in areas such as recommender systems, natural language processing, and potentially large language models given the paper's topic.
1.3. Journal/Conference
The paper was published at "Findings of the Association for Computational Linguistics (ACL) 2024". ACL is a premier conference in the field of computational linguistics and natural language processing (NLP). Publishing at ACL indicates a high standard of peer review and significance within the NLP community.
1.4. Publication Year
2024
1.5. Abstract
This paper introduces LLM-REDIAL, a novel large-scale dataset designed for conversational recommender systems (CRS). The authors highlight two major limitations in existing CRS datasets: data inextensibility (difficulty in scaling due to reliance on human annotation) and semantic inconsistency (lack of alignment between dialogue content and users' actual historical behaviors). To overcome these, LLM-REDIAL is constructed by leveraging large language models (LLMs) to generate high-quality dialogues. The generation process integrates historical user behavior data with carefully designed dialogue templates, which are built upon multiple pre-defined conversational goals. The dataset boasts two main advantages: first, it is the largest multi-domain CRS dataset, comprising 47.6k multi-turn dialogues with 482.6k utterances across four domains; second, it ensures high consistency between dialogue semantics and users' historical interaction information. The quality of LLM-REDIAL is validated through human evaluation, and its utility is demonstrated by evaluating advanced LLM-based models on conversational recommendation tasks using the dataset.
1.6. Original Source Link
https://aclanthology.org/2024.findings-acl.529.pdf (Officially published)
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the scarcity and limitations of high-quality, large-scale datasets for conversational recommender systems (CRS). CRS are systems that provide personalized and context-aware recommendations through natural language conversations, integrating conversational aspects with traditional recommender systems.
This problem is crucial because existing CRS methods are largely data-driven, meaning they require extensive and diverse datasets for effective model training and evaluation. However, current datasets suffer from two significant drawbacks:
-
Data Inextensibility: Most existing datasets rely heavily on manual human annotation, either through crowd-workers or expert annotators. This process is time-consuming, expensive, and limits the scalability of dataset creation. Even with the advent of large language models (LLMs), generating high-quality conversational recommendation data has remained a bottleneck.
-
Semantic Inconsistency: Existing methods for generating dialogues (e.g., simulated dialogues by crowd-workers or semi-automatic generation based on user profiles) often fail to maintain consistency between the conversation content and users' actual historical behaviors. This inconsistency makes it difficult to thoroughly evaluate the recommendation aspect of CRS, as the dialogue might not accurately reflect a user's true preferences and interaction history.
The paper's entry point and innovative idea lie in leveraging the advanced text generation capabilities of
Large Language Models (LLMs)to systematically create a large-scale, high-quality, and semantically consistent CRS dataset. By integratinghistorical user behavior datawithpre-defined dialogue templatesand guiding LLMs with detailed prompts, they aim to overcome the limitations of prior dataset construction methods.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
-
Introduction of LLM-REDIAL: The creation and release of
LLM-REDIAL, a new large-scale, multi-domain dataset specifically designed for conversational recommender systems. This dataset addresses the critical need for scalable and high-quality data in the field. -
LLM-driven Data Generation Methodology: A novel methodology for generating CRS dialogues using LLMs. This method integrates
historical user behaviors(positive/negative feedback, reviews) withcarefully designed dialogue templates(composed of primary and sub-goals), providing rich guidance to LLMs for generating semantically consistent and fluent conversations. -
Largest Multi-domain CRS Dataset:
LLM-REDIALis highlighted as the largest multi-domain CRS dataset to date, featuring 47.6k multi-turn dialogues and 482.6k utterances across four distinct domains (Books, Movies, Sports, Electronics). This scale and diversity are crucial for training more robust and generalizable CRS models. -
Semantic Consistency: The dataset explicitly tackles the
semantic inconsistencyproblem by ensuring a strong alignment between the dialogue content (items discussed, recommendations made) and the users' actual historical interactions and preferences. This makes ituser-centric, enabling better evaluation of recommendation performance. -
Quality Verification and Benchmark: The quality of the generated dialogues in
LLM-REDIALis rigorously verified through extensive human evaluations, demonstrating superior fluency, informativeness, logicality, and coherence compared to existing datasets. Furthermore, the paper provides a benchmark evaluation of advanced LLM-based models onLLM-REDIAL, showcasing its applicability and the importance of incorporating historical interaction information for effective recommendations.The key findings demonstrate that:
- LLM-generated dialogues, when properly guided, can achieve high quality and scale, surpassing previous human-annotated or crowd-sourced datasets in certain aspects.
- The
user-centricnature andsemantic consistencyofLLM-REDIALare vital for evaluating the recommendation capabilities of CRS, as evidenced by the improved performance of LLM-based models when historical interaction information is considered. - While LLMs excel at generating coherent and natural language responses in CRS, fine-tuning and the integration of external knowledge (like user history) are essential for achieving strong recommendation performance, moving beyond just fluent chat.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a beginner should be familiar with the following core concepts:
- Recommender Systems (RS): Software systems that suggest items (e.g., movies, products, articles) to users based on their preferences, past behaviors, and other contextual information. Traditional RS often rely on explicit ratings, implicit feedback (like purchases), or content-based filtering.
- Conversational Recommender Systems (CRS): An evolution of traditional RS that allows users to interact with the system through natural language conversations. Instead of static interfaces, users can express preferences, ask questions, and refine their needs dynamically, making the recommendation process more interactive and personalized.
- Large Language Models (LLMs): Advanced artificial intelligence models trained on vast amounts of text data to understand, generate, and process human language. Examples include
GPT-3,GPT-3.5,GPT-4(developed by OpenAI),LLaMA(Meta),Vicuna,Baize,Guanaco. They excel at tasks like text generation, summarization, translation, and question answering. - Multi-turn Dialogues: Conversations that involve multiple exchanges (turns) between participants, where each turn builds upon the previous context. In CRS, multi-turn dialogues allow for iterative refinement of user preferences and recommendations.
- Utterance: A single spoken or written statement by one participant in a conversation, forming a unit of dialogue.
- User-centric Dataset: A dataset where information is organized around individual users, including their historical behaviors, preferences, and potentially multiple conversations associated with them. This contrasts with
dialogue-centricdatasets where conversations might be isolated without explicit links to a consistent user profile. - Data Inextensibility: A term used in the paper to describe the difficulty and high cost of expanding a dataset, especially when it relies heavily on manual human effort.
- Semantic Inconsistency: Refers to a lack of meaningful alignment or logical connection between different pieces of information. In this context, it means the generated dialogue content doesn't accurately or consistently reflect a user's actual preferences or historical interactions.
- Dialogue Templates: Pre-defined structures or patterns for conversations. They specify the sequence of turns, the type of information exchanged, and the goals for each utterance, providing a scaffold for generating more structured and purposeful dialogues.
- Prompts: Textual instructions or initial inputs given to an LLM to guide its generation. Effective prompt design is crucial for steering LLMs to produce desired outputs.
- Fine-tuning (LLMs): A process where a pre-trained LLM is further trained on a smaller, specific dataset to adapt it to a particular task or domain. This typically improves its performance on that specific task compared to its general-purpose pre-trained state.
- Zero-shot Learning: An LLM's ability to perform a task it has not been explicitly trained on, relying solely on its general knowledge acquired during pre-training and the instructions provided in the prompt.
- Few-shot Learning: An LLM's ability to perform a task with only a few examples provided in the prompt, guiding its understanding of the desired output format and content.
- Generative Retrieval: A method where an LLM generates recommendations directly as text, rather than selecting from a pre-defined list. The generated textual recommendations then need to be mapped back to actual items.
- Evaluation Metrics for Recommender Systems:
Recall@K: Measures the proportion of relevant items that are successfully recommended within the top items.- Conceptual Definition: Recall@K assesses how many of the items a user would have liked are actually present in the system's top recommendations. A higher Recall@K means the recommender system is better at not missing relevant items.
- Mathematical Formula: $ \mathrm{Recall}@K = \frac{\sum_{u \in U} |R_{u,K} \cap T_u|}{\sum_{u \in U} |T_u|} $
- Symbol Explanation:
- : Set of all users.
- : Set of top recommended items for user .
- : Set of relevant items (ground truth) for user .
- : Cardinality of a set (number of elements).
- : Set intersection.
Normalized Discounted Cumulative Gain (NDCG)@K: A ranking quality metric that accounts for the position of relevant items in the recommendation list, giving higher scores to relevant items that appear earlier in the list.- Conceptual Definition: NDCG@K measures the usefulness of a recommended list, where relevant items appearing higher in the list contribute more to the score. It is "normalized" to ensure scores across different recommendation lists are comparable, ranging from 0 to 1.
- Mathematical Formula:
$
\mathrm{NDCG}@K = \frac{\mathrm{DCG}@K}{\mathrm{IDCG}@K}
$
where
$
\mathrm{DCG}@K = \sum_{i=1}^{K} \frac{2^{rel_i} - 1}{\log_2(i+1)}
$
and is the
ideal DCG@K, obtained by sorting all relevant items by their relevance. - Symbol Explanation:
- : The number of top recommendations considered.
- : Discounted Cumulative Gain at rank .
- : Ideal Discounted Cumulative Gain at rank .
- : Rank of the item in the recommended list (from 1 to ).
- : Relevance score of the item at rank . (Often, relevance is binary: 1 if relevant, 0 if not).
- : Discount factor, penalizing relevant items at lower ranks.
3.2. Previous Works
The paper contextualizes its contribution by discussing existing CRS datasets and their limitations. Key prior studies and datasets mentioned include:
- REDIAL (Li et al., 2018): One of the earliest and most widely known CRS datasets for movie recommendations. It consists of over 10,000 dialogues collected by pairing Amazon Mechanical Turk (AMT) workers and guiding them to recommend movies.
- Limitation: Collected by crowd-workers, quality not always guaranteed, not user-centric, limited to one domain.
- TG-ReDial (Zhou et al., 2020b): A topic-guided CRS dataset constructed using topic threads-based utterance retrieval and human annotation.
- Limitation: Similar to
REDIAL, relies on human annotation/retrieval, not user-centric, limited to one domain.
- Limitation: Similar to
- DuRecDial (Liu et al., 2020): A human-to-human recommendation-oriented multi-type dialogue dataset created by manual annotation with pre-defined goals. It covers multiple domains like movie, music, food.
- Limitation: Manual annotation limits scalability, not user-centric in terms of linking dialogues to comprehensive user historical behaviors.
- INSPIRED (Hayati et al., 2020): Another dataset focusing on sociable recommendation dialogue systems.
- Limitation: Smaller scale (1k dialogues), not user-centric.
- OpenDialKG (Moon et al., 2019): A dataset for explainable conversational reasoning using knowledge graphs, covering movies and books.
-
Limitation: Not user-centric, some dialogues might end abruptly or lack recommendations as observed in human evaluation.
These datasets, while foundational for the development of CRS, largely suffer from issues of
data inextensibility(due to heavy reliance on human annotation or retrieval) andsemantic inconsistency(dialogues often lack a strong link to a user's genuine historical preferences and are not structured around a persistent user identity). The paper highlights that even with LLMs, generating conversational recommendation data has been challenging, and existing LLM applications in this area have shown less promising performance.
-
3.3. Technological Evolution
The field of recommender systems has evolved from basic collaborative filtering and content-based methods to more sophisticated approaches incorporating deep learning and, more recently, natural language processing for conversational interactions.
- Traditional RS: Focused on implicit/explicit feedback, matrix factorization, etc.
- Deep Learning RS: Utilized neural networks for better feature learning and representation.
- Conversational RS: Emerged to address the limitations of static interfaces, aiming for interactive and dynamic recommendation processes. Early CRS datasets relied on human-in-the-loop data collection.
- Knowledge Graph Enhanced CRS: Integrated external knowledge bases to improve understanding and explanation capabilities.
- LLM-powered CRS: The latest frontier, where LLMs are used for dialogue generation, understanding user intent, and even direct recommendation generation. This paper sits firmly in this latest stage, leveraging LLMs not just for the CRS itself, but for creating the data to train and evaluate future CRS, thereby accelerating research in this area.
3.4. Differentiation Analysis
The core differences and innovations of LLM-REDIAL compared to existing CRS datasets are:
-
Scale and Diversity:
LLM-REDIALis significantly larger than previous datasets (47.6k dialogues vs. ~10-15k for others) and covers multiple domains, offering richer and more diverse interaction patterns. This addresses thedata inextensibilityproblem by providing a scalable generation method. -
LLM-driven Generation: Instead of relying on crowd-workers or retrieval,
LLM-REDIALuses LLMs (specificallyGPT-3.5-turbo) for dialogue generation. This allows for higher quality, fluency, and semantic richness in the generated conversations. -
Semantic Consistency & User-Centricity: This is a critical differentiator.
LLM-REDIALexplicitly integrateshistorical user behavior data(positive/negative interactions, review texts) into the dialogue generation process. This ensures that the items discussed and recommended in the dialogues are highly consistent with a specific user's actual preferences, making the datasetuser-centric. Previous datasets were largelydialogue-centric, where conversations were independent and lacked a consistent user identity or comprehensive historical interaction data. Thisuser-centricdesign is crucial for accurately evaluating the recommendation component of CRS. -
Structured Prompt Engineering: The paper proposes a structured approach to guide LLM generation using
pre-defined dialogue templatescombined withgoal designanduser behavior data. This detailed guidance ensures the LLMs generate dialogues that follow a logical recommendation flow and incorporate relevant item details from reviews, avoiding the common pitfalls of unconstrained LLM generation. -
Cost-Effectiveness: While LLMs incur API costs, the automated generation process is significantly more scalable and potentially more cost-effective for large datasets than extensive human annotation, especially for ensuring semantic consistency.
The following are the results from [Table 1] of the original paper:
Datasets #Dialogues #Utterances #Tokens #4-Grams Domains User-Centric REDIAL 10k 182k 4.5k 58k Movie No TG-REDIAL 10k 129k 50k 7.5k Movie No DuRecDial 10.2k 156k 17.6k 461k Movie, music, food, etc No INSPIRED 1k 35k 11k 182k Movie No OpenDialKG 15k 91k 22k 547k Movie, book No LLM-REDIAL 47.6k 482.6k 124.2k 4.6M Movie, book, sport, etc Yes
As Table 1 clearly illustrates, LLM-REDIAL stands out in terms of scale (number of dialogues, utterances, tokens, and 4-grams) and diversity (multi-domain). Crucially, it is the only dataset marked as User-Centric, which underscores its key innovation in linking dialogues to specific user histories. The higher 4-Grams value also indicates richer and more complex patterns in the conversational texts, benefiting from LLM generation.
4. Methodology
The LLM-REDIAL dataset is constructed through a systematic process that leverages Large Language Models (LLMs) to generate high-quality, multi-turn dialogues for conversational recommender systems. The core idea is to guide LLMs with detailed dialogue templates and historical user behavior data to ensure semantic consistency and scalability.
4.1. Principles
The core idea behind the methodology is to overcome the limitations of data inextensibility and semantic inconsistency in existing CRS datasets. The theoretical basis or intuition is that LLMs, with their powerful text generation capabilities, can produce realistic and coherent dialogues. However, to make these dialogues relevant for conversational recommendation, they need to be grounded in actual user preferences and follow a structured conversational flow. This grounding is achieved by:
- User Behavior Integration: Directly incorporating real
historical user behaviors(likes, dislikes, reviews) to ensure recommendations and discussions are personalized and consistent with user history. - Template-Guided Generation: Using
dialogue templateswithpre-defined goalsfor each utterance to structure the conversation, ensuring it covers key recommendation phases (e.g., asking for recommendations, making recommendations, providing feedback, acceptance/rejection). This makes the generation process controllable and ensures the dialogues serve the purpose of CRS. - Prompt Engineering: Combining the structured templates and user data into effective prompts that guide the LLM to generate natural, fluent, and semantically rich dialogues.
4.2. Core Methodology In-depth (Layer by Layer)
The overall process of dataset construction consists of three sequential phases: data preprocessing, template construction, and dialogue generation.
The following figure (Figure 2 from the original paper) provides an overview of the dataset construction framework:
该图像是一个示意图,展示了 LLM-REDIAL 数据集的构建流程,包括数据预处理、模板构建和对话生成三个主要部分。图中说明了数据过滤、分组、模板设计的不同目标以及与大语言模型的对接。通过设计问候、推荐等环节,生成高质量对话,实现用户行为与对话内容的一致性。
Figure: Overview of the LLM-REDIAL dataset construction framework consisting of data preprocessing, template generation, and dialogue generation.
4.2.1. Data Preprocessing
The goal of this phase is to transform raw review data into a usable format for dialogue generation, focusing on extracting user preferences and relevant item information. The dataset source is product reviews from Amazon (He and McAuley, 2016), which contain user reviews and rating information.
The steps involved are:
- Tokenization and Irregular Token Removal: Non-word tokens are removed from review texts to clean the data.
- Review Text Filtering: Review texts are filtered to retain records with a word count between 20 and 400. This ensures the content is substantial enough for dialogue generation but not excessively long, which could confuse the LLM.
- User and Item Interaction Filtering: Users and items with fewer than 10 interactions are removed. This ensures that there is sufficient historical data for each user to support the generation of dialogues representing the recommendation process.
- Interaction Classification (Positive/Negative Feedback): User ratings are used to classify interactions:
- Ratings are designated as
positive feedbacks. - Ratings are designated as
negative feedbacks. - Ratings of 3 are typically considered neutral and are not used for explicit positive/negative classification in this context, though the paper doesn't explicitly state their treatment.
- Ratings are designated as
- Chronological Sorting and Collection Formation:
Positiveandnegativeinteractions are sorted chronologically.- Two main collections are formed:
LIKES(items with positive feedback) andDISLIKES(items with negative feedback), ready for prompt generation. - A special collection,
MIGHT_LIKES, is created by moving the last 10% of positive interactions for each user. Items fromMIGHT_LIKESare specifically chosen to be thefinal golden recommendationin the generated dialogues, implying they are items the user would accept. This ensures a clear positive outcome for some recommendation scenarios.
4.2.2. Template Construction
This phase focuses on designing the conversational structure and flow. It involves defining goals for utterances and combining these goals into dialogue templates.
4.2.2.1. Goal Design
- Primary Goals: Eight primary communicative functions are designed for utterances, inspired by the international standard ISO 244617-2. These broadly categorize the intent of a dialogue turn (e.g.,
Greeting,Ask,Respond,Recommend,Feedback,Chit-Chat,Talk,Reason). - Sub-Goals: Under each primary goal, detailed sub-goals are provided (totaling 30 sub-goals). These sub-goals come in two types:
- Fixed Instructions: Explicit instructions like "Ask for recommendation".
- Flexible Instructions (with Slots): Instructions with placeholders (slots) that will be filled with specific user information during dialogue generation. For example, "Recommend
[USER_HIS_LIKES]", where[USER_HIS_LIKES]will be replaced by an item randomly sampled from the user'sLIKEScollection. - Example Sub-Goals (from Table 2):
Greeting-> "Greeting with[USER_HIS_DISLIKES]and[USER_HIS_DISLIKES_REVIEW]" (user starts conversation referencing a disliked item).Ask-> "Ask for recommendation" (user seeks recommendations).Recommend-> "Recommend[USER_HIS_LIKES]" (system recommends an item the user likes but will be rejected in the template).Feedback-> "Reject recommendation with reason" (user rejects an item).
4.2.2.2. Template Construction
- Multiple
dialogue templatesare created by combining these sub-goals. - Diversity: To enhance dialogue diversity, templates are varied based on the
frequency of recommendations. The count is restricted to 1-3 times. - Rejection Scenarios: For templates with 2 or 3 recommendations, all preceding recommendations (except the final one) are assumed to be
rejectedby the user. This creates realistic interaction patterns where users don't always accept the first suggestion. - Dialogue Lengths: The dialogue lengths are constrained to ranges similar to existing CRS datasets (around 6-16 turns) to ensure realism. Templates with more recommendations will naturally have longer dialogue lengths.
- Manual Design: The combinations of goals are
manually and carefully designed, resulting in 168 distinct dialogue templates.
4.2.3. Dialogue Generation
This is the core phase where LLMs are used to generate the actual dialogues.
4.2.3.1. Generation with LLMs
-
Prompt Construction: The prompt fed to the LLM is a combination of two main parts:
- Static Prompt: A pre-defined, task-agnostic textual instruction that describes the task and requirements in plain language.
- Concretized Template: A specific dialogue template filled with
user information.
-
User Information Integration: For each dialogue, specific user information is obtained by sampling interactions and review texts from one user's historical behavior (from the
LIKES,DISLIKES, andMIGHT_LIKEScollections). This information is structured, for example, in a JSON file. This ensures the generated dialogue is specific to a user's past interactions. -
Review Enrichment: To establish a strong connection between dialogue content and item information, real user reviews associated with the sampled items are introduced. The LLM is instructed to enrich the dialogue using this review information without verbatim replication.
-
Sentence Length Constraint: To prevent verbosity and ensure quality, each generated sentence is limited to 60 words.
-
LLM Selection:
GPT-3.5-turbo(the static version of ChatGPT) is used for dialogue generation to facilitate reproducibility. -
Output Observation: The LLM output is a complete multi-turn dialogue. The design ensures that the dialogue flows smoothly, reflecting key steps like requesting, providing, and accepting recommendations, and seamlessly incorporating item information from reviews. The strong generation capabilities of LLMs help maintain naturalness and coherence.
The following figure (Figure 3 from the original paper) illustrates the inputs and outputs for LLM-based dialogue generation:
该图像是示意图,展示了对话生成过程中的输入(对话模板和静态提示)与输出(生成的对话)的关系。图中包含了用户信息、代理响应及对话示例,体现了如何利用大型语言模型生成高质量对话。
Figure 3: The inputs (Template and Prompt) and outputs (Dialogue) of LLMs for the dialogue generation.
This figure visually demonstrates how a Dialogue Template (a sequence of sub-goals), combined with a Static Prompt (general instructions to the LLM) and User Information (historical interactions, reviews) as input, leads to a Generated Dialogue by GPT-3.5-turbo. The example shows how [USER_HIS_LIKES] and [USER_MIGHT_LIKES] slots are filled with actual movie titles and review snippets.
4.2.3.2. Dialogue Filtering
Due to the inherent randomness of LLMs and the potential for long, confusing review texts, direct LLM outputs may contain invalid or noisy cases. A multi-step automatic filtering process is applied to ensure high-quality dialogues:
-
Completeness Check: Dialogues that are not completely generated (e.g., cut off mid-sentence) are removed.
-
Character Validity Check: Dialogues containing garbled or unreadable characters are discarded.
-
Template Filling Check: Dialogues where template slots were not successfully filled with user information (i.e., placeholders like
[USER_HIS_LIKES]remain) are removed. -
Length Consistency Check: Dialogues inconsistent in length with their related dialogue templates are discarded.
This filtering ensures that the final
LLM-REDIALdataset contains only high-quality, structured, and semantically consistent multi-turn dialogues suitable for CRS research.
4.3. Dataset Construction Cost Analysis
The primary cost for creating LLM-REDIAL is associated with API calls to GPT-3.5-turbo-16k.
- Time Cost: Generating one dialogue takes approximately 10-20 seconds.
- Monetary Cost:
GPT-3.5-Turbo-16kis priced at \0.003per 1K input tokens and`0.004` per 1K output tokens. - Total Cost: Approximately 100,000 API calls were made, resulting in a total cost of around \750$ for generating the preliminary dialogues before filtering. This highlights the relative efficiency of LLM-based generation for large datasets compared to manual annotation.
5. Experimental Setup
5.1. Datasets
The LLM-REDIAL dataset itself is the primary focus of the experiments, used both for evaluation of its quality and for benchmarking CRS models.
-
Source Data for LLM-REDIAL Generation: Amazon review dataset (He and McAuley, 2016). This dataset contains user reviews and rating information.
-
Domains within LLM-REDIAL: The current version of
LLM-REDIALis constructed from 4 domains, chosen from the 24 available in the Amazon review dataset:- Books
- Movies
- Sports
- Electronics
-
Scale of LLM-REDIAL: 47,651 dialogues with 482,684 utterances.
-
Characteristics: Multi-domain, multi-turn, user-centric (each dialogue linked to a user with historical interactions), high semantic consistency.
The following are the results from [Table 3] of the original paper:
Books Movies Sports Electronics Total #Dialogues 25,080 10,093 6,218 6,260 47,651 #Utterances 259,850 106,151 58,289 58,394 482,684 #Tokens 79,540 40,285 35,137 31,331 124,269 #4-Grams 2,385,204 1,100,472 757,201 679,257 4,679,146 # Users 9,893 3,133 5,128 4,469 22,151 # Items 112,913 11,589 34,733 18,034 177,269 Avg. #Dialogues per User 2.54 3.22 1.21 1.40 2.15 Avg. #Utterances per Dialogue 10.36 10.52 9.37 9.33 10.13
Table 3 provides detailed statistics for LLM-REDIAL across its four domains. The "Books" domain is the largest in terms of dialogues, utterances, tokens, and users. The average number of utterances per dialogue is around 9-10, consistent with the template design. The user-centric nature is highlighted by "Avg. #Dialogues per User", which shows that users in Books and Movies tend to have more associated dialogues, possibly due to richer interaction histories in these categories.
- Comparison Datasets for Human Evaluation:
REDIAL(Li et al., 2018)INSPIRED(Hayati et al., 2020)OpenDialKG(Moon et al., 2019) These datasets are chosen as representative English CRS datasets for comparative quality assessment.
5.2. Evaluation Metrics
The paper uses different evaluation metrics for two main aspects: dataset quality (human evaluation) and conversational recommendation performance (model evaluation).
5.2.1. Human Evaluation Metrics (Dataset Quality)
Human annotators evaluate the quality of dialogues at both utterance-level and conversation-level.
- Utterance-Level Metrics (Scale 0-2):
- Fluency:
- Conceptual Definition: Assesses whether an utterance is grammatically correct, easy to understand, and free from awkward phrasing or errors.
- Grading Criteria: 0 (poor - severe errors, difficult to comprehend), 1 (normal - some errors, generally understandable), 2 (good - fluent, no noticeable errors, clear).
- Informativeness:
- Conceptual Definition: Determines if an utterance provides meaningful content, avoiding generic "safe responses" or repetitive statements.
- Grading Criteria: 0 (poor - lacking information, safe response), 1 (normal - some information but lacks detail), 2 (good - rich, detailed, in-depth, provides relevant content).
- Logicality:
- Conceptual Definition: Evaluates the logical consistency of an utterance, checking if it aligns with common sense, follows a logical flow, and is relevant to the preceding context.
- Grading Criteria: 0 (poor - severe logical errors, unrelated to context, self-contradictory), 1 (normal - some logical issues, insufficiently related/reasonable), 2 (good - maintains logical coherence, related and reasonable).
- Coherence:
- Conceptual Definition: Ensures that an utterance logically connects to and flows smoothly from the previous conversation turn, maintaining contextual links.
- Grading Criteria: 0 (poor - highly incoherent, no clear contextual connections), 1 (normal - moderately coherent, occasional ruptures or insufficient links), 2 (good - highly coherent, clear logical connections, smooth transitions).
- Fluency:
- Conversation-Level Metric:
- Direct Pairwise Comparison: Annotators compare two conversations (one from
LLM-REDIALand one from a baseline dataset) and select which one has overall higher quality. This is a subjective but holistic assessment.
- Direct Pairwise Comparison: Annotators compare two conversations (one from
- Annotator Agreement:
Kendall's coefficient of concordance (W)is used to measure the agreement among the seven annotators for utterance-level evaluations.- Conceptual Definition: Kendall's W is a non-parametric statistic that assesses the agreement among multiple raters or judges. A value of 1 indicates perfect agreement, and 0 indicates no agreement.
- Mathematical Formula: $ W = \frac{12 \sum_{i=1}^{N} (R_i - \bar{R})^2}{m^2 (N^3 - N)} $ where $ R_i = \sum_{j=1}^{m} r_{ij} $ and $ \bar{R} = \frac{1}{N} \sum_{i=1}^{N} R_i $
- Symbol Explanation:
- : Kendall's coefficient of concordance.
- : Number of items or subjects being ranked/rated (e.g., utterances).
- : Number of raters or judges.
- : Rank (or score in this case) assigned by rater to item .
- : Sum of ranks (scores) for item across all raters.
- : Mean of the sum of ranks (scores) for all items.
- The significance of is typically tested using a Chi-square statistic: $
\chi^2 = m(N-1)W
withN-1$ degrees of freedom.
5.2.2. Conversational Recommendation Performance Metrics (Model Evaluation)
These metrics are used to evaluate how well LLM-based models perform the recommendation task on LLM-REDIAL. The evaluation focuses on the "Movie" domain.
Recall@K(with ): Defined and explained in Section 3.1.NDCG@K(with ): Defined and explained in Section 3.1.
5.3. Baselines
For evaluating conversational recommendation performance, the paper compares several LLM-based models. The task is to predict the item that will appear in the next response given the preceding dialogue context.
- ChatGPT-based model: Uses
GPT-3.5-turbofrom OpenAI as the recommender. - Vicuna-based model: Uses
Vicuna-7B(Chiang et al., 2023), an open-source LLM fine-tuned based onLLaMA-13B(Touvron et al., 2023). - Baize-based model: Uses
Baize-v2-7B(Xu et al., 2023), another open-source LLM based onLLaMA-13B. - Guanaco-based model: Uses
Guanaco-7B(Dettmers et al., 2023), also an open-source LLM based onLLaMA-13B.
Settings for LLM-based Baselines: Each model is tested under three settings:
- Zero-shot: The model receives only the dialogue context (or context + historical interactions) and the instruction to recommend 50 items. No examples are provided.
- Few-shot: The model receives the dialogue context (or context + historical interactions) along with 5 case examples to guide its recommendation.
- Fine-tuning: The model is fine-tuned on a training set of dialogues before being evaluated.
- For
ChatGPT-based: 200 dialogues for testing. Fine-tuned with 200 training examples. - For
Vicuna,Baize,Guanaco: 1,500 dialogues for testing. Fine-tuned with 8,593 training examples.
- For
Prompting Details:
- Static Prompt: "Pretend you are a movie recommender system. I will give you a conversation between a human and assistant. Based on the conversation, you reply me with 50 recommendations without extra sentences."
- Historical Interaction (H.I.) Integration: An additional prompt part "Here is the item lists:
{}" is added, where{}is filled with the user's historical interaction data. - Few-shot Examples: For few-shot settings, "Here is the examples:
{}" is added, containing 5 correct recommendation examples. - Decoding Temperature: Set to 0 for all models to ensure deterministic outputs.
- Recommendation Mapping: Since LLMs generate text, a
fuzzy matchingapproach (following He et al., 2023) is used to convert the generated textual recommendation list into an item ranking list.
6. Results & Analysis
6.1. Core Results Analysis
The evaluation focuses on two main aspects: the quality of the LLM-REDIAL dataset itself through human evaluation, and the performance of LLM-based models on conversational recommendation tasks using the dataset.
6.1.1. Human Evaluation on Dataset Quality
6.1.1.1. Utterance-Level Evaluation
-
Setup: 10 dialogues were randomly sampled from each of the four datasets (
LLM-REDIAL,REDIAL,INSPIRED,OpenDialKG), shuffled, and presented to seven graduate student annotators. Each annotator scored 1,996 utterances based onFluency,Informativeness,Logicality, andCoherence(scale 0-2). -
Annotator Agreement: Kendall's coefficient of concordance (W) was 0.312. The Chi-square value (4353.788) was significantly greater than the boundary value (), indicating statistically significant agreement among annotators ().
The following are the results from [Table 4] of the original paper:
Fluency(0-2) Informative(0-2) Logical(0-2) Coherence(0-2) LLM-REDIAL 1.98 1.28 1.90 1.88 REDIAL 1.83 1.18 1.76 1.77 INSPIRED 1.86 1.01 1.83 1.79 OpenDialKG 1.95 1.03 1.84 1.78 -
Analysis:
Table 4shows thatLLM-REDIALachieved the highest scores across all four utterance-level metrics.- It demonstrated extremely high
Fluency(1.98),Logicality(1.90), andCoherence(1.88), which can be attributed to the strong generative capabilities of LLMs. - Its superiority in
Informativeness(1.28) was particularly significant compared to other datasets. The authors explain this by the integration of users' historical interactions and review information into the dialogue templates, allowing for more detailed and in-depth discussions. In contrast, crowd-sourced datasets struggle to incorporate such rich, personalized information.
- It demonstrated extremely high
6.1.1.2. Conversation-Level Evaluation
-
Setup: Three groups were formed, each pairing
LLM-REDIALwith one of the comparison datasets (REDIAL,INSPIRED,OpenDialKG). For each group, 50 dialogues from each dataset were randomly matched to form 50 pairs. Seven annotators compared 150 pairs (50 pairs per group) and selected the overall higher-quality conversation.The following figure (Figure 4 from the original paper) shows the conversation-level human evaluation results:
该图像是一个图表,展示了LLM-REDIAL与其他数据集在会话级人类评估中的比较。红色条形代表LLM-REDIAL表现更好的对比结果,而绿色条形则表示LLM-REDIAL表现较差的对比结果,显示出LLM-REDIAL在多个数据集中的优势。
Figure 4: Conversation-level human evaluation on the LLM-REDIAL dataset.
- Analysis:
Figure 4illustrates that in all three pairwise comparisons, a significantly higher proportion of annotators (over 80% forREDIALandINSPIRED, and about 88% forOpenDialKG) ratedLLM-REDIALdialogues as having better overall quality.- An interesting observation was made regarding
OpenDialKG: despite its good utterance-level scores, a large majority of annotators found its overall conversation quality inferior. This was attributed toOpenDialKGdialogues sometimes ending abruptly or lacking clear recommendations, issues thatLLM-REDIAL's template-driven generation avoids.
- An interesting observation was made regarding
6.1.2. Evaluation on Conversational Recommendation
-
Setup: Experiments were conducted on the "Movie" domain of
LLM-REDIALto test the applicability of the dataset for conversational recommendation tasks using LLM-based models. The task was to predict the next recommended item. -
Metrics:
Recall@KandNDCG@K(). -
Settings:
Zero-shot,Few-shot, andFine-tuning. -
Input Variations:
Dial. Only(only dialogue text as input) vs.Dial. + H.I.(dialogue text plus users' historical interactions as input).The following are the results from [Table 5] of the original paper:
Methods R@5 REDIAL LLM-REDIAL R@10 R@50 N@5 N@10 N@50 R@5 R 10 R@50 N@5 N@10 N@50 ChatGPT-based Zero-Shot Dial. Only 0.0100 0.0100 0.0150 0.0072 0.0071 0.0085 0.0000 0.0000 0.0400 0.0000 0.0000 0.0086 Dial. + H. I / 0.0000 0.0050 0.0350 0.0000 0.0015 0.0077 Few-Shot Dial. Only 0.0100 0.0150 0.0200 0.0100 0.0115 0.0130 0.0000 0.0000 0.0350 0.0000 0.0000 0.0075 Dial. + H. I 0.2000 0.2600 0.4400 0.1953 0.2021 0.2625 0.0000 0.0000 0.0400 0.0000 0.0000 0.0087 Fine-Tuning Dial. Only / 0.1757 0.3150 0.4600 0.5175 0.5100 0.1716 Dial. + H. I 0.4500 0.4270 0.4295 0.4265 Vicuna-based Zero-Shot Dial. Only 0.0005 0.0007 0.0013 0.0001 0.0003 0.0004 0.0010 0.0013 0.0027 0.0007 0.0006 0.0010 Dial. + H. I / 0.0033 0.0080 0.0507 0.0025 0.0034 0.0128 Few-Shot Dial. Only 0.0004 0.0007 0.0053 0.0005 0.0007 0.0016 0.0000 0.0027 0.0100 0.0000 0.0009 0.0026 Dial. + H. I / 0.0080 0.0133 0.0553 0.0073 0.0089 0.0172 Fine-Tuning Dial. Only 0.1945 0.3018 0.4993 0.1397 0.1642 0.2080 0.2869 0.3325 0.6090 0.2624 0.2684 0.2988 Dial. + H. I / 0.3260 0.3980 0.6940 0.2569 0.2655 0.3108 Baize-based Zero-Shot Dial. Only 0.0005 0.0007 0.0020 0.0002 0.0003 0.0006 0.0017 0.0031 0.0119 0.0012 0.0016 0.0034 Dial. + H. I / 0.0021 0.0039 0.0109 0.0027 0.0037 0.0041 Few-Shot Dial. Only 0.0007 0.0008 0.0033 0.0003 0.0004 0.0008 0.0039 0.0069 0.0135 0.0029 0.0037 0.0052 Dial. + H. I / 0.0095 0.0135 0.0195 0.0074 0.0084 0.0094 Fine-Tuning Dial. Only 0.2103 0.3104 0.4260 0.1295 0.1406 0.1809 0.2173 0.3227 0.4867 0.1600 0.1665 0.1873 Dial. + H. I / 0.3327 0.4580 0.5513 0.1769 0.1920 0.2087 Guanaco-based 0.0011 Zero-Shot Dial. Only 0.0006 0.0007 0.0040 0.0002 0.0003 0.0008 0.0026 0.0013 0.0044 0.0099 0.0096 0.0006 0.0008 Dial. + H. I 0.0007 0.0007 0.0020 0.0003 0.0003 0.0006 0.0028 0.0048 0.0019 0.0024 0.0034 Few-Shot Dial. Only / 0.0093 0.0133 0.0100 0.0213 0.0019 0.0025 Dial. + H. I 0.2028 0.2367 0.3133 0.1195 0.1267 0.1608 0.1867 0.2567 0.4140 0.0081 0.0097 Fine-Tuning Dial. Only / 0.1993 0.2827 0.4533 0.1430 0.1536 0.1833 Dial. + H. I / 0.1680 0.1751 0.1922 -
Analysis of Table 5:
- Zero-shot and Few-shot Performance: All baseline models (ChatGPT-based, Vicuna-based, Baize-based, Guanaco-based) show very poor performance in both zero-shot and few-shot settings on
LLM-REDIAL. This indicates that pre-trained LLMs, while capable of generating coherent text, cannot directly perform conversational recommendation effectively without specific adaptation. Thefew-shotsetting provides only marginal improvements. - Impact of Fine-tuning: There are significant performance improvements across all models when
fine-tuningon the training data. This confirms the necessity of adapting LLMs to the specific task of conversational recommendation. The ranking of models under fine-tuning generally aligns with their reported performance on general LLM leaderboards (e.g., AlpacaEval for Vicuna), suggestingLLM-REDIALis a valid benchmark for distinguishing LLM capabilities in CRS. - Importance of Historical Interactions (H.I.): The
Dial. + H.I.setting consistently outperformsDial. Onlyacross all models and settings, especially under fine-tuning. This is a crucial finding: incorporating users' historical interaction records significantly improves recommendation performance. This validatesLLM-REDIAL'suser-centricdesign philosophy, as most existing CRS datasets lack this explicit link to historical user data. This highlights a key advantage ofLLM-REDIALfor developing more effective CRS. - Comparison with REDIAL: The table also includes some performance figures on
REDIALfor the ChatGPT-based model. While not a direct comparison for all models, it generally shows that performance onLLM-REDIAL(especially with H.I. and fine-tuning) can be competitive or better, underscoring its utility as a benchmark.
- Zero-shot and Few-shot Performance: All baseline models (ChatGPT-based, Vicuna-based, Baize-based, Guanaco-based) show very poor performance in both zero-shot and few-shot settings on
6.2. Case Study
The paper provides a case study to intuitively explore the effect of response generation with recommendations based on LLMs under different settings.
The following figure (Figure 6 from the original paper) shows an example:

Figure 6: Case study of response generation for recommendation based on LLMs under different settings.
- Analysis:
-
Zero-shot and Few-shot: The example shows that in these settings, the
ChatGPT-basedmodel generates responses that arecoherentandnaturalin terms of language, but the actual recommendation performance is relatively poor. The model struggles to provide a recommendation relevant to the user's implicit preferences (e.g., the user dislikes "Game Change" for its portrayal of political figures but the model still tries to recommend similar themes). This suggests that while LLMs excel at dialogue generation, their inherent knowledge is not sufficient for precise recommendation without task-specific adaptation. -
Fine-tuning: After
fine-tuning, the model is "more likely to make recommendations meeting users' requirements in the generated responses." This reinforces the quantitative results that adaptation is critical for LLMs to become effective recommenders within a conversational context. -
Ground Truth: The
Ground Truthshows an example of a relevant recommendation ("Ghost Dog: The Way of the Samurai") that aligns with the user's expressed interest in "Vicky Cristina Barcelona" (implied interest in character development, complex themes) and their dislike of "Game Change" (political themes). The fine-tuned model's output, while not explicitly shown in full detail for this specific case, is expected to move closer to such ground truth.The case study visually supports the conclusion that while LLMs simplify response generation, significant effort (like fine-tuning and integrating historical context) is needed to achieve effective recommendations.
-
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper successfully introduces LLM-REDIAL, a large-scale, multi-turn dialogue dataset specifically designed for conversational recommender systems (CRS). It addresses the critical limitations of data inextensibility and semantic inconsistency prevalent in existing CRS datasets. By leveraging Large Language Models (LLMs) and guiding them with historical user behavior data and pre-designed dialogue templates, the authors generated a high-quality, user-centric dataset. LLM-REDIAL is currently the largest multi-domain CRS dataset, comprising 47.6k dialogues across four domains, featuring strong semantic consistency between dialogue content and user interaction history. Human evaluations confirm its superior quality (fluency, informativeness, logicality, coherence) compared to other benchmarks. Furthermore, experiments with LLM-based models on LLM-REDIAL demonstrate its usability for evaluating CRS, highlighting that fine-tuning and incorporating user historical interactions are crucial for effective recommendation performance. The paper concludes that LLM-REDIAL serves as a valuable resource for advancing CRS research, particularly in the context of LLM-powered systems.
7.2. Limitations & Future Work
The authors acknowledge several limitations of LLM-REDIAL and suggest future research directions:
- Prompt Design for LLMs: The quality of generated dialogues is heavily influenced by prompt design. The current work focused on generating a large-scale dataset rather than optimizing prompts. Future work could explore
prompt tuningtechniques to improve LLM output quality for dialogue generation in conversational recommendation scenarios. - Manual Template Construction: The
template constructionphase (goal design and template combination) relies heavily on manual effort. This limits the efficiency and diversity of dataset construction. Future research should aim to reduce human intervention in goal and template design, possibly through automated or semi-automated methods. - Bias in Source Data: The
Amazon review datasetused as the source introduces potential biases:- User Rating Bias: Different users may have varying rating standards, leading to inconsistencies in defining "likes" () and "dislikes" ().
- Review Bias: Review content itself can be polarized, exaggerated, or depreciative. Dialogues generated from such reviews might inherit these biases. Detecting and correcting these biases in LLM-generated dialogues is non-trivial due to the diverse outputs. Future work needs to explore more nuanced and sophisticated processes to correct user rating and review biases before dialogue generation.
7.3. Personal Insights & Critique
This paper presents a highly relevant and timely contribution to the field of conversational recommender systems, especially with the rise of LLMs. The core innovation of using LLMs to generate a large-scale, semantically consistent, and user-centric dataset is a powerful approach to address the data bottleneck.
Insights:
- Paradigm Shift in Dataset Creation: The methodology represents a significant shift from costly, human-intensive dataset annotation to scalable, LLM-driven generation. This could accelerate research not just in CRS but in other NLP domains requiring large amounts of structured conversational data.
- Emphasis on Semantic Consistency: The explicit focus on linking dialogues to historical user behaviors is crucial. Many previous CRS datasets treated conversations as isolated entities, making it difficult to properly evaluate recommendation effectiveness in a personalized context.
LLM-REDIAL'suser-centricdesign closes this gap. - LLMs as Tools, Not Just Models: The paper effectively demonstrates LLMs' utility as tools for data generation, not just as end-to-end models. This highlights a powerful application of LLMs that goes beyond direct task execution.
- Hybrid Approach Validity: The success of
LLM-REDIALstems from a hybrid approach: leveraging LLM power for generation while providing strong, structured guidance through templates and real user data. This controlled generation avoids the "hallucinations" or irrelevant outputs that might arise from unconstrained LLM use.
Critique/Areas for Improvement:
-
Template Diversity and Generalization: While 168 templates are used, the manual design process is a potential limitation for truly open-ended conversational scenarios. The paper notes this as future work, but it's a critical point. Can these templates fully capture the vast range of human conversational styles and recommendation nuances? Exploring how to automate or semi-automate template generation and goal definition would be a crucial next step to make the dataset generation even more extensible.
-
LLM "Black Box" Dependency: The quality heavily relies on the LLM's capabilities. While
GPT-3.5-turbois powerful, future iterations of LLMs will inevitably change. The robustness of the dataset generation process to different (or future) LLMs is an interesting question. The prompt engineering itself might need adaptation as LLM capabilities evolve. -
Bias Mitigation: The paper acknowledges biases from the Amazon review dataset. While ethical considerations are mentioned regarding privacy, the deeper issue of
algorithmic biasinherent in the source data and its propagation into the generated dialogues warrants more detailed investigation. For example, if Amazon reviews show gender or racial bias in product recommendations, these could implicitly be replicated inLLM-REDIAL. Detecting and mitigating such subtle biases is an active area of research for LLMs. -
Fuzzy Matching: The reliance on
fuzzy matchingto map generated textual recommendations back to actual items could introduce noise or errors. While a practical solution, its impact on the ground truth for evaluation should be thoroughly analyzed and potentially improved upon. -
Beyond 4 Domains: The initial release covers 4 domains. While a good start, expanding to a wider variety of domains would further enhance the dataset's utility and generalizability for real-world CRS, which often operate across many product categories.
Overall,
LLM-REDIALis a meticulously constructed and well-evaluated dataset that offers a compelling solution to a long-standing problem in CRS research. Its methodology provides a blueprint for future dataset creation efforts, fostering the development of more sophisticated and user-aware conversational recommender systems.
Similar papers
Recommended via semantic vector search.