Refining Text Generation for Realistic Conversational Recommendation via Direct Preference Optimization
TL;DR Summary
This paper introduces an improved Conversational Recommender System method using Large Language Models to generate dialogue summaries and recommendations, capturing both explicit and implicit user preferences. Direct Preference Optimization (DPO) is employed to ensure rich conten
Abstract
Conversational Recommender Systems (CRSs) aim to elicit user preferences via natural dialogue to provide suitable item recommendations. However, current CRSs often deviate from realistic human interactions by rapidly recommending items in brief sessions. This work addresses this gap by leveraging Large Language Models (LLMs) to generate dialogue summaries from dialogue history and item recommendation information from item description. This approach enables the extraction of both explicit user statements and implicit preferences inferred from the dialogue context. We introduce a method using Direct Preference Optimization (DPO) to ensure dialogue summary and item recommendation information are rich in information crucial for effective recommendations. Experiments on two public datasets validate our method's effectiveness in fostering more natural and realistic conversational recommendation processes. Our implementation is publicly available at: https://github.com/UEC-InabaLab/Refining-LLM-Text
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Refining Text Generation for Realistic Conversational Recommendation via Direct Preference Optimization
1.2. Authors
Manato Tajiri and Michimasa Inaba, affiliated with The University of Electro-Communications, Tokyo, Japan. Their contact emails are t2530085@gl.cc.uec.ac.jp and m-inaba@uec.ac.jp.
1.3. Journal/Conference
The paper is listed as published at an unspecified venue with a publication date of 2025-08-27. Given the arxiv.org link, it is likely a preprint for a future conference or journal submission, common in academic research.
1.4. Publication Year
2025
1.5. Abstract
The paper addresses a significant limitation in current Conversational Recommender Systems (CRSs), which often generate rapid item recommendations in brief sessions, deviating from natural human interactions. To bridge this gap, the authors propose a method that utilizes Large Language Models (LLMs) to generate dialogue summaries from dialogue history and item recommendation information from item descriptions. This dual generation aims to capture both explicit user statements and implicit preferences. A key innovation is the application of Direct Preference Optimization (DPO) to fine-tune the LLMs, ensuring that the generated texts are rich in information crucial for effective recommendations. Experiments on two public datasets demonstrate the method's superiority in fostering more natural and realistic conversational recommendation processes. The implementation is publicly available.
1.6. Original Source Link
Official Source: https://arxiv.org/abs/2508.19918 PDF Link: https://arxiv.org/pdf/2508.19918v3.pdf Publication Status: Preprint (arXiv)
2. Executive Summary
2.1. Background & Motivation
The core problem the paper addresses is the lack of realism in many existing Conversational Recommender Systems (CRSs). Traditional CRSs often recommend items too quickly within short dialogue sessions, relying heavily on immediate user feedback. This approach contrasts sharply with human-to-human recommendation scenarios, where a recommender typically gathers comprehensive user preferences, experiences, and context before making thoughtful suggestions. Furthermore, current CRSs often under-explore the effective integration of implicit information, such as context, sentiment, and unarticulated past experiences, which are crucial in natural interactions. This discrepancy limits the practical utility and naturalness of existing systems.
The paper's entry point is to enhance the SumRec approach (Asahara et al., 2023), which uses Large Language Models (LLMs) to generate dialogue summaries from conversation history and item recommendation information from item descriptions. While SumRec aims to extract both explicit and implicit user preferences and articulate item relevance in natural language, its primary limitation is that the generated texts may sometimes lack information critical for downstream recommendation tasks (e.g., item selection or scoring). This can hinder the system's ability to interpret the relationship between user needs and item suitability. The paper aims to guide the LLM to generate precisely the essential information.
2.2. Main Contributions / Findings
The primary contributions of this research are:
-
Proposed Extension to SumRec with DPO: The authors propose an extension to the
SumRecframework by fine-tuning the LLM usingDirect Preference Optimization (DPO). This method is specifically tailored for realistic conversational recommendation datasets, aiming to generatedialogue summariesanditem recommendation informationthat are rich in content essential for accurate and appropriate item recommendations. -
Superior Recommendation Performance: Through extensive experiments on two public Japanese datasets (
Tabidachi CorpusandChatRec), the proposed DPO-enhanced approach demonstrates superior recommendation performance (measured byHit Rate (HR)andMean Reciprocal Rank (MRR)) compared to existing baseline methods, including the originalSumRec. The improvements are particularly significant in enhancing the quality of top-position recommendations. -
Analysis of Generated Texts and Ablation Study: The paper provides a quantitative analysis of the generated texts, showing that DPO leads to longer summaries and item recommendation information, prioritizing task-relevant details over lexical diversity or superficial overlap with original descriptions. An ablation study further confirms that DPO training for
dialogue summariesis a critical driver for overall system performance improvements. -
Human Evaluation: A human evaluation study corroborates the automated metrics, indicating that DPO significantly improves the quality of
dialogue summariesin terms ofconsistency,fluency, andusefulnessfor recommendations.These findings collectively address the challenges of unnatural dialogue processes and insufficient integration of implicit information in CRSs, leading to more human-like and effective recommendation experiences.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a foundational understanding of several key concepts is essential:
-
Recommender Systems: These are information filtering systems that aim to predict the "rating" or "preference" a user would give to an item. They are ubiquitous in e-commerce (e.g., Amazon, Netflix) and help users discover new products or content. Traditional systems often rely on historical user behavior (e.g., past purchases, ratings).
-
Cold-Start Problem: A common challenge in recommender systems where it's difficult to make accurate recommendations for new users or new items due to a lack of historical data.
-
Conversational Recommender Systems (CRSs): An advanced type of recommender system that interacts with users through natural language dialogue to elicit their preferences and provide recommendations. CRSs aim to overcome limitations like the cold-start problem by dynamically gathering information through conversation, making the recommendation process more interactive and personalized.
-
Large Language Models (LLMs): These are advanced artificial intelligence models, such as GPT (Generative Pre-trained Transformer), Llama, and PaLM, trained on vast amounts of text data to understand, generate, and process human language. LLMs are capable of various natural language processing tasks, including summarization, text generation, and question answering, by predicting the next word in a sequence. Their ability to capture context and generate coherent text makes them valuable tools for tasks like dialogue summarization and information generation.
-
Fine-tuning (of LLMs): The process of taking a pre-trained LLM (trained on a general dataset) and further training it on a smaller, task-specific dataset. This adapts the model's knowledge and capabilities to a particular task or domain, improving its performance for that specific application.
-
Direct Preference Optimization (DPO): A novel algorithm for fine-tuning LLMs based on human preference data. Unlike traditional reinforcement learning from human feedback (RLHF) methods that require training a separate reward model, DPO directly optimizes the language model policy to align with human preferences. It simplifies the alignment process by reparameterizing the reward function in terms of the policy, making it more stable and computationally efficient. DPO works by presenting the model with pairs of generated texts (one preferred, one not preferred) and directly optimizing the model's likelihood of generating the preferred text over the non-preferred one.
-
Transformer Encoder: A key component of the Transformer neural network architecture, widely used in LLMs. The encoder processes input sequences (like text) by mapping them into a continuous representation. It uses
self-attentionmechanisms to weigh the importance of different words in the input sequence, capturing contextual relationships, andfeed-forward neural networksfor further processing.DeBERTais an example of a model built on the Transformer encoder architecture. -
DeBERTa (Decoding-enhanced BERT with disentangled attention): A pre-trained language model that enhances the
BERT(Bidirectional Encoder Representations from Transformers) architecture. Key improvements includedisentangled attention(where content and position embeddings are encoded separately and then combined to compute attention scores) and anenhanced mask decoderfor a better pre-training objective. It's often used for tasks like text classification, question answering, and, in this paper, as ascore predictor.
3.2. Previous Works
The paper contextualizes its work by discussing existing research in Conversational Recommender Systems (CRSs), dialogue summarization using LLMs, and recommendation via augmentation or refinement of item description.
-
Conversational Recommender Systems (CRSs):
- Early CRSs and Datasets (e.g., Li et al., 2018; Zhou et al., 2020; Hayati et al., 2020; Liu et al., 2021): Many prior datasets and systems (e.g.,
REDIALby Li et al., 2018) exhibit a common pattern: items are recommended rapidly or in quick succession within short dialogue sessions, with subsequent recommendations often determined by immediate user feedback (as shown in Figure 1 of the paper). - Limitations: This focus on rapid, short-session recommendations limits their applicability to more realistic, nuanced conversations where recommendations are made more deliberately after comprehensive preference elicitation. The paper highlights this as a key gap, contrasting
REDIAL(short dialogues, rapid recommendations) withTabidachi Corpus(longer dialogues, more deliberate recommendations) (Figure 1). - This paper's position: Aims to foster more natural dialogue processes and integrate implicit information, moving beyond the rapid recommendation paradigm.
- Early CRSs and Datasets (e.g., Li et al., 2018; Zhou et al., 2020; Hayati et al., 2020; Liu et al., 2021): Many prior datasets and systems (e.g.,
-
Dialogue Summarization using LLMs:
- General LLM Capabilities (e.g., GPT, Llama, PaLM): LLMs have shown significant success in various AI tasks, including summarization.
- Existing Summarization Approaches:
- Zhu et al. (2025a): Proposed generating factual summaries using smaller language models with
GPT-3.5-Turboas a teacher forcontrastive learning. This method is primarily for short dialogues and struggles with longer conversations. - Zhong et al. (2022): Addressed long dialogues by pre-training Transformer-based models, which typically require substantial data and computational resources.
- Zhang et al. (2022) - SummN: A method that fine-tunes LLMs via supervised learning to first summarize dialogue
chunksand then create a final summary from thesepartial summaries. This is particularly relevant as the current paper adoptsSummN's approach for handling long dialogues inTabidachi Corpus.
- Zhu et al. (2025a): Proposed generating factual summaries using smaller language models with
- This paper's position: Focuses on achieving high-quality text generation through
fine-tuning(specificallyDPO), without extensive pre-training, and adoptsSummN's multi-stage summarization for long dialogues.
-
Recommendation via Augmentation or Refinement of Item Description:
- Lyu et al. (2024): Proposed augmenting item descriptions using LLMs for recommendations.
- Limitation: Their study does not explicitly incorporate user preferences or experiences from the ongoing conversation into the augmentation process.
- Li et al. (2023): Introduced a method to generate more appropriate
item recommendation informationby imposingvocabulary constraints. - Other Approaches (e.g., Cheng et al., 2023; Yang et al., 2024; Ma et al., 2024): Involve retrieving similar reviews or external information to generate
item recommendation information. - Limitations of existing methods: Primarily focus on historical user behavior data or past reviews, making their application challenging in scenarios where only the current dialogue history and item description are available.
- This paper's position: Proposes generating pertinent
item recommendation informationbased solely on user preferences and experiences derived from thedialogue history, combined with theitem description.
3.3. Technological Evolution
The field of recommender systems has evolved from basic collaborative filtering (e.g., item-to-item collaborative filtering by Linden et al., 2003, at Amazon) to more sophisticated deep learning models. Conversational Recommender Systems (CRSs) emerged as a way to address challenges like the cold-start problem and enhance user interaction by leveraging natural language. Initially, CRSs often relied on predefined dialogue policies or rule-based systems, or learned policies from constrained datasets.
The advent of powerful Large Language Models (LLMs) marked a significant shift. LLMs, with their ability to understand context, generate coherent text, and adapt to various tasks through fine-tuning, opened new possibilities for CRSs. Approaches like SumRec (Asahara et al., 2023) began to use LLMs for tasks like dialogue summarization and item description augmentation, making the system more flexible and capable of handling open-domain conversations.
This paper's work represents a further refinement in this evolution. While LLMs offer powerful text generation capabilities, ensuring the quality and relevance of the generated text for a specific downstream task (like recommendation scoring) remains a challenge. The integration of Direct Preference Optimization (DPO) is a crucial step in this evolution. DPO allows for a more direct and efficient way to align the LLM's text generation with desired outcomes (i.e., generating summaries and recommendation information that lead to better recommendation scores), without the complexities of traditional Reinforcement Learning from Human Feedback (RLHF). This positions the paper at the forefront of leveraging advanced LLM alignment techniques to enhance the realism and effectiveness of CRSs.
3.4. Differentiation Analysis
Compared to prior Conversational Recommender Systems (CRSs):
-
Focus on Realistic Dialogue: Unlike many existing CRSs that make rapid recommendations in short sessions, this work aims for a more natural, human-like dialogue flow where comprehensive preference elicitation precedes recommendations.
-
Implicit Information Integration: It explicitly addresses the under-explored area of integrating implicit user preferences and contextual information, which is a key differentiator from systems that primarily rely on explicit feedback.
Compared to
SumRec(Asahara et al., 2023): -
Enhanced Information Extraction: While
SumRecleveraged LLMs fordialogue summaryanditem recommendation informationgeneration, it relied on prompt engineering, which could result in abstract or generic outputs lacking crucial information for thescore predictor. This paper overcomes this by applyingDirect Preference Optimization (DPO)to fine-tune the LLM, ensuring the generated texts are optimally rich in recommendation-relevant content.SumRecdid not fine-tune these generation models, relying purely on base LLM capabilities guided by prompts. -
Generalizability: The paper extends
SumRec's applicability beyond specific domains (like tourist recommendations from chit-chat inChatRec) to more general and realistic conversational scenarios (Tabidachi Corpus), aiming for improved recommendation quality across diverse datasets.Compared to
dialogue summarization using LLMs: -
Task-Specific Optimization: While previous works focused on general summarization quality (e.g., factual accuracy, handling long dialogues), this paper specifically optimizes summaries to contain information crucial for a downstream recommendation task. This goal-oriented summarization, achieved through DPO, differentiates it from generic summarization methods.
Compared to
recommendation via augmentation or refinement of item description: -
Dialogue-Contextualized Information: Unlike methods that augment item descriptions using historical data or external reviews, this paper generates
item recommendation informationdynamically based solely on the current dialogue history (user preferences and experiences) and theitem description. This makes the generated information highly tailored to the ongoing conversation. -
DPO for Relevance: The use of DPO ensures that the
item recommendation informationnot only augments the description but does so in a way that is most useful for thescore predictorin determining item suitability based on user preferences.
4. Methodology
The proposed method extends SumRec by integrating Direct Preference Optimization (DPO) to refine the generation of dialogue summaries and item recommendation information, thereby enhancing recommendation performance in realistic conversational settings.
4.1. Task Definition
The task focuses on item recommendation within conversational settings.
Given:
-
A
dialogue history, where is an utterance from the operator (recommender) and is an utterance from the customer (recommendee). -
A set of
candidate itemsavailable at that point in the dialogue. -
Item descriptionscorresponding to the candidate items.The objective is to predict the correct item that will be included in the next operator's utterance .
4.2. SumRec
The paper builds upon the SumRec framework, which involves three main components: Dialogue Summary Generation Model, Item Recommendation Information Generation Model, and Score Predictor. The overall flow is depicted in Figure 2. SumRec feeds the generated dialogue summary, item recommendation information, and the original item description into a score predictor to estimate a recommendation score for each candidate item.
The following figure (Figure 2 from the original paper) illustrates the item recommendation flow in SumRec:
该图像是一个示意图,展示了在对话历史和项目描述基础上生成对话摘要和推荐信息的流程。用户表达了自然偏好,经过对话摘要生成模型,最终生成对东京巨蛋的推荐信息,预测评分为0.608。
4.2.1. Dialogue Summary Generation Model
This model uses a Large Language Model (LLM) to generate a dialogue summary from the dialogue history . The goal is to extract information crucial for recommendations, such as the customer's preferences and experiences.
For longer dialogues, particularly in datasets like Tabidachi Corpus, a multi-stage summarization approach inspired by Zhang et al. (2022) is adopted:
- The
dialogue historyis divided intochunks. Partial summariesare generated for eachchunk.- A
final dialogue summaryis generated based on these concatenatedpartial summaries. The specific prompts used for summary generation are provided in Appendix B.1 of the paper.
4.2.2. Item Recommendation Information Generation Model
Item descriptions typically contain objective facts but often lack details about what kind of user an item is suitable for. To address this, SumRec employs an LLM to create item recommendation information based on the item description of candidate items. This generated information aims to articulate the item's relevance to user preferences and experiences in natural language. For instance, an item description might state facts about a location, while the item recommendation information might add a phrase like "It's also suitable for people who want to enjoy entertainment as various events are held there," explaining its suitability for a specific user type. The actual prompts are detailed in Appendix B.2 of the paper.
4.2.3. Score Predictor
The dialogue summary and item recommendation information generated by the respective models, along with the original item description, are concatenated using a [SEP] token. This combined text is then fed into a score predictor.
The score predictor is a pre-trained language model based on a Transformer encoder. It is trained as a regression task. For training, items explicitly recommended within the dialogue are assigned a target score of , while all other candidate items are assigned . In the experiments, DeBERTa (He et al., 2021) is used as the score predictor.
4.3. Improving Information Extraction Performance using DPO
The core innovation of this study is the application of Direct Preference Optimization (DPO) to the LLMs responsible for generating dialogue summaries and item recommendation information. While SumRec aimed to include necessary information, its reliance on prompt engineering alone often resulted in outputs that were too abstract or generic, lacking the precise details crucial for accurate recommendation. This study proposes to fine-tune these generation models using DPO to ensure they produce texts that adequately contain information essential for the score predictor to make accurate predictions.
The following figure (Figure 3 from the original paper) depicts the training flow of the proposed method:
该图像是一个示意图,展示了通过直接偏好优化(DPO)方法进行对话推荐系统的训练过程。图中分为两个阶段:第一阶段为训练评分预测器,第二阶段为训练生成模型。其中,包含了对话历史、对话摘要生成模型及推荐信息生成模型等关键组件。
The training process consists of two main steps: Stage 1: Training the Score Predictor Stage 2: Training the Dialogue Summary Generation Model and the Item Recommendation Information Generation Model using DPO
4.3.1. Training the Score Predictor
The score predictor is trained first, as its output is used to create the preference data for DPO training of the generation models. DeBERTa is used as the score predictor.
The prediction score from DeBERTa is obtained by concatenating the dialogue summary , item recommendation information , and item description .
The score predictor's estimation is expressed by Equation 1:
Where:
-
is the predicted recommendation score for an item.
-
denotes the
DeBERTamodel functioning as thescore predictor. -
is the
dialogue summarygenerated from thedialogue history. -
is the
item recommendation informationgenerated from theitem description. -
is the original
item description.The
score predictoris trained as a regression task, where the target score is assigned to items that were actually recommended in the dialogue, and to all other items.
4.3.2. Training the Dialogue Summary Generation Model
This section details the DPO training procedure for the dialogue summary generation model.
- Summary Generation: For a given
dialogue history, theLLMfirst generates a set ofpartial summaries. These are then concatenated into a combined text,P S _ { n }. FromP S _ { n }, theLLMgeneratesfinal dialogue summaries. - Score Prediction for Summaries: For each generated summary ,
item recommendation information, anditem description(corresponding to a candidate item ), thescore predictorestimates a score . This is given by Equation 2: Where:- is the predicted score for candidate item when using dialogue summary .
- is the -th generated dialogue summary for dialogue .
- is the
item recommendation informationfor candidate item in dialogue . - is the
item descriptionfor candidate item in dialogue .
- Preference Data Creation: The
absolute differencebetween the predicted score and theground-truth score(which is either 0 or 1) is calculated.- The
dialogue summarythat results in the prediction closest to theground-truth scoreis selected as thepreferred (winner)sample, denoted . This is defined by Equation 3: Where:- is the
preferred dialogue summaryfor item in dialogue . - means selecting the summary that minimizes the absolute difference between the
ground-truth scoreand thepredicted score.
- is the
- Conversely, the
dialogue summarythat results in the prediction furthest from theground-truth scoreis selected as thedispreferred (loser)sample, denoted . This is defined by Equation 4: Where:- is the
dispreferred dialogue summaryfor item in dialogue . - means selecting the summary that maximizes the absolute difference between the
ground-truth scoreand thepredicted score. These pairs form thepreference dataforDPO.
- is the
- The
- DPO Loss Function: The
DPOloss function is then applied tofine-tunethedialogue summary generation model. This loss encourages the model to generate thepreferred summarymore frequently than thedispreferred summary. It is expressed by Equation 5: Where:- is the
Direct Preference Optimizationloss. - denotes the expected value over all
preference datasamples. P S _ { n }is the concatenatedpartial summariesfrom which thefinal dialogue summaryis generated for dialogue history .- is the
preferred dialogue summaryfor item in dialogue . - is the
dispreferred dialogue summaryfor item in dialogue . - is the set of indices for all dialogue histories.
- is the set of indices for candidate items for dialogue history .
- is a
temperature parameterthat controls the strength of the preference modeling. - represents the output probability (likelihood) of the
dialogue summary generation modelbeing trained. - represents the output probability of the
dialogue summary generation modelbeforeDPO training(the reference model). - is the
sigmoid function, which squashes values between 0 and 1. The goal is to maximize thelog-likelihood ratioof the preferred summary over the dispreferred summary, scaled by , and then pass it through a sigmoid function to optimize the model.
- is the
4.3.3. Training the Item Recommendation Information Generation Model
The item recommendation information generation model is trained using DPO in an analogous manner. The objective is to enable this model to generate item recommendation information that more effectively incorporates details crucial for item recommendation.
- Item Recommendation Information Generation: For each candidate
item descriptionfor which theground-truth scoreis 1 (i.e., the item was actually recommended), theLLMgenerates pieces ofitem recommendation information. The reason for focusing only on items with aground-truth scoreof 1 is to prevent the model from being trained to consider texts lacking necessary recommendation information as good outputs. - Score Prediction for Item Recommendation Information: A
dialogue summary(generated by theLLMfrom ) is combined with each generateditem recommendation informationanditem description, and fed into thescore predictor. - Preference Data Creation: Similar to
dialogue summaries, theabsolute differencebetween thepredicted scoreand theground-truth scoreis calculated.- The
item recommendation informationthat yields the score closest to theground-truth scoreis designated as . - The
item recommendation informationthat yields the score furthest from theground-truth scoreis designated as . These pairs constitute thepreference dataforDPO.
- The
- DPO Loss Function: The
loss functionfor training theitem recommendation information generation modelis structurally identical to Equation 5. The key difference is that the policy now generatesitem recommendation informationconditioned on theitem description(and possibly thedialogue summary), rather than asummaryconditioned onpartial summariesP S _ { n }. This DPO process allows the model to learn to generateitem recommendation informationthat better serves the recommendation task.
5. Experimental Setup
5.1. Datasets
Experiments were conducted using two public Japanese datasets: Tabidachi travel agency task dialogue corpus and ChatRec.
The following are the statistics from Table 4 of the original paper:
| Metric | Tabidachi Corpus | ChatRec | |||
|---|---|---|---|---|---|
| T | E | N | ALL | ||
| Dialogues | 165 | 237 | 223 | 545 | 1,005 |
| (Train / Val / Test) | 126 / 15 / 24 | 189 / 13 / 35 | 178 / 12 / 33 | 436 /28 /81 | 803 / 53 / 149 |
| Utterances | 42,663 | 5,238 | 5,009 | 11,735 | 21,982 |
5.1.1. Tabidachi Corpus
-
Source & Characteristics: This corpus features
tourist spot recommendation dialoguesbetween an operator and a customer planning a sightseeing trip via Zoom. The operator uses a system to find tourist information while conversing, and the customer makes travel plans based on a predefined scenario. This dataset is designed to resemble actual dialogue scenarios with longer dialogue histories. -
Data Collection Details: 55 participants (25 adults, 10 elderly, 20 children) acted as customers, each engaging in six dialogues. The study specifically used dialogues conducted without screen sharing, as visual information cannot be directly input into the
LLM. -
Item Description: The
item descriptionforTabidachi Corpusis formed by concatenating the "Summary" and "Feature" fields from the tourist destination information. -
Ground Truth: For
Tabidachi Corpus, items explicitly recommended within the dialogue are assigned a target score of , and others .The following is a dialogue example from Table 5 of the original paper:
Operator Hello. Thank you for using our service today. Um, regarding your travel plans, um, do you have any particular destination in mind that you would like to visit? Customer Yes. Um, I would like to go to Hokkaido. Operator Ah, yes. Um, do you have any preference for the season? Customer Um, around autumn, I think. Operator Um, how many people are planning to go? Customer Ah, just me, just myself alone. Operator Ah, understood. <> I will look into it, so please wait a moment. Customer Yes. Ah, yes. Yes, please do. Operator Um, is there anything specific you'd like to do, or any particular preferences? Customer Yes. Ah, well. Um, I'd like to go somewhere with beautiful autumn leaves. Operator Yes. Ah, there is one thing but... Yes. Customer Um, <>, around Sapporo and Mount Hakodate, particularly, are there any other places you'd like to visit? Operator Ah, yes. Around that area, if there are any recommendations. Customer Let me see... also Yes. Operator <>, it's near Sapporo but... Customer Yes. Operator There is a place called Satellite Place Customer Yes.
The following is an example of item description in Tabidachi Corpus from Table 6 of the original paper:
| SightID | 80042498 | |
| Title | Former Sougenji Stone Gate (KyuuSougenji Ishimon) | |
| Detail | Area | Kyushu/Okinawa>Okinawa Prefecture>Naha/Southern Main Island |
| Genre1 | See>Buildings/HistoricSites>Historical Structures | |
| Genre2 | ||
| Summary | A triple-arch gate made of Ryukyulimestone. The massive stone gateextending nearly 100m was built us-ing cut stone masonry technique andis designated as a National ImportantCultural Property. The interior wasthe temple grounds where SougenjiTemple, which enshrined the spiritsof the Sho Dynasty, once stood, butwas completely destroyed during theBattle of Okinawa. | |
| Time | ||
| Closed | ||
| Price | Free to visit | |
| Tel | 098-868-4887 | |
| Address | 1-9-1 Tomari, Naha City, OkinawaPrefecture | |
| Station | Miebashi | |
| Parking | None | |
| Traffic1 | 10-minute walk from Yui Rail (Oki-nawa Monorail) Miebashi Station orMakishi Station | |
| Traffic2 | 6 km from Okinawa Naha Airport | |
| Feature | Takes about 30 minutes to visitRecommended for women / Recom-mended for history enthusiasts | |
| Treasure | Important Cultural Property (Struc-ture) |
5.1.2. ChatRec Dataset
-
Source & Characteristics: This dataset was used by the original
SumRecpaper and is included for comparison, although it does not represent realistic recommendation dialogues in the same wayTabidachi Corpusdoes. It compriseschit-chat dialoguesbetween twoCrowdWorksparticipants (minimum 10 turns each) under a "strangers in a waiting room" scenario, collected under three topic conditions:Travel (T),Except for Travel (E), andNo Restriction (N).ALLrefers to the combination of these conditions. -
Item Information: The tourist destination information consists of 3,290 domestic spots, filtered from
Rurububy excluding those with fewer than 100TripAdvisorreviews. These spots were organized into 147 files, each containing 10-20 spots grouped by prefecture. -
Ground Truth: After each dialogue, one file was randomly assigned, and workers rated the spots. A "human-predicted score" (average of interest scores by five third-party workers on a 5-point scale) is provided. For this study, scores of 2 or less were converted to "dislike" (0) and scores of 3 or more to "like" (1), aligning with the
Tabidachi Corpusscoring method.The following is a dialogue example from Table 7 of the original paper: A: What are your plans for dinner? B: Thank you in advance. I'm planning to make ginger pork for dinner today. How about you? A: Since it's cold, I'm thinking about having shabushabu, but ginger pork sounds good too. B: That sounds nice. But I've already prepared for ginger pork today, so I'm thinking about having shabu-shabu tomorrow. A: Pork for two days in a row, do you like pork? B: I do like pork. I prefer chicken or pork over beef. A: Do you use pork for curry? B: Yes. We usually make it with pork at home. Are you perhaps a beef person? A: We use beef at home. Does that mean you live in the eastern region? B: Not necessarily, but for some reason we've always used pork at my home. A: I see, what's your favorite pork dish? B: For pork dishes, I like wrapping cheese with pork and seasoning it with a sweet and savory sauce. A: That's quite elaborate. Do you put only cheese inside? B: Not at all. I also add shiso leaves. A: Is this fried, or do you just grill it? B: It's delicious when fried too, but I'm concerned about the calories, so currently I just grill it. A: What's the best side dish for it? B: I'm not sure if it's the best, but I usually serve it with lettuce and cherry tomatoes. A: Just imagining it makes me hungry. B: Indeed. Do you like beef? A: I do! I love steak and yakiniku (grilled meat). A: Thank you for your time!
The following is an example of item description in the ChatRec dataset from Table 8 of the original paper:
| id | 7 |
|---|---|
| name | Sumida Park |
| description | Located alongside the Sumida River, it has long been known as a famous cherry blossom view- ing spot. In spring, when approximately 500 cherry trees planted along the Sumida embank- ment bloom, the park becomes crowded with many flower-viewing visitors. The park, which extends from Azuma Bridge, features walking paths that make for an ideal strolling course. From the X-shaped Sakura Bridge, visitors can enjoy a view of the Sumida River below. |
5.2. Evaluation Metrics
The task involves selecting an item from a set of candidates, so standard retrieval task metrics are used: Hit Rate (HR) and Mean Reciprocal Rank (MRR).
-
Hit Rate (HR):
- Conceptual Definition:
Hit Ratemeasures the proportion of users for whom the target item (the item actually recommended) is present within the top- recommendations generated by the system. It indicates how often the system successfully recommends the relevant item at least once within the specified rank. - Mathematical Formula: $ \mathrm{HR@K} = \frac{\text{Number of users for whom the target item is in top-K}}{\text{Total number of users}} $
- Symbol Explanation:
- : Hit Rate at rank .
Number of users for whom the target item is in top-K: The count of unique users for whom the ground-truth recommended item appeared among the top items predicted by the system.Total number of users: The total number of users (or recommendation instances) in the evaluation set.
- Conceptual Definition:
-
Mean Reciprocal Rank (MRR):
- Conceptual Definition:
Mean Reciprocal Rankmeasures the average of the reciprocals of the ranks of the first relevant item. It is particularly useful when only one relevant item is expected or desired, and it penalizes systems that recommend the relevant item at lower ranks. A higherMRRindicates that the relevant item is found earlier in the ranked list. - Mathematical Formula: $ \mathrm{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\mathrm{rank}_i} $
- Symbol Explanation:
- : Mean Reciprocal Rank.
- : The total number of queries (or recommendation instances) in the evaluation set.
- : The rank position of the first relevant item for the -th query. If no relevant items are found, the reciprocal rank is 0.
- Conceptual Definition:
5.3. Baselines
The proposed method is compared against two baselines:
-
Baseline:
- Description: This baseline generates the
dialogue summaryusing theLLM(Llama-3.1-Swallow-8B-v0.1) withoutDPO training. Only thisdialogue summaryand theitem descriptionare input to thescore predictor. Crucially, this model does not generate or utilizeitem recommendation information.
- Description: This baseline generates the
-
SumRec:
- Description: This model represents the original
SumRecapproach. It uses theLLM(Llama-3.1-Swallow-8B-v0.1) withoutDPO trainingto generate both thedialogue summaryand theitem recommendation information. The generateddialogue summary,item recommendation information, anditem descriptionare then fed into thescore predictor.
- Description: This model represents the original
5.4. Implementation Details
-
Framework: Python 3.10.12 with PyTorch (2.4.1), Hugging Face Transformers (4.46.2), Tokenizers (0.20.3), SacreBLEU (2.5.1), rouge-score (0.1.2), Fugashi (1.4.0) with MeCab, Optuna (4.1.0), Hugging Face Datasets (3.1.0), Hugging Face TRL (0.12.1).
-
Text Generation Model (LLM):
Llama-3.1-Swallow-8B-v0.1(Okazaki et al., 2024; Fujii et al., 2024). This is a medium-scale model. -
Score Predictor Model:
deberta-v3-japanese-large(352M parameters) (He et al., 2021). -
Hyperparameter Optimization:
Optuna(Akiba et al., 2019) was used to optimize hyperparameters forDPO fine-tuning. The hyperparameters yielding the best recommendation performance on thevalidation setwere selected. -
Training Runs: Each model was trained five times with the selected hyperparameters, and the average results were reported.
-
Computational Resources: All experiments were run primarily on four Nvidia A100 80GB GPUs.
-
Training Time:
Llama-3.1-Swallow-8Btraining for a single epoch took approximately 24 hours on four GPUs.DeBERTatraining for a single epoch took approximately 4 hours on four GPUs. -
Licensing:
Tabidachi Corpus(CC BY 4.0),ChatRec(MIT License).Llama3.1-Swallow-8B(Meta Llama 3.1 Community License + Gemma Terms of Use).deberta-v3-japanese-large(CC BY-SA 4.0).The following are the hyperparameters for the summary generation model and item recommendation information generation model, fine-tuned using DPO on
Tabidachi Corpusfrom Table 9 of the original paper:Parameter Summary Model (DPO) Recommendation Model (DPO) learning_rate 1.1593 × 10−7 8.7340 × 10−6 per_device_train_batch_size 12 16 num_train_epochs 1 1 optimizer AdamW (β1 = 0.9, β2 = 0.999, = 10−8, weight_decay= 0) AdamW (β1 = 0.9, β2 = 0.999, = 10−8, weight_decay= 0) max_grad_norm 1.0 1.0 gradient_checkpointing True True bf16 True True disable_dropout True True DPO-Specific Parameter β 0.1768 0.06109
The following are the hyperparameters for the summary generation model and item recommendation information generation model, fine-tuned using DPO on the ChatRec dataset from Table 10 of the original paper:
| Parameter | Summary Model (DPO) | Recommendation Model (DPO) |
|---|---|---|
| learning_rate | 6.4087 × 10−7 | 1.7718 × 10-7 |
| per_device_train_batch_size | 8 | 8 |
| num_train_epochs | 1 | 1 |
| optimizer | AdamW (β1 = 0.9, β2 = 0.999, = 10−8, weight_decay= 0) | AdamW (β1 = 0.9, β2 = 0.999, = 10−8, weight_decay= 0) |
| max_grad_norm | 1.0 | 1.0 |
| gradient_checkpointing | True | True |
| bf16 | True | True |
| disable_dropout | True | True |
| DPO-specific Parameter | ||
| β | 0.1253 | 0.03949 |
6. Results & Analysis
6.1. Core Results Analysis
The experimental results, presented in Table 1, compare the proposed method (Ours) against the Baseline and SumRec on two datasets: Tabidachi Corpus and ChatRec. The metrics used are Hit Rate (HR) and Mean Reciprocal Rank (MRR) at different rank cutoffs ( and ).
The following are the comparison results of Hit Rate (HR) and Mean Reciprocal Rank (MRR) from Table 1 of the original paper:
| Dataset | Method | Metrics | @1 | @3 | @5 |
|---|---|---|---|---|---|
| Tabidachi Corpus | Baseline | HR ↑ | 0.2439 | 0.5056 | 0.7146 |
| MRR ↑ | 0.2439 | 0.3587 | 0.4057 | ||
| SumRec | HR ↑ | 0.2040 | 0.5376 | 0.7574 | |
| MRR↑ | 0.2040 | 0.3527 | 0.4032 | ||
| Ours | HR ↑ MRR ↑ | 0.2474 0.2474 | 0.5525 0.3796 | 0.7231 0.4181 | |
| ChatRec | Baseline | HR ↑ MRR↑ | 0.8423 0.8423 | 0.9799 0.9049 | 0.9933 0.9081 |
| SumRec | HR ↑ | 0.8255 | 0.9698 | 1.0 | |
| MRR ↑ | 0.8255 | 0.8915 | 0.8984 | ||
| Ours | HR ↑ | 0.8591 | 0.9832 | 0.9933 | |
| MRR ↑ | 0.8591 | 0.9172 | 0.9196 |
Analysis for Tabidachi Corpus:
- The
Proposed method (Ours)consistently outperforms bothBaselineandSumRecacross all rank cutoffs for bothHRandMRR. - Specifically,
Oursachieves the highestHR@1(0.2474),HR@3(0.5525), andHR@5(0.7231). - Similarly,
Oursshows the bestMRR@1(0.2474),MRR@3(0.3796), andMRR@5(0.4181). - The significant improvements, especially at higher ranks (e.g.,
HR@3andHR@5), indicate that theDPO-enhanced method substantially improves the quality of the candidate list presented to users, making it more likely that the correct item is found earlier. - It's noteworthy that
SumRecshows a slightly lowerHR@1andMRR@1thanBaseline, but higherHR@3andHR@5. This suggestsSumRecmight introduce more relevant items at slightly lower ranks, while theBaselineis better at finding the absolute top item. However,Ourssurpasses both, demonstrating a more robust improvement across the board.
Analysis for ChatRec:
ChatRecinherently presents a task with very high baseline performance, withHR@5already near 1.0 for all methods. This might be due to its nature as a chit-chat dataset with more general item suitability.- Despite this high baseline,
Oursstill manages to achieve comparable or slightly superiorHRlevels (HR@1: 0.8591,HR@3: 0.9832,HR@5: 0.9933). Notably,SumRecachievesHR@5of 1.0, butOursis very close. - Crucially,
Oursconsistently achieves the bestMRRacross all rank cutoffs (MRR@1: 0.8591,MRR@3: 0.9172,MRR@5: 0.9196). This indicates that even when multiple methods find the correct item,Oursis more precise in placing it at the very top of the recommendation list. - These results confirm that the proposed method improves the quality of top-position recommendations across diverse datasets, contributing to more rapid and highly accurate recommendations crucial for practical applications.
6.2. Analysis of Generated Texts
The paper conducts a quantitative analysis of the generated texts (dialogue summaries and item recommendation information) to understand the impact of DPO. Avg. Len. (average length), Distinct-1/2 (lexical diversity), BLEU, and ROUGE-L (n-gram similarity to original item descriptions) are measured.
The following are the automatic analysis results of dialogue summaries and item recommendation information from Table 2 of the original paper:
| Method | Avg. Len. | Distinct-1/2 | BLEU | ROUGE-L |
|---|---|---|---|---|
| Dialogue | Summary | |||
| SumRec | 118.6 | 0.251 / 0.611 | ||
| Proposed | 151.2 | 0.187 / 0.526 | ||
| Item Recommendation Information | ||||
| SumRec | 149.7 | 0.247 / 0.586 | 3.608 | 0.087 |
| Proposed | 247.2 | 0.164 / 0.433 | 1.455 | 0.019 |
Analysis for Dialogue Summaries:
- Average Length: The
Proposedmethod'sdialogue summariesare significantly longer (151.2 words) thanSumRec's (118.6 words). This suggests thatDPOenables the model to retain and include more detailed user preferences and conversational context within the summary. - Distinct-1/2: Both
Distinct-1(0.187 vs 0.251) andDistinct-2(0.526 vs 0.611) scores decreased for theProposedmethod. This indicates a reduced lexical diversity, meaning the model tends to use the same keywords or phrases more repeatedly. The authors interpret this as a consequence ofDPOguiding the model to prioritize and preserve phrases deemed important by thescore predictor, even if it leads to less varied vocabulary.
Analysis for Item Recommendation Information:
-
Average Length: Similar to summaries, the
item recommendation informationgenerated by theProposedmethod is much longer (247.2 words) compared toSumRec's (149.7 words). This again implies a greater emphasis on incorporating more explanatory elements relevant to recommendation. -
Distinct-1/2:
Distinct-1(0.164 vs 0.247) andDistinct-2(0.433 vs 0.586) also decreased, suggesting lower lexical diversity, consistent with thedialogue summaries. -
BLEU and ROUGE-L:
n-gram similarity metricslikeBLEU(1.455 vs 3.608) andROUGE-L(0.019 vs 0.087), calculated against the originalitem descriptions, significantly decreased for theProposedmethod. This indicates that theDPO-trained model prioritizes generating new, explanatory content that is useful for recommendation, rather than merely paraphrasing or having superficial overlap with the factualitem description. The generateditem recommendation informationmoves further away from a direct rephrasing of the original description, focusing on adding suitability and user-centric details.Overall Interpretation: These observations imply that
DPOsuccessfully optimized bothdialogue summariesanditem recommendation informationtowards "adequately containing information necessary for the task," even at the cost of lexical diversity or direct textual similarity to original sources, which is a desirable outcome for the system's performance.
The following is an example of output sentences before and after DPO from Table 11 of the original paper:
| Dialogue History | Operator: Thank you for using our service today. Customer: My pleasure. |
| Operator: Thank you. Ma'am/Sir, are you here for a travel consultation today? | |
| .. (omitted) ... Customer: Hmm. If possible, I'd like to visit a remote island. | |
| .…. (omitted) .. Operator: Yes, that's right... Ma'am/Sir, earlier you mentioned you like Agu pork... | |
| Customer: Yes. Uh-huh. Motobu beef, yes. ... (omitted) ... Customer: Ah, is that your top recommendation? Operator: Hmm, there might be better places—<> well, ah. | |
| Item description | Enjoy reasonably priced dishes made with carefully selected ingredients, including A5 and A4 grade Motobu beef, Agu pork, Yanbaru young chicken, and exceptionally fresh seafood chosen by the head chef. Duration: 30-60 minutes / English menu available / 3000-5000 yen (dinner) / Recommended for business entertainment. |
| SumRec Dialogue Summary | A woman planning a trip to Okinawa wants to gaze at the sea in beach sandals and ride an ox-drawn cart. She is also looking forward to Agu pork dishes. |
| SumRec Item Recommendation Information | This is a restaurant located in Motobu Town, Okinawa Prefecture, where you can enjoy delicious dishes made with A-grade beef and fresh seafood. It is also recommended for business entertainment. |
| Proposed Dialogue Summary the sea and also wishing to stay on a remote island. She is also being interested in a tour along the coast in an ox-drawn cart and dishes made with Agu pork. Ms. A is introducing | |
| Proposed Item Recommendation Information | This restaurant offers you the opportunity to savor the luxurious taste of A-rank Kuroge Wagyu steak and abundant local seafood. The restaurant has a calm atmosphere, perfect for a date with a loved one or an anniversary dinner. Private rooms are also available for a relaxing time with family or friends. You can enjoy delicious food at a reasonable price, so it's easy on your wallet. |
The following is an example of an item recommendation information containing incorrect information from Table 12 of the original paper:
| Item Information | KiKiYOKOCHO is a new concept zone that gathers items to tickle women's sensibilities by mixing beauty, food, and miscellaneous goods. The concept is "try, find, enjoy." For those who want to compare and try things they are interested in to find what matches their personal preferences. A collection of shops that fulfills such selfish desires. It's packed with unprecedented enjoyment. Duration: around 30-60 minutes. English pamphlets available. |
| Item Recommendation Information by Proposed Method | This is a shopping mall targeted at women, featuring stores from various genres such as beauty, gourmet, fashion, and interior design. The interior is stylish and has a calm atmosphere, allowing you to enjoy shopping at a leisurely pace. Additionally, English signboards are available, so foreign visitors can also use it with peace of mind. The shop staff are also kind and helpful, so even first-time visitors can visit casually. |
The following are examples of positive (Winner) and negative (Loser) Item Recommendation Information used in DPO from Table 13 of the original paper:
| Winner | Rows of old buildings create an atmosphere of traditional Japan. There is also a spacious park where families with children can enjoy themselves with peace of mind. Within the park, there is an exhibition hall where visitors can learn about the region's traditions and culture, making it an attractive spot especially for families. In particular, the area is cool and pleasant in summer, making it highly recommended for family visits. |
|---|---|
| Loser | Rows of old buildings stand, and various exhibitions are held. There is also a large park where families with children can play safely. Visitors can also enjoy light hiking and experience nature. In summer, it is cool and an ideal place for children to have fun. |
6.3. Ablation Study
An ablation study was conducted on the Tabidachi Corpus to assess the individual contributions of DPO training for the dialogue summary generation model and the item recommendation information generation model.
The following are the ablation study results on Tabidachi Corpus from Table 3 of the original paper:
| Method | Metrics | @1 | @3 | @5 |
|---|---|---|---|---|
| Ours | HR↑ | 0.2474 | 0.5525 | 0.7231 |
| MRR ↑ | 0.2474 | 0.3796 | 0.4181 | |
| w/o Rec-DPO | HR ↑ | 0.2393 | 0.5560 | 0.7402 |
| MRR ↑ | 0.2393 | 0.3772 | 0.4195 | |
| w/o Sum-DPO | HR ↑ | 0.2341 | 0.5176 | 0.7363 |
| MRR ↑ | 0.2341 | 0.3554 | 0.4051 |
Analysis:
-
w/o Rec-DPO(DPO only on summary generation): This method removesDPOfrom theitem recommendation information generation model, meaning only thedialogue summary generation modelisDPO-trained.- It generally
surpassed SumRec(refer to Table 1 forSumRecscores onTabidachi Corpus) on most metrics, with notable gains inHRandMRRat higher ranks. For instance, itsHR@3(0.5560) is higher thanSumRec's (0.5376). ItsMRRscores are also better thanSumRec's. - This indicates that enhancing the quality of the
dialogue summary(allowing user preferences to be reflected more precisely) significantly improves the relevance of initially presented items.
- It generally
-
w/o Sum-DPO(DPO only on recommendation information generation): This method removesDPOfrom thedialogue summary generation model, meaning only theitem recommendation information generation modelisDPO-trained.- While it showed some improvement over the
Baselinein some aspects, its effect was not as pronounced asw/o Rec-DPO. For example, itsHR@1(0.2341) andMRR@1(0.2341) are lower than theBaseline(0.2439) andw/o Rec-DPO(0.2393). - The performance gap tended to widen at higher ranks, suggesting that while improving
recommendation informationoffers a supplementary benefit, refining thedialogue summary(which forms the foundation of the recommendation process) is more critical.
- While it showed some improvement over the
-
Ours(DPO on both models): The full proposed method, withDPOtraining for both generation models, demonstrated the highest performance across all metrics (HR@1: 0.2474,HR@3: 0.5525,HR@5: 0.7231;MRR@1: 0.2474,MRR@3: 0.3796,MRR@5: 0.4181). This confirms thatfine-tuningboth thesummaryandrecommendation informationgeneration synergistically enhances the quality of user preference representation and item description, further boosting recommendation accuracy.In summary, the ablation study highlights that
DPOtraining for thedialogue summaryis particularly impactful, serving as the primary driver for improved recommendation performance. WhileDPOonitem recommendation informationalso contributes, its effect is less pronounced.
6.4. Human Evaluation
A human evaluation was conducted using CrowdWorks to assess the quality of generated dialogue summaries and item recommendation information from Ours and SumRec on the Tabidachi Corpus. Ten crowd workers evaluated outputs based on four criteria: Consistency, Conciseness, Fluency, and Usefulness. The evaluation covered 54 recommendation dialogues and their item descriptions.
The following figure (Figure 4 from the original paper) illustrates the results of the human evaluation:

Analysis:
- Dialogue Summaries:
- The
Proposedmethod outperformedSumReconConsistency(47.72% win rate vs 27.90% loss),Fluency(47.43% vs 35.24%), andUsefulness(51.54% vs 29.52%). - The most significant difference was observed in
Usefulness, where approximately half of the evaluators ratedOurssummaries as superior. This strongly suggests thatDPOeffectively enhanced summary quality by enabling more accurate capture of user preferences relevant for recommendations. Concisenessshowed no substantial difference (43.47% win vs 44.35% loss), indicating thatOursimproved other aspects without significantly compromising brevity, despite a slight tendency towards verbosity observed in automatic metrics.
- The
- Item Recommendation Information:
-
Conversely,
SumRecperformed better across all metrics foritem recommendation information(Figure 4, lower part). This finding aligns with theablation studyresults (Table 3), where applyingDPOsolely toitem recommendation information(w/o Sum-DPO) did not yield consistent performance gains as strong asDPOon summaries. -
The authors mitigate this decline in quality by noting that
item recommendation informationis intended for internal use by thescore predictorand is not directly presented to the user, thus its human-perceived quality might be less critical than its machine-readable utility.Conclusion from Human Evaluation: The human evaluation corroborates that
DPO trainingsignificantly improves the quality ofdialogue summaries, particularly their ability to extract recommendation-relevant information, which aligns with theablation study's finding that enhanceddialogue summariesare key drivers of overall system performance.
-
The following are the requests to Crowd Workers from Figure 5 of the original paper:
该图像是插图,展示了针对众包工作者的请求细节,包含一系列指导性说明。这些说明强调了回答调查时的真实性要求,同时提醒参与者有关隐私及数据使用的注意事项。
The following is the English version of the requests to Crowd Workers from Figure 7 of the original paper: Request for Survey Cooperation
This survey aims to evaluate the quality of dialogue summary texts and tourist spot recommendation texts to help improve our system in the future. The survey takes approximately 30-40 minutes to complete. Your repns wll roesstatistcall n yo persal oaton il ot bentif plea el comfortable participating.
Response Procedure
- Evaluation of Summary Texts (13 questions total)
- We present summary texts ① and ② generated based on the dialogue history posted on this Notion page. For each question, please select the number of the summary text that you feel is superior in each of the following four aspects:
-
Consistency: How well the content of the summary text matches the content of the dialogue history (accuracy of facts, reflection of important points)
-
Conciseness: Whether it conveys necessary information efficiently without unnecessary verbose expressions (not simply fewer characters, but density and efficiency of information)
-
Fluency: Readability of the text, natural expressions, and logical connections (no unnatural phrasing, reads smoothly)
-
Usefulness: Whether it is possible to make tourist spot recommendations to the person being recommended in the dialogue after reading this summary (whether the hobbies and preferences of the person being recommended (Speaker B) are reflected)
Questionnaire on dialogue summary and recommendation of tourist attractions ⑧
-
- We present summary texts ① and ② generated based on the dialogue history posted on this Notion page. For each question, please select the number of the summary text that you feel is superior in each of the following four aspects:
Please keep the Notion page open while proceeding with the survey.
- Evaluation of Tourist Spot Recommendation Texts (12 questions total)
- We present tourist spot recommendation texts ① and ② generated based on the "tourist spot information" shown in each question. For each question, please select the number of the recommendation text that you feel is superior in each of the following four aspects:
-
Consistency: How well the content of the recommendation text matches the content of the tourist spot information (accuracy of facts, reflection of important points)
-
Conciseness: Whether it conveys necessary information efficiently without unnecessary verbose expressions (not simply fewer characters, but density and efficiency of information)
-
Fluency: Readability of the text, natural expressions, and logical connections (no unnatural phrasing, reads smoothly)
-
Usefulness: Whether you can understand what kind of person the tourist spot is recommended for by reading this recommendation text (whether important features and benefits are clearly communicated)
Notes
-
- We present tourist spot recommendation texts ① and ② generated based on the "tourist spot information" shown in each question. For each question, please select the number of the recommendation text that you feel is superior in each of the following four aspects:
-
There are no right or wrong answers. Please answer honestly based on your impressions.
-
If you close your browser in the middle of the survey, your responses may be lost, so please do not close the screen until you press the submit button.
-
The information obtained in this survey will only be used for research purposes, and the results will be anonymized when published or shared.
The following is the Crowdworker response screen from Figure 6 of the original paper:
该图像是问卷调查的响应界面,展示了几个问题和选项,以收集参与者对不同摘要的评价。参与者需要选择每个问题的答案,使用序列编号来区分不同选项。
7. Conclusion & Reflections
7.1. Conclusion Summary
This study successfully proposed a novel method to enhance Conversational Recommender Systems (CRSs) by integrating Direct Preference Optimization (DPO) into the generation of dialogue summaries and item recommendation information. By fine-tuning Large Language Models (LLMs) with DPO, the system is guided to produce texts that are rich in information crucial for effective recommendations, thereby fostering more natural and realistic conversational processes. Experimental results on two public datasets (Tabidachi Corpus and ChatRec) demonstrated that the proposed method achieved superior recommendation performance (higher Hit Rate and Mean Reciprocal Rank) compared to existing baselines, including the original SumRec. An ablation study and human evaluation confirmed that DPO training for dialogue summaries was a particularly critical factor in boosting the overall system's efficacy by enhancing the extraction of recommendation-useful information.
7.2. Limitations & Future Work
The authors acknowledge several limitations of their work:
-
Model Scale: The study utilized medium-scale
LLMs(Llama-3.1-Swallow-8B-v0.1andDeBERTa-v3-japanese-large), not state-of-the-art models with hundreds of billions of parameters. While larger models might offer enhanced performance, they come with significant trade-offs in GPU memory consumption and inference latency, presenting operational cost challenges. -
Narrow Evaluation Scope: Experiments were conducted exclusively on two Japanese datasets primarily within the
travel domain(Tabidachi CorpusandChatRec). This limits the generalizability of the method, as its effectiveness across other domains and languages remains unverified. -
Content Hallucination: A persistent limitation is the occurrence of
hallucinationsin the generateditem recommendation informationanddialogue summaries. The model sometimes fabricates features not present in the source content. Even if not directly shown to users, such fabrications can adversely affect the model's explainability and potentially lead to inaccurate recommendations (as exemplified in Table 12).Future work will focus on:
-
Further improving recommendation performance while maintaining the quality (e.g., factual accuracy) of generated
item recommendation information. -
Addressing potential risks such as
data-specific biases,content hallucination, andmisuseof the system.
7.3. Personal Insights & Critique
This paper presents a thoughtful and incremental yet significant improvement to Conversational Recommender Systems (CRSs). The explicit focus on addressing the "unrealistic rapid recommendation" problem and integrating implicit preferences is highly valuable, moving CRSs closer to natural human interaction.
Strengths:
- Principled Approach to LLM Alignment: The use of
Direct Preference Optimization (DPO)is a strong methodological choice. It offers a more stable and efficient way to alignLLMoutputs with a specific task objective (improving recommendation scores) compared to traditionalRLHF. The generation ofpreference databased onscore predictorperformance is a clever self-bootstrapping mechanism. - Clear Problem Definition and Solution: The paper clearly articulates the gap in existing CRSs and proposes a well-structured solution building upon
SumRec. The two-stage training flow is logical and well-explained. - Comprehensive Evaluation: The combination of
automatic metrics(HR, MRR),text generation analysis(length, diversity, n-gram similarity),ablation studies, andhuman evaluationprovides a robust validation of the proposed method's effectiveness and sheds light on why it works (e.g., the critical role ofDPO-trained summaries). - Practical Relevance: Enhancing the naturalness and accuracy of conversational recommendations has immense practical value for various applications, from e-commerce to travel planning.
Potential Issues/Areas for Improvement:
- Hallucination Mitigation: While acknowledged as a limitation, the issue of
hallucinationsin generateditem recommendation informationis significant. Although these texts are internal, incorrect facts could lead to misguided recommendations or, if exposed, erode user trust. Future work could explore more robustfact-checking mechanismsorconstrained generation techniquesduringDPOtraining to penalize factual inaccuracies more heavily. - Generalizability Beyond Japanese Travel Domain: The evaluation on only two Japanese datasets, primarily in the
travel domain, limits the direct generalizability. Replication on diverse domains (e.g., movies, books, products) and languages would strengthen the claims of a "realistic conversational recommendation" system. - Interpretability of DPO's Effect on Recommendation Information: The
human evaluationshowed a decrease in perceived quality foritem recommendation informationgenerated by theDPO-trained model, despite overall system performance improvement. This suggests a trade-off: what's "useful" for thescore predictor(e.g., specific keywords, longer explanations) might not always be "natural" or "concise" for human judges. Further analysis could investigate if this "less natural" output has a negative impact if it were ever exposed to users, or if there's a way to optimize for both machine utility and human readability. - Computational Cost: The mention of larger
LLMsand their associated costs highlights a practical barrier. Exploring parameter-efficientfine-tuning(PEFT) methods beyondgradient checkpointing(e.g., LoRA, QLoRA) could be a direction to leverage larger models without prohibitive costs.
Transferability and Application:
The core idea of using a score predictor to generate preference data for DPO fine-tuning of text generation models is highly transferable. This paradigm could be applied to any task where LLMs generate intermediate texts that feed into a downstream model. For example:
-
Summarization for Question Answering: Fine-tune a summarizer to create summaries that are most useful for a downstream QA model to answer questions accurately.
-
Explanation Generation for XAI: Generate explanations that not only sound plausible but also help a user
debugorunderstanda model's decision more effectively, as judged by a separateexplanation evaluator. -
Dialogue Policy Learning:
DPOcould directly optimizedialogue policygeneration in other interactive AI systems.Overall, this paper provides a valuable contribution to the field of conversational AI by demonstrating a practical and effective way to align
LLMgeneration with specific task objectives, paving the way for more intelligent and human-centric interactive systems.
Similar papers
Recommended via semantic vector search.