Paper status: completed

Refining Text Generation for Realistic Conversational Recommendation via Direct Preference Optimization

Published:08/27/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces an improved Conversational Recommender System method using Large Language Models to generate dialogue summaries and recommendations, capturing both explicit and implicit user preferences. Direct Preference Optimization (DPO) is employed to ensure rich conten

Abstract

Conversational Recommender Systems (CRSs) aim to elicit user preferences via natural dialogue to provide suitable item recommendations. However, current CRSs often deviate from realistic human interactions by rapidly recommending items in brief sessions. This work addresses this gap by leveraging Large Language Models (LLMs) to generate dialogue summaries from dialogue history and item recommendation information from item description. This approach enables the extraction of both explicit user statements and implicit preferences inferred from the dialogue context. We introduce a method using Direct Preference Optimization (DPO) to ensure dialogue summary and item recommendation information are rich in information crucial for effective recommendations. Experiments on two public datasets validate our method's effectiveness in fostering more natural and realistic conversational recommendation processes. Our implementation is publicly available at: https://github.com/UEC-InabaLab/Refining-LLM-Text

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Refining Text Generation for Realistic Conversational Recommendation via Direct Preference Optimization

1.2. Authors

Manato Tajiri and Michimasa Inaba, affiliated with The University of Electro-Communications, Tokyo, Japan. Their contact emails are t2530085@gl.cc.uec.ac.jp and m-inaba@uec.ac.jp.

1.3. Journal/Conference

The paper is listed as published at an unspecified venue with a publication date of 2025-08-27. Given the arxiv.org link, it is likely a preprint for a future conference or journal submission, common in academic research.

1.4. Publication Year

2025

1.5. Abstract

The paper addresses a significant limitation in current Conversational Recommender Systems (CRSs), which often generate rapid item recommendations in brief sessions, deviating from natural human interactions. To bridge this gap, the authors propose a method that utilizes Large Language Models (LLMs) to generate dialogue summaries from dialogue history and item recommendation information from item descriptions. This dual generation aims to capture both explicit user statements and implicit preferences. A key innovation is the application of Direct Preference Optimization (DPO) to fine-tune the LLMs, ensuring that the generated texts are rich in information crucial for effective recommendations. Experiments on two public datasets demonstrate the method's superiority in fostering more natural and realistic conversational recommendation processes. The implementation is publicly available.

Official Source: https://arxiv.org/abs/2508.19918 PDF Link: https://arxiv.org/pdf/2508.19918v3.pdf Publication Status: Preprint (arXiv)

2. Executive Summary

2.1. Background & Motivation

The core problem the paper addresses is the lack of realism in many existing Conversational Recommender Systems (CRSs). Traditional CRSs often recommend items too quickly within short dialogue sessions, relying heavily on immediate user feedback. This approach contrasts sharply with human-to-human recommendation scenarios, where a recommender typically gathers comprehensive user preferences, experiences, and context before making thoughtful suggestions. Furthermore, current CRSs often under-explore the effective integration of implicit information, such as context, sentiment, and unarticulated past experiences, which are crucial in natural interactions. This discrepancy limits the practical utility and naturalness of existing systems.

The paper's entry point is to enhance the SumRec approach (Asahara et al., 2023), which uses Large Language Models (LLMs) to generate dialogue summaries from conversation history and item recommendation information from item descriptions. While SumRec aims to extract both explicit and implicit user preferences and articulate item relevance in natural language, its primary limitation is that the generated texts may sometimes lack information critical for downstream recommendation tasks (e.g., item selection or scoring). This can hinder the system's ability to interpret the relationship between user needs and item suitability. The paper aims to guide the LLM to generate precisely the essential information.

2.2. Main Contributions / Findings

The primary contributions of this research are:

  • Proposed Extension to SumRec with DPO: The authors propose an extension to the SumRec framework by fine-tuning the LLM using Direct Preference Optimization (DPO). This method is specifically tailored for realistic conversational recommendation datasets, aiming to generate dialogue summaries and item recommendation information that are rich in content essential for accurate and appropriate item recommendations.

  • Superior Recommendation Performance: Through extensive experiments on two public Japanese datasets (Tabidachi Corpus and ChatRec), the proposed DPO-enhanced approach demonstrates superior recommendation performance (measured by Hit Rate (HR) and Mean Reciprocal Rank (MRR)) compared to existing baseline methods, including the original SumRec. The improvements are particularly significant in enhancing the quality of top-position recommendations.

  • Analysis of Generated Texts and Ablation Study: The paper provides a quantitative analysis of the generated texts, showing that DPO leads to longer summaries and item recommendation information, prioritizing task-relevant details over lexical diversity or superficial overlap with original descriptions. An ablation study further confirms that DPO training for dialogue summaries is a critical driver for overall system performance improvements.

  • Human Evaluation: A human evaluation study corroborates the automated metrics, indicating that DPO significantly improves the quality of dialogue summaries in terms of consistency, fluency, and usefulness for recommendations.

    These findings collectively address the challenges of unnatural dialogue processes and insufficient integration of implicit information in CRSs, leading to more human-like and effective recommendation experiences.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a foundational understanding of several key concepts is essential:

  • Recommender Systems: These are information filtering systems that aim to predict the "rating" or "preference" a user would give to an item. They are ubiquitous in e-commerce (e.g., Amazon, Netflix) and help users discover new products or content. Traditional systems often rely on historical user behavior (e.g., past purchases, ratings).

  • Cold-Start Problem: A common challenge in recommender systems where it's difficult to make accurate recommendations for new users or new items due to a lack of historical data.

  • Conversational Recommender Systems (CRSs): An advanced type of recommender system that interacts with users through natural language dialogue to elicit their preferences and provide recommendations. CRSs aim to overcome limitations like the cold-start problem by dynamically gathering information through conversation, making the recommendation process more interactive and personalized.

  • Large Language Models (LLMs): These are advanced artificial intelligence models, such as GPT (Generative Pre-trained Transformer), Llama, and PaLM, trained on vast amounts of text data to understand, generate, and process human language. LLMs are capable of various natural language processing tasks, including summarization, text generation, and question answering, by predicting the next word in a sequence. Their ability to capture context and generate coherent text makes them valuable tools for tasks like dialogue summarization and information generation.

  • Fine-tuning (of LLMs): The process of taking a pre-trained LLM (trained on a general dataset) and further training it on a smaller, task-specific dataset. This adapts the model's knowledge and capabilities to a particular task or domain, improving its performance for that specific application.

  • Direct Preference Optimization (DPO): A novel algorithm for fine-tuning LLMs based on human preference data. Unlike traditional reinforcement learning from human feedback (RLHF) methods that require training a separate reward model, DPO directly optimizes the language model policy to align with human preferences. It simplifies the alignment process by reparameterizing the reward function in terms of the policy, making it more stable and computationally efficient. DPO works by presenting the model with pairs of generated texts (one preferred, one not preferred) and directly optimizing the model's likelihood of generating the preferred text over the non-preferred one.

  • Transformer Encoder: A key component of the Transformer neural network architecture, widely used in LLMs. The encoder processes input sequences (like text) by mapping them into a continuous representation. It uses self-attention mechanisms to weigh the importance of different words in the input sequence, capturing contextual relationships, and feed-forward neural networks for further processing. DeBERTa is an example of a model built on the Transformer encoder architecture.

  • DeBERTa (Decoding-enhanced BERT with disentangled attention): A pre-trained language model that enhances the BERT (Bidirectional Encoder Representations from Transformers) architecture. Key improvements include disentangled attention (where content and position embeddings are encoded separately and then combined to compute attention scores) and an enhanced mask decoder for a better pre-training objective. It's often used for tasks like text classification, question answering, and, in this paper, as a score predictor.

3.2. Previous Works

The paper contextualizes its work by discussing existing research in Conversational Recommender Systems (CRSs), dialogue summarization using LLMs, and recommendation via augmentation or refinement of item description.

  • Conversational Recommender Systems (CRSs):

    • Early CRSs and Datasets (e.g., Li et al., 2018; Zhou et al., 2020; Hayati et al., 2020; Liu et al., 2021): Many prior datasets and systems (e.g., REDIAL by Li et al., 2018) exhibit a common pattern: items are recommended rapidly or in quick succession within short dialogue sessions, with subsequent recommendations often determined by immediate user feedback (as shown in Figure 1 of the paper).
    • Limitations: This focus on rapid, short-session recommendations limits their applicability to more realistic, nuanced conversations where recommendations are made more deliberately after comprehensive preference elicitation. The paper highlights this as a key gap, contrasting REDIAL (short dialogues, rapid recommendations) with Tabidachi Corpus (longer dialogues, more deliberate recommendations) (Figure 1).
    • This paper's position: Aims to foster more natural dialogue processes and integrate implicit information, moving beyond the rapid recommendation paradigm.
  • Dialogue Summarization using LLMs:

    • General LLM Capabilities (e.g., GPT, Llama, PaLM): LLMs have shown significant success in various AI tasks, including summarization.
    • Existing Summarization Approaches:
      • Zhu et al. (2025a): Proposed generating factual summaries using smaller language models with GPT-3.5-Turbo as a teacher for contrastive learning. This method is primarily for short dialogues and struggles with longer conversations.
      • Zhong et al. (2022): Addressed long dialogues by pre-training Transformer-based models, which typically require substantial data and computational resources.
      • Zhang et al. (2022) - SummN: A method that fine-tunes LLMs via supervised learning to first summarize dialogue chunks and then create a final summary from these partial summaries. This is particularly relevant as the current paper adopts SummN's approach for handling long dialogues in Tabidachi Corpus.
    • This paper's position: Focuses on achieving high-quality text generation through fine-tuning (specifically DPO), without extensive pre-training, and adopts SummN's multi-stage summarization for long dialogues.
  • Recommendation via Augmentation or Refinement of Item Description:

    • Lyu et al. (2024): Proposed augmenting item descriptions using LLMs for recommendations.
    • Limitation: Their study does not explicitly incorporate user preferences or experiences from the ongoing conversation into the augmentation process.
    • Li et al. (2023): Introduced a method to generate more appropriate item recommendation information by imposing vocabulary constraints.
    • Other Approaches (e.g., Cheng et al., 2023; Yang et al., 2024; Ma et al., 2024): Involve retrieving similar reviews or external information to generate item recommendation information.
    • Limitations of existing methods: Primarily focus on historical user behavior data or past reviews, making their application challenging in scenarios where only the current dialogue history and item description are available.
    • This paper's position: Proposes generating pertinent item recommendation information based solely on user preferences and experiences derived from the dialogue history, combined with the item description.

3.3. Technological Evolution

The field of recommender systems has evolved from basic collaborative filtering (e.g., item-to-item collaborative filtering by Linden et al., 2003, at Amazon) to more sophisticated deep learning models. Conversational Recommender Systems (CRSs) emerged as a way to address challenges like the cold-start problem and enhance user interaction by leveraging natural language. Initially, CRSs often relied on predefined dialogue policies or rule-based systems, or learned policies from constrained datasets.

The advent of powerful Large Language Models (LLMs) marked a significant shift. LLMs, with their ability to understand context, generate coherent text, and adapt to various tasks through fine-tuning, opened new possibilities for CRSs. Approaches like SumRec (Asahara et al., 2023) began to use LLMs for tasks like dialogue summarization and item description augmentation, making the system more flexible and capable of handling open-domain conversations.

This paper's work represents a further refinement in this evolution. While LLMs offer powerful text generation capabilities, ensuring the quality and relevance of the generated text for a specific downstream task (like recommendation scoring) remains a challenge. The integration of Direct Preference Optimization (DPO) is a crucial step in this evolution. DPO allows for a more direct and efficient way to align the LLM's text generation with desired outcomes (i.e., generating summaries and recommendation information that lead to better recommendation scores), without the complexities of traditional Reinforcement Learning from Human Feedback (RLHF). This positions the paper at the forefront of leveraging advanced LLM alignment techniques to enhance the realism and effectiveness of CRSs.

3.4. Differentiation Analysis

Compared to prior Conversational Recommender Systems (CRSs):

  • Focus on Realistic Dialogue: Unlike many existing CRSs that make rapid recommendations in short sessions, this work aims for a more natural, human-like dialogue flow where comprehensive preference elicitation precedes recommendations.

  • Implicit Information Integration: It explicitly addresses the under-explored area of integrating implicit user preferences and contextual information, which is a key differentiator from systems that primarily rely on explicit feedback.

    Compared to SumRec (Asahara et al., 2023):

  • Enhanced Information Extraction: While SumRec leveraged LLMs for dialogue summary and item recommendation information generation, it relied on prompt engineering, which could result in abstract or generic outputs lacking crucial information for the score predictor. This paper overcomes this by applying Direct Preference Optimization (DPO) to fine-tune the LLM, ensuring the generated texts are optimally rich in recommendation-relevant content. SumRec did not fine-tune these generation models, relying purely on base LLM capabilities guided by prompts.

  • Generalizability: The paper extends SumRec's applicability beyond specific domains (like tourist recommendations from chit-chat in ChatRec) to more general and realistic conversational scenarios (Tabidachi Corpus), aiming for improved recommendation quality across diverse datasets.

    Compared to dialogue summarization using LLMs:

  • Task-Specific Optimization: While previous works focused on general summarization quality (e.g., factual accuracy, handling long dialogues), this paper specifically optimizes summaries to contain information crucial for a downstream recommendation task. This goal-oriented summarization, achieved through DPO, differentiates it from generic summarization methods.

    Compared to recommendation via augmentation or refinement of item description:

  • Dialogue-Contextualized Information: Unlike methods that augment item descriptions using historical data or external reviews, this paper generates item recommendation information dynamically based solely on the current dialogue history (user preferences and experiences) and the item description. This makes the generated information highly tailored to the ongoing conversation.

  • DPO for Relevance: The use of DPO ensures that the item recommendation information not only augments the description but does so in a way that is most useful for the score predictor in determining item suitability based on user preferences.

4. Methodology

The proposed method extends SumRec by integrating Direct Preference Optimization (DPO) to refine the generation of dialogue summaries and item recommendation information, thereby enhancing recommendation performance in realistic conversational settings.

4.1. Task Definition

The task focuses on item recommendation within conversational settings. Given:

  • A dialogue history C={uo1,uc1,,uon1,ucn1}\mathcal { C } = \left\{ u _ { o _ { 1 } } , u _ { c _ { 1 } } , \ldots , u _ { o _ { n - 1 } } , u _ { c _ { n - 1 } } \right\}, where uoiu_{o_i} is an utterance from the operator (recommender) and uciu_{c_i} is an utterance from the customer (recommendee).

  • A set of candidate items T={t1,,tM}T = \{ t _ { 1 } , \dots , t _ { M } \} available at that point in the dialogue.

  • Item descriptions D={d1,\hdots,dM}D = \{ d _ { 1 } , \hdots , d _ { M } \} corresponding to the candidate items.

    The objective is to predict the correct item tkt_k that will be included in the next operator's utterance uon\boldsymbol { u } _ { o _ { n } }.

4.2. SumRec

The paper builds upon the SumRec framework, which involves three main components: Dialogue Summary Generation Model, Item Recommendation Information Generation Model, and Score Predictor. The overall flow is depicted in Figure 2. SumRec feeds the generated dialogue summary, item recommendation information, and the original item description into a score predictor to estimate a recommendation score for each candidate item.

The following figure (Figure 2 from the original paper) illustrates the item recommendation flow in SumRec:

Figure 2: Item recommendation flow in SumRec. Dialogue Summaries and Item Recommendation Information, generated from Dialogue History and Item Descriptions respectively, are fed with the Item Descrip… 该图像是一个示意图,展示了在对话历史和项目描述基础上生成对话摘要和推荐信息的流程。用户表达了自然偏好,经过对话摘要生成模型,最终生成对东京巨蛋的推荐信息,预测评分为0.608。

4.2.1. Dialogue Summary Generation Model

This model uses a Large Language Model (LLM) to generate a dialogue summary from the dialogue history C={uo1,uc1,,uon1,ucn1}\mathcal { C } = \{ u _ { o _ { 1 } } , u _ { c _ { 1 } } , \dotsc , u _ { o _ { n - 1 } } , u _ { c _ { n - 1 } } \}. The goal is to extract information crucial for recommendations, such as the customer's preferences and experiences. For longer dialogues, particularly in datasets like Tabidachi Corpus, a multi-stage summarization approach inspired by Zhang et al. (2022) is adopted:

  1. The dialogue history is divided into chunks.
  2. Partial summaries are generated for each chunk.
  3. A final dialogue summary is generated based on these concatenated partial summaries. The specific prompts used for summary generation are provided in Appendix B.1 of the paper.

4.2.2. Item Recommendation Information Generation Model

Item descriptions typically contain objective facts but often lack details about what kind of user an item is suitable for. To address this, SumRec employs an LLM to create item recommendation information based on the item description D={d1,\hdots,dM}D = \{ d _ { 1 } , \hdots , d _ { M } \} of candidate items. This generated information aims to articulate the item's relevance to user preferences and experiences in natural language. For instance, an item description might state facts about a location, while the item recommendation information might add a phrase like "It's also suitable for people who want to enjoy entertainment as various events are held there," explaining its suitability for a specific user type. The actual prompts are detailed in Appendix B.2 of the paper.

4.2.3. Score Predictor

The dialogue summary and item recommendation information generated by the respective models, along with the original item description, are concatenated using a [SEP] token. This combined text is then fed into a score predictor. The score predictor is a pre-trained language model based on a Transformer encoder. It is trained as a regression task. For training, items explicitly recommended within the dialogue are assigned a target score of y=1y = 1, while all other candidate items are assigned y=0y = 0. In the experiments, DeBERTa (He et al., 2021) is used as the score predictor.

4.3. Improving Information Extraction Performance using DPO

The core innovation of this study is the application of Direct Preference Optimization (DPO) to the LLMs responsible for generating dialogue summaries and item recommendation information. While SumRec aimed to include necessary information, its reliance on prompt engineering alone often resulted in outputs that were too abstract or generic, lacking the precise details crucial for accurate recommendation. This study proposes to fine-tune these generation models using DPO to ensure they produce texts that adequately contain information essential for the score predictor to make accurate predictions.

The following figure (Figure 3 from the original paper) depicts the training flow of the proposed method:

该图像是一个示意图,展示了通过直接偏好优化(DPO)方法进行对话推荐系统的训练过程。图中分为两个阶段:第一阶段为训练评分预测器,第二阶段为训练生成模型。其中,包含了对话历史、对话摘要生成模型及推荐信息生成模型等关键组件。 该图像是一个示意图,展示了通过直接偏好优化(DPO)方法进行对话推荐系统的训练过程。图中分为两个阶段:第一阶段为训练评分预测器,第二阶段为训练生成模型。其中,包含了对话历史、对话摘要生成模型及推荐信息生成模型等关键组件。

The training process consists of two main steps: Stage 1: Training the Score Predictor Stage 2: Training the Dialogue Summary Generation Model and the Item Recommendation Information Generation Model using DPO

4.3.1. Training the Score Predictor

The score predictor is trained first, as its output is used to create the preference data for DPO training of the generation models. DeBERTa is used as the score predictor. The prediction score y^\hat { y } from DeBERTa is obtained by concatenating the dialogue summary ss, item recommendation information rr, and item description dd.

The score predictor's estimation is expressed by Equation 1: y^=DeBERTa(s,r,d) \hat { y } = \mathrm { D e B E R T a } ( s , r , d ) Where:

  • y^\hat { y } is the predicted recommendation score for an item.

  • DeBERTa()\mathrm { D e B E R T a } ( \cdot ) denotes the DeBERTa model functioning as the score predictor.

  • ss is the dialogue summary generated from the dialogue history.

  • rr is the item recommendation information generated from the item description.

  • dd is the original item description.

    The score predictor is trained as a regression task, where the target score y=1y=1 is assigned to items that were actually recommended in the dialogue, and y=0y=0 to all other items.

4.3.2. Training the Dialogue Summary Generation Model

This section details the DPO training procedure for the dialogue summary generation model.

  1. Summary Generation: For a given dialogue history CnC_n, the LLM first generates a set of MM partial summaries {ps1n,,psMn}\{ p s _ { 1 } ^ { n } , \ldots , p s _ { M } ^ { n } \}. These are then concatenated into a combined text, P S _ { n }. From P S _ { n }, the LLM generates KK final dialogue summaries {s1n,,sKn}\{ s _ { 1 } ^ { n } , \ldots , s _ { K } ^ { n } \}.
  2. Score Prediction for Summaries: For each generated summary skns_k^n, item recommendation information rmnr_m^n, and item description dmnd_m^n (corresponding to a candidate item mm), the score predictor estimates a score y^k,mn\hat { y } _ { k , m } ^ { n }. This is given by Equation 2: y^k,mn=DeBERTa(skn,rmn,dmn) \hat { y } _ { k , m } ^ { n } = \mathrm { D e B E R T a } ( s _ { k } ^ { n } , r _ { m } ^ { n } , d _ { m } ^ { n } ) Where:
    • y^k,mn\hat { y } _ { k , m } ^ { n } is the predicted score for candidate item mm when using dialogue summary skns_k^n.
    • skns_k^n is the kk-th generated dialogue summary for dialogue nn.
    • rmnr_m^n is the item recommendation information for candidate item mm in dialogue nn.
    • dmnd_m^n is the item description for candidate item mm in dialogue nn.
  3. Preference Data Creation: The absolute difference between the predicted score y^k,mn\hat { y } _ { k , m } ^ { n } and the ground-truth score ymny _ { m } ^ { n } (which is either 0 or 1) is calculated.
    • The dialogue summary that results in the prediction closest to the ground-truth score is selected as the preferred (winner) sample, denoted sm,+ns _ { m , + } ^ { n }. This is defined by Equation 3: sm,+n=argminsknymny^k,mn s _ { m , + } ^ { n } = \arg \underset { s _ { k } ^ { n } } { \operatorname* { m i n } } | y _ { m } ^ { n } - \hat { y } _ { k , m } ^ { n } | Where:
      • sm,+ns _ { m , + } ^ { n } is the preferred dialogue summary for item mm in dialogue nn.
      • argminsknymny^k,mn\arg \underset { s _ { k } ^ { n } } { \operatorname* { m i n } } | y _ { m } ^ { n } - \hat { y } _ { k , m } ^ { n } | means selecting the summary skns_k^n that minimizes the absolute difference between the ground-truth score ymny_m^n and the predicted score y^k,mn\hat{y}_{k,m}^n.
    • Conversely, the dialogue summary that results in the prediction furthest from the ground-truth score is selected as the dispreferred (loser) sample, denoted sm,ns _ { m , - } ^ { n }. This is defined by Equation 4: sm,n=argmaxsknymny^k,mn s _ { m , - } ^ { n } = \arg \underset { s _ { k } ^ { n } } { \operatorname* { m a x } } | y _ { m } ^ { n } - \hat { y } _ { k , m } ^ { n } | Where:
      • sm,ns _ { m , - } ^ { n } is the dispreferred dialogue summary for item mm in dialogue nn.
      • argmaxsknymny^k,mn\arg \underset { s _ { k } ^ { n } } { \operatorname* { m a x } } | y _ { m } ^ { n } - \hat { y } _ { k , m } ^ { n } | means selecting the summary skns_k^n that maximizes the absolute difference between the ground-truth score ymny_m^n and the predicted score y^k,mn\hat{y}_{k,m}^n. These pairs (sm,+n,sm,n)(s _ { m , + } ^ { n }, s _ { m , - } ^ { n }) form the preference data for DPO.
  4. DPO Loss Function: The DPO loss function is then applied to fine-tune the dialogue summary generation model. This loss encourages the model to generate the preferred summary more frequently than the dispreferred summary. It is expressed by Equation 5: LDP0=E(PSn,sm,+n,sm,n){mMn,nN}[logσ(βlogπϕ(sm,+nPSn)πϕref(sm,+nPSn)βlogπϕ(sm,nPSn)πϕref(sm,nPSn))] \begin{array} { r l } { \mathcal { L } _ { \mathrm { { D P 0 } } } = - \mathbf { E } _ { ( P S _ { n } , s _ { m , + } ^ { n } , s _ { m , - } ^ { n } ) \sim \{ m \in \mathcal { M } _ { n } , n \in \mathcal { N } \} } } \\ { \Bigg [ \log \sigma \Bigg ( \beta \log \frac { \pi _ { \phi } ( s _ { m , + } ^ { n } | P S _ { n } ) } { \pi _ { \phi _ { \mathrm { r e f } } } ( s _ { m , + } ^ { n } | P S _ { n } ) } } & { } \\ { - \beta \log \frac { \pi _ { \phi } ( s _ { m , - } ^ { n } | P S _ { n } ) } { \pi _ { \phi _ { \mathrm { r e f } } } ( s _ { m , - } ^ { n } | P S _ { n } ) } \Bigg ) \Bigg ] } \end{array} Where:
    • LDP0\mathcal { L } _ { \mathrm { { D P 0 } } } is the Direct Preference Optimization loss.
    • E(PSn,sm,+n,sm,n){mMn,nN}\mathbf { E } _ { ( P S _ { n } , s _ { m , + } ^ { n } , s _ { m , - } ^ { n } ) \sim \{ m \in \mathcal { M } _ { n } , n \in \mathcal { N } \} } denotes the expected value over all preference data samples.
    • P S _ { n } is the concatenated partial summaries from which the final dialogue summary is generated for dialogue history CnC_n.
    • sm,+ns _ { m , + } ^ { n } is the preferred dialogue summary for item mm in dialogue nn.
    • sm,ns _ { m , - } ^ { n } is the dispreferred dialogue summary for item mm in dialogue nn.
    • N\mathcal { N } is the set of indices for all dialogue histories.
    • Mn\mathcal { M } _ { n } is the set of indices for candidate items for dialogue history CnC_n.
    • β\beta is a temperature parameter that controls the strength of the preference modeling.
    • πϕ()\pi _ { \phi } ( \cdot | \cdot ) represents the output probability (likelihood) of the dialogue summary generation model being trained.
    • πϕref()\pi _ { \phi _ { \mathrm { r e f } } } ( \cdot | \cdot ) represents the output probability of the dialogue summary generation model before DPO training (the reference model).
    • σ()\sigma ( \cdot ) is the sigmoid function, which squashes values between 0 and 1. The goal is to maximize the log-likelihood ratio of the preferred summary over the dispreferred summary, scaled by β\beta, and then pass it through a sigmoid function to optimize the model.

4.3.3. Training the Item Recommendation Information Generation Model

The item recommendation information generation model is trained using DPO in an analogous manner. The objective is to enable this model to generate item recommendation information that more effectively incorporates details crucial for item recommendation.

  1. Item Recommendation Information Generation: For each candidate item description dmnd _ { m } ^ { n } for which the ground-truth score ymny_m^n is 1 (i.e., the item was actually recommended), the LLM generates JJ pieces of item recommendation information {rm,1n,\hdots,rm,Jn}\{ r _ { m , 1 } ^ { n } , \hdots , r _ { m , J } ^ { n } \}. The reason for focusing only on items with a ground-truth score of 1 is to prevent the model from being trained to consider texts lacking necessary recommendation information as good outputs.
  2. Score Prediction for Item Recommendation Information: A dialogue summary sns^n (generated by the LLM from CnC_n) is combined with each generated item recommendation information rm,jnr_{m,j}^n and item description dmnd_m^n, and fed into the score predictor.
  3. Preference Data Creation: Similar to dialogue summaries, the absolute difference between the predicted score and the ground-truth score ymny _ { m } ^ { n } is calculated.
    • The item recommendation information that yields the score closest to the ground-truth score is designated as rm,+nr _ { m , + } ^ { n }.
    • The item recommendation information that yields the score furthest from the ground-truth score is designated as rm,nr _ { m , - } ^ { n }. These pairs (rm,+n,rm,n)(r _ { m , + } ^ { n }, r _ { m , - } ^ { n }) constitute the preference data for DPO.
  4. DPO Loss Function: The loss function for training the item recommendation information generation model is structurally identical to Equation 5. The key difference is that the policy πϕ\pi _ { \phi } now generates item recommendation information rr conditioned on the item description dmnd _ { m } ^ { n } (and possibly the dialogue summary sns^n), rather than a summary ss conditioned on partial summaries P S _ { n }. This DPO process allows the model to learn to generate item recommendation information that better serves the recommendation task.

5. Experimental Setup

5.1. Datasets

Experiments were conducted using two public Japanese datasets: Tabidachi travel agency task dialogue corpus and ChatRec.

The following are the statistics from Table 4 of the original paper:

Metric Tabidachi Corpus ChatRec
T E N ALL
Dialogues 165 237 223 545 1,005
(Train / Val / Test) 126 / 15 / 24 189 / 13 / 35 178 / 12 / 33 436 /28 /81 803 / 53 / 149
Utterances 42,663 5,238 5,009 11,735 21,982

5.1.1. Tabidachi Corpus

  • Source & Characteristics: This corpus features tourist spot recommendation dialogues between an operator and a customer planning a sightseeing trip via Zoom. The operator uses a system to find tourist information while conversing, and the customer makes travel plans based on a predefined scenario. This dataset is designed to resemble actual dialogue scenarios with longer dialogue histories.

  • Data Collection Details: 55 participants (25 adults, 10 elderly, 20 children) acted as customers, each engaging in six dialogues. The study specifically used dialogues conducted without screen sharing, as visual information cannot be directly input into the LLM.

  • Item Description: The item description for Tabidachi Corpus is formed by concatenating the "Summary" and "Feature" fields from the tourist destination information.

  • Ground Truth: For Tabidachi Corpus, items explicitly recommended within the dialogue are assigned a target score of y=1y = 1, and others y=0y = 0.

    The following is a dialogue example from Table 5 of the original paper:

    Operator Hello. Thank you for using our service today. Um, regarding your travel plans, um, do you have any particular destination in mind that you would like to visit?
    Customer Yes. Um, I would like to go to Hokkaido.
    Operator Ah, yes. Um, do you have any preference for the season?
    Customer Um, around autumn, I think.
    Operator Um, how many people are planning to go?
    Customer Ah, just me, just myself alone.
    Operator Ah, understood. <> I will look into it, so please wait a moment.
    Customer Yes. Ah, yes. Yes, please do.
    Operator Um, is there anything specific you'd like to do, or any particular preferences?
    Customer Yes. Ah, well. Um, I'd like to go somewhere with beautiful autumn leaves.
    Operator Yes. Ah, there is one thing but... Yes.
    Customer Um, <>, around Sapporo and Mount Hakodate, particularly, are there any other places you'd like to visit?
    Operator Ah, yes. Around that area, if there are any recommendations.
    Customer Let me see... also Yes.
    Operator <>, it's near Sapporo but...
    Customer Yes.
    Operator There is a place called Satellite Place
    Customer Yes.

The following is an example of item description in Tabidachi Corpus from Table 6 of the original paper:

SightID 80042498
Title Former Sougenji Stone Gate (KyuuSougenji Ishimon)
Detail Area Kyushu/Okinawa>Okinawa Prefecture>Naha/Southern Main Island
Genre1 See>Buildings/HistoricSites>Historical Structures
Genre2
Summary A triple-arch gate made of Ryukyulimestone. The massive stone gateextending nearly 100m was built us-ing cut stone masonry technique andis designated as a National ImportantCultural Property. The interior wasthe temple grounds where SougenjiTemple, which enshrined the spiritsof the Sho Dynasty, once stood, butwas completely destroyed during theBattle of Okinawa.
Time
Closed
Price Free to visit
Tel 098-868-4887
Address 1-9-1 Tomari, Naha City, OkinawaPrefecture
Station Miebashi
Parking None
Traffic1 10-minute walk from Yui Rail (Oki-nawa Monorail) Miebashi Station orMakishi Station
Traffic2 6 km from Okinawa Naha Airport
Feature Takes about 30 minutes to visitRecommended for women / Recom-mended for history enthusiasts
Treasure Important Cultural Property (Struc-ture)

5.1.2. ChatRec Dataset

  • Source & Characteristics: This dataset was used by the original SumRec paper and is included for comparison, although it does not represent realistic recommendation dialogues in the same way Tabidachi Corpus does. It comprises chit-chat dialogues between two CrowdWorks participants (minimum 10 turns each) under a "strangers in a waiting room" scenario, collected under three topic conditions: Travel (T), Except for Travel (E), and No Restriction (N). ALL refers to the combination of these conditions.

  • Item Information: The tourist destination information consists of 3,290 domestic spots, filtered from Rurubu by excluding those with fewer than 100 TripAdvisor reviews. These spots were organized into 147 files, each containing 10-20 spots grouped by prefecture.

  • Ground Truth: After each dialogue, one file was randomly assigned, and workers rated the spots. A "human-predicted score" (average of interest scores by five third-party workers on a 5-point scale) is provided. For this study, scores of 2 or less were converted to "dislike" (0) and scores of 3 or more to "like" (1), aligning with the Tabidachi Corpus scoring method.

    The following is a dialogue example from Table 7 of the original paper: A: What are your plans for dinner? B: Thank you in advance. I'm planning to make ginger pork for dinner today. How about you? A: Since it's cold, I'm thinking about having shabushabu, but ginger pork sounds good too. B: That sounds nice. But I've already prepared for ginger pork today, so I'm thinking about having shabu-shabu tomorrow. A: Pork for two days in a row, do you like pork? B: I do like pork. I prefer chicken or pork over beef. A: Do you use pork for curry? B: Yes. We usually make it with pork at home. Are you perhaps a beef person? A: We use beef at home. Does that mean you live in the eastern region? B: Not necessarily, but for some reason we've always used pork at my home. A: I see, what's your favorite pork dish? B: For pork dishes, I like wrapping cheese with pork and seasoning it with a sweet and savory sauce. A: That's quite elaborate. Do you put only cheese inside? B: Not at all. I also add shiso leaves. A: Is this fried, or do you just grill it? B: It's delicious when fried too, but I'm concerned about the calories, so currently I just grill it. A: What's the best side dish for it? B: I'm not sure if it's the best, but I usually serve it with lettuce and cherry tomatoes. A: Just imagining it makes me hungry. B: Indeed. Do you like beef? A: I do! I love steak and yakiniku (grilled meat). A: Thank you for your time!

The following is an example of item description in the ChatRec dataset from Table 8 of the original paper:

id 7
name Sumida Park
description Located alongside the Sumida River, it has long been known as a famous cherry blossom view- ing spot. In spring, when approximately 500 cherry trees planted along the Sumida embank- ment bloom, the park becomes crowded with many flower-viewing visitors. The park, which extends from Azuma Bridge, features walking paths that make for an ideal strolling course. From the X-shaped Sakura Bridge, visitors can enjoy a view of the Sumida River below.

5.2. Evaluation Metrics

The task involves selecting an item from a set of candidates, so standard retrieval task metrics are used: Hit Rate (HR) and Mean Reciprocal Rank (MRR).

  • Hit Rate (HR):

    1. Conceptual Definition: Hit Rate measures the proportion of users for whom the target item (the item actually recommended) is present within the top-KK recommendations generated by the system. It indicates how often the system successfully recommends the relevant item at least once within the specified rank.
    2. Mathematical Formula: $ \mathrm{HR@K} = \frac{\text{Number of users for whom the target item is in top-K}}{\text{Total number of users}} $
    3. Symbol Explanation:
      • HR@K\mathrm{HR@K}: Hit Rate at rank KK.
      • Number of users for whom the target item is in top-K: The count of unique users for whom the ground-truth recommended item appeared among the top KK items predicted by the system.
      • Total number of users: The total number of users (or recommendation instances) in the evaluation set.
  • Mean Reciprocal Rank (MRR):

    1. Conceptual Definition: Mean Reciprocal Rank measures the average of the reciprocals of the ranks of the first relevant item. It is particularly useful when only one relevant item is expected or desired, and it penalizes systems that recommend the relevant item at lower ranks. A higher MRR indicates that the relevant item is found earlier in the ranked list.
    2. Mathematical Formula: $ \mathrm{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\mathrm{rank}_i} $
    3. Symbol Explanation:
      • MRR\mathrm{MRR}: Mean Reciprocal Rank.
      • Q|Q|: The total number of queries (or recommendation instances) in the evaluation set.
      • ranki\mathrm{rank}_i: The rank position of the first relevant item for the ii-th query. If no relevant items are found, the reciprocal rank is 0.

5.3. Baselines

The proposed method is compared against two baselines:

  • Baseline:

    • Description: This baseline generates the dialogue summary using the LLM (Llama-3.1-Swallow-8B-v0.1) without DPO training. Only this dialogue summary and the item description are input to the score predictor. Crucially, this model does not generate or utilize item recommendation information.
  • SumRec:

    • Description: This model represents the original SumRec approach. It uses the LLM (Llama-3.1-Swallow-8B-v0.1) without DPO training to generate both the dialogue summary and the item recommendation information. The generated dialogue summary, item recommendation information, and item description are then fed into the score predictor.

5.4. Implementation Details

  • Framework: Python 3.10.12 with PyTorch (2.4.1), Hugging Face Transformers (4.46.2), Tokenizers (0.20.3), SacreBLEU (2.5.1), rouge-score (0.1.2), Fugashi (1.4.0) with MeCab, Optuna (4.1.0), Hugging Face Datasets (3.1.0), Hugging Face TRL (0.12.1).

  • Text Generation Model (LLM): Llama-3.1-Swallow-8B-v0.1 (Okazaki et al., 2024; Fujii et al., 2024). This is a medium-scale model.

  • Score Predictor Model: deberta-v3-japanese-large (352M parameters) (He et al., 2021).

  • Hyperparameter Optimization: Optuna (Akiba et al., 2019) was used to optimize hyperparameters for DPO fine-tuning. The hyperparameters yielding the best recommendation performance on the validation set were selected.

  • Training Runs: Each model was trained five times with the selected hyperparameters, and the average results were reported.

  • Computational Resources: All experiments were run primarily on four Nvidia A100 80GB GPUs.

  • Training Time: Llama-3.1-Swallow-8B training for a single epoch took approximately 24 hours on four GPUs. DeBERTa training for a single epoch took approximately 4 hours on four GPUs.

  • Licensing: Tabidachi Corpus (CC BY 4.0), ChatRec (MIT License). Llama3.1-Swallow-8B (Meta Llama 3.1 Community License + Gemma Terms of Use). deberta-v3-japanese-large (CC BY-SA 4.0).

    The following are the hyperparameters for the summary generation model and item recommendation information generation model, fine-tuned using DPO on Tabidachi Corpus from Table 9 of the original paper:

    Parameter Summary Model (DPO) Recommendation Model (DPO)
    learning_rate 1.1593 × 10−7 8.7340 × 10−6
    per_device_train_batch_size 12 16
    num_train_epochs 1 1
    optimizer AdamW (β1 = 0.9, β2 = 0.999, = 10−8, weight_decay= 0) AdamW (β1 = 0.9, β2 = 0.999, = 10−8, weight_decay= 0)
    max_grad_norm 1.0 1.0
    gradient_checkpointing True True
    bf16 True True
    disable_dropout True True
    DPO-Specific Parameter
    β 0.1768 0.06109

The following are the hyperparameters for the summary generation model and item recommendation information generation model, fine-tuned using DPO on the ChatRec dataset from Table 10 of the original paper:

Parameter Summary Model (DPO) Recommendation Model (DPO)
learning_rate 6.4087 × 10−7 1.7718 × 10-7
per_device_train_batch_size 8 8
num_train_epochs 1 1
optimizer AdamW (β1 = 0.9, β2 = 0.999, = 10−8, weight_decay= 0) AdamW (β1 = 0.9, β2 = 0.999, = 10−8, weight_decay= 0)
max_grad_norm 1.0 1.0
gradient_checkpointing True True
bf16 True True
disable_dropout True True
DPO-specific Parameter
β 0.1253 0.03949

6. Results & Analysis

6.1. Core Results Analysis

The experimental results, presented in Table 1, compare the proposed method (Ours) against the Baseline and SumRec on two datasets: Tabidachi Corpus and ChatRec. The metrics used are Hit Rate (HR) and Mean Reciprocal Rank (MRR) at different rank cutoffs (HR@1,HR@3,HR@5\mathrm{HR@1}, \mathrm{HR@3}, \mathrm{HR@5} and MRR@1,MRR@3,MRR@5\mathrm{MRR@1}, \mathrm{MRR@3}, \mathrm{MRR@5}).

The following are the comparison results of Hit Rate (HR) and Mean Reciprocal Rank (MRR) from Table 1 of the original paper:

Dataset Method Metrics @1 @3 @5
Tabidachi Corpus Baseline HR ↑ 0.2439 0.5056 0.7146
MRR ↑ 0.2439 0.3587 0.4057
SumRec HR ↑ 0.2040 0.5376 0.7574
MRR↑ 0.2040 0.3527 0.4032
Ours HR ↑ MRR ↑ 0.2474 0.2474 0.5525 0.3796 0.7231 0.4181
ChatRec Baseline HR ↑ MRR↑ 0.8423 0.8423 0.9799 0.9049 0.9933 0.9081
SumRec HR ↑ 0.8255 0.9698 1.0
MRR ↑ 0.8255 0.8915 0.8984
Ours HR ↑ 0.8591 0.9832 0.9933
MRR ↑ 0.8591 0.9172 0.9196

Analysis for Tabidachi Corpus:

  • The Proposed method (Ours) consistently outperforms both Baseline and SumRec across all rank cutoffs for both HR and MRR.
  • Specifically, Ours achieves the highest HR@1 (0.2474), HR@3 (0.5525), and HR@5 (0.7231).
  • Similarly, Ours shows the best MRR@1 (0.2474), MRR@3 (0.3796), and MRR@5 (0.4181).
  • The significant improvements, especially at higher ranks (e.g., HR@3 and HR@5), indicate that the DPO-enhanced method substantially improves the quality of the candidate list presented to users, making it more likely that the correct item is found earlier.
  • It's noteworthy that SumRec shows a slightly lower HR@1 and MRR@1 than Baseline, but higher HR@3 and HR@5. This suggests SumRec might introduce more relevant items at slightly lower ranks, while the Baseline is better at finding the absolute top item. However, Ours surpasses both, demonstrating a more robust improvement across the board.

Analysis for ChatRec:

  • ChatRec inherently presents a task with very high baseline performance, with HR@5 already near 1.0 for all methods. This might be due to its nature as a chit-chat dataset with more general item suitability.
  • Despite this high baseline, Ours still manages to achieve comparable or slightly superior HR levels (HR@1: 0.8591, HR@3: 0.9832, HR@5: 0.9933). Notably, SumRec achieves HR@5 of 1.0, but Ours is very close.
  • Crucially, Ours consistently achieves the best MRR across all rank cutoffs (MRR@1: 0.8591, MRR@3: 0.9172, MRR@5: 0.9196). This indicates that even when multiple methods find the correct item, Ours is more precise in placing it at the very top of the recommendation list.
  • These results confirm that the proposed method improves the quality of top-position recommendations across diverse datasets, contributing to more rapid and highly accurate recommendations crucial for practical applications.

6.2. Analysis of Generated Texts

The paper conducts a quantitative analysis of the generated texts (dialogue summaries and item recommendation information) to understand the impact of DPO. Avg. Len. (average length), Distinct-1/2 (lexical diversity), BLEU, and ROUGE-L (n-gram similarity to original item descriptions) are measured.

The following are the automatic analysis results of dialogue summaries and item recommendation information from Table 2 of the original paper:

Method Avg. Len. Distinct-1/2 BLEU ROUGE-L
Dialogue Summary
SumRec 118.6 0.251 / 0.611
Proposed 151.2 0.187 / 0.526
Item Recommendation Information
SumRec 149.7 0.247 / 0.586 3.608 0.087
Proposed 247.2 0.164 / 0.433 1.455 0.019

Analysis for Dialogue Summaries:

  • Average Length: The Proposed method's dialogue summaries are significantly longer (151.2 words) than SumRec's (118.6 words). This suggests that DPO enables the model to retain and include more detailed user preferences and conversational context within the summary.
  • Distinct-1/2: Both Distinct-1 (0.187 vs 0.251) and Distinct-2 (0.526 vs 0.611) scores decreased for the Proposed method. This indicates a reduced lexical diversity, meaning the model tends to use the same keywords or phrases more repeatedly. The authors interpret this as a consequence of DPO guiding the model to prioritize and preserve phrases deemed important by the score predictor, even if it leads to less varied vocabulary.

Analysis for Item Recommendation Information:

  • Average Length: Similar to summaries, the item recommendation information generated by the Proposed method is much longer (247.2 words) compared to SumRec's (149.7 words). This again implies a greater emphasis on incorporating more explanatory elements relevant to recommendation.

  • Distinct-1/2: Distinct-1 (0.164 vs 0.247) and Distinct-2 (0.433 vs 0.586) also decreased, suggesting lower lexical diversity, consistent with the dialogue summaries.

  • BLEU and ROUGE-L: n-gram similarity metrics like BLEU (1.455 vs 3.608) and ROUGE-L (0.019 vs 0.087), calculated against the original item descriptions, significantly decreased for the Proposed method. This indicates that the DPO-trained model prioritizes generating new, explanatory content that is useful for recommendation, rather than merely paraphrasing or having superficial overlap with the factual item description. The generated item recommendation information moves further away from a direct rephrasing of the original description, focusing on adding suitability and user-centric details.

    Overall Interpretation: These observations imply that DPO successfully optimized both dialogue summaries and item recommendation information towards "adequately containing information necessary for the task," even at the cost of lexical diversity or direct textual similarity to original sources, which is a desirable outcome for the system's performance.

The following is an example of output sentences before and after DPO from Table 11 of the original paper:

Dialogue History Operator: Thank you for using our service today. Customer: My pleasure.
Operator: Thank you. Ma'am/Sir, are you here for a travel consultation today?
.. (omitted) ... Customer: Hmm. If possible, I'd like to visit a remote island.
.…. (omitted) .. Operator: Yes, that's right... Ma'am/Sir, earlier you mentioned you like Agu pork...
Customer: Yes. Uh-huh. Motobu beef, yes. ... (omitted) ... Customer: Ah, is that your top recommendation? Operator: Hmm, there might be better places—<> well, ah.
Item description Enjoy reasonably priced dishes made with carefully selected ingredients, including A5 and A4 grade Motobu beef, Agu pork, Yanbaru young chicken, and exceptionally fresh seafood chosen by the head chef. Duration: 30-60 minutes / English menu available / 3000-5000 yen (dinner) / Recommended for business entertainment.
SumRec Dialogue Summary A woman planning a trip to Okinawa wants to gaze at the sea in beach sandals and ride an ox-drawn cart. She is also looking forward to Agu pork dishes.
SumRec Item Recommendation Information This is a restaurant located in Motobu Town, Okinawa Prefecture, where you can enjoy delicious dishes made with A-grade beef and fresh seafood. It is also recommended for business entertainment.
Proposed Dialogue Summary the sea and also wishing to stay on a remote island. She is also being interested in a tour along the coast in an ox-drawn cart and dishes made with Agu pork. Ms. A is introducing
Proposed Item Recommendation Information This restaurant offers you the opportunity to savor the luxurious taste of A-rank Kuroge Wagyu steak and abundant local seafood. The restaurant has a calm atmosphere, perfect for a date with a loved one or an anniversary dinner. Private rooms are also available for a relaxing time with family or friends. You can enjoy delicious food at a reasonable price, so it's easy on your wallet.

The following is an example of an item recommendation information containing incorrect information from Table 12 of the original paper:

Item Information KiKiYOKOCHO is a new concept zone that gathers items to tickle women's sensibilities by mixing beauty, food, and miscellaneous goods. The concept is "try, find, enjoy." For those who want to compare and try things they are interested in to find what matches their personal preferences. A collection of shops that fulfills such selfish desires. It's packed with unprecedented enjoyment. Duration: around 30-60 minutes. English pamphlets available.
Item Recommendation Information by Proposed Method This is a shopping mall targeted at women, featuring stores from various genres such as beauty, gourmet, fashion, and interior design. The interior is stylish and has a calm atmosphere, allowing you to enjoy shopping at a leisurely pace. Additionally, English signboards are available, so foreign visitors can also use it with peace of mind. The shop staff are also kind and helpful, so even first-time visitors can visit casually.

The following are examples of positive (Winner) and negative (Loser) Item Recommendation Information used in DPO from Table 13 of the original paper:

Winner Rows of old buildings create an atmosphere of traditional Japan. There is also a spacious park where families with children can enjoy themselves with peace of mind. Within the park, there is an exhibition hall where visitors can learn about the region's traditions and culture, making it an attractive spot especially for families. In particular, the area is cool and pleasant in summer, making it highly recommended for family visits.
Loser Rows of old buildings stand, and various exhibitions are held. There is also a large park where families with children can play safely. Visitors can also enjoy light hiking and experience nature. In summer, it is cool and an ideal place for children to have fun.

6.3. Ablation Study

An ablation study was conducted on the Tabidachi Corpus to assess the individual contributions of DPO training for the dialogue summary generation model and the item recommendation information generation model.

The following are the ablation study results on Tabidachi Corpus from Table 3 of the original paper:

Method Metrics @1 @3 @5
Ours HR↑ 0.2474 0.5525 0.7231
MRR ↑ 0.2474 0.3796 0.4181
w/o Rec-DPO HR ↑ 0.2393 0.5560 0.7402
MRR ↑ 0.2393 0.3772 0.4195
w/o Sum-DPO HR ↑ 0.2341 0.5176 0.7363
MRR ↑ 0.2341 0.3554 0.4051

Analysis:

  • w/o Rec-DPO (DPO only on summary generation): This method removes DPO from the item recommendation information generation model, meaning only the dialogue summary generation model is DPO-trained.

    • It generally surpassed SumRec (refer to Table 1 for SumRec scores on Tabidachi Corpus) on most metrics, with notable gains in HR and MRR at higher ranks. For instance, its HR@3 (0.5560) is higher than SumRec's (0.5376). Its MRR scores are also better than SumRec's.
    • This indicates that enhancing the quality of the dialogue summary (allowing user preferences to be reflected more precisely) significantly improves the relevance of initially presented items.
  • w/o Sum-DPO (DPO only on recommendation information generation): This method removes DPO from the dialogue summary generation model, meaning only the item recommendation information generation model is DPO-trained.

    • While it showed some improvement over the Baseline in some aspects, its effect was not as pronounced as w/o Rec-DPO. For example, its HR@1 (0.2341) and MRR@1 (0.2341) are lower than the Baseline (0.2439) and w/o Rec-DPO (0.2393).
    • The performance gap tended to widen at higher ranks, suggesting that while improving recommendation information offers a supplementary benefit, refining the dialogue summary (which forms the foundation of the recommendation process) is more critical.
  • Ours (DPO on both models): The full proposed method, with DPO training for both generation models, demonstrated the highest performance across all metrics (HR@1: 0.2474, HR@3: 0.5525, HR@5: 0.7231; MRR@1: 0.2474, MRR@3: 0.3796, MRR@5: 0.4181). This confirms that fine-tuning both the summary and recommendation information generation synergistically enhances the quality of user preference representation and item description, further boosting recommendation accuracy.

    In summary, the ablation study highlights that DPO training for the dialogue summary is particularly impactful, serving as the primary driver for improved recommendation performance. While DPO on item recommendation information also contributes, its effect is less pronounced.

6.4. Human Evaluation

A human evaluation was conducted using CrowdWorks to assess the quality of generated dialogue summaries and item recommendation information from Ours and SumRec on the Tabidachi Corpus. Ten crowd workers evaluated outputs based on four criteria: Consistency, Conciseness, Fluency, and Usefulness. The evaluation covered 54 recommendation dialogues and their item descriptions.

The following figure (Figure 4 from the original paper) illustrates the results of the human evaluation:

该图像是柱状图,展示了对不同特征(有用性、一致性、流畅性和简洁性)评价的百分比分布。可见在简洁性上,72.39%的评价为高,而在一致性上,52.42%的评价为高。

Analysis:

  • Dialogue Summaries:
    • The Proposed method outperformed SumRec on Consistency (47.72% win rate vs 27.90% loss), Fluency (47.43% vs 35.24%), and Usefulness (51.54% vs 29.52%).
    • The most significant difference was observed in Usefulness, where approximately half of the evaluators rated Ours summaries as superior. This strongly suggests that DPO effectively enhanced summary quality by enabling more accurate capture of user preferences relevant for recommendations.
    • Conciseness showed no substantial difference (43.47% win vs 44.35% loss), indicating that Ours improved other aspects without significantly compromising brevity, despite a slight tendency towards verbosity observed in automatic metrics.
  • Item Recommendation Information:
    • Conversely, SumRec performed better across all metrics for item recommendation information (Figure 4, lower part). This finding aligns with the ablation study results (Table 3), where applying DPO solely to item recommendation information (w/o Sum-DPO) did not yield consistent performance gains as strong as DPO on summaries.

    • The authors mitigate this decline in quality by noting that item recommendation information is intended for internal use by the score predictor and is not directly presented to the user, thus its human-perceived quality might be less critical than its machine-readable utility.

      Conclusion from Human Evaluation: The human evaluation corroborates that DPO training significantly improves the quality of dialogue summaries, particularly their ability to extract recommendation-relevant information, which aligns with the ablation study's finding that enhanced dialogue summaries are key drivers of overall system performance.

The following are the requests to Crowd Workers from Figure 5 of the original paper:

Figure 5: Requests to Crowd Workers 该图像是插图,展示了针对众包工作者的请求细节,包含一系列指导性说明。这些说明强调了回答调查时的真实性要求,同时提醒参与者有关隐私及数据使用的注意事项。

The following is the English version of the requests to Crowd Workers from Figure 7 of the original paper: Request for Survey Cooperation

This survey aims to evaluate the quality of dialogue summary texts and tourist spot recommendation texts to help improve our system in the future. The survey takes approximately 30-40 minutes to complete. Your repns wll roesstatistcall n yo persal oaton il ot bentif plea el comfortable participating.

Response Procedure

  1. Evaluation of Summary Texts (13 questions total)
    • We present summary texts ① and ② generated based on the dialogue history posted on this Notion page. For each question, please select the number of the summary text that you feel is superior in each of the following four aspects:
      • Consistency: How well the content of the summary text matches the content of the dialogue history (accuracy of facts, reflection of important points)

      • Conciseness: Whether it conveys necessary information efficiently without unnecessary verbose expressions (not simply fewer characters, but density and efficiency of information)

      • Fluency: Readability of the text, natural expressions, and logical connections (no unnatural phrasing, reads smoothly)

      • Usefulness: Whether it is possible to make tourist spot recommendations to the person being recommended in the dialogue after reading this summary (whether the hobbies and preferences of the person being recommended (Speaker B) are reflected)

        Questionnaire on dialogue summary and recommendation of tourist attractions ⑧

Please keep the Notion page open while proceeding with the survey.

  1. Evaluation of Tourist Spot Recommendation Texts (12 questions total)
    • We present tourist spot recommendation texts ① and ② generated based on the "tourist spot information" shown in each question. For each question, please select the number of the recommendation text that you feel is superior in each of the following four aspects:
      • Consistency: How well the content of the recommendation text matches the content of the tourist spot information (accuracy of facts, reflection of important points)

      • Conciseness: Whether it conveys necessary information efficiently without unnecessary verbose expressions (not simply fewer characters, but density and efficiency of information)

      • Fluency: Readability of the text, natural expressions, and logical connections (no unnatural phrasing, reads smoothly)

      • Usefulness: Whether you can understand what kind of person the tourist spot is recommended for by reading this recommendation text (whether important features and benefits are clearly communicated)

        Notes

  • There are no right or wrong answers. Please answer honestly based on your impressions.

  • If you close your browser in the middle of the survey, your responses may be lost, so please do not close the screen until you press the submit button.

  • The information obtained in this survey will only be used for research purposes, and the results will be anonymized when published or shared.

    The following is the Crowdworker response screen from Figure 6 of the original paper:

    Figure 6: Crowdworker response screen 该图像是问卷调查的响应界面,展示了几个问题和选项,以收集参与者对不同摘要的评价。参与者需要选择每个问题的答案,使用序列编号来区分不同选项。

7. Conclusion & Reflections

7.1. Conclusion Summary

This study successfully proposed a novel method to enhance Conversational Recommender Systems (CRSs) by integrating Direct Preference Optimization (DPO) into the generation of dialogue summaries and item recommendation information. By fine-tuning Large Language Models (LLMs) with DPO, the system is guided to produce texts that are rich in information crucial for effective recommendations, thereby fostering more natural and realistic conversational processes. Experimental results on two public datasets (Tabidachi Corpus and ChatRec) demonstrated that the proposed method achieved superior recommendation performance (higher Hit Rate and Mean Reciprocal Rank) compared to existing baselines, including the original SumRec. An ablation study and human evaluation confirmed that DPO training for dialogue summaries was a particularly critical factor in boosting the overall system's efficacy by enhancing the extraction of recommendation-useful information.

7.2. Limitations & Future Work

The authors acknowledge several limitations of their work:

  • Model Scale: The study utilized medium-scale LLMs (Llama-3.1-Swallow-8B-v0.1 and DeBERTa-v3-japanese-large), not state-of-the-art models with hundreds of billions of parameters. While larger models might offer enhanced performance, they come with significant trade-offs in GPU memory consumption and inference latency, presenting operational cost challenges.

  • Narrow Evaluation Scope: Experiments were conducted exclusively on two Japanese datasets primarily within the travel domain (Tabidachi Corpus and ChatRec). This limits the generalizability of the method, as its effectiveness across other domains and languages remains unverified.

  • Content Hallucination: A persistent limitation is the occurrence of hallucinations in the generated item recommendation information and dialogue summaries. The model sometimes fabricates features not present in the source content. Even if not directly shown to users, such fabrications can adversely affect the model's explainability and potentially lead to inaccurate recommendations (as exemplified in Table 12).

    Future work will focus on:

  • Further improving recommendation performance while maintaining the quality (e.g., factual accuracy) of generated item recommendation information.

  • Addressing potential risks such as data-specific biases, content hallucination, and misuse of the system.

7.3. Personal Insights & Critique

This paper presents a thoughtful and incremental yet significant improvement to Conversational Recommender Systems (CRSs). The explicit focus on addressing the "unrealistic rapid recommendation" problem and integrating implicit preferences is highly valuable, moving CRSs closer to natural human interaction.

Strengths:

  • Principled Approach to LLM Alignment: The use of Direct Preference Optimization (DPO) is a strong methodological choice. It offers a more stable and efficient way to align LLM outputs with a specific task objective (improving recommendation scores) compared to traditional RLHF. The generation of preference data based on score predictor performance is a clever self-bootstrapping mechanism.
  • Clear Problem Definition and Solution: The paper clearly articulates the gap in existing CRSs and proposes a well-structured solution building upon SumRec. The two-stage training flow is logical and well-explained.
  • Comprehensive Evaluation: The combination of automatic metrics (HR, MRR), text generation analysis (length, diversity, n-gram similarity), ablation studies, and human evaluation provides a robust validation of the proposed method's effectiveness and sheds light on why it works (e.g., the critical role of DPO-trained summaries).
  • Practical Relevance: Enhancing the naturalness and accuracy of conversational recommendations has immense practical value for various applications, from e-commerce to travel planning.

Potential Issues/Areas for Improvement:

  • Hallucination Mitigation: While acknowledged as a limitation, the issue of hallucinations in generated item recommendation information is significant. Although these texts are internal, incorrect facts could lead to misguided recommendations or, if exposed, erode user trust. Future work could explore more robust fact-checking mechanisms or constrained generation techniques during DPO training to penalize factual inaccuracies more heavily.
  • Generalizability Beyond Japanese Travel Domain: The evaluation on only two Japanese datasets, primarily in the travel domain, limits the direct generalizability. Replication on diverse domains (e.g., movies, books, products) and languages would strengthen the claims of a "realistic conversational recommendation" system.
  • Interpretability of DPO's Effect on Recommendation Information: The human evaluation showed a decrease in perceived quality for item recommendation information generated by the DPO-trained model, despite overall system performance improvement. This suggests a trade-off: what's "useful" for the score predictor (e.g., specific keywords, longer explanations) might not always be "natural" or "concise" for human judges. Further analysis could investigate if this "less natural" output has a negative impact if it were ever exposed to users, or if there's a way to optimize for both machine utility and human readability.
  • Computational Cost: The mention of larger LLMs and their associated costs highlights a practical barrier. Exploring parameter-efficient fine-tuning (PEFT) methods beyond gradient checkpointing (e.g., LoRA, QLoRA) could be a direction to leverage larger models without prohibitive costs.

Transferability and Application: The core idea of using a score predictor to generate preference data for DPO fine-tuning of text generation models is highly transferable. This paradigm could be applied to any task where LLMs generate intermediate texts that feed into a downstream model. For example:

  • Summarization for Question Answering: Fine-tune a summarizer to create summaries that are most useful for a downstream QA model to answer questions accurately.

  • Explanation Generation for XAI: Generate explanations that not only sound plausible but also help a user debug or understand a model's decision more effectively, as judged by a separate explanation evaluator.

  • Dialogue Policy Learning: DPO could directly optimize dialogue policy generation in other interactive AI systems.

    Overall, this paper provides a valuable contribution to the field of conversational AI by demonstrating a practical and effective way to align LLM generation with specific task objectives, paving the way for more intelligent and human-centric interactive systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.