Abstract

Recommendation dialogue systems aim to build social bonds with users and provide high-quality recommendations. This paper pushes forward towards a promising paradigm called target-driven recommendation dialogue systems, which is highly desired yet under-explored. We focus on how to naturally lead users to accept the designated targets gradually through conversations. To this end, we propose a Target-driven Conversation Planning (TCP) framework to plan a sequence of dialogue actions and topics, driving the system to transit between different conversation stages proactively. We then apply our TCP with planned content to guide dialogue generation. Experimental results show that our conversation planning significantly improves the performance of target-driven recommendation dialogue systems.

1. Bibliographic Information

1.1. Title

The central topic of this paper is "Follow Me: Conversation Planning for Target-driven Recommendation Dialogue Systems". It focuses on developing methods for dialogue systems that proactively guide users towards specific recommended items.

1.2. Authors

The paper is authored by:

Jian Wang (The Hong Kong Polytechnic University)
Dongding Lin (The Hong Kong Polytechnic University)
Wenjie Li (The Hong Kong Polytechnic University)

1.3. Journal/Conference

The paper was published at (UTC): 2022-08-06T13:23:42.000Z. While the specific conference is not mentioned in the provided text, given the nature of the research (Natural Language Processing, Dialogue Systems) and common publication venues for these authors, it is likely to be a prominent NLP or AI conference such as ACL, EMNLP, NAACL, or COLING. These venues are highly reputable in the field, known for publishing cutting-edge research.

1.4. Publication Year

1.5. Abstract

The research addresses the challenge of building recommendation dialogue systems that not only provide high-quality recommendations but also foster social connections with users. It introduces a novel paradigm called target-driven recommendation dialogue systems, which is currently under-explored. The core problem tackled is how to naturally guide users through conversations to accept pre-designated target recommendations. To achieve this, the paper proposes a Target-driven Conversation Planning (TCP) framework. This framework is designed to plan a sequence of dialogue actions and topics, enabling the system to proactively transition between different conversation stages. The TCP framework's planned content then guides the dialogue generation process. Experimental results demonstrate that this conversation planning approach significantly enhances the performance of target-driven recommendation dialogue systems.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2208.03516v1
PDF Link: https://arxiv.org/pdf/2208.03516v1.pdf This indicates the paper was published on arXiv, a preprint server, and the version is $v1$ . This is a common practice for academic papers before or concurrently with peer-reviewed conference/journal publication.

2. Executive Summary

2.1. Background & Motivation (Why)

The paper tackles a critical limitation in existing recommendation dialogue systems: their reactive nature.

Core Problem: Current systems primarily respond to user utterances, inferring preferences and then recommending items. This approach is limited because users might not always have clear preferences, especially for unfamiliar topics or items. It hinders the system's ability to proactively introduce and recommend specific items.
Why Important: The ability for a dialogue system to proactively lead conversations towards a designated target (e.g., a specific movie, song, or restaurant) in a natural and sociable manner is highly desirable. It mimics human-like persuasive conversations and opens new possibilities for personalized and guided user experiences.
Gaps in Prior Work: While datasets like DuRecDial have emerged to support proactive dialogue, the specific challenge of target-driven recommendation dialogue systems – where the system aims to lead the user to a pre-determined target – remains under-explored. Existing systems often rely on multi-task learning or predict-then-generate paradigms, which don't explicitly focus on planning a conversational path to a specific target. The key challenge lies in making reasonable plans to drive the conversation step-by-step while maintaining engagement and arousing user interest, rather than just passively discovering preferences.
Novel Approach: The paper introduces a Target-driven Conversation Planning (TCP) framework. This framework uniquely plans a sequence of dialogue actions and topics to proactively steer the conversation towards a designated target. Crucially, it plans this path backward from the target to the current turn, leveraging target-side information, and then uses this plan to guide utterance generation.

2.2. Main Contributions / Findings (What)

The paper highlights two primary contributions:

Formulating Target-driven Recommendation Dialogue: The authors are the first to shift from the reactive recommendation dialogue paradigm to a proactive one by formally defining and addressing the target-driven recommendation dialogue task. This involves the system actively working to lead a conversation to a specific, pre-assigned target.
Proposing the TCP Framework: They introduce the Target-driven Conversation Planning (TCP) framework. This framework is designed to plan a coherent path of dialogue actions and topics. This planned content then serves to proactively guide the system's conversation flow and to inform the generation of relevant and engaging utterances.

The main findings demonstrate that this conversation planning approach significantly improves the performance of target-driven recommendation dialogue systems, achieving higher target recommendation success rates and better dialogue generation quality across various metrics.

3.1. Foundational Concepts

To understand this paper, a beginner should be familiar with the following core concepts:

Dialogue Systems: Computer systems designed to interact with humans using natural language. They can be broadly categorized into task-oriented dialogue systems (which help users achieve specific goals, like booking a flight) and chit-chat systems (which focus on open-ended, social conversations).
Recommendation Systems: Systems that suggest items (e.g., movies, products, articles) to users based on their preferences, past behavior, or other data.
Recommendation Dialogue Systems: A specialized type of task-oriented dialogue system that combines dialogue capabilities with recommendation functions, allowing users to discover items through natural conversation.
Dialogue Actions: Predefined communicative functions or intentions of a dialogue system's utterance (e.g., greeting, ask user preference, movie recommendation, chit-chat).
Dialogue Topics: The subject matter being discussed in a conversation (e.g., a specific movie, a genre, an actor).
Knowledge Graphs: Structured representations of information where entities (e.g., "Andy Lau," "McDull") are nodes and their relationships (e.g., "voice cast," "actor in") are edges. They provide a rich source of domain knowledge.
Perplexity (PPL): A measure of how well a probability model predicts a sample. In NLP, it quantifies how well a language model predicts a sequence of words. Lower PPL indicates a more fluent and accurate model.
F1 Score: The harmonic mean of precision and recall. In NLP, word-level F1 measures the overlap between generated and ground-truth utterances based on individual words, while Knowledge F1 measures the accuracy of generating correct knowledge entities.
- Precision: The proportion of correctly predicted positive observations among all predicted positive observations.
- Recall: The proportion of correctly predicted positive observations among all actual positive observations.
BLEU (Bilingual Evaluation Understudy): A metric for evaluating the quality of text generated by machine translation systems. It measures the $n$ -gram overlap between the generated text and one or more reference texts. Higher BLEU scores indicate better quality.
DIST (Distinct): A metric used to evaluate the diversity of generated text. DIST-1 and DIST-2 measure the proportion of unique unigrams (single words) and bigrams (two-word sequences) in the generated output, respectively. Higher DIST scores indicate more diverse and less repetitive output.
Transformer [17]: A neural network architecture introduced in 2017, foundational for modern NLP. It relies entirely on self-attention mechanisms to process input sequences, allowing it to weigh the importance of different words in a sequence when processing each word. It avoids recurrent or convolutional layers.
BERT (Bidirectional Encoder Representations from Transformers) [2]: A pre-trained language model developed by Google, based on the Transformer's encoder architecture. It's pre-trained on a massive text corpus and can be fine-tuned for various downstream NLP tasks. "Bidirectional" means it considers context from both the left and right of a word.
BART (Bidirectional and Auto-Regressive Transformers) [8]: A denoising autoencoder for pre-training sequence-to-sequence models. It's a Transformer-based model that learns to reconstruct original text from corrupted versions, making it effective for both natural language understanding and generation.
GPT-2 (Generative Pre-trained Transformer 2) [15]: A large Transformer-based language model developed by OpenAI, famous for its ability to generate coherent and diverse human-like text. It is an autoregressive model, meaning it predicts the next token in a sequence based on the preceding ones.
End-to-End Memory Network [16]: A neural network architecture that incorporates an external memory component, allowing it to store and retrieve information over long sequences or multiple interactions, useful for tasks requiring reasoning over context.
Graph Attention Transformer [3]: An extension of the Transformer architecture designed to process graph-structured data. It uses attention mechanisms to weigh the importance of different nodes (entities) and edges (relations) in a graph when learning representations.

3.2. Previous Works

Previous research in recommendation dialogue systems primarily focused on reactive interactions:

Early Datasets:
- GoRecDIAL [6]: One of the early datasets for recommendation dialogue, contributing to the field's growth.
- TG-ReDial [21]: A topic-guided conversational recommender system dataset.
- INSPIRED [4]: Another dataset for sociable recommendation dialogue systems.
- Limitation: The paper argues that most systems built on these datasets converse reactively, meaning they primarily respond to user queries or expressed preferences rather than proactively steering the conversation.
Reactive Models:
- Ma et al. [13]: Proposed a tree-structured reasoning framework over knowledge graphs to guide recommendations and response generation. This is a reactive approach.
- Liang et al. [10]: Introduced the NTRD framework, combining slot filling (a classic task-oriented dialogue technique) with neural language generation for item recommendations. Still reactive.
Proactive Paradigm Emergence:
- DuRecDial [12]: This dataset is highlighted as crucial because it features dialogues where the system proactively leads the conversation with rich interactive actions (e.g., chit-chat, question answering, recommendation). It's a key inspiration for the target-driven approach. The example in Figure 1 from DuRecDial shows a system proactively guiding a user towards "McDull, Prince de la Bun".
Related Paradigms for Dialogue Generation:
- Multi-task Learning [11]: As depicted in Figure 2(a), this paradigm involves training a single model to perform multiple related tasks simultaneously (e.g., predicting next action, topic, and generating response). While beneficial for shared representations, it might not explicitly enforce target-driven planning.
- Predict-then-Generate [12, 19]: As depicted in Figure 2(b), this paradigm first predicts intermediate dialogue actions or topics (or other dialogue states) and then uses these predictions to guide the subsequent utterance generation. MGCG_G [12] and KERS [19] are examples of models following this paradigm.
  - MGCG_G [12]: Employs predicted dialogue actions and topics to guide generation.
  - KERS [19]: Features a knowledge-enhanced mechanism and multiple subgoals for recommendation dialogue generation.
- Distinction: While predict-then-generate involves prediction, the paper argues these models still struggle to proactively lead users toward a specific designated target because their predictions are often local (next turn) and not necessarily tied to a long-term, overarching target.

3.3. Technological Evolution

The field of recommendation dialogue systems has evolved from purely reactive interactions, often based on slot filling or simple template-based responses, towards more natural language generation capabilities. The advent of large pre-trained language models (PLMs) like BERT, BART, and GPT-2 significantly boosted the fluency and coherence of generated responses. Concurrently, the development of specialized datasets like DuRecDial that include proactive system behaviors has laid the groundwork for systems that can take initiative. This paper leverages these advancements in PLMs and dataset availability to push towards a truly proactive, target-driven system.

3.4. Differentiation

The proposed Target-driven Conversation Planning (TCP) framework differentiates itself significantly from existing approaches:

Reactive vs. Proactive: Unlike most prior recommendation dialogue systems that react to user input, TCP is inherently proactive. It is given a designated target and aims to naturally lead users to accept it.
Planning Horizon: While predict-then-generate models (e.g., $MGCG_G$ , KERS) predict the next dialogue action and topic, TCP performs conversation planning for an entire sequence of dialogue actions and topics to reach a designated target. This implies a longer-term, strategic planning capability.
Backward Planning: A key innovation is that TCP plans the sequence of actions and topics from the target turn of the conversation to the current turn. This allows it to explicitly incorporate and leverage target-side information throughout the planning process, ensuring all intermediate steps contribute to achieving the final target.
Explicit Guidance: TCP explicitly generates a plan (sequence of actions and topics) which then serves as a concrete guide for the dialogue generation module, rather than just using implicitly predicted labels.
Knowledge Integration: TCP integrates user profiles, domain knowledge, and conversation history within its Transformer-based planner, using a knowledge-target mutual attention module and an information fusion layer to strategically combine these diverse information sources for effective planning.

The comparison in Figure 2 visually summarizes this differentiation:

该图像是一个示意图，展示了三种对话生成范式：(a)多任务学习范式，(b)先预测后生成范式，(c)本文提出的目标驱动对话规划范式。图中通过模块和箭头表示输入、动作、话题以及系统回复的流程关系。
(a) Multi-task Learning Paradigm: This approach (e.g., Dongding Lin et al. [11]) typically involves a single model performing multiple tasks simultaneously (e.g., response generation, action prediction, topic prediction). It aims for efficiency and shared representations but doesn't necessarily impose a strong, long-term planning mechanism towards a specific target.
(b) Predict-then-Generate Paradigm: This approach (e.g., Liu et al. [12], Zhang et al. [19]) first predicts an intermediate dialogue state (like the next dialogue action or topic) and then uses this predicted state to generate the system's utterance. While providing some guidance, the predictions are often local to the next turn and don't explicitly incorporate a pre-defined overall conversation target.
(c) Our target-driven planning enhanced generation framework: This is the proposed TCP framework. It explicitly plans a sequence of dialogue actions and topics (Planned Action-Topic Path) with a designated target in mind. This plan then directly guides the dialogue generation process, ensuring that the entire conversation progresses towards the target. The crucial difference is the Target-driven Conversation Planner component that works backwards from the target.

4. Methodology

4.1. Principles

The core idea behind the Target-driven Conversation Planning (TCP) framework is to enable a recommendation dialogue system to proactively and naturally lead users towards a designated target topic or item (e.g., recommending a specific movie). This is achieved by strategically planning the conversation flow, specifically by determining a sequence of dialogue actions and topics that guide the user step-by-step to accept the target. The intuition is that for effective proactive guidance, the system needs a forward-looking plan. To make this plan target-aware from the outset, the planning process is designed to work backward from the final target to the current conversation turn, leveraging the information about the ultimate goal.

4.2. Steps & Procedures

4.2.1. Problem Formulation

The problem is defined within the context of a recommendation-oriented dialogue corpus $\mathcal{D}$ . Each entry in the corpus $i$ consists of:

User Profile $\mathcal{U}_i = \{u_{i,j}\}^{N_u}_{j=1}$ : A collection of user attributes, each $u_{i,j}$ being a $<key, value>$ pair (e.g., occupation: student).
Domain Knowledge $\mathcal{K}_i = \{k_{i,j}\}^{N_K}_{j=1}$ : A set of knowledge triples, each $k_{i,j}$ being a {subject, relation, object} triple (e.g., {Andy Lau, voice cast, McDull, Prince de la Bun}).
Conversation Content $\mathcal{H}_i = \{(X_{i,t}, Y_{i,t})\}^T_{t=1}$ : The dialogue history with $T$ turns, where $X_{i,t}$ is the user's utterance and $Y_{i,t}$ is the system's utterance at turn $t$ .
Annotated Plans $\mathcal{P}_i = \{(a_{i,l}, z_{i,l})\}^L_{l=1}$ : A sequence of planned dialogue actions $a_{i,l}$ and topics $z_{i,l}$ , where each pair might span multiple conversation turns, and $L$ is the number of plans.

Given a designated target topic $z_{T'}$ with its corresponding target action $a_{T'}$ , a user profile $\mathcal{U}'$ , relevant domain knowledge $\mathcal{K}'$ , and conversation history $\mathcal{H}'$ , the objective is to generate coherent system utterances that engage the user and recommend $z_{T'}$ appropriately.

The problem is decomposed into three sub-tasks:

Action Planning: Determining a sequence of dialogue actions to proactively lead the conversation.
Topic Planning: Determining appropriate dialogue topics to move towards the target topic.
Dialogue Generation: Generating a proper system utterance that realizes the planned action and topic at each turn.

4.2.2. Our Method: TCP Framework

The TCP framework guides dialogue generation in a pipeline manner, as illustrated in Figure 2(c). It consists of three main stages: Encoders, Target-driven Conversation Planner, and TCP-Enhanced Dialogue Generation.

4.2.2.1. Encoders

Different types of input information are encoded into rich representations:

User Profile ( $\mathcal{U}'$ ): An end-to-end memory network [16] is used to encode the user profile into a representation $\mathbf{U}$ . The user profile is composed of $m$ entries, $\mathbf{U} = \text{EPM}(\mathbf{u}_1, \mathbf{u}_2, \dots, \mathbf{u}_m)$ .
Domain Knowledge ( $\mathcal{K}'$ ): A Graph Attention Transformer [3] encodes the domain knowledge. Knowledge triples are converted into relation-entity pairs to save space. The final representation is $\mathbf{K} = (\mathbf{k}_1, \mathbf{k}_2, \dots, \mathbf{k}_k)$ , where $k$ is the length of the knowledge sequence. Pre-trained language models (PLMs) like BERT can initialize the embedding layers.
Conversation History ( $\mathcal{H}'$ ): A BERT [2] model encodes the conversation history into a token-level representation $\mathbf{H} = (\mathbf{h}_1, \mathbf{h}_2, \dots, \mathbf{h}_n)$ , where $n$ is the length of the history.

4.2.2.3. Target-driven Conversation Planner

This module is the core of TCP. It plans a path of dialogue actions and topics by generating a sequence of tokens. A key design choice is to generate the path from the target turn to the current turn (backward planning), which allows the planner to integrate target-side information more effectively.

The planner is based on the Transformer [17] decoder architecture (see Figure 3).

Figure 3: Overview of Target-driven Conversation Planner. 该图像是论文中图3，示意了目标驱动会话规划器的架构，展示了多头注意力机制、信息融合层以及跨注意力模块如何协同处理知识-目标、用户偏好和上下文信息，推动对话生成。

Input to Planner: During training, tokens of the target action and target topic are prepended to the plan sequence as input. The generated plan sequence tokens follow a specific format: $[A] a_1 a_2 ... [T] t_1 t_2 ... [EOS]$ , where [A] separates an action, [T] separates a topic, and [EOS] marks the end of the sequence.
Query Representation: The shifted token-level plan representation serves as the query. This query is passed through three masked multi-head attention layers (which ensure that the prediction for a token only depends on previous tokens) followed by add and normalization layers. This process generates three query representations: $\mathbf{P}_k$ (for knowledge), $\mathbf{P}_u$ (for user profile), and $\mathbf{P}_h$ (for conversation history).
Knowledge-Target Mutual Attention: This module is designed to highlight the influence of the target on domain knowledge. It calculates a relevance score between the domain knowledge $\mathbf{K}$ and the target representation $\mathbf{T}$ (derived from the target action/topic). This score is used as a weight when attending to $\mathbf{K}$ .
User and History Attention: $\mathbf{P}_u$ and $\mathbf{P}_h$ are used to attend to the user profile $\mathbf{U}$ and conversation history $\mathbf{H}$ respectively. These are similar to the encoder-decoder cross-attention in the original Transformer decoder, allowing the planner to consider user preferences and past dialogue turns. This results in attended representations $\mathbf{A}_u$ and $\mathbf{A}_h$ .
Information Fusion Layer: To dynamically combine the different attended results ( $\mathbf{A}_k, \mathbf{A}_u, \mathbf{A}_h$ $A_{k}, A_{u}, A_{h}$ ) based on their relevance, a gate-controlled information fusion layer is used.
- First, $\mathbf{A}_u$ and $\mathbf{A}_h$ are fused: $\begin{array}{rcl} \mathbf{A}_1 &=& \beta \cdot \mathbf{A}_u + (1 - \beta) \cdot \mathbf{A}_h \\ \beta &=& \mathrm{sigmoid}(\mathbf{W}_1 [\mathbf{A}_u ; \mathbf{A}_h] + \mathbf{b}_1) \end{array}$ Here, $\beta$ is a sigmoid gate controlling the contribution of $\mathbf{A}_u$ and $\mathbf{A}_h$ . $\mathbf{W}_1$ and $\mathbf{b}_1$ are trainable parameters. The notation $[\mathbf{A}_u ; \mathbf{A}_h]$ denotes concatenation.
- Then, the result $\mathbf{A}_1$ is fused with $\mathbf{A}_k$ : $\begin{array}{rcl} \mathbf{A} &=& \gamma \cdot \mathbf{A}_k + (1 - \gamma) \cdot \mathbf{A}_1 \\ \gamma &=& \mathrm{sigmoid}(\mathbf{W}_2 [\mathbf{A}_k ; \mathbf{A}_1] + \mathbf{b}_2) \end{array}$ Similarly, $\gamma$ is a sigmoid gate, and $\mathbf{W}_2, \mathbf{b}_2$ are trainable parameters. $\mathbf{A}$ represents the final fused representation, which is then used to predict the next token in the plan sequence.
Training and Inference: The planner is trained using cross-entropy loss against ground-truth plan sequences. During inference, greedy search decoding is employed to generate the plan sequences.

4.2.2.4. TCP-Enhanced Dialogue Generation

Once a plan path is generated by TCP, it guides the dialogue generation process:

Guiding Prompt: Since the plan is generated backward (from target to current), the last action $a_t$ and the last topic $z_t$ in the generated path correspond to the current turn's planned content. These are used as the guiding prompt.
Knowledge Extraction: If $z_t$ is not NULL (i.e., not a chit-chat action), $z_t$ is treated as the center topic. The system then extracts topic-centric attributes and reviews (relevant knowledge triples) from the domain knowledge $\mathcal{K}'$ that are related to $z_t$ . If $a_t$ is chit-chat, the extracted knowledge is empty.
Input for Generation: The concatenated text of the user profile, the extracted knowledge, the conversation history, and the planned action $a_t$ forms the input to a backbone dialogue generation model.
Backbone Models: Various backbone models (e.g., BART, GPT-2) can then be fine-tuned on this combined input to produce the final system utterance.

4.3. Mathematical Formulas & Key Details

4.3.1. Knowledge-Target Mutual Attention

The relevance score between domain knowledge $\mathbf{K}$ and the target representation $\mathbf{T}$ is calculated via scaled dot-product attention, followed by mean pooling to get a weight that reflects the target's influence on reasoning over knowledge. $\mathbf{K}_{weight} = \mathrm{MeanPooling}\left(\frac{\mathbf{K}\mathbf{T}^\top}{\sqrt{d}}\right)$ Then, when $\mathbf{P}_k$ (query representation for knowledge) attends to $\mathbf{K}$ : $\mathbf{A}_k = \mathrm{softmax}\left(\frac{\mathbf{P}_k \mathbf{K}^\top}{\sqrt{d}} * \mathbf{K}_{weight}\right) \mathbf{K}$ Where:

$\mathbf{K}$ : Represents the encoded domain knowledge. It's a matrix where each row corresponds to a knowledge embedding.
$\mathbf{T}$ : Represents the encoded target (action and topic).
$^\top$ : Denotes the matrix transpose.
$d$ : Is the hidden size or dimension of the key/query vectors. Dividing by $\sqrt{d}$ is a scaling factor common in Transformer attention to prevent large dot products from pushing the softmax into regions with tiny gradients.
$\mathrm{MeanPooling}(\cdot)$ : An operation that computes the average of its input. Here, it averages the relevance scores to get a single weight.
$\mathbf{K}_{weight}$ : A scalar or vector representing how much the target influences the processing of knowledge.
$\mathbf{P}_k$ : Query representation for knowledge, derived from the plan sequence through masked multi-head attention.
$\mathrm{softmax}(\cdot)$ : The softmax function, used to convert raw scores into a probability distribution.
*: Element-wise multiplication.
$\mathbf{A}_k$ : The attended representation of the domain knowledge, weighted by its relevance to the target and the current plan query.

4.3.2. Information Fusion Layer

This layer uses gate mechanisms to strategically combine different attended representations:

Step 1: Fuse User Profile and Conversation History Attentions

$\mathbf{A}_1 = \beta \cdot \mathbf{A}_u + (1 - \beta) \cdot \mathbf{A}_h$ $\beta = \mathrm{sigmoid}(\mathbf{W}_1 [\mathbf{A}_u ; \mathbf{A}_h] + \mathbf{b}_1)$ Where:

$\mathbf{A}_u$ : Attended representation from the user profile $\mathbf{U}$ .
$\mathbf{A}_h$ : Attended representation from the conversation history $\mathbf{H}$ .
$[\mathbf{A}_u ; \mathbf{A}_h]$ : Denotes the concatenation of $\mathbf{A}_u$ and $\mathbf{A}_h$ .
$\mathbf{W}_1 \in \mathbb{R}^{2d}$ : A trainable weight matrix that transforms the concatenated attended representations. 2d implies the input dimension is twice the hidden size $d$ if $\mathbf{A}_u$ and $\mathbf{A}_h$ each have dimension $d$ .
$\mathbf{b}_1$ : A trainable bias vector.
$\mathrm{sigmoid}(\cdot)$ : The sigmoid activation function, which squashes values between 0 and 1, acting as a gating mechanism.
$\beta$ : The gate value, determining the relative importance of $\mathbf{A}_u$ and $\mathbf{A}_h$ for the combined representation $\mathbf{A}_1$ . A higher $\beta$ means more weight on $\mathbf{A}_u$ .

Step 2: Fuse Knowledge Attention with the Combined User/History Attention

$\mathbf{A} = \gamma \cdot \mathbf{A}_k + (1 - \gamma) \cdot \mathbf{A}_1$ $\gamma = \mathrm{sigmoid}(\mathbf{W}_2 [\mathbf{A}_k ; \mathbf{A}_1] + \mathbf{b}_2)$ Where:

$\mathbf{A}_k$ : The attended representation from the domain knowledge (derived using knowledge-target mutual attention).
$\mathbf{A}_1$ : The fused representation from user profile and conversation history.
$[\mathbf{A}_k ; \mathbf{A}_1]$ : Denotes the concatenation of $\mathbf{A}_k$ and $\mathbf{A}_1$ .
$\mathbf{W}_2 \in \mathbb{R}^{2d}$ : Another trainable weight matrix.
$\mathbf{b}_2$ : Another trainable bias vector.
$\gamma$ : The gate value, determining the relative importance of knowledge ( $\mathbf{A}_k$ ) and the user/history context ( $\mathbf{A}_1$ ) for the final fused representation $\mathbf{A}$ .
$\mathbf{A}$ : The final fused attended representation, which incorporates information from knowledge, user profile, and conversation history, all strategically weighted by the target. This $\mathbf{A}$ is then used by the feed-forward networks to predict the next token in the plan.

5. Experimental Setup

5.1. Datasets

The experiments are conducted using the DuRecDial [12] dataset.

Origin and Characteristics: DuRecDial is a prominent dataset for conversational recommendation. It is unique because its dialogues often feature the system proactively leading the conversation using a variety of interactive actions (e.g., chit-chat, question answering, recommendation). The dataset is in Chinese.
Size: It contains approximately 10,000 multi-turn Chinese conversations and 156,000 utterances. Crucially, it includes annotations for sequences of dialogue actions and topics for the system's turns.
Domain: The domain typically involves recommendations for movies, music, or food, often grounded in specific factual knowledge about these items.
Example Data Sample (Conceptual based on Figure 1): Imagine a user profile containing Name: Yuzhen Hu; Occupation: student. And domain knowledge containing triples like $<Andy Lau, voice cast, McDull, Prince de la Bun>$ . If the target is $action="Movie Recommendation"$ for "McDull, Prince de la Bun", the conversation might look like this: Bot: "Hello Yuzhen Hu! I see you're a student. Do you like animated movies?" (Greeting, ask user preference) User: "Yes, sometimes." Bot: "Have you heard of 'McDull, Prince de la Bun'? It's a charming animated film." (Movie Recommendation) User: "Oh, no, I haven't." Bot: "It features Andy Lau as a voice actor, who is quite famous. Do you know Andy Lau?" (Chat about the star) This proactive flow, from greeting -> ask user -> chat about star -> movie recommendation, is what the TCP framework aims to plan.
Repurposing for Target-driven Task: The original DuRecDial dataset was adapted for the target-driven task:
- The topic that the user accepted at the end of each conversation was designated as the target topic.
- The system's corresponding action at that point was considered the target action.
- This includes movie, music, food, and point-of-interest recommendations.
Statistics of Repurposed Dataset:
- Total actions: 15
- Total topics: 678 (including a NULL topic for chit-chat)
- Splits: 5,400 conversations for training, 800 for development, 1,804 for testing.
- Conversation Length: Average of 7.9 turns, maximum of 14 turns.
- Plan Transitions: Average of 4.5 different action/topic transitions from the start to the target.
Justification: DuRecDial was chosen because, unlike GoRecDIAL [6] and TG-ReDial [21] which are primarily reactive, it contains dialogues with proactive system behaviors and explicit dialogue action/topic annotations, making it suitable for evaluating target-driven conversation planning.

5.2. Evaluation Metrics

The paper employs a comprehensive set of metrics to evaluate both the dialogue generation and conversation planning aspects.

5.2.1. Dialogue Generation Metrics

These metrics assess the quality of the system's generated utterances.

Perplexity (PPL)
- Conceptual Definition: Perplexity is a measure of how well a probability distribution or language model predicts a sample. In simpler terms, it quantifies how "surprised" the model is by new data. A lower PPL indicates that the model is more confident and accurate in its predictions, leading to more fluent and natural-sounding generated text.
- Mathematical Formula: For a sequence of words $W = (w_1, w_2, \dots, w_N)$ , the perplexity is defined as: $\mathrm{PPL}(W) = P(w_1, w_2, \dots, w_N)^{-\frac{1}{N}} = \sqrt[N]{\frac{1}{P(w_1, w_2, \dots, w_N)}}$ This is often computed as: $\mathrm{PPL}(W) = e^{-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i | w_1, \dots, w_{i-1})}$
- Symbol Explanation:
  - $W$ : A sequence of words.
  - $N$ : The total number of words in the sequence.
  - $P(w_1, w_2, \dots, w_N)$ : The joint probability of the entire sequence of words, as predicted by the language model.
  - $P(w_i | w_1, \dots, w_{i-1})$ : The conditional probability of the $i$ -th word given all preceding words, as predicted by the language model.
  - $e$ : Euler's number (base of the natural logarithm).
  - $\log$ : Natural logarithm.
Word-level F1
- Conceptual Definition: Word-level F1 measures the overlap between the words in the generated utterance and the words in the ground-truth (human-written) utterance. It balances precision (how many generated words are correct) and recall (how many correct words were generated). A higher F1 score indicates that the generated utterance contains more of the relevant words from the reference.
- Mathematical Formula: $\mathrm{Precision} = \frac{\text{Number of common words}}{\text{Number of words in generated utterance}}$ $\mathrm{Recall} = \frac{\text{Number of common words}}{\text{Number of words in ground-truth utterance}}$ $\mathrm{F1} = 2 \cdot \frac{\mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}$
- Symbol Explanation:
  - Number of common words: Count of words present in both the generated and ground-truth utterances.
  - Number of words in generated utterance: Total count of words in the system's output.
  - Number of words in ground-truth utterance: Total count of words in the reference human utterance.
BLEU (Bilingual Evaluation Understudy)
- Conceptual Definition: BLEU score is a metric for evaluating the quality of text which has been machine-translated or, in this case, generated. It compares the generated text to one or more high-quality reference texts and measures the $n$ -gram overlap. It assigns a penalty for brevity if the generated text is too short. Higher BLEU scores indicate closer resemblance to human-written text. BLEU-1 considers unigram overlap, BLEU-2 considers bigram overlap, and so on.
- Mathematical Formula: $\mathrm{BLEU} = \mathrm{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log \mathrm{Precision}_n\right)$ where $\mathrm{BP}$ is the brevity penalty: $\mathrm{BP} = \begin{cases} 1 & \text{if } c > r \\ e^{(1-r/c)} & \text{if } c \le r \end{cases}$
- Symbol Explanation:
  - $N$ : The maximum $n$ -gram order to consider (e.g., for BLEU-2, $N=2$ ).
  - $w_n$ : Weight for each $n$ -gram precision (typically $1/N$ ).
  - $\mathrm{Precision}_n$ : The $n$ -gram precision, calculated as the count of matched $n$ -grams in the candidate text divided by the total number of $n$ -grams in the candidate text.
  - $c$ : Length of the candidate (generated) utterance.
  - $r$ : Effective reference corpus length (closest reference length to $c$ ).
DIST (Distinct)
- Conceptual Definition: Distinct (DIST) metrics (DIST-1, DIST-2) evaluate the diversity of generated responses. They measure the proportion of unique unigrams (DIST-1) or bigrams (DIST-2) within the generated output. A higher DIST score indicates less repetitive and more diverse responses, which is desirable for engaging dialogue.
- Mathematical Formula: For DIST-N: $\mathrm{DIST-N} = \frac{\text{Count of unique N-grams}}{\text{Total count of N-grams}}$
- Symbol Explanation:
  - Count of unique N-grams: The number of distinct N-gram sequences found in the generated utterances.
  - Total count of N-grams: The total number of N-gram sequences in the generated utterances.
Knowledge F1 (Know. F1)
- Conceptual Definition: Knowledge F1 specifically evaluates how well the system generates correct knowledge entities (e.g., topics, attributes, entity names) that are relevant to the conversation and derived from the domain knowledge. It measures the F1 score for knowledge triples or entities present in the generated text compared to the ground truth.
- Mathematical Formula: Similar to word-level F1, but applied to knowledge entities. Let $G$ be the set of knowledge entities in the generated utterance and $R$ be the set of knowledge entities in the reference utterance. $\mathrm{Precision} = \frac{|G \cap R|}{|G|}$ $\mathrm{Recall} = \frac{|G \cap R|}{|R|}$ $\mathrm{F1} = 2 \cdot \frac{\mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}$
- Symbol Explanation:
  - $G$ : Set of knowledge entities identified in the generated utterance.
  - $R$ : Set of knowledge entities identified in the ground-truth utterance.
  - $|G \cap R|$ : Number of common knowledge entities between generated and reference.
  - $|G|$ : Total number of knowledge entities in the generated utterance.
  - $|R|$ : Total number of knowledge entities in the ground-truth utterance.
Target Recommendation Success Rate (Target Succ.)
- Conceptual Definition: This metric is crucial for the target-driven paradigm. It measures how often the system successfully generates the designated target topic (i.e., makes the target recommendation) at the "target turn" of the conversation. A higher success rate indicates better goal-achievement for the target-driven system.
- Mathematical Formula: $\mathrm{Target~Succ.} = \frac{\text{Number of dialogues where target topic is correctly generated at target turn}}{\text{Total number of test dialogues}}$
- Symbol Explanation: Self-explanatory from the definition.

5.2.2. Conversation Planning Metrics

These metrics evaluate the accuracy of the planned dialogue actions and topics.

Accuracy (Acc.)
- Conceptual Definition: Accuracy measures the proportion of correctly predicted/generated dialogue actions or dialogue topics for the next step in the conversation. It's a standard classification metric.
- Mathematical Formula: $\mathrm{Acc.} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}$
- Symbol Explanation: Self-explanatory from the definition.
Bigram Accuracy (Bi. Acc.)
- Conceptual Definition: Bigram Accuracy is a more lenient version of accuracy for conversation planning. It acknowledges that in conversations, multiple planning strategies might be reasonable. Therefore, it expands the set of ground-truth labels by considering not only the immediate next action/topic but also those from the previous turn and the following turn. If the predicted action/topic matches any of these expanded labels, it's considered correct. This helps to account for the flexible nature of human conversation flow.
- Mathematical Formula: (Conceptual, as specific expansion rules can vary) $\mathrm{Bi.~Acc.} = \frac{\text{Number of predictions matching the next, previous, or following ground-truth label}}{\text{Total number of predictions}}$
- Symbol Explanation: Self-explanatory from the definition.

5.3. Baselines

The proposed TCP framework is compared against a range of baseline models, categorized by their primary function:

5.3.1. General Dialogue Generation Models

These models are primarily focused on generating fluent and coherent responses without specific recommendation or planning mechanisms, often leveraging large-scale pre-training. They serve as a benchmark for basic generation quality.

Transformer [17]: The foundational Transformer architecture, widely used for sequence-to-sequence tasks. It's included to show the impact of basic Transformer without pre-training or specialized dialogue enhancements.
DialoGPT [20]: A pre-trained dialogue generation model (based on GPT-2) specifically designed for conversational response generation. It represents a strong baseline for open-domain dialogue.
BART [8]: A denoising sequence-to-sequence pre-trained model (Transformer-based) capable of strong language generation. It's used as a backbone model for TCP as well.
GPT-2 [15]: A pre-trained autoregressive generation model known for its ability to produce high-quality, coherent text. It's also used as a backbone model for TCP.

5.3.2. State-of-the-art Recommendation Dialogue Generation Models

These models incorporate mechanisms for recommendations and often follow a predict-then-generate paradigm.

MGCG_G [12]: A model that employs a predicted next dialogue action and topic to guide its utterance generation. This represents a strong predict-then-generate baseline from the DuRecDial paper itself.
KERS [19]: A knowledge-enhanced framework for recommendation dialogue systems that incorporates multiple subgoals. It also operates under a predict-then-generate principle.

5.3.3. Conversation Planning Baselines (for the planning sub-task)

These models are specifically compared for their ability to predict or generate dialogue actions and topics. The input for these baselines is adjusted to be fair to the target-driven setting (i.e., they also receive the target action and topic).

MGCG [12]: This model (the planning component of $MGCG_G$ ) aims to perform multi-task predictions for the next dialogue action and topic. The paper notes that its original formulation assumes ground-truth historical actions/topics are known, but for fair comparison, it's adapted to receive only the target action and topic as input, similar to TCP.
KERS [19]: The planning component of KERS, which uses a Transformer network to generate the next dialogue action and topic. It is also adapted to the target-driven input setting.
BERT [2]: A pre-trained BERT model fine-tuned for this task by adding two fully-connected layers to jointly predict the next dialogue action and topic. This serves as a strong PLM-based baseline for the planning sub-task.

5.4. Implementation Details

Tokenization: Since the DuRecDial dataset is in Chinese, character-based tokenization is used.
TCP Training:
- Uses the pre-trained Chinese BERT_base model for initialization.
- Vocabulary size: 21,128.
- Hidden size: 768.
- The target-driven conversation planner (the Transformer decoder) has 12 layers and 8 attention heads. Its embeddings are randomly initialized.
- Optimizer: Adam [7] with an initial learning rate of 1e-5.
- Training schedule: Trained for 10 epochs with a warm-up phase over the first 3,000 training steps and linear decay of the learning rate.
- Model Selection: Best model chosen based on performance on the validation set.
- Inference: Greedy search decoding is used for generating plan sequences.
Dialogue Generation:
- Backbone models: Chinese BART_base and $GPT-2_base$ from Huggingface's Transformers [18] library are used for fine-tuning after TCP planning.
- Parameter settings: Each backbone model uses the same parameter settings as in the baseline experiments.

6. Results & Analysis

6.1. Core Results

6.1.1. Dialogue Generation Evaluation Results

The following table shows the results from Table 1, comparing TCP with various general and recommendation dialogue generation models. The metrics cover fluency (PPL), content quality (F1, BLEU, Know. F1), diversity (DIST), and the crucial Target Success Rate.

The following table shows the results from Table 1:

Model	PPL (↓)	F1 (%)	BLEU-1/2	DIST-1/2	Know. F1 (%)	Target Succ. (%)
Generation
Transformer	22.83	27.95	0.224 / 0.165	0.001 / 0.005	17.73	9.28
DialoGPT	5.45	29.60	0.287 / 0.213	0.005 / 0.036	27.26	40.31
BART	6.29	34.07	0.312 / 0.242	0.008 / 0.067	38.16	53.84
GPT-2	4.93	38.93	0.367 / 0.291	0.007 / 0.058	43.83	60.49
Predict-then-generate
MGCG_G	18.76	33.48	0.279 / 0.203	0.007 / 0.043	35.12	42.06
KERS	12.55	34.04	0.302 / 0.220	0.005 / 0.030	40.75	49.40
Ours
Ours (BART w/ TCP)	5.23	36.41*	0.335* / 0.254*	0.008 / 0.082	44.30*	62.73*
Ours (GPT-2 w/ TCP)	4.22	41.40*	0.376* / 0.299*	0.007 / 0.072	48.63*	68.57*

PPL (↓): Lower is better.
F1 (%), BLEU-1/2, Know. F1 (%), Target Succ. (%): Higher is better.
DIST-1/2: Higher is better (for diversity).
*: Significant improvements over the backbone model results (t-test, $p < 0.05$ ).

Analysis of Dialogue Generation Results:

Vanilla Transformer: Performs the worst across most metrics, demonstrating the necessity of pre-training and specialized dialogue mechanisms. Its Target Succ. is extremely low (9.28%), indicating a complete lack of target-driven capability.
Pre-trained Models (DialoGPT, BART, GPT-2): These models show significant improvements over the vanilla Transformer in terms of PPL, F1, BLEU, DIST, and Know. F1. This highlights the power of pre-trained language models in generating fluent and diverse responses. Notably, GPT-2 achieves the lowest PPL (4.93) and highest F1 (38.93%), BLEU, and Know. F1 among standalone generation models, and a reasonable Target Succ. of 60.49%.
Predict-then-Generate Models (MGCG_G, KERS): These models, despite not using PLMs in their original forms (as noted by the authors), generally outperform the vanilla Transformer and DialoGPT in F1, BLEU, and Know. F1. This confirms that incorporating some form of action/topic planning (even if local) helps generate more informative and reasonable utterances. However, their Target Succ. (42.06% for $MGCG_G$ , 49.40% for KERS) is notably lower than GPT-2 alone, suggesting they struggle to explicitly lead users towards a designated target.
Our Models (TCP-Enhanced):
- Both Ours (BART w/ TCP) and Ours (GPT-2 w/ TCP) achieve the best performance across almost all metrics.
- They significantly improve F1, BLEU, Know. F1, and most importantly, Target Succ. compared to their respective backbone models (BART and GPT-2) and all other baselines.
- Ours (GPT-2 w/ TCP) achieves the lowest PPL (4.22), highest F1 (41.40%), BLEU-1 (0.376), BLEU-2 (0.299), Know. F1 (48.63%), and Target Succ. (68.57%).
- The substantial improvement in Target Succ. (e.g., from 60.49% for GPT-2 to 68.57% for Ours (GPT-2 w/ TCP)) is a strong validation of the TCP framework's effectiveness in guiding the system to achieve its target recommendation goal. The conversation planning explicitly helps the system to generate appropriate utterances that lead to the target.

6.1.2. Conversation Planning Evaluation Results

The following table shows the results from Table 2, comparing TCP against other planning methods for predicting dialogue actions and topics.

The following table shows the results from Table 2:

Model	Dialogue Action Acc. (%)	Dialogue Action Bi. Acc. (%)	Dialogue Topic Acc. (%)	Dialogue Topic Bi. Acc. (%)
MGCG	84.78	86.52	64.31	66.65
KERS	89.17	90.49	76.34	79.33
BERT	90.19	91.35	83.53	85.61
TCP	92.22*	93.82*	87.67*	89.40*

Acc. (%) and Bi. Acc. (%): Higher is better.
*: Significant improvements over the baseline models (t-test, $p < 0.05$ ).

Analysis of Conversation Planning Results:

Difficulty of Topic Planning: It is observed that dialogue topic planning is generally more challenging than dialogue action planning. This is expected, as the topic space (678 topics) is much larger and more granular than the action space (15 actions).
Baseline Planning Models:
- MGCG shows the lowest performance, especially for topic planning.
- KERS significantly improves over MGCG, demonstrating the benefit of its knowledge-enhanced mechanisms.
- BERT fine-tuned for planning shows strong performance, outperforming MGCG and KERS considerably. This underscores the power of pre-trained models even for the planning sub-task.
Our TCP Model:
- TCP achieves the highest performance across all planning metrics.
- It shows significant improvements over all baselines for both dialogue action accuracy (92.22% Acc., 93.82% Bi. Acc.) and dialogue topic accuracy (87.67% Acc., 89.40% Bi. Acc.).
- These results validate that TCP is highly effective in generating an appropriate sequence of dialogue actions and topics. The target-driven approach, including backward planning and the knowledge-target mutual attention, helps the planner make more informed decisions about what to say next to reach the goal. This improved planning directly translates to better dialogue generation as shown in the previous table.

6.2. Ablations / Parameter Sensitivity

The paper does not explicitly report ablation studies or detailed parameter sensitivity analyses. However, the comparison between TCP-enhanced models and their respective backbone models (e.g., BART w/ TCP vs. BART) implicitly serves as a form of ablation, demonstrating the standalone contribution of the TCP framework to dialogue generation performance. The significant improvements observed when TCP is integrated (marked with * in Table 1) confirm that the conversation planning component is indeed crucial and effective.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper makes a significant stride in the field of recommendation dialogue systems by introducing and addressing the target-driven recommendation dialogue task. This novel paradigm allows dialogue systems to proactively steer conversations towards a designated target item or topic. The core contribution is the Target-driven Conversation Planning (TCP) framework, which effectively plans a sequence of dialogue actions and topics. This planning process is uniquely designed to operate backward from the target, leveraging target-side information and integrating user profiles, domain knowledge, and conversation history through sophisticated attention and fusion mechanisms. The planned content then explicitly guides the dialogue generation process. Experimental results on the DuRecDial dataset unequivocally demonstrate that TCP not only significantly improves the accuracy of conversation planning but also leads to substantial enhancements in dialogue generation quality, particularly in achieving a higher target recommendation success rate.

7.2. Limitations & Future Work

The authors acknowledge the continuous nature of research and outline future directions:

More Precise Planning: Investigating methods to plan conversation paths with even greater precision. This might involve more granular control over dialogue states or incorporating more sophisticated reasoning capabilities.
More Effective Guidance for Dialogue Generation: Exploring ways to guide the dialogue generation models more effectively using the planned content. This could involve tighter integration between the planner and the generator, or developing generation models specifically tailored to execute complex plans.

7.3. Personal Insights & Critique

Novelty and Impact: The paradigm shift from reactive to proactive target-driven recommendation dialogue is a highly valuable contribution. It moves conversational AI closer to human-like interaction, where persuasion and guidance are common. The backward planning approach is an elegant solution to infuse target awareness throughout the entire conversation flow, rather than just locally predicting the next turn. This focus on a long-term goal is a significant step forward.
Beginner-Friendliness: The paper is well-structured and explains its core concepts clearly. The illustrative example (Figure 1) is very helpful for understanding the problem. The use of standard Transformer-based architectures makes the methodology accessible to those familiar with modern NLP.
Potential Limitations Not Explicitly Discussed:
- User Acceptance and Naturalness: While the paper shows quantitative improvements, the "naturalness" of the conversation flow (i.e., whether users feel they are being naturally led or overtly manipulated) is often best evaluated through extensive human user studies. The current metrics, while robust, don't fully capture this subjective aspect.
- Robustness to User Deviation: What happens if the user strongly deviates from the planned path? How gracefully does the system recover or re-plan? The current framework assumes a relatively linear progression towards the target, which might be challenged by complex or unexpected user behaviors.
- Multiple Targets/Dynamic Targets: The current setup assumes a single, predetermined target. In real-world scenarios, a system might have multiple potential recommendations, or the target itself might need to adapt based on real-time user engagement and evolving preferences. This would require a more dynamic planning mechanism.
- Computational Cost of Planning: While greedy search is used for inference, the planning process, especially if it involves exploring multiple paths or more complex reasoning, could be computationally intensive, impacting real-time interaction.
- Generalizability to Other Domains/Languages: The experiments are conducted on a Chinese dataset. While the Transformer architecture is language-agnostic, the specific performance and nuances might vary in other languages or domains with different cultural conversational norms or knowledge graph structures.
Future Research Avenues:
- Personalized Planning: Incorporating deeper user modeling to generate plans that are not just target-driven but also highly personalized to individual user traits, moods, and learning styles.
- Explainable Planning: Making the planning process more transparent, perhaps by generating justifications for chosen actions or topics, could build greater user trust.
- Reinforcement Learning for Planning: Training the planner using reinforcement learning with rewards for target success and user engagement could lead to more adaptive and robust planning strategies in dynamic conversational environments.
- Multi-modal Integration: Extending target-driven planning to multi-modal dialogue systems where recommendations might involve visual or auditory elements, and planning could account for different interaction modalities.
  
  Overall, this paper provides a robust framework and a significant step towards more intelligent and proactive conversational AI, particularly for recommendation tasks. Its emphasis on explicit planning and target-driven guidance sets a strong foundation for future work in this exciting area.

Follow Me: Conversation Planning for Target-driven Recommendation Dialogue Systems

TL;DR Summary