Paper status: completed

Experience as Source for Anticipation and Planning: Experiential Policy Learning for Target-driven Recommendation Dialogues

Published:11/01/2024

Target-Driven Recommendation Dialogues (1)Experiential Policy Learning (1)Dialogue State Potential Estimation (1)LLM and Monte-Carlo Tree Search (1)Long-Term Memory Dialogue Experience Utilization (1)

Original Link PDF

Price: 0.100000

6 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This work introduces Experiential Policy Learning, leveraging past dialogue experiences via a scoring function and combining LLMs with Monte-Carlo Tree Search for training-free, hierarchical reasoning in recommendation dialogues, demonstrating superior performance on benchmark da

Abstract

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages 14179–14198 November 12-16, 2024 ©2024 Association for Computational Linguistics Experience as Source for Anticipation and Planning : Experiential Policy Learning for Target-driven Recommendation Dialogues Huy Dao 1 , Yang Deng 1 , Khanh-Huyen Bui 2 , Dung D. Le 3 , Lizi Liao 1 1 Singapore Management University 2 FPT Software AI Center, 3 College of Engineering and Computer Science, VinUniversity qh.dao.2023@phdcs.smu.edu.sg , {ydeng,lzliao}@smu.edu.sg , huyenbk2@fpt.com , dung.ld@vinuni.edu.vn , Abstract Target-driven recommendation dialogues present unique challenges in dialogue man- agement due to the necessity of anticipating user interactions for successful conversations. Current methods face significant limitations: (I) inadequate capabilities for conversation anticipation, (II) computational inefficiencies due to costly simulations, and (III) neglect of valuable past dialogue experiences. To address these limitations, we propose a new framework, Experiential Policy Learning (EPL), to enhance such dialogues. Specifically, EPL embodies the principle of Learning From

Mind Map

In-depth Reading

English Analysis~33 min read · 43,734 chars

1. Bibliographic Information

1.1. Title

Experience as Source for Anticipation and Planning: Experiential Policy Learning for Target-driven Recommendation Dialogues

1.2. Authors

Huy Dao, Yang Deng, Khanh-Huyen Bui, Dung D. Le, Lizi Liao

Affiliations:

Singapore Management University: Huy Dao, Yang Deng, Lizi Liao
FPT Software AI Center: Khanh-Huyen Bui
College of Engineering and Computer Science, VinUniversity: Dung D. Le

1.3. Journal/Conference

Published at EMNLP (Empirical Methods in Natural Language Processing) 2024.

Comment on Venue's Reputation: EMNLP is one of the premier conferences in the field of Natural Language Processing (NLP). It is highly respected for showcasing cutting-edge research, attracting a global audience of academics and industry professionals. Publication at EMNLP signifies significant contributions and rigorous peer review in the NLP community.

1.4. Publication Year

2024

1.5. Abstract

Target-driven recommendation dialogues face significant challenges in managing conversations, primarily due to the need to anticipate user interactions for successful outcomes. Current methods are limited by: (I) inadequate capabilities for anticipating conversation flow, (II) computational inefficiencies stemming from costly simulations, and (III) a failure to leverage valuable past dialogue experiences. To overcome these limitations, the authors propose a novel framework called Experiential Policy Learning (EPL). EPL is designed around the principle of Learning From Experience, facilitating anticipation through an experiential scoring function. This function estimates the potential of a dialogue state by drawing upon similar past interactions stored in a long-term memory. To demonstrate its flexibility, the paper introduces Tree-structured EPL (T-EPL) as a training-free realization that integrates Large Language Models (LLMs) and Monte-Carlo Tree Search (MCTS). T-EPL employs LLMs to assess past dialogue states and uses MCTS to enable hierarchical and multi-level reasoning. Extensive experiments conducted on two established datasets confirm the superiority and efficacy of T-EPL compared to existing methods.

1.6. Original Source Link

https://aclanthology.org/2024.findings-emnlp.829.pdf The paper was published at the Findings of EMNLP 2024.

2. Executive Summary

2.1. Background & Motivation (Why)

The core problem addressed by this paper lies in the domain of target-driven recommendation dialogues. In these systems, the goal is not just to recommend items based on user preferences, but to proactively steer the conversation towards specific, pre-defined target items (e.g., promoting a new product).

This problem is important because traditional conversational recommender systems (CRSs) often adopt a reactive approach, identifying user interests as they emerge. While effective in some contexts, these systems lack the ability to proactively guide users towards specific items, a capability crucial for business objectives like promoting new products and increasing sales.

Existing target-driven CRS models face several significant limitations:

(I) Inadequate capabilities for conversation anticipation: Current models primarily focus on evaluating individual next-turn interactions, neglecting the ability to foresee and plan for future user-system interactions. This foresight is critical for successfully guiding conversations toward specific target items.
(II) Computational inefficiencies due to costly simulations: Previous attempts to incorporate foresight often rely on expensive online simulations of user interactions (e.g., using Reinforcement Learning or Monte-Carlo Tree Search with LLMs). These simulations can be computationally prohibitive, especially at inference time.
(III) Neglect of valuable past dialogue experiences: Almost all prior work fails to leverage newly obtained interactions during inference to continuously enhance their performance. This means policies remain static and cannot adapt to evolving user behaviors or conversation dynamics.

The paper's novel approach, Experiential Policy Learning (EPL), aims to overcome these limitations by embodying the principle of Learning From Experience. It proposes to use similar past interactions from a long-term memory to anticipate conversational trajectories and estimate the potential value of dialogue states, thereby reducing the need for costly real-time simulations and enabling adaptability.

2.2. Main Contributions / Findings (What)

The paper makes three primary contributions:

A novel dialogue policy learning framework, Experiential Policy Learning (EPL): This framework integrates past interactions into the planning process for target-driven recommendation dialogues. It facilitates future anticipation by utilizing an experiential scoring function that approximates the potential value of a dialogue state based on similar interactions stored in memory, rather than relying on expensive online rollouts.
Tree-structured EPL (T-EPL), a training-free realization of EPL: T-EPL is introduced as a flexible, training-free implementation that leverages Large Language Models (LLMs) to assess the potential of past dialogue states and integrates this into a Monte-Carlo Tree Search (MCTS) algorithm. MCTS is used to achieve hierarchical and multi-level reasoning, while the experiential scoring function reduces the need for costly LLM-based evaluations during the tree search. This design allows T-EPL to quickly adapt to newly encountered interactions.
Extensive experimental validation: The paper conducts interactive evaluations on two published datasets (DuRecDial 2.0 and INSPIRED). The experiments demonstrate that T-EPL consistently outperforms state-of-the-art approaches in terms of both performance (higher objective and subjective success rates, fewer average turns) and efficiency (lower API call complexity during inference). The ablation studies confirm the effectiveness of EPL's core components.

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with the following concepts:

Conversational Recommender Systems (CRSs): These are interactive systems that help users find items (e.g., movies, products) through natural language conversations. Unlike traditional recommender systems that might present a static list, CRSs engage in multi-turn dialogues to understand user preferences and provide recommendations.
Target-driven Recommendation Dialogues: A specialized type of CRS where the system has a pre-defined target item it aims to recommend (e.g., a specific new movie, a particular restaurant). The system's dialogue policy is designed to proactively guide the conversation towards this target, making the recommendation at an opportune moment. This contrasts with purely reactive CRSs that just respond to user queries.
Dialogue Management/Policy Learning: This is the component of a dialogue system responsible for deciding what the system should say or do next, given the current dialogue state (the history of the conversation, user preferences, system goals). Policy learning involves training models to make these decisions effectively, often to optimize certain objectives like task success or user satisfaction.
Markov Decision Process (MDP): A mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. An MDP is defined by a tuple $\mathcal{M} = (\mathcal{S}, \mathcal{A}, R, T, \gamma)$ $M = (S, A, R, T, γ)$ :
- $\mathcal{S}$ : A set of possible states (e.g., the current dialogue history, user's stated preferences).
- $\mathcal{A}$ : A set of possible actions the agent can take (e.g., ask a question, make a recommendation, chit-chat).
- R(s, a, s'): A reward function that specifies the immediate reward received after taking action $a$ in state $s$ and transitioning to state $s'$ . This quantifies the desirability of outcomes.
- T(s, a, s') or $P(s' | s, a)$ : A transition function (or probability distribution) that describes the probability of moving from state $s$ to state $s'$ after taking action $a$ .
- $\gamma$ : A discount factor ( $0 \le \gamma \le 1$ ) that determines the present value of future rewards. Rewards received in the near future are worth more than those received in the distant future. In the context of this paper, a state $s_t$ at turn $t$ is defined as $(h_{\le t}, a_{\le t})$ , where $h_{\le t}$ is the dialogue context and $a_{\le t}$ is the sequence of previous actions.
Monte-Carlo Tree Search (MCTS): A search algorithm often used in artificial intelligence for decision processes, particularly in games with a vast state space (like Go) or sequential decision-making. MCTS builds a search tree by iteratively simulating possible trajectories. It consists of four main steps:
1. Selection: Starting from the root node (current state), it traverses the tree by selecting child nodes that balance exploration (visiting less-explored nodes) and exploitation (visiting nodes that seem promising) using a strategy like Upper Confidence bounds applied to Trees (UCT).
2. Expansion: When a node with unvisited children is reached, one (or more) of its children is added to the tree.
3. Simulation/Rollout: From the newly expanded node, a simulation (or "rollout") is performed to a terminal state (e.g., end of game/dialogue) by randomly choosing actions. The outcome of this simulation is a reward.
4. Backpropagation: The reward obtained from the simulation is backpropagated up the tree, updating the statistics (e.g., visit counts, average rewards) of all nodes on the path from the new node to the root. MCTS is used here to facilitate hierarchical and multi-level reasoning for dialogue planning.
Large Language Models (LLMs): Powerful neural networks trained on vast amounts of text data, capable of generating human-like text, understanding context, and performing various language tasks. In this paper, LLMs (specifically GPT-3.5-Turbo and Llama 2) are used for:
- User Simulation: Mimicking a user's responses in a dialogue to allow for interactive evaluation.
- Target-driven Assessment: Evaluating the success of a conversation or the potential value of a dialogue state by providing a score based on specific criteria.
Dense Retrieval: A technique used to find relevant items (e.g., dialogue states, documents) from a large collection by comparing their dense vector representations (embeddings). A retrieval model (e.g., all-MiniLM-L6-v2 from Sentence Transformers) converts dialogue states into high-dimensional vectors, and similarity (e.g., cosine similarity) is used to find the k-nearest neighbors. Faiss is a library for efficient similarity search and clustering of dense vectors.
Reinforcement Learning (RL): A paradigm where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. It involves trial and error, where the agent receives feedback (rewards) for its actions. Some prior works use RL to fine-tune dialogue policies.

3.2. Previous Works

The paper discusses several categories of related work:

Target-driven Proactive Dialogue Systems:
- Early efforts (e.g., Liu et al., 2020; Zhang et al., 2021; Liu et al., 2021; Wang et al., 2022; Deng et al., 2023b; Wang et al., 2023a; Dao et al., 2023) focused on guiding conversations toward predefined targets like negotiation or recommendation. These methods include techniques like Brownian-motion dialogue planning (Wang et al., 2023b) and long-short term strategic balancing (Dao et al., 2023).
- Limitation: A common drawback of these methods is their inability to incorporate foresight user-system interactions, which are crucial for anticipating conversation trajectories.
Approaches to Address Foresight:
- Deng et al. (2023a) used LLM-generated interactions to fine-tune a pre-trained policy with Reinforcement Learning (RL).
  - Limitation: The resulting pre-trained policy is static and struggles to adapt to newly encountered situations during deployment.
- Yu et al. (2023) proposed using open-loop Monte-Carlo Tree Search (MCTS) for future interaction estimation.
  - Limitation: This approach suffers from computational inefficiencies due to frequent costly LLM-based evaluations during inference.
- General Limitation: Both types of approaches often fail to leverage valuable interactions learned during inference to dynamically enhance performance.
Reflecting on Experience:
- This concept has shown benefits in various domains:
  - Multimodal response generation (Ye et al., 2022): Retrieving similar dialogues from training data to improve response quality.
  - Decision-making (Shinn et al., 2023): Using results from past trials to enhance predictions in current ones.
  - Recommendation (Lin et al., 2023): Extracting information from dialogues with similar users to understand current user preferences.
- Connection to current paper: These works inspire the core idea of EPL, which leverages analogous past interactions to improve future anticipation in dialogue planning.

3.3. Technological Evolution

The field of conversational recommender systems has evolved from reactive systems, which merely respond to user queries, to proactive or target-driven systems that aim to steer conversations towards specific items. Initially, these proactive systems focused on rule-based or simple learned policies. The integration of Reinforcement Learning provided a framework for optimizing dialogue policies over sequences of interactions. More recently, the advent of powerful Large Language Models (LLMs) has revolutionized the capabilities of dialogue systems, enabling more natural interactions, sophisticated user simulation, and even direct policy generation or evaluation. Monte-Carlo Tree Search (MCTS) has emerged as a robust planning algorithm, offering a way to explore future interaction trajectories and make more informed decisions. However, combining these advanced techniques efficiently and adaptably has remained a challenge, often leading to computational bottlenecks or static policies. This paper attempts to bridge this gap by introducing experiential learning to MCTS, making it more efficient and adaptive without constant reliance on expensive LLM calls during core planning steps.

3.4. Differentiation

The proposed Experiential Policy Learning (EPL) and its realization Tree-structured EPL (T-EPL) differentiate themselves from existing methods by addressing their core limitations:

Against simulation-heavy methods (e.g., MCTS with rollouts, GDP-Zero): T-EPL avoids the computational inefficiencies of costly rollout simulations or frequent LLM-based state evaluations by using an experiential scoring function. Instead of simulating new interactions, it retrieves and aggregates values from similar past interactions stored in a memory.
Against static pre-trained policies (e.g., PPDPP, RL-tuned policies): T-EPL is designed as a training-free framework that can adapt on the fly. It continuously updates its memory component with newly encountered interactions during inference, allowing its policy to refine and become more effective over time, unlike methods with fixed pre-trained policies.
Against memory-less approaches: While previous works in experience reflection exist, EPL specifically integrates this into a dialogue policy learning framework for target-driven CRSs. It explicitly constructs a memory structure of past states and their assessments to guide future anticipation and planning.
Addressing anticipation: By leveraging similar past interactions, EPL directly tackles the inadequate capabilities for conversation anticipation that plague existing target-driven CRS models, enabling better foresight and planning for conversational trajectories.

4. Methodology

4.1. Principles

The core idea behind Experiential Policy Learning (EPL) is Learning From Experience. It posits that past interactions, especially those similar to the current dialogue state, contain valuable information for anticipating future outcomes and guiding decision-making. The intuition is that if a particular sequence of actions led to a successful recommendation in a similar past situation, it is likely to be a good strategy now. This principle is embodied through an experiential scoring function that estimates the potential of a dialogue state by recalling and aggregating knowledge from a long-term memory of previously observed interactions.

Tree-structured EPL (T-EPL) extends this by combining the experiential scoring with Monte-Carlo Tree Search (MCTS) to enable robust, hierarchical, and multi-level reasoning for dialogue planning. Instead of relying on expensive real-time simulations or pre-trained neural networks for state evaluation within MCTS, T-EPL uses the experiential scoring function. Furthermore, Large Language Models (LLMs) are utilized to provide initial assessments of past dialogue states, populating the memory with valuable experiential data in a training-free manner.

4.2. Steps & Procedures

The EPL framework first defines a way to assess the potential value of a dialogue state with respect to a target item. Then, T-EPL shows how to realize this framework using MCTS, LLMs, and a dense retrieval model for memory management.

4.2.1. Experiential Policy Learning (EPL)

Target-driven Scoring Function: The goal is to estimate the potential value of a dialogue state $s$ with respect to a target item $v$ , denoted as F(s, v). The outcome $r$ of a conversation is modeled as a binary random variable, $r \in \mathcal{L} = \{success, fail\}$ . The function F(s, v) is initially defined as the expected value of an outcome function f(r): $\mathcal{F}(s, v) = \mathbb{E}_{r \sim \mathbf{P}(r \mid s, v)}[f(r)] = \sum_{r \in \mathcal{L}} \mathbf{P}(r \mid s, v) f(r)$ where:

$\mathcal{F}(s, v)$ : The potential value of state $s$ and target item $v$ .
f(r): A scalar-valued function mapping an outcome (success/fail) to a numerical value (e.g., 1 for success, 0 for fail).
$\mathbf{P}(r \mid s, v)$ : The probability of outcome $r$ given the state $s$ and target item $v$ .

Estimating $\mathbf{P}(r \mid s, v)$ can be challenging because $s$ might not contain enough information. To better capture future interactions, the function is re-formalized by considering dialogue continuations $c$ : $F(s, v) = \sum_{r \in \mathcal{L}} f(r) \sum_{c \sim \mathbf{P}(c \mid s, v)} \mathbf{P}(r \mid c, s, v) \mathbf{P}(c \mid s, v)$ where:
$c$ : A dialogue continuation, encompassing subsequent user-system interactions.
$\mathbf{P}(c \mid s, v)$ : The probability of a specific continuation $c$ given state $s$ and target $v$ .
$\mathbf{P}(r \mid c, s, v)$ : The probability of outcome $r$ given a continuation $c$ , state $s$ , and target $v$ . If $v$ is explicitly mentioned in $c$ , $P(r=success \mid c, s, v)$ could be 1. However, explicitly computing or modeling $\mathbf{P}(c \mid s, v)$ is computationally intractable. While online rollout simulations can sample from this distribution, they are computationally inefficient at inference time.

An Experiential Approximation: To overcome the computational intractability, EPL proposes an approximated scoring function, $F_{\mathcal{M}}(s, v)$ , by leveraging similar past interactions. The intuition is that similar users exhibit similar preferences, and past interactions can guide current conversations. The experiential approximation is initially expressed as: $\begin{array}{r l} & { F _ { \mathcal { M } } ( s , v ) = \sum _ { r \in \mathcal { L } } f ( r ) } \\ & { \qquad \cdot \sum _ { c ^ { \prime } \sim \mathbf { P } _ { \mathcal { M } } ( c ^ { \prime } \mid s , v ) } \mathbf { P } ( r \mid c ^ { \prime } , s , v ) \mathbf { P } ( c ^ { \prime } \mid s , v ) , } \end{array}$ where $\mathcal{M} = \{(s', c', v', F(s', v'))\}$ is a memory storing tuples of experienced state $s'$ , its continuation $c'$ , target item $v'$ , and its assessed score F(s', v'). Since (s', c') forms a completed conversation, $s'$ can be seen as a proxy for $c'$ .

Re-arranging the summation and making an assumption about the relationship between $s'$ and $c'$ , the formulation becomes: $\begin{array}{r l} { F _ { \mathcal { M } } ( s , v ) = \sum _ { ( s ^ { \prime } , c ^ { \prime } ) \sim \mathbf { P } _ { \mathcal { M } } ( s ^ { \prime } \mid s , v ) } } & { \mathbf { P } ( s ^ { \prime } \mid s , v ) } \\ & { \quad \cdot \sum _ { r \in \mathcal { L } } f ( r ) \mathbf { P } ( r \mid c ^ { \prime } , s , v ) , } \end{array}$ By considering only the k most similar states $s' \in \mathcal{M}_k$ for a given state $s$ , and assuming $\mathbf{P}(r \mid c', s, v) \approx \mathbf{P}(r \mid c', s', v)$ , the final experiential approximation formula is: $\begin{array}{r l} & { F _ { \mathcal { M } } ( s , v ) \approx \sum _ { s ^ { \prime } \in \mathcal { M } _ { k } } \mathbf { P } ( s ^ { \prime } | s , v ) F ( s ^ { \prime } , v ) , } \\ & { \qquad = \mathbb { E } _ { s ^ { \prime } \sim \mathbf { P } _ { \mathcal { M } } ^ { k } ( s ^ { \prime } | s , v ) } [ F ( s ^ { \prime } , v ) ] , } \end{array}$ where:

$\mathcal{M}_k$ : The set of $k$ most similar past states retrieved from memory $\mathcal{M}$ .
$\mathbf{P}(s' | s, v)$ : The probability (or similarity weight) of a retrieved state $s'$ given the current state $s$ and target $v$ .
F(s', v): The pre-computed potential value of the experienced state $s'$ with its target $v$ .
It's assumed that $F(s', v)=0$ if the conversation (s', c', v) is not in $\mathcal{M}$ . This formulation bypasses the need for rollout simulations by directly using pre-computed values from similar past interactions, facilitating a fast approximation of the target-driven function.

4.2.2. Tree-structured EPL (T-EPL)

T-EPL is a training-free realization of EPL that adapts to new interactions using LLMs, a dense document retrieval model, and MCTS. The overall process is illustrated in Figure 2.

The architecture of the system is shown below (Figure 2 from the original paper):

该图像是一个示意图，展示了论文中基于大语言模型和蒙特卡洛树搜索的Tree-structured EPL框架，包含目标设定、对话历史、树搜索、对话状态检索、评估和估计等模块的交互流程。

The figure illustrates the T-EPL framework. The MCTS (Monte-Carlo Tree Search) constructs a search tree. The Retrieval phase (corresponding to Equation 6) finds similar past interactions. The Assessment and Estimation phases (corresponding to Equations 2 and 1 respectively) evaluate the potential value of nodes using an experiential target-driven scoring function, integrating LLM-based assessment for past dialogues.

MCTS-guided Tree Search: T-EPL integrates the experiential scoring function $F_{\mathcal{M}}(s, v)$ as the state value function within the MCTS algorithm (Algorithm 1). This enhances both planning capability and efficiency by replacing expensive rollouts or direct LLM evaluations for every tree node.

The MCTS procedure (detailed in Algorithm 1 in the paper and Appendix A.1) involves four stages:

Selection: Starting from the root node (current state $s_t$ ), traverse the tree using the Upper Confidence bounds applied to Trees (UCT) formula until a leaf node is reached. UCT balances exploration (visiting less-explored paths) and exploitation (choosing paths with high estimated value). $s \gets \underset { s ^ { \prime } \in Children(s) } { \arg \operatorname* { m a x } } \left[ V ( s ^ { \prime } ) + c \pi _ { p } ( a _ { s ^ { \prime } } | s ) \sqrt { \frac { 2 \mathbf { N } ( s ) } { \mathbf { N } ( s ^ { \prime } ) } } \right]$ where:
- $s'$ : A child node of the current state $s$ .
- V(s'): The estimated value of the child node $s'$ .
- $c$ : A hyperparameter balancing exploitation and exploration.
- $\pi_p(a_{s'} | s; \theta)$ : A backbone policy (e.g., RTCP from Dao et al., 2023) that provides a prior assessment of actions during tree search. This guides the search towards more promising actions.
- $\mathbf{N}(s)$ : The number of times the parent state $s$ has been visited.
- $\mathbf{N}(s')$ : The number of times the child node $s'$ has been visited.
Expansion: From the selected leaf node $s$ , potential actions $\hat{\mathcal{A}}_{t+1}$ are sampled using the backbone policy $\pi_p$ . A new state $\hat{s}_{t+1}$ is generated by taking an action $\hat{a}_{t+1}$ . If $\hat{s}_{t+1}$ is not yet in the tree, a new node is created for it.
Estimation: For the newly constructed node $\hat{s}_{t+1}$ and the target item $v$ , the experiential target-driven scoring function $F_{\mathcal{M}}(\hat{s}_{t+1}, v)$ is computed using the previously defined Equation 5. This is where the Learning From Experience principle is applied, drawing values from the memory.
Backpropagation: The estimated value $F_{\mathcal{M}}(\hat{s}_{t+1}, v)$ is propagated back up the tree, updating the statistics (e.g., estimated values V(s) and visit counts $\mathbf{N}(s)$ ) of all nodes along the path from the new node to the root.

After a specified number of simulation steps ( $n$ ), the algorithm selects the action $\hat{a}_{t+1}$ from the root state $s_t$ that leads to the highest total value, calculated as $R(s_t, \hat{a}_{t+1}, \hat{s}_{t+1}) + V(\hat{s}_{t+1})$ .

LLM-based Target-driven Assessment: To populate the memory $\mathcal{M}$ with assessed values, LLMs (specifically Llama 2 and GPT-3.5-Turbo) are used to assess the success of a completed conversation (s', c') regarding a target item $v'$ . This assessment, F(s', v'), is computed with an additional penalty term for lengthy trajectories: $\frac { 1 } { T } \sum _ { i = 1 } ^ { T } \mathcal { V } _ { r } ( \mathbf { L L M } ( \mathcal { P } ( s ^ { \prime } , e ^ { \prime } , v ^ { \prime } ) ) ) + \lambda \cdot e ^ { \frac { - l _ { c ^ { \prime } } } { \alpha } }$ where:

$\mathcal{V}_r(.)$ : A function that maps a textual output from the LLM (e.g., 'accept' or 'reject') to a scalar value (e.g., 1 or -1).
$\mathbf{LLM}(\mathcal{P}(s', c', v'))$ : The LLM's assessment of the conversation (s', c') and target $v'$ , obtained by prompting it with $\mathcal{P}(.)$ . The LLM is prompted $T$ times with temperature $\rho = 1.1$ to get diverse assessments, and the average is taken.
$\lambda, \alpha$ : Hyperparameters for the length penalty.
$l_{c'}$ : The length of the continuation $c'$ (number of turns).
The exponential decay term $e^{\frac{-l_{c'}}{\alpha}}$ penalizes longer conversations, encouraging more concise paths to success.

Dense Retrieval of Dialogue States: The probability distribution $\mathbf{P}_{\mathcal{M}}(s' | s, v)$ (used in Equation 5) is modeled using a retrieval model $f_{rev}(s' | s; \theta_{rev})$ . This model computes a similarity score between the current state $s$ and an experienced state $s'$ in the memory $\mathcal{M}$ : $\mathbf { P } _ { \mathcal { M } } ( s ^ { \prime } | s ) = \frac { \exp \left( \phi ( s ^ { \prime } ) ^ { \mathrm { T } } \phi ( s , v ) \right) } { \sum _ { s ^ { \prime \prime } \in \mathcal { M } } \exp ( \phi ( s ^ { \prime \prime } ) ^ { \mathrm { T } } \phi ( s , v ) ) }$ where:

$\phi$ : An encoder function (e.g., all-MiniLM-L6-v2 from Sentence Transformers) that maps dialogue states to their high-dimensional vector representations (embeddings).
$\phi(s')^\mathrm{T} \phi(s, v)$ : Represents the dot product similarity between the embedding of an experienced state $s'$ and the embedding of the current state $s$ combined with the target $v$ . This softmax-like formulation normalizes the similarity scores to obtain a probability distribution over the retrieved states.

Memory Construction: The memory $\mathcal{M}$ is dynamically built and updated. A prior policy $\pi_{prior}(a | s; \theta)$ (e.g., the fine-tuned RTCP model) first interacts with a user simulator (an LLM). For each generated dialogue, its target-driven assessment score F(s', v') is computed using the LLM-based assessment (Equation 2 from 4.1, detailed with penalty term above). The completed dialogue is then broken down into individual dialogue states and their continuations, and these are stored as tuples in the memory. Faiss is used as the underlying library for efficient storage and retrieval of these experiences. This continuous update mechanism allows T-EPL to refine its policy online during testing.

5. Experimental Setup

5.1. Datasets

The experiments are conducted on two publicly available datasets:

DuRecDial 2.0 (Liu et al., 2021):
- Origin: A bilingual parallel corpus for conversational recommendation.
- Characteristics: Encompasses conversations across multiple domains, including Movie, Music, Food, and Point-of-Interest (POI). The data exhibits a bias towards Music and Movie recommendations, with less data for POI and Food domains.
- Size: 16.5K conversations, 255K utterances, 13 goals, 646 topics.
- Target Items: 471/285/376 for train/validation/test sets.
- Example Data: While the paper doesn't show a raw data example, a typical conversation turn in DuRecDial 2.0 might look like:
  - User: "I'm looking for a movie recommendation."
  - System: "How about KungFu Panda 3?"
  - User: "Oh, is that a comedy?"
- Justification: A well-established multi-domain dataset for evaluating conversational recommendation systems.
INSPIRED (Hayati et al., 2020):
- Origin: A dataset specifically focusing on movie recommendation scenarios.
- Characteristics: Primarily dedicated to the movie domain, potentially featuring longer and more complex conversations related to movie preferences.
- Size: 1001 conversations, 35,811 utterances, 14 goals, 1169 topics.
- Target Items: 368/42/55 for train/validation/test sets.
- Example Data: A typical turn might be:
  - User: "I like action movies with strong female leads."
  - System: "Have you seen The Island? It stars Scarlett Johansson."
- Justification: A common benchmark for movie recommendation dialogues, allowing for comparison with existing research.
  
  The detailed statistics of the datasets are provided in Table 1 and Table 5 (domains) and Table 6 (dialogue strategies).

The following table shows the results from Table 1:

	DuRecDial 2.0	INSPIRED
# convs	16.5K	1001
# utterances	255K	35,811
# goals	13	14
# topics	646	1169
# target items	471/285/376	368/42/55
domains	Movie/Music/Food/POI	Movie

The following table shows the results from Table 5:

Domain	DuRecDial 2.0	INSPIRED
Movie	190/121/161	368/42/55
Music	139/109/120
Food	48/13/30
Point-of-interest (POI)	96/42/65

The following table shows the results from Table 6:

| DuRecDial 2.0 | | INSPIRED | | :--- | :--- | :--- | :--- | Strategy | Amount | Strategy | Amount | Greetings | 4,948 | Opinion inquiry | 1,258 | Ask about weather | 4,393 | Self modeling | 235 | Play music | 10,034 | Personal opinion | 1,388 | Q/A | 6,072 | Credibility | 1,563 | Music on demand | 1,692 | Encouragement | 1,146 | Movie recommendation | 14,882 | Similarity | 539 | Chat about stars | 16,276 | Rephrase preference | 103 | Say goodbye | 12,819 | Preference confirmation | 436 | Music recommendation | 13,170 | Acknowledgment | 814 | Ask about date | 2,401 | Personal experience | 304 | Ask questions | 2,100 | Experience inquiry | 880 | POI recommendation | 5,451 | Offer help | 449 | Food recommendation | 4,465 | Transparency | 120 | | | No strategy | 1,423

5.2. Evaluation Metrics

The paper employs both automatic and human evaluations.

5.2.1. Automatic Evaluation Metrics

Objective Success Rate ( $\mathbf{Obj_{SR}}$ ):
- Conceptual Definition: Measures whether the system's generated response explicitly contains the pre-defined target item that the system is trying to recommend. It's a binary measure: 1 if the target item is mentioned, 0 otherwise.
- Importance: It indicates the system's ability to introduce the target item into the conversation. However, the paper notes that relying solely on this metric can overestimate effectiveness, as a user might reject a recommended item even if it's mentioned.
- Mathematical Formula: Not explicitly provided in the paper, but conceptually defined as: $\mathbf{Obj_{SR}} = \frac{\text{Number of dialogues where target item is mentioned}}{\text{Total number of dialogues}}$
- Symbol Explanation:
  - Number of dialogues where target item is mentioned: Count of conversations where the system successfully uttered the specified target item at least once.
  - Total number of dialogues: The total count of conversations evaluated.
Subjective Success Rate ( $\mathbf{Subj_{SR}}$ ):
- Conceptual Definition: Determines if the LLM-based assessment score of a generated conversation surpasses a pre-defined threshold $\epsilon$ . This metric aims to capture user satisfaction or willingness to accept the recommendation, providing a more realistic measure of success than just mentioning the item.
- Importance: Addresses the limitation of Obj_SR by incorporating an LLM-simulated judgment of the user's attitude towards the recommendation, better reflecting real-world success.
- Mathematical Formula: Not explicitly provided, but based on the LLM assessment: $\mathbf{Subj_{SR}} = \frac{\text{Number of dialogues with LLM assessment score} > \epsilon}{\text{Total number of dialogues}}$
- Symbol Explanation:
  - $Number of dialogues with LLM assessment score >$ \epsilon

: Count of conversations where the LLM's evaluation of user satisfaction/acceptance (computed via the method described in Section 4.2.2 for LLM-based Target-driven Assessment) exceeds a threshold  $\epsilon$ .
        *   `Total number of dialogues`: The total count of conversations evaluated.
        *    $\epsilon$ : A pre-defined threshold (set to 1 in this work).

3.  **Average Number of Turns (Avg. T):**
    *   **Conceptual Definition:** The average number of turns required to recommend the target item successfully or to reach the end of the conversation. Lower `Avg. T` is generally preferred for efficiency.
    *   **Importance:** Measures the `efficiency` and `conciseness` of the dialogue. A system that can achieve its target in fewer turns is often more user-friendly.
    *   **Mathematical Formula:** Not explicitly provided, but conceptually:

    \text{Avg. T} = \frac{\sum_{i=1}^{\text{Total dialogues}} \text{Turns}_i}{\text{Total number of dialogues}}
    \$\$
*   **Symbol Explanation:**
    *    $\text{Turns}_i$ : The number of turns in dialogue  $i$ .
    *   `Total number of dialogues`: The total count of conversations evaluated.

Approximated Number of API Calls (#APIC):
- Conceptual Definition: The computational cost of each model at inference time, measured in Big-O notation representing the complexity in terms of API calls to external services (e.g., LLMs, user simulator).
- Importance: Crucial for evaluating the practical deployability and scalability of models, especially those relying on external services which can incur monetary costs and latency.
- Mathematical Formula: Expressed in Big-O notation (e.g., $\mathcal{O}(TH)$ ), which represents the upper bound of the growth rate of API calls as input size increases.
- Symbol Explanation:
  - $T$ : Number of target items.
  - $H$ : Conversation horizon (maximum number of turns).
  - $K$ : Number of simulation steps (for MCTS-based algorithms).

5.2.2. Human Evaluation Metrics

For human evaluation, two annotators assessed randomly sampled dialogues based on:

Satisfaction:
- Conceptual Definition: Assesses which dialogue offers more convincing justifications for the user to accept the target item.
- Importance: Directly measures the persuasive quality and overall user experience.
- Formula: Not applicable; reported as Win/Loss rates for T-EPL against baselines.
Coherency:
- Conceptual Definition: Assesses which dialogue offers more reasonable topical transitions towards the target item.
- Importance: Measures the naturalness and logical flow of the conversation, ensuring the recommendation is introduced smoothly and relevantly.
- Formula: Not applicable; reported as Win/Loss rates for T-EPL against baselines.

Inter-annotator Agreement: Fleiss' Kappa (McHugh, 2012) is used to measure the agreement between the two human annotators, ensuring the reliability of the human evaluation results.

5.3. Baselines

The proposed T-EPL is compared against various dialogue policy methods, categorized by their approach:

Predictive Policies: These models predict a probability distribution over dialogue actions.
- BERT (Devlin et al., 2019): A general pre-trained language model, used here to predict the next dialogue strategy.
- RTCP (Dao et al., 2023): A state-of-the-art target-driven recommendation model balancing short-term and long-term planning with a strategic balancing mechanism. This also serves as the backbone policy for T-EPL.
- UNIMIND (Deng et al., 2023b): A goal-aware conversational recommender system using a multi-task learning paradigm and prompt-based learning to unify subtasks in multi-goal CRSs.
Generative Policies: These models directly generate dialogue strategies or responses.
- TCP (Wang et al., 2022): An early target-driven recommender system that uses a text generation model to produce a sequence of actions, starting from the target action.
- COLOR (Wang et al., 2023b): A recent target-driven dialogue system that learns latent traditions within dialogues via a Brownian-motion bridge stochastic process.
MCTS-based Policies: Methods leveraging Monte-Carlo Tree Search.
- MCTS: A vanilla Monte-Carlo Tree Search with rollouts, which simulates future interactions to estimate state values. For fair comparison, RTCP is used as its prior policy.
- GDP-Zero (Yu et al., 2023): A recent target-driven dialogue system that uses open-loop MCTS for look-ahead planning, where state values are estimated by prompting an LLM model. RTCP is used as its prior policy.
RL-based Policies:
- PPDPP (Deng et al., 2023a): A target-driven dialogue system that fine-tunes a small language model (as a prior dialogue policy) using background datasets, then further fine-tunes it with simulated conversations generated via Reinforcement Learning to maximize long-term rewards.

Implementation Details for Baselines:

Published source codes of baselines are leveraged.
For MCTS-based approaches (MCTS, GDP-Zero, T-EPL), RTCP is used as their prior policies for a fair comparison.
The RTCP policy was fine-tuned on the training data of DuRecDial 2.0 and INSPIRED for 5 epochs with learning rates of 5e-5 and 1e-5, respectively.
For T-EPL, the memory buffer size $k$ is set to 20, and the number of simulation steps $n$ is set to 5.
A generated conversation is considered successful if its target-driven assessment score (from LLM) surpasses a threshold of $\epsilon = 1$ .
A BART-based model with 114M parameters is used as the response generation model for all dialogue policy methods to ensure fair comparison of planning capabilities.

5.4. User Response Simulation

An LLM (specifically GPT-3.5-Turbo) is prompted to act as a user simulator. A 1-shot prompting scheme is used, where the LLM's input includes the current dialogue history, the newly generated system response, and one demonstrative conversation. The LLM is prompted with a temperature $\rho = 0$ to ensure deterministic and consistent user behavior. The target item is intentionally excluded from the user simulator prompt to avoid bias.

5.5. LLM-based Target-driven Assessment

A separate LLM (GPT-3.5-Turbo) is used to assess the success of a conversation and determine if the user is happy and willing to accept the target item. The LLM is prompted with a temperature $\rho = 1.1$ for 5 times, and the averaged score is computed. The textual outputs (e.g., 'accept'/'reject') are mapped to scalar values (e.g., 1/-1).

6. Results & Analysis

6.1. Core Results

The main empirical results demonstrate the superiority and efficacy of T-EPL across various datasets and evaluation metrics.

The following table shows the results from Table 2:

| Model | #APIC | DuRecDial 2.0 | | | INSPIRED | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | | Objsr | Subjr | Avg. T(↓) | Objsr | Subjsr | Avg. T (↓) | BERT* (Devlin et al., 2019) | O(TH) | 0.851 | 0.773 | 6.045 | 0.090 | 0.062 | 13.647 | UNIMIND* (Deng et al., 2023b) | O(TH) | 0.720 | 0.668 | 6.873 | 0.163 | 0.109 | 13.321 | TCP* (Wang et al., 2022) | O(TH) | 0.742 | 0.679 | 7.091 | 0.201 | 0.168 | 13.178 | COLOR* (Wang et al., 2023b) | O(TH) | 0.805 | 0.749 | 7.067 | 0.206 | 0.172 | 13.283 | RTCP* (Dao et al., 2023) | O(TH) | 0.877 | 0.786 | 5.993 | 0.136 | 0.099 | 13.479 | T-EPL * (ours) | O(TKH) | 0.904 | 0.813 | 5.255 | 0.218 | 0.172 | 13.158 | MCTS+ | O(T KH 2) | 0.867 | 0.784 | 5.550 | 0.137 | 0.112 | 13.537 | GDP-Zero (Yu et al., 2023) | O(T KH 2) | 0.850 | 0.825 | 4.475 | 0.250 | 0.217 | 13.216 | PPDPP‡ (Deng et al., 2023a) | O(TH) | 0.667 | 0.650 | 7.283 | 0.150 | 0.100 | 13.416 | T-EPL+ (ours) | O(TKH) | 0.900 | 0.833 | 5.034 | 0.225 | 0.225 | 13.000

Notes:

Objsr: Objective Success Rate.
Subjsr: Subjective Success Rate.
Avg. T(↓): Average number of turns (lower is better).
#APIC: Approximated number of API calls.
* denotes performance on the whole test set.
+ denotes performance on a sub-sample of the dataset (for computationally costly methods).
‡ denotes methods with a quadratic computational cost.
Best performance in each category is bolded.

6.1.1. Objective versus Subjective Metrics

The results consistently show a substantial difference between Obj_SR and Subj_SR across all methods. This confirms the paper's argument that Obj_SR alone can lead to an overestimation of a system's effectiveness. For instance, a system might mention the target item (high Obj_SR) but the user might reject it, leading to a lower Subj_SR. This highlights the importance of incorporating user feedback (even simulated) to assess true recommendation success.

6.1.2. Generative versus Predictive Policies

DuRecDial 2.0: Predictive policies (e.g., BERT, RTCP) generally perform better.
INSPIRED: Generative policies (e.g., TCP, COLOR) are more effective. This trend is attributed to the action space size. INSPIRED has a significantly larger action space, which generative policies can handle better as they directly generate strategies, whereas predictive policies might struggle with predicting over a vast set of discrete actions.

6.1.3. Performance Comparison against Baseline Methods

T-EPL's Superiority: The proposed T-EPL consistently outperforms most existing target-driven dialogue policies (marked with *) across both datasets and evaluation metrics. This is attributed to its effective utilization of similar past interactions from its memory, leading to enhanced dialogue planning.
- On DuRecDial 2.0, T-EPL achieves the highest Obj_SR (0.904) and Subj_SR (0.813), and the lowest Avg. T (5.255) among the models evaluated on the full test set.
- On INSPIRED, T-EPL also shows strong performance, matching COLOR's Subj_SR (0.172) and having a slightly lower Avg. T (13.158).
Comparison with MCTS-based baselines (+): When compared on the sub-sampled datasets (marked with +), T-EPL (0.900 Obj_SR, 0.833 Subj_SR, 5.034 Avg. T on DuRecDial 2.0; 0.225 Obj_SR, 0.225 Subj_SR, 13.000 Avg. T on INSPIRED) still demonstrates strong performance. It outperforms vanilla MCTS across the board. Against GDP-Zero, T-EPL achieves competitive Subj_SR while significantly reducing Avg. T on DuRecDial 2.0 (5.034 vs 4.475 for GDP-Zero, but T-EPL's higher Subj_SR on DuRecDial 2.0 (0.833 vs 0.825) suggests a better overall outcome even with slightly more turns in this specific comparison). On INSPIRED, GDP-Zero shows higher Obj_SR and Subj_SR on the sub-sample, which might be due to dataset characteristics or specific tuning. However, the paper suggests T-EPL's experiential target-driven scoring function provides accurate assessments compared to LLM-based estimation in GDP-Zero.
PPDPP's Lower Performance: PPDPP shows relatively lower performance, possibly due to its limitation in fine-tuning its pre-trained policy on a limited number of interactions, hindering its generalizability.
MCTS Computational Constraint: Vanilla MCTS is restricted by computational costs, limiting its ability to use sufficient rollouts for effective policy learning.

6.1.4. Efficiency Comparison

Offline Policies: BERT, UNIMIND, TCP, COLOR, RTCP are offline policies with lower computational costs, typically $O(TH)$ API calls.
MCTS & GDP-Zero: These incur high computational costs. Vanilla MCTS has a quadratic cost $O(TKH^2)$ due to expensive rollouts. GDP-Zero also has $O(TKH^2)$ because it relies on costly LLM-based evaluations for state values at each simulation step.
T-EPL: While T-EPL introduces API calls for pre-computing past interactions, this step is pre-computed efficiently. During inference, T-EPL exhibits a linear scaling of API calls, $O(TKH)$ , similar to offline policy models. This makes it significantly more efficient than other MCTS-based methods at inference time.

The following table shows the results from Table 7 (Inference time):

| Model | Inference Time (s) | | :--- | :--- | :--- | | DuRecDial 2.0 | INSPIRED | BERT (Devlin et al., 2019) | 6.01 | 6.62 | UNIMIND (Deng et al., 2023b) | 7.54 | 9.21 | TCP (Wang et al., 2022) | 13.60 | 34.34 | COLOR (Wang et al., 2023b) | 10.81 | 26.07 | RTCP (Dao et al., 2023) | 7.53 | 8.69 | MCTS | 105.29 | 232.04 | GDP-Zero (Yu et al., 2023) | 90.62 | 148.70 | PPDPP (Deng et al., 2023a) | 7.59 | 9.49 | T-EPL (ours) | 50.71 | 84.43

The inference times in Table 7 corroborate the Big-O notation complexities, showing T-EPL as significantly faster than other MCTS-based methods (MCTS, GDP-Zero).

6.2. Ablations / Parameter Sensitivity

The ablation study investigates the contribution of key components of the T-EPL algorithm.

The following table shows the results from Table 3:

| Model | DuRecDial 2.0 | | INSPIRED | | :--- | :--- | :--- | :--- | :--- | | Subjsr | Avg. T(↓) | Subjsr | Avg. T (↓) | T-EPL | 0.813 | 5.255 | 0.172 | 13.158 | - w/o Len | 0.837 | 5.312 | 0.145 | 13.161 | - w/o Exp | 0.801 | 5.435 | 0.136 | 13.372

Notes:

w/o Len: Variant without the length-penalized term ( $\lambda = 0$ ).
w/o Exp: Variant without the experiential target-driven scoring function ( $F_{\mathcal{M}}(s, v) = 0$ ).
Impact of Length-Penalized Term (w/o Len):
- DuRecDial 2.0: Removing the length penalty increases Subj_SR (0.837 vs 0.813) but slightly degrades Avg. T (5.312 vs 5.255). This suggests a trade-off: allowing longer conversations might lead to slightly higher satisfaction for users who appreciate thoroughness, but at the cost of more turns.
- INSPIRED: Removing the length penalty negatively impacts both Subj_SR (0.145 vs 0.172) and Avg. T (13.161 vs 13.158). This is explained by INSPIRED having longer conversations, where accumulated error during planning for longer trajectories is higher. Penalizing long trajectories helps T-EPL prioritize shorter, more certain paths.
Impact of Experiential Target-driven Scoring Function (w/o Exp):
- Removing this core component leads to significant performance drops on both datasets (Subj_SR 0.801 vs 0.813 on DuRecDial 2.0, and 0.136 vs 0.172 on INSPIRED; Avg. T also worsens). This strongly emphasizes the critical importance of experiential learning in improving target-driven planning capabilities.

6.2.1. Impact of the Number of Simulation Steps ( $n$ )

The performance of T-EPL generally improves with an increasing number of simulation steps ( $n$ ). A larger $n$ allows MCTS to construct a more comprehensive search tree, leading to better policy decisions. However, this comes with an inherent trade-off in computational cost, as a higher $n$ increases API calls and runtime. This necessitates careful selection of $n$ to balance performance and efficiency.

The following figure shows the performance of T-EPL with different values of simulation steps (n) and # retrieved examples (k) (Figure 4 from the original paper):

Figure 4: Performance of T-EPL with different values of simulation steps `( n )` and # retrieved examples `( k )` .
该图像是图表，展示了图4中T-EPL方法在不同模拟步数 $n$ 和检索示例数 $k$ 下的性能表现，包括成功率(SR)、平均回合数(Avg.turn)、运行时间与方差等指标。

Figure 4.a (left top) shows that Subj_SR increases with $n$ up to a point, while Figure 4.c (left bottom) shows runtime and #API calls increasing linearly with $n$ .

6.2.2. Analyses on the Number of Retrieved Interactions ( $k$ )

The number of retrieved interactions $k$ also affects T-EPL's performance. Performance initially increases with $k$ , as considering more interactions refines policy decisions. However, beyond a certain point, performance can decrease. This is because a very large $k$ can introduce noisy interactions from memory or lead to value estimation saturation, where adding more (potentially less relevant) similar interactions doesn't provide significant new insights and might even dilute the quality of the aggregated value.

Figure 4.b (right top) illustrates this by showing Subj_SR rising and then falling as $k$ increases, while Figure 4.d (right bottom) shows that value estimation variance decreases, indicating saturation.

6.3. Human Evaluation

The human evaluation results compare T-EPL against key baselines on the DuRecDial 2.0 dataset.

The following table shows the results from Table 4:

| T-EPL vs | Stat. | | Coh. | | :--- | :--- | :--- | :--- | :--- | | Win.(%) | Lose.(%) | Win.(%) | Lose.(%) | RTCP | 38 % | 32 % | 27 % | 24 % | COLOR | 45 % | 34 % | 21 % | 18 % | PPDPP | 27 % | 19 % | 34 % | 23 % | GDP-Zero | 32 % | 29 % | 26 % | 22 %

Notes:

Stat.: Satisfaction.
Coh.: Coherency.
Win.(%): Percentage of dialogues where T-EPL was preferred.
Lose.(%): Percentage of dialogues where the baseline was preferred.
The inter-annotator agreement score (Fleiss' Kappa) is 0.69, indicating substantial agreement.
Overall Superiority: T-EPL generally achieves better performance in Satisfaction and Coherency compared to RTCP, COLOR, PPDPP, and GDP-Zero.
Improvement over Backbone (RTCP): T-EPL significantly improves upon its backbone policy, RTCP, in both Satisfaction (38% win vs 32% loss) and Coherency (27% win vs 24% loss). This highlights that RTCP's limitation in lacking foresight interaction is effectively addressed by T-EPL's use of experienced interactions from memory.
Strong Performance against Generative (COLOR) and MCTS-based (GDP-Zero, PPDPP) Baselines: T-EPL shows favorable win rates against these diverse baselines, reinforcing its robust planning capabilities.

6.4. In-depth Analyses

6.4.1. Frequency of Target Items and Performance Comparison w.r.t Conversation Turns

The analysis of recommendation timing (Figure 3, Figure 5, Figure 6, Figure 7) reveals distinct strategies:

The following figure shows the frequency of target items (left) and relative success rate (right) w.r.t conversation turns of different models (Figure 3 from the original paper):

$Figure 3: Frequency of target items (left) and relative success rate (right) w.r.t conversation turns of different models. Green lines show the baseline BERT\\*. More results can be found in Appendix…$
该图像是论文中图3，展示了不同模型在不同对话轮次的目标项频率及相对成功率。左侧柱状图表示目标项频率，右侧折线图显示相对成功率。颜色区分基线(Base)与对比模型TCP、RTCP和T-EPL。

Figure 3 (left panel) shows that RTCP and T-EPL tend to introduce recommendations earlier than TCP. Figure 3 (right panel) compares the relative Subj_SR of these models to the BERT baseline, showing T-EPL's consistent gains.

The following figure shows the frequency of target items (left) and relative success rate (right) w.r.t conversation turns of BERT, UNIMIND, and COLOR models (Figure 5 from the original paper):

Figure 5: Frequencies of target items (left) and relative success rate (right) w.r.t conversation turns of BERT, UNIMIND, and COLOR models.
该图像是图表，展示了BERT、UNIMIND和COLOR三种模型在不同对话轮次下目标项频率（左图）和相对成功率（右图）的变化情况，反映了模型性能随对话进程的差异。

Figure 5 further illustrates that predictive policies (like BERT) often favor both early and late recommendations, while generative policies (UNIMIND, COLOR) tend to prioritize late suggestions.

The following figure shows the performance comparison of relative success rate against the standard baseline BERT at different conversation turns (Figure 6 from the original paper):

该图像是图表，展示了在DuRecDial和INSPIRED两个数据集上，不同方法随对话轮次变化的相对成功率。图中T-EPL方法表现最佳，体现其利用经验学习提升目标驱动推荐对话管理的优势。

Figure 6 demonstrates that T-EPL consistently outperforms baselines across conversation turns, with a more significant gap in earlier rounds, indicating superior early recommendation management. On the INSPIRED dataset (longer conversations), T-EPL maintains strong long-range planning capabilities even in later turns.

The following figure shows the performance comparison of relative success rate against the standard baseline MCTS at different conversation turns (Figure 7 from the original paper):

该图像是两幅折线图，展示了在DuRecDial和INSPIRED数据集上不同方法随对话轮次变化的相对成功率，比较了T-EPL、PPDPP、GDP_Zero和MCTS(Base)四种策略的表现。

Figure 7 shows that T-EPL consistently outperforms vanilla MCTS. Compared to GDP-Zero, T-EPL is competitive, especially in later turns. While GDP-Zero might show an edge in the initial turn (possibly due to aggressive early recommendations), this might lead to poor user experience if not well-justified.

The following figure shows the frequencies of predicted dialogue strategies for different conversation turns for TCP, RTCP, and T-EPL (Figure 8 from the original paper):

该图像是多组柱状图，展示了不同对话策略（TCP、RTCP和T-EPL）在不同会话轮次（3-4、5-6、7-8、9-10）中的使用频率，反映了论文中提出的T-EPL方法在多轮推荐对话中策略选择的表现差异。

Figure 8 provides detailed frequencies of dialogue actions. In early turns (1-4), models prioritize rapport building. From turns 5-6, Music, Food, and Movie recommendations emerge, with RTCP and T-EPL showing higher frequency. Movie and POI suggestions are more prevalent in later turns (7-10). This highlights that optimal dialogue strategies vary by domain and conversation stage.

6.4.2. Performance Comparison w.r.t Different Recommendation Domains

The paper investigates performance across different recommendation domains within DuRecDial 2.0.

The following table shows the results from Table 8:

Model	Movie		Music		POI		Food
	Subjsr	Avg. T(↓)	Subjsr	Avg. T(↓)	Subjsr	Avg. T(↓)	Subjsr
BERT*	0.869	6.782	0.833	4.333	0.723	6.938	0.533
UNIMIND*	0.788	6.503	0.858	4.825	0.261	7.769	0.500
TCP*	0.832	6.881	0.791	5.975	0.569	7.000	0.200
COLOR*	0.782	7.391	0.800	6.700	0.584	8.153	0.533
RTCP*	0.925	6.204	0.875	4.375	0.738	7.107	0.333
T-EPL * (ours)	0.851	5.708	0.891	3.891	0.831	4.892	0.667

Notes:

Subjsr: Subjective Success Rate.
Avg. T(↓): Average number of turns (lower is better).
* denotes performance on the whole test set.
Domain-specific Challenges: POI and Food recommendations consistently show lower performance compared to Movie and Music. This is likely due to the limited amount of training data for these domains, making it harder for models to learn effective dialogue policies.
T-EPL's Generalizability: Despite these challenges, T-EPL significantly outperforms all baselines in 3 out of 4 domains (Music, POI, Food) and achieves the lowest Avg. T in Movie and Music domains, while being competitive in Subj_SR for Movie. This demonstrates T-EPL's superiority and generalizability across different domains, showcasing its capacity to enhance target-driven dialogue planning irrespective of the domain.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work introduces Experiential Policy Learning (EPL), a novel framework for enhancing target-driven recommendation dialogues. The core innovation of EPL lies in its Learning From Experience principle, which leverages similar past interactions stored in a long-term memory to anticipate future conversational trajectories and estimate dialogue state potential. This approach effectively addresses critical limitations of prior methods, such as inadequate conversation anticipation, computational inefficiencies from costly simulations, and the neglect of valuable past experiences.

The paper further presents Tree-structured EPL (T-EPL) as a training-free realization of EPL, integrating Large Language Models (LLMs) for assessing past dialogue states and Monte-Carlo Tree Search (MCTS) for hierarchical and multi-level reasoning. Through extensive interactive experiments on two published datasets (DuRecDial 2.0 and INSPIRED), T-EPL consistently demonstrates superior performance and efficiency compared to state-of-the-art baselines. The ablation studies confirm the effectiveness of its experiential scoring function and length-penalized assessment.

7.2. Limitations & Future Work

The authors acknowledge several potential limitations of the proposed T-EPL algorithm:

Memory Availability: T-EPL's performance relies heavily on the availability and quality of its memory component. In real-world applications, this memory might not be readily available (a cold-start problem for experiences) and would require additional effort to construct before deployment.
Computational Cost of Interactive LLM Evaluation: While T-EPL reduces LLM calls during core MCTS planning, its memory construction and LLM-based assessment still rely on interactive LLM evaluations. This can incur significant computational costs (both time and monetary) when building or updating the memory, even if it's done offline.
Dependence on Retrieval Models: T-EPL's effectiveness is constrained by the quality of its retrieval model (e.g., all-MiniLM-L6-v2) and the resulting retrieved interactions and their corresponding state values. If the retrieval model is poor, or if the stored experiences are not truly representative or diverse, the experiential scoring function could be inaccurate, degrading performance.

The paper implicitly suggests future work by addressing these limitations, such as developing more efficient memory construction methods, reducing the reliance on costly LLM evaluations, or improving the robustness of the retrieval mechanism.

7.3. Personal Insights & Critique

The Experiential Policy Learning (EPL) framework is a highly intuitive and powerful concept, particularly in dialogue systems where interactions can be complex and diverse. The analogy to human reflection on experience is well-placed and resonates with how intelligent agents should ideally operate.

Novelty: The core novelty lies in integrating experiential learning into MCTS for dialogue policy learning in a training-free and adaptive manner. While MCTS and LLMs have been used, the way T-EPL uses a dense retrieval memory to approximate state values, circumventing costly online simulations or constant LLM calls during planning, is a significant contribution to efficiency and adaptability. The dynamic memory update during inference is also a strong point, enabling online policy refinement.

Potential Improvements & Unexplored Avenues:

Memory Management and Forgetting: The paper mentions that similar users tend to exhibit similar preferences. However, user preferences can evolve, and some experiences might become outdated or less relevant. Future work could explore dynamic memory pruning or weighting mechanisms (e.g., temporal decay, context-specific relevance) to ensure the memory remains optimal and doesn't get polluted with noisy or stale information.
Definition of "Similar": The dense retrieval relies on vector similarity. While effective, the definition of "similarity" might be too generic. Could domain-specific or goal-oriented similarity measures further enhance retrieval quality? For instance, similarity based on dialogue acts or topic shifts rather than just semantic content.
Scalability of Memory: While Faiss is efficient, what happens when the memory grows to an extremely large scale? How does retrieval latency impact real-time performance in production?
LLM Assessment Robustness: The LLM-based assessment for memory population is crucial. While temperature $\rho = 1.1$ and $T=5$ samples are used, LLMs can still be prone to hallucination or bias. The quality of LLM-generated prompts for assessment (Table 10) also plays a huge role. Further studies on the reliability and consistency of these LLM assessments would be valuable.
Transferability to other domains: While T-EPL shows generalizability across domains in DuRecDial 2.0, it would be interesting to see its performance in drastically different dialogue contexts (e.g., customer service, technical support) where target items might be less tangible.
Explainability: The experiential scoring function is more transparent than a black-box neural network. However, providing explanations for why certain past interactions were deemed similar and influential could enhance trust and debuggability.

Assumptions:

The primary assumption is that past experiences are genuinely indicative of future outcomes in similar dialogue states. While often true, outliers or rapid shifts in user behavior could challenge this.
The quality of the user simulator and LLM-based assessment is assumed to be high enough to generate reliable experiential data for the memory.

Overall, T-EPL represents a significant step towards building more adaptive, efficient, and foresightful dialogue policies for target-driven recommendation systems, cleverly leveraging the strengths of LLMs, MCTS, and experiential learning.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Experience as Source for Anticipation and Planning: Experiential Policy Learning for Target-driven Recommendation Dialogues

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~33 min read · 43,734 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation (Why)

2.2. Main Contributions / Findings (What)

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation

4. Methodology

4.1. Principles

4.2. Steps & Procedures

4.2.1. Experiential Policy Learning (EPL)

4.2.2. Tree-structured EPL (T-EPL)

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Automatic Evaluation Metrics

5.2.2. Human Evaluation Metrics

5.3. Baselines

5.4. User Response Simulation

5.5. LLM-based Target-driven Assessment

6. Results & Analysis

6.1. Core Results

6.1.1. Objective versus Subjective Metrics

6.1.2. Generative versus Predictive Policies

6.1.3. Performance Comparison against Baseline Methods

6.1.4. Efficiency Comparison

6.2. Ablations / Parameter Sensitivity

6.2.1. Impact of the Number of Simulation Steps (nnn)

6.2.2. Analyses on the Number of Retrieved Interactions (kkk)

6.3. Human Evaluation

6.4. In-depth Analyses

6.4.1. Frequency of Target Items and Performance Comparison w.r.t Conversation Turns

6.4.2. Performance Comparison w.r.t Different Recommendation Domains

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

6.2.1. Impact of the Number of Simulation Steps ( $n$ )

6.2.2. Analyses on the Number of Retrieved Interactions ( $k$ )