Paper status: completed

Plug-and-Play Policy Planner for Large Language Model Powered Dialogue Agents

Published:11/01/2023

LLM-guided motion planning (27)Large Language Model Fine-Tuning (51)RL Training for Large Language Models (67)Dialogue Policy Planning (1)Self-Play Reinforcement Learning (1)

Original Link PDF

Price: 0.100000

5 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

PPDPP introduces a tunable dialogue policy planner enhancing LLMs' proactive dialogue capabilities via supervised fine-tuning and reinforcement learning, achieving superior generalization and performance across diverse applications.

Abstract

Proactive dialogues serve as a practical yet challenging dialogue problem in the era of large language models (LLMs), where the dialogue policy planning is the key to improving the proactivity of LLMs. Most existing studies enable the dialogue policy planning of LLMs using various prompting schemes or iteratively enhance this capability in handling the given case with verbal AI feedback. However, these approaches are either bounded by the policy planning capability of the frozen LLMs or hard to be transferred to new cases. In this work, we introduce a new dialogue policy planning paradigm to strategize LLMs for proactive dialogue problems with a tunable language model plug-in as a plug-and-play dialogue policy planner, named PPDPP. Specifically, we develop a novel training framework to facilitate supervised fine-tuning over available human-annotated data as well as reinforcement learning from goal-oriented AI feedback with dynamic interaction data collected by the LLM-based self-play simulation. In this manner, the LLM-powered dialogue agent can not only be generalized to different cases after the training, but also be applicable to different applications by just substituting the learned plug-in. In addition, we propose to evaluate the policy planning capability of dialogue systems under the interactive setting. Experimental results demonstrate that PPDPP consistently and substantially outperforms existing approaches on three different proactive dialogue applications, including negotiation, emotional support, and tutoring dialogues.

Mind Map

In-depth Reading

English Analysis~36 min read · 48,268 chars

1. Bibliographic Information

1.1. Title

Plug-and-Play Policy Planner for Large Language Model Powered Dialogue Agents

1.2. Authors

Yang Deng, Wenxuan Zhang, Wai Lam, See-Kiong Ng, Tat-Seng Chua

The authors appear to be affiliated with the National University of Singapore and The Chinese University of Hong Kong, indicating a strong academic background in natural language processing and artificial intelligence research.

1.3. Journal/Conference

The paper was published on arXiv, which is a preprint server for electronic preprints of scientific papers. While not a peer-reviewed journal or conference proceeding itself, arXiv serves as a widely recognized platform for rapid dissemination of research findings in fields like AI, often preceding or accompanying formal publication in top-tier venues. Its reputation is high for sharing cutting-edge research quickly.

1.4. Publication Year

2023

1.5. Abstract

This paper addresses the challenge of proactive dialogues in Large Language Model (LLM)-powered dialogue agents, where dialogue policy planning is crucial for improving LLM proactivity. Existing methods, primarily prompting schemes or verbal AI feedback, are limited by the capabilities of frozen LLMs or lack transferability to new cases. The authors introduce a novel paradigm named Plug-and-Play Dialogue Policy Planner (PPDPP), which utilizes a tunable language model plug-in to strategize LLMs for proactive dialogue problems. PPDPP employs a unique training framework combining supervised fine-tuning (SFT) on human-annotated data and reinforcement learning (RL) from goal-oriented AI feedback generated through LLM-based self-play simulation. This approach allows the LLM-powered agent to generalize to different cases and be adaptable across various applications by simply substituting the learned plug-in. Additionally, the paper proposes an interactive evaluation method for policy planning capability. Experimental results demonstrate that PPDPP significantly outperforms existing methods across three proactive dialogue applications: negotiation, emotional support, and tutoring.

1.6. Original Source Link

Official Source: https://arxiv.org/abs/2311.00262 PDF Link: https://arxiv.org/pdf/2311.00262v2.pdf Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the limitation of Large Language Models (LLMs) in proactive dialogues. While LLMs like ChatGPT excel at understanding context and generating responses, they are fundamentally designed to be passive, following user instructions. This inherent passivity makes them less effective in proactive dialogue problems, where an agent needs to strategically take the initiative to steer a conversation towards a specific goal. Examples of such problems include negotiation, emotional support, and tutoring, all of which require dynamic, goal-oriented interaction.

This problem is important because many real-world applications of dialogue agents require more than just reactive responses; they demand agents that can pursue objectives, guide users, and manage conversational flow strategically. Existing solutions, primarily prompting schemes or iterative refinement with verbal AI feedback, face significant challenges:

Bounded by frozen LLMs: Their effectiveness is limited by the inherent policy planning capability of the underlying frozen LLMs (i.e., LLMs whose parameters are not updated during the policy learning process).
Lack of transferability: Methods involving iterative refinement often require multiple rounds of self-play dialogue simulations for each new case, making them impractical for broad application.
Inadequate evaluation: Traditional turn-level response quality measurements fail to assess the policy planning capability effectively, which is about achieving long-term conversational goals.

The paper's innovative idea is to introduce a plug-and-play mechanism for dialogue policy planning, effectively decoupling the strategic decision-making from the LLM's core generation capabilities. By using a separate, tunable language model plug-in, the system can learn and adapt its strategic behavior without modifying the large, expensive, and often proprietary backbone LLM.

2.2. Main Contributions / Findings

The paper makes several primary contributions to address the identified challenges:

Introduction of PPDPP: They propose Plug-and-Play Dialogue Policy Planner (PPDPP), a novel paradigm that strategizes LLM-powered dialogue agents using a tunable language model plug-in. This plug-in acts as the policy agent, predicting dialogue strategies independently of the main LLM.
Novel Training Framework: PPDPP is trained through a two-phase framework:
- Supervised Fine-Tuning (SFT) on available human-annotated corpora to initialize the plug-in.
- Reinforcement Learning (RL) from goal-oriented AI feedback, collected via LLM-based self-play simulations. This simulation involves two LLMs (assistant and user) with competing goals, and a third LLM acting as a reward model to provide goal-oriented verbal feedback, which is then converted into scalar rewards for RL.
Enhanced Generalizability and Applicability: This approach allows the LLM-powered dialogue agent to be generalized to different cases after training, and crucially, to be applicable to different applications simply by substituting the learned plug-in, without affecting the LLM's context understanding and response generation capabilities.
Interactive Evaluation Protocol: The paper proposes an LLM-based interactive evaluation approach using user simulators and reward models to assess success rate and average number of turns for achieving designated goals in a dynamic, multi-turn setting, overcoming the limitations of static turn-level metrics.
Superior Performance across Applications: Experimental results demonstrate that PPDPP consistently and substantially outperforms existing approaches on three diverse proactive dialogue applications: negotiation, emotional support, and tutoring dialogues. It shows improvements in both efficiency (fewer turns) and effectiveness (higher success rates/benefits).

3.1. Foundational Concepts

To fully grasp the methodology and contributions of this paper, a beginner should understand several key concepts:

Large Language Models (LLMs): These are advanced artificial intelligence models, like ChatGPT, Vicuna, and LLaMA2-Chat, trained on vast amounts of text data. They excel at understanding natural language, generating human-like text, answering questions, summarizing information, and performing various language-related tasks. A key characteristic highlighted in this paper is their passive nature, meaning they typically respond to explicit instructions rather than proactively steering conversations towards specific goals.
Dialogue Agents: Also known as chatbots or conversational AI, these are computer programs designed to engage in natural language conversations with humans. They can range from simple rule-based systems to highly sophisticated AI models. This paper focuses on LLM-powered dialogue agents, which leverage the advanced capabilities of LLMs for their conversational abilities.
Proactive Dialogue Systems: Unlike reactive dialogue systems that merely respond to user inputs, proactive systems take the initiative to guide the conversation towards an anticipated goal. This involves strategic decision-making about what actions to take next to achieve a specific objective, such as closing a negotiation deal, providing emotional support, or teaching a concept. This paper defines proactive dialogues as those where the agent needs to strategically take the initiative to steer the conversation towards an anticipated goal.
Dialogue Policy Planning: This refers to the process by which a dialogue agent decides what to say or do next in a conversation to achieve its goals. It's the strategic component of a dialogue system, determining the sequence of actions or dialogue acts (e.g., "ask a question," "propose a price," "offer reassurance") that will lead to a successful outcome. In proactive dialogues, effective policy planning is paramount.
Supervised Fine-Tuning (SFT): A common technique in machine learning, particularly with pre-trained language models. Fine-tuning involves taking a model that has already been trained on a large, general dataset (like an LLM) and further training it on a smaller, task-specific dataset. Supervised means this training uses labeled examples (input-output pairs), where the model learns to map inputs to desired outputs. In this paper, SFT is used to initialize the PPDPP plug-in with human-annotated dialogue strategies.
Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. The agent receives feedback (rewards or penalties) from the environment for its actions and learns through trial and error. Key components include:
- Agent: The decision-maker (here, the PPDPP plug-in).
- Environment: The setting in which the agent operates (here, the self-play simulation involving LLM-based users).
- State: The current situation or observation of the environment (e.g., the dialogue history).
- Action: A decision made by the agent (e.g., a dialogue strategy).
- Reward: A scalar value indicating how good an action was in a given state, often delayed.
- Policy ( $\pi$ ): A mapping from states to actions, defining the agent's behavior. The goal of RL is to learn an optimal policy.
Markov Decision Process (MDP): A mathematical framework used to model decision-making in environments where outcomes are partly random and partly under the control of a decision-maker. It's a common formalization for problems solved with RL. An MDP is defined by states, actions, transition probabilities between states, and rewards. The paper explicitly formulates the dialogue process as an MDP.
Reinforcement Learning from AI Feedback (RLAIF): An extension of Reinforcement Learning from Human Feedback (RLHF). Instead of relying on human annotators to provide preference labels or feedback, RLAIF uses an AI model (often a powerful LLM) as the reward model or critic to evaluate the agent's behavior and generate feedback. This feedback is then used to train the agent. This paper utilizes a third LLM ( $\mathbf{LLM_{rwd}}$ ) to provide goal-oriented verbal feedback which is mapped to scalar rewards for RL.
Self-play Simulation: A technique where an AI agent plays against itself or other AI agents in a simulated environment to generate data and learn. In this paper, two LLMs (one acting as the assistant, another as the user) engage in a conversation, simulating real-world interactions. This generates rich, dynamic interaction data for RL training, allowing the agent to explore various scenarios and learn optimal policies.
Prompting Schemes: Methods of providing instructions or context to LLMs to guide their output.
- Zero-shot prompting: Giving an LLM a task without any examples.
- Few-shot prompting: Providing a few examples of input-output pairs along with the task instruction to guide the LLM.
- Chain-of-Thought (CoT) prompting: Guiding an LLM to generate intermediate reasoning steps before arriving at a final answer, often improving performance on complex tasks. The paper refers to ProCoT as using CoT.
RoBERTa: A Transformer-based language model developed by Facebook AI. It is an optimized version of BERT (Bidirectional Encoder Representations from Transformers), known for its robust pre-training approach. RoBERTa is a smaller language model compared to the very large generative LLMs like ChatGPT, making it suitable as a tunable plug-in for specific tasks without the computational overhead of fine-tuning an entire large LLM.

3.2. Previous Works

The paper contextualizes its work by referencing various existing approaches to dialogue policy planning, broadly categorized into pre-LLM and LLM-era methods.

Pre-LLM Era (Corpus-Based Learning):
- Mechanism: These methods primarily relied on corpus-based learning from static human-annotated dialogues. They predict dialogue strategies (e.g., Joshi et al., 2021; Cheng et al., 2022; Wang et al., 2023c).
- Limitations:
  - Heavy reliance on static human-annotated dialogues.
  - Failure to optimize long-term conversational goals.
  - Lack of adaptability to new or unseen scenarios beyond the training corpus.
  - Costly and unrealistic to fine-tune entire dialogue systems for every new application.
LLM Era (Prompt-Based Policy Planning): With the advent of LLMs, research shifted towards leveraging their inherent capabilities:
- Self-Thinking/Strategy Planning per Turn: Some works prompt a frozen actor LLM to think about its strategy for each turn (Zhang et al., 2023a; Deng et al., 2023b; Wang et al., 2023a).
  - MI-Prompt (Chen et al., 2023): Investigates mixed-initiative strategy-based prompting in proactive dialogues.
  - Ask-an-Expert (AnE) (Zhang et al., 2023a): Prompts another LLM as a strategic expert to reason about the next dialogue strategy.
  - ProCoT (Deng et al., 2023b): Improves proactive dialogue by having the LLM generate a chain-of-thought descriptive analysis for strategy planning.
- Iterative Refinement with AI Feedback: Other approaches generate AI feedback given the whole dialogue history to iteratively improve policy planning for a specific case (Fu et al., 2023; Yu et al., 2023).
  - ICL-AIF (Fu et al., 2023): Uses in-context learning with AI feedback from self-play simulation to refine dialogue strategies at a dialogue-level.
- Limitations (Common to LLM-era prompting):
  - Limited by Frozen LLMs: Performance is still bounded by the policy planning capability of the underlying frozen LLMs.
  - Lack of Transferability: Iterative refinement methods (Fu et al., 2023; Yu et al., 2023) are case-exclusive, meaning they require re-simulation for every new case, making them impractical for real-world application.
  - Evaluation Deficiencies: Still often relies on turn-level response quality metrics rather than comprehensive dialogue-level goal achievement.
Learnable Plug-ins for LLMs:
- This is a recent trend leveraging smaller models or APIs to enhance LLMs for specific tasks without full fine-tuning.
- External APIs/Models: Using APIs (Schick et al., 2023), vision models (Wu et al., 2023), or functional models (Shen et al., 2023). These are often fixed and don't learn from feedback.
- Small Language Model Plug-ins: Using smaller LLMs for tasks like text classification (Xu et al., 2023), summarization (Li et al., 2023), QA (Yao et al., 2023), or specific capabilities like mental state reasoning (Sclar et al., 2023). These can be fine-tuned or trained with RL.
Reinforcement Learning from AI Feedback (RLAIF):
- $Bai et al. (2022)$ proposed RLAIF to train models without human labels.
- Existing RLAIF methods often leverage natural language feedback from LLMs to self-refine prompts (Shinn et al., 2023; Fu et al., 2023; Madaan et al., 2023; Hao et al., 2023), rather than deriving scalar rewards for model training.

3.3. Technological Evolution

The field of dialogue policy planning has evolved significantly:

Early Rule-Based Systems: Simple, deterministic systems with hand-crafted rules for responses. Limited flexibility.
Statistical/ML-Based Systems (Pre-LLM): Shifted to corpus-based learning, where models learned policies from human-annotated dialogues using statistical methods or traditional machine learning. While more flexible, they were heavily reliant on domain-specific data and struggled with long-term optimization. Examples include Joshi et al., 2021 and Cheng et al., 2022.
Emergence of Large Language Models (LLMs): LLMs brought unprecedented context understanding and response generation capabilities. Initial attempts involved prompting LLMs to perform policy planning directly, often on a turn-by-turn basis. This showed promise but was constrained by the frozen nature of LLMs and the difficulty in transferring learned policies.
Refinement with AI Feedback: The next step was to use LLMs not just for generation but also for feedback (e.g., ICL-AIF, Fu et al., 2023), attempting to iteratively improve policy. However, this often remained case-specific.
This Paper's Position (Plug-and-Play Tunable Policy Planner): The current paper represents a significant step by introducing a tunable language model plug-in specifically for dialogue policy planning. This decouples the strategic component from the core LLM, allowing for targeted training with SFT and RL from goal-oriented AI feedback in self-play simulations. This innovation aims to achieve generalizability and transferability across cases and applications, addressing key limitations of prior LLM-based methods.

3.4. Differentiation Analysis

Compared to the main methods in related work, PPDPP offers several core differences and innovations:

Tunable Plug-in vs. Frozen LLM:
- Prior Prompt-based methods (e.g., Standard, Proactive, ProCoT, AnE): Rely on prompting a frozen LLM to conduct policy planning. Their policy planning capability is inherently bounded by the pre-trained knowledge and reasoning abilities of the LLM, which cannot be explicitly improved for specific dialogue goals.
- PPDPP: Introduces a tunable language model plug-in (a smaller model like RoBERTa) specifically dedicated to policy planning. This plug-in's parameters are learnable and can be iteratively optimized through SFT and RL, directly enhancing its policy planning capability.
Generalizability and Transferability:
- Iterative Refinement methods (e.g., ICL-AIF): Enhance policy planning by iteratively refining strategies through self-play simulations and AI feedback for a given case. This means they lack transferability as multiple rounds of simulation are needed for every new case.
- PPDPP: After training, the learned plug-in can be generalized to new cases without requiring further iterative simulation. Furthermore, the plug-in can be swapped out to adapt the LLM to different applications by simply substituting the learned plug-in for that domain, without retraining the large backbone LLM. This provides a level of flexibility and efficiency unmatched by case-exclusive refinement methods.
Goal-Oriented RL from Scalar AI Feedback:
- Most existing RLAIF approaches: Often directly use natural language feedback from LLMs to self-refine prompts (Shinn et al., 2023; Fu et al., 2023), which can be less direct for gradient-based optimization.
- PPDPP: Employs a goal-oriented AI feedback mechanism where a reward model LLM provides verbal feedback that is explicitly transformed into scalar rewards. This allows for direct application of reinforcement learning algorithms (like policy gradient) to optimize the plug-in's policy, capturing long-term goal-oriented rewards from dynamic multi-turn interactions.
Interactive Evaluation for Policy Planning:
- Traditional evaluation: Typically focuses on turn-level response quality based on fixed reference responses, which fails to assess the strategic policy planning capability in multi-turn, goal-oriented dialogues.
- PPDPP: Proposes and uses an LLM-based interactive evaluation approach that harnesses LLM-based user simulators and reward models to assess dialogue-level metrics like success rate and average turn for goal achievement. This is a more appropriate and automated way to measure policy planning effectiveness.
  
  In summary, PPDPP innovates by decoupling policy planning into a dedicated, learnable component, enabling systematic improvement through a principled RL framework with AI feedback, and ensuring better generalizability and modularity than prior LLM-based dialogue policy methods.

4. Methodology

4.1. Principles

The core idea behind PPDPP is to decouple the dialogue policy planning component from the Large Language Model's (LLM) response generation capabilities. Instead of relying on a frozen LLM to implicitly plan strategies through prompting, PPDPP employs a separate, smaller, and tunable language model as a plug-in specifically for policy planning. This plug-in (PPDPP) predicts the optimal dialogue action (strategy) at each turn. The rationale is that by offloading the strategic decision-making to a specialized, learnable component, the system can achieve better generalizability and transferability across different cases and applications. The theoretical basis for learning this optimal policy comes from Reinforcement Learning (RL), where the plug-in learns to maximize cumulative rewards in a simulated dialogue environment. The intuition is that dialogue policy planning is a distinct task from language generation, and by giving it its own learning capacity, it can be optimized more effectively for proactive dialogue problems.

4.2. Core Methodology In-depth (Layer by Layer)

The methodology of PPDPP involves several interconnected components: an MDP environment formulation, the Plug-and-Play Dialogue Policy Planner itself, a self-play interaction mechanism, an LLM as a Reward Model, and a Reinforcement Learning framework.

4.2.1. MDP Environment Formulation

The paper formulates the entire dialogue process as a Markov Decision Process (MDP). This is a standard way to model sequential decision-making problems, making them amenable to Reinforcement Learning.

At each turn $t$ of the dialogue, the dialogue system observes the current dialogue history (which represents the state $s_t$ ). Based on this state, it selects a dialogue action $a_t$ from a predefined set of candidate strategies $\mathcal{A}$ . The user (or user simulator) then responds to this action. This interaction continues until a conversational goal is achieved or a maximum number of turns $T$ is reached.

The objective of this MDP is to learn an optimal policy $\pi$ that maximizes the expected cumulative rewards over the entire dialogue episode. This is mathematically expressed as: $ \pi ^ { * } = \arg \operatorname* { m a x } _ { \pi \in \Pi } \left[ \sum _ { t = 0 } ^ { T } r ( s _ { t } , a _ { t } ) \right] $ Where:

$\pi^*$ : Represents the optimal dialogue policy that the system aims to learn.
$\pi$ : Denotes a specific policy, which is a function that maps observed states to actions.
$\Pi$ : Is the set of all possible policies.
$t$ : Refers to the current turn or timestep in the dialogue.
$T$ : Represents the maximum number of turns allowed in a dialogue episode.
$r(s_t, a_t)$ : Is the immediate reward received at turn $t$ for taking action $a_t$ in state $s_t$ . This is also referred to as $r_t$ .
$s_t$ : Represents the state of the dialogue at turn $t$ , which encapsulates the dialogue history.
$a_t$ : Denotes the action (dialogue strategy) chosen by the dialogue system at turn $t$ .

4.2.2. Plug-and-Play Dialogue Policy Planner (PPDPP)

PPDPP is the central component responsible for dialogue policy planning. It is implemented as a smaller, tunable pre-trained language model (e.g., RoBERTa). This model acts as a plug-in to the main LLM-powered dialogue agent. Its role is to predict the next dialogue action $a_t$ .

The training of PPDPP involves two phases:

4.2.2.1. Supervised Fine-Tuning (SFT)

Before interactive online learning with Reinforcement Learning, the PPDPP plug-in is initialized through Supervised Fine-Tuning (SFT). This phase uses available human-annotated dialogue corpora $\mathcal{D}$ . The goal of SFT is to teach PPDPP to mimic human-expert dialogue strategy choices.

Given the dialogue history up to the previous turn (which forms the current state $s_t$ ), PPDPP predicts the action for the current turn. The dialogue history is represented as a sequence of system and user utterances: $\left\{ u _ { 1 } ^ { \mathrm { s y s } } , u _ { 1 } ^ { \mathrm { u s r } } , . . . , u _ { t - 1 } ^ { \mathrm { s y s } } , u _ { t - 1 } ^ { \mathrm { u s r } } \right\}$ .

The prediction process is: $ a _ { t } = \mathbf { P P D P P } ( u _ { 1 } ^ { \mathrm { s y s } } , u _ { 1 } ^ { \mathrm { u s r } } , . . . , u _ { t - 1 } ^ { \mathrm { s y s } } , u _ { t - 1 } ^ { \mathrm { u s r } } ) $ Where:

$a_t$ : The predicted dialogue action (strategy) for the current turn $t$ .
$\mathbf{PPDPP}(\cdot)$ : Represents the Plug-and-Play Dialogue Policy Planner model, which takes the dialogue history as input.
$u_k^{sys}$ : The utterance generated by the system at turn $k$ .
$u_k^{usr}$ : The utterance generated by the user at turn $k$ .

The SFT phase optimizes the PPDPP by minimizing the cross-entropy loss between its predicted actions and the human-labeled actions ( $y_t$ ) in the annotated dialogues. $ \mathcal { L } _ { c } = - \frac { 1 } { | \mathcal { D } | } \sum _ { d \in \mathcal { D } } \frac { 1 } { T _ { d } } \sum _ { t = 1 } ^ { T _ { d } } a _ { t } \log y _ { t } $ Where:
$\mathcal{L}_c$ : The cross-entropy loss function.
$|\mathcal{D}|$ : The total number of dialogues in the human-annotated corpus $\mathcal{D}$ .
$d$ : Refers to an individual dialogue within the corpus.
$T_d$ : The number of turns in dialogue $d$ .
$a_t$ : The probability distribution over possible actions predicted by PPDPP for turn $t$ .
$y_t$ : The one-hot encoded vector representing the human-labeled (ground truth) action for turn $t$ .

This SFT initialization, while potentially sub-optimal on its own, is crucial for accelerating the convergence of the subsequent interactive online training (RL phase).

4.2.3. Self-play Interaction

After SFT, the system enters the interactive online learning phase, which relies on self-play conversations to simulate dynamic user-assistant interactions. This simulation involves using two LLMs configured to act as distinct roles: an assistant and a user.

Each of these LLMs is given a role description and specific conversational goals through prompts. For instance:

Negotiation: A buyer LLM aims for a lower price, while a seller LLM aims for a higher price.
Emotional Support: A patient LLM has a specific emotional problem, and a therapist LLM aims to reduce distress.
Tutoring: A student LLM has a knowledge gap, and a teacher LLM aims to impart knowledge.

The dialogue proceeds as follows:

PPDPP Predicts Action: When it's the assistant's turn to speak, the PPDPP plug-in (trained in the SFT phase) first predicts the next action $a_t$ based on the interaction history so far.
Action to Natural Language: This predicted action $a_t$ is then mapped to a predefined natural language instruction using a mapping function $\mathcal { M } _ { a } ( a _ { t } )$ . This instruction serves as an explicit prompt for the assistant LLM.
Assistant Generates Response: The assistant player (an LLM, e.g., ChatGPT) then generates its strategic response ( $u_t^{sys}$ $u_{t}^{sys}$ ) based on the current dialogue history and the natural language action instruction. $ u _ { t } ^ { s y s } = \mathbf { L } \mathbf { L } \mathbf { M } _ { \mathrm { s y s } } ( p _ { \mathrm { s y s } } ; \mathcal { M } _ { a } ( a _ { t } ) ; u _ { 1 } ^ { \mathrm { s y s } } , u _ { 1 } ^ { \mathrm { u s r } } , . . . , u _ { t - 1 } ^ { \mathrm { s y s } } , u _ { t - 1 } ^ { \mathrm { u s r } } ) $ Where:
- $u_t^{sys}$ : The utterance generated by the system (assistant LLM) at turn $t$ .
- $\mathbf{LLM_{sys}}(\cdot)$ : Represents the Large Language Model acting as the system (assistant).
- $p_{sys}$ : The initial prompt defining the assistant's role and goal.
- $\mathcal{M}_a(a_t)$ : The natural language instruction derived from the PPDPP's predicted action $a_t$ .
- The remaining terms are the dialogue history up to turn t-1.
User Generates Response: Subsequently, the user player (another LLM) generates its response ( $u_t^{usr}$ $u_{t}^{u sr}$ ) based on the updated dialogue history. $ u _ { t } ^ { u s r } = \mathbf { L } \mathbf { L } \mathbf { M } _ { \mathrm { { u s r } } } ( p _ { \mathrm { u s r } } ; u _ { 1 } ^ { \mathrm { s y s } } , u _ { 1 } ^ { \mathrm { u s r } } , . . . , u _ { t - 1 } ^ { \mathrm { s y s } } , u _ { t - 1 } ^ { \mathrm { u s r } } , u _ { t } ^ { s y s } ) $ Where:
- $u_t^{usr}$ : The utterance generated by the user (user LLM) at turn $t$ .
- $\mathbf{LLM_{usr}}(\cdot)$ : Represents the Large Language Model acting as the user.
- $p_{usr}$ : The initial prompt defining the user's role and goal.
- The remaining terms are the dialogue history including the system's latest utterance $u_t^{sys}$ .
  
  This process continues until a terminal state is reached. Three types of states are defined:

ON-GOING: The conversation is still active, and the goal has not been met.
GOAL-COMPLETED: The designated conversational goal has been achieved (e.g., a deal reached, emotional problem solved, exercise mastered).
GOAL-FAILED: The conversational goal is not met within the maximum turns $T$ .

4.2.4. LLM as Reward Model ( $\mathbf{LLM_{rwd}}$ )

A third LLM is designated as the reward model ( $\mathbf{LLM_{rwd}}$ ). This model has two critical functions:

Determine Goal Completion: It assesses whether the conversational goal has been achieved at any point during the dialogue.
Evaluate Policy Outcome: It provides feedback that is transformed into scalar rewards for the Reinforcement Learning process.

To achieve this, the reward model is prompted with a multi-choice question designed to elicit goal-oriented AI feedback. This verbal feedback is then converted into scalar rewards using a predefined mapping $\mathcal { M } _ { r } ( \cdot )$ .

To mitigate the subjectivity and variance often present in LLM-generated outputs, a common practice of sampling is employed. The reward model generates goal-oriented AI feedback for $l$ times, and these verbal feedbacks are converted into scalar values and then averaged to produce a robust scalar value $v_t$ . $ v _ { t } = \frac { 1 } { l } \sum _ { i = 1 } ^ { l } \mathcal { M } _ { r } ( \mathbf { L } \mathbf { L } \mathbf { M } _ { \mathrm { r w d } } ( p _ { \mathrm { r w d } } ; u _ { 1 } ^ { \mathrm { s y s } } , u _ { 1 } ^ { \mathrm { u s r } } , . . . , u _ { t - 1 } ^ { \mathrm { s y s } } , u _ { t - 1 } ^ { \mathrm { u s r } } , u _ { t } ^ { s y s } , u _ { t } ^ { u s r } ; \tau ) ) $ Where:

$v_t$ : The averaged scalar value representing the reward at turn $t$ .
$l$ : The number of times the reward model's output is sampled.
$i$ : The index for each sample.
$\mathcal{M}_r(\cdot)$ : The mapping function that transforms the verbal feedback from the reward model into a scalar value.
$\mathbf{LLM_{rwd}}(\cdot)$ : The Large Language Model acting as the reward model.
$p_{rwd}$ : The prompt given to the reward model for evaluation.
$\tau$ : The temperature parameter used for sampling from the LLM, controlling the randomness of its output. Higher temperatures lead to more diverse (and potentially more variable) outputs.
The remaining terms are the complete dialogue history up to the current turn $t$ .

This $v_t$ is first used to determine the state of the self-play interaction. If $v_t$ is greater than or equal to a predefined threshold $\epsilon$ , the state is declared GOAL-COMPLETED.

The reward $r_t$ for Reinforcement Learning is assigned as follows:

If the conversation reaches a terminal state (GOAL-COMPLETED or GOAL-FAILED), then $r_t = v_t$ . This means the final assessment directly contributes to the reward.
If the conversation is ON-GOING (not yet terminal), a small negative reward (e.g., $r_t = -0.1$ ) is assigned. This penalizes lengthy conversations and implicitly promotes efficient goal completion.

4.2.5. Reinforcement Learning

Once a dialogue episode reaches a terminal state (either GOAL-COMPLETED or GOAL-FAILED), goal-oriented rewards $r_t$ are obtained. These rewards are then used to update the PPDPP plug-in's policy through Reinforcement Learning. The policy agent is denoted as $\pi(a_t | s_t)$ , which gives the probability of taking action $a_t$ given state $s_t$ .

The paper uses the vanilla policy gradient method (Sutton et al., 1999) to optimize the policy agent. This method updates the model's parameters in the direction that increases the probability of actions that lead to higher rewards. The update rule is: $ \theta \gets \theta - \alpha \nabla \log \pi _ { \theta } ( a _ { t } | s _ { t } ) R _ { t } $ Where:

$\theta$ : Represents the parameters of the policy agent (PPDPP).
$\alpha$ : Is the learning rate, a hyper-parameter that controls the step size of the parameter updates.
$\nabla \log \pi _ { \theta } ( a _ { t } | s _ { t } )$ : The gradient of the log-probability of taking action $a_t$ in state $s_t$ , with respect to the policy parameters $\theta$ . This indicates how to change the parameters to make action $a_t$ more or less likely.
$R_t$ : The cumulative discounted return from turn $t$ onwards. This is the sum of future rewards, discounted by a factor $\gamma$ .

The cumulative discounted return $R_t$ is calculated as: $ R _ { t } = \sum _ { t ^ { \prime } = t } ^ { T } \gamma ^ { T - t ^ { \prime } } r _ { t ^ { \prime } } $ Where:
$t'$ : An index for summing over future turns.
$\gamma$ : The discount factor, a value between 0 and 1 that determines the present value of future rewards. A higher $\gamma$ means future rewards are considered more important.
$r_{t'}$ : The immediate reward received at turn $t'$ .

4.2.6. Inference Phase

During inference (when the system is deployed and interacting with real users), the tuned PPDPP directly provides the action prompt based on the dialogue history. This action prompt guides the dialogue LLM (the assistant LLM) to generate the next response. Crucially, the reward LLM is not used during inference. This design ensures that the LLM-powered dialogue agent, equipped with the tuned PPDPP, can strategically guide conversations without requiring multiple, costly iterations of simulation for every new case in real-time. This is a key aspect of the "plug-and-play" and generalizability claims.

5. Experimental Setup

5.1. Datasets

The authors evaluate the proposed framework across three distinct proactive dialogue applications, each with its own dataset:

CraisglistBargain (He et al., 2018):
- Domain: Bargain negotiation dialogues.
- Task: Buyer and seller negotiate the price of an item.
- Statistics:
  - Number of Cases (train/dev/test): 3,090 / 188 / 188
  - Number of Dialogue Acts: 11 (11 negotiation strategies and 4 terminal acts; only 11 negotiation strategies considered in experiments)
- Characteristics: Each case includes an item category, an item description, a buyer target price, and a seller target price. These serve as contextual instruction information. The goal is to reach a deal that is as favorable as possible to the buyer.
- Example Data Sample (from Appendix F):
  - Item Name: Furniture
  - Item Description: Macybed Plush Queen Mattress MacyBed 8.5" Plush Pillowtop Queen Mattress in excellent condition. Bought in December of 2013, 3.5 years old. Only had one owner in one household (one person sleeping on it, minimal ware). No stains or discoloring. Been covered with mattress cover since purchase.
  - Listed Price (Seller Target Price): 150
  - Buyer Target Price: 135
ESConv (Liu et al., 2021):
- Domain: Emotional support conversation.
- Task: A therapist assists a patient in reducing emotional distress and working through challenges.
- Statistics:
  - Number of Cases (train/dev/test): 1,040 / 130 / 130
  - Number of Strategies: 8 (types of support strategies)
- Characteristics: Each case is accompanied by a problem type, an emotion type, and a situation description provided by the patient. The goal is to solve the patient's emotional issue.
- Example Data Sample (from Appendix F):
  - Emotion Type: Fear
  - Problem Type: Job Crisis
  - Situation: I think I will be losing my job soon. I just read an email taking about the need for us to cut cost and also how we have not got any support from the government.
CIMA (Stasaski et al., 2020):
- Domain: Tutoring dialogues.
- Task: A teacher tutors a student on translating an English prepositional sentence into Italian.
- Statistics:
  - Number of Cases (train/dev/test): 909 / 113 / 113 (random 8:1:1 split)
  - Number of Pedagogical Strategies: 5
- Characteristics: Each case involves a specific exercise (English sentence) and the student's individual problem or knowledge state related to that exercise. The goal is for the student to master the exercise.
  
  These datasets were chosen because they represent diverse proactive dialogue problems with clear, quantifiable goals (negotiation outcome, emotional problem resolution, learning mastery), making them suitable for validating a policy planning framework. They also provide human-annotated data for SFT and case backgrounds for RL simulation.

5.2. Evaluation Metrics

The paper emphasizes dialogue-level interactive evaluation over traditional turn-level metrics to assess policy planning capability.

Average Turn (AT)
- Conceptual Definition: AT measures the efficiency of goal completion. It quantifies the average number of turns required for the dialogue system to successfully achieve its designated conversational goal. A lower AT indicates a more efficient policy.
- Mathematical Formula: The paper does not provide an explicit formula, but it can be commonly understood as: $ \mathrm{AT} = \frac{\sum_{e \in \text{SuccessfulEpisodes}} \text{turns}(e)}{\text{Number of Successful Episodes}} $
- Symbol Explanation:
  - $\text{SuccessfulEpisodes}$ : The set of dialogue episodes where the conversational goal was successfully achieved.
  - $\text{turns}(e)$ : The total number of turns taken in a specific successful episode $e$ .
  - $\text{Number of Successful Episodes}$ : The total count of dialogue episodes that ended in GOAL-COMPLETED.
Success Rate (SR@t)
- Conceptual Definition: SR measures the effectiveness of goal completion. It represents the proportion of dialogue episodes in which the conversational goal was successfully achieved within a predefined maximum number of turns ( $t$ ). A higher SR indicates a more effective policy. The paper sets the maximum turn as 8 for experiments (SR@8).
- Mathematical Formula: The paper does not provide an explicit formula, but it can be commonly understood as: $ \mathrm{SR@t} = \frac{\text{Number of Successful Episodes within } t \text{ turns}}{\text{Total Number of Episodes}} $
- Symbol Explanation:
  - $\text{Number of Successful Episodes within } t \text{ turns}$ : The count of dialogue episodes that reached GOAL-COMPLETED within $t$ turns.
  - $\text{Total Number of Episodes}$ : The total count of all dialogue episodes simulated or evaluated.
Sale-to-List Ratio (SL%)
- Conceptual Definition: SL% is a specific metric used for negotiation dialogues (CraisglistBargain) to determine the effectiveness of goal completion from the buyer's perspective. It quantifies how much benefit the buyer gains from the deal relative to the potential negotiation range. A higher SL% means the buyer achieved a price closer to their target price (or further from the seller's initial price), indicating greater benefit. If no deal is reached, SL% is assigned as 0.
- Mathematical Formula: $ \mathrm { S L % } = \frac { d e a l \ p r i c e - s e l l e r \ t a r g e t \ p r i c e } { b u y e r \ t a r g e t \ p r i c e - s e l l e r \ t a r g e t \ p r i c e } $
- Symbol Explanation:
  - $\text{deal price}$ : The final negotiated price agreed upon in the dialogue.
  - $\text{seller target price}$ : The initial price or desired minimum price of the seller.
  - $\text{buyer target price}$ : The desired maximum price the buyer is willing to pay.
Human Evaluation Metrics:
- For ESConv (Emotional Support):
  - Identification: How helpful the assistant is in exploring and identifying the problem.
  - Comforting: How skillful the assistant is in comforting the user.
  - Suggestion: How helpful the suggestions are for solving the problem.
- For CraisglistBargain (Negotiation):
  - Persuasive: How persuasive the assistant is in the negotiation.
  - Coherent: How on-topic and consistent with the conversation history the assistant is.
  - Natural: How human-like the assistant's responses are.
- Overall: A general assessment of the dialogue quality. These are subjective metrics, evaluated by human annotators in a pairwise comparison (Win/Tie/Lose) between PPDPP and baselines.

The evaluation is conducted in an interactive setting using LLM-based user simulators and LLM-based reward models, as described in the methodology, to simulate diverse interactions and automatically assess dialogue-level outcomes.

5.3. Baselines

The paper compares PPDPP against several existing methods, encompassing both general fine-tuned dialogue models and various LLM-based dialogue policy planning approaches:

DialoGPT (Zhang et al., 2020): A general fine-tuned dialogue model. This represents a pre-LLM baseline for response generation.
Standard: A vanilla LLM prompting scheme. This involves directly prompting two LLMs to conduct self-play conversations using task instructions without considering any explicit dialogue strategy. It serves as a strong baseline for the raw capability of LLMs.
AnE (Ask-an-Expert) (Zhang et al., 2023a): This method prompts another LLM to act as a strategic expert. This expert LLM is asked $M$ -part questions to reason about the next dialogue strategy. The strategy is a verbal description rather than a selection from a predefined taxonomy. The original work was for emotional support dialogues, adapted for others by changing roles in questions.
Proactive (Deng et al., 2023b): This approach prompts the LLM-based dialogue system to first select the most appropriate strategy for the next turn, and then generate the response based on the selected strategy. The predicted strategy label is mapped into a mixed-initiative strategy prompt (MI-Prompt) as per Chen et al. (2023).
- + MI-Prompt (Chen et al., 2023): This variant specifically incorporates the mixed-initiative strategy-based prompting detailed in Chen et al. (2023) into the Proactive scheme, explicitly guiding the LLM with strategic instructions.
ProCoT (Deng et al., 2023b): This method builds upon Proactive by first prompting the LLM-based dialogue system to generate a chain-of-thought (CoT) descriptive analysis for planning the next turn's strategy. MI-Prompt is also incorporated. This aims to improve the LLM's reasoning before action selection.
- + MI-Prompt (Chen et al., 2023): Similar to Proactive, this explicitly incorporates the MI-Prompt mechanism within the ProCoT framework.
ICL-AIF (In-Context Learning from AI Feedback) (Fu et al., 2023): This method prompts an LLM to provide dialogue-level feedback to a player to improve their dialogue strategies. This feedback is verbal and is used over $N$ iterations of self-play simulation to refine the overall strategy. Unlike AnE, it focuses on dialogue-level rather than turn-level feedback.

Implementation Details:

PPDPP Plug-in: RoBERTa (roberta-large) is used as the default plug-in.
Backbone LLMs:
- For role-playing LLMs ( $\mathbf{LLM_{sys}}$ and $\mathbf{LLM_{usr}}$ ) and the reward model ( $\mathbf{LLM_{rwd}}$ ), ChatGPT (gpt-3.5-turbo-0613) is the primary choice.
- For role-playing LLMs, temperature $\tau = 0$ is set for deterministic outputs.
- For the reward model, temperature $\tau = 1.1$ and sampling times $l=10$ are used to integrate scalar rewards.
- Performance comparisons are also made with open-source LLMs like Vicuna-13B-delta-v1.1 and LLaMA-2-13B-Chat as backbone LLMs for response generation, while the user simulator and reward model remain ChatGPT.

5.4. Training Details

The training process for PPDPP is divided into two phases: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL).

The following are the results from [Table 6] of the original paper:

Training Phase	Hyper-parameter	Value
SFT	Batch Size	16
	Training Epochs	10
	Learning Rate	6e-6
	Max Sequence Length	512
RL	Training Episodes	1,000
	Learning Rate	1e-6
	Max Conversation Turn	8
	Discount Factor γ	0.999
	Max New Tokens	32

SFT Phase:

PPDPP is fine-tuned on the training set of the respective datasets.
Checkpoint selection is based on the best performance on the validation set.
Hyper-parameters include a Batch Size of 16, 10 Training Epochs, a Learning Rate of 6e-6, Max Sequence Length of 512, Linear Learning Scheduler, and Weight Decay of 0.01.

RL Phase:

Cases from the training set are randomly sampled for online training.
Hyper-parameters include 1,000 Training Episodes, a Learning Rate of 1e-6, Max Conversation Turn of 8, a Discount Factor $\gamma$ of 0.999, and Max New Tokens of 32.
Experiments are conducted on a server equipped with 8 Tesla V100 GPUs.

5.5. Reliability Analysis of LLMs as Reward Models and User Simulators

Before deploying the self-play evaluation and RL training, the authors performed a reliability analysis on using LLMs as reward models and user simulators.

5.5.1. Analysis of LLMs as Reward Model

The reward model is tasked with selecting the situation that best matches the current user state. The reliability is assessed by computing the F1 score of its predictions against human-annotated labels on 50 sampled self-play dialogues from each dataset.

The following figure (Figure 5 from the original paper) shows the analysis of LLMs as reward models:

Figure 5: Analysis of LLMs as reward model. 该图像是图表，展示了作为奖励模型的不同大型语言模型（Vicuna、LLaMA2-Chat、ChatGPT）在CraigslistBargain、ESConv及CIMA三个数据集上的性能对比，ChatGPT表现优异。

As seen in Figure 5, ChatGPT performs well across all three datasets (CraisglistBargain, ESConv, CIMA). Vicuna-13B and LLaMA2-Chat-13B also perform well for CraisglistBargain and ESConv. However, Vicuna and LLaMA2 struggle significantly with CIMA, which involves Italian translation. This is because they were not trained on large-scale Italian data, making them unable to correctly evaluate the students' Italian translations. This analysis confirms ChatGPT's suitability as the reward model for all three problems due to its strong multilingual capabilities.

5.5.2. Analysis of LLMs as User Simulator

Simulated users are expected to accurately play their assigned role within a specific context. The quality of user simulators is evaluated based on naturalness (fluency and human-likeness of utterances) and usefulness (consistency with role descriptions) in single-turn and multi-turn free-form conversations. A pairwise evaluation (Win/Tie/Lose) was conducted by two annotators, comparing LLM-based simulators with a fine-tuned DialoGPT and original human conversations.

The following are the results from [Table 5] of the original paper:

	Single-turn		Multi-turn
Setting	Natural	Useful	Natural	Useful
DialoGPT	8%	4%	2%	5%
ChatGPT	63%	72%	78%	74%
Tie	29%	24%	20%	21%
Human	14%	22%	18%	27%
ChatGPT	49%	42%	36%	33%
Tie	37%	36%	46%	41%

The Cohen's Kappa between annotators was 0.72, indicating substantial agreement. ChatGPT-based simulators show significantly superior performance compared to DialoGPT, especially in naturalness for multi-turn conversations. Even when compared to human-annotated dialogues, ChatGPT-based simulators achieve competitive performance (e.g., 49% Win rate vs. Human in single-turn Naturalness, with 37% Tie). These results validate the reliability of using ChatGPT as the user simulator.

6. Results & Analysis

6.1. Core Results Analysis

The paper presents a comprehensive evaluation comparing PPDPP with various baselines across three proactive dialogue applications. The primary metrics are Average Turn (AT) (lower is better for efficiency), Success Rate (SR) (higher is better for effectiveness), and Sale-to-List Ratio (SL%) for negotiation (higher is better for buyer benefit).

The following are the results from [Table 3] of the original paper:

Method	#Tokens	CraisglistBargain			ESConv		CIMA
Method	#Tokens	AT↓	SR↑	SL%↑	AT↓	SR↑	AT↓	SR↑
DialoGPT	-	6.73	0.3245	0.2012	5.31	0.7538	5.43	0.4956
Standard	O(L)	6.47	0.3830	0.1588	5.10	0.7692	3.89	0.6903
AnE (Zhang et al., 2023a)	O(M + 1)L)	5.91	0.4521	0.2608	4.76	0.8000	3.86	0.6549
Proactive (Deng et al., 2023b)	O(L)	5.80	0.5638	0.2489	5.08	0.7538	4.84	0.5310
+ MI-Prompt (Chen et al., 2023)	O(2L)	5.74	0.5691	0.2680	4.78	0.7846	4.70	0.5664
ProCoT (Deng et al., 2023b)	O(L)	6.22	0.5319	0.2486	4.75	0.7923	4.58	0.5487
+ MI-Prompt (Chen et al., 2023)	O(2L)	6.12	0.5532	0.3059	4.83	0.7769	4.72	0.5221
ICL-AIF (Fu et al., 2023)	O((N + 1)L)	6.53	0.3617	0.1881	4.69	0.8079	4.19	0.6106
PPDPP	O(L)	5.62	0.6117	0.3376	4.56	0.8462	3.03	0.8407
- w/o SFT	O(L)	5.71	0.6223	0.3354	4.68	0.8384	3.18	0.8230
- w/o RL	O(L)	5.57	0.6649	0.2280	5.24	0.7308	3.41	0.7965

Overall Observations:

PPDPP consistently outperforms baselines: The proposed PPDPP method achieves the best performance across all three datasets and almost all metrics (lower AT, higher SR/SL%). This demonstrates its effectiveness and efficiency in achieving conversational goals.
Efficiency in token usage: PPDPP requires $O(L)$ tokens, which is similar to Standard, Proactive, and ProCoT, and significantly less than AnE ( $O(M+1)L)$ ) and ICL-AIF ( $O((N+1)L)$ ), making it more practical for black-box LLMs.
Importance of Reinforcement Learning (RL): The PPDPP - w/o RL variant (meaning only SFT was applied) generally performs worse than the full PPDPP, highlighting the crucial role of RL with simulated interactions in optimizing the policy.

Task-Specific Observations:

Negotiation Dialogues (CraisglistBargain):
- PPDPP achieves the lowest AT (5.62), highest SR (0.6117), and highest SL% (0.3376), significantly outperforming all baselines.
- Prompt-based methods (AnE, Proactive, ProCoT with/without MI-Prompt) show substantial improvement over Standard and DialoGPT in SR and SL%.
- ICL-AIF performs poorly in negotiation, even negatively affecting SR compared to Standard, suggesting dialogue-level AI feedback might not be dynamic enough for negotiation.
- Trade-off between SR and SL% in PPDPP: The PPDPP - w/o RL variant has a higher SR (0.6649) but a much lower SL% (0.2280) compared to full PPDPP. This indicates that RL optimizes for higher buyer benefit (SL%) at the cost of slightly reducing the likelihood of a deal (SR), which is an expected behavior in competitive negotiation.
Emotional Support Dialogues (ESConv):
- PPDPP achieves the lowest AT (4.56) and highest SR (0.8462).
- Standard prompting performs reasonably well (SR of 0.7692).
- PPDPP - w/o RL performs worse than Standard prompting (SR 0.7308 vs 0.7692), suggesting that corpus-based SFT alone is insufficient for the complexity and diversity of emotional support.
- RL significantly boosts PPDPP's SR from 0.7308 to 0.8462, demonstrating its ability to learn effective support strategies.
Tutoring Dialogues (CIMA):
- PPDPP achieves the lowest AT (3.03) and highest SR (0.8407).
- All baseline methods struggle to beat Standard prompting (SR 0.6903), indicating ChatGPT's inherent strength in tutoring/translation.
- Interestingly, PPDPP - w/o RL (SFT only) substantially outperforms all baselines (SR 0.7965). This is attributed to the CIMA dataset's narrower scope (Italian translation for a specific grammatical structure), where corpus-based learning can be highly effective for cases similar to training data.
- RL still further improves PPDPP's SR from 0.7965 to 0.8407, demonstrating continuous optimization capabilities.

6.2. In-depth Analysis

6.2.1. Performance w.r.t Turns

The paper analyzes the relative success rate (SR@t) and relative Sale-to-List Ratio (SL%@t) against the Standard prompting method at different conversation turns.

The following figure (Figure 2 from the original paper) shows the comparison of relative success rate against Standard at different conversation turns:

Figure 2: Comparisons of relative success rate against Standard at different conversation turns. The relative success rate is calculated by subtracting the actual success rate of the Standard prompti… 该图像是图表，展示了在不同对话轮次下，PPDPP与其他方法在三个任务（CraigslistBargain，ESConv，CIMA）中的相对成功率比较。纵轴为相对成功率百分比，横轴为对话轮次，PPDPP整体表现优于其他方法。

The following figure (Figure 3 from the original paper) shows the comparison of relative Sale-to-List Ratio against Standard at different turns:

Figure 3: Comparisons of relative Saleto-List Ratio against Standard at different turns (same legends as Figure 2).

Observations:

PPDPP's consistent superiority: PPDPP consistently outperforms baselines across all datasets and almost every turn.
AnE's early strength, later decline: AnE shows strong performance in the first few turns, especially in negotiation (Figure 3, positive SL% early on). This suggests AnE is effective at quickly understanding simple situations and achieving goals in short dialogues. However, its performance drops rapidly as conversations become longer and more complex, indicating a failure to achieve long-term goals or adapt to complicated situations. This highlights the limitation of turn-level expert advice without a learnable long-term policy.
Baselines' struggle in tutoring (CIMA): In tutoring dialogues (CIMA, Figure 2), all baselines perform worse than Standard prompting after three turns. This implies they get stuck or make wrong decisions that hinder long-term goal achievement, further emphasizing the importance of a robust policy planning mechanism like PPDPP.

6.2.2. Comparisons with Different LLMs

The paper investigates PPDPP's performance when using different LLMs (ChatGPT, Vicuna, LLaMA2-Chat) as the backbone for response generation, while the user simulator and reward model remain ChatGPT.

The following figure (Figure 4 from the original paper) shows testing performance curve along with training episodes w.r.t different LLMs:

Figure 4: Testing performance curve along with training episodes w.r.t different LLMs.

Observations:

RL effectiveness across LLMs: RL training effectively enhances the performance of all LLM-powered dialogue agents across each dialogue problem, demonstrating the robustness of the PPDPP framework regardless of the specific backbone LLM. The optimization objective (SL% for CraisglistBargain, SR for ESConv and CIMA) generally increases with training episodes.
LLM inherent strengths: ChatGPT does not universally outperform other LLMs:
- Negotiation (CraisglistBargain): Vicuna and LLaMA2-Chat achieve higher benefits (SL%) than ChatGPT, but with a lower success rate of reaching a deal. This suggests ChatGPT might be more prone to making compromises to reach a deal, potentially due to its general-purpose training favoring responsiveness and collaboration.
- Emotional Support (ESConv): Vicuna achieves competitive performance with ChatGPT.
- Tutoring (CIMA): ChatGPT substantially outperforms others. This is attributed to ChatGPT's superior multilingual capabilities, which are crucial for Italian translation tasks.
Implication: Different LLMs possess inherent strengths in various dialogue problems stemming from their pre-training. This suggests that an ensemble of multiple agents (each leveraging different LLMs) could potentially address a wider range of dialogue problems more effectively.

6.3. Ablation Study of the Sampling Strategy

An ablation study was conducted to validate the benefits of sampling goal-oriented AI feedback multiple times, specifically focusing on its impact on state prediction and reward estimation.

The following are the results from [Table 7] of the original paper:

	State Prediction			Reward Estimation
	CB	ESC	CIMA	CraisglistBargain			ESConv		CIMA
Method	F1↑	F1↑	F1↑	AT↓	SR↑	SL%↑	AT↓	SR↑	AT↓	SR↑
PPDPP (l = 10)	93.7	93.4	94.6	5.62	0.6117	0.3376	4.56	0.8462	3.03	0.8407
PPDPP (l = 1)	91.4	88.2	90.3	5.87	0.5957	0.2623	4.67	0.8307	3.29	0.7965

Observations:

Analysis of State Prediction (left part of Table 7):
- $PPDPP (l=10)$ (sampling 10 times) significantly improves the F1 score for state prediction across all three datasets (e.g., from 91.4 to 93.7 for CraisglistBargain, 88.2 to 93.4 for ESConv, 90.3 to 94.6 for CIMA) compared to $PPDPP (l=1)$ (sampling once).
- This indicates that sampling effectively reduces the variance of LLM-generated output, leading to more reliable goal completion determination by the reward model.
Analysis of Reward Estimation (right part of Table 7):
- $PPDPP (l=10)$ (using averaged continuous rewards) generally outperforms $PPDPP (l=1)$ (using discrete rewards) in terms of dialogue performance metrics (AT, SR, SL%).
- For example, in CraisglistBargain, SL% increases from 0.2623 to 0.3376 with $l=10$ .
- This suggests that fine-grained continuous rewards (derived from averaging multiple samples) lead to better performance. They make the policy planning outcome more distinguishable during the Reinforcement Learning process, enabling more precise optimization.

6.4. Human Evaluation

Human evaluation was conducted on 100 randomly sampled dialogues from ESConv and CraisglistBargain, with three annotators comparing PPDPP against AnE, ProCoT, and ICL-AIF.

The following are the results from [Table 4] of the original paper:

PPDPP	ESConv								CraisglistBargain
	Ide.		Com.		Sug.		Ove.		Per.		Coh.		Nat.		Ove.
	Win	Lose	Win	Lose	Win	Lose	Win	Lose	Win	Lose	Win	Lose	Win	Lose	Win	Lose
AnE	31%	15%	14%	27%	52%	12%	34%	24%	40%	23%	22%	12%	14%	7%	31%	18%
ProCoT	27%	21%	34%	20%	38%	15%	30%	11%	24%	21%	17%	15%	9%	6%	27%	21%
ICL-AIF	35%	12%	32%	28%	33%	29%	29%	22%	55%	11%	39%	12%	25%	3%	62%	4%

Note: The table structure is slightly confusing. Each row represents a comparison of PPDPP against one baseline. For example, PPDPP vs. AnE shows PPDPP won 31% vs AnE's 15% in Identification for ESConv. The higher percentage for PPDPP means it was preferred by humans more often. The numbers in the table indicate the win rate of PPDPP against the specified baseline, and the lose rate of PPDPP against the specified baseline (meaning the baseline won against PPDPP). The remaining percentage would be 'Tie' experiences.

Observations:

Overall Superiority of PPDPP: PPDPP generally outperforms other baselines across almost all human evaluation perspectives and Overall scores for both ESConv and CraisglistBargain. This reinforces the findings from automatic metrics.
ESConv (Emotional Support):
- PPDPP shows strong win rates in Identification (vs AnE: 31% Win, 15% Lose; vs ProCoT: 27% Win, 21% Lose; vs ICL-AIF: 35% Win, 12% Lose) and Suggestion.
- The only exception is Comforting, where AnE has a higher win rate against PPDPP (27% Win for AnE vs PPDPP's 14% Win). This might be because AnE is observed to provide detailed and empathetic emotional support strategies (as mentioned in the qualitative case study), which contributes to strong comforting. However, the paper notes that a dialogue system needs to go beyond just empathy to proactively explore and solve the patient's issues, which PPDPP does better.
CraisglistBargain (Negotiation):
- PPDPP demonstrates significantly higher win rates in Persuasive, Coherent, Natural, and Overall compared to all baselines. For instance, PPDPP has an Overall win rate of 62% against ICL-AIF with only 4% lose, indicating substantial human preference.

6.5. Example Conversations

The paper includes qualitative case studies in Appendix F to illustrate the differences in conversational behavior and outcomes among various methods.

Negotiation Dialogues (Example from Appendix F):

Scenario: Buyer (LLM agent) and Seller (LLM user simulator) negotiate a furniture price. Listed price: 150, Buyer target:135.
Standard: Directly reveals budget, sticks to it rigidly, no negotiation strategy. Leads to no deal (SL% = 0).
Ask-an-Expert (AnE): Effectively reaches a deal quickly but results in a large compromise for the buyer (SL% = 0.3333). Suggestions from the expert LLM might prioritize closing the deal over maximizing buyer benefit.
ProCoT: Employs effective negotiation strategies, leading to a much better deal (SL% = 0.8). Shows strategic turn-by-turn planning.
ICL-AIF: Provides dialogue-level suggestions that are adopted, but fails to dynamically adapt to user interactions, leading to no deal (SL% = 0). Highlights limitations of static dialogue-level feedback for dynamic negotiations.
PPDPP: Similar to ProCoT, uses effective negotiation strategies to achieve a better deal (SL% = 0.8333). It also shows adaptability, maximizing buyer benefit when the seller shows willingness to compromise. This demonstrates the combination of strategic planning and dynamic adaptation.

Emotional Support Dialogues (Example from Appendix F):

Scenario: Patient (LLM user simulator) has a job crisis, feeling fear.
Standard: Consistently conveys empathy but becomes less useful as the conversation progresses and the patient's emotional intensity decreases. Lacks goal-oriented progression.
Ask-an-Expert (AnE): Produces engaging conversations with detailed empathetic actions but shares the drawback of not progressing beyond empathy to problem-solving.
ProCoT: Adopts effective emotional support strategies to efficiently solve the patient's issue by providing helpful suggestions, demonstrating a more goal-oriented approach.
ICL-AIF: Provides dialogue-level suggestions covering different stages of emotional support (validation, exploration, coping strategies), which are effectively implemented to interact with the patient.
PPDPP: Optimizes the policy planner for efficient goal achievement, leading to shorter conversations. It strategically moves from acknowledgement and reassurance to suggestions for coping. This illustrates the benefit of RL in learning efficient pathways to goal completion.

These examples underscore PPDPP's ability to not only employ effective strategies but also adapt dynamically and efficiently steer conversations towards goals, distinguishing it from methods that are either too passive, too rigid, or lack long-term optimization.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work introduces PPDPP, a novel plug-and-play dialogue policy planner paradigm designed to empower Large Language Model (LLM)-powered dialogue agents with enhanced proactivity. The core innovation lies in decoupling dialogue policy planning from the LLM's response generation using a separate, tunable language model plug-in. The paper also develops a robust training framework that combines supervised fine-tuning (SFT) on human-annotated data for initialization and reinforcement learning (RL) from goal-oriented AI feedback obtained through LLM-based self-play simulations. This framework allows the LLM agent to generalize to new cases and adapt to different applications by simply replacing the learned plug-in, crucially without re-training the expensive backbone LLM. Furthermore, the authors propose an interactive evaluation protocol that leverages LLM-based user simulators and reward models to assess dialogue-level effectiveness and efficiency in multi-turn conversations. Experimental results across negotiation, emotional support, and tutoring dialogues consistently demonstrate PPDPP's superior performance compared to existing methods, highlighting its ability to effectively and efficiently guide conversations towards designated goals.

7.2. Limitations & Future Work

The authors identify several implications and avenues for future research:

Potential of Tunable Plug-ins: The success of PPDPP highlights the significant potential of using tunable plug-ins to address specific shortcomings of LLMs. This approach can be extended to various other applications and could involve integrating multiple plug-ins to tackle even more complex dialogue challenges.
Inherent LLM Strengths and Ensemble Agents: The findings reveal that dialogue agents powered by different LLMs possess inherent strengths in diverse problems, influenced by their respective training processes. Given the resource-intensive nature of training specialized LLMs, this insight suggests the value of employing an ensemble of multiple agents, each potentially leveraging a different LLM's strengths, to collaboratively address a wide range of dialogue problems.
Limitations of LLM-based simulators/reward models: While the reliability of LLM-based user simulators and reward models was validated, they still inherently carry potential biases or limitations of the underlying LLMs, which could affect the fidelity of the self-play environment and the RL feedback.
Cost of LLM calls: Despite PPDPP being more token-efficient than some baselines, the reliance on multiple LLMs (assistant, user, reward model) during self-play simulation can still incur significant computational and API costs, especially for proprietary LLMs.
Scope of Proactive Dialogues: The current work focuses on specific types of proactive dialogues (negotiation, emotional support, tutoring). The generalizability to other complex proactive scenarios (e.g., medical diagnosis, financial advice) needs further exploration.

7.3. Personal Insights & Critique

This paper presents a highly practical and innovative approach to enhancing the proactivity of LLM-powered dialogue agents.

Decoupling is Key: The plug-and-play paradigm of decoupling policy planning from response generation is a significant architectural improvement. For commercial applications, where direct fine-tuning of proprietary LLMs is impossible or prohibitively expensive, having a small, tunable policy plug-in is an elegant solution. It allows for domain-specific strategic intelligence to be injected and learned without compromising the general linguistic capabilities of the powerful backbone LLM. This modularity greatly simplifies development and deployment cycles for specialized dialogue agents.
Interactive RL from AI Feedback: The LLM-based self-play simulation combined with goal-oriented AI feedback for Reinforcement Learning is a powerful learning mechanism. It allows the system to learn from dynamic, multi-turn interactions, which is essential for proactive dialogues where long-term goals matter. The use of sampling to stabilize the reward signal from the reward model is a clever detail to address the inherent stochasticity and subjectivity of LLM outputs. This approach effectively bridges the gap between natural language feedback and structured scalar rewards for RL.
Robust Evaluation Framework: The emphasis on interactive evaluation using LLM-based user simulators and reward models is crucial. It moves beyond superficial turn-level metrics to directly assess the dialogue-level success and efficiency of policy planning, which is the true measure of a proactive system. This framework itself could be widely adopted for evaluating other goal-oriented dialogue systems.
Trade-offs and Nuances: The observation of trade-offs, such as SL% vs. SR in negotiation, is a valuable insight. It highlights that optimizing for one aspect of a goal (e.g., maximizing buyer benefit) might inherently impact another (e.g., deal success rate). This complex behavior underscores the policy planner's ability to learn nuanced strategies rather than simplistic ones.
Potential Issues/Areas for Improvement:
- LLM Hallucination/Bias in Simulators: While validated, the LLM-based user simulators and reward models are still susceptible to hallucinations or biases present in their training data. If the user simulator doesn't accurately reflect human behavior or the reward model has a flawed understanding of "success," the learned policy could be optimized for an imperfect simulation rather than real-world effectiveness. Further research into robust verification of these LLM components (perhaps through human-in-the-loop validation during RL) could strengthen the approach.
- Action Space Design: The effectiveness heavily relies on a well-defined set of candidate dialogue actions ( $\mathcal{A}$ ) and their mapping to natural language instructions. Designing this action space for complex domains can be challenging and might require significant domain expertise.
- Domain Adaptation Costs: While the plug-in is described as "transferable" across applications, it still requires re-training (SFT + RL) for each new application. The term "plug-and-play" implies a level of immediate usability, but it's more accurate to say it's a "plug-in-and-train" approach for new domains. However, this cost is still significantly less than fine-tuning the entire backbone LLM.
- Interpretability of Learned Policies: While the policy plug-in is smaller, understanding why it chooses certain actions, especially after RL training, can still be a challenge. Improving the interpretability of these learned policies could build greater trust and facilitate debugging.
  
  Overall, PPDPP offers a compelling solution for building more intelligent, goal-oriented conversational agents using LLMs. Its modular design and principled learning framework lay a strong foundation for future advancements in proactive dialogue systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Plug-and-Play Policy Planner for Large Language Model Powered Dialogue Agents

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~36 min read · 48,268 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. MDP Environment Formulation

4.2.2. Plug-and-Play Dialogue Policy Planner (PPDPP)

4.2.2.1. Supervised Fine-Tuning (SFT)

4.2.3. Self-play Interaction

4.2.4. LLM as Reward Model (LLMrwd\mathbf{LLM_{rwd}}LLMrwd​)

4.2.5. Reinforcement Learning

4.2.6. Inference Phase

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Training Details

5.5. Reliability Analysis of LLMs as Reward Models and User Simulators

5.5.1. Analysis of LLMs as Reward Model

5.5.2. Analysis of LLMs as User Simulator

6. Results & Analysis

6.1. Core Results Analysis

6.2. In-depth Analysis

6.2.1. Performance w.r.t Turns

6.2.2. Comparisons with Different LLMs

6.3. Ablation Study of the Sampling Strategy

6.4. Human Evaluation

6.5. Example Conversations

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

4.2.4. LLM as Reward Model ( $\mathbf{LLM_{rwd}}$ )