Plug-and-Play Policy Planner for Large Language Model Powered Dialogue Agents
TL;DR Summary
PPDPP introduces a tunable dialogue policy planner enhancing LLMs' proactive dialogue capabilities via supervised fine-tuning and reinforcement learning, achieving superior generalization and performance across diverse applications.
Abstract
Proactive dialogues serve as a practical yet challenging dialogue problem in the era of large language models (LLMs), where the dialogue policy planning is the key to improving the proactivity of LLMs. Most existing studies enable the dialogue policy planning of LLMs using various prompting schemes or iteratively enhance this capability in handling the given case with verbal AI feedback. However, these approaches are either bounded by the policy planning capability of the frozen LLMs or hard to be transferred to new cases. In this work, we introduce a new dialogue policy planning paradigm to strategize LLMs for proactive dialogue problems with a tunable language model plug-in as a plug-and-play dialogue policy planner, named PPDPP. Specifically, we develop a novel training framework to facilitate supervised fine-tuning over available human-annotated data as well as reinforcement learning from goal-oriented AI feedback with dynamic interaction data collected by the LLM-based self-play simulation. In this manner, the LLM-powered dialogue agent can not only be generalized to different cases after the training, but also be applicable to different applications by just substituting the learned plug-in. In addition, we propose to evaluate the policy planning capability of dialogue systems under the interactive setting. Experimental results demonstrate that PPDPP consistently and substantially outperforms existing approaches on three different proactive dialogue applications, including negotiation, emotional support, and tutoring dialogues.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Plug-and-Play Policy Planner for Large Language Model Powered Dialogue Agents
1.2. Authors
Yang Deng, Wenxuan Zhang, Wai Lam, See-Kiong Ng, Tat-Seng Chua
The authors appear to be affiliated with the National University of Singapore and The Chinese University of Hong Kong, indicating a strong academic background in natural language processing and artificial intelligence research.
1.3. Journal/Conference
The paper was published on arXiv, which is a preprint server for electronic preprints of scientific papers. While not a peer-reviewed journal or conference proceeding itself, arXiv serves as a widely recognized platform for rapid dissemination of research findings in fields like AI, often preceding or accompanying formal publication in top-tier venues. Its reputation is high for sharing cutting-edge research quickly.
1.4. Publication Year
2023
1.5. Abstract
This paper addresses the challenge of proactive dialogues in Large Language Model (LLM)-powered dialogue agents, where dialogue policy planning is crucial for improving LLM proactivity. Existing methods, primarily prompting schemes or verbal AI feedback, are limited by the capabilities of frozen LLMs or lack transferability to new cases. The authors introduce a novel paradigm named Plug-and-Play Dialogue Policy Planner (PPDPP), which utilizes a tunable language model plug-in to strategize LLMs for proactive dialogue problems. PPDPP employs a unique training framework combining supervised fine-tuning (SFT) on human-annotated data and reinforcement learning (RL) from goal-oriented AI feedback generated through LLM-based self-play simulation. This approach allows the LLM-powered agent to generalize to different cases and be adaptable across various applications by simply substituting the learned plug-in. Additionally, the paper proposes an interactive evaluation method for policy planning capability. Experimental results demonstrate that PPDPP significantly outperforms existing methods across three proactive dialogue applications: negotiation, emotional support, and tutoring.
1.6. Original Source Link
Official Source: https://arxiv.org/abs/2311.00262 PDF Link: https://arxiv.org/pdf/2311.00262v2.pdf Publication Status: Preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the limitation of Large Language Models (LLMs) in proactive dialogues. While LLMs like ChatGPT excel at understanding context and generating responses, they are fundamentally designed to be passive, following user instructions. This inherent passivity makes them less effective in proactive dialogue problems, where an agent needs to strategically take the initiative to steer a conversation towards a specific goal. Examples of such problems include negotiation, emotional support, and tutoring, all of which require dynamic, goal-oriented interaction.
This problem is important because many real-world applications of dialogue agents require more than just reactive responses; they demand agents that can pursue objectives, guide users, and manage conversational flow strategically. Existing solutions, primarily prompting schemes or iterative refinement with verbal AI feedback, face significant challenges:
-
Bounded by frozen LLMs: Their effectiveness is limited by the inherent
policy planning capabilityof the underlyingfrozen LLMs(i.e., LLMs whose parameters are not updated during the policy learning process). -
Lack of transferability: Methods involving iterative refinement often require multiple rounds of
self-play dialogue simulationsfor each new case, making them impractical for broad application. -
Inadequate evaluation: Traditional
turn-level response quality measurementsfail to assess thepolicy planning capabilityeffectively, which is about achieving long-term conversational goals.The paper's innovative idea is to introduce a
plug-and-playmechanism fordialogue policy planning, effectively decoupling the strategic decision-making from the LLM's core generation capabilities. By using a separate, tunable language model plug-in, the system can learn and adapt its strategic behavior without modifying the large, expensive, and often proprietary backbone LLM.
2.2. Main Contributions / Findings
The paper makes several primary contributions to address the identified challenges:
- Introduction of PPDPP: They propose
Plug-and-Play Dialogue Policy Planner(PPDPP), a novel paradigm that strategizes LLM-powered dialogue agents using a tunable language model plug-in. This plug-in acts as the policy agent, predicting dialogue strategies independently of the main LLM. - Novel Training Framework: PPDPP is trained through a two-phase framework:
Supervised Fine-Tuning(SFT) on available human-annotated corpora to initialize the plug-in.Reinforcement Learning(RL) fromgoal-oriented AI feedback, collected via LLM-basedself-play simulations. This simulation involves two LLMs (assistant and user) with competing goals, and a third LLM acting as areward modelto provide goal-oriented verbal feedback, which is then converted into scalar rewards for RL.
- Enhanced Generalizability and Applicability: This approach allows the LLM-powered dialogue agent to be generalized to different cases after training, and crucially, to be applicable to different applications simply by substituting the learned plug-in, without affecting the LLM's context understanding and response generation capabilities.
- Interactive Evaluation Protocol: The paper proposes an LLM-based interactive evaluation approach using
user simulatorsandreward modelsto assesssuccess rateandaverage number of turnsfor achieving designated goals in a dynamic, multi-turn setting, overcoming the limitations of static turn-level metrics. - Superior Performance across Applications: Experimental results demonstrate that PPDPP consistently and substantially outperforms existing approaches on three diverse proactive dialogue applications: negotiation, emotional support, and tutoring dialogues. It shows improvements in both efficiency (fewer turns) and effectiveness (higher success rates/benefits).
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully grasp the methodology and contributions of this paper, a beginner should understand several key concepts:
-
Large Language Models (LLMs): These are advanced artificial intelligence models, like ChatGPT, Vicuna, and LLaMA2-Chat, trained on vast amounts of text data. They excel at understanding natural language, generating human-like text, answering questions, summarizing information, and performing various language-related tasks. A key characteristic highlighted in this paper is their passive nature, meaning they typically respond to explicit instructions rather than proactively steering conversations towards specific goals.
-
Dialogue Agents: Also known as chatbots or conversational AI, these are computer programs designed to engage in natural language conversations with humans. They can range from simple rule-based systems to highly sophisticated AI models. This paper focuses on
LLM-powered dialogue agents, which leverage the advanced capabilities of LLMs for their conversational abilities. -
Proactive Dialogue Systems: Unlike reactive dialogue systems that merely respond to user inputs, proactive systems take the initiative to guide the conversation towards an anticipated goal. This involves strategic decision-making about what actions to take next to achieve a specific objective, such as closing a negotiation deal, providing emotional support, or teaching a concept. This paper defines
proactive dialoguesas those where the agent needs tostrategically take the initiative to steer the conversation towards an anticipated goal. -
Dialogue Policy Planning: This refers to the process by which a dialogue agent decides what to say or do next in a conversation to achieve its goals. It's the strategic component of a dialogue system, determining the sequence of actions or
dialogue acts(e.g., "ask a question," "propose a price," "offer reassurance") that will lead to a successful outcome. In proactive dialogues, effective policy planning is paramount. -
Supervised Fine-Tuning (SFT): A common technique in machine learning, particularly with pre-trained language models.
Fine-tuninginvolves taking a model that has already been trained on a large, general dataset (like an LLM) and further training it on a smaller, task-specific dataset.Supervisedmeans this training uses labeled examples (input-output pairs), where the model learns to map inputs to desired outputs. In this paper, SFT is used to initialize thePPDPPplug-in with human-annotated dialogue strategies. -
Reinforcement Learning (RL): A paradigm of machine learning where an
agentlearns to make decisions by performingactionsin anenvironmentto maximize a cumulativereward. The agent receives feedback (rewards or penalties) from the environment for its actions and learns through trial and error. Key components include:- Agent: The decision-maker (here, the
PPDPPplug-in). - Environment: The setting in which the agent operates (here, the
self-play simulationinvolving LLM-based users). - State: The current situation or observation of the environment (e.g., the
dialogue history). - Action: A decision made by the agent (e.g., a
dialogue strategy). - Reward: A scalar value indicating how good an action was in a given state, often delayed.
- Policy (): A mapping from states to actions, defining the agent's behavior. The goal of RL is to learn an optimal policy.
- Agent: The decision-maker (here, the
-
Markov Decision Process (MDP): A mathematical framework used to model decision-making in environments where outcomes are partly random and partly under the control of a decision-maker. It's a common formalization for problems solved with RL. An
MDPis defined by states, actions, transition probabilities between states, and rewards. The paper explicitly formulates the dialogue process as an MDP. -
Reinforcement Learning from AI Feedback (RLAIF): An extension of
Reinforcement Learning from Human Feedback(RLHF). Instead of relying on human annotators to provide preference labels or feedback, RLAIF uses an AI model (often a powerful LLM) as thereward modelorcriticto evaluate the agent's behavior and generate feedback. This feedback is then used to train the agent. This paper utilizes a third LLM () to providegoal-oriented verbal feedbackwhich is mapped to scalar rewards for RL. -
Self-play Simulation: A technique where an AI agent plays against itself or other AI agents in a simulated environment to generate data and learn. In this paper, two LLMs (one acting as the assistant, another as the user) engage in a conversation, simulating real-world interactions. This generates rich, dynamic interaction data for RL training, allowing the agent to explore various scenarios and learn optimal policies.
-
Prompting Schemes: Methods of providing instructions or context to LLMs to guide their output.
- Zero-shot prompting: Giving an LLM a task without any examples.
- Few-shot prompting: Providing a few examples of input-output pairs along with the task instruction to guide the LLM.
- Chain-of-Thought (CoT) prompting: Guiding an LLM to generate intermediate reasoning steps before arriving at a final answer, often improving performance on complex tasks. The paper refers to
ProCoTas using CoT.
-
RoBERTa: A
Transformer-based language model developed by Facebook AI. It is an optimized version ofBERT(Bidirectional Encoder Representations from Transformers), known for its robust pre-training approach. RoBERTa is asmaller language modelcompared to the very large generative LLMs like ChatGPT, making it suitable as a tunableplug-infor specific tasks without the computational overhead of fine-tuning an entire large LLM.
3.2. Previous Works
The paper contextualizes its work by referencing various existing approaches to dialogue policy planning, broadly categorized into pre-LLM and LLM-era methods.
-
Pre-LLM Era (Corpus-Based Learning):
- Mechanism: These methods primarily relied on
corpus-based learningfrom static human-annotated dialogues. They predictdialogue strategies(e.g.,Joshi et al., 2021; Cheng et al., 2022; Wang et al., 2023c). - Limitations:
- Heavy reliance on
static human-annotated dialogues. - Failure to optimize
long-term conversational goals. - Lack of adaptability to new or unseen scenarios beyond the training corpus.
- Costly and unrealistic to fine-tune entire dialogue systems for every new application.
- Heavy reliance on
- Mechanism: These methods primarily relied on
-
LLM Era (Prompt-Based Policy Planning): With the advent of LLMs, research shifted towards leveraging their inherent capabilities:
- Self-Thinking/Strategy Planning per Turn: Some works prompt a
frozen actor LLMto think about its strategy for each turn (Zhang et al., 2023a; Deng et al., 2023b; Wang et al., 2023a).MI-Prompt (Chen et al., 2023): Investigatesmixed-initiative strategy-based promptingin proactive dialogues.Ask-an-Expert (AnE) (Zhang et al., 2023a): Prompts another LLM as a strategic expert to reason about the next dialogue strategy.ProCoT (Deng et al., 2023b): Improves proactive dialogue by having the LLM generate achain-of-thought descriptive analysisfor strategy planning.
- Iterative Refinement with AI Feedback: Other approaches generate
AI feedbackgiven the whole dialogue history to iteratively improve policy planning for a specific case (Fu et al., 2023; Yu et al., 2023).ICL-AIF (Fu et al., 2023): Usesin-context learningwithAI feedbackfromself-play simulationto refine dialogue strategies at a dialogue-level.
- Limitations (Common to LLM-era prompting):
- Limited by Frozen LLMs: Performance is still bounded by the
policy planning capabilityof the underlyingfrozen LLMs. - Lack of Transferability: Iterative refinement methods (
Fu et al., 2023; Yu et al., 2023) arecase-exclusive, meaning they require re-simulation for every new case, making them impractical for real-world application. - Evaluation Deficiencies: Still often relies on
turn-level response quality metricsrather than comprehensivedialogue-level goal achievement.
- Limited by Frozen LLMs: Performance is still bounded by the
- Self-Thinking/Strategy Planning per Turn: Some works prompt a
-
Learnable Plug-ins for LLMs:
- This is a recent trend leveraging smaller models or APIs to enhance LLMs for specific tasks without full fine-tuning.
- External APIs/Models: Using APIs (
Schick et al., 2023), vision models (Wu et al., 2023), or functional models (Shen et al., 2023). These are often fixed and don't learn from feedback. - Small Language Model Plug-ins: Using smaller LLMs for tasks like text classification (
Xu et al., 2023), summarization (Li et al., 2023), QA (Yao et al., 2023), or specific capabilities like mental state reasoning (Sclar et al., 2023). These can be fine-tuned or trained with RL.
-
Reinforcement Learning from AI Feedback (RLAIF):
- proposed RLAIF to train models without human labels.
- Existing RLAIF methods often leverage
natural language feedbackfrom LLMs to self-refine prompts (Shinn et al., 2023; Fu et al., 2023; Madaan et al., 2023; Hao et al., 2023), rather than derivingscalar rewardsfor model training.
3.3. Technological Evolution
The field of dialogue policy planning has evolved significantly:
- Early Rule-Based Systems: Simple, deterministic systems with hand-crafted rules for responses. Limited flexibility.
- Statistical/ML-Based Systems (Pre-LLM): Shifted to
corpus-based learning, where models learned policies from human-annotated dialogues using statistical methods or traditional machine learning. While more flexible, they were heavily reliant on domain-specific data and struggled withlong-term optimization. Examples includeJoshi et al., 2021andCheng et al., 2022. - Emergence of Large Language Models (LLMs): LLMs brought unprecedented
context understandingandresponse generationcapabilities. Initial attempts involvedprompting LLMsto perform policy planning directly, often on a turn-by-turn basis. This showed promise but was constrained by thefrozennature of LLMs and the difficulty in transferring learned policies. - Refinement with AI Feedback: The next step was to use LLMs not just for generation but also for
feedback(e.g.,ICL-AIF,Fu et al., 2023), attempting to iteratively improve policy. However, this often remainedcase-specific. - This Paper's Position (Plug-and-Play Tunable Policy Planner): The current paper represents a significant step by introducing a
tunable language model plug-inspecifically fordialogue policy planning. This decouples the strategic component from the core LLM, allowing for targeted training withSFTandRLfromgoal-oriented AI feedbackinself-play simulations. This innovation aims to achievegeneralizabilityandtransferabilityacross cases and applications, addressing key limitations of prior LLM-based methods.
3.4. Differentiation Analysis
Compared to the main methods in related work, PPDPP offers several core differences and innovations:
-
Tunable Plug-in vs. Frozen LLM:
- Prior Prompt-based methods (e.g.,
Standard,Proactive,ProCoT,AnE): Rely on prompting afrozen LLMto conduct policy planning. Theirpolicy planning capabilityis inherently bounded by the pre-trained knowledge and reasoning abilities of the LLM, which cannot be explicitly improved for specific dialogue goals. - PPDPP: Introduces a
tunable language model plug-in(a smaller model like RoBERTa) specifically dedicated to policy planning. This plug-in's parameters are learnable and can be iteratively optimized throughSFTandRL, directly enhancing its policy planning capability.
- Prior Prompt-based methods (e.g.,
-
Generalizability and Transferability:
- Iterative Refinement methods (e.g.,
ICL-AIF): Enhance policy planning by iteratively refining strategies throughself-play simulationsandAI feedbackfor a given case. This means they lacktransferabilityas multiple rounds of simulation are needed for every new case. - PPDPP: After training, the learned
plug-incan begeneralizedtonew caseswithout requiring further iterative simulation. Furthermore, theplug-incan be swapped out to adapt the LLM todifferent applicationsby simply substituting the learnedplug-infor that domain, without retraining the large backbone LLM. This provides a level of flexibility and efficiency unmatched by case-exclusive refinement methods.
- Iterative Refinement methods (e.g.,
-
Goal-Oriented RL from Scalar AI Feedback:
- Most existing RLAIF approaches: Often directly use
natural language feedbackfrom LLMs toself-refine prompts(Shinn et al., 2023; Fu et al., 2023), which can be less direct for gradient-based optimization. - PPDPP: Employs a
goal-oriented AI feedbackmechanism where areward modelLLM provides verbal feedback that is explicitlytransformed into scalar rewards. This allows for direct application ofreinforcement learningalgorithms (likepolicy gradient) to optimize theplug-in's policy, capturinglong-term goal-oriented rewardsfrom dynamic multi-turn interactions.
- Most existing RLAIF approaches: Often directly use
-
Interactive Evaluation for Policy Planning:
-
Traditional evaluation: Typically focuses on
turn-level response qualitybased onfixed reference responses, which fails to assess the strategicpolicy planning capabilityin multi-turn, goal-oriented dialogues. -
PPDPP: Proposes and uses an
LLM-based interactive evaluationapproach that harnessesLLM-based user simulatorsandreward modelsto assessdialogue-level metricslikesuccess rateandaverage turnfor goal achievement. This is a more appropriate and automated way to measure policy planning effectiveness.In summary, PPDPP innovates by decoupling policy planning into a dedicated, learnable component, enabling systematic improvement through a principled RL framework with AI feedback, and ensuring better generalizability and modularity than prior LLM-based dialogue policy methods.
-
4. Methodology
4.1. Principles
The core idea behind PPDPP is to decouple the dialogue policy planning component from the Large Language Model's (LLM) response generation capabilities. Instead of relying on a frozen LLM to implicitly plan strategies through prompting, PPDPP employs a separate, smaller, and tunable language model as a plug-in specifically for policy planning. This plug-in (PPDPP) predicts the optimal dialogue action (strategy) at each turn. The rationale is that by offloading the strategic decision-making to a specialized, learnable component, the system can achieve better generalizability and transferability across different cases and applications. The theoretical basis for learning this optimal policy comes from Reinforcement Learning (RL), where the plug-in learns to maximize cumulative rewards in a simulated dialogue environment. The intuition is that dialogue policy planning is a distinct task from language generation, and by giving it its own learning capacity, it can be optimized more effectively for proactive dialogue problems.
4.2. Core Methodology In-depth (Layer by Layer)
The methodology of PPDPP involves several interconnected components: an MDP environment formulation, the Plug-and-Play Dialogue Policy Planner itself, a self-play interaction mechanism, an LLM as a Reward Model, and a Reinforcement Learning framework.
4.2.1. MDP Environment Formulation
The paper formulates the entire dialogue process as a Markov Decision Process (MDP). This is a standard way to model sequential decision-making problems, making them amenable to Reinforcement Learning.
At each turn of the dialogue, the dialogue system observes the current dialogue history (which represents the state ). Based on this state, it selects a dialogue action from a predefined set of candidate strategies . The user (or user simulator) then responds to this action. This interaction continues until a conversational goal is achieved or a maximum number of turns is reached.
The objective of this MDP is to learn an optimal policy that maximizes the expected cumulative rewards over the entire dialogue episode. This is mathematically expressed as:
$
\pi ^ { * } = \arg \operatorname* { m a x } _ { \pi \in \Pi } \left[ \sum _ { t = 0 } ^ { T } r ( s _ { t } , a _ { t } ) \right]
$
Where:
- : Represents the optimal
dialogue policythat the system aims to learn. - : Denotes a specific
policy, which is a function that maps observedstatestoactions. - : Is the set of all possible
policies. - : Refers to the current
turnortimestepin the dialogue. - : Represents the maximum number of
turnsallowed in a dialogueepisode. - : Is the immediate
rewardreceived atturnfor takingactioninstate. This is also referred to as . - : Represents the
stateof the dialogue atturn, which encapsulates thedialogue history. - : Denotes the
action(dialogue strategy) chosen by the dialogue system atturn.
4.2.2. Plug-and-Play Dialogue Policy Planner (PPDPP)
PPDPP is the central component responsible for dialogue policy planning. It is implemented as a smaller, tunable pre-trained language model (e.g., RoBERTa). This model acts as a plug-in to the main LLM-powered dialogue agent. Its role is to predict the next dialogue action .
The training of PPDPP involves two phases:
4.2.2.1. Supervised Fine-Tuning (SFT)
Before interactive online learning with Reinforcement Learning, the PPDPP plug-in is initialized through Supervised Fine-Tuning (SFT). This phase uses available human-annotated dialogue corpora . The goal of SFT is to teach PPDPP to mimic human-expert dialogue strategy choices.
Given the dialogue history up to the previous turn (which forms the current state ), PPDPP predicts the action for the current turn. The dialogue history is represented as a sequence of system and user utterances: .
The prediction process is: $ a _ { t } = \mathbf { P P D P P } ( u _ { 1 } ^ { \mathrm { s y s } } , u _ { 1 } ^ { \mathrm { u s r } } , . . . , u _ { t - 1 } ^ { \mathrm { s y s } } , u _ { t - 1 } ^ { \mathrm { u s r } } ) $ Where:
-
: The predicted
dialogue action(strategy) for the current turn . -
: Represents the
Plug-and-Play Dialogue Policy Plannermodel, which takes thedialogue historyas input. -
: The utterance generated by the system at turn .
-
: The utterance generated by the user at turn .
The SFT phase optimizes the
PPDPPby minimizing thecross-entropy lossbetween its predicted actions and the human-labeled actions () in the annotated dialogues. $ \mathcal { L } _ { c } = - \frac { 1 } { | \mathcal { D } | } \sum _ { d \in \mathcal { D } } \frac { 1 } { T _ { d } } \sum _ { t = 1 } ^ { T _ { d } } a _ { t } \log y _ { t } $ Where: -
: The
cross-entropy lossfunction. -
: The total number of dialogues in the human-annotated
corpus. -
: Refers to an individual dialogue within the corpus.
-
: The number of turns in dialogue .
-
: The probability distribution over possible actions predicted by
PPDPPfor turn . -
: The one-hot encoded vector representing the human-labeled (ground truth) action for turn .
This SFT initialization, while potentially
sub-optimalon its own, is crucial for accelerating the convergence of the subsequentinteractive online training(RL phase).
4.2.3. Self-play Interaction
After SFT, the system enters the interactive online learning phase, which relies on self-play conversations to simulate dynamic user-assistant interactions. This simulation involves using two LLMs configured to act as distinct roles: an assistant and a user.
Each of these LLMs is given a role description and specific conversational goals through prompts. For instance:
-
Negotiation: A buyer LLM aims for a lower price, while a seller LLM aims for a higher price.
-
Emotional Support: A patient LLM has a specific emotional problem, and a therapist LLM aims to reduce distress.
-
Tutoring: A student LLM has a knowledge gap, and a teacher LLM aims to impart knowledge.
The dialogue proceeds as follows:
- PPDPP Predicts Action: When it's the
assistant's turn to speak, thePPDPPplug-in (trained in the SFT phase) first predicts thenext actionbased on theinteraction historyso far. - Action to Natural Language: This predicted action is then mapped to a predefined
natural language instructionusing a mapping function . This instruction serves as an explicit prompt for the assistant LLM. - Assistant Generates Response: The
assistant player(an LLM, e.g., ChatGPT) then generates itsstrategic response() based on the currentdialogue historyand thenatural language action instruction. $ u _ { t } ^ { s y s } = \mathbf { L } \mathbf { L } \mathbf { M } _ { \mathrm { s y s } } ( p _ { \mathrm { s y s } } ; \mathcal { M } _ { a } ( a _ { t } ) ; u _ { 1 } ^ { \mathrm { s y s } } , u _ { 1 } ^ { \mathrm { u s r } } , . . . , u _ { t - 1 } ^ { \mathrm { s y s } } , u _ { t - 1 } ^ { \mathrm { u s r } } ) $ Where:- : The utterance generated by the
system(assistant LLM) at turn . - : Represents the
Large Language Modelacting as thesystem(assistant). - : The initial
promptdefining theassistant's role and goal. - : The
natural language instructionderived from thePPDPP's predicted action . - The remaining terms are the
dialogue historyup to turnt-1.
- : The utterance generated by the
- User Generates Response: Subsequently, the
user player(another LLM) generates its response () based on the updateddialogue history. $ u _ { t } ^ { u s r } = \mathbf { L } \mathbf { L } \mathbf { M } _ { \mathrm { { u s r } } } ( p _ { \mathrm { u s r } } ; u _ { 1 } ^ { \mathrm { s y s } } , u _ { 1 } ^ { \mathrm { u s r } } , . . . , u _ { t - 1 } ^ { \mathrm { s y s } } , u _ { t - 1 } ^ { \mathrm { u s r } } , u _ { t } ^ { s y s } ) $ Where:-
: The utterance generated by the
user(user LLM) at turn . -
: Represents the
Large Language Modelacting as theuser. -
: The initial
promptdefining theuser's role and goal. -
The remaining terms are the
dialogue historyincluding the system's latest utterance .This process continues until a
terminal stateis reached. Three types of states are defined:
-
ON-GOING: The conversation is still active, and the goal has not been met.GOAL-COMPLETED: The designated conversational goal has been achieved (e.g., a deal reached, emotional problem solved, exercise mastered).GOAL-FAILED: The conversational goal is not met within themaximum turns.
4.2.4. LLM as Reward Model ()
A third LLM is designated as the reward model (). This model has two critical functions:
-
Determine Goal Completion: It assesses whether the conversational
goalhas been achieved at any point during the dialogue. -
Evaluate Policy Outcome: It provides feedback that is transformed into
scalar rewardsfor theReinforcement Learningprocess.To achieve this, the
reward modelis prompted with amulti-choice questiondesigned to elicitgoal-oriented AI feedback. This verbal feedback is then converted intoscalar rewardsusing a predefined mapping .
To mitigate the subjectivity and variance often present in LLM-generated outputs, a common practice of sampling is employed. The reward model generates goal-oriented AI feedback for times, and these verbal feedbacks are converted into scalar values and then averaged to produce a robust scalar value .
$
v _ { t } = \frac { 1 } { l } \sum _ { i = 1 } ^ { l } \mathcal { M } _ { r } ( \mathbf { L } \mathbf { L } \mathbf { M } _ { \mathrm { r w d } } ( p _ { \mathrm { r w d } } ; u _ { 1 } ^ { \mathrm { s y s } } , u _ { 1 } ^ { \mathrm { u s r } } , . . . , u _ { t - 1 } ^ { \mathrm { s y s } } , u _ { t - 1 } ^ { \mathrm { u s r } } , u _ { t } ^ { s y s } , u _ { t } ^ { u s r } ; \tau ) )
$
Where:
-
: The averaged
scalar valuerepresenting the reward at turn . -
: The number of times the
reward model's output is sampled. -
: The index for each sample.
-
: The
mapping functionthat transforms theverbal feedbackfrom thereward modelinto ascalar value. -
: The
Large Language Modelacting as thereward model. -
: The
promptgiven to thereward modelfor evaluation. -
: The
temperatureparameter used for sampling from the LLM, controlling the randomness of its output. Higher temperatures lead to more diverse (and potentially more variable) outputs. -
The remaining terms are the complete
dialogue historyup to the current turn .This is first used to determine the
stateof theself-play interaction. If is greater than or equal to a predefinedthreshold, the state is declaredGOAL-COMPLETED.
The reward for Reinforcement Learning is assigned as follows:
- If the conversation reaches a
terminal state(GOAL-COMPLETEDorGOAL-FAILED), then . This means the final assessment directly contributes to the reward. - If the conversation is
ON-GOING(not yet terminal), a smallnegative reward(e.g., ) is assigned. This penalizes lengthy conversations and implicitly promotesefficient goal completion.
4.2.5. Reinforcement Learning
Once a dialogue episode reaches a terminal state (either GOAL-COMPLETED or GOAL-FAILED), goal-oriented rewards are obtained. These rewards are then used to update the PPDPP plug-in's policy through Reinforcement Learning. The policy agent is denoted as , which gives the probability of taking action given state .
The paper uses the vanilla policy gradient method (Sutton et al., 1999) to optimize the policy agent. This method updates the model's parameters in the direction that increases the probability of actions that lead to higher rewards. The update rule is:
$
\theta \gets \theta - \alpha \nabla \log \pi _ { \theta } ( a _ { t } | s _ { t } ) R _ { t }
$
Where:
-
: Represents the
parametersof thepolicy agent(PPDPP). -
: Is the
learning rate, a hyper-parameter that controls the step size of the parameter updates. -
: The
gradientof thelog-probabilityof takingactioninstate, with respect to thepolicy parameters. This indicates how to change the parameters to make action more or less likely. -
: The
cumulative discounted returnfromturnonwards. This is the sum of future rewards, discounted by a factor .The
cumulative discounted returnis calculated as: $ R _ { t } = \sum _ { t ^ { \prime } = t } ^ { T } \gamma ^ { T - t ^ { \prime } } r _ { t ^ { \prime } } $ Where: -
: An index for summing over future turns.
-
: The
discount factor, a value between 0 and 1 that determines the present value of future rewards. A higher means future rewards are considered more important. -
: The immediate
rewardreceived at turn .
4.2.6. Inference Phase
During inference (when the system is deployed and interacting with real users), the tuned PPDPP directly provides the action prompt based on the dialogue history. This action prompt guides the dialogue LLM (the assistant LLM) to generate the next response. Crucially, the reward LLM is not used during inference. This design ensures that the LLM-powered dialogue agent, equipped with the tuned PPDPP, can strategically guide conversations without requiring multiple, costly iterations of simulation for every new case in real-time. This is a key aspect of the "plug-and-play" and generalizability claims.
5. Experimental Setup
5.1. Datasets
The authors evaluate the proposed framework across three distinct proactive dialogue applications, each with its own dataset:
-
CraisglistBargain (He et al., 2018):
- Domain: Bargain
negotiation dialogues. - Task: Buyer and seller negotiate the price of an item.
- Statistics:
- Number of Cases (train/dev/test): 3,090 / 188 / 188
- Number of Dialogue Acts: 11 (11 negotiation strategies and 4 terminal acts; only 11 negotiation strategies considered in experiments)
- Characteristics: Each case includes an
item category, anitem description, abuyer target price, and aseller target price. These serve as contextual instruction information. The goal is to reach a deal that is as favorable as possible to the buyer. - Example Data Sample (from Appendix F):
- Item Name: Furniture
- Item Description: Macybed Plush Queen Mattress MacyBed 8.5" Plush Pillowtop Queen Mattress in excellent condition. Bought in December of 2013, 3.5 years old. Only had one owner in one household (one person sleeping on it, minimal ware). No stains or discoloring. Been covered with mattress cover since purchase.
- Listed Price (Seller Target Price): 150
- Buyer Target Price: 135
- Domain: Bargain
-
ESConv (Liu et al., 2021):
- Domain:
Emotional support conversation. - Task: A therapist assists a patient in reducing emotional distress and working through challenges.
- Statistics:
- Number of Cases (train/dev/test): 1,040 / 130 / 130
- Number of Strategies: 8 (types of support strategies)
- Characteristics: Each case is accompanied by a
problem type, anemotion type, and asituation descriptionprovided by the patient. The goal is to solve the patient's emotional issue. - Example Data Sample (from Appendix F):
- Emotion Type: Fear
- Problem Type: Job Crisis
- Situation: I think I will be losing my job soon. I just read an email taking about the need for us to cut cost and also how we have not got any support from the government.
- Domain:
-
CIMA (Stasaski et al., 2020):
-
Domain:
Tutoring dialogues. -
Task: A teacher tutors a student on translating an English prepositional sentence into Italian.
-
Statistics:
- Number of Cases (train/dev/test): 909 / 113 / 113 (random 8:1:1 split)
- Number of Pedagogical Strategies: 5
-
Characteristics: Each case involves a specific
exercise(English sentence) and the student'sindividual problemorknowledge staterelated to that exercise. The goal is for the student to master the exercise.These datasets were chosen because they represent diverse
proactive dialogue problemswith clear, quantifiable goals (negotiation outcome, emotional problem resolution, learning mastery), making them suitable for validating apolicy planningframework. They also provide human-annotated data forSFTand case backgrounds forRL simulation.
-
5.2. Evaluation Metrics
The paper emphasizes dialogue-level interactive evaluation over traditional turn-level metrics to assess policy planning capability.
-
Average Turn (AT)
- Conceptual Definition: AT measures the
efficiencyofgoal completion. It quantifies the average number of turns required for the dialogue system to successfully achieve its designated conversational goal. A lower AT indicates a more efficient policy. - Mathematical Formula: The paper does not provide an explicit formula, but it can be commonly understood as: $ \mathrm{AT} = \frac{\sum_{e \in \text{SuccessfulEpisodes}} \text{turns}(e)}{\text{Number of Successful Episodes}} $
- Symbol Explanation:
- : The set of dialogue episodes where the conversational goal was successfully achieved.
- : The total number of turns taken in a specific successful episode .
- : The total count of dialogue episodes that ended in
GOAL-COMPLETED.
- Conceptual Definition: AT measures the
-
Success Rate (SR@t)
- Conceptual Definition: SR measures the
effectivenessofgoal completion. It represents the proportion of dialogue episodes in which the conversational goal was successfully achieved within a predefined maximum number of turns (). A higher SR indicates a more effective policy. The paper sets the maximum turn as 8 for experiments (SR@8). - Mathematical Formula: The paper does not provide an explicit formula, but it can be commonly understood as: $ \mathrm{SR@t} = \frac{\text{Number of Successful Episodes within } t \text{ turns}}{\text{Total Number of Episodes}} $
- Symbol Explanation:
- : The count of dialogue episodes that reached
GOAL-COMPLETEDwithin turns. - : The total count of all dialogue episodes simulated or evaluated.
- : The count of dialogue episodes that reached
- Conceptual Definition: SR measures the
-
Sale-to-List Ratio (SL%)
- Conceptual Definition: SL% is a specific metric used for
negotiation dialogues(CraisglistBargain) to determine theeffectivenessofgoal completionfrom the buyer's perspective. It quantifies how much benefit the buyer gains from the deal relative to the potential negotiation range. A higher SL% means the buyer achieved a price closer to their target price (or further from the seller's initial price), indicating greater benefit. If no deal is reached, SL% is assigned as 0. - Mathematical Formula: $ \mathrm { S L % } = \frac { d e a l \ p r i c e - s e l l e r \ t a r g e t \ p r i c e } { b u y e r \ t a r g e t \ p r i c e - s e l l e r \ t a r g e t \ p r i c e } $
- Symbol Explanation:
- : The final negotiated price agreed upon in the dialogue.
- : The initial price or desired minimum price of the seller.
- : The desired maximum price the buyer is willing to pay.
- Conceptual Definition: SL% is a specific metric used for
-
Human Evaluation Metrics:
- For ESConv (Emotional Support):
Identification: How helpful the assistant is in exploring and identifying the problem.Comforting: How skillful the assistant is in comforting the user.Suggestion: How helpful the suggestions are for solving the problem.
- For CraisglistBargain (Negotiation):
Persuasive: How persuasive the assistant is in the negotiation.Coherent: How on-topic and consistent with the conversation history the assistant is.Natural: How human-like the assistant's responses are.
Overall: A general assessment of the dialogue quality. These are subjective metrics, evaluated by human annotators in a pairwise comparison (Win/Tie/Lose) between PPDPP and baselines.
- For ESConv (Emotional Support):
The evaluation is conducted in an interactive setting using LLM-based user simulators and LLM-based reward models, as described in the methodology, to simulate diverse interactions and automatically assess dialogue-level outcomes.
5.3. Baselines
The paper compares PPDPP against several existing methods, encompassing both general fine-tuned dialogue models and various LLM-based dialogue policy planning approaches:
- DialoGPT (Zhang et al., 2020): A general fine-tuned dialogue model. This represents a pre-LLM baseline for response generation.
- Standard: A vanilla LLM prompting scheme. This involves directly prompting two LLMs to conduct self-play conversations using task instructions without considering any explicit dialogue strategy. It serves as a strong baseline for the raw capability of LLMs.
- AnE (Ask-an-Expert) (Zhang et al., 2023a): This method prompts another LLM to act as a
strategic expert. This expert LLM is asked -part questions to reason about the next dialogue strategy. The strategy is a verbal description rather than a selection from a predefined taxonomy. The original work was for emotional support dialogues, adapted for others by changing roles in questions. - Proactive (Deng et al., 2023b): This approach prompts the LLM-based dialogue system to first select the most appropriate strategy for the next turn, and then generate the response based on the selected strategy. The predicted strategy label is mapped into a
mixed-initiative strategy prompt(MI-Prompt) as perChen et al. (2023).- + MI-Prompt (Chen et al., 2023): This variant specifically incorporates the
mixed-initiative strategy-based promptingdetailed inChen et al. (2023)into the Proactive scheme, explicitly guiding the LLM with strategic instructions.
- + MI-Prompt (Chen et al., 2023): This variant specifically incorporates the
- ProCoT (Deng et al., 2023b): This method builds upon Proactive by first prompting the LLM-based dialogue system to generate a
chain-of-thought (CoT)descriptive analysis for planning the next turn's strategy.MI-Promptis also incorporated. This aims to improve the LLM's reasoning before action selection.- + MI-Prompt (Chen et al., 2023): Similar to Proactive, this explicitly incorporates the
MI-Promptmechanism within the ProCoT framework.
- + MI-Prompt (Chen et al., 2023): Similar to Proactive, this explicitly incorporates the
- ICL-AIF (In-Context Learning from AI Feedback) (Fu et al., 2023): This method prompts an LLM to provide
dialogue-level feedbackto a player to improve their dialogue strategies. This feedback is verbal and is used over iterations of self-play simulation to refine the overall strategy. UnlikeAnE, it focuses ondialogue-levelrather thanturn-levelfeedback.
Implementation Details:
- PPDPP Plug-in:
RoBERTa(roberta-large) is used as the default plug-in. - Backbone LLMs:
- For
role-playing LLMs( and ) and thereward model(),ChatGPT(gpt-3.5-turbo-0613) is the primary choice. - For role-playing LLMs,
temperatureis set for deterministic outputs. - For the reward model,
temperatureandsampling timesare used to integrate scalar rewards. - Performance comparisons are also made with open-source LLMs like
Vicuna-13B-delta-v1.1andLLaMA-2-13B-Chatas backbone LLMs for response generation, while the user simulator and reward model remainChatGPT.
- For
5.4. Training Details
The training process for PPDPP is divided into two phases: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL).
The following are the results from [Table 6] of the original paper:
| Training Phase | Hyper-parameter | Value |
|---|---|---|
| SFT | Batch Size | 16 |
| Training Epochs | 10 | |
| Learning Rate | 6e-6 | |
| Max Sequence Length | 512 | |
| RL | Training Episodes | 1,000 |
| Learning Rate | 1e-6 | |
| Max Conversation Turn | 8 | |
| Discount Factor γ | 0.999 | |
| Max New Tokens | 32 |
SFT Phase:
PPDPPis fine-tuned on the training set of the respective datasets.Checkpointselection is based on the best performance on the validation set.- Hyper-parameters include a
Batch Sizeof 16,10 Training Epochs, aLearning Rateof 6e-6,Max Sequence Lengthof 512,Linear Learning Scheduler, andWeight Decayof 0.01.
RL Phase:
- Cases from the training set are randomly sampled for
online training. - Hyper-parameters include
1,000 Training Episodes, aLearning Rateof 1e-6,Max Conversation Turnof 8, aDiscount Factorof 0.999, andMax New Tokensof 32. - Experiments are conducted on a server equipped with 8 Tesla V100 GPUs.
5.5. Reliability Analysis of LLMs as Reward Models and User Simulators
Before deploying the self-play evaluation and RL training, the authors performed a reliability analysis on using LLMs as reward models and user simulators.
5.5.1. Analysis of LLMs as Reward Model
The reward model is tasked with selecting the situation that best matches the current user state. The reliability is assessed by computing the F1 score of its predictions against human-annotated labels on 50 sampled self-play dialogues from each dataset.
The following figure (Figure 5 from the original paper) shows the analysis of LLMs as reward models:
该图像是图表,展示了作为奖励模型的不同大型语言模型(Vicuna、LLaMA2-Chat、ChatGPT)在CraigslistBargain、ESConv及CIMA三个数据集上的性能对比,ChatGPT表现优异。
As seen in Figure 5, ChatGPT performs well across all three datasets (CraisglistBargain, ESConv, CIMA). Vicuna-13B and LLaMA2-Chat-13B also perform well for CraisglistBargain and ESConv. However, Vicuna and LLaMA2 struggle significantly with CIMA, which involves Italian translation. This is because they were not trained on large-scale Italian data, making them unable to correctly evaluate the students' Italian translations. This analysis confirms ChatGPT's suitability as the reward model for all three problems due to its strong multilingual capabilities.
5.5.2. Analysis of LLMs as User Simulator
Simulated users are expected to accurately play their assigned role within a specific context. The quality of user simulators is evaluated based on naturalness (fluency and human-likeness of utterances) and usefulness (consistency with role descriptions) in single-turn and multi-turn free-form conversations. A pairwise evaluation (Win/Tie/Lose) was conducted by two annotators, comparing LLM-based simulators with a fine-tuned DialoGPT and original human conversations.
The following are the results from [Table 5] of the original paper:
| Single-turn | Multi-turn | |||
| Setting | Natural | Useful | Natural | Useful |
| DialoGPT | 8% | 4% | 2% | 5% |
| ChatGPT | 63% | 72% | 78% | 74% |
| Tie | 29% | 24% | 20% | 21% |
| Human | 14% | 22% | 18% | 27% |
| ChatGPT | 49% | 42% | 36% | 33% |
| Tie | 37% | 36% | 46% | 41% |
The Cohen's Kappa between annotators was 0.72, indicating substantial agreement. ChatGPT-based simulators show significantly superior performance compared to DialoGPT, especially in naturalness for multi-turn conversations. Even when compared to human-annotated dialogues, ChatGPT-based simulators achieve competitive performance (e.g., 49% Win rate vs. Human in single-turn Naturalness, with 37% Tie). These results validate the reliability of using ChatGPT as the user simulator.
6. Results & Analysis
6.1. Core Results Analysis
The paper presents a comprehensive evaluation comparing PPDPP with various baselines across three proactive dialogue applications. The primary metrics are Average Turn (AT) (lower is better for efficiency), Success Rate (SR) (higher is better for effectiveness), and Sale-to-List Ratio (SL%) for negotiation (higher is better for buyer benefit).
The following are the results from [Table 3] of the original paper:
| Method | #Tokens | CraisglistBargain | ESConv | CIMA | ||||
| AT↓ | SR↑ | SL%↑ | AT↓ | SR↑ | AT↓ | SR↑ | ||
| DialoGPT | - | 6.73 | 0.3245 | 0.2012 | 5.31 | 0.7538 | 5.43 | 0.4956 |
| Standard | O(L) | 6.47 | 0.3830 | 0.1588 | 5.10 | 0.7692 | 3.89 | 0.6903 |
| AnE (Zhang et al., 2023a) | O(M + 1)L) | 5.91 | 0.4521 | 0.2608 | 4.76 | 0.8000 | 3.86 | 0.6549 |
| Proactive (Deng et al., 2023b) | O(L) | 5.80 | 0.5638 | 0.2489 | 5.08 | 0.7538 | 4.84 | 0.5310 |
| + MI-Prompt (Chen et al., 2023) | O(2L) | 5.74 | 0.5691 | 0.2680 | 4.78 | 0.7846 | 4.70 | 0.5664 |
| ProCoT (Deng et al., 2023b) | O(L) | 6.22 | 0.5319 | 0.2486 | 4.75 | 0.7923 | 4.58 | 0.5487 |
| + MI-Prompt (Chen et al., 2023) | O(2L) | 6.12 | 0.5532 | 0.3059 | 4.83 | 0.7769 | 4.72 | 0.5221 |
| ICL-AIF (Fu et al., 2023) | O((N + 1)L) | 6.53 | 0.3617 | 0.1881 | 4.69 | 0.8079 | 4.19 | 0.6106 |
| PPDPP | O(L) | 5.62 | 0.6117 | 0.3376 | 4.56 | 0.8462 | 3.03 | 0.8407 |
| - w/o SFT | O(L) | 5.71 | 0.6223 | 0.3354 | 4.68 | 0.8384 | 3.18 | 0.8230 |
| - w/o RL | O(L) | 5.57 | 0.6649 | 0.2280 | 5.24 | 0.7308 | 3.41 | 0.7965 |
Overall Observations:
- PPDPP consistently outperforms baselines: The proposed
PPDPPmethod achieves the best performance across all three datasets and almost all metrics (lower AT, higher SR/SL%). This demonstrates its effectiveness and efficiency in achieving conversational goals. - Efficiency in token usage: PPDPP requires tokens, which is similar to
Standard,Proactive, andProCoT, and significantly less thanAnE() andICL-AIF(), making it more practical for black-box LLMs. - Importance of Reinforcement Learning (RL): The
PPDPP - w/o RLvariant (meaning only SFT was applied) generally performs worse than the fullPPDPP, highlighting the crucial role ofRLwithsimulated interactionsin optimizing the policy.
Task-Specific Observations:
-
Negotiation Dialogues (CraisglistBargain):
PPDPPachieves the lowestAT(5.62), highestSR(0.6117), and highestSL%(0.3376), significantly outperforming all baselines.- Prompt-based methods (
AnE,Proactive,ProCoTwith/withoutMI-Prompt) show substantial improvement overStandardandDialoGPTinSRandSL%. ICL-AIFperforms poorly in negotiation, even negatively affectingSRcompared toStandard, suggestingdialogue-level AI feedbackmight not be dynamic enough for negotiation.- Trade-off between SR and SL% in PPDPP: The
PPDPP - w/o RLvariant has a higherSR(0.6649) but a much lowerSL%(0.2280) compared to fullPPDPP. This indicates thatRLoptimizes for higher buyer benefit (SL%) at the cost of slightly reducing the likelihood of a deal (SR), which is an expected behavior in competitive negotiation.
-
Emotional Support Dialogues (ESConv):
PPDPPachieves the lowestAT(4.56) and highestSR(0.8462).Standard promptingperforms reasonably well (SRof 0.7692).PPDPP - w/o RLperforms worse thanStandard prompting(SR0.7308 vs 0.7692), suggesting thatcorpus-based SFTalone is insufficient for the complexity and diversity of emotional support.RLsignificantly boostsPPDPP'sSRfrom 0.7308 to 0.8462, demonstrating its ability to learn effective support strategies.
-
Tutoring Dialogues (CIMA):
PPDPPachieves the lowestAT(3.03) and highestSR(0.8407).- All baseline methods struggle to beat
Standard prompting(SR 0.6903), indicatingChatGPT's inherent strength in tutoring/translation. - Interestingly,
PPDPP - w/o RL(SFT only) substantially outperforms all baselines (SR0.7965). This is attributed to theCIMAdataset's narrower scope (Italian translation for a specific grammatical structure), wherecorpus-based learningcan be highly effective for cases similar to training data. RLstill further improvesPPDPP'sSRfrom 0.7965 to 0.8407, demonstrating continuous optimization capabilities.
6.2. In-depth Analysis
6.2.1. Performance w.r.t Turns
The paper analyzes the relative success rate (SR@t) and relative Sale-to-List Ratio (SL%@t) against the Standard prompting method at different conversation turns.
The following figure (Figure 2 from the original paper) shows the comparison of relative success rate against Standard at different conversation turns:
该图像是图表,展示了在不同对话轮次下,PPDPP与其他方法在三个任务(CraigslistBargain,ESConv,CIMA)中的相对成功率比较。纵轴为相对成功率百分比,横轴为对话轮次,PPDPP整体表现优于其他方法。
The following figure (Figure 3 from the original paper) shows the comparison of relative Sale-to-List Ratio against Standard at different turns:

Observations:
- PPDPP's consistent superiority:
PPDPPconsistently outperforms baselines across all datasets and almost every turn. - AnE's early strength, later decline:
AnEshows strong performance in the first few turns, especially in negotiation (Figure 3, positiveSL%early on). This suggestsAnEis effective at quickly understanding simple situations and achieving goals in short dialogues. However, its performance drops rapidly as conversations become longer and more complex, indicating a failure to achievelong-term goalsor adapt to complicated situations. This highlights the limitation ofturn-levelexpert advice without a learnable long-term policy. - Baselines' struggle in tutoring (CIMA): In tutoring dialogues (CIMA, Figure 2), all baselines perform worse than
Standard promptingafter three turns. This implies they get stuck or make wrong decisions that hinderlong-term goal achievement, further emphasizing the importance of a robustpolicy planningmechanism likePPDPP.
6.2.2. Comparisons with Different LLMs
The paper investigates PPDPP's performance when using different LLMs (ChatGPT, Vicuna, LLaMA2-Chat) as the backbone for response generation, while the user simulator and reward model remain ChatGPT.
The following figure (Figure 4 from the original paper) shows testing performance curve along with training episodes w.r.t different LLMs:

Observations:
- RL effectiveness across LLMs:
RL trainingeffectively enhances the performance of allLLM-powered dialogue agentsacross each dialogue problem, demonstrating the robustness of thePPDPPframework regardless of the specific backbone LLM. Theoptimization objective(SL%for CraisglistBargain,SRfor ESConv and CIMA) generally increases with training episodes. - LLM inherent strengths:
ChatGPTdoes not universally outperform other LLMs:- Negotiation (CraisglistBargain):
VicunaandLLaMA2-Chatachieve higher benefits (SL%) thanChatGPT, but with a lowersuccess rateof reaching a deal. This suggestsChatGPTmight be more prone to making compromises to reach a deal, potentially due to its general-purpose training favoringresponsivenessandcollaboration. - Emotional Support (ESConv):
Vicunaachieves competitive performance withChatGPT. - Tutoring (CIMA):
ChatGPTsubstantially outperforms others. This is attributed toChatGPT's superiormultilingual capabilities, which are crucial for Italian translation tasks.
- Negotiation (CraisglistBargain):
- Implication: Different LLMs possess
inherent strengthsin various dialogue problems stemming from their pre-training. This suggests that anensemble of multiple agents(each leveraging different LLMs) could potentially address a wider range of dialogue problems more effectively.
6.3. Ablation Study of the Sampling Strategy
An ablation study was conducted to validate the benefits of sampling goal-oriented AI feedback multiple times, specifically focusing on its impact on state prediction and reward estimation.
The following are the results from [Table 7] of the original paper:
| State Prediction | Reward Estimation | |||||||||
| CB | ESC | CIMA | CraisglistBargain | ESConv | CIMA | |||||
| Method | F1↑ | F1↑ | F1↑ | AT↓ | SR↑ | SL%↑ | AT↓ | SR↑ | AT↓ | SR↑ |
| PPDPP (l = 10) | 93.7 | 93.4 | 94.6 | 5.62 | 0.6117 | 0.3376 | 4.56 | 0.8462 | 3.03 | 0.8407 |
| PPDPP (l = 1) | 91.4 | 88.2 | 90.3 | 5.87 | 0.5957 | 0.2623 | 4.67 | 0.8307 | 3.29 | 0.7965 |
Observations:
- Analysis of State Prediction (left part of Table 7):
- (sampling 10 times) significantly improves the
F1 scoreforstate predictionacross all three datasets (e.g., from 91.4 to 93.7 for CraisglistBargain, 88.2 to 93.4 for ESConv, 90.3 to 94.6 for CIMA) compared to (sampling once). - This indicates that
samplingeffectively reduces thevarianceofLLM-generated output, leading to more reliable goal completion determination by thereward model.
- (sampling 10 times) significantly improves the
- Analysis of Reward Estimation (right part of Table 7):
- (using averaged continuous rewards) generally outperforms (using discrete rewards) in terms of dialogue performance metrics (
AT,SR,SL%). - For example, in CraisglistBargain,
SL%increases from 0.2623 to 0.3376 with . - This suggests that
fine-grained continuous rewards(derived from averaging multiple samples) lead to better performance. They make thepolicy planning outcomemoredistinguishableduring theReinforcement Learningprocess, enabling more precise optimization.
- (using averaged continuous rewards) generally outperforms (using discrete rewards) in terms of dialogue performance metrics (
6.4. Human Evaluation
Human evaluation was conducted on 100 randomly sampled dialogues from ESConv and CraisglistBargain, with three annotators comparing PPDPP against AnE, ProCoT, and ICL-AIF.
The following are the results from [Table 4] of the original paper:
| PPDPP | ESConv | CraisglistBargain | ||||||||||||||
| Ide. | Com. | Sug. | Ove. | Per. | Coh. | Nat. | Ove. | |||||||||
| Win | Lose | Win | Lose | Win | Lose | Win | Lose | Win | Lose | Win | Lose | Win | Lose | Win | Lose | |
| AnE | 31% | 15% | 14% | 27% | 52% | 12% | 34% | 24% | 40% | 23% | 22% | 12% | 14% | 7% | 31% | 18% |
| ProCoT | 27% | 21% | 34% | 20% | 38% | 15% | 30% | 11% | 24% | 21% | 17% | 15% | 9% | 6% | 27% | 21% |
| ICL-AIF | 35% | 12% | 32% | 28% | 33% | 29% | 29% | 22% | 55% | 11% | 39% | 12% | 25% | 3% | 62% | 4% |
Note: The table structure is slightly confusing. Each row represents a comparison of PPDPP against one baseline. For example, PPDPP vs. AnE shows PPDPP won 31% vs AnE's 15% in Identification for ESConv. The higher percentage for PPDPP means it was preferred by humans more often. The numbers in the table indicate the win rate of PPDPP against the specified baseline, and the lose rate of PPDPP against the specified baseline (meaning the baseline won against PPDPP). The remaining percentage would be 'Tie' experiences.
Observations:
- Overall Superiority of PPDPP:
PPDPPgenerally outperforms other baselines across almost all human evaluation perspectives andOverallscores for both ESConv and CraisglistBargain. This reinforces the findings from automatic metrics. - ESConv (Emotional Support):
PPDPPshows strong win rates inIdentification(vsAnE: 31% Win, 15% Lose; vsProCoT: 27% Win, 21% Lose; vsICL-AIF: 35% Win, 12% Lose) andSuggestion.- The only exception is
Comforting, whereAnEhas a higher win rate againstPPDPP(27% Win forAnEvsPPDPP's 14% Win). This might be becauseAnEis observed to provide detailed and empatheticemotional support strategies(as mentioned in the qualitative case study), which contributes to strong comforting. However, the paper notes that a dialogue system needs to go beyond just empathy toproactively explore and solvethe patient's issues, whichPPDPPdoes better.
- CraisglistBargain (Negotiation):
PPDPPdemonstrates significantly higher win rates inPersuasive,Coherent,Natural, andOverallcompared to all baselines. For instance,PPDPPhas anOverallwin rate of 62% againstICL-AIFwith only 4% lose, indicating substantial human preference.
6.5. Example Conversations
The paper includes qualitative case studies in Appendix F to illustrate the differences in conversational behavior and outcomes among various methods.
Negotiation Dialogues (Example from Appendix F):
- Scenario: Buyer (LLM agent) and Seller (LLM user simulator) negotiate a furniture price. Listed price:
150, Buyer target:135. - Standard: Directly reveals budget, sticks to it rigidly, no negotiation strategy. Leads to
no deal(SL% = 0). - Ask-an-Expert (AnE): Effectively reaches a deal quickly but results in a large compromise for the buyer (SL% = 0.3333). Suggestions from the expert LLM might prioritize closing the deal over maximizing buyer benefit.
- ProCoT: Employs effective negotiation strategies, leading to a much better deal (SL% = 0.8). Shows strategic turn-by-turn planning.
- ICL-AIF: Provides
dialogue-levelsuggestions that are adopted, but fails to dynamically adapt to user interactions, leading tono deal(SL% = 0). Highlights limitations of static dialogue-level feedback for dynamic negotiations. - PPDPP: Similar to
ProCoT, uses effective negotiation strategies to achieve a better deal (SL% = 0.8333). It also shows adaptability, maximizing buyer benefit when the seller shows willingness to compromise. This demonstrates the combination of strategic planning and dynamic adaptation.
Emotional Support Dialogues (Example from Appendix F):
-
Scenario: Patient (LLM user simulator) has a job crisis, feeling fear.
-
Standard: Consistently conveys empathy but becomes less useful as the conversation progresses and the patient's emotional intensity decreases. Lacks goal-oriented progression.
-
Ask-an-Expert (AnE): Produces engaging conversations with detailed empathetic actions but shares the drawback of not progressing beyond empathy to problem-solving.
-
ProCoT: Adopts effective emotional support strategies to efficiently solve the patient's issue by providing helpful suggestions, demonstrating a more goal-oriented approach.
-
ICL-AIF: Provides
dialogue-levelsuggestions covering different stages of emotional support (validation, exploration, coping strategies), which are effectively implemented to interact with the patient. -
PPDPP: Optimizes the policy planner for efficient goal achievement, leading to shorter conversations. It strategically moves from acknowledgement and reassurance to suggestions for coping. This illustrates the benefit of RL in learning efficient pathways to goal completion.
These examples underscore PPDPP's ability to not only employ effective strategies but also adapt dynamically and efficiently steer conversations towards goals, distinguishing it from methods that are either too passive, too rigid, or lack long-term optimization.
7. Conclusion & Reflections
7.1. Conclusion Summary
This work introduces PPDPP, a novel plug-and-play dialogue policy planner paradigm designed to empower Large Language Model (LLM)-powered dialogue agents with enhanced proactivity. The core innovation lies in decoupling dialogue policy planning from the LLM's response generation using a separate, tunable language model plug-in. The paper also develops a robust training framework that combines supervised fine-tuning (SFT) on human-annotated data for initialization and reinforcement learning (RL) from goal-oriented AI feedback obtained through LLM-based self-play simulations. This framework allows the LLM agent to generalize to new cases and adapt to different applications by simply replacing the learned plug-in, crucially without re-training the expensive backbone LLM. Furthermore, the authors propose an interactive evaluation protocol that leverages LLM-based user simulators and reward models to assess dialogue-level effectiveness and efficiency in multi-turn conversations. Experimental results across negotiation, emotional support, and tutoring dialogues consistently demonstrate PPDPP's superior performance compared to existing methods, highlighting its ability to effectively and efficiently guide conversations towards designated goals.
7.2. Limitations & Future Work
The authors identify several implications and avenues for future research:
- Potential of Tunable Plug-ins: The success of
PPDPPhighlights the significant potential of usingtunable plug-insto address specific shortcomings of LLMs. This approach can be extended to various other applications and could involve integrating multiple plug-ins to tackle even more complex dialogue challenges. - Inherent LLM Strengths and Ensemble Agents: The findings reveal that dialogue agents powered by different LLMs possess
inherent strengthsin diverse problems, influenced by their respective training processes. Given the resource-intensive nature of training specialized LLMs, this insight suggests the value of employing anensemble of multiple agents, each potentially leveraging a different LLM's strengths, to collaboratively address a wide range of dialogue problems. - Limitations of LLM-based simulators/reward models: While the reliability of
LLM-based user simulatorsandreward modelswas validated, they still inherently carry potential biases or limitations of the underlying LLMs, which could affect the fidelity of theself-play environmentand theRL feedback. - Cost of LLM calls: Despite PPDPP being more token-efficient than some baselines, the reliance on multiple LLMs (assistant, user, reward model) during
self-play simulationcan still incur significant computational and API costs, especially for proprietary LLMs. - Scope of Proactive Dialogues: The current work focuses on specific types of proactive dialogues (negotiation, emotional support, tutoring). The generalizability to other complex proactive scenarios (e.g., medical diagnosis, financial advice) needs further exploration.
7.3. Personal Insights & Critique
This paper presents a highly practical and innovative approach to enhancing the proactivity of LLM-powered dialogue agents.
-
Decoupling is Key: The
plug-and-playparadigm of decouplingpolicy planningfromresponse generationis a significant architectural improvement. For commercial applications, where direct fine-tuning of proprietary LLMs is impossible or prohibitively expensive, having a small, tunablepolicy plug-inis an elegant solution. It allows for domain-specific strategic intelligence to be injected and learned without compromising the general linguistic capabilities of the powerful backbone LLM. This modularity greatly simplifies development and deployment cycles for specialized dialogue agents. -
Interactive RL from AI Feedback: The
LLM-based self-play simulationcombined withgoal-oriented AI feedbackforReinforcement Learningis a powerful learning mechanism. It allows the system to learn from dynamic, multi-turn interactions, which is essential forproactive dialogueswhere long-term goals matter. The use ofsamplingto stabilize thereward signalfrom thereward modelis a clever detail to address the inherent stochasticity and subjectivity of LLM outputs. This approach effectively bridges the gap between natural language feedback and structured scalar rewards for RL. -
Robust Evaluation Framework: The emphasis on
interactive evaluationusingLLM-based user simulatorsandreward modelsis crucial. It moves beyond superficialturn-level metricsto directly assess thedialogue-level successandefficiencyofpolicy planning, which is the true measure of a proactive system. This framework itself could be widely adopted for evaluating other goal-oriented dialogue systems. -
Trade-offs and Nuances: The observation of trade-offs, such as
SL%vs.SRin negotiation, is a valuable insight. It highlights that optimizing for one aspect of a goal (e.g., maximizing buyer benefit) might inherently impact another (e.g., deal success rate). This complex behavior underscores thepolicy planner's ability to learn nuanced strategies rather than simplistic ones. -
Potential Issues/Areas for Improvement:
-
LLM Hallucination/Bias in Simulators: While validated, the
LLM-based user simulatorsandreward modelsare still susceptible tohallucinationsorbiasespresent in their training data. If theuser simulatordoesn't accurately reflect human behavior or thereward modelhas a flawed understanding of "success," the learned policy could be optimized for an imperfect simulation rather than real-world effectiveness. Further research into robust verification of these LLM components (perhaps through human-in-the-loop validation during RL) could strengthen the approach. -
Action Space Design: The effectiveness heavily relies on a well-defined set of candidate
dialogue actions() and their mapping tonatural language instructions. Designing this action space for complex domains can be challenging and might require significant domain expertise. -
Domain Adaptation Costs: While the
plug-inis described as "transferable" across applications, it still requires re-training (SFT+RL) for each new application. The term "plug-and-play" implies a level of immediate usability, but it's more accurate to say it's a "plug-in-and-train" approach for new domains. However, this cost is still significantly less than fine-tuning the entire backbone LLM. -
Interpretability of Learned Policies: While the
policy plug-inis smaller, understanding why it chooses certain actions, especially afterRLtraining, can still be a challenge. Improving the interpretability of these learned policies could build greater trust and facilitate debugging.Overall,
PPDPPoffers a compelling solution for building more intelligent, goal-oriented conversational agents using LLMs. Its modular design and principled learning framework lay a strong foundation for future advancements inproactive dialogue systems.
-
Similar papers
Recommended via semantic vector search.