A Survey on Proactive Dialogue Systems: Problems, Methods, and Prospects
TL;DR Summary
This survey reviews proactive dialogue systems, highlighting problems, methods, and motivational strategies for goal-driven, strategic conversational agents, offering a comprehensive overview to advance conversational AI toward more complex, interactive tasks.
Abstract
Proactive dialogue systems, related to a wide range of real-world conversational applications, equip the conversational agent with the capability of leading the conversation direction towards achieving pre-defined targets or fulfilling certain goals from the system side. It is empowered by advanced techniques to progress to more complicated tasks that require strategical and motivational interactions. In this survey, we provide a comprehensive overview of the prominent problems and advanced designs for conversational agent's proactivity in different types of dialogues. Furthermore, we discuss challenges that meet the real-world application needs but require a greater research focus in the future. We hope that this first survey of proactive dialogue systems can provide the community with a quick access and an overall picture to this practical problem, and stimulate more progresses on conversational AI to the next level.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
A Survey on Proactive Dialogue Systems: Problems, Methods, and Prospects
1.2. Authors
Yang Deng, Wenqiang Lei,†, Wai Lam, Tat-Seng Chua National University of Singapore Sichuan University The Chinese University of Hong Kong
1.3. Journal/Conference
This paper was published on arXiv, a preprint server for scientific papers. While arXiv itself is not a peer-reviewed journal or conference, it is a widely used platform in the scientific community, particularly in fields like computer science, to rapidly disseminate research findings before or in parallel with formal publication. Papers on arXiv undergo a basic moderation process but are not subject to the rigorous peer review typically associated with academic journals or conferences. Given the recency of the publication date, it is likely intended for submission or has been submitted to a major conference or journal in the field of Natural Language Processing or Artificial Intelligence.
1.4. Publication Year
2023
1.5. Abstract
This paper presents a comprehensive survey of proactive dialogue systems, which are conversational agents designed with the ability to guide conversations toward specific goals or targets. These systems leverage advanced techniques for strategic and motivational interactions, enabling them to tackle more complex tasks. The survey provides an overview of key problems and advanced designs for conversational proactivity across three main types of dialogues: open-domain dialogues, task-oriented dialogues, and information-seeking dialogues. Furthermore, it discusses current challenges in real-world applications that require intensified future research, including hybrid dialogues, robust evaluation protocols, and ethical considerations. The authors aim for this survey to offer quick access and a broad understanding of this practical problem, thereby stimulating further advancements in conversational AI.
1.6. Original Source Link
The paper is available at: https://arxiv.org/abs/2305.02750.
The PDF version can be accessed at: https://arxiv.org/pdf/2305.02750v2.pdf.
It is published as a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper addresses is the passivity of most existing dialogue systems. Traditional dialogue systems are typically designed to passively respond to user queries, follow user-initiated topics, or fulfill explicit user requests. Examples include open-domain dialogue systems (for general conversation), task-oriented dialogue systems (for specific tasks like booking), and conversational information-seeking systems (for finding information).
This passivity presents several limitations:
-
Limited Engagement: Passive systems may struggle to maintain user engagement or guide conversations effectively, especially in complex scenarios.
-
Handling Ambiguity/Problematic Content: They might passively accept ambiguous user queries or even problematic/harmful statements, leading to suboptimal or unsafe interactions (e.g., providing randomly guessed answers, failing to address biased conversations).
-
Lack of Strategy/Motivation: They lack the
proactivityto strategically steer the conversation towards pre-defined system goals or to motivate users towards certain actions, which is crucial for more sophisticated applications. -
Absence of "Strong AI" Trait: The ability to take initiative and anticipate impacts (i.e.,
proactivity) is an essential property of intelligent, human-like conversations and a significant step towardsstrong AIthat possessesautonomyandhuman-like consciousness. Even advanced models likeChatGPTexhibit some of these limitations due to a lack ofproactivity.The problem is important because equipping conversational agents with
proactivitycan significantly enhanceuser engagement, improveservice efficiency, and enable the system to handle morecomplicated tasksthat requirestrategicalandmotivational interactions.
The paper's entry point is to define and categorize proactive dialogue systems as conversational agents capable of leading the conversation direction towards achieving pre-defined targets or fulfilling certain goals from the system side. It aims to systematically review existing efforts in this emerging field, identify key problems, advanced designs, and future research directions.
2.2. Main Contributions / Findings
The paper makes several primary contributions by providing the first comprehensive survey specifically focused on proactive dialogue systems:
- Systematic Categorization: It provides a systematic overview of
proactive dialogue systemsby categorizing them into three main types of dialogues:- Proactive Open-domain Dialogues: Where the system leads general conversations (e.g.,
target-guided dialogues,prosocial dialogues). - Proactive Task-oriented Dialogues: Where the system goes beyond fulfilling explicit requests (e.g.,
non-collaborative dialogues,enriched task-oriented dialogues). - Proactive Conversational Information Seeking Systems: Where the system actively refines information search (e.g.,
asking clarification questions,user preference elicitation).
- Proactive Open-domain Dialogues: Where the system leads general conversations (e.g.,
- Identification of Prominent Problems and Advanced Designs: For each category, the survey details the specific problems that necessitate
proactivityand outlines the advanced techniques and designs developed to address them, including specific subtasks liketopic-shift detection,dialogue strategy learning, andclarification need prediction. - Review of Data Resources and Evaluation Protocols: It summarizes available datasets and commonly adopted evaluation metrics and protocols for each identified problem, providing a valuable resource for researchers entering the field.
- Discussion of Challenges and Prospects: The paper highlights crucial open challenges that align with real-world application needs but require greater research focus. These include:
-
Proactivity in Hybrid Dialogues: Addressing conversations with multiple, varying goals.
-
Evaluation Protocols for Proactivity: Developing robust, multidisciplinary metrics beyond traditional dialogue evaluation.
-
Ethics of Conversational Agent's Proactivity: Ensuring
factuality,morality, andprivacyin proactive interactions.The key conclusion is that
proactivityis a critical, yet often overlooked, property inconversational AIthat holds immense potential for developing more intelligent, engaging, and capabledialogue systems. The findings solve the problem of fragmented understanding in this emerging field by synthesizing diverse research efforts, offering a structured view, and pointing towards future high-impact research avenues.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a foundational understanding of dialogue systems and key concepts within Natural Language Processing (NLP) is essential.
-
Dialogue Systems: At its core, a
dialogue system(orconversational AI) is a computer program designed to converse with human users in natural language. Its goal is typically to provide social support, answer questions, complete tasks, or offer recommendations.- Open-domain Dialogue Systems (
ODD): These systems are designed for general conversations without a specific task. Their goal is to maintain engagement, build rapport, and provide social interaction. Examples include chatbots for casual chat. - Task-oriented Dialogue Systems (
TOD): These systems aim to help users complete specific tasks, such as booking flights, making restaurant reservations, or setting reminders. They require understanding user intent, managing dialogue states, and interacting with external APIs. - Conversational Information-Seeking Systems (
CIS): These systems help users find information through a conversational interface, often combining elements of search engines, recommender systems, and question-answering systems.
- Open-domain Dialogue Systems (
-
Proactivity: In the context of
dialogue systems,proactivityrefers to the agent's ability to take initiative, anticipate user needs or conversation direction, and actively steer the dialogue towards a predefined goal or target, rather than merely reacting to user input. This contrasts with traditionalpassive dialogue systems. -
Natural Language Processing (
NLP): The field ofAIthat focuses on enabling computers to understand, interpret, and generate human language. Techniques fromNLPare fundamental to alldialogue systems, includingnatural language understanding(NLU) andnatural language generation(NLG). -
Machine Learning (
ML) / Deep Learning (DL): Many moderndialogue systemsare built usingMLandDLtechniques, particularlyneural networks.- Transformers: A
deep learningarchitecture introduced in 2017, which has become dominant inNLP. It relies heavily on theself-attention mechanismto process sequential data, allowing it to weigh the importance of different parts of the input sequence. Models likeBERT,GPT, andT5are based on theTransformerarchitecture. The paper mentionsT5-basedmodels andBERT-basedmodels. - Reinforcement Learning (
RL): A type ofmachine learningwhere an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. It's often used indialogue systemsfordialogue policy learning, where the agent learns the optimal strategy for responding to users.
- Transformers: A
-
Knowledge Graphs: Structured representations of knowledge that store facts as nodes (entities) and edges (relationships). They are used to inject factual information and common sense into
dialogue systems, particularly forknowledge-groundedconversations ortopic planning.
3.2. Previous Works
The paper frames its discussion by first highlighting the existing landscape of dialogue systems and then pinpointing the gap that proactivity addresses.
-
Conventional Dialogue Research:
- Response-ability Focus: Prior research largely focused on the
response-abilityof systems, meaning their ability to understand context and generate appropriate replies. - User-oriented/Passive Nature:
Open-domain dialogue systems(e.g.,PersonaChat[Zhang et al., 2018a]): Aim to establish long-term connections by echoing user topics, emotions, or views. The paper notes thatPersonaChatis often used as a base dataset fortarget-guided dialogueresearch, demonstrating how passive systems are being adapted for proactive goals.Task-oriented dialogue systems(e.g.,SimpleTOD[Hosseini-Asl et al., 2020]): Designed to passively follow user instructions to complete specific tasks. The paper mentionsSimpleTODas a baseline that needs extension () to handleenriched TODswithchit-chats.Conversational information-seeking systems(e.g., general conversational search [Aliannejadi et al., 2019]): Passively respond to user queries.
- Response-ability Focus: Prior research largely focused on the
-
Early Attempts at Proactivity:
Topic Introduction[Li et al., 2016]: Pioneering work recognized the need for systems to proactively introduce new topics.Useful Suggestions[Yan and Zhao, 2018]: Other early studies explored agents offering useful suggestions. These efforts laid the groundwork but highlighted the need for more defined problem settings and applications.
-
Key Models/Architectures Referenced (and underlying concepts):
BERT(Bidirectional Encoder Representations from Transformers): ATransformer-based model pre-trained on large text corpora, widely used forNLUtasks like classification and question answering. For example,BERT-basedmodels are used forquestion selectioninclarification questiongeneration [Aliannejadi et al., 2019].T5(Text-to-Text Transfer Transformer): AnotherTransformer-based model that frames allNLPtasks as a text-to-text problem (taking text as input and producing text as output). It's used intopic-shift management[Xie et al., 2021].- Evaluation Metrics (Crucial for understanding performance):
BLEU(Bilingual Evaluation Understudy): A metric for evaluating the quality of machine-generated text by comparing it to reference texts. It measures the n-gram overlap.- Conceptual Definition:
BLEUquantifies how similar a candidate translation (or generated response) is to a set of high-quality reference translations. It's widely used in machine translation and text generation tasks, focusing on precision (how many words in the candidate are also in the reference). - Mathematical Formula: $ \text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right) $ where is the brevity penalty, is the maximum n-gram order (typically 4), are positive weights summing to 1 (often ), and is the n-gram precision. The brevity penalty is calculated as: $ \text{BP} = \begin{cases} 1 & \text{if } c > r \ e^{(1-r/c)} & \text{if } c \le r \end{cases} $ where is the length of the candidate text and is the effective reference corpus length. The n-gram precision is calculated as: $ p_n = \frac{\sum_{S \in {\text{candidates}}} \sum_{ngram \in S} \text{Count}{\text{clip}}(ngram)}{\sum{S' \in {\text{candidates}}} \sum_{ngram' \in S'} \text{Count}(ngram')} $
- Symbol Explanation:
- :
Brevity Penalty, penalizes generated texts that are too short compared to the reference. - : Length of the candidate (generated) text.
- : Effective reference corpus length, which is the sum of the lengths of the reference segments whose length is closest to the candidate segment length.
- : Maximum n-gram order.
- : Weight for the -gram precision.
- : Modified n-gram precision for -grams.
- : The number of times an -gram appears in the candidate, clipped by its maximum count in any single reference.
- : The total number of -grams in the candidate.
- :
- Conceptual Definition:
ROUGE(Recall-Oriented Understudy for Gisting Evaluation): A set of metrics used for evaluating automatic summarization and machine translation software. It works by comparing an automatically produced summary or translation against a set of reference summaries or translations. It focuses on recall (how many words in the reference are also in the candidate). The paper mentionsROUGEforRoT generationandprosocial generation.- Conceptual Definition:
ROUGEmeasures the overlap of n-grams, word sequences, and word pairs between the system-generated summary and human-created reference summaries. Different variants (ROUGE-N,ROUGE-L,ROUGE-W,ROUGE-S) exist, withROUGE-Nfocusing on n-gram overlap. - Mathematical Formula (for ROUGE-N): $ \text{ROUGE-N} = \frac{\sum_{S \in {\text{references}}} \sum_{ngram_n \in S} \text{Count}(ngram_n)}{\sum_{S' \in {\text{references}}} \sum_{ngram'_n \in S'} \text{Count}(ngram'_n)} $
- Symbol Explanation:
- : The length of the n-gram.
- : The number of times an -gram appears in both the candidate and reference summaries.
- : The number of times an -gram appears in the reference summary.
- Conceptual Definition:
PPL(Perplexity): A measure of how well a probability distribution or probability model predicts a sample. InNLP, it's used to evaluatelanguage modelsby measuring how well the model predicts a given sequence of words. A lowerPPLindicates a better model.- Conceptual Definition:
Perplexitymeasures the uncertainty of a language model. It's the inverse probability of the test set, normalized by the number of words. A model with low perplexity is good at predicting the next word in a sequence. - Mathematical Formula: $ \text{PPL}(W) = \left( \prod_{i=1}^{N} \frac{1}{P(w_i | w_1, \dots, w_{i-1})} \right)^{\frac{1}{N}} = \exp\left( -\frac{1}{N} \sum_{i=1}^{N} \log P(w_i | w_1, \dots, w_{i-1}) \right) $
- Symbol Explanation:
- : A sequence of words.
- : The probability assigned by the language model to the -th word given the preceding
i-1words. - : The total number of words in the sequence.
- Conceptual Definition:
F1Score: The harmonic mean of precision and recall, often used for classification tasks.- Conceptual Definition: The
F1score provides a single score that balances the precision and recall of a classifier. Precision measures the proportion of true positive predictions among all positive predictions, while recall measures the proportion of true positive predictions among all actual positive instances. - Mathematical Formula: $ \text{F1 Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $ where $ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} $ and $ \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $
- Symbol Explanation:
True Positives: Instances correctly identified as positive.False Positives: Instances incorrectly identified as positive (Type I error).False Negatives: Instances incorrectly identified as negative (Type II error).Precision: The ability of the classifier not to label as positive a sample that is negative.Recall: The ability of the classifier to find all the positive samples.
- Conceptual Definition: The
Accuracy: The proportion of correctly classified instances among the total number of instances.- Conceptual Definition:
Accuracymeasures the overall correctness of a classification model, indicating the ratio of correct predictions to the total number of predictions made. - Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
- Symbol Explanation:
Number of Correct Predictions: The sum of true positives and true negatives.Total Number of Predictions: The sum of true positives, true negatives, false positives, and false negatives.
- Conceptual Definition:
3.3. Technological Evolution
The evolution of dialogue systems has largely progressed from rule-based systems to statistical machine learning approaches, and more recently, to deep learning and large language models (LLMs). Initially, systems were highly scripted and passive, following rigid rules. With the advent of statistical methods, systems gained some flexibility in understanding and generating language but remained largely reactive. The deep learning era brought significant advancements, especially with sequence-to-sequence models and later Transformers, enabling more fluent and contextually aware conversations.
This paper's work on proactive dialogue systems represents a crucial step in this evolution. It moves beyond mere response-ability (a system's ability to respond appropriately) towards proactivity (a system's ability to initiate and guide), which is essential for strong AI. Early attempts at proactivity were often limited to specific, simple interventions like topic introduction or suggestions. However, the integration of advanced deep learning techniques, reinforcement learning for policy learning, and sophisticated knowledge representation (e.g., knowledge graphs) has empowered proactive dialogue systems to tackle more complex, strategic, and motivational interactions, as highlighted in this survey. The field is moving from systems that merely respond to those that strategically engage and lead.
3.4. Differentiation Analysis
Compared to previous surveys in dialogue systems, this paper distinguishes itself by its singular focus on proactivity. While other surveys might cover open-domain, task-oriented, or information-seeking dialogues generally, or specific technical aspects like dialogue state tracking or response generation, this is explicitly presented as the "first survey to focus on proactive dialogue systems."
The core innovations and differences of this paper's approach are:
-
Unified Definition of Proactivity: It provides a clear, derived definition of
conversational agent's proactivity, grounding it in concepts from organizational behaviors, which helps unify understanding across diverse application areas. -
Cross-Domain Categorization: Instead of focusing on one type of dialogue, it systematically surveys
proactivityacrossopen-domain,task-oriented, andinformation-seeking dialogues, revealing common themes and distinct challenges. -
Problem-Centric View: It structures its analysis around "prominent problems" that
proactivityaims to solve (e.g.,target-guided conversation,prosocial dialogue,non-collaborative dialogue,clarification questions,user preference elicitation), rather than just surveying models. -
Forward-Looking Perspective: It dedicates a significant section to "Challenges and Prospects," offering a critical perspective on
hybrid dialogues,evaluation protocols, andethicsthat are specific to the proactive nature of these systems. This future-oriented view is crucial for guiding subsequent research.In essence, while foundational
dialogue systemconcepts are well-surveyed, this paper carves out a new, important sub-field, systematically defining, categorizing, and charting the progress and future ofproactive conversational AI.
4. Methodology
This paper is a survey, so its "methodology" is the systematic approach it takes to review and categorize existing research on proactive dialogue systems. The authors define proactivity and then structure their analysis around how this concept manifests in three major types of dialogue systems: open-domain, task-oriented, and information-seeking.
4.1. Principles
The core idea behind the survey's methodology is to provide a comprehensive, structured overview of proactivity in dialogue systems. The theoretical basis is the definition of proactivity itself: the capability of a conversational agent to "create or control the conversation by taking the initiative and anticipating the impacts on themselves or human users, rather than only passively responding to the users." The intuition is that moving beyond passive response-ability towards proactivity is essential for more intelligent, engaging, and goal-oriented conversational AI.
The survey's methodological principles include:
- Problem-Driven Analysis: Identifying specific problems (e.g.,
target-guided conversation,non-collaborative dialogue) whereproactivityis a key solution. - Solution-Oriented Review: Detailing advanced designs and methods developed to implement
proactivityfor these problems. - Resource Mapping: Connecting problems and solutions to available
data resourcesandevaluation protocols. - Forward-Looking Critique: Discussing open challenges and future research directions to guide the community.
4.2. Core Methodology In-depth (Layer by Layer)
The survey systematically breaks down proactive dialogue systems into categories based on the type of dialogue and then further into specific problems within those categories. For each problem, it describes the definition, common methods, relevant datasets, and evaluation protocols.
The following figure (Figure 1 from the original paper) shows the classification structure of proactive dialogue systems:
Figure 1: Summary of proactive dialogue systems.
The following figure (Figure 2 from the original paper) illustrates examples for different problems in proactive dialogue systems:
Figure 2: Examples for different problems in proactive dialogue systems.
4.2.1. Proactive Open-domain Dialogue Systems
Open-domain dialogue systems (ODD) traditionally aim for general social interaction. Proactivity here means the system doesn't just echo user topics or emotions but actively guides the conversation.
4.2.1.1. Target-guided Dialogues
-
Problem Definition: Given a secret
target() known only to the agent, the system must lead a conversation, starting from any initial topic, towards thattargetover multiple turns. The generated responses must ensure:Transition smoothness: Natural and appropriate content within thedialogue history.Target achievement: Successfully guiding the conversation to the designatedtarget. Thetargetcan be atopical keyword[Tang et al., 2019], aknowledge entity[Wu et al., 2019], or aconversational goal[Liu et al., 2020]. The system maintains a set ofcandidate targets.
-
Methods: Involve three main subtasks:
- Topic-shift Detection: Identifying when the user introduces a new topic.
- Rachna et al. [2021] fine-tune
XLNet-baseto classify utterances intomajor,minor, oroff topics. - Xie et al. [2021] created
TIAGEdataset withtopic-shift annotationsand proposedTSMANAGER, aT5-basedmodel, to predicttopic shifts.
- Rachna et al. [2021] fine-tune
- Topic Planning: The core of
target-guided dialogues, aiming to make the conversation follow a desired path.- Early works [Tang et al., 2019; Zhong et al., 2021] use
discourse-level strategiesconstrained bykeyword transitions. - To improve coherence,
event knowledge graphsare constructed [Xu et al., 2020]. - Latest studies [Yang et al., 2022] leverage
external knowledge graphsandgraph reasoningfor bettertopic path planning. - Lei et al. [2022] learn
topic transitionsfrom user interactions instead of corpus-based methods.
- Early works [Tang et al., 2019; Zhong et al., 2021] use
- Topic-aware Response Generation: Producing responses relevant to the planned topic path.
- Kishinami et al. [2022] generate a
complete responding planto lead to the target. - Gupta et al. [2022] use a
bridging path of commonsense knowledge conceptsbetween current and target topics to generatetransition responses.
- Kishinami et al. [2022] generate a
- Topic-shift Detection: Identifying when the user introduces a new topic.
4.2.1.2. Prosocial Dialogues
-
Problem Definition: Given a
dialogue context(utterances ), the system first classifies thesafety label() of the user's utterance and then generates an appropriate response () to mitigateproblematic user utterances(e.g., unsafe, unethical, toxic) by leading the conversation in aprosocial manner(following social norms, benefiting others). -
Methods:
- Safety Detection: Identifying problematic user utterances.
- Dinan et al. [2019] developed
human-in-the-loop trainingforoffensive utterance detection, improved byadversarial learning[Xu et al., 2021]. - Baheti et al. [2021] fine-tuned
offensive language detection classifiersonToxICCHAT. - Kim et al. [2022] introduced a
fine-grained safety classification schema(Needs Caution,Needs Intervention,Casual) to avoid social exclusion for minority users.
- Dinan et al. [2019] developed
- Rule-of-Thumb (RoT) Generation: Explaining why a statement is acceptable or problematic.
- Forbes et al. [2020] introduced
SOCIALCHEM01corpus andNORM TRANSFORMERfor reasoning aboutsocial norms. - Ziems et al. [2022] proposed
MORAL TRANSFORMERto generate newRoTs. - Kim et al. [2022] developed
Canary, asequence-to-sequence modelgenerating bothsafety labelsandRoTs.
- Forbes et al. [2020] introduced
- Prosocial Response Generation: Actively generating appropriate responses.
- Baheti et al. [2021] investigated
controllable text generationto mitigate agreement with offensive user utterances. - Kim et al. [2022] proposed
Prostto generateprosocial responsesconditioned onRoTsanddialogue context.
- Baheti et al. [2021] investigated
- Safety Detection: Identifying problematic user utterances.
4.2.2. Proactive Task-oriented Dialogue Systems
TOD systems usually act as obedient assistants. Proactivity here allows them to handle non-collaborative tasks or enrich conversations with unrequested but useful information.
4.2.2.1. Non-collaborative Dialogues
-
Problem Definition: Under
non-collaborative settings, the system and user have competing interests but aim for an agreement (e.g., price negotiation, persuasion). The goal is to generate a response () with an appropriatedialogue strategy() that leads to aconsensus state, givendialogue history,previous strategies, anddialogue background().Dialogue strategycan be coarse or fine-grained. -
Methods:
- Dialogue Strategy Learning: Beyond intent detection, this requires strategic reasoning.
- He et al. [2018] decoupled
strategyandgenerationto controldialogue strategyfor differentnegotiation goals. - Zhou et al. [2020] used
finite state transducers (FSTs)to predict the next strategy fromeffective sequences. - Advanced models like
DIALOGRAPH[Joshi et al., 2021] withinterpretable strategy-graph networksandREsPER[Dutt et al., 2021] withresisting strategy modelinghave been developed.
- He et al. [2018] decoupled
- User Personality Modeling: Understanding human decision-making.
- Yang et al. [2021] generated
strategic dialogueby modelingopponent personality typesusingTheory of Mind (ToM). - Shi et al. [2021] developed
DialGAIL, anRL-based generative algorithmwith separate user/system profile builders to reduce repetition.
- Yang et al. [2021] generated
- Persuasive Response Generation: Generating responses that lead to consensus.
- Modularized [He et al., 2018] and
end-to-end[Li et al., 2020; Wu et al., 2021] methods incorporatepersuasive strategies. - Recent studies [Mishra et al., 2022; Samad et al., 2022] focus on building
empathetic connectionsfor persuasion.
- Modularized [He et al., 2018] and
- Dialogue Strategy Learning: Beyond intent detection, this requires strategic reasoning.
4.2.2.2. Enriched Task-oriented Dialogues
-
Problem Definition: The system proactively provides
additional informationnot explicitly requested but useful to the user, improving thequalityandeffectivenessof functional service. The problem formulation is similar to generalTODs, but responses should be bothfunctionally accurateandsocially engaging. -
Methods:
- Sun et al. [2021] created
ACCENTORby addingtopical chit-chatstoTODresponses. SimpleTOD[Hosseini-Asl et al., 2020] was extended to to handleenriched TODswith a newchit-chat dialogue action.- Zhao et al. [2022] developed
UniDS, anend-to-end methodwith aunified dialogue data schemafor bothchit-chatandtask-oriented dialogues. - Chen et al. [2022b] proposed
KETODforknowledge-grounded chit-chatregardingrelevant entities, using apipeline-based methodcalledCombinerto reduce interference.
- Sun et al. [2021] created
4.2.3. Proactive Conversational Information Seeking Systems
CIS systems fulfill user information needs. Proactivity eliminates uncertainty for efficient and precise information seeking by initiating subdialogues.
4.2.3.1. Asking Clarification Questions
-
Problem Definition: Clarifying
ambiguityin user queries. Formulated as two subtasks [Aliannejadi et al., 2021]:Clarification need prediction: Abinary classification problemto predict if a query is ambiguous.Clarification question generation: Generating the actual question, either byselection from a bank[Aliannejadi et al., 2019] oron the fly[Zamani et al., 2020].
-
Methods:
- Aliannejadi et al. [2019] proposed
NeuQS, aquestion retrieval-selection pipelineusingBERT-basedmodels for re-ranking. - Zamani et al. [2020] developed
QCM, areinforcement learning based methodto generate questions by maximizing aclarification utility function. - Complete
pipeline-based systemsare presented by Aliannejadi et al. [2021] and Guo et al. [2021], withbinary classificationforclarification needfollowed byquestion generation. - Deng et al. [2022a] proposed
UniPCQA, anend-to-end frameworkusing aunified sequence-to-sequence formulationforclarification need prediction,question generation, andconversational question answering.
- Aliannejadi et al. [2019] proposed
4.2.3.2. User Preference Elicitation
-
Problem Definition: Proactively acquiring user preferences by asking questions in
conversational recommendation(e.g., "Which brand of laptop do you prefer?"), rather than passively learning from context. This involves predicting theitem attributeto ask about next. -
Methods:
- Zhang et al. [2018b] designed
PMMN(personalized multi-memory network) to incorporateuser embeddingsintonext question predictionat the turn level. - For
multi-turn interactions, recent works tackleuser preference elicitationat thedialogue levelas amulti-step decision making processusingreinforcement learning (RL)[Deng et al., 2021; Zhang et al., 2022]. - Deng et al. [2021] proposed
UNICORN, agraph-based RL frameworkforpolicy learningthat modelsreal-time user preferencewith adynamic weighted graph structure. - Zhang et al. [2022] proposed the
MCMIPL frameworkformulti-choice questionasking to efficiently obtainuser preferences.
- Zhang et al. [2018b] designed
5. Experimental Setup
As this paper is a survey, it does not present its own experimental setup in the traditional sense of proposing a new model and evaluating it. Instead, it systematically reviews the datasets and evaluation protocols used by the proactive dialogue systems research it covers.
5.1. Datasets
The paper summarizes representative datasets used for evaluating various types of proactive dialogue systems.
The following are the results from [Table 1] of the original paper:
| Dataset | Problem | Language | #Dial. | #Turns | Featured Annotations |
| TGC [Tang et al., 2019] | Target-guided Dialogues | English | 9,939 | 11.35 | Turn-level Topical Keywords |
| DuConv [Wu et al., 2019] | Target-guided Dialogues | Chinese | 29,858 | 9.1 | Turn-level Entities & Dialogue-level Goals |
| MIC [Ziems et al., 2022] | Prosocial Dialogues | English | 38K | 2.0 | Rules of Thumbs (RoTs) & Revised Responses |
| ProsocialDialog [Kim et al., 2022] | Prosocial Dialogues | English | 58K | 5.7 | Safety Labels and Reasons & RoTs |
| CraigslistBargain [He et al., 2018] | Non-collaborative Dialogues | English | 6,682 | 9.2 | Coarse Dialogue Acts |
| P4G [Wang et al., 2019] | Non-collaborative Dialogues | English | 1,017 | 10.43 | Dialogue Strategies |
| ACCENTOR [Sun et al., 2021] | Enriched Task-oriented Dialogues | English | 23.8K | Enriched Responses with Chit-chats | |
| KETOD [Chen et al., 2022b] | Enriched Task-oriented Dialogues | English | 5,324 | 9.78 | Turn-level Entities & Enriched Responses with Knowledge |
| Abg-CoQA [Guo et al., 2021] | Asking Clarification Questions | English | 8,615 | 5.0 | Clarification Need Labels and Questions |
| PACIFIC [Deng et al., 2022a] | Asking Clarification Questions | English | 2,757 | 6.89 | Clarification Need Labels and Questions |
Here's a description of selected datasets:
TGC (Target-Guided Conversation)[Tang et al., 2019]:- Source/Characteristics: Constructed from
Persona-Chatby removing persona information and definingtargetsas keywords. Targets are automatically extracted. - Domain: Open-domain dialogue, focused on guiding conversations towards specific keywords.
- Example Data Sample: A conversation where the agent tries to steer the discussion towards a specific keyword like "Music" or "Blackpink." The dataset would contain turns of dialogue with an associated target keyword for each conversation. For instance, a dialogue might start about daily life and the agent tries to subtly introduce "music" as a topic.
- Source/Characteristics: Constructed from
DuConv[Wu et al., 2019]:- Source/Characteristics: Human-human conversations generated based on two linked entities from a
grounded knowledge graph. - Domain: Open-domain, knowledge-driven,
target-guided dialoguesin Chinese. - Example Data Sample: A conversation about "pizza" might have the agent trying to steer towards "Italy" as a
knowledge entityor "cooking" as adialogue-level goal, leveraging facts from aknowledge graph.
- Source/Characteristics: Human-human conversations generated based on two linked entities from a
MIC (Moral Integrity Conversation)[Ziems et al., 2022]:- Source/Characteristics: Manually annotated
prompt-reply pairs(AI-generated response to an open query) withRule-of-Thumbs (RoTs)fromSOCIALCHEM01[Forbes et al., 2020]. - Domain: Prosocial dialogues, evaluating the ethical and moral aspects of AI responses.
- Example Data Sample: A user might say something like "I really don't feel like sharing my notes, even though my friend helped me study." An AI-generated response might be assessed with an
RoTlike "It's generally good to reciprocate kindness" to guide a revised, more prosocial response.
- Source/Characteristics: Manually annotated
ProsocialDialog[Kim et al., 2022]:- Source/Characteristics: Human-AI collaboration where AI acts as a problematic user, and crowdworkers act as a
prosocial agent. Includessafety labels,RoTs, andprosocial responses. - Domain: Prosocial dialogues, specifically for handling problematic user utterances.
- Example Data Sample: A user (AI) might express a
problematic statement(e.g., "Cheating is sometimes okay to pass a difficult class."). The dataset would contain thesafety labelfor this statement, a relevantRoT(e.g., "Academic integrity is important"), and aprosocial agent's responsethat addresses the issue constructively.
- Source/Characteristics: Human-AI collaboration where AI acts as a problematic user, and crowdworkers act as a
CraigslistBargain[He et al., 2018]:- Source/Characteristics: Human-human conversations where workers negotiate the price of an item.
- Domain: Non-collaborative,
bargain negotiationintask-oriented dialogues. - Example Data Sample: A dialogue between a buyer and a seller over the price of a used bicycle, where each side has a different goal (buyer wants lower price, seller wants higher).
Coarse Dialogue Acts(e.g.,offer,accept,reject,propose) are annotated.
P4G (PERSUASIONFORGOOD)[Wang et al., 2019]:- Source/Characteristics:
Persuasion conversationsfor charity donation, including user profiles and manual annotations ofpersuasion strategiesanddialogue acts. - Domain: Non-collaborative,
persuasion dialogues(e.g., convincing to donate). - Example Data Sample: A conversation where a system tries to persuade a user to donate to a charity, leveraging information about the user's interests (from their profile) and applying specific
persuasion strategies(e.g.,appeal to emotion,establish common ground).
- Source/Characteristics:
ACCENTOR[Sun et al., 2021]:- Source/Characteristics: Augments
task-oriented dialogueswithtopical chit-chats. - Domain: Enriched
task-oriented dialogues, making interactions more engaging. - Example Data Sample: In a restaurant booking dialogue, after confirming details, the system might add a
chit-chatlike "I hope you enjoy your meal, Italian food is my favorite!"
- Source/Characteristics: Augments
KETOD (Knowledge-Enriched Task-Oriented Dialogue)[Chen et al., 2022b]:- Source/Characteristics: Provides
turn-level entitiesandenriched responseswithknowledge. - Domain: Enriched
task-oriented dialogueswithknowledge-grounded chit-chat. - Example Data Sample: A user asks to book a hotel in Paris. The system confirms and then proactively adds a
knowledge-enriched chit-chatlike "Did you know Paris is also known as the City of Love? There are many romantic spots near your hotel."
- Source/Characteristics: Provides
Abg-CoQA (Ambiguous Conversational Question Answering)[Guo et al., 2021]:- Source/Characteristics: Includes
clarification need labelsandquestions. - Domain:
Conversational Question Answeringwhere ambiguity needs to be resolved. - Example Data Sample: User: "Tell me about 'Lion'." The system might classify this as
ambiguous(movie, animal, king?) and generate aclarification questionlike "Are you referring to the movie 'Lion' or the animal?"
- Source/Characteristics: Includes
PACIFIC (Proactive Conversational Question Answering over Tabular and Textual Data in Finance)[Deng et al., 2022a]:-
Source/Characteristics: Focuses on
clarification need labelsandquestionsovertabular and textual financial data. -
Domain:
Conversational QAin the finance domain, requiring clarification. -
Example Data Sample: User: "What is the yield of Apple?" (Apple stock, apple orchard?). The system detects ambiguity and asks: "Are you referring to Apple Inc. stock or a type of fruit?"
These datasets are chosen because they provide concrete examples and annotations for the specific
proactivebehaviors discussed (e.g., target tracking, prosocial interventions, negotiation strategies, clarification needs), making them effective for validating the performance ofproactive dialogue systems.
-
5.2. Evaluation Metrics
The paper discusses a range of evaluation metrics, distinguishing between general dialogue system metrics and those specific to proactivity.
5.2.1. General Evaluation Metrics
BLEU,Dist-N,PPL: Commonly used fortext generationtasks.BLEU(Bilingual Evaluation Understudy): (See explanation in Section 3.2. forBLEU)Dist-N(Distinct N-grams):- Conceptual Definition:
Distinct N-grams(Dist-N) measures the diversity of generated responses by counting the number of unique n-grams. A higherDist-Nindicates greater diversity and less repetition in the generated text.Dist-1measures unique unigrams,Dist-2measures unique bigrams, etc. - Mathematical Formula: $ \text{Dist-N} = \frac{\text{Count of unique n-grams}}{\text{Total number of n-grams}} $
- Symbol Explanation:
Count of unique n-grams: The number of distinct n-grams in the generated text.Total number of n-grams: The total number of n-grams in the generated text.
- Conceptual Definition:
PPL(Perplexity): (See explanation in Section 3.2. forPPL)ROUGE: (See explanation in Section 3.2. forROUGE)F1Score: (See explanation in Section 3.2. forF1Score)Accuracy: (See explanation in Section 3.2. forAccuracy)ROC AUC(Receiver Operating Characteristic Area Under the Curve):- Conceptual Definition:
ROC AUCis a performance measurement forclassification problemsat variousthreshold settings.ROCis a probability curve, andAUCrepresents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. The higher theAUC, the better the model is at predicting 0s as 0s and 1s as 1s. - Mathematical Formula:
$
\text{AUC} = \int_0^1 \text{TPR}(FPR) , d(FPR)
$
where
TPRis theTrue Positive Rate(Sensitivity or Recall) andFPRis theFalse Positive Rate(1 - Specificity). $ \text{TPR} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $ $ \text{FPR} = \frac{\text{False Positives}}{\text{False Positives} + \text{True Negatives}} $ - Symbol Explanation:
TPR:True Positive Rate(orRecall), the proportion of actual positive cases that are correctly identified.FPR:False Positive Rate, the proportion of actual negative cases that are incorrectly identified as positive.True Positives: Instances correctly identified as positive.False Positives: Instances incorrectly identified as positive.True Negatives: Instances correctly identified as negative.False Negatives: Instances incorrectly identified as negative.
- Conceptual Definition:
5.2.2. Proactivity-Specific Evaluation Protocols
5.2.2.1. For Target-guided Dialogues (Turn-level & Dialogue-level)
-
Turn-level Evaluation:
P@K(Precision at K) andR@K(Recall at K): Forkeyword predictionin each turn.- Conceptual Definition:
P@Kmeasures the proportion of the top-K retrieved items (keywords in this case) that are relevant (actual targets).R@Kmeasures the proportion of all relevant items that are found among the top-K retrieved items. These are commonly used in ranking and retrieval tasks. - Mathematical Formula: $ \text{P@K} = \frac{\text{Number of relevant items in top-K}}{\text{K}} $ $ \text{R@K} = \frac{\text{Number of relevant items in top-K}}{\text{Total number of relevant items}} $
- Symbol Explanation:
- : The number of top items considered.
Number of relevant items in top-K: How many of the predicted top-K keywords match the actual target.Total number of relevant items: The total number of actual target keywords for that turn (usually 1 for a single target).
- Conceptual Definition:
Embedding-based correlation scores: Measure semantic similarity between predicted and target topics.Proactivity/Smoothness: Human evaluation scores for how well the system introduces new topics while maintaining coherence.
-
Dialogue-level Evaluation: Often uses
user simulatorsdue to the cost of real user experiments.SR@t(Success Rate at turn t): Success rate of achieving targets by the -th turn.- Conceptual Definition:
SR@tmeasures the cumulative percentage of dialogues where the system successfully achieved its target goal within a specified number of turns (). - Mathematical Formula: $ \text{SR@t} = \frac{\text{Number of dialogues with target achieved by turn t}}{\text{Total number of dialogues}} $
- Symbol Explanation:
- : The maximum number of turns allowed for target achievement.
Number of dialogues with target achieved by turn t: Count of dialogues where the target was reached within turns.Total number of dialogues: Total number of evaluation dialogues.
- Conceptual Definition:
#Turns: Average number of turns to reach the target.
5.2.2.2. For Prosocial Dialogues
- Human evaluation or
trained classification modelsare used to quantify attributes likeagreement,respect,fairness, etc., alongside generaltext generation metrics(ROUGE,BLEU,PPL).
5.2.2.3. For Non-collaborative Dialogues
Dialogue strategy prediction accuracy: Measures how well the system predicts or learns optimal strategies (usingAccuracy,F1,ROC AUC).- Human evaluation for
persuasiveness,task success.
5.2.2.4. For User Preference Elicitation in Conversational Recommendation
- Turn-level:
HR@k,t(Hit Ratio at k, turn t): Fornext question prediction(top-k predicted attributes at turn t).- Conceptual Definition:
HR@k,tmeasures whether the correct item/attribute is among the top-k recommendations/predictions at turn . - Mathematical Formula: Typically,
HR@kis calculated as: $ \text{HR@k} = \frac{\text{Number of users for whom the target item is in top-k recommendations}}{\text{Total number of users}} $ ForHR@k,t, this would apply at a specific turn . - Symbol Explanation:
- : The size of the recommendation/prediction list.
- : The current turn of the conversation.
Number of users...: Count of users where the target preference was among the top-k predicted attributes for elicitation at turn .Total number of users: Total users in the evaluation.
- Conceptual Definition:
- General
recommendation metrics(MRR@k,t,MAP@k,t,NDCG@k,t) foritem recommendationbased on elicited preferences.MRR@k(Mean Reciprocal Rank at k):- Conceptual Definition:
MRRis a statistic measure for evaluating any process that produces a list of possible responses to a query, ordered by probability of correctness. Thereciprocal rankis the inverse of the rank of the first correct answer. If no correct answer is found, thereciprocal rankis 0.MRR@kconsiders only the first correct answer up to rank . - Mathematical Formula: $ \text{MRR@k} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i} \quad (\text{if rank}_i \le k, \text{else } 0) $
- Symbol Explanation:
- : The total number of queries (or sessions).
- : The rank of the first relevant item for the -th query.
- : The maximum rank to consider.
- Conceptual Definition:
MAP@k(Mean Average Precision at k):- Conceptual Definition:
MAPis a popular metric for evaluating ranked retrieval results. It computes the average precision for each relevant item in a ranked list and then averages theseAverage Precisionvalues across all queries.MAP@kisMAPtruncated at items. - Mathematical Formula:
$
\text{MAP@k} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \text{AP@k}(i)
$
where is the
Average Precisionfor query : $ \text{AP@k}(i) = \sum_{j=1}^{k} P(j) \cdot \text{rel}(j) $P(j)is the precision at cut-off in the list, and is an indicator function equal to 1 if the item at rank is relevant, and 0 otherwise. - Symbol Explanation:
- : The total number of queries.
- : Average Precision for the -th query up to rank .
- : The maximum rank to consider.
P(j): Precision at rank .- $\text{rel}(# 1. Bibliographic Information
- Conceptual Definition:
1.1. Title
A Survey on Proactive Dialogue Systems: Problems, Methods, and Prospects
1.2. Authors
Yang Deng, Wenqiang Lei, Wai Lam, Tat-Seng Chua
National University of Singapore Sichuan University The Chinese University of Hong Kong
1.3. Journal/Conference
Published on arXiv, a preprint server.
Comment on Venue's Reputation: arXiv is a widely respected open-access archive for preprints of scientific papers in fields like mathematics, physics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. While not a peer-reviewed journal or conference in itself, it serves as a crucial platform for early dissemination of research, allowing authors to share their work rapidly and gather feedback before or during formal peer review. Many highly influential papers are first published on arXiv.
1.4. Publication Year
2023
1.5. Abstract
This paper presents a comprehensive overview of proactive dialogue systems, which are conversational agents capable of guiding conversations towards predefined targets or system goals, rather than merely responding passively to user input. The survey highlights the significant problems encountered and the advanced design techniques employed for conversational agent's proactivity across various types of dialogues. It further discusses critical challenges related to real-world applications that demand increased future research. The authors express hope that this first survey on proactive dialogue systems will offer the research community accessible insights and a holistic understanding of this practical problem, thereby fostering further advancements in conversational AI.
1.6. Original Source Link
https://arxiv.org/abs/2305.02750
Publication Status: This is a preprint available on arXiv.
1.7. PDF Link
https://arxiv.org/pdf/2305.02750v2.pdf
2. Executive Summary
2.1. Background & Motivation
The core problem this paper aims to solve is the lack of proactivity in conventional dialogue systems. Traditional dialogue systems are primarily designed to be response-able, meaning they passively follow user-initiated conversations or fulfill user requests. This includes systems like open-domain dialogue systems, task-oriented dialogue systems, and conversational information-seeking systems.
This problem is important because the absence of proactivity hinders the development of truly intelligent and human-like conversational agents. The paper highlights several specific challenges and gaps in prior research:
-
Limited User Engagement and Service Efficiency: Passive systems may struggle to maintain engaging conversations or efficiently guide users towards specific outcomes.
-
Inability to Handle Complicated Tasks: Tasks requiring
strategicalandmotivational interactionsgo beyond simple responsiveness. -
Limitations of Current Advanced Systems: Even powerful models like
ChatGPTexhibit limitations due to a lack of proactivity, such as providing randomly guessed answers to ambiguous queries or failing to handle problematic (harmful/biased) requests constructively. -
Step Towards Strong AI: Proactivity is considered a significant step towards achieving
strong AIthat possessesautonomyandhuman-like consciousness.The paper's entry point is recognizing that while early attempts (e.g., introducing new topics or suggestions) identified the need, recent years have seen many advanced designs for
conversational agent's proactivityacross various task formulations and application scenarios. This survey aims to consolidate and categorize these diverse efforts.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
First Comprehensive Survey: It provides the first systematic and comprehensive overview of
proactive dialogue systems, summarizing recent studies across three common types of dialogues:open-domain dialogues,task-oriented dialogues, andinformation-seeking dialogues. -
Categorization of Problems and Designs: For each dialogue type, it identifies
prominent problems(e.g.,target-guided dialogues,prosocial dialogues,non-collaborative dialogues,enriched task-oriented dialogues,asking clarification questions,user preference elicitation) and theadvanced designs(methods) used to address them. -
Resource and Evaluation Overview: It presents available data resources and commonly adopted evaluation protocols pertinent to each problem area, aiding researchers in accessing and assessing relevant work.
-
Identification of Future Research Directions: It discusses significant open challenges and promising research prospects, including
proactivity in hybrid dialogues,evaluation protocols for proactivity, andethics of conversational agent's proactivity.Key conclusions and findings reached by the paper include:
-
Proactivity is an essential property for
intelligent conversationsthat can significantly improveuser engagementandservice efficiency. -
The field has progressed beyond simple topic introduction to handle complex
strategicalandmotivational interactions. -
A diverse set of
proactive capabilitieshas emerged, tailored to the specific goals of different dialogue types. -
Despite advancements, challenges remain in areas such as seamlessly integrating multiple conversational goals (
hybrid dialogues), developing robust andmulti-disciplinary evaluation metrics, and addressingethical considerations(factuality, morality, privacy) inherent in proactive systems.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a reader should be familiar with the following fundamental concepts:
-
Dialogue Systems: Broadly,
dialogue systems(also known asconversational AIorchatbots) are computer programs designed to converse with human users using natural language. They aim to simulate human conversation.- Open-domain Dialogue Systems (ODD): These systems aim to engage in general conversations on a wide range of topics, often for social support or entertainment. Examples include chatbots like
ChatGPTorMeena. Their primary goal is to maintain a coherent and engaging conversation. - Task-oriented Dialogue Systems (TOD): These systems are designed to help users accomplish specific tasks in particular domains, such as booking flights, making restaurant reservations, or providing customer service. They typically involve understanding user intent, managing dialogue state, and interacting with external databases or APIs.
- Conversational Information-Seeking Systems (CIS): These systems facilitate finding information through natural language interactions. This includes
conversational search,conversational recommendation, andconversational question answering, where the system helps refine queries or explore preferences.
- Open-domain Dialogue Systems (ODD): These systems aim to engage in general conversations on a wide range of topics, often for social support or entertainment. Examples include chatbots like
-
Proactivity: In the context of dialogue systems,
proactivityrefers to the system's capability to take initiative, anticipate user needs or potential issues, and guide the conversation direction towards specific goals or targets from the system's side, rather than merely reacting to user input. It stems from the organizational behavior definition of initiating actions to create or control situations. -
Natural Language Processing (NLP): A field of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language. Core NLP tasks relevant here include:
- Dialogue Context Understanding: Interpreting the meaning and intent of user utterances within the ongoing conversation.
- Response Generation: Creating natural and relevant textual responses.
- Text Classification: Categorizing text into predefined labels (e.g.,
safety detection,topic-shift detection). - Sequence-to-Sequence (Seq2Seq) Models: A common architecture in NLP for tasks like machine translation and text summarization, where an input sequence is mapped to an output sequence. Many dialogue generation models are based on
Seq2Seq.
-
Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. In dialogue systems, RL can be used to learn optimal
dialogue policies(strategies) that guide the conversation flow, especially for tasks involvingmulti-step decision makingorstrategic interactions. -
Knowledge Graphs: A structured representation of knowledge that stores information in a graph format, where nodes represent entities (e.g., people, places, concepts) and edges represent relationships between them. Knowledge graphs can be leveraged in dialogue systems for
knowledge-grounded response generation,topic planning, orfactuality checking.
3.2. Previous Works
The paper contextualizes proactive dialogue systems by contrasting them with conventional, largely passive systems and acknowledging early attempts at proactivity.
-
Conventional Dialogue Research:
- Focus on Response-ability: Most prior work emphasizes understanding
dialogue contextand generating appropriate responses. - Passive Nature: Systems are typically designed to follow user-oriented conversations or fulfill explicit user requests.
- Examples of Passive Systems:
Open-domain dialogue systemslike those described by [Zhang et al., 2018a] often aim for engaging chitchat without a specific system-driven goal.Task-oriented dialogue systems(e.g., [Hosseini-Asl et al., 2020]) focus on efficiently completing user-defined tasks.Conversational information-seeking systems(e.g., [Aliannejadi et al., 2019]) primarily respond to user queries for information.
- Focus on Response-ability: Most prior work emphasizes understanding
-
Early Attempts at Proactivity:
- Researchers recognized the need for improved interactions beyond pure reactivity.
- [Li et al., 2016] explored enabling conversational agents to proactively introduce new topics.
- [Yan and Zhao, 2018] investigated systems that could offer useful suggestions proactively.
- These pioneering studies laid the groundwork, indicating a need for more robust problem settings and tangible applications for
proactive dialogue systems.
-
Limitations of Modern Advanced Systems (e.g., ChatGPT):
- Even highly advanced models like
ChatGPT(though not explicitly cited with a paper, it's a widely known large language model) are noted to lack true proactivity. - They may passively provide randomly guessed answers to ambiguous queries, rather than proactively asking for clarification.
- They can struggle to handle
problematic requests(e.g., harmful or biased content) constructively, often relying on pre-programmed guardrails rather than sophisticatedprosocialstrategies.
- Even highly advanced models like
3.3. Technological Evolution
The evolution of dialogue systems can be broadly seen as a progression from rule-based and retrieval-based systems to sophisticated neural network models.
- Early Systems (Rule-based/Retrieval-based): Focused on predefined scripts or retrieving responses from large corpora. Limited flexibility and domain-specific.
- Statistical/Machine Learning Models: Introduced more flexibility, especially for
task-oriented dialogue, with components likenatural language understanding (NLU),dialogue state tracking (DST),dialogue policy learning, andnatural language generation (NLG). - Deep Learning Era (2015-present):
-
End-to-End Models: Attempts to build
dialogue systemsthat learn directly from raw text, often usingsequence-to-sequence (Seq2Seq) models. -
Pre-trained Language Models (PLMs): Emergence of powerful models like
BERT,GPT,T5,XLNet, which, after pre-training on vast text corpora, can be fine-tuned for variousdialogue taskswith remarkable performance. These models significantly improvedcontext understandingandresponse generation. -
Shift to Proactivity: This paper highlights a recent, crucial shift within the deep learning era: moving from
passive response-abilitytoproactive guidance. This involves integrating elements likestrategy learning,user modeling, andethical considerationsinto the core design ofdialogue systems.This paper's work fits within this technological timeline as a meta-analysis that synthesizes the advancements specifically aimed at integrating
proactivityinto thedeep learningera ofdialogue systems. It marks a coming-of-age for this particular sub-field, moving beyond foundational capabilities to more sophisticated, human-like interaction paradigms.
-
3.4. Differentiation Analysis
Compared to general surveys on dialogue systems, this paper's core differentiation and innovation lie in its explicit focus on proactivity.
- Specific Focus: While other surveys might cover
open-domain,task-oriented, orinformation-seeking dialoguesgenerally, this paper specifically dissects how proactivity is implemented and studied within each of these categories. - Unified Definition: It provides a clear, unifying definition of
conversational agent's proactivity, drawing fromorganizational behavior, and applies this lens across diversedialogue tasks. - Categorization by Proactive Goal: Instead of just classifying systems by their domain (e.g.,
TODvs.ODD), it further classifies them by their proactive goals within those domains (e.g.,target-guided,prosocial,non-collaborative,clarification seeking). This provides a novel, goal-oriented perspective. - Identification of Unique Challenges: It specifically highlights challenges and prospects that are unique to
proactive dialogue systems, such asevaluation protocols for proactivity(requiring multi-disciplinary approaches) andethics of conversational agent's proactivity. - First of its Kind: The authors explicitly state, "To our knowledge, this survey is the first to focus on proactive dialogue systems," which underscores its unique contribution as a foundational review for this emerging field.
4. Methodology
As a survey paper, this document does not propose a novel methodology in the traditional sense of a new model or algorithm. Instead, its "methodology" lies in its structured approach to analyzing and categorizing existing research on proactive dialogue systems. The core idea is to break down the broad concept of proactivity into specific problems, methods, datasets, and evaluation protocols across different types of dialogue systems.
4.1. Principles
The core idea of this survey is to systematically review and categorize the rapidly growing body of research on proactive dialogue systems. The theoretical basis is derived from the definition of proactivity as the capability of an agent to create or control the conversation by taking the initiative and anticipating the impacts on themselves or human users, rather than only passively responding to the users. This principle guides the identification and classification of various proactive behaviors observed in dialogue systems.
The survey's structural principles are:
- Typology-based Classification: Organize research based on established
dialogue systemtypes (open-domain,task-oriented,information-seeking). - Problem-centric Analysis: Within each type, identify specific problems that necessitate or benefit from
proactivity. - Methodological Deep-dive: For each problem, detail the
advanced designsandtechniquesemployed by researchers. - Resource Mapping: Connect each problem with relevant
data resourcesandevaluation protocols. - Future-oriented Discussion: Highlight open challenges and future research directions for
proactive dialogue systems.
4.2. Core Methodology In-depth (Layer by Layer)
The survey systematically categorizes proactive dialogue systems into three main types, each with specific problems and corresponding methods.
4.2.1. Proactive Open-domain Dialogue Systems
These systems aim to build long-term connections by proactively guiding conversations or addressing sensitive topics.
4.2.1.1. Target-guided Dialogues
Problem Definition: The system needs to proactively lead the conversation towards a designated target topic (e.g., a keyword, knowledge entity, or conversational goal) unknown to the user, ensuring transition smoothness and target achievement.
Methods:
- Topic-shift Detection: Aims to identify changes in user topics.
- Rachna et al. [2021]
fine-tune XLNet-basefor classifying utterances intomajor,minor, oroff topics.XLNetis a generalized autoregressive pretraining method that combines the advantages ofBERT(bidirectional context) andautoregressive models(predicting tokens one by one) by using a permutation language modeling objective. - Xie et al. [2021] propose
TSMANAGER, aT5-basedmodel to predicttopic shiftsafter augmenting thePersonaChatdataset.T5(Text-To-Text Transfer Transformer) is atransformer-based model that frames all NLP tasks as a text-to-text problem.
- Rachna et al. [2021]
- Topic Planning: The core task of defining a path to the target.
- Early strategies [Tang et al., 2019; Zhong et al., 2021] used
discourse-levelkeyword transitions. - Later,
event knowledge graphs[Xu et al., 2020] were used to enhance coherence. - More advanced approaches [Yang et al., 2022] leverage
external knowledge graphswithgraph reasoning techniquesfor bettertopic path planning. Reinforcement Learning (RL)is employed by Lei et al. [2022] to learntopic transitionsdirectly from user interactions.
- Early strategies [Tang et al., 2019; Zhong et al., 2021] used
- Topic-aware Response Generation: Producing responses that move the conversation towards the target.
- Kishinami et al. [2022] generate a
complete responding plan. - Gupta et al. [2022] use a
bridging path of commonsense knowledge conceptsbetween current and target topics to generate transition responses.
- Kishinami et al. [2022] generate a
4.2.1.2. Prosocial Dialogues
Problem Definition: Given a dialogue context, the system first classifies the safety label of user utterances and then generates a proper response to constructively mitigate problematic user statements, adhering to social norms.
Methods:
- Safety Detection: Identifying problematic user utterances.
- Dinan et al. [2019] use
human-in-the-loop trainingforoffensive utterance detection, improved byadversarial learning[Xu et al., 2021].Adversarial learninginvolves ageneratorthat creates synthetic data and adiscriminatorthat tries to distinguish real from synthetic data, improving robustness. - Baheti et al. [2021]
fine-tune offensive language detection classifierson annotated datasets likeToxICCHAT. - Kim et al. [2022] introduce a
fine-grained safety classification schema(Needs Caution,Needs Intervention,Casual) to avoid broad "unsafe" labels.
- Dinan et al. [2019] use
- Rule-of-Thumb (RoT) Generation: Explaining why a statement is acceptable or problematic.
- Forbes et al. [2020] propose
NORM TRANSFORMERto reason about social norms from theSoCIALCHEM01corpus. ATransformeris a neural network architecture based onself-attention mechanisms. - Ziems et al. [2022] propose
MORAL TRANSFORMERfor generating newRoTs. - Kim et al. [2022] developed
Canary, asequence-to-sequence modelthat generates bothsafety labelsandrelevant RoTs.
- Forbes et al. [2020] propose
- Prosocial Response Generation: Generating helpful and socially responsible responses.
- Baheti et al. [2021] investigate
controllable text generationto prevent agreement with offensive content. - Kim et al. [2022] propose
PROSTto generateprosocial responsesconditioned onRoTsanddialogue context.
- Baheti et al. [2021] investigate
4.2.2. Proactive Task-oriented Dialogue Systems
These systems go beyond simple task completion to handle non-collaborative scenarios or enrich conversations.
4.2.2.1. Non-collaborative Dialogues
Problem Definition: The system and user have competing interests or goals but aim for an agreement. The system needs to generate responses with appropriate dialogue strategies to achieve its goal (e.g., negotiation, persuasion).
Methods:
- Dialogue Strategy Learning: Learning to select actions that steer the conversation.
- He et al. [2018]
decouple strategy and generationto controldialogue strategyfornegotiation goals. - Zhou et al. [2020] use
finite state transducers (FSTs)to predict the next strategy based oneffective sequences of strategies. AnFSTis a finite-state machine that maps input sequences to output sequences. - Advanced models include
DIALOGRAGH[Joshi et al., 2021] usinginterpretable strategy-graph networksandRESPER[Dutt et al., 2021] forresisting strategy modeling.
- He et al. [2018]
- User Personality Modeling: Understanding human decision-making.
- Yang et al. [2021] generate
strategic dialoguebymodeling and inferring personality typesbased onTheory of Mind (ToM).ToMis the ability to attribute mental states (beliefs, intentions, desires) to oneself and others. - Shi et al. [2021] develop
DialGAIL, anRL-based generative algorithmwith separateuser and system profile buildersto improvepersuasion dialogues.
- Yang et al. [2021] generate
- Persuasive Response Generation: Generating effective responses to achieve consensus.
- Both
modularized[He et al., 2018] andend-to-end[Li et al., 2020; Wu et al., 2021] methods incorporatepersuasive dialogue strategies. - Recent work [Mishra et al., 2022; Samad et al., 2022] focuses on building
empathetic connectionsfor betterpersuasive responses.
- Both
4.2.2.2. Enriched Task-oriented Dialogues
Problem Definition: Beyond functional accuracy, the system proactively provides additional information (e.g., chit-chats, knowledge) that is useful but not explicitly requested by the user, making interactions more engaging.
Methods:
- Adding Topical Chit-chats:
- Sun et al. [2021] create
ACCENTORby addingtopical chit-chatstoTODresponses. - [Hosseini-Asl et al., 2020] extends
SimpleTODby introducing achit-chat dialogue action. UniDS[Zhao et al., 2022] is anend-to-endmethod using aunified dialogue data schemafor bothchit-chatandtask-oriented dialogues.
- Sun et al. [2021] create
- Knowledge-grounded Chit-chats:
- Chen et al. [2022b] propose
KETODforknowledge-grounded chit-chatsregarding relevant entities. Combineris apipeline-based methoddesigned to reduce interference betweendialogue state trackingandknowledge-enriched response generation.
- Chen et al. [2022b] propose
4.2.3. Proactive Conversational Information Seeking Systems
These systems proactively clarify ambiguities or elicit preferences to achieve more efficient and precise information seeking.
4.2.3.1. Asking Clarification Questions
Problem Definition: Clarifying potential ambiguity in user queries. Formulated into two subtasks: clarification need prediction and clarification question generation.
Methods:
- Clarification Need Prediction: Typically a
binary classification problemto predict if a query is ambiguous. - Clarification Question Generation:
NeuQS[Aliannejadi et al., 2019] uses aretrieval-selection pipelineto select questions from aquestion bankviaBERT-based reranking.BERT(Bidirectional Encoder Representations from Transformers) is atransformer-based model pre-trained on large text corpora, known for its ability to learn contextual representations of words.QCM[Zamani et al., 2020] is anRL-based methodto generate questions by maximizing aclarification utility function.- Pipeline-based systems [Aliannejadi et al., 2021; Guo et al., 2021] first predict
clarification needthen generate questions. UniPCQA[Deng et al., 2022a] is anend-to-end frameworkusing aunified sequence-to-sequence formulationforclarification need prediction,question generation, andconversational question answering.
4.2.3.2. User Preference Elicitation
Problem Definition: Proactively acquiring user preferences by asking questions in conversational recommendation systems.
Methods:
- Turn-level Preference Elicitation:
PMMN(personalized multi-memory network) [Zhang et al., 2018b] incorporatesuser embeddingsintonext question prediction.
- Dialogue-level Preference Elicitation (Multi-step Decision Making):
Reinforcement Learning (RL)frameworks are used to learnwhat questions to ask.UNICORN[Deng et al., 2021] is agraph-based RL frameworkthat modelsreal-time user preferenceswith adynamic weighted graph structure.MCMIPL[Zhang et al., 2022] is proposed for efficiently obtaininguser preferencesby askingmulti-choice questions.
5. Experimental Setup
This section outlines the datasets and evaluation protocols discussed in the survey paper, reflecting the experimental setups commonly used in the research it reviews.
5.1. Datasets
The following are the results from Table 1 of the original paper:
| Dataset | Problem | Language | #Dial. | #Turns | Featured Annotations |
| TGC [Tang et al., 2019] | Target-guided Dialogues | English | 9,939 | 11.35 | Turn-level Topical Keywords |
| DuConv [Wu et al., 2019] | Target-guided Dialogues | Chinese | 29,858 | 9.1 | Turn-level Entities & Dialogue-level Goals |
| MIC [Ziems et al., 2022] | Prosocial Dialogues | English | 38K | 2.0 | Rules of Thumbs (RoTs) & Revised Responses |
| ProsocialDialog [Kim et al., 2022] | Prosocial Dialogues | English | 58K | 5.7 | Safety Labels and Reasons & RoTs |
| CraigslistBargain [He et al., 2018] | Non-collaborative Dialogues | English | 6,682 | 9.2 | Coarse Dialogue Acts |
| P4G [Wang et al., 2019] | Non-collaborative Dialogues | English | 1,017 | 10.43 | Dialogue Strategies |
| ACCENTOR [Sun et al., 2021] | Enriched Task-oriented Dialogues | English | 23.8K | Enriched Responses with Chit-chats | |
| KETOD [Chen et al., 2022b] | Enriched Task-oriented Dialogues | English | 5,324 | 9.78 | Turn-level Entities & Enriched Responses with Knowledge |
| Abg-CoQA [Guo et al., 2021] | Asking Clarification Questions | English | 8,615 | 5.0 | Clarification Need Labels and Questions |
| PACIFIC [Deng et al., 2022a] | Asking Clarification Questions | English | 2,757 | 6.89 | Clarification Need Labels and Questions |
Here's a more detailed description of some key datasets:
- Target-guided Dialogues:
- TGC (Target-Guided Conversation) [Tang et al., 2019]: Derived from
Persona-Chatbut without persona information. Targets aretopical keywordsextracted rule-based. This dataset is created by labeling targets on existing conversations. - DuConv [Wu et al., 2019]: Consists of human-human conversations based on
linked entitiesfrom agrounded knowledge graph. Targets areturn-level entitiesanddialogue-level goals. This dataset is constructed by generating conversations based on designated targets.
- TGC (Target-Guided Conversation) [Tang et al., 2019]: Derived from
- Prosocial Dialogues:
- MIC (Moral Integrity Conversation) [Ziems et al., 2022]: Contains
prompt-reply pairsmanually annotated withRule-of-Thumbs (RoTs)fromSoCIALCHEM01, where eachRoTserves as amoral judgment. - ProsocialDialog [Kim et al., 2022]: Constructed via a
human-AI collaboration frameworkwhere AI plays aproblematic userand crowdworkers act asprosocial agents. It includessafety labels,RoTsfor problematic contexts, andprosocial responses.
- MIC (Moral Integrity Conversation) [Ziems et al., 2022]: Contains
- Non-collaborative Dialogues:
- CraigslistBargain [He et al., 2018]: Contains conversations where two workers (
buyerandseller) negotiate the price of an item. Annotations includecoarse dialogue acts. - P4G (PERSUASIONFORGOOD) [Wang et al., 2019]: Features
persuasion conversationsfor charity donations, along withuser profilesand manual annotations forpersuasion strategiesanddialogue acts.
- CraigslistBargain [He et al., 2018]: Contains conversations where two workers (
- Enriched Task-oriented Dialogues:
- ACCENTOR [Sun et al., 2021]: Enriches
task-oriented dialogueswithtopical chit-chatsto make interactions more engaging. - KETOD (Knowledge-Enriched Task-Oriented Dialogue) [Chen et al., 2022b]: Focuses on
knowledge-grounded chit-chatsrelated toturn-level entities.
- ACCENTOR [Sun et al., 2021]: Enriches
- Asking Clarification Questions:
- Abg-CoQA [Guo et al., 2021]: For
conversational question answering, withclarification need labelsandquestions. - PACIFIC [Deng et al., 2022a]: For
proactive conversational question answering over tabular and textual data, also withclarification need labelsandquestions. - Other mentioned datasets include
Qulac[Aliannejadi et al., 2019] andClariQ[Aliannejadi et al., 2021].
- Abg-CoQA [Guo et al., 2021]: For
- User Preference Elicitation:
-
Many studies use
synthetic conversation datagenerated fromproduct reviews[Zhang et al., 2018b] orpurchase logs[Deng et al., 2021; Zhang et al., 2022]. The paper notes a demand forhuman-human conversationsin this area.These datasets are chosen to validate methods across the diverse range of
proactive dialogue systemtasks, providing specific annotations (e.g.,topical keywords,safety labels,dialogue strategies) that are crucial for training and evaluating models with proactive capabilities.
-
5.2. Evaluation Metrics
The survey details specific evaluation protocols for proactive dialogue systems, often complementing general dialogue system metrics (like BLEU, Dist-N, PPL) with task-specific measures.
5.2.1. Target-guided Dialogues
Evaluation is typically done at two levels:
-
Turn-level Evaluation:
P@K(Precision at K) andR@K(Recall at K) for keyword prediction: These metrics measure the accuracy of predicted keywords within the top candidates.- Conceptual Definition: Measures the proportion of relevant items (predicted keywords) among the top retrieved items, and the proportion of retrieved relevant items to the total number of relevant items.
- Mathematical Formula (for a single query): $ P@K = \frac{\text{Number of relevant items in top K}}{\text{K}} $ $ R@K = \frac{\text{Number of relevant items in top K}}{\text{Total number of relevant items}} $
- Symbol Explanation:
- : The count of correct keywords among the first predicted keywords.
- : The number of top predicted keywords considered.
- : The total number of actual target keywords for the turn.
Embedding-based correlation scores: Measures semantic similarity between generated responses/topics and target topics using vector embeddings.Proactivity/Smoothness: Human evaluation scores to assess how well the system introduces new topics coherently.
-
Dialogue-level Evaluation: Typically uses
user simulatorsdue to the high cost of real user experiments.SR@t(Success Rate at turn t): Measures the cumulative success rate of achieving the targets by the -th turn.- Conceptual Definition: The percentage of dialogues where the system successfully reaches the designated target by a specific turn .
- Mathematical Formula: $ SR@t = \frac{\text{Number of dialogues where target is reached by turn } t}{\text{Total number of dialogues}} \times 100% $
- Symbol Explanation:
- : Count of conversations where the system successfully guided to the target within turns.
- : The total number of conversations in the evaluation set.
#Turns: The average number of turns required to reach the target.
5.2.2. Prosocial Dialogues
- Safety Detection: As a
classification problem.Accuracy:- Conceptual Definition: The proportion of correctly classified instances (both true positives and true negatives) out of the total number of instances.
- Mathematical Formula: $ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} $
- Symbol Explanation:
TP: True Positives (correctly predicted positive).TN: True Negatives (correctly predicted negative).FP: False Positives (incorrectly predicted positive).FN: False Negatives (incorrectly predicted negative).
F1score:- Conceptual Definition: The harmonic mean of
precisionandrecall, providing a balanced measure for classification performance, especially useful for imbalanced datasets. - Mathematical Formula: $ F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $ where and
- Symbol Explanation:
TP, TN, FP, FN: Same as for Accuracy.- : The proportion of true positive predictions among all positive predictions.
- : The proportion of true positive predictions among all actual positives.
- Conceptual Definition: The harmonic mean of
- RoT Generation and Prosocial Response Generation:
ROUGE(Recall-Oriented Understudy for Gisting Evaluation):- Conceptual Definition: A set of metrics for evaluating automatic summarization and machine translation. It works by comparing an automatically produced summary or translation with a set of reference summaries or translations. It measures overlap of n-grams, word sequences, and word pairs.
- Mathematical Formula (ROUGE-N, for N-gram overlap): $ \text{ROUGE-N} = \frac{\sum_{S \in {\text{Reference Summaries}}} \sum_{\text{ngram}N \in S} \text{Count}{\text{match}}(\text{ngram}N)}{\sum{S \in {\text{Reference Summaries}}} \sum_{\text{ngram}_N \in S} \text{Count}(\text{ngram}_N)} $
- Symbol Explanation:
- : An N-gram (contiguous sequence of items) from the text.
- : The maximum number of N-grams co-occurring in the candidate response and a reference summary.
- : The number of N-grams in the reference summary.
BLEU(Bilingual Evaluation Understudy):- Conceptual Definition: A metric for evaluating the quality of text which has been machine-translated from one natural language to another. It measures the correspondence between a machine's output and a human reference output.
- Mathematical Formula: $ \text{BLEU} = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right) $ where (Brevity Penalty) and (N-gram precision)
- Symbol Explanation:
BP: Brevity Penalty, penalizes short candidate translations.- : Modified n-gram precision for n-grams of length .
- : Weight for each n-gram precision (often uniform, e.g., ).
- : Maximum n-gram order considered (typically 4).
- : Count of n-grams in the candidate.
- : Maximum count of n-grams in any single reference.
PPL(Perplexity):- Conceptual Definition: A measure of how well a probability distribution or probability model predicts a sample. In NLP, it's used to evaluate language models; lower perplexity indicates a better model.
- Mathematical Formula: $ \text{Perplexity}(W) = \sqrt[N]{\frac{1}{P(w_1, w_2, \ldots, w_N)}} = \sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i | w_1, \ldots, w_{i-1})}} $
- Symbol Explanation:
- : A sequence of words.
- : The probability of the entire sequence according to the language model.
- : The probability of word given the preceding words, as estimated by the language model.
- Human Evaluation: For quantifying attributes like
agreement,respect,fairness,prosociality, etc.
5.2.3. Non-collaborative Dialogues
- Strategy Learning:
Accuracy: See above definition.F1score: See above definition.ROC AUC(Receiver Operating Characteristic Area Under the Curve):- Conceptual Definition: A measure of a classifier's performance across all possible classification thresholds. It plots the
True Positive Rate(TPR) against theFalse Positive Rate(FPR) at various threshold settings. AnAUCof 1 represents a perfect classifier, while 0.5 represents a random classifier. - Mathematical Formula: The
AUCis the area under theROCcurve. There is no simple closed-form formula, it's calculated by numerically integrating the area under the curve formed by plottingTPRvsFPRfor different thresholds. $ \text{TPR} = \text{Recall} = \frac{TP}{TP + FN} $ $ \text{FPR} = \frac{FP}{FP + TN} $ - Symbol Explanation:
TP, TN, FP, FN: Same as for Accuracy and F1.TPR: True Positive Rate, also known asRecallorSensitivity.FPR: False Positive Rate, the proportion of negative instances incorrectly classified as positive.
- Conceptual Definition: A measure of a classifier's performance across all possible classification thresholds. It plots the
- Response Generation:
- Human evaluation for
persuasiveness,task success, etc.
- Human evaluation for
5.2.4. Enriched Task-oriented Dialogues
- General
TOD metricsfor functional accuracy. - Human evaluation for
engagementandinteractivityof the enriched responses.
5.2.5. Conversational Information Seeking Systems
- Asking Clarification Questions:
- Binary classification metrics (
Accuracy,F1) forclarification need prediction. - General
text generation metrics(ROUGE,BLEU,PPL) for question generation. - Task-specific metrics depending on the application (e.g., improved search
success rateinconversational search).
- Binary classification metrics (
5.2.6. User Preference Elicitation
- Turn-level:
HR@k,t(Hit Ratio at K, turn t):- Conceptual Definition: Measures if the correct item (or attribute) is among the top recommendations (or predictions) at turn .
- Mathematical Formula: $ HR@k,t = \frac{\text{Number of hits in top K at turn } t}{\text{Total number of interactions at turn } t} $
- Symbol Explanation:
- : Count of times the target item/attribute was found in the top predictions/recommendations at turn .
- : Total number of prediction/recommendation instances at turn .
- (Mean Reciprocal Rank at K, turn t):
- Conceptual Definition: A statistic measure for evaluating any process that produces a list of possible responses to a query, ordered by probability of correctness. The reciprocal rank is .
MRRis the average of the reciprocal ranks of results for a set of queries. - Mathematical Formula: $ \text{MRR@k,t} = \frac{1}{|Q|} \sum_{q=1}^{|Q|} \frac{1}{\text{rank}_q} \text{ if rank}_q \le k \text{ else } 0 $
- Symbol Explanation:
- : Number of queries (or instances) in the set.
- : Rank position of the first relevant item for query .
- : Maximum rank position to consider.
- Conceptual Definition: A statistic measure for evaluating any process that produces a list of possible responses to a query, ordered by probability of correctness. The reciprocal rank is .
- (Mean Average Precision at K, turn t):
- Conceptual Definition: A popular metric for evaluating ranked retrieval results. It computes the average precision for each query (considering only ranks up to ) and then averages these across all queries.
- Mathematical Formula: $ \text{MAP@k,t} = \frac{1}{|Q|} \sum_{q=1}^{|Q|} \text{AP}_q(k) $ where
- Symbol Explanation:
- : Number of queries.
- : Average Precision for query up to rank .
P(i): Precision at cut-off .- : An indicator function, 1 if the item at rank is relevant, 0 otherwise.
- (Normalized Discounted Cumulative Gain at K, turn t):
- Conceptual Definition: Measures the usefulness, or gain, of a document based on its position in the result list. The gain is accumulated from the top of the result list to the bottom, with the gain of each result discounted at lower ranks.
- Mathematical Formula: $ \text{DCG}k = \sum{i=1}^{k} \frac{2^{\text{rel}i} - 1}{\log_2(i+1)} $ $ \text{IDCG}k = \sum{i=1}^{k} \frac{2^{\text{rel}{i, \text{ideal}}} - 1}{\log_2(i+1)} $ $ \text{NDCG}_k = \frac{\text{DCG}_k}{\text{IDCG}_k} $
- Symbol Explanation:
- : The relevance score of the result at position .
- : The relevance score of the result at position in the ideal ranking.
- : Discounted Cumulative Gain at rank .
- : Ideal Discounted Cumulative Gain at rank .
- Dialogue-level:
- (Success Rate at turn t): Measures the cumulative ratio of
successful recommendationby turn . (Similar totarget-guided dialogues) AT(Average Turns): The average number of turns required for all sessions to reach a successful recommendation.
- (Success Rate at turn t): Measures the cumulative ratio of
5.3. Baselines
As a survey paper, this document does not present its own experimental results or compare its proposed method against baselines. Instead, it reviews and categorizes the methods and comparisons made within the individual research papers it surveys. Within those surveyed papers, baselines would typically include:
-
Passive Dialogue Systems: Simple
response-onlymodels, ortask-oriented systemsthat only follow explicit user instructions. -
Rule-based Systems: For strategy learning or topic management, where rules are handcrafted.
-
Simpler Neural Models: For generation tasks, models without specific
proactive components(e.g.,Seq2Seqwithoutstrategy networksorknowledge grounding). -
Ablations: Removing
proactive componentsfrom a proposed model to show their effectiveness.The representativeness of these baselines stems from their role as established or simpler approaches, against which the
proactiveelements of newer models can demonstrate their added value.
6. Results & Analysis
This section synthesizes the findings and trends reported in the survey paper regarding the effectiveness and characteristics of various proactive dialogue systems. Since this is a survey, it does not present new experimental results but rather analyzes the landscape of existing research.
6.1. Core Results Analysis
The survey effectively demonstrates that integrating proactivity into dialogue systems leads to significant advancements in user interaction and task accomplishment across diverse domains.
-
Effectiveness in Open-domain Dialogues:
Target-guided dialoguesshow that systems can successfully steer conversations towards predefined topics or goals, moving beyond simple chitchat. The use ofknowledge graphsandreinforcement learningfortopic planninghighlights a sophisticated approach to managing conversational flow.Prosocial dialoguesillustrate the critical role ofproactivityin handling problematic user input, shifting from passive acceptance to constructive intervention. This ensuressafetyandethical interaction, which are crucial for real-world deployment. The development ofsafety detectionandRule-of-Thumb generationmechanisms indicates a growing maturity in addressing social complexities.
-
Effectiveness in Task-oriented Dialogues:
- For
non-collaborative dialogues,proactivityis essential for systems to achieve their own objectives (e.g., negotiation, persuasion) rather than just fulfilling user requests. Techniques likedialogue strategy learninganduser personality modelingallow for more sophisticated, goal-oriented interactions, leading to successfulconflict resolutionorpersuasion. Enriched task-oriented dialoguesshow thatproactivityenhances user experience by providinguseful supplementary informationorengaging chit-chatsnot explicitly requested. This improves thequalityandeffectivenessof the service, making the system more human-like and helpful.
- For
-
Effectiveness in Conversational Information Seeking Systems:
Asking clarification questionsdirectly addresses the ambiguity inherent in natural language queries, leading to moreefficientandprecise information retrieval. Proactively seeking clarification prevents irrelevant results and improves user satisfaction.User preference elicitationallowsconversational recommendation systemsto actively learn user needs, moving beyond passive observation of past behavior. This leads to morepersonalizedandaccurate recommendations, improving thefinal recommendation results.
-
Advantages and Disadvantages of Proactive Approaches:
- Advantages:
- Improved User Engagement: By taking initiative, systems can maintain more dynamic and interesting conversations.
- Increased Efficiency: Proactively clarifying or guiding can reduce turns and lead to faster task completion or information discovery.
- Enhanced Goal Achievement: Systems can ensure their own objectives (e.g., sales, safety) are met alongside user goals.
- More Human-like Interaction: Mimicking human conversational habits of leading and steering.
- Handling Complex Scenarios: Enables systems to tackle non-collaborative, ambiguous, or sensitive situations.
- Disadvantages (implied by challenges):
-
Complexity: Designing proactive systems requires more sophisticated
dialogue policies,user modeling, andstrategy learning. -
Evaluation Difficulty: Measuring the success of
proactivityis inherently complex and often relies on costlyhuman evaluationoruser simulators. -
Ethical Risks: The power to lead or influence conversations introduces risks related to
factuality,morality, andprivacy.The survey highlights a clear trend towards more intelligent, goal-driven, and socially aware
conversational AIthroughproactivity.
-
- Advantages:
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper:
| Dataset | Problem | Language | #Dial. | #Turns | Featured Annotations |
| TGC [Tang et al., 2019] | Target-guided Dialogues | English | 9,939 | 11.35 | Turn-level Topical Keywords |
| DuConv [Wu et al., 2019] | Target-guided Dialogues | Chinese | 29,858 | 9.1 | Turn-level Entities & Dialogue-level Goals |
| MIC [Ziems et al., 2022] | Prosocial Dialogues | English | 38K | 2.0 | Rules of Thumbs (RoTs) & Revised Responses |
| ProsocialDialog [Kim et al., 2022] | Prosocial Dialogues | English | 58K | 5.7 | Safety Labels and Reasons & RoTs |
| CraigslistBargain [He et al., 2018] | Non-collaborative Dialogues | English | 6,682 | 9.2 | Coarse Dialogue Acts |
| P4G [Wang et al., 2019] | Non-collaborative Dialogues | English | 1,017 | 10.43 | Dialogue Strategies |
| ACCENTOR [Sun et al., 2021] | Enriched Task-oriented Dialogues | English | 23.8K | Enriched Responses with Chit-chats | |
| KETOD [Chen et al., 2022b] | Enriched Task-oriented Dialogues | English | 5,324 | 9.78 | Turn-level Entities & Enriched Responses with Knowledge |
| Abg-CoQA [Guo et al., 2021] | Asking Clarification Questions | English | 8,615 | 5.0 | Clarification Need Labels and Questions |
| PACIFIC [Deng et al., 2022a] | Asking Clarification Questions | English | 2,757 | 6.89 | Clarification Need Labels and Questions |
This table effectively summarizes the key datasets available for research in proactive dialogue systems. It highlights:
- Diversity of Problems: Datasets exist for all major
proactive problemsidentified (Target-guided, Prosocial, Non-collaborative, Enriched TOD, Clarification Questions). - Language Coverage: While predominantly English,
DuConvprovides a valuable Chinese resource fortarget-guided dialogues. - Scale and Turn Length: The datasets vary significantly in scale (from ~1K to 58K dialogues) and average turns, reflecting the complexity and data collection efforts for different tasks.
MIChas a very low average turn count (2.0), suggesting it might focus on single problematic utterances and immediate responses rather than extended dialogues. - Rich Annotations: The
Featured Annotationscolumn is crucial, showing that these datasets provide specific labels (e.g.,topical keywords,RoTs,safety labels,dialogue acts,clarification need labels,entities) essential for training and evaluatingproactive capabilities. This indicates a strong foundation for supervised andreinforcement learningapproaches.
6.3. Ablation Studies / Parameter Analysis
As a survey paper, this document does not conduct its own ablation studies or parameter analyses. Such analyses are typically performed within individual research papers to validate the contribution of specific components of a proposed model or to optimize its performance. The survey's role is to report on the types of methods and approaches taken in the field, including those that might have been validated via ablation studies in their original publications.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper serves as the first comprehensive survey on proactive dialogue systems, offering a structured overview of the field's problems, methods, and future directions. It meticulously categorizes existing research into three main types of dialogues: open-domain, task-oriented, and conversational information-seeking. Within these categories, the authors delineate specific proactive problems such as target-guided dialogues, prosocial dialogues, non-collaborative dialogues, enriched task-oriented dialogues, asking clarification questions, and user preference elicitation. For each problem, the survey details the advanced designs and techniques employed, along with representative datasets and evaluation protocols. The overarching conclusion is that proactivity is a crucial, emerging property for conversational AI that enables more engaging, efficient, and sophisticated interactions, moving systems beyond passive responsiveness towards achieving strategic goals and human-like intelligence.
7.2. Limitations & Future Work
The authors highlight several critical challenges and promising research directions for the future:
-
Proactivity in Hybrid Dialogues:
- Limitation: Current
dialogue systemsoften assume a single conversational goal. Real-world interactions, however, involve multiple, varied objectives and require natural transitions between differentdialogue types(e.g., shifting fromchit-chattotask completionand then torecommendation). Few studies have adequately addressed this. - Future Work: More research is needed to ensure
natural and smooth transitionsamong different types of dialogues and toproactively discover user interestsfor guiding multi-goal conversations. This includes developing systems that can adapt to changingconversational goalswithout losing performance in any specificdialogue type. The emergence of datasets likeDuRecDial,FusedChat,SalesBot, andOB-MultiWOZindicates a growing interest in this area.
- Limitation: Current
-
Evaluation Protocols for Proactivity:
- Limitation: Evaluating
proactivityis complex and often relies on expensivehuman evaluation. Whileuser simulatorsoffer an alternative, robust and effective evaluation metrics are still lacking. Traditionaldialogue metricsare insufficient, asproactivityinvolves aspects fromhuman-computer interaction,sociology, andpsychology. - Future Work: There is an urgent need for
more effective and robust multidisciplinary evaluation protocols. This involves designing metrics that can accurately quantify aspects likesmoothness,persuasiveness,safety, andgoal achievementfrom a proactive stance, perhaps integrating quantitative measures with qualitative user experience insights.
- Limitation: Evaluating
-
Ethics of Conversational Agent's Proactivity:
- Limitation: The ability to proactively guide conversations is a "double-edged sword." While beneficial, it introduces significant ethical concerns that are often overlooked.
- Future Work: Researchers must prioritize
responsible AI. This includes focusing on:- Factuality: Ensuring the accuracy of
system-initiative informationand external knowledge to preventhallucinationsorfactual incorrectness. - Morality: Addressing issues beyond general
toxic languageandsocial bias, specifically focusing onaggressivenessinnon-collaborative conversationsand promotingprosocialandempatheticinteractions. - Privacy: Heightened concerns arise regarding the
misuse of personal informationwhen agents proactively engage with user data. Robustprivacy-preserving mechanismsare essential.
- Factuality: Ensuring the accuracy of
7.3. Personal Insights & Critique
This survey is a timely and valuable contribution to the conversational AI community. Its primary strength lies in providing a clear, systematic framework for understanding proactivity, a concept that has been implicitly present but rarely explicitly defined and categorized across the diverse landscape of dialogue systems.
Inspirations:
- Paradigm Shift: The paper reinforces the idea that
conversational AIis moving beyond mere functional task completion or social chitchat towards becoming strategic, ethical, and truly intelligent partners. Thisparadigm shiftfromreactivetoproactiveis critical forAI adoptionin complex real-world scenarios. - Cross-pollination: The categorization highlights how solutions from one
proactive problem(e.g.,strategy learninginnon-collaborative dialogues) could potentially inspire methods in another (e.g.,topic planningintarget-guided dialogues), fosteringinterdisciplinary research. - Human-Centric AI: The emphasis on
prosocialityandethicsunderscores the growing importance of designingAIthat not only performs well but also interacts responsibly and beneficially with humans.
Potential Issues/Areas for Improvement (Critique):
-
Definition of "Proactivity" Nuances: While providing a definition, the spectrum of
proactivitycan be subtle. Is a system that offers a suggestion after a user query proactive, or merely responsive with added value? A deeper exploration ofdegrees of proactivitymight be beneficial (e.g., mild suggestions vs. strong steering). -
Practical Implementation Challenges beyond Research: The survey focuses on research problems. Real-world deployment faces additional hurdles like
latency,scalability, androbustnesstoout-of-domain inputswhenproactive strategiesare involved. These practical considerations could be discussed more. -
User Acceptance and Control: While
proactivitycan improve efficiency, excessive or poorly executedproactivitycan be perceived as intrusive or annoying by users. The balance betweensystem initiativeanduser controlis a crucial design challenge, and its evaluation is complex. This aspect is touched upon in theethicssection but could be expanded. -
Lack of Unified Benchmarks: Although many datasets are listed, the diversity of
evaluation metricsandtask formulationsacross differentproactive problemssuggests a lack of truly unified benchmarks. This makes direct comparison across differentproactive capabilitiesdifficult, hindering holistic progress.Overall, this survey successfully charts the emerging landscape of
proactive dialogue systems, providing an invaluable resource for researchers and practitioners alike. It effectively sets the stage for the next generation ofconversational AIthat is not only intelligent but also intentional and responsible.
Similar papers
Recommended via semantic vector search.