Paper status: completed

A Survey on Proactive Dialogue Systems: Problems, Methods, and Prospects

Published:05/04/2023
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
6 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This survey reviews proactive dialogue systems, highlighting problems, methods, and motivational strategies for goal-driven, strategic conversational agents, offering a comprehensive overview to advance conversational AI toward more complex, interactive tasks.

Abstract

Proactive dialogue systems, related to a wide range of real-world conversational applications, equip the conversational agent with the capability of leading the conversation direction towards achieving pre-defined targets or fulfilling certain goals from the system side. It is empowered by advanced techniques to progress to more complicated tasks that require strategical and motivational interactions. In this survey, we provide a comprehensive overview of the prominent problems and advanced designs for conversational agent's proactivity in different types of dialogues. Furthermore, we discuss challenges that meet the real-world application needs but require a greater research focus in the future. We hope that this first survey of proactive dialogue systems can provide the community with a quick access and an overall picture to this practical problem, and stimulate more progresses on conversational AI to the next level.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

A Survey on Proactive Dialogue Systems: Problems, Methods, and Prospects

1.2. Authors

Yang Deng1^1, Wenqiang Lei2^2,†, Wai Lam3^3, Tat-Seng Chua1^1 1^1National University of Singapore 2^2Sichuan University 3^3The Chinese University of Hong Kong

1.3. Journal/Conference

This paper was published on arXiv, a preprint server for scientific papers. While arXiv itself is not a peer-reviewed journal or conference, it is a widely used platform in the scientific community, particularly in fields like computer science, to rapidly disseminate research findings before or in parallel with formal publication. Papers on arXiv undergo a basic moderation process but are not subject to the rigorous peer review typically associated with academic journals or conferences. Given the recency of the publication date, it is likely intended for submission or has been submitted to a major conference or journal in the field of Natural Language Processing or Artificial Intelligence.

1.4. Publication Year

2023

1.5. Abstract

This paper presents a comprehensive survey of proactive dialogue systems, which are conversational agents designed with the ability to guide conversations toward specific goals or targets. These systems leverage advanced techniques for strategic and motivational interactions, enabling them to tackle more complex tasks. The survey provides an overview of key problems and advanced designs for conversational proactivity across three main types of dialogues: open-domain dialogues, task-oriented dialogues, and information-seeking dialogues. Furthermore, it discusses current challenges in real-world applications that require intensified future research, including hybrid dialogues, robust evaluation protocols, and ethical considerations. The authors aim for this survey to offer quick access and a broad understanding of this practical problem, thereby stimulating further advancements in conversational AI.

The paper is available at: https://arxiv.org/abs/2305.02750. The PDF version can be accessed at: https://arxiv.org/pdf/2305.02750v2.pdf. It is published as a preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper addresses is the passivity of most existing dialogue systems. Traditional dialogue systems are typically designed to passively respond to user queries, follow user-initiated topics, or fulfill explicit user requests. Examples include open-domain dialogue systems (for general conversation), task-oriented dialogue systems (for specific tasks like booking), and conversational information-seeking systems (for finding information).

This passivity presents several limitations:

  1. Limited Engagement: Passive systems may struggle to maintain user engagement or guide conversations effectively, especially in complex scenarios.

  2. Handling Ambiguity/Problematic Content: They might passively accept ambiguous user queries or even problematic/harmful statements, leading to suboptimal or unsafe interactions (e.g., providing randomly guessed answers, failing to address biased conversations).

  3. Lack of Strategy/Motivation: They lack the proactivity to strategically steer the conversation towards pre-defined system goals or to motivate users towards certain actions, which is crucial for more sophisticated applications.

  4. Absence of "Strong AI" Trait: The ability to take initiative and anticipate impacts (i.e., proactivity) is an essential property of intelligent, human-like conversations and a significant step towards strong AI that possesses autonomy and human-like consciousness. Even advanced models like ChatGPT exhibit some of these limitations due to a lack of proactivity.

    The problem is important because equipping conversational agents with proactivity can significantly enhance user engagement, improve service efficiency, and enable the system to handle more complicated tasks that require strategical and motivational interactions.

The paper's entry point is to define and categorize proactive dialogue systems as conversational agents capable of leading the conversation direction towards achieving pre-defined targets or fulfilling certain goals from the system side. It aims to systematically review existing efforts in this emerging field, identify key problems, advanced designs, and future research directions.

2.2. Main Contributions / Findings

The paper makes several primary contributions by providing the first comprehensive survey specifically focused on proactive dialogue systems:

  1. Systematic Categorization: It provides a systematic overview of proactive dialogue systems by categorizing them into three main types of dialogues:
    • Proactive Open-domain Dialogues: Where the system leads general conversations (e.g., target-guided dialogues, prosocial dialogues).
    • Proactive Task-oriented Dialogues: Where the system goes beyond fulfilling explicit requests (e.g., non-collaborative dialogues, enriched task-oriented dialogues).
    • Proactive Conversational Information Seeking Systems: Where the system actively refines information search (e.g., asking clarification questions, user preference elicitation).
  2. Identification of Prominent Problems and Advanced Designs: For each category, the survey details the specific problems that necessitate proactivity and outlines the advanced techniques and designs developed to address them, including specific subtasks like topic-shift detection, dialogue strategy learning, and clarification need prediction.
  3. Review of Data Resources and Evaluation Protocols: It summarizes available datasets and commonly adopted evaluation metrics and protocols for each identified problem, providing a valuable resource for researchers entering the field.
  4. Discussion of Challenges and Prospects: The paper highlights crucial open challenges that align with real-world application needs but require greater research focus. These include:
    • Proactivity in Hybrid Dialogues: Addressing conversations with multiple, varying goals.

    • Evaluation Protocols for Proactivity: Developing robust, multidisciplinary metrics beyond traditional dialogue evaluation.

    • Ethics of Conversational Agent's Proactivity: Ensuring factuality, morality, and privacy in proactive interactions.

      The key conclusion is that proactivity is a critical, yet often overlooked, property in conversational AI that holds immense potential for developing more intelligent, engaging, and capable dialogue systems. The findings solve the problem of fragmented understanding in this emerging field by synthesizing diverse research efforts, offering a structured view, and pointing towards future high-impact research avenues.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a foundational understanding of dialogue systems and key concepts within Natural Language Processing (NLP) is essential.

  • Dialogue Systems: At its core, a dialogue system (or conversational AI) is a computer program designed to converse with human users in natural language. Its goal is typically to provide social support, answer questions, complete tasks, or offer recommendations.

    • Open-domain Dialogue Systems (ODD): These systems are designed for general conversations without a specific task. Their goal is to maintain engagement, build rapport, and provide social interaction. Examples include chatbots for casual chat.
    • Task-oriented Dialogue Systems (TOD): These systems aim to help users complete specific tasks, such as booking flights, making restaurant reservations, or setting reminders. They require understanding user intent, managing dialogue states, and interacting with external APIs.
    • Conversational Information-Seeking Systems (CIS): These systems help users find information through a conversational interface, often combining elements of search engines, recommender systems, and question-answering systems.
  • Proactivity: In the context of dialogue systems, proactivity refers to the agent's ability to take initiative, anticipate user needs or conversation direction, and actively steer the dialogue towards a predefined goal or target, rather than merely reacting to user input. This contrasts with traditional passive dialogue systems.

  • Natural Language Processing (NLP): The field of AI that focuses on enabling computers to understand, interpret, and generate human language. Techniques from NLP are fundamental to all dialogue systems, including natural language understanding (NLU) and natural language generation (NLG).

  • Machine Learning (ML) / Deep Learning (DL): Many modern dialogue systems are built using ML and DL techniques, particularly neural networks.

    • Transformers: A deep learning architecture introduced in 2017, which has become dominant in NLP. It relies heavily on the self-attention mechanism to process sequential data, allowing it to weigh the importance of different parts of the input sequence. Models like BERT, GPT, and T5 are based on the Transformer architecture. The paper mentions T5-based models and BERT-based models.
    • Reinforcement Learning (RL): A type of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. It's often used in dialogue systems for dialogue policy learning, where the agent learns the optimal strategy for responding to users.
  • Knowledge Graphs: Structured representations of knowledge that store facts as nodes (entities) and edges (relationships). They are used to inject factual information and common sense into dialogue systems, particularly for knowledge-grounded conversations or topic planning.

3.2. Previous Works

The paper frames its discussion by first highlighting the existing landscape of dialogue systems and then pinpointing the gap that proactivity addresses.

  • Conventional Dialogue Research:

    • Response-ability Focus: Prior research largely focused on the response-ability of systems, meaning their ability to understand context and generate appropriate replies.
    • User-oriented/Passive Nature:
      • Open-domain dialogue systems (e.g., PersonaChat [Zhang et al., 2018a]): Aim to establish long-term connections by echoing user topics, emotions, or views. The paper notes that PersonaChat is often used as a base dataset for target-guided dialogue research, demonstrating how passive systems are being adapted for proactive goals.
      • Task-oriented dialogue systems (e.g., SimpleTOD [Hosseini-Asl et al., 2020]): Designed to passively follow user instructions to complete specific tasks. The paper mentions SimpleTOD as a baseline that needs extension (SimpleTOD+SimpleTOD+) to handle enriched TODs with chit-chats.
      • Conversational information-seeking systems (e.g., general conversational search [Aliannejadi et al., 2019]): Passively respond to user queries.
  • Early Attempts at Proactivity:

    • Topic Introduction [Li et al., 2016]: Pioneering work recognized the need for systems to proactively introduce new topics.
    • Useful Suggestions [Yan and Zhao, 2018]: Other early studies explored agents offering useful suggestions. These efforts laid the groundwork but highlighted the need for more defined problem settings and applications.
  • Key Models/Architectures Referenced (and underlying concepts):

    • BERT (Bidirectional Encoder Representations from Transformers): A Transformer-based model pre-trained on large text corpora, widely used for NLU tasks like classification and question answering. For example, BERT-based models are used for question selection in clarification question generation [Aliannejadi et al., 2019].
    • T5 (Text-to-Text Transfer Transformer): Another Transformer-based model that frames all NLP tasks as a text-to-text problem (taking text as input and producing text as output). It's used in topic-shift management [Xie et al., 2021].
    • Evaluation Metrics (Crucial for understanding performance):
      • BLEU (Bilingual Evaluation Understudy): A metric for evaluating the quality of machine-generated text by comparing it to reference texts. It measures the n-gram overlap.
        • Conceptual Definition: BLEU quantifies how similar a candidate translation (or generated response) is to a set of high-quality reference translations. It's widely used in machine translation and text generation tasks, focusing on precision (how many words in the candidate are also in the reference).
        • Mathematical Formula: $ \text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right) $ where BP\text{BP} is the brevity penalty, NN is the maximum n-gram order (typically 4), wnw_n are positive weights summing to 1 (often 1/N1/N), and pnp_n is the n-gram precision. The brevity penalty BP\text{BP} is calculated as: $ \text{BP} = \begin{cases} 1 & \text{if } c > r \ e^{(1-r/c)} & \text{if } c \le r \end{cases} $ where cc is the length of the candidate text and rr is the effective reference corpus length. The n-gram precision pnp_n is calculated as: $ p_n = \frac{\sum_{S \in {\text{candidates}}} \sum_{ngram \in S} \text{Count}{\text{clip}}(ngram)}{\sum{S' \in {\text{candidates}}} \sum_{ngram' \in S'} \text{Count}(ngram')} $
        • Symbol Explanation:
          • BP\text{BP}: Brevity Penalty, penalizes generated texts that are too short compared to the reference.
          • cc: Length of the candidate (generated) text.
          • rr: Effective reference corpus length, which is the sum of the lengths of the reference segments whose length is closest to the candidate segment length.
          • NN: Maximum n-gram order.
          • wnw_n: Weight for the nn-gram precision.
          • pnp_n: Modified n-gram precision for nn-grams.
          • Countclip(ngram)\text{Count}_{\text{clip}}(ngram): The number of times an nn-gram appears in the candidate, clipped by its maximum count in any single reference.
          • Count(ngram)\text{Count}(ngram'): The total number of nn-grams in the candidate.
      • ROUGE (Recall-Oriented Understudy for Gisting Evaluation): A set of metrics used for evaluating automatic summarization and machine translation software. It works by comparing an automatically produced summary or translation against a set of reference summaries or translations. It focuses on recall (how many words in the reference are also in the candidate). The paper mentions ROUGE for RoT generation and prosocial generation.
        • Conceptual Definition: ROUGE measures the overlap of n-grams, word sequences, and word pairs between the system-generated summary and human-created reference summaries. Different variants (ROUGE-N, ROUGE-L, ROUGE-W, ROUGE-S) exist, with ROUGE-N focusing on n-gram overlap.
        • Mathematical Formula (for ROUGE-N): $ \text{ROUGE-N} = \frac{\sum_{S \in {\text{references}}} \sum_{ngram_n \in S} \text{Count}(ngram_n)}{\sum_{S' \in {\text{references}}} \sum_{ngram'_n \in S'} \text{Count}(ngram'_n)} $
        • Symbol Explanation:
          • nn: The length of the n-gram.
          • Count(ngramn)\text{Count}(ngram_n): The number of times an nn-gram appears in both the candidate and reference summaries.
          • Count(ngramn)\text{Count}(ngram'_n): The number of times an nn-gram appears in the reference summary.
      • PPL (Perplexity): A measure of how well a probability distribution or probability model predicts a sample. In NLP, it's used to evaluate language models by measuring how well the model predicts a given sequence of words. A lower PPL indicates a better model.
        • Conceptual Definition: Perplexity measures the uncertainty of a language model. It's the inverse probability of the test set, normalized by the number of words. A model with low perplexity is good at predicting the next word in a sequence.
        • Mathematical Formula: $ \text{PPL}(W) = \left( \prod_{i=1}^{N} \frac{1}{P(w_i | w_1, \dots, w_{i-1})} \right)^{\frac{1}{N}} = \exp\left( -\frac{1}{N} \sum_{i=1}^{N} \log P(w_i | w_1, \dots, w_{i-1}) \right) $
        • Symbol Explanation:
          • W=w1,w2,,wNW = w_1, w_2, \dots, w_N: A sequence of NN words.
          • P(wiw1,,wi1)P(w_i | w_1, \dots, w_{i-1}): The probability assigned by the language model to the ii-th word given the preceding i-1 words.
          • NN: The total number of words in the sequence.
      • F1 Score: The harmonic mean of precision and recall, often used for classification tasks.
        • Conceptual Definition: The F1 score provides a single score that balances the precision and recall of a classifier. Precision measures the proportion of true positive predictions among all positive predictions, while recall measures the proportion of true positive predictions among all actual positive instances.
        • Mathematical Formula: $ \text{F1 Score} = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $ where $ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} $ and $ \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $
        • Symbol Explanation:
          • True Positives: Instances correctly identified as positive.
          • False Positives: Instances incorrectly identified as positive (Type I error).
          • False Negatives: Instances incorrectly identified as negative (Type II error).
          • Precision: The ability of the classifier not to label as positive a sample that is negative.
          • Recall: The ability of the classifier to find all the positive samples.
      • Accuracy: The proportion of correctly classified instances among the total number of instances.
        • Conceptual Definition: Accuracy measures the overall correctness of a classification model, indicating the ratio of correct predictions to the total number of predictions made.
        • Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
        • Symbol Explanation:
          • Number of Correct Predictions: The sum of true positives and true negatives.
          • Total Number of Predictions: The sum of true positives, true negatives, false positives, and false negatives.

3.3. Technological Evolution

The evolution of dialogue systems has largely progressed from rule-based systems to statistical machine learning approaches, and more recently, to deep learning and large language models (LLMs). Initially, systems were highly scripted and passive, following rigid rules. With the advent of statistical methods, systems gained some flexibility in understanding and generating language but remained largely reactive. The deep learning era brought significant advancements, especially with sequence-to-sequence models and later Transformers, enabling more fluent and contextually aware conversations.

This paper's work on proactive dialogue systems represents a crucial step in this evolution. It moves beyond mere response-ability (a system's ability to respond appropriately) towards proactivity (a system's ability to initiate and guide), which is essential for strong AI. Early attempts at proactivity were often limited to specific, simple interventions like topic introduction or suggestions. However, the integration of advanced deep learning techniques, reinforcement learning for policy learning, and sophisticated knowledge representation (e.g., knowledge graphs) has empowered proactive dialogue systems to tackle more complex, strategic, and motivational interactions, as highlighted in this survey. The field is moving from systems that merely respond to those that strategically engage and lead.

3.4. Differentiation Analysis

Compared to previous surveys in dialogue systems, this paper distinguishes itself by its singular focus on proactivity. While other surveys might cover open-domain, task-oriented, or information-seeking dialogues generally, or specific technical aspects like dialogue state tracking or response generation, this is explicitly presented as the "first survey to focus on proactive dialogue systems."

The core innovations and differences of this paper's approach are:

  1. Unified Definition of Proactivity: It provides a clear, derived definition of conversational agent's proactivity, grounding it in concepts from organizational behaviors, which helps unify understanding across diverse application areas.

  2. Cross-Domain Categorization: Instead of focusing on one type of dialogue, it systematically surveys proactivity across open-domain, task-oriented, and information-seeking dialogues, revealing common themes and distinct challenges.

  3. Problem-Centric View: It structures its analysis around "prominent problems" that proactivity aims to solve (e.g., target-guided conversation, prosocial dialogue, non-collaborative dialogue, clarification questions, user preference elicitation), rather than just surveying models.

  4. Forward-Looking Perspective: It dedicates a significant section to "Challenges and Prospects," offering a critical perspective on hybrid dialogues, evaluation protocols, and ethics that are specific to the proactive nature of these systems. This future-oriented view is crucial for guiding subsequent research.

    In essence, while foundational dialogue system concepts are well-surveyed, this paper carves out a new, important sub-field, systematically defining, categorizing, and charting the progress and future of proactive conversational AI.

4. Methodology

This paper is a survey, so its "methodology" is the systematic approach it takes to review and categorize existing research on proactive dialogue systems. The authors define proactivity and then structure their analysis around how this concept manifests in three major types of dialogue systems: open-domain, task-oriented, and information-seeking.

4.1. Principles

The core idea behind the survey's methodology is to provide a comprehensive, structured overview of proactivity in dialogue systems. The theoretical basis is the definition of proactivity itself: the capability of a conversational agent to "create or control the conversation by taking the initiative and anticipating the impacts on themselves or human users, rather than only passively responding to the users." The intuition is that moving beyond passive response-ability towards proactivity is essential for more intelligent, engaging, and goal-oriented conversational AI.

The survey's methodological principles include:

  1. Problem-Driven Analysis: Identifying specific problems (e.g., target-guided conversation, non-collaborative dialogue) where proactivity is a key solution.
  2. Solution-Oriented Review: Detailing advanced designs and methods developed to implement proactivity for these problems.
  3. Resource Mapping: Connecting problems and solutions to available data resources and evaluation protocols.
  4. Forward-Looking Critique: Discussing open challenges and future research directions to guide the community.

4.2. Core Methodology In-depth (Layer by Layer)

The survey systematically breaks down proactive dialogue systems into categories based on the type of dialogue and then further into specific problems within those categories. For each problem, it describes the definition, common methods, relevant datasets, and evaluation protocols.

The following figure (Figure 1 from the original paper) shows the classification structure of proactive dialogue systems:

Figure 1: Summary of proactive dialogue systems. Figure 1: Summary of proactive dialogue systems.

The following figure (Figure 2 from the original paper) illustrates examples for different problems in proactive dialogue systems:

Figure 2: Examples for different problems in proactive dialogue systems. Figure 2: Examples for different problems in proactive dialogue systems.

4.2.1. Proactive Open-domain Dialogue Systems

Open-domain dialogue systems (ODD) traditionally aim for general social interaction. Proactivity here means the system doesn't just echo user topics or emotions but actively guides the conversation.

4.2.1.1. Target-guided Dialogues

  • Problem Definition: Given a secret target (tt) known only to the agent, the system must lead a conversation, starting from any initial topic, towards that target over multiple turns. The generated responses {un}\{ u_n \} must ensure:

    1. Transition smoothness: Natural and appropriate content within the dialogue history.
    2. Target achievement: Successfully guiding the conversation to the designated target. The target can be a topical keyword [Tang et al., 2019], a knowledge entity [Wu et al., 2019], or a conversational goal [Liu et al., 2020]. The system maintains a set of candidate targets.
  • Methods: Involve three main subtasks:

    • Topic-shift Detection: Identifying when the user introduces a new topic.
      • Rachna et al. [2021] fine-tune XLNet-base to classify utterances into major, minor, or off topics.
      • Xie et al. [2021] created TIAGE dataset with topic-shift annotations and proposed TSMANAGER, a T5-based model, to predict topic shifts.
    • Topic Planning: The core of target-guided dialogues, aiming to make the conversation follow a desired path.
      • Early works [Tang et al., 2019; Zhong et al., 2021] use discourse-level strategies constrained by keyword transitions.
      • To improve coherence, event knowledge graphs are constructed [Xu et al., 2020].
      • Latest studies [Yang et al., 2022] leverage external knowledge graphs and graph reasoning for better topic path planning.
      • Lei et al. [2022] learn topic transitions from user interactions instead of corpus-based methods.
    • Topic-aware Response Generation: Producing responses relevant to the planned topic path.
      • Kishinami et al. [2022] generate a complete responding plan to lead to the target.
      • Gupta et al. [2022] use a bridging path of commonsense knowledge concepts between current and target topics to generate transition responses.

4.2.1.2. Prosocial Dialogues

  • Problem Definition: Given a dialogue context (utterances {u1,,ut1}\{ u_1, \dots, u_{t-1} \}), the system first classifies the safety label (yy) of the user's utterance and then generates an appropriate response (utu_t) to mitigate problematic user utterances (e.g., unsafe, unethical, toxic) by leading the conversation in a prosocial manner (following social norms, benefiting others).

  • Methods:

    • Safety Detection: Identifying problematic user utterances.
      • Dinan et al. [2019] developed human-in-the-loop training for offensive utterance detection, improved by adversarial learning [Xu et al., 2021].
      • Baheti et al. [2021] fine-tuned offensive language detection classifiers on ToxICCHAT.
      • Kim et al. [2022] introduced a fine-grained safety classification schema (Needs Caution, Needs Intervention, Casual) to avoid social exclusion for minority users.
    • Rule-of-Thumb (RoT) Generation: Explaining why a statement is acceptable or problematic.
      • Forbes et al. [2020] introduced SOCIALCHEM01 corpus and NORM TRANSFORMER for reasoning about social norms.
      • Ziems et al. [2022] proposed MORAL TRANSFORMER to generate new RoTs.
      • Kim et al. [2022] developed Canary, a sequence-to-sequence model generating both safety labels and RoTs.
    • Prosocial Response Generation: Actively generating appropriate responses.
      • Baheti et al. [2021] investigated controllable text generation to mitigate agreement with offensive user utterances.
      • Kim et al. [2022] proposed Prost to generate prosocial responses conditioned on RoTs and dialogue context.

4.2.2. Proactive Task-oriented Dialogue Systems

TOD systems usually act as obedient assistants. Proactivity here allows them to handle non-collaborative tasks or enrich conversations with unrequested but useful information.

4.2.2.1. Non-collaborative Dialogues

  • Problem Definition: Under non-collaborative settings, the system and user have competing interests but aim for an agreement (e.g., price negotiation, persuasion). The goal is to generate a response (utu_t) with an appropriate dialogue strategy (sts_t) that leads to a consensus state, given dialogue history {u1,,ut1}\{ u_1, \dots, u_{t-1} \}, previous strategies {s1,,st1}\{ s_1, \dots, s_{t-1} \}, and dialogue background (cc). Dialogue strategy can be coarse or fine-grained.

  • Methods:

    • Dialogue Strategy Learning: Beyond intent detection, this requires strategic reasoning.
      • He et al. [2018] decoupled strategy and generation to control dialogue strategy for different negotiation goals.
      • Zhou et al. [2020] used finite state transducers (FSTs) to predict the next strategy from effective sequences.
      • Advanced models like DIALOGRAPH [Joshi et al., 2021] with interpretable strategy-graph networks and REsPER [Dutt et al., 2021] with resisting strategy modeling have been developed.
    • User Personality Modeling: Understanding human decision-making.
      • Yang et al. [2021] generated strategic dialogue by modeling opponent personality types using Theory of Mind (ToM).
      • Shi et al. [2021] developed DialGAIL, an RL-based generative algorithm with separate user/system profile builders to reduce repetition.
    • Persuasive Response Generation: Generating responses that lead to consensus.
      • Modularized [He et al., 2018] and end-to-end [Li et al., 2020; Wu et al., 2021] methods incorporate persuasive strategies.
      • Recent studies [Mishra et al., 2022; Samad et al., 2022] focus on building empathetic connections for persuasion.

4.2.2.2. Enriched Task-oriented Dialogues

  • Problem Definition: The system proactively provides additional information not explicitly requested but useful to the user, improving the quality and effectiveness of functional service. The problem formulation is similar to general TODs, but responses should be both functionally accurate and socially engaging.

  • Methods:

    • Sun et al. [2021] created ACCENTOR by adding topical chit-chats to TOD responses.
    • SimpleTOD [Hosseini-Asl et al., 2020] was extended to SimpleTOD+SimpleTOD+ to handle enriched TODs with a new chit-chat dialogue action.
    • Zhao et al. [2022] developed UniDS, an end-to-end method with a unified dialogue data schema for both chit-chat and task-oriented dialogues.
    • Chen et al. [2022b] proposed KETOD for knowledge-grounded chit-chat regarding relevant entities, using a pipeline-based method called Combiner to reduce interference.

4.2.3. Proactive Conversational Information Seeking Systems

CIS systems fulfill user information needs. Proactivity eliminates uncertainty for efficient and precise information seeking by initiating subdialogues.

4.2.3.1. Asking Clarification Questions

  • Problem Definition: Clarifying ambiguity in user queries. Formulated as two subtasks [Aliannejadi et al., 2021]:

    1. Clarification need prediction: A binary classification problem to predict if a query is ambiguous.
    2. Clarification question generation: Generating the actual question, either by selection from a bank [Aliannejadi et al., 2019] or on the fly [Zamani et al., 2020].
  • Methods:

    • Aliannejadi et al. [2019] proposed NeuQS, a question retrieval-selection pipeline using BERT-based models for re-ranking.
    • Zamani et al. [2020] developed QCM, a reinforcement learning based method to generate questions by maximizing a clarification utility function.
    • Complete pipeline-based systems are presented by Aliannejadi et al. [2021] and Guo et al. [2021], with binary classification for clarification need followed by question generation.
    • Deng et al. [2022a] proposed UniPCQA, an end-to-end framework using a unified sequence-to-sequence formulation for clarification need prediction, question generation, and conversational question answering.

4.2.3.2. User Preference Elicitation

  • Problem Definition: Proactively acquiring user preferences by asking questions in conversational recommendation (e.g., "Which brand of laptop do you prefer?"), rather than passively learning from context. This involves predicting the item attribute to ask about next.

  • Methods:

    • Zhang et al. [2018b] designed PMMN (personalized multi-memory network) to incorporate user embeddings into next question prediction at the turn level.
    • For multi-turn interactions, recent works tackle user preference elicitation at the dialogue level as a multi-step decision making process using reinforcement learning (RL) [Deng et al., 2021; Zhang et al., 2022].
    • Deng et al. [2021] proposed UNICORN, a graph-based RL framework for policy learning that models real-time user preference with a dynamic weighted graph structure.
    • Zhang et al. [2022] proposed the MCMIPL framework for multi-choice question asking to efficiently obtain user preferences.

5. Experimental Setup

As this paper is a survey, it does not present its own experimental setup in the traditional sense of proposing a new model and evaluating it. Instead, it systematically reviews the datasets and evaluation protocols used by the proactive dialogue systems research it covers.

5.1. Datasets

The paper summarizes representative datasets used for evaluating various types of proactive dialogue systems.

The following are the results from [Table 1] of the original paper:

Dataset Problem Language #Dial. #Turns Featured Annotations
TGC [Tang et al., 2019] Target-guided Dialogues English 9,939 11.35 Turn-level Topical Keywords
DuConv [Wu et al., 2019] Target-guided Dialogues Chinese 29,858 9.1 Turn-level Entities & Dialogue-level Goals
MIC [Ziems et al., 2022] Prosocial Dialogues English 38K 2.0 Rules of Thumbs (RoTs) & Revised Responses
ProsocialDialog [Kim et al., 2022] Prosocial Dialogues English 58K 5.7 Safety Labels and Reasons & RoTs
CraigslistBargain [He et al., 2018] Non-collaborative Dialogues English 6,682 9.2 Coarse Dialogue Acts
P4G [Wang et al., 2019] Non-collaborative Dialogues English 1,017 10.43 Dialogue Strategies
ACCENTOR [Sun et al., 2021] Enriched Task-oriented Dialogues English 23.8K Enriched Responses with Chit-chats
KETOD [Chen et al., 2022b] Enriched Task-oriented Dialogues English 5,324 9.78 Turn-level Entities & Enriched Responses with Knowledge
Abg-CoQA [Guo et al., 2021] Asking Clarification Questions English 8,615 5.0 Clarification Need Labels and Questions
PACIFIC [Deng et al., 2022a] Asking Clarification Questions English 2,757 6.89 Clarification Need Labels and Questions

Here's a description of selected datasets:

  • TGC (Target-Guided Conversation) [Tang et al., 2019]:
    • Source/Characteristics: Constructed from Persona-Chat by removing persona information and defining targets as keywords. Targets are automatically extracted.
    • Domain: Open-domain dialogue, focused on guiding conversations towards specific keywords.
    • Example Data Sample: A conversation where the agent tries to steer the discussion towards a specific keyword like "Music" or "Blackpink." The dataset would contain turns of dialogue with an associated target keyword for each conversation. For instance, a dialogue might start about daily life and the agent tries to subtly introduce "music" as a topic.
  • DuConv [Wu et al., 2019]:
    • Source/Characteristics: Human-human conversations generated based on two linked entities from a grounded knowledge graph.
    • Domain: Open-domain, knowledge-driven, target-guided dialogues in Chinese.
    • Example Data Sample: A conversation about "pizza" might have the agent trying to steer towards "Italy" as a knowledge entity or "cooking" as a dialogue-level goal, leveraging facts from a knowledge graph.
  • MIC (Moral Integrity Conversation) [Ziems et al., 2022]:
    • Source/Characteristics: Manually annotated prompt-reply pairs (AI-generated response to an open query) with Rule-of-Thumbs (RoTs) from SOCIALCHEM01 [Forbes et al., 2020].
    • Domain: Prosocial dialogues, evaluating the ethical and moral aspects of AI responses.
    • Example Data Sample: A user might say something like "I really don't feel like sharing my notes, even though my friend helped me study." An AI-generated response might be assessed with an RoT like "It's generally good to reciprocate kindness" to guide a revised, more prosocial response.
  • ProsocialDialog [Kim et al., 2022]:
    • Source/Characteristics: Human-AI collaboration where AI acts as a problematic user, and crowdworkers act as a prosocial agent. Includes safety labels, RoTs, and prosocial responses.
    • Domain: Prosocial dialogues, specifically for handling problematic user utterances.
    • Example Data Sample: A user (AI) might express a problematic statement (e.g., "Cheating is sometimes okay to pass a difficult class."). The dataset would contain the safety label for this statement, a relevant RoT (e.g., "Academic integrity is important"), and a prosocial agent's response that addresses the issue constructively.
  • CraigslistBargain [He et al., 2018]:
    • Source/Characteristics: Human-human conversations where workers negotiate the price of an item.
    • Domain: Non-collaborative, bargain negotiation in task-oriented dialogues.
    • Example Data Sample: A dialogue between a buyer and a seller over the price of a used bicycle, where each side has a different goal (buyer wants lower price, seller wants higher). Coarse Dialogue Acts (e.g., offer, accept, reject, propose) are annotated.
  • P4G (PERSUASIONFORGOOD) [Wang et al., 2019]:
    • Source/Characteristics: Persuasion conversations for charity donation, including user profiles and manual annotations of persuasion strategies and dialogue acts.
    • Domain: Non-collaborative, persuasion dialogues (e.g., convincing to donate).
    • Example Data Sample: A conversation where a system tries to persuade a user to donate to a charity, leveraging information about the user's interests (from their profile) and applying specific persuasion strategies (e.g., appeal to emotion, establish common ground).
  • ACCENTOR [Sun et al., 2021]:
    • Source/Characteristics: Augments task-oriented dialogues with topical chit-chats.
    • Domain: Enriched task-oriented dialogues, making interactions more engaging.
    • Example Data Sample: In a restaurant booking dialogue, after confirming details, the system might add a chit-chat like "I hope you enjoy your meal, Italian food is my favorite!"
  • KETOD (Knowledge-Enriched Task-Oriented Dialogue) [Chen et al., 2022b]:
    • Source/Characteristics: Provides turn-level entities and enriched responses with knowledge.
    • Domain: Enriched task-oriented dialogues with knowledge-grounded chit-chat.
    • Example Data Sample: A user asks to book a hotel in Paris. The system confirms and then proactively adds a knowledge-enriched chit-chat like "Did you know Paris is also known as the City of Love? There are many romantic spots near your hotel."
  • Abg-CoQA (Ambiguous Conversational Question Answering) [Guo et al., 2021]:
    • Source/Characteristics: Includes clarification need labels and questions.
    • Domain: Conversational Question Answering where ambiguity needs to be resolved.
    • Example Data Sample: User: "Tell me about 'Lion'." The system might classify this as ambiguous (movie, animal, king?) and generate a clarification question like "Are you referring to the movie 'Lion' or the animal?"
  • PACIFIC (Proactive Conversational Question Answering over Tabular and Textual Data in Finance) [Deng et al., 2022a]:
    • Source/Characteristics: Focuses on clarification need labels and questions over tabular and textual financial data.

    • Domain: Conversational QA in the finance domain, requiring clarification.

    • Example Data Sample: User: "What is the yield of Apple?" (Apple stock, apple orchard?). The system detects ambiguity and asks: "Are you referring to Apple Inc. stock or a type of fruit?"

      These datasets are chosen because they provide concrete examples and annotations for the specific proactive behaviors discussed (e.g., target tracking, prosocial interventions, negotiation strategies, clarification needs), making them effective for validating the performance of proactive dialogue systems.

5.2. Evaluation Metrics

The paper discusses a range of evaluation metrics, distinguishing between general dialogue system metrics and those specific to proactivity.

5.2.1. General Evaluation Metrics

  • BLEU, Dist-N, PPL: Commonly used for text generation tasks.
    • BLEU (Bilingual Evaluation Understudy): (See explanation in Section 3.2. for BLEU)
    • Dist-N (Distinct N-grams):
      • Conceptual Definition: Distinct N-grams (Dist-N) measures the diversity of generated responses by counting the number of unique n-grams. A higher Dist-N indicates greater diversity and less repetition in the generated text. Dist-1 measures unique unigrams, Dist-2 measures unique bigrams, etc.
      • Mathematical Formula: $ \text{Dist-N} = \frac{\text{Count of unique n-grams}}{\text{Total number of n-grams}} $
      • Symbol Explanation:
        • Count of unique n-grams: The number of distinct n-grams in the generated text.
        • Total number of n-grams: The total number of n-grams in the generated text.
    • PPL (Perplexity): (See explanation in Section 3.2. for PPL)
    • ROUGE: (See explanation in Section 3.2. for ROUGE)
    • F1 Score: (See explanation in Section 3.2. for F1 Score)
    • Accuracy: (See explanation in Section 3.2. for Accuracy)
    • ROC AUC (Receiver Operating Characteristic Area Under the Curve):
      • Conceptual Definition: ROC AUC is a performance measurement for classification problems at various threshold settings. ROC is a probability curve, and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. The higher the AUC, the better the model is at predicting 0s as 0s and 1s as 1s.
      • Mathematical Formula: $ \text{AUC} = \int_0^1 \text{TPR}(FPR) , d(FPR) $ where TPR is the True Positive Rate (Sensitivity or Recall) and FPR is the False Positive Rate (1 - Specificity). $ \text{TPR} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $ $ \text{FPR} = \frac{\text{False Positives}}{\text{False Positives} + \text{True Negatives}} $
      • Symbol Explanation:
        • TPR: True Positive Rate (or Recall), the proportion of actual positive cases that are correctly identified.
        • FPR: False Positive Rate, the proportion of actual negative cases that are incorrectly identified as positive.
        • True Positives: Instances correctly identified as positive.
        • False Positives: Instances incorrectly identified as positive.
        • True Negatives: Instances correctly identified as negative.
        • False Negatives: Instances incorrectly identified as negative.

5.2.2. Proactivity-Specific Evaluation Protocols

5.2.2.1. For Target-guided Dialogues (Turn-level & Dialogue-level)

  • Turn-level Evaluation:

    • P@K (Precision at K) and R@K (Recall at K): For keyword prediction in each turn.
      • Conceptual Definition: P@K measures the proportion of the top-K retrieved items (keywords in this case) that are relevant (actual targets). R@K measures the proportion of all relevant items that are found among the top-K retrieved items. These are commonly used in ranking and retrieval tasks.
      • Mathematical Formula: $ \text{P@K} = \frac{\text{Number of relevant items in top-K}}{\text{K}} $ $ \text{R@K} = \frac{\text{Number of relevant items in top-K}}{\text{Total number of relevant items}} $
      • Symbol Explanation:
        • KK: The number of top items considered.
        • Number of relevant items in top-K: How many of the predicted top-K keywords match the actual target.
        • Total number of relevant items: The total number of actual target keywords for that turn (usually 1 for a single target).
    • Embedding-based correlation scores: Measure semantic similarity between predicted and target topics.
    • Proactivity/Smoothness: Human evaluation scores for how well the system introduces new topics while maintaining coherence.
  • Dialogue-level Evaluation: Often uses user simulators due to the cost of real user experiments.

    • SR@t (Success Rate at turn t): Success rate of achieving targets by the tt-th turn.
      • Conceptual Definition: SR@t measures the cumulative percentage of dialogues where the system successfully achieved its target goal within a specified number of turns (tt).
      • Mathematical Formula: $ \text{SR@t} = \frac{\text{Number of dialogues with target achieved by turn t}}{\text{Total number of dialogues}} $
      • Symbol Explanation:
        • tt: The maximum number of turns allowed for target achievement.
        • Number of dialogues with target achieved by turn t: Count of dialogues where the target was reached within tt turns.
        • Total number of dialogues: Total number of evaluation dialogues.
    • #Turns: Average number of turns to reach the target.

5.2.2.2. For Prosocial Dialogues

  • Human evaluation or trained classification models are used to quantify attributes like agreement, respect, fairness, etc., alongside general text generation metrics (ROUGE, BLEU, PPL).

5.2.2.3. For Non-collaborative Dialogues

  • Dialogue strategy prediction accuracy: Measures how well the system predicts or learns optimal strategies (using Accuracy, F1, ROC AUC).
  • Human evaluation for persuasiveness, task success.

5.2.2.4. For User Preference Elicitation in Conversational Recommendation

  • Turn-level:
    • HR@k,t (Hit Ratio at k, turn t): For next question prediction (top-k predicted attributes at turn t).
      • Conceptual Definition: HR@k,t measures whether the correct item/attribute is among the top-k recommendations/predictions at turn tt.
      • Mathematical Formula: Typically, HR@k is calculated as: $ \text{HR@k} = \frac{\text{Number of users for whom the target item is in top-k recommendations}}{\text{Total number of users}} $ For HR@k,t, this would apply at a specific turn tt.
      • Symbol Explanation:
        • kk: The size of the recommendation/prediction list.
        • tt: The current turn of the conversation.
        • Number of users...: Count of users where the target preference was among the top-k predicted attributes for elicitation at turn tt.
        • Total number of users: Total users in the evaluation.
    • General recommendation metrics (MRR@k,t, MAP@k,t, NDCG@k,t) for item recommendation based on elicited preferences.
      • MRR@k (Mean Reciprocal Rank at k):
        • Conceptual Definition: MRR is a statistic measure for evaluating any process that produces a list of possible responses to a query, ordered by probability of correctness. The reciprocal rank is the inverse of the rank of the first correct answer. If no correct answer is found, the reciprocal rank is 0. MRR@k considers only the first correct answer up to rank kk.
        • Mathematical Formula: $ \text{MRR@k} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i} \quad (\text{if rank}_i \le k, \text{else } 0) $
        • Symbol Explanation:
          • Q|Q|: The total number of queries (or sessions).
          • ranki\text{rank}_i: The rank of the first relevant item for the ii-th query.
          • kk: The maximum rank to consider.
      • MAP@k (Mean Average Precision at k):
        • Conceptual Definition: MAP is a popular metric for evaluating ranked retrieval results. It computes the average precision for each relevant item in a ranked list and then averages these Average Precision values across all queries. MAP@k is MAP truncated at kk items.
        • Mathematical Formula: $ \text{MAP@k} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \text{AP@k}(i) $ where AP@k(i)\text{AP@k}(i) is the Average Precision for query ii: $ \text{AP@k}(i) = \sum_{j=1}^{k} P(j) \cdot \text{rel}(j) $ P(j) is the precision at cut-off jj in the list, and rel(j)\text{rel}(j) is an indicator function equal to 1 if the item at rank jj is relevant, and 0 otherwise.
        • Symbol Explanation:
          • Q|Q|: The total number of queries.
          • AP@k(i)\text{AP@k}(i): Average Precision for the ii-th query up to rank kk.
          • kk: The maximum rank to consider.
          • P(j): Precision at rank jj.
          • $\text{rel}(# 1. Bibliographic Information

1.1. Title

A Survey on Proactive Dialogue Systems: Problems, Methods, and Prospects

1.2. Authors

Yang Deng1^1, Wenqiang Lei2^2, Wai Lam3^3, Tat-Seng Chua1^1

1^1National University of Singapore 2^2Sichuan University 3^3The Chinese University of Hong Kong

1.3. Journal/Conference

Published on arXiv, a preprint server.

Comment on Venue's Reputation: arXiv is a widely respected open-access archive for preprints of scientific papers in fields like mathematics, physics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. While not a peer-reviewed journal or conference in itself, it serves as a crucial platform for early dissemination of research, allowing authors to share their work rapidly and gather feedback before or during formal peer review. Many highly influential papers are first published on arXiv.

1.4. Publication Year

2023

1.5. Abstract

This paper presents a comprehensive overview of proactive dialogue systems, which are conversational agents capable of guiding conversations towards predefined targets or system goals, rather than merely responding passively to user input. The survey highlights the significant problems encountered and the advanced design techniques employed for conversational agent's proactivity across various types of dialogues. It further discusses critical challenges related to real-world applications that demand increased future research. The authors express hope that this first survey on proactive dialogue systems will offer the research community accessible insights and a holistic understanding of this practical problem, thereby fostering further advancements in conversational AI.

https://arxiv.org/abs/2305.02750

Publication Status: This is a preprint available on arXiv.

https://arxiv.org/pdf/2305.02750v2.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem this paper aims to solve is the lack of proactivity in conventional dialogue systems. Traditional dialogue systems are primarily designed to be response-able, meaning they passively follow user-initiated conversations or fulfill user requests. This includes systems like open-domain dialogue systems, task-oriented dialogue systems, and conversational information-seeking systems.

This problem is important because the absence of proactivity hinders the development of truly intelligent and human-like conversational agents. The paper highlights several specific challenges and gaps in prior research:

  • Limited User Engagement and Service Efficiency: Passive systems may struggle to maintain engaging conversations or efficiently guide users towards specific outcomes.

  • Inability to Handle Complicated Tasks: Tasks requiring strategical and motivational interactions go beyond simple responsiveness.

  • Limitations of Current Advanced Systems: Even powerful models like ChatGPT exhibit limitations due to a lack of proactivity, such as providing randomly guessed answers to ambiguous queries or failing to handle problematic (harmful/biased) requests constructively.

  • Step Towards Strong AI: Proactivity is considered a significant step towards achieving strong AI that possesses autonomy and human-like consciousness.

    The paper's entry point is recognizing that while early attempts (e.g., introducing new topics or suggestions) identified the need, recent years have seen many advanced designs for conversational agent's proactivity across various task formulations and application scenarios. This survey aims to consolidate and categorize these diverse efforts.

2.2. Main Contributions / Findings

The paper's primary contributions are:

  • First Comprehensive Survey: It provides the first systematic and comprehensive overview of proactive dialogue systems, summarizing recent studies across three common types of dialogues: open-domain dialogues, task-oriented dialogues, and information-seeking dialogues.

  • Categorization of Problems and Designs: For each dialogue type, it identifies prominent problems (e.g., target-guided dialogues, prosocial dialogues, non-collaborative dialogues, enriched task-oriented dialogues, asking clarification questions, user preference elicitation) and the advanced designs (methods) used to address them.

  • Resource and Evaluation Overview: It presents available data resources and commonly adopted evaluation protocols pertinent to each problem area, aiding researchers in accessing and assessing relevant work.

  • Identification of Future Research Directions: It discusses significant open challenges and promising research prospects, including proactivity in hybrid dialogues, evaluation protocols for proactivity, and ethics of conversational agent's proactivity.

    Key conclusions and findings reached by the paper include:

  • Proactivity is an essential property for intelligent conversations that can significantly improve user engagement and service efficiency.

  • The field has progressed beyond simple topic introduction to handle complex strategical and motivational interactions.

  • A diverse set of proactive capabilities has emerged, tailored to the specific goals of different dialogue types.

  • Despite advancements, challenges remain in areas such as seamlessly integrating multiple conversational goals (hybrid dialogues), developing robust and multi-disciplinary evaluation metrics, and addressing ethical considerations (factuality, morality, privacy) inherent in proactive systems.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with the following fundamental concepts:

  • Dialogue Systems: Broadly, dialogue systems (also known as conversational AI or chatbots) are computer programs designed to converse with human users using natural language. They aim to simulate human conversation.

    • Open-domain Dialogue Systems (ODD): These systems aim to engage in general conversations on a wide range of topics, often for social support or entertainment. Examples include chatbots like ChatGPT or Meena. Their primary goal is to maintain a coherent and engaging conversation.
    • Task-oriented Dialogue Systems (TOD): These systems are designed to help users accomplish specific tasks in particular domains, such as booking flights, making restaurant reservations, or providing customer service. They typically involve understanding user intent, managing dialogue state, and interacting with external databases or APIs.
    • Conversational Information-Seeking Systems (CIS): These systems facilitate finding information through natural language interactions. This includes conversational search, conversational recommendation, and conversational question answering, where the system helps refine queries or explore preferences.
  • Proactivity: In the context of dialogue systems, proactivity refers to the system's capability to take initiative, anticipate user needs or potential issues, and guide the conversation direction towards specific goals or targets from the system's side, rather than merely reacting to user input. It stems from the organizational behavior definition of initiating actions to create or control situations.

  • Natural Language Processing (NLP): A field of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language. Core NLP tasks relevant here include:

    • Dialogue Context Understanding: Interpreting the meaning and intent of user utterances within the ongoing conversation.
    • Response Generation: Creating natural and relevant textual responses.
    • Text Classification: Categorizing text into predefined labels (e.g., safety detection, topic-shift detection).
    • Sequence-to-Sequence (Seq2Seq) Models: A common architecture in NLP for tasks like machine translation and text summarization, where an input sequence is mapped to an output sequence. Many dialogue generation models are based on Seq2Seq.
  • Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. In dialogue systems, RL can be used to learn optimal dialogue policies (strategies) that guide the conversation flow, especially for tasks involving multi-step decision making or strategic interactions.

  • Knowledge Graphs: A structured representation of knowledge that stores information in a graph format, where nodes represent entities (e.g., people, places, concepts) and edges represent relationships between them. Knowledge graphs can be leveraged in dialogue systems for knowledge-grounded response generation, topic planning, or factuality checking.

3.2. Previous Works

The paper contextualizes proactive dialogue systems by contrasting them with conventional, largely passive systems and acknowledging early attempts at proactivity.

  • Conventional Dialogue Research:

    • Focus on Response-ability: Most prior work emphasizes understanding dialogue context and generating appropriate responses.
    • Passive Nature: Systems are typically designed to follow user-oriented conversations or fulfill explicit user requests.
    • Examples of Passive Systems:
      • Open-domain dialogue systems like those described by [Zhang et al., 2018a] often aim for engaging chitchat without a specific system-driven goal.
      • Task-oriented dialogue systems (e.g., [Hosseini-Asl et al., 2020]) focus on efficiently completing user-defined tasks.
      • Conversational information-seeking systems (e.g., [Aliannejadi et al., 2019]) primarily respond to user queries for information.
  • Early Attempts at Proactivity:

    • Researchers recognized the need for improved interactions beyond pure reactivity.
    • [Li et al., 2016] explored enabling conversational agents to proactively introduce new topics.
    • [Yan and Zhao, 2018] investigated systems that could offer useful suggestions proactively.
    • These pioneering studies laid the groundwork, indicating a need for more robust problem settings and tangible applications for proactive dialogue systems.
  • Limitations of Modern Advanced Systems (e.g., ChatGPT):

    • Even highly advanced models like ChatGPT (though not explicitly cited with a paper, it's a widely known large language model) are noted to lack true proactivity.
    • They may passively provide randomly guessed answers to ambiguous queries, rather than proactively asking for clarification.
    • They can struggle to handle problematic requests (e.g., harmful or biased content) constructively, often relying on pre-programmed guardrails rather than sophisticated prosocial strategies.

3.3. Technological Evolution

The evolution of dialogue systems can be broadly seen as a progression from rule-based and retrieval-based systems to sophisticated neural network models.

  1. Early Systems (Rule-based/Retrieval-based): Focused on predefined scripts or retrieving responses from large corpora. Limited flexibility and domain-specific.
  2. Statistical/Machine Learning Models: Introduced more flexibility, especially for task-oriented dialogue, with components like natural language understanding (NLU), dialogue state tracking (DST), dialogue policy learning, and natural language generation (NLG).
  3. Deep Learning Era (2015-present):
    • End-to-End Models: Attempts to build dialogue systems that learn directly from raw text, often using sequence-to-sequence (Seq2Seq) models.

    • Pre-trained Language Models (PLMs): Emergence of powerful models like BERT, GPT, T5, XLNet, which, after pre-training on vast text corpora, can be fine-tuned for various dialogue tasks with remarkable performance. These models significantly improved context understanding and response generation.

    • Shift to Proactivity: This paper highlights a recent, crucial shift within the deep learning era: moving from passive response-ability to proactive guidance. This involves integrating elements like strategy learning, user modeling, and ethical considerations into the core design of dialogue systems.

      This paper's work fits within this technological timeline as a meta-analysis that synthesizes the advancements specifically aimed at integrating proactivity into the deep learning era of dialogue systems. It marks a coming-of-age for this particular sub-field, moving beyond foundational capabilities to more sophisticated, human-like interaction paradigms.

3.4. Differentiation Analysis

Compared to general surveys on dialogue systems, this paper's core differentiation and innovation lie in its explicit focus on proactivity.

  • Specific Focus: While other surveys might cover open-domain, task-oriented, or information-seeking dialogues generally, this paper specifically dissects how proactivity is implemented and studied within each of these categories.
  • Unified Definition: It provides a clear, unifying definition of conversational agent's proactivity, drawing from organizational behavior, and applies this lens across diverse dialogue tasks.
  • Categorization by Proactive Goal: Instead of just classifying systems by their domain (e.g., TOD vs. ODD), it further classifies them by their proactive goals within those domains (e.g., target-guided, prosocial, non-collaborative, clarification seeking). This provides a novel, goal-oriented perspective.
  • Identification of Unique Challenges: It specifically highlights challenges and prospects that are unique to proactive dialogue systems, such as evaluation protocols for proactivity (requiring multi-disciplinary approaches) and ethics of conversational agent's proactivity.
  • First of its Kind: The authors explicitly state, "To our knowledge, this survey is the first to focus on proactive dialogue systems," which underscores its unique contribution as a foundational review for this emerging field.

4. Methodology

As a survey paper, this document does not propose a novel methodology in the traditional sense of a new model or algorithm. Instead, its "methodology" lies in its structured approach to analyzing and categorizing existing research on proactive dialogue systems. The core idea is to break down the broad concept of proactivity into specific problems, methods, datasets, and evaluation protocols across different types of dialogue systems.

4.1. Principles

The core idea of this survey is to systematically review and categorize the rapidly growing body of research on proactive dialogue systems. The theoretical basis is derived from the definition of proactivity as the capability of an agent to create or control the conversation by taking the initiative and anticipating the impacts on themselves or human users, rather than only passively responding to the users. This principle guides the identification and classification of various proactive behaviors observed in dialogue systems.

The survey's structural principles are:

  1. Typology-based Classification: Organize research based on established dialogue system types (open-domain, task-oriented, information-seeking).
  2. Problem-centric Analysis: Within each type, identify specific problems that necessitate or benefit from proactivity.
  3. Methodological Deep-dive: For each problem, detail the advanced designs and techniques employed by researchers.
  4. Resource Mapping: Connect each problem with relevant data resources and evaluation protocols.
  5. Future-oriented Discussion: Highlight open challenges and future research directions for proactive dialogue systems.

4.2. Core Methodology In-depth (Layer by Layer)

The survey systematically categorizes proactive dialogue systems into three main types, each with specific problems and corresponding methods.

4.2.1. Proactive Open-domain Dialogue Systems

These systems aim to build long-term connections by proactively guiding conversations or addressing sensitive topics.

4.2.1.1. Target-guided Dialogues

Problem Definition: The system needs to proactively lead the conversation towards a designated target topic (e.g., a keyword, knowledge entity, or conversational goal) unknown to the user, ensuring transition smoothness and target achievement.

Methods:

  • Topic-shift Detection: Aims to identify changes in user topics.
    • Rachna et al. [2021] fine-tune XLNet-base for classifying utterances into major, minor, or off topics. XLNet is a generalized autoregressive pretraining method that combines the advantages of BERT (bidirectional context) and autoregressive models (predicting tokens one by one) by using a permutation language modeling objective.
    • Xie et al. [2021] propose TSMANAGER, a T5-based model to predict topic shifts after augmenting the PersonaChat dataset. T5 (Text-To-Text Transfer Transformer) is a transformer-based model that frames all NLP tasks as a text-to-text problem.
  • Topic Planning: The core task of defining a path to the target.
    • Early strategies [Tang et al., 2019; Zhong et al., 2021] used discourse-level keyword transitions.
    • Later, event knowledge graphs [Xu et al., 2020] were used to enhance coherence.
    • More advanced approaches [Yang et al., 2022] leverage external knowledge graphs with graph reasoning techniques for better topic path planning.
    • Reinforcement Learning (RL) is employed by Lei et al. [2022] to learn topic transitions directly from user interactions.
  • Topic-aware Response Generation: Producing responses that move the conversation towards the target.
    • Kishinami et al. [2022] generate a complete responding plan.
    • Gupta et al. [2022] use a bridging path of commonsense knowledge concepts between current and target topics to generate transition responses.

4.2.1.2. Prosocial Dialogues

Problem Definition: Given a dialogue context, the system first classifies the safety label of user utterances and then generates a proper response to constructively mitigate problematic user statements, adhering to social norms.

Methods:

  • Safety Detection: Identifying problematic user utterances.
    • Dinan et al. [2019] use human-in-the-loop training for offensive utterance detection, improved by adversarial learning [Xu et al., 2021]. Adversarial learning involves a generator that creates synthetic data and a discriminator that tries to distinguish real from synthetic data, improving robustness.
    • Baheti et al. [2021] fine-tune offensive language detection classifiers on annotated datasets like ToxICCHAT.
    • Kim et al. [2022] introduce a fine-grained safety classification schema (Needs Caution, Needs Intervention, Casual) to avoid broad "unsafe" labels.
  • Rule-of-Thumb (RoT) Generation: Explaining why a statement is acceptable or problematic.
    • Forbes et al. [2020] propose NORM TRANSFORMER to reason about social norms from the SoCIALCHEM01 corpus. A Transformer is a neural network architecture based on self-attention mechanisms.
    • Ziems et al. [2022] propose MORAL TRANSFORMER for generating new RoTs.
    • Kim et al. [2022] developed Canary, a sequence-to-sequence model that generates both safety labels and relevant RoTs.
  • Prosocial Response Generation: Generating helpful and socially responsible responses.
    • Baheti et al. [2021] investigate controllable text generation to prevent agreement with offensive content.
    • Kim et al. [2022] propose PROST to generate prosocial responses conditioned on RoTs and dialogue context.

4.2.2. Proactive Task-oriented Dialogue Systems

These systems go beyond simple task completion to handle non-collaborative scenarios or enrich conversations.

4.2.2.1. Non-collaborative Dialogues

Problem Definition: The system and user have competing interests or goals but aim for an agreement. The system needs to generate responses with appropriate dialogue strategies to achieve its goal (e.g., negotiation, persuasion).

Methods:

  • Dialogue Strategy Learning: Learning to select actions that steer the conversation.
    • He et al. [2018] decouple strategy and generation to control dialogue strategy for negotiation goals.
    • Zhou et al. [2020] use finite state transducers (FSTs) to predict the next strategy based on effective sequences of strategies. An FST is a finite-state machine that maps input sequences to output sequences.
    • Advanced models include DIALOGRAGH [Joshi et al., 2021] using interpretable strategy-graph networks and RESPER [Dutt et al., 2021] for resisting strategy modeling.
  • User Personality Modeling: Understanding human decision-making.
    • Yang et al. [2021] generate strategic dialogue by modeling and inferring personality types based on Theory of Mind (ToM). ToM is the ability to attribute mental states (beliefs, intentions, desires) to oneself and others.
    • Shi et al. [2021] develop DialGAIL, an RL-based generative algorithm with separate user and system profile builders to improve persuasion dialogues.
  • Persuasive Response Generation: Generating effective responses to achieve consensus.
    • Both modularized [He et al., 2018] and end-to-end [Li et al., 2020; Wu et al., 2021] methods incorporate persuasive dialogue strategies.
    • Recent work [Mishra et al., 2022; Samad et al., 2022] focuses on building empathetic connections for better persuasive responses.

4.2.2.2. Enriched Task-oriented Dialogues

Problem Definition: Beyond functional accuracy, the system proactively provides additional information (e.g., chit-chats, knowledge) that is useful but not explicitly requested by the user, making interactions more engaging.

Methods:

  • Adding Topical Chit-chats:
    • Sun et al. [2021] create ACCENTOR by adding topical chit-chats to TOD responses.
    • SimpleTOD+SimpleTOD+ [Hosseini-Asl et al., 2020] extends SimpleTOD by introducing a chit-chat dialogue action.
    • UniDS [Zhao et al., 2022] is an end-to-end method using a unified dialogue data schema for both chit-chat and task-oriented dialogues.
  • Knowledge-grounded Chit-chats:
    • Chen et al. [2022b] propose KETOD for knowledge-grounded chit-chats regarding relevant entities.
    • Combiner is a pipeline-based method designed to reduce interference between dialogue state tracking and knowledge-enriched response generation.

4.2.3. Proactive Conversational Information Seeking Systems

These systems proactively clarify ambiguities or elicit preferences to achieve more efficient and precise information seeking.

4.2.3.1. Asking Clarification Questions

Problem Definition: Clarifying potential ambiguity in user queries. Formulated into two subtasks: clarification need prediction and clarification question generation.

Methods:

  • Clarification Need Prediction: Typically a binary classification problem to predict if a query is ambiguous.
  • Clarification Question Generation:
    • NeuQS [Aliannejadi et al., 2019] uses a retrieval-selection pipeline to select questions from a question bank via BERT-based reranking. BERT (Bidirectional Encoder Representations from Transformers) is a transformer-based model pre-trained on large text corpora, known for its ability to learn contextual representations of words.
    • QCM [Zamani et al., 2020] is an RL-based method to generate questions by maximizing a clarification utility function.
    • Pipeline-based systems [Aliannejadi et al., 2021; Guo et al., 2021] first predict clarification need then generate questions.
    • UniPCQA [Deng et al., 2022a] is an end-to-end framework using a unified sequence-to-sequence formulation for clarification need prediction, question generation, and conversational question answering.

4.2.3.2. User Preference Elicitation

Problem Definition: Proactively acquiring user preferences by asking questions in conversational recommendation systems.

Methods:

  • Turn-level Preference Elicitation:
    • PMMN (personalized multi-memory network) [Zhang et al., 2018b] incorporates user embeddings into next question prediction.
  • Dialogue-level Preference Elicitation (Multi-step Decision Making):
    • Reinforcement Learning (RL) frameworks are used to learn what questions to ask.
    • UNICORN [Deng et al., 2021] is a graph-based RL framework that models real-time user preferences with a dynamic weighted graph structure.
    • MCMIPL [Zhang et al., 2022] is proposed for efficiently obtaining user preferences by asking multi-choice questions.

5. Experimental Setup

This section outlines the datasets and evaluation protocols discussed in the survey paper, reflecting the experimental setups commonly used in the research it reviews.

5.1. Datasets

The following are the results from Table 1 of the original paper:

Dataset Problem Language #Dial. #Turns Featured Annotations
TGC [Tang et al., 2019] Target-guided Dialogues English 9,939 11.35 Turn-level Topical Keywords
DuConv [Wu et al., 2019] Target-guided Dialogues Chinese 29,858 9.1 Turn-level Entities & Dialogue-level Goals
MIC [Ziems et al., 2022] Prosocial Dialogues English 38K 2.0 Rules of Thumbs (RoTs) & Revised Responses
ProsocialDialog [Kim et al., 2022] Prosocial Dialogues English 58K 5.7 Safety Labels and Reasons & RoTs
CraigslistBargain [He et al., 2018] Non-collaborative Dialogues English 6,682 9.2 Coarse Dialogue Acts
P4G [Wang et al., 2019] Non-collaborative Dialogues English 1,017 10.43 Dialogue Strategies
ACCENTOR [Sun et al., 2021] Enriched Task-oriented Dialogues English 23.8K Enriched Responses with Chit-chats
KETOD [Chen et al., 2022b] Enriched Task-oriented Dialogues English 5,324 9.78 Turn-level Entities & Enriched Responses with Knowledge
Abg-CoQA [Guo et al., 2021] Asking Clarification Questions English 8,615 5.0 Clarification Need Labels and Questions
PACIFIC [Deng et al., 2022a] Asking Clarification Questions English 2,757 6.89 Clarification Need Labels and Questions

Here's a more detailed description of some key datasets:

  • Target-guided Dialogues:
    • TGC (Target-Guided Conversation) [Tang et al., 2019]: Derived from Persona-Chat but without persona information. Targets are topical keywords extracted rule-based. This dataset is created by labeling targets on existing conversations.
    • DuConv [Wu et al., 2019]: Consists of human-human conversations based on linked entities from a grounded knowledge graph. Targets are turn-level entities and dialogue-level goals. This dataset is constructed by generating conversations based on designated targets.
  • Prosocial Dialogues:
    • MIC (Moral Integrity Conversation) [Ziems et al., 2022]: Contains prompt-reply pairs manually annotated with Rule-of-Thumbs (RoTs) from SoCIALCHEM01, where each RoT serves as a moral judgment.
    • ProsocialDialog [Kim et al., 2022]: Constructed via a human-AI collaboration framework where AI plays a problematic user and crowdworkers act as prosocial agents. It includes safety labels, RoTs for problematic contexts, and prosocial responses.
  • Non-collaborative Dialogues:
    • CraigslistBargain [He et al., 2018]: Contains conversations where two workers (buyer and seller) negotiate the price of an item. Annotations include coarse dialogue acts.
    • P4G (PERSUASIONFORGOOD) [Wang et al., 2019]: Features persuasion conversations for charity donations, along with user profiles and manual annotations for persuasion strategies and dialogue acts.
  • Enriched Task-oriented Dialogues:
    • ACCENTOR [Sun et al., 2021]: Enriches task-oriented dialogues with topical chit-chats to make interactions more engaging.
    • KETOD (Knowledge-Enriched Task-Oriented Dialogue) [Chen et al., 2022b]: Focuses on knowledge-grounded chit-chats related to turn-level entities.
  • Asking Clarification Questions:
    • Abg-CoQA [Guo et al., 2021]: For conversational question answering, with clarification need labels and questions.
    • PACIFIC [Deng et al., 2022a]: For proactive conversational question answering over tabular and textual data, also with clarification need labels and questions.
    • Other mentioned datasets include Qulac [Aliannejadi et al., 2019] and ClariQ [Aliannejadi et al., 2021].
  • User Preference Elicitation:
    • Many studies use synthetic conversation data generated from product reviews [Zhang et al., 2018b] or purchase logs [Deng et al., 2021; Zhang et al., 2022]. The paper notes a demand for human-human conversations in this area.

      These datasets are chosen to validate methods across the diverse range of proactive dialogue system tasks, providing specific annotations (e.g., topical keywords, safety labels, dialogue strategies) that are crucial for training and evaluating models with proactive capabilities.

5.2. Evaluation Metrics

The survey details specific evaluation protocols for proactive dialogue systems, often complementing general dialogue system metrics (like BLEU, Dist-N, PPL) with task-specific measures.

5.2.1. Target-guided Dialogues

Evaluation is typically done at two levels:

  • Turn-level Evaluation:

    • P@K (Precision at K) and R@K (Recall at K) for keyword prediction: These metrics measure the accuracy of predicted keywords within the top KK candidates.
      • Conceptual Definition: Measures the proportion of relevant items (predicted keywords) among the top KK retrieved items, and the proportion of retrieved relevant items to the total number of relevant items.
      • Mathematical Formula (for a single query): $ P@K = \frac{\text{Number of relevant items in top K}}{\text{K}} $ $ R@K = \frac{\text{Number of relevant items in top K}}{\text{Total number of relevant items}} $
      • Symbol Explanation:
        • Number of relevant items in top K\text{Number of relevant items in top K}: The count of correct keywords among the first KK predicted keywords.
        • K\text{K}: The number of top predicted keywords considered.
        • Total number of relevant items\text{Total number of relevant items}: The total number of actual target keywords for the turn.
    • Embedding-based correlation scores: Measures semantic similarity between generated responses/topics and target topics using vector embeddings.
    • Proactivity/Smoothness: Human evaluation scores to assess how well the system introduces new topics coherently.
  • Dialogue-level Evaluation: Typically uses user simulators due to the high cost of real user experiments.

    • SR@t (Success Rate at turn t): Measures the cumulative success rate of achieving the targets by the tt-th turn.
      • Conceptual Definition: The percentage of dialogues where the system successfully reaches the designated target by a specific turn tt.
      • Mathematical Formula: $ SR@t = \frac{\text{Number of dialogues where target is reached by turn } t}{\text{Total number of dialogues}} \times 100% $
      • Symbol Explanation:
        • Number of dialogues where target is reached by turn t\text{Number of dialogues where target is reached by turn } t: Count of conversations where the system successfully guided to the target within tt turns.
        • Total number of dialogues\text{Total number of dialogues}: The total number of conversations in the evaluation set.
    • #Turns: The average number of turns required to reach the target.

5.2.2. Prosocial Dialogues

  • Safety Detection: As a classification problem.
    • Accuracy:
      • Conceptual Definition: The proportion of correctly classified instances (both true positives and true negatives) out of the total number of instances.
      • Mathematical Formula: $ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} $
      • Symbol Explanation:
        • TP: True Positives (correctly predicted positive).
        • TN: True Negatives (correctly predicted negative).
        • FP: False Positives (incorrectly predicted positive).
        • FN: False Negatives (incorrectly predicted negative).
    • F1 score:
      • Conceptual Definition: The harmonic mean of precision and recall, providing a balanced measure for classification performance, especially useful for imbalanced datasets.
      • Mathematical Formula: $ F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $ where Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP} and Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}
      • Symbol Explanation:
        • TP, TN, FP, FN: Same as for Accuracy.
        • Precision\text{Precision}: The proportion of true positive predictions among all positive predictions.
        • Recall\text{Recall}: The proportion of true positive predictions among all actual positives.
  • RoT Generation and Prosocial Response Generation:
    • ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
      • Conceptual Definition: A set of metrics for evaluating automatic summarization and machine translation. It works by comparing an automatically produced summary or translation with a set of reference summaries or translations. It measures overlap of n-grams, word sequences, and word pairs.
      • Mathematical Formula (ROUGE-N, for N-gram overlap): $ \text{ROUGE-N} = \frac{\sum_{S \in {\text{Reference Summaries}}} \sum_{\text{ngram}N \in S} \text{Count}{\text{match}}(\text{ngram}N)}{\sum{S \in {\text{Reference Summaries}}} \sum_{\text{ngram}_N \in S} \text{Count}(\text{ngram}_N)} $
      • Symbol Explanation:
        • ngramN\text{ngram}_N: An N-gram (contiguous sequence of NN items) from the text.
        • Countmatch(ngramN)\text{Count}_{\text{match}}(\text{ngram}_N): The maximum number of N-grams co-occurring in the candidate response and a reference summary.
        • Count(ngramN)\text{Count}(\text{ngram}_N): The number of N-grams in the reference summary.
    • BLEU (Bilingual Evaluation Understudy):
      • Conceptual Definition: A metric for evaluating the quality of text which has been machine-translated from one natural language to another. It measures the correspondence between a machine's output and a human reference output.
      • Mathematical Formula: $ \text{BLEU} = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right) $ where BP=min(1,exp(1Reference LengthCandidate Length))BP = \min\left(1, \exp\left(1 - \frac{\text{Reference Length}}{\text{Candidate Length}}\right)\right) (Brevity Penalty) and pn=ngramnCandidatemin(Count(ngramn,Candidate),MaxCount(ngramn,References))ngramnCandidateCount(ngramn,Candidate)p_n = \frac{\sum_{\text{ngram}_n \in \text{Candidate}} \min(\text{Count}(\text{ngram}_n, \text{Candidate}), \text{MaxCount}(\text{ngram}_n, \text{References}))}{\sum_{\text{ngram}_n \in \text{Candidate}} \text{Count}(\text{ngram}_n, \text{Candidate})} (N-gram precision)
      • Symbol Explanation:
        • BP: Brevity Penalty, penalizes short candidate translations.
        • pnp_n: Modified n-gram precision for n-grams of length nn.
        • wnw_n: Weight for each n-gram precision (often uniform, e.g., 1/N1/N).
        • NN: Maximum n-gram order considered (typically 4).
        • Count(ngramn,Candidate)\text{Count}(\text{ngram}_n, \text{Candidate}): Count of n-grams in the candidate.
        • MaxCount(ngramn,References)\text{MaxCount}(\text{ngram}_n, \text{References}): Maximum count of n-grams in any single reference.
    • PPL (Perplexity):
      • Conceptual Definition: A measure of how well a probability distribution or probability model predicts a sample. In NLP, it's used to evaluate language models; lower perplexity indicates a better model.
      • Mathematical Formula: $ \text{Perplexity}(W) = \sqrt[N]{\frac{1}{P(w_1, w_2, \ldots, w_N)}} = \sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i | w_1, \ldots, w_{i-1})}} $
      • Symbol Explanation:
        • W=(w1,w2,,wN)W = (w_1, w_2, \ldots, w_N): A sequence of NN words.
        • P(w1,w2,,wN)P(w_1, w_2, \ldots, w_N): The probability of the entire sequence according to the language model.
        • P(wiw1,,wi1)P(w_i | w_1, \ldots, w_{i-1}): The probability of word wiw_i given the preceding words, as estimated by the language model.
  • Human Evaluation: For quantifying attributes like agreement, respect, fairness, prosociality, etc.

5.2.3. Non-collaborative Dialogues

  • Strategy Learning:
    • Accuracy: See above definition.
    • F1 score: See above definition.
    • ROC AUC (Receiver Operating Characteristic Area Under the Curve):
      • Conceptual Definition: A measure of a classifier's performance across all possible classification thresholds. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings. An AUC of 1 represents a perfect classifier, while 0.5 represents a random classifier.
      • Mathematical Formula: The AUC is the area under the ROC curve. There is no simple closed-form formula, it's calculated by numerically integrating the area under the curve formed by plotting TPR vs FPR for different thresholds. $ \text{TPR} = \text{Recall} = \frac{TP}{TP + FN} $ $ \text{FPR} = \frac{FP}{FP + TN} $
      • Symbol Explanation:
        • TP, TN, FP, FN: Same as for Accuracy and F1.
        • TPR: True Positive Rate, also known as Recall or Sensitivity.
        • FPR: False Positive Rate, the proportion of negative instances incorrectly classified as positive.
  • Response Generation:
    • Human evaluation for persuasiveness, task success, etc.

5.2.4. Enriched Task-oriented Dialogues

  • General TOD metrics for functional accuracy.
  • Human evaluation for engagement and interactivity of the enriched responses.

5.2.5. Conversational Information Seeking Systems

  • Asking Clarification Questions:
    • Binary classification metrics (Accuracy, F1) for clarification need prediction.
    • General text generation metrics (ROUGE, BLEU, PPL) for question generation.
    • Task-specific metrics depending on the application (e.g., improved search success rate in conversational search).

5.2.6. User Preference Elicitation

  • Turn-level:
    • HR@k,t (Hit Ratio at K, turn t):
      • Conceptual Definition: Measures if the correct item (or attribute) is among the top KK recommendations (or predictions) at turn tt.
      • Mathematical Formula: $ HR@k,t = \frac{\text{Number of hits in top K at turn } t}{\text{Total number of interactions at turn } t} $
      • Symbol Explanation:
        • Number of hits in top K at turn t\text{Number of hits in top K at turn } t: Count of times the target item/attribute was found in the top KK predictions/recommendations at turn tt.
        • Total number of interactions at turn t\text{Total number of interactions at turn } t: Total number of prediction/recommendation instances at turn tt.
    • MRR@k,t\text{MRR@k,t} (Mean Reciprocal Rank at K, turn t):
      • Conceptual Definition: A statistic measure for evaluating any process that produces a list of possible responses to a query, ordered by probability of correctness. The reciprocal rank is 1/rank1/\text{rank}. MRR is the average of the reciprocal ranks of results for a set of queries.
      • Mathematical Formula: $ \text{MRR@k,t} = \frac{1}{|Q|} \sum_{q=1}^{|Q|} \frac{1}{\text{rank}_q} \text{ if rank}_q \le k \text{ else } 0 $
      • Symbol Explanation:
        • Q|Q|: Number of queries (or instances) in the set.
        • rankq\text{rank}_q: Rank position of the first relevant item for query qq.
        • kk: Maximum rank position to consider.
    • MAP@k,t\text{MAP@k,t} (Mean Average Precision at K, turn t):
      • Conceptual Definition: A popular metric for evaluating ranked retrieval results. It computes the average precision for each query (considering only ranks up to KK) and then averages these across all queries.
      • Mathematical Formula: $ \text{MAP@k,t} = \frac{1}{|Q|} \sum_{q=1}^{|Q|} \text{AP}_q(k) $ where APq(k)=i=1k(P(i)×rel(i))Number of relevant items up to k\text{AP}_q(k) = \frac{\sum_{i=1}^{k} (P(i) \times \text{rel}(i))}{\text{Number of relevant items up to k}}
      • Symbol Explanation:
        • Q|Q|: Number of queries.
        • APq(k)\text{AP}_q(k): Average Precision for query qq up to rank kk.
        • P(i): Precision at cut-off ii.
        • rel(i)\text{rel}(i): An indicator function, 1 if the item at rank ii is relevant, 0 otherwise.
    • NDCG@k,t\text{NDCG@k,t} (Normalized Discounted Cumulative Gain at K, turn t):
      • Conceptual Definition: Measures the usefulness, or gain, of a document based on its position in the result list. The gain is accumulated from the top of the result list to the bottom, with the gain of each result discounted at lower ranks.
      • Mathematical Formula: $ \text{DCG}k = \sum{i=1}^{k} \frac{2^{\text{rel}i} - 1}{\log_2(i+1)} $ $ \text{IDCG}k = \sum{i=1}^{k} \frac{2^{\text{rel}{i, \text{ideal}}} - 1}{\log_2(i+1)} $ $ \text{NDCG}_k = \frac{\text{DCG}_k}{\text{IDCG}_k} $
      • Symbol Explanation:
        • reli\text{rel}_i: The relevance score of the result at position ii.
        • reli,ideal\text{rel}_{i, \text{ideal}}: The relevance score of the result at position ii in the ideal ranking.
        • DCGk\text{DCG}_k: Discounted Cumulative Gain at rank kk.
        • IDCGk\text{IDCG}_k: Ideal Discounted Cumulative Gain at rank kk.
  • Dialogue-level:
    • SR@t\text{SR@t} (Success Rate at turn t): Measures the cumulative ratio of successful recommendation by turn tt. (Similar to target-guided dialogues)
    • AT (Average Turns): The average number of turns required for all sessions to reach a successful recommendation.

5.3. Baselines

As a survey paper, this document does not present its own experimental results or compare its proposed method against baselines. Instead, it reviews and categorizes the methods and comparisons made within the individual research papers it surveys. Within those surveyed papers, baselines would typically include:

  • Passive Dialogue Systems: Simple response-only models, or task-oriented systems that only follow explicit user instructions.

  • Rule-based Systems: For strategy learning or topic management, where rules are handcrafted.

  • Simpler Neural Models: For generation tasks, models without specific proactive components (e.g., Seq2Seq without strategy networks or knowledge grounding).

  • Ablations: Removing proactive components from a proposed model to show their effectiveness.

    The representativeness of these baselines stems from their role as established or simpler approaches, against which the proactive elements of newer models can demonstrate their added value.

6. Results & Analysis

This section synthesizes the findings and trends reported in the survey paper regarding the effectiveness and characteristics of various proactive dialogue systems. Since this is a survey, it does not present new experimental results but rather analyzes the landscape of existing research.

6.1. Core Results Analysis

The survey effectively demonstrates that integrating proactivity into dialogue systems leads to significant advancements in user interaction and task accomplishment across diverse domains.

  • Effectiveness in Open-domain Dialogues:

    • Target-guided dialogues show that systems can successfully steer conversations towards predefined topics or goals, moving beyond simple chitchat. The use of knowledge graphs and reinforcement learning for topic planning highlights a sophisticated approach to managing conversational flow.
    • Prosocial dialogues illustrate the critical role of proactivity in handling problematic user input, shifting from passive acceptance to constructive intervention. This ensures safety and ethical interaction, which are crucial for real-world deployment. The development of safety detection and Rule-of-Thumb generation mechanisms indicates a growing maturity in addressing social complexities.
  • Effectiveness in Task-oriented Dialogues:

    • For non-collaborative dialogues, proactivity is essential for systems to achieve their own objectives (e.g., negotiation, persuasion) rather than just fulfilling user requests. Techniques like dialogue strategy learning and user personality modeling allow for more sophisticated, goal-oriented interactions, leading to successful conflict resolution or persuasion.
    • Enriched task-oriented dialogues show that proactivity enhances user experience by providing useful supplementary information or engaging chit-chats not explicitly requested. This improves the quality and effectiveness of the service, making the system more human-like and helpful.
  • Effectiveness in Conversational Information Seeking Systems:

    • Asking clarification questions directly addresses the ambiguity inherent in natural language queries, leading to more efficient and precise information retrieval. Proactively seeking clarification prevents irrelevant results and improves user satisfaction.
    • User preference elicitation allows conversational recommendation systems to actively learn user needs, moving beyond passive observation of past behavior. This leads to more personalized and accurate recommendations, improving the final recommendation results.
  • Advantages and Disadvantages of Proactive Approaches:

    • Advantages:
      • Improved User Engagement: By taking initiative, systems can maintain more dynamic and interesting conversations.
      • Increased Efficiency: Proactively clarifying or guiding can reduce turns and lead to faster task completion or information discovery.
      • Enhanced Goal Achievement: Systems can ensure their own objectives (e.g., sales, safety) are met alongside user goals.
      • More Human-like Interaction: Mimicking human conversational habits of leading and steering.
      • Handling Complex Scenarios: Enables systems to tackle non-collaborative, ambiguous, or sensitive situations.
    • Disadvantages (implied by challenges):
      • Complexity: Designing proactive systems requires more sophisticated dialogue policies, user modeling, and strategy learning.

      • Evaluation Difficulty: Measuring the success of proactivity is inherently complex and often relies on costly human evaluation or user simulators.

      • Ethical Risks: The power to lead or influence conversations introduces risks related to factuality, morality, and privacy.

        The survey highlights a clear trend towards more intelligent, goal-driven, and socially aware conversational AI through proactivity.

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

Dataset Problem Language #Dial. #Turns Featured Annotations
TGC [Tang et al., 2019] Target-guided Dialogues English 9,939 11.35 Turn-level Topical Keywords
DuConv [Wu et al., 2019] Target-guided Dialogues Chinese 29,858 9.1 Turn-level Entities & Dialogue-level Goals
MIC [Ziems et al., 2022] Prosocial Dialogues English 38K 2.0 Rules of Thumbs (RoTs) & Revised Responses
ProsocialDialog [Kim et al., 2022] Prosocial Dialogues English 58K 5.7 Safety Labels and Reasons & RoTs
CraigslistBargain [He et al., 2018] Non-collaborative Dialogues English 6,682 9.2 Coarse Dialogue Acts
P4G [Wang et al., 2019] Non-collaborative Dialogues English 1,017 10.43 Dialogue Strategies
ACCENTOR [Sun et al., 2021] Enriched Task-oriented Dialogues English 23.8K Enriched Responses with Chit-chats
KETOD [Chen et al., 2022b] Enriched Task-oriented Dialogues English 5,324 9.78 Turn-level Entities & Enriched Responses with Knowledge
Abg-CoQA [Guo et al., 2021] Asking Clarification Questions English 8,615 5.0 Clarification Need Labels and Questions
PACIFIC [Deng et al., 2022a] Asking Clarification Questions English 2,757 6.89 Clarification Need Labels and Questions

This table effectively summarizes the key datasets available for research in proactive dialogue systems. It highlights:

  • Diversity of Problems: Datasets exist for all major proactive problems identified (Target-guided, Prosocial, Non-collaborative, Enriched TOD, Clarification Questions).
  • Language Coverage: While predominantly English, DuConv provides a valuable Chinese resource for target-guided dialogues.
  • Scale and Turn Length: The datasets vary significantly in scale (from ~1K to 58K dialogues) and average turns, reflecting the complexity and data collection efforts for different tasks. MIC has a very low average turn count (2.0), suggesting it might focus on single problematic utterances and immediate responses rather than extended dialogues.
  • Rich Annotations: The Featured Annotations column is crucial, showing that these datasets provide specific labels (e.g., topical keywords, RoTs, safety labels, dialogue acts, clarification need labels, entities) essential for training and evaluating proactive capabilities. This indicates a strong foundation for supervised and reinforcement learning approaches.

6.3. Ablation Studies / Parameter Analysis

As a survey paper, this document does not conduct its own ablation studies or parameter analyses. Such analyses are typically performed within individual research papers to validate the contribution of specific components of a proposed model or to optimize its performance. The survey's role is to report on the types of methods and approaches taken in the field, including those that might have been validated via ablation studies in their original publications.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper serves as the first comprehensive survey on proactive dialogue systems, offering a structured overview of the field's problems, methods, and future directions. It meticulously categorizes existing research into three main types of dialogues: open-domain, task-oriented, and conversational information-seeking. Within these categories, the authors delineate specific proactive problems such as target-guided dialogues, prosocial dialogues, non-collaborative dialogues, enriched task-oriented dialogues, asking clarification questions, and user preference elicitation. For each problem, the survey details the advanced designs and techniques employed, along with representative datasets and evaluation protocols. The overarching conclusion is that proactivity is a crucial, emerging property for conversational AI that enables more engaging, efficient, and sophisticated interactions, moving systems beyond passive responsiveness towards achieving strategic goals and human-like intelligence.

7.2. Limitations & Future Work

The authors highlight several critical challenges and promising research directions for the future:

  • Proactivity in Hybrid Dialogues:

    • Limitation: Current dialogue systems often assume a single conversational goal. Real-world interactions, however, involve multiple, varied objectives and require natural transitions between different dialogue types (e.g., shifting from chit-chat to task completion and then to recommendation). Few studies have adequately addressed this.
    • Future Work: More research is needed to ensure natural and smooth transitions among different types of dialogues and to proactively discover user interests for guiding multi-goal conversations. This includes developing systems that can adapt to changing conversational goals without losing performance in any specific dialogue type. The emergence of datasets like DuRecDial, FusedChat, SalesBot, and OB-MultiWOZ indicates a growing interest in this area.
  • Evaluation Protocols for Proactivity:

    • Limitation: Evaluating proactivity is complex and often relies on expensive human evaluation. While user simulators offer an alternative, robust and effective evaluation metrics are still lacking. Traditional dialogue metrics are insufficient, as proactivity involves aspects from human-computer interaction, sociology, and psychology.
    • Future Work: There is an urgent need for more effective and robust multidisciplinary evaluation protocols. This involves designing metrics that can accurately quantify aspects like smoothness, persuasiveness, safety, and goal achievement from a proactive stance, perhaps integrating quantitative measures with qualitative user experience insights.
  • Ethics of Conversational Agent's Proactivity:

    • Limitation: The ability to proactively guide conversations is a "double-edged sword." While beneficial, it introduces significant ethical concerns that are often overlooked.
    • Future Work: Researchers must prioritize responsible AI. This includes focusing on:
      • Factuality: Ensuring the accuracy of system-initiative information and external knowledge to prevent hallucinations or factual incorrectness.
      • Morality: Addressing issues beyond general toxic language and social bias, specifically focusing on aggressiveness in non-collaborative conversations and promoting prosocial and empathetic interactions.
      • Privacy: Heightened concerns arise regarding the misuse of personal information when agents proactively engage with user data. Robust privacy-preserving mechanisms are essential.

7.3. Personal Insights & Critique

This survey is a timely and valuable contribution to the conversational AI community. Its primary strength lies in providing a clear, systematic framework for understanding proactivity, a concept that has been implicitly present but rarely explicitly defined and categorized across the diverse landscape of dialogue systems.

Inspirations:

  • Paradigm Shift: The paper reinforces the idea that conversational AI is moving beyond mere functional task completion or social chitchat towards becoming strategic, ethical, and truly intelligent partners. This paradigm shift from reactive to proactive is critical for AI adoption in complex real-world scenarios.
  • Cross-pollination: The categorization highlights how solutions from one proactive problem (e.g., strategy learning in non-collaborative dialogues) could potentially inspire methods in another (e.g., topic planning in target-guided dialogues), fostering interdisciplinary research.
  • Human-Centric AI: The emphasis on prosociality and ethics underscores the growing importance of designing AI that not only performs well but also interacts responsibly and beneficially with humans.

Potential Issues/Areas for Improvement (Critique):

  • Definition of "Proactivity" Nuances: While providing a definition, the spectrum of proactivity can be subtle. Is a system that offers a suggestion after a user query proactive, or merely responsive with added value? A deeper exploration of degrees of proactivity might be beneficial (e.g., mild suggestions vs. strong steering).

  • Practical Implementation Challenges beyond Research: The survey focuses on research problems. Real-world deployment faces additional hurdles like latency, scalability, and robustness to out-of-domain inputs when proactive strategies are involved. These practical considerations could be discussed more.

  • User Acceptance and Control: While proactivity can improve efficiency, excessive or poorly executed proactivity can be perceived as intrusive or annoying by users. The balance between system initiative and user control is a crucial design challenge, and its evaluation is complex. This aspect is touched upon in the ethics section but could be expanded.

  • Lack of Unified Benchmarks: Although many datasets are listed, the diversity of evaluation metrics and task formulations across different proactive problems suggests a lack of truly unified benchmarks. This makes direct comparison across different proactive capabilities difficult, hindering holistic progress.

    Overall, this survey successfully charts the emerging landscape of proactive dialogue systems, providing an invaluable resource for researchers and practitioners alike. It effectively sets the stage for the next generation of conversational AI that is not only intelligent but also intentional and responsible.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.