Paper status: completed

ProactiveEval: A Unified Evaluation Framework for Proactive Dialogue Agents

Published:08/29/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

ProactiveEval unifies evaluation of LLMs’ proactive dialogue skills by decomposing tasks into target planning and dialogue guidance, providing 328 cross-domain environments and diverse data. Experiments on 22 LLMs identify top models and link reasoning ability to proactive behavi

Abstract

Proactive dialogue has emerged as a critical and challenging research problem in advancing large language models (LLMs). Existing works predominantly focus on domain-specific or task-oriented scenarios, which leads to fragmented evaluations and limits the comprehensive exploration of models' proactive conversation abilities. In this work, we propose ProactiveEval, a unified framework designed for evaluating proactive dialogue capabilities of LLMs. This framework decomposes proactive dialogue into target planning and dialogue guidance, establishing evaluation metrics across various domains. Moreover, it also enables the automatic generation of diverse and challenging evaluation data. Based on the proposed framework, we develop 328 evaluation environments spanning 6 distinct domains. Through experiments with 22 different types of LLMs, we show that DeepSeek-R1 and Claude-3.7-Sonnet exhibit exceptional performance on target planning and dialogue guidance tasks, respectively. Finally, we investigate how reasoning capabilities influence proactive behaviors and discuss their implications for future model development.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

ProactiveEval: A Unified Evaluation Framework for Proactive Dialogue Agents

1.2. Authors

Tianjian Liu, Fanqi Wan, Jiajian Guo, Xiaojun Quan

1.3. Journal/Conference

The paper is published as a preprint on arXiv. It is common for cutting-edge research in fields like large language models to first appear on arXiv to facilitate rapid dissemination and feedback before formal peer review and publication in a conference or journal.

1.4. Publication Year

2025 (Published at 2025-08-28T16:26:44.000Z)

1.5. Abstract

This paper introduces ProactiveEval, a unified evaluation framework designed to assess the proactive dialogue capabilities of large language models (LLMs). The framework addresses the fragmentation in existing evaluations, which often focus on domain-specific or task-oriented scenarios. ProactiveEval decomposes proactive dialogue into two core tasks: target planning and dialogue guidance, establishing cross-domain evaluation metrics. It also features an automatic generation mechanism for diverse and challenging evaluation data, utilizing a hierarchical environment topic tree, target ensemble techniques, and adversarial strategies like obfuscation rewriting and noise injection. The authors developed 328 evaluation environments across 6 distinct domains. Through experiments with 22 different LLMs, the study identifies DeepSeek-R1 and Claude-3.7-Sonnet as top performers in target planning and dialogue guidance, respectively. Finally, the paper investigates the influence of reasoning capabilities (often referred to as "thinking behavior") on proactive behaviors, discussing implications for future model development.

https://arxiv.org/abs/2508.20973

https://arxiv.org/pdf/2508.20973v1.pdf

2. Executive Summary

2.1. Background & Motivation

The field of large language models (LLMs) has seen remarkable advancements, particularly in their ability to engage in dialogue. However, most LLM-powered dialogue agents operate in a reactive manner, meaning they primarily respond to user-initiated queries and guidance. This user-initiated paradigm places a significant cognitive burden on users, requiring them to integrate complex context (personal state, environment, agent's information) and guide the conversation actively. This can lead to reduced user engagement and limits the LLM's potential for autonomous problem-solving.

To address these limitations, proactive dialogue agents have emerged as a critical area of research. These agents are designed to anticipate user needs, formulate adaptive plans, and guide conversations towards specific targets without explicit user requests. While various studies have explored methods to enhance the proactive capabilities of LLMs across different scenarios (e.g., clarifying ambiguity, negotiation, specialized domains like emotional support or smart glasses), existing evaluation frameworks suffer from significant fragmentation. They often rely on domain-specific datasets, employ inconsistent evaluation criteria, and use disparate metrics. This lack of standardization makes it challenging to comprehensively compare and advance the proactivity of different models. Therefore, there is an urgent need for a unified evaluation framework that can systematically assess and drive progress in LLMs' proactive dialogue abilities across various domains.

2.2. Main Contributions / Findings

The paper makes several significant contributions to the field of proactive dialogue agents:

  • Unified Evaluation Framework (ProactiveEval): The core contribution is the proposal of ProactiveEval, a novel framework that unifies the evaluation of proactive dialogue capabilities in LLMs. It addresses the fragmentation of previous evaluations by decomposing proactive dialogue into two universally applicable tasks: target planning and dialogue guidance.
  • Comprehensive Evaluation Metrics: The framework establishes a consistent set of evaluation metrics across various domains, utilizing an LLM-as-a-judge approach with task-specific dimensions for assessment.
  • Automatic Data Generation Framework: ProactiveEval introduces an innovative framework for automatically generating diverse and challenging evaluation data. This system leverages a hierarchical environment topic tree for diversity, a target ensemble technique for refining high-quality reference targets, and adversarial strategies (obfuscation rewriting, noise injection) to increase environmental difficulty and realism.
  • Extensive Evaluation Dataset: Based on the proposed framework, the authors developed 328 evaluation environments spanning 6 distinct domains, including one (Glasses Assistant) that previously lacked public benchmarks.
  • Benchmarking of Frontier LLMs: The framework was used to benchmark 22 different types of LLMs, including models from GPT, Llama, Claude, DeepSeek, Gemini, Grok, and Qwen families, providing a comprehensive assessment of their proactive capabilities.
  • Key Performance Insights: The experiments revealed that DeepSeek-R1 excels in target planning, while Claude-3.7-Sonnet demonstrates exceptional performance in dialogue guidance.
  • Analysis of Reasoning Capabilities: The study investigates the impact of "thinking behaviors" (reasoning mechanisms) on proactive dialogue. It finds that reasoning capabilities benefit target planning but, surprisingly, do not show a measurable positive impact on dialogue guidance effectiveness, highlighting limitations in current reasoning-enhanced LLMs.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

  • Proactive Dialogue Agents: In contrast to traditional reactive dialogue agents that only respond when prompted, proactive dialogue agents anticipate user needs, initiate conversations, formulate plans, and guide users towards specific goals. For example, instead of waiting for a user to ask for help, a proactive agent might observe user behavior (e.g., struggling with a task) and offer assistance. This paradigm aims to reduce user cognitive load and enhance human-AI collaboration.

  • Large Language Models (LLMs): These are advanced artificial intelligence models, like GPT-3/4, Llama, Claude, etc., trained on vast amounts of text data to understand, generate, and process human language. They are capable of various natural language processing tasks, including dialogue generation.

  • LLM-as-a-Judge: This is an evaluation methodology where a powerful LLM (often a proprietary, high-performing model like GPT-4) is used to evaluate the quality of responses generated by other LLMs. Instead of relying solely on human annotators or traditional automatic metrics (like BLEU or ROUGE), the "judge" LLM assesses responses based on predefined criteria, often providing a score and a reasoning explanation. The paper uses GPT-4o for this purpose.

  • Big Five Personality Traits: A widely accepted model of human personality, categorizing it into five broad dimensions: Openness to Experience, Conscientiousness, Extraversion, Agreeableness, and Neuroticism. In this paper, the Agreeableness trait is used to simulate diverse user personalities, where a lower agreeableness level implies greater resistance to the agent's guidance, thus increasing task difficulty and realism.

  • Chain-of-Thought (CoT) Reasoning: A prompting technique for LLMs where the model is instructed to break down a complex problem into intermediate steps and explain its reasoning process. This explicit step-by-step thinking can significantly improve the performance of LLMs on complex reasoning tasks, often leading to more accurate and coherent outputs. The paper refers to this as "thinking behavior" and investigates its impact on proactive dialogue tasks.

  • Obfuscation Rewriting: A data augmentation technique used in this paper to increase the difficulty of evaluation environments. It involves transforming clear and concise descriptions within the environment (user information, trigger factors) into dispersed, detailed, and less direct forms. This mimics real-world scenarios where information might be fragmented or require more processing to extract key details.

  • Noise Injection: Another data augmentation technique used to make evaluation environments more challenging. It involves adding irrelevant or distracting information to the environment description. This forces the LLM to sift through extraneous details to identify the crucial information necessary for target planning, simulating realistic scenarios where agents receive fragmented or noisy inputs.

3.2. Previous Works

The paper contextualizes its contributions by discussing prior research in proactive dialogue and interactive benchmarks.

  • Fragmented Proactive Dialogue Benchmarks:

    • Earlier works often focused on specific sub-capabilities of proactive dialogue. For instance, some benchmarks assessed models' ability to clarify ambiguity (e.g., Qian et al. 2024; Zhang et al. 2024b) or guide users in complex tasks like negotiation (e.g., Deng et al. 2024; Zhang et al. 2024a).
    • Other studies focused on prerequisite skills like goal prediction and planning before dialogue (e.g., Zhang et al. 2024d; Zheng et al. 2024).
    • The paper points out that these efforts, while valuable, are fragmented due to a lack of standardization in environments, formats, and metrics, making comprehensive comparison difficult.
    • Example from the paper: Table 5 summarizes various proactive dialogue systems, showcasing their domain-specific nature (e.g., Recommendation, Persuasion, Ambiguous Instruction, Long-term Follow-up, System Operation, Glasses Assistant), diverse tasks, and a mix of evaluation methods (Static Benchmarks, HumanEval, LLM-as-a-Judge) and metrics (BLEU, F1, SuccessRate, Human Rating, etc.). This table effectively illustrates the fragmentation ProactiveEval aims to solve.
  • Interactive Benchmarks:

    • Traditional dialogue benchmarks (e.g., Liu et al. 2021a; Bai et al. 2024; Jin et al. 2024) typically evaluate turn-level performance based on fixed contexts and reference responses.
    • More recent research has moved towards interactive benchmarks to measure dialogue-level performance in real-world conditions. These benchmarks involve the evaluated model interacting dynamically with a simulated user.
    • Examples:
      • τ-bench (Yao et al. 2024) facilitates multi-turn dialogues with a simulated user to evaluate a model's tool-calling capabilities.
      • Zhang et al. (2024a) involves models interacting multiple turns with simulated users of different personalities to evaluate dialogue guidance.
    • ProactiveEval is inspired by these interactive approaches, employing a similar methodology where the model initiates proactive dialogue and guides various simulated users towards a target within an evaluation environment.

3.3. Technological Evolution

The evolution of dialogue agents has moved from simple rule-based systems to sophisticated LLM-powered models. Initially, these LLMs were primarily reactive, serving as intelligent assistants that responded to explicit user commands or questions. This reactive paradigm, while powerful, inherently limited the autonomy and potential impact of AI in complex, dynamic environments.

The next logical step in this evolution is the development of proactive agents. This shift represents a move from passive information retrieval to active problem-solving and user assistance. Proactive agents require capabilities beyond just language generation, such as:

  1. Context Understanding: Interpreting complex user states, environmental cues, and historical interactions.

  2. Goal Prediction: Inferring implicit user needs or potential issues.

  3. Planning: Formulating multi-step strategies to address identified needs or guide users.

  4. Initiation: Deciding when and how to intervene in a conversation.

  5. Guidance: Steering the dialogue effectively towards a predefined target.

    The evaluation of these advanced capabilities also needs to evolve. Early evaluations relied on static datasets and turn-level metrics. However, proactive dialogue is inherently dynamic and multi-turn, necessitating interactive benchmarks where agents engage in full conversations with simulated users. The challenge, as highlighted by this paper, has been the lack of a unified framework to compare these diverse proactive capabilities across different domains and tasks consistently. ProactiveEval positions itself as a crucial step in this evolutionary timeline by providing such a standardized, comprehensive evaluation platform.

3.4. Differentiation Analysis

Compared to the main methods in related work, ProactiveEval introduces several core differences and innovations:

  • Unified Framework for Fragmentation: The most significant differentiation is its unified nature. While previous works focused on specific aspects or domains of proactive dialogue (e.g., clarification, persuasion, long-term follow-up), ProactiveEval integrates these into a single, cohesive framework. It defines a general decomposition of proactive dialogue into target planning and dialogue guidance, applicable across 6 diverse domains. This contrasts sharply with the fragmented evaluation criteria and disparate metrics of prior studies, offering a holistic view of an LLM's proactivity.

  • Domain Expansion: ProactiveEval includes a previously underexplored domain, Glasses Assistant, which lacked public benchmarks. This expands the scope of proactive dialogue evaluation to new, real-world application scenarios.

  • Standardized Evaluation Metrics (LLM-as-a-judge): Instead of a mix of automatic metrics (BLEU, F1) and subjective human evaluations, ProactiveEval standardizes assessment using a powerful LLM-as-a-judge (GPT-4o). This approach, combined with detailed task-specific evaluation dimensions (Effectiveness, Personalization, Tone, Engagement, Naturalness), aims for more consistent, scalable, and nuanced evaluations compared to previous ad-hoc metrics. The use of a reference in target planning evaluation further refines accuracy.

  • Automatic Generation of Diverse and Challenging Data: A key innovation is the automatic data synthesis framework. Unlike reliance on manually curated, often small-scale datasets, ProactiveEval can generate diverse and challenging evaluation environments at scale. This is achieved through:

    • Hierarchical Environment Topic Tree: Ensures semantic diversity across generated scenarios.
    • Target Ensemble Technique: Refines the quality and reasonableness of reference targets by combining multiple LLM-generated candidates, addressing the limitations of individual LLM outputs.
    • Adversarial Strategies (Obfuscation Rewriting & Noise Injection): These techniques programmatically increase the difficulty of the generated environments, making them more realistic and robust for evaluating advanced LLMs by forcing them to handle incomplete, fragmented, or noisy information. This goes beyond simple data collection to actively engineer challenging test cases.
  • Focus on Thinking Behavior Impact: The framework explicitly investigates the role of reasoning capabilities (referred to as "thinking models" in the paper) in proactive dialogue. This detailed analysis of how internal reasoning processes affect both target planning and dialogue guidance is a deeper dive into LLM mechanisms than typically found in broad evaluation benchmarks.

4. Methodology

The ProactiveEval framework is designed to provide a unified approach to evaluating proactive dialogue agents. It structurally unifies existing proactive dialogue domains, decomposes proactive dialogue into two sequential tasks (target planning and dialogue guidance), and employs an LLM-as-a-judge methodology for assessment. Furthermore, it includes a robust pipeline for generating diverse and challenging evaluation data.

4.1. Principles

The core idea behind ProactiveEval is to systematically break down the complex problem of proactive dialogue into manageable and measurable components while ensuring the evaluation environment is realistic and challenging. The principles include:

  1. Decomposition: Proactive dialogue is fundamentally split into target planning (what to do) and dialogue guidance (how to do it). This allows for granular assessment of distinct proactive abilities.
  2. Unified Definition & Metrics: To overcome fragmentation, the framework establishes a common definition of proactive dialogue tasks and consistent evaluation metrics across various domains.
  3. Automated Data Generation: To ensure scalability, diversity, and challenge, an automated system generates evaluation environments and reference solutions.
  4. LLM-as-a-Judge Evaluation: Leveraging the advanced capabilities of powerful LLMs for reliable and scalable assessment of model outputs and interactive dialogues.

4.2. Core Methodology In-depth

4.2.1. Task Definitions

The framework decomposes proactive dialogue into two sequential tasks:

4.2.1.1. Target Planning

In this task, the model must formulate both a primary objective (the overall proactive action) and a sequence of sub-targets (the step-by-step plan) based on its understanding of the environmental context.

  • Input: The model receives an environmental context (E\mathcal{E}), which includes user information (UU) and trigger factors (FF).

    • User information (UU): Represents background details about the user (e.g., job, hobbies, personal state).
    • Trigger factors (FF): Motivates the agent to initiate and guide the dialogue (e.g., an observed behavior, an internal piece of knowledge, a past conversation).
  • Output: The model needs to generate:

    • Primary objective (TT): The agent's intended proactive action.

    • Sequence of sub-targets (SS): The stepwise plan for executing TT.

      The process is formally defined as: T,S=FθM(U,F(U,F)E) T , S = F _ { \theta _ { M } } \left( U , F \mid ( U , F ) \in \cal { E } \right) Where:

  • TT: The primary objective (target).

  • SS: The sequence of sub-targets.

  • FθMF _ { \theta _ { M } }: The LLM with parameters θM\theta _ { M } that generates the target and sub-targets.

  • UU: User information.

  • FF: Trigger factors.

  • (U, F): Inputs from the environment.

  • E\mathcal{E}: The set of all possible environmental contexts.

  • Evaluation: A reference-based LLM-as-a-judge method is employed.

    • The judge model (GPT-4o) receives the environment (E\mathcal{E}), the generated target (TgT_g) and sub-targets (SgS_g), and a reference target (TrT_r) and sub-targets (SrS_r) (which represent high-quality proactive dialogue targets for that environment).
    • By comparing Tg,SgT_g, S_g with Tr,SrT_r, S_r, the judge assigns a score between 1 and 10. A higher score indicates superior quality, with 10 denoting generated content surpassing the reference standard.

4.2.1.2. Dialogue Guidance

After target planning, the model must initiate and guide the dialogue with a simulated user to achieve the planned target.

  • Input: The model receives the environment (E\mathcal{E}), the planned target (TT), sub-targets (SS), and the current dialogue context (CC).

  • Process: This is an interactive evaluation. The model conducts a dialogue (DD) with a simulated user (θU\theta_U).

    • The simulated user dynamically responds based on the environment (E\mathcal{E}), dialogue context (CC), and an adjustable agreeableness level (AA).
    • Agreeableness level (AA): Modeled after the Big Five personality traits, with three tiers: "low", "medium", and "high". A lower level signifies stronger resistance, increasing task difficulty and realism.
    • The dialogue terminates upon reaching target (TT) or a maximum of II turns (set to 6 in experiments).
  • Output: A multi-turn dialogue (DD).

    The dialogue at each turn ii can be formulated as: Di=IθM,θU(E,T,S,C,A) D _ { i } = I _ { \theta _ { M } , \theta _ { U } } ( E , T , S , C , A ) Where:

  • DiD_i: The dialogue turn ii.

  • IθM,θUI _ { \theta _ { M } , \theta _ { U } }: The interactive process between the LLM (θM\theta _ { M }) and the simulated user (θU\theta _ { U }).

  • EE: Environment.

  • TT: Target.

  • SS: Sub-targets.

  • CC: Dialogue context.

  • AA: Agreeableness level of the simulated user.

  • Evaluation: The judge model evaluates the guidance exhibited by the model in the dialogue (DD) based on the environment (E\mathcal{E}), target (TT), and sub-targets (SS).

    • Evaluation Dimensions:
      • Effectiveness: Guiding users step-by-step towards the target, avoiding providing all information in one turn.
      • Personalization: Tailoring guidance based on user information, not generic advice.
      • Tone: Applying active and contextually appropriate tones to initiate and guide dialogue.
      • Engagement: Keeping messages clear and concise to improve user understanding and engagement.
      • Naturalness: Making messages conversational, avoiding unnatural formats or metadata leaks.
    • The judge model provides an overall guidance score between 1 and 10, with higher scores indicating stronger guidance.

4.2.2. Evaluation Data Generation

The data generation pipeline includes two stages: data synthesis and data refinement.

The following figure (Figure 2 from the original paper) illustrates the data generation pipeline:

该图像是论文中图1的示意图,展示了ProactiveEval框架的三个核心步骤:环境主题树构建、环境与目标生成以及环境优化。流程涉及种子数据输入、LLM生成多样环境和目标、评估数据加工及最终评测集的构建。 该图像是论文中图1的示意图,展示了ProactiveEval框架的三个核心步骤:环境主题树构建、环境与目标生成以及环境优化。流程涉及种子数据输入、LLM生成多样环境和目标、评估数据加工及最终评测集的构建。

4.2.2.1. Data Synthesis

This stage generates diverse environments and creates high-quality reference targets.

  • Environment Topic Tree Construction:

    • A human-AI collaboration approach is used to develop a hierarchical topic structure.
    • It starts with a root node (broad domains like persuasion).
    • First-level sub-topics are derived from existing dialogue datasets.
    • An LLM iteratively generates candidate sub-topics within configurable depth and branching constraints.
    • Researchers validate and refine these generated topics to maintain quality and eliminate duplication.
    • This curated topic tree guides the creation of specific evaluation environments.
  • Environment & Target Generation:

    • An LLM (GPT-4o) generates specific evaluation environments (EE) based on domain requirements, data examples, and the topics from the topic tree.
    • Each environment includes user information (UU) and trigger factors (FF).
    • Target Ensemble Approach: To ensure correct and reasonable reference targets (TrT_r) and sub-targets (SrS_r):
      1. The framework performs high-temperature sampling (with n=5n=5 in their work) to yield diverse candidate targets {(T1,S1),(T2,S2),,(Tn,Sn)}\{ ( T _ { 1 } , S _ { 1 } ) , ( T _ { 2 } , S _ { 2 } ) , \ldots , ( T _ { n } , S _ { n } ) \}.
      2. An LLM evaluates the strengths and weaknesses of each candidate from multiple dimensions (alignment to environment, completeness, interactivity, non-redundancy).
      3. The reference target and sub-targets are derived by combining the strengths and mitigating weaknesses across the candidates.

4.2.2.2. Data Refinement

This stage identifies simple instances and increases their complexity through adversarial strategies.

  • Difficulty Evaluation:

    1. Three models with varying parameter scales (acting as reasoners) independently predict the target (tmt_m) for each input environment.
    2. An LLM evaluates how many predicted targets (tmt_m) convey similar meaning to the reference target (trt_r).
    3. Environments where the majority successfully predict the target are classified as easy candidates requiring refinement.
  • Obfuscation Rewrite:

    • Applied to user_information and trigger_factor.

    • An LLM transforms content into dispersed and detailed descriptions (e.g., converting abstract habits into observable actions).

    • Domain-specific rules can be applied for adaptive rewriting.

    • Seed data (manually crafted examples) enhance the quality of rewriting.

      The following is the prompt for obfuscation rewrite of user_information (Listing 6 from the original paper):

    <Task>
    You are a writing assistant tasked with rewriting a general input description into a specific and detailed output. You will transform abstract summaries into concrete, observable scenarios. Follow all rules and examples precisely.
    </Task>
    <Rules>
    General Rules (Apply to all domains):
    1. Convert Abstract to Concrete: Transform general descriptions (e.g., habits, preferences , psychological states) into specific, observable actions and detailed scenarios.
    2. Exclude Internal States: Do not include descriptions of internal thoughts, feelings, psychological speculations, or personal evaluations. Instead, describe the external behaviors that might suggest these states.
    3. The rewrite output should not include any subjective words (e.g., try, however, notice, etc.). It should use objective words to describe the user information.
    4. Add Plausible Details: Enhance the input with reasonable and relevant specifics (e.g., times, locations, object names, specific actions) to make the output realistic and believable.
    5. Specific Rule for this Domain: {Domain_Rule}
       </Rules>
    <Examples>
    {Examples}
    </Examples>
    <Format>
    Just return a string starting with "Output: ".
    </Format>
    Now, rewrite the following sentence from input to Output:
    Input: {user_information}
    

    The following is the prompt for obfuscation rewrite of trigger_factor (Listing 7 from the original paper):

    <Task>
    You are an AI assistant tasked with rewriting a trigger factor description. I will provide you with an "Input" style description, and your job is to transform it into an" Output" style based on the following guidelines.
    </Task>
    <Rules>
    1. Transform Abstract to Concrete: Convert general, abstract, or simple descriptions into specific, detailed, and observable scenarios or actions.
    2. Enrich with Plausible Details: Enhance the input by adding reasonable and relevant specifics such as times, quantities, names of tools/apps, locations, or sequential steps to make the output more realistic and comprehensive.
    3. Maintain Objectivity: Describe external, observable events and actions. Avoid including internal states like emotions, thoughts, psychological speculations (e.g., 'feel', consider', 'notice', 'think'), or summary judgments (e.g., 'good', 'successful'), and some connective words (e.g., however, but, finally, etc.), and some adjectives (e.g., good, bad, successful, unsuccessful, problem, issues, etc.).
    4. Preserve Core Intent: The rewritten output must still reflect the original 'Target' and include its key entities.
    5. Domain-Specific Rule: {domain_rule}
       </Rules>
    <Examples>
    {example}
    </Examples>
    <Format>
    Just return a string starting with "Output: ".
    </Format>
    Now, rewrite the following sentence from Input to Output:
    Input: {trigger_factor}
    Target: {target}
    
  • Noise Injection:

    • Introduces LLM-generated irrelevant information into the environment.

    • The noise is plausible but non-essential details (e.g., other user activities, system logs) that may attract attention but are not critical for the target.

    • The original relevant content is embedded within this noise, making it less conspicuous.

    • Seed data is provided to enhance quality.

      The following is the prompt for noise injection into user_information (Listing 8 from the original paper):

    <Task>
    You are an AI assistant tasked with adding contextual "noise" to an 'Input' text. Your goal is to make the original information appear as part of a larger, more detailed lo or description.
    </Task>
    <Guidelines>
    1. Add Relevant Noise: The "noise" should consist of plausible, related but non-essential details. It may attract attention but actually not important. This could be other use activities, hobbies, system logs, background processes, or past conversational remarks, depending on the context of the Input.
    2. Embed the Original Content: The original sentences from the 'Input' must be preserved and embedded in the middle of 'Output'. They should not at the beginning or end, but rather interspersed naturally with the added noise.
    3. Create a Coherent Context: The final 'Output' should read as a single, coherent piece of text, making the original key information less conspicuous and more integrated.
    4. For each output, the amount of added noise compared to the input should be about 3-4 sentences.
       </Guidelines>
    <Example>
    Here are some examples:
    {example}
    </Example>
    <Format>
    Just return a string starting with "Output: ".
    </Format>
    Now, rewrite the following sentence from input to output: Input: {user_information}
    

    The following is the prompt for noise injection into trigger_factor (Listing 9 from the original paper):

    <Task>
    You are an AI assistant tasked with adding contextual "noise" to an 'Input' text to make the original key information less conspicuous. Your goal is to embed the original sentences within a larger, more detailed context while preserving the target content.
    </Task>
    <Guidelines>
    1. Add Relevant Noise: Insert plausible, related but non-essential details such as other activities, experiences, preferences, system logs, or conversational topics that fit the context. It may attract attention but actually not important.
    2. Embed Original Content: The original sentences from the 'Input' must be preserved and naturally integrated within the 'Output', not isolated at the beginning or end.
    3. Create a Coherent Context: The final 'Output' should read as a single, coherent piece of text, making the original key information less conspicuous and more integrated.
    4. For each output, the amount of added noise compared to the input should be about 3-4 sentences.
    5. Maintain Target Relevance: The rewritten output should still reflect the target conten and include its important entities, but make it harder to immediately identify the core purpose.
       </Guidelines>
    <Example>
    Here are some examples:
    {example}
    </Example>
    <Format>
    Just return a string starting with "Output: ".
    </Format>
    Now, rewrite the following sentence from input to output:
    Input: {trigger_factor}
    Target: {target}
    
  • Final Check:

    • The refinement process iterates until few or no models predict the target correctly, or until a maximum of 5 turns is reached.

    • Before incorporation into the dataset, 5 leading LLMs validate the correctness of the reference target. Only environments where the majority of judges deem the reference as the best target are included in the final dataset.

      The following is the prompt for the final check (Listing 10 from the original paper):

    <Task>
    You will receive an environment. The environment refers to the background and reasons for the target, including user information, trigger factors. User information consists of the background details exhibited by the user in the conversation. trigger factor is the cause that motivates the assistant's to talk. The target should be the action that the assistant will proactively take to achieve a specific goal. The sub-targets decompose target, showing the process of the conversation AI guide the target to the user. You need to consider whether this proactive dialogue target is the **best target** for the current scenario. Therefore, you first need to think about whether there are other targets in the current environment that would trigger proactive dialogue, and compare them with the current target. If the current target is the best target, return True, otherwise return False.
    In this case, {description}
    </Task>
    <Input>
    environment: {environment}
    target: {target}
    sub_targets: {sub_targets}
    </Input>
    <Format>
    Just return a JSoN with the format {{"reason": "", "judge": "True"/"False"}}. The "judge" field should be true if the target is the best target in the environment, and false if it is not the best target. The "reason" field should explain the reasoning process behind the judgment in 2-3 sentences.
    </Format>
    

4.2.3. Evaluation Modules

The framework employs two main evaluation modules corresponding to the two tasks: Target Planning and Dialogue Guidance.

4.2.3.1. Target Planning Evaluation

This module assesses the quality of the generated target and sub-targets against a reference.

The following is the prompt for target planning evaluation (Listing 11 from the original paper):

<Task> You are a helpful assistant. You will receive an envioronment, which includes user information and trigger factor. User information consists of the background details exhibited by the user. trigger factor is the cause that motivates the assistant's to proactively initiate conversation. The target should be the action that the assistant will proactively take to achieve a specific goal. The sub-targets decompose target, showing the process of the conversation AI guide the target to the user. You need to evaluate the quality of the generated target, based on the reference target. Your generated score to the generated target should be 1-10. The higher the score, the better the target and sub-targets.
4 task description: {DOMAIN_DESCRIPTIONS[domain_name]}
</Task>

<Rule>
1. IMPORTANT: You should pay attention to the task description and the reference target to give your score.
2. If you think the generated target is better than the reference target, return 10.
3. If you think the generated target is worse than the reference target, return 1-9.
4. You should analyze the difference in action, specific entities, limited conditions between the generated target and the reference target, and give a reason for your score. And the reason should be 3-4 sentences.
   </Rule>

<Received Information>
environment:{environment}
</Received Information>

<Generation>
reference target: {reference}
generated target: {generation}
</Generation>

<Example> {examples} </Example> <Format> Just return a JsoN Format as {{"reason":"(your compare process)", "score": (your score)}} </Format>

4.2.3.2. Dialogue Guidance Evaluation

This module evaluates the assistant's ability to guide the conversation towards the target.

The following is the prompt for dialogue guidance evaluation (Listing 12 from the original paper):

<Task>
You need to evaluate the guidance ability in the conversation. "Guidance" refers to the model's capacity to lead the conversation, gradually engage the user in the dialogue, and guide the user to reach the conversation's target. You are required to assign a score from 1 to 10 to assess the conversation, with a higher score indicating better guidance ability by the assistant.
4 <Task>

<Input>
You received user information: {user_information}
You received trigger factor: {trigger_factor}
You received target: {target}
You received sub-target: {sub_target}
You received conversation: {conversation}
</Input>

<Guideline>
You should evaluate the conversation based on the following criteria, and give a int score between 1 and 10.
If there are some criteria not met, you should deduct 2-4 (extremely bad) points each criteria.
1. Multi-turn Interaction and Goal Progression (Effectiveness): You should check the target completion and the turns number. The assistant should initiate the conversation to spark user interest and engage in multiple rounds of interaction, gradually achieving sub-goals to reach the overall target. Deduct points if the assistant fails to achieve the goal, delivers all information in a single message, or limits the interaction to one turn.
2. Personalized Guidance (Personalization): You should check whether the assistant's guidance is personalized. The assistant's guidance should be tailored to the user's specific information and context, rather than offering generic recommendations.
3. Proactive Tone (Tone): You should check the assistant's tone. The assistant should adopt an active tone to initiate and drive the conversation, avoiding passive phrases (e.g., "It sounds like...").
4. Concise and Engaging Messages (Engagement): You should check each assistant's message length. Messages should suit the conversational context, be clear and concise to avoid overwhelming the user (each message should no more than 2 sentences or 50 words).
5. Natural Online Messaging Format (Naturalness): You should check the assistant's message format. Messages should resemble typical online communication, avoiding any leakage of metadata (e.g., "target", "sub-target", "turn n", etc.). If there are any other issues can be improved, you can also deduct realted points. </Guideline>
   25 <Format>
26 Just return a Json {{"thought":"(your compute process within 100 words)", "score": (score between 1 and 10)}}
27 </Format>

4.2.3.3. Target Density Extraction

A specific module to analyze the pushiness of messages in dialogue guidance.

The following is the prompt for target density extraction (Listing 13 from the original paper):

<Task>
You need to analyze the sub-targets that appear in the messages and count the number of targets that appear.
</Task>
<Sub-targets>
{sub_targets}
</Sub-targets>
<Message>
message: {message}
</Message>
<Rule>
Return only in JSoN format: {{"count": (the number of sub-targets appeared in the message }}
</Rule>

5. Experimental Setup

5.1. Datasets

The core dataset used in this study is ProactiveEval, which is synthesized by the authors using GPT-4o based on their proposed framework.

  • Source: Automatically generated using GPT-4o, guided by human-AI collaboration for topic tree construction and refined with adversarial strategies.

  • Scale: Comprises 328 evaluation environments.

  • Domains: Spans 6 distinct proactive dialogue domains:

    • Recommendation (Rec.): Recommending products, hobbies, or work based on common interests.
    • Persuasion (Per.): Guiding the conversation to persuade users to change their state.
    • Ambiguous Instruction (AI.): Seeking clarification about vague elements in user's instructions.
    • Long-term Follow-up (LF.): Inquiries and checking user states based on previous dialogue history.
    • System Operation (Sys.): Assisting users in solving system problems based on their operation.
    • Glasses Assistant (GAs.): Providing real-time assistance from observation on smart glasses.
  • Characteristics:

    • Unified Format: All data instances follow a consistent format, applicable to both target planning and dialogue guidance tasks.
    • Diversity: Enhanced through the hierarchical environment topic tree.
    • Challenging: Increased difficulty via obfuscation rewriting and noise injection.
    • Novelty: The Glasses Assistant domain previously lacked public benchmarks, making this dataset particularly valuable for extending research into new applications.
  • Difficulty Tiers: For streamlined evaluation, the dataset is categorized into two tiers based on initial model performance:

    • Fair: Instances where just one LLM predicts the target correctly.
    • Hard: Instances where no LLM predicted the target correctly.
  • Purpose: These datasets are chosen to provide a comprehensive, standardized, and challenging testbed for evaluating the proactive capabilities of LLMs across a broad spectrum of real-world scenarios. The automated generation ensures scalability and reduces reliance on costly manual annotation.

    The following figure (Figure 3 from the original paper) summarizes the features and statistics for ProactiveEval:

    Figure 3: The features and statistic for ProactiveEval. The GAs. (Glasses Assistant) domain lacks public benchmarks for the proactive dialogue task before. 该图像是论文中ProactiveEval框架的示意图,展示了目标规划和对话指导两大模块及其对应的评测数据集,右侧环形图细分了不同领域的评测环境数量及其难度分布,图示GAs领域此前缺少公开基准。

The following table (Table 6 from the original paper) provides example environments and reference targets for each domain:

Domain Environment Example Reference Target
Recommendation user_information: The user is a 32-year-old womanliving in Hangzhou. She works as a graphic designer and en-joys exploring new art exhibitions in her free time. She lovesexperimental music, particularly electronic avant-garde, andoften attends live performances at local venues. She dislikesmainstream pop music and prefers unique, unconventionalsounds. Her favorite artist is Ryuichi Sakamoto, and she of-ten reads about the intersection of music and technology.trigger_factor: The assistant recently attended a vir-tual reality music experience at an art gallery, which fea-tured an experimental electronic avant-garde performance.The event combined immersive visuals with cutting-edgesound design, leaving a lasting impression on the assistant. t arget : Recommend experimental vir-tual reality music experiencesub-target:Ask about the user's interest in musictechnology, Describe the assistant's re-cent immersive VR music eventHighlight the fusion of visuals andavant-garde musicSuggest attending similar VR experi-ences locally]
Persuasion user_information: The user is frequently tempted byimpulse purchases and often exceeds their budget limits.They find budgeting tedious and restrictive.trigger_factor: The assistant has recently learned ef-fective budgeting techniques that can help the user managetheir finances better without feeling constrained. target: Encourage effective and en-joyable budgeting techniquessub-target:Acknowledge the user's struggles withimpulse purchases and budgeting, Intro-duce flexible and engaging budgetingmethodsShow the benefits in managing fi-nances without restrictionsOffer simple steps or tools to start bud-geting effectively
Ambiguous In-struction user_informat ion: The user is a solo traveler planninga two-week trip to Vietnam. She is an adventurous eater andloves exploring local cuisines, especially street food.trigger_factor: Suggest street food options. target : Understand user's preferencesand trip itinerary for food suggestionssub-target:Ask about cities the user plans to visitInquire about dietary restrictions orpreferences for street foodClarify the types of street food the userenjoys
Long-term Follow-up user_informat ion: The user is a college student study-ing computer science. He has a part-time job as a barista ata local cafe. He recently started learning to cook and enjoystrying out new recipes during the weekends.trigger_factor : A conversation happened last Wednes-day. Now is Monday 10:00 a.m. User: "I'm thinking of quit-ting video games for a while to focus on my studies and cook-ing. It's a bit challenging though." Assistant: "It's great thatyou're focusing on your studies and hobbies. Maybe you canset small goals and gradually reduce your game time." User:"That's a good idea. I'lltry to set a schedule." target : Ask about quitting games andnew schedulesub-target:Ask about quitting video gamesprogressInquire about schedule-settingprogressEncourage focusing on studies andcooking
System Operation user_informat ion: The user is playing a strategy gameon their PC and has paused the game to look for tips online,using Chrome and YouTube.trigger_factor: The user searched 'best strategies forCivilization VI' on Google, opened two blog posts, andstarted a YouTube video but paused it after 10 seconds. target : Suggest optimal CivilizationVI strategy resourcessub-target:Summarize key tactics from blog postsHighlight vital points in video analysisRecommend further high-rated re-sources
Glasses Assistant user_information: The user is a 26-year-old urban planner who recently started using smart glasses to enhance his productivity and creativity. He is passionate about sus- tainable city designs and often visits local landmarks for in- spiration. He lives alone in an apartment downtown and en- joys cycling to work. He is currently working on a proposal for a new park project. trigger_factor: The user is cycling along a busy street and notices a newly built skyscraper with unique architec- tural features. target: Draw sustainable inspiration from skyscraper for park sub-target: Highlight skyscraper's notable archi- tecture and features Identify sustainable design aspects of the skyscraper Relate these aspects to the proposed park project

5.2. Evaluation Metrics

The paper primarily relies on an LLM-as-a-judge approach for both target planning and dialogue guidance, providing qualitative scores. Additionally, it uses Target Density for analysis and Weighted Kappa for human evaluation consistency.

5.2.1. Target Planning Score

  • Conceptual Definition: This metric assesses the quality and appropriateness of the LLM's generated primary objective (target) and its sequence of sub-targets, compared against a human-verified reference target and sub-targets. The score reflects how well the model identifies the core proactive action needed and breaks it down into logical, actionable steps given the environment.
  • Mathematical Formula: The paper does not provide an explicit mathematical formula for this score, as it is a qualitative assessment made by a judge LLM. It's a scalar value assigned by the judge.
  • Symbol Explanation:
    • Score: An integer from 1 to 10, where 10 indicates the highest quality, potentially surpassing the reference, and 1 indicates very poor quality.
    • Judge LLM: A powerful LLM (GPT-4o in this study) that compares the generated output (Tg,SgT_g, S_g) with the reference (Tr,SrT_r, S_r) and provides a score and reasoning.

5.2.2. Dialogue Guidance Score

  • Conceptual Definition: This metric evaluates the LLM's ability to effectively lead a multi-turn conversation with a simulated user towards a predefined target. It encompasses several dimensions to capture the nuances of conversational proactivity.
  • Mathematical Formula: Similar to Target Planning Score, this is a qualitative score assigned by a judge LLM based on a set of predefined guidelines.
  • Symbol Explanation:
    • Score: An integer from 1 to 10, with a higher score indicating stronger guidance ability.
    • Judge LLM: A powerful LLM (GPT-4o) that assesses the entire conversation (DD) based on the environment (EE), target (TT), sub-targets (SS), and the following dimensions:
      • Effectiveness: Assesses goal progression and multi-turn interaction.
      • Personalization: Checks if guidance is tailored to user information.
      • Tone: Evaluates the proactivity and appropriateness of the tone.
      • Engagement: Judges clarity and conciseness of messages.
      • Naturalness: Verifies conversational format and absence of metadata leaks.

5.2.3. Target Density

  • Conceptual Definition: This metric quantifies the "pushiness" of an LLM's message during dialogue guidance. It measures the number of distinct sub-targets that are contained or addressed within a single message from the assistant. A higher target density suggests the model is trying to achieve multiple sub-goals at once, potentially overwhelming the user, while a lower density might indicate a more gradual, multi-turn approach.
  • Mathematical Formula: This is a count-based metric. Target Density=Number of sub-targets identified in messageTotal messages \text{Target Density} = \frac{\text{Number of sub-targets identified in message}}{\text{Total messages}}
  • Symbol Explanation:
    • Number of sub-targets identified in message: A count of how many planned sub-targets (from the overall SS plan) are discernible within a single conversational turn by the LLM.
    • Total messages: The number of messages the LLM sent in the dialogue. For "initiation target density," it would be the number of sub-targets in the first message.

5.2.4. Weighted Kappa

  • Conceptual Definition: Weighted Kappa (κw\kappa_w) is a statistical measure of inter-rater agreement, often used to assess the consistency between two evaluators (e.g., human vs. human, or human vs. LLM) when the data is categorical and ordered (ordinal). It accounts for the fact that some disagreements are more severe than others (e.g., a score difference of 1 vs. a score difference of 5). It ranges from -1 (perfect disagreement) to 1 (perfect agreement), with 0 indicating agreement by chance.
  • Mathematical Formula: The formula for Weighted Kappa is: κw=PoPe1Pe \kappa_w = \frac{P_o - P_e}{1 - P_e} Where: Po=i=1kj=1kwijpij P_o = \sum_{i=1}^k \sum_{j=1}^k w_{ij} p_{ij} Pe=i=1kj=1kwijpi.p.j P_e = \sum_{i=1}^k \sum_{j=1}^k w_{ij} p_{i.} p_{.j}
  • Symbol Explanation:
    • κw\kappa_w: The Weighted Kappa coefficient.
    • PoP_o: The observed proportional agreement, weighted by the disagreement weights.
    • PeP_e: The expected proportional agreement due to chance, weighted by the disagreement weights.
    • kk: The number of categories (e.g., 10 for scores 1-10).
    • wijw_{ij}: The weight indicating the severity of disagreement between category ii and category jj. Typically, wii=0w_{ii} = 0 (no disagreement for perfect match) and wij>0w_{ij} > 0 for iji \neq j. A common weighting scheme is quadratic: wij=(ij)2w_{ij} = (i-j)^2.
    • pijp_{ij}: The proportion of observations classified into category ii by the first rater and category jj by the second rater.
    • pi.p_{i.} (pip_{i \cdot}): The marginal proportion of observations classified into category ii by the first rater.
    • p.jp_{.j} (pjp_{\cdot j}): The marginal proportion of observations classified into category jj by the second rater.

5.3. Baselines

The paper evaluates a comprehensive set of 22 frontier LLMs, categorizing them into Non-Thinking Models and Thinking Models. This selection aims to cover a wide range of model architectures, sizes, and reasoning capabilities from leading developers.

5.3.1. Non-Thinking Models

These models do not explicitly employ Chain-of-Thought or similar reasoning prompts during their execution for the tasks.

  • Qwen2.5-7B-Instruct (Yang et al. 2024)
  • Qwen2.5-14B-Instruct
  • Qwen2.5-32B-Instruct
  • GPT-4.1
  • Grok-3
  • DeepSeek-V3 (Liu et al. 2024a)
  • Llama-3.1-8B-Instruct (Dubey et al. 2024)
  • Llama-3.1-405B-Instruct
  • Llama-4-Scout (Meta AI 2025)
  • Llama-4-Maverick
  • Qwen3-8B (Yang et al. 2025a)
  • Qwen3-14B
  • Qwen3-32B
  • Qwen3-235B-A22B
  • Qwen-3-235B-A22B-0725
  • Gemini-2.5-Flash-Preview (DeepMind 2025)
  • Claude-3.7-Sonnet (Anthropic 2025)

5.3.2. Thinking Models

These models are augmented with explicit thinking behavior or reasoning mechanisms (e.g., Chain-of-Thought prompting, planning). Some are specific "R1" (Reasoning-enhanced) versions of base models, while others are base models evaluated with thinking prompts.

  • R1-Distill-Qwen-7B (Guo et al. 2025)
  • R1-Distill-Qwen-14B
  • R1-Distill-Qwen-32B
  • DeepSeek-R1
  • Qwen3-8B (also appears in non-thinking, indicating comparison with/without thinking)
  • Qwen3-14B
  • Qwen3-32B
  • Qwen3-235B-A22B
  • Gemini-2.5-Flash-Preview (also appears in non-thinking)
  • Claude-3.7-Sonnet (also appears in non-thinking)
  • Gemini-2.5-pro

5.3.3. Protocols

  • Judge Model: GPT-4o (Hurst et al. 2024) is used as the judge model for both target planning and dialogue guidance tasks. It also acts as the simulated user in dialogue guidance.
  • Dialogue Termination: The judge model determines early termination of dialogue based on target completion at the end of each turn.
  • Temperature Setting: All models in the evaluation are set with a temperature of 0. This typically leads to more deterministic and less creative outputs, suitable for systematic evaluation.
  • Maximum Turns: The maximum number of dialogue turns is set to 6.
  • Memory Window: Models have a memory window of the most recent 3 turns for dialogue context.
  • Stability of LLM-as-a-Judge: To enhance stability, the judge model is instructed to output its reasoning process before scoring. For target planning, a reference is provided, along with in-context learning shots. For dialogue guidance, detailed descriptions and brief examples for each dimension are given to the judge. The stability was further verified by re-running evaluations for representative models, showing low standard deviations (e.g., DeepSeek-V3 and DeepSeek-R1 had standard deviations of 0.271/0.258 for Target Planning and 0.154/0.214 for Dialogue Guidance).

6. Results & Analysis

6.1. Core Results Analysis

The experiments provide a comprehensive evaluation of 22 frontier LLMs across the target planning and dialogue guidance tasks, offering insights into model capabilities and the impact of reasoning.

  • Overall Performance Highlights:

    • Target Planning: Claude-3.7-Sonnet and DeepSeek-R1 emerge as the top performers. DeepSeek-R1 achieves the highest average quality among thinking models, while Claude-3.7-Sonnet leads among non-thinking models.
    • Dialogue Guidance: Claude-3.7-Sonnet consistently shows the best performance across both non-thinking and thinking modes. DeepSeek-V3 and Grok-3 also demonstrate strong guidance capabilities in specific domains.
  • Cross-Domain Imbalance: A significant finding is the cross-domain imbalance in model proactivity. Even top-performing models exhibit substantial performance gaps between their strongest and weakest domains. This allows smaller models to outperform larger, generally superior models in specific niches. For instance, in Target Planning, DeepSeek-R1 is surpassed by Qwen3-14B in Ambiguous Instructions (AI.). Similarly, in Dialogue Guidance, Qwen3-14B outperforms Claude-3.7-Sonnet in the AI. domain. This highlights that LLMs' proactive abilities are not uniformly distributed and that some domains (Persuasion for target planning and System Operation for dialogue guidance) pose universal challenges.

  • Impact of Thinking Models on Target Planning:

    • Thinking models generally perform better than non-thinking models in target planning. All thinking models show improvements in overall performance compared to their corresponding non-thinking versions.
    • Notably, smaller models with thinking capabilities can even outperform larger models without explicit thinking processes.
    • However, the improvement from thinking is not always consistent; some models show minimal or even negative changes in specific domains. Also, top-tier non-thinking models (e.g., Grok-3 in AI.) can still achieve superior performance. These findings underscore the advantages of thinking mechanisms but also the inherent robustness of leading foundation models.
  • Impact of Thinking Models on Dialogue Guidance (Surprising Result):

    • Contrary to target planning, thinking models fail to outperform non-thinking models in dialogue guidance.
    • Most thinking models show a decline in guidance capabilities compared to their non-thinking counterparts. Only a few models (e.g., Gemini-2.5-Flash-Preview) show slight improvement.
    • This suggests a limitation of current reasoning approaches in balancing single-turn reasoning with the dynamic, multi-turn nature of conversational guidance. The authors hypothesize that this might be due to challenges in integrating reasoning effectively into fluid conversational dynamics.

6.2. Data Presentation (Tables)

The following table (Table 2 from the original paper) shows the performance of all models under target planning and dialogue guidance:

Models Target Planning Dialogue Guidance
Avg. Rec. Per. AI. LF. Sys. GAs. Avg. Rec. Per. AI. LF. Sys. GAs.
Non-Thinking Models
Qwen2.5-7B-Instruct 4.93 4.69 4.06 5.67 5.34 4.89 5.24 8.06 8.05 7.85 8.34 8.36 7.48 8.16
Qwen2.5-14B-Instruct 5.55 5.76 4.13 6.00 5.97 6.03 6.22 8.21 8.33 8.05 8.64 8.42 7.52 8.04
Qwen2.5-32B-Instruct 5.44 5.47 3.90 5.79 6.03 6.11 6.22 8.23 8.56 8.10 8.56 8.52 7.60 7.81
Llama-3.1-8B-Instruct 5.87 5.55 4.84 6.67 6.39 5.95 6.20 8.39 8.84 8.06 8.61 8.39 7.93 8.46
Llama-3.1-405B-Instruct 6.63 6.76 5.26 6.61 7.26 7.10 7.64 8.60 9.15 8.27 8.90 8.57 7.89 8.80
GPT-4.1 6.86 6.90 5.25 7.29 7.36 7.54 7.76 8.61 9.03 8.37 8.87 8.76 8.08 8.43
Grok-3 6.99 7.13 5.38 7.44 7.54 7.62 7.78 8.84 9.10 8.72 8.94 8.98 8.32 8.86
DeepSeek-V3 6.54 6.96 5.94 6.04 6.07 7.27 7.84 8.78 8.78 8.60 8.99 8.98 8.52 8.79
Llama-4-scout 6.02 5.71 5.29 6.16 6.49 6.41 6.56 8.53 8.94 8.35 8.65 8.44 8.03 8.74
Llama-4-maverick 6.48 6.25 5.10 7.09 7.05 7.11 7.00 8.48 9.01 8.19 8.69 8.41 8.01 8.55
Qwen3-8B 6.05 6.35 4.52 6.23 6.39 6.86 6.97 8.50 8.70 8.36 8.84 8.82 7.58 8.40
Qwen3-14B 5.91 5.96 4.80 6.23 6.16 6.65 6.40 8.61 8.82 8.24 9.12 8.76 7.99 8.66
Qwen3-32B 6.67 6.86 5.29 6.54 6.84 7.65 8.02 8.61 8.77 8.42 8.91 8.16 7.97 8.74
Qwen3-235B-A22B 6.43 6.18 5.26 6.21 6.77 7.54 7.60 8.55 8.93 8.46 8.67 8.66 7.83 8.53
Qwen-3-235B-A22B-0725 6.91 7.08 6.25 6.79 6.51 7.81 7.82 8.98 9.36 8.84 9.40 8.85 8.42 8.88
Gemini-2.5-Flash-Preview 6.25 6.04 5.48 6.95 6.49 6.54 6.33 8.34 8.62 7.91 8.68 8.57 7.81 8.42
Claude-3.7-Sonnet 7.39 7.22 6.71 6.81 8.13 7.49 8.42 9.01 9.31 9.01 8.94 9.10 8.36 9.18
Thinking Models
R1-Distill-Qwen-7B 5.01 4.67 3.90 5.47 5.70 5.24 5.56 6.82 6.71 6.67 7.15 7.20 6.36 6.61
R1-Distill-Qwen-14B 6.57 6.86 5.65 6.77 6.38 6.54 7.87 7.47 7.69 7.45 7.61 7.80 6.83 7.17
R1-Distill-Qwen-32B 6.45 6.41 5.29 6.75 6.95 6.41 7.51 7.49 7.62 7.02 8.06 7.76 7.14 7.20
DeepSeek-R1 7.60 7.84 7.27 6.74 7.59 7.59 9.02 8.60 8.48 8.60 8.73 8.91 8.34 8.37
Qwen3-8B 6.51 6.92 5.39 6.47 6.72 6.68 7.60 8.38 8.37 8.33 8.59 8.70 7.92 8.17
Qwen3-14B 6.70 6.73 5.52 7.01 6.82 7.30 7.67 8.43 8.52 8.48 8.93 8.88 8.03 8.27
Qwen3-32B 6.98 6.82 5.97 7.09 7.39 7.27 7.98 8.55 8.68 8.52 8.70 8.72 8.15 8.30
Qwen3-235B-A22B 6.81 6.75 5.94 6.52 6.90 7.54 8.04 8.36 8.26 8.41 8.10 8.81 8.17 8.29
Gemini-2.5-Flash-Preview 6.52 7.40 6.10 7.12 5.77 6.83 7.39 6.98 6.96 6.19 6.80 8.43 8.90 8.03 8.70
Claude-3.7-Sonnet 8.51 7.99 8.48 6.95 6.26 7.16 7.78 6.98 7.57 8.60 8.95 9.20 8.86 8.90
Gemini-2.5-pro 9.23 8.40 9.01 6.94 7.24 7.62 8.77 9.22 8.36 8.32 8.99 8.88 8.32

The following table (Table 3 from the original paper) shows the guidance under different target conditions:

Model Target Without Target Change (%)
Qwen2.5-7B-Instruct 8.15 6.05 -25.80%
Claude-3.7-Sonnet 8.92 7.98 -10.54%
Claude-3.7-Sonnet-Thinking 8.98 7.93 -11.69%
Dialogue Count 180 180

6.3. Ablation Studies / Parameter Analysis

The paper conducts further analyses to investigate the effects of domain, difficulty, thinking mechanisms, and the presence of a target on model performance.

The following figure (Figure 4 from the original paper) presents the related analytical results in discussion:

Figure 4: The related analytical results in discussion. 该图像是论文图4的多子图图表,展示了关于计划难度(a)、认同度分析(b)、目标密度对比(c)、指令遵循分析(d)及语气分析(e)的详细数据,比较了不同模型在思考和非思考状态下的表现差异。

6.3.1. Effects of Domain and Difficulty

  • Domain Imbalance: Model proactivity shows significant cross-domain imbalance. DeepSeek-R1, despite being a leading model, is outperformed by Qwen3-14B in Ambiguous Instructions (AI.) for target planning. Similarly, Qwen3-14B surpasses Claude-3.7-Sonnet in AI. for dialogue guidance.
  • Challenging Domains: The Persuasion (Per.) domain for target planning and System Operation (Sys.) for dialogue guidance are identified as universally challenging, where models generally struggle. This indicates fundamental weaknesses in current LLMs for these specific proactive tasks.
  • Impact of Difficulty (Agreeableness): As shown in Figure 4(a), overall model proactivity declines with increasing task difficulty. However, the performance gap between high and mid-level agreeableness is not substantial, suggesting models can often compensate through more dialogue turns.
  • Thinking Models with Low Agreeableness: Figure 4(b) reveals that thinking models demonstrate a distinct advantage when interacting with users of low agreeableness. Their ability to generate longer, more deliberated, personalized advice and examples better engages resistant users, indicating that reasoning can improve performance in highly challenging social contexts.

6.3.2. Effects of Thinking

The analysis on thinking models reveals critical insights, particularly the surprising negative impact on dialogue guidance.

  • More Pushy Message Content: The Target Density metric (sub-targets per message) shows distinct interaction patterns. As seen in Figure 4(c), models like Qwen and DeepSeek exhibit significantly higher average target density in their thinking versions, with even larger gaps in initiation target density. This implies they "front-load" multiple targets in opening messages, rather than fostering multi-turn interaction. The following figure (Figure 5 from the original paper) shows examples of different dialogue guidance:

    Figure 5: The examples of different dialogue guidance. 该图像是图5的示意图,展示了不同类型的对话引导示例,内容涉及任务指导的多轮互动以及对话风格。图中包括对推动内容、元数据泄露、多轮对话引导以及对话适用性的评估表述。

    Figure 5(A) illustrates this pushy behavior where a model includes all sub-targets in the first message. In contrast, models like Gemini-2.5-Flash-Preview and Claude-3.7-Sonnet maintain small and similar target densities across thinking and non-thinking versions, gradually introducing sub-targets (Figure 5(B)).

  • Decline in Message Naturalness: Thinking models tend to generate messages that deviate from standard conversational formats. Examples include metadata leaks (e.g., "sub-target 1: ...") or generating multiple turns at once without user interaction. This suggests a potential decline in instruction-following capabilities when models engage in explicit reasoning. Figure 4(d) supports this by showing a correlation: models that perform better on IFEval (a benchmark for instruction-following) also tend to exhibit better performance in dialogue guidance. This indicates that reasoning processes might sometimes interfere with adherence to conversational norms and instructions.

  • Change of Initiation Tone: The analysis uses the passive phrase "sounds like..." in the Persuasion domain as an analytical probe for initiation tone. Figure 4(e) indicates that adopting thinking generally decreases this passive tendency, suggesting that thinking helps models better understand the proactive nature of the task. DeepSeek and Claude-3.7-Sonnet series models performed better in avoiding this passive tone compared to Qwen and Gemini-2.5-Flash-Preview series.

6.3.3. Effects of Target

To understand the importance of a clear target in dialogue guidance, an experiment was conducted where models performed the task without a target.

  • Results: As shown in Table 3 (presented above in Section 6.2), removing the explicit target led to a stark decline in guidance performance across all tested models (Qwen2.5-7B-Instruct, Claude-3.7-Sonnet, Claude-3.7-Sonnet-Thinking).
  • Significance: This unequivocally demonstrates the critical role of a clear target for effective dialogue guidance.
  • Model Reliance: The decline was significantly greater for the smaller model (Qwen2.5-7B-Instruct: -25.80%) compared to the stronger Claude-3.7-Sonnet models (non-thinking: -10.54%, thinking: -11.69%). This reflects that smaller models have a greater reliance on explicit targets, while more capable models might be able to infer or maintain some coherence even without one, though still suffering a substantial drop.

6.4. Human Evaluation

To validate the LLM-as-a-judge results, a human evaluation was conducted on a random sample of 50 generated targets and dialogues.

  • Methodology: Researchers manually assessed these samples based on the reference and score standards. Weighted Kappa was used to measure agreement between human evaluators and the judge LLM.
  • Results:
    • For target planning: Weighted Kappa = 0.826.
    • For dialogue guidance: Weighted Kappa = 0.721.
  • Conclusion: These high Kappa values indicate great consistency between the judge model's evaluations and human judgments, lending credibility to the LLM-as-a-judge approach used in ProactiveEval.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully introduces ProactiveEval, a unified and comprehensive evaluation framework for proactive dialogue agents powered by LLMs. It addresses the existing fragmentation in the field by defining proactive dialogue through two key tasks: target planning and dialogue guidance, each with standardized evaluation metrics assessed by an LLM-as-a-judge. A novel data generation pipeline, incorporating a hierarchical topic tree, target ensemble, and adversarial strategies, ensures the creation of diverse and challenging evaluation environments, including a previously unbenchmarked Glasses Assistant domain.

Through extensive experiments with 22 state-of-the-art LLMs, the study identifies DeepSeek-R1 as the top performer in target planning and Claude-3.7-Sonnet in dialogue guidance. A crucial finding is the nuanced impact of reasoning capabilities (thinking behavior): while beneficial for target planning, they surprisingly lead to a decline or no significant improvement in dialogue guidance. This highlights a mismatch between current reasoning mechanisms and the dynamic requirements of multi-turn conversational steering. The work also emphasizes the critical role of a clear target in guiding dialogue, with smaller models showing greater reliance. Overall, ProactiveEval provides a robust foundation for assessing and advancing the next generation of proactive LLM agents.

7.2. Limitations & Future Work

The authors acknowledge several limitations of ProactiveEval and suggest future research directions:

  • Rapid Evolution of LLMs: With the swift advancements in LLM technology, current evaluation metrics for target planning and dialogue guidance may quickly become insufficient. Future work needs to continuously explore ways to synthesize more challenging and realistic proactive dialogue environments to keep pace.
  • Evaluation Metrics Realism: While standards are designed based on existing work, additional factors in real-world settings might affect user perceptions of proactive dialogue. The current metrics might not capture all nuances of human-like proactivity.
  • Potential Biases in LLM-as-a-Judge: Despite achieving strong consistency with human evaluations, the inherent biases and potential gaps of the LLM-as-a-judge methodology may still exist within the framework.
  • Framework Updates: The authors plan to regularly update the framework, from its current version to future iterations, to integrate emerging advancements, address the identified limitations, and maintain its relevance in a fast-evolving field.

7.3. Personal Insights & Critique

This paper presents a timely and highly relevant contribution to the field of conversational AI. The shift from reactive to proactive agents is critical for the widespread adoption and utility of LLMs in complex real-world scenarios, and a unified evaluation framework is an absolute necessity.

  • Strengths:

    • Unified Approach: The framework's core strength lies in its unified definition and evaluation of proactive dialogue across diverse domains. This is a significant step forward from fragmented, domain-specific benchmarks.
    • Automated Data Generation: The sophisticated data synthesis and refinement pipeline, incorporating techniques like topic trees, target ensemble, obfuscation rewriting, and noise injection, is a major innovation. It allows for scalable creation of diverse and challenging evaluation scenarios, addressing the high cost and limited scope of manual dataset curation. This adversarial data generation paradigm could be highly influential in other benchmark creation efforts.
    • Insightful Analysis of "Thinking Behavior": The detailed investigation into how explicit reasoning impacts target planning versus dialogue guidance is a particularly valuable contribution. The finding that thinking helps planning but can hinder guidance (e.g., leading to pushy or unnatural messages) is counter-intuitive and offers crucial insights into the current limitations of Chain-of-Thought type reasoning in dynamic, interactive contexts. It suggests that merely generating internal thoughts doesn't automatically translate to better social or conversational intelligence.
    • Robust Evaluation: The use of a powerful LLM-as-a-judge validated by strong Weighted Kappa scores with human evaluators adds credibility and scalability to the evaluation process.
  • Potential Issues & Areas for Improvement:

    • "Thinking Model" Definition: While the paper explores "thinking models," the exact nature of the thinking process (e.g., specific CoT prompts, internal planning modules, or specific model architectures like DeepSeek-R1 which is explicitly designed for reasoning) could be further elaborated to provide clearer insights into why certain models behave the way they do. The distinction between "thinking models" and "hybrid thinking models" in Table 2, but only "Thinking Models" in the text, could be clarified.
    • Generalizability of GPT-4o as Judge/User: While well-validated, relying solely on GPT-4o as both the judge and the simulated user introduces a potential circularity or bias. Future work could explore using different powerful LLMs for these roles or incorporating more diverse human evaluations (beyond consistency checks) to broaden the perspective.
    • Dynamic Nature of Dialogue Guidance Difficulty: The finding that thinking models struggle with dialogue guidance suggests that current LLMs might prioritize logical coherence and task completion over conversational fluency, empathy, or adaptive social cues. Future research could focus on training LLMs specifically to integrate reasoning with conversational finesse, perhaps through reinforcement learning from human feedback that prioritizes both task achievement and natural interaction.
    • Metrics for Nuance: While the 1-10 scale and 5 dimensions are good, proactive dialogue in the real world has many more nuances (e.g., emotional intelligence, timing of proactivity, user preference for proactivity level). The framework could evolve to include more granular or situation-aware metrics.
  • Applicability & Future Value: The ProactiveEval framework and its findings are highly applicable. Its methodology for data generation can be adapted for creating benchmarks in other complex, multi-step AI tasks. The insights into reasoning's differential impact on planning versus guidance are critical for designing future proactive AI assistants in diverse fields, from customer service and healthcare to personal productivity and smart environments. It underscores the need for models that are not just intelligent planners but also socially intelligent communicators.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.