Paper status: completed

Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs

Published:10/29/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
6 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Learn-to-Ask learns proactive LLMs from offline expert logs without simulators by leveraging observed future data to infer turn-by-turn rewards, decomposing long-horizon tasks for effective training and deployment in real-world high-stakes domains.

Abstract

Large Language Models (LLMs) excel as passive responders, but teaching them to be proactive, goal-oriented partners, a critical capability in high-stakes domains, remains a major challenge. Current paradigms either myopically optimize single-turn attributes or rely on brittle, high-cost user simulators, creating a persistent ``reality gap''. To bridge this gap, we introduce \texttt{Learn-to-Ask}, a general, simulator-free framework for learning and deploying proactive dialogue agents \textit{directly from offline expert data}, bypassing the need to model complex user dynamics. Our key insight is to reframe the offline policy learning problem by leveraging the \textbf{observed future} of each expert trajectory. This allows us to infer a dense, turn-by-turn reward signal grounded in the expert's revealed strategy, decomposing the intractable long-horizon problem into a series of supervised learning tasks, and training a policy to output a structured \texttt{(action, state_assessment)} tuple, governing both \textbf{what to ask} and, crucially, \textbf{when to stop}. To ensure reward fidelity, our Automated Grader Calibration pipeline systematically purges noise from the LLM-based reward model with minimal human supervision. Empirically, we demonstrate the efficacy of \texttt{Learn-to-Ask} in a real-world medical dataset, using LLMs of varying sizes up to 32B. Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service. In rigorous in-house evaluations, our model was launched and achieved performance even superior to human experts, proving our framework's ability to translate offline data into tangible, real-world impact. We hope this work provides a practical and economically viable blueprint for transforming passive LLMs into proactive, goal-oriented LLM applications.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs

1.2. Authors

Fei Wei*, Daoyuan Chen*, Ce Wang†, Yilun Huang†, Yushuo Chen, Xuchen Pan, Yaliang Li‡, Boling Ding‡. All authors are affiliated with Alibaba Group.

1.3. Journal/Conference

The paper is published on arXiv, a preprint server, indicating it has not yet undergone formal peer review for a specific conference or journal. The provided "Published at (UTC): 2025-10-29T12:08:07.000Z" suggests a future publication date. In the context of academic publishing, arXiv allows early dissemination of research findings.

1.4. Publication Year

2025

1.5. Abstract

Large Language Models (LLMs) currently excel at passive responses, but a significant challenge remains in training them to be proactive, goal-oriented partners, especially in critical, high-stakes fields. Existing methods either focus narrowly on single-turn attributes or rely on expensive and often inaccurate user simulators, leading to a persistent "reality gap." To address this, the paper introduces Learn-to-Ask, a general, simulator-free framework that learns and deploys proactive dialogue agents directly from existing offline expert data. This approach avoids modeling complex user behaviors. The core innovation is reframing the offline policy learning problem by utilizing the observed future of each expert conversation trajectory. This allows for the inference of a dense, turn-by-turn reward signal that is grounded in the expert's actual strategy. This decomposes the otherwise intractable long-horizon problem into a series of supervised learning tasks. The policy is trained to output a structured (action, state_assessment) tuple, which dictates both what to ask and, critically, when to stop. To maintain the accuracy of this reward signal, an Automated Grader Calibration pipeline is employed, systematically removing noise from the LLM-based reward model with minimal human oversight. The effectiveness of Learn-to-Ask is demonstrated empirically on a real-world medical dataset, utilizing LLMs up to 32 billion parameters. The method culminates in the successful deployment of the LLM into a live, large-scale online AI service, where rigorous in-house evaluations showed performance even surpassing human experts. The authors posit this framework offers a practical and economically viable blueprint for transforming passive LLMs into proactive, goal-oriented applications.

https://arxiv.org/abs/2510.25441 PDF Link: https://arxiv.org/pdf/2510.25441v1.pdf Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper addresses is the inability of current Large Language Models (LLMs) to act as proactive, goal-oriented partners, particularly in high-stakes domains such as healthcare, law, and finance. While LLMs are proficient as passive responders, there's a critical need for them to take initiative, gather information, and drive conversations towards a specific objective.

This problem is important because numerous daily goal-oriented conversations between human experts and clients generate a vast amount of valuable dialogue data (a "goldmine") that LLMs currently fail to harness effectively. The default passive behavior of LLMs severely limits their potential for truly collaborative and intelligent applications.

The specific challenges or gaps in prior research, which contribute to this reality gap, are two-fold:

  1. Attribute-based alignment: This approach focuses on optimizing single-turn qualities (e.g., clarity, relevance) using preference data. It is myopic as it fails to learn a coherent, sequential policy that considers conversational flow and, crucially, lacks a principled mechanism to decide when to stop.

  2. Simulation-based optimization: This method uses user simulators to train agents for long-horizon rewards. However, creating high-fidelity simulators for complex, open-ended, expert-level domains is notoriously difficult, computationally expensive, and often results in policies that fail to generalize to real-world interactions due to the inherent combinatorial explosion of states.

    The paper's entry point or innovative idea is to bypass these limitations by asking a fundamental question: Can an effective, long-horizon questioning policy be learned directly from offline expert data, thereby eliminating the need for a simulator and bridging the reality gap? The Learn-to-Ask framework is proposed as an affirmative answer to this question.

2.2. Main Contributions / Findings

The paper makes three primary contributions:

  1. A Simulator-Free Policy Learning Framework (Learn-to-Ask): The paper introduces Learn-to-Ask, a novel framework that enables an LLM to learn a complete, sequential questioning policy—including a critical stopping condition—directly from offline expert logs. This provides a grounded, data-driven, and economically viable alternative to the brittle and costly user simulators traditionally used in Reinforcement Learning (RL) for dialogue agents.

  2. Hindsight-based Reward Inference with Automated Calibration: A key innovation is a method to infer dense, turn-by-turn reward signals by leveraging the observed future of expert trajectories. This reframes the intractable Reinforcement Learning problem into a series of supervised learning tasks. To ensure the fidelity and accuracy of these rewards, the framework includes an Automated Grader Calibration pipeline. This pipeline systematically purges noise from the LLM-based reward model with minimal human supervision, mitigating oracle noise and ensuring the learning signal is truly aligned with expert intent.

  3. Demonstrated Real-World Impact and Superhuman Performance: The framework's efficacy is validated not just through offline experiments on a real-world medical dataset (RealMedConv) but, more importantly, through its successful deployment in a live, large-scale commercial AI service. In rigorous in-house evaluations, the Learn-to-Ask-trained model achieved task-success rates exceeding those of human experts and demonstrated significant business impact (e.g., a 1.87x lift in dialog-to-purchase conversion rate). This practical deployment provides strong evidence that the offline learning paradigm directly translates to superior real-world performance, offering a practical blueprint for developing proactive LLM applications.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand the paper, a beginner should be familiar with the following core concepts:

  • Large Language Models (LLMs): These are advanced artificial intelligence models, typically based on the transformer architecture, trained on vast amounts of text data to understand, generate, and process human language. They excel at tasks like text generation, translation, summarization, and question answering. In this paper, LLMs are the base agents being taught to be proactive.

  • Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. The agent doesn't receive explicit instructions but learns through trial and error. Key components of RL are:

    • Agent: The learner or decision-maker.
    • Environment: The world the agent interacts with.
    • State: The current situation or context of the environment.
    • Action: A move made by the agent in a given state.
    • Reward: A feedback signal (positive or negative) from the environment, indicating the desirability of an action.
    • Policy (π\pi): A strategy that maps states to actions, determining the agent's behavior.
    • Value Function: A prediction of the future reward starting from a given state or state-action pair.
  • Offline Reinforcement Learning (Offline RL): A subfield of RL where the agent learns a policy solely from a fixed dataset of previously collected interactions (trajectories) without any further interaction with the environment. This is crucial for real-world applications where online interaction is expensive, dangerous, or impractical (e.g., healthcare, finance). A major challenge in offline RL is dealing with out-of-distribution (OOD) actions, where the learned policy might try to exploit actions not seen in the training data, leading to extrapolation error.

  • Supervised Fine-Tuning (SFT) / Behavioral Cloning: A common technique to adapt a pre-trained LLM to a specific task. It involves training the LLM on a dataset of input-output pairs (e.g., conversation history - expert's next utterance). The model learns to directly mimic the behavior shown in the training data. In the context of dialogue, it teaches the model to produce the next utterance given the current conversation history. It's myopic because it optimizes for single-step imitation rather than long-term goals.

  • Direct Preference Optimization (DPO): A technique for aligning LLMs with human preferences without explicitly training a separate reward model. Instead, it directly optimizes the LLM's policy to prefer chosen responses over rejected responses based on pairwise human feedback. The loss function encourages the model to assign higher probabilities to preferred responses and lower probabilities to dispreferred ones. While effective for local preferences (e.g., clarity of a single turn), it struggles with complex sequential decision-making and long-horizon goals.

  • Markov Decision Process (MDP): A mathematical framework for modeling sequential decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. An MDP is defined by:

    • A set of states SS.
    • A set of actions AA.
    • A transition function P(ss,a)P(s' | s, a), which gives the probability of moving to state ss' from state ss after taking action aa.
    • A reward function R(s, a, s'), which specifies the immediate reward received after transitioning from ss to ss' via action aa.
    • A discount factor γ\gamma, which determines the present value of future rewards.
  • Hindsight Experience Replay (HER): A technique used in Reinforcement Learning to make sparse rewards more dense and improve sample efficiency. When an agent fails to achieve its intended goal, HER re-labels the failed trajectory by pretending that the goal that was actually achieved at the end of the trajectory was the intended goal from the beginning. This generates useful training examples even from episodes that did not achieve the original goal, making learning from sparse rewards more tractable. This paper draws inspiration from HER's core idea of leveraging achieved outcomes.

3.2. Previous Works

The paper frames previous approaches to instilling proactivity in LLMs into two main paradigms, highlighting their limitations:

  1. Attribute-based alignment / Single-turn optimization:

    • Methods: This category includes Supervised Fine-Tuning (SFT) and preference optimization techniques like Direct Preference Optimization (DPO) (Rafailov et al., 2023). Early forms also included simple prompting (Deng et al., 2023b; Zhao & Dou, 2024) to elicit behaviors like clarifying questions.
    • Focus: These methods optimize for single-turn attributes such as clarity, relevance, or safety (Li et al., 2025b; Qian et al., 2023; Xu et al., 2025; Zhou et al., 2022).
    • Limitations: They are myopic and fail to learn a long-horizon, stateful policy. They don't account for temporal dependencies in a conversation and, critically, lack a mechanism for deciding when to stop. The paper notes that DPO's single binary preference signal is often insufficient for guiding learning of dual objectives (what to ask and when to stop).
  2. Simulation-based optimization:

    • Methods: This involves using Reinforcement Learning (RL) in simulated user environments (Wu et al., 2025; Xu et al., 2023). Story-related reasoning tasks using tree-based extensions for data simulation have also been explored (Zhou et al., 2025).
    • Focus: These approaches aim to tackle sequential decision-making and long-horizon rewards by interacting with a synthetic user.
    • Limitations (Reality Gap): Building a high-fidelity user simulator for complex, open-ended domains (like medical consultation) is notoriously difficult, computationally prohibitive, and suffers from a combinatorial explosion of states (Hao et al., 2024). Policies optimized in synthetic environments often fail to generalize to real human interactions, leading to poor performance in the real world.
  3. Offline RL from Human Data:

    • Alignment: The paper's work is philosophically aligned with offline RL from human-involved data (Shani et al., 2024; Zhou et al., 2024; Shi et al., 2024).
    • Distinction: Unlike standard offline RL which assumes a fixed reward function, this paper's key challenge is to infer the reward signal itself from expert behavior. Their methodology is distinct: they decompose long trajectories into single-turn decisions and infer fine-grained, turn-level rewards by using the observed future of the real conversation. This is seen as a novel adaptation and significant extension of the hindsight learning paradigm.

3.3. Technological Evolution

The evolution of teaching LLMs proactive behavior can be summarized as follows:

  • Early Dialogue Systems (Pre-LLM): Rule-based or statistical methods were used for proactive behaviors, typically in narrow domains (Deng et al., 2023a; Ling et al., 2025). These lacked the general knowledge and adaptability of modern LLMs.
  • LLMs with Simple Prompting: With the advent of powerful LLMs, initial efforts leveraged their vast world knowledge through prompt engineering. This involved crafting specific instructions to elicit proactive behaviors like asking clarifying questions (Deng et al., 2023b; Zhao & Dou, 2024). While straightforward, these methods lacked the ability to learn complex, domain-specific strategies from data.
  • LLM Fine-tuning for Single-Turn Attributes: This represented a step towards data-driven learning. Techniques like SFT and DPO were used to align LLMs with desirable single-turn qualities (e.g., relevance, clarity, safety) by training on synthetic preference data (Li et al., 2025b). This improved local response quality but remained myopic and couldn't learn long-horizon sequential policies.
  • LLM Fine-tuning with Simulation-based RL: To address sequential decision-making, Reinforcement Learning was applied, often interacting with user simulators (Wu et al., 2025). This aimed for long-horizon rewards but introduced the reality gap problem: policies trained in synthetic environments often failed in real-world human interactions due to the difficulty of creating realistic simulators.
  • Learn-to-Ask's Position: This paper's Learn-to-Ask framework is situated at a critical juncture in this evolution. It moves beyond myopic single-turn optimization and directly tackles the reality gap of simulation-based RL. By learning a sequential policy and a stopping condition directly from offline expert data using a novel hindsight-based reward inference, it offers a simulator-free and data-driven solution for truly proactive LLM agents. It aims to make LLMs collaborative partners by leveraging real-world expert strategies.

3.4. Differentiation Analysis

The Learn-to-Ask framework differentiates itself from existing methods primarily in its approach to learning a long-horizon, proactive dialogue policy from offline data.

Here's a comparison with the main methods in related work:

  • Compared to Attribute-based Alignment (e.g., SFT, DPO):

    • Core Difference: Attribute-based methods are myopic, optimizing single-turn qualities (e.g., clarity, relevance of a question). Learn-to-Ask, in contrast, learns a complete, sequential policy that accounts for temporal dependencies and long-term conversational goals.
    • Innovation: Learn-to-Ask specifically addresses the crucial decision of when to stop asking questions, a component entirely absent in attribute-focused methods. SFT learns to mimic a single path, failing to generalize to alternative valid strategies, while DPO struggles with conflicting preference signals across diverse expert trajectories. Learn-to-Ask overcomes this by using a hindsight-driven objective that estimates "remaining nodes to be covered," making it robust to path variations.
    • Reward Signal: SFT uses direct imitation as a signal, DPO uses binary preference pairs. Learn-to-Ask infers a dense, turn-by-turn reward signal directly from the observed future of expert trajectories, which is grounded in what experts actually did to achieve their goals, providing a more nuanced learning signal.
  • Compared to Simulation-based Optimization (e.g., RL with user simulators):

    • Core Difference: Simulation-based methods rely on user simulators to generate interaction data for Reinforcement Learning. Learn-to-Ask is simulator-free.
    • Innovation: Learn-to-Ask bypasses the notorious reality gap problem where policies optimized in synthetic environments often fail to generalize to real-world human interactions. It directly learns from real-world offline expert data, ensuring real-world applicability and robustness. Building high-fidelity simulators for complex, open-ended domains is extremely difficult and computationally expensive, a challenge Learn-to-Ask completely avoids.
    • Stability: Simulation-based methods often face instability in offline value estimation due to extrapolation errors. Learn-to-Ask reframes the problem into supervised learning on hindsight-based objectives, avoiding the need to estimate a long-horizon, unstable value function altogether, leading to a much more stable and direct learning process.
  • Compared to standard Offline RL from Human Data:

    • Core Difference: While philosophically aligned, standard offline RL often assumes a fixed reward function. Learn-to-Ask's primary contribution lies in its reward inference methodology.

    • Innovation: Instead of assuming a reward, Learn-to-Ask infers the reward signal itself from expert behavior. It decomposes long trajectories into single-turn decisions and infers fine-grained, turn-level rewards by using the observed future of the real conversation as a grounded source of truth. This makes policy learning more precise and data-efficient. It also adapts the Hindsight Experience Replay (HER) concept to the high-dimensional language space for learning a complete dialogue policy with an explicit stopping condition, which HER does not directly address.

      In essence, Learn-to-Ask offers a practical, economically viable, and robust solution by learning directly from the rich, sequential structure of existing expert trajectories, transforming intractable long-horizon RL into tractable supervised learning tasks, and ensuring reward fidelity through automated calibration, all without the need for expensive and often unrealistic simulators.

4. Methodology

4.1. Principles

The core idea behind Learn-to-Ask is to transform the complex and often intractable problem of offline Reinforcement Learning (RL) for proactive dialogue policies into a series of tractable, single-step supervised learning tasks. This is achieved by leveraging the observed future of each expert trajectory as a grounded oracle. Instead of estimating a speculative long-horizon value function or relying on a user simulator, the framework directly infers dense, turn-by-turn reward signals that reflect the expert's revealed strategy. This hindsight-driven objective decomposition allows the policy to learn both what to ask (to gather target information) and when to stop (when the goal is met) in a data-driven and stable manner.

4.2. Core Methodology In-depth (Layer by Layer)

The Learn-to-Ask framework is designed to move beyond myopic imitation and address the challenges of offline RL in dialogue by decomposing the long-horizon problem into single-step supervised learning tasks. The overall workflow is illustrated in Figure 1.

The following figure (Figure 1 from the original paper) shows the overall workflow of the proposed Learn-to-Ask framework:

Figure 1: The overview of the proposed Learn-to-Ask framework. which transforms the intractable offline RL problem into a sequence of tractable supervised learning tasks. 该图像是论文中图1的示意图,展示了Learn-to-Ask框架的整体流程。框架通过观察专家对话轨迹中的未来状态,利用层次式奖励建模,将难解的离线强化学习问题转化为一系列可训练的监督学习任务,其中包含公式Rs(at,st)(1+βRa(at,It))+Ω(at,st)R_s(a_t,s_t^*) \cdot (1 + \beta \cdot R_a(a_t,I_t^{*})) + \Omega(a_t,s_t)

As seen in Figure 1, the framework consists of three main parts:

  • Part A: Offline RL Problem Formulation: The initial problem of learning a proactive dialogue policy from static offline data is formulated as an offline RL problem.
  • Part B: Hindsight-driven Reward Pipeline: This is the core innovation where the observed future of expert trajectories is analyzed to extract ground-truth objectives (target information set and stopping decision). This process is guided by an Automated Grader Calibration pipeline.
  • Part C: Policy Optimization: Using the grounded objectives from Part B, the policy is trained through Reinforcement Fine-Tuning (RFT) to generate structured utterances (action, state_assessment) that align with the expert's strategy.

4.2.1. Problem Formulation: Proactive Dialogue As Offline RL

The task of proactive, goal-oriented dialogue is formulated as a sequential decision-making problem. The agent aims to learn a policy, π\pi, from a static, offline dataset of expert-led conversations, denoted as Dˉ={τ1ˉ,τ2,,τN}\mathcal { \bar { D } } = \{ \bar { \tau _ { 1 } } , \tau _ { 2 } , \dots , \tau _ { N } \}.

Each trajectory τD\tau \in \mathcal { D } represents a complete conversation: $ \tau = ( u _ { 0 } , x _ { 1 } , u _ { 1 } , \dots , x _ { T - 1 } , u _ { T - 1 } ) $ where:

  • u _ { t } is the user's utterance at turn tt.

  • x _ { t } is the agent's utterance at turn tt.

  • TT is the total number of turns in the conversation.

    At each turn tt, the policy π\pi observes the conversation history up to that point, Ct1=(u0,x1,,ut1)C _ { t - 1 } = ( u _ { 0 } , x _ { 1 } , \dots , u _ { t - 1 } ), and generates a structured utterance tuple xt=(at,st)\boldsymbol { x } _ { t } = \left( a _ { t } , s _ { t } \right). Here:

  • a _ { t } is a natural language question aimed at gathering new information.

  • st{CONTINUE,STOP}s _ { t } \in \{ \mathsf { C O N T I N U E } , \mathsf { S T O P } \} is a discrete state assessment indicating whether the agent believes the conversational goal has been met.

    The policy is thus defined as π(at,stCt1)\pi ( { a } _ { t } , { s } _ { t } | { C } _ { t - 1 } ). The objective is for the learned policy to mimic the expert's strategy for effective and efficient task completion (e.g., medical diagnosis).

This problem is modeled as an offline Markov Decision Process (MDP) with the following components:

  1. State: The conversation history Ct1C _ { t - 1 }.

  2. Action: The agent's structured utterance (at,st)\left( { { a } _ { t } } , { { s } _ { t } } \right).

  3. Transition Dynamics (P): The unknown user response dynamics, which govern the state transition P ( C _ { t } | C _ { t - 1 } , a _ { t } ). The next state C _ { t } is formed by appending the agent's question a _ { t } and the user's subsequent utterance u _ { t } to the history Ct1C _ { t - 1 }. This is unknown in the offline setting.

  4. Reward Function (R): The unknown reward function that implicitly guided the expert's actions.

    The central challenges addressed are operating in an offline setting (cannot query PP) and inferring the reward function RR directly from expert trajectories.

4.2.2. Overview: Objective Decomposition via Hindsight

To overcome the challenges of standard offline RL (like the simulator gap and instability of value estimation), Learn-to-Ask introduces a novel objective decomposition inspired by Hindsight Learning. The core idea is to transform the intractable sequential decision problem into a sequence of tractable, single-step supervised learning tasks by using the observed future of each real trajectory as a grounded oracle.

Instead of estimating a long-horizon value, for each turn tt, a Hindsight-driven Reward Pipeline (Part B in Figure 1) analyzes the future conversation segment Ctc=CCt1C _ { t } ^ { c } = C \setminus C _ { t - 1 } to extract a ground-truth tuple (It,st)\left( { { I } _ { t } ^ { * } } , { { s } _ { t } ^ { * } } \right).

  • ItI _ { t } ^ { * }: The target information set that the expert went on to collect.

  • sts _ { t } ^ { * }: The expert's implicit stopping decision (CONTINUE or STOP).

    This process effectively creates a dataset of (state, hindsight-objective) pairs. This allows for stable policy optimization (Part C) to train a policy π(at,stCt1)\pi ( a _ { t } , s _ { t } | C _ { t - 1 } ) that aligns with this hindsight-derived objective. This decomposition grounds the entire learning process in demonstrated expert strategy, teaching the policy both what to ask (to cover ItI _ { t } ^ { * }) and when to stop (to match sts _ { t } ^ { * }).

4.2.3. Ground Truth Extraction from Observed Trajectories

For each turn tt in a successful dialogue trajectory CC (where the designated goal gg was achieved by the end), a ground truth tuple (It,st)\left( { { I } _ { t } ^ { * } } , { { s } _ { t } ^ { * } } \right) is extracted from the future context CtcC _ { t } ^ { c }. This process is guided by a powerful LLM, π\pi ^ { * }, which acts as a noisy oracle for interpreting the expert's latent intent.

Micro-Goal ItI _ { t } ^ { * } (Target Information Set)

This represents the set of goal-relevant information that the expert sought and obtained in the subsequent turns within CtcC _ { t } ^ { c }. It's defined as the "information delta" that the expert successfully closed. To extract this, the LLM π\pi ^ { * } is used as an information extractor. For each turn tt, π\pi ^ { * } is prompted with the overall goal gg, the current context Ct1C _ { t - 1 }, and the future conversation CtcC _ { t } ^ { c }. The prompt instructs the LLM to identify and list only the critical new pieces of information present in the user's responses within CtcC _ { t } ^ { c } that were not already available in Ct1C _ { t - 1 }.

The structured extraction process, governed by π\pi ^ { * }, yields the target information set for turn tt: It=Extract(Ct1,Ctc,g;π)fort<TC;ITC=. I _ { t } ^ { * } = \mathrm { E x t r a c t } ( C _ { t - 1 } , C _ { t } ^ { c } , g ; \pi ^ { * } ) \quad \mathrm { f o r } t < T _ { C } ; \quad I _ { T _ { C } } ^ { * } = \emptyset . Where:

  • ItI _ { t } ^ { * }: The target information set for turn tt.

  • Extract\mathrm { E x t r a c t }: The function executed by the LLM π\pi ^ { * } to perform information extraction.

  • Ct1C _ { t - 1 }: The conversation history up to turn t-1.

  • CtcC _ { t } ^ { c }: The future conversation segment from turn tt onwards.

  • gg: The overall conversational goal.

  • π\pi ^ { * }: The powerful LLM (e.g., Qwen2.5-32B-Instruct) acting as the information extractor.

  • T _ { C }: The final turn of the conversation.

  • ITC=I _ { T _ { C } } ^ { * } = \emptyset: At the final turn, there is no more future information to extract, so the target set is empty.

    This process ensures that the micro-goal is grounded in the actual information-gathering path taken by a human expert. A critical step is to avoid extracting overly generic or context-independent information (e.g., common medical questions like "pregnancy status") to prevent reward hacking.

Macro-Goal sts _ { t } ^ { * } (Target Situation Assessment)

This represents the ideal action (CONTINUE or STOP) at turn tt, reflecting the expert's implicit decision. It is inferred based on whether there was still critical information to be gathered: st={CONTINUEif It andt<TC,STOPif It= or t=TC. s _ { t } ^ { * } = \left\{ \begin{array} { l l } { \mathsf { C O N T I N U E } } & { \mathrm { i f } \ I _ { t } ^ { * } \neq \emptyset \ \mathrm { a n d } t < T _ { C } , } \\ { \mathsf { S T O P } } & { \mathrm { i f } \ I _ { t } ^ { * } = \emptyset \ \mathrm { o r } \ t = T _ { C } . } \end{array} \right. Where:

  • sts _ { t } ^ { * }: The target situation assessment for turn tt.

  • CONTINUE\mathsf { CONTINUE }: The expert's implicit decision is to continue the conversation.

  • STOP\mathsf { STOP }: The expert's implicit decision is to stop the conversation.

  • ItI _ { t } ^ { * } \neq \emptyset: The target information set is not empty, meaning there is still information to be gathered.

  • t<TCt < T _ { C }: The current turn is not the final turn.

  • It=I _ { t } ^ { * } = \emptyset: The target information set is empty, meaning all necessary information has been gathered.

  • t=TCt = T _ { C }: The current turn is the final turn, implying the conversation has concluded.

    This mechanism directly learns an expert-aligned stopping policy from data, which is a component typically absent in attribute-focused methods.

4.2.4. Automated Prompt Calibration (Auto-Prompt)

The Learn-to-Ask framework heavily relies on LLMs for ground-truth extraction, reward grading, and policy sampling. The behavior of these LLMs is dictated by natural language prompts, making their alignment with true expert intent a crucial concern. An uncalibrated prompt can introduce systematic bias, leading the policy to pursue phantom goals or misinterpret its actions.

To ensure robustness, the Auto-Prompt pipeline is introduced for automatically calibrating all three types of prompts with minimal human supervision, creating a verifiable chain of fidelity.

The following algorithm (Algorithm 1 from the original paper) describes the automated prompt optimization process:

Algorithm 1 Automated Prompt Optimization
1: Input: Initial prompts PseedP_{\mathrm{seed}} , calibration sets Dcalib\mathcal{D}_{\mathrm{calib}} , Danchor\mathcal{D}_{\mathrm{anchor}} , number of iterations KK.
2: PbestPseedP_{\mathrm{best}} \leftarrow P_{\mathrm{seed}} (for [EXTRACT, GRADER, ROLLOUT]).
3: for k=1,,Kk = 1, \ldots, K do
4:   Generate candidate prompts Pcand\mathcal{P}_{\mathrm{cand}} from PbestP_{\mathrm{best}} .
5:   Execute type-specific pipelines for each candidate: Oj=Pipeline(Pj,Dcalib,T)O_j = \mathtt{Pipeline}(P_j, \mathcal{D}_{\mathrm{calib}}, T)
6:   Compute consistency score against labels from Danchor\mathcal{D}_{\mathrm{anchor}}: Sj=Score(Oj,Danchor,T)S_j = \mathtt{Score}(O_j, \mathcal{D}_{\mathrm{anchor}}, T)
7:   Update PbestargmaxPjSjP_{\mathrm{best}} \leftarrow \arg\max_{P_j} S_j. \triangleright Maximizing score
8: end for
9: Output: Calibrated prompts PbestP_{\mathrm{best}} .

Where:

  • PseedP_{\mathrm{seed}}: The initial set of prompts.

  • Dcalib\mathcal{D}_{\mathrm{calib}}: A calibration dataset used for executing pipelines.

  • Danchor\mathcal{D}_{\mathrm{anchor}}: A small, high-quality, human-verified anchor set used for scoring prompt consistency.

  • KK: The number of iterations for prompt optimization.

  • PbestP_{\mathrm{best}}: The best-performing prompt found during optimization.

  • Pcand\mathcal{P}_{\mathrm{cand}}: A set of candidate prompt variations generated from PbestP_{\mathrm{best}}.

  • OjO_j: The output of a pipeline execution for candidate prompt PjP_j.

  • Pipeline(Pj,Dcalib,T)\mathtt{Pipeline}(P_j, \mathcal{D}_{\mathrm{calib}}, T): Executes a pipeline specific to prompt type TT (Extractor, Grader, or Rollout) using prompt PjP_j on Dcalib\mathcal{D}_{\mathrm{calib}}.

  • Score(Oj,Danchor,T)\mathtt{Score}(O_j, \mathcal{D}_{\mathrm{anchor}}, T): Computes a consistency score SjS_j for the output OjO_j against the human-curated anchor set Danchor\mathcal{D}_{\mathrm{anchor}} for prompt type TT.

    The pipeline iteratively performs four key steps:

  1. Candidate Generation: A generator LLM proposes variations of the current best prompt (e.g., semantic paraphrasing, rule-based mutations).

  2. Type-specific Pipeline Execution on Calibration Set (Dcalib\mathcal{D}_{\mathrm{calib}}): Each candidate prompt is used in its respective pipeline (information extraction, reward grading, or policy rollout) on a flexible calibration dataset.

    • For the Info-Extractor (T=EXTRACTT = \mathtt{EXTRACT}): The information set extraction pipeline is run, returning OjO_j as the extracted information set.
    • For the Reward Grader (T=GRADERT = \mathtt{GRADER}): The grader model returns rewards OjO_j for prepared rollouts in Dcalib\mathcal{D}_{\mathrm{calib}}.
    • For the Policy Rollout (T=ROLLOUTT = \mathtt{ROLLOUT}): The policy model generates rollouts with PjP_j on Dcalib\mathcal{D}_{\mathrm{calib}}, and a fixed grader computes rewards OjO_j.
  3. Consistency Scoring with Human Anchors (Danchor\mathcal{D}_{\mathrm{anchor}}): The quality of each candidate prompt is measured against a small, human-verified anchor set.

    • For the Info-Extractor: SjS_j is measured by accuracy (e.g., F1-score or exact match) between OjO_j and human-annotated information sets. This ensures the prompt reproduces expert information extraction.
    • For the Reward Grader: SjS_j is measured by negative Mean Squared Error (MSE) between OjO_j and human-assigned graded scores (e.g., 0.0, 0.5, 1.0). This ensures the prompt's scoring logic mimics human judgment.
    • For the Policy Rollout: Once the grader is calibrated, human anchors are not needed. SjS_j is simply the average reward of the generated rollouts.
  4. Selection and Iteration: The candidate prompt with the highest consistency score is selected as the new best prompt for the next iteration. This loop runs automatically until performance converges.

    This process ensures:

  5. Grounding the Objective: The Extractor Prompt aligns ItI _ { t } ^ { * } with human-verified goals.

  6. Grounding the Learning Signal: The Grader Prompt ensures reward scores mimic human judgment.

  7. Grounding the Exploration: The Policy Sampler Prompt generates a diverse and high-quality candidate action space.

4.2.5. Grounded Reward Formulation

With the calibrated reward model and the extracted ground truth (It,st)\left( I _ { t } ^ { * } , s _ { t } ^ { * } \right), any candidate generation (at,st)\left( { { a } _ { t } } , { { s } _ { t } } \right) produced by the policy can be scored. The reward function is designed to be grounded in the observable outcomes of the expert's dialogue path, composed of two heads: Micro-Reward and Macro-Reward.

Micro-Reward (Question Utility) R _ { a }

This component measures how effectively the generated question a _ { t } targets the necessary information ItI _ { t } ^ { * } that the expert deemed critical to collect next. It uses a graded scoring system output by the calibrated grader RϕR _ { \phi }, providing a more nuanced learning signal than simple binary preference.

The Micro-Reward Ra(at;It)R _ { a } ( a _ { t } ; I _ { t } ^ { * } ) is defined as: Ra(at;It)={1.0if at precisely targets an element of It,0.5if at is contextually relevant but not precise,0.0if at is irrelevant to It. R _ { a } ( a _ { t } ; I _ { t } ^ { * } ) = { \left\{ \begin{array} { l l } { 1 . 0 } & { { \mathrm { i f ~ } } a _ { t } { \mathrm { ~ p r e c i s e l y ~ t a r g e t s ~ an ~ e l e m e n t ~ o f ~ } } I _ { t } ^ { * } , } \\ { 0 . 5 } & { { \mathrm { i f ~ } } a _ { t } { \mathrm { ~ i s ~ c o n t e x t u a l l y ~ r e l e v a n t ~ b u t ~ n o t ~ p r e c i s e } } , } \\ { 0 . 0 } & { { \mathrm { i f ~ } } a _ { t } { \mathrm { ~ i s ~ i r r e l e v a n t ~ t o ~ } } I _ { t } ^ { * } . } \end{array} \right. } Where:

  • a _ { t }: The question generated by the agent at turn tt.

  • ItI _ { t } ^ { * }: The target information set (micro-goal) for turn tt.

  • 1.0: Full reward for a question precisely targeting an element in ItI _ { t } ^ { * }.

  • 0.5: Partial reward for a question that is relevant but not precise.

  • 0.0: No reward for an irrelevant question.

    This graded structure helps mitigate the sparse reward problem by crediting partially correct attempts and incentivizing precision.

Macro-Reward (Assessment Accuracy) R _ { s }

This component evaluates the correctness of the agent's decision to continue or stop (s _ { t }) against the expert's implicit decision (sts _ { t } ^ { * }). It is a straightforward binary reward: Rs(st;st)={1ifst=st,0otherwise. R _ { s } ( s _ { t } ; s _ { t } ^ { * } ) = { \left\{ \begin{array} { l l } { 1 } & { { \mathrm { i f } } s _ { t } = s _ { t } ^ { * } , } \\ { 0 } & { { \mathrm { o t h e r w i s e . } } } \end{array} \right. } Where:

  • s _ { t }: The agent's state assessment (CONTINUE or STOP) at turn tt.

  • sts _ { t } ^ { * }: The target situation assessment (macro-goal) for turn tt.

  • 1: Full reward if the agent's decision matches the expert's decision.

  • 0: No reward otherwise.

    This is a critical component for learning an expert-aligned stopping policy.

Reward Integration

A hierarchical fusion function is used to integrate these two rewards, prioritizing the macro-decision (when to stop) over the micro-action (what to ask). The reward for asking a good question is only granted if the strategic decision to continue is correct.

The integrated reward function R ( a _ { t } , s _ { t } ) is defined as: R(at,st)=Rs(st;st)(1+βRa(at;It))+Ω(at,st). R ( a _ { t } , s _ { t } ) = R _ { s } ( s _ { t } ; s _ { t } ^ { * } ) \cdot ( 1 + \beta \cdot R _ { a } ( a _ { t } ; I _ { t } ^ { * } ) ) + \Omega ( a _ { t } , s _ { t } ) . Where:

  • R ( a _ { t } , s _ { t } ): The total reward for the agent's action (at,st)\left( a _ { t } , s _ { t } \right).

  • Rs(st;st)R _ { s } ( s _ { t } ; s _ { t } ^ { * } ): The Macro-Reward for assessment accuracy. This term acts as a hierarchical gate. If Rs=0R_s=0, the entire multiplicative term becomes zero, effectively nullifying the Micro-Reward.

  • (1+βRa(at;It))( 1 + \beta \cdot R _ { a } ( a _ { t } ; I _ { t } ^ { * } ) ): This term incorporates the Micro-Reward for question utility.

    • The +1+ 1 term ensures that the Macro-Reward R _ { s } is addressed even if Ra=0R _ { a } = 0, preventing the Micro-Reward from entirely negating the Macro-Reward if the question is irrelevant but the stopping decision is correct.
    • β>0\beta > 0: A tunable hyperparameter that balances the preference for generating good questions against the aggressiveness of the stopping decision. A higher β\beta places more emphasis on question utility.
  • Ω(at,st)\Omega ( a _ { t } , s _ { t } ): A flexible reward or penalty term used to regulate other aspects of the output, such as format and length. For example, it can penalize or reward for generating a specific number of questions or adhering to certain stylistic guidelines.

    The multiplicative formulation enforces a lexicographical preference for the macro-decision, meaning the agent only gets credit for good questions (R _ { a }) if the strategic decision to continue (Rs=1R _ { s } = 1) is correct. This prevents the agent from being rewarded for asking good questions at the wrong time (e.g., after the goal has been met).

The specific definition of Ω(at,st)\Omega(a_t, s_t) used in experiments is as follows: For st=CONTINUEs_t = \mathsf{CONTINUE}: Ω(at,st=CONTINUE)={1 if Rs=1,and at contain exactly one question,0.5 if Rs=1,and at contain exactly two questions,0, otherwisel. \Omega ( a _ { t } , s _ { t } = \mathsf { C O N T I N U E } ) = \left\{ \begin{array} { l l } { 1 } & { \mathrm { ~ i f ~ } R _ { s } = 1 , \mathrm { a n d ~ } a _ { t } \mathrm { \ c o n t a i n \ e x a c t l y \mathrm { ~ o n e\ q uestion } } , } \\ { 0 . 5 } & { \mathrm { ~ i f ~ } R _ { s } = 1 , \mathrm { a n d ~ } a _ { t } \mathrm { \ c o n t a i n \ e x a c t l y \mathrm { ~ t w o\ questions } } , } \\ { 0 , } & { \mathrm { ~ o t h e rwisel } . } \end{array} \right. For st=STOPs_t = \mathsf{STOP}: Ω(at,st=STOP)={1if Rs=1andat=STOP,0,otherwisel. \Omega ( a _ { t } , s _ { t } = \mathsf { S T O P } ) = \left\{ \begin{array} { l l } { 1 } & { \mathrm { i f ~ } R _ { s } = 1 \mathrm { a n d } a _ { t } = \langle S T O P \rangle , } \\ { 0 , } & { \mathrm { o t h e rwisel } . } \end{array} \right. This penalty term ensures the output format is regulated (e.g., generating exactly one question to avoid shotgun effect).

4.2.6. Policy Optimization Via Reinforcement Finetuning

With a structured dataset derived from real logs and a well-defined, grounded reward function, the policy is trained using an offline reinforcement learning approach. The training dataset consists of tuples Ct1,at,st,R(at,st)\langle C _ { t - 1 } , a _ { t } , s _ { t } , R ( a _ { t } , s _ { t } ) \rangle, where (at,st)\left( { { a } _ { t } } , { { s } _ { t } } \right) are sampled responses to the context Ct1C _ { t - 1 }, and RR is their calculated reward.

This formulation allows the method to be applied to various offline RFT (Reinforcement Fine-Tuning) algorithms without ad-hoc modifications. The paper primarily studies Group Relative Policy Optimization (GRPO) (Shao et al., 2024).

GRPO is chosen because:

  • Unlike PPO (Proximal Policy Optimization) which requires a separate critic model to estimate advantages, GRPO estimates advantage directly and efficiently from a group of sampled responses.
  • Its group optimization nature utilizes the advantage of the Learn-to-Ask method in exploring possible question spaces.
  • Its group-wise advantage estimation naturally handles the graded, non-binary nature of the rewards, as the normalization process dynamically adjusts the learning signal based on the quality distribution of sampled responses. This helps navigate the nuances of expert-level conversation.
  • These features make GRPO more adaptive, stable, and less complex to implement, benefiting real-world deployment pipelines.

5. Experimental Setup

5.1. Datasets

The experiments are conducted on the RealMedConv dataset.

  • Source: It is built from anonymized logs of real-world interactions between licensed pharmacists and users seeking over-the-counter (OTC) medication advice.
  • Scale: It contains 2,000 dialogues, with 1,600 used for training and 400 for evaluation.
  • Characteristics: Each session has a clear goal: gather sufficient symptom information to make a safe and appropriate recommendation. Dialogues are typically 3-5 turns long, reflecting the efficient, goal-directed nature of expert interactions.
  • Domain: Medical dialogue, specifically for OTC medication recommendations.
  • Data Preparation: For each full dialogue trajectory τ=(u0,a1,u1,,uT1)\tau = ( u _ { 0 } , a _ { 1 } , u _ { 1 } , \ldots , u _ { T - 1 } ), it is split into a current context Ct1=(u0,,ut1)C _ { t - 1 } = ( u _ { 0 } , \dots , u _ { t - 1 } ) and an observed future Ctc=(at,ut,,uT1)C _ { t } ^ { c } = ( a _ { t } , u _ { t } , \ldots , u _ { T - 1 } ) at each turn t[0,T1]t \in [ 0 , T - 1 ]. The processing then differs for each experimental setting:
    • RL (for Learn-to-Ask): The hindsight pipeline (Section 3.4) generates the ground-truth objective tuple (It,st)\left( { { I } _ { t } ^ { * } } , { { s } _ { t } ^ { * } } \right) from Ct1C_{t-1} and CtcC_t^c. This creates a training sample: $ \langle \mathrm { i n p u t } = C _ { t - 1 } , \mathrm { r e w a r d ~ r e f e r e n c e } = ( I _ { t } ^ { * } , s _ { t } ^ { * } ) \rangle $ For ablations:
      • w/o RsR_s: Samples with ground truth STOP are omitted.
      • ground truth CONTINUE and I=I^* = \emptyset: These samples are also omitted as there is no valid RaR_a.
    • SFT (Behavioral Cloning): The immediate next assistant utterance a _ { t } is used as the expected response. This creates a training sample: $ \langle \mathrm { i n p u t } = C _ { t - 1 } , \mathrm { r e s p o n s e } = a _ { t } \rangle $
    • DPO (Direct Preference Optimization): The immediate next assistant utterance a _ { t } is designated as 'chosen'. An LLM generates an irrelevant utterance (irrelevant to any content in the trajectory) as 'rejected'. This forms a training sample: $ \langle \mathrm { i n p u t } = C _ { t - 1 } , \mathrm { c h o s e n } = u _ { t } , \mathrm { r e j e c t e d } = \mathrm { s o m e ~ i r r e l e v a n t ~ u t t e r a n c e } \rangle $
  • Effectiveness: The dataset is effective for validating the method's performance as it represents real-world expert interactions, allowing the framework to learn from actual implicit expert-driven strategies and providing a grounded basis for evaluation.

5.2. Evaluation Metrics

The paper defines a suite of proxy metrics grounded in its hindsight framework to measure fine-grained alignment with expert strategy. These serve as strong indicators of task success.

  1. Strategic Questioning Quality (WA & WA-GH):

    • Conceptual Definition: These metrics measure what to ask. WA quantifies the average utility of generated questions, while WA-GH specifically measures the rate of generating perfectly targeted questions. They assess if the agent targets the same critical information as the expert (ItI_t^*). High scores proxy Information Coverage.
    • WA (What-to-Ask): The average RaR _ { a } ^ { * } score on samples where the ground truth s=CONTINUEs ^ { * } = \mathsf { CONTINUE } and the policy also correctly chose to continue the questioning.
    • WA-GH (Good Hit rate): The proportion of generated questions that achieve a perfect score (i.e., Ra=1R _ { a } ^ { * } = 1).
    • Mathematical Formula (for WA-GH): WAGH=total#ofcorrectCONTINUE samples with Ra=1total#ofcorrectCONTINUE samples \mathrm { W A - G H } = \frac { \mathrm { total } \# \mathrm { of } \mathrm { correct } \mathrm { CONTINUE ~ samples ~ with ~ } R _ { a } ^ { * } = 1 } { \mathrm { total } \# \mathrm { of } \mathrm { correct } \mathrm { CONTINUE ~ samples } }
    • Symbol Explanation:
      • WAGH\mathrm { W A - G H }: Good Hit rate.
      • total#ofcorrectCONTINUE samples with Ra=1\mathrm { total } \# \mathrm { of } \mathrm { correct } \mathrm { CONTINUE ~ samples ~ with ~ } R _ { a } ^ { * } = 1: The count of samples where the ground truth was CONTINUE, the model correctly predicted CONTINUE, and the generated question achieved a Micro-Reward of 1.0.
      • total#ofcorrectCONTINUE samples\mathrm { total } \# \mathrm { of } \mathrm { correct } \mathrm { CONTINUE ~ samples }: The count of samples where the ground truth was CONTINUE, and the model correctly predicted CONTINUE.
  2. Dialogue Termination Accuracy (WS):

    • Conceptual Definition: This metric measures when to stop. It reports the accuracy of the model's decision to terminate the dialogue (STOP) specifically when the information-gathering goal has been met (i.e., It=I_t^* = \emptyset). A high WS score is a direct proxy for Dialogue Efficiency and the ability to avoid user fatigue.
    • WS (When-to-Stop): The average RsR _ { s } ^ { * } score on samples where the ground truth s=STOPs ^ { * } = \mathsf { STOP }.
    • Mathematical Formula: Not explicitly provided in the paper, but inferred as the proportion of STOP decisions that match the ground truth when s=STOPs^* = \mathsf{STOP}. WS=total#ofsamples where st=st=STOPtotal#ofsamples where st=STOP \mathrm { WS } = \frac { \mathrm { total } \# \mathrm { of } \mathrm { samples ~ where ~ } s _ { t } = s _ { t } ^ { * } = \mathsf { STOP } } { \mathrm { total } \# \mathrm { of } \mathrm { samples ~ where ~ } s _ { t } ^ { * } = \mathsf { STOP } }
    • Symbol Explanation:
      • WS\mathrm { WS }: When-to-Stop accuracy.
      • total#ofsamples where st=st=STOP\mathrm { total } \# \mathrm { of } \mathrm { samples ~ where ~ } s _ { t } = s _ { t } ^ { * } = \mathsf { STOP }: Count of samples where the agent's decision sts_t correctly matches the ground truth sts_t^*, which is STOP\mathsf{STOP}.
      • total#ofsamples where st=STOP\mathrm { total } \# \mathrm { of } \mathrm { samples ~ where ~ } s _ { t } ^ { * } = \mathsf { STOP }: Total count of samples where the ground truth decision was STOP\mathsf{STOP}.
  3. Dialogue Continuation Accuracy (WC):

    • Conceptual Definition: This metric measures the accuracy of the model's decision to continue the dialogue.
    • WC (When-to-Continue): The average RsR _ { s } ^ { * } score on samples whose ground truth s=CONTINUEs ^ { * } = \mathsf { CONTINUE }.
    • Mathematical Formula: Not explicitly provided, but inferred as the proportion of CONTINUE decisions that match the ground truth when s=CONTINUEs^* = \mathsf{CONTINUE}. WC=total#ofsamples where st=st=CONTINUEtotal#ofsamples where st=CONTINUE \mathrm { WC } = \frac { \mathrm { total } \# \mathrm { of } \mathrm { samples ~ where ~ } s _ { t } = s _ { t } ^ { * } = \mathsf { CONTINUE } } { \mathrm { total } \# \mathrm { of } \mathrm { samples ~ where ~ } s _ { t } ^ { * } = \mathsf { CONTINUE } }
    • Symbol Explanation:
      • WC\mathrm { WC }: When-to-Continue accuracy.
      • total#ofsamples where st=st=CONTINUE\mathrm { total } \# \mathrm { of } \mathrm { samples ~ where ~ } s _ { t } = s _ { t } ^ { * } = \mathsf { CONTINUE }: Count of samples where the agent's decision sts_t correctly matches the ground truth sts_t^*, which is CONTINUE\mathsf{CONTINUE}.
      • total#ofsamples where st=CONTINUE\mathrm { total } \# \mathrm { of } \mathrm { samples ~ where ~ } s _ { t } ^ { * } = \mathsf { CONTINUE }: Total count of samples where the ground truth decision was CONTINUE\mathsf{CONTINUE}.
    • Note: The paper cautions that a high WC may sometimes indicate a policy trivially choosing to continue and being weak in termination assessment.
  4. Assessment Accuracy (AA):

    • Conceptual Definition: This is the overall accuracy of the agent's CONTINUE/STOP decisions across all turns.
    • AA: The average assessment score RsR _ { s } ^ { * } across all samples.
    • Mathematical Formula: Not explicitly provided, but inferred as the overall accuracy of the agent's termination decisions. AA=total#ofsamples where st=sttotal#ofall samples \mathrm { AA } = \frac { \mathrm { total } \# \mathrm { of } \mathrm { samples ~ where ~ } s _ { t } = s _ { t } ^ { * } } { \mathrm { total } \# \mathrm { of } \mathrm { all ~ samples } }
    • Symbol Explanation:
      • AA\mathrm { AA }: Assessment Accuracy.
      • total#ofsamples where st=st\mathrm { total } \# \mathrm { of } \mathrm { samples ~ where ~ } s _ { t } = s _ { t } ^ { * }: Count of samples where the agent's decision sts_t matches the ground truth sts_t^*.
      • total#ofall samples\mathrm { total } \# \mathrm { of } \mathrm { all ~ samples }: Total count of all evaluated samples.
  5. Format Correctness (FC):

    • Conceptual Definition: This metric evaluates how well the generated responses adhere to specified output format guidelines (e.g., number of questions, sentence structure).
    • FC: The average format score Ω\Omega across all samples.
    • Mathematical Formula: Not explicitly provided, but represents the average of the Ω\Omega term as defined in the reward integration for all samples. FC=1Ni=1NΩ(at,i,st,i) \mathrm { FC } = \frac { 1 } { N } \sum _ { i = 1 } ^ { N } \Omega ( a _ { t , i } , s _ { t , i } )
    • Symbol Explanation:
      • FC\mathrm { FC }: Format Correctness.
      • NN: Total number of samples.
      • Ω(at,i,st,i)\Omega ( a _ { t , i } , s _ { t , i } ): The format penalty/reward for the ii-th sample's action and state assessment.
  6. Total Reward (TR):

    • Conceptual Definition: This is the overall integrated reward score reflecting the combined performance on question utility, termination accuracy, and format correctness.
    • TR: The average overall reward score integrated by Equation 5 (from Section 3.6) across all samples.
    • Mathematical Formula: TR=1Ni=1NR(at,i,st,i)=1Ni=1N[Rs(st,i;st,i)(1+βRa(at,i;It,i))+Ω(at,i,st,i)] \mathrm { TR } = \frac { 1 } { N } \sum _ { i = 1 } ^ { N } R ( a _ { t , i } , s _ { t , i } ) = \frac { 1 } { N } \sum _ { i = 1 } ^ { N } \left[ R _ { s } ( s _ { t , i } ; s _ { t , i } ^ { * } ) \cdot ( 1 + \beta \cdot R _ { a } ( a _ { t , i } ; I _ { t , i } ^ { * } ) ) + \Omega ( a _ { t , i } , s _ { t , i } ) \right]
    • Symbol Explanation:
      • TR\mathrm { TR }: Total Reward.
      • NN: Total number of samples.
      • R ( a _ { t , i } , s _ { t , i } ): The integrated reward for the ii-th sample as defined in the Reward Integration section.
      • Rs(st,i;st,i)R _ { s } ( s _ { t , i } ; s _ { t , i } ^ { * } ): Macro-reward for the ii-th sample.
      • β\beta: Tunable hyperparameter.
      • Ra(at,i;It,i)R _ { a } ( a _ { t , i } ; I _ { t , i } ^ { * } ): Micro-reward for the ii-th sample.
      • Ω(at,i,st,i)\Omega ( a _ { t , i } , s _ { t , i } ): Format reward/penalty for the ii-th sample.

5.3. Baselines

The proposed Learn-to-Ask method is compared against the following baselines:

  1. Direct Prompting (Base): This uses the base LLM (e.g., Qwen2.5-7B/32B-Instruct) guided by a carefully engineered zero-shot prompt (i.e., without any examples in the prompt itself, relying solely on instructions).
  2. Behavioral Cloning (SFT): This involves standard supervised fine-tuning where the LLM is directly trained to imitate the expert's next utterance, which is the structured tuple (at,st)( a _ { t } , s _ { t } ^ { * } ).
  3. Direct Preference Optimization (DPO): For DPO, preference pairs are formed. The expert's response is considered 'chosen', and a generation from a base model (which is irrelevant to any information in the context) is used as 'rejected' (Rafailov et al., 2023). This baseline tests if a simple preference for expert actions is sufficient to guide the learning process.

5.4. Ablation Studies / Other RL Algorithms

To validate its design choices and explore extensibility, the paper conducts several ablation studies and evaluates Learn-to-Ask with other RL algorithms:

  • Ablation (w/o R _ { a }): Removes the Micro-Reward (question utility) component. The model is trained on the full dataset, but the reward function ignores question quality, simplifying to R(at,st)=Rs(st;st)+Ω(at,st)R ( a _ { t } , s _ { t } ) = R _ { s } ( s _ { t } ; s _ { t } ^ { * } ) + \Omega ( a _ { t } , s _ { t } ).
  • Ablation (w/o R _ { s }): Removes the Macro-Reward (assessment accuracy) component. In this setting, the model is only trained on dialogue turns where the ground-truth action was CONTINUE. The system prompt is modified to only instruct question generation, removing any mention of the stopping condition. The reward is simplified to Rˉ(at,sˉt)=βRa(at;It)+Ωˉ(at,st)\bar { R } ( { a } _ { t } , \bar { s } _ { t } ) = \beta \cdot R _ { a } ( { a } _ { t } ; I _ { t } ^ { * } ) + \bar { \Omega } ( { a } _ { t } , { s } _ { t } ).
  • Ablation (Sum): Replaces the hierarchical multiplicative fusion of rewards with a simple additive summation. The reward function becomes R(at,st)=Rs(st;st)+βRa(at;It)+Ω(at,st)R ( a _ { t } , s _ { t } ) = R _ { s } ( s _ { t } ; s _ { t } ^ { * } ) + \beta \cdot R _ { a } ( a _ { t } ; I _ { t } ^ { * } ) + \Omega ( a _ { t } , s _ { t } ).
  • Other RL Algorithms: Learn-to-Ask is also evaluated with GSPO (Group Sequence Policy Optimization) (Zheng et al., 2025) and CISPO (Chen et al., 2025a) to compare its performance across different Reinforcement Fine-Tuning (RFT) algorithms.

5.5. Implementation Details

  • LLM Models: Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct models (Yang et al., 2024) are used as the base LLMs.
  • Hardware: Experiments are conducted on a cluster of up to 32 NVIDIA H20 GPUs.
  • Framework: The Trinity-RFT framework (Pan et al., 2025), a customizable RFT training library, is used for implementing the entire workflow (policy sampling, reward grading, optimization).
  • Hyperparameters: Primary hyperparameters are kept consistent across all methods and models for fair comparison:
    • Learning rate: 5e75e^{-7}
    • Batch size: 64
    • Number of training epochs: 4
    • For group RL algorithms (GRPO, CISPO, GSPO), 5 repeats are used for each sample.
  • Prompt Calibration: The policy-sampler prompt, info-extractor prompt, and reward-grader prompt are all calibrated using the Auto-Prompt pipeline (Section 3.5) before the main training runs. The Qwen2.5-32B-Instruct model is specifically used as the backbone for the info-extractor and reward grader.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that Learn-to-Ask is highly effective in teaching LLMs to be proactive, goal-oriented agents, excelling at both what to ask and when to stop.

The following are the results from Table 1 of the original paper:

Model Qwen2.5-7B-Instruct Qwen2.5-32B-Instruct
WA WA-GH WC WS AA FC TR WA WA-GH WC WS AA FC TR
Base 0.50 0.13 0.98 0.16 0.75 0.63 2.17 0.50 0.13 0.92 0.52 0.81 0.67 2.43
SFT 0.40 0.08 0.94 0.74 0.89 0.57 2.41 0.43 0.11 0.94 0.87 0.93 0.69 2.70
DPO 0.42 0.05 0.94 0.36 0.78 0.19 1.78 0.23 0.04 0.52 0.87 0.62 0.18 1.61
Ours (GRPO) 0.67 0.41 0.94 0.93 0.94 0.92 3.27 0.64 0.37 0.93 0.88 0.92 0.88 3.15
Ablation Studies
w/o Ra 0.63 0.34 1.00 0.02 0.73 0.70 2.35 0.57 0.26 0.97 0.33 0.79 0.74 2.52
w/o Rs 0.52 0.19 0.96 0.87 0.93 0.92 3.06 0.54 0.19 0.95 0.91 0.94 0.92 3.12
Sum 0.64 0.38 0.92 0.95 0.93 0.91 3.20 0.65 0.37 0.94 0.88 0.92 0.90 3.19
Learn-to-Ask with other RL algorithms
GSPO 0.61 0.31 0.93 0.94 0.93 0.91 3.16 0.62 0.32 0.95 0.86 0.93 0.89 3.12
CISPO 0.71 0.47 0.95 0.94 0.95 0.93 3.36 0.70 0.49 0.94 0.89 0.93 0.92 3.29

6.1.1. Superiority of Learn-to-Ask

The primary finding from Table 1 is that Learn-to-Ask (labeled as Ours (GRPO)) significantly outperforms all baselines across both the Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct models, especially in strategic questioning quality and termination accuracy.

  • Strategic Questioning Quality (WA-GH): For the 7B model, WA-GH (Good Hit rate) skyrockets from 0.13 (Base) to 0.41 (+215% relative increase), indicating a massive improvement in generating perfectly targeted questions. Similarly, for the 32B model, WA-GH improves from 0.13 to 0.37 (+185% relative increase). This validates that Learn-to-Ask successfully teaches the model what to ask by targeting the critical information an expert would seek.
  • Dialogue Termination Accuracy (WS): For the 7B model, WS (When-to-Stop) accuracy jumps from 0.16 (Base) to 0.93. For the 32B model, it improves from 0.52 to 0.88. This demonstrates that the framework effectively teaches the model when to stop asking questions, leading to higher dialogue efficiency and avoiding user fatigue.
  • Total Reward (TR): Learn-to-Ask achieves the highest TR scores across both model sizes (3.27 for 7B and 3.15 for 32B), reflecting its superior overall performance in balancing question utility, termination accuracy, and format correctness.
  • Format Correctness (FC): Learn-to-Ask also achieves high FC (0.92 for 7B, 0.88 for 32B), indicating that the Ω\Omega term in the reward function effectively regulates the output format.

6.1.2. Limits of Baselines

The performance of the baselines underscores the difficulty of the task:

  • SFT (Behavioral Cloning): While SFT improves WS (e.g., from 0.16 to 0.74 for 7B), it sacrifices question quality (WA drops, WA-GH remains low), suggesting it myopically memorizes stopping behavior without truly understanding strategic questioning. Its Total Reward is better than Base but significantly lower than Learn-to-Ask.
  • DPO (Direct Preference Optimization): DPO performs poorly, especially on the 32B model, where its WA-GH is 0.04, WC is 0.52, and FC is 0.18, resulting in the lowest TR of 1.61. This indicates that its single binary preference signal is insufficient to guide the learning of dual objectives (what to ask and when to stop) in a complex sequential dialogue setting.

6.1.3. Nuances of Scale

  • Interestingly, in these offline experiments, Learn-to-Ask with the 7B model (TR 3.27) shows slightly better performance than the 32B model (TR 3.15). The paper attributes this to the limited data scale of the RealMedConv dataset, which might not be sufficient to fully leverage the larger model's capacity. However, this trend reverses in large-scale production environments, where the 32B model significantly outperforms the 7B model, as discussed in Section 5, highlighting that larger models require ample, diverse data to unlock their full potential.

6.2. Ablation Studies / Parameter Analysis

The ablation studies validate the design choices of the Learn-to-Ask framework:

  • Necessity of Micro-Reward (R _ { a }):
    • The w/o RaR_a ablation shows a substantial drop in WA-GH (0.34 to 0.02 for 7B, 0.26 to 0.33 for 32B in WC, 0.02 in WS for 7B, 0.33 in WS for 32B), and a collapse in WS (0.93 to 0.02 for 7B, 0.88 to 0.33 for 32B). This confirms that without the question utility reward, the model fails to learn when to stop and generate targeted questions.
  • Necessity of Macro-Reward (R _ { s }):
    • The w/o RsR_s ablation leads to a drop in WA-GH (0.41 to 0.19 for 7B, 0.37 to 0.19 for 32B) and a minor drop in WS for 7B model. This shows that both reward components are crucial for comprehensive policy learning.
  • Multiplicative vs. Additive Reward Fusion (Sum):
    • The Sum ablation (additive reward fusion) shows slightly lower TR scores (3.20 vs 3.27 for 7B, 3.19 vs 3.15 for 32B) compared to the multiplicative fusion used in Ours (GRPO). This indicates that the hierarchical gating of the multiplicative formulation provides a slight but consistent edge, which is magnified in complex production environments.

6.2.1. Learn-to-Ask with other RL algorithms

The following figure (Figure 4 from the original paper) shows the reward growth curves of RL algorithms in training 7B (left) and 32B (right) models:

Figure 4: The reward growing curves of RL algorithms in training 7B (left) and 32B (right) models. 该图像是图表,展示了训练7B(左图)和32B(右图)模型时三种强化学习算法GRPO、CISPO和GSPO的奖励增长曲线,横轴为训练步数,纵轴为奖励值,CISPO表现优于其它算法。

As seen in Figure 4, CISPO (Chen et al., 2025a) consistently achieves higher reward growth rates during training compared to GRPO and GSPO (Zheng et al., 2025). The results in Table 1 corroborate this: CISPO (TR of 3.36 for 7B and 3.29 for 32B) outperforms GRPO (TR of 3.27 for 7B and 3.15 for 32B) in terms of Total Reward, especially showing stronger WA and WA-GH scores. This suggests that while GRPO is effective, there's potential for further performance improvement by integrating Learn-to-Ask with more efficient RFT algorithms.

6.2.2. Evaluation on General Capabilities Benchmarks

The following figure (Figure 5 from the original paper) shows radar chart comparisons of 7B and 32B parameter models on general capability benchmarks:

Figure 5: The evaluation results on general capabilities benchmarks on our models with 7B and 32B parameters. 该图像是图5,展示了7B和32B参数模型在通用能力基准上的雷达图对比,涵盖领域能力、指令跟随、推理表现和风险安全等指标,红色区域为所提模型表现。

Figure 5 illustrates that the specialized training of Learn-to-Ask generally preserves the models' core competencies. Performance on domain-specific tasks (MedAgents, MedJourney) and instruction-following benchmarks (IFEval, StructFlow) remains stable or even slightly improves. Minor trade-offs are observed in safety-related metrics (e.g., a decrease in hallucination detection on MedHallu for the 7B model), which highlights the need for careful monitoring in real-world applications. Overall, the framework successfully imbues the model with proactive dialogue skills without significantly degrading its foundational capabilities.

6.3. Qualitative Analysis

The qualitative analysis (Figure 2) provides a clear illustration of Learn-to-Ask's superiority over SFT.

The following figure (Figure 2 from the original paper) shows a case study comparing dialogues generated by SFT and Learn-to-Ask models:

Figure 2: A case study comparing dialogues generated by SFT and Learn-t -Ask models. 该图像是一个对比对话示意图(图2),展示了SFT模型与Learn-to-Ask模型在医疗问诊中提问的差异。图中突出显示了Learn-to-Ask模型提出的相关有效问题与SFT模型提出的无关问题。

As seen in Figure 2:

  • The SFT model asks an irrelevant question ("What kind of pain do you have?") in a context where the user has already provided information about back pain, suggesting a lack of contextual understanding and strategic adaptation. This brittle mimicry implies SFT struggles with generalization to contexts not perfectly aligned with its training data.
  • In contrast, the Learn-to-Ask model demonstrates strategic adaptation: it correctly identifies the information already provided and proceeds with an insightful follow-up question ("Is there any numbness or tingling in your legs or feet?") that logically furthers the information-gathering goal. This highlights a shift from rote memorization to flexible, goal-oriented reasoning, reflecting its ability to learn a true sequential policy.

6.4. Real-World Deployment and Impact

The ultimate validation of Learn-to-Ask is its successful deployment in a live, large-scale online AI service.

  • Deployment Context: The model was deployed in a commercial Medication AI Assistant service, which serves thousands of users daily. Its goal is to proactively engage users to obtain a complete description of symptoms and recommend appropriate over-the-counter (OTC) medications.

  • Model Scale in Production: In the production environment, which involved a dataset 100x larger and 10x more medical conditions than RealMedConv, the 32B model significantly outperformed the 7B model. This confirms that for complex, larger-scale scenarios with ample data, the saturation trend observed with smaller datasets does not apply, and the full capacity of larger models becomes essential. The 32B model was therefore selected for production.

  • Role and Value of Auto-Prompt: While Auto-Prompt yielded marginal gains in offline academic tasks (e.g., TR on 32B increased slightly from 3.145 to 3.166 for policy sampler calibration), its value in production became indispensable. Its true strength lies in enabling maintainability and continuous improvement for the extractor and grader prompts. By periodically adding human-reviewed "margin examples" to anchor sets, the reward model can be recalibrated, and the policy retrained in a data-driven, semi-automated loop. This allows the agent to adapt to evolving user behaviors and new business needs (e.g., new safety guidelines) without costly manual prompt engineering. Auto-Prompt transforms a static training process into a dynamic, self-improving system.

    The following are the results from Table 2 of the original paper, showing the comparison of models trained with the original prompt and optimized prompt:

    Method Results on 7B Models
    WA WA-GH WC WS AA FC TR
    Base 0.501 0.132 0.975 0.155 0.751 0.629 2.174
    Original 0.665 0.413 0.944 0.926 0.939 0.915 3.272
    Optimized 0.641 0.399 0.949 0.910 0.938 0.894 3.214
    Results on 32B Models
    Base 0.503 0.134 0.915 0.521 0.807 0.670 2.431
    Original 0.640 0.365 0.933 0.877 0.918 0.880 3.145
    Optimized 0.634 0.366 0.925 0.916 0.923 0.889 3.166

As shown in Table 2, the gains from Auto-Prompt are indeed marginal for the RealMedConv dataset. For the 32B model, the Total Reward increased from 3.145 (Original prompt) to 3.166 (Optimized prompt).

  • Online Performance and Validation of Proxy Metrics: A four-week live A/B test was conducted, routing a significant portion of user traffic to the Learn-to-Ask-trained model.
    • The model achieved a 93% information completeness rate and an 88% good-question rate, which are the online analogs to the offline WS and WA metrics. These strong internal scores validate that the offline proxy metrics are effective for predicting end-to-end task success.
    • Crucially, the model produced a 1.87x lift in dialog-to-purchase conversion rate compared to historical data from a parallel human-based service. This provides powerful empirical evidence of the framework's ability to deliver tangible business impact and achieve super-human performance on key business metrics.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work introduced Learn-to-Ask, a novel, general, and simulator-free framework designed to bridge the "reality gap" in training proactive LLMs. By reframing the intractable long-horizon offline Reinforcement Learning problem into a series of supervised learning tasks, Learn-to-Ask effectively learns a complete dialogue policy—encompassing both what to ask and when to stop—directly from offline expert conversation logs. The key insight lies in leveraging the observed future of each real trajectory to infer a dense and grounded reward signal, thereby bypassing the need for brittle user simulators.

Empirically, Learn-to-Ask demonstrated significant outperformance against strong baselines like SFT and DPO on a real-world medical dialogue dataset, showcasing its superior ability to learn nuanced, strategic questioning. The framework's true value was further validated through its successful deployment in a large-scale, commercial medical AI service, where the model achieved super-human performance on key business metrics (e.g., 1.87x lift in dialog-to-purchase conversion rate) and demonstrated task success rates comparable to or even superior to human experts. This deployment highlights that the offline proxy metrics directly translate into real-world impact.

7.2. Limitations & Future Work

The paper discusses several theoretical implications and future research directions:

Theoretical Implications:

  1. Value-Function-Free Offline RL: Learn-to-Ask can be seen as a stable, value-function-free offline RL algorithm. This raises questions about formally characterizing the sub-optimality gap of this hindsight-based policy compared to the true offline optimum, potentially depending on the quality and coverage of the expert data.
  2. Causal Intervention Heuristic: The framework serves as a heuristic for causal reasoning, learning an intervention policy. Future work could integrate do-calculus or counterfactual reasoning models to evolve from imitating optimal outcomes to predicting outcomes of novel, unseen interventions.
  3. Data-Driven Proxy for Information Gain: Learn-to-Ask provides a pragmatic, data-driven proxy for information gain. This suggests research into dynamically adjusting the reward function itself to explore lines of inquiry not present in expert data, but deemed valuable by a theoretical information model.
  4. Graph-Theoretic Model: While Learn-to-Ask learns a subgraph coverage policy robust to path variations, the information graph is currently implicit. Future work could involve explicitly learning this graph structure from data to serve as a powerful prior for policy learning, enabling the identification of "holes" (un-asked but valuable questions) for structured exploration.

Future Research Directions (from Imitation to Superhuman Intervention):

  1. Reward Shaping for Specific Goals: Instead of merely rewarding coverage of the expert's information set ItI _ { t } ^ { * }, future work can explore reward functions to enforce desired superhuman behaviors. This could involve penalizing dialogues that conclude without asking critical safety questions (e.g., allergies), even if human experts sometimes omit them, thereby encoding organizational knowledge or safety protocols.
  2. Exploration in Semantic Space: To enable exploration without a live simulator, a generator model could propose alternative, plausible information goals (ItI _ { t } ^ { \prime }) beyond the observed ItI _ { t } ^ { * }. An advanced reward model, potentially trained on broader medical knowledge, could then score these hypothetical goals, allowing the agent to learn inquiries not represented in the limited offline dataset.
  3. Hybrid Human-AI Policy Learning: The ultimate goal is human-AI augmentation. Future systems could use Learn-to-Ask in an online loop. If a human expert overrules an AI-proposed question and asks something different, this action and its future outcome could be immediately incorporated to refine the AI's policy, creating a symbiotic system where the AI continuously learns from and adapts to human strategies.

7.3. Personal Insights & Critique

This paper presents a highly innovative and practical approach to a critical problem in LLM deployment: making models proactive and goal-oriented in high-stakes domains. The hindsight-driven reward inference is a particularly clever idea, effectively circumventing the reality gap of user simulators and the myopia of single-turn optimization. Its successful deployment in a commercial setting, achieving super-human performance, provides compelling evidence of its real-world viability and economic value.

Inspirations and Applications: The core methodology of Learn-to-Ask has broad applicability beyond medical dialogues. Any domain where offline expert-led conversations exist and a sequential information-gathering process is crucial could benefit. Examples include legal consultation, financial advisory, customer support, and even educational tutoring systems. The ability to learn stopping conditions is especially valuable for improving efficiency and user experience across these fields. The Auto-Prompt calibration is also a significant contribution, transforming prompt engineering from an art to a more systematic and maintainable process, critical for scaling and adapting LLM applications.

Potential Issues and Areas for Improvement:

  1. Inheritance of Expert Biases: While the paper acknowledges that the current model inherits human expert biases (e.g., preference for conversational brevity), this is a significant limitation in high-stakes domains. If expert data contains suboptimal strategies or implicit biases (e.g., overlooking certain patient demographics), the model will replicate these. The proposed reward shaping for superhuman behaviors is a good step but requires careful design and oversight to avoid introducing new biases or unintended consequences.

  2. Robustness of Info-Extractor and Reward Grader: The entire framework heavily relies on the LLM-based info-extractor (π\pi^*) and reward grader. If these foundational LLMs are imperfect (e.g., misinterpret context, hallucinate information, or have inherent biases), the inferred ground truth and reward signals will be noisy or skewed, directly impacting the learned policy's quality. While Auto-Prompt helps calibrate, continuous monitoring and robust LLM evaluation on these components are crucial, especially as user behaviors evolve.

  3. Generalization to Truly Novel Situations: The model learns from observed expert trajectories. While the information graph intuition suggests robustness to path variation, it might still struggle with situations entirely out of distribution from the expert data, where novel inquiries are needed. The proposed exploration in semantic space is an exciting direction to address this, but it adds another layer of complexity in ensuring meaningful and safe exploration.

  4. Interpretability: In high-stakes domains like healthcare, understanding why an AI agent asks certain questions or decides to stop is crucial for trust and accountability. While the framework provides insights into target information sets, the underlying LLM's reasoning for generating a specific question or its confidence in a stopping decision might still be a black box.

  5. Cost of Auto-Prompt: While Auto-Prompt saves manual prompt engineering effort, the iterative search process involving LLM generations and evaluations can be computationally intensive, especially for large LLMs and diverse calibration sets. The trade-off between calibration cost and policy performance needs to be carefully managed in practice.

    Overall, Learn-to-Ask provides a powerful and practical blueprint for proactive LLMs, skillfully navigating several key challenges. Its strength lies in its grounded, data-driven approach, making LLMs more capable and impactful in real-world, goal-oriented applications.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.