Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs
TL;DR Summary
Learn-to-Ask learns proactive LLMs from offline expert logs without simulators by leveraging observed future data to infer turn-by-turn rewards, decomposing long-horizon tasks for effective training and deployment in real-world high-stakes domains.
Abstract
Large Language Models (LLMs) excel as passive responders, but teaching them to be proactive, goal-oriented partners, a critical capability in high-stakes domains, remains a major challenge. Current paradigms either myopically optimize single-turn attributes or rely on brittle, high-cost user simulators, creating a persistent ``reality gap''. To bridge this gap, we introduce \texttt{Learn-to-Ask}, a general, simulator-free framework for learning and deploying proactive dialogue agents \textit{directly from offline expert data}, bypassing the need to model complex user dynamics. Our key insight is to reframe the offline policy learning problem by leveraging the \textbf{observed future} of each expert trajectory. This allows us to infer a dense, turn-by-turn reward signal grounded in the expert's revealed strategy, decomposing the intractable long-horizon problem into a series of supervised learning tasks, and training a policy to output a structured \texttt{(action, state_assessment)} tuple, governing both \textbf{what to ask} and, crucially, \textbf{when to stop}. To ensure reward fidelity, our Automated Grader Calibration pipeline systematically purges noise from the LLM-based reward model with minimal human supervision. Empirically, we demonstrate the efficacy of \texttt{Learn-to-Ask} in a real-world medical dataset, using LLMs of varying sizes up to 32B. Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service. In rigorous in-house evaluations, our model was launched and achieved performance even superior to human experts, proving our framework's ability to translate offline data into tangible, real-world impact. We hope this work provides a practical and economically viable blueprint for transforming passive LLMs into proactive, goal-oriented LLM applications.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs
1.2. Authors
Fei Wei*, Daoyuan Chen*, Ce Wang†, Yilun Huang†, Yushuo Chen, Xuchen Pan, Yaliang Li‡, Boling Ding‡. All authors are affiliated with Alibaba Group.
1.3. Journal/Conference
The paper is published on arXiv, a preprint server, indicating it has not yet undergone formal peer review for a specific conference or journal. The provided "Published at (UTC): 2025-10-29T12:08:07.000Z" suggests a future publication date. In the context of academic publishing, arXiv allows early dissemination of research findings.
1.4. Publication Year
2025
1.5. Abstract
Large Language Models (LLMs) currently excel at passive responses, but a significant challenge remains in training them to be proactive, goal-oriented partners, especially in critical, high-stakes fields. Existing methods either focus narrowly on single-turn attributes or rely on expensive and often inaccurate user simulators, leading to a persistent "reality gap." To address this, the paper introduces Learn-to-Ask, a general, simulator-free framework that learns and deploys proactive dialogue agents directly from existing offline expert data. This approach avoids modeling complex user behaviors. The core innovation is reframing the offline policy learning problem by utilizing the observed future of each expert conversation trajectory. This allows for the inference of a dense, turn-by-turn reward signal that is grounded in the expert's actual strategy. This decomposes the otherwise intractable long-horizon problem into a series of supervised learning tasks. The policy is trained to output a structured (action, state_assessment) tuple, which dictates both what to ask and, critically, when to stop. To maintain the accuracy of this reward signal, an Automated Grader Calibration pipeline is employed, systematically removing noise from the LLM-based reward model with minimal human oversight. The effectiveness of Learn-to-Ask is demonstrated empirically on a real-world medical dataset, utilizing LLMs up to 32 billion parameters. The method culminates in the successful deployment of the LLM into a live, large-scale online AI service, where rigorous in-house evaluations showed performance even surpassing human experts. The authors posit this framework offers a practical and economically viable blueprint for transforming passive LLMs into proactive, goal-oriented applications.
1.6. Original Source Link
https://arxiv.org/abs/2510.25441 PDF Link: https://arxiv.org/pdf/2510.25441v1.pdf Publication Status: Preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper addresses is the inability of current Large Language Models (LLMs) to act as proactive, goal-oriented partners, particularly in high-stakes domains such as healthcare, law, and finance. While LLMs are proficient as passive responders, there's a critical need for them to take initiative, gather information, and drive conversations towards a specific objective.
This problem is important because numerous daily goal-oriented conversations between human experts and clients generate a vast amount of valuable dialogue data (a "goldmine") that LLMs currently fail to harness effectively. The default passive behavior of LLMs severely limits their potential for truly collaborative and intelligent applications.
The specific challenges or gaps in prior research, which contribute to this reality gap, are two-fold:
-
Attribute-based alignment: This approach focuses on optimizing single-turn qualities (e.g., clarity, relevance) using preference data. It is
myopicas it fails to learn acoherent, sequential policythat considers conversational flow and, crucially, lacks a principled mechanism to decidewhen to stop. -
Simulation-based optimization: This method uses
user simulatorsto train agents for long-horizon rewards. However, creating high-fidelity simulators for complex, open-ended, expert-level domains isnotoriously difficult, computationally expensive, and often results in policies that fail to generalize to real-world interactions due to the inherentcombinatorial explosion of states.The paper's entry point or innovative idea is to bypass these limitations by asking a fundamental question: Can an effective, long-horizon questioning policy be learned directly from offline expert data, thereby eliminating the need for a simulator and bridging the
reality gap? TheLearn-to-Askframework is proposed as an affirmative answer to this question.
2.2. Main Contributions / Findings
The paper makes three primary contributions:
-
A Simulator-Free Policy Learning Framework (
Learn-to-Ask): The paper introducesLearn-to-Ask, a novel framework that enables an LLM to learn a complete, sequential questioning policy—including a critical stopping condition—directly fromoffline expert logs. This provides agrounded, data-driven, and economically viablealternative to the brittle and costlyuser simulatorstraditionally used inReinforcement Learning(RL) for dialogue agents. -
Hindsight-based Reward Inference with Automated Calibration: A key innovation is a method to infer dense, turn-by-turn
reward signalsby leveraging theobserved futureof expert trajectories. This reframes the intractableReinforcement Learningproblem into a series ofsupervised learningtasks. To ensure the fidelity and accuracy of theserewards, the framework includes anAutomated Grader Calibrationpipeline. This pipeline systematically purges noise from the LLM-basedreward modelwith minimal human supervision, mitigatingoracle noiseand ensuring the learning signal is truly aligned with expert intent. -
Demonstrated Real-World Impact and Superhuman Performance: The framework's efficacy is validated not just through offline experiments on a real-world medical dataset (
RealMedConv) but, more importantly, through its successful deployment in a live, large-scale commercial AI service. In rigorous in-house evaluations, theLearn-to-Ask-trained model achievedtask-success rates exceeding those of human expertsand demonstrated significant business impact (e.g., a1.87x liftindialog-to-purchase conversion rate). This practical deployment provides strong evidence that the offline learning paradigm directly translates to superior real-world performance, offering apractical blueprintfor developing proactive LLM applications.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand the paper, a beginner should be familiar with the following core concepts:
-
Large Language Models (LLMs): These are advanced artificial intelligence models, typically based on the
transformer architecture, trained on vast amounts of text data to understand, generate, and process human language. They excel at tasks like text generation, translation, summarization, and question answering. In this paper, LLMs are the base agents being taught to be proactive. -
Reinforcement Learning (RL): A paradigm of machine learning where an
agentlearns to makedecisionsby performingactionsin anenvironmentto maximize a cumulativereward. The agent doesn't receive explicit instructions but learns throughtrial and error. Key components of RL are:- Agent: The learner or decision-maker.
- Environment: The world the agent interacts with.
- State: The current situation or context of the environment.
- Action: A move made by the agent in a given state.
- Reward: A feedback signal (positive or negative) from the environment, indicating the desirability of an action.
- Policy (): A strategy that maps states to actions, determining the agent's behavior.
- Value Function: A prediction of the future reward starting from a given state or state-action pair.
-
Offline Reinforcement Learning (Offline RL): A subfield of RL where the agent learns a
policysolely from a fixed dataset of previously collected interactions (trajectories) without any further interaction with the environment. This is crucial for real-world applications where online interaction is expensive, dangerous, or impractical (e.g., healthcare, finance). A major challenge in offline RL is dealing without-of-distribution (OOD)actions, where the learned policy might try to exploit actions not seen in the training data, leading toextrapolation error. -
Supervised Fine-Tuning (SFT) / Behavioral Cloning: A common technique to adapt a pre-trained LLM to a specific task. It involves training the LLM on a dataset of
input-output pairs(e.g.,conversation history - expert's next utterance). The model learns to directlymimicthe behavior shown in the training data. In the context of dialogue, it teaches the model to produce thenext utterancegiven thecurrent conversation history. It'smyopicbecause it optimizes for single-step imitation rather than long-term goals. -
Direct Preference Optimization (DPO): A technique for aligning LLMs with human preferences without explicitly training a separate
reward model. Instead, it directly optimizes the LLM's policy to preferchosenresponses overrejectedresponses based onpairwise human feedback. The loss function encourages the model to assign higher probabilities to preferred responses and lower probabilities to dispreferred ones. While effective forlocal preferences(e.g., clarity of a single turn), it struggles with complexsequential decision-makingandlong-horizon goals. -
Markov Decision Process (MDP): A mathematical framework for modeling
sequential decision-makingin situations where outcomes are partly random and partly under the control of a decision-maker. An MDP is defined by:- A set of
states. - A set of
actions. - A
transition function, which gives the probability of moving to state from state after taking action . - A
reward functionR(s, a, s'), which specifies the immediate reward received after transitioning from to via action . - A
discount factor, which determines the present value of future rewards.
- A set of
-
Hindsight Experience Replay (HER): A technique used in
Reinforcement Learningto make sparse rewards more dense and improve sample efficiency. When an agent fails to achieve its intendedgoal, HER re-labels the failedtrajectoryby pretending that thegoalthat was actually achieved at the end of the trajectory was the intendedgoalfrom the beginning. This generates usefultraining exampleseven from episodes that did not achieve the original goal, making learning from sparse rewards more tractable. This paper draws inspiration from HER's core idea of leveragingachieved outcomes.
3.2. Previous Works
The paper frames previous approaches to instilling proactivity in LLMs into two main paradigms, highlighting their limitations:
-
Attribute-based alignment / Single-turn optimization:
- Methods: This category includes
Supervised Fine-Tuning (SFT)and preference optimization techniques likeDirect Preference Optimization (DPO)(Rafailov et al., 2023). Early forms also includedsimple prompting(Deng et al., 2023b; Zhao & Dou, 2024) to elicit behaviors like clarifying questions. - Focus: These methods optimize for
single-turn attributessuch asclarity,relevance, orsafety(Li et al., 2025b; Qian et al., 2023; Xu et al., 2025; Zhou et al., 2022). - Limitations: They are
myopicand fail to learn along-horizon, stateful policy. They don't account fortemporal dependenciesin a conversation and, critically, lack a mechanism for decidingwhen to stop. The paper notes that DPO's single binary preference signal is often insufficient for guiding learning of dual objectives (what to ask and when to stop).
- Methods: This category includes
-
Simulation-based optimization:
- Methods: This involves using
Reinforcement Learning (RL)insimulated user environments(Wu et al., 2025; Xu et al., 2023).Story-related reasoning tasksusing tree-based extensions for data simulation have also been explored (Zhou et al., 2025). - Focus: These approaches aim to tackle
sequential decision-makingandlong-horizon rewardsby interacting with a synthetic user. - Limitations (
Reality Gap): Building ahigh-fidelity user simulatorfor complex, open-ended domains (like medical consultation) isnotoriously difficult, computationally prohibitive, and suffers from acombinatorial explosion of states(Hao et al., 2024). Policies optimized in synthetic environments oftenfail to generalizeto real human interactions, leading to poor performance in the real world.
- Methods: This involves using
-
Offline RL from Human Data:
- Alignment: The paper's work is philosophically aligned with
offline RLfrom human-involved data (Shani et al., 2024; Zhou et al., 2024; Shi et al., 2024). - Distinction: Unlike standard
offline RLwhich assumes a fixedreward function, this paper's key challenge is toinfer the reward signal itselffrom expert behavior. Their methodology is distinct: they decompose long trajectories into single-turn decisions and infer fine-grained, turn-level rewards by using theobserved futureof the real conversation. This is seen as a novel adaptation and significant extension of thehindsight learningparadigm.
- Alignment: The paper's work is philosophically aligned with
3.3. Technological Evolution
The evolution of teaching LLMs proactive behavior can be summarized as follows:
- Early Dialogue Systems (Pre-LLM): Rule-based or statistical methods were used for proactive behaviors, typically in narrow domains (Deng et al., 2023a; Ling et al., 2025). These lacked the general knowledge and adaptability of modern LLMs.
- LLMs with Simple Prompting: With the advent of powerful LLMs, initial efforts leveraged their vast
world knowledgethroughprompt engineering. This involved crafting specific instructions to elicit proactive behaviors likeasking clarifying questions(Deng et al., 2023b; Zhao & Dou, 2024). While straightforward, these methods lacked the ability to learn complex, domain-specific strategies from data. - LLM Fine-tuning for Single-Turn Attributes: This represented a step towards data-driven learning. Techniques like
SFTandDPOwere used to align LLMs with desirablesingle-turn qualities(e.g., relevance, clarity, safety) by training on synthetic preference data (Li et al., 2025b). This improved local response quality but remainedmyopicand couldn't learnlong-horizon sequential policies. - LLM Fine-tuning with Simulation-based RL: To address
sequential decision-making,Reinforcement Learningwas applied, often interacting withuser simulators(Wu et al., 2025). This aimed forlong-horizon rewardsbut introduced thereality gapproblem: policies trained in synthetic environments often failed in real-world human interactions due to the difficulty of creating realistic simulators. - Learn-to-Ask's Position: This paper's
Learn-to-Askframework is situated at a critical juncture in this evolution. It moves beyondmyopicsingle-turn optimization and directly tackles thereality gapofsimulation-based RL. By learning asequential policyand astopping conditiondirectly fromoffline expert datausing a novelhindsight-based reward inference, it offers asimulator-freeanddata-drivensolution for trulyproactive LLM agents. It aims to make LLMs collaborative partners by leveraging real-world expert strategies.
3.4. Differentiation Analysis
The Learn-to-Ask framework differentiates itself from existing methods primarily in its approach to learning a long-horizon, proactive dialogue policy from offline data.
Here's a comparison with the main methods in related work:
-
Compared to Attribute-based Alignment (e.g., SFT, DPO):
- Core Difference: Attribute-based methods are
myopic, optimizingsingle-turn qualities(e.g., clarity, relevance of a question).Learn-to-Ask, in contrast, learns acomplete, sequential policythat accounts fortemporal dependenciesandlong-term conversational goals. - Innovation:
Learn-to-Askspecifically addresses the crucial decision ofwhen to stopasking questions, a component entirely absent in attribute-focused methods. SFT learns to mimic a single path, failing to generalize to alternative valid strategies, while DPO struggles with conflicting preference signals across diverse expert trajectories.Learn-to-Askovercomes this by using ahindsight-driven objectivethat estimates "remaining nodes to be covered," making it robust to path variations. - Reward Signal: SFT uses
direct imitationas a signal, DPO usesbinary preference pairs.Learn-to-Askinfers adense, turn-by-turn reward signaldirectly from theobserved futureof expert trajectories, which isgroundedin what experts actually did to achieve their goals, providing a more nuanced learning signal.
- Core Difference: Attribute-based methods are
-
Compared to Simulation-based Optimization (e.g., RL with user simulators):
- Core Difference: Simulation-based methods rely on
user simulatorsto generateinteraction dataforReinforcement Learning.Learn-to-Askissimulator-free. - Innovation:
Learn-to-Askbypasses the notoriousreality gapproblem where policies optimized in synthetic environments oftenfail to generalizeto real-world human interactions. It directly learns fromreal-world offline expert data, ensuringreal-world applicabilityandrobustness. Building high-fidelity simulators for complex, open-ended domains is extremely difficult and computationally expensive, a challengeLearn-to-Askcompletely avoids. - Stability: Simulation-based methods often face
instabilityinoffline value estimationdue to extrapolation errors.Learn-to-Askreframes the problem intosupervised learningonhindsight-based objectives, avoiding the need to estimate along-horizon, unstable value functionaltogether, leading to a muchmore stable and direct learning process.
- Core Difference: Simulation-based methods rely on
-
Compared to standard Offline RL from Human Data:
-
Core Difference: While philosophically aligned, standard
offline RLoften assumes a fixedreward function.Learn-to-Ask's primary contribution lies in itsreward inference methodology. -
Innovation: Instead of assuming a reward,
Learn-to-Askinfers thereward signalitself from expert behavior. It decomposes long trajectories intosingle-turn decisionsand infersfine-grained, turn-level rewardsby using theobserved futureof the real conversation as a grounded source of truth. This makespolicy learningmoreprecise and data-efficient. It also adapts theHindsight Experience Replay (HER)concept to the high-dimensional language space for learning a complete dialogue policy with an explicitstopping condition, which HER does not directly address.In essence,
Learn-to-Askoffers apractical, economically viable, and robustsolution by learning directly from the rich, sequential structure of existing experttrajectories, transforming intractablelong-horizon RLintotractable supervised learningtasks, and ensuringreward fidelitythrough automated calibration, all without the need for expensive and often unrealistic simulators.
-
4. Methodology
4.1. Principles
The core idea behind Learn-to-Ask is to transform the complex and often intractable problem of offline Reinforcement Learning (RL) for proactive dialogue policies into a series of tractable, single-step supervised learning tasks. This is achieved by leveraging the observed future of each expert trajectory as a grounded oracle. Instead of estimating a speculative long-horizon value function or relying on a user simulator, the framework directly infers dense, turn-by-turn reward signals that reflect the expert's revealed strategy. This hindsight-driven objective decomposition allows the policy to learn both what to ask (to gather target information) and when to stop (when the goal is met) in a data-driven and stable manner.
4.2. Core Methodology In-depth (Layer by Layer)
The Learn-to-Ask framework is designed to move beyond myopic imitation and address the challenges of offline RL in dialogue by decomposing the long-horizon problem into single-step supervised learning tasks. The overall workflow is illustrated in Figure 1.
The following figure (Figure 1 from the original paper) shows the overall workflow of the proposed Learn-to-Ask framework:
该图像是论文中图1的示意图,展示了Learn-to-Ask框架的整体流程。框架通过观察专家对话轨迹中的未来状态,利用层次式奖励建模,将难解的离线强化学习问题转化为一系列可训练的监督学习任务,其中包含公式。
As seen in Figure 1, the framework consists of three main parts:
- Part A: Offline RL Problem Formulation: The initial problem of learning a
proactive dialogue policyfrom staticoffline datais formulated as anoffline RLproblem. - Part B: Hindsight-driven Reward Pipeline: This is the core innovation where the
observed futureof expert trajectories is analyzed to extractground-truth objectives(target information setandstopping decision). This process is guided by anAutomated Grader Calibrationpipeline. - Part C: Policy Optimization: Using the
grounded objectivesfrom Part B, the policy is trained throughReinforcement Fine-Tuning (RFT)to generatestructured utterances(action,state_assessment) that align with the expert's strategy.
4.2.1. Problem Formulation: Proactive Dialogue As Offline RL
The task of proactive, goal-oriented dialogue is formulated as a sequential decision-making problem. The agent aims to learn a policy, , from a static, offline dataset of expert-led conversations, denoted as .
Each trajectory represents a complete conversation:
$
\tau = ( u _ { 0 } , x _ { 1 } , u _ { 1 } , \dots , x _ { T - 1 } , u _ { T - 1 } )
$
where:
-
u _ { t }is theuser's utteranceat turn . -
x _ { t }is theagent's utteranceat turn . -
is the total number of turns in the conversation.
At each turn , the
policyobserves theconversation historyup to that point, , and generates astructured utterance tuple. Here: -
a _ { t }is anatural language questionaimed at gathering new information. -
is a
discrete state assessmentindicating whether the agent believes the conversational goal has been met.The
policyis thus defined as . The objective is for the learned policy tomimicthe expert's strategy for effective and efficient task completion (e.g., medical diagnosis).
This problem is modeled as an offline Markov Decision Process (MDP) with the following components:
-
State: The
conversation history. -
Action: The agent's
structured utterance. -
Transition Dynamics
(P): Theunknown user response dynamics, which govern thestate transitionP ( C _ { t } | C _ { t - 1 } , a _ { t } ). The next stateC _ { t }is formed by appending the agent's questiona _ { t }and the user's subsequent utteranceu _ { t }to the history . This is unknown in theoffline setting. -
Reward Function
(R): Theunknown reward functionthat implicitly guided the expert's actions.The central challenges addressed are operating in an
offline setting(cannot query ) and inferring thereward functiondirectly from experttrajectories.
4.2.2. Overview: Objective Decomposition via Hindsight
To overcome the challenges of standard offline RL (like the simulator gap and instability of value estimation), Learn-to-Ask introduces a novel objective decomposition inspired by Hindsight Learning. The core idea is to transform the intractable sequential decision problem into a sequence of tractable, single-step supervised learning tasks by using the observed future of each real trajectory as a grounded oracle.
Instead of estimating a long-horizon value, for each turn , a Hindsight-driven Reward Pipeline (Part B in Figure 1) analyzes the future conversation segment to extract a ground-truth tuple .
-
: The
target information setthat the expert went on to collect. -
: The expert's implicit
stopping decision(CONTINUEorSTOP).This process effectively creates a dataset of
(state, hindsight-objective)pairs. This allows for stablepolicy optimization(Part C) to train a policy that aligns with thishindsight-derived objective. This decomposition grounds the entire learning process in demonstrated expert strategy, teaching the policy bothwhat to ask(to cover ) andwhen to stop(to match ).
4.2.3. Ground Truth Extraction from Observed Trajectories
For each turn in a successful dialogue trajectory (where the designated goal was achieved by the end), a ground truth tuple is extracted from the future context . This process is guided by a powerful LLM, , which acts as a noisy oracle for interpreting the expert's latent intent.
Micro-Goal (Target Information Set)
This represents the set of goal-relevant information that the expert sought and obtained in the subsequent turns within . It's defined as the "information delta" that the expert successfully closed.
To extract this, the LLM is used as an information extractor. For each turn , is prompted with the overall goal , the current context , and the future conversation . The prompt instructs the LLM to identify and list only the critical new pieces of information present in the user's responses within that were not already available in .
The structured extraction process, governed by , yields the target information set for turn :
Where:
-
: The
target information setfor turn . -
: The function executed by the
LLMto perform information extraction. -
: The
conversation historyup to turnt-1. -
: The
future conversation segmentfrom turn onwards. -
: The overall conversational goal.
-
: The powerful
LLM(e.g.,Qwen2.5-32B-Instruct) acting as theinformation extractor. -
T _ { C }: The final turn of the conversation. -
: At the final turn, there is no more future information to extract, so the target set is empty.
This process ensures that the
micro-goalis grounded in the actual information-gathering path taken by a human expert. A critical step is to avoid extracting overly generic or context-independent information (e.g., common medical questions like "pregnancy status") to preventreward hacking.
Macro-Goal (Target Situation Assessment)
This represents the ideal action (CONTINUE or STOP) at turn , reflecting the expert's implicit decision. It is inferred based on whether there was still critical information to be gathered:
Where:
-
: The
target situation assessmentfor turn . -
: The expert's implicit decision is to continue the conversation.
-
: The expert's implicit decision is to stop the conversation.
-
: The
target information setis not empty, meaning there is still information to be gathered. -
: The current turn is not the final turn.
-
: The
target information setis empty, meaning all necessary information has been gathered. -
: The current turn is the final turn, implying the conversation has concluded.
This mechanism directly learns an expert-aligned
stopping policyfrom data, which is a component typically absent in attribute-focused methods.
4.2.4. Automated Prompt Calibration (Auto-Prompt)
The Learn-to-Ask framework heavily relies on LLMs for ground-truth extraction, reward grading, and policy sampling. The behavior of these LLMs is dictated by natural language prompts, making their alignment with true expert intent a crucial concern. An uncalibrated prompt can introduce systematic bias, leading the policy to pursue phantom goals or misinterpret its actions.
To ensure robustness, the Auto-Prompt pipeline is introduced for automatically calibrating all three types of prompts with minimal human supervision, creating a verifiable chain of fidelity.
The following algorithm (Algorithm 1 from the original paper) describes the automated prompt optimization process:
Algorithm 1 Automated Prompt Optimization
1: Input: Initial prompts , calibration sets , , number of iterations .
2: (for [EXTRACT, GRADER, ROLLOUT]).
3: for do
4: Generate candidate prompts from .
5: Execute type-specific pipelines for each candidate:
6: Compute consistency score against labels from :
7: Update . Maximizing score
8: end for
9: Output: Calibrated prompts .
Where:
-
: The initial set of prompts.
-
: A calibration dataset used for executing pipelines.
-
: A small, high-quality, human-verified anchor set used for scoring prompt consistency.
-
: The number of iterations for prompt optimization.
-
: The best-performing prompt found during optimization.
-
: A set of candidate prompt variations generated from .
-
: The output of a pipeline execution for candidate prompt .
-
: Executes a pipeline specific to prompt type (Extractor, Grader, or Rollout) using prompt on .
-
: Computes a consistency score for the output against the human-curated anchor set for prompt type .
The pipeline iteratively performs four key steps:
-
Candidate Generation: A generator
LLMproposes variations of the current best prompt (e.g., semantic paraphrasing, rule-based mutations). -
Type-specific Pipeline Execution on Calibration Set (): Each candidate prompt is used in its respective pipeline (information extraction, reward grading, or policy rollout) on a flexible
calibration dataset.- For the
Info-Extractor(): The information set extraction pipeline is run, returning as the extracted information set. - For the
Reward Grader(): The grader model returns rewards for prepared rollouts in . - For the
Policy Rollout(): The policy model generates rollouts with on , and a fixed grader computes rewards .
- For the
-
Consistency Scoring with Human Anchors (): The quality of each candidate prompt is measured against a small, human-verified
anchor set.- For the
Info-Extractor: is measured byaccuracy(e.g.,F1-scoreorexact match) between and human-annotated information sets. This ensures the prompt reproduces expert information extraction. - For the
Reward Grader: is measured bynegative Mean Squared Error (MSE)between and human-assigned graded scores (e.g., 0.0, 0.5, 1.0). This ensures the prompt's scoring logic mimics human judgment. - For the
Policy Rollout: Once thegraderis calibrated,human anchorsare not needed. is simply theaverage rewardof the generated rollouts.
- For the
-
Selection and Iteration: The candidate prompt with the highest consistency score is selected as the new best prompt for the next iteration. This loop runs automatically until performance converges.
This process ensures:
-
Grounding the Objective: The
Extractor Promptaligns with human-verified goals. -
Grounding the Learning Signal: The
Grader Promptensuresreward scoresmimic human judgment. -
Grounding the Exploration: The
Policy Sampler Promptgenerates adiverseandhigh-qualitycandidateaction space.
4.2.5. Grounded Reward Formulation
With the calibrated reward model and the extracted ground truth , any candidate generation produced by the policy can be scored. The reward function is designed to be grounded in the observable outcomes of the expert's dialogue path, composed of two heads: Micro-Reward and Macro-Reward.
Micro-Reward (Question Utility) R _ { a }
This component measures how effectively the generated question a _ { t } targets the necessary information that the expert deemed critical to collect next. It uses a graded scoring system output by the calibrated grader , providing a more nuanced learning signal than simple binary preference.
The Micro-Reward is defined as:
Where:
-
a _ { t }: The question generated by the agent at turn . -
: The
target information set(micro-goal) for turn . -
1.0: Full reward for a question precisely targeting an element in . -
0.5: Partial reward for a question that is relevant but not precise. -
0.0: No reward for an irrelevant question.This graded structure helps mitigate the
sparse reward problemby crediting partially correct attempts and incentivizing precision.
Macro-Reward (Assessment Accuracy) R _ { s }
This component evaluates the correctness of the agent's decision to continue or stop (s _ { t }) against the expert's implicit decision (). It is a straightforward binary reward:
Where:
-
s _ { t }: The agent'sstate assessment(CONTINUEorSTOP) at turn . -
: The
target situation assessment(macro-goal) for turn . -
1: Full reward if the agent's decision matches the expert's decision. -
0: No reward otherwise.This is a critical component for learning an expert-aligned
stopping policy.
Reward Integration
A hierarchical fusion function is used to integrate these two rewards, prioritizing the macro-decision (when to stop) over the micro-action (what to ask). The reward for asking a good question is only granted if the strategic decision to continue is correct.
The integrated reward function R ( a _ { t } , s _ { t } ) is defined as:
Where:
-
R ( a _ { t } , s _ { t } ): The total reward for the agent's action . -
: The
Macro-Rewardfor assessment accuracy. This term acts as ahierarchical gate. If , the entire multiplicative term becomes zero, effectively nullifying theMicro-Reward. -
: This term incorporates the
Micro-Rewardfor question utility.- The term ensures that the
Macro-RewardR _ { s }is addressed even if , preventing theMicro-Rewardfrom entirely negating theMacro-Rewardif the question is irrelevant but the stopping decision is correct. - : A tunable hyperparameter that balances the preference for generating good questions against the aggressiveness of the
stopping decision. A higher places more emphasis onquestion utility.
- The term ensures that the
-
: A flexible
reward or penalty termused to regulate other aspects of the output, such asformatandlength. For example, it can penalize or reward for generating a specific number of questions or adhering to certain stylistic guidelines.The
multiplicative formulationenforces alexicographical preferencefor themacro-decision, meaning the agent only gets credit for good questions (R _ { a }) if the strategic decision to continue () is correct. This prevents the agent from being rewarded for asking good questions at the wrong time (e.g., after the goal has been met).
The specific definition of used in experiments is as follows:
For :
For :
This penalty term ensures the output format is regulated (e.g., generating exactly one question to avoid shotgun effect).
4.2.6. Policy Optimization Via Reinforcement Finetuning
With a structured dataset derived from real logs and a well-defined, grounded reward function, the policy is trained using an offline reinforcement learning approach. The training dataset consists of tuples , where are sampled responses to the context , and is their calculated reward.
This formulation allows the method to be applied to various offline RFT (Reinforcement Fine-Tuning) algorithms without ad-hoc modifications. The paper primarily studies Group Relative Policy Optimization (GRPO) (Shao et al., 2024).
GRPO is chosen because:
- Unlike
PPO(Proximal Policy Optimization) which requires a separatecritic modelto estimateadvantages,GRPOestimatesadvantagedirectly and efficiently from agroup of sampled responses. - Its
group optimizationnature utilizes the advantage of theLearn-to-Askmethod in exploring possible question spaces. - Its
group-wise advantage estimationnaturally handles thegraded, non-binary natureof therewards, as thenormalization processdynamically adjusts the learning signal based on thequality distributionof sampled responses. This helps navigate the nuances ofexpert-level conversation. - These features make
GRPOmoreadaptive,stable, andless complexto implement, benefitingreal-world deployment pipelines.
5. Experimental Setup
5.1. Datasets
The experiments are conducted on the RealMedConv dataset.
- Source: It is built from anonymized logs of real-world interactions between licensed pharmacists and users seeking
over-the-counter (OTC) medication advice. - Scale: It contains 2,000 dialogues, with 1,600 used for training and 400 for evaluation.
- Characteristics: Each session has a clear goal: gather sufficient symptom information to make a safe and appropriate recommendation. Dialogues are typically 3-5 turns long, reflecting the efficient, goal-directed nature of expert interactions.
- Domain: Medical dialogue, specifically for
OTC medication recommendations. - Data Preparation: For each full dialogue
trajectory, it is split into acurrent contextand anobserved futureat each turn . The processing then differs for each experimental setting:- RL (for Learn-to-Ask): The
hindsight pipeline(Section 3.4) generates theground-truth objective tuplefrom and . This creates a training sample: $ \langle \mathrm { i n p u t } = C _ { t - 1 } , \mathrm { r e w a r d ~ r e f e r e n c e } = ( I _ { t } ^ { * } , s _ { t } ^ { * } ) \rangle $ For ablations:w/o: Samples withground truthSTOPare omitted.ground truth CONTINUEand : These samples are also omitted as there is no valid .
- SFT (Behavioral Cloning): The immediate next assistant
utterancea _ { t }is used as the expected response. This creates a training sample: $ \langle \mathrm { i n p u t } = C _ { t - 1 } , \mathrm { r e s p o n s e } = a _ { t } \rangle $ - DPO (Direct Preference Optimization): The immediate next assistant
utterancea _ { t }is designated as 'chosen'. AnLLMgenerates anirrelevant utterance(irrelevant to any content in the trajectory) as 'rejected'. This forms a training sample: $ \langle \mathrm { i n p u t } = C _ { t - 1 } , \mathrm { c h o s e n } = u _ { t } , \mathrm { r e j e c t e d } = \mathrm { s o m e ~ i r r e l e v a n t ~ u t t e r a n c e } \rangle $
- RL (for Learn-to-Ask): The
- Effectiveness: The dataset is effective for validating the method's performance as it represents real-world expert interactions, allowing the framework to learn from actual
implicit expert-driven strategiesand providing a grounded basis for evaluation.
5.2. Evaluation Metrics
The paper defines a suite of proxy metrics grounded in its hindsight framework to measure fine-grained alignment with expert strategy. These serve as strong indicators of task success.
-
Strategic Questioning Quality (WA & WA-GH):
- Conceptual Definition: These metrics measure
what to ask.WAquantifies the average utility of generated questions, whileWA-GHspecifically measures the rate of generating perfectly targeted questions. They assess if the agent targets the same critical information as the expert (). High scores proxyInformation Coverage. - WA (What-to-Ask): The average score on samples where the
ground truthand thepolicyalso correctly chose tocontinuethe questioning. - WA-GH (Good Hit rate): The proportion of generated questions that achieve a perfect score (i.e., ).
- Mathematical Formula (for WA-GH):
- Symbol Explanation:
- : Good Hit rate.
- : The count of samples where the ground truth was
CONTINUE, the model correctly predictedCONTINUE, and the generated question achieved aMicro-Rewardof1.0. - : The count of samples where the ground truth was
CONTINUE, and the model correctly predictedCONTINUE.
- Conceptual Definition: These metrics measure
-
Dialogue Termination Accuracy (WS):
- Conceptual Definition: This metric measures
when to stop. It reports the accuracy of the model's decision to terminate the dialogue (STOP) specifically when theinformation-gathering goalhas been met (i.e., ). A highWSscore is a direct proxy forDialogue Efficiencyand the ability to avoiduser fatigue. - WS (When-to-Stop): The average score on samples where the
ground truth. - Mathematical Formula: Not explicitly provided in the paper, but inferred as the proportion of
STOPdecisions that match theground truthwhen . - Symbol Explanation:
- : When-to-Stop accuracy.
- : Count of samples where the agent's decision correctly matches the ground truth , which is .
- : Total count of samples where the ground truth decision was .
- Conceptual Definition: This metric measures
-
Dialogue Continuation Accuracy (WC):
- Conceptual Definition: This metric measures the accuracy of the model's decision to
continuethe dialogue. - WC (When-to-Continue): The average score on samples whose
ground truth. - Mathematical Formula: Not explicitly provided, but inferred as the proportion of
CONTINUEdecisions that match theground truthwhen . - Symbol Explanation:
- : When-to-Continue accuracy.
- : Count of samples where the agent's decision correctly matches the ground truth , which is .
- : Total count of samples where the ground truth decision was .
- Note: The paper cautions that a high
WCmay sometimes indicate a policy trivially choosing tocontinueand being weak intermination assessment.
- Conceptual Definition: This metric measures the accuracy of the model's decision to
-
Assessment Accuracy (AA):
- Conceptual Definition: This is the overall accuracy of the agent's
CONTINUE/STOPdecisions across all turns. - AA: The average
assessment scoreacross all samples. - Mathematical Formula: Not explicitly provided, but inferred as the overall accuracy of the agent's termination decisions.
- Symbol Explanation:
- : Assessment Accuracy.
- : Count of samples where the agent's decision matches the ground truth .
- : Total count of all evaluated samples.
- Conceptual Definition: This is the overall accuracy of the agent's
-
Format Correctness (FC):
- Conceptual Definition: This metric evaluates how well the generated responses adhere to specified output
format guidelines(e.g., number of questions, sentence structure). - FC: The average
format scoreacross all samples. - Mathematical Formula: Not explicitly provided, but represents the average of the term as defined in the reward integration for all samples.
- Symbol Explanation:
- : Format Correctness.
- : Total number of samples.
- : The
format penalty/rewardfor the -th sample's action and state assessment.
- Conceptual Definition: This metric evaluates how well the generated responses adhere to specified output
-
Total Reward (TR):
- Conceptual Definition: This is the overall integrated
reward scorereflecting the combined performance onquestion utility,termination accuracy, andformat correctness. - TR: The average
overall reward scoreintegrated by Equation 5 (from Section 3.6) across all samples. - Mathematical Formula:
- Symbol Explanation:
- : Total Reward.
- : Total number of samples.
R ( a _ { t , i } , s _ { t , i } ): The integrated reward for the -th sample as defined in theReward Integrationsection.- : Macro-reward for the -th sample.
- : Tunable hyperparameter.
- : Micro-reward for the -th sample.
- : Format reward/penalty for the -th sample.
- Conceptual Definition: This is the overall integrated
5.3. Baselines
The proposed Learn-to-Ask method is compared against the following baselines:
- Direct Prompting (Base): This uses the base
LLM(e.g.,Qwen2.5-7B/32B-Instruct) guided by a carefully engineeredzero-shot prompt(i.e., without any examples in the prompt itself, relying solely on instructions). - Behavioral Cloning (SFT): This involves
standard supervised fine-tuningwhere theLLMis directly trained toimitatethe expert's next utterance, which is the structured tuple . - Direct Preference Optimization (DPO): For
DPO,preference pairsare formed. The expert's response is considered 'chosen', and a generation from a base model (which is irrelevant to any information in the context) is used as 'rejected' (Rafailov et al., 2023). This baseline tests if a simple preference for expert actions is sufficient to guide the learning process.
5.4. Ablation Studies / Other RL Algorithms
To validate its design choices and explore extensibility, the paper conducts several ablation studies and evaluates Learn-to-Ask with other RL algorithms:
- Ablation (w/o
R _ { a }): Removes theMicro-Reward(question utility) component. The model is trained on the full dataset, but the reward function ignores question quality, simplifying to . - Ablation (w/o
R _ { s }): Removes theMacro-Reward(assessment accuracy) component. In this setting, the model is only trained on dialogue turns where theground-truth actionwasCONTINUE. The system prompt is modified to only instructquestion generation, removing any mention of thestopping condition. The reward is simplified to . - Ablation (Sum): Replaces the
hierarchical multiplicative fusionof rewards with a simpleadditive summation. The reward function becomes . - Other RL Algorithms:
Learn-to-Askis also evaluated withGSPO(Group Sequence Policy Optimization) (Zheng et al., 2025) andCISPO(Chen et al., 2025a) to compare its performance across differentReinforcement Fine-Tuning (RFT)algorithms.
5.5. Implementation Details
- LLM Models:
Qwen2.5-7B-InstructandQwen2.5-32B-Instructmodels (Yang et al., 2024) are used as the baseLLMs. - Hardware: Experiments are conducted on a cluster of up to 32
NVIDIA H20 GPUs. - Framework: The
Trinity-RFTframework (Pan et al., 2025), a customizableRFT training library, is used for implementing the entire workflow (policy sampling, reward grading, optimization). - Hyperparameters: Primary hyperparameters are kept consistent across all methods and models for fair comparison:
Learning rate:Batch size: 64Number of training epochs: 4- For
group RL algorithms(GRPO, CISPO, GSPO), 5 repeats are used for each sample.
- Prompt Calibration: The
policy-sampler prompt,info-extractor prompt, andreward-grader promptare all calibrated using theAuto-Prompt pipeline(Section 3.5) before the main training runs. TheQwen2.5-32B-Instructmodel is specifically used as the backbone for theinfo-extractorandreward grader.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that Learn-to-Ask is highly effective in teaching LLMs to be proactive, goal-oriented agents, excelling at both what to ask and when to stop.
The following are the results from Table 1 of the original paper:
| Model | Qwen2.5-7B-Instruct | Qwen2.5-32B-Instruct | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| WA | WA-GH | WC | WS | AA | FC | TR | WA | WA-GH | WC | WS | AA | FC | TR | |
| Base | 0.50 | 0.13 | 0.98 | 0.16 | 0.75 | 0.63 | 2.17 | 0.50 | 0.13 | 0.92 | 0.52 | 0.81 | 0.67 | 2.43 |
| SFT | 0.40 | 0.08 | 0.94 | 0.74 | 0.89 | 0.57 | 2.41 | 0.43 | 0.11 | 0.94 | 0.87 | 0.93 | 0.69 | 2.70 |
| DPO | 0.42 | 0.05 | 0.94 | 0.36 | 0.78 | 0.19 | 1.78 | 0.23 | 0.04 | 0.52 | 0.87 | 0.62 | 0.18 | 1.61 |
| Ours (GRPO) | 0.67 | 0.41 | 0.94 | 0.93 | 0.94 | 0.92 | 3.27 | 0.64 | 0.37 | 0.93 | 0.88 | 0.92 | 0.88 | 3.15 |
| Ablation Studies | ||||||||||||||
| w/o Ra | 0.63 | 0.34 | 1.00 | 0.02 | 0.73 | 0.70 | 2.35 | 0.57 | 0.26 | 0.97 | 0.33 | 0.79 | 0.74 | 2.52 |
| w/o Rs | 0.52 | 0.19 | 0.96 | 0.87 | 0.93 | 0.92 | 3.06 | 0.54 | 0.19 | 0.95 | 0.91 | 0.94 | 0.92 | 3.12 |
| Sum | 0.64 | 0.38 | 0.92 | 0.95 | 0.93 | 0.91 | 3.20 | 0.65 | 0.37 | 0.94 | 0.88 | 0.92 | 0.90 | 3.19 |
| Learn-to-Ask with other RL algorithms | ||||||||||||||
| GSPO | 0.61 | 0.31 | 0.93 | 0.94 | 0.93 | 0.91 | 3.16 | 0.62 | 0.32 | 0.95 | 0.86 | 0.93 | 0.89 | 3.12 |
| CISPO | 0.71 | 0.47 | 0.95 | 0.94 | 0.95 | 0.93 | 3.36 | 0.70 | 0.49 | 0.94 | 0.89 | 0.93 | 0.92 | 3.29 |
6.1.1. Superiority of Learn-to-Ask
The primary finding from Table 1 is that Learn-to-Ask (labeled as Ours (GRPO)) significantly outperforms all baselines across both the Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct models, especially in strategic questioning quality and termination accuracy.
Strategic Questioning Quality(WA-GH): For the 7B model,WA-GH(Good Hit rate) skyrockets from0.13(Base) to0.41(+215% relative increase), indicating a massive improvement in generating perfectly targeted questions. Similarly, for the 32B model,WA-GHimproves from0.13to0.37(+185% relative increase). This validates thatLearn-to-Asksuccessfully teaches the modelwhat to askby targeting thecritical informationan expert would seek.Dialogue Termination Accuracy(WS): For the 7B model,WS(When-to-Stop) accuracy jumps from0.16(Base) to0.93. For the 32B model, it improves from0.52to0.88. This demonstrates that the framework effectively teaches the modelwhen to stopasking questions, leading to higherdialogue efficiencyand avoidinguser fatigue.Total Reward(TR):Learn-to-Askachieves the highestTRscores across both model sizes (3.27for 7B and3.15for 32B), reflecting its superior overall performance in balancingquestion utility,termination accuracy, andformat correctness.Format Correctness(FC):Learn-to-Askalso achieves highFC(0.92for 7B,0.88for 32B), indicating that the term in the reward function effectively regulates the output format.
6.1.2. Limits of Baselines
The performance of the baselines underscores the difficulty of the task:
SFT(Behavioral Cloning): WhileSFTimprovesWS(e.g., from0.16to0.74for 7B), it sacrificesquestion quality(WAdrops,WA-GHremains low), suggesting itmyopically memorizes stopping behaviorwithout truly understandingstrategic questioning. ItsTotal Rewardis better than Base but significantly lower thanLearn-to-Ask.DPO(Direct Preference Optimization):DPOperforms poorly, especially on the 32B model, where itsWA-GHis0.04,WCis0.52, andFCis0.18, resulting in the lowestTRof1.61. This indicates that itssingle binary preference signalis insufficient to guide the learning ofdual objectives(what to ask and when to stop) in a complexsequential dialoguesetting.
6.1.3. Nuances of Scale
- Interestingly, in these offline experiments,
Learn-to-Askwith the 7B model (TR3.27) shows slightly better performance than the 32B model (TR3.15). The paper attributes this to thelimited data scaleof theRealMedConvdataset, which might not be sufficient to fully leverage the larger model's capacity. However, this trend reverses inlarge-scale production environments, where the 32B model significantly outperforms the 7B model, as discussed in Section 5, highlighting that larger models require ample, diverse data to unlock their full potential.
6.2. Ablation Studies / Parameter Analysis
The ablation studies validate the design choices of the Learn-to-Ask framework:
- Necessity of
Micro-Reward(R _ { a }):- The
w/oablation shows a substantial drop inWA-GH(0.34to0.02for 7B,0.26to0.33for 32B in WC,0.02in WS for 7B,0.33in WS for 32B), and a collapse inWS(0.93to0.02for 7B,0.88to0.33for 32B). This confirms that without thequestion utility reward, the model fails to learnwhen to stopand generatetargeted questions.
- The
- Necessity of
Macro-Reward(R _ { s }):- The
w/oablation leads to a drop inWA-GH(0.41to0.19for 7B,0.37to0.19for 32B) and a minor drop inWSfor 7B model. This shows that both reward components are crucial for comprehensive policy learning.
- The
- Multiplicative vs. Additive Reward Fusion (
Sum):- The
Sumablation (additive reward fusion) shows slightly lowerTRscores (3.20vs3.27for 7B,3.19vs3.15for 32B) compared to themultiplicative fusionused inOurs (GRPO). This indicates that thehierarchical gatingof themultiplicative formulationprovides a slight but consistent edge, which is magnified in complex production environments.
- The
6.2.1. Learn-to-Ask with other RL algorithms
The following figure (Figure 4 from the original paper) shows the reward growth curves of RL algorithms in training 7B (left) and 32B (right) models:
该图像是图表,展示了训练7B(左图)和32B(右图)模型时三种强化学习算法GRPO、CISPO和GSPO的奖励增长曲线,横轴为训练步数,纵轴为奖励值,CISPO表现优于其它算法。
As seen in Figure 4, CISPO (Chen et al., 2025a) consistently achieves higher reward growth rates during training compared to GRPO and GSPO (Zheng et al., 2025). The results in Table 1 corroborate this: CISPO (TR of 3.36 for 7B and 3.29 for 32B) outperforms GRPO (TR of 3.27 for 7B and 3.15 for 32B) in terms of Total Reward, especially showing stronger WA and WA-GH scores. This suggests that while GRPO is effective, there's potential for further performance improvement by integrating Learn-to-Ask with more efficient RFT algorithms.
6.2.2. Evaluation on General Capabilities Benchmarks
The following figure (Figure 5 from the original paper) shows radar chart comparisons of 7B and 32B parameter models on general capability benchmarks:
该图像是图5,展示了7B和32B参数模型在通用能力基准上的雷达图对比,涵盖领域能力、指令跟随、推理表现和风险安全等指标,红色区域为所提模型表现。
Figure 5 illustrates that the specialized training of Learn-to-Ask generally preserves the models' core competencies. Performance on domain-specific tasks (MedAgents, MedJourney) and instruction-following benchmarks (IFEval, StructFlow) remains stable or even slightly improves. Minor trade-offs are observed in safety-related metrics (e.g., a decrease in hallucination detection on MedHallu for the 7B model), which highlights the need for careful monitoring in real-world applications. Overall, the framework successfully imbues the model with proactive dialogue skills without significantly degrading its foundational capabilities.
6.3. Qualitative Analysis
The qualitative analysis (Figure 2) provides a clear illustration of Learn-to-Ask's superiority over SFT.
The following figure (Figure 2 from the original paper) shows a case study comparing dialogues generated by SFT and Learn-to-Ask models:
该图像是一个对比对话示意图(图2),展示了SFT模型与Learn-to-Ask模型在医疗问诊中提问的差异。图中突出显示了Learn-to-Ask模型提出的相关有效问题与SFT模型提出的无关问题。
As seen in Figure 2:
- The
SFT modelasks anirrelevant question("What kind of pain do you have?") in a context where the user has already provided information about back pain, suggesting a lack of contextual understanding andstrategic adaptation. Thisbrittle mimicryimpliesSFTstruggles withgeneralizationto contexts not perfectly aligned with its training data. - In contrast, the
Learn-to-Ask modeldemonstratesstrategic adaptation: it correctly identifies the information already provided and proceeds with aninsightful follow-up question("Is there any numbness or tingling in your legs or feet?") that logically furthers theinformation-gathering goal. This highlights a shift from rotememorizationtoflexible, goal-oriented reasoning, reflecting its ability to learn a truesequential policy.
6.4. Real-World Deployment and Impact
The ultimate validation of Learn-to-Ask is its successful deployment in a live, large-scale online AI service.
-
Deployment Context: The model was deployed in a commercial
Medication AI Assistantservice, which serves thousands of users daily. Its goal is toproactively engageusers to obtain acomplete description of symptomsand recommend appropriateover-the-counter (OTC) medications. -
Model Scale in Production: In the production environment, which involved a dataset
100x largerand10x more medical conditionsthanRealMedConv, the32B modelsignificantlyoutperformedthe 7B model. This confirms that for complex, larger-scale scenarios with ample data, thesaturation trendobserved with smaller datasets does not apply, and thefull capacity of larger modelsbecomes essential. The 32B model was therefore selected for production. -
Role and Value of
Auto-Prompt: WhileAuto-Promptyieldedmarginal gainsin offline academic tasks (e.g., TR on 32B increased slightly from3.145to3.166for policy sampler calibration), its value in production became indispensable. Its true strength lies in enablingmaintainabilityandcontinuous improvementfor theextractorandgrader prompts. By periodically addinghuman-reviewed "margin examples"toanchor sets, thereward modelcan be recalibrated, and thepolicyretrained in adata-driven, semi-automated loop. This allows the agent to adapt to evolvinguser behaviorsand newbusiness needs(e.g., new safety guidelines) without costly manualprompt engineering.Auto-Prompttransforms a static training process into adynamic, self-improving system.The following are the results from Table 2 of the original paper, showing the comparison of models trained with the original prompt and optimized prompt:
Method Results on 7B Models WA WA-GH WC WS AA FC TR Base 0.501 0.132 0.975 0.155 0.751 0.629 2.174 Original 0.665 0.413 0.944 0.926 0.939 0.915 3.272 Optimized 0.641 0.399 0.949 0.910 0.938 0.894 3.214 Results on 32B Models Base 0.503 0.134 0.915 0.521 0.807 0.670 2.431 Original 0.640 0.365 0.933 0.877 0.918 0.880 3.145 Optimized 0.634 0.366 0.925 0.916 0.923 0.889 3.166
As shown in Table 2, the gains from Auto-Prompt are indeed marginal for the RealMedConv dataset. For the 32B model, the Total Reward increased from 3.145 (Original prompt) to 3.166 (Optimized prompt).
- Online Performance and Validation of Proxy Metrics: A
four-week live A/B testwas conducted, routing a significant portion of user traffic to theLearn-to-Ask-trained model.- The model achieved a
93% information completeness rateand an88% good-question rate, which are the online analogs to theoffline WSandWAmetrics. These strong internal scores validate that the offlineproxy metricsare effective for predictingend-to-end task success. - Crucially, the model produced a
1.87x liftindialog-to-purchase conversion ratecompared to historical data from a parallel human-based service. This providespowerful empirical evidenceof the framework's ability to delivertangible business impactand achievesuper-human performanceon key business metrics.
- The model achieved a
7. Conclusion & Reflections
7.1. Conclusion Summary
This work introduced Learn-to-Ask, a novel, general, and simulator-free framework designed to bridge the "reality gap" in training proactive LLMs. By reframing the intractable long-horizon offline Reinforcement Learning problem into a series of supervised learning tasks, Learn-to-Ask effectively learns a complete dialogue policy—encompassing both what to ask and when to stop—directly from offline expert conversation logs. The key insight lies in leveraging the observed future of each real trajectory to infer a dense and grounded reward signal, thereby bypassing the need for brittle user simulators.
Empirically, Learn-to-Ask demonstrated significant outperformance against strong baselines like SFT and DPO on a real-world medical dialogue dataset, showcasing its superior ability to learn nuanced, strategic questioning. The framework's true value was further validated through its successful deployment in a large-scale, commercial medical AI service, where the model achieved super-human performance on key business metrics (e.g., 1.87x lift in dialog-to-purchase conversion rate) and demonstrated task success rates comparable to or even superior to human experts. This deployment highlights that the offline proxy metrics directly translate into real-world impact.
7.2. Limitations & Future Work
The paper discusses several theoretical implications and future research directions:
Theoretical Implications:
- Value-Function-Free Offline RL:
Learn-to-Askcan be seen as astable, value-function-free offline RL algorithm. This raises questions about formally characterizing thesub-optimality gapof this hindsight-based policy compared to the trueoffline optimum, potentially depending on thequalityandcoverageof the expert data. - Causal Intervention Heuristic: The framework serves as a
heuristicforcausal reasoning, learning an intervention policy. Future work could integratedo-calculusorcounterfactual reasoning modelsto evolve from imitating optimal outcomes to predicting outcomes of novel, unseen interventions. - Data-Driven Proxy for Information Gain:
Learn-to-Askprovides apragmatic, data-driven proxyforinformation gain. This suggests research intodynamically adjusting the reward functionitself to explore lines of inquiry not present in expert data, but deemed valuable by a theoretical information model. - Graph-Theoretic Model: While
Learn-to-Asklearns asubgraph coverage policyrobust to path variations, theinformation graphis currently implicit. Future work could involve explicitlylearning this graph structure from datato serve as a powerful prior for policy learning, enabling the identification of "holes" (un-asked but valuable questions) for structured exploration.
Future Research Directions (from Imitation to Superhuman Intervention):
- Reward Shaping for Specific Goals: Instead of merely rewarding coverage of the expert's
information set, future work can explorereward functionsto enforce desiredsuperhuman behaviors. This could involve penalizing dialogues that conclude without asking critical safety questions (e.g., allergies), even if human experts sometimes omit them, thereby encodingorganizational knowledgeorsafety protocols. - Exploration in Semantic Space: To enable
exploration without a live simulator, agenerator modelcould proposealternative, plausible information goals() beyond the observed . Anadvanced reward model, potentially trained on broader medical knowledge, could then score thesehypothetical goals, allowing the agent to learn inquiries not represented in the limitedoffline dataset. - Hybrid Human-AI Policy Learning: The ultimate goal is
human-AI augmentation. Future systems could useLearn-to-Askin anonline loop. If a human expert overrules anAI-proposed questionand asks something different, this action and its future outcome could be immediately incorporated torefine the AI's policy, creating asymbiotic systemwhere theAI continuously learnsfrom and adapts to human strategies.
7.3. Personal Insights & Critique
This paper presents a highly innovative and practical approach to a critical problem in LLM deployment: making models proactive and goal-oriented in high-stakes domains. The hindsight-driven reward inference is a particularly clever idea, effectively circumventing the reality gap of user simulators and the myopia of single-turn optimization. Its successful deployment in a commercial setting, achieving super-human performance, provides compelling evidence of its real-world viability and economic value.
Inspirations and Applications:
The core methodology of Learn-to-Ask has broad applicability beyond medical dialogues. Any domain where offline expert-led conversations exist and a sequential information-gathering process is crucial could benefit. Examples include legal consultation, financial advisory, customer support, and even educational tutoring systems. The ability to learn stopping conditions is especially valuable for improving efficiency and user experience across these fields. The Auto-Prompt calibration is also a significant contribution, transforming prompt engineering from an art to a more systematic and maintainable process, critical for scaling and adapting LLM applications.
Potential Issues and Areas for Improvement:
-
Inheritance of Expert Biases: While the paper acknowledges that the current model inherits
human expert biases(e.g., preference for conversational brevity), this is a significant limitation in high-stakes domains. If expert data containssuboptimal strategiesorimplicit biases(e.g., overlooking certain patient demographics), the model will replicate these. The proposedreward shapingforsuperhuman behaviorsis a good step but requires careful design and oversight to avoid introducing new biases or unintended consequences. -
Robustness of
Info-ExtractorandReward Grader: The entire framework heavily relies on theLLM-based info-extractor() andreward grader. If these foundationalLLMsare imperfect (e.g., misinterpret context, hallucinate information, or have inherent biases), the inferredground truthandreward signalswill be noisy or skewed, directly impacting the learned policy's quality. WhileAuto-Prompthelps calibrate, continuous monitoring and robustLLM evaluationon these components are crucial, especially asuser behaviorsevolve. -
Generalization to Truly Novel Situations: The model learns from
observed expert trajectories. While theinformation graphintuition suggests robustness topath variation, it might still struggle with situations entirelyout of distributionfrom the expert data, where novel inquiries are needed. The proposedexploration in semantic spaceis an exciting direction to address this, but it adds another layer of complexity in ensuringmeaningfulandsafe exploration. -
Interpretability: In
high-stakes domainslike healthcare, understanding why anAI agentasks certain questions or decides to stop is crucial for trust and accountability. While the framework provides insights intotarget information sets, the underlyingLLM's reasoningfor generating a specific question or its confidence in astopping decisionmight still be ablack box. -
Cost of
Auto-Prompt: WhileAuto-Promptsavesmanual prompt engineeringeffort, theiterative searchprocess involvingLLM generationsandevaluationscan be computationally intensive, especially for largeLLMsand diversecalibration sets. The trade-off between calibration cost andpolicy performanceneeds to be carefully managed in practice.Overall,
Learn-to-Askprovides a powerful and practical blueprint forproactive LLMs, skillfully navigating several key challenges. Its strength lies in its grounded, data-driven approach, makingLLMsmore capable and impactful in real-world, goal-oriented applications.
Similar papers
Recommended via semantic vector search.