Asking Clarifying Questions for Preference Elicitation With Large Language Models
TL;DR Summary
This paper presents a novel method using large language models to generate clarifying questions for eliciting user preferences, particularly when user history is limited, significantly improving the model's effectiveness in guiding preferences.
Abstract
Large Language Models (LLMs) have made it possible for recommendation systems to interact with users in open-ended conversational interfaces. In order to personalize LLM responses, it is crucial to elicit user preferences, especially when there is limited user history. One way to get more information is to present clarifying questions to the user. However, generating effective sequential clarifying questions across various domains remains a challenge. To address this, we introduce a novel approach for training LLMs to ask sequential questions that reveal user preferences. Our method follows a two-stage process inspired by diffusion models. Starting from a user profile, the forward process generates clarifying questions to obtain answers and then removes those answers step by step, serving as a way to add noise'' to the user profile. The reverse process involves training a model to denoise'' the user profile by learning to ask effective clarifying questions. Our results show that our method significantly improves the LLM's proficiency in asking funnel questions and eliciting user preferences effectively.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Asking Clarifying Questions for Preference Elicitation With Large Language Models
1.2. Authors
-
Ali Montazeralghaem (alimontazer@google.com) - Google Mountain View, CA, USA
-
Guy Tennenholtz (guytenn@google.com) - Google Mountain View, CA, USA
-
Craig Boutilier (cboutilier@google.com) - Google Mountain View, CA, USA
-
Ofer Meshi (meshi@google.com) - Google Mountain View, CA, USA
All authors are affiliated with Google in Mountain View, CA, USA, indicating a strong industry research background, likely focusing on practical applications of AI and machine learning, particularly in areas relevant to Google's products (e.g., search, recommendations, conversational AI).
1.3. Journal/Conference
Published at GENNEXT@SIGIR'25. SIGIR (Special Interest Group on Information Retrieval) is a highly reputable and leading international conference in the field of information retrieval. GENNEXT@SIGIR'25 likely refers to a workshop or a co-located event, indicating that the work was presented at a venue associated with a top-tier conference, suggesting relevance and quality within the information retrieval and recommender systems community.
1.4. Publication Year
2025
1.5. Abstract
The paper addresses the challenge of eliciting user preferences in Conversational Recommendation Systems (CRS) augmented by Large Language Models (LLMs), especially when user history is limited. It proposes a novel two-stage method, inspired by diffusion models, to train LLMs to generate effective sequential clarifying questions. In the forward process, a complete user profile is 'corrupted' by generating clarifying questions and then removing the answers step by step, creating noisy user profiles. The reverse process trains an LLM to denoise these profiles by learning to ask effective clarifying questions to reconstruct the original profile. The results demonstrate that this method significantly improves the LLM's ability to ask funnel questions (moving from general to specific) and effectively elicit user preferences.
1.6. Original Source Link
https://arxiv.org/abs/2510.12015 (Preprint on arXiv)
1.7. PDF Link
https://arxiv.org/pdf/2510.12015v1.pdf (Preprint on arXiv)
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the effective elicitation of user preferences in Conversational Recommendation Systems (CRS) that leverage Large Language Models (LLMs). In recommendation systems (RSs), personalizing recommendations is crucial, but often challenging due to limited user interaction history (the cold-start problem) or privacy constraints that restrict access to past data. Additionally, current user preferences can be influenced by transient factors like mood or context, which passive observation cannot capture.
Traditional preference elicitation (PE) techniques exist, but with the advent of LLMs and conversational interfaces, there's an opportunity to perform PE through multi-turn dialogues. While simple prompting can make LLMs ask questions, the challenge lies in optimizing them to generate effective sequential clarifying questions across various domains. The existing gap is how to systematically train LLMs to ask high-quality, structured PE questions that efficiently reveal user preferences, especially in a funnel-like manner (starting general and becoming more specific).
The paper's entry point is an innovative idea: drawing inspiration from diffusion models to train LLMs for this task. By framing the problem as denoising a user profile through question-asking, they aim to generate a training methodology that naturally leads to structured and effective PE dialogues.
2.2. Main Contributions / Findings
The primary contributions of this paper are:
-
Novel Diffusion-Inspired Training Approach: Introduction of a two-stage process, inspired by
diffusion models, for trainingLLMsto ask sequential clarifying questions for userpreference elicitation. This involves aforward processofcorruptinga user profile by removing answers and generating questions, and areverse processof training anLLMtodenoisethe profile by asking effective questions. -
Generation of Funnel Questions: The method is designed to produce
funnel questions, meaning theLLMlearns to start with more general inquiries and gradually progress to more specific ones, mimicking a natural and efficient conversational flow forPE. -
Improved LLM Proficiency: Experimental results demonstrate that the proposed method significantly enhances the
LLM's ability to ask effective clarifying questions, leading to improved reconstruction of the true user profile. -
Effective User Preference Elicitation: The fine-tuned
LLMshows superior performance in gathering relevant user information and building a comprehensive user profile through sequential questioning, outperforming non-fine-tunedLLMsand less effective user simulators. -
User Simulator Fine-tuning: The paper also shows that fine-tuning a user simulator (another
LLM) to provide accurate answers from a given profile significantly improves the overall interaction andQuestionerperformance.The key conclusions are that a
diffusion model-inspired training paradigm can effectively optimizeLLMsfor the complex task ofsequential preference elicitation, resulting in more human-like and efficient conversationalRSs. These findings address the cold-start problem and enhance personalization by enablingLLMsto actively and strategically gather user preferences.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a beginner should be familiar with several core concepts:
-
Large Language Models (LLMs): These are advanced artificial intelligence models, like
GPT-3,Gemma, orGemini, trained on vast amounts of text data. They can understand, generate, and process human language. Their strength lies in their ability to perform variousNatural Language Processing (NLP)tasks, such as answering questions, summarizing text, and holding conversations, often in anopen-endedmanner. In this paper,LLMsare used both to generate questions and simulate user responses. -
Recommendation Systems (RSs): These are software tools that provide suggestions for items (e.g., movies, products, articles) to users. They typically work by analyzing user behavior, preferences, and item characteristics. The goal is to help users discover items they might like but haven't explicitly searched for.
-
Conversational Recommendation Systems (CRS): This is a type of
RSwhere the interaction with the user occurs through a dialogue interface. Instead of just presenting a list of recommendations, aCRScan ask questions, understand user feedback in natural language, and refine recommendations iteratively.LLMsare increasingly used to power the conversational aspects ofCRS. -
Preference Elicitation (PE): This is the process of actively asking users questions to determine their tastes, needs, or priorities. In
RSs,PEhelps overcome thecold-start problem(when there's little or no historical data for a new user) and allows for dynamic adaptation to changing user preferences or contexts. The paper focuses on generating effectivePEquestions. -
Diffusion Models: Originating primarily in
computer visionfor tasks like image generation,diffusion modelsare a class ofgenerative modelsthat learn to reverse adiffusion process.- Forward (Corruption) Process: This process gradually adds
noiseto data (e.g., an image) until it becomes pure random noise. For discrete data (like text), it involves operations like inserting, deleting, or replacing tokens. - Reverse (Denoising) Process: This process learns to iteratively
denoisethe corrupted data, starting from noise and gradually reconstructing the original clean data. This paper adapts this concept to user profiles, where "noise" means missing information, and "denoising" means reconstructing the full profile by asking questions.
- Forward (Corruption) Process: This process gradually adds
-
Fine-tuning: This is a technique used to adapt a pre-trained
LLM(a model already trained on a large, general dataset) to a specific task or domain. By training the model on a smaller, task-specific dataset, it learns to perform that particular task more accurately and efficiently. -
Parameter-Efficient Fine-Tuning (PEFT):
PEFTmethods are a family of techniques that allow for fine-tuningLLMswith significantly fewer trainable parameters than full fine-tuning. This reduces computational cost and memory requirements. -
Low-Rank Adaptation (LoRA): A specific
PEFTtechnique.LoRAworks by injecting trainable low-rank matrices into the transformer architecture ofLLMs. During fine-tuning, only these newly added matrices are trained, while the original pre-trained model weights remain frozen. This makes fine-tuning much more efficient. -
JSON Format:
JavaScript Object Notationis a lightweight data-interchange format. It's a human-readable way to represent structured data, consisting of key-value pairs and ordered lists. The paper usesJSONto represent user profiles in a structured manner, making it easier forLLMsto process and manipulate.
3.2. Previous Works
The paper builds upon existing research in recommendation systems, preference elicitation, and the application of LLMs in conversational contexts.
-
Autoregressive Text Generation (Background 2.1): The foundation for how
LLMsgenerate text. The probability of an entire sequence (e.g., a sentence or a question) being generated is modeled as the product of conditional probabilities of each token, given the preceding tokens. $ P ( S ) = \prod _ { i = 0 } ^ { N } p ( s _ { t } | s _ { 0 } , s _ { 1 } , . . . , s _ { t - 1 } ) $ Here,P(S)is the probability of the entire sequence . represents the token at position in the sequence. is the conditional probability of generating the token given all the tokens that came before it ( to ). Thisautoregressiveproperty means each token is predicted based on the tokens already generated, moving from left to right. -
Diffusion Models (Background 2.3): The paper explicitly draws inspiration from
diffusion models, particularlydiscrete diffusion modelsused in text generation. Incomputer vision, continuousdiffusion modelswork by adding Gaussian noise progressively to an original image to get (pure noise) in a forward (corruption) process. A model is then trained to reverse this,denoisingback to . The paper adapts this to user preferences, where anull state(empty profile) is iteratively refined by asking questions, similar to how a noisy image is refined into a clear one. -
Preference Elicitation (PE) in RSs: References [16, 18, 23, 24, 26] discuss traditional
PEtechniques, highlighting its importance in clarifying user preferences and improving recommendations. These works typically focus on various strategies for asking questions to infer user preferences, often using explicit feedback. -
Conversational Recommendation Systems (CRS) with LLMs: References [6, 11, 19-21, 31] explore the integration of
LLMsintoRSsto createconversational interfaces. These works demonstrate the potential ofLLMsto augmentRSswithmulti-turn dialoguecapabilities, enabling more naturalPE. Simple prompting ofLLMsto ask questions is a common starting point, but this paper aims to optimize this process. -
LLM-based Preference Elicitation and Clarifying Questions:
Li et al. [17]introducedGenerative Active Task Elicitation (GATE), whereLLMsinteract with users viafree-form languageto infer behavior.Andukuri et al. [1]proposedSTaR-GATE, focusing on teachingLLMsto ask clarifying questions to handle ambiguity.Austin et al. [2]usedBayesian optimizationwithLLM-basedacquisition functionsfornatural language PE, employingNLIandBOstrategies.Piriyakulkij et al. [22]presented an algorithm foractive preference inferenceusingLMsandprobabilistic reasoningto generate informative questions.Montazeralghaem et al. [19-21]also contributed to conversational search and recommendation, including usingactor-critic frameworksfor interactiveCRS.
3.3. Technological Evolution
The field has evolved from traditional recommendation systems relying on implicit feedback and collaborative filtering to more interactive systems. Early PE involved structured questionnaires or simple feedback mechanisms. The rise of LLMs marks a significant shift, enabling RSs to engage users in natural language conversations. This allows for more nuanced preference elicitation than ever before. However, simply using LLMs out-of-the-box for PE can be suboptimal; they need to be guided to ask effective, sequential, and non-redundant questions. This paper represents a step in this evolution by proposing a systematic, diffusion-inspired training method to optimize LLMs for this specific conversational PE challenge. It moves beyond generic prompting to structured LLM fine-tuning for question generation.
3.4. Differentiation Analysis
Compared to other LLM-based PE approaches, the core innovation of this paper lies in its diffusion model-inspired training methodology for generating sequential clarifying questions.
-
Structured Training for Sequentiality and Funneling: While other works explore
LLMsasking clarifying questions (e.g.,STaR-GATE [1]) or usingBayesian optimizationfor query generation (PEBOL [2]), this paper specifically designs aforward-reverse processto explicitly train theLLMto ask questions in afunnel-likemanner (general to specific). This structured approach for sequential question generation is a key differentiator. -
Profile Corruption and Denoising Analogy: The direct analogy to
diffusion models—corrupting a complete user profile by removing information (answers to questions) and then training anLLMtodenoise(reconstruct) it by asking those questions—is a novel application of this paradigm topreference elicitation. This contrasts with frameworks that might infer preferences directly or optimize question selection via other means. -
Synthetic Data Generation: The paper's methodology for generating its own training data using a larger
LLM(Gemini 2.0) to simulate theforward processand then fine-tuning a smallerLLM(Gemma 7B) in thereverse processis a practical and effective way to create supervised learning signals for a complex conversational task. -
Emphasis on Profile Reconstruction: The explicit objective is to maximize the probability of reconstructing the complete user profile through sequential question-answering, which provides a clear and measurable goal for the
PEprocess.In essence, this paper provides a principled,
diffusion-inspired framework to instill a strategic questioning behavior (funneling, non-repetitive) intoLLMsforpreference elicitation, going beyond genericLLMcapabilities.
4. Methodology
4.1. Principles
The core idea of the method is to optimize a Large Language Model (LLM) to ask effective clarifying questions for preference elicitation by drawing inspiration from diffusion models in discrete spaces. The theoretical basis is that the process of gradually revealing user preferences can be modeled as denoising a corrupted (incomplete) user profile.
The intuition is as follows: Imagine you have a complete picture of a user's preferences. If you gradually obscure parts of this picture, and at each step, you ask a question that would help reveal the obscured part, you are essentially performing a forward (corruption) process. Conversely, if you start with an obscured (empty) picture and learn to ask the right questions to fill in the missing details, you are performing a reverse (denoising) process. This reverse process is what the LLM is trained to do: reconstruct the user's full profile by asking a sequence of questions. A key principle is that these questions should follow a funnel pattern, starting with general inquiries and progressing to more specific ones, mirroring natural human conversation and efficient information gathering.
4.2. Core Methodology In-depth (Layer by Layer)
The proposed model involves two main phases: a forward process for profile corruption (generating training data) and a reverse process for profile reconstruction (fine-tuning the LLM to ask questions).
The overall framework is illustrated in Figure 1, depicting how corrupted user profiles are addressed and reconstructed through clarifying questions. The forward process generates questions and answers to create noisy profiles, and the reverse process learns to denoise by asking effective questions.
该图像是示意图,展示了我们的方法用于处理受损用户档案以及通过澄清问题进行重构的流程。图中包含两个阶段的过程:前向过程和反向过程。前向过程通过生成一系列问题获取用户反馈,而反向过程则通过学习有效的问题来“去噪”用户档案,从而改善推荐系统的个性化效果。
Figure 1: Our model for addressing corrupted user profiles and reconstruction through clarifying questions.
4.2.1. Reverse Process: Profile Reconstruction by Asking Questions
The goal in the reverse process is to train an LLM (referred to as the Questioner) to transform an initial empty user profile into a final, complete ground-truth profile by asking a sequence of questions and receiving answers. This process involves intermediate profiles . Each profile represents the state after question-answer interactions. Formally, , where is the question asked and is its corresponding answer.
The generative process aims to learn how the complete profile is formed. Using the chain rule and assuming conditional independence (for simplification), the probability of generating the entire series of profile versions is modeled as:
$ {\mathfrak{p}} ( P ) = \prod _ { i = 0 } ^ { n } {\mathfrak{p}} ( P _ { i } | P _ { 0 } , \dots , P _ { i - 1 } ) $
This general formula for a series of profile versions is from the Background section of the paper. For the reverse process where we build from , the probability of a profile (which is effectively the sequence of questions and answers leading to it) can be broken down as:
$ p _ { \theta , \phi } ( P _ { n } ) = \prod _ { t = 1 } ^ { n } p ( P _ { t } | P _ { t - 1 } ; \theta , \phi ) $
Here, is the probability of obtaining the complete profile given the learned parameters and . The product iterates from to , representing each step where the profile is updated. is the probability of transitioning from the partial profile to . This transition probability is further decomposed into three components:
$ \begin{array} { r l } & { \ p ( P _ { t } \mid P _ { t - 1 } ; \theta , \phi ) = p _ { \theta } ( Q _ { t - 1 } \mid P _ { t - 1 } ) } \ & { \qquad \quad \times p _ { \phi } ( A _ { t - 1 } \mid Q _ { t - 1 } , P _ { t - 1 } ) } \ & { \qquad \quad \times p ( P _ { t } \mid P _ { t - 1 } , Q _ { t - 1 } , A _ { t - 1 } ) } \end{array} $
Let's break down each component:
-
: This is the probability that the
Questioner(anLLMparameterized by ) generates question given the currentpartial profile. This component reflects theQuestioner's ability to choose an appropriate clarifying question based on what is already known about the user. -
: This is the probability that the user (or more precisely, a
user simulatorparameterized by ) provides answer to question , given and thepartial profile. This component models theuser simulator's response behavior. -
: This is the probability of generating the next state (the updated profile) given the previous state , the question , and the answer . This component is deterministic and does not have any learnable parameters. It's defined as:
$ \begin{array} { r } { p ( P _ { t } | P _ { t - 1 } , Q _ { t - 1 } , A _ { t - 1 } ) = \left{ \begin{array} { r l } { 1 } & { \mathrm { ~ i f ~ } P _ { t } = P _ { t - 1 } \cup { ( Q _ { t - 1 } , A _ { t - 1 } ) } } \ { 0 } & { \mathrm { ~ o t h e r w i s e } } \end{array} \right. } \end{array} $ This means the new profile is simply the previous profile with the new question-answer pair added to it. The paper notes that including questions along with answers helps the model avoid repetitive queries and improves performance.
The overall objective is to maximize the probability of generating the complete user profile across all users by asking effective clarifying questions. This is formalized as:
$ \operatorname* { m a x } _ { \theta , \phi } \sum _ { i = 1 } ^ { | I | } \log ( p _ { \theta , \phi } ( P _ { n } ^ { i } ) ) , $
where is the complete profile for user , and is the total number of users. This objective is optimized by fine-tuning two LLMs: one for the Questioner (to learn ) and one for the user simulator (to learn ).
4.2.2. Forward Process: Profile Corruption (Generating Training Data)
The forward process is crucial for generating the training data used to fine-tune the Questioner and user simulator in the reverse process. It starts with a complete user profile and gradually corrupts it by removing information, analogous to adding noise in diffusion models.
-
Structured User Profile Creation: Given a user profile in raw text format (e.g., "user likes action movies"), it's first converted into a structured
JSON formatusing anLLM. ThisJSONformat allows for easier manipulation and querying of specific pieces of information. $ J P ^ { u } : = \mathrm { L L M } ( P ^ { u } ) $ For example, could be "User likes action movies and sci-fi. Director Christopher Nolan is a favorite." This might be converted to . TheJSONprofile is represented as , where is atag(e.g., 'Genre', 'Director') and is itscontent(e.g., 'action movies', 'Christopher Nolan'). -
Generating Funnel Questions: The core idea here is to generate a sequence of questions that follow a
funnelpattern: starting general and becoming specific. To achieve this, two constraints are imposed:- Generality Ordering: Questions should start with easier, more straightforward inquiries and progress to more specific ones.
- Dependency Handling: Broader aspects should be asked before more specific ones (e.g., movie
genrebefore a specificdirector). AnLLMis used to rank thetagsin from general to specific. Then, anotherLLMis prompted to generatefunnel questionsand their corresponding answers based on this rankedJSONprofile: $ ( { \mathcal Q } _ { 0 } , A _ { 0 } ) , \ldots , ( Q _ { n - 1 } , A _ { n - 1 } ) \ = \ \mathrm { L L M } ( J P ^ { u } , { t _ { 1 } , t _ { 2 } , t _ { 3 } , \ldots , t _ { m } } ) $ Here, is the generated question and is its answer derived from . For example, and . A mapping identifies the set of tag-content pairs from that are addressed by and . For example, if and , then might be . The questions are ordered such that is general and is specific. is the total number of questions.
-
Iterative Profile Corruption (Data Generation): The
forward processthen iteratively removes information from the complete profile to createpartial profilesat each step. Since the questions are generated in afunnelmanner (from general to specific, i.e., is most general, is most specific), the corruption process starts by removing the information related to the most specific question () first. This way, when going in reverse (training theQuestioner), theLLMwill learn to ask the most general questions first to reconstruct the profile.The partial user profile at step is represented as , which is the complete profile minus the information covered by questions through . $ J P _ { t } ^ { u } = J P ^ { u } \setminus \bigcup _ { i = t } ^ { n - 1 } \mathcal { T } _ { i } $ The index ranges from down to
0.- If , the union is empty, so (the full profile).
- If , information from all questions has been removed, resulting in (the empty profile). represents the partial user profile available just before asking question .
The training data for user for the
reverse processconsists of pairs of (question, partial profile): $ D _ { u } = { ( Q _ { n - 1 } , J P _ { n - 1 } ^ { u } ) , ( Q _ { n - 2 } , J P _ { n - 2 } ^ { u } ) , . . . , ( Q _ { 0 } , J P _ { 0 } ^ { u } ) } . $ In each pair , the model is given as input and trained to generate as the target. The full dataset for fine-tuning is the union of for all users .Algorithm 1 summarizes this
forward process:Algorithm 1 Forward process: Profile Corruption
Input: A user profile in text format. Output: Training data , comprising question-partial user profile pairs for various partial profiles.
1: Convert into a JSON format . 2: Sort tags from based on notion of generality. 3: Generate Funnel Questions based on the extracted tags. 4: 5: while do 6: Create partial profile by using Equation (7). 7: 8: 9: end while 10: return
An example illustrating the full process (likely referring to the generation of questions and removal of information) is shown in Figure 2.
该图像是一个示意图,展示了通过前向过程生成用户偏好的清晰问题。图中包含不同的用户档案及对应的问题示例,如'你对音乐剧有兴趣吗?'和'你喜欢什么类型的电影?'等。F inormation. The revere process then reconstructs the profle byiterativly answerin theelicitation qustions
4.2.3. User Simulation
For evaluating the trained Questioner, an environment is needed where it can interact with a user simulator. This simulator's role is to answer questions based on a ground-truth user profile.
The user simulator is also an LLM. Given a question and the ground-truth user profile , it tries to find the answer: . If an answer cannot be found in the profile, it responds with "I don't know", implying the user has no specific preference regarding that question.
To enhance the user simulator's ability to answer questions effectively, it is also fine-tuned. The training data for the user simulator is constructed from the forward process:
$
\begin{array} { r l } & { \hat { D } _ { u } = { ( \mathcal { T } _ { n - 1 } , Q _ { n - 1 } , J P ^ { u } ) , ( \mathcal { T } _ { n - 2 } , Q _ { n - 2 } , J P ^ { u } ) , } \ & { \qquad \ldots , ( \mathcal { T } _ { 0 } , Q _ { 0 } , J P ^ { u } ) } } \end{array}
$
Each tuple serves as a training instance. Given question and the full user profile , the model is trained to generate the corresponding answer (implicitly, by identifying the relevant which contains the tag-content pairs, and generating the content ).
This two-stage process of data generation (forward) and model fine-tuning (reverse for Questioner and user simulator) forms the complete methodology.
5. Experimental Setup
5.1. Datasets
- Movielens Dataset: This is a widely used public dataset in
recommender systemsresearch, consisting of movie ratings from users. The paper utilizes it as the domain forpreference elicitation. - User Profiles: The
ground-truthuser profiles used in the experiments are not directly from the raw Movielens dataset. Instead, they are derived fromJeong et al. [15]andTennenholtz et al. [29, 30]. These profiles were specifically generated using anLLMand the complete raw history of ratings from each user. The original authors (Tennenholtz et al.) evaluated theseLLM-generated profiles and found them to be predictive of user ratings. The use ofLLM-generated profiles for training ensures consistency with theLLM-centric approach of the paper and provides rich, textual preference data.
5.2. Evaluation Metrics
The quality of the generated questions and the effectiveness of preference elicitation are measured using two standard Natural Language Processing (NLP) metrics: ROUGE and BLEU. These metrics assess the similarity between the generated user profile (reconstructed by the Questioner) and the target (ground-truth) user profile.
-
ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
- Conceptual Definition:
ROUGEis a set of metrics used for evaluating automatic summarization and machine translation. It works by comparing an automatically produced summary or translation against a set of human-produced reference summaries or translations. It primarily measures the overlap of n-grams (sequences of words), word sequences, and word pairs between the generated text and the reference text. HigherROUGEscores indicate greater similarity to the reference, typically implying higher quality. The "Recall-Oriented" aspect emphasizes how much of the reference information is captured by the generated text. - Mathematical Formula (ROUGE-N): $ \mathrm{ROUGE-N} = \frac{\sum_{S \in {\text{Reference Summaries}}} \sum_{\text{n-gram} \in S} \mathrm{Count_{match}(\text{n-gram})}}{\sum_{S \in {\text{Reference Summaries}}} \sum_{\text{n-gram} \in S} \mathrm{Count}(\text{n-gram})} $
- Symbol Explanation:
- : The
ROUGEscore based on matching n-grams of length . - : Iterates through each reference summary (in this case, the
ground-truth user profile). - : A sequence of words.
- : The number of times a particular n-gram appears in both the
generated profileand thereference profile. - : The total number of times a particular n-gram appears in the
reference profile. The paper likely uses variants likeROUGE-1(unigrams),ROUGE-2(bigrams), orROUGE-L(longest common subsequence), which are common inNLPtasks. The abstract mentionsROUGEgenerally, implying a composite score or one of these common variants.
- : The
- Conceptual Definition:
-
BLEU (Bilingual Evaluation Understudy):
- Conceptual Definition:
BLEUis an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. It measures theprecisionof n-grams between the candidate text (generated) and the reference text. It also includes abrevity penaltyto discourage very short outputs. HigherBLEUscores indicate closer resemblance to professional human translations, suggesting better quality. - Mathematical Formula:
$
\mathrm{BLEU} = \mathrm{BP} \cdot \exp \left( \sum_{n=1}^{N} w_n \log p_n \right)
$
where is the
brevity penaltyand is then-gram precision. $ p_n = \frac{\sum_{\text{n-gram} \in \text{candidate}} \mathrm{Count_{clip}}(\text{n-gram})}{\sum_{\text{n-gram} \in \text{candidate}} \mathrm{Count}(\text{n-gram})} $ $ \mathrm{BP} = \begin{cases} 1 & \text{if } c > r \ e^{1 - r/c} & \text{if } c \le r \end{cases} $ - Symbol Explanation:
-
: The
BLEUscore. -
:
Brevity Penalty. -
: The maximum n-gram order considered (typically 4).
-
: The weight for the n-gram precision (often uniform, ).
-
: The
n-gram precisionfor n-grams of length . -
: The count of an n-gram in the candidate that also appears in the reference, clipped to its maximum count in any single reference sentence.
-
: The total count of an n-gram in the candidate text.
-
: Length of the candidate (generated) text.
-
: Effective reference corpus length (closest reference length to candidate length).
The evaluation process for these metrics involves the following steps, as detailed in Algorithm 2:
-
- Conceptual Definition:
Algorithm 2 Evaluation Process
Input: A corrupted profile , target profile , parameters , maximum question number . Output: A sequence of questions and answers that transforms to .
1: Initialize profiles:
2: Initialize question count:
3: while and do
4: Generate question using the fine-tuned model as the Questioner.
5: Query the user simulator model, which accesses the target profile to determine an answer :
6: if answer is found in then
7: Set to the corresponding value in .
8: else
9: Set to "No Preference".
10: end if
11: Update:
12:
13: Increment question count:
14: end while
15: return
This algorithm simulates an interaction session where the Questioner (trained in the reverse process) asks questions, and the user simulator provides answers based on the ground-truth profile . The process continues until the current profile matches or a maximum number of questions (set to 10 in experiments) is reached. The final is then compared to using ROUGE and BLEU.
5.3. Baselines
The paper compares its proposed fine-tuned Gemma model (as Questioner and user simulator) against several baselines to demonstrate the effectiveness of its approach:
- Non-fine-tuned
Gemmamodel: This serves as a baselineQuestionerto show the impact of the proposeddiffusion-inspired fine-tuning on question generation. - Non-fine-tuned
Geminimodel as auser simulator: This is used to compare against the fine-tunedGemmauser simulator.Gemini 2.0is a larger, more capableLLM, so using it in a non-fine-tuned capacity for simulation helps assess if fine-tuning a smaller model can outperform a larger, general-purposeLLMfor specific tasks. The comparisons are structured to evaluate the quality of theQuestioner(fine-tuned vs. non-fine-tuned) and theuser simulator(fine-tunedGemmavs. non-fine-tunedGemini) both individually and in combination.
5.4. Training Details
- LLMs:
- Questioner and fine-tuned user simulator:
Gemma LLM(7B version, 28 layers). Chosen for public availability of weights and strong performance for its size. - Forward process data generation:
Gemini 2.0. A larger, more capableLLMused to ensure high quality of the synthetic training data.
- Questioner and fine-tuned user simulator:
- Fine-tuning technique:
Parameter-Efficient Fine-Tuning (PEFT)specifically usingLow-Rank Adaptation (LoRA) [13]. This technique is used to efficiently adapt the pre-trainedGemmamodel. - Hyperparameters:
- Batch size: 64
- Learning rate: 0.001
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate the significant impact of the proposed diffusion-inspired fine-tuning on the LLM's ability to generate effective clarifying questions and the benefit of fine-tuning the user simulator.
6.1.1. Effect of Fine-Tuning (Questioner and User Simulator)
The following are the results from Figure 3(a) of the original paper:
该图像是两个柱状图,展示了使用不同问题设置的平均得分和未回答问题的百分比。在图(a)中,蓝色和红色柱子分别表示Bleu和Rouge得分,显示了Fine-tuned和Non-Fine-tuned问题的比较。图(b)则展示了不同设置下未回答问题的百分比,其中Non-Fine-tuned问题的比例最高,达到0.46。
e p Questioners with different user simulators. (b) Percentage of unanswered questions for models.
Figure 3(a) compares the performance (measured by ROUGE and BLEU scores) of different configurations of Questioners and user simulators.
-
Impact on Questioner: Comparing
non-finetuned questionstofinetuned questions(assuming afinetuned simulationas theuser simulator), there is a substantial improvement. TheROUGEscore increases from approximately 0.4 to 0.68, andBLEUincreases from 0.28 to 0.49. This indicates that the fine-tuning process, using the data generated by theforward process, makes theQuestioner LLMsignificantly better at asking questions that lead to accurate profile reconstruction. -
Impact on User Simulator: By comparing the
finetuned questionswithfinetuned simulation(green bar) againstfinetuned questionswithnon-finetuned Gemini simulation(light green bar), we observe that thefinetuned simulatorleads to better performance. This suggests that a specially fine-tunedLLMas auser simulator, even if smaller (Gemma) than a general-purpose largeLLM(Gemini), can provide more effective and accurate answers within the experimental setup. This highlights the importance of matching the simulation environment to the task.The following are the results from Figure 3(b) of the original paper:
Figure 3(b) further clarifies these findings by showing the percentage of unanswered questions across different model settings.
- The
non-finetuned questionscombined withnon-finetuned Geminias a simulator result in the highest percentage of unanswered questions (0.46). This suggests that a general-purposeLLM(even a large one likeGemini) struggles to answer questions accurately from a profile without specific fine-tuning for this task, and a non-fine-tunedQuestionerasks less precise questions. - Fine-tuning the
user simulator(comparing the first two bars to the last two bars) significantly reduces the unanswered questions, even fornon-finetuned Questioners. This confirms that a well-tuneduser simulatoris crucial for reliable evaluation and effective interaction. Thefinetuned questionswithfinetuned simulationresults in the lowest percentage of unanswered questions, indicating a highly effective interaction where questions are well-formed and answers are accurately retrieved.
6.1.2. Effect of Number of Questions
The following are the results from Figure 4 of the original paper:
该图像是一个图表,展示了BLEU(左图)和ROUGE(右图)分数与问题数量的关系。不同的曲线代表了不同的模型和微调策略,随着问题数量的增加,分数普遍呈现上升趋势。
Figure 4: BLEU (left) and ROUGE (right) scores vs. number of questions.
Figure 4 illustrates how BLEU and ROUGE scores evolve as the number of questions asked increases, for different Questioner-simulator combinations.
- Worst Performance: The
non-finetuned Questionerwith theGemini simulator(blue lines) consistently performs the worst. This is attributed to both theQuestionerasking ineffective questions and theGemini simulatorfailing to provide relevant answers. - Improved Simulator: Replacing the
Gemini simulatorwith afinetuned simulator(orange lines,non-finetuned Q+finetuned sim) boosts performance. This shows that some of the questions asked by the non-fine-tuned model are useful, but their effectiveness is amplified when responded to by a more capablesimulator. - Improved Questioner: Using a
finetuned Questionerwith theGemini simulator(green lines,finetuned Q+Gemini sim) also yields a performance boost compared to thenon-finetuned QuestionerwithGemini simulator. This directly demonstrates the value of fine-tuning theQuestionerto ask more effective clarifying questions. - Best Performance and Funneling: The best model, the
finetuned Questionerpaired with thefinetuned simulator(red lines), shows superior performance. Crucially, the curves indicate afunneling effect: the model gathers broader information in the initial ~5 turns, leading to a rapid increase in scores, and then shifts to more specific questions in later turns (6-7), further refining the profile. This behavior aligns with the design goal of funnel questions.
6.1.3. Effect of Adding Question History
The following are the results from Figure 5 of the original paper:
该图像是一个柱状图,展示了在不同条件下(Finetune与Non-Finetune以及是否使用Q-H)评估模型生成效果的平均得分,包括Bleu和Rouge两种指标。可见,Finetune情况下的得分普遍高于Non-Finetune,且在使用Q-H时效果尤为显著。
(a) Comparing the overall performance of models by integrating questions and answers (Q-H) into user profiles.
Figure 5 shows the impact of including the question history (Q-H) along with answers in the partial user profiles during the interaction.
-
Fine-tuned Model: For the
finetuned Questioner, addingquestion history(Finetune with Q-H) significantly increases performance compared to not including it (Finetune without Q-H). This is because the history helps the model avoid asking repetitive questions and enables it to generate more effective follow-up questions based on the established context. -
Non-fine-tuned Model: In contrast, for the
non-finetuned Questioner, addingquestion history(Non-Finetune with Q-H) decreases performance compared toNon-Finetune without Q-H. This is because while Q-H might reduce repetitiveness, the non-fine-tuned model is not trained to generate effective non-repetitive questions. It might start asking less relevant or less structured questions when forced to be non-repetitive without proper training.The following are the results from Figure 6 of the original paper:
该图像是一幅柱状图,展示了不同条件下重复提问的百分比。数据分为四个类别:未微调且不使用 Q-H、未微调并使用 Q-H、已微调且不使用 Q-H、已微调并使用 Q-H,分别对应的百分比为 0.37、0.44、0.21 和 0.03。
This figure (VLM description referring to "Percentage of repetitive questions") likely corresponds to a different aspect than Figure 5(a) but is not explicitly referenced in the text as Figure 6. Assuming it is relevant, it would show how adding Q-H reduces repetitive questions, especially for the finetuned model. The "Finetune with Q-H" shows the lowest percentage of repetitive questions (0.03), validating the claim that Q-H helps avoid repetition in fine-tuned models.
6.1.4. Impact of Fine-Tuning Steps on Model Performance
The following are the results from Figure 7 of the original paper:
该图像是一个柱状图,展示了Questioner模型在不同微调步骤下(未微调、4000步、28000步和40000步)的BLEU和ROUGE评分。蓝色柱表示BLEU评分,红色柱表示ROUGE评分,结果显示随着微调步骤的增加,模型的评分显著提升。
Figure 6: BLEU and ROUGE scores of the Questioner model at different fine-tuning steps (0, 4000, 28000, and 40000)
Figure 6 (labeled as Figure 7 in the VLM description, but 6 in the paper text) illustrates how BLEU and ROUGE scores change with an increasing number of fine-tuning steps for the Questioner model.
- Improvement with Steps: Both
BLEUandROUGEscores show a clear upward trend as the fine-tuning steps increase from 0 (non-finetuned) to 4000, 28000, and 40000. This indicates that continued training helps theQuestionerlearn to ask better, more effectivefollow-up questions, which is essential forpreference elicitation. The largest jump in performance typically occurs early in training (e.g., from 0 to 4000 steps), with continued but diminishing returns for further steps.
6.1.5. Analyzing the Questions Asked by the Model (Funneling)
The following are the results from Figure 8 of the original paper:
该图像是一个条形图,展示了基于加权排名(总和 >= 300)的类别预期值。图中列出了各种偏好的预期价值,如逃避、怀旧、幽默以及视觉效果等,反映了用户在推荐过程中可能关注的不同方面。
rW values correspond to concepts typically addressed later.
To confirm that the fine-tuned Questioner indeed asks questions in a funnel format, the paper analyzes the weighted rank (WR) of concepts (keywords from the JSON profile) across conversations. The WR for each concept is calculated as:
$
W R = \sum _ { i = 1 } ^ { T } i \times p ( i )
$
where is the maximum number of questions the model can ask, and p(i) is the probability that the concept appears in position (i.e., at turn in the conversation). A lower WR means the concept is typically asked earlier, indicating a more general concept. A higher WR means the concept is typically asked later, indicating a more specific concept.
Figure 7 (labeled Figure 8 in the VLM description, but Figure 7 in paper text) shows the WR values for various concepts.
-
Early Concepts (Lower WR): Concepts like
Genre,Film Era, andDecadehave lowerWRvalues. This indicates that theQuestionertends to ask about these broader aspects earlier in the conversation. -
Mid-Conversation Concepts (Intermediate WR): Concepts such as
Directors,Visual Style, andTonehave intermediateWRvalues, suggesting they are introduced after the initial broad questions. -
Late Concepts (Higher WR): Highly detailed concepts like
Special Effects,Humor, andAtmosphereexhibit higherWRvalues, meaning they are typically addressed later in the conversation.This progression strongly confirms that the
Questionersuccessfully learns to ask questions in afunnel-likemanner, starting from general concepts and gradually moving to more specific details, consistent with the training strategy derived from theforward process.
6.2. Data Presentation (Tables)
The paper primarily presents its results through figures and descriptive text. No specific data tables are provided in the main body of the paper that require transcription using Markdown or HTML table formats. All numerical results are either embedded in the text (e.g., "Rouge from 0.4 to 0.68") or visually represented in the provided figures.
6.3. Ablation Studies / Parameter Analysis
The paper implicitly performs some ablation studies by comparing different configurations:
-
Effect of Fine-Tuning: Comparing
non-fine-tunedvs.fine-tunedQuestioner(Figure 3a, 4, 6) shows the effectiveness of the fine-tuning process. -
Effect of User Simulator: Comparing
fine-tunedvs.non-fine-tuneduser simulator(Figure 3a, 3b, 4) highlights the importance of an effective simulation environment. -
Effect of Question History: The comparison of models
withandwithoutquestion history(Figure 5) serves as an ablation to understand the role of conversational context. -
Effect of Fine-Tuning Steps: Figure 6 (VLM 7) explicitly analyzes the impact of the number of fine-tuning steps, demonstrating that more training improves performance.
These comparisons act as
ablation studiesby demonstrating the individual contributions and interactions of different components and training choices to the overall performance. They confirm that the fine-tuning approach, an effective user simulator, including question history, and sufficient training steps are all crucial for the model's success in generating effectivepreference elicitationquestions.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces a novel and effective method for training Large Language Models (LLMs) to ask sequential, funnel-shaped clarifying questions for preference elicitation in conversational recommendation systems. Inspired by diffusion models, the approach involves a forward process to corrupt complete user profiles by systematically removing information (answers to generated questions) and a reverse process where an LLM (Questioner) is fine-tuned to denoise (reconstruct) these profiles by learning to ask the appropriate questions. Key findings include a significant improvement in the LLM's ability to generate domain-specific and contextually relevant follow-up questions, leading to more accurate user profile reconstruction. The method successfully teaches the LLM to adopt a funneling strategy, starting with general questions and progressing to more specific ones, mimicking natural conversational flow. Additionally, fine-tuning an LLM to act as a user simulator was shown to be crucial for robust evaluation and improved interaction quality.
7.2. Limitations & Future Work
The authors highlight the potential of their model to "advance personalized user interactions and open new avenues for adaptive learning in LLMs." While the paper doesn't explicitly list limitations within a dedicated section, several implicit limitations and future work directions can be inferred:
-
Domain Specificity: The experiments are conducted solely in the movie recommendation domain. While the authors claim applicability "across various domains," further validation would be needed to confirm this generalizability without significant re-training or adaptation.
-
Synthetic User Simulation: The evaluation relies on an
LLM-baseduser simulatorrather than real human users. While this is a practical approach for research,LLMsimulators might not perfectly capture the nuances, inconsistencies, or evolving nature of real human preferences and conversational behavior. This could lead to an optimistic evaluation. -
Data Generation Reliance: The
forward processfor generating training data depends on a powerfulLLM(Gemini 2.0). The quality and biases of this initialLLMwill directly impact the quality of the generated training data and, consequently, the performance of the fine-tunedQuestioner. -
Fixed Question Limit: The evaluation process sets a limit of 10 questions. While practical, real-world
PEmight require more or fewer questions depending on the complexity of preferences and user engagement. -
Definition of "Good Question": While
ROUGEandBLEUmeasure textual similarity to aground-truthprofile, they don't fully capture all aspects of a "good"preference elicitationquestion, such as user satisfaction, perceived helpfulness, or efficiency in real user interactions.Future work could involve:
-
Conducting user studies with real users to validate the effectiveness and user experience of the
Questionerin real-world scenarios. -
Exploring the generalizability of the method to diverse domains beyond movie recommendations.
-
Investigating alternative
noisemechanisms ordiffusionprocesses for discrete data, potentially leading to more robust or efficientPEstrategies. -
Integrating user feedback into the training loop, possibly through
reinforcement learning from human feedback (RLHF), to make theLLMquestions even more aligned with human preferences. -
Developing more sophisticated
user simulatorsthat can model dynamic user behavior or uncertainty more accurately.
7.3. Personal Insights & Critique
This paper presents a highly inspiring and innovative application of diffusion models to a critical problem in conversational AI and recommender systems. The idea of framing preference elicitation as a denoising process is elegant and conceptually powerful.
Strengths:
- Novelty: The
diffusion-inspired framework forsequential question generationis a significant methodological contribution, offering a structured way to trainLLMsfor a complex conversational task. - Addressing Cold-Start: By enabling effective
PE, the method directly tackles thecold-start problemand enhances personalization, which is a long-standing challenge inRSs. - Structured Questioning: The
funnelingeffect achieved by theQuestioneris a major win. It reflects a more intuitive and efficient conversational strategy for gathering information, improving user experience. - Practical Training Data Generation: The use of a larger
LLMto generate synthetic training data for a smallerLLMis a clever and scalable approach to developing specialized models without requiring extensive human annotation. - Impact of User Simulator: The clear demonstration of how a fine-tuned
user simulatorimproves evaluation validity and model performance is an important finding for future research indialogue systems.
Critique/Areas for Improvement:
-
Real User Validation: The most significant missing piece is validation with real human users. While
LLMsimulators are useful, they operate within a definedground-truthand might not fully capture the variability, cognitive load, or subjective nature of human responses to clarifying questions. Auser studywould be crucial to confirm perceived helpfulness, satisfaction, and actual preference coverage. -
Complexity of Profile Representation: While
JSONis structured, the conversion from natural language toJSONand the subsequent mapping could introduce ambiguities or loss of nuanced information. The quality of this initialLLM-driven structuring is paramount. -
Bias Propagation: If the
LLMused for initial profile generation or question generation (Gemini 2.0) carries biases, these could be propagated through the training data and affect the fine-tunedQuestioner's questioning strategy, potentially leading to biased preference elicitation. -
Interpretability of
Weighted Rank: While theWeighted Rankanalysis forfunnelingis insightful, further qualitative analysis (e.g., examples of early vs. late questions generated by the model) could strengthen the argument for human-like conversational flow.Transferability: The core methodology of generating structured conversational training data via a
forward (corruption)process and then fine-tuning anLLMin areverse (denoising)process could be applied to various otherdialogue systemtasks where sequential information gathering is crucial. Examples include: -
Diagnostic Chatbots: Training chatbots to ask sequential diagnostic questions for medical or technical troubleshooting.
-
Customer Service Automation: Enabling
LLMsto efficiently gather necessary information from customers before escalating to human agents. -
Educational Tutors: Creating
LLM-based tutors that ask adaptive questions to assess a student's understanding and guide their learning path.Overall, this paper provides a robust and promising framework that pushes the boundaries of
LLMcapabilities inconversational preference elicitation, opening exciting avenues for more intelligent and adaptiverecommendation systems.
Similar papers
Recommended via semantic vector search.