CRS-Que: A User-centric Evaluation Framework for Conversational Recommender Systems
TL;DR Summary
This paper presents CRS-Que, a user-centric evaluation framework for conversational recommender systems, built on ResQue. It integrates conversation-related UX metrics and validates its effectiveness and reliability across different scenarios, highlighting the interaction between
Abstract
An increasing number of recommendation systems try to enhance the overall user experience by incorporating conversational interaction. However, evaluating conversational recommender systems (CRSs) from the user’s perspective remains elusive. This article presents our proposed unifying framework, CRS-Que, to evaluate the user experience of CRSs. This new evaluation framework is developed based on ResQue, a popular user-centric evaluation framework for recommender systems. Additionally, it includes user experience metrics of conversation (e.g., understanding, response quality, humanness) under two dimensions of ResQue (i.e., Perceived Qualities and User Beliefs). Following the psychometric modeling method, we validate our framework by evaluating two conversational recommender systems in different scenarios: music exploration and mobile phone purchase. The results of the two studies support the validity and reliability of the constructs in our framework and reveal how conversation constructs and recommendation constructs interact and influence the overall user experience of the CRS.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is the development and validation of a user-centric evaluation framework for conversational recommender systems (CRSs), named CRS-Que.
1.2. Authors
The authors are Yucheng Jin, Li Chen, Wanling Cai, and Xianglin Zhao, all affiliated with Hong Kong Baptist University, China. Their research backgrounds likely involve recommender systems, human-computer interaction, natural language processing, and user experience evaluation.
1.3. Journal/Conference
The paper was published in ACM Trans. Recomm. Syst. 2, 1, Article 2 (March 2024). The ACM Transactions on Recommender Systems (TORS) is a highly reputable and influential journal in the field of recommender systems. Publishing here signifies that the research has undergone rigorous peer review and is considered a significant contribution to the academic community.
1.4. Publication Year
The publication year is 2024. However, the Published at (UTC) metadata indicates 2023-11-02T00:00:00.000Z, suggesting it was available online or accepted in late 2023 for a 2024 print/volume.
1.5. Abstract
This article addresses the challenge of evaluating conversational recommender systems (CRSs) from a user's perspective. It introduces CRS-Que, a novel, unifying framework designed for user experience (UX) evaluation of CRSs. CRS-Que is an extension of ResQue, a well-known user-centric framework for traditional recommender systems. The key innovation lies in its integration of user experience metrics specific to conversation (such as understanding, response quality, and humanness) into ResQue's Perceived Qualities and User Beliefs dimensions. The framework was validated using a psychometric modeling method through two user studies, conducted in distinct scenarios: music exploration and mobile phone purchase. The validation results confirm the reliability and validity of the framework's constructs and highlight the intricate interplay between conversational and recommendation constructs in shaping the overall user experience of a CRS.
1.6. Original Source Link
The original source link provided is /files/papers/691ee0472c2d75f725911eb5/paper.pdf. Given the publication in ACM Transactions on Recommender Systems, this link likely points to the author's version or a pre-print on a repository, with the official version available via ACM Digital Library. It is officially published.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the elusive and often inconsistent evaluation of conversational recommender systems (CRSs) from a user's perspective. Traditional recommender systems often rely on graphical user interfaces (GUIs) and objective metrics (e.g., accuracy, recall), but CRSs introduce natural language interaction, fundamentally changing how users engage with recommendations.
This problem is important because CRSs are designed to enhance the overall user experience (UX) by enabling more natural and interactive ways for users to express preferences, receive explanations, and provide feedback. However, existing user-centric evaluation frameworks for traditional recommender systems primarily focus on recommendations and overlook the conversational aspect. While recent CRS evaluations often combine questions from general RS frameworks and conversational agent questionnaires, this ad-hoc approach lacks standardization, making it difficult to compare findings across different studies. There is a clear gap for a unified, standardized framework specific to dialogue-based CRSs.
The paper's entry point is to extend an existing, popular, and well-validated user-centric framework for recommender systems, ResQue, by systematically incorporating user experience metrics relevant to conversational interaction. This approach aims to provide a holistic view of CRS evaluation that considers both recommendation quality and conversational quality.
2.2. Main Contributions / Findings
The paper makes four primary contributions:
-
Development of a Consolidated and Unifying User-centric Evaluation Framework (
CRS-Que): The authors developedCRS-Queby extendingResQueto specifically evaluate conversational recommender systems from users' perspectives. This new framework systematically integrates critical conversation constructs (e.g.,CUI Understanding,CUI Response quality,CUI Humanness) intoResQue's dimensions, revealing how conversational and recommendation constructs interact and influence the overall user experience. -
Psychometric Validation through Two User Studies: The framework was rigorously validated using psychometric research methodology, involving two distinct user studies. These studies were conducted under different experimental conditions, scenarios (music exploration and mobile phone purchase), and platforms (desktop web and mobile application), demonstrating the framework's robustness.
-
Re-validation and Adaptation of
ResQue: The evaluation results not only validateCRS-Quebut also re-validateResQuewhen recommendations are delivered via conversation. The studies reveal how the natural language interaction method changes and adapts the originalResQueframework, highlighting the new relationships and influences between constructs. -
Standardized User-centric Research and Evaluation Approach:
CRS-Queprovides a standardized methodology for researchers to conduct user-centric studies on conversational recommender systems and offers practitioners actionable insights for designing and evaluating CRSs based on user perceptions of both conversation and recommendation quality.The key findings demonstrate that
CRS-Quepossesses good validity and reliability in assessing the UX of CRSs. It reveals that conversation constructs are naturally fitted intoResQue'sPerceived QualitiesandUser Beliefsdimensions. Furthermore, the studies uncover how these conversation constructs interact with recommendation constructs and influenceUser AttitudesandBehavioral Intentions, underscoring the interconnectedness of conversational and recommendation aspects in shaping the overall user experience. For instance,NoveltyandCUI Adaptabilitypositively influencePerceived UsefulnessandCUI Rapport, andCUI Rapportcan directly influenceIntention to Use.
3. Prerequisite Knowledge & Related Work
This section outlines the foundational concepts and prior research necessary for a comprehensive understanding of the CRS-Que framework.
3.1. Foundational Concepts
- Recommender Systems (RS): Software tools and techniques providing suggestions for items that might be of interest to a user. These systems analyze user preferences and behaviors to predict ratings or preferences for items. Traditional RS often rely on explicit (e.g., ratings) or implicit (e.g., clicks, purchases) feedback through graphical user interfaces (GUIs).
- Conversational Recommender Systems (CRS): An evolution of traditional recommender systems that enable users to interact with recommendations using natural human language (text or voice). Instead of just clicking buttons, users can express preferences, ask questions, and provide nuanced feedback in a multi-turn dialogue, making the interaction more natural and dynamic.
- User Experience (UX): A broad term encompassing all aspects of an end-user's interaction with a company, its services, and its products. In the context of software, it refers to a user's emotions, attitudes, and perceptions about using a particular system, including its usability, accessibility, and utility.
- Psychometric Modeling Method: A scientific approach to the measurement of psychological characteristics, such as user perceptions, attitudes, and beliefs. It involves developing questionnaires (scales), testing their reliability (consistency of measurement), and validity (whether the scale measures what it intends to measure).
- Structural Equation Modeling (SEM): A multivariate statistical technique that combines aspects of factor analysis and multiple regression to estimate a series of interrelated dependence relationships simultaneously. It is used to test complex theoretical models and examine causal relationships between observed (measured) variables and latent (unmeasured) variables.
- Confirmatory Factor Analysis (CFA): A statistical technique used to verify the factor structure of a set of observed variables. It is a subset of SEM that deals specifically with measurement models, confirming how observed variables (e.g., questionnaire items) load onto latent variables (e.g.,
Perceived Usefulness). CFA helps establish construct validity and reliability. - Latent Variable: A variable that cannot be directly observed or measured (e.g.,
Trust,Perceived Usefulness) but is inferred from observed variables (e.g., responses to specific questionnaire items). - Observed Variable (Indicator): A directly measurable variable (e.g., a specific question on a Likert scale) that is used to assess a latent variable.
- Likert Scale: A psychometric scale commonly used in questionnaires to measure attitudes or opinions. It typically offers a range of options (e.g., 1 to 7), from
Strongly DisagreetoStrongly Agree. - Cronbach's Alpha: A measure of internal consistency, or how closely related a set of items are as a group. It is considered a measure of scale reliability. A higher value (typically above 0.7, though the paper uses 0.5 as a moderate level) indicates that the items are consistently measuring the same underlying construct.
- Average Variance Extracted (AVE): A measure used to assess convergent validity. It quantifies the amount of variance captured by a construct relative to the amount of variance due to measurement error. An AVE value above 0.5 generally indicates good convergent validity.
- Factor Loading: The correlation between an observed variable and its corresponding latent factor. It indicates how strongly an item measures the intended construct. Higher factor loadings (typically above 0.4 or 0.5) are desired.
- Discriminant Validity: The extent to which a construct is distinct from other constructs. It is established if a construct shares more variance with its own measures than with other constructs in the model.
- Convergent Validity: The extent to which a measure correlates positively with other measures of the same construct. It is established if indicators that are theoretically supposed to measure the same construct indeed converge or load highly on that construct.
- Statistical Significance (p-value): A measure used to determine the probability of obtaining observed results if the null hypothesis were true. A
p-valueless than a predefined threshold (e.g., 0.05, 0.01, 0.001) indicates that the results are unlikely to have occurred by chance, and thus, the effect is considered statistically significant.- : Highly significant
- : Very significant
- : Significant
- : Marginally significant (often denoted by )
- Goodness-of-Fit Indices (for SEM): Metrics used to assess how well a hypothesized model fits the observed data.
- (Chi-Square): An absolute fit index. A non-significant
p-value(ideally ) indicates a good fit, but it's highly sensitive to sample size. The paper reports a significant but notes its sensitivity to sample size. - TLI (Tucker-Lewis Index) / NNFI (Non-Normed Fit Index): A relative fit index. Values above 0.90 (ideally 0.95) indicate a good fit.
- CFI (Comparative Fit Index): A non-centrality-based index that compares the fit of the hypothesized model to a baseline model. Values above 0.90 (ideally 0.95) indicate a good fit.
- RMSEA (Root Mean Square Error of Approximation): An absolute fit index. Values below 0.08 (ideally 0.05 or 0.06) indicate a good fit, with a 90% confidence interval (CI).
- (Chi-Square): An absolute fit index. A non-significant
- (Coefficient of Determination): In SEM, for a dependent variable indicates the proportion of variance in that variable that is explained by its predictor variables in the model. Higher values mean a better explanatory power.
3.2. Previous Works
The paper builds upon and differentiates itself from several lines of prior research:
3.2.1. User-centric Evaluation Frameworks for Recommender Systems
-
ResQue [95]: This is the foundational framework upon which
CRS-Queis built.ResQueis a unifying evaluation framework developed fromTechnology Acceptance Model (TAM)[25] andSoftware Usability Measurement Inventory (SUMI)[62]. It evaluates traditional recommender systems across four dimensions:Perceived Qualities,User Beliefs,User Attitudes, andBehavioral Intentions. It includes constructs likeExplanation,Interaction Adequacy,Recommendation Accuracy,Transparency,Trust, andIntention to Purchase. The originalResQuemodel (Figure 2) contains 15 constructs and 32 questions. The following figure (Figure 2 from the original paper) shows a structural equation model ofResQue:
该图像是一个结构方程模型,展示了用户感知质量、用户信念、用户态度与行为意图之间的关系。模型中包含多个指标和路径系数,如推荐准确性、透明度、信任与信心等,这些指标影响用户的总体满意度和使用意图。 -
Knijnenburg et al. [68]: This framework provides a more holistic view of user behavior in recommender systems, explaining it through a set of constructs categorized into objective system aspects, subjective system aspects (perceived qualities), experience constructs, personal characteristics (e.g., demographics, domain knowledge), and situational characteristics (e.g., privacy concern, choice goal). This work highlights the importance of context and individual differences in UX.
3.2.2. UX Metrics of Conversational Agents (CA)
- PARADISE [125]: A popular general performance model for evaluating spoken dialogue agents, including subjective user satisfaction and objective metrics for dialogue efficiency, dialogue quality, and task success.
- Radziwill and Benton [97]: Proposed quantifying CA quality across four aspects: performance, humanity, affect, and accessibility, using an Analytic Hierarchy Process (AHP) for metric selection.
- Metrics for Embodied CAs [101]: Introduced additional metrics like likeability, entertainment, engagement, helpfulness, and naturalness, reflecting the quality of communication in agents with physical or visual presence.
- Kuligowska [70]: Proposed sophisticated metrics for commercial CAs, such as visual look, conversational abilities, language skills, and context sensitiveness.
- Response Quality [50]: Assessed by content informativeness and interaction fluency, which can influence perceived humanness [104].
- Quality of Service (QoS) and Quality of Experience (QoE) [35]: Metrics for task-oriented chatbots, assessing the impact of interaction strategies on user experience.
- Rapport Theory [116, 138]: Characterizes interaction into dimensions of
rapportsuch as positivity, attentiveness, and coordination, based on communication theories. - PEACE Model [113]: Identifies politeness, entertainment, attentive curiosity, and empathy as essential qualities for open-domain chatbots influencing user intention to use.
3.3. Technological Evolution
The evolution in this field has moved from traditional graphical user interface (GUI)-based recommender systems, which primarily focused on passive feedback (e.g., clicks, ratings) and objective algorithmic metrics, to more interactive and natural language-driven conversational recommender systems. Early CRSs might have used GUI widgets for critiquing [17, 77], but recent advancements in natural language processing (NLP) have enabled dialogue-based CRSs that support multi-turn natural language interactions [8, 60]. This shift necessitates a re-evaluation of how these systems are assessed, moving beyond purely objective performance metrics to comprehensive user-centric evaluations that account for the nuances of human-computer conversation. The paper's work fits into the current state by providing a much-needed standardized user-centric evaluation framework for these advanced dialogue-based CRSs.
3.4. Differentiation Analysis
Compared to the main methods in related work, the core differences and innovations of this paper's approach are:
-
Holistic Integration: While previous user evaluations of CRSs often cobbled together questions from existing RS frameworks and CA questionnaires [9, 51, 90],
CRS-Queoffers a consolidated and unifying framework. It systematically integrates relevant conversational UX metrics directly intoResQue's established structure, rather than just appending them. -
Focus on Dialogue-based CRSs: Unlike earlier CRS work that might have included GUI-based critiquing systems,
CRS-Queis specifically developed for and validated with text-based, natural language interaction CRSs. -
Psychometric Rigor: The framework development follows a rigorous psychometric research methodology, including
Confirmatory Factor Analysis (CFA)andStructural Equation Modeling (SEM), to ensure the validity and reliability of its constructs and the hypothesized relationships. This scientific approach provides a robust foundation often missing in ad-hoc evaluation strategies. -
Revealing Interaction Effects:
CRS-Quegoes beyond simply listing metrics; it reveals how conversation constructs and recommendation constructs interact and influence the overall user experience. For example, it identifies paths showing howNoveltyof recommendations can influenceCUI Rapport, or howExplainabilitycan affectCUI Attentiveness. -
Standardization: By offering a validated, unified framework, the paper addresses the lack of standardization in CRS evaluation, enabling better comparison and generalization of findings across different studies and systems.
In essence,
CRS-Queinnovates by creating a theoretically grounded, empirically validated, and practically applicable framework that bridges the gap between traditional RS evaluation and conversational agent evaluation, specifically tailored for the unique challenges of conversational recommender systems.
4. Methodology
4.1. Principles
The core idea behind CRS-Que is to provide a comprehensive, user-centric evaluation framework for conversational recommender systems (CRSs) by extending an existing, widely accepted framework for traditional recommender systems, ResQue [95]. The theoretical basis and intuition are that a CRS's user experience is not solely determined by the quality of its recommendations but also significantly by the quality of its conversational interaction. Therefore, any evaluation framework must account for both aspects and their interplay. The development adheres to psychometric modeling principles to ensure the reliability and validity of the proposed constructs and their relationships.
4.2. Core Methodology In-depth (Layer by Layer)
The development of CRS-Que involved several key stages, starting from conceptual design, integration of new constructs, and rigorous empirical validation.
4.2.1. Framework Development: Integrating Conversation Constructs into ResQue
The CRS-Que framework is built upon the four dimensions of ResQue [95]: Perceived System Qualities, User Beliefs, User Attitudes, and Behavioral Intentions. The primary innovation is the inclusion of user experience metrics specific to conversation. The authors reviewed existing UX metrics for conversational agents and identified eight critical constructs. These were then integrated into ResQue, primarily within the Perceived Qualities and User Beliefs dimensions. A prefix CUI (Conversational User Interface) is used to distinguish conversation-related constructs.
The following figure (Figure 3 from the original paper) shows the general evaluation framework CRS-Que with hypothesized relationships:

Constructs of CRS-Que:
The framework contains constructs from ResQue (gray boxes in Figure 3) and newly introduced conversation-related constructs (blue boxes in Figure 3).
4.2.1.1. Perceived Qualities
This dimension measures how users perceive the significant characteristics of the system, including qualities of recommendations and qualities of conversation.
- Omitted
ResQueconstructs:Diversity,Interface Adequacy,Information Sufficiencywere omitted due to the unique characteristics of CRSs (e.g., CRSs often present single items, focus on natural language interaction over GUI elements, andCUI Response Qualitycan cover informativeness). - Recommendation-focused constructs:
- Accuracy: Measures users' perception of how well recommendations match their interests. It complements objective accuracy metrics.
- Novelty: Measures the extent to which recommendations are new or unknown to users, supporting exploration and discovery. This is often discussed with "serendipity."
- Interaction Adequacy: Measures the system's ability to elicit and refine user preferences through interaction. This is integral to CRSs, which often use dialogue for preference elicitation.
- Explanation: Measures the system's ability to explain its recommendations, which improves trustworthiness and transparency.
- Conversation-focused constructs: These are derived from
rapport theory[116] and other CA metrics.- CUI Positivity: The first component of
rapport, measuring perceived mutual friendliness and caring in communication (e.g., tone, vocabulary). - CUI Attentiveness: The second component of
rapport, measuring if the system establishes a focused interaction by expressing mutual attention. - CUI Coordination: The third component of
rapport, examining if communication is synchronous and harmonious. It is more critical in later stages of communication. - CUI Adaptability: Measures the system's ability to adapt to users' behavior and preferences during conversation, often linked to personalization (e.g., adapting replies to emotions, historical behavior, or item preferences).
- CUI Understanding: A key performance indicator for CAs, measuring the system's ability to comprehend user intents. The framework focuses on user-perceived understanding.
- CUI Response Quality: Refers to the informativeness of content and the fluency/pace of interaction. This is frequently used to assess chatbot responses.
- CUI Positivity: The first component of
4.2.1.2. User Beliefs
This dimension reflects a higher level of user perception, often influenced by Perceived Qualities, and relates to the CRS's effectiveness in supporting tasks like decision-making. It includes ResQue constructs and two conversation-specific constructs.
- Recommendation-focused constructs:
- Perceived Ease of Use: Measures how physically and mentally easy it is for users to use the CRS. Subjective questions are used.
- Perceived Usefulness: Measures the system's competence in supporting users to perform tasks (e.g., decision-making).
- User Control: Measures the level of controllability users perceive while interacting with the recommender.
- Transparency: Enables users to understand the internal logic of the recommendation process, relating closely to
User ControlandExplanation.
- Conversation-focused constructs:
- CUI Rapport: An overall measure of
rapportperceived during communication with the CA, encompassingPositivity,Attentiveness, andCoordination. - CUI Humanness: An overall quality measure assessing the extent to which the agent behaves like a human. It is influenced by various design factors (e.g., anthropomorphic cues, conversational skills).
- CUI Rapport: An overall measure of
4.2.1.3. User Attitudes
This dimension assesses users' overall feelings towards the CRS, which are less likely to be influenced by short-term experience.
- Trust & Confidence:
Trustsignificantly influences RS success and can be affected by recommendations, conversations, or both.Confidencemeasures the user's belief in accepting the recommendation. - Satisfaction: An overall measure of users' attitudes and opinions toward the CRS.
4.2.1.4. Behavioral Intentions
This dimension relates to user loyalty and measures the likelihood of future use, purchase of recommendations, and recommending the system to others.
- Intention to Use: Measures the likelihood of users using the system again.
- Intention to Purchase: Measures the likelihood of users accepting/purchasing the recommended items.
4.2.2. Validation Approach
The framework was validated using a psychometric approach involving two user studies.
4.2.2.1. Measurements
All constructs in CRS-Que are measured subjectively using questionnaires.
- Questionnaire Design: Evaluation constructs were developed based on existing UX metrics for recommenders and conversational systems. To ensure robust measurement for
CFA, at least three questions were designed per construct to provide minimum coverage of the construct's theoretical domain. Some self-composed questions are marked with*in Tables 2 and 4. - Scale: All questions are rated on a 7-point Likert scale (from
Strongly DisagreetoStrongly Agree). - Attention Checks: Questions designed to filter out inattentive responses (e.g., "Please respond to this question with '2'").
4.2.2.2. System Manipulation
To investigate the framework's ability to capture variations in UX, prominent design factors of CRSs were manipulated in the studies:
- Critiquing Initiative:
User-initiatedvs.System-suggested(Study 1). - Explanation Display:
True(with explanation) vs.False(without explanation) (Study 2). - Humanization Level:
Lowvs.High(Study 2). The manipulations were designed to isolate the effects of specific design factors, with one version often serving as a baseline.
4.2.2.3. Study Design
- Design Type: Between-subjects design was chosen to avoid carryover effects and reduce participant burden from repeated long questionnaires.
- Recruitment: Participants were recruited from Prolific, a platform popular for academic surveys.
- Pre-screening Criteria:
- Fluent in English.
- More than 100 previous submissions.
- Approval rate greater than 95%.
- Ethical Approval: The study was approved by the Research Ethics Committee (REC) of the university.
- Procedure:
- Participants sign a consent form (GDPR compliant).
- Read a brief introduction to the experimental CRS.
- Fill out a pre-study questionnaire.
- Try the system.
- Perform a specific task (e.g., create a music playlist, add mobile phones to a shopping cart).
- Fill out a post-study questionnaire based on
CRS-Que.
4.2.2.4. Analysis Method
- Confirmatory Factor Analysis (CFA):
- Purpose: To establish
internal reliability,convergent validity, anddiscriminant validityof the constructs. - Process: Iterative adjustment of the model by:
- Removing indicators (question items) with low factor loadings (e.g., if
AVEof a factor is less than 0.4). - Merging highly correlated constructs (correlation greater than 0.85) to ensure
discriminant validity.
- Removing indicators (question items) with low factor loadings (e.g., if
- Requirements: A latent variable (construct) should have at least three indicators (questions).
- Reliability Metrics:
Cronbach's alpha(above 0.5 for moderate level) andcorrelated item-total correlations(above 0.4) were used to measure internal reliability.
- Purpose: To establish
- Structural Equation Modeling (SEM):
- Purpose: To investigate the relationships (causal paths) between constructs within and across dimensions, validating the hypothesized model.
- Advantages of SEM:
- Estimating latent variables (variables that cannot be directly measured) via observed variables.
- Taking measurement error into account in the model.
- Validating multiple hypotheses simultaneously as a whole.
- Testing a model regarding its fit to the data.
4.2.2.5. Visual Presentation of SEM
SEM results are presented visually with:
- Boxes: Representing constructs (latent variables).
- Arrows: Representing significant relationships (causal paths).
- Single-headed Arrows: Indicate directional relationships. Each is associated with:
- Parameter : The regression coefficient, indicating the amount of change in a dependent variable (y) attributed to a unit change in an independent variable (x). The large coefficient values (> 1) do not necessarily mean multicollinearity as is not standardized.
- Number in Parentheses: Standard error of the regression coefficient.
- Double-headed Arrows: Represent correlations between two variables, showing estimate and standard error of covariance.
- Significance Levels: Denoted by stars ( for , for , for , for ).
- Color Coding:
- Orange: System design factors.
- Gray: Recommendation constructs.
- Blue: Conversation constructs.
- White: User attitudes and behavioral intentions constructs.
- : The proportion of variance explained by the model for endogenous constructs.
5. Experimental Setup
The paper conducted two user studies to validate CRS-Que across different scenarios and system manipulations.
5.1. Study 1: MusicBot for Music Exploration
5.1.1. Datasets
The scenario for Study 1 was music exploration, a task typically involving low user involvement in decision support. The recommendation component was powered by the Spotify recommendation service, which accesses a large database of music.
5.1.2. System Description
-
System Name:
MusicBot. -
Type: A critiquing-based conversational recommender system for music exploration. Critiquing involves users providing feedback on recommendations to refine preferences.
-
Interface: Desktop web application, consisting of a rating widget (Figure 4(A)),
MusicBotdialogue window (Figure 4(B)), and an instruction panel (Figure 4(C)). The following figure (Figure 4 from the original paper) shows the user interface ofMusicBot:
该图像是图表,展示了MusicBot的用户界面,包括用户主动与系统建议的音乐推荐交互示例,以及如何调整推荐的指示面板。用户可以通过对歌曲能量、舞蹈性和情感的评价进行个性化调整,从而改善推荐体验。 -
Interaction:
- Users can critique songs using natural language (e.g., "I need a song with higher energy" -
user-initiated critiquing, UC). - The system can suggest critiques (e.g., "Compared to the last played song, do you like the song of lower tempo?" -
system-suggested critiquing, SC). - Buttons for
Like,Next,Let bot suggest.
- Users can critique songs using natural language (e.g., "I need a song with higher energy" -
-
Technical Details:
- Recommendation logic: Based on
Multi-Attribute Utility Theory (MAUT)[136] and diversity calculation usingShannon's entropy[135]. - Natural language understanding (NLU): Enabled by
Dialogflow ES (standard) API.
- Recommendation logic: Based on
5.1.3. Manipulation
The study manipulated the critiquing initiative:
- User-initiated Critiquing (UC): Users freely type their preferences.
- System-suggested Critiquing (SC): The bot suggests critiques, which users can accept or reject.
5.1.4. Research Questions
- RQ1: How does the critiquing initiative (user-initiated vs. system-suggested) influence users'
perceived qualitiesof recommendations and conversations? - RQ2: How do the changes in
Perceived Qualitiesinfluence the constructs of other dimensions (i.e.,User Beliefs,User Attitudes, andBehavioral Intentions)?
5.1.5. Participants
-
Total Recruited: 265 from Prolific.
-
Filtered Out: 38 for extremely long duration, 54 for failing attention checks.
-
Valid Participants: 173 (meeting the
5:1subjects-to-observable variables rule of thumb for CFA/SEM). The following are the demographics from Table 1 of the original paper:Item Frequency Percentage (%) Age 18-24 36 16.67% 25-34 74 34.26% 35-44 54 25.00% 45-54 27 12.50% 55-64 18 8.33% ≥ 65 7 3.24% Gender Male 117 54.17% Female 96 44.44% Other 3 1.39% Nationality UK 135 62.50% Canada 15 6.94% USA 15 6.94% Germany 6 2.78% Netherlands 5 2.31% Others 40 18.52%
5.1.6. Evaluation Metrics & Constructs (after CFA/SEM Adjustments)
The initial CFA process led to adjustments to ensure validity and reliability.
- Dropped Constructs: Constructs with only a single item (e.g.,
Accuracy,Explainability,CUI Attentiveness,CUI Engagingness) were dropped for Study 1, as they cannot assess measurement error or validate scales in a new context. - Merged Constructs: Strongly correlated constructs were merged for
discriminant validity:CUI Positivity&CUI Rapportwere merged intoCUI Rapport.CUI Adaptability&CUI Coordinationwere merged intoCUI Adaptability.Trust&Confidencewere merged intoTrust & Confidence.
- Validated Constructs (8 total):
-
Perceived Qualities:
Novelty,Interaction Adequacy,CUI Adaptability,CUI Response Quality. -
User Beliefs:
Perceived Usefulness,CUI Rapport. -
User Attitudes:
Trust & Confidence. -
Behavioral Intentions:
Intention to Use.The following are the reliability results for latent factors (constructs) validated in Study 1 from Table 2 of the original paper:
Internal Reliability Convergent Validity Construct Items Cronbach alpha (0.5) Item-total correlation (0.4) Factor loading (R2) (0.4) Variance extracted (AVE) (0.4) Perceived Qualities 1. Novelty [76, 95] 4 0.922 0.757 The music chatbot helps me discover new songs. 0.728 0.593 The music chatbot provides me with surprising recommendations that helped me discover new music that I wouldn't have found elsewhere. 0.896 0.902 The music chatbot provides me with recommendations that I had not considered in the first place but turned out to be a positive and surprising discovery. 0.816 0.726 The music chatbot provides me with recommendations that were a pleasant surprise to me because I would not have discovered them somewhere else. 0.850 0.816 2. Interaction Adequacy [95] 3 0.784 0.560 I find it easy to inform the music chatbot if I dislike/like the recommended song. 0.592 0.549 The music chatbot allows me to tell what I like/dislike. 0.571 0.455 I find it easy to tell the system what I like/dislike. 0.722 0.717 3. CUI Adaptability [116, 129] 3 0.805 0.584 I felt I was in sync with the music chatbot. 0.628 0.605 The music chatbot adapts continuously to my preferences. 0.642 0.545 I always have the feeling that this music chatbot learns my preferences. 0.692 0.596 4. CUI Response Quality [137] 3 0.722 0.473 The music chatbot's responses are readable and fluent. 0.581 0.464 Most of the chatbot's responses make sense. 0.560 0.475 The pace of interaction with the music chatbot is appropriate. 0.503 0.479 User Beliefs 1. Perceived Usefulness [95] 3 0.816 0.593 The music chatbot helps me find the ideal item. 0.694 0.555 Using the music chatbot to find what I like is easy. 0.661 0.570 The music chatbot gives me good suggestions. 0.653 0.659 2. CUI Rapport [116] 5 0.893 0.629 The music chatbot is warm and caring. 0.750 0.653 The music chatbot cares about me. 0.803 0.761 I like and feel warm toward the music chatbot. 0.764 0.715 I feel that I have no connection with the music chatbot. 0.628 0.431 The music chatbot and I establish rapport. 0.764 0.627 User Attitudes 1. Trust & Confidence [95] 3 0.801 0.607 This music chatbot can be trusted. 0.528 0.400 I am convinced of the items recommended to me. 0.731 0.758 I am confident I will like the items recommended to me. 0.698 0.669 Behavioral Intentions 1. Intention to Use [95] 0.922 0.798 I will use this music chatbot again. 0.843 0.824 I will use this music chatbot frequently. 0.872 0.861 I will tell my friends about this music chatbot. 0.812 0.720
-
5.1.7. Task
The task for participants was to use MusicBot to discover new and diverse songs and create a playlist containing 20 songs fitting their music taste.
5.2. Study 2: PhoneBot for Purchase Decision-making
5.2.1. Datasets
The scenario for Study 2 was mobile phone purchase, a task requiring high user involvement in decision support due to the higher cost and importance of the item. The system's recommendation component used a mobile phone database from GSMArena.com.
5.2.2. System Description
-
System Name:
PhoneBot. -
Type: A conversational recommender system designed to help users purchase mobile phones.
-
Interface: Mobile application, requiring participants to chat with the bot on their mobile devices (validating on a different platform).
-
Interaction:
PhoneBotelicits preferences by asking questions about budget and specific phone attributes (e.g., display size, battery capacity, brand). It then presents a recommended phone. The following figure (Figure 6 from the original paper) shows the user interfaces ofPhoneBot:
该图像是 PhoneBot 的用户界面示意图,展示了不同人性化水平下的对话,即低人性化(A)和高人性化(B),以及解释显示的真实与否(C 和 D)。每个界面显示了与用户互动时推荐手机的对话,包括预算、品牌和电池容量等信息。 -
Technical Details:
- Recommendation logic: Based on
Multi-Attribute Utility Theory (MAUT)[136]. - Conversation component: Implemented using
DialogFlow ES (standard)by defining intents for critiquing various mobile phone attributes.
- Recommendation logic: Based on
5.2.3. Manipulation
The study used a between-subjects design manipulating two factors:
- Humanization Level:
Low Humanization: Standard chatbot interaction.High Humanization: Included features like a human avatar, human identity in self-introduction, addressing users by name, and adaptive response speed (Figure 6(B)).
- Explanation Display:
With Explanation: The bot explains the recommendation by ranking it based on user-cared attributes (Figure 6(C)).Without Explanation: No explanations are shown (Figure 6(D)).
5.2.4. Research Questions
- RQ1: How does the
humanization levelof the system influence user trust in the CRS? - RQ2: How do the
recommendation explanationsinfluence user trust in the CRS? - RQ3: What are the interaction effects of
humanization levelandrecommendation explanations?
5.2.5. Participants
-
Total Recruited: 256 from Prolific.
-
Filtered Out: 6 for failing attention checks, 29 for straight-line answers, 5 for extremely long duration.
-
Valid Participants: 216 (acceptable sample size for SEM). The following are the demographics from Table 3 of the original paper:
Item Frequency Percentage (%) Age 19-25 80 46.24% 26-30 35 20.23% 31-35 19 10.98% 41-50 13 7.51% 36-40 13 7.51% 51-60 9 5.20% >60 4 2.31% Gender Male 90 52.02% Female 80 46.24% Other 3 1.73% Nationality UK 41 23.70% USA 38 21.97% Portugal 18 10.40% Poland 15 8.67% Italy 13 7.51% Others 48 27.73%
5.2.6. Evaluation Metrics & Constructs (after CFA/SEM Adjustments)
Similar CFA adjustments were made. The additional constructs validated in Study 2 complement those from Study 1.
- Validated Constructs (11 total):
-
Perceived Qualities:
Accuracy,Explainability,CUI Attentiveness,CUI Understanding. -
User Beliefs:
Transparency,Perceived Ease of Use,User Control,CUI Humanness. -
User Attitudes:
Trust & Confidence,Satisfaction. -
Behavioral Intentions:
Intention to Purchase.The following are the reliability results for latent factors (constructs) validated in Study 2 from Table 4 of the original paper:
Internal Reliability Convergent Validity Construct Items Cronbach alpha (0.5) Item-total correlation (0.4) Factor loading (R2) (0.4) Variance extracted (AVE) (0.4) Perceived Qualities 1. Accuracy [68, 95] 3 0.805 0.600 The recommended phones were well-chosen. 0.717 0.680 The recommended phones were relevant. 0.663 0.631 The recommended phones were interesting.* 3 0.606 0.482 2. Explainability [95] 0.916 0.800 The chatbot explained why the phones were recommended to me. 0.893 0.937 The chatbot explained the logic of recommending phones.* 0.750 0.607 The chatbot told me the reason why I received the recommended phones.* 3 0.854 0.847 3. CUI Attentiveness [116, 138] 0.812 0.598 The chatbot tried to know more about my needs. 0.631 0.514 The chatbot paid attention to what I was saying.* 0.708 0.700 The chatbot was respectful to me and considered my needs.* 0.662 0.592 4. CUI Understanding [5] 3 0.930 0.822 The chatbot understood what I said. 0.852 0.797 I found that the chatbot understood what I wanted. 0.899 0.904 I felt that the chatbot understood my intentions. 0.823 0.767 User Beliefs 1. Transparency [40, 95] 3 0.614 I understood why the phones were recommended to me. 0.645 0.551 I understood how the system determined the quality of the phones. 0.680 0.556 I understood how well the recommendations matched my preferences. 0.711 0.720 2. Perceived Ease of Use [95] 0.808 I could easily use the chatbot to find the phones of my interests.* 0.865 0.799 Using the chatbot to find what I like was easy. 0.871 0.809 Finding a phone to buy with the help of the chatbot was easy. 0.844 0.763 It was easy to find what I liked by using the chatbot.* 0.881 0.860 0.785 3. User Control [95] 3 0.913 I felt in control of modifying my taste using this chatbot. 0.857 0.861 0.645 I could control the recommendations the chatbot made for me.* 0.761 I felt in control of adjusting recommendations based on my preference.* 0.859 0.855 4. CUI Humanness [107] 3 0.787 The chatbot behaved like a human. 0.881 0.903 I felt like conversing with a real human when interacting with this chatbot. 0.770 0.663 This chatbot system has human properties. 0.841 0.823 User Attitudes 1. Trust & Confidence [29, 95] 6 0.955 The recommendations provided by the chatbot can be trusted.* 0.758 I can rely on the chatbot when I need to buy a mobile phone.* 0.821 I feel I could count on the chatbot to help me purchase the mobile phone I need. 0.838 I was convinced of the phones recommended to me. 0.848 0.806 0.780 I was confident I would like the phones recommended to me. 0.821 I had confidence in accepting the phones recommended to me. 0.865 0.815 2. Satisfaction 0.932 0.851 I was satisfied with the recommendations made by the chatbot.* 0.869 The recommendations made by the chatbot were satisfying.' 0.833 0.748 These recommendations made by the chatbot made me satisfied.* 0.879 0.865 Behavioral Intentions 1. Intention to Purchase [37] 0.937 0.831 Given a chance, I predict that I would consider buying the phones recommended by the chatbot in the near future. 0.873 0.859 I will likely buy the phones recommended by the chatbot in the near future. 0.880 0.847 Given the opportunity, I intend to buy the phones recommended by the chatbot. 0.855 0.788
-
5.2.7. Task
The task for participants was to help a fictional character pick three mobile phones based on specified requirements (budget, battery, display size).
6. Results & Analysis
6.1. Core Results Analysis
The studies employed Structural Equation Modeling (SEM) to validate the hypothesized relationships within the CRS-Que framework. The results demonstrate the validity and reliability of the framework's constructs and reveal how conversation and recommendation constructs interact.
6.1.1. Study 1: MusicBot Results
The SEM model for Study 1, conducted with MusicBot for music exploration, showed an acceptable fit to the data based on standard indices:
-
(d.f. = 311, )
-
TLI(Tucker-Lewis Index) = 0.926 (above 0.90, good fit) -
CFI(Comparative Fit Index) = 0.934 (above 0.90, good fit) -
RMSEA(Root Mean Square Error of Approximation) = 0.062, 90% CI [0.052, 0.072] (within acceptable range, ideally below 0.08) -
values for all constructs were larger than 0.40, indicating good explanatory power.
The following figure (Figure 5 from the original paper) shows the Structural Equation Modeling (SEM) results of Study 1:

Key Observations from Study 1 SEM:
-
No Significant Effect of Manipulation: The manipulated design factor (
critiquing initiative- user-initiated vs. system-suggested) did not show a significant effect on any measured construct. This aligns with previous findings in conversational recommender systems where this distinction may not always lead to perceived differences. -
Influence of
Perceived QualitiesonUser Beliefs:Noveltypositively influencedPerceived Usefulness() andCUI Rapport(). This suggests that discovering new and surprising music makes the system feel more useful and fosters a better connection with the bot.CUI Adaptabilitypositively influencedPerceived Usefulness() andCUI Rapport(). A bot that adapts to user preferences is perceived as more useful and builds better rapport.Interaction Adequacypositively influencedPerceived Usefulness(). Ease of interaction in giving feedback contributes to perceived usefulness.
-
Influence of
User BeliefsonUser AttitudesandBehavioral Intentions:Perceived Usefulnesspositively influencedTrust & Confidence(). A useful system builds trust.Trust & Confidencepositively influencedIntention to Use(). Trust is crucial for continued use.CUI Rapportdirectly influencedIntention to Use(). A good rapport with the bot directly encourages users to use it again.
-
Relationships between Recommendation and Conversation Constructs:
Interaction Adequacypositively correlated withCUI Adaptability() andCUI Response Quality(). This suggests that when interaction is easy, users also perceive the bot as more adaptable and its responses as higher quality.Noveltypositively correlated withCUI Adaptability(). Discovering new items can make the bot seem more adaptable.
6.1.2. Study 2: PhoneBot Results
The SEM model for Study 2, evaluating PhoneBot for mobile phone purchase decisions, also demonstrated a good fit:
-
(d.f. = 685, )
-
TLI= 0.947 (above 0.90, good fit) -
CFI= 0.951 (above 0.90, good fit) -
RMSEA= 0.049, 90% CI [0.049, 0.060] (excellent fit, below 0.05 is ideal) -
values indicate good explanatory power.
The following figure (Figure 7 from the original paper) shows the structural equation modeling (SEM) results of Study 2:

Key Observations from Study 2 SEM:
-
Influence of Manipulated Design Factors:
Explanationcondition had a direct positive effect onExplainability(). This confirms that providing explanations makes users perceive the system as more explainable.Humanization Levelhad a marginal positive effect onCUI Attentiveness(). High humanization tends to make the bot seem more attentive.
-
Influence of
Perceived QualitiesonUser Beliefs:Explainabilitypositively influencedTransparency(). Explanations lead to a better understanding of the system's logic.Accuracypositively influencedPerceived Ease of Use() andUser Control(). Accurate recommendations make the system easier to use and give users a sense of control.CUI Understandingpositively influencedPerceived Ease of Use() andUser Control(). A bot that understands the user is perceived as easier to use and gives more control.
-
Influence of
User BeliefsonUser AttitudesandBehavioral Intentions:Transparencydid not significantly influenceTrust & ConfidenceorIntention to Purchasedirectly in this model, which contrasts with some prior findings and suggests mediators or specific context effects.Perceived Ease of Usepositively influencedSatisfaction().CUI Humannesspositively influencedSatisfaction() andTrust & Confidence(). Perceiving the bot as human-like increases satisfaction and trust.
-
Mediated Effects on
Intention to Purchase:Explainabilityindirectly influencedIntention to PurchaseviaTransparencywhich then influencesTrust & Confidence, and subsequentlyIntention to Purchase. ( forTrust & ConfidencetoIntention to Purchase).Humanization Levelindirectly influencedIntention to PurchasethroughCUI Attentiveness, thenCUI Humanness, and finallyTrust & ConfidenceandSatisfaction(which both lead toIntention to Purchase).
-
Relationships between Recommendation and Conversation Constructs:
Explainabilitypositively correlated withCUI Attentiveness(). Explaining recommendations makes the bot seem more attentive.CUI Understandingpositively correlated withAccuracy(). When the bot understands the user, its recommendations are perceived as more accurate.
6.2. Data Presentation (Tables)
The tables presenting the demographics and reliability for latent factors in Study 1 and Study 2 have been transcribed fully in the Experimental Setup section above, as per the instruction.
6.3. Ablation Studies / Parameter Analysis
The paper did not present explicit ablation studies in the traditional sense (e.g., removing components of CRS-Que to see impact on overall performance). Instead, the two studies acted as a form of validation for the framework itself, by:
-
Iterative Refinement:
CFAwas used to iteratively adjust the model by removing or merging constructs (e.g., dropping single-item constructs, merging highly correlated ones) until satisfactory validity and reliability were achieved (as detailed in Tables 2 and 4). This acts as an internal validation of the constructs within the framework. -
Manipulating Design Factors: The studies manipulated
critiquing initiative(Study 1),humanization level, andexplanation display(Study 2). These manipulations served to confirm that theCRS-Queframework could indeed capture the variability in user responses due to different system designs. For example, the direct link between theExplanationcondition andExplainability(Study 2) shows the framework's sensitivity to design choices.The analysis focuses on the relationships between constructs as identified by SEM, rather than the impact of hyperparameters on a single model's performance. The "parameters" analyzed here are the coefficients and -values of the paths within the SEM models, which indicate the strength and significance of relationships between the latent constructs.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper successfully proposes CRS-Que, a unifying user-centric evaluation framework for conversational recommender systems (CRSs). Built upon the established ResQue framework, CRS-Que innovatively integrates critical user experience (UX) metrics related to conversational interaction (e.g., CUI Adaptability, CUI Understanding, CUI Humanness, CUI Rapport) into ResQue's four dimensions: Perceived System Qualities, User Beliefs, User Attitudes, and Behavioral Intentions. Through two rigorous online user studies conducted in diverse application domains (music exploration and mobile phone purchase) and across different platforms, the authors validated the reliability and validity of the framework's 18 constructs, comprising 64 question items. The Structural Equation Modeling (SEM) results not only confirm the hypothesized relationships within the framework but also reveal intricate influencing paths between conversation and recommendation constructs. This framework offers practitioners a systematic way to evaluate CRSs from a holistic user perspective, considering both conversational and recommendation quality, and provides a standardized approach for future research.
7.2. Limitations & Future Work
The authors acknowledge several limitations:
-
Generalizability of Study Design: The study design might limit the framework's generalizability. For instance, Study 1 primarily focused on
critiquing-based interaction, which represents only one specific method of user feedback acquisition in CRSs. -
Domain Impact: Evaluating two different systems in distinct domains makes it challenging to precisely isolate and examine the exact impact of the domain on the evaluation framework itself.
-
Modality Restriction: The validation was exclusively performed with text-based conversational recommender systems. Its validity for voice-based systems, which introduce new UX factors related to voice quality and interaction, remains untested.
-
Technical Limitations of Systems: The
MusicBotandPhoneBotsystems had technical limitations, such as predefined intents that might not cover all user expressions, and conversational skills that are not comparable to state-of-the-artLarge Language Model (LLM)-powered agents.Based on these limitations, the authors suggest several directions for future work:
-
Involving a more diverse range of dialogue designs for recommendation scenarios.
-
Identifying the domain independence of the framework by testing the same CRS in different domains.
-
Conducting additional studies to assess the framework's validity in voice-based systems, incorporating relevant voice interaction quality constructs.
-
Evaluating more advanced conversational agents (e.g., those based on
ChatGPTor other LLMs) for recommendation scenarios. -
Continuously maintaining and tracking how
CRS-Queis used by practitioners to evaluate different CRSs.
7.3. Personal Insights & Critique
This paper makes a crucial contribution by providing a well-structured and empirically validated framework for evaluating conversational recommender systems. The integration of conversational UX metrics into an existing RS evaluation framework (ResQue) is a particularly insightful approach, bridging two distinct but increasingly intertwined research areas. The rigorous psychometric validation, including CFA and SEM, lends strong credibility to CRS-Que and establishes a solid foundation for future user-centric studies in CRSs.
One of the most valuable aspects of this work is its emphasis on the interplay between recommendation and conversation constructs. The SEM results explicitly show that a system's Novelty can influence CUI Rapport, or that Explainability can enhance CUI Attentiveness. This highlights that in a CRS, the conversational interface is not merely a delivery mechanism but an active component that shapes how users perceive the recommendations and the system as a whole. This deeper understanding of cross-construct relationships moves beyond simply measuring individual UX factors to explaining how they influence each other, offering actionable insights for system designers. For instance, knowing that CUI Rapport can directly impact Intention to Use (Study 1) suggests that focusing on social aspects of the conversation can be as important as the pure algorithmic quality of recommendations.
The two-study validation strategy, covering different domains (low vs. high user involvement) and platforms (desktop vs. mobile), enhances the generalizability and robustness of CRS-Que. This makes the framework versatile and applicable to a wide range of CRS implementations. The provision of a comprehensive questionnaire (Appendix A) and a short version (Appendix B) further empowers researchers and practitioners to adopt and customize the framework for their specific needs, greatly aiding standardization efforts.
Critically, while the paper acknowledges technical limitations of the evaluated systems, the rapid advancement of Large Language Models (LLMs) poses an interesting challenge and opportunity. LLMs can exhibit highly sophisticated conversational abilities, potentially altering user perceptions of CUI Humanness, CUI Understanding, and CUI Response Quality in ways that older DialogFlow ES-based systems might not fully capture. Evaluating LLM-powered CRSs with CRS-Que could provide fascinating insights into how these advanced models influence user experience across the framework's dimensions.
Furthermore, the lack of significant impact from the critiquing initiative in Study 1, while noted, could warrant further investigation. It might suggest that for certain low-involvement tasks like music exploration, the method of critiquing (user-initiated vs. system-suggested) is less critical than the overall Interaction Adequacy or CUI Adaptability. However, for high-stakes decision-making, this finding might differ. The paper implicitly encourages such deeper contextual analysis by providing the framework itself.
Overall, CRS-Que is a highly valuable contribution that provides a much-needed standardized and rigorous tool for understanding the complex user experience of conversational recommender systems. Its methods and conclusions are highly transferable to various domains where natural language interaction is used to mediate complex information tasks, from customer service chatbots to educational tools.
Similar papers
Recommended via semantic vector search.