SCREEN: A Benchmark for Situated Conversational Recommendation
TL;DR Summary
The paper introduces Situated Conversational Recommendation Systems (SCRS) and presents the SCREEN dataset, which contains over 20,000 dialogues generated by multimodal large language models, simulating user-recommender interactions. This resource enriches future research in scen
Abstract
Engaging in conversational recommendations within a specific scenario represents a promising paradigm in the real world. Scenario-relevant situations often affect conversations and recommendations from two closely related aspects: varying the appealingness of items to users, namely , and shifting user interests in the targeted items, namely . We highlight that considering those situational factors is crucial, as this aligns with the realistic conversational recommendation process in the physical world. However, it is challenging yet under-explored. In this work, we are pioneering to bridge this gap and introduce a novel setting: (SCRS). We observe an emergent need for high-quality datasets, and building one from scratch requires tremendous human effort. To this end, we construct a new benchmark, named , via a role-playing method based on multimodal large language models. We take two multimodal large language models to play the roles of a user and a recommender, simulating their interactions in a co-observed scene. Our SCREEN comprises over 20k dialogues across 1.5k diverse situations, providing a rich foundation for exploring situational influences on conversational recommendations. Based on the SCREEN, we propose three worth-exploring subtasks and evaluate several representative baseline models. Our evaluations suggest that the benchmark is high quality, establishing a solid experimental basis for future research. The code and data are available at https://github.com/DongdingLin/SCREEN.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
SCREEN: A Benchmark for Situated Conversational Recommendation
1.2. Authors
- Dongding Lin (The Hong Kong Polytechnic University, Hong Kong, China)
- Jian Wang (The Hong Kong Polytechnic University, Hong Kong, China)
- Chak Tou Leong (The Hong Kong Polytechnic University, Hong Kong, China)
- Wenjie Li (The Hong Kong Polytechnic University, Hong Kong, China)
1.3. Journal/Conference
The paper is published in the Proceedings of the 32nd ACM International Conference on Multimedia (MM '24), scheduled for October 28-November 1, 2024, in Melbourne, VIC, Australia. ACM Multimedia (ACM MM) is a highly reputable and influential conference in the field of multimedia computing, covering a broad range of topics including multimedia systems, applications, data, and human-computer interaction. Its publication in this venue indicates a significant contribution to the multimedia and AI communities, particularly in the intersection of conversational AI and recommender systems.
1.4. Publication Year
2024
1.5. Abstract
The paper introduces a novel paradigm called Situated Conversational Recommendation Systems (SCRS), which aims to integrate scenario-relevant situational factors into conversational recommendations. It highlights that these situational factors influence situated item representation (how appealing items are to users) and situated user preference (how user interests shift) in a realistic setting. Recognizing the lack of high-quality datasets for this challenging and underexplored area, the authors propose a new benchmark named SCREEN. This benchmark is constructed using a novel role-playing method involving two multimodal large language models (MM-LLMs) that simulate user and recommender interactions within a co-observed scene. SCREEN comprises over 20,000 dialogues across 1,500 diverse situations, providing a rich foundation for studying situational influences. Based on this benchmark, the paper defines three subtasks (system action prediction, situated recommendation, and system response generation) and evaluates several baseline models, demonstrating the benchmark's high quality and establishing a basis for future research.
1.6. Original Source Link
https://openreview.net/pdf?id=BfjHOCFvyf This paper is currently published on OpenReview, indicating it has undergone peer review (often associated with major conferences) and is publicly available.
2. Executive Summary
2.1. Background & Motivation
The paper addresses a critical gap in the field of Conversational Recommendation Systems (CRS) and Multimodal Conversational Recommendation Systems (Multimodal CRS). While existing Multimodal CRSs integrate textual and visual product information, they often fall short in capturing the dynamic nature of user preferences and the contextual variability of item appeal in real-world scenarios.
The core problem is that conventional CRSs and Multimodal CRSs typically overlook crucial situational factors. These factors, such as the product's location, current season, daily weather, and even the user's appearance or emotional state, can significantly:
-
Vary the appealingness of items (
situated item representation): An item's attractiveness is not static; it changes based on the environment (e.g., a swimsuit is more appealing in summer than winter, specific lighting might make an item look better). -
Shift user interests (
situated user preference): A user's preference for an item can fluctuate based on their current situation or mood (e.g., wanting lighter clothes on a hot day, or a specific type of furniture for a brightly lit room).The authors emphasize that considering these situational factors is crucial because it directly aligns with how real-world conversational recommendations occur in physical settings (e.g., a salesperson observing a customer's outfit and the store's ambiance). However, this integration of
situational contextintoCRSsis challenging and largelyunder-exploreddue to the lack of suitable datasets.
The paper's entry point is to bridge this gap by pioneering a novel setting called Situated Conversational Recommendation Systems (SCRS) and, more importantly, by constructing a high-quality benchmark to facilitate research in this new area, requiring minimal human effort.
2.2. Main Contributions / Findings
The paper makes several significant contributions:
- Introduces Situated Conversational Recommendation Systems (SCRS): It formally defines
SCRSas a new and more realistic paradigm forconversational recommendation, which explicitly incorporatessituational contextto deliver more engaging and appropriate recommendations. This expands the scope beyond traditionalmultimodal CRSs. - Constructs the SCREEN Benchmark: The authors develop
SCREEN(Situated Conversational REcommENdation), the first comprehensive and high-quality benchmark dataset specifically designed forSCRS. This dataset is created using an innovative, efficientrole-playing methodpowered bymultimodal large language models (MM-LLMs), which significantly reduces human effort. - Large-scale and Diverse Dataset:
SCREENcomprises over 20,000 dialogues across 1,500 diverse situations, providing a rich foundation for exploring situational influences. It leverages visual features from scene snapshots and detailed item metadata, enriched with subjective descriptions. - Proposes Three Essential Subtasks: The paper delineates three critical subtasks for comprehensively evaluating
SCRS:system action prediction,situated recommendation, andsystem response generation. These tasks measure a system's ability to understand user intent, model situated preferences, and generate context-aware responses. - Establishes Baseline Results: The authors evaluate several representative baseline models on the
SCREENbenchmark, providing initial performance metrics and highlighting the challenges insituated recommendation, thereby establishing a solid experimental basis for future research. The findings suggest that even state-of-the-artMM-LLMsstill have significant room for improvement in fully addressingsituated recommendations.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a novice should be familiar with the following core concepts:
- Recommender Systems (RS): At a high level, these are information filtering systems that predict what a user might like. They are widely used in e-commerce, streaming services, etc., to suggest products, movies, or music.
- Conversational Recommender Systems (CRS): An evolution of traditional
RSthat interact with users through natural language dialogues. Instead of static recommendations,CRSscan ask clarifying questions, understand nuanced preferences, and refine recommendations dynamically, much like a human salesperson. - Multimodal Conversational Recommender Systems (Multimodal CRS): These
CRSsgo beyond text-only interactions by incorporating multiple modalities, primarilyvisual information(e.g., item images, scene snapshots) alongside textual dialogue history. This allows for a richer understanding of items and user preferences. - Situated Conversational Recommendation Systems (SCRS): This is the novel paradigm introduced in the paper. It extends
Multimodal CRSsby explicitly incorporating a broader range ofsituational context(e.g., time of day, weather, store environment, user's appearance) into the recommendation process. The goal is to make recommendations more relevant and natural by considering the "real-world" context of the interaction. - Situated Item Representation: In
SCRS, this refers to how the appealingness or relevance of an item dynamically changes based on thesituational context. For example, a warm jacket has a differentsituated item representationin winter than in summer. - Situated User Preference: In
SCRS, this describes how a user's interests or preferences for items shift due to thesituational context. For instance, a user might prefer lighter clothing on a hot day, even if their general preference is for darker colors. - Large Language Models (LLMs): These are advanced artificial intelligence models trained on vast amounts of text data, enabling them to understand, generate, and process human language. They can perform various tasks, including conversation, summarization, translation, and even complex reasoning. Examples include GPT-3, GPT-4, Llama, etc.
- Multimodal Large Language Models (Multimodal LLMs): An extension of
LLMsthat can process and generate content across multiple modalities, such as text and images. For example,GPT-4Vcan take an image and text as input and generate a textual response based on both. - Role-playing / Agent-based Simulation: A method where
LLMsare prompted with specificpersonas(e.g., "You are a user," "You are a recommender") andinstructionsto simulate interactions or dialogues. This technique is used in the paper to automatically generate the dataset by havingLLMagents play different roles. - Big Five Personality Traits (OCEAN): A widely accepted model of personality in psychology, describing five broad dimensions of personality:
- Openness (O): Inventive/curious vs. consistent/cautious.
- Conscientiousness (C): Efficient/organized vs. extravagant/careless.
- Extraversion (E): Outgoing/energetic vs. solitary/reserved.
- Agreeableness (A): Friendly/compassionate vs. challenging/detached.
- Neuroticism (N): Sensitive/nervous vs. secure/confident.
This model is used to imbue the
user agentwith diverse and realistic personalities.
- Set-of-Mark Technique: A method, particularly useful with
Multimodal LLMs, to improve object recognition and grounding in images. It involves prompting theMM-LLMto identify and mark objects in an image before describing them, which helps in generating more precise and contextually rich descriptions.
3.2. Previous Works
The paper contextualizes its contribution by discussing prior research in two main areas: Conversational Recommendation Systems (CRS) and Situated Dialogues.
3.2.1. Conversational Recommendation Systems (CRS)
Early CRS datasets focused predominantly on textual information, relying on dialogue histories and item attributes.
-
REDIAL [20]: A benchmark for
CRSfocusing on movie recommendations, collected through crowd workers. It primarily uses textual dialogue. -
TG-REDIAL [46]: Another movie recommendation dataset, similar to
REDIALin its textual nature and crowd-sourced collection. -
INSPIRED [11]: A dataset aimed at sociable recommendation dialogue systems, also primarily text-based.
-
DuRecDial [23, 24]: A large-scale
CRSdataset covering multiple domains and dialogue types, including rich user profiles, but still mainly textual in its conversational content.To address the limitations of text-only
CRSs,multimodal CRSsemerged, integrating visual information. -
MMD [33]: A significant advancement, this benchmark dataset introduced tasks catering to
multimodal, domain-specific dialogues, often involving fashion items. It started integrating item images. -
MMConv [21]: Further expanded multimodal
CRSby covering multiple domains beyond single-domain datasets. -
SURE [25]: Acknowledged the need to capture more diverse expressions of users'
subjective preferencesand recommendation behaviors in real-life scenarios, integrating textual and visual information. While it aims for subjective preferences, it doesn't explicitly focus onsituational context. -
SIMMC-VR [36]: A
task-oriented multimodal dialogue datasetwithsituated and immersive VR streams, enhancing the system's comprehension of spatial and temporal contexts. This dataset moves closer tosituated dialoguesbut is still distinct from the paper's specific focus onsituated conversational recommendation.
3.2.2. Situated Dialogues
Research in situated dialogues emphasizes grounding interactions within specific contextual situations.
- SIMMC [5] and SIMMC 2.0 [17]: The
Situated Interactive Multi-Modal Conversational (SIMMC)dataset and its successorSIMMC 2.0established a foundation for situational, interactivemultimodal conversations.SIMMC 2.0enhanced capabilities but primarily focused on immediate, local topics, limiting support for dynamic, forward-looking conversations. It provides a valuable base formultimodalandsituatedaspects but not specifically forrecommendation. - SUGAR [31]: A dataset introduced to improve agents'
proactive response selectionin situated contexts, addressing limitations inSIMMC 2.0regarding dynamic conversations.
3.3. Technological Evolution
The evolution of recommender systems has progressed through several stages:
- Traditional Recommender Systems: Early systems focused on collaborative filtering, content-based filtering, or hybrid approaches, often based on explicit ratings or implicit behaviors. These were largely static and non-interactive.
- Conversational Recommender Systems (CRSs): The introduction of
CRSsmarked a shift towards interactivity, allowing systems to engage with users via natural language to elicit preferences and refine recommendations. These were initially text-based. - Multimodal CRSs: Recognizing the richness of real-world items,
multimodal CRSsintegrated visual information (e.g., product images) alongside text, enabling a deeper understanding of item attributes and user preferences. - Situated Dialogues (Task-Oriented): A parallel development focused on dialogues grounded in a specific environment or scenario, often involving visual scenes and objects, but primarily for task completion (e.g., "find the red chair") rather than personalized recommendations based on shifting preferences.
- Situated Conversational Recommendation Systems (SCRS): This paper's work represents the latest evolution, combining the interactivity of
CRSs, the richness ofmultimodalinput, and the contextual awareness ofsituated dialogues, specifically tailored forrecommendation. It bridges the gap by making recommendations truly "situated" by considering the dynamic interplay between user, item, and environment.
3.4. Differentiation Analysis
The SCREEN benchmark and the SCRS paradigm differentiate from previous works in several key ways:
-
Beyond Multimodal CRS: While datasets like
MMDandSUREincorporate visual information, they primarily treat item representations and user preferences as static or based on general interests.SCREENexplicitly modelssituated item representationandsituated user preference, meaning the appeal of an item and a user's interest in it are dynamic and context-dependent (e.g., weather, time of day, user's current appearance). This is a more realistic reflection of physical-world interactions. -
Beyond Situated Dialogues: Datasets like
SIMMC 2.0andSUGARfocus ontask-oriented situated dialogues(e.g., finding an item, answering questions about a scene).SCREEN, however, centers onrecommendationas the core task, where the system proactively suggests items based on complexsituational understanding, user preferences, and dialogue history. It's not just about interacting in a scene but about making appropriate suggestions. -
Comprehensive Situational Context:
SCREENintegrates a richer set ofsituational factors, including:- Co-observed scenario: Scene snapshots with precise item coordinates.
- Spatiotemporal information: Time of day (morning, noon, afternoon, evening).
- Environmental information: Climate (spring, summer, autumn, winter).
- User's dynamic state: Emotional state and appearance (e.g., clothing in another snapshot), and
Big Five personality traits. This level of granularsituational contextis unique inCRSdatasets.
-
LLM-driven Data Generation: The paper leverages a
role-playing methodusingmultimodal LLMagents (GPT-4V,GPT-4) for dataset construction. This approach allows for generatinglarge-scale,high-qualitydialogues withminimal human effort, addressing the significant challenge of creating such complex datasets from scratch. Previous datasets often relied heavily on manual crowd-sourcing. -
Subjective Item Descriptions:
SCREENenhances item metadata withsituational attributesandsubjective descriptions(e.g., "enthusiastic and bold" for clothes) generated byMM-LLMs, moving beyond intrinsic attributes (e.g., color, brand) to capture how items are perceived in context.In essence,
SCREENpioneers a more holistic and realisticconversational recommendationexperience by grounding it deeply in dynamicsituational contexts, an aspect largely underexplored by previousCRSandsituated dialoguebenchmarks.
4. Methodology
The core idea behind the paper's methodology is to construct a high-quality Situated Conversational Recommendation System (SCRS) dataset with minimal human effort by leveraging the advanced human-mimicking capabilities of Large Language Models (LLMs). This is achieved through a role-playing approach where LLM agents simulate users and recommenders interacting within a shared, visually perceived scene.
4.1. Problem Formulation
The paper formulates the SCRS dataset as a collection of dialogues. Each dialogue is represented as a tuple .
-
: Represents the
situational information. This includes theuser-system co-observed scenario(e.g., a scene snapshot),spatiotemporal information(e.g., time of day), andenvironmental information(e.g., climate). -
: Denotes all items present in the specific situation . Each item has its own attributes, which can be
intrinsic(e.g., brand, price) andsituational(e.g., appearance under certain lighting). -
: Represents the user's
personalized informationfor dialogue , which includesuser preferences,user profiles, anduser personalities. -
: Denotes the dialogue context for dialogue , where is the total number of turns in the conversation. Each is a turn in the conversation.
Given the situational information , all items in this situation, a set of user's personalized information , and a dialogue context , the objective of an
SCRSis to:
-
Select and recommend the most
appropriate itemfrom the scene to the user. -
Generate a
natural language responsethat matches the scene content andsituational context.The key distinction from traditional
CRSis the explicit requirement for recommended items and responses to beclosely related to the situational context.
4.2. Overview of Dataset Construction Framework
The paper's automatic dataset construction framework for situated conversational recommendation is inspired by LLM agent-based role-playing. As depicted in Figure 2, the framework consists of two main components:
-
Scene Information Pool Generation: This component enriches item metadata with
situational attributesandsubjective descriptionsby leveragingmultimodal LLMsto analyze scene snapshots. This addresses the limitation of traditional item descriptions that only provide intrinsic attributes. -
Role-Playing Environment: This environment simulates human-like interactions using three types of
LLM agents– auser agent, asystem agent, and amoderator agent. These agents follow meticulously designed instructions and operate within a global environmental description to generate recommendation-oriented dialogues.The work leverages
VR snapshotsfrom theSIMMC 2.1dataset, which includes diverse scenes from fashion and furniture stores, along with detailed metadata for each item (e.g., type, color, material, price).
该图像是一个示意图,展示了基于角色扮演的环境自动数据集构建框架,包含场景池生成和角色扮演环境的各个组成部分。图中包括用户代理和系统代理的互动流程,以及如何通过模拟用户偏好和情境来进行对话推荐。
Figure : Overview of our automatic dataset construction framework for situated conversational recommendatior
4.3. Scene Information Pool Generation
This component aims to create a more nuanced item representation by moving beyond intrinsic attributes to include situational attributes and subjective descriptions.
- Motivation: In real-world scenarios, users often prioritize an item's appearance and how it fits the situation (e.g., location, lighting) over its static attributes (e.g., brand, exact price). Traditional product databases lack these
subjective descriptors(e.g., "clothes designed for young women" instead of "red clothes"). - Methodology:
-
Spatial Data: The
SIMMC 2.1dataset provides precise coordinates for products within scene snapshots. -
Bounding Boxes and IDs: These coordinates are used to create
bounding boxesaround each item and assignunique identifiers. -
Multimodal LLM Processing: The annotated snapshots (image with bounding boxes) are fed into
GPT-4V(specifically thegpt-4-1106-vision-previewversion). -
Prompting with Set-of-Mark: The
Set-of-Mark technique[39] is used to improve item recognition and description generation.GPT-4Vis prompted to elucidatesituational attributes(how an item appears in the given scene, influenced by lighting or placement) andsubjective descriptions(e.g., "enthusiastic and bold" for a vibrant red dress). -
Integration: The generated situational and subjective descriptions are then integrated into the existing product metadata to form a comprehensive
scene information pool.The instruction template for this generation is illustrated in Figure 3. It prompts the
MM-LLMto act as a consumer and describe items in specific numbered boxes, focusing on their color, type, and pattern, given a scene screenshot.
该图像是一个示意图,展示了场景信息池的指令模板。该模板指导消费者在查看场景截图时,描述每个标有数字的箱子中的服装或家具,根据颜色、类型和图案进行选择。
-
Figure 3: Instruction template for the scene pool generation.
4.4. Role-Playing Environment
The role-playing environment is designed to simulate realistic interactions between a user and a recommender.
4.4.1. Global Environment Description
To ensure a realistic and diverse setting, a global environment description is created, which is provided to all LLM agents. This description incorporates three main dimensions:
-
Temporal phases: Morning, noon, afternoon, and evening.
-
Spatial settings: Fashion and furniture retail spaces.
-
Climate: Spring, summer, autumn, and winter.
ChatGPT(gpt-3.5-turboversion) is used to generate succinct narratives for each combination of these dimensions (e.g., "It is the afternoon, and you find yourself in a fashion store. A gentle breeze wafts through, heralding the arrival of spring."). These tailored descriptions are appended to the beginning of each agent's instructions.
4.4.2. User Agent
The user agent simulates consumer shopping behavior, generating responses based on predefined preferences, profiles, and personalities.
- User Preference:
- Attribute Cataloging: Attributes of all products in a given scenario are cataloged.
- Random Allocation: User preferences (favor, aversion, or neutrality) are randomly allocated to each attribute. This ensures a wide range of personalized preferences.
- Natural Language Refinement:
ChatGPTrefines this structured information into fluentnatural language. For example, "You exhibit a preference for red, an aversion to white, and display no particular inclination towards purple...".
- User Profile:
- Structured Pool: A structured pool of personal profile attributes (e.g., name, age, gender, profession) is developed using information from the
DuRecDialdataset. - Emotional States and Appearance: Profiles are further enriched with
emotional states(e.g., joy, cheerfulness) andappearance descriptions. The appearance (e.g., "Upper Body: White shirt; Lower Body: Jeans") is derived from another scene snapshot to mimic real-user scenarios, where a salesperson would observe a customer's attire. This appearance information is crucial as it's passed to thesystem agentfor making contextually relevant recommendations. - Natural Language Refinement:
ChatGPTrefines this structured information into fluentnatural language, similar to user preferences.
- Structured Pool: A structured pool of personal profile attributes (e.g., name, age, gender, profession) is developed using information from the
- User Personality:
-
Big Five Personality Traits: To increase diversity,
user personalitiesare simulated using theBig Five personality traits[9, 41]:Openness (O),Conscientiousness (C),Extraversion (E),Agreeableness (A), andNeuroticism (N). Positive and negative aspects are assigned along these dimensions. -
Natural Language Refinement:
ChatGPTrefines this structured personality information into fluentnatural language.The complete instruction template for the
user agent(shown in Figure 5) expresses these simulateduser preferences,profiles, andpersonalitiesin natural language to prompt theuser agentto play the role of a customer.
该图像是一个用户偏好、用户个人资料和用户个性生成的指令模板示意图。该模板提供了如何将结构化信息转化为自然流畅语言的指导,特别强调信息的长度限制和开头语的使用。
-
Figure 4: Instruction template for the user preference, user profile, and user personality generation.
4.4.3. System Agent
The system agent acts as a human-like salesperson, aiming to recommend appropriate items based on user preferences and the conversational context. It operates with predefined actions and observations:
-
Predefined Actions: The
system agentis designed to determine and execute one of six actions in each interaction round:- Describe Item Information: Proactively offers comprehensive details (intrinsic, situational, subjective attributes) of items.
- Inquire About Preferences: Gathers user preferences by asking opinions on scene items or clarifying ambiguities in requests.
- Address User Queries: Provides requested information upon user inquiries.
- Topic Transfer: Strategically guides the conversation by introducing new items or delving deeper into a current selection once a user accepts a recommendation.
- Make Recommendations: Decides which item to recommend once sufficient user preference information is gathered.
- Add to Cart: Inquires if the user wants to add an item to their cart after a recommendation is accepted.
-
Observations: Similar to a real salesperson, the
system agentcan observe the user's appearance (e.g., from theuser profile) but does not have access to private user information like name or profession. This aids in inferring user preferences. -
Self-Augmented Instructions: To prevent forgetting item details during a multi-turn conversation, the
system agentusesself-augmented instructions, where its prompts are repeated in each conversation round.The specific
system agentinstruction template is shown in Figure 5.
4.4.4. Moderator Agent
The moderator agent is designed to oversee and manage the conversation flow between the user agent and the system agent.
- Role: It automatically decides when to terminate the conversation and tracks whether the
user agentaccepts or rejects recommended items based on its preset preferences. - Termination Conditions: The conversation terminates under specific natural language conditions:
- Successful Recommendation: The
system agentcompletes a recommendation, theuser agentaccepts it, and the recommended item aligns with theuser agent's predefined preferences. Additionally, the system action is nottopic transfer. Conversations terminated under this condition are consideredvalid data. - Repeated Rejection: The
user agentrejects recommended items multiple times (e.g., more than three times). - Maximum Turns: The conversation reaches a predefined maximum number of turns.
Conversations ending under conditions (2) and (3) are categorized as
invalidand discarded.
- Successful Recommendation: The
The specific moderator agent instruction template is shown in Figure 5.

Combined image with User Agent, System Agent, and Moderator Agent Instruction Templates from original Figure 5, which was not explicitly provided but implied by the text. Based on the description, Figure 5 would contain these three templates.
The image "images/5.jpg" is not provided by the user for me to extract, so I'm assuming it would look something like a combination of instruction templates for the three agents as described in the text. I will include a placeholder note for the absence of the image if it was meant to be distinct from other provided images. Since the text explicitly states "Figure 5 shows the complete instruction template" and later "The specific system agent instruction template is shown in Figure 5" and "Figure 5 describes the specific moderator agent instruction template", it implies one figure contains all three. The provided images do not include such a Figure 5. I will proceed with the explanation of the templates as described in the text.
Note: The image corresponding to "Figure 5: Instruction template for the user agent, system agent, and moderator agent" was not provided in the input. The description of these templates is derived from the accompanying text.
4.5. Dataset Construction
- Multimodal Context: The conversational scenario integrates
visual(scene snapshots) andtextual(dialogue history, instructions) elements. - LLM Agents:
User agentandsystem agentare powered byGPT-4V(gpt-4-1106-vision-previewversion) due to their need to "see" and process visual cues from the scene snapshots.Moderator agent, which doesn't require visual cues, usesGPT-4(gpt-4-1106-previewversion).
- Dialogue Flow: The dialogue initiates with the
system agentgreeting theuser agent. Interactions continue through multiple rounds until themoderator agentintervenes based on its termination conditions. - Framework: The role-playing framework is built upon the open-source library
ChatArena[37]. - Generation Parameters:
-
Temperature: Standardized at 0.8 across all agents to balance creativity and coherence.
-
Maximum Generation Tokens: Tailored for each agent type: 120 for the system agent, 80 for the user agent, and 20 for the moderator agent. This ensures efficient and appropriate length responses for each role.
This automated process allows for the rapid construction of large-scale, high-quality dialogues with significantly reduced human intervention.
-
5. Experimental Setup
5.1. Datasets
The primary dataset used for experiments is the newly constructed SCREEN benchmark.
5.1.1. SCREEN Dataset Characteristics
The SCREEN dataset is designed to facilitate situated conversational recommendations.
- Source: Constructed via a
role-playing methodbased onmultimodal large language modelsusingVR snapshotsfromSIMMC 2.1dataset. - Scale: Comprises over 20,000 dialogues across 1,500 diverse situations.
- Domain: Fashion and Furniture retail spaces.
- Key Features:
- Situational Context: Each dialogue is associated with unique
situational informationincluding scene snapshot, spatiotemporal (time), and environmental (climate) details. - Personalized User Information: Incorporates
user preferences,user profiles(age, gender, profession, emotional state, appearance), andBig-5 personality traits. - Enriched Item Representation: Beyond intrinsic attributes, items have
situational attributesandsubjective descriptionsgenerated byMM-LLMs. - Recommendation Candidates: Each dialogue is associated with a unique list of recommendation candidates from the conversational scene, unlike traditional
CRSwhere all conversations might access a communal candidate list. This necessitates modeling item representation within the scene.
- Situational Context: Each dialogue is associated with unique
- Data Split: The dataset is divided into training, validation, and test sets with an 8:1:1 ratio.
-
Total #dialogue (train/valid/test): 16,089 / 2,011 / 2,012
-
Total #utterances (train/valid/test): 172,152 / 20,713 / 21,528
-
Total #scene snapshots: 1,566
-
Avg. #words per user turns: 15.7
-
Avg. #words per assistant turns: 20
-
Avg. #utterances per dialog: 10.7
-
Avg. #objects mentioned per dialog: 4.3
-
Avg. #objects in scene per dialog: 19.7
The following are the results from Table 1 of the original paper:
Dataset Task Modality Participants SB SR Domains #Image #Dialogue REDIAL [20] CRS Textual Crowd Workers X × Movie - 10,006 TG-REDIAL [46] CRS Textual Crowd Workers × × Movie - 10,000 INSPIRED [11] CRS Textual Crowd Workers Movie - 1,001 MMD [33] Multimodal CRS Textual+Visual Crowd Workers × Fashion 4,200* 105,439 SIMMC 2.0 [17] Situated Dialogue Textual+Visual Crowd Workers Fashion, Furniture 1,566† 11,244 SURE [25] Multimodal CRS Textual+Visual Crowd Workers ; X Fashion, Furniture 1,566 12,180 SCREEN Situated CRS Textual+Visual LLM agents ✓ ✓ Fashion, Furniture 1,566 20,112
-
The following are the results from Table 2 of the original paper:
| Total #dialogue(train/valid/test) | 16,089/2,011/2,012 |
| Total #utterances(train/valid/test) | 172,152/20,713/21,528 |
| Total #scene snapshots | 1,566 |
| Avg. #words per user turns | 15.7 |
| Avg. #words per assistant turns | 20 |
| Avg. #utterances per dialog | 10.7 |
| Avg. #objects mentioned per dialog | 4.3 |
| Avg. #objects in scene per dialog | 19.7 |
5.1.2. Dataset Selection Rationale
SCREEN was chosen because it is explicitly designed to address the situated conversational recommendation problem, which is the focus of this work. It uniquely targets the integration of situational context to model situated item representations and situated user preferences. The use of LLM agents for construction makes it high-quality and large-scale, providing a robust experimental foundation for this new research direction.
5.2. Evaluation Metrics
The paper proposes three subtasks and uses a variety of metrics for evaluation, including both automatic and human evaluations.
5.2.1. System Action Prediction
This subtask measures the system's ability to predict the next system action (e.g., make recommendations, inquire about preferences) based on dialogue history and situational context.
- Conceptual Definition: These metrics quantify the accuracy of a model's classification performance.
Precisionmeasures the proportion of correctly predicted positive instances among all instances predicted as positive.Recallmeasures the proportion of correctly predicted positive instances among all actual positive instances. TheF1-scoreis the harmonic mean of Precision and Recall, providing a single metric that balances both. - Mathematical Formula:
- Precision:
- Recall:
- F1-score:
- Symbol Explanation:
TP: True Positives (correctly predicted positive instances).FP: False Positives (incorrectly predicted positive instances).FN: False Negatives (positive instances incorrectly predicted as negative).
5.2.2. Situated Recommendation
This subtask evaluates the system's ability to recommend the most appropriate item from the scene by aligning item attributes with the user's situated preferences, considering the scenario and dialogue history. Recommendations are only evaluated when the system action is explicitly "make recommendations."
- Conceptual Definition:
Recall@k(R@k) evaluates how often the ground truth item is present within the top items recommended by the system. A higherRecall@kindicates that the model is better at including the correct recommendation among its top suggestions. - Mathematical Formula: $ \mathrm{Recall@k} = \frac{\text{Number of users for whom the ground truth item is in the top-k recommendations}}{\text{Total number of users requiring a recommendation}} $
- Symbol Explanation:
- : The number of top recommendations considered (e.g., ).
5.2.3. System Response Generation
This subtask assesses the quality of the natural language responses generated by the system.
- Conceptual Definition:
- Perplexity (PPL): A measure of how well a probability distribution or language model predicts a sample. Lower perplexity indicates a higher degree of confidence in predicting the next word, which generally correlates with higher fluency and naturalness in generated text.
- BLEU (Bilingual Evaluation Understudy): A metric for evaluating the quality of text that has been machine-translated or, in this case, machine-generated. It measures the similarity between the generated text and a set of reference texts, focusing on the overlap of -grams (sequences of words).
BLEU-2andBLEU-3consider 2-gram and 3-gram overlaps, respectively. Higher BLEU scores indicate greater similarity to human-written references. - Distinct -gram (DIST-): Measures the diversity of generated responses. It calculates the ratio of unique -grams to the total number of -grams in the generated text. A higher
DIST-n$score indicates greater lexical diversity and less repetition in the generated responses.DIST-1measures unique unigrams, andDIST-2` measures unique bigrams.
- Mathematical Formula:
- Perplexity (PPL): For a given test set of words , and a language model : $ \mathrm{PPL}(W) = P(w_1, w_2, \ldots, w_M)^{-\frac{1}{M}} = \sqrt[M]{\frac{1}{P(w_1, w_2, \ldots, w_M)}} $ Which can also be expressed using conditional probabilities: $ \mathrm{PPL}(W) = \left( \prod_{i=1}^{M} \frac{1}{P(w_i | w_1, \ldots, w_{i-1})} \right)^{\frac{1}{M}} $
- BLEU: The BLEU score is calculated based on the geometric mean of modified -gram precisions, multiplied by a brevity penalty (BP). $ \mathrm{BLEU} = \mathrm{BP} \cdot \exp\left( \sum_{n=1}^{N} w_n \log \mathrm{P_n} \right) $ Where is the modified -gram precision (counting clipping occurrences in references) and is the weight for each -gram (often uniform, ). The brevity penalty is: $ \mathrm{BP} = \begin{cases} 1 & \text{if } c > r \ e^{(1-r/c)} & \text{if } c \le r \end{cases} $ (This paper specifically mentions BLEU-2,3, so would be 2 or 3, and would be or respectively).
- Distinct -gram (DIST-): $ \mathrm{DIST-}n = \frac{\text{Number of unique } n\text{-grams in generated responses}}{\text{Total number of } n\text{-grams in generated responses}} $
- Symbol Explanation:
- : A sequence of words (the test set).
- : The number of words in the test set.
- : The probability of word given the preceding words, as predicted by the language model.
- : Length of the candidate (generated) text.
- : Effective reference corpus length (sum of shortest reference lengths for each sentence).
- : Modified -gram precision for -grams.
- : Maximum -gram order considered (e.g., 2 for BLEU-2, 3 for BLEU-3).
5.2.4. Human Evaluation Metrics
For human evaluation, three annotators assessed generated responses based on several criteria.
- Situated Relevance (SR):
- Conceptual Definition: This novel metric evaluates whether the system's responses accurately reference items in the scene and appropriately consider the user's appearance and climate conditions. It assesses how well the response is grounded in the current situation.
- Scoring: Scale from 0 to 2 (0: no relevance, 2: high relevance).
- Fluency:
- Conceptual Definition: Measures the grammatical correctness, naturalness, and readability of the generated responses.
- Scoring: Scale from 0 to 2 (0: no fluency, 2: smooth fluency).
- Informativeness:
- Conceptual Definition: Assesses whether the responses provide useful, rich, and relevant information to the user.
- Scoring: Scale from 0 to 2 (0: no informativeness, 2: rich information).
- Fleiss's Kappa ():
- Conceptual Definition: A statistical measure for assessing the reliability of agreement between a fixed number of raters (annotators) when assigning categorical ratings to a number of items or classifications. It corrects for chance agreement, providing a more robust measure than simple percent agreement.
- Mathematical Formula: $ \kappa = \frac{\bar{P} - \bar{P}_e}{1 - \bar{P}_e} $ Where is the observed agreement among raters, and is the expected agreement by chance.
- Symbol Explanation:
- : The mean proportion of all ratings that are in agreement.
- : The mean proportion of all ratings that would be in agreement by chance.
5.3. Baselines
The paper evaluates several representative multimodal baseline models on the SCREEN dataset:
- SimpleTOD+MM [5]: An extension of the
SimpleTODmodel, adapted for multimodal inputs (e.g., from theSIMMCdataset). It framessystem action predictionas acausal language modeling taskand finetunes a pretrainedGPT-2model to generate both system actions and responses. - Multi-Task Learning [18]: A
GPT-2based model that leveragesmulti-task learningtechniques. This approach trains the model to perform multiple tasks simultaneously, often leading to improved generalization and performance across related tasks. It achieved robust performance on theSIMMCdataset. - Encoder-Decoder [12]: An end-to-end
encoder-decodermodel built uponBART.BARTis a denoising autoencoder that is particularly effective for generation tasks. This model achieved first place in theSIMMCcompetition, indicating its strong performance in situated interactive multimodal conversations. - Reasoner [26]: This model employs a
multi-step reasoning methodto process information and generate responses. Its reasoning capabilities allowed it to perform exceptionally well in theSIMMC 2.0competition. - MiniGPT4 [47]: A widely used
multimodal LLM. ForSCREENevaluation, thedialogue historyandscene snapshotare concatenated as input. All three subtasks (system action prediction,situated recommendation,system response generation) are framed asresponse generation tasksfor this model. It benefits fromLLM's advanced language understanding and generation. - GPT-4o [30]: A state-of-the-art
multimodal LLMdeveloped by OpenAI. To ensure a fair comparison, it follows the same input concatenation strategy asMiniGPT4(dialogue history + scene snapshot) and treats all subtasks asresponse generation. Official configurations are used during inference.
6. Results & Analysis
6.1. Core Results Analysis
The paper presents both automatic and human evaluation results to validate the SCREEN benchmark and assess the performance of baseline models.
6.1.1. Automatic Evaluation
The automatic evaluation results for the three subtasks (System Action Prediction, Situated Recommendation, and System Response Generation) on the SCREEN dataset are presented in Table 3.
The following are the results from Table 3 of the original paper:
| System Action Prediction | Situated Recommendation | System Response Generation | |||||||||
| Model | Precision | Recall | F1 | R@1 | R@2 | R@3 | PPL (1) | BLEU-2 | BLEU-3 | DIST-1 | DIST-2 |
| SimpleTOD+MM [5] | 0.715 | 0.736 | 0.725 | 0.085 | 0.161 | 0.244 | 19.3 | 0.089 | 0.041 | 0.028 | 0.114 |
| Multi-Task Learning [18] | 0.727 | 0.753 | 0.740 | 0.107 | 0.199 | 0.298 | 17.5 | 0.105 | 0.054 | 0.031 | 0.112 |
| Encoder-Decoder [12] | 0.838 | 0.856 | 0.847 | 0.148 | 0.277 | 0.425 | 12.7 | 0.140 | 0.071 | 0.038 | 0.178 |
| Reasoner [26] | 0.902 | 0.925 | 0.913 | 0.190 | 0.395 | 0.588 | 10.2 | 0.181 | 0.078 | 0.043 | 0.192 |
| MiniGPT4 [47] | 0.946 | 0.951 | 0.948 | 0.234 | 0.498 | 0.697 | 4.31 | 0.252 | 0.117 | 0.081 | 0.310 |
| GPT-4o [13] | 0.951 | 0.974 | 0.962 | 0.284 | 0.557 | 0.751 | - | 0.276 | 0.132 | 0.107 | 0.337 |
Analysis of Automatic Evaluation Results:
-
Overall Performance:
GPT-4oconsistently achieves the highest scores across all metrics for all three subtasks, which is expected for a state-of-the-art proprietarymultimodal LLM. -
LLM Advantage: Among the open-source models,
MiniGPT4significantly outperforms other baselines (like ,Multi-Task Learning,Encoder-Decoder,Reasoner). This highlights the superiorlanguage understandingandgeneration capabilitiesofLLM-based models in this complexsituated conversational recommendationsetting. -
Weaker Baselines: and
Multi-Task Learning(bothGPT-2based) show the weakest performance, indicating that simpler architectures struggle to capture the nuances ofsituational contextand complex conversational dynamics. -
Mid-tier Performance:
Encoder-DecoderandReasonermodels perform reasonably well, withReasonershowing a slight edge due to itsmulti-step reasoningapproach, particularly insystem action predictionandsituated recommendation. -
Challenge of Situated Recommendation: A notable observation is that all models, including
GPT-4o, show relatively lowRecall@kscores forSituated Recommendationcompared to the other tasks. For instance,GPT-4oachieves aR@1of 0.284, meaning it correctly recommends the top item only about 28.4% of the time when a recommendation is made. This underscores the inherent difficulty of accurately capturing dynamicsituated user preferencesandsituated item representationsto provide precise recommendations in complex scenarios. -
System Response Generation:
MiniGPT4andGPT-4oalso excel inSystem Response Generationmetrics (lowerPPL, higherBLEU, higherDIST-n), indicating they produce more fluent, coherent, and diverse responses. The lack ofPPLforGPT-4ois noted, likely due to it being a closed-source model where internal probabilities might not be directly accessible or reported.Overall, the automatic evaluation confirms that
SCREENpresents a challenging benchmark, particularly for thesituated recommendationtask, and demonstrates the potential of advancedLLMswhile also highlighting significant room for future improvements.
6.1.2. Human Evaluation
Human annotators evaluated system-generated responses based on Situated Relevance (SR), Fluency, and Informativeness. Fleiss's Kappa was calculated to assess inter-annotator agreement.
The following are the results from Table 4 of the original paper:
| Model | SR | K | Fluency | K | Inform. | K |
| SimpleTOD+MM [5] | 0.74 | 0.42 | 1.31 | 0.41 | 0.89 | 0.48 |
| Multi-Task Learning [18] | 0.98 | 0.48 | 1.35 | 0.45 | 1.01 | 0.56 |
| Encoder-Decoder [12] | 1.04 | 0.51 | 1.57 | 0.47 | 1.17 | 0.51 |
| Reasoner [26] | 1.19 | 0.47 | 1.61 | 0.52 | 1.48 | 0.48 |
| MiniGPT4 [47] | 1.42 | 0.55 | 1.91 | 0.52 | 1.70 | 0.49 |
| GPT-4o [13] | 1.50 | 0.50 | 1.95 | 0.49 | 1.75 | 0.52 |
Analysis of Human Evaluation Results:
- Inter-Annotator Agreement:
Fleiss's Kappascores are generally within the [0.4, 0.6] range (e.g., 0.50 forGPT-4oonSR), indicatingmoderate agreementamong annotators. This suggests the human evaluation is reasonably reliable. - Model Performance Trends: The human evaluation results closely align with the automatic evaluations.
GPT-4oandMiniGPT4again demonstrate superior performance, generating responses that are moresituation-relevant,fluent, andinformative.GPT-4oconsistently achieves the highest average scores across all human metrics.ReasonerandEncoder-Decodermodels show comparable levels ofsituation relevanceandfluency. However,Reasoner's responses are generally moreinformative(average 1.48) due to itsmulti-step reasoningprocess, which allows it to gather and present necessary information more effectively.- and
Multi-Task Learningagain perform the weakest, especially inSituated RelevanceandInformativeness.
- Validation of Subtasks: The strong correlation between human and automatic evaluations supports the effectiveness of the three designed subtasks in assessing
SCRSperformance.
6.1.3. Dataset Quality Verification (Human Evaluation)
To verify the reliability of the SCREEN dataset itself, a comparative human evaluation was conducted against SIMMC 2.1. Five human evaluators compared 50 dialogue pairs (one from SCREEN, one from SIMMC 2.1) based on four criteria.
The following are the results from Figure 6 of the original paper:

该图像是图表,展示了SCREEN与SIMMC2.1的对比结果。在四个评估维度中,SCREEN在情境关联性(52%)和个性(82%)方面表现优异,而在用户状态(80%)和推荐适当性(48%)方面也显示出一定优势。
Figure 6: Human evaluation results of dataset comparison. "Rec." denotes "Recommendation".
Analysis of Dataset Comparison Results:
- Higher Win Percentages for SCREEN: The comparative results, presented in Figure 6, show that the
SCREENdataset achieves higher "win percentages" across all evaluation criteria compared to theartificially generated SIMMC 2.1dataset.Situation Relevance:SCREENwins 52% of the time,SIMMC 2.1wins 28%.Personality:SCREENwins 82%,SIMMC 2.1wins 4%.User State:SCREENwins 80%,SIMMC 2.1wins 4%.Recommendation Appropriateness:SCREENwins 48%,SIMMC 2.1wins 20%.
- Reliability: This outcome demonstrates the higher perceived quality and reliability of the
SCREENdataset, particularly in reflectinguser personality,user state, andsituational relevancein recommendations, compared to a well-known crowd-sourced dataset. This suggests that theLLM role-playing approachcan generate more natural and contextually appropriate dialogues forsituated recommendations.
6.1.4. Illustrative Case Study
Figure 7 provides an illustrative example to qualitatively demonstrate the quality of responses generated by baseline models on the SCREEN dataset.
Context:
-
Situation: Late afternoon, clothing store, relatively dry and hot air (summer).
-
User Mood: Excited and joyful.
-
User Appearance: Blue long-sleeved shirt and sports pants, crew-cut hairstyle.
-
Conversation History:
-
User: "Hello, I want to buy a sportswear suitable for playing basketball. Do you have any recommendations?"
-
System: "Of course, do you prefer long sleeves or short sleeves."
-
User: "I usually prefer to wear long sleeves, but the weather is too hot, so I want to buy a short-sleeved one this time."
Generated Responses and Analysis:
-
-
SimpleTOD+MM: "I will recommend this clothing to you." (Fails to specify the clothing, lacks informativeness and situation relevance).
-
Multi-Task Learning: "This brown pair of pants looks very good." (Irrelevant to the user's request for sportswear suitable for basketball in hot weather, and doesn't explicitly recommend).
-
Encoder-Decoder: "Yes, it looks great." (Generic, uninformative, and doesn't address the user's need).
-
Reasoner: "The waters o ot. I belveyo ill le hs a." (The text is corrupted or an incoherent snippet, indicating a failure in generation or an uncorrected typo in the paper's excerpt). Assuming this meant "The weather is hot. I believe you will like this," it still lacks specific recommendations.
-
MiniGPT4: "Of course, I recommend this black short-sleeved T-shirt to you. It is very suitable for playing basketball." (Successfully uses
contextual information– hot weather, playing basketball – and makes a specific, appropriate recommendation. Shows enhancedinformativeness). -
Groundtruth: "Yes, the weather is hot. I would recommend this black top with a flame pattern in the middle. It makes you look cool and is suitable for playing basketball." (Similar to
MiniGPT4, but even more descriptive, mentioning a "flame pattern" for a "cool" look, showing highersituated item representationandinformativeness).Discussion: The illustrative case confirms that models like
ReasonerandMiniGPT4can leveragesituational context(climate: summer) andconversational history(playing basketball, preference for short sleeves due to heat) for more appropriate recommendations.MiniGPT4particularly excels in generating informative content. However, the comparison with theGroundtruthreveals that even the best baselines still have significant room for improvement in fully addressingsituated recommendations, particularly in generating rich, specific, and contextually nuanced item descriptions that truly enhance the user experience. This further supports the notion thatSCRSremains a challenging area with considerable research potential.
6.2. Ablation Studies / Parameter Analysis
The paper does not explicitly detail ablation studies or specific parameter analyses of the proposed dataset construction framework (e.g., the impact of different LLM versions, temperature settings, or specific instruction components). The focus is primarily on presenting the benchmark and evaluating baseline models on it. The fixed parameters (e.g., temperature 0.8, specific token limits) are mentioned as part of the construction process.
7. Conclusion & Reflections
7.1. Conclusion Summary
This work introduces Situated Conversational Recommendation Systems (SCRS), a novel and highly promising paradigm that integrates situational factors into traditional multimodal conversational recommendations. The authors identify that dynamic aspects like situated item representation (item appeal varying with context) and situated user preference (user interests shifting with situation) are crucial for realistic recommendations. To accelerate research in this underexplored field, they constructed SCREEN, a comprehensive and high-quality benchmark. SCREEN was built using an innovative and efficient role-playing approach powered by multimodal large language models, simulating interactions between user and recommender agents in a co-observed scene. The benchmark comprises over 20,000 dialogues across 1,500 diverse situations. To evaluate SCRS, the paper proposes three essential subtasks: system action prediction, situated recommendation, and system response generation. Evaluations of several representative baseline models on SCREEN demonstrate the benchmark's quality and the significant challenges that remain, particularly in situated recommendation, thereby establishing a solid foundation for future research in bridging the gap between traditional and real-world conversational recommendations.
7.2. Limitations & Future Work
The authors acknowledge several limitations of their work, primarily stemming from the use of LLM agents for data generation:
- Hallucinations:
LLMscan occasionally generate responses withhallucinations[1], meaning they might produce factually incorrect or nonsensical information. This compromises the quality and reliability of the generated dataset. - Post-processing: To address
hallucinationsand enhance data quality, the authors suggest future work will include designingpost-processing measures, such as verification and corrections by multiple moderators (potentially human or more sophisticatedLLMagents). - Ethical Considerations: Rigorous
ethical considerationsare paramount, particularly in preventing the generation of harmful content and ensuring that no sensitive or private information is involved in the dataset. Manual sampling inspection is proposed as a method to alleviate this to some extent. - Room for Model Improvement: Despite the strong performance of
LLM-based baselines likeMiniGPT4andGPT-4o, the results forsituated recommendationindicate that there is still substantial room for improvement in models to fully address the complexities ofsituated recommendations.
7.3. Personal Insights & Critique
This paper presents a highly relevant and forward-thinking approach to conversational recommendation. The explicit focus on situational context is a critical step towards making CRSs truly aligned with human-like interactions in the physical world.
- Novelty and Impact: The introduction of
SCRSas a new paradigm and theSCREENbenchmark is a significant contribution. It redefines what constitutes a "good" recommendation by shifting focus from static preferences to dynamic, context-aware ones. This has immense potential for real-world applications in retail, tourism, and personalized services. - LLM-driven Data Generation: The role-playing method using
multimodal LLMsfor dataset construction is a clever and efficient solution to the perennial problem of data scarcity in complex AI research areas. It demonstrates the power ofLLMsnot just for direct application but also as powerful tools for research infrastructure. This approach can be transferred to other domains requiring large-scale, context-rich dialogue datasets, potentially reducing dependence on expensive and time-consuming human annotation. - Challenge of Situated Understanding: The relatively low scores on the
situated recommendationtask across all baselines, evenGPT-4o, highlight thatsituational understandingremains a profound challenge for current AI. It's not enough to recognize objects; systems must infer theirsituated appealand align it with a user'ssituated preference, which requires deep contextual reasoning. This paper effectively sets a challenging new frontier for research. - Critique/Areas for Improvement:
-
Transparency of LLM Prompts: While templates are shown, the exact, full prompts used for
LLMagents could be provided as supplementary material for reproducibility and deeper analysis of agent behavior. -
Beyond Visual Situations: The current "situations" primarily rely on visual scenes, temporal, and climatic information. Future work could explore more abstract or social
situational factors(e.g., user's companion, social event context, budget constraints expressed implicitly) to further enrich theSCRSparadigm. -
User Feedback Loop: The current framework relies on pre-defined
user preferencesfor theuser agent. Incorporating a more dynamicuser feedback loopduring interaction, where theuser agentcan genuinely learn and adapt its preferences based on the interaction (rather than just adhering to presets), could lead to even more realistic dialogues. -
Interpretability: Given the complexity of
situated recommendations, future models trained onSCREENshould also aim forinterpretability, explaining why a particular item is recommended in a given situation.Overall,
SCREENis a well-designed benchmark that pushes the boundaries ofconversational recommendationtowards a more holistic and realistic understanding of user needs in dynamic environments, offering a rich platform for future advancements.
-
Similar papers
Recommended via semantic vector search.