Paper status: completed

BlendScape: Enabling End-User Customization of Video-Conferencing Environments through Generative AI

Published:10/11/2024

Multimodal Interaction Techniques (1)Video Conferencing Environment Generation (1)Generative AI Customization (1)Virtual Environment Composition System (1)User-Centered Design (1)

Original Link

Price: 0.100000

8 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

BlendScape enables personalized video-conferencing environments by blending physical and digital backgrounds using generative AI, guided by multimodal interaction. User studies show improved collaboration potential with needed controls for visual distractions.

Abstract

BlendScape: Enabling End-User Customization of Video-Conferencing Environments through Generative AI Shwetha Rajaram ∗ Nels Numan ∗ Balasaravanan Thoravi Microsoft Research Microsoft Research Kumaravel United States United States Microsoft Research shwethar@umich.edu nels.numan@ucl.ac.uk United States bala.kumaravel@microsoft.com Nicolai Marquardt Andrew D. Wilson Microsoft Research Microsoft Research United States United States nicmarquardt@microsoft.com awilson@microsoft.com Figure 1: Overview of BlendScape , a rendering and composition system for end-users to customize video-conference environ - ments by leveraging AI image generation techniques. ABSTRACT Today’s video-conferencing tools support a rich range of profes - sional and social activities, but their generic meeting environments cannot be dynamically adapted to align with distributed collabora - tors’ needs. To enable end-user customization, we developed Blend- Scape , a rendering and composition system for video-c

Mind Map

In-depth Reading

English Analysis~16 min read · 17,595 chars

1. Bibliographic Information

Title: BlendScape: Enabling End-User Customization of Video-Conferencing Environments through Generative AI
Authors: Shwetha Rajaram, Nels Numan, Balasaravanan Thoravi Kumaravel, Nicolai Marquardt, and Andrew D. Wilson.
Affiliations: All authors are affiliated with Microsoft Research.
Journal/Conference: The 37th Annual ACM Symposium on User Interface Software and Technology (UIST '24). UIST is a premier international conference in the field of Human-Computer Interaction (HCI), known for showcasing technical and creative innovations in user interfaces.
Publication Year: 2024
Abstract: The paper addresses the limitation of generic, static environments in modern video-conferencing tools. To solve this, the authors developed BlendScape, a system that allows users to customize their meeting environments using generative AI. BlendScape's key features include blending users' physical or digital backgrounds into unified scenes and using multimodal interactions to guide the AI generation. An exploratory study with 15 users revealed that while they saw great potential for BlendScape in future collaborative activities, they also required more control to fix distracting or unrealistic visual artifacts. The paper demonstrates BlendScape's capabilities through various scenarios and suggests composition techniques to enhance the quality of the generated environments.
Original Source Link: /files/papers/68f2376925f61e44beef6019/paper.pdf. This is a formally published paper for the UIST '24 conference.

2. Executive Summary

Background & Motivation (Why):
- Core Problem: Video-conferencing has become integral to professional and social life, yet the environments within these tools are typically generic video grids or static virtual backgrounds. These "one-size-fits-all" spaces are not adaptable to the wide range of activities they host, leading to issues like "meeting fatigue," reduced user engagement, and a loss of important interpersonal communication cues.
- Importance & Gaps: While prior HCI research has identified effective design strategies for meeting environments (e.g., using spatial metaphors, conveying a shared context), there is a lack of tools that allow end-users—people without technical design skills—to implement these strategies dynamically and in real-time. Existing customization tools often require significant manual effort, making them unsuitable for adapting environments during a live meeting.
- Innovation: The paper introduces a novel approach by leveraging recent advances in generative AI. The core innovation is BlendScape, a system that empowers users to create custom meeting environments by blending their own physical or digital backgrounds into a single, unified, and thematically coherent space. This grounds the AI generation in the users' real context, making the experience more personal and relevant.
Main Contributions / Findings (What):
- The BlendScape System: The primary contribution is the design and implementation of BlendScape, a rendering and composition system that uses generative AI techniques (like inpainting and image-to-image translation) to enable real-time, end-user customization of video-conferencing environments.
- Exploratory User Study: The paper presents findings from a study with 15 end-users. The study found that participants could envision using such a system to facilitate collaboration, spark creativity, and set a social theme. However, it also highlighted a critical need for better user controls to mitigate visual flaws (e.g., distortions, unrealistic elements) produced by the AI.
- Demonstration of Expressiveness: Through several implemented scenarios (Design Brainstorming, Remote Education, Storytime), the authors demonstrate that BlendScape can support a majority of the effective environment design strategies identified from prior academic work.

This section explains the foundational concepts needed to understand the paper and situates BlendScape within the existing landscape of research.

Foundational Concepts:
- Generative AI: A branch of artificial intelligence where models learn patterns from existing data (like images, text, or sounds) to create new, original content. In this paper, the focus is on AI that generates images.
- Text-to-Image: An AI technique where a user provides a written description (a "prompt"), and the model generates an image that matches that description.
- Image-to-Image: An AI technique that modifies an existing input image based on a text prompt. It retains the structure of the original image while changing its style or adding new elements.
- Inpainting: A specific type of image-to-image technique where the AI fills in missing or selected parts of an image. The model generates content that seamlessly blends with the surrounding, untouched areas. BlendScape uses this to merge the backgrounds of different users.
- ControlNet: An auxiliary model that gives users more precise control over generative AI. It can guide image generation using additional inputs like depth maps (for 3D structure), edge detection (Canny), or human poses. BlendScape uses it to preserve the spatial layout of scenes.
- Large Language Models (LLMs): AI models, like GPT-3.5, trained on vast amounts of text data to understand and generate human-like language. BlendScape uses an LLM to automatically "enhance" a user's simple prompt with more descriptive keywords, leading to higher-quality images.
Previous Works & Design Strategies: The paper reviews prior research and commercial tools to identify three key roles that a video-conferencing environment can play. Image 10 provides a visual summary of these strategies.

该图像是图表，展示了环境设计策略分类（图2），分析了现有视频会议工具如何通过共享上下文、空间隐喻和协作记录三类策略支持分布式协作。图中黑框表示BlendScape支持的设计策略。
1. Establishing a Shared Context: Creating a common frame of reference.
  - Unified Meeting Space: Systems like Teams Together Mode or Waazam place users in a shared virtual room (e.g., an auditorium) to increase the sense of co-presence.
  - Conveying Task Space: Tools like MirrorBlender (for digital content) and ThingShare (for physical objects) allow collaborators to see and interact with the artifacts central to their task.
  - Setting a Theme: Systems like VideoPlay allow for playful interactions by compositing users into thematic environments, like a storybook.
2. Enabling Spatial Metaphors for Communication: Using virtual space to mimic real-world social cues.
  - Proxemic Metaphors: Gather allows users to start conversations by "walking" their avatars close to others. OpenMic lets users signal their intent to speak by moving to a "Virtual Floor."
  - Scaling Metaphors: Zoom's Speaker View makes the active speaker's video larger.
  - Consistent Layouts & Egocentric Views: Together Mode seats users around a virtual table to establish turn-taking patterns. Perspectives renders unique, egocentric viewpoints for each user to simulate eye contact.
3. Serving as a Record or Artifact of Collaboration:
  - History of Activities: MirrorVerse can record user movements and environment changes, allowing for later replay and analysis of the collaboration.
  - Environment as Artifact: The environment itself can be the product of collaboration, common in VR authoring tools like Quill. BlendScape explores this by having users collaboratively build a story's setting.
Differentiation: While prior work demonstrated what makes a good environment, BlendScape focuses on how to let end-users create these environments easily and in real-time. Unlike procedural systems (WordsEye) or static authoring tools (WorldSmith), BlendScape is designed for the dynamic context of live video calls. Its unique contribution is blending real user backgrounds, which grounds the AI-generated scenes in the collaborators' personal spaces, a feature not central to other systems.

4. Methodology (Core Technology & Implementation)

This section details the technical architecture and core features of the BlendScape system. Image 2 provides a high-level overview of the system's concept and interface.

该图像是BlendScape系统的示意图，展示了视频会议环境中用户背景和任务空间的融合方式，包含不同的合成模式、自动布局技术以及添加自定义对象的功能示例，图下方展示了多种主题环境场景。

System Requirements: The design of BlendScape was guided by three core requirements:
- R1: Expressing the meeting context: The system must allow users to align the environment's visuals and structure with the purpose of their meeting.
- R2: Supporting convincing illusions of a shared space: The system should composite users into the environment in a believable way to enhance co-presence, for example by preserving realistic lighting from their physical spaces.
- R3: Enabling coarse- and fine-grained customization: Users must be able to make both large-scale changes (generating a whole new scene) and small adjustments (editing a specific object).
BlendScape Interface & Tools: The user interface (UI) consists of a central canvas where the blended environment is displayed. Key tools include:
- Composition Modes: Users can choose between inpainting (blending their webcam backgrounds) or image-to-image (restyling the entire canvas, which may contain a user-uploaded image).
- Prompting: Users provide a Meeting Activity (e.g., "brainstorming") and a Meeting Theme (e.g., "treehouse") to guide the generation.
- Direct Manipulation: Users can click, drag, and pinch to reposition and resize their video feeds, which influences the final composition.
- Granular Editing: A selection tool allows users to circle a region of the scene and regenerate it with a new prompt (e.g., "add a chair").
- Layout and History: Automatic layout techniques can place users behind objects, and a history tool allows users to revisit and iterate on previous designs.
Generative AI Techniques: BlendScape employs two main generative strategies:
1. Blending Environments via Inpainting: This is the core "blending" feature. The system takes the video backgrounds of multiple users, preserves portions of them (the amount is user-controlled), and uses an inpainting model to generate a coherent scene that fills the gaps between them. This maintains realistic lighting and shadows from the users' actual environments, fulfilling R2.
2. Driving Environments via Image Priors (Image-to-Image): Users can upload a "prior" image (e.g., a photo of a library) or use a previously generated scene. The image-to-image technique then restyles this image according to the user's prompts (e.g., changing a modern library into a fantasy one) while preserving its main structural elements (like walls and tables).
Improving Composition Quality: To make the generated environments more useful and visually appealing, BlendScape includes several enhancement techniques:
- LLM-driven Prompt Padding: To help users who struggle with writing detailed prompts, BlendScape sends their basic Meeting Activity and Theme to a Large Language Model (GPT-3.5). The LLM returns a list of relevant objects and stylistic keywords (e.g., for "Hologram," it might suggest "Dynamic Lighting" and "Holographic Whiteboards"). These are automatically added to the prompt sent to the image generation model.
- Hidden Surface Removal (2.5D Scenes): To avoid the "floating heads" look, BlendScape creates the illusion of depth. It segments foreground objects (like tables or chairs) from the generated background and places them on a separate layer in front of the users' videos. This makes it look like users are sitting behind the furniture, creating a more immersive composition.
- Granular Scene Editing: Using the GLIGEN model, users can make targeted changes. They can select an area and use a prompt to add an object (e.g., "a whiteboard") or remove an unwanted artifact. GLIGEN is specifically designed to generate content that respects the specified location and scale.
Implementation Architecture: Image 1 shows the system's workflow.

该图像是BlendScape系统工作流程示意图，展示视频会议环境定制中多模块协同处理过程，包括用户画面分割、文本提示增强及背景补全，强调了BlendScape客户端与服务器端的交互机制。
- Client-Server Model: The system consists of a BlendScape (Unity client) that runs on the user's machine and a BlendScape (server) that handles the heavy AI computation.
- Video Streaming: Video streams are received from Microsoft Teams via its NDI (Network Device Interface) capabilities.
- AI Pipeline on Server:
  1. Person Segmentation: Users are separated from their backgrounds. To handle the fact that this is done on static frames (not live video), the system inpaints the area behind the person to create a complete, static background.
  2. Prompt Enhancement: User prompts are sent to GPT-3.5.
  3. Environment Generation: The processed prompt and background images are sent to Stable Diffusion (using the Realistic Vision 2.0 checkpoint) for inpainting or image-to-image generation, guided by ControlNet.
  4. Granular Editing: Edits are handled by GLIGEN.
  5. Object Segmentation: The final environment is processed by PixelLib to identify foreground objects for layering.
- Performance: Generation times averaged around 10-25 seconds, which is near-real-time but not instantaneous. A key limitation noted is that generation operates on static frames from the video, not the live video stream itself.

5. Experimental Setup

The authors conducted an exploratory study, not a traditional quantitative experiment, to understand user perceptions and preferences.

Study Goal: To investigate (1) whether and how end-users find value in using generative AI for environment customization and (2) to what extent BlendScape enables them to achieve their design intentions.
Participants: 15 individuals from the authors' organization with roles like UX Designer, UX Researcher, and Product Manager. Seven were frequent users of generative AI, while eight were not.
Method: The 1-hour remote study consisted of three tasks followed by a discussion.
- Task 1: Inpainting Walkthrough: Participants were introduced to the inpainting feature in a "Vacation Planning" scenario. They were shown examples with different levels of background preservation (as seen in Image 8) and asked to comment on their quality.
  
  该图像是示意图，展示了不同程度背景保留的混合场景效果。图中分别显示原始视频背景，以及最大、中等和最小背景保留的视觉对比，体现BlendScape系统在自定义视频会议环境中对背景融合的调节能力。
- Task 2: Image-to-Image Walkthrough: Participants learned about the image-to-image feature in a "Game Stream" scenario, where an arcade was restyled in a Minecraft theme. They compared results with and without background preservation (as seen in Image 9).
  
  该图像是多张示意图，展示了BlendScape系统中游戏流媒体、街机游戏画面和直播视频与背景保留技术结合的效果对比，突出环境定制和背景融合在视频会议中的应用。
- Task 3: Progressive Meeting Scenario: This was the main creative task. Participants were asked to design a series of environments for a "Research Paper" meeting. The scenario evolved: (1) two students discussing, (2) a professor joining, and (3) everyone feeling stressed. This prompted participants to adapt the environment dynamically.
- Discussion: A semi-structured interview explored potential use cases, benefits, and drawbacks of a system like BlendScape.
Data Collection & Analysis: The researchers collected audio transcripts, screen recordings, and all images generated during the sessions. They used a thematic analysis approach to identify recurring patterns and themes in the participants' feedback and design choices.

6. Results & Analysis

The study yielded rich qualitative insights into user preferences and the system's strengths and weaknesses.

Environment Customization Preferences:
- Authentic Over Artificial, With a Caveat: Most participants preferred blending their real physical backgrounds, as it created a greater sense of realism and co-presence ("trick our minds to believe that we're more in the same place"). However, this realism came with higher expectations. Users were much more critical of AI-generated flaws, such as warped perspectives or inconsistent geometry, when their own space was involved. Image 6 shows examples of such flaws.
  
  该图像是一个示意图，展示了BlendScape中混合环境的不完美之处。图中红色虚线框标记了背景融合时产生的扭曲和不真实的空间几何现象，分别来自不同视角或差异较大的视频背景（P7和P5）。
- Context-Dependent Thematic Strength: Participants wanted strong, overt themes for social or creative events (e.g., a "mushroom forest" for storytelling) but preferred subtle, professional themes for work meetings (e.g., a clean "design studio"). Image 5 shows the variety of "brainstorming" versus "de-stressing" environments created.
  
  该图像是两组视频会议环境场景的插图，分别对应“头脑风暴环境”和“减压环境”，图中展示了不同编号（如P6、P8等）参与者自定义的多样化虚拟背景，体现了BlendScape在环境设计上的灵活表达能力。
- Structuring Collaboration with Spatial Layouts: Participants saw value in using the environment to structure activities. For example, in the Remote Education scenario (Image 3), they appreciated using virtual tables to organize breakout groups, leveraging spatial metaphors for collaboration.
  
  $Figure 10: Scenario 2: Remote Education. To establish room layouts for seminar discussions, a professor restyles images with small and large tables $( a , b )$ to resemble a library setting $( c , d…$ 该图像是论文BlendScape中的示意图（图10），展示了远程教育场景下，教授如何通过重新设计不同尺寸的桌子（a，c）使房间布局（b，d）呈现图书馆环境，方便学生在空间地标后进行讨论。
- Balancing Richness and Distraction: While participants enjoyed the creativity, they were concerned that overly complex or visually "busy" environments could be distracting. They valued scenes that were engaging without detracting from the meeting's primary goal.
Benefits and Limitations of BlendScape:
- Expressive Leverage (Benefit): Participants were impressed with how easily they could generate a wide variety of high-quality, thematic environments with simple prompts. The system successfully lowered the barrier to creative expression. Image 7 shows the diverse range of scenes generated.
  
  该图像是图14，展示了多种生成的视频会议环境示例。不同的人物出现在模拟的多样会议背景中，包括设计室、工厂、传统中式婚礼场景等，体现了BlendScape在环境定制上的多样性和创意性。
- Time, Effort, and Control (Limitation): The main drawback was the lack of fine-grained control to fix the AI's mistakes. Participants expressed a need for tools to correct distracting or unrealistic elements without having to regenerate the entire scene. The trial-and-error process of prompt engineering was also seen as a hurdle to using it smoothly in a live meeting. The ~20-second generation time, while fast for AI, was still considered a potential interruption.
  
  The "Storytime with Family" scenario (Image 4) exemplifies the creative potential, where users could iteratively change the scene to match a narrative.
  
  该图像是论文BlendScape中图11的示意图，展示了第三场景“家庭讲故事”。图中一位祖母和她的孙女通过BlendScape，将视频背景从日常环境转换成童话中的舞厅和蘑菇森林，体现了基于图像到图像的背景重塑技术。

7. Conclusion & Reflections

Conclusion Summary: The paper introduces BlendScape, a novel system that successfully demonstrates the potential of generative AI for end-user customization of video-conferencing environments. The key innovation is the blending of users' real-world backgrounds to create shared, context-aware spaces. The user study confirmed that people see value in this for a range of collaborative activities, from professional brainstorming to social storytelling. However, for such systems to be practically adopted, future work must focus on giving users more robust controls to mitigate visual imperfections and reduce the effort required to achieve their desired outcome.
Limitations & Future Work:
- Author-Acknowledged Limitations: The paper notes that BlendScape currently does not support spatialized audio or egocentric viewing perspectives, which are known strategies for improving communication.
- Implied Limitations from Study: The primary limitation identified through the study is the "black box" nature of the AI. Users need more direct control to fix errors and steer the generation predictably. Furthermore, the system's reliance on static frames (not live video) for generation is a significant technical hurdle for a truly dynamic and interactive experience.
- Future Work: The authors propose improving the composition techniques to address visual quality and providing users with better controls to fix distracting elements.
Personal Insights & Critique:
- Novelty and Significance: The concept of blending real user backgrounds is a powerful idea. It bridges the gap between fully artificial virtual worlds and the sterile reality of current video calls. This grounding in physical space makes the technology feel more personal and less like a gimmick. This paper is an important step toward making video-conferencing a more human-centric and contextually rich medium.
- Practical Challenges: The biggest open question is performance. Generating high-quality images in 10-25 seconds is impressive but still too slow for seamless, live interaction. The reliance on static frames is a major compromise. The future of such systems depends on achieving video-rate generation.
- Social and Ethical Questions: This work opens up interesting social dynamics. Who gets to control the shared environment? What happens if one person's "creative" background is another's "distracting" mess? How do we prevent the generation of inappropriate or biased content in a professional setting? These are critical HCI questions that will need to be addressed as this technology matures.
- Overall: BlendScape is a forward-looking piece of research that provides both a functional system and valuable insights into the design of future communication tools. It successfully maps out the opportunities and challenges at the intersection of generative AI and collaborative systems.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.