Imaginarium: Vision-guided High-Quality 3D Scene Layout Generation
TL;DR Summary
Proposes a vision-guided 3D layout system using a large asset library; text prompts are expanded into images and fine-tuned, then parsed into 3D layouts optimized for coherence, significantly improving layout richness and quality over prior methods.
Abstract
Generating artistic and coherent 3D scene layouts is crucial in digital content creation. Traditional optimization-based methods are often constrained by cumbersome manual rules, while deep generative models face challenges in producing content with richness and diversity. Furthermore, approaches that utilize large language models frequently lack robustness and fail to accurately capture complex spatial relationships. To address these challenges, this paper presents a novel vision-guided 3D layout generation system. We first construct a high-quality asset library containing 2,037 scene assets and 147 3D scene layouts. Subsequently, we employ an image generation model to expand prompt representations into images, fine-tuning it to align with our asset library. We then develop a robust image parsing module to recover the 3D layout of scenes based on visual semantics and geometric information. Finally, we optimize the scene layout using scene graphs and overall visual semantics to ensure logical coherence and alignment with the images. Extensive user testing demonstrates that our algorithm significantly outperforms existing methods in terms of layout richness and quality. The code and dataset will be available at https://github.com/HiHiAllen/Imaginarium.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: Imaginarium: Vision-guided High-Quality 3D Scene Layout Generation
- Authors: Xiaoming Zhu, Xu Huang, Qinghongbing Xie, Zhi Deng, Junsheng Yu, Yirui Guan, Zhongyuan Liu, Lin Zhu, Qijun Zhao, Ligang Liu, and Long Zeng.
- Affiliations: The authors are affiliated with prominent academic and industrial research institutions, including Tsinghua University, Tencent, Southeast University, and the University of Science and Technology of China. This collaboration between academia and a major tech company (Tencent) often indicates research with strong practical applications, particularly in digital content and gaming.
- Journal/Conference: The paper is formatted for ACM Transactions on Graph. (TOG), a premier and highly competitive journal in the field of computer graphics. Publication in TOG is often associated with a presentation at the annual SIGGRAPH conference, which is the most prestigious venue for graphics research.
- Publication Year: 2025 (as indicated in the paper's reference format).
- Abstract: The paper introduces a novel system for generating high-quality 3D scene layouts from text prompts. The core problem is that existing methods are either too restrictive (rule-based), lack diversity (deep generative models on small 3D datasets), or are not robust in capturing spatial relationships (LLM-based methods). The proposed solution, "Imaginarium," uses a "vision-guided" approach. First, it expands a text prompt into a 2D image using a fine-tuned image generation model. This image then serves as a visual guide. A robust image parsing module analyzes the image to extract semantics and geometry (object locations, planes, etc.). Finally, it retrieves matching assets from a custom, high-quality 3D asset library and arranges them in 3D space, optimizing the layout to match the visual guide and ensure physical and logical coherence. User studies show the method significantly outperforms existing approaches in layout richness and quality.
- Original Source Link:
https://arxiv.org/abs/2510.15564v1. The provided link and arXiv ID appear to be placeholders or fictional, as they do not resolve to a real paper at the time of this analysis. The analysis will proceed based on the content provided. The status is a preprint.
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Creating artistic, coherent, and diverse 3D scene layouts automatically is a major challenge in digital content creation (e.g., for games, films, and virtual reality).
- Existing Gaps:
- Traditional optimization methods rely on hand-crafted rules, which are tedious to create and limit creativity.
- Deep generative models trained directly on 3D data suffer from a lack of diversity and quality because high-quality 3D scene datasets are small and expensive to create.
- Large Language Model (LLM)-based methods can generate plausible object lists but lack the "spatial intuition" to arrange them realistically and aesthetically, often failing to capture complex geometric relationships.
- Fresh Angle: The paper's key innovation is to leverage the massive progress in 2D image generation for the 3D layout task. Instead of generating a 3D layout directly, it first generates a high-quality 2D "concept art" image and then "lifts" this 2D vision into a 3D scene. This elegantly sidesteps the scarcity of 3D data by tapping into the vast and diverse world of 2D images.
-
Main Contributions / Findings (What):
- An Innovative Vision-Guided System: The paper presents a complete pipeline that transforms a text prompt into a high-quality 3D scene layout by using a generated 2D image as an intermediate guide.
- A High-Quality 3D Scene Dataset: The authors curated a new dataset with 2,037 high-quality 3D assets and 147 professionally designed scene layouts. This dataset, which they plan to open-source, addresses quality and flexibility limitations in existing libraries like
3D-FutureandObjaverse. - A Robust Object Pose Estimation Algorithm: They developed a novel algorithm for estimating the 6D pose (3D rotation and translation) of objects. It integrates visual semantic features with geometric information and scene-level logic, making it more accurate and robust than previous methods, especially when dealing with discrepancies between the guiding image and the available 3D assets.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- 3D Scene Layout Generation: The task of placing and orienting a collection of 3D objects within a defined space (like a room) to create a plausible and functional arrangement.
- Diffusion Models: A class of powerful deep generative models, such as
Stable Diffusionand the paper's choice,Flux. They work by gradually adding noise to data during training and then learning to reverse this process, allowing them to generate new data from pure noise, often guided by text or images. - Vision Foundation Models: Large-scale models pre-trained on enormous amounts of visual data. They can perform various vision tasks "out of the box" or with minimal fine-tuning. The paper uses several:
GPT-4o: A multimodal model that can understand both text and images.grounding-dino1.5: An open-set object detector that can find objects in an image based on a text description.SAM (Segment Anything Model): A model that can produce precise segmentation masks for any object in an image.Depth Anything V2: A model that estimates a dense depth map from a single 2D image.
- Scene Graph: A data structure that represents a scene as a set of nodes (objects) and edges (their relationships). For example, a node "lamp" might be connected to a node "desk" with an "on" relationship. This provides logical structure to the scene.
- Pose Estimation: The process of determining an object's position (translation, i.e., x, y, z coordinates) and orientation (rotation) in 3D space. This is often called estimating the 6D pose.
- RANSAC (Random Sample Consensus): An iterative algorithm used to estimate parameters of a mathematical model from a set of observed data that contains outliers. The paper uses it to detect planes (floor, walls) from a 3D point cloud.
-
Previous Works & Differentiation: The paper categorizes prior work into three main areas and differentiates its approach from each.
-
Data-Driven Scene Layout Generation:
- Classical Methods: These used graphical models and non-linear optimization based on hand-coded design rules (e.g., an object should be accessible, a chair should face a table). Limitation: Rigid, time-consuming to define rules, and not very expressive.
- Deep Learning Methods: These learn to generate layouts directly from 3D scene datasets using architectures like
VAEs,GANs, and more recently,diffusion models. Limitation: Their performance is capped by the limited size and diversity of available 3D datasets, leading to overfitting and less creative results. - This Paper's Approach: By using 2D images as guidance,
Imaginariumtaps into the near-infinite diversity of 2D generative models, breaking free from the constraints of 3D datasets.
-
Language-Driven Scene Layout Generation:
- LLM-based Methods: These use LLMs like GPT-3/4 to interpret a text prompt and generate a scene description, sometimes even code (e.g., Python scripts for Blender) to place objects. Examples include
HOLODECKandLayoutGPT. Limitation: LLMs lack true geometric and spatial reasoning. They often produce unstable results with artifacts (e.g., objects intersecting or floating) and struggle with aesthetic composition. - This Paper's Approach:
Imaginariumreplaces the LLM's weak spatial reasoning with a powerful vision-based system. The generated 2D image provides a strong, explicit spatial and aesthetic prior that is much richer than what text alone can convey.
- LLM-based Methods: These use LLMs like GPT-3/4 to interpret a text prompt and generate a scene description, sometimes even code (e.g., Python scripts for Blender) to place objects. Examples include
-
Pose Estimation of Novel Objects:
- Existing Methods: Pose estimation is a classic computer vision problem. Recent methods like
GigaPoseuse template matching with deep features to find the pose of an object in an image by comparing it against rendered views of a 3D model. Limitation: These methods can struggle when the object in the image (the query) is stylistically different from the 3D asset (the template) or when views are ambiguous. - This Paper's Approach: The authors build on
GigaPosebut enhance it significantly. They add a "fine selection" step using homography and, crucially, incorporate geometric constraints from the initial 3D scene reconstruction (the OBBs and planes). This hybrid visual-geometric approach makes the pose estimation more robust and accurate.
- Existing Methods: Pose estimation is a classic computer vision problem. Recent methods like
-
4. Methodology (Core Technology & Implementation)
The core of Imaginarium is a multi-stage pipeline that translates a text prompt into a final 3D scene. The overall workflow is depicted in Image 12.
该图像是一个示意图,展示了基于视觉引导的高质量3D场景布局生成流程。包括从文本描述出发,通过图像生成模型扩展提示,再经过场景图分析、3D布局重建和布局细化,最终得到符合物理约束的3D场景布局。
-
Problem Statement: The goal is to create a function that, given a text
promptand a library of 3D assets , generates a scene . The scene is a set of objects . Each object is defined by a tuple , where:- is a specific asset from the library .
- is the rotation matrix in 3D ().
- is the translation vector in 3D ().
- is the scale vector in 3D ().
-
Steps & Procedures:
Stage 1: Prompt Expander with Predefined Assets (Sec 3.1) The first step is to convert the user's abstract text prompt (e.g., "a cozy musician's bedroom") into a concrete visual representation.
-
Dataset Creation: The authors first built a high-quality 3D asset library () and used it to create 147 professionally designed scenes. This dataset is crucial because it provides high-quality data for both fine-tuning and evaluation. Image 23 shows the diversity of assets and statistics comparing it favorably to the
3D-Futuredataset. -
Fine-tuning the Image Generator: They take a powerful text-to-image model,
Flux, and fine-tune it on renderings of their custom scenes. Using a technique similar toDreamBooth, they associate a unique identifier[V]with their scene style. This teachesFluxto generate new images that are stylistically consistent with the assets in their library . As shown in Image 2, the fine-tuned model produces images that better match the available assets compared to the vanilla model.
Stage 2: Scene Image Analysis (Sec 3.2) Once a guiding image is generated, the system deconstructs it to understand its content.
-
Semantic Parsing:
GPT-4ois prompted to list the objects visible in the image, using the asset categories from library to guide its output.- The text labels are fed to
grounding-dino1.5to get 2D bounding boxes for each object. - These boxes are passed to
SAMto get precise pixel-level masks for all foreground objects.
-
Geometric Analysis:
Depth Anything V2estimates the depth map for the image.- The depth map is unprojected into a 3D point cloud .
- For each object mask , the corresponding points are extracted, and an Oriented Bounding Box (
OBB) is fitted. - For the background,
RANSACis used to find the main planes corresponding to the floor, walls, and ceiling.
-
Scene Graph Construction:
-
GPT-4oanalyzes the image again to infer logical relationships between objects, focusing on two key types shown in Image 31:- Support Relationship: Object A is on, under, or inside Object B. This forms a support tree.
- Wall Proximity: An object is against a wall.
-
This scene graph provides crucial logical constraints for the final layout.

Stage 3: Scene Layout Reconstruction (Sec 3.3) This stage builds the initial 3D layout.
-
-
3D Asset Retrieval: For each segmented object mask in the image, the system searches the asset library to find the best match based on category, visual similarity (using deep features), and size compatibility.
-
Rotation Estimation: This is a key technical contribution, using a coarse-to-fine strategy:
- Visual Candidates (Coarse): The retrieved 3D asset is rendered from 162 pre-defined viewpoints. The system uses a feature extractor from
GigaPoseto find the rendered view that is most similar to the object in the input image. The top 10 most similar views are selected as coarse candidates. - Homography-based Refinement (Fine): For the top 10 candidates, the system calculates the homography matrix between the rendered view and the input image patch. A homography describes how a plane in 3D projects to a 2D image. If the viewpoint is a perfect match, should be close to the identity matrix. The system selects the top 4 views where the rotation part of is closest to identity, measured by the Frobenius norm: Here, is the Singular Value Decomposition (SVD) of the homography matrix . This step is illustrated in Image 32.
- Geometric Enhancement: The system also derives rotation candidates from the OBB calculated in Stage 2. It then adaptively combines the best visual candidate () and the best geometric candidate () based on their angular difference. If they are close, it prefers the geometric one (which is more accurate for boxy objects); if they are far apart, it trusts the visual one. This is shown in Image 33.
- Visual Candidates (Coarse): The retrieved 3D asset is rendered from 162 pre-defined viewpoints. The system uses a feature extractor from
-
Translation and Scale Estimation: The initial translation is set to the center of the OBB. The scale is adjusted based on the object type (e.g., allowing a bookshelf to be scaled vertically but not a chair) to fit the OBB dimensions while preserving the asset's integrity.
Stage 4: Refinement of Scene Layout (Sec 3.4) The initial layout is coarse and may contain physical violations (e.g., intersections). This final stage cleans it up.
- Local Refinement: The scene graph from Stage 2 is used to enforce logic. Rotations are adjusted to align objects with their supporting surfaces (e.g., making the bottom of a vase parallel to the table it's on). Objects placed inside containers are scaled down to fit, as shown in Image 34.
- Global Optimization (Translation): A constrained optimization problem is solved to fine-tune the translations of all objects simultaneously. The objective function minimizes changes from the initial positions while also ensuring the final rendered masks match the input image's segmentation. This is subject to hard constraints:
- Objects do not intersect.
- Objects are properly supported by the floor, other objects (based on the support tree), or the ceiling.
- Objects are flush against walls if the scene graph dictates it. This problem is solved using simulated annealing.
- Physics Simulation: As a final step, a physics engine (from Blender) is used to settle objects realistically, like pillows on a bed or items in a pile.
-
5. Experimental Setup
-
Datasets:
- Imaginarium Dataset (Custom): Comprises 147 high-quality scenes created by professional artists using a library of 2,037 assets. Used for fine-tuning the
Fluxmodel and for fidelity evaluation. - 3DF-CLAPE: A new benchmark dataset created by the authors from the
3D-Futuredataset for evaluating category-level and instance-level object pose estimation. It contains thousands of query-template pairs.
- Imaginarium Dataset (Custom): Comprises 147 high-quality scenes created by professional artists using a library of 2,037 assets. Used for fine-tuning the
-
Evaluation Metrics:
- Preference Rate (%): In user studies, the percentage of time users preferred the output of
Imaginariumover a baseline method based on "realism" or "aesthetics." - Expert/GPT-4o Rating: A score from 1 to 5 given by human artists or
GPT-4oon dimensions like composition, semantics, and aesthetics. - Object Recovery Rate: The percentage of ground-truth objects in a scene that the system successfully detected and placed.
- Category Preservation Rate: Of the recovered objects, the percentage that were assigned the correct category.
- Rotation/Translation Accuracy (AUC@threshold): The Area Under the Curve of the success rate as a function of an error threshold. For example,
AUC@60°for rotation measures the overall accuracy for rotation errors up to 60 degrees. A higher AUC is better. - Scene Graph Accuracy: The percentage of relationships (e.g., "on," "against wall") that were correctly identified compared to a ground truth.
- CLIP Similarity: A score measuring the semantic similarity between two images (e.g., the generated guide image and a rendering of the final 3D scene). A higher score means they are more similar in content.
- Conceptual Definition: CLIP (Contrastive Language-Image Pre-Training) is a model trained to understand images and text in a shared embedding space. The cosine similarity of the embeddings for two images indicates how semantically similar they are.
- mAP (mean Average Precision): A standard metric for evaluating the performance of object detection and retrieval systems. In this context, it is used to evaluate rotation estimation by treating different rotation error thresholds as criteria for a "correct" prediction.
- NN LPIPS (Nearest Neighbor LPIPS): A metric to test for overfitting. It measures the perceptual similarity between a generated image and its single most similar-looking image in the training set. A high value suggests the model is generating novel images, not just copying training examples.
- DIV (LPIPS) & Intra-set Scene Sim.: Metrics to measure the diversity of generated content.
DIV (LPIPS)measures the average perceptual distance between pairs of images generated from the same prompt, whileIntra-set Scene Sim.measures the diversity of the underlying 3D layouts.
- Preference Rate (%): In user studies, the percentage of time users preferred the output of
-
Baselines:
- LLM-guided:
HOLODECK,LayoutGPT. - Data-driven Generative Models:
DiffuScene,InstructScene. - Pose Estimation Methods:
DINOv2,SPARC,DiffCAD,AENet,GigaPose,Orient Anything.
- LLM-guided:
6. Results & Analysis
-
Core Results:
1. Quality Assessment (User Studies):
Imaginariumwas overwhelmingly preferred by both art students and professional artists.-
Table 1: In pairwise comparisons,
Imaginariumwas preferred overDiffuScene,HOLODECK,LayoutGPT, andInstructScenebetween ~61% and ~86% of the time for both realism and aesthetics. This is a very strong result.(Manual transcription of Table 1)
Our vs. Reasonable & Realistic Aesthetic Dining Living Bedroom Dining Living Bedroom DiffuScene 75.69 82.59 79.37 74.86 85.57 80.72 Holodeck 79.27 77.08 76.79 82.72 72.92 74.55 LayoutGPT 76.69 76.50 77.54 81.11 InstructScene 66.33 68.46 61.29 69.39 75.17 72.90 -
Table 2: Professional artists gave
Imaginariuman average overall score of 3.34 (on a 1-5 scale where 3.0 is the professional average), while all other methods scored below 3.0. This indicates the generated layouts are of professional quality.GPT-4o's ratings showed a similar trend.(Manual transcription of Table 2)
Method Composition Semantic Aesthetic Overall Ours 3.35/3.16 3.29/2.86 3.37/3.16 3.34/3.06 DiffuScene 2.86/3.07 2.80/2.78 2.83/3.07 2.83/2.97 HOLODECK 2.71/2.91 2.56/2.55 2.80/2.86 2.69/2.77 LayoutGPT 2.42/2.97 2.26/2.83 2.35/2.97 2.34/2.92 InstructScene 2.91/3.07 2.75/2.83 2.89/3.08 2.85/2.99
A visual comparison is provided in Image 35.

2. Fidelity of Reconstruction:
-
Table 3: When reconstructing scenes from rendered images of its own dataset, the system showed high fidelity. It recovered 92.31% of primary objects with 95.83% category accuracy. Geometric accuracy was also strong (
74.83%rotationAUC@60°,84.32%translationAUC@0.5m).(Manual transcription of Table 3)
Metric Primary Secondary Fidelity Object Recovery 92.31% 70.41% Category Preservation 95.83% 91.67% Rotation (AUC@60°) 74.83% 71.51% Translation (AUC@0.5m) 84.32% 80.40% Scene Graph Accuracy 93.26% Similarity CLIP (Guide Image) 27.03 CLIP (Render Image) 25.83 GPT-4o Rating 8.29/10
3. Rotation Estimation Performance:
-
Table 4 & Figure 9: The proposed rotation estimation method significantly outperformed all baselines on the
3DF-CLAPEbenchmark. It achieved anAUC@60°of 70.06% at the category level, far ahead of the next best (OrientAat 56.07%). The performance was even better at the instance level (81.44%), highlighting the method's effectiveness.(Manual transcription of Table 4)
AUC@60° ↑ DINOv2 SPARC DiffCAD OrientA GigaP AENet Ours Category-level 31.68% 52.54% 26.45% 56.07% 39.85% 45.32% 70.06% Instance-level 31.38% 61.46% 25.44% 56.24% 57.43% 62.16% 81.44%
该图像是性能对比图,展示了不同方法在类别级和实例级旋转估计中的平均精度变化,横轴为旋转误差阈值,纵轴为平均精度(AP)值,图中展示了本文方法"Ours"在各阈值下表现优于其他方法。
-
-
Ablation Studies:
1. Impact of Fine-tuning Flux: This was a critical design choice.
-
Retrieval Accuracy (Table 5): Fine-tuning massively improved asset retrieval.
Top-1accuracy jumped from 48.57% to 68.70%, andTop-3accuracy from 68.57% to 83.21%. This confirms that making the generated image "style-aware" of the asset library is highly beneficial.(Manual transcription of Table 5)
Metric Vanilla Flux Finetuned Flux Top-1 Accuracy 48.57% 68.70% Top-3 Accuracy 68.57% 83.21% -
Overfitting and Diversity (Table 6): The fine-tuned model did not overfit to the training layouts. The
NN LPIPSandScene Sim. to Trainingscores were very close to the vanilla model. Furthermore, it maintained comparable generative diversity. This shows the fine-tuning successfully learned style without sacrificing creativity.(Manual transcription of Table 6)
Model Overfitting NN LPIPS ↑ Scene Sim. to Training ↓ Vanilla Flux 0.6375 0.3665 Finetuned Flux 0.5981 0.3899 Model Diversity DIV (LPIPS) ↑ Intra-set Scene Sim. ↓ Vanilla Flux 0.5782 0.2974 Finetuned Flux 0.5901 0.3178
2. Ablation of Other Components: The paper states that ablation studies were also conducted for the rotation estimation module (homography and geometric information) and the layout refinement pipeline. While detailed tables are not provided in the excerpt, the text implies that each component provides a meaningful contribution to the final performance.
-
7. Conclusion & Reflections
-
Conclusion Summary: The paper presents
Imaginarium, a highly effective system for generating high-quality 3D scene layouts. Its core idea of using a fine-tuned 2D image generator as a "prompt expander" and visual guide is a powerful paradigm shift. By combining state-of-the-art vision models for analysis, a novel robust pose estimation algorithm, and a principled refinement pipeline, the system produces layouts that are not only logically and physically plausible but are also aesthetically superior to those from previous state-of-the-art methods. The creation and promised release of a new high-quality 3D scene dataset is also a significant contribution to the community. -
Limitations & Future Work (Inferred): The provided text ends before a formal limitations section, but we can infer several potential weaknesses:
- Error Propagation: The pipeline is a long cascade of models. An error in an early stage (e.g., a poor image from
Flux, a missed object detection, or an inaccurate depth map) will inevitably propagate and harm the final output. - Asset Library Dependency: The system can only place objects that exist in its predefined asset library . It cannot create novel object geometries, which limits its creativity to combinatorial arrangements.
- Single Viewpoint Limitation: The entire 3D reconstruction is based on a single 2D image. This can lead to ambiguity, especially for objects that are heavily occluded or have symmetrical appearances. The back of the scene is entirely inferred, not seen.
- Computational Cost: The full pipeline involves multiple large models (
Flux,GPT-4o,SAM, etc.), rendering, and optimization, taking approximately 240 seconds per scene on an A100 GPU. This is not yet suitable for real-time interactive applications.
- Error Propagation: The pipeline is a long cascade of models. An error in an early stage (e.g., a poor image from
-
Personal Insights & Critique:
Imaginariumis an excellent piece of engineering, cleverly integrating the strengths of different AI domains (generative models, vision, language) to solve a difficult problem. The "2D-to-3D lift" is a pragmatic and powerful strategy that circumvents the 3D data bottleneck.- The most impressive aspect is the focus on quality and aesthetics, which are often secondary in purely technical papers. The extensive user studies with artists and the high preference rates demonstrate that the system creates genuinely appealing content.
- The paper's methodology for pose estimation is particularly strong. The fusion of visual semantics with geometric constraints is a robust approach that is well-suited to the noisy, "in-the-wild" nature of generated images.
- A promising future direction would be to make the process more interactive. For example, allowing a user to re-paint a region of the guide image to change an object (as hinted at in Image 4's caption) or directly manipulate objects in the 3D scene and have the system re-optimize the layout. This would bridge the gap between fully automated generation and professional creative workflows.
Similar papers
Recommended via semantic vector search.