Paper Reading AI Learner

FreeScene: Mixed Graph Diffusion for 3D Scene Synthesis from Free Prompts

2025-06-03 12:01:41
Tongyuan Bai, Wangyuanfan Bai, Dong Chen, Tieru Wu, Manyi Li, Rui Ma

Abstract

Controllability plays a crucial role in the practical applications of 3D indoor scene synthesis. Existing works either allow rough language-based control, that is convenient but lacks fine-grained scene customization, or employ graph based control, which offers better controllability but demands considerable knowledge for the cumbersome graph design process. To address these challenges, we present FreeScene, a user-friendly framework that enables both convenient and effective control for indoor scene this http URL, FreeScene supports free-form user inputs including text description and/or reference images, allowing users to express versatile design intentions. The user inputs are adequately analyzed and integrated into a graph representation by a VLM-based Graph Designer. We then propose MG-DiT, a Mixed Graph Diffusion Transformer, which performs graph-aware denoising to enhance scene generation. Our MG-DiT not only excels at preserving graph structure but also offers broad applicability to various tasks, including, but not limited to, text-to-scene, graph-to-scene, and rearrangement, all within a single model. Extensive experiments demonstrate that FreeScene provides an efficient and user-friendly solution that unifies text-based and graph based scene synthesis, outperforming state-of-the-art methods in terms of both generation quality and controllability in a range of applications.

Abstract (translated)

可控性在3D室内场景合成的实际应用中起着关键作用。现有研究要么提供粗糙的语言控制,虽然使用方便但缺乏精细的场景定制;要么采用基于图的控制方式,这种控制方式提供了更好的可控性,但需要用户具备复杂且繁琐的图形设计知识。为了解决这些问题,我们提出了一种名为FreeScene的框架,它既便捷又有效,能够使用户对室内场景进行细致的控制。 在FreeScene中,支持自由形式的用户输入,包括文本描述和/或参考图像,允许用户表达各种设计意图。通过基于视觉语言模型(VLM)的图设计器,用户的这些输入会被充分分析并整合到一个图形表示之中。然后我们提出了一种混合图扩散变压器(MG-DiT),该变压器能够执行图感知去噪以增强场景生成过程。我们的MG-DiT不仅擅长保持图形结构,还具有广泛的应用性,包括但不限于文本到场景、图到场景以及重排任务,并且所有这些都在单一模型中实现。 通过广泛的实验验证,FreeScene提供了一种高效且用户友好的解决方案,能够统一基于文本和基于图的场景合成方法。在多种应用场合下,该方案不仅在生成质量上超越了当前最先进的方法,在可控性方面也表现出色。

URL

https://arxiv.org/abs/2506.02781

PDF

https://arxiv.org/pdf/2506.02781.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot