Paper Reading AI Learner

CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graphs

2023-05-25 17:39:13
Guangyao Zhai, Evin Pinar Örnek, Shun-Cheng Wu, Yan Di, Federico Tombari, Nassir Navab, Benjamin Busam

Abstract

Controllable scene synthesis aims to create interactive environments for various industrial use cases. Scene graphs provide a highly suitable interface to facilitate these applications by abstracting the scene context in a compact manner. Existing methods, reliant on retrieval from extensive databases or pre-trained shape embeddings, often overlook scene-object and object-object relationships, leading to inconsistent results due to their limited generation capacity. To address this issue, we present CommonScenes, a fully generative model that converts scene graphs into corresponding controllable 3D scenes, which are semantically realistic and conform to commonsense. Our pipeline consists of two branches, one predicting the overall scene layout via a variational auto-encoder and the other generating compatible shapes via latent diffusion, capturing global scene-object and local inter-object relationships while preserving shape diversity. The generated scenes can be manipulated by editing the input scene graph and sampling the noise in the diffusion model. Due to lacking a scene graph dataset offering high-quality object-level meshes with relations, we also construct SG-FRONT, enriching the off-the-shelf indoor dataset 3D-FRONT with additional scene graph labels. Extensive experiments are conducted on SG-FRONT where CommonScenes shows clear advantages over other methods regarding generation consistency, quality, and diversity. Codes and the dataset will be released upon acceptance.

Abstract (translated)

可控场景合成的目标是为各种工业应用创建交互环境。场景图提供了高度合适的接口,以通过紧凑的方式抽象场景上下文,方便这些应用。现有的方法依赖于从广泛的数据库或预先训练的形状嵌入中检索,往往忽略场景对象和对象之间的关系,导致由于它们的生成能力有限而产生不一致的结果。为了解决这一问题,我们提出了CommonScenes,这是一个全生成模型,将场景图转换为相应的可控3D场景,语义上真实且符合常识。我们的管道由两个分支组成,一个通过Variational Auto-encoder 预测整个场景布局,另一个通过隐式扩散生成兼容的形状,捕捉全球场景对象和本地对象之间的关系,同时保持形状多样性。生成的场景可以通过编辑输入场景图和采样扩散模型中的噪声来操纵。由于缺少提供高质量对象级网格与关系的场景图数据集,我们还建立了SG-Front,将现有的室内数据集3D-Front中添加额外的场景图标签。在SG-Front上进行广泛的实验,CommonScenes 在生成一致性、质量和多样性方面明显优于其他方法。代码和数据集将在接受后发布。

URL

https://arxiv.org/abs/2305.16283

PDF

https://arxiv.org/pdf/2305.16283.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot