Paper Reading AI Learner

Compositional 3D Scene Synthesis with Scene Graph Guided Layout-Shape Generation

2024-03-19 15:54:48
Yao Wei, Martin Renqiang Min, George Vosselman, Li Erran Li, Michael Ying Yang

Abstract

Compositional 3D scene synthesis has diverse applications across a spectrum of industries such as robotics, films, and video games, as it closely mirrors the complexity of real-world multi-object environments. Early works typically employ shape retrieval based frameworks which naturally suffer from limited shape diversity. Recent progresses have been made in shape generation with powerful generative models, such as diffusion models, which increases the shape fidelity. However, these approaches separately treat 3D shape generation and layout generation. The synthesized scenes are usually hampered by layout collision, which implies that the scene-level fidelity is still under-explored. In this paper, we aim at generating realistic and reasonable 3D scenes from scene graph. To enrich the representation capability of the given scene graph inputs, large language model is utilized to explicitly aggregate the global graph features with local relationship features. With a unified graph convolution network (GCN), graph features are extracted from scene graphs updated via joint layout-shape distribution. During scene generation, an IoU-based regularization loss is introduced to constrain the predicted 3D layouts. Benchmarked on the SG-FRONT dataset, our method achieves better 3D scene synthesis, especially in terms of scene-level fidelity. The source code will be released after publication.

Abstract (translated)

组件式3D场景合成在机器人学、电影和游戏等各个行业中具有广泛的应用,因为它与现实世界多物体环境中的复杂性密切相关。早期的作品通常基于形状检索的框架,但自然地存在形状多样性的限制。随着强大生成模型的进步(如扩散模型),形状生成取得了显著提高。然而,这些方法分别处理3D形状生成和布局生成。生成的场景通常受到布局碰撞的影响,这表明在场景级别上,场景级保真度还有待进一步探索。在本文中,我们的目标是生成真实和合理的3D场景,从场景图入手。为了丰富给定的场景图输入的表示能力,我们使用了大型语言模型来明确聚合全局图特征和局部关系特征。通过统一的图卷积网络(GCN),从更新后的场景图中提取 graph 特征。在场景生成过程中,引入了基于IoU的 Regularization Loss 来约束预测的3D布局。在SG-FRONT数据集上的基准测试中,我们的方法实现了更好的3D场景合成,尤其是在场景级别保真度方面。源代码将在发表后发布。

URL

https://arxiv.org/abs/2403.12848

PDF

https://arxiv.org/pdf/2403.12848.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot