Paper Reading AI Learner

Urban Architect: Steerable 3D Urban Scene Generation with Layout Prior

2024-04-10 06:41:30
Fan Lu, Kwan-Yee Lin, Yan Xu, Hongsheng Li, Guang Chen, Changjun Jiang

Abstract

Text-to-3D generation has achieved remarkable success via large-scale text-to-image diffusion models. Nevertheless, there is no paradigm for scaling up the methodology to urban scale. Urban scenes, characterized by numerous elements, intricate arrangement relationships, and vast scale, present a formidable barrier to the interpretability of ambiguous textual descriptions for effective model optimization. In this work, we surmount the limitations by introducing a compositional 3D layout representation into text-to-3D paradigm, serving as an additional prior. It comprises a set of semantic primitives with simple geometric structures and explicit arrangement relationships, complementing textual descriptions and enabling steerable generation. Upon this, we propose two modifications -- (1) We introduce Layout-Guided Variational Score Distillation to address model optimization inadequacies. It conditions the score distillation sampling process with geometric and semantic constraints of 3D layouts. (2) To handle the unbounded nature of urban scenes, we represent 3D scene with a Scalable Hash Grid structure, incrementally adapting to the growing scale of urban scenes. Extensive experiments substantiate the capability of our framework to scale text-to-3D generation to large-scale urban scenes that cover over 1000m driving distance for the first time. We also present various scene editing demonstrations, showing the powers of steerable urban scene generation. Website: this https URL.

Abstract (translated)

通过大规模的文本到图像扩散模型,文本到3D生成功能取得了显著的成功。然而,在扩展方法以达到城市规模方面,尚无范式。城市场景具有大量的元素、复杂的布局关系和广泛的规模,这使得对模糊文本描述的有效模型优化具有挑战性。在这项工作中,我们克服了限制,通过将组件化的3D布局表示引入文本到3D范式中,作为额外的先验。它包括一系列语义原型,具有简单的几何结构和明确的布局关系,补充了文本描述,并实现了可引导的生成。在此基础上,我们提出了两个修改建议--(1) 我们引入了布局引导的变分 score distillation 以解决模型优化不足的问题。它将分数扩散采样过程与3D布局的几何和语义约束相结合。(2) 为了处理城市场景的无界性质,我们使用可扩展哈希网格结构表示3D场景,并随着城市场景规模的增长,逐步适应。大量实验证实了我们的框架可以将文本到3D生成扩展到覆盖超过1000米驾驶距离的大型城市场景,这是第一次实现。我们还展示了各种场景编辑示例,展示了我们框架的可引导城市场景生成的力量。网站:https://this URL。

URL

https://arxiv.org/abs/2404.06780

PDF

https://arxiv.org/pdf/2404.06780.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot