Paper Reading AI Learner

MaGRITTe: Manipulative and Generative 3D Realization from Image, Topview and Text

2024-03-30 12:50:25
Takayuki Hara, Tatsuya Harada

Abstract

The generation of 3D scenes from user-specified conditions offers a promising avenue for alleviating the production burden in 3D applications. Previous studies required significant effort to realize the desired scene, owing to limited control conditions. We propose a method for controlling and generating 3D scenes under multimodal conditions using partial images, layout information represented in the top view, and text prompts. Combining these conditions to generate a 3D scene involves the following significant difficulties: (1) the creation of large datasets, (2) reflection on the interaction of multimodal conditions, and (3) domain dependence of the layout conditions. We decompose the process of 3D scene generation into 2D image generation from the given conditions and 3D scene generation from 2D images. 2D image generation is achieved by fine-tuning a pretrained text-to-image model with a small artificial dataset of partial images and layouts, and 3D scene generation is achieved by layout-conditioned depth estimation and neural radiance fields (NeRF), thereby avoiding the creation of large datasets. The use of a common representation of spatial information using 360-degree images allows for the consideration of multimodal condition interactions and reduces the domain dependence of the layout control. The experimental results qualitatively and quantitatively demonstrated that the proposed method can generate 3D scenes in diverse domains, from indoor to outdoor, according to multimodal conditions.

Abstract (translated)

从用户指定条件生成3D场景为减轻3D应用程序的生产负担提供了有前途的途径。之前的研究因为控制条件有限而需要大量努力来实现期望的场景。我们提出了一种使用部分图像、表示在顶视图中的布局信息以及文本提示控制和生成3D场景的方法。将这些条件组合生成3D场景涉及以下重大困难:(1)创建大量数据集,(2)反思多模态条件的交互,(3)布局条件的领域依赖性。我们将3D场景生成的过程分解为从给定条件的2D图像生成和从2D图像生成的3D场景生成。2D图像生成是通过预训练的文本到图像模型的小人工数据集微调来实现的,而3D场景生成是通过布局条件下的深度估计和神经辐射场(NeRF)实现的,从而避免了创建大量数据集。使用360度图像的常见表示空间信息允许考虑多模态条件的交互,并减少了布局控制的领域依赖性。实验结果既定性地又定量地证明了所提出的方法可以根据多模态条件生成各种领域的3D场景。

URL

https://arxiv.org/abs/2404.00345

PDF

https://arxiv.org/pdf/2404.00345.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot