Paper Reading AI Learner

RealmDreamer: Text-Driven 3D Scene Generation with Inpainting and Depth Diffusion

2024-04-10 17:57:41
Jaidev Shriram, Alex Trevithick, Lingjie Liu, Ravi Ramamoorthi

Abstract

We introduce RealmDreamer, a technique for generation of general forward-facing 3D scenes from text descriptions. Our technique optimizes a 3D Gaussian Splatting representation to match complex text prompts. We initialize these splats by utilizing the state-of-the-art text-to-image generators, lifting their samples into 3D, and computing the occlusion volume. We then optimize this representation across multiple views as a 3D inpainting task with image-conditional diffusion models. To learn correct geometric structure, we incorporate a depth diffusion model by conditioning on the samples from the inpainting model, giving rich geometric structure. Finally, we finetune the model using sharpened samples from image generators. Notably, our technique does not require video or multi-view data and can synthesize a variety of high-quality 3D scenes in different styles, consisting of multiple objects. Its generality additionally allows 3D synthesis from a single image.

Abstract (translated)

我们介绍了一种从文本描述生成通用面向未来的3D场景的技术,名为RealmDreamer。我们的技术通过优化3D高斯平铺表示来匹配复杂的文本提示。我们通过利用最先进的文本到图像生成器的状态,将样本提升到3D并计算遮挡体积。然后,在多个视角上对这种表示进行优化,将其作为3D修复任务与图像条件扩散模型一起进行。为了学习正确的几何结构,我们在修复模型上通过条件于修复模型的样本,从而赋予了丰富几何结构。最后,我们使用增强的生成器样本对模型进行微调。值得注意的是,我们的技术不需要视频或多视角数据,可以生成不同风格的高质量3D场景,包括多个物体。此外,其普遍性还允许从单张图像合成3D。

URL

https://arxiv.org/abs/2404.07199

PDF

https://arxiv.org/pdf/2404.07199.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot