Paper Reading AI Learner

PanoDreamer: Consistent Text to 360-Degree Scene Generation

2025-04-07 14:57:01
Zhexiao Xiong, Zhang Chen, Zhong Li, Yi Xu, Nathan Jacobs

Abstract

Automatically generating a complete 3D scene from a text description, a reference image, or both has significant applications in fields like virtual reality and gaming. However, current methods often generate low-quality textures and inconsistent 3D structures. This is especially true when extrapolating significantly beyond the field of view of the reference image. To address these challenges, we propose PanoDreamer, a novel framework for consistent, 3D scene generation with flexible text and image control. Our approach employs a large language model and a warp-refine pipeline, first generating an initial set of images and then compositing them into a 360-degree panorama. This panorama is then lifted into 3D to form an initial point cloud. We then use several approaches to generate additional images, from different viewpoints, that are consistent with the initial point cloud and expand/refine the initial point cloud. Given the resulting set of images, we utilize 3D Gaussian Splatting to create the final 3D scene, which can then be rendered from different viewpoints. Experiments demonstrate the effectiveness of PanoDreamer in generating high-quality, geometrically consistent 3D scenes.

Abstract (translated)

从文本描述、参考图像或两者结合自动生成完整的三维场景,在虚拟现实和游戏等领域具有重要的应用价值。然而,目前的方法往往生成质量较低的纹理,并且在构建不一致的三维结构时尤其如此,尤其是在超出参考图像视野范围的情况下。为了解决这些问题,我们提出了PanoDreamer,这是一种新颖的框架,用于通过灵活的文字和图像控制来生成一致性的三维场景。 我们的方法采用大型语言模型和一种扭曲-细化管道技术:首先生成一系列初始图像,然后将这些图像组合成360度全景图。接着,我们将该全景图提升至三维空间以形成初步点云。随后,我们使用几种不同的方法从不同视角生成额外的图像,并确保这些新生成的图像与初步点云保持一致,同时扩展和细化初步点云。 基于最终的一系列图像,我们利用3D高斯光束技术来创建最终的三维场景,该场景可以从多个视角进行渲染。实验结果表明,PanoDreamer在生成高质量且几何上一致的三维场景方面具有显著的效果。

URL

https://arxiv.org/abs/2504.05152

PDF

https://arxiv.org/pdf/2504.05152.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot