Abstract
Automatically generating a complete 3D scene from a text description, a reference image, or both has significant applications in fields like virtual reality and gaming. However, current methods often generate low-quality textures and inconsistent 3D structures. This is especially true when extrapolating significantly beyond the field of view of the reference image. To address these challenges, we propose PanoDreamer, a novel framework for consistent, 3D scene generation with flexible text and image control. Our approach employs a large language model and a warp-refine pipeline, first generating an initial set of images and then compositing them into a 360-degree panorama. This panorama is then lifted into 3D to form an initial point cloud. We then use several approaches to generate additional images, from different viewpoints, that are consistent with the initial point cloud and expand/refine the initial point cloud. Given the resulting set of images, we utilize 3D Gaussian Splatting to create the final 3D scene, which can then be rendered from different viewpoints. Experiments demonstrate the effectiveness of PanoDreamer in generating high-quality, geometrically consistent 3D scenes.
Abstract (translated)
从文本描述、参考图像或两者结合自动生成完整的三维场景,在虚拟现实和游戏等领域具有重要的应用价值。然而,目前的方法往往生成质量较低的纹理,并且在构建不一致的三维结构时尤其如此,尤其是在超出参考图像视野范围的情况下。为了解决这些问题,我们提出了PanoDreamer,这是一种新颖的框架,用于通过灵活的文字和图像控制来生成一致性的三维场景。 我们的方法采用大型语言模型和一种扭曲-细化管道技术:首先生成一系列初始图像,然后将这些图像组合成360度全景图。接着,我们将该全景图提升至三维空间以形成初步点云。随后,我们使用几种不同的方法从不同视角生成额外的图像,并确保这些新生成的图像与初步点云保持一致,同时扩展和细化初步点云。 基于最终的一系列图像,我们利用3D高斯光束技术来创建最终的三维场景,该场景可以从多个视角进行渲染。实验结果表明,PanoDreamer在生成高质量且几何上一致的三维场景方面具有显著的效果。
URL
https://arxiv.org/abs/2504.05152