Abstract
Text-driven 3D scene generation techniques have made rapid progress in recent years. Their success is mainly attributed to using existing generative models to iteratively perform image warping and inpainting to generate 3D scenes. However, these methods heavily rely on the outputs of existing models, leading to error accumulation in geometry and appearance that prevent the models from being used in various scenarios (e.g., outdoor and unreal scenarios). To address this limitation, we generatively refine the newly generated local views by querying and aggregating global 3D information, and then progressively generate the 3D scene. Specifically, we employ a tri-plane features-based NeRF as a unified representation of the 3D scene to constrain global 3D consistency, and propose a generative refinement network to synthesize new contents with higher quality by exploiting the natural image prior from 2D diffusion model as well as the global 3D information of the current scene. Our extensive experiments demonstrate that, in comparison to previous methods, our approach supports wide variety of scene generation and arbitrary camera trajectories with improved visual quality and 3D consistency.
Abstract (translated)
近年来,基于文本的3D场景生成技术取得了快速进展。其成功主要归功于使用现有的生成模型迭代进行图像扭曲和修复以生成3D场景。然而,这些方法过于依赖现有模型的输出,导致在几何和外观上产生误差,从而使模型无法应用于各种场景(例如户外和虚幻场景)。为了应对这一局限,我们通过查询和聚合全局3D信息来生成局部视图,然后逐步生成3D场景。具体来说,我们采用基于三平面特征的NeRF作为统一的三维场景表示,约束全局3D一致性,并提出了一个生成修复网络,通过利用扩散模型的自然图像先验以及当前场景的全局3D信息来合成高质量的新内容。我们广泛的实验证明,与以前的方法相比,我们的方法在支持各种场景生成和任意相机轨迹的同时,提高了视觉质量和3D一致性。
URL
https://arxiv.org/abs/2403.09439