Abstract
In this work, we present SceneDreamer, an unconditional generative model for unbounded 3D scenes, which synthesizes large-scale 3D landscapes from random noises. Our framework is learned from in-the-wild 2D image collections only, without any 3D annotations. At the core of SceneDreamer is a principled learning paradigm comprising 1) an efficient yet expressive 3D scene representation, 2) a generative scene parameterization, and 3) an effective renderer that can leverage the knowledge from 2D images. Our framework starts from an efficient bird's-eye-view (BEV) representation generated from simplex noise, which consists of a height field and a semantic field. The height field represents the surface elevation of 3D scenes, while the semantic field provides detailed scene semantics. This BEV scene representation enables 1) representing a 3D scene with quadratic complexity, 2) disentangled geometry and semantics, and 3) efficient training. Furthermore, we propose a novel generative neural hash grid to parameterize the latent space given 3D positions and the scene semantics, which aims to encode generalizable features across scenes. Lastly, a neural volumetric renderer, learned from 2D image collections through adversarial training, is employed to produce photorealistic images. Extensive experiments demonstrate the effectiveness of SceneDreamer and superiority over state-of-the-art methods in generating vivid yet diverse unbounded 3D worlds.
Abstract (translated)
在本作品中,我们提出了SceneDreamer,一个无条件生成模型,用于生成无限制的三维场景。该模型从随机噪声中合成大规模的三维地形。我们的框架仅从野生的2D图像集学习,没有任何3D注释。SceneDreamer的核心是一种有原则的学习范式,包括1)高效但表达丰富的3D场景表示,2)生成场景参数化,3)可以利用2D图像知识的有效渲染器。我们的框架从简单的单源噪声生成高效的俯瞰视图(BEV)表示,该表示由高度场和语义场组成。高度场表示3D场景的表面高度,而语义场提供详细的场景语义。这种BEV场景表示可以1)代表具有平方复杂度的3D场景,2)分离几何和语义,3)高效训练。此外,我们提出了一种新的生成神经网络哈希网格,以参数化给定3D位置和场景语义的隐含空间,旨在编码跨场景通用的特征。最后,通过对抗训练从2D图像集学习到的神经网络体积渲染器被用于生成逼真的图像。广泛的实验结果表明SceneDreamer的有效性和在生成丰富但多样性无限的三维世界中的优势。
URL
https://arxiv.org/abs/2302.01330