Abstract
Scene-level 3D generation is a challenging research topic, with most existing methods generating only partial scenes and offering limited navigational freedom. We introduce WorldPrompter, a novel generative pipeline for synthesizing traversable 3D scenes from text prompts. We leverage panoramic videos as an intermediate representation to model the 360° details of a scene. WorldPrompter incorporates a conditional 360° panoramic video generator, capable of producing a 128-frame video that simulates a person walking through and capturing a virtual environment. The resulting video is then reconstructed as Gaussian splats by a fast feedforward 3D reconstructor, enabling a true walkable experience within the 3D scene. Experiments demonstrate that our panoramic video generation model achieves convincing view consistency across frames, enabling high-quality panoramic Gaussian splat reconstruction and facilitating traversal over an area of the scene. Qualitative and quantitative results also show it outperforms the state-of-the-art 360° video generators and 3D scene generation models.
Abstract (translated)
场景级别的三维生成是一个具有挑战性的研究课题,大多数现有的方法只能生成部分场景,并且提供有限的导航自由度。我们引入了WorldPrompter,这是一种新颖的生成管道,可以从文本提示中合成可穿越的三维场景。我们利用全景视频作为中间表示来建模场景的360°细节。 WorldPrompter 包含了一个条件性的 360° 全景视频生成器,能够产生一个模拟人在虚拟环境中行走并捕捉环境的128帧视频。然后,通过快速前馈式的三维重建算法将产生的视频重构为高斯点(Gaussian splats),从而在三维场景中实现真正的可行走体验。 实验表明,我们的全景视频生成模型实现了跨帧令人信服的视角一致性,这使得高质量的全景高斯点重建成为可能,并且支持穿越场景中的大片区域。定性和定量的结果也显示它超越了现有的最先进的 360° 视频生成器和三维场景生成模型的表现。 该研究通过结合文本驱动的视频生成与高效的 3D 场景重构技术,为实现高度交互式的虚拟现实体验提供了新的途径。
URL
https://arxiv.org/abs/2504.02045