Abstract
Generating 3D worlds from text is a highly anticipated goal in computer vision. Existing works are limited by the degree of exploration they allow inside of a scene, i.e., produce streched-out and noisy artifacts when moving beyond central or panoramic perspectives. To this end, we propose WorldExplorer, a novel method based on autoregressive video trajectory generation, which builds fully navigable 3D scenes with consistent visual quality across a wide range of viewpoints. We initialize our scenes by creating multi-view consistent images corresponding to a 360 degree panorama. Then, we expand it by leveraging video diffusion models in an iterative scene generation pipeline. Concretely, we generate multiple videos along short, pre-defined trajectories, that explore the scene in depth, including motion around objects. Our novel scene memory conditions each video on the most relevant prior views, while a collision-detection mechanism prevents degenerate results, like moving into objects. Finally, we fuse all generated views into a unified 3D representation via 3D Gaussian Splatting optimization. Compared to prior approaches, WorldExplorer produces high-quality scenes that remain stable under large camera motion, enabling for the first time realistic and unrestricted exploration. We believe this marks a significant step toward generating immersive and truly explorable virtual 3D environments.
Abstract (translated)
从文本生成三维世界是计算机视觉领域一个备受期待的目标。现有的方法在场景内部探索方面受到限制,即超出中心或全景视角时会产生拉伸和噪点等不理想的效果。为此,我们提出了WorldExplorer这一新方法,它基于自回归视频轨迹生成技术,在广泛的视角范围内构建出完全可导航且具有连贯视觉质量的三维场景。我们的方法首先通过创建一系列多视图一致性图像(对应360度全景)来初始化场景。然后利用视频扩散模型在迭代场景生成流水线中扩展这些初始设置,具体来说,我们在预定义的短轨迹上生成多个视频,深入探索场景,包括围绕物体运动的情况。我们提出的新颖场景记忆机制让每个视频都基于最相关的先前视图条件化处理,同时一个碰撞检测机制防止了诸如移动进物体等退化的结果产生。最后,通过3D高斯点绘优化将所有生成的视角融合为统一的三维表示。 相比于之前的方法,WorldExplorer能够生成在大范围相机运动下依然保持稳定的高质量场景,首次实现了真实且不受限的探索。我们认为这标志着向生成沉浸式和真正可探索的虚拟三维环境迈出的重要一步。
URL
https://arxiv.org/abs/2506.01799