Paper Reading AI Learner

WorldPrompter: Traversable Text-to-Scene Generation

2025-04-02 18:04:32
Zhaoyang Zhang, Yannick Hold-Geoffroy, Milo\v{s} Ha\v{s}an, Chen Ziwen, Fujun Luan, Julie Dorsey, Yiwei Hu

Abstract

Scene-level 3D generation is a challenging research topic, with most existing methods generating only partial scenes and offering limited navigational freedom. We introduce WorldPrompter, a novel generative pipeline for synthesizing traversable 3D scenes from text prompts. We leverage panoramic videos as an intermediate representation to model the 360° details of a scene. WorldPrompter incorporates a conditional 360° panoramic video generator, capable of producing a 128-frame video that simulates a person walking through and capturing a virtual environment. The resulting video is then reconstructed as Gaussian splats by a fast feedforward 3D reconstructor, enabling a true walkable experience within the 3D scene. Experiments demonstrate that our panoramic video generation model achieves convincing view consistency across frames, enabling high-quality panoramic Gaussian splat reconstruction and facilitating traversal over an area of the scene. Qualitative and quantitative results also show it outperforms the state-of-the-art 360° video generators and 3D scene generation models.

Abstract (translated)

场景级别的三维生成是一个具有挑战性的研究课题,大多数现有的方法只能生成部分场景,并且提供有限的导航自由度。我们引入了WorldPrompter,这是一种新颖的生成管道,可以从文本提示中合成可穿越的三维场景。我们利用全景视频作为中间表示来建模场景的360°细节。 WorldPrompter 包含了一个条件性的 360° 全景视频生成器,能够产生一个模拟人在虚拟环境中行走并捕捉环境的128帧视频。然后,通过快速前馈式的三维重建算法将产生的视频重构为高斯点(Gaussian splats),从而在三维场景中实现真正的可行走体验。 实验表明,我们的全景视频生成模型实现了跨帧令人信服的视角一致性,这使得高质量的全景高斯点重建成为可能,并且支持穿越场景中的大片区域。定性和定量的结果也显示它超越了现有的最先进的 360° 视频生成器和三维场景生成模型的表现。 该研究通过结合文本驱动的视频生成与高效的 3D 场景重构技术,为实现高度交互式的虚拟现实体验提供了新的途径。

URL

https://arxiv.org/abs/2504.02045

PDF

https://arxiv.org/pdf/2504.02045.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot