Abstract
The rapid advancement of diffusion models holds the promise of revolutionizing the application of VR and AR technologies, which typically require scene-level 4D assets for user experience. Nonetheless, existing diffusion models predominantly concentrate on modeling static 3D scenes or object-level dynamics, constraining their capacity to provide truly immersive experiences. To address this issue, we propose HoloTime, a framework that integrates video diffusion models to generate panoramic videos from a single prompt or reference image, along with a 360-degree 4D scene reconstruction method that seamlessly transforms the generated panoramic video into 4D assets, enabling a fully immersive 4D experience for users. Specifically, to tame video diffusion models for generating high-fidelity panoramic videos, we introduce the 360World dataset, the first comprehensive collection of panoramic videos suitable for downstream 4D scene reconstruction tasks. With this curated dataset, we propose Panoramic Animator, a two-stage image-to-video diffusion model that can convert panoramic images into high-quality panoramic videos. Following this, we present Panoramic Space-Time Reconstruction, which leverages a space-time depth estimation method to transform the generated panoramic videos into 4D point clouds, enabling the optimization of a holistic 4D Gaussian Splatting representation to reconstruct spatially and temporally consistent 4D scenes. To validate the efficacy of our method, we conducted a comparative analysis with existing approaches, revealing its superiority in both panoramic video generation and 4D scene reconstruction. This demonstrates our method's capability to create more engaging and realistic immersive environments, thereby enhancing user experiences in VR and AR applications.
Abstract (translated)
扩散模型的迅速发展预示着其在虚拟现实(VR)和增强现实(AR)技术应用中的革命性潜力,这些领域通常需要场景级别的4D资产来提升用户体验。然而,现有的扩散模型主要集中在静态3D场景或对象级别动态建模上,这限制了它们提供真正沉浸式体验的能力。为了解决这个问题,我们提出了一种名为HoloTime的框架,该框架集成了视频扩散模型,可以从单个提示或参考图像生成全景视频,并结合一种无缝将生成的全景视频转化为4D资产的方法,从而实现用户全方位的沉浸式4D体验。 具体而言,为了使视频扩散模型能够生成高质量的全景视频,我们引入了360World数据集,这是第一个针对下游4D场景重建任务而设计的全面全景视频集合。借助这一精心策划的数据集,我们提出了Panoramic Animator,这是一种两阶段的图像到视频扩散模型,可以将全景图像转换为高质量的全景视频。 在此之后,我们介绍了全景时空重构方法,这种方法利用时空深度估计技术将生成的全景视频转化为4D点云,并优化整体4D高斯散布表示法来重建在空间和时间上都一致的4D场景。为了验证我们的方法的有效性,我们进行了与现有方法的比较分析,结果表明它在全景视频生成和4D场景重构方面均表现出卓越性能。 这证明了我们的方法能够创建更吸引人且更具现实感的沉浸式环境,从而提升VR和AR应用中的用户体验。
URL
https://arxiv.org/abs/2504.21650