Abstract
In this paper, we propose Scene Splatter, a momentum-based paradigm for video diffusion to generate generic scenes from single image. Existing methods, which employ video generation models to synthesize novel views, suffer from limited video length and scene inconsistency, leading to artifacts and distortions during further reconstruction. To address this issue, we construct noisy samples from original features as momentum to enhance video details and maintain scene consistency. However, for latent features with the perception field that spans both known and unknown regions, such latent-level momentum restricts the generative ability of video diffusion in unknown regions. Therefore, we further introduce the aforementioned consistent video as a pixel-level momentum to a directly generated video without momentum for better recovery of unseen regions. Our cascaded momentum enables video diffusion models to generate both high-fidelity and consistent novel views. We further finetune the global Gaussian representations with enhanced frames and render new frames for momentum update in the next step. In this manner, we can iteratively recover a 3D scene, avoiding the limitation of video length. Extensive experiments demonstrate the generalization capability and superior performance of our method in high-fidelity and consistent scene generation.
Abstract (translated)
在这篇论文中,我们提出了一种名为Scene Splatter的基于动量的范式,用于从单张图像生成通用场景。现有方法使用视频生成模型来合成新颖视图时,会受到视频长度有限和场景不一致的影响,从而导致在进一步重建过程中出现伪影和失真。为了解决这个问题,我们构建了以原始特征为基础的噪声样本作为动量,以此增强视频细节并保持场景一致性。然而,在具有感知范围涵盖已知区域与未知区域的潜在特征的情况下,这种潜在层次上的动量限制了视频扩散模型在未知区域内生成的能力。因此,我们进一步将前述一致性的视频引入为像素级别的动量,以改善没有使用动量直接生成的视频中未见区域的恢复效果。我们的级联动量使视频扩散模型能够生成既高保真又具有一致性的新颖视图。此外,我们将全局高斯表示进行精细调整,并利用增强后的帧渲染新帧来更新下一步中的动量。通过这种方式,我们可以迭代地恢复一个3D场景,避免了视频长度的限制。广泛的实验展示了我们方法在生成高质量和一致场景方面的泛化能力和卓越性能。
URL
https://arxiv.org/abs/2504.02764