Paper Reading AI Learner

Scene Splatter: Momentum 3D Scene Generation from Single Image with Video Diffusion Model

2025-04-03 17:00:44
Shengjun Zhang, Jinzhao Li, Xin Fei, Hao Liu, Yueqi Duan

Abstract

In this paper, we propose Scene Splatter, a momentum-based paradigm for video diffusion to generate generic scenes from single image. Existing methods, which employ video generation models to synthesize novel views, suffer from limited video length and scene inconsistency, leading to artifacts and distortions during further reconstruction. To address this issue, we construct noisy samples from original features as momentum to enhance video details and maintain scene consistency. However, for latent features with the perception field that spans both known and unknown regions, such latent-level momentum restricts the generative ability of video diffusion in unknown regions. Therefore, we further introduce the aforementioned consistent video as a pixel-level momentum to a directly generated video without momentum for better recovery of unseen regions. Our cascaded momentum enables video diffusion models to generate both high-fidelity and consistent novel views. We further finetune the global Gaussian representations with enhanced frames and render new frames for momentum update in the next step. In this manner, we can iteratively recover a 3D scene, avoiding the limitation of video length. Extensive experiments demonstrate the generalization capability and superior performance of our method in high-fidelity and consistent scene generation.

Abstract (translated)

在这篇论文中,我们提出了一种名为Scene Splatter的基于动量的范式,用于从单张图像生成通用场景。现有方法使用视频生成模型来合成新颖视图时,会受到视频长度有限和场景不一致的影响,从而导致在进一步重建过程中出现伪影和失真。为了解决这个问题,我们构建了以原始特征为基础的噪声样本作为动量,以此增强视频细节并保持场景一致性。然而,在具有感知范围涵盖已知区域与未知区域的潜在特征的情况下,这种潜在层次上的动量限制了视频扩散模型在未知区域内生成的能力。因此,我们进一步将前述一致性的视频引入为像素级别的动量,以改善没有使用动量直接生成的视频中未见区域的恢复效果。我们的级联动量使视频扩散模型能够生成既高保真又具有一致性的新颖视图。此外,我们将全局高斯表示进行精细调整,并利用增强后的帧渲染新帧来更新下一步中的动量。通过这种方式,我们可以迭代地恢复一个3D场景,避免了视频长度的限制。广泛的实验展示了我们方法在生成高质量和一致场景方面的泛化能力和卓越性能。

URL

https://arxiv.org/abs/2504.02764

PDF

https://arxiv.org/pdf/2504.02764.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot