Paper Reading AI Learner

VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step

2025-04-02 17:59:21
Hanyang Wang, Fangfu Liu, Jiawei Chi, Yueqi Duan

Abstract

Recovering 3D scenes from sparse views is a challenging task due to its inherent ill-posed problem. Conventional methods have developed specialized solutions (e.g., geometry regularization or feed-forward deterministic model) to mitigate the issue. However, they still suffer from performance degradation by minimal overlap across input views with insufficient visual information. Fortunately, recent video generative models show promise in addressing this challenge as they are capable of generating video clips with plausible 3D structures. Powered by large pretrained video diffusion models, some pioneering research start to explore the potential of video generative prior and create 3D scenes from sparse views. Despite impressive improvements, they are limited by slow inference time and the lack of 3D constraint, leading to inefficiencies and reconstruction artifacts that do not align with real-world geometry structure. In this paper, we propose VideoScene to distill the video diffusion model to generate 3D scenes in one step, aiming to build an efficient and effective tool to bridge the gap from video to 3D. Specifically, we design a 3D-aware leap flow distillation strategy to leap over time-consuming redundant information and train a dynamic denoising policy network to adaptively determine the optimal leap timestep during inference. Extensive experiments demonstrate that our VideoScene achieves faster and superior 3D scene generation results than previous video diffusion models, highlighting its potential as an efficient tool for future video to 3D applications. Project Page: this https URL

Abstract (translated)

从稀疏视角恢复三维场景是一个极具挑战性的任务,因为其本质上是不适定问题。传统的方法开发了专门的解决方案(例如几何正则化或前馈确定性模型)来缓解这些问题。然而,它们仍然在输入视图之间最小重叠且视觉信息不足的情况下遭受性能下降的问题。幸运的是,最近的视频生成模型显示出解决这一挑战的潜力,因为它们能够生成具有合理三维结构的视频片段。借助大规模预训练的视频扩散模型,一些开创性的研究开始探索视频生成先验的潜力,并从稀疏视图中创建三维场景。尽管取得了令人印象深刻的进步,但这些方法受限于推理时间慢以及缺乏三维约束条件,导致效率低下和重建过程中出现与现实世界几何结构不符的艺术瑕疵。 在本文中,我们提出了VideoScene框架,旨在将视频扩散模型精简为一步生成三维场景的方法,目标是建立一个高效且有效的工具,以弥合从视频到三维的差距。具体来说,我们设计了一个3D感知跳跃流精炼策略来跳过耗时冗余信息,并训练动态去噪策略网络在推理过程中自适应地确定最佳跳跃时间步长。 大量的实验表明,我们的VideoScene方法比之前的视频扩散模型更快且更有效地生成三维场景结果,这突显了其作为未来视频到3D应用高效工具的潜力。项目页面:[此URL](https://this-url.com/)(原文中的链接地址请替换为实际链接)。

URL

https://arxiv.org/abs/2504.01956

PDF

https://arxiv.org/pdf/2504.01956.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot