Paper Reading AI Learner

GaussVideoDreamer: 3D Scene Generation with Video Diffusion and Inconsistency-Aware Gaussian Splatting

2025-04-14 09:04:01
Junlin Hao, Peiheng Wang, Haoyang Wang, Xinggong Zhang, Zongming Guo

Abstract

Single-image 3D scene reconstruction presents significant challenges due to its inherently ill-posed nature and limited input constraints. Recent advances have explored two promising directions: multiview generative models that train on 3D consistent datasets but struggle with out-of-distribution generalization, and 3D scene inpainting and completion frameworks that suffer from cross-view inconsistency and suboptimal error handling, as they depend exclusively on depth data or 3D smoothness, which ultimately degrades output quality and computational performance. Building upon these approaches, we present GaussVideoDreamer, which advances generative multimedia approaches by bridging the gap between image, video, and 3D generation, integrating their strengths through two key innovations: (1) A progressive video inpainting strategy that harnesses temporal coherence for improved multiview consistency and faster convergence. (2) A 3D Gaussian Splatting consistency mask to guide the video diffusion with 3D consistent multiview evidence. Our pipeline combines three core components: a geometry-aware initialization protocol, Inconsistency-Aware Gaussian Splatting, and a progressive video inpainting strategy. Experimental results demonstrate that our approach achieves 32% higher LLaVA-IQA scores and at least 2x speedup compared to existing methods while maintaining robust performance across diverse scenes.

Abstract (translated)

单张图像的三维场景重建面临巨大的挑战,由于其本质上是一个病态问题,并且输入信息有限。最近的研究主要探索了两个有前景的方向:多视图生成模型训练在3D一致的数据集上,但难以处理出分布数据的一般化;以及依赖于深度数据或三维平滑性的3D场景修复和补全框架,在跨视角一致性方面存在不足,并且误差处理欠佳,最终导致输出质量降低及计算性能下降。基于这些方法的进展,我们提出了GaussVideoDreamer,该模型通过连接图像、视频与三维生成之间的差距,利用两个关键创新来推动生成多媒体的方法:(1)渐进式视频修复策略,借助时间一致性实现多视图一致性的改进和更快收敛。(2)3D高斯点阵一致性掩模,为视频扩散过程提供基于3D一致性多视角证据的指导。我们的流程整合了三个核心组件:几何感知初始化协议、跨一致性意识高斯点阵以及渐进式视频修复策略。实验结果表明,相较于现有的方法,我们提出的方法在LLaVA-IQA评分上提高了32%,同时至少提升了两倍的速度,并保持了在各种场景中的稳定性能。

URL

https://arxiv.org/abs/2504.10001

PDF

https://arxiv.org/pdf/2504.10001.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot