Abstract
Single-image 3D scene reconstruction presents significant challenges due to its inherently ill-posed nature and limited input constraints. Recent advances have explored two promising directions: multiview generative models that train on 3D consistent datasets but struggle with out-of-distribution generalization, and 3D scene inpainting and completion frameworks that suffer from cross-view inconsistency and suboptimal error handling, as they depend exclusively on depth data or 3D smoothness, which ultimately degrades output quality and computational performance. Building upon these approaches, we present GaussVideoDreamer, which advances generative multimedia approaches by bridging the gap between image, video, and 3D generation, integrating their strengths through two key innovations: (1) A progressive video inpainting strategy that harnesses temporal coherence for improved multiview consistency and faster convergence. (2) A 3D Gaussian Splatting consistency mask to guide the video diffusion with 3D consistent multiview evidence. Our pipeline combines three core components: a geometry-aware initialization protocol, Inconsistency-Aware Gaussian Splatting, and a progressive video inpainting strategy. Experimental results demonstrate that our approach achieves 32% higher LLaVA-IQA scores and at least 2x speedup compared to existing methods while maintaining robust performance across diverse scenes.
Abstract (translated)
单张图像的三维场景重建面临巨大的挑战,由于其本质上是一个病态问题,并且输入信息有限。最近的研究主要探索了两个有前景的方向:多视图生成模型训练在3D一致的数据集上,但难以处理出分布数据的一般化;以及依赖于深度数据或三维平滑性的3D场景修复和补全框架,在跨视角一致性方面存在不足,并且误差处理欠佳,最终导致输出质量降低及计算性能下降。基于这些方法的进展,我们提出了GaussVideoDreamer,该模型通过连接图像、视频与三维生成之间的差距,利用两个关键创新来推动生成多媒体的方法:(1)渐进式视频修复策略,借助时间一致性实现多视图一致性的改进和更快收敛。(2)3D高斯点阵一致性掩模,为视频扩散过程提供基于3D一致性多视角证据的指导。我们的流程整合了三个核心组件:几何感知初始化协议、跨一致性意识高斯点阵以及渐进式视频修复策略。实验结果表明,相较于现有的方法,我们提出的方法在LLaVA-IQA评分上提高了32%,同时至少提升了两倍的速度,并保持了在各种场景中的稳定性能。
URL
https://arxiv.org/abs/2504.10001