Abstract
We consider the problem of novel view synthesis from unposed images in a single feed-forward. Our framework capitalizes on fast speed, scalability, and high-quality 3D reconstruction and view synthesis capabilities of 3DGS, where we further extend it to offer a practical solution that relaxes common assumptions such as dense image views, accurate camera poses, and substantial image overlaps. We achieve this through identifying and addressing unique challenges arising from the use of pixel-aligned 3DGS: misaligned 3D Gaussians across different views induce noisy or sparse gradients that destabilize training and hinder convergence, especially when above assumptions are not met. To mitigate this, we employ pre-trained monocular depth estimation and visual correspondence models to achieve coarse alignments of 3D Gaussians. We then introduce lightweight, learnable modules to refine depth and pose estimates from the coarse alignments, improving the quality of 3D reconstruction and novel view synthesis. Furthermore, the refined estimates are leveraged to estimate geometry confidence scores, which assess the reliability of 3D Gaussian centers and condition the prediction of Gaussian parameters accordingly. Extensive evaluations on large-scale real-world datasets demonstrate that PF3plat sets a new state-of-the-art across all benchmarks, supported by comprehensive ablation studies validating our design choices.
Abstract (translated)
我们考虑从未对齐图像中通过单一前向传递进行新视角合成的问题。我们的框架利用了3DGS在快速速度、可扩展性和高质量的三维重建及视图合成能力方面的优势,并在此基础上进一步扩展,提供了一种实用解决方案,放宽了诸如密集图像视图、精确相机姿态和显著图像重叠等常见假设。我们通过识别并解决使用像素对齐3DGS时出现的独特挑战来实现这一目标:不同视角间未对齐的3D高斯分布会导致噪声或稀疏梯度,这会破坏训练过程并阻碍收敛,特别是在上述假设不成立的情况下。为减轻这一点,我们采用预训练的单目深度估计和视觉对应模型来实现3D高斯分布的粗略对齐。随后,我们引入轻量级、可学习模块以从粗略对齐中改进深度和姿态估计,从而提升三维重建和新视角合成的质量。此外,利用改进后的估计值评估几何置信度得分,该得分用于评价3D高斯中心的可靠性,并相应地调整高斯参数预测。在大规模现实世界数据集上的广泛评估表明,PF3plat在所有基准测试中均设立了新的最佳性能标准,这一结论得到了详尽消融研究的支持,验证了我们的设计选择。
URL
https://arxiv.org/abs/2410.22128