Abstract
While recent foundation models have enabled significant breakthroughs in monocular depth estimation, a clear path towards safe and reliable deployment in the real-world remains elusive. Metric depth estimation, which involves predicting absolute distances, poses particular challenges, as even the most advanced foundation models remain prone to critical errors. Since quantifying the uncertainty has emerged as a promising endeavor to address these limitations and enable trustworthy deployment, we fuse five different uncertainty quantification methods with the current state-of-the-art DepthAnythingV2 foundation model. To cover a wide range of metric depth domains, we evaluate their performance on four diverse datasets. Our findings identify fine-tuning with the Gaussian Negative Log-Likelihood Loss (GNLL) as a particularly promising approach, offering reliable uncertainty estimates while maintaining predictive performance and computational efficiency on par with the baseline, encompassing both training and inference time. By fusing uncertainty quantification and foundation models within the context of monocular depth estimation, this paper lays a critical foundation for future research aimed at improving not only model performance but also its explainability. Extending this critical synthesis of uncertainty quantification and foundation models into other crucial tasks, such as semantic segmentation and pose estimation, presents exciting opportunities for safer and more reliable machine vision systems.
Abstract (translated)
尽管最近的基础模型在单目深度估计方面取得了显著突破,但要实现安全可靠的现实世界部署仍然面临挑战。涉及预测绝对距离的度量深度估计尤其具有挑战性,即使最先进的基础模型仍容易出现关键错误。鉴于量化不确定性已成为解决这些限制并实现可信部署的有希望的方法,我们融合了五种不同的不确定性量化方法与当前最先进的DepthAnythingV2基础模型。为了涵盖广泛的度量深度领域,我们在四个多样化的数据集上评估它们的表现。我们的研究发现,使用高斯负对数似然损失(GNLL)进行微调特别具有前景,这种方法在提供可靠的不确定性估计的同时,保持了预测性能和计算效率与基线相当,在训练和推理时间方面均包括。 通过将不确定性量化和基础模型融合到单目深度估计的背景下,本文为未来旨在不仅提高模型性能而且增强其可解释性的研究奠定了关键基础。将这种不确定性量化的关键综合应用扩展到其他重要任务(如语义分割和姿态估计),为更安全可靠的机器视觉系统带来了令人兴奋的机会。
URL
https://arxiv.org/abs/2501.08188