Paper Reading AI Learner

Depth Anything with Any Prior

2025-05-15 17:59:50
Zehan Wang, Siyu Chen, Lihe Yang, Jialei Wang, Ziang Zhang, Hengshuang Zhao, Zhou Zhao

Abstract

This work presents Prior Depth Anything, a framework that combines incomplete but precise metric information in depth measurement with relative but complete geometric structures in depth prediction, generating accurate, dense, and detailed metric depth maps for any scene. To this end, we design a coarse-to-fine pipeline to progressively integrate the two complementary depth sources. First, we introduce pixel-level metric alignment and distance-aware weighting to pre-fill diverse metric priors by explicitly using depth prediction. It effectively narrows the domain gap between prior patterns, enhancing generalization across varying scenarios. Second, we develop a conditioned monocular depth estimation (MDE) model to refine the inherent noise of depth priors. By conditioning on the normalized pre-filled prior and prediction, the model further implicitly merges the two complementary depth sources. Our model showcases impressive zero-shot generalization across depth completion, super-resolution, and inpainting over 7 real-world datasets, matching or even surpassing previous task-specific methods. More importantly, it performs well on challenging, unseen mixed priors and enables test-time improvements by switching prediction models, providing a flexible accuracy-efficiency trade-off while evolving with advancements in MDE models.

Abstract (translated)

这项工作提出了Prior Depth Anything框架,该框架结合了深度测量中不完整但精确的度量信息与深度预测中相对但完整的几何结构,从而为任何场景生成准确、密集和详细的度量深度图。为此,我们设计了一个从粗到细的流水线,逐步整合这两种互补的深度来源。首先,我们引入像素级度量对齐和距离感知加权,通过明确使用深度预测来预先填充各种度量先验,有效地缩小了先前模式之间的领域差距,增强了在不同场景中的泛化能力。其次,我们开发了一个条件单目深度估计(MDE)模型,以精炼深度先验中存在的固有噪声。该模型通过对归一化的预填充分先验和预测进行调节,进一步隐式地融合这两种互补的深度来源。我们的模型在七个真实世界数据集上展示了跨深度完成、超分辨率和修复任务的强大零样本泛化能力,并且在这些特定任务的方法中表现出匹配甚至超越的结果。更重要的是,它在具有挑战性的未见混合先验下表现良好,并通过切换预测模型实现了测试时间改进,提供了一个灵活的准确性和效率之间的权衡,随着MDE模型的进步而不断进化。

URL

https://arxiv.org/abs/2505.10565

PDF

https://arxiv.org/pdf/2505.10565.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot