Abstract
Self-supervised monocular depth estimation aims to infer depth information without relying on labeled data. However, the lack of labeled information poses a significant challenge to the model's representation, limiting its ability to capture the intricate details of the scene accurately. Prior information can potentially mitigate this issue, enhancing the model's understanding of scene structure and texture. Nevertheless, solely relying on a single type of prior information often falls short when dealing with complex scenes, necessitating improvements in generalization performance. To address these challenges, we introduce a novel self-supervised monocular depth estimation model that leverages multiple priors to bolster representation capabilities across spatial, context, and semantic dimensions. Specifically, we employ a hybrid transformer and a lightweight pose network to obtain long-range spatial priors in the spatial dimension. Then, the context prior attention is designed to improve generalization, particularly in complex structures or untextured areas. In addition, semantic priors are introduced by leveraging semantic boundary loss, and semantic prior attention is supplemented, further refining the semantic features extracted by the decoder. Experiments on three diverse datasets demonstrate the effectiveness of the proposed model. It integrates multiple priors to comprehensively enhance the representation ability, improving the accuracy and reliability of depth estimation. Codes are available at: \url{this https URL}
Abstract (translated)
自监督单目深度估计旨在在没有标注数据的情况下推断深度信息。然而,缺乏标注信息会显著挑战模型的表示能力,限制其准确捕捉场景复杂细节的能力。先验信息可能会减轻这个问题,增强模型对场景结构的了解和纹理的理解。然而,仅依赖一种先验信息通常在处理复杂场景时不足,需要提高泛化性能。为解决这些问题,我们引入了一种新颖的自监督单目深度估计模型,它利用多个先验信息来增强在空间、上下文和语义维度上的表示能力。具体来说,我们在空间维度上使用混合Transformer和轻量级姿态网络获取长距离空间先验。然后,上下文先验注意力被设计用于提高泛化能力,尤其是在复杂结构或纹理较少的区域。此外,通过利用语义边界损失引入语义先验,并补充语义先验注意力,进一步优化解码器提取的语义特征。在三个不同的数据集上的实验表明,所提出的模型具有有效性。它集成了多个先验信息,全面增强了表示能力,提高了深度估计的准确性和可靠性。代码可在此处下载:\url{这个链接}
URL
https://arxiv.org/abs/2406.08928