Paper Reading AI Learner

Multiple Prior Representation Learning for Self-Supervised Monocular Depth Estimation via Hybrid Transformer

2024-06-13 08:51:57
Guodong Sun, Junjie Liu, Mingxuan Liu, Moyun Liu, Yang Zhang

Abstract

Self-supervised monocular depth estimation aims to infer depth information without relying on labeled data. However, the lack of labeled information poses a significant challenge to the model's representation, limiting its ability to capture the intricate details of the scene accurately. Prior information can potentially mitigate this issue, enhancing the model's understanding of scene structure and texture. Nevertheless, solely relying on a single type of prior information often falls short when dealing with complex scenes, necessitating improvements in generalization performance. To address these challenges, we introduce a novel self-supervised monocular depth estimation model that leverages multiple priors to bolster representation capabilities across spatial, context, and semantic dimensions. Specifically, we employ a hybrid transformer and a lightweight pose network to obtain long-range spatial priors in the spatial dimension. Then, the context prior attention is designed to improve generalization, particularly in complex structures or untextured areas. In addition, semantic priors are introduced by leveraging semantic boundary loss, and semantic prior attention is supplemented, further refining the semantic features extracted by the decoder. Experiments on three diverse datasets demonstrate the effectiveness of the proposed model. It integrates multiple priors to comprehensively enhance the representation ability, improving the accuracy and reliability of depth estimation. Codes are available at: \url{this https URL}

Abstract (translated)

自监督单目深度估计旨在在没有标注数据的情况下推断深度信息。然而,缺乏标注信息会显著挑战模型的表示能力,限制其准确捕捉场景复杂细节的能力。先验信息可能会减轻这个问题,增强模型对场景结构的了解和纹理的理解。然而,仅依赖一种先验信息通常在处理复杂场景时不足,需要提高泛化性能。为解决这些问题,我们引入了一种新颖的自监督单目深度估计模型,它利用多个先验信息来增强在空间、上下文和语义维度上的表示能力。具体来说,我们在空间维度上使用混合Transformer和轻量级姿态网络获取长距离空间先验。然后,上下文先验注意力被设计用于提高泛化能力,尤其是在复杂结构或纹理较少的区域。此外,通过利用语义边界损失引入语义先验,并补充语义先验注意力,进一步优化解码器提取的语义特征。在三个不同的数据集上的实验表明,所提出的模型具有有效性。它集成了多个先验信息,全面增强了表示能力,提高了深度估计的准确性和可靠性。代码可在此处下载:\url{这个链接}

URL

https://arxiv.org/abs/2406.08928

PDF

https://arxiv.org/pdf/2406.08928.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot