Paper Reading AI Learner

Learning monocular depth estimation infusing traditional stereo knowledge

2019-04-08 15:59:07
Fabio Tosi, Filippo Aleotti, Matteo Poggi, Stefano Mattoccia

Abstract

Depth estimation from a single image represents a fascinating, yet challenging problem with countless applications. Recent works proved that this task could be learned without direct supervision from ground truth labels leveraging image synthesis on sequences or stereo pairs. Focusing on this second case, in this paper we leverage stereo matching in order to improve monocular depth estimation. To this aim we propose monoResMatch, a novel deep architecture designed to infer depth from a single input image by synthesizing features from a different point of view, horizontally aligned with the input image, performing stereo matching between the two cues. In contrast to previous works sharing this rationale, our network is the first trained end-to-end from scratch. Moreover, we show how obtaining proxy ground truth annotation through traditional stereo algorithms, such as Semi-Global Matching, enables more accurate monocular depth estimation still countering the need for expensive depth labels by keeping a self-supervised approach. Exhaustive experimental results prove how the synergy between i) the proposed monoResMatch architecture and ii) proxy-supervision attains state-of-the-art for self-supervised monocular depth estimation. The code is publicly available at this https URL.

Abstract (translated)

单个图像的深度估计代表了无数应用程序中一个迷人但具有挑战性的问题。最近的研究证明,在没有直接监督的情况下,利用序列或立体对上的图像合成,可以从地面真值标签学习这项任务。针对第二种情况,本文利用立体匹配来提高单目深度估计。为此,我们提出了一种新的深度匹配结构monoresmatch,它通过从不同的角度合成特征,与输入图像水平对齐,在两个提示之间执行立体匹配,从而从单个输入图像推断深度。与之前共享这一基本原理的作品相比,我们的网络是第一个经过培训的端到端从头开始。此外,我们还展示了如何通过传统的立体算法(如半全局匹配)获得代理地面真值注释,从而通过保持自监督方法来实现更精确的单目深度估计,以满足昂贵的深度标签需求。详尽的实验结果证明了i)所提出的单重匹配结构和ii)代理监督之间的协同作用如何达到自我监督单目深度估计的最先进水平。该代码在此https URL上公开可用。

URL

https://arxiv.org/abs/1904.04144

PDF

https://arxiv.org/pdf/1904.04144.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot