Paper Reading AI Learner

Boosting Monocular Depth Estimation with Sparse Guided Points

2022-02-03 08:52:54
Guangkai Xu, Wei Yin, Hao Chen, Kai Cheng, Feng Zhao, Chunhua Shen

Abstract

Existing monocular depth estimation shows excellent robustness in the wild, but the affine-invariant prediction requires aligning with the ground truth globally while being converted into the metric depth. In this work, we firstly propose a modified locally weighted linear regression strategy to leverage sparse ground truth and generate a flexible depth transformation to correct the coarse misalignment brought by global recovery strategy. Applying this strategy, we achieve significant improvement (more than 50% at most) over most recent state-of-the-art methods on five zero-shot datasets. Moreover, we train a robust depth estimation model with 6.3 million data and analyze the training process by decoupling the inaccuracy into coarse misalignment inaccuracy and detail missing inaccuracy. As a result, our model based on ResNet50 even outperforms the state-of-the-art DPT ViT-Large model with the help of our recovery strategy. In addition to accuracy, the consistency is also boosted for simple per-frame video depth estimation. Compared with monocular depth estimation, robust video depth estimation, and depth completion methods, our pipeline obtains state-of-the-art performance on video depth estimation without any post-processing. Experiments of 3D scene reconstruction from consistent video depth are conducted for intuitive comparison as well.

Abstract (translated)

URL

https://arxiv.org/abs/2202.01470

PDF

https://arxiv.org/pdf/2202.01470.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot