Paper Reading AI Learner

$mathrm{F^2Depth}$: Self-supervised Indoor Monocular Depth Estimation via Optical Flow Consistency and Feature Map Synthesis

2024-03-27 11:00:33
Xiaotong Guo, Huijie Zhao, Shuwei Shao, Xudong Li, Baochang Zhang

Abstract

Self-supervised monocular depth estimation methods have been increasingly given much attention due to the benefit of not requiring large, labelled datasets. Such self-supervised methods require high-quality salient features and consequently suffer from severe performance drop for indoor scenes, where low-textured regions dominant in the scenes are almost indiscriminative. To address the issue, we propose a self-supervised indoor monocular depth estimation framework called $\mathrm{F^2Depth}$. A self-supervised optical flow estimation network is introduced to supervise depth learning. To improve optical flow estimation performance in low-textured areas, only some patches of points with more discriminative features are adopted for finetuning based on our well-designed patch-based photometric loss. The finetuned optical flow estimation network generates high-accuracy optical flow as a supervisory signal for depth estimation. Correspondingly, an optical flow consistency loss is designed. Multi-scale feature maps produced by finetuned optical flow estimation network perform warping to compute feature map synthesis loss as another supervisory signal for depth learning. Experimental results on the NYU Depth V2 dataset demonstrate the effectiveness of the framework and our proposed losses. To evaluate the generalization ability of our $\mathrm{F^2Depth}$, we collect a Campus Indoor depth dataset composed of approximately 1500 points selected from 99 images in 18 scenes. Zero-shot generalization experiments on 7-Scenes dataset and Campus Indoor achieve $\delta_1$ accuracy of 75.8% and 76.0% respectively. The accuracy results show that our model can generalize well to monocular images captured in unknown indoor scenes.

Abstract (translated)

自监督单目深度估计方法因无需大而标注的数据集的优势而得到了越来越多的关注。 这些自监督方法需要高质量的显著特征,因此对于室内场景中低纹理区域占主导地位的情况,其性能下降非常严重。为解决此问题,我们提出了一个自监督室内单目深度估计框架,称为$\mathrm{F^2Depth}$。引入了一个自监督的图像平滑估计算法来监督深度学习。为了提高低纹理区域的图像平滑估计算法性能,我们根据我们设计的基于补丁的广义光度损失来选择具有更好区分特性的点的补丁进行微调。 微调后的光度估计算法生成高度准确的图像平滑作为深度估计的监督信号。 相应地,还设计了一个光度平滑一致性损失。 在NYU Depth V2数据集上进行实验证明了我们框架的有效性和所提出的损失的有效性。为了评估我们$\mathrm{F^2Depth` 的泛化能力,我们收集了由99个场景中的大约1500个点组成的校园室内深度数据集。 基于7-Scenes数据集和校园内进行零散生成实验达到75.8%和76.0%的准确度。 准确度结果表明,我们的模型可以在未知室内场景中捕获的单目图像上表现出很好的泛化能力。

URL

https://arxiv.org/abs/2403.18443

PDF

https://arxiv.org/pdf/2403.18443.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot