Paper Reading AI Learner

Mining Supervision for Dynamic Regions in Self-Supervised Monocular Depth Estimation

2024-04-23 10:51:15
Hoang Chuong Nguyen, Tianyu Wang, Jose M. Alvarez, Miaomiao Liu

Abstract

This paper focuses on self-supervised monocular depth estimation in dynamic scenes trained on monocular videos. Existing methods jointly estimate pixel-wise depth and motion, relying mainly on an image reconstruction loss. Dynamic regions1 remain a critical challenge for these methods due to the inherent ambiguity in depth and motion estimation, resulting in inaccurate depth estimation. This paper proposes a self-supervised training framework exploiting pseudo depth labels for dynamic regions from training data. The key contribution of our framework is to decouple depth estimation for static and dynamic regions of images in the training data. We start with an unsupervised depth estimation approach, which provides reliable depth estimates for static regions and motion cues for dynamic regions and allows us to extract moving object information at the instance level. In the next stage, we use an object network to estimate the depth of those moving objects assuming rigid motions. Then, we propose a new scale alignment module to address the scale ambiguity between estimated depths for static and dynamic regions. We can then use the depth labels generated to train an end-to-end depth estimation network and improve its performance. Extensive experiments on the Cityscapes and KITTI datasets show that our self-training strategy consistently outperforms existing self/unsupervised depth estimation methods.

Abstract (translated)

本文关注在动态场景中自监督单目深度估计。现有的方法主要依靠图像重建损失来估计像素级的深度和运动,但由于深度和运动估计的固有不确定性,导致准确度不准确。本文提出了一种利用训练数据中动态区域的伪深度标签进行自监督训练的框架。我们提出了一种将图像训练数据中静态区域和动态区域的深度估计解耦的方法。我们的框架的关键贡献是解耦训练数据中静态和动态区域的深度估计。我们首先采用无监督的深度估计方法,为静态区域提供可靠的深度估计,并允许我们在实例级别提取移动物体信息。在下一阶段,我们使用物体网络估计假设刚性运动的动态对象的深度。然后,我们提出了一种新的尺度对齐模块来解决估计深度静态和动态区域之间的尺度不确定性。我们可以然后使用生成的深度标签来训练端到端深度估计算法,并提高其性能。在Cityscapes和KITTI数据集上的实验表明,我们的自训练策略 consistently优于现有的自/无监督深度估计方法。

URL

https://arxiv.org/abs/2404.14908

PDF

https://arxiv.org/pdf/2404.14908.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot