Abstract
This paper focuses on self-supervised monocular depth estimation in dynamic scenes trained on monocular videos. Existing methods jointly estimate pixel-wise depth and motion, relying mainly on an image reconstruction loss. Dynamic regions1 remain a critical challenge for these methods due to the inherent ambiguity in depth and motion estimation, resulting in inaccurate depth estimation. This paper proposes a self-supervised training framework exploiting pseudo depth labels for dynamic regions from training data. The key contribution of our framework is to decouple depth estimation for static and dynamic regions of images in the training data. We start with an unsupervised depth estimation approach, which provides reliable depth estimates for static regions and motion cues for dynamic regions and allows us to extract moving object information at the instance level. In the next stage, we use an object network to estimate the depth of those moving objects assuming rigid motions. Then, we propose a new scale alignment module to address the scale ambiguity between estimated depths for static and dynamic regions. We can then use the depth labels generated to train an end-to-end depth estimation network and improve its performance. Extensive experiments on the Cityscapes and KITTI datasets show that our self-training strategy consistently outperforms existing self/unsupervised depth estimation methods.
Abstract (translated)
本文关注在动态场景中自监督单目深度估计。现有的方法主要依靠图像重建损失来估计像素级的深度和运动,但由于深度和运动估计的固有不确定性,导致准确度不准确。本文提出了一种利用训练数据中动态区域的伪深度标签进行自监督训练的框架。我们提出了一种将图像训练数据中静态区域和动态区域的深度估计解耦的方法。我们的框架的关键贡献是解耦训练数据中静态和动态区域的深度估计。我们首先采用无监督的深度估计方法,为静态区域提供可靠的深度估计,并允许我们在实例级别提取移动物体信息。在下一阶段,我们使用物体网络估计假设刚性运动的动态对象的深度。然后,我们提出了一种新的尺度对齐模块来解决估计深度静态和动态区域之间的尺度不确定性。我们可以然后使用生成的深度标签来训练端到端深度估计算法,并提高其性能。在Cityscapes和KITTI数据集上的实验表明,我们的自训练策略 consistently优于现有的自/无监督深度估计方法。
URL
https://arxiv.org/abs/2404.14908