Leveraging the Third Dimension in Contrastive Learning

Abstract
Abstract (translated)
URL
PDF

Abstract

Self-Supervised Learning (SSL) methods operate on unlabeled data to learn robust representations useful for downstream tasks. Most SSL methods rely on augmentations obtained by transforming the 2D image pixel map. These augmentations ignore the fact that biological vision takes place in an immersive three-dimensional, temporally contiguous environment, and that low-level biological vision relies heavily on depth cues. Using a signal provided by a pretrained state-of-the-art monocular RGB-to-depth model (the \emph{Depth Prediction Transformer}, Ranftl et al., 2021), we explore two distinct approaches to incorporating depth signals into the SSL framework. First, we evaluate contrastive learning using an RGB+depth input representation. Second, we use the depth signal to generate novel views from slightly different camera positions, thereby producing a 3D augmentation for contrastive learning. We evaluate these two approaches on three different SSL methods -- BYOL, SimSiam, and SwAV -- using ImageNette (10 class subset of ImageNet), ImageNet-100 and ImageNet-1k datasets. We find that both approaches to incorporating depth signals improve the robustness and generalization of the baseline SSL methods, though the first approach (with depth-channel concatenation) is superior. For instance, BYOL with the additional depth channel leads to an increase in downstream classification accuracy from 85.3\% to 88.0\% on ImageNette and 84.1\% to 87.0\% on ImageNet-C.

Abstract (translated)

监督学习(SSL)方法运行在未标记数据上,以学习对后续任务有用的鲁棒表示。大多数SSL方法依赖于对2D图像像素地图的变换后的增强。这些增强忽略了生物学视觉发生在一个沉浸式三维、时间连续的环境内的事实,并且低层次的生物学视觉很大程度上依赖于深度 cues。使用一个预先训练的单个视觉RGB-depth模型(称为“深度预测Transformer”, Ranftl等人,2021)提供的信号,我们探索了将深度信号融入SSL框架的两个不同方法。首先,我们使用RGB+depth输入表示来评估对比学习。其次,我们使用深度信号生成从略微不同的相机位置的新视角,从而产生对比学习的3D增强。我们使用ImageNette(ImageNet的10类子集)、ImageNet-100和ImageNet-1k数据集评估这些两种方法。我们发现, both approaches to incorporating depth signals improve the robustness and generalization of the baseline SSL methods, though the first approach (with depth-channelconcatenation) is superior.例如, BYOL加上额外的深度通道导致后续分类准确率从85.3%增加到88.0%,在ImageNette上从84.1%增加到87.0%。

URL

https://arxiv.org/abs/2301.11790

PDF

https://arxiv.org/pdf/2301.11790.pdf