Paper Reading AI Learner

Leveraging the Third Dimension in Contrastive Learning

2023-01-27 15:45:03
Sumukh Aithal, Anirudh Goyal, Alex Lamb, Yoshua Bengio, Michael Mozer

Abstract

Self-Supervised Learning (SSL) methods operate on unlabeled data to learn robust representations useful for downstream tasks. Most SSL methods rely on augmentations obtained by transforming the 2D image pixel map. These augmentations ignore the fact that biological vision takes place in an immersive three-dimensional, temporally contiguous environment, and that low-level biological vision relies heavily on depth cues. Using a signal provided by a pretrained state-of-the-art monocular RGB-to-depth model (the \emph{Depth Prediction Transformer}, Ranftl et al., 2021), we explore two distinct approaches to incorporating depth signals into the SSL framework. First, we evaluate contrastive learning using an RGB+depth input representation. Second, we use the depth signal to generate novel views from slightly different camera positions, thereby producing a 3D augmentation for contrastive learning. We evaluate these two approaches on three different SSL methods -- BYOL, SimSiam, and SwAV -- using ImageNette (10 class subset of ImageNet), ImageNet-100 and ImageNet-1k datasets. We find that both approaches to incorporating depth signals improve the robustness and generalization of the baseline SSL methods, though the first approach (with depth-channel concatenation) is superior. For instance, BYOL with the additional depth channel leads to an increase in downstream classification accuracy from 85.3\% to 88.0\% on ImageNette and 84.1\% to 87.0\% on ImageNet-C.

Abstract (translated)

监督学习(SSL)方法运行在未标记数据上,以学习对后续任务有用的鲁棒表示。大多数SSL方法依赖于对2D图像像素地图的变换后的增强。这些增强忽略了生物学视觉发生在一个沉浸式三维、时间连续的环境内的事实,并且低层次的生物学视觉很大程度上依赖于深度 cues。使用一个预先训练的单个视觉RGB-depth模型(称为“深度预测Transformer”, Ranftl等人,2021)提供的信号,我们探索了将深度信号融入SSL框架的两个不同方法。首先,我们使用RGB+depth输入表示来评估对比学习。其次,我们使用深度信号生成从略微不同的相机位置的新视角,从而产生对比学习的3D增强。我们使用ImageNette(ImageNet的10类子集)、ImageNet-100和ImageNet-1k数据集评估这些两种方法。我们发现, both approaches to incorporating depth signals improve the robustness and generalization of the baseline SSL methods, though the first approach (with depth-channelconcatenation) is superior.例如, BYOL加上额外的深度通道导致后续分类准确率从85.3%增加到88.0%,在ImageNette上从84.1%增加到87.0%。

URL

https://arxiv.org/abs/2301.11790

PDF

https://arxiv.org/pdf/2301.11790.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot