Paper Reading AI Learner

Correlational Image Modeling for Self-Supervised Visual Pre-Training

2023-03-22 15:48:23
Wei Li, Jiahao Xie, Chen Change Loy

Abstract

We introduce Correlational Image Modeling (CIM), a novel and surprisingly effective approach to self-supervised visual pre-training. Our CIM performs a simple pretext task: we randomly crop image regions (exemplars) from an input image (context) and predict correlation maps between the exemplars and the context. Three key designs enable correlational image modeling as a nontrivial and meaningful self-supervisory task. First, to generate useful exemplar-context pairs, we consider cropping image regions with various scales, shapes, rotations, and transformations. Second, we employ a bootstrap learning framework that involves online and target encoders. During pre-training, the former takes exemplars as inputs while the latter converts the context. Third, we model the output correlation maps via a simple cross-attention block, within which the context serves as queries and the exemplars offer values and keys. We show that CIM performs on par or better than the current state of the art on self-supervised and transfer benchmarks.

Abstract (translated)

我们引入了 correlational Image Modeling (CIM),一种 novel 且出人意料有效的 self-supervised 视觉前训练方法。我们的 CIM 执行了一个简单的目的任务:我们随机裁剪输入图像(context)中的图像区域(样本)并预测样本与 context 之间的相关性图。三个关键设计使 correlational 图像建模成为一项艰巨且有意义的 self-supervised 任务。第一,为了生成有用的样本-context 对,我们考虑裁剪具有不同尺度、形状、旋转和变换的图像区域。第二,我们采用一种Bootstrap 学习框架,其中包括在线和目标编码器。在预训练期间,前者将样本作为输入,而后者则将 context 转换为样本。第三,我们使用一个简单的交叉注意力块来建模输出相关性图,其中 context 充当查询,样本提供值和关键。我们表明,CIM 在 self-supervised 和迁移基准方面表现与当前的前沿水平相当或更好。

URL

https://arxiv.org/abs/2303.12670

PDF

https://arxiv.org/pdf/2303.12670.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot