Abstract
We introduce Correlational Image Modeling (CIM), a novel and surprisingly effective approach to self-supervised visual pre-training. Our CIM performs a simple pretext task: we randomly crop image regions (exemplars) from an input image (context) and predict correlation maps between the exemplars and the context. Three key designs enable correlational image modeling as a nontrivial and meaningful self-supervisory task. First, to generate useful exemplar-context pairs, we consider cropping image regions with various scales, shapes, rotations, and transformations. Second, we employ a bootstrap learning framework that involves online and target encoders. During pre-training, the former takes exemplars as inputs while the latter converts the context. Third, we model the output correlation maps via a simple cross-attention block, within which the context serves as queries and the exemplars offer values and keys. We show that CIM performs on par or better than the current state of the art on self-supervised and transfer benchmarks.
Abstract (translated)
我们引入了 correlational Image Modeling (CIM),一种 novel 且出人意料有效的 self-supervised 视觉前训练方法。我们的 CIM 执行了一个简单的目的任务:我们随机裁剪输入图像(context)中的图像区域(样本)并预测样本与 context 之间的相关性图。三个关键设计使 correlational 图像建模成为一项艰巨且有意义的 self-supervised 任务。第一,为了生成有用的样本-context 对,我们考虑裁剪具有不同尺度、形状、旋转和变换的图像区域。第二,我们采用一种Bootstrap 学习框架,其中包括在线和目标编码器。在预训练期间,前者将样本作为输入,而后者则将 context 转换为样本。第三,我们使用一个简单的交叉注意力块来建模输出相关性图,其中 context 充当查询,样本提供值和关键。我们表明,CIM 在 self-supervised 和迁移基准方面表现与当前的前沿水平相当或更好。
URL
https://arxiv.org/abs/2303.12670