Abstract
In this paper, we are interested in understanding self-supervised pretraining through studying the capability that self-supervised representation pretraining methods learn part-aware representations. The study is mainly motivated by that random views, used in contrastive learning, and random masked (visible) patches, used in masked image modeling, are often about object parts. We explain that contrastive learning is a part-to-whole task: the projection layer hallucinates the whole object representation from the object part representation learned from the encoder, and that masked image modeling is a part-to-part task: the masked patches of the object are hallucinated from the visible patches. The explanation suggests that the self-supervised pretrained encoder is required to understand the object part. We empirically compare the off-the-shelf encoders pretrained with several representative methods on object-level recognition and part-level recognition. The results show that the fully-supervised model outperforms self-supervised models for object-level recognition, and most self-supervised contrastive learning and masked image modeling methods outperform the fully-supervised method for part-level recognition. It is observed that the combination of contrastive learning and masked image modeling further improves the performance.
Abstract (translated)
在本文中,我们希望通过研究自监督的前训练方法学习部分意识的表示能力,以理解自监督的前训练方法如何学习部分意识表示。研究的主要动力是随机视图(用于比较学习)和随机掩膜(可见)补丁(用于掩膜图像建模),这些通常涉及到物体的部分。我们解释比较学习是一种整体到部分的任务:从编码器学习的物体部分表示,投影层幻象化整个对象表示,而掩膜图像建模是一种整体到部分的任务:物体的掩膜补丁从可见补丁中幻象化。解释表明,自监督的前训练编码器需要理解物体的部分。我们经验性地比较了使用多个代表性方法训练的公开编码器和物体级识别和部分级识别的性能。结果显示,完全监督模型在物体级识别中胜过自监督模型,而大多数自监督比较学习和掩膜图像建模方法在部分级识别中胜过完全监督方法。观察到的是,比较学习和掩膜图像建模的结合进一步改善了性能。
URL
https://arxiv.org/abs/2301.11915