Paper Reading AI Learner

Seeing the Whole in the Parts in Self-Supervised Representation Learning

2025-01-06 09:08:59
Arthur Aubret, C\'eline Teuli\`ere, Jochen Triesch

Abstract

Recent successes in self-supervised learning (SSL) model spatial co-occurrences of visual features either by masking portions of an image or by aggressively cropping it. Here, we propose a new way to model spatial co-occurrences by aligning local representations (before pooling) with a global image representation. We present CO-SSL, a family of instance discrimination methods and show that it outperforms previous methods on several datasets, including ImageNet-1K where it achieves 71.5% of Top-1 accuracy with 100 pre-training epochs. CO-SSL is also more robust to noise corruption, internal corruption, small adversarial attacks, and large training crop sizes. Our analysis further indicates that CO-SSL learns highly redundant local representations, which offers an explanation for its robustness. Overall, our work suggests that aligning local and global representations may be a powerful principle of unsupervised category learning.

Abstract (translated)

最近在自监督学习(SSL)模型中的成功案例是通过屏蔽图像的部分或激进地裁剪图像来建模视觉特征的空间共现。在这里,我们提出了一种新的方法,通过将局部表示(在池化之前)与全局图像表示对齐来建模空间共现。我们介绍了CO-SSL,这是一个实例区分方法家族,并展示了它在多个数据集上超越了先前的方法,包括ImageNet-1K,在该数据集中使用100个预训练周期达到了71.5%的Top-1准确率。CO-SSL还对噪声污染、内部污染、小规模对抗性攻击以及大规模训练裁剪尺寸更加鲁棒。我们的分析进一步表明,CO-SSL学习到了高度冗余的局部表示,这为其鲁棒性提供了解释。总体而言,我们的工作表明将局部和全局表示对齐可能是无监督类别学习的一个强大原则。

URL

https://arxiv.org/abs/2501.02860

PDF

https://arxiv.org/pdf/2501.02860.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot