Paper Reading AI Learner

Observation, Analysis, and Solution: Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training

2024-04-18 14:14:44
Jin Gao, Shubo Lin, Shaoru Wang, Yutong Kou, Zeming Li, Liang Li, Congxuan Zhang, Xiaoqin Zhang, Yizheng Wang, Weiming Hu

Abstract

Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) in computer vision has enabled promising downstream performance on top of the learned self-supervised ViT features. In this paper, we question if the extremely simple ViTs' fine-tuning performance with a small-scale architecture can also benefit from this pre-training paradigm, which is considerably less studied yet in contrast to the well-established lightweight architecture design methodology with sophisticated components introduced. By carefully adapting various typical MIM pre-training methods to this lightweight regime and comparing them with the contrastive learning (CL) pre-training on various downstream image classification and dense prediction tasks, we systematically observe different behaviors between MIM and CL with respect to the downstream fine-tuning data scales. Furthermore, we analyze the frozen features under linear probing evaluation and also the layer representation similarities and attention maps across the obtained models, which clearly show the inferior learning of MIM pre-training on higher layers, leading to unsatisfactory fine-tuning performance on data-insufficient downstream tasks. This finding is naturally a guide to choosing appropriate distillation strategies during pre-training to solve the above deterioration problem. Extensive experiments on various vision tasks demonstrate the effectiveness of our observation-analysis-solution flow. In particular, our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design (5.7M/6.5M) can achieve 79.4%/78.9% top-1 accuracy on ImageNet-1K. It also enables SOTA performance on the ADE20K semantic segmentation task (42.8% mIoU) and LaSOT visual tracking task (66.1% AUC) in the lightweight regime. The latter even surpasses all the current SOTA lightweight CPU-realtime trackers.

Abstract (translated)

遮罩图像建模(MIM)在计算机视觉中的大规模ViT预训练已经实现了在学到的自监督ViT特征上具有 promising 的下游性能。在本文中,我们怀疑极简单的ViT在小规模架构上的微调性能是否也能从中获得好处,相比之下,这种预训练方法在研究方面还比较薄弱。与具有复杂组件的成熟轻量级架构设计方法相比,这种预训练方法的研究程度要低得多。通过谨慎地适应各种常见的MIM预训练方法到轻量级状态,并将其与各种下游图像分类和密集预测任务中的对比学习(CL)预训练进行比较,我们系统地观察到MIM和CL在下游细粒度数据上的行为存在差异。此外,我们分析了几种典型MIM预训练方法在轻量级状态下的冻结特征以及获得的模型中层表示相似度和注意图,这显然表明了在较高层的学习不足,导致在数据不足的下游任务上的不令人满意的细粒度预训练性能。这一发现自然地为指导在预训练过程中选择合适的去混淆策略来解决上述恶化问题提供了指导。在各种视觉任务上的广泛实验证明了我们观察-分析和解决方案流程的有效性。特别是,我们在纯轻量级ViT上进行去混淆的预训练,具有(5.7M/6.5M)ImageNet-1K的79.4%/78.9% top-1准确率。这还在轻量状态下实现了ADE20K语义分割任务(42.8% mIoU)和LaSOT视觉跟踪任务(66.1% AUC)的SOTA性能。后一个甚至超过了所有当前的SOTA轻量级CPU实时跟踪器的性能。

URL

https://arxiv.org/abs/2404.12210

PDF

https://arxiv.org/pdf/2404.12210.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot