Paper Reading AI Learner

SCHEME: Scalable Channer Mixer for Vision Transformers

2023-12-01 08:22:34
Deepak Sridhar, Yunsheng Li, Nuno Vasconcelos


Vision Transformers have received significant attention due to their impressive performance in many vision tasks. While the token mixer or attention block has been studied in great detail, the channel mixer or feature mixing block (FFN or MLP) has not been explored in depth albeit it accounts for a bulk of the parameters and computation in a model. In this work, we study whether sparse feature mixing can replace the dense connections and confirm this with a block diagonal MLP structure that improves the accuracy by supporting larger expansion ratios. To improve the feature clusters formed by this structure and thereby further improve the accuracy, a lightweight, parameter-free, channel covariance attention (CCA) mechanism is introduced as a parallel branch during training. This design of CCA enables gradual feature mixing across channel groups during training whose contribution decays to zero as the training progresses to convergence. This allows the CCA block to be discarded during inference, thus enabling enhanced performance with no additional computational cost. The resulting $\textit{Scalable CHannEl MixEr}$ (SCHEME) can be plugged into any ViT architecture to obtain a gamut of models with different trade-offs between complexity and performance by controlling the block diagonal structure size in the MLP. This is shown by the introduction of a new family of SCHEMEformer models. Experiments on image classification, object detection, and semantic segmentation, with different ViT backbones, consistently demonstrate substantial accuracy gains over existing designs, especially under lower FLOPs regimes. For example, the SCHEMEformer establishes a new SOTA of 79.7% accuracy for ViTs using pure attention mixers on ImageNet-1K at 1.77G FLOPs.

Abstract (translated)

由于在许多视觉任务中的出色表现,Transformer Vision模型已经引起了很大的关注。尽管在token mixer或attention block上已经进行了详细研究,但通道混合器或特征混合块(FFN或MLP)尚未深入研究,尽管它占据了模型中大部分的参数和计算。在本文中,我们研究是否稀疏特征混合可以取代密集连接,并通过支持更大的扩展比来证实这一结论。为了提高由该结构形成的特征簇的准确性,在训练过程中引入了一个轻量级、参数无关的通道协方差注意(CCA)机制作为并行分支。该设计的CCA允许在训练过程中逐步混合通道组,其贡献在训练达到收敛时逐渐消失。这使得CCA块在推理时可以被丢弃,从而实现在不增加计算成本的情况下提高性能。通过控制MLP中块的扩展比,可以将得到的具有不同复杂度和性能的模型插接到任何ViT架构中。这通过引入一个新的SCHEMEformer模型家族来证明。在不同的ViT骨干网络、图像分类、目标检测和语义分割实验中,与现有设计相比,具有显著的准确性提升,尤其是在较低的FLOPs条件下。例如,SCHEMEformer在ImageNet-1K上使用纯注意力混合器建立了79.7%的准确率的新SOTA。



3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot