Abstract
Vision Transformers have received significant attention due to their impressive performance in many vision tasks. While the token mixer or attention block has been studied in great detail, the channel mixer or feature mixing block (FFN or MLP) has not been explored in depth albeit it accounts for a bulk of the parameters and computation in a model. In this work, we study whether sparse feature mixing can replace the dense connections and confirm this with a block diagonal MLP structure that improves the accuracy by supporting larger expansion ratios. To improve the feature clusters formed by this structure and thereby further improve the accuracy, a lightweight, parameter-free, channel covariance attention (CCA) mechanism is introduced as a parallel branch during training. This design of CCA enables gradual feature mixing across channel groups during training whose contribution decays to zero as the training progresses to convergence. This allows the CCA block to be discarded during inference, thus enabling enhanced performance with no additional computational cost. The resulting $\textit{Scalable CHannEl MixEr}$ (SCHEME) can be plugged into any ViT architecture to obtain a gamut of models with different trade-offs between complexity and performance by controlling the block diagonal structure size in the MLP. This is shown by the introduction of a new family of SCHEMEformer models. Experiments on image classification, object detection, and semantic segmentation, with different ViT backbones, consistently demonstrate substantial accuracy gains over existing designs, especially under lower FLOPs regimes. For example, the SCHEMEformer establishes a new SOTA of 79.7% accuracy for ViTs using pure attention mixers on ImageNet-1K at 1.77G FLOPs.
Abstract (translated)
由于在许多视觉任务中的出色表现,Transformer Vision模型已经引起了很大的关注。尽管在token mixer或attention block上已经进行了详细研究,但通道混合器或特征混合块(FFN或MLP)尚未深入研究,尽管它占据了模型中大部分的参数和计算。在本文中,我们研究是否稀疏特征混合可以取代密集连接,并通过支持更大的扩展比来证实这一结论。为了提高由该结构形成的特征簇的准确性,在训练过程中引入了一个轻量级、参数无关的通道协方差注意(CCA)机制作为并行分支。该设计的CCA允许在训练过程中逐步混合通道组,其贡献在训练达到收敛时逐渐消失。这使得CCA块在推理时可以被丢弃,从而实现在不增加计算成本的情况下提高性能。通过控制MLP中块的扩展比,可以将得到的具有不同复杂度和性能的模型插接到任何ViT架构中。这通过引入一个新的SCHEMEformer模型家族来证明。在不同的ViT骨干网络、图像分类、目标检测和语义分割实验中,与现有设计相比,具有显著的准确性提升,尤其是在较低的FLOPs条件下。例如,SCHEMEformer在ImageNet-1K上使用纯注意力混合器建立了79.7%的准确率的新SOTA。
URL
https://arxiv.org/abs/2312.00412