Abstract
The vision transformer (ViT) has advanced to the cutting edge in the visual recognition task. Transformers are more robust than CNN, according to the latest research. ViT's self-attention mechanism, according to the claim, makes it more robust than CNN. Even with this, we discover that these conclusions are based on unfair experimental conditions and just comparing a few models, which did not allow us to depict the entire scenario of robustness performance. In this study, we investigate the performance of 58 state-of-the-art computer vision models in a unified training setup based not only on attention and convolution mechanisms but also on neural networks based on a combination of convolution and attention mechanisms, sequence-based model, complementary search, and network-based method. Our research demonstrates that robustness depends on the training setup and model types, and performance varies based on out-of-distribution type. Our research will aid the community in better understanding and benchmarking the robustness of computer vision models.
Abstract (translated)
视觉转换器(ViT)在视觉识别任务中已经走到了前沿。根据最新的研究,Transformers比卷积神经网络(CNN)更加鲁棒。据声称,ViT的自注意力机制使其比CNN更加鲁棒。尽管如此,我们发现这些结论基于不公平的实验条件,仅比较了几个模型,因此无法呈现整个鲁棒性能场景。在本研究中,我们研究了58个最先进的计算机视觉模型在一个统一的训练 setup 中的表现,该训练 setup 不仅基于注意力和卷积机制,还基于卷积和注意力机制的组合、序列模型、互补搜索和网络方法。我们的研究表明,鲁棒性取决于训练 setup 和模型类型,表现根据分布类型而异。我们的研究将帮助 community 更好地理解和基准计算机视觉模型的鲁棒性。
URL
https://arxiv.org/abs/2301.10750