Making Vision Transformers Truly Shift-Equivariant

Abstract
Abstract (translated)
URL
PDF

Abstract

For computer vision tasks, Vision Transformers (ViTs) have become one of the go-to deep net architectures. Despite being inspired by Convolutional Neural Networks (CNNs), ViTs remain sensitive to small shifts in the input image. To address this, we introduce novel designs for each of the modules in ViTs, such as tokenization, self-attention, patch merging, and positional encoding. With our proposed modules, we achieve truly shift-equivariant ViTs on four well-established models, namely, Swin, SwinV2, MViTv2, and CvT, both in theory and practice. Empirically, we tested these models on image classification and semantic segmentation, achieving competitive performance across three different datasets while maintaining 100% shift consistency.

Abstract (translated)

对计算机视觉任务而言，视觉转换器(ViTs)已成为深度学习架构的首选之一。尽管受到了卷积神经网络(CNNs)的启发，ViTs仍然对输入图像的微小变化非常敏感。为了解决这一问题，我们提出了ViTs中的每个模块的全新设计，例如 tokenization、自注意力、块融合和位置编码。利用我们提出的模块，我们实现了真正的变换同构ViTs，对四个已知模型(Swin、SwinV2、MViTv2和CvT)进行了验证，在理论和实践上都实现了100%的变换一致性。具体来说，我们在实践中测试了这些模型的图像分类和语义分割性能，在不同数据集上取得了竞争表现，同时保持了100%的变换一致性。

URL

https://arxiv.org/abs/2305.16316

PDF

https://arxiv.org/pdf/2305.16316.pdf