Abstract
This work presents a systematic investigation into modernizing Vision Transformer backbones by leveraging architectural advancements from the past five years. While preserving the canonical Attention-FFN structure, we conduct a component-wise refinement involving normalization, activation functions, positional encoding, gating mechanisms, and learnable tokens. These updates form a new generation of Vision Transformers, which we call ViT-5. Extensive experiments demonstrate that ViT-5 consistently outperforms state-of-the-art plain Vision Transformers across both understanding and generation benchmarks. On ImageNet-1k classification, ViT-5-Base reaches 84.2\% top-1 accuracy under comparable compute, exceeding DeiT-III-Base at 83.8\%. ViT-5 also serves as a stronger backbone for generative modeling: when plugged into an SiT diffusion framework, it achieves 1.84 FID versus 2.06 with a vanilla ViT backbone. Beyond headline metrics, ViT-5 exhibits improved representation learning and favorable spatial reasoning behavior, and transfers reliably across tasks. With a design aligned with contemporary foundation-model practices, ViT-5 offers a simple drop-in upgrade over vanilla ViT for mid-2020s vision backbones.
Abstract (translated)
这项工作对通过利用过去五年来的架构进步来现代化视觉变换器(Vision Transformer,ViT)骨干网络进行了系统的调查。在保留传统的注意力-前馈网络(Attention-FFN)结构的同时,我们从归一化、激活函数、位置编码、门控机制和可学习令牌等方面进行逐组件的优化改进。这些更新形成了新一代的视觉变换器,我们将它们命名为ViT-5。大量的实验表明,无论是在理解基准还是生成基准上,ViT-5都始终优于现有的纯视觉变换器模型。 在ImageNet-1k分类任务中,在相当计算资源的情况下,ViT-5-BASE达到了84.2%的Top-1精度,而DeiT-III-BASE则为83.8%。此外,当将ViT-5作为生成模型的基础骨干时,它展现出了更强的能力:在SiT扩散框架中使用ViT-5相较于纯ViT基础骨干获得了更优的表现(FID得分为1.84 vs 2.06)。 除了上述的关键指标之外,ViT-5还展示了改进的表示学习和有利的空间推理行为,并且能够可靠地跨任务进行迁移。其设计与当代基础模型实践相契合,为2020年代中期的视觉骨干网络提供了简洁的升级替代方案——即从纯ViT到ViT-5的直接替换可以显著提升性能。
URL
https://arxiv.org/abs/2602.08071