Paper Reading AI Learner

ViT-5: Vision Transformers for The Mid-2020s

2026-02-08 18:03:44
Feng Wang, Sucheng Ren, Tiezheng Zhang, Predrag Neskovic, Anand Bhattad, Cihang Xie, Alan Yuille

Abstract

This work presents a systematic investigation into modernizing Vision Transformer backbones by leveraging architectural advancements from the past five years. While preserving the canonical Attention-FFN structure, we conduct a component-wise refinement involving normalization, activation functions, positional encoding, gating mechanisms, and learnable tokens. These updates form a new generation of Vision Transformers, which we call ViT-5. Extensive experiments demonstrate that ViT-5 consistently outperforms state-of-the-art plain Vision Transformers across both understanding and generation benchmarks. On ImageNet-1k classification, ViT-5-Base reaches 84.2\% top-1 accuracy under comparable compute, exceeding DeiT-III-Base at 83.8\%. ViT-5 also serves as a stronger backbone for generative modeling: when plugged into an SiT diffusion framework, it achieves 1.84 FID versus 2.06 with a vanilla ViT backbone. Beyond headline metrics, ViT-5 exhibits improved representation learning and favorable spatial reasoning behavior, and transfers reliably across tasks. With a design aligned with contemporary foundation-model practices, ViT-5 offers a simple drop-in upgrade over vanilla ViT for mid-2020s vision backbones.

Abstract (translated)

这项工作对通过利用过去五年来的架构进步来现代化视觉变换器(Vision Transformer,ViT)骨干网络进行了系统的调查。在保留传统的注意力-前馈网络(Attention-FFN)结构的同时,我们从归一化、激活函数、位置编码、门控机制和可学习令牌等方面进行逐组件的优化改进。这些更新形成了新一代的视觉变换器,我们将它们命名为ViT-5。大量的实验表明,无论是在理解基准还是生成基准上,ViT-5都始终优于现有的纯视觉变换器模型。 在ImageNet-1k分类任务中,在相当计算资源的情况下,ViT-5-BASE达到了84.2%的Top-1精度,而DeiT-III-BASE则为83.8%。此外,当将ViT-5作为生成模型的基础骨干时,它展现出了更强的能力:在SiT扩散框架中使用ViT-5相较于纯ViT基础骨干获得了更优的表现(FID得分为1.84 vs 2.06)。 除了上述的关键指标之外,ViT-5还展示了改进的表示学习和有利的空间推理行为,并且能够可靠地跨任务进行迁移。其设计与当代基础模型实践相契合,为2020年代中期的视觉骨干网络提供了简洁的升级替代方案——即从纯ViT到ViT-5的直接替换可以显著提升性能。

URL

https://arxiv.org/abs/2602.08071

PDF

https://arxiv.org/pdf/2602.08071.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot