Paper Reading AI Learner

Adapting LLaMA Decoder to Vision Transformer

2024-04-10 06:30:08
Jiahao Wang, Wenqi Shao, Mengzhao Chen, Chengyue Wu, Yong Liu, Kaipeng Zhang, Songyang Zhang, Kai Chen, Ping Luo

Abstract

This work examines whether decoder-only Transformers such as LLaMA, which were originally designed for large language models (LLMs), can be adapted to the computer vision field. We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a casual mask to the self-attention brings an attention collapse issue, resulting in the failure to the network training. We suggest to reposition the class token behind the image tokens with a post-sequence class token technique to overcome this challenge, enabling causal self-attention to efficiently capture the entire image's information. Additionally, we develop a soft mask strategy that gradually introduces a casual mask to the self-attention at the onset of training to facilitate the optimization behavior. The tailored model, dubbed as image LLaMA (iLLaMA), is akin to LLaMA in architecture and enables direct supervised learning. Its causal self-attention boosts computational efficiency and learns complex representation by elevating attention map ranks. iLLaMA rivals the performance with its encoder-only counterparts, achieving 75.1% ImageNet top-1 accuracy with only 5.7M parameters. Scaling the model to ~310M and pre-training on ImageNet-21K further enhances the accuracy to 86.0%. Extensive experiments demonstrate iLLaMA's reliable properties: calibration, shape-texture bias, quantization compatibility, ADE20K segmentation and CIFAR transfer learning. We hope our study can kindle fresh views to visual model design in the wave of LLMs. Pre-trained models and codes are available here.

Abstract (translated)

本文探讨了是否像LLLM(大型语言模型)这样原本设计的解码器- only transformer(如LLaMA)可以适应计算机视觉领域。首先,我们使用标准的ViT逐步对齐LLaMA的架构,并发现直接应用一个简单的掩码到自注意力会导致注意力崩溃问题,从而导致网络训练失败。我们建议使用后序类标签技术将类标签置于图像标签之前,以克服这一挑战,从而实现高效的因果自注意。此外,我们还开发了一种逐渐引入随机的掩码以促进训练优化的软掩码策略。定制的模型被称为图像LLaMA(iLLaMA),在架构上与LLaMA类似,并通过提高注意图的排名实现直接监督学习。其因果自注意力提高了计算效率,通过提高注意图的排名学习复杂的表示。与仅编码器的模型相比,iLLaMA的性能略高,达到75.1%的ImageNet top-1准确率,仅使用5.7M个参数。将模型扩展到~310M,并在ImageNet-21K上进行预训练,进一步提高了准确率至86.0%。大量的实验证明了iLLaMA的可靠性质:标定、形状纹理偏差、量化和兼容性,以及ADE20K分割和CIFAR迁移学习。我们希望我们的研究可以激发对LLM视觉模型设计的全新观点。预训练模型和代码可在此处获取。

URL

https://arxiv.org/abs/2404.06773

PDF

https://arxiv.org/pdf/2404.06773.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot