Adapting LLaMA Decoder to Vision Transformer

Abstract
Abstract (translated)
URL
PDF

Abstract

This work examines whether decoder-only Transformers such as LLaMA, which were originally designed for large language models (LLMs), can be adapted to the computer vision field. We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a casual mask to the self-attention brings an attention collapse issue, resulting in the failure to the network training. We suggest to reposition the class token behind the image tokens with a post-sequence class token technique to overcome this challenge, enabling causal self-attention to efficiently capture the entire image's information. Additionally, we develop a soft mask strategy that gradually introduces a casual mask to the self-attention at the onset of training to facilitate the optimization behavior. The tailored model, dubbed as image LLaMA (iLLaMA), is akin to LLaMA in architecture and enables direct supervised learning. Its causal self-attention boosts computational efficiency and learns complex representation by elevating attention map ranks. iLLaMA rivals the performance with its encoder-only counterparts, achieving 75.1% ImageNet top-1 accuracy with only 5.7M parameters. Scaling the model to ~310M and pre-training on ImageNet-21K further enhances the accuracy to 86.0%. Extensive experiments demonstrate iLLaMA's reliable properties: calibration, shape-texture bias, quantization compatibility, ADE20K segmentation and CIFAR transfer learning. We hope our study can kindle fresh views to visual model design in the wave of LLMs. Pre-trained models and codes are available here.

Abstract (translated)

本文探讨了是否像LLLM（大型语言模型）这样原本设计的解码器- only transformer（如LLaMA）可以适应计算机视觉领域。首先，我们使用标准的ViT逐步对齐LLaMA的架构，并发现直接应用一个简单的掩码到自注意力会导致注意力崩溃问题，从而导致网络训练失败。我们建议使用后序类标签技术将类标签置于图像标签之前，以克服这一挑战，从而实现高效的因果自注意。此外，我们还开发了一种逐渐引入随机的掩码以促进训练优化的软掩码策略。定制的模型被称为图像LLaMA（iLLaMA），在架构上与LLaMA类似，并通过提高注意图的排名实现直接监督学习。其因果自注意力提高了计算效率，通过提高注意图的排名学习复杂的表示。与仅编码器的模型相比，iLLaMA的性能略高，达到75.1%的ImageNet top-1准确率，仅使用5.7M个参数。将模型扩展到~310M，并在ImageNet-21K上进行预训练，进一步提高了准确率至86.0%。大量的实验证明了iLLaMA的可靠性质：标定、形状纹理偏差、量化和兼容性，以及ADE20K分割和CIFAR迁移学习。我们希望我们的研究可以激发对LLM视觉模型设计的全新观点。预训练模型和代码可在此处获取。

URL

https://arxiv.org/abs/2404.06773

PDF

https://arxiv.org/pdf/2404.06773.pdf

Adapting LLaMA Decoder to Vision Transformer

Abstract

Abstract (translated)

URL

PDF Copy

PDF