Abstract
Large Language Models (LLMs) have so far impressed the world, with unprecedented capabilities that emerge in models at large scales. On the vision side, transformer models (i.e., ViT) are following the same trend, achieving the best performance on challenging benchmarks. With the abundance of such unimodal models, a natural question arises; do we need also to follow this trend to tackle multimodal tasks? In this work, we propose to rather direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. In particular, they still train a large number of parameters, rely on large multimodal pretraining, use encoders (e.g., CLIP) trained on huge image-text datasets, and add significant inference overhead. In addition, most of these approaches have focused on Zero-Shot and In Context Learning, with little to no effort on direct finetuning. We investigate the minimal computational effort needed to adapt unimodal models for multimodal tasks and propose a new challenging setup, alongside different approaches, that efficiently adapts unimodal pretrained models. We show that by freezing more than 99\% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning across Image, Video, and Audio modalities, following the proposed setup. The code will be available here: this https URL.
Abstract (translated)
大型语言模型(LLM)已经在世界上引起了轰动,它们在大规模模型中展现了前所未有的能力。在视觉方面,Transformer模型(即ViT)正在遵循相同的趋势,并在挑战性的基准测试中取得了最佳表现。由于这种单目模型的供应充足,一个自然的问题是:我们需要也遵循这种趋势来解决多目任务吗?在本文中,我们建议直接努力优化现有的模型,并建议通过感知来增强语言模型。现有的用于适应视觉语言任务预训练模型的方法仍然依赖于几个关键组件,限制了它们的效率。特别是,他们仍然训练大量的参数,依赖于大规模的多目预训练,使用训练在巨大图像文本数据集上的编码器,并增加了大量的推理 overhead。此外,这些方法大多数都关注零知识和上下文学习,几乎没有直接微调的努力。我们研究了适应单目模型对多目任务所需的最小计算 effort,并提出了一种新的挑战性架构,与其他方法一起,高效地适应单目预训练模型。我们表明,通过冻结超过99\%的整个参数,只训练一个线性投影层,并添加一个可训练的元,我们的方法(称为eP-ALM)在图像、视频和音频modality之间的视频问答和标题生成任务中显著优于其他基准。代码将在这里可用: this https URL。
URL
https://arxiv.org/abs/2303.11403