Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models

Abstract
Abstract (translated)
URL
PDF

Abstract

Recently, growing interest has been aroused in extending the multimodal capability of large language models (LLMs), e.g., vision-language (VL) learning, which is regarded as the next milestone of artificial general intelligence. However, existing solutions are prohibitively expensive, which not only need to optimize excessive parameters, but also require another large-scale pre-training before VL instruction tuning. In this paper, we propose a novel and affordable solution for the effective VL adaption of LLMs, called Mixture-of-Modality Adaptation (MMA). Instead of using large neural networks to connect the image encoder and LLM, MMA adopts lightweight modules, i.e., adapters, to bridge the gap between LLMs and VL tasks, which also enables the joint optimization of the image and language models. Meanwhile, MMA is also equipped with a routing algorithm to help LLMs achieve an automatic shift between single- and multi-modal instructions without compromising their ability of natural language understanding. To validate MMA, we apply it to a recent LLM called LLaMA and term this formed large vision-language instructed model as LaVIN. To validate MMA and LaVIN, we conduct extensive experiments under two setups, namely multimodal science question answering and multimodal dialogue. The experimental results not only demonstrate the competitive performance and the superior training efficiency of LaVIN than existing multimodal LLMs, but also confirm its great potential as a general-purpose chatbot. More importantly, the actual expenditure of LaVIN is extremely cheap, e.g., only 1.4 training hours with 3.8M trainable parameters, greatly confirming the effectiveness of MMA. Our project is released at this https URL.

Abstract (translated)

近年来，对大型语言模型(LLM)的多模态能力扩展引起了越来越多的关注，例如视觉语言(VL)学习，被认为是人工智能通用智能的下一个里程碑。然而，现有的解决方案非常昂贵，不仅需要优化过多的参数，还需要在VL指令调整之前进行另一大规模的预训练。在本文中，我们提出了一种新颖且成本较低的解决方案，称为混合模态适应(MMA)，以有效适应LLM的VL学习，该解决方案被称为Adapters。 Instead of使用大型神经网络连接图像编码器和LLM,MMA采用轻量级模块，即适配器，以连接LLM和VL任务之间的差异，并实现图像和语言模型的联合优化。同时，MMA还配备了路由算法，以帮助LLM实现单模态和多模态指令的自动转换，而不会影响其自然语言理解能力。为了验证MMA，我们将其应用于最近开发的LLM称为LLaMA，并将形成的大型视觉语言指示模型称为LaVIN。为了验证MMA和LaVIN，我们在两个设置下进行了广泛的实验，即 multimodal科学问题回答和 multimodal对话。实验结果不仅证明了LaVIN比现有的多模态LLM更具竞争力性能和更好的训练效率，还确认了其作为通用聊天机器人的巨大潜力。更重要的是，LaVIN的实际支出非常便宜，例如仅需要1.4小时的训练时间，并具有380万可训练参数，极大地证实了MMA的有效性。我们的项目在此httpsURL发布。

URL

https://arxiv.org/abs/2305.15023

PDF

https://arxiv.org/pdf/2305.15023.pdf