Abstract
This paper introduces VLAP, a novel approach that bridges pretrained vision models and large language models (LLMs) to make frozen LLMs understand the visual world. VLAP transforms the embedding space of pretrained vision models into the LLMs' word embedding space using a single linear layer for efficient and general-purpose visual and language understanding. Specifically, we harness well-established word embeddings to bridge two modality embedding spaces. The visual and text representations are simultaneously assigned to a set of word embeddings within pretrained LLMs by formulating the assigning procedure as an optimal transport problem. We predict the assignment of one modality from the representation of another modality data, enforcing consistent assignments for paired multimodal data. This allows vision and language representations to contain the same information, grounding the frozen LLMs' word embedding space in visual data. Moreover, a robust semantic taxonomy of LLMs can be preserved with visual data since the LLMs interpret and reason linguistic information from correlations between word embeddings. Experimental results show that VLAP achieves substantial improvements over the previous linear transformation-based approaches across a range of vision-language tasks, including image captioning, visual question answering, and cross-modal retrieval. We also demonstrate the learned visual representations hold a semantic taxonomy of LLMs, making visual semantic arithmetic possible.
Abstract (translated)
本文介绍了一种名为VLAP的新方法,它将预训练的视觉模型(VMs)和大语言模型(LLMs)相桥,使冻结的LLMs能够理解视觉世界。VLAP通过使用单线性层将预训练的视觉模型的嵌入空间转换为LLMs的词向量空间,实现高效的视觉和语言理解。具体来说,我们利用已经确立的词向量来连接两个模态的嵌入空间。通过将分配过程表示为最优传输问题,将视觉和文本表示同时分配给预训练的LLM中的一个单词向量集合。我们预测从另一个模态的表示中分配一个模态,强制保持成对多模态数据的相似分配。这使得视觉和语言表示包含相同的信息,将冻结的LLM的词嵌入空间 grounded in visual data。此外,通过视觉数据可以保留LLM的语义分类器,因为LLM解释并推理单词嵌入之间的相关性。实验结果表明,VLAP在各种视觉-语言任务上的改进都超过了基于线性变换的先前方法。我们还证明了学习到的视觉表示具有LLM的语义分类器,使视觉语义算术成为可能。
URL
https://arxiv.org/abs/2404.09632