Grounding Language Models to Images for Multimodal Generation

Abstract
Abstract (translated)
URL
PDF

Abstract

We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process and generate arbitrarily interleaved image-and-text data. Our method leverages the abilities of language models learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the language model frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved image-and-text inputs, and generate free-form text interleaved with retrieved images. We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. Our approach works with any off-the-shelf language model and paves the way towards an effective, general solution for leveraging pretrained language models in visually grounded settings.

Abstract (translated)

我们提出了一种高效的方法,将预先训练的文本-only语言模型连接到视觉领域,使其能够处理和生成任意穿插的图像和文本数据。我们的方法利用从大规模文本-only预训练中学到的能力,例如上下文学习和自由形式文本生成。我们将语言模型冻结,并优化输入和输出线性层,以实现跨模态交互。这使我们的模型能够处理任意穿插的图像和文本输入,生成与检索图像相关的自由形式文本。我们在基础任务方面(例如上下文图像检索和多模态对话)实现了强大的零样本表现,并展示了令人着迷的交互能力。我们的方法适用于任何现有的语言模型,并开辟了利用预训练语言模型在视觉 grounded 环境中有效、通用的解决方案的道路。

URL

https://arxiv.org/abs/2301.13823

PDF

https://arxiv.org/pdf/2301.13823.pdf