Language Is Not All You Need: Aligning Perception with Language Models

Abstract
Abstract (translated)
URL
PDF

Abstract

A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, we train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning. Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions). We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language. In addition, we introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.

Abstract (translated)

语言、多感官感知和行为世界的大汇聚是实现人工智能通用智能的关键步骤。在本研究中,我们介绍了 Kosmos-1,一个多感官大型语言模型(MLLM),它能够在上下文中感知一般感官,学习(即少量样本学习),并遵循指令(即零样本学习)。具体来说,我们从零开始训练 Kosmos-1在大规模多感官数据集上,包括任意组合的文本和图像、图像标题对和文本数据。我们评估了各种设置,包括零样本、少量样本和多感官思维 chain-of- thought prompt,在广泛的任务中不需要任何梯度更新或微调。实验结果显示, Kosmos-1在(i)语言理解、生成和甚至无OCR的NLP(直接通过文档图像)方面表现出令人印象深刻的表现,(ii)感知语言任务,包括多感官对话、图像标题对、视觉问答和(iii)视觉任务,如图像识别配有描述(通过文本指令指定分类)。我们还表明,MLLMs可以从跨感官传输中受益,即从语言到多感官和从多感官到语言的知识传输。此外,我们介绍了 Raven智商测试数据集,该数据集用于诊断MLLMs的非语言推理能力。

URL

https://arxiv.org/abs/2302.14045

PDF

https://arxiv.org/pdf/2302.14045.pdf