Abstract
Building general-purpose models that can perceive diverse real-world modalities and solve various tasks is an appealing target in artificial intelligence. In this paper, we present ChatBridge, a novel multimodal language model that leverages the expressive capabilities of language as the catalyst to bridge the gap between various modalities. We show that only language-paired two-modality data is sufficient to connect all modalities. ChatBridge leverages recent large language models (LLM) and extends their zero-shot capabilities to incorporate diverse multimodal inputs. ChatBridge undergoes a two-stage training. The first stage aligns each modality with language, which brings emergent multimodal correlation and collaboration abilities. The second stage instruction-finetunes ChatBridge to align it with user intent with our newly proposed multimodal instruction tuning dataset, named MULTIS, which covers a wide range of 16 multimodal tasks of text, image, video, and audio modalities. We show strong quantitative and qualitative results on zero-shot multimodal tasks covering text, image, video, and audio modalities. All codes, data, and models of ChatBridge will be open-sourced.
Abstract (translated)
构建能够感知各种现实世界模式并解决各种任务的通用模型是一个令人着迷的目标,在人工智能中也不例外。在本文中,我们介绍了ChatBridge,一个新型的多模式语言模型,利用语言的表达能力作为催化剂,以连接各种模式之间的差异。我们表明,只有与语言配对的两个模式数据才能足够地连接所有模式。ChatBridge利用最近的大型语言模型(LLM)并扩展了他们的零次操作能力,以包括各种多模式输入。ChatBridge经历了两阶段的培训。第一阶段将每个模式与语言对齐,带来模式的零次操作相关性和协作能力。第二阶段指令微调ChatBridge,使其与用户的意图对齐,我们的新提出的多模式指令调整数据集MultiS,涵盖了文本、图像、视频和音频16种多模式任务。我们展示了在涵盖文本、图像、视频和音频模式的零次操作多模式任务中的强烈量化和定性结果。ChatBridge的所有代码、数据和模型都将开源。
URL
https://arxiv.org/abs/2305.16103