Abstract
We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.
Abstract (translated)
我们提出了Chameleon模型,这是一款基于早期融合词素的多模态模型,可以理解并以任意顺序生成图像和文本。我们从一开始就概述了稳定训练方法、一个对齐配方和一种针对早期融合、词素基础、多模态设置的建筑参数化。这些模型在包括视觉问题回答、图像标题、文本生成、图像生成和长篇多模态生成在内的各种任务上进行了评估。Chameleon展示了广泛和普遍的能力,包括在图像标题任务中的最先进性能,在仅文本任务中超过了Llama-2的性能,同时与Mixtral 8x7B和Gemini-Pro等模型竞争,所有这些都在单个模型中实现。根据人类对长篇多模态生成评估的新鲜程度,Chameleon在包括图像和文本的混合序列的提示或输出上表现出色,甚至超过了Gemini Pro和GPT-4V等更大模型的性能。Chameleon在统一建模全多模态文本文档方面迈出了重要的一步。
URL
https://arxiv.org/abs/2405.09818