Paper Reading AI Learner

Chameleon: Mixed-Modal Early-Fusion Foundation Models

2024-05-16 05:23:41
Chameleon Team

Abstract

We present Chameleon, a family of early-fusion token-based mixed-modal models capable of understanding and generating images and text in any arbitrary sequence. We outline a stable training approach from inception, an alignment recipe, and an architectural parameterization tailored for the early-fusion, token-based, mixed-modal setting. The models are evaluated on a comprehensive range of tasks, including visual question answering, image captioning, text generation, image generation, and long-form mixed modal generation. Chameleon demonstrates broad and general capabilities, including state-of-the-art performance in image captioning tasks, outperforms Llama-2 in text-only tasks while being competitive with models such as Mixtral 8x7B and Gemini-Pro, and performs non-trivial image generation, all in a single model. It also matches or exceeds the performance of much larger models, including Gemini Pro and GPT-4V, according to human judgments on a new long-form mixed-modal generation evaluation, where either the prompt or outputs contain mixed sequences of both images and text. Chameleon marks a significant step forward in a unified modeling of full multimodal documents.

Abstract (translated)

我们提出了Chameleon模型,这是一款基于早期融合词素的多模态模型,可以理解并以任意顺序生成图像和文本。我们从一开始就概述了稳定训练方法、一个对齐配方和一种针对早期融合、词素基础、多模态设置的建筑参数化。这些模型在包括视觉问题回答、图像标题、文本生成、图像生成和长篇多模态生成在内的各种任务上进行了评估。Chameleon展示了广泛和普遍的能力,包括在图像标题任务中的最先进性能,在仅文本任务中超过了Llama-2的性能,同时与Mixtral 8x7B和Gemini-Pro等模型竞争,所有这些都在单个模型中实现。根据人类对长篇多模态生成评估的新鲜程度,Chameleon在包括图像和文本的混合序列的提示或输出上表现出色,甚至超过了Gemini Pro和GPT-4V等更大模型的性能。Chameleon在统一建模全多模态文本文档方面迈出了重要的一步。

URL

https://arxiv.org/abs/2405.09818

PDF

https://arxiv.org/pdf/2405.09818.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot