Paper Reading AI Learner

MoAI: Mixture of All Intelligence for Large Language and Vision Models

2024-03-12 10:44:13
Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro

Abstract

The rise of large language models (LLMs) and instruction tuning has led to the current trend of instruction-tuned large language and vision models (LLVMs). This trend involves either meticulously curating numerous instruction tuning datasets tailored to specific objectives or enlarging LLVMs to manage vast amounts of vision language (VL) data. However, current LLVMs have disregarded the detailed and comprehensive real-world scene understanding available from specialized computer vision (CV) models in visual perception tasks such as segmentation, detection, scene graph generation (SGG), and optical character recognition (OCR). Instead, the existing LLVMs rely mainly on the large capacity and emergent capabilities of their LLM backbones. Therefore, we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models. MoAI operates through two newly introduced modules: MoAI-Compressor and MoAI-Mixer. After verbalizing the outputs of the external CV models, the MoAI-Compressor aligns and condenses them to efficiently use relevant auxiliary visual information for VL tasks. MoAI-Mixer then blends three types of intelligence (1) visual features, (2) auxiliary features from the external CV models, and (3) language features by utilizing the concept of Mixture of Experts. Through this integration, MoAI significantly outperforms both open-source and closed-source LLVMs in numerous zero-shot VL tasks, particularly those related to real-world scene understanding such as object existence, positions, relations, and OCR without enlarging the model size or curating extra visual instruction tuning datasets.

Abstract (translated)

大规模语言模型(LLMs)和指令调整的发展导致出现了一种以指令为单位的 large language 和视觉模型(LLVMs)的当前趋势。这种趋势涉及要么精心挑选针对特定目标的指令调整数据集,要么将 LLVMs 扩展以处理大量视觉语言(VL)数据。然而,现有的 LLVMs 忽略了从专用计算机视觉(CV)模型在视觉感知任务(如分割、检测、场景图生成(SGG)和光学字符识别(OCR)中获得的详细和全面的真实场景理解。相反,现有的 LLVMs 主要依赖其 LLM 骨干的 大容量和新兴功能。因此,我们提出了一个新的 LLVM,混合智能(MoAI),它利用外部分割、检测、SGG 和 OCR 模型的输出获得辅助视觉信息。MoAI 通过两个新模块运行:MoAI-Compressor 和 MoAI-Mixer。在对外部 CV 模型的输出进行口头说明后,MoAI-Compressor 对其进行对齐和压缩,以便为 VL 任务有效地使用相关辅助视觉信息。然后,MoAI-Mixer 通过利用专家混合的概念,将三种智能(1)视觉特征、(2)外部 CV 模型的辅助特征和(3)语言特征进行混合。通过这种整合,MoAI 在许多零散 VL 任务中显著优于开源和闭源 LLVMs,尤其是与现实场景理解相关的任务,例如物体存在、位置、关系和 OCR,而不需要扩大模型大小或额外视觉指令调整数据集。

URL

https://arxiv.org/abs/2403.07508

PDF

https://arxiv.org/pdf/2403.07508.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot