MoAI: Mixture of All Intelligence for Large Language and Vision Models

Abstract
Abstract (translated)
URL
PDF

Abstract

The rise of large language models (LLMs) and instruction tuning has led to the current trend of instruction-tuned large language and vision models (LLVMs). This trend involves either meticulously curating numerous instruction tuning datasets tailored to specific objectives or enlarging LLVMs to manage vast amounts of vision language (VL) data. However, current LLVMs have disregarded the detailed and comprehensive real-world scene understanding available from specialized computer vision (CV) models in visual perception tasks such as segmentation, detection, scene graph generation (SGG), and optical character recognition (OCR). Instead, the existing LLVMs rely mainly on the large capacity and emergent capabilities of their LLM backbones. Therefore, we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models. MoAI operates through two newly introduced modules: MoAI-Compressor and MoAI-Mixer. After verbalizing the outputs of the external CV models, the MoAI-Compressor aligns and condenses them to efficiently use relevant auxiliary visual information for VL tasks. MoAI-Mixer then blends three types of intelligence (1) visual features, (2) auxiliary features from the external CV models, and (3) language features by utilizing the concept of Mixture of Experts. Through this integration, MoAI significantly outperforms both open-source and closed-source LLVMs in numerous zero-shot VL tasks, particularly those related to real-world scene understanding such as object existence, positions, relations, and OCR without enlarging the model size or curating extra visual instruction tuning datasets.

Abstract (translated)

大规模语言模型（LLMs）和指令调整的发展导致出现了一种以指令为单位的 large language 和视觉模型（LLVMs）的当前趋势。这种趋势涉及要么精心挑选针对特定目标的指令调整数据集，要么将 LLVMs 扩展以处理大量视觉语言（VL）数据。然而，现有的 LLVMs 忽略了从专用计算机视觉（CV）模型在视觉感知任务（如分割、检测、场景图生成（SGG）和光学字符识别（OCR）中获得的详细和全面的真实场景理解。相反，现有的 LLVMs 主要依赖其 LLM 骨干的大容量和新兴功能。因此，我们提出了一个新的 LLVM，混合智能（MoAI），它利用外部分割、检测、SGG 和 OCR 模型的输出获得辅助视觉信息。MoAI 通过两个新模块运行：MoAI-Compressor 和 MoAI-Mixer。在对外部 CV 模型的输出进行口头说明后，MoAI-Compressor 对其进行对齐和压缩，以便为 VL 任务有效地使用相关辅助视觉信息。然后，MoAI-Mixer 通过利用专家混合的概念，将三种智能（1）视觉特征、（2）外部 CV 模型的辅助特征和（3）语言特征进行混合。通过这种整合，MoAI 在许多零散 VL 任务中显著优于开源和闭源 LLVMs，尤其是与现实场景理解相关的任务，例如物体存在、位置、关系和 OCR，而不需要扩大模型大小或额外视觉指令调整数据集。

URL

https://arxiv.org/abs/2403.07508

PDF

https://arxiv.org/pdf/2403.07508.pdf

MoAI: Mixture of All Intelligence for Large Language and Vision Models

Abstract

Abstract (translated)

URL

PDF Copy

PDF