How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Abstract
Abstract (translated)
URL
PDF

Abstract

In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs. (2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448$\times$448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input. (3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks. We evaluate InternVL 1.5 through a series of benchmarks and comparative studies. Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks. Code has been released at this https URL.

Abstract (translated)

在这份报告中，我们介绍了InternVL 1.5，一个开源的多模态大型语言模型（MLLM），以弥合开源和商业模型在多模态理解能力方面的差距。我们介绍了三个简单的改进：（1）强视图编码器：我们对 large-scale vision foundation model -- InternViT-6B 进行连续学习，提高了其视觉理解能力，并使其可以迁移和重用于不同的LLM。（2）动态高分辨率：我们根据输入图像的透视率和分辨率将图像划分为从1到40个448x448像素的方块，支持最高4K分辨率输入。（3）高质量双语数据集：我们仔细收集了一个高质量的双语数据集，涵盖了常见的场景、文档图像，并使用英语和中文问题与答案对它们进行了标注，显著提高了 OCR- 和与中文相关的任务的表现。我们通过一系列基准测试和比较研究评估了InternVL 1.5。与开源和商业模型相比，InternVL 1.5显示出具有竞争力的性能，在8个基准测试中实现了最先进的结果。代码已发布在https://这个网址。

URL

https://arxiv.org/abs/2404.16821

PDF

https://arxiv.org/pdf/2404.16821.pdf

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

Abstract

Abstract (translated)

URL

PDF Copy

PDF