Abstract
In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs. (2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448$\times$448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input. (3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks. We evaluate InternVL 1.5 through a series of benchmarks and comparative studies. Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks. Code has been released at this https URL.
Abstract (translated)
在这份报告中,我们介绍了InternVL 1.5,一个开源的多模态大型语言模型(MLLM),以弥合开源和商业模型在多模态理解能力方面的差距。我们介绍了三个简单的改进:(1)强视图编码器:我们对 large-scale vision foundation model -- InternViT-6B 进行连续学习,提高了其视觉理解能力,并使其可以迁移和重用于不同的LLM。 (2)动态高分辨率:我们根据输入图像的透视率和分辨率将图像划分为从1到40个448x448像素的方块,支持最高4K分辨率输入。 (3)高质量双语数据集:我们仔细收集了一个高质量的双语数据集,涵盖了常见的场景、文档图像,并使用英语和中文问题与答案对它们进行了标注,显著提高了 OCR- 和与中文相关的任务的表现。我们通过一系列基准测试和比较研究评估了InternVL 1.5。与开源和商业模型相比,InternVL 1.5显示出具有竞争力的性能,在8个基准测试中实现了最先进的结果。代码已发布在https://这个网址。
URL
https://arxiv.org/abs/2404.16821