Abstract
The exponential growth of large language models (LLMs) has opened up numerous possibilities for multi-modal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the large language model, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on visual perception tasks such as image-level or pixel-level recognition, vision-language tasks such as zero-shot image/video classification, zero-shot image/video-text retrieval, and link with LLMs to create multi-modal dialogue systems. We hope that our research could contribute to the development of multi-modal large models. Code and models are available at this https URL.
Abstract (translated)
大型语言模型的指数增长为多模态AGI系统带来了许多可能性。然而,在多模态AGI中至关重要的视觉和视觉语言基础模型(也是大型语言模型的关键组成部分)的进步并没有跟上LLM的进度。在这项工作中,我们设计了一个大规模视觉语言基础模型(InternVL),将视觉基础模型扩展到60亿参数,并逐步将其与大型语言模型对齐,利用各种来源的千万级图像-文本数据。这个模型可以广泛应用于各种任务,在视觉感知任务(如图像级别或像素级别识别)和视觉-语言任务(如零散图像/视频分类、零散图像/视频-文本检索)方面实现与LLM相当的最佳性能。我们希望我们的研究能为多模态大型模型的研发做出贡献。代码和模型可在此链接下载:https://www.internvl.org/
URL
https://arxiv.org/abs/2312.14238