Abstract
As the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. Although some large-scale pretrained vision encoders such as vision encoders in CLIP and DINOv2 have brought promising performance, we found that there is still no single vision encoder that can dominate various image content understanding, e.g., the CLIP vision encoder leads to outstanding results on general image understanding but poor performance on document or chart content. To alleviate the bias of CLIP vision encoder, we first delve into the inherent behavior of different pre-trained vision encoders and then propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism. In the coarse-grained stage, we design a context-aware expert routing strategy to dynamically select the most suitable vision experts according to the user instruction, input image, and expertise of vision experts. This benefits from the powerful model function understanding ability of the large language model (LLM) equipped with expert-routing low-rank adaptation (LoRA). In the fine-grained stage, we elaborately conduct the mixture-of-vision-expert adapter (MoV-Adapter) to extract and fuse task-specific knowledge from various experts. This coarse-to-fine paradigm effectively leverages representations from experts based on multimodal context and model expertise, further enhancing the generalization ability. We conduct extensive experiments to evaluate the effectiveness of the proposed approach. Without any bells and whistles, MoVA can achieve significant performance gains over current state-of-the-art methods in a wide range of challenging multimodal benchmarks. Codes and models will be available at this https URL.
Abstract (translated)
作为多模态大型语言模型的关键组件,视觉编码器的能力对MLLM对多样图像内容的理解带来了很大的影响。尽管一些大型预训练视觉编码器(如CLIP和DINOv2中的视觉编码器)已经带来了良好的性能,但我们发现仍然没有一种视觉编码器可以主导各种图像内容的理解,例如,CLIP视觉编码器在通用图像理解方面表现出色,但在文档或图表内容上表现不佳。为了减轻CLIP视觉编码器的偏差,我们首先深入研究了不同预训练视觉编码器的固有行为,然后提出了MoVA,一种强大的新MLLM,通过粗到细的机制将任务特定的视觉专家与粗略到细的机制相结合。在粗粒度阶段,我们设计了一个基于用户指令、输入图像和视觉专家专业知识的上下文感知专家路由策略,根据这些信息动态选择最合适的视觉专家。这得益于配备专家路由低秩适应(LoRA)的大型语言模型(LLM)的强大的模型功能理解能力。在细粒度阶段,我们详细介绍了混合视觉专家适配器(MoV-Adapter),以提取和融合各种专家的任务特定知识。这种粗粒度到细粒度的范式有效地利用了多模态上下文和模型专业知识,进一步加强了泛化能力。我们对所提出的方法进行了广泛的实验评估。与任何华丽的装饰相比,MoVA在各种具有挑战性的多模态基准测试中的性能都有显著的提高。代码和模型将在此处https:// URL中提供。
URL
https://arxiv.org/abs/2404.13046