Paper Reading AI Learner

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

2024-04-19 17:59:48
Zhuofan Zong, Bingqi Ma, Dazhong Shen, Guanglu Song, Hao Shao, Dongzhi Jiang, Hongsheng Li, Yu Liu

Abstract

As the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. Although some large-scale pretrained vision encoders such as vision encoders in CLIP and DINOv2 have brought promising performance, we found that there is still no single vision encoder that can dominate various image content understanding, e.g., the CLIP vision encoder leads to outstanding results on general image understanding but poor performance on document or chart content. To alleviate the bias of CLIP vision encoder, we first delve into the inherent behavior of different pre-trained vision encoders and then propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism. In the coarse-grained stage, we design a context-aware expert routing strategy to dynamically select the most suitable vision experts according to the user instruction, input image, and expertise of vision experts. This benefits from the powerful model function understanding ability of the large language model (LLM) equipped with expert-routing low-rank adaptation (LoRA). In the fine-grained stage, we elaborately conduct the mixture-of-vision-expert adapter (MoV-Adapter) to extract and fuse task-specific knowledge from various experts. This coarse-to-fine paradigm effectively leverages representations from experts based on multimodal context and model expertise, further enhancing the generalization ability. We conduct extensive experiments to evaluate the effectiveness of the proposed approach. Without any bells and whistles, MoVA can achieve significant performance gains over current state-of-the-art methods in a wide range of challenging multimodal benchmarks. Codes and models will be available at this https URL.

Abstract (translated)

作为多模态大型语言模型的关键组件,视觉编码器的能力对MLLM对多样图像内容的理解带来了很大的影响。尽管一些大型预训练视觉编码器(如CLIP和DINOv2中的视觉编码器)已经带来了良好的性能,但我们发现仍然没有一种视觉编码器可以主导各种图像内容的理解,例如,CLIP视觉编码器在通用图像理解方面表现出色,但在文档或图表内容上表现不佳。为了减轻CLIP视觉编码器的偏差,我们首先深入研究了不同预训练视觉编码器的固有行为,然后提出了MoVA,一种强大的新MLLM,通过粗到细的机制将任务特定的视觉专家与粗略到细的机制相结合。在粗粒度阶段,我们设计了一个基于用户指令、输入图像和视觉专家专业知识的上下文感知专家路由策略,根据这些信息动态选择最合适的视觉专家。这得益于配备专家路由低秩适应(LoRA)的大型语言模型(LLM)的强大的模型功能理解能力。在细粒度阶段,我们详细介绍了混合视觉专家适配器(MoV-Adapter),以提取和融合各种专家的任务特定知识。这种粗粒度到细粒度的范式有效地利用了多模态上下文和模型专业知识,进一步加强了泛化能力。我们对所提出的方法进行了广泛的实验评估。与任何华丽的装饰相比,MoVA在各种具有挑战性的多模态基准测试中的性能都有显著的提高。代码和模型将在此处https:// URL中提供。

URL

https://arxiv.org/abs/2404.13046

PDF

https://arxiv.org/pdf/2404.13046.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot