Abstract
Vision-language models, while effective in general domains and showing strong performance in diverse multi-modal applications like visual question-answering (VQA), struggle to maintain the same level of effectiveness in more specialized domains, e.g., medical. We propose a medical vision-language model that integrates large vision and language models adapted for the medical domain. This model goes through three stages of parameter-efficient training using three separate biomedical and radiology multi-modal visual and text datasets. The proposed model achieves state-of-the-art performance on the SLAKE 1.0 medical VQA (MedVQA) dataset with an overall accuracy of 87.5% and demonstrates strong performance on another MedVQA dataset, VQA-RAD, achieving an overall accuracy of 73.2%.
Abstract (translated)
视觉语言模型在一般领域通常都表现出良好的效果,并且在多样化的多模态应用中如视觉问答(VQA)中表现出强大的性能。然而,在更 specialized的领域,如医学领域,这些模型很难保持同样的效果。为了克服这一问题,我们提出了一个医学视觉语言模型,该模型整合了适用于医学领域的较大视觉和语言模型。该模型通过使用三个分开的生物医学和放射学多模态视觉和文本数据集进行参数高效的训练,分别训练三个阶段。所提出的模型在SLAKE 1.0医疗VQA(MedVQA)数据集上实现了最先进的性能, overall accuracy 达到了87.5%,同时在另一个MedVQA数据集VQA-RAD上表现出强大的性能, overall accuracy 达到了73.2%。
URL
https://arxiv.org/abs/2404.16192