Abstract
Medical visual question answering (Med-VQA) aims to automate the prediction of correct answers for medical images and questions, thereby assisting physicians in reducing repetitive tasks and alleviating their workload. Existing approaches primarily focus on pre-training models using additional and comprehensive datasets, followed by fine-tuning to enhance performance in downstream tasks. However, there is also significant value in exploring existing models to extract clinically relevant information. In this paper, we propose the Latent Prompt Assist model (LaPA) for medical visual question answering. Firstly, we design a latent prompt generation module to generate the latent prompt with the constraint of the target answer. Subsequently, we propose a multi-modal fusion block with latent prompt fusion module that utilizes the latent prompt to extract clinical-relevant information from uni-modal and multi-modal features. Additionally, we introduce a prior knowledge fusion module to integrate the relationship between diseases and organs with the clinical-relevant information. Finally, we combine the final integrated information with image-language cross-modal information to predict the final answers. Experimental results on three publicly available Med-VQA datasets demonstrate that LaPA outperforms the state-of-the-art model ARL, achieving improvements of 1.83%, 0.63%, and 1.80% on VQA-RAD, SLAKE, and VQA-2019, respectively. The code is publicly available at this https URL.
Abstract (translated)
医疗视觉问题回答(Med-VQA)旨在自动化的目的是辅助医生减少重复的任务,减轻他们的负担。现有的方法主要集中在使用额外的全面数据集进行预训练,然后进行微调以提高下游任务的性能。然而,探索现有的模型以提取有关临床相关的信息也非常重要。在本文中,我们提出了LaPA模型,用于医疗视觉问题回答。首先,我们设计了一个基于目标答案的隐式提示生成模块。然后,我们提出了一个多模态融合模块,利用隐式提示从单模态和多模态特征中提取临床相关信息。此外,我们还引入了一个先验知识融合模块,将疾病和器官的关系与临床相关信息集成起来。最后,我们将最终整合的信息与图像语言跨模态信息相结合来预测最终答案。在三个公开可用的Med-VQA数据集上的实验结果表明,LaPA在ARL(最先进的模型)上实现了卓越的性能,分别取得了VQA-RAD 1.83%,SLAKE 0.63%和VQA-2019 1.80%的改进。代码公开可用,在https://www. this URL。
URL
https://arxiv.org/abs/2404.13039