Abstract
Several medical Multimodal Large Languange Models (MLLMs) have been developed to address tasks involving visual images with textual instructions across various medical modalities, achieving impressive results. Most current medical generalist models are region-agnostic, treating the entire image as a holistic representation. However, they struggle to identify which specific regions they are focusing on when generating a this http URL mimic the behavior of doctors, who typically begin by reviewing the entire image before concentrating on specific regions for a thorough evaluation, we aim to enhance the capability of medical MLLMs in understanding anatomical regions within entire medical scans. To achieve it, we first formulate Region-Centric tasks and construct a large-scale dataset, MedRegInstruct, to incorporate regional information into training. Combining our collected dataset with other medical multimodal corpora for training, we propose a Region-Aware medical MLLM, MedRegA, which is the first bilingual generalist medical AI system to simultaneously handle image-level and region-level medical vision-language tasks across a broad range of modalities. Our MedRegA not only enables three region-centric tasks, but also achieves the best performance for visual question answering, report generation and medical image classification over 8 modalities, showcasing significant versatility. Experiments demonstrate that our model can not only accomplish powerful performance across various medical vision-language tasks in bilingual settings, but also recognize and detect structures in multimodal medical scans, boosting the interpretability and user interactivity of medical MLLMs. Our project page is this https URL.
Abstract (translated)
几个医疗多模态大型语言模型(MLLMs)已经被开发出来,以处理涉及视觉图像与文本指令的各种医学模态任务,并取得了令人印象深刻的结果。目前大多数医疗通用型模型是区域无关的,将整个图像视为一个整体表示。然而,它们在生成输出时难以识别出其具体关注的是哪些特定区域。为了模仿医生的行为——通常先全面审查整个图像再集中于某些特定区域进行深入评估——我们旨在提升医学MLLMs理解整幅医疗扫描中解剖学区域的能力。为此,我们首先定义了以区域为中心的任务,并构建了一个大规模数据集MedRegInstruct,将区域信息整合到训练过程中。结合我们收集的数据集与其他医疗多模态语料库进行训练,我们提出了一种区域感知的医学MLLM——MedRegA,这是首个能够同时处理图像级和区域级医疗视觉-语言任务的双语通用型医学AI系统,并覆盖了广泛的模态。我们的MedRegA不仅实现了三项以区域为中心的任务,还在8个模态下的视觉问题回答、报告生成以及医学影像分类等任务中取得了最佳性能,展示了显著的多功能性。实验表明,我们的模型不仅能跨多种医疗视觉-语言任务在双语设置下实现强大的表现,还能识别和检测多模态医疗扫描中的结构,增强了医学MLLMs的可解释性和用户交互性。我们的项目页面是[这个链接](this https URL)。
URL
https://arxiv.org/abs/2410.18387