Paper Reading AI Learner

Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks

2024-10-24 02:55:41
Lehan Wang, Haonan Wang, Honglong Yang, Jiaji Mao, Zehong Yang, Jun Shen, Xiaomeng Li

Abstract

Several medical Multimodal Large Languange Models (MLLMs) have been developed to address tasks involving visual images with textual instructions across various medical modalities, achieving impressive results. Most current medical generalist models are region-agnostic, treating the entire image as a holistic representation. However, they struggle to identify which specific regions they are focusing on when generating a this http URL mimic the behavior of doctors, who typically begin by reviewing the entire image before concentrating on specific regions for a thorough evaluation, we aim to enhance the capability of medical MLLMs in understanding anatomical regions within entire medical scans. To achieve it, we first formulate Region-Centric tasks and construct a large-scale dataset, MedRegInstruct, to incorporate regional information into training. Combining our collected dataset with other medical multimodal corpora for training, we propose a Region-Aware medical MLLM, MedRegA, which is the first bilingual generalist medical AI system to simultaneously handle image-level and region-level medical vision-language tasks across a broad range of modalities. Our MedRegA not only enables three region-centric tasks, but also achieves the best performance for visual question answering, report generation and medical image classification over 8 modalities, showcasing significant versatility. Experiments demonstrate that our model can not only accomplish powerful performance across various medical vision-language tasks in bilingual settings, but also recognize and detect structures in multimodal medical scans, boosting the interpretability and user interactivity of medical MLLMs. Our project page is this https URL.

Abstract (translated)

几个医疗多模态大型语言模型(MLLMs)已经被开发出来,以处理涉及视觉图像与文本指令的各种医学模态任务,并取得了令人印象深刻的结果。目前大多数医疗通用型模型是区域无关的,将整个图像视为一个整体表示。然而,它们在生成输出时难以识别出其具体关注的是哪些特定区域。为了模仿医生的行为——通常先全面审查整个图像再集中于某些特定区域进行深入评估——我们旨在提升医学MLLMs理解整幅医疗扫描中解剖学区域的能力。为此,我们首先定义了以区域为中心的任务,并构建了一个大规模数据集MedRegInstruct,将区域信息整合到训练过程中。结合我们收集的数据集与其他医疗多模态语料库进行训练,我们提出了一种区域感知的医学MLLM——MedRegA,这是首个能够同时处理图像级和区域级医疗视觉-语言任务的双语通用型医学AI系统,并覆盖了广泛的模态。我们的MedRegA不仅实现了三项以区域为中心的任务,还在8个模态下的视觉问题回答、报告生成以及医学影像分类等任务中取得了最佳性能,展示了显著的多功能性。实验表明,我们的模型不仅能跨多种医疗视觉-语言任务在双语设置下实现强大的表现,还能识别和检测多模态医疗扫描中的结构,增强了医学MLLMs的可解释性和用户交互性。我们的项目页面是[这个链接](this https URL)。

URL

https://arxiv.org/abs/2410.18387

PDF

https://arxiv.org/pdf/2410.18387.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot