Paper Reading AI Learner

RadVLM: A Multitask Conversational Vision-Language Model for Radiology

2025-02-05 16:27:02
Nicolas Deperrois, Hidetoshi Matsuo, Samuel Ruip\'erez-Campillo, Moritz Vandenhirtz, Sonia Laguna, Alain Ryser, Koji Fujimoto, Mizuho Nishio, Thomas M. Sutter, Julia E. Vogt, Jonas Kluckert, Thomas Frauenfelder, Christian Bl\"uthgen, Farhad Nooralahzadeh, Michael Krauthammer

Abstract

The widespread use of chest X-rays (CXRs), coupled with a shortage of radiologists, has driven growing interest in automated CXR analysis and AI-assisted reporting. While existing vision-language models (VLMs) show promise in specific tasks such as report generation or abnormality detection, they often lack support for interactive diagnostic capabilities. In this work we present RadVLM, a compact, multitask conversational foundation model designed for CXR interpretation. To this end, we curate a large-scale instruction dataset comprising over 1 million image-instruction pairs containing both single-turn tasks -- such as report generation, abnormality classification, and visual grounding -- and multi-turn, multi-task conversational interactions. After fine-tuning RadVLM on this instruction dataset, we evaluate it across different tasks along with re-implemented baseline VLMs. Our results show that RadVLM achieves state-of-the-art performance in conversational capabilities and visual grounding while remaining competitive in other radiology tasks. Ablation studies further highlight the benefit of joint training across multiple tasks, particularly for scenarios with limited annotated data. Together, these findings highlight the potential of RadVLM as a clinically relevant AI assistant, providing structured CXR interpretation and conversational capabilities to support more effective and accessible diagnostic workflows.

Abstract (translated)

胸部X光片(CXRs)的广泛应用,加上放射科医生短缺的问题,推动了自动CXRs分析和人工智能辅助报告的兴趣日益增长。尽管现有的视觉-语言模型(VLMs)在诸如生成报告或检测异常等特定任务中显示出潜力,但它们通常缺乏支持交互式诊断功能的支持。在这项工作中,我们介绍了RadVLM,这是一种紧凑型、多任务对话基础模型,专为CXRs解释设计。为此,我们整理了一个大规模指令数据集,该数据集中包含超过100万张图像-指令对,这些对涵盖了单轮次任务(如生成报告、异常分类和视觉定位)以及多轮次、多任务的对话交互。 在RadVLM模型上进行这项指令数据集微调后,我们在不同任务中对其进行评估,并与重新实现的基线VLMs进行了比较。我们的结果显示,在对话能力和视觉定位方面,RadVLM达到了最先进的性能,同时在其他放射学任务中也保持了竞争力。消融研究进一步强调了跨多个任务联合训练的好处,特别是在标注数据有限的情况下。 这些发现共同突显了RadVLM作为临床相关AI助手的潜力,它能够提供结构化的CXRs解释和对话能力,以支持更有效且可访问的诊断工作流程。

URL

https://arxiv.org/abs/2502.03333

PDF

https://arxiv.org/pdf/2502.03333.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot