Abstract
The widespread use of chest X-rays (CXRs), coupled with a shortage of radiologists, has driven growing interest in automated CXR analysis and AI-assisted reporting. While existing vision-language models (VLMs) show promise in specific tasks such as report generation or abnormality detection, they often lack support for interactive diagnostic capabilities. In this work we present RadVLM, a compact, multitask conversational foundation model designed for CXR interpretation. To this end, we curate a large-scale instruction dataset comprising over 1 million image-instruction pairs containing both single-turn tasks -- such as report generation, abnormality classification, and visual grounding -- and multi-turn, multi-task conversational interactions. After fine-tuning RadVLM on this instruction dataset, we evaluate it across different tasks along with re-implemented baseline VLMs. Our results show that RadVLM achieves state-of-the-art performance in conversational capabilities and visual grounding while remaining competitive in other radiology tasks. Ablation studies further highlight the benefit of joint training across multiple tasks, particularly for scenarios with limited annotated data. Together, these findings highlight the potential of RadVLM as a clinically relevant AI assistant, providing structured CXR interpretation and conversational capabilities to support more effective and accessible diagnostic workflows.
Abstract (translated)
胸部X光片(CXRs)的广泛应用,加上放射科医生短缺的问题,推动了自动CXRs分析和人工智能辅助报告的兴趣日益增长。尽管现有的视觉-语言模型(VLMs)在诸如生成报告或检测异常等特定任务中显示出潜力,但它们通常缺乏支持交互式诊断功能的支持。在这项工作中,我们介绍了RadVLM,这是一种紧凑型、多任务对话基础模型,专为CXRs解释设计。为此,我们整理了一个大规模指令数据集,该数据集中包含超过100万张图像-指令对,这些对涵盖了单轮次任务(如生成报告、异常分类和视觉定位)以及多轮次、多任务的对话交互。 在RadVLM模型上进行这项指令数据集微调后,我们在不同任务中对其进行评估,并与重新实现的基线VLMs进行了比较。我们的结果显示,在对话能力和视觉定位方面,RadVLM达到了最先进的性能,同时在其他放射学任务中也保持了竞争力。消融研究进一步强调了跨多个任务联合训练的好处,特别是在标注数据有限的情况下。 这些发现共同突显了RadVLM作为临床相关AI助手的潜力,它能够提供结构化的CXRs解释和对话能力,以支持更有效且可访问的诊断工作流程。
URL
https://arxiv.org/abs/2502.03333