Paper Reading AI Learner

RSNA Large Language Model Benchmark Dataset for Chest Radiographs of Cardiothoracic Disease: Radiologist Evaluation and Validation Enhanced by AI Labels

2026-01-21 16:04:01
Yishu Wei, Adam E. Flanders, Errol Colak, John Mongan, Luciano M Prevedello, Po-Hao Chen, Henrique Min Ho Lee, Gilberto Szarf, Hamilton Shoji, Jason Sho, Katherine Andriole, Tessa Cook, Lisa C. Adams, Linda C. Chu, Maggie Chung, Geraldine Brusca-Augello, Djeven P. Deva, Navneet Singh, Felipe Sanchez Tijmes, Jeffrey B. Alpert, Elsie T. Nguyen, Drew A. Torigian, Kate Hanneman, Lauren K Groner, Alexander Phan, Ali Islam, Matias F. Callejas, Gustavo Borges da Silva Teles, Faisal Jamal, Maryam Vazirabad, Ali Tejani, Hari Trivedi, Paulo Kuriki, Rajesh Bhayana, Elana T. Benishay, Yi Lin, Yifan Peng, George Shih

Abstract

Multimodal large language models have demonstrated comparable performance to that of radiology trainees on multiple-choice board-style exams. However, to develop clinically useful multimodal LLM tools, high-quality benchmarks curated by domain experts are essential. To curate released and holdout datasets of 100 chest radiographic studies each and propose an artificial intelligence (AI)-assisted expert labeling procedure to allow radiologists to label studies more efficiently. A total of 13,735 deidentified chest radiographs and their corresponding reports from the MIDRC were used. GPT-4o extracted abnormal findings from the reports, which were then mapped to 12 benchmark labels with a locally hosted LLM (Phi-4-Reasoning). From these studies, 1,000 were sampled on the basis of the AI-suggested benchmark labels for expert review; the sampling algorithm ensured that the selected studies were clinically relevant and captured a range of difficulty levels. Seventeen chest radiologists participated, and they marked "Agree all", "Agree mostly" or "Disagree" to indicate their assessment of the correctness of the LLM suggested labels. Each chest radiograph was evaluated by three experts. Of these, at least two radiologists selected "Agree All" for 381 radiographs. From this set, 200 were selected, prioritizing those with less common or multiple finding labels, and divided into 100 released radiographs and 100 reserved as the holdout dataset. The holdout dataset is used exclusively by RSNA to independently evaluate different models. A benchmark of 200 chest radiographic studies with 12 benchmark labels was created and made publicly available this https URL, with each chest radiograph verified by three radiologists. In addition, an AI-assisted labeling procedure was developed to help radiologists label at scale, minimize unnecessary omissions, and support a semicollaborative environment.

Abstract (translated)

多模态大型语言模型在多项选择式的放射学执照考试中表现出与放射科实习生相当的性能。然而,为了开发临床实用的多模态LLM工具,由领域专家策划的高质量基准测试是必不可少的。为此,收集了100份胸部X光研究的数据集作为已发布和保留数据集,并提出了一种AI辅助专家标记程序以帮助放射科医生更高效地进行标注。 共使用了来自MIDRC的13,735张去标识化的胸部X光片及其相应的报告。GPT-4o从报告中提取异常发现,然后通过本地托管的语言模型(Phi-4-Reasoning)映射到12个基准标签上。基于AI建议的基准标签随机选取了1000例研究供专家评审;抽样算法确保所选案例具有临床相关性并涵盖了各种难度水平。 有十七名胸部放射科医生参与,他们对LLM提出的标记正确性进行评估,并选择“完全同意”、“大部分同意”或“不同意”。每张胸部X光片由三名专家独立评价。其中,至少两名放射科医生选择了“完全同意”的图像共有381例。从这些案例中优先选择罕见或多发现标签的200例,分成100份作为已发布数据集和100份保留为备用数据集。 备用数据集仅供RSNA独家使用以独立评估不同的模型。创建了一个包含200个胸部X光研究及其对应的12个基准标签的数据集,并通过此链接(原语句中提及的“this https URL”应理解为此处未提供的具体链接)公开发布,每张胸部X光片都经过三位放射科医生的验证。 此外,还开发了一种AI辅助标记程序以帮助放射科医生大规模地进行标注、减少不必要的遗漏,并支持半协作工作环境。

URL

https://arxiv.org/abs/2601.15129

PDF

https://arxiv.org/pdf/2601.15129.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot