CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios

Abstract
Abstract (translated)
URL
PDF

Abstract

Medical Vision-Language Pretraining (Med-VLP) establishes a connection between visual content from medical images and the relevant textual descriptions. Existing Med-VLP methods primarily focus on 2D images depicting a single body part, notably chest X-rays. In this paper, we extend the scope of Med-VLP to encompass 3D images, specifically targeting full-body scenarios, by using a multimodal dataset of CT images and reports. Compared with the 2D counterpart, 3D VLP is required to effectively capture essential semantics from significantly sparser representation in 3D imaging. In this paper, we introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning, aligning grounded visual features with precise diagnostic text. Additionally, we developed an abnormality dictionary to augment contrastive learning with diverse negative samples. Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages. The performance of CT-GLIP is validated on a separate test set of 1,130 patients, focusing on the 16 most frequent abnormalities across 7 organs. The experimental results show our model's superior performance over the standard CLIP framework across zero-shot and fine-tuning scenarios, using both CNN and ViT architectures.

Abstract (translated)

医疗视觉-语言预训练（Med-VLP）建立了视觉内容从医学图像和相关的文本描述之间的联系。现有的Med-VLP方法主要集中在描述单个身体部位的2D图像，特别是胸部X光片。在本文中，我们将Med-VLP的视野扩展到包括3D图像，特别是全身情景，通过使用包含CT图像和报告的多模态数据集。与2D版本相比，3D VLP需要有效地从显著稀疏的3D成像表示中捕捉关键语义信息。本文我们引入了CT-GLIP（基于CT的 grounded 语言-图像预训练），一种新颖的方法，用于构建器官级别的图像-文本对以增强多模态对比学习，将 grounded visual features 与精确的诊断文本对齐。此外，我们还开发了一个异常情况词典，以增加对比学习中的多样负样本。我们的方法，在包括17,702名患者跨越104个器官的44,011个器官级别视觉-文本对的多模态CT数据集上进行训练，能够以零散的方式识别器官和异常情况。CT-GLIP的性能在一个包括1,130名患者的独立测试集上进行了验证，重点关注7个器官中最常见的异常情况。实验结果表明，在我们的模型在零散和微调场景下超过了标准CLIP框架，使用了CNN和ViT架构。

URL

https://arxiv.org/abs/2404.15272

PDF

https://arxiv.org/pdf/2404.15272.pdf

CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios

Abstract

Abstract (translated)

URL

PDF Copy

PDF