Mammo-CLIP: Leveraging Contrastive Language-Image Pre-training for Enhanced Breast Cancer Diagnosis with Multi-view Mammography

Abstract
Abstract (translated)
URL
PDF

Abstract

Although fusion of information from multiple views of mammograms plays an important role to increase accuracy of breast cancer detection, developing multi-view mammograms-based computer-aided diagnosis (CAD) schemes still faces challenges and no such CAD schemes have been used in clinical practice. To overcome the challenges, we investigate a new approach based on Contrastive Language-Image Pre-training (CLIP), which has sparked interest across various medical imaging tasks. By solving the challenges in (1) effectively adapting the single-view CLIP for multi-view feature fusion and (2) efficiently fine-tuning this parameter-dense model with limited samples and computational resources, we introduce Mammo-CLIP, the first multi-modal framework to process multi-view mammograms and corresponding simple texts. Mammo-CLIP uses an early feature fusion strategy to learn multi-view relationships in four mammograms acquired from the CC and MLO views of the left and right breasts. To enhance learning efficiency, plug-and-play adapters are added into CLIP image and text encoders for fine-tuning parameters and limiting updates to about 1% of the parameters. For framework evaluation, we assembled two datasets retrospectively. The first dataset, comprising 470 malignant and 479 benign cases, was used for few-shot fine-tuning and internal evaluation of the proposed Mammo-CLIP via 5-fold cross-validation. The second dataset, including 60 malignant and 294 benign cases, was used to test generalizability of Mammo-CLIP. Study results show that Mammo-CLIP outperforms the state-of-art cross-view transformer in AUC (0.841 vs. 0.817, 0.837 vs. 0.807) on both datasets. It also surpasses previous two CLIP-based methods by 20.3% and 14.3%. This study highlights the potential of applying the finetuned vision-language models for developing next-generation, image-text-based CAD schemes of breast cancer.

Abstract (translated)

尽管从多种视角获取的乳腺X光片信息融合对提高乳腺癌检测的准确性具有重要意义，但基于多视角乳腺X光片开发的计算机辅助诊断（CAD）方案仍然面临挑战，并且在临床实践中尚未应用到这样的CAD方案。为了克服这些挑战，我们研究了一种基于Contrastive Language-Image Pre-training（CLIP）的新方法，该方法在各种医学影像任务中引起了人们的关注。通过解决（1）有效适应单视图CLIP的多人视角特征融合和（2）通过有限样本和计算资源 efficiently微调参数密集模型，我们引入了Mammo-CLIP，这是第一个多模态框架处理乳腺多视角X光片和相应简单文本。Mammo-CLIP使用早期特征融合策略从左、右乳头的CC和MLO视角的四个乳腺X光片中学习多视角关系。为了提高学习效率，我们将自适应器添加到CLIP图像和文本编码器以微调参数和限制更新至约参数的1%。对于框架评估，我们将两个数据集按时间顺序组装起来。第一个数据集包括470个恶性和479个良性病例，用于通过5轮交叉验证对提出的Mammo-CLIP进行微调并进行内部评估。第二个数据集包括60个恶性和294个良性病例，用于测试Mammo-CLIP的泛化能力。研究结果表明，Mammo-CLIP在两个数据集上都优于最先进的交叉视角Transformer，其AUC（0.841 vs 0.817，0.837 vs 0.807）均值分别为0.817和0.837。它还比基于CLIP的前两种方法超出20.3%和14.3%。这项研究突出了将微调视觉-语言模型的应用于开发下一代，基于图像-文本乳腺癌检测方案的潜力。

URL

https://arxiv.org/abs/2404.15946

PDF

https://arxiv.org/pdf/2404.15946.pdf

Mammo-CLIP: Leveraging Contrastive Language-Image Pre-training for Enhanced Breast Cancer Diagnosis with Multi-view Mammography

Abstract

Abstract (translated)

URL

PDF Copy

PDF