Abstract
Although fusion of information from multiple views of mammograms plays an important role to increase accuracy of breast cancer detection, developing multi-view mammograms-based computer-aided diagnosis (CAD) schemes still faces challenges and no such CAD schemes have been used in clinical practice. To overcome the challenges, we investigate a new approach based on Contrastive Language-Image Pre-training (CLIP), which has sparked interest across various medical imaging tasks. By solving the challenges in (1) effectively adapting the single-view CLIP for multi-view feature fusion and (2) efficiently fine-tuning this parameter-dense model with limited samples and computational resources, we introduce Mammo-CLIP, the first multi-modal framework to process multi-view mammograms and corresponding simple texts. Mammo-CLIP uses an early feature fusion strategy to learn multi-view relationships in four mammograms acquired from the CC and MLO views of the left and right breasts. To enhance learning efficiency, plug-and-play adapters are added into CLIP image and text encoders for fine-tuning parameters and limiting updates to about 1% of the parameters. For framework evaluation, we assembled two datasets retrospectively. The first dataset, comprising 470 malignant and 479 benign cases, was used for few-shot fine-tuning and internal evaluation of the proposed Mammo-CLIP via 5-fold cross-validation. The second dataset, including 60 malignant and 294 benign cases, was used to test generalizability of Mammo-CLIP. Study results show that Mammo-CLIP outperforms the state-of-art cross-view transformer in AUC (0.841 vs. 0.817, 0.837 vs. 0.807) on both datasets. It also surpasses previous two CLIP-based methods by 20.3% and 14.3%. This study highlights the potential of applying the finetuned vision-language models for developing next-generation, image-text-based CAD schemes of breast cancer.
Abstract (translated)
尽管从多种视角获取的乳腺X光片信息融合对提高乳腺癌检测的准确性具有重要意义,但基于多视角乳腺X光片开发的计算机辅助诊断(CAD)方案仍然面临挑战,并且在临床实践中尚未应用到这样的CAD方案。为了克服这些挑战,我们研究了一种基于Contrastive Language-Image Pre-training(CLIP)的新方法,该方法在各种医学影像任务中引起了人们的关注。通过解决(1)有效适应单视图CLIP的多人视角特征融合和(2)通过有限样本和计算资源 efficiently微调参数密集模型,我们引入了Mammo-CLIP,这是第一个多模态框架处理乳腺多视角X光片和相应简单文本。Mammo-CLIP使用早期特征融合策略从左、右乳头的CC和MLO视角的四个乳腺X光片中学习多视角关系。为了提高学习效率,我们将自适应器添加到CLIP图像和文本编码器以微调参数和限制更新至约参数的1%。对于框架评估,我们将两个数据集按时间顺序组装起来。第一个数据集包括470个恶性和479个良性病例,用于通过5轮交叉验证对提出的Mammo-CLIP进行微调并进行内部评估。第二个数据集包括60个恶性和294个良性病例,用于测试Mammo-CLIP的泛化能力。研究结果表明,Mammo-CLIP在两个数据集上都优于最先进的交叉视角Transformer,其AUC(0.841 vs 0.817,0.837 vs 0.807)均值分别为0.817和0.837。它还比基于CLIP的前两种方法超出20.3%和14.3%。这项研究突出了将微调视觉-语言模型的应用于开发下一代,基于图像-文本乳腺癌检测方案的潜力。
URL
https://arxiv.org/abs/2404.15946