Paper Reading AI Learner

Mammo-CLIP: Leveraging Contrastive Language-Image Pre-training for Enhanced Breast Cancer Diagnosis with Multi-view Mammography

2024-04-24 16:07:31
Xuxin Chen, Yuheng Li, Mingzhe Hu, Ella Salari, Xiaoqian Chen, Richard L.J. Qiu, Bin Zheng, Xiaofeng Yang

Abstract

Although fusion of information from multiple views of mammograms plays an important role to increase accuracy of breast cancer detection, developing multi-view mammograms-based computer-aided diagnosis (CAD) schemes still faces challenges and no such CAD schemes have been used in clinical practice. To overcome the challenges, we investigate a new approach based on Contrastive Language-Image Pre-training (CLIP), which has sparked interest across various medical imaging tasks. By solving the challenges in (1) effectively adapting the single-view CLIP for multi-view feature fusion and (2) efficiently fine-tuning this parameter-dense model with limited samples and computational resources, we introduce Mammo-CLIP, the first multi-modal framework to process multi-view mammograms and corresponding simple texts. Mammo-CLIP uses an early feature fusion strategy to learn multi-view relationships in four mammograms acquired from the CC and MLO views of the left and right breasts. To enhance learning efficiency, plug-and-play adapters are added into CLIP image and text encoders for fine-tuning parameters and limiting updates to about 1% of the parameters. For framework evaluation, we assembled two datasets retrospectively. The first dataset, comprising 470 malignant and 479 benign cases, was used for few-shot fine-tuning and internal evaluation of the proposed Mammo-CLIP via 5-fold cross-validation. The second dataset, including 60 malignant and 294 benign cases, was used to test generalizability of Mammo-CLIP. Study results show that Mammo-CLIP outperforms the state-of-art cross-view transformer in AUC (0.841 vs. 0.817, 0.837 vs. 0.807) on both datasets. It also surpasses previous two CLIP-based methods by 20.3% and 14.3%. This study highlights the potential of applying the finetuned vision-language models for developing next-generation, image-text-based CAD schemes of breast cancer.

Abstract (translated)

尽管从多种视角获取的乳腺X光片信息融合对提高乳腺癌检测的准确性具有重要意义,但基于多视角乳腺X光片开发的计算机辅助诊断(CAD)方案仍然面临挑战,并且在临床实践中尚未应用到这样的CAD方案。为了克服这些挑战,我们研究了一种基于Contrastive Language-Image Pre-training(CLIP)的新方法,该方法在各种医学影像任务中引起了人们的关注。通过解决(1)有效适应单视图CLIP的多人视角特征融合和(2)通过有限样本和计算资源 efficiently微调参数密集模型,我们引入了Mammo-CLIP,这是第一个多模态框架处理乳腺多视角X光片和相应简单文本。Mammo-CLIP使用早期特征融合策略从左、右乳头的CC和MLO视角的四个乳腺X光片中学习多视角关系。为了提高学习效率,我们将自适应器添加到CLIP图像和文本编码器以微调参数和限制更新至约参数的1%。对于框架评估,我们将两个数据集按时间顺序组装起来。第一个数据集包括470个恶性和479个良性病例,用于通过5轮交叉验证对提出的Mammo-CLIP进行微调并进行内部评估。第二个数据集包括60个恶性和294个良性病例,用于测试Mammo-CLIP的泛化能力。研究结果表明,Mammo-CLIP在两个数据集上都优于最先进的交叉视角Transformer,其AUC(0.841 vs 0.817,0.837 vs 0.807)均值分别为0.817和0.837。它还比基于CLIP的前两种方法超出20.3%和14.3%。这项研究突出了将微调视觉-语言模型的应用于开发下一代,基于图像-文本乳腺癌检测方案的潜力。

URL

https://arxiv.org/abs/2404.15946

PDF

https://arxiv.org/pdf/2404.15946.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot