Paper Reading AI Learner

MedCLIP-SAM: Bridging Text and Image Towards Universal Medical Image Segmentation

2024-03-29 15:59:11
Taha Koleilat, Hojat Asgariandehkordi, Hassan Rivaz, Yiming Xiao

Abstract

Medical image segmentation of anatomical structures and pathology is crucial in modern clinical diagnosis, disease study, and treatment planning. To date, great progress has been made in deep learning-based segmentation techniques, but most methods still lack data efficiency, generalizability, and interactability. Consequently, the development of new, precise segmentation methods that demand fewer labeled datasets is of utmost importance in medical image analysis. Recently, the emergence of foundation models, such as CLIP and Segment-Anything-Model (SAM), with comprehensive cross-domain representation opened the door for interactive and universal image segmentation. However, exploration of these models for data-efficient medical image segmentation is still limited, but is highly necessary. In this paper, we propose a novel framework, called MedCLIP-SAM that combines CLIP and SAM models to generate segmentation of clinical scans using text prompts in both zero-shot and weakly supervised settings. To achieve this, we employed a new Decoupled Hard Negative Noise Contrastive Estimation (DHN-NCE) loss to fine-tune the BiomedCLIP model and the recent gScoreCAM to generate prompts to obtain segmentation masks from SAM in a zero-shot setting. Additionally, we explored the use of zero-shot segmentation labels in a weakly supervised paradigm to improve the segmentation quality further. By extensively testing three diverse segmentation tasks and medical image modalities (breast tumor ultrasound, brain tumor MRI, and lung X-ray), our proposed framework has demonstrated excellent accuracy.

Abstract (translated)

医学图像分割解剖结构和病理学在现代临床诊断、疾病研究和治疗规划中至关重要。到目前为止,基于深度学习的分割技术已经取得了很大进展,但大多数方法仍然缺乏数据效率、可扩展性和交互性。因此,开发新的、精确的分割方法,要求更少的标记数据是医学图像分析中至关重要的事情。最近,出现了以CLIP和Segment-Anything-Model (SAM)为代表的全面跨领域表示的基础模型,为交互和通用图像分割打开了大门。然而,为数据高效的医学图像分割探索这些模型仍然有限,但非常必要。在本文中,我们提出了一个新框架,称为MedCLIP-SAM,它将CLIP和SAM模型相结合,用于通过文本提示在零散和弱监督设置中生成临床扫描的分割。为达到这一目标,我们采用了一种新的解耦 hard 负噪声contrastive estimation (DHN-NCE)损失来微调生物医学CLIP模型和最近的gScoreCAM,以从SAM中生成提示获得分割掩码。此外,我们还研究了在弱监督范式中使用零散分割标签来提高分割质量。通过广泛测试三种不同的分割任务和医学图像类别(乳腺肿瘤超声、脑肿瘤MRI和肺X光),我们的框架在精确性方面已经证明了卓越的准确性。

URL

https://arxiv.org/abs/2403.20253

PDF

https://arxiv.org/pdf/2403.20253.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot