Paper Reading AI Learner

RFMedSAM 2: Automatic Prompt Refinement for Enhanced Volumetric Medical Image Segmentation with SAM 2

2025-02-04 22:03:23
Bin Xie, Hao Tang, Yan Yan, Gady Agam

Abstract

Segment Anything Model 2 (SAM 2), a prompt-driven foundation model extending SAM to both image and video domains, has shown superior zero-shot performance compared to its predecessor. Building on SAM's success in medical image segmentation, SAM 2 presents significant potential for further advancement. However, similar to SAM, SAM 2 is limited by its output of binary masks, inability to infer semantic labels, and dependence on precise prompts for the target object area. Additionally, direct application of SAM and SAM 2 to medical image segmentation tasks yields suboptimal results. In this paper, we explore the upper performance limit of SAM 2 using custom fine-tuning adapters, achieving a Dice Similarity Coefficient (DSC) of 92.30% on the BTCV dataset, surpassing the state-of-the-art nnUNet by 12%. Following this, we address the prompt dependency by investigating various prompt generators. We introduce a UNet to autonomously generate predicted masks and bounding boxes, which serve as input to SAM 2. Subsequent dual-stage refinements by SAM 2 further enhance performance. Extensive experiments show that our method achieves state-of-the-art results on the AMOS2022 dataset, with a Dice improvement of 2.9% compared to nnUNet, and outperforms nnUNet by 6.4% on the BTCV dataset.

Abstract (translated)

《段落式模型 Segment Anything Model 2 (SAM 2),一种基于提示的通用基础模型,扩展了 SAM 在图像和视频领域的应用,并且相比其前身展现了卓越的零样本性能。在医学影像分割领域,SAM 的成功表明 SAM 2 具备进一步发展的巨大潜力。然而,与 SAM 类似,SAM 2 受限于只能输出二值掩码、无法推断语义标签以及依赖精确提示来定位目标物体区域等限制。此外,直接将 SAM 和 SAM 2 应用于医学影像分割任务会产生次优的结果。在本文中,我们通过使用自定义的微调适配器探索了 SAM 2 的性能上限,在 BTCV 数据集上达到了 92.30% 的 Dice 相似性系数 (DSC),这一成绩比最先进的 nnUNet 高出 12%。随后,为了解决提示依赖问题,我们研究了几种提示生成器,并引入了一种 UNet 来自动产生预测掩码和边界框,这些输出作为 SAM 2 的输入。经过 SAM 2 的后续双阶段优化后,性能进一步提升。广泛的实验表明,我们的方法在 AMOS2022 数据集上达到了最先进的水平,Dice 分数相比 nnUNet 提升了 2.9%,并且在 BTCV 数据集上的表现优于 nnUNet 6.4%。》

URL

https://arxiv.org/abs/2502.02741

PDF

https://arxiv.org/pdf/2502.02741.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot