Abstract
Segment Anything Model 2 (SAM 2), a prompt-driven foundation model extending SAM to both image and video domains, has shown superior zero-shot performance compared to its predecessor. Building on SAM's success in medical image segmentation, SAM 2 presents significant potential for further advancement. However, similar to SAM, SAM 2 is limited by its output of binary masks, inability to infer semantic labels, and dependence on precise prompts for the target object area. Additionally, direct application of SAM and SAM 2 to medical image segmentation tasks yields suboptimal results. In this paper, we explore the upper performance limit of SAM 2 using custom fine-tuning adapters, achieving a Dice Similarity Coefficient (DSC) of 92.30% on the BTCV dataset, surpassing the state-of-the-art nnUNet by 12%. Following this, we address the prompt dependency by investigating various prompt generators. We introduce a UNet to autonomously generate predicted masks and bounding boxes, which serve as input to SAM 2. Subsequent dual-stage refinements by SAM 2 further enhance performance. Extensive experiments show that our method achieves state-of-the-art results on the AMOS2022 dataset, with a Dice improvement of 2.9% compared to nnUNet, and outperforms nnUNet by 6.4% on the BTCV dataset.
Abstract (translated)
《段落式模型 Segment Anything Model 2 (SAM 2),一种基于提示的通用基础模型,扩展了 SAM 在图像和视频领域的应用,并且相比其前身展现了卓越的零样本性能。在医学影像分割领域,SAM 的成功表明 SAM 2 具备进一步发展的巨大潜力。然而,与 SAM 类似,SAM 2 受限于只能输出二值掩码、无法推断语义标签以及依赖精确提示来定位目标物体区域等限制。此外,直接将 SAM 和 SAM 2 应用于医学影像分割任务会产生次优的结果。在本文中,我们通过使用自定义的微调适配器探索了 SAM 2 的性能上限,在 BTCV 数据集上达到了 92.30% 的 Dice 相似性系数 (DSC),这一成绩比最先进的 nnUNet 高出 12%。随后,为了解决提示依赖问题,我们研究了几种提示生成器,并引入了一种 UNet 来自动产生预测掩码和边界框,这些输出作为 SAM 2 的输入。经过 SAM 2 的后续双阶段优化后,性能进一步提升。广泛的实验表明,我们的方法在 AMOS2022 数据集上达到了最先进的水平,Dice 分数相比 nnUNet 提升了 2.9%,并且在 BTCV 数据集上的表现优于 nnUNet 6.4%。》
URL
https://arxiv.org/abs/2502.02741