Paper Reading AI Learner

Surgical-DeSAM: Decoupling SAM for Instrument Segmentation in Robotic Surgery

2024-04-22 09:53:55
Yuyang Sheng, Sophia Bano, Matthew J. Clarkson, Mobarakol Islam

Abstract

Purpose: The recent Segment Anything Model (SAM) has demonstrated impressive performance with point, text or bounding box prompts, in various applications. However, in safety-critical surgical tasks, prompting is not possible due to (i) the lack of per-frame prompts for supervised learning, (ii) it is unrealistic to prompt frame-by-frame in a real-time tracking application, and (iii) it is expensive to annotate prompts for offline applications. Methods: We develop Surgical-DeSAM to generate automatic bounding box prompts for decoupling SAM to obtain instrument segmentation in real-time robotic surgery. We utilise a commonly used detection architecture, DETR, and fine-tuned it to obtain bounding box prompt for the instruments. We then empolyed decoupling SAM (DeSAM) by replacing the image encoder with DETR encoder and fine-tune prompt encoder and mask decoder to obtain instance segmentation for the surgical instruments. To improve detection performance, we adopted the Swin-transformer to better feature representation. Results: The proposed method has been validated on two publicly available datasets from the MICCAI surgical instruments segmentation challenge EndoVis 2017 and 2018. The performance of our method is also compared with SOTA instrument segmentation methods and demonstrated significant improvements with dice metrics of 89.62 and 90.70 for the EndoVis 2017 and 2018. Conclusion: Our extensive experiments and validations demonstrate that Surgical-DeSAM enables real-time instrument segmentation without any additional prompting and outperforms other SOTA segmentation methods.

Abstract (translated)

目的:最近,Segment Anything Model(SAM)通过点、文本或边界框提示在各种应用中展示了出色的性能。然而,在关键手术任务中,由于(i)缺少每个帧的监督学习指导,(ii)在实时跟踪应用程序中逐帧提示是不现实的,(iii)为离线应用程序标注提示成本高昂,我们开发了Surgical-DeSAM,用于生成自动边界框提示,以将SAM与实时机器人手术解耦,并获得器械分割。我们利用了一个常用的检测架构DETR并对其进行了微调,以获得器械的边界框提示。然后,通过用DETR编码器替换图像编码器,并微调提示编码器和遮罩解码器,我们实现了手术器械的实例分割。为了提高检测性能,我们采用了Swin-transformer来更好地表示特征。结果:所提出的方法已通过在EndoVis 2017和2018两个公开可用的数据集上进行验证。我们的方法与其他用于手术器械分割的最好方法进行了比较,并使用迪氏分数(89.62)和吉氏分数(90.70)证明了在EndoVis 2017和2018上显著的改进。结论:我们的大量实验和验证证明,Surgical-DeSAM实现了没有额外提示的实时器械分割,并超越了其他SOTA分割方法。

URL

https://arxiv.org/abs/2404.14040

PDF

https://arxiv.org/pdf/2404.14040.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot