Paper Reading AI Learner

Unlocking Generalization in Polyp Segmentation with DINO Self-Attention 'keys'

2025-12-15 14:29:47
Carla Monteiro, Valentina Corbetta, Regina Beets-Tan, Lu\'is F. Teixeira, Wilson Silva

Abstract

Automatic polyp segmentation is crucial for improving the clinical identification of colorectal cancer (CRC). While Deep Learning (DL) techniques have been extensively researched for this problem, current methods frequently struggle with generalization, particularly in data-constrained or challenging settings. Moreover, many existing polyp segmentation methods rely on complex, task-specific architectures. To address these limitations, we present a framework that leverages the intrinsic robustness of DINO self-attention "key" features for robust segmentation. Unlike traditional methods that extract tokens from the deepest layers of the Vision Transformer (ViT), our approach leverages the key features of the self-attention module with a simple convolutional decoder to predict polyp masks, resulting in enhanced performance and better generalizability. We validate our approach using a multi-center dataset under two rigorous protocols: Domain Generalization (DG) and Extreme Single Domain Generalization (ESDG). Our results, supported by a comprehensive statistical analysis, demonstrate that this pipeline achieves state-of-the-art (SOTA) performance, significantly enhancing generalization, particularly in data-scarce and challenging scenarios. While avoiding a polyp-specific architecture, we surpass well-established models like nnU-Net and UM-Net. Additionally, we provide a systematic benchmark of the DINO framework's evolution, quantifying the specific impact of architectural advancements on downstream polyp segmentation performance.

Abstract (translated)

自动息肉分割对于提高结直肠癌(CRC)的临床识别至关重要。尽管深度学习(DL)技术已经广泛应用于解决这个问题,但目前的方法在数据受限或挑战性环境中往往难以实现泛化效果。此外,许多现有的息肉分割方法依赖于复杂且特定任务的设计架构。为了克服这些限制,我们提出了一种框架,该框架利用DINO自注意力机制“键”特征的内在鲁棒性来进行稳健的分割。与传统的从视觉变换器(ViT)最深层提取令牌的方法不同,我们的方法使用简单的卷积解码器结合自注意力模块的键特征来预测息肉掩模,从而提高性能并实现更好的泛化能力。 我们通过一个多中心数据集在两个严格的协议下验证了这一方法:领域泛化(DG)和极端单一领域泛化(ESDG)。我们的结果,经全面统计分析支持,显示该管道达到了最先进的(SOTA)性能,在数据稀缺和挑战性场景中显著增强了泛化能力。同时避免使用特定于息肉的架构设计,我们超越了诸如nnU-Net和UM-Net等已确立的模型。 此外,我们还系统地评估了DINO框架的发展历程,并量化了架构改进对下游息肉分割性能的具体影响。

URL

https://arxiv.org/abs/2512.13376

PDF

https://arxiv.org/pdf/2512.13376.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot