Paper Reading AI Learner

Native Segmentation Vision Transformers

2025-05-22 17:55:20
Guillem Bras\'o, Aljo\v{s}a O\v{s}ep, Laura Leal-Taix\'e

Abstract

Uniform downsampling remains the de facto standard for reducing spatial resolution in vision backbones. In this work, we propose an alternative design built around a content-aware spatial grouping layer, that dynamically assigns tokens to a reduced set based on image boundaries and their semantic content. Stacking our grouping layer across consecutive backbone stages results in hierarchical segmentation that arises natively in the feature extraction process, resulting in our coined Native Segmentation Vision Transformer. We show that a careful design of our architecture enables the emergence of strong segmentation masks solely from grouping layers, that is, without additional segmentation-specific heads. This sets the foundation for a new paradigm of native, backbone-level segmentation, which enables strong zero-shot results without mask supervision, as well as a minimal and efficient standalone model design for downstream segmentation tasks. Our project page is this https URL.

Abstract (translated)

均匀下采样一直是降低视觉骨干网络空间分辨率的事实标准。在本工作中,我们提出了一种基于内容感知的空间分组层的替代设计方案,该设计可以根据图像边界及其语义内容动态地将标记分配给一个缩小的集合中。在整个连续骨干阶段堆叠我们的分组层会产生一种层次化分割,这种分割自然出现在特征提取过程中,从而形成了我们提出的原生分割视觉变换器(Native Segmentation Vision Transformer)。我们展示了对架构进行精心设计可以使仅通过分组层就能产生强大的分割掩码,而无需额外的特定于分割的头部。这为新的原生骨干级分割范式奠定了基础,该范式可以在没有掩码监督的情况下实现强大的零样本结果,并且对于下游分割任务具有最小和高效的独立模型设计。我们的项目页面在此 [URL]。 注:原文中的项目页面链接(https URL)未给出具体网址,在实际引用时需要提供完整的网址信息。

URL

https://arxiv.org/abs/2505.16993

PDF

https://arxiv.org/pdf/2505.16993.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot