Abstract
Uniform downsampling remains the de facto standard for reducing spatial resolution in vision backbones. In this work, we propose an alternative design built around a content-aware spatial grouping layer, that dynamically assigns tokens to a reduced set based on image boundaries and their semantic content. Stacking our grouping layer across consecutive backbone stages results in hierarchical segmentation that arises natively in the feature extraction process, resulting in our coined Native Segmentation Vision Transformer. We show that a careful design of our architecture enables the emergence of strong segmentation masks solely from grouping layers, that is, without additional segmentation-specific heads. This sets the foundation for a new paradigm of native, backbone-level segmentation, which enables strong zero-shot results without mask supervision, as well as a minimal and efficient standalone model design for downstream segmentation tasks. Our project page is this https URL.
Abstract (translated)
均匀下采样一直是降低视觉骨干网络空间分辨率的事实标准。在本工作中,我们提出了一种基于内容感知的空间分组层的替代设计方案,该设计可以根据图像边界及其语义内容动态地将标记分配给一个缩小的集合中。在整个连续骨干阶段堆叠我们的分组层会产生一种层次化分割,这种分割自然出现在特征提取过程中,从而形成了我们提出的原生分割视觉变换器(Native Segmentation Vision Transformer)。我们展示了对架构进行精心设计可以使仅通过分组层就能产生强大的分割掩码,而无需额外的特定于分割的头部。这为新的原生骨干级分割范式奠定了基础,该范式可以在没有掩码监督的情况下实现强大的零样本结果,并且对于下游分割任务具有最小和高效的独立模型设计。我们的项目页面在此 [URL]。 注:原文中的项目页面链接(https URL)未给出具体网址,在实际引用时需要提供完整的网址信息。
URL
https://arxiv.org/abs/2505.16993