Abstract
Attention mechanisms have become a core component of deep learning models, with Channel Attention and Spatial Attention being the two most representative architectures. Current research on their fusion strategies primarily bifurcates into sequential and parallel paradigms, yet the selection process remains largely empirical, lacking systematic analysis and unified principles. We systematically compare channel-spatial attention combinations under a unified framework, building an evaluation suite of 18 topologies across four classes: sequential, parallel, multi-scale, and residual. Across two vision and nine medical datasets, we uncover a "data scale-method-performance" coupling law: (1) in few-shot tasks, the "Channel-Multi-scale Spatial" cascaded structure achieves optimal performance; (2) in medium-scale tasks, parallel learnable fusion architectures demonstrate superior results; (3) in large-scale tasks, parallel structures with dynamic gating yield the best performance. Additionally, experiments indicate that the "Spatial-Channel" order is more stable and effective for fine-grained classification, while residual connections mitigate vanishing gradient problems across varying data scales. We thus propose scenario-based guidelines for building future attention modules. Code is open-sourced at this https URL.
Abstract (translated)
注意力机制已经成为深度学习模型的核心组成部分,通道注意力和空间注意力是两种最具代表性的架构。目前关于它们融合策略的研究主要分为顺序融合和平行融合两大范式,但选择过程很大程度上依赖经验,缺乏系统性分析和统一的原则。我们建立了一个统一的框架,在该框架下对通道-空间注意力组合进行了系统的对比研究,并构建了一套包含18种拓扑结构(四类:序列型、并行型、多尺度型以及残差型)的评估套件。在两项视觉任务和九项医疗数据集中,我们发现了一个“数据规模-方法-性能”耦合规律: 1. 在小样本任务中,“通道-多尺度空间”的级联结构表现最佳。 2. 中等规模的任务最适合使用并行学习融合架构。 3. 对于大规模的任务,并行结构结合动态门控机制表现出最优的性能。 此外,实验表明,在细粒度分类任务中,“空间-通道”顺序更为稳定且有效;残差连接则能缓解不同数据规模下的梯度消失问题。因此,我们提出了基于场景的选择指南来构建未来的注意力模块。相关代码已开源,可在提供的链接处获取。
URL
https://arxiv.org/abs/2601.07310