Abstract
In recent years, the long-range attention mechanism of vision transformers has driven significant performance breakthroughs across various computer vision tasks. However, the traditional self-attention mechanism, which processes both informative and non-informative tokens, suffers from inefficiency and inaccuracies. While sparse attention mechanisms have been introduced to mitigate these issues by pruning tokens involved in attention, they often lack context-awareness and intelligence. These mechanisms frequently apply a uniform token selection strategy across different inputs for batch training or optimize efficiency only for the inference stage. To overcome these challenges, we propose a novel algorithm: Select and Pack Attention (SPA). SPA dynamically selects informative tokens using a low-cost gating layer supervised by selection labels and packs these tokens into new batches, enabling a variable number of tokens to be used in parallelized GPU batch training and inference. Extensive experiments across diverse datasets and computer vision tasks demonstrate that SPA delivers superior performance and efficiency, including a 0.6 mAP improvement in object detection and a 16.4% reduction in computational costs.
Abstract (translated)
近年来,视觉变压器的长距离注意力机制在各种计算机视觉任务中推动了显著的性能突破。然而,传统的自注意力机制通过处理信息量大和信息量小的标记,存在效率低下和不准确的问题。虽然稀疏注意力机制已经被引入以减轻这些问题,通过剪枝参与注意力的标记来实现,但它们往往缺乏上下文感知能力和智能。这些机制通常对不同的输入应用统一的标记选择策略来进行批量训练,或者仅在推理阶段优化效率。为了解决这些问题,我们提出了一种新算法:选择和打包注意力(SPA)。SPA使用一个由选择标签监督的低成本门控层动态选择信息量大的标记,并将这些标记打包成新的批次,使得可以在并行化的GPU批量训练和推理中使用可变数量的标记。在多样数据集和计算机视觉任务上的广泛实验表明,SPA提供了优越的性能和效率,包括目标检测中的0.6 mAP改进和16.4%的计算成本降低。
URL
https://arxiv.org/abs/2410.23608