Abstract
We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query interactions of attention. StructSA generates attention maps by recognizing space-time structures of key-query correlations via convolution and uses them to dynamically aggregate local contexts of value features. This effectively leverages rich structural patterns in images and videos such as scene layouts, object motion, and inter-object relations. Using StructSA as a main building block, we develop the structural vision transformer (StructViT) and evaluate its effectiveness on both image and video classification tasks, achieving state-of-the-art results on ImageNet-1K, Kinetics-400, Something-Something V1 & V2, Diving-48, and FineGym.
Abstract (translated)
我们提出了一种新的关注机制,被称为结构自注意(StructSA),它利用了注意力中自然出现的丰富相关模式。StructSA 通过卷积识别关键查询关联的空间时间结构,并使用它们动态地聚合价值特征的局部上下文。这有效地利用了图像和视频中的丰富结构模式,如场景布局、物体运动和物体之间的关系。使用StructSA作为主要构建模块,我们开发了结构视觉Transformer(StructViT),并在图像和视频分类任务上评估其效果,实现了ImageNet-1K、Kinetics-400、Something-Something V1&V2、Diving-48和FineGym的最好成绩。
URL
https://arxiv.org/abs/2404.03924