Abstract
Current attention algorithms (e.g., self-attention) are stimulus-driven and highlight all the salient objects in an image. However, intelligent agents like humans often guide their attention based on the high-level task at hand, focusing only on task-related objects. This ability of task-guided top-down attention provides task-adaptive representation and helps the model generalize to various tasks. In this paper, we consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision. Prior work indicates a functional equivalence between visual attention and sparse reconstruction; we show that an AbS visual system that optimizes a similar sparse reconstruction objective modulated by a goal-directed top-down signal naturally simulates top-down attention. We further propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and achieves controllable top-down attention. For real-world applications, AbSViT consistently improves over baselines on Vision-Language tasks such as VQA and zero-shot retrieval where language guides the top-down attention. AbSViT can also serve as a general backbone, improving performance on classification, semantic segmentation, and model robustness.
Abstract (translated)
目前的注意力算法(例如,自我注意力)是基于刺激驱动的,并在图像中强调所有突出的物体。然而,像人类这样的智能代理通常基于当前任务的指导,只关注任务相关的物体。这种任务引导的高层次注意力提供了任务适应的表示,帮助模型适应多种任务。在本文中,我们将从视觉分析的迭代(AbS)视角看待高层次注意力。先前的工作表明视觉注意力和稀疏重建之间存在功能等价性。我们表明,一个以目标为导向的高层次信号驱动的AbS视觉系统自然地模拟了高层次注意力。我们还提出了分析-分析迭代视觉卷积器(AbSViT),它是一个高层次信号驱动的ViT模型,其变化模拟了AbS,并实现了可控制高层次注意力。对于实际应用场景,AbSViT在视觉语言任务(如VQA和零样本检索)中 consistently improves over baselines,特别是在语言指导高层次注意力的情况下。AbSViT还可以作为一个通用的骨架,提高分类、语义分割和模型鲁棒性。
URL
https://arxiv.org/abs/2303.13043