Top-Down Visual Attention from Analysis by Synthesis

Abstract
Abstract (translated)
URL
PDF

Abstract

Current attention algorithms (e.g., self-attention) are stimulus-driven and highlight all the salient objects in an image. However, intelligent agents like humans often guide their attention based on the high-level task at hand, focusing only on task-related objects. This ability of task-guided top-down attention provides task-adaptive representation and helps the model generalize to various tasks. In this paper, we consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision. Prior work indicates a functional equivalence between visual attention and sparse reconstruction; we show that an AbS visual system that optimizes a similar sparse reconstruction objective modulated by a goal-directed top-down signal naturally simulates top-down attention. We further propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and achieves controllable top-down attention. For real-world applications, AbSViT consistently improves over baselines on Vision-Language tasks such as VQA and zero-shot retrieval where language guides the top-down attention. AbSViT can also serve as a general backbone, improving performance on classification, semantic segmentation, and model robustness.

Abstract (translated)

目前的注意力算法(例如，自我注意力)是基于刺激驱动的，并在图像中强调所有突出的物体。然而，像人类这样的智能代理通常基于当前任务的指导，只关注任务相关的物体。这种任务引导的高层次注意力提供了任务适应的表示，帮助模型适应多种任务。在本文中，我们将从视觉分析的迭代(AbS)视角看待高层次注意力。先前的工作表明视觉注意力和稀疏重建之间存在功能等价性。我们表明，一个以目标为导向的高层次信号驱动的AbS视觉系统自然地模拟了高层次注意力。我们还提出了分析-分析迭代视觉卷积器(AbSViT)，它是一个高层次信号驱动的ViT模型，其变化模拟了AbS，并实现了可控制高层次注意力。对于实际应用场景，AbSViT在视觉语言任务(如VQA和零样本检索)中 consistently improves over baselines，特别是在语言指导高层次注意力的情况下。AbSViT还可以作为一个通用的骨架，提高分类、语义分割和模型鲁棒性。

URL

https://arxiv.org/abs/2303.13043

PDF

https://arxiv.org/pdf/2303.13043.pdf