Abstract
Attention networks in multimodal learning provide an efficient way to utilize given visual information selectively. However, the computational cost to learn attention distributions for every pair of multimodal input channels is prohibitively expensive. To solve this problem, co-attention builds two separate attention distributions for each modality neglecting the interaction between multimodal inputs. In this paper, we propose bilinear attention networks (BAN) that find bilinear attention distributions to utilize given vision-language information seamlessly. BAN considers bilinear interactions among two groups of input channels, while low-rank bilinear pooling extracts the joint representations for each pair of channels. Furthermore, we propose a variant of multimodal residual networks to exploit eight-attention maps of the BAN efficiently. We quantitatively and qualitatively evaluate our model on visual question answering (VQA 2.0) and Flickr30k Entities datasets, showing that BAN significantly outperforms previous methods and achieves new state-of-the-arts on both datasets.
Abstract (translated)
多模式学习中的关注网络提供了一种有效的方式来有选择地利用给定的视觉信息。然而,学习每对多通道输入通道的注意力分配的计算成本是非常昂贵的。为了解决这个问题,共同关注为每种模式建立了两个单独的注意分布,忽略了多模态输入之间的相互作用。在本文中,我们提出双线性关注网络(BAN),它可以找到双线性关注分布以无缝地利用给定的视觉语言信息。 BAN考虑两组输入通道之间的双线性相互作用,而低秩双线性池提取每对通道的联合表示。此外,我们提出了一种多模态残差网络的变体,以有效地利用BAN的八关注地图。我们对视觉问题回答(VQA 2.0)和Flickr30k实体数据集的数据进行了定量和定性评估,显示BAN明显优于以前的方法,并在两个数据集上都实现了新的技术水平。
URL
https://arxiv.org/abs/1805.07932