Abstract
Dialog is an effective way to exchange information, but subtle details and nuances are extremely important. While significant progress has paved a path to address visual dialog with algorithms, details and nuances remain a challenge. Attention mechanisms have demonstrated compelling results to extract details in visual question answering and also provide a convincing framework for visual dialog due to their interpretability and effectiveness. However, the many data utilities that accompany visual dialog challenge existing attention techniques. We address this issue and develop a general attention mechanism for visual dialog which operates on any number of data utilities. To this end, we design a factor graph based attention mechanism which combines any number of utility representations. We illustrate the applicability of the proposed approach on the challenging and recently introduced VisDial datasets, outperforming recent state-of-the-art methods by 1.1% for VisDial0.9 and by 2% for VisDial1.0 on MRR. Our ensemble model improved the MRR score on VisDial1.0 by more than 6%.
Abstract (translated)
对话是一种有效的信息交流方式,但微妙的细节和细微差别是非常重要的。虽然取得了重大进展,但解决视觉对话的算法、细节和细微差别仍然是一个挑战。注意力机制已经证明了在视觉问答中提取细节的令人信服的结果,并且由于其可解释性和有效性,也为视觉对话提供了一个令人信服的框架。然而,伴随视觉对话的许多数据实用程序挑战了现有的注意力技术。我们解决了这个问题,并为可视化对话开发了一个通用的注意机制,它可以在任意数量的数据实用程序上运行。为此,我们设计了一种基于因子图的关注机制,它结合了任意数量的效用表示。我们说明了所提出的方法在具有挑战性的和最近引入的visdial数据集上的适用性,visdial0.9比最新的方法优越1.1%,在mrr上比visdial1.0高2%。我们的集成模型将visdial1.0上的mrr评分提高了6%以上。
URL
https://arxiv.org/abs/1904.05880