Paper Reading AI Learner

Visual Reference Resolution using Attention Memory for Visual Dialog

2017-11-20 18:09:07
Paul Hongsuck Seo, Andreas Lehrmann, Bohyung Han, Leonid Sigal

Abstract

Visual dialog is a task of answering a series of inter-dependent questions given an input image, and often requires to resolve visual references among the questions. This problem is different from visual question answering (VQA), which relies on spatial attention (a.k.a. visual grounding) estimated from an image and question pair. We propose a novel attention mechanism that exploits visual attentions in the past to resolve the current reference in the visual dialog scenario. The proposed model is equipped with an associative attention memory storing a sequence of previous (attention, key) pairs. From this memory, the model retrieves the previous attention, taking into account recency, which is most relevant for the current question, in order to resolve potentially ambiguous references. The model then merges the retrieved attention with a tentative one to obtain the final attention for the current question; specifically, we use dynamic parameter prediction to combine the two attentions conditioned on the question. Through extensive experiments on a new synthetic visual dialog dataset, we show that our model significantly outperforms the state-of-the-art (by ~16 % points) in situations, where visual reference resolution plays an important role. Moreover, the proposed model achieves superior performance (~ 2 % points improvement) in the Visual Dialog dataset, despite having significantly fewer parameters than the baselines.

Abstract (translated)

视觉对话是一个任务,在输入图像时回答一系列相互依赖的问题,并且通常需要解决问题之间的视觉引用。这个问题不同于视觉问题回答(VQA),它依赖于从图像和问题对估计的空间关注(又名视觉基础)。我们提出了一种新颖的注意机制,它利用过去的视觉注意力来解决视觉对话场景中的当前参考。所提出的模型配备有存储一系列先前(注意,关键)对的相关注意记忆。从这个记忆中,模型检索以前的注意力,考虑与当前问题最相关的新近性,以便解决潜在的模棱两可的引用。该模型然后将检索到的注意力与试探性注意力合并,以获得当前问题的最终关注;具体而言,我们使用动态参数预测来结合这个问题的两个注意点。通过对新的合成视觉对话数据集进行广泛的实验,我们发现,在视觉参考分辨率起着重要作用的情况下,我们的模型显着优于现有技术(约16%)。此外,所提出的模型在Visual Dialog数据集中实现了优异的性能(〜2%点改进),尽管参数比基线少得多。

URL

https://arxiv.org/abs/1709.07992

PDF

https://arxiv.org/pdf/1709.07992.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot