Paper Reading AI Learner

Factor Graph Attention

2019-04-11 17:59:58
Idan Schwartz, Seunghak Yu, Tamir Hazan, Alexander Schwing

Abstract

Dialog is an effective way to exchange information, but subtle details and nuances are extremely important. While significant progress has paved a path to address visual dialog with algorithms, details and nuances remain a challenge. Attention mechanisms have demonstrated compelling results to extract details in visual question answering and also provide a convincing framework for visual dialog due to their interpretability and effectiveness. However, the many data utilities that accompany visual dialog challenge existing attention techniques. We address this issue and develop a general attention mechanism for visual dialog which operates on any number of data utilities. To this end, we design a factor graph based attention mechanism which combines any number of utility representations. We illustrate the applicability of the proposed approach on the challenging and recently introduced VisDial datasets, outperforming recent state-of-the-art methods by 1.1% for VisDial0.9 and by 2% for VisDial1.0 on MRR. Our ensemble model improved the MRR score on VisDial1.0 by more than 6%.

Abstract (translated)

对话是一种有效的信息交流方式,但微妙的细节和细微差别是非常重要的。虽然取得了重大进展,但解决视觉对话的算法、细节和细微差别仍然是一个挑战。注意力机制已经证明了在视觉问答中提取细节的令人信服的结果,并且由于其可解释性和有效性,也为视觉对话提供了一个令人信服的框架。然而,伴随视觉对话的许多数据实用程序挑战了现有的注意力技术。我们解决了这个问题,并为可视化对话开发了一个通用的注意机制,它可以在任意数量的数据实用程序上运行。为此,我们设计了一种基于因子图的关注机制,它结合了任意数量的效用表示。我们说明了所提出的方法在具有挑战性的和最近引入的visdial数据集上的适用性,visdial0.9比最新的方法优越1.1%,在mrr上比visdial1.0高2%。我们的集成模型将visdial1.0上的mrr评分提高了6%以上。

URL

https://arxiv.org/abs/1904.05880

PDF

https://arxiv.org/pdf/1904.05880.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot