Paper Reading AI Learner

Supervising Neural Attention Models for Video Captioning by Human Gaze Data

2017-07-19 11:44:36
Youngjae Yu, Jongwook Choi, Yeonhwa Kim, Kyung Yoo, Sang-Hun Lee, Gunhee Kim

Abstract

The attention mechanisms in deep neural networks are inspired by human's attention that sequentially focuses on the most relevant parts of the information over time to generate prediction output. The attention parameters in those models are implicitly trained in an end-to-end manner, yet there have been few trials to explicitly incorporate human gaze tracking to supervise the attention models. In this paper, we investigate whether attention models can benefit from explicit human gaze labels, especially for the task of video captioning. We collect a new dataset called VAS, consisting of movie clips, and corresponding multiple descriptive sentences along with human gaze tracking data. We propose a video captioning model named Gaze Encoding Attention Network (GEAN) that can leverage gaze tracking information to provide the spatial and temporal attention for sentence generation. Through evaluation of language similarity metrics and human assessment via Amazon mechanical Turk, we demonstrate that spatial attentions guided by human gaze data indeed improve the performance of multiple captioning methods. Moreover, we show that the proposed approach achieves the state-of-the-art performance for both gaze prediction and video captioning not only in our VAS dataset but also in standard datasets (e.g. LSMDC and Hollywood2).

Abstract (translated)

深度神经网络中的关注机制受到人们注意力的启发,随着时间的推移,它们依次关注信息中最相关的部分,以产生预测输出。这些模型中的注意力参数是以端对端的方式隐式训练的,但很少有试验明确纳入人类注视跟踪来监督注意力模型。在本文中,我们调查注意模型是否可以从明确的人类注视标签中受益,特别是视频字幕的任务。我们收集一个名为VAS的新数据集,包括影片剪辑,以及相应的多个描述性句子以及人眼注视跟踪数据。我们提出了一个名为Gaze Encoding Attention Network(GEAN)的视频字幕模型,它可以利用注视跟踪信息为句子生成提供空间和时间关注。通过亚马逊机械突厥语评估语言相似性度量和人类评估,我们证明了由人类注视数据引导的空间关注确实提高了多种字幕方法的性能。此外,我们表明,所提出的方法不仅在我们的VAS数据集中而且在标准数据集(例如LSMDC和Hollywood2)中都实现了注视预测和视频字幕的最新性能。

URL

https://arxiv.org/abs/1707.06029

PDF

https://arxiv.org/pdf/1707.06029.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot