Supervising Neural Attention Models for Video Captioning by Human Gaze Data

Abstract
Abstract (translated)
URL
PDF

Abstract

The attention mechanisms in deep neural networks are inspired by human's attention that sequentially focuses on the most relevant parts of the information over time to generate prediction output. The attention parameters in those models are implicitly trained in an end-to-end manner, yet there have been few trials to explicitly incorporate human gaze tracking to supervise the attention models. In this paper, we investigate whether attention models can benefit from explicit human gaze labels, especially for the task of video captioning. We collect a new dataset called VAS, consisting of movie clips, and corresponding multiple descriptive sentences along with human gaze tracking data. We propose a video captioning model named Gaze Encoding Attention Network (GEAN) that can leverage gaze tracking information to provide the spatial and temporal attention for sentence generation. Through evaluation of language similarity metrics and human assessment via Amazon mechanical Turk, we demonstrate that spatial attentions guided by human gaze data indeed improve the performance of multiple captioning methods. Moreover, we show that the proposed approach achieves the state-of-the-art performance for both gaze prediction and video captioning not only in our VAS dataset but also in standard datasets (e.g. LSMDC and Hollywood2).

Abstract (translated)

深度神经网络中的关注机制受到人们注意力的启发，随着时间的推移，它们依次关注信息中最相关的部分，以产生预测输出。这些模型中的注意力参数是以端对端的方式隐式训练的，但很少有试验明确纳入人类注视跟踪来监督注意力模型。在本文中，我们调查注意模型是否可以从明确的人类注视标签中受益，特别是视频字幕的任务。我们收集一个名为VAS的新数据集，包括影片剪辑，以及相应的多个描述性句子以及人眼注视跟踪数据。我们提出了一个名为Gaze Encoding Attention Network（GEAN）的视频字幕模型，它可以利用注视跟踪信息为句子生成提供空间和时间关注。通过亚马逊机械突厥语评估语言相似性度量和人类评估，我们证明了由人类注视数据引导的空间关注确实提高了多种字幕方法的性能。此外，我们表明，所提出的方法不仅在我们的VAS数据集中而且在标准数据集（例如LSMDC和Hollywood2）中都实现了注视预测和视频字幕的最新性能。

URL

https://arxiv.org/abs/1707.06029

PDF

https://arxiv.org/pdf/1707.06029.pdf