MTLE: A Multitask Learning Encoder of Visual Feature Representations for Video and Movie Description

Abstract
Abstract (translated)
URL
PDF

Abstract

Learning visual feature representations for video analysis is a daunting task that requires a large amount of training samples and a proper generalization framework. Many of the current state of the art methods for video captioning and movie description rely on simple encoding mechanisms through recurrent neural networks to encode temporal visual information extracted from video data. In this paper, we introduce a novel multitask encoder-decoder framework for automatic semantic description and captioning of video sequences. In contrast to current approaches, our method relies on distinct decoders that train a visual encoder in a multitask fashion. Our system does not depend solely on multiple labels and allows for a lack of training data working even with datasets where only one single annotation is viable per video. Our method shows improved performance over current state of the art methods in several metrics on multi-caption and single-caption datasets. To the best of our knowledge, our method is the first method to use a multitask approach for encoding video features. Our method demonstrates its robustness on the Large Scale Movie Description Challenge (LSMDC) 2017 where our method won the movie description task and its results were ranked among other competitors as the most helpful for the visually impaired.

Abstract (translated)

学习视频分析的视觉特征表示是一项艰巨的任务，需要大量的训练样本和适当的泛化框架。用于视频字幕和电影描述的许多现有技术方法依赖于通过递归神经网络的简单编码机制来编码从视频数据提取的时间视觉信息。在本文中，我们介绍了一种新的多任务编码器 - 解码器框架，用于自动语义描述和视频序列的字幕。与当前的方法相比，我们的方法依赖于以多任务方式训练可视编码器的不同解码器。我们的系统不仅仅依赖于多个标签，并且即使对于每个视频只有一个注释可行的数据集，也可以使缺少训练数据。我们的方法在多字幕和单字幕数据集的若干指标上显示出优于当前最新技术方法的性能。据我们所知，我们的方法是第一种使用多任务方法编码视频特征的方法。我们的方法展示了其在2017年大型电影描述挑战（LSMDC）中的稳健性，其中我们的方法赢得了电影描述任务，其结果在其他竞争者中被评为对视障者最有帮助的。

URL

https://arxiv.org/abs/1809.07257

PDF

https://arxiv.org/pdf/1809.07257.pdf