Abstract
Building correspondences across different modalities, such as video and language, has recently become critical in many visual recognition applications, such as video captioning. Inspired by machine translation, recent models tackle this task using an encoder-decoder strategy. The (video) encoder is traditionally a Convolutional Neural Network (CNN), while the decoding (for language generation) is done using a Recurrent Neural Network (RNN). Current state-of-the-art methods, however, train encoder and decoder separately. CNNs are pretrained on object and/or action recognition tasks and used to encode video-level features. The decoder is then optimised on such static features to generate the video's description. This disjoint setup is arguably sub-optimal for input (video) to output (description) mapping. In this work, we propose to optimise both encoder and decoder simultaneously in an end-to-end fashion. In a two-stage training setting, we first initialise our architecture using pre-trained encoders and decoders -- then, the entire network is trained end-to-end in a fine-tuning stage to learn the most relevant features for video caption generation. In our experiments, we use GoogLeNet and Inception-ResNet-v2 as encoders and an original Soft-Attention (SA-) LSTM as a decoder. Analogously to gains observed in other computer vision problems, we show that end-to-end training significantly improves over the traditional, disjoint training process. We evaluate our End-to-End (EtENet) Networks on the Microsoft Research Video Description (MSVD) and the MSR Video to Text (MSR-VTT) benchmark datasets, showing how EtENet achieves state-of-the-art performance across the board.
Abstract (translated)
在视频和语言等不同的形式之间建立通信,最近在许多视觉识别应用中变得至关重要,如视频字幕。受机器翻译的启发,最近的模型使用编码器-解码器策略来处理这个任务。(视频)编码器传统上是卷积神经网络(CNN),而解码(用于语言生成)则使用循环神经网络(RNN)。然而,目前最先进的方法,列车编码器和解码器分别。CNN对对象和/或动作识别任务进行了预培训,并用于编码视频级功能。然后,解码器对这些静态特性进行优化,以生成视频的描述。对于输入(视频)到输出(描述)映射,这种不相交的设置可以说是次优的。
URL
https://arxiv.org/abs/1904.02628