Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption Generation

Abstract
Abstract (translated)
URL
PDF

Abstract

We present our submission to the Microsoft Video to Language Challenge of generating short captions describing videos in the challenge dataset. Our model is based on the encoder--decoder pipeline, popular in image and video captioning systems. We propose to utilize two different kinds of video features, one to capture the video content in terms of objects and attributes, and the other to capture the motion and action information. Using these diverse features we train models specializing in two separate input sub-domains. We then train an evaluator model which is used to pick the best caption from the pool of candidates generated by these domain expert models. We argue that this approach is better suited for the current video captioning task, compared to using a single model, due to the diversity in the dataset. Efficacy of our method is proven by the fact that it was rated best in MSR Video to Language Challenge, as per human evaluation. Additionally, we were ranked second in the automatic evaluation metrics based table.

Abstract (translated)

我们将提交给微软视频到语言的挑战，即在挑战数据集中生成描述视频的简短字幕。我们的模型基于编码器 - 解码器流水线，在图像和视频字幕系统中很受欢迎。我们建议利用两种不同类型的视频功能，一种根据对象和属性捕获视频内容，另一种捕获视频内容以捕获运动和动作信息。使用这些不同的功能，我们训练专注于两个独立输入子域的模型。然后，我们训练一个评估者模型，该模型用于从这些领域专家模型生成的候选人中挑选最佳标题。我们认为，由于数据集的多样性，与使用单一模型相比，这种方法更适合当前的视频字幕任务。根据人类评估，我们的方法的功效已得到证实，因为它在MSR Video to Language Challenge中被评为最佳。此外，我们在自动评估指标表中排名第二。

URL

https://arxiv.org/abs/1608.04959

PDF

https://arxiv.org/pdf/1608.04959.pdf