Abstract
Learning a joint language-visual embedding has a number of very appealing properties and can result in variety of practical application, including natural language image/video annotation and search. In this work, we study three different joint language-visual neural network model architectures. We evaluate our models on large scale LSMDC16 movie dataset for two tasks: 1) Standard Ranking for video annotation and retrieval 2) Our proposed movie multiple-choice test. This test facilitate automatic evaluation of visual-language models for natural language video annotation based on human activities. In addition to original Audio Description (AD) captions, provided as part of LSMDC16, we collected and will make available a) manually generated re-phrasings of those captions obtained using Amazon MTurk b) automatically generated human activity elements in "Predicate + Object" (PO) phrases based on "Knowlywood", an activity knowledge mining model. Our best model archives Recall@10 of 19.2% on annotation and 18.9% on video retrieval tasks for subset of 1000 samples. For multiple-choice test, our best model achieve accuracy 58.11% over whole LSMDC16 public test-set.
Abstract (translated)
学习联合语言 - 视觉嵌入具有许多非常吸引人的特性,并且可以导致各种实际应用,包括自然语言图像/视频注释和搜索。在这项工作中,我们研究了三种不同的联合语言 - 视觉神经网络模型体系结构。我们在大型LSMDC16电影数据集上评估我们的模型,以执行两项任务:1)视频注释和检索的标准排名2)我们提出的电影多选题测试。该测试方便了基于人类活动的自然语言视频注释的视觉语言模型的自动评估。除了作为LSMDC16的一部分提供的原始音频描述(AD)字幕之外,我们收集并将提供a)使用亚马逊MTurk获得的这些字幕的手动生成的重新解释b)在“谓词+对象”中自动生成的人类活动元素(PO)短语基于活动知识挖掘模型“Knowle”。对于1000个样本的子集,我们最好的模型归档回顾@ 19.2%的注释和18.9%的视频检索任务。对于多项选择测试,我们最好的模型在整个LSMDC16公共测试集上达到58.11%的准确度。
URL
https://arxiv.org/abs/1609.08124