Abstract
This paper attacks the challenging problem of zero-example video retrieval. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described in natural language text with no visual example provided. The majority of existing methods are concept based, extracting relevant concepts from queries and videos and accordingly establishing associations between the two modalities. In contrast, this paper follows a novel trend of concept-free, deep learning based encoding. To that end, we propose a dual deep encoding network that works on both video and query sides. The network can be flexibly coupled with an existing common space learning module for video-text similarity computation. As experiments on three benchmarks, i.e., MSR-VTT, TRECVID 2016 and 2017 Ad-hoc Video Search show, the proposed method establishes a new state-of-the-art for zero-example video retrieval.
Abstract (translated)
本文探讨了零示例视频检索的挑战性问题。在这种检索范例中,最终用户通过在自然语言文本中描述的即席查询来搜索未标记的视频,而不提供视觉示例。大多数现有方法是基于概念的,从查询和视频中提取相关概念,并因此建立两种模态之间的关联。相比之下,本文遵循基于概念的深度学习编码的新趋势。为此,我们提出了一种双深度编码网络,可以在视频和查询方面工作。网络可以灵活地与现有的公共空间学习模块耦合,用于视频文本相似度计算。作为三个基准的实验,即MSR-VTT,TRECVID 2016和2017 Ad-hoc视频搜索显示,所提出的方法为零示例视频检索建立了新的现有技术。
URL
https://arxiv.org/abs/1809.06181