Abstract
Localizing moments in a longer video via natural language queries is a new, challenging task at the intersection of language and video understanding. Though moment localization with natural language is similar to other language and vision tasks like natural language object retrieval in images, moment localization offers an interesting opportunity to model temporal dependencies and reasoning in text. We propose a new model that explicitly reasons about different temporal segments in a video, and shows that temporal context is important for localizing phrases which include temporal language. To benchmark whether our model, and other recent video localization models, can effectively reason about temporal language, we collect the novel TEMPOral reasoning in video and language (TEMPO) dataset. Our dataset consists of two parts: a dataset with real videos and template sentences (TEMPO - Template Language) which allows for controlled studies on temporal language, and a human language dataset which consists of temporal sentences annotated by humans (TEMPO - Human Language).
Abstract (translated)
通过自然语言查询在较长的视频中定位时刻是语言和视频理解交叉的新任务,具有挑战性。虽然使用自然语言的时刻本地化与其他语言和视觉任务类似,如图像中的自然语言对象检索,但是时刻本地化提供了一个有趣的机会来模拟文本中的时间依赖性和推理。我们提出了一个明确推断视频中不同时间片段的新模型,并表明时间上下文对于定位包含时间语言的短语很重要。为了评估我们的模型和其他最近的视频定位模型是否能够有效推理时间语言,我们收集了视频和语言(TEMPO)数据集中的新颖TEMPOral推理。我们的数据集由两部分组成:一个包含真实视频和模板语句的数据集(TEMPO-模板语言),它允许对时间语言进行对照研究,以及一个由人类注释的时间句子组成的人类语言数据集(TEMPO - 人类语言)。
URL
https://arxiv.org/abs/1809.01337