Abstract
There have been a few recent methods proposed in text to video moment retrieval using natural language queries, but requiring full supervision during training. However, acquiring a large number of training videos with temporal boundary annotations for each text description is extremely time-consuming and often not scalable. In order to cope with this issue, in this work, we introduce the problem of learning from weak labels for the task of text to video moment retrieval. The weak nature of the supervision is because, during training, we only have access to the video-text pairs rather than the temporal extent of the video to which different text descriptions relate. We propose a joint visual-semantic embedding based framework that learns the notion of relevant segments from video using only video-level sentence descriptions. Specifically, our main idea is to utilize latent alignment between video frames and sentence descriptions using Text-Guided Attention (TGA). TGA is then used during the test phase to retrieve relevant moments. Experiments on two benchmark datasets demonstrate that our method achieves comparable performance to state-of-the-art fully supervised approaches.
Abstract (translated)
近年来,在利用自然语言查询从文本到视频的即时检索中,有一些方法被提出,但在训练过程中需要充分的监督。但是,为每个文本描述获取大量带有时间边界注释的培训视频非常耗时,而且通常不可扩展。为了解决这一问题,在本文中,我们引入了从弱标签学习文本到视频片段检索任务的问题。监督的薄弱之处在于,在培训过程中,我们只能访问视频文本对,而不能访问与不同文本描述相关的视频的时间范围。我们提出了一个基于视觉语义嵌入的框架,该框架只使用视频级句子描述从视频中学习相关片段的概念。具体来说,我们的主要想法是利用视频帧和句子描述之间的潜在对齐,使用文本引导注意力(TGA)。然后在测试阶段使用TGA来检索相关力矩。对两个基准数据集的实验表明,我们的方法达到了与最先进的完全监督方法相当的性能。
URL
https://arxiv.org/abs/1904.03282