Abstract
Video action detectors are usually trained using video datasets with fully supervised temporal annotations. Building such video datasets is a heavily expensive task. To alleviate this problem, recent algorithms leverage weak labelling where videos are untrimmed and only a video-level label is available. In this paper, we propose RefineLoc, a new method for weakly-supervised temporal action localization. RefineLoc uses an iterative refinement approach by estimating and training on snippet-level pseudo ground truth at every iteration. We show the benefit of using such an iterative approach and present an extensive analysis of different pseudo ground truth generators. We show the effectiveness of our model on two standard action datasets, ActivityNet v1.2 and THUMOS14. RefineLoc equipped with a segment prediction-based pseudo ground truth generator improves the state-of-the-art in weakly-supervised temporal localization on the challenging and large-scale ActivityNet dataset by 4.2% and achieves comparable performance with state-of-the-art on THUMOS14.
Abstract (translated)
视频动作检测器通常使用视频数据集进行训练,并具有完全受监控的时间注释。构建这样的视频数据集是一项非常昂贵的任务。为了缓解这个问题,最近的算法利用了弱标签,其中视频未经修剪,只有视频级标签可用。本文提出了一种弱监督时间动作定位的新方法refineloc。refineloc使用一种迭代的精化方法,通过在每次迭代中估计和训练片段级的伪地面真实性。我们证明了使用这种迭代方法的好处,并对不同的伪地面真值发生器进行了广泛的分析。我们展示了我们的模型对两个标准动作数据集(activitynet v1.2和thumos14)的有效性。refineloc配备了基于分段预测的伪地面真值生成器,将具有挑战性和大规模活动网络数据集的弱监控时间定位的最新状态提高了4.2%,并与周四的最新状态相比达到了可比的性能14。
URL
https://arxiv.org/abs/1904.00227