Abstract
Weakly-supervised temporal action localization aims to locate action regions and identify action categories in untrimmed videos, only taking video-level labels as the supervised information. Pseudo label generation is a promising strategy to solve the challenging problem, but most existing methods are limited to employing snippet-wise classification results to guide the generation, and they ignore that the natural temporal structure of the video can also provide rich information to assist such a generation process. In this paper, we propose a novel weakly-supervised temporal action localization method by inferring snippet-feature affinity. First, we design an affinity inference module that exploits the affinity relationship between temporal neighbor snippets to generate initial coarse pseudo labels. Then, we introduce an information interaction module that refines the coarse labels by enhancing the discriminative nature of snippet-features through exploring intra- and inter-video relationships. Finally, the high-fidelity pseudo labels generated from the information interaction module are used to supervise the training of the action localization network. Extensive experiments on two publicly available datasets, i.e., THUMOS14 and ActivityNet v1.3, demonstrate our proposed method achieves significant improvements compared to the state-of-the-art methods.
Abstract (translated)
弱监督的时间行动定位旨在在未剪辑的视频中提取行动区域并识别行动类别,仅使用视频级别的标签作为监督信息。伪标签生成是一种解决挑战性问题有前途的策略,但大多数现有方法局限于使用片段级别的分类结果来指导生成,并忽视了视频的自然时间结构也可以提供丰富的信息来帮助这种生成过程。在本文中,我们提出了一种新的弱监督的时间行动定位方法,通过推断片段特征亲和力来实现。首先,我们设计了一个亲和力推断模块,利用时间相邻片段之间的亲和力关系来生成初始的粗仿标签。然后,我们引入了一个信息交互模块,通过探索视频内部和外部的关系来优化粗仿标签,并最后使用信息交互模块生成的高保真的仿标签来监督行动定位网络的训练。在两个公开数据集THUMOS14和ActivityNet v1.3上进行广泛的实验,证明了我们提出的方法相比现有方法取得了显著的改进。
URL
https://arxiv.org/abs/2303.12332