Abstract
Weakly-supervised temporal action localization aims to localize and recognize actions in untrimmed videos with only video-level category labels during training. Without instance-level annotations, most existing methods follow the Segment-based Multiple Instance Learning (S-MIL) framework, where the predictions of segments are supervised by the labels of videos. However, the objective for acquiring segment-level scores during training is not consistent with the target for acquiring proposal-level scores during testing, leading to suboptimal results. To deal with this problem, we propose a novel Proposal-based Multiple Instance Learning (P-MIL) framework that directly classifies the candidate proposals in both the training and testing stages, which includes three key designs: 1) a surrounding contrastive feature extraction module to suppress the discriminative short proposals by considering the surrounding contrastive information, 2) a proposal completeness evaluation module to inhibit the low-quality proposals with the guidance of the completeness pseudo labels, and 3) an instance-level rank consistency loss to achieve robust detection by leveraging the complementarity of RGB and FLOW modalities. Extensive experimental results on two challenging benchmarks including THUMOS14 and ActivityNet demonstrate the superior performance of our method.
Abstract (translated)
弱监督的时间行动定位旨在在训练期间使用视频级别的类别标签唯一地定位和识别未修剪的视频,而不需要实例级别的注释。在没有实例级别的注释的情况下,大多数现有方法遵循基于片段的多个实例学习框架(S-MIL),其中片段的预测由视频标签监督。然而,在训练期间获取片段级评分的目标与在测试期间获取提议级评分的目标不一致,导致优化结果不足。为了解决这个问题,我们提出了一种新的提议基于多个实例学习框架(P-MIL),它在训练和测试阶段直接分类候选人提议,包括三个关键设计:1)周围的对比特征提取模块,通过考虑周围的对比信息,抑制具有歧视性的短提议;2)提议完整性评估模块,通过指导完整性伪标签,抑制低质量的提议;3)实例级别的等级一致性损失,通过利用RGB和Flow特征的互补性,实现鲁棒检测。在包括THUMOS14和ActivityNet等两个挑战基准的广泛实验结果表明,我们的方法和性能优越。
URL
https://arxiv.org/abs/2305.17861