Abstract
Weakly-supervised Temporal Action Localization (WSTAL) aims to localize actions in untrimmed videos using only video-level supervision. Latest WSTAL methods introduce pseudo label learning framework to bridge the gap between classification-based training and inferencing targets at localization, and achieve cutting-edge results. In these frameworks, a classification-based model is used to generate pseudo labels for a regression-based student model to learn from. However, the quality of pseudo labels in the framework, which is a key factor to the final result, is not carefully studied. In this paper, we propose a set of simple yet efficient pseudo label quality enhancement mechanisms to build our FuSTAL framework. FuSTAL enhances pseudo label quality at three stages: cross-video contrastive learning at proposal Generation-Stage, prior-based filtering at proposal Selection-Stage and EMA-based distillation at Training-Stage. These designs enhance pseudo label quality at different stages in the framework, and help produce more informative, less false and smoother action proposals. With the help of these comprehensive designs at all stages, FuSTAL achieves an average mAP of 50.8% on THUMOS'14, outperforming the previous best method by 1.2%, and becomes the first method to reach the milestone of 50%.
Abstract (translated)
弱监督时序动作定位(WSTAL)旨在使用仅视频级别的监督来对未剪辑的视频进行动作定位。最先进的WSTAL方法引入伪标签学习框架来弥合基于分类的训练和推理目标之间的差距,并取得最佳结果。在这些框架中,基于分类的模型用于为基于回归的学生模型生成伪标签,以学习。然而,框架中伪标签的质量,这是最终结果的关键因素,并没有仔细研究。在本文中,我们提出了一组简单而有效的伪标签质量增强机制来构建我们的FuSTAL框架。FuSTAL在提议生成阶段通过跨视频对比学习来提高伪标签质量,在提议选择阶段基于先验进行过滤,在训练阶段采用EMA进行蒸馏。这些设计在框架的不同阶段提高了伪标签的质量,并有助于产生更具有信息性、更准确、更平滑的动作提议。在所有阶段都有这些全面设计的帮助下,FuSTAL在THUMOS'14上的平均mAP达到50.8%,比之前最好的方法领先1.2%,成为第一个达到里程碑50%的方法。
URL
https://arxiv.org/abs/2407.08971