Abstract
Weakly-supervised action localization aims to recognize and localize action instancese in untrimmed videos with only video-level labels. Most existing models rely on multiple instance learning(MIL), where the predictions of unlabeled instances are supervised by classifying labeled bags. The MIL-based methods are relatively well studied with cogent performance achieved on classification but not on localization. Generally, they locate temporal regions by the video-level classification but overlook the temporal variations of feature semantics. To address this problem, we propose a novel attention-based hierarchically-structured latent model to learn the temporal variations of feature semantics. Specifically, our model entails two components, the first is an unsupervised change-points detection module that detects change-points by learning the latent representations of video features in a temporal hierarchy based on their rates of change, and the second is an attention-based classification model that selects the change-points of the foreground as the boundaries. To evaluate the effectiveness of our model, we conduct extensive experiments on two benchmark datasets, THUMOS-14 and ActivityNet-v1.3. The experiments show that our method outperforms current state-of-the-art methods, and even achieves comparable performance with fully-supervised methods.
Abstract (translated)
弱监督行动定位的目标是在未修剪的视频上识别和定位行动实例,仅使用视频级别的标签。大多数现有模型依赖于多个实例学习(MIL),其中未标记实例的预测由分类标签袋监督。基于MIL的方法相对较为深入研究,在分类方面取得了显著性能,但在定位方面则忽视了特征语义的时间变化。为了解决这一问题,我们提出了一种新的基于注意力的Hierarchically-StructuredLatent模型来学习特征语义的时间变化。具体来说,我们的模型包含两个组件,第一个是 unsupervised的转折点检测模块,通过学习视频特征的时间级潜在表示,从它们的增长率中检测转折点,第二个是基于注意力的分类模型,选择前景转折点作为边界。为了评估我们的模型的有效性,我们研究了两个基准数据集,THUMOS-14和ActivityNet-v1.3。实验结果表明,我们的方法比当前最先进的方法表现更好,甚至与完全监督的方法达到类似的性能。
URL
https://arxiv.org/abs/2308.09946