Weakly-Supervised Temporal Action Localization with Bidirectional Semantic Consistency Constraint

Abstract
Abstract (translated)
URL
PDF

Abstract

Weakly Supervised Temporal Action Localization (WTAL) aims to classify and localize temporal boundaries of actions for the video, given only video-level category labels in the training datasets. Due to the lack of boundary information during training, existing approaches formulate WTAL as a classificationproblem, i.e., generating the temporal class activation map (T-CAM) for localization. However, with only classification loss, the model would be sub-optimized, i.e., the action-related scenes are enough to distinguish different class labels. Regarding other actions in the action-related scene ( i.e., the scene same as positive actions) as co-scene actions, this sub-optimized model would misclassify the co-scene actions as positive actions. To address this misclassification, we propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi-SCC), to discriminate the positive actions from co-scene actions. The proposed Bi-SCC firstly adopts a temporal context augmentation to generate an augmented video that breaks the correlation between positive actions and their co-scene actions in the inter-video; Then, a semantic consistency constraint (SCC) is used to enforce the predictions of the original video and augmented video to be consistent, hence suppressing the co-scene actions. However, we find that this augmented video would destroy the original temporal context. Simply applying the consistency constraint would affect the completeness of localized positive actions. Hence, we boost the SCC in a bidirectional way to suppress co-scene actions while ensuring the integrity of positive actions, by cross-supervising the original and augmented videos. Finally, our proposed Bi-SCC can be applied to current WTAL approaches, and improve their performance. Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet.

Abstract (translated)

弱监督时间行动定位(WTAL)的目标是对视频进行分类和定位时间边界，仅提供训练数据中的视频级分类标签。由于在训练期间缺乏边界信息，现有方法将WTAL视为分类问题，即生成时间类激活图(T-CAM)以定位。然而，仅产生分类损失，模型将变得次优化，即与动作相关的场景足够区分不同类标签。对于与动作相关的场景中的其他行动(即与积极行动相同的场景)，该次优化模型将错误地将共同场景行动归为积极行动。为了解决这种错误分类，我们提出了一种简单但高效的方法，称为双向语义一致性约束(Bi-SCC)，以区分积极行动和共同场景行动。该Bi-SCC首先采用时间上下文增强来生成增强视频，在 inter-video 中破坏积极行动和其共同场景行动之间的相关性；然后，使用语义一致性约束(SCC)强制原始视频和增强视频的预测保持一致，因此抑制共同场景行动。然而，我们发现，这种增强视频会摧毁原始时间上下文。简单地应用一致性约束将影响定位积极行动的完整度。因此，我们双向推动 SCC 抑制共同场景行动，同时确保积极行动的完整性，通过交叉监督原始视频和增强视频。最终，我们提出的 Bi-SCC 可以应用于当前的 WTAL 方法，以提高其性能。实验结果显示，我们的方法在 THUMOS14 和ActivityNet 上优于当前的最佳方法。

URL

https://arxiv.org/abs/2304.12616

PDF

https://arxiv.org/pdf/2304.12616.pdf