Abstract
The crux of semi-supervised temporal action localization (SS-TAL) lies in excavating valuable information from abundant unlabeled videos. However, current approaches predominantly focus on building models that are robust to the error-prone target class (i.e, the predicted class with the highest confidence) while ignoring informative semantics within non-target classes. This paper approaches SS-TAL from a novel perspective by advocating for learning from non-target classes, transcending the conventional focus solely on the target class. The proposed approach involves partitioning the label space of the predicted class distribution into distinct subspaces: target class, positive classes, negative classes, and ambiguous classes, aiming to mine both positive and negative semantics that are absent in the target class, while excluding ambiguous classes. To this end, we first devise innovative strategies to adaptively select high-quality positive and negative classes from the label space, by modeling both the confidence and rank of a class in relation to those of the target class. Then, we introduce novel positive and negative losses designed to guide the learning process, pushing predictions closer to positive classes and away from negative classes. Finally, the positive and negative processes are integrated into a hybrid positive-negative learning framework, facilitating the utilization of non-target classes in both labeled and unlabeled videos. Experimental results on THUMOS14 and ActivityNet v1.3 demonstrate the superiority of the proposed method over prior state-of-the-art approaches.
Abstract (translated)
半监督的时间动作定位(SS-TAL)的核心在于从丰富的未标记视频中挖掘有价值的信息。然而,目前的 approaches 主要集中在构建对错误率易为目标类(即最高置信度的预测类)具有鲁棒性的模型,同时忽略了非目标类中的有信息语义。本文从新颖的角度探讨了 SS-TAL,主张从非目标类中学习,超越仅关注目标类的传统关注点。所提出的 approach 包括将预测类标签空间的标签分片为四个子空间:目标类、 positive classes(正类)、negative classes(负类)和 ambiguous classes(不确定类),旨在挖掘目标类中不存在 positive 和 negative semantics 的同时排除 ambiguous classes。为此,我们首先通过建模类与目标类之间的置信度和排名关系,设计了一些创新策略来自适应地选择标签空间中高质量的正负类。然后,我们引入了新颖的正负损失函数,用于指导学习过程,将预测结果推向正类,远离负类。最后,将正负过程整合到一种混合正负学习框架中,促进非目标类在 both labeled and unlabeled videos 中的使用。 在 THUMOS14 和 ActivityNet v1.3 上的实验结果表明,与 prior state-of-the-art approaches 相比,所提出的方法具有优越性。
URL
https://arxiv.org/abs/2403.11189