Cross-Video Contextual Knowledge Exploration and Exploitation for Ambiguity Reduction in Weakly Supervised Temporal Action Localization

Abstract
Abstract (translated)
URL
PDF

Abstract

Weakly supervised temporal action localization (WSTAL) aims to localize actions in untrimmed videos using video-level labels. Despite recent advances, existing approaches mainly follow a localization-by-classification pipeline, generally processing each segment individually, thereby exploiting only limited contextual information. As a result, the model will lack a comprehensive understanding (e.g. appearance and temporal structure) of various action patterns, leading to ambiguity in classification learning and temporal localization. Our work addresses this from a novel perspective, by exploring and exploiting the cross-video contextual knowledge within the dataset to recover the dataset-level semantic structure of action instances via weak labels only, thereby indirectly improving the holistic understanding of fine-grained action patterns and alleviating the aforementioned ambiguities. Specifically, an end-to-end framework is proposed, including a Robust Memory-Guided Contrastive Learning (RMGCL) module and a Global Knowledge Summarization and Aggregation (GKSA) module. First, the RMGCL module explores the contrast and consistency of cross-video action features, assisting in learning more structured and compact embedding space, thus reducing ambiguity in classification learning. Further, the GKSA module is used to efficiently summarize and propagate the cross-video representative action knowledge in a learnable manner to promote holistic action patterns understanding, which in turn allows the generation of high-confidence pseudo-labels for self-learning, thus alleviating ambiguity in temporal localization. Extensive experiments on THUMOS14, ActivityNet1.3, and FineAction demonstrate that our method outperforms the state-of-the-art methods, and can be easily plugged into other WSTAL methods.

Abstract (translated)

弱监督时间动作定位(WSTAL)旨在使用视频级别的标签对未修剪的视频进行动作定位。尽管近年来取得了进展,但现有的方法主要采用分类后再定位的流程,通常对每个片段单独进行处理,因此只能利用有限的上下文信息。因此,模型将缺乏对各种动作模式(例如外观和时间结构)的全面理解,导致分类学习和时间定位中的歧义。我们的工作从一个全新的角度解决这个问题,通过探索和利用数据集中的跨视频上下文知识,仅使用弱标签恢复行动实例的数据集语义结构,从而间接地改善精细动作模式全面了解和提高整体理解,并减轻上述歧义。具体来说,我们提出了一个端到端框架,包括一个 robust 记忆引导比较学习(RMGCL)模块和一个全球知识摘要和聚合(GKSA)模块。首先,RMGCL模块探索跨视频动作特征的对比度和一致性,协助学习更结构化和紧凑嵌入空间,从而减少分类学习中的歧义。进一步,GKSA模块用于高效摘要和传播跨视频代表行动知识,以促进整体动作模式理解,从而允许生成高信心的自学习伪标签,从而减轻时间定位中的歧义。在THUMOS14、ActivityNet1.3和FineAction等数据集上进行广泛的实验表明,我们的方法优于当前的最佳方法,并且可以轻松地与其他WSTAL方法整合。

URL

https://arxiv.org/abs/2308.12609

PDF

https://arxiv.org/pdf/2308.12609.pdf

Cross-Video Contextual Knowledge Exploration and Exploitation for Ambiguity Reduction in Weakly Supervised Temporal Action Localization

Abstract

Abstract (translated)

URL

PDF Copy

PDF