Boosting Weakly-Supervised Temporal Action Localization with Text Information

Abstract
Abstract (translated)
URL
PDF

Abstract

Due to the lack of temporal annotation, current Weakly-supervised Temporal Action Localization (WTAL) methods are generally stuck into over-complete or incomplete localization. In this paper, we aim to leverage the text information to boost WTAL from two aspects, i.e., (a) the discriminative objective to enlarge the inter-class difference, thus reducing the over-complete; (b) the generative objective to enhance the intra-class integrity, thus finding more complete temporal boundaries. For the discriminative objective, we propose a Text-Segment Mining (TSM) mechanism, which constructs a text description based on the action class label, and regards the text as the query to mine all class-related segments. Without the temporal annotation of actions, TSM compares the text query with the entire videos across the dataset to mine the best matching segments while ignoring irrelevant ones. Due to the shared sub-actions in different categories of videos, merely applying TSM is too strict to neglect the semantic-related segments, which results in incomplete localization. We further introduce a generative objective named Video-text Language Completion (VLC), which focuses on all semantic-related segments from videos to complete the text sentence. We achieve the state-of-the-art performance on THUMOS14 and ActivityNet1.3. Surprisingly, we also find our proposed method can be seamlessly applied to existing methods, and improve their performances with a clear margin. The code is available at this https URL.

Abstract (translated)

由于缺乏时间标注，当前Weakly-supervised Temporal Action Localization (WTAL)方法往往陷入过度完整或不完整Localization的状态。在本文中，我们旨在利用文本信息从两个方面提高WTAL，即(a)增强不同类别之间的差异，减少过度完整；(b)增强内部类别一致性，找到更多的完整时间边界。针对增强目标，我们提出了文本片段挖掘机制(TSM)，该机制基于行动类别标签构建文本描述，并将文本视为查询，以挖掘所有类别相关的片段。在没有时间标注的行动的情况下，TSM将文本查询与整个数据集的视频进行比较，以找到最佳匹配片段，并忽略无关的片段。由于不同类别视频共享相同的子行动，仅仅应用TSM过于严格，忽略语义相关的片段，导致不完整Localization。我们还介绍了一个生成目标名为视频文本语言完整(VLC)，它专注于从视频中提取所有语义相关的片段，以完成句子。我们在THUMOS14和ActivityNet1.3上实现了最先进的性能。令人惊讶地，我们还发现，我们的提出方法可以无缝应用于现有方法，并以明显优势改进其性能。代码在此httpsURL上可用。

URL

https://arxiv.org/abs/2305.00607

PDF

https://arxiv.org/pdf/2305.00607.pdf

Boosting Weakly-Supervised Temporal Action Localization with Text Information

Abstract

Abstract (translated)

URL

PDF Copy

PDF