Abstract
Weakly supervised temporal action localization (WS-TAL) is a task of targeting at localizing complete action instances and categorizing them with video-level labels. Action-background ambiguity, primarily caused by background noise resulting from aggregation and intra-action variation, is a significant challenge for existing WS-TAL methods. In this paper, we introduce a hybrid multi-head attention (HMHA) module and generalized uncertainty-based evidential fusion (GUEF) module to address the problem. The proposed HMHA effectively enhances RGB and optical flow features by filtering redundant information and adjusting their feature distribution to better align with the WS-TAL task. Additionally, the proposed GUEF adaptively eliminates the interference of background noise by fusing snippet-level evidences to refine uncertainty measurement and select superior foreground feature information, which enables the model to concentrate on integral action instances to achieve better action localization and classification performance. Experimental results conducted on the THUMOS14 dataset demonstrate that our method outperforms state-of-the-art methods. Our code is available in \url{this https URL}.
Abstract (translated)
弱监督时间动作定位(WS-TAL)的任务旨在通过视频级别的标签来定位和分类完整的动作实例。现有方法在处理这一任务时面临的主要挑战是动作背景模糊,这主要是由于背景噪音和动作内部变化导致的。为了解决这个问题,本文提出了一种混合多头注意力(HMHA)模块和基于广义不确定性证据融合(GUEF)模块。 提出的HMHA模块能够有效地通过过滤冗余信息并调整特征分布来增强RGB和光流特征,使其更好地与WS-TAL任务对齐。此外,提出的GUEF模块通过融合片段级别的证据来适应性地消除背景噪音的影响,改进不确定性测量,并选择更优的前景特征信息。这使得模型能够专注于完整的动作实例,从而实现更好的动作定位和分类性能。 在THUMOS14数据集上的实验结果表明,我们的方法优于现有的最先进方法。我们的代码可在此网址获取:[this https URL]。
URL
https://arxiv.org/abs/2412.19418