Abstract
Temporal action localization (TAL), which involves recognizing and locating action instances, is a challenging task in video understanding. Most existing approaches directly predict action classes and regress offsets to boundaries, while overlooking the discrepant importance of each frame. In this paper, we propose an Action Sensitivity Learning framework (ASL) to tackle this task, which aims to assess the value of each frame and then leverage the generated action sensitivity to recalibrate the training procedure. We first introduce a lightweight Action Sensitivity Evaluator to learn the action sensitivity at the class level and instance level, respectively. The outputs of the two branches are combined to reweight the gradient of the two sub-tasks. Moreover, based on the action sensitivity of each frame, we design an Action Sensitive Contrastive Loss to enhance features, where the action-aware frames are sampled as positive pairs to push away the action-irrelevant frames. The extensive studies on various action localization benchmarks (i.e., MultiThumos, Charades, Ego4D-Moment Queries v1.0, Epic-Kitchens 100, Thumos14 and ActivityNet1.3) show that ASL surpasses the state-of-the-art in terms of average-mAP under multiple types of scenarios, e.g., single-labeled, densely-labeled and egocentric.
Abstract (translated)
时间动作定位(TAL)是指在视频中识别和定位动作实例的挑战性任务。大部分现有方法直接预测动作类别并回归到边界,同时忽略每个帧的离散重要性。在本文中,我们提出一个Action sensitive Learning framework(ASL)来解决该任务,旨在评估每个帧的价值,然后利用生成的动作敏感性来重新校准训练过程。我们首先介绍了一个轻量级Action sensitive Evaluator,分别学习动作敏感性在类级别和实例级别的响应。两个分支的输出组合在一起重新调整两个子任务的梯度。此外,基于每个帧的动作敏感性,我们设计了一个Action Sensitive Contrastive Loss,以增强特征,其中动作意识到帧被采样为正值对,以推开无关的动作帧。对多种动作定位基准(如MultiThumos、Charades、Ego4D-Moment Queries v1.0、Epic-Kitchens 100、 Thumos14和ActivityNet1.3)广泛的研究结果表明,ASL在平均MAP方面超越了当前的最佳方法,特别是在各种场景类型(如单个标记、密集标记和自我中心)下的情况。
URL
https://arxiv.org/abs/2305.15701