Abstract
Weakly-supervised temporal action localization aims to identify and localize the action instances in the untrimmed videos with only video-level action labels. When humans watch videos, we can adapt our abstract-level knowledge about actions in different video scenarios and detect whether some actions are occurring. In this paper, we mimic how humans do and bring a new perspective for locating and identifying multiple actions in a video. We propose a network named VQK-Net with a video-specific query-key attention modeling that learns a unique query for each action category of each input video. The learned queries not only contain the actions' knowledge features at the abstract level but also have the ability to fit this knowledge into the target video scenario, and they will be used to detect the presence of the corresponding action along the temporal dimension. To better learn these action category queries, we exploit not only the features of the current input video but also the correlation between different videos through a novel video-specific action category query learner worked with a query similarity loss. Finally, we conduct extensive experiments on three commonly used datasets (THUMOS14, ActivityNet1.2, and ActivityNet1.3) and achieve state-of-the-art performance.
Abstract (translated)
弱监督的时域动作定位旨在仅使用视频级别的动作标签来确定和定位未剪辑的视频中的行动实例。当人类观看视频时,我们可以适应不同视频场景下抽象的知识,并检测某些行动是否正在发生。在本文中,我们模拟人类的行为并提出了一种名为VQK-Net的网络,它使用视频特定的查询关键注意力模型来学习每个输入视频的行动类别的唯一查询。学习到的查询不仅包含行动在抽象层面上的知识特征,而且有能力将这些知识嵌入到目标视频场景之中,并将用于检测时间维度上相应的行动的存在。为了更好地学习这些行动类别查询,我们不仅利用当前输入视频的特征,还通过一种独特的视频特定行动类别查询学习器与查询相似度损失合作,利用不同视频之间的相关关系。最后,我们研究了三个常用的数据集(THUMOS14、ActivityNet1.2和ActivityNet1.3)并实现了最先进的性能。
URL
https://arxiv.org/abs/2305.04186