Paper Reading AI Learner

Video-Specific Query-Key Attention Modeling for Weakly-Supervised Temporal Action Localization

2023-05-07 04:18:22
Xijun Wang, Aggelos K. Katsaggelos

Abstract

Weakly-supervised temporal action localization aims to identify and localize the action instances in the untrimmed videos with only video-level action labels. When humans watch videos, we can adapt our abstract-level knowledge about actions in different video scenarios and detect whether some actions are occurring. In this paper, we mimic how humans do and bring a new perspective for locating and identifying multiple actions in a video. We propose a network named VQK-Net with a video-specific query-key attention modeling that learns a unique query for each action category of each input video. The learned queries not only contain the actions' knowledge features at the abstract level but also have the ability to fit this knowledge into the target video scenario, and they will be used to detect the presence of the corresponding action along the temporal dimension. To better learn these action category queries, we exploit not only the features of the current input video but also the correlation between different videos through a novel video-specific action category query learner worked with a query similarity loss. Finally, we conduct extensive experiments on three commonly used datasets (THUMOS14, ActivityNet1.2, and ActivityNet1.3) and achieve state-of-the-art performance.

Abstract (translated)

弱监督的时域动作定位旨在仅使用视频级别的动作标签来确定和定位未剪辑的视频中的行动实例。当人类观看视频时,我们可以适应不同视频场景下抽象的知识,并检测某些行动是否正在发生。在本文中,我们模拟人类的行为并提出了一种名为VQK-Net的网络,它使用视频特定的查询关键注意力模型来学习每个输入视频的行动类别的唯一查询。学习到的查询不仅包含行动在抽象层面上的知识特征,而且有能力将这些知识嵌入到目标视频场景之中,并将用于检测时间维度上相应的行动的存在。为了更好地学习这些行动类别查询,我们不仅利用当前输入视频的特征,还通过一种独特的视频特定行动类别查询学习器与查询相似度损失合作,利用不同视频之间的相关关系。最后,我们研究了三个常用的数据集(THUMOS14、ActivityNet1.2和ActivityNet1.3)并实现了最先进的性能。

URL

https://arxiv.org/abs/2305.04186

PDF

https://arxiv.org/pdf/2305.04186.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot