Paper Reading AI Learner

Weakly-Supervised Temporal Action Localization by Inferring Snippet-Feature Affinity

2023-03-22 06:08:34
Wulian Yun, Mengshi Qi, Chuanming Wang, Huadong Ma

Abstract

Weakly-supervised temporal action localization aims to locate action regions and identify action categories in untrimmed videos, only taking video-level labels as the supervised information. Pseudo label generation is a promising strategy to solve the challenging problem, but most existing methods are limited to employing snippet-wise classification results to guide the generation, and they ignore that the natural temporal structure of the video can also provide rich information to assist such a generation process. In this paper, we propose a novel weakly-supervised temporal action localization method by inferring snippet-feature affinity. First, we design an affinity inference module that exploits the affinity relationship between temporal neighbor snippets to generate initial coarse pseudo labels. Then, we introduce an information interaction module that refines the coarse labels by enhancing the discriminative nature of snippet-features through exploring intra- and inter-video relationships. Finally, the high-fidelity pseudo labels generated from the information interaction module are used to supervise the training of the action localization network. Extensive experiments on two publicly available datasets, i.e., THUMOS14 and ActivityNet v1.3, demonstrate our proposed method achieves significant improvements compared to the state-of-the-art methods.

Abstract (translated)

弱监督的时间行动定位旨在在未剪辑的视频中提取行动区域并识别行动类别,仅使用视频级别的标签作为监督信息。伪标签生成是一种解决挑战性问题有前途的策略,但大多数现有方法局限于使用片段级别的分类结果来指导生成,并忽视了视频的自然时间结构也可以提供丰富的信息来帮助这种生成过程。在本文中,我们提出了一种新的弱监督的时间行动定位方法,通过推断片段特征亲和力来实现。首先,我们设计了一个亲和力推断模块,利用时间相邻片段之间的亲和力关系来生成初始的粗仿标签。然后,我们引入了一个信息交互模块,通过探索视频内部和外部的关系来优化粗仿标签,并最后使用信息交互模块生成的高保真的仿标签来监督行动定位网络的训练。在两个公开数据集THUMOS14和ActivityNet v1.3上进行广泛的实验,证明了我们提出的方法相比现有方法取得了显著的改进。

URL

https://arxiv.org/abs/2303.12332

PDF

https://arxiv.org/pdf/2303.12332.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot