Paper Reading AI Learner

Full-Stage Pseudo Label Quality Enhancement for Weakly-supervised Temporal Action Localization

2024-07-12 03:53:55
Qianhan Feng, Wenshuo Li, Tong Lin, Xinghao Chen

Abstract

Weakly-supervised Temporal Action Localization (WSTAL) aims to localize actions in untrimmed videos using only video-level supervision. Latest WSTAL methods introduce pseudo label learning framework to bridge the gap between classification-based training and inferencing targets at localization, and achieve cutting-edge results. In these frameworks, a classification-based model is used to generate pseudo labels for a regression-based student model to learn from. However, the quality of pseudo labels in the framework, which is a key factor to the final result, is not carefully studied. In this paper, we propose a set of simple yet efficient pseudo label quality enhancement mechanisms to build our FuSTAL framework. FuSTAL enhances pseudo label quality at three stages: cross-video contrastive learning at proposal Generation-Stage, prior-based filtering at proposal Selection-Stage and EMA-based distillation at Training-Stage. These designs enhance pseudo label quality at different stages in the framework, and help produce more informative, less false and smoother action proposals. With the help of these comprehensive designs at all stages, FuSTAL achieves an average mAP of 50.8% on THUMOS'14, outperforming the previous best method by 1.2%, and becomes the first method to reach the milestone of 50%.

Abstract (translated)

弱监督时序动作定位(WSTAL)旨在使用仅视频级别的监督来对未剪辑的视频进行动作定位。最先进的WSTAL方法引入伪标签学习框架来弥合基于分类的训练和推理目标之间的差距,并取得最佳结果。在这些框架中,基于分类的模型用于为基于回归的学生模型生成伪标签,以学习。然而,框架中伪标签的质量,这是最终结果的关键因素,并没有仔细研究。在本文中,我们提出了一组简单而有效的伪标签质量增强机制来构建我们的FuSTAL框架。FuSTAL在提议生成阶段通过跨视频对比学习来提高伪标签质量,在提议选择阶段基于先验进行过滤,在训练阶段采用EMA进行蒸馏。这些设计在框架的不同阶段提高了伪标签的质量,并有助于产生更具有信息性、更准确、更平滑的动作提议。在所有阶段都有这些全面设计的帮助下,FuSTAL在THUMOS'14上的平均mAP达到50.8%,比之前最好的方法领先1.2%,成为第一个达到里程碑50%的方法。

URL

https://arxiv.org/abs/2407.08971

PDF

https://arxiv.org/pdf/2407.08971.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot