Paper Reading AI Learner

Improving Weakly Supervised Temporal Action Localization by Bridging Train-Test Gap in Pseudo Labels

2023-04-17 03:47:41
Jingqiu Zhou, Linjiang Huang, Liang Wang, Si Liu, Hongsheng Li

Abstract

The task of weakly supervised temporal action localization targets at generating temporal boundaries for actions of interest, meanwhile the action category should also be classified. Pseudo-label-based methods, which serve as an effective solution, have been widely studied recently. However, existing methods generate pseudo labels during training and make predictions during testing under different pipelines or settings, resulting in a gap between training and testing. In this paper, we propose to generate high-quality pseudo labels from the predicted action boundaries. Nevertheless, we note that existing post-processing, like NMS, would lead to information loss, which is insufficient to generate high-quality action boundaries. More importantly, transforming action boundaries into pseudo labels is quite challenging, since the predicted action instances are generally overlapped and have different confidence scores. Besides, the generated pseudo-labels can be fluctuating and inaccurate at the early stage of training. It might repeatedly strengthen the false predictions if there is no mechanism to conduct self-correction. To tackle these issues, we come up with an effective pipeline for learning better pseudo labels. Firstly, we propose a Gaussian weighted fusion module to preserve information of action instances and obtain high-quality action boundaries. Second, we formulate the pseudo-label generation as an optimization problem under the constraints in terms of the confidence scores of action instances. Finally, we introduce the idea of $\Delta$ pseudo labels, which enables the model with the ability of self-correction. Our method achieves superior performance to existing methods on two benchmarks, THUMOS14 and ActivityNet1.3, achieving gains of 1.9\% on THUMOS14 and 3.7\% on ActivityNet1.3 in terms of average mAP.

Abstract (translated)

弱监督时间动作本地化的任务目标是生成有关感兴趣的行动的时间边界,同时还需要对行动类别进行分类。最近,基于伪标签的方法被广泛研究作为有效的解决方案。然而,现有的方法在训练期间生成伪标签并在测试期间make predictions,在采用不同的管道或设置的情况下,导致训练和测试之间的差异。在本文中,我们提议从预测的行动边界中生成高质量的伪标签。然而,我们注意到,现有的后处理,如NMS,会导致信息丢失,不足以生成高质量的行动边界。更重要的是,将行动边界转换为伪标签相当困难,因为预测的行动实例通常重叠且具有不同的信任度。此外,在训练的早期阶段,生成的伪标签可能会波动和不准确。如果没有自我修正机制,可能会多次强化错误的预测。为了解决这些问题,我们提出了一种有效的学习伪标签的管道。首先,我们提议使用高斯加权融合模块来保护行动实例的信息并生成高质量的行动边界。其次,我们将伪标签生成问题表示为约束条件下的行动实例信任度分数的优化问题。最后,我们引入了$\Delta$伪标签的概念,这使具有自我修正能力的模型能够取得更好的性能。我们的方法在两个基准测试上比现有方法表现更好,在THUMOS14上取得了1.9%的平均mAP提高,在ActivityNet1.3上取得了3.7%的提高。

URL

https://arxiv.org/abs/2304.07978

PDF

https://arxiv.org/pdf/2304.07978.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot