Paper Reading AI Learner

Boosting Semi-Supervised Temporal Action Localization by Learning from Non-Target Classes

2024-03-17 12:26:23
Kun Xia, Le Wang, Sanping Zhou, Gang Hua, Wei Tang

Abstract

The crux of semi-supervised temporal action localization (SS-TAL) lies in excavating valuable information from abundant unlabeled videos. However, current approaches predominantly focus on building models that are robust to the error-prone target class (i.e, the predicted class with the highest confidence) while ignoring informative semantics within non-target classes. This paper approaches SS-TAL from a novel perspective by advocating for learning from non-target classes, transcending the conventional focus solely on the target class. The proposed approach involves partitioning the label space of the predicted class distribution into distinct subspaces: target class, positive classes, negative classes, and ambiguous classes, aiming to mine both positive and negative semantics that are absent in the target class, while excluding ambiguous classes. To this end, we first devise innovative strategies to adaptively select high-quality positive and negative classes from the label space, by modeling both the confidence and rank of a class in relation to those of the target class. Then, we introduce novel positive and negative losses designed to guide the learning process, pushing predictions closer to positive classes and away from negative classes. Finally, the positive and negative processes are integrated into a hybrid positive-negative learning framework, facilitating the utilization of non-target classes in both labeled and unlabeled videos. Experimental results on THUMOS14 and ActivityNet v1.3 demonstrate the superiority of the proposed method over prior state-of-the-art approaches.

Abstract (translated)

半监督的时间动作定位(SS-TAL)的核心在于从丰富的未标记视频中挖掘有价值的信息。然而,目前的 approaches 主要集中在构建对错误率易为目标类(即最高置信度的预测类)具有鲁棒性的模型,同时忽略了非目标类中的有信息语义。本文从新颖的角度探讨了 SS-TAL,主张从非目标类中学习,超越仅关注目标类的传统关注点。所提出的 approach 包括将预测类标签空间的标签分片为四个子空间:目标类、 positive classes(正类)、negative classes(负类)和 ambiguous classes(不确定类),旨在挖掘目标类中不存在 positive 和 negative semantics 的同时排除 ambiguous classes。为此,我们首先通过建模类与目标类之间的置信度和排名关系,设计了一些创新策略来自适应地选择标签空间中高质量的正负类。然后,我们引入了新颖的正负损失函数,用于指导学习过程,将预测结果推向正类,远离负类。最后,将正负过程整合到一种混合正负学习框架中,促进非目标类在 both labeled and unlabeled videos 中的使用。 在 THUMOS14 和 ActivityNet v1.3 上的实验结果表明,与 prior state-of-the-art approaches 相比,所提出的方法具有优越性。

URL

https://arxiv.org/abs/2403.11189

PDF

https://arxiv.org/pdf/2403.11189.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot