Paper Reading AI Learner

Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal Action Localization

2023-05-29 02:48:04
Huan Ren, Wenfei Yang, Tianzhu Zhang, Yongdong Zhang

Abstract

Weakly-supervised temporal action localization aims to localize and recognize actions in untrimmed videos with only video-level category labels during training. Without instance-level annotations, most existing methods follow the Segment-based Multiple Instance Learning (S-MIL) framework, where the predictions of segments are supervised by the labels of videos. However, the objective for acquiring segment-level scores during training is not consistent with the target for acquiring proposal-level scores during testing, leading to suboptimal results. To deal with this problem, we propose a novel Proposal-based Multiple Instance Learning (P-MIL) framework that directly classifies the candidate proposals in both the training and testing stages, which includes three key designs: 1) a surrounding contrastive feature extraction module to suppress the discriminative short proposals by considering the surrounding contrastive information, 2) a proposal completeness evaluation module to inhibit the low-quality proposals with the guidance of the completeness pseudo labels, and 3) an instance-level rank consistency loss to achieve robust detection by leveraging the complementarity of RGB and FLOW modalities. Extensive experimental results on two challenging benchmarks including THUMOS14 and ActivityNet demonstrate the superior performance of our method.

Abstract (translated)

弱监督的时间行动定位旨在在训练期间使用视频级别的类别标签唯一地定位和识别未修剪的视频,而不需要实例级别的注释。在没有实例级别的注释的情况下,大多数现有方法遵循基于片段的多个实例学习框架(S-MIL),其中片段的预测由视频标签监督。然而,在训练期间获取片段级评分的目标与在测试期间获取提议级评分的目标不一致,导致优化结果不足。为了解决这个问题,我们提出了一种新的提议基于多个实例学习框架(P-MIL),它在训练和测试阶段直接分类候选人提议,包括三个关键设计:1)周围的对比特征提取模块,通过考虑周围的对比信息,抑制具有歧视性的短提议;2)提议完整性评估模块,通过指导完整性伪标签,抑制低质量的提议;3)实例级别的等级一致性损失,通过利用RGB和Flow特征的互补性,实现鲁棒检测。在包括THUMOS14和ActivityNet等两个挑战基准的广泛实验结果表明,我们的方法和性能优越。

URL

https://arxiv.org/abs/2305.17861

PDF

https://arxiv.org/pdf/2305.17861.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot