Paper Reading AI Learner

Revisiting Foreground and Background Separation in Weakly-supervised Temporal Action Localization: A Clustering-based Approach

2023-12-21 18:57:12
Qinying Liu, Zilei Wang, Shenghai Rong, Junjie Li, Yixin Zhang

Abstract

Weakly-supervised temporal action localization aims to localize action instances in videos with only video-level action labels. Existing methods mainly embrace a localization-by-classification pipeline that optimizes the snippet-level prediction with a video classification loss. However, this formulation suffers from the discrepancy between classification and detection, resulting in inaccurate separation of foreground and background (F\&B) snippets. To alleviate this problem, we propose to explore the underlying structure among the snippets by resorting to unsupervised snippet clustering, rather than heavily relying on the video classification loss. Specifically, we propose a novel clustering-based F\&B separation algorithm. It comprises two core components: a snippet clustering component that groups the snippets into multiple latent clusters and a cluster classification component that further classifies the cluster as foreground or background. As there are no ground-truth labels to train these two components, we introduce a unified self-labeling mechanism based on optimal transport to produce high-quality pseudo-labels that match several plausible prior distributions. This ensures that the cluster assignments of the snippets can be accurately associated with their F\&B labels, thereby boosting the F\&B separation. We evaluate our method on three benchmarks: THUMOS14, ActivityNet v1.2 and v1.3. Our method achieves promising performance on all three benchmarks while being significantly more lightweight than previous methods. Code is available at this https URL

Abstract (translated)

弱监督的时间动作定位旨在通过仅使用视频级别的动作标签来定位视频中的动作实例。现有的方法主要采用基于分类的定位管道来通过视频分类损失优化片段级别的预测。然而,这个公式存在分类和检测之间的差异,导致前景和背景(F&B)片段的准确分离。为了减轻这个问题,我们提出了一种通过无监督片段聚类来探索片段之间的潜在结构,而不是过分依赖视频分类损失。具体来说,我们提出了一种基于聚类的F&B分割算法。它包括两个核心组件:一个片段聚类组件,将片段分组到多个潜在聚类中,和一个聚类分类组件,进一步将聚类分类为前景或背景。由于没有用于训练这两个组件的地面真标签,我们引入了一种基于最优传输的统一自标签机制,产生高质量的反向样本,匹配多个可能的先验分布。这确保了片段分配的聚类可以准确与F&B标签相关联,从而提高F&B分割。我们在THUMOS14、ActivityNet v1.2和v1.3这三个基准上评估我们的方法。我们的方法在所有三个基准上都取得了良好的性能,而重量比以前的方法轻得多。代码可以从这个链接获取:https://www.aclweb.org/anthology/W17-6246

URL

https://arxiv.org/abs/2312.14138

PDF

https://arxiv.org/pdf/2312.14138.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot