Paper Reading AI Learner

Cross-Video Contextual Knowledge Exploration and Exploitation for Ambiguity Reduction in Weakly Supervised Temporal Action Localization


Abstract

Weakly supervised temporal action localization (WSTAL) aims to localize actions in untrimmed videos using video-level labels. Despite recent advances, existing approaches mainly follow a localization-by-classification pipeline, generally processing each segment individually, thereby exploiting only limited contextual information. As a result, the model will lack a comprehensive understanding (e.g. appearance and temporal structure) of various action patterns, leading to ambiguity in classification learning and temporal localization. Our work addresses this from a novel perspective, by exploring and exploiting the cross-video contextual knowledge within the dataset to recover the dataset-level semantic structure of action instances via weak labels only, thereby indirectly improving the holistic understanding of fine-grained action patterns and alleviating the aforementioned ambiguities. Specifically, an end-to-end framework is proposed, including a Robust Memory-Guided Contrastive Learning (RMGCL) module and a Global Knowledge Summarization and Aggregation (GKSA) module. First, the RMGCL module explores the contrast and consistency of cross-video action features, assisting in learning more structured and compact embedding space, thus reducing ambiguity in classification learning. Further, the GKSA module is used to efficiently summarize and propagate the cross-video representative action knowledge in a learnable manner to promote holistic action patterns understanding, which in turn allows the generation of high-confidence pseudo-labels for self-learning, thus alleviating ambiguity in temporal localization. Extensive experiments on THUMOS14, ActivityNet1.3, and FineAction demonstrate that our method outperforms the state-of-the-art methods, and can be easily plugged into other WSTAL methods.

Abstract (translated)

弱监督时间动作定位(WSTAL)旨在使用视频级别的标签对未修剪的视频进行动作定位。尽管近年来取得了进展,但现有的方法主要采用分类后再定位的流程,通常对每个片段单独进行处理,因此只能利用有限的上下文信息。因此,模型将缺乏对各种动作模式(例如外观和时间结构)的全面理解,导致分类学习和时间定位中的歧义。我们的工作从一个全新的角度解决这个问题,通过探索和利用数据集中的跨视频上下文知识,仅使用弱标签恢复行动实例的数据集语义结构,从而间接地改善精细动作模式全面了解和提高整体理解,并减轻上述歧义。具体来说,我们提出了一个端到端框架,包括一个 robust 记忆引导比较学习(RMGCL)模块和一个全球知识摘要和聚合(GKSA)模块。首先,RMGCL模块探索跨视频动作特征的对比度和一致性,协助学习更结构化和紧凑嵌入空间,从而减少分类学习中的歧义。进一步,GKSA模块用于高效摘要和传播跨视频代表行动知识,以促进整体动作模式理解,从而允许生成高信心的自学习伪标签,从而减轻时间定位中的歧义。在THUMOS14、ActivityNet1.3和FineAction等数据集上进行广泛的实验表明,我们的方法优于当前的最佳方法,并且可以轻松地与其他WSTAL方法整合。

URL

https://arxiv.org/abs/2308.12609

PDF

https://arxiv.org/pdf/2308.12609.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot