Paper Reading AI Learner

Boosting Weakly-Supervised Temporal Action Localization with Text Information

2023-05-01 00:07:09
Guozhang Li, De Cheng, Xinpeng Ding, Nannan Wang, Xiaoyu Wang, Xinbo Gao

Abstract

Due to the lack of temporal annotation, current Weakly-supervised Temporal Action Localization (WTAL) methods are generally stuck into over-complete or incomplete localization. In this paper, we aim to leverage the text information to boost WTAL from two aspects, i.e., (a) the discriminative objective to enlarge the inter-class difference, thus reducing the over-complete; (b) the generative objective to enhance the intra-class integrity, thus finding more complete temporal boundaries. For the discriminative objective, we propose a Text-Segment Mining (TSM) mechanism, which constructs a text description based on the action class label, and regards the text as the query to mine all class-related segments. Without the temporal annotation of actions, TSM compares the text query with the entire videos across the dataset to mine the best matching segments while ignoring irrelevant ones. Due to the shared sub-actions in different categories of videos, merely applying TSM is too strict to neglect the semantic-related segments, which results in incomplete localization. We further introduce a generative objective named Video-text Language Completion (VLC), which focuses on all semantic-related segments from videos to complete the text sentence. We achieve the state-of-the-art performance on THUMOS14 and ActivityNet1.3. Surprisingly, we also find our proposed method can be seamlessly applied to existing methods, and improve their performances with a clear margin. The code is available at this https URL.

Abstract (translated)

由于缺乏时间标注,当前Weakly-supervised Temporal Action Localization (WTAL)方法往往陷入过度完整或不完整Localization的状态。在本文中,我们旨在利用文本信息从两个方面提高WTAL,即(a)增强不同类别之间的差异,减少过度完整;(b)增强内部类别一致性,找到更多的完整时间边界。针对增强目标,我们提出了文本片段挖掘机制(TSM),该机制基于行动类别标签构建文本描述,并将文本视为查询,以挖掘所有类别相关的片段。在没有时间标注的行动的情况下,TSM将文本查询与整个数据集的视频进行比较,以找到最佳匹配片段,并忽略无关的片段。由于不同类别视频共享相同的子行动,仅仅应用TSM过于严格,忽略语义相关的片段,导致不完整Localization。我们还介绍了一个生成目标名为视频文本语言完整(VLC),它专注于从视频中提取所有语义相关的片段,以完成句子。我们在THUMOS14和ActivityNet1.3上实现了最先进的性能。令人惊讶地,我们还发现,我们的提出方法可以无缝应用于现有方法,并以明显优势改进其性能。代码在此httpsURL上可用。

URL

https://arxiv.org/abs/2305.00607

PDF

https://arxiv.org/pdf/2305.00607.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot