Paper Reading AI Learner

DDG-Net: Discriminability-Driven Graph Network for Weakly-supervised Temporal Action Localization

2023-07-31 05:48:39
Xiaojun Tang, Junsong Fan, Chuanchen Luo, Zhaoxiang Zhang, Man Zhang, Zongyuan Yang

Abstract

Weakly-supervised temporal action localization (WTAL) is a practical yet challenging task. Due to large-scale datasets, most existing methods use a network pretrained in other datasets to extract features, which are not suitable enough for WTAL. To address this problem, researchers design several modules for feature enhancement, which improve the performance of the localization module, especially modeling the temporal relationship between snippets. However, all of them neglect the adverse effects of ambiguous information, which would reduce the discriminability of others. Considering this phenomenon, we propose Discriminability-Driven Graph Network (DDG-Net), which explicitly models ambiguous snippets and discriminative snippets with well-designed connections, preventing the transmission of ambiguous information and enhancing the discriminability of snippet-level representations. Additionally, we propose feature consistency loss to prevent the assimilation of features and drive the graph convolution network to generate more discriminative representations. Extensive experiments on THUMOS14 and ActivityNet1.2 benchmarks demonstrate the effectiveness of DDG-Net, establishing new state-of-the-art results on both datasets. Source code is available at \url{this https URL}.

Abstract (translated)

弱监督的时间行动定位(WTAL)是一个实用但具有挑战性的任务。由于大规模的数据集,大多数现有方法使用在其他数据集中训练的网络提取特征,这些特征不适合用于WTAL。为了解决这一问题,研究人员设计了几个特征增强模块,以提高定位模块的性能,特别是建模片段之间的时间关系。然而,他们都忽略了歧义信息的副作用,这将会减少其他人的区分能力。考虑到这种现象,我们提出了区分性驱动的 Graph 网络(DDG-Net),它 explicitly 建模歧义片段和有用的片段,采用设计良好的连接,防止传输歧义信息,并增强片段级表示的区分能力。此外,我们提出了特征一致性损失,以防止特征融合并推动Graph卷积网络生成更多的有用表示。在THUMOS14和ActivityNet1.2基准数据上的广泛实验证明了DDG-Net的有效性,在两个数据集上实现了新的最先进的结果。源代码可在 \url{this https URL} 找到。

URL

https://arxiv.org/abs/2307.16415

PDF

https://arxiv.org/pdf/2307.16415.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot