Paper Reading AI Learner

RefineLoc: Iterative Refinement for Weakly-Supervised Action Localization

2019-03-30 14:51:44
Humam Alwassel, Fabian Caba Heilbron, Ali Thabet, Bernard Ghanem

Abstract

Video action detectors are usually trained using video datasets with fully supervised temporal annotations. Building such video datasets is a heavily expensive task. To alleviate this problem, recent algorithms leverage weak labelling where videos are untrimmed and only a video-level label is available. In this paper, we propose RefineLoc, a new method for weakly-supervised temporal action localization. RefineLoc uses an iterative refinement approach by estimating and training on snippet-level pseudo ground truth at every iteration. We show the benefit of using such an iterative approach and present an extensive analysis of different pseudo ground truth generators. We show the effectiveness of our model on two standard action datasets, ActivityNet v1.2 and THUMOS14. RefineLoc equipped with a segment prediction-based pseudo ground truth generator improves the state-of-the-art in weakly-supervised temporal localization on the challenging and large-scale ActivityNet dataset by 4.2% and achieves comparable performance with state-of-the-art on THUMOS14.

Abstract (translated)

视频动作检测器通常使用视频数据集进行训练,并具有完全受监控的时间注释。构建这样的视频数据集是一项非常昂贵的任务。为了缓解这个问题,最近的算法利用了弱标签,其中视频未经修剪,只有视频级标签可用。本文提出了一种弱监督时间动作定位的新方法refineloc。refineloc使用一种迭代的精化方法,通过在每次迭代中估计和训练片段级的伪地面真实性。我们证明了使用这种迭代方法的好处,并对不同的伪地面真值发生器进行了广泛的分析。我们展示了我们的模型对两个标准动作数据集(activitynet v1.2和thumos14)的有效性。refineloc配备了基于分段预测的伪地面真值生成器,将具有挑战性和大规模活动网络数据集的弱监控时间定位的最新状态提高了4.2%,并与周四的最新状态相比达到了可比的性能14。

URL

https://arxiv.org/abs/1904.00227

PDF

https://arxiv.org/pdf/1904.00227.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot