Paper Reading AI Learner

Improving Action Localization by Progressive Cross-stream Cooperation

2019-05-28 02:29:12
Rui Su, Wanli Ouyang, Luping Zhou, Dong Xu

Abstract

Spatio-temporal action localization consists of three levels of tasks: spatial localization, action classification, and temporal segmentation. In this work, we propose a new Progressive Cross-stream Cooperation (PCSC) framework to use both region proposals and features from one stream (i.e. Flow/RGB) to help another stream (i.e. RGB/Flow) to iteratively improve action localization results and generate better bounding boxes in an iterative fashion. Specifically, we first generate a larger set of region proposals by combining the latest region proposals from both streams, from which we can readily obtain a larger set of labelled training samples to help learn better action detection models. Second, we also propose a new message passing approach to pass information from one stream to another stream in order to learn better representations, which also leads to better action detection models. As a result, our iterative framework progressively improves action localization results at the frame level. To improve action localization results at the video level, we additionally propose a new strategy to train class-specific actionness detectors for better temporal segmentation, which can be readily learnt by focusing on "confusing" samples from the same action class. Comprehensive experiments on two benchmark datasets UCF-101-24 and J-HMDB demonstrate the effectiveness of our newly proposed approaches for spatio-temporal action localization in realistic scenarios.

Abstract (translated)

时空动作定位包括三个层次的任务:空间定位、动作分类和时间分割。在这项工作中,我们提出了一个新的渐进式跨流合作(PCSC)框架,利用来自一个流(即流/RGB)的区域建议和特性来帮助另一个流(即流/RGB)迭代地改进动作本地化结果,并以迭代的方式生成更好的边界框。具体来说,我们首先通过结合来自两个流的最新区域建议生成一组更大的区域建议,从中我们可以轻松地获得一组更大的带标签的培训样本,以帮助学习更好的行动检测模型。其次,我们还提出了一种新的消息传递方法,将信息从一个流传递到另一个流,以学习更好的表示,这也导致了更好的动作检测模型。因此,我们的迭代框架逐步改进了框架级别的动作本地化结果。为了提高视频级的动作定位结果,我们还提出了一种新的策略,训练特定于类的动作检测器,以便更好地进行时间分割,这可以通过关注来自同一动作类的“混淆”样本来容易地学习。在两个基准数据集UCF-101-24和J-HMDB上进行的综合实验证明了我们新提出的方法在现实场景中时空动作定位的有效性。

URL

https://arxiv.org/abs/1905.11575

PDF

https://arxiv.org/pdf/1905.11575.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot