Paper Reading AI Learner

Learning Spatio-Temporal Representation with Local and Global Diffusion

2019-06-13 09:41:00
Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Xinmei Tian, Tao Mei

Abstract

Convolutional Neural Networks (CNN) have been regarded as a powerful class of models for visual recognition problems. Nevertheless, the convolutional filters in these networks are local operations while ignoring the large-range dependency. Such drawback becomes even worse particularly for video recognition, since video is an information-intensive media with complex temporal variations. In this paper, we present a novel framework to boost the spatio-temporal representation learning by Local and Global Diffusion (LGD). Specifically, we construct a novel neural network architecture that learns the local and global representations in parallel. The architecture is composed of LGD blocks, where each block updates local and global features by modeling the diffusions between these two representations. Diffusions effectively interact two aspects of information, i.e., localized and holistic, for more powerful way of representation learning. Furthermore, a kernelized classifier is introduced to combine the representations from two aspects for video recognition. Our LGD networks achieve clear improvements on the large-scale Kinetics-400 and Kinetics-600 video classification datasets against the best competitors by 3.5% and 0.7%. We further examine the generalization of both the global and local representations produced by our pre-trained LGD networks on four different benchmarks for video action recognition and spatio-temporal action detection tasks. Superior performances over several state-of-the-art techniques on these benchmarks are reported. Code is available at: https://github.com/ZhaofanQiu/local-and-global-diffusion-networks.

Abstract (translated)

卷积神经网络(CNN)被认为是一类强有力的视觉识别模型。然而,这些网络中的卷积滤波器是本地操作,而忽略了大范围的依赖关系。这种缺点变得更糟,尤其是对于视频识别,因为视频是一种信息密集型媒体,具有复杂的时间变化。本文提出了一种新的基于局部和全局扩散的时空表示学习框架。具体地说,我们构建了一种新的神经网络结构,它可以并行学习局部和全局表示。该体系结构由LGD块组成,每个块通过建模这两个表示之间的扩散来更新局部和全局特征。扩散有效地相互作用的两个方面的信息,即,本地化和整体,以更强大的方式表示学习。此外,本文还引入了一种核心分类器,将两个方面的表示结合起来进行视频识别。我们的LGD网络在大型动理学-400和动理学-600视频分类数据集上取得了明显的改进,与最佳竞争对手相比分别提高了3.5%和0.7%。我们进一步研究了预先训练的LGD网络在视频动作识别和时空动作检测任务的四个不同基准上产生的全局和局部表示的通用性。报告了在这些基准上优于几种最先进技术的性能。代码可从以下网址获取:https://github.com/zhaofanqiu/local-and-global-diffusion-networks。

URL

https://arxiv.org/abs/1906.05571

PDF

https://arxiv.org/pdf/1906.05571.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot