Learning Spatio-Temporal Representation with Local and Global Diffusion

Abstract
Abstract (translated)
URL
PDF

Abstract

Convolutional Neural Networks (CNN) have been regarded as a powerful class of models for visual recognition problems. Nevertheless, the convolutional filters in these networks are local operations while ignoring the large-range dependency. Such drawback becomes even worse particularly for video recognition, since video is an information-intensive media with complex temporal variations. In this paper, we present a novel framework to boost the spatio-temporal representation learning by Local and Global Diffusion (LGD). Specifically, we construct a novel neural network architecture that learns the local and global representations in parallel. The architecture is composed of LGD blocks, where each block updates local and global features by modeling the diffusions between these two representations. Diffusions effectively interact two aspects of information, i.e., localized and holistic, for more powerful way of representation learning. Furthermore, a kernelized classifier is introduced to combine the representations from two aspects for video recognition. Our LGD networks achieve clear improvements on the large-scale Kinetics-400 and Kinetics-600 video classification datasets against the best competitors by 3.5% and 0.7%. We further examine the generalization of both the global and local representations produced by our pre-trained LGD networks on four different benchmarks for video action recognition and spatio-temporal action detection tasks. Superior performances over several state-of-the-art techniques on these benchmarks are reported. Code is available at: https://github.com/ZhaofanQiu/local-and-global-diffusion-networks.

Abstract (translated)

卷积神经网络（CNN）被认为是一类强有力的视觉识别模型。然而，这些网络中的卷积滤波器是本地操作，而忽略了大范围的依赖关系。这种缺点变得更糟，尤其是对于视频识别，因为视频是一种信息密集型媒体，具有复杂的时间变化。本文提出了一种新的基于局部和全局扩散的时空表示学习框架。具体地说，我们构建了一种新的神经网络结构，它可以并行学习局部和全局表示。该体系结构由LGD块组成，每个块通过建模这两个表示之间的扩散来更新局部和全局特征。扩散有效地相互作用的两个方面的信息，即，本地化和整体，以更强大的方式表示学习。此外，本文还引入了一种核心分类器，将两个方面的表示结合起来进行视频识别。我们的LGD网络在大型动理学-400和动理学-600视频分类数据集上取得了明显的改进，与最佳竞争对手相比分别提高了3.5%和0.7%。我们进一步研究了预先训练的LGD网络在视频动作识别和时空动作检测任务的四个不同基准上产生的全局和局部表示的通用性。报告了在这些基准上优于几种最先进技术的性能。代码可从以下网址获取：https://github.com/zhaofanqiu/local-and-global-diffusion-networks。

URL

https://arxiv.org/abs/1906.05571

PDF

https://arxiv.org/pdf/1906.05571.pdf