Abstract
Motion is a salient cue to recognize actions in video. Modern action recognition models leverage motion information either explicitly by using optical flow as input or implicitly by means of 3D convolutional filters that simultaneously capture appearance and motion information. This paper proposes an alternative approach based on a learnable correlation operator that can be used to establish frame-to-frame matches over convolutional feature maps in the different layers of the network. The proposed architecture enables the fusion of this explicit temporal matching information with traditional appearance cues captured by 2D convolution. Our correlation network compares favorably with widely-used 3D CNNs for video modeling, and achieves competitive results over the prominent two-stream network while being much faster to train. We empirically demonstrate that correlation networks produce strong results on a variety of video datasets, and outperform the state of the art on three popular benchmarks for action recognition: Kinetics, Something-Something and Diving48.
Abstract (translated)
动作是识别视频中动作的重要提示。现代的动作识别模型利用运动信息,可以是以光流为输入,也可以是同时捕捉外观和运动信息的三维卷积滤波器。本文提出了一种基于可学习相关算子的替代方法,该方法可用于建立网络不同层次卷积特征图上的帧到帧匹配。所提出的架构能够将这种显式的时间匹配信息与传统的二维卷积捕获的外观线索融合。我们的相关网络在视频建模方面与广泛使用的3D CNN相比,具有很好的优势,并且在训练速度更快的情况下,与突出的双流网络相比,我们获得了具有竞争力的结果。我们从经验上证明了相关网络在各种视频数据集上产生了强大的结果,并且在三个流行的动作识别基准(动力学、某物和占卜48)上超过了最新水平。
URL
https://arxiv.org/abs/1906.03349