Abstract
We address the problem of video representation learning without human-annotated labels. While previous efforts address the problem by designing novel self-supervised tasks using video data, the learned features are merely on a frame-by-frame basis, which are not applicable to many video analytic tasks where spatio-temporal features are prevailing. In this paper we propose a novel self-supervised approach to learn spatio-temporal features for video representation. Inspired by the success of two-stream approaches in video classification, we propose to learn visual features by regressing both motion and appearance statistics along spatial and temporal dimensions, given only the input video data. Specifically, we extract statistical concepts (fast-motion region and the corresponding dominant direction, spatio-temporal color diversity, dominant color, etc.) from simple patterns in both spatial and temporal domains. Unlike prior puzzles that are even hard for humans to solve, the proposed approach is consistent with human inherent visual habits and therefore easy to answer. We conduct extensive experiments with C3D to validate the effectiveness of our proposed approach. The experiments show that our approach can significantly improve the performance of C3D when applied to video classification tasks. Code is available at this https URL.
Abstract (translated)
我们解决了在没有人工标注标签的情况下进行视频表示学习的问题。虽然以前的工作是通过使用视频数据设计新的自监督任务来解决这个问题,但所学的功能仅仅是基于逐帧的基础,这不适用于许多具有时空特征的视频分析任务。本文提出了一种新的自监督方法来学习视频表示的时空特征。受视频分类中两种流方法成功的启发,我们建议仅在输入视频数据的情况下,通过沿空间和时间维度回归运动和外观统计数据来学习视觉特征。具体地说,我们从空间和时间域的简单模式中提取统计概念(快速运动区域和相应的主导方向、时空颜色多样性、主导颜色等)。不同于以往人类很难解决的难题,该方法与人类固有的视觉习惯相一致,因此易于回答。我们使用c3d进行了大量的实验,以验证我们提出的方法的有效性。实验表明,该方法在视频分类任务中能显著提高C3D的性能。此https URL上提供了代码。
URL
https://arxiv.org/abs/1904.03597