Abstract
Appearance and motion are two key components to depict and characterize the video content. Currently, the two-stream models have achieved state-of-the-art performances on video classification. However, extracting motion information, specifically in the form of optical flow features, is extremely computationally expensive, especially for large-scale video classification. In this paper, we propose a motion hallucination network, namely MoNet, to imagine the optical flow features from the appearance features, with no reliance on the optical flow computation. Specifically, MoNet models the temporal relationships of the appearance features and exploits the contextual relationships of the optical flow features with concurrent connections. Extensive experimental results demonstrate that the proposed MoNet can effectively and efficiently hallucinate the optical flow features, which together with the appearance features consistently improve the video classification performances. Moreover, MoNet can help cutting down almost a half of computational and data-storage burdens for the two-stream video classification. Our code is available at: https://github.com/YongyiTang92/MoNet-Features.
Abstract (translated)
外观和运动是描述和表征视频内容的两个关键组成部分。目前,这两种流模型在视频分类方面已经达到了最先进的性能。然而,提取运动信息,特别是以光流特征的形式,是极其昂贵的计算,特别是对于大规模的视频分类。在本文中,我们提出了一个运动幻觉网络,即MONET,从外观特征来想象光流特征,而不依赖光流计算。具体地说,莫奈模型的时间关系的外观特征和利用的上下文关系的光流特征与并发连接。大量的实验结果表明,该方法能有效地对光流特征进行幻象处理,并能持续改善视频分类性能。此外,MONET还可以帮助减少两流视频分类几乎一半的计算和数据存储负担。我们的代码位于:https://github.com/yongyitang92/monet-features。
URL
https://arxiv.org/abs/1905.11799