Abstract
In this work we present a new efficient approach to Human Action Recognition called Video Transformer Network (VTN). It leverages the latest advances in Computer Vision and Natural Language Processing and applies them to video understanding. The proposed method allows us to create lightweight CNN models that achieve high accuracy and real-time speed using just an RGB mono camera and general purpose CPU. Furthermore, we explain how to improve accuracy by distilling from multiple models with different modalities into a single model. We conduct a comparison with state-of-the-art methods and show that our approach performs on par with most of them on famous Action Recognition datasets. We benchmark the inference time of the models using the modern inference framework and argue that our approach compares favorably with other methods in terms of speed/accuracy trade-off, running at 56 FPS on CPU. The models and the training code are available.
Abstract (translated)
在这项工作中,我们提出了一种新的有效的人类行为识别方法,即视频变压器网络(VTN)。它利用了计算机视觉和自然语言处理的最新进展,并将其应用于视频理解。提出的方法允许我们创建轻量级的CNN模型,仅使用一个RGB单摄像机和通用CPU就可以实现高精度和实时速度。此外,我们还解释了如何通过将具有不同形式的多个模型提取为单个模型来提高精度。我们与最先进的方法进行了比较,结果表明,我们的方法在著名的动作识别数据集上的表现与大多数方法相当。我们使用现代推理框架对模型的推理时间进行了基准测试,并认为我们的方法在速度/精度权衡方面优于其他方法,在CPU上以56 fps的速度运行。模型和培训代码可用。
URL
https://arxiv.org/abs/1905.08711