Abstract
In this paper, a novel two-stream architecture for the task of temporal action proposal generation in long, untrimmed videos is presented. Inspired by the recent advances in the field of human action recognition utilizing 3D convolutions in combination with two-stream networks and based on the Single-Stream Temporal Action Proposals (SST) architecture, four different two-stream architectures utilizing sequences of images on one stream and images of optical flow on the other stream are subsequently investigated. The four architectures fuse the two separate streams at different depths in the model; for each of them, a broad range of parameters is investigated systematically as well as an optimal parametrization is empirically determined. The experiments on action and sports datasets show that all four two-stream architectures are able to outperform the original single-stream SST and achieve state of the art results. Additional experiments revealed that the improvements are not restricted to a single method of calculating optical flow by exchanging the formerly used method of Brox with FlowNet2 and still achieving improvements.
Abstract (translated)
本文提出了一种新的两流体系结构,用于在长时间、未经修剪的视频中生成时间动作建议。受最近人类行动识别领域的发展启发,利用三维卷积结合两个流网络,并基于单流时间行动方案(SST)架构,四种不同的双流架构利用一个流上的图像序列和另一个流上的光流图像,并将其应用于人类行动识别领域。E随后进行了调查。这四种结构在模型中融合了两种不同深度的独立流,对每种结构系统地研究了广泛的参数范围,并根据经验确定了最佳参数化。对动作和运动数据集的实验表明,四种双流体系结构都能优于原单流SST,达到了最新的效果。另外的实验表明,这种改进并不局限于通过将先前使用的brox方法与flownet2进行交换来计算光流量的单一方法,并且仍在取得改进。
URL
https://arxiv.org/abs/1903.04176