Abstract
We propose a Spatiotemporal Sampling Network (STSN) that uses deformable convolutions across time for object detection in videos. Our STSN performs object detection in a video frame by learning to spatially sample features from the adjacent frames. This naturally renders the approach robust to occlusion or motion blur in individual frames. Our framework does not require additional supervision, as it optimizes sampling locations directly with respect to object detection performance. Our STSN outperforms the state-of-the-art on the ImageNet VID dataset and compared to prior video object detection methods it uses a simpler design, and does not require optical flow data for training.
Abstract (translated)
我们提出了一种时空采样网络(STSN),它使用可变形卷积跨时间在视频中进行物体检测。我们的STSN通过学习从相邻帧中空间采样特征来在视频帧中执行对象检测。这自然地使得该方法对于各个帧中的遮挡或运动模糊是鲁棒的。我们的框架不需要额外的监督,因为它直接针对对象检测性能优化采样位置。我们的STSN优于ImageNet VID数据集的最新技术,并且与之前的视频对象检测方法相比,它使用更简单的设计,并且不需要光流数据进行训练。
URL
https://arxiv.org/abs/1803.05549