Abstract
The state of the art in video understanding suffers from two problems: (1) The major part of reasoning is performed locally in the video, therefore, it misses important relationships within actions that span several seconds. (2) While there are local methods with fast per-frame processing, the processing of the whole video is not efficient and hampers fast video retrieval or online classification of long-term activities. In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time. The architecture is based on merging long-term content already in the network rather than in a post-hoc fusion. Together with a sampling strategy, which exploits that neighboring frames are largely redundant, this yields high-quality action classification and video captioning at up to 230 videos per second, where each video can consist of a few hundred frames. The approach achieves competitive performance across all datasets while being 10x to 80x faster than state-of-the-art methods.
Abstract (translated)
视频理解技术的最新进展存在两个问题:(1)推理的主要部分是在视频中本地执行的,因此,它忽略了跨越几秒钟的动作中的重要关系。 (2)尽管本地方法具有快速的每帧处理,但整个视频的处理效率并不高,并且妨碍了快速视频检索或长期活动的在线分类。在本文中,我们介绍一种考虑长期内容的网络体系结构,并可同时实现快速的每个视频处理。该体系结构基于将网络中已有的长期内容进行合并而不是进行事后融合。再加上采用相邻帧的采样策略在很大程度上是多余的,这可以产生高质量的动作分类和视频字幕,每秒高达230个视频,其中每个视频可以由几百帧组成。该方法实现了所有数据集的竞争性表现,而速度比现有技术快10倍至80倍。
URL
https://arxiv.org/abs/1804.09066