Abstract
Generic event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks. Existing methods typically require video frames to be decoded before feeding into the network, which contains significant spatio-temporal redundancy and demands considerable computational power and storage space. To remedy these issues, we propose a novel compressed video representation learning method for event boundary detection that is fully end-to-end leveraging rich information in the compressed domain, i.e., RGB, motion vectors, residuals, and the internal group of pictures (GOP) structure, without fully decoding the video. Specifically, we use lightweight ConvNets to extract features of the P-frames in the GOPs and spatial-channel attention module (SCAM) is designed to refine the feature representations of the P-frames based on the compressed information with bidirectional information flow. To learn a suitable representation for boundary detection, we construct the local frames bag for each candidate frame and use the long short-term memory (LSTM) module to capture temporal relationships. We then compute frame differences with group similarities in the temporal domain. This module is only applied within a local window, which is critical for event boundary detection. Finally a simple classifier is used to determine the event boundaries of video sequences based on the learned feature representation. To remedy the ambiguities of annotations and speed up the training process, we use the Gaussian kernel to preprocess the ground-truth event boundaries. Extensive experiments conducted on the Kinetics-GEBD and TAPOS datasets demonstrate that the proposed method achieves considerable improvements compared to previous end-to-end approach while running at the same speed. The code is available at this https URL.
Abstract (translated)
通用事件边界检测旨在定位通用、无分类事件的分割边界,将视频片段分割成块。现有的方法通常要求视频帧先解码才能输入网络,其中包含 significant spatio-temporal redundancy 并需要相当规模的计算资源和存储空间。为了解决这些问题,我们提出了一种用于事件边界检测的压缩视频表示学习方法,该方法完全端到端利用压缩域中丰富的信息,即RGB、运动向量、残留值和内部图片组(GOP)结构,而无需完全解码视频。具体来说,我们使用轻量级卷积神经网络提取GOP中的P帧特征,并使用空间通道注意力模块(SCAM)优化基于压缩信息的P帧特征表示,以双向信息流为基础。为了学习适合边界检测的特征表示,我们每个候选帧构建本地帧包,并使用长短期记忆(LSTM)模块捕捉时间关系。然后我们在时间域中计算群体相似度,以计算帧差异。该模块仅适用于本地窗口,这是事件边界检测的关键。最后,我们使用简单的分类器来确定视频序列的事件边界,以消除注释的歧义并加快训练过程。在Kinetics-GEBD和TAPOS数据集上的广泛实验表明,该方法在运行速度相同的情况下与以前的端到端方法相比取得了相当大的改进。代码在此httpsURL可用。
URL
https://arxiv.org/abs/2309.15431