Paper Reading AI Learner

Local Compressed Video Stream Learning for Generic Event Boundary Detection

2023-09-27 06:49:40
Libo Zhang, Xin Gu, Congcong Li, Tiejian Luo, Heng Fan

Abstract

Generic event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks. Existing methods typically require video frames to be decoded before feeding into the network, which contains significant spatio-temporal redundancy and demands considerable computational power and storage space. To remedy these issues, we propose a novel compressed video representation learning method for event boundary detection that is fully end-to-end leveraging rich information in the compressed domain, i.e., RGB, motion vectors, residuals, and the internal group of pictures (GOP) structure, without fully decoding the video. Specifically, we use lightweight ConvNets to extract features of the P-frames in the GOPs and spatial-channel attention module (SCAM) is designed to refine the feature representations of the P-frames based on the compressed information with bidirectional information flow. To learn a suitable representation for boundary detection, we construct the local frames bag for each candidate frame and use the long short-term memory (LSTM) module to capture temporal relationships. We then compute frame differences with group similarities in the temporal domain. This module is only applied within a local window, which is critical for event boundary detection. Finally a simple classifier is used to determine the event boundaries of video sequences based on the learned feature representation. To remedy the ambiguities of annotations and speed up the training process, we use the Gaussian kernel to preprocess the ground-truth event boundaries. Extensive experiments conducted on the Kinetics-GEBD and TAPOS datasets demonstrate that the proposed method achieves considerable improvements compared to previous end-to-end approach while running at the same speed. The code is available at this https URL.

Abstract (translated)

通用事件边界检测旨在定位通用、无分类事件的分割边界,将视频片段分割成块。现有的方法通常要求视频帧先解码才能输入网络,其中包含 significant spatio-temporal redundancy 并需要相当规模的计算资源和存储空间。为了解决这些问题,我们提出了一种用于事件边界检测的压缩视频表示学习方法,该方法完全端到端利用压缩域中丰富的信息,即RGB、运动向量、残留值和内部图片组(GOP)结构,而无需完全解码视频。具体来说,我们使用轻量级卷积神经网络提取GOP中的P帧特征,并使用空间通道注意力模块(SCAM)优化基于压缩信息的P帧特征表示,以双向信息流为基础。为了学习适合边界检测的特征表示,我们每个候选帧构建本地帧包,并使用长短期记忆(LSTM)模块捕捉时间关系。然后我们在时间域中计算群体相似度,以计算帧差异。该模块仅适用于本地窗口,这是事件边界检测的关键。最后,我们使用简单的分类器来确定视频序列的事件边界,以消除注释的歧义并加快训练过程。在Kinetics-GEBD和TAPOS数据集上的广泛实验表明,该方法在运行速度相同的情况下与以前的端到端方法相比取得了相当大的改进。代码在此httpsURL可用。

URL

https://arxiv.org/abs/2309.15431

PDF

https://arxiv.org/pdf/2309.15431.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot