Paper Reading AI Learner

SELF-VS: Self-supervised Encoding Learning For Video Summarization

2023-03-28 14:08:05
Hojjat Mokhtarabadi, Kave Bahraman, Mehrdad HosseinZadeh, Mahdi Eftekhari

Abstract

Despite its wide range of applications, video summarization is still held back by the scarcity of extensive datasets, largely due to the labor-intensive and costly nature of frame-level annotations. As a result, existing video summarization methods are prone to overfitting. To mitigate this challenge, we propose a novel self-supervised video representation learning method using knowledge distillation to pre-train a transformer encoder. Our method matches its semantic video representation, which is constructed with respect to frame importance scores, to a representation derived from a CNN trained on video classification. Empirical evaluations on correlation-based metrics, such as Kendall's $\tau$ and Spearman's $\rho$ demonstrate the superiority of our approach compared to existing state-of-the-art methods in assigning relative scores to the input frames.

Abstract (translated)

尽管视频摘要有许多应用,但由于帧级注释的劳动密集型和昂贵性质,仍然面临着缺乏大量数据的困境。因此,现有的视频摘要方法往往容易过拟合。为了缓解这一挑战,我们提出了一种全新的自监督视频表示学习方法,通过知识蒸馏来预先训练transformer编码器。我们的方法和基于视频分类训练的卷积神经网络的表示进行比较,以匹配其语义视频表示,该表示基于帧重要性得分构建。对基于相关性的度量指标(如卡德维的$\tau$和Spearman的$\rho$)的实证评估表明,我们的方法相对于现有的最先进的方法在给输入帧赋予相对评分方面具有优势。

URL

https://arxiv.org/abs/2303.15993

PDF

https://arxiv.org/pdf/2303.15993.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot