Abstract
Despite its wide range of applications, video summarization is still held back by the scarcity of extensive datasets, largely due to the labor-intensive and costly nature of frame-level annotations. As a result, existing video summarization methods are prone to overfitting. To mitigate this challenge, we propose a novel self-supervised video representation learning method using knowledge distillation to pre-train a transformer encoder. Our method matches its semantic video representation, which is constructed with respect to frame importance scores, to a representation derived from a CNN trained on video classification. Empirical evaluations on correlation-based metrics, such as Kendall's $\tau$ and Spearman's $\rho$ demonstrate the superiority of our approach compared to existing state-of-the-art methods in assigning relative scores to the input frames.
Abstract (translated)
尽管视频摘要有许多应用,但由于帧级注释的劳动密集型和昂贵性质,仍然面临着缺乏大量数据的困境。因此,现有的视频摘要方法往往容易过拟合。为了缓解这一挑战,我们提出了一种全新的自监督视频表示学习方法,通过知识蒸馏来预先训练transformer编码器。我们的方法和基于视频分类训练的卷积神经网络的表示进行比较,以匹配其语义视频表示,该表示基于帧重要性得分构建。对基于相关性的度量指标(如卡德维的$\tau$和Spearman的$\rho$)的实证评估表明,我们的方法相对于现有的最先进的方法在给输入帧赋予相对评分方面具有优势。
URL
https://arxiv.org/abs/2303.15993