Abstract
The rapid proliferation of online video content necessitates effective video summarization techniques. Traditional methods, often relying on a single modality (typically visual), struggle to capture the full semantic richness of videos. This paper introduces MF2Summ, a novel video summarization model based on multimodal content understanding, integrating both visual and auditory information. MF2Summ employs a five-stage process: feature extraction, cross-modal attention interaction, feature fusion, segment prediction, and key shot selection. Visual features are extracted using a pre-trained GoogLeNet model, while auditory features are derived using SoundNet. The core of our fusion mechanism involves a cross-modal Transformer and an alignment-guided self-attention Transformer, designed to effectively model inter-modal dependencies and temporal correspondences. Segment importance, location, and center-ness are predicted, followed by key shot selection using Non-Maximum Suppression (NMS) and the Kernel Temporal Segmentation (KTS) algorithm. Experimental results on the SumMe and TVSum datasets demonstrate that MF2Summ achieves competitive performance, notably improving F1-scores by 1.9\% and 0.6\% respectively over the DSNet model, and performing favorably against other state-of-the-art methods.
Abstract (translated)
在线视频内容的迅速增长需要有效的视频摘要技术。传统方法通常依赖单一模态(通常是视觉信息),难以捕捉视频的全部语义丰富性。本文介绍了MF2Summ,这是一种基于多模态内容理解的新颖视频摘要模型,融合了视听信息。MF2Summ采用五阶段流程:特征提取、跨模态注意力交互、特征融合、片段预测和关键镜头选择。视觉特征通过预训练的GoogLeNet模型提取,而音频特征则利用SoundNet获取。我们融合机制的核心涉及一个跨模态Transformer和一个以对齐为导向的自注意力Transformer,旨在有效建模跨模态依赖关系和时间对应性。在片段重要性、位置和中心度预测之后,使用非极大值抑制(NMS)算法以及核时间分割(KTS)算法进行关键镜头选择。在SumMe和TVSum数据集上的实验结果表明,MF2Summ达到了竞争性的性能,在F1分数上分别比DSNet模型提高了1.9%和0.6%,并且相对于其他最先进的方法也有显著的优势。
URL
https://arxiv.org/abs/2506.10430