Abstract
Video summarization aims to create short, accurate, and cohesive summaries of longer videos. Despite the existence of various video summarization datasets, a notable limitation is their limited amount of source videos, which hampers the effective fine-tuning of advanced large vision-language models (VLMs). Additionally, most existing datasets are created for video-to-video summarization, overlooking the contemporary need for multimodal video content summarization. Recent efforts have been made to expand from unimodal to multimodal video summarization, categorizing the task into three sub-tasks based on the summary's modality: video-to-video (V2V), video-to-text (V2T), and a combination of video and text summarization (V2VT). However, the textual summaries in previous multimodal datasets are inadequate. To address these issues, we introduce Instruct-V2Xum, a cross-modal video summarization dataset featuring 30,000 diverse videos sourced from YouTube, with lengths ranging from 40 to 940 seconds and an average summarization ratio of 16.39\%. Each video summary in Instruct-V2Xum is paired with a textual summary that references specific frame indexes, facilitating the generation of aligned video and textual summaries. In addition, we propose a new video summarization framework named V2Xum-LLM. V2Xum-LLM, specifically V2Xum-LLaMA in this study, is the first framework that unifies different video summarization tasks into one large language model's (LLM) text decoder and achieves task-controllable video summarization with temporal prompts and task instructions. Experiments show that V2Xum-LLaMA outperforms strong baseline models on multiple video summarization tasks. Furthermore, we propose an enhanced evaluation metric for V2V and V2VT summarization tasks.
Abstract (translated)
视频摘要的目的是创建较短、准确和连贯的视频摘要。尽管存在各种视频摘要数据集,但一个显著的局限是它们来源视频的数量有限,这阻碍了高级大 Vision-语言模型(VLMs)的有效微调。此外,现有的数据集都是为视频到视频摘要而设计的,忽视了当代需要多模态视频内容摘要的需求。最近,努力将无模态视频摘要扩展到多模态视频摘要,根据摘要的模态将任务划分为三个子任务:视频到视频(V2V)、视频到文本(V2T)和视频与文本摘要的结合(V2VT)。然而,之前的多模态数据集中的文本摘要是不够的。为了应对这些挑战,我们引入了Instruct-V2Xum,一个包含来自YouTube的30,000个多样视频的多模态视频摘要数据集,时长在40至940秒之间,平均摘要比率为16.39%。Instruct-V2Xum中的每个视频摘要都搭配了一个文本摘要,引用具体的帧索引,从而生成对齐的影片和文本摘要。此外,我们提出了一个新的视频摘要框架,名为V2Xum-LLM。V2Xum-LLM(本研究中的V2Xum-LLaMA)是第一个将不同视频摘要任务统一到一个大型语言模型(LLM)的文本解码器中的框架,并使用时序提示和任务指令实现任务可控的视频摘要。实验证明,V2Xum-LLaMA在多个视频摘要任务上优于强大的基线模型。此外,我们提出了V2V和V2VT摘要任务的增强评估指标。
URL
https://arxiv.org/abs/2404.12353