V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

Abstract
Abstract (translated)
URL
PDF

Abstract

Video summarization aims to create short, accurate, and cohesive summaries of longer videos. Despite the existence of various video summarization datasets, a notable limitation is their limited amount of source videos, which hampers the effective fine-tuning of advanced large vision-language models (VLMs). Additionally, most existing datasets are created for video-to-video summarization, overlooking the contemporary need for multimodal video content summarization. Recent efforts have been made to expand from unimodal to multimodal video summarization, categorizing the task into three sub-tasks based on the summary's modality: video-to-video (V2V), video-to-text (V2T), and a combination of video and text summarization (V2VT). However, the textual summaries in previous multimodal datasets are inadequate. To address these issues, we introduce Instruct-V2Xum, a cross-modal video summarization dataset featuring 30,000 diverse videos sourced from YouTube, with lengths ranging from 40 to 940 seconds and an average summarization ratio of 16.39\%. Each video summary in Instruct-V2Xum is paired with a textual summary that references specific frame indexes, facilitating the generation of aligned video and textual summaries. In addition, we propose a new video summarization framework named V2Xum-LLM. V2Xum-LLM, specifically V2Xum-LLaMA in this study, is the first framework that unifies different video summarization tasks into one large language model's (LLM) text decoder and achieves task-controllable video summarization with temporal prompts and task instructions. Experiments show that V2Xum-LLaMA outperforms strong baseline models on multiple video summarization tasks. Furthermore, we propose an enhanced evaluation metric for V2V and V2VT summarization tasks.

Abstract (translated)

视频摘要的目的是创建较短、准确和连贯的视频摘要。尽管存在各种视频摘要数据集，但一个显著的局限是它们来源视频的数量有限，这阻碍了高级大 Vision-语言模型（VLMs）的有效微调。此外，现有的数据集都是为视频到视频摘要而设计的，忽视了当代需要多模态视频内容摘要的需求。最近，努力将无模态视频摘要扩展到多模态视频摘要，根据摘要的模态将任务划分为三个子任务：视频到视频（V2V）、视频到文本（V2T）和视频与文本摘要的结合（V2VT）。然而，之前的多模态数据集中的文本摘要是不够的。为了应对这些挑战，我们引入了Instruct-V2Xum，一个包含来自YouTube的30,000个多样视频的多模态视频摘要数据集，时长在40至940秒之间，平均摘要比率为16.39%。Instruct-V2Xum中的每个视频摘要都搭配了一个文本摘要，引用具体的帧索引，从而生成对齐的影片和文本摘要。此外，我们提出了一个新的视频摘要框架，名为V2Xum-LLM。V2Xum-LLM（本研究中的V2Xum-LLaMA）是第一个将不同视频摘要任务统一到一个大型语言模型（LLM）的文本解码器中的框架，并使用时序提示和任务指令实现任务可控的视频摘要。实验证明，V2Xum-LLaMA在多个视频摘要任务上优于强大的基线模型。此外，我们提出了V2V和V2VT摘要任务的增强评估指标。

URL

https://arxiv.org/abs/2404.12353

PDF

https://arxiv.org/pdf/2404.12353.pdf

V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

Abstract

Abstract (translated)

URL

PDF Copy

PDF