Paper Reading AI Learner

V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

2024-04-18 17:32:46
Hang Hua, Yunlong Tang, Chenliang Xu, Jiebo Luo

Abstract

Video summarization aims to create short, accurate, and cohesive summaries of longer videos. Despite the existence of various video summarization datasets, a notable limitation is their limited amount of source videos, which hampers the effective fine-tuning of advanced large vision-language models (VLMs). Additionally, most existing datasets are created for video-to-video summarization, overlooking the contemporary need for multimodal video content summarization. Recent efforts have been made to expand from unimodal to multimodal video summarization, categorizing the task into three sub-tasks based on the summary's modality: video-to-video (V2V), video-to-text (V2T), and a combination of video and text summarization (V2VT). However, the textual summaries in previous multimodal datasets are inadequate. To address these issues, we introduce Instruct-V2Xum, a cross-modal video summarization dataset featuring 30,000 diverse videos sourced from YouTube, with lengths ranging from 40 to 940 seconds and an average summarization ratio of 16.39\%. Each video summary in Instruct-V2Xum is paired with a textual summary that references specific frame indexes, facilitating the generation of aligned video and textual summaries. In addition, we propose a new video summarization framework named V2Xum-LLM. V2Xum-LLM, specifically V2Xum-LLaMA in this study, is the first framework that unifies different video summarization tasks into one large language model's (LLM) text decoder and achieves task-controllable video summarization with temporal prompts and task instructions. Experiments show that V2Xum-LLaMA outperforms strong baseline models on multiple video summarization tasks. Furthermore, we propose an enhanced evaluation metric for V2V and V2VT summarization tasks.

Abstract (translated)

视频摘要的目的是创建较短、准确和连贯的视频摘要。尽管存在各种视频摘要数据集,但一个显著的局限是它们来源视频的数量有限,这阻碍了高级大 Vision-语言模型(VLMs)的有效微调。此外,现有的数据集都是为视频到视频摘要而设计的,忽视了当代需要多模态视频内容摘要的需求。最近,努力将无模态视频摘要扩展到多模态视频摘要,根据摘要的模态将任务划分为三个子任务:视频到视频(V2V)、视频到文本(V2T)和视频与文本摘要的结合(V2VT)。然而,之前的多模态数据集中的文本摘要是不够的。为了应对这些挑战,我们引入了Instruct-V2Xum,一个包含来自YouTube的30,000个多样视频的多模态视频摘要数据集,时长在40至940秒之间,平均摘要比率为16.39%。Instruct-V2Xum中的每个视频摘要都搭配了一个文本摘要,引用具体的帧索引,从而生成对齐的影片和文本摘要。此外,我们提出了一个新的视频摘要框架,名为V2Xum-LLM。V2Xum-LLM(本研究中的V2Xum-LLaMA)是第一个将不同视频摘要任务统一到一个大型语言模型(LLM)的文本解码器中的框架,并使用时序提示和任务指令实现任务可控的视频摘要。实验证明,V2Xum-LLaMA在多个视频摘要任务上优于强大的基线模型。此外,我们提出了V2V和V2VT摘要任务的增强评估指标。

URL

https://arxiv.org/abs/2404.12353

PDF

https://arxiv.org/pdf/2404.12353.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot