Paper Reading AI Learner

Prompts to Summaries: Zero-Shot Language-Guided Video Summarization

2025-06-12 15:23:11
Mario Barbara, Alaa Maalouf

Abstract

The explosive growth of video data intensified the need for flexible user-controllable summarization tools that can operate without domain-specific training data. Existing methods either rely on datasets, limiting generalization, or cannot incorporate user intent expressed in natural language. We introduce Prompts-to-Summaries: the first zero-shot, text-queryable video summarizer that converts off-the-shelf video-language models (VidLMs) captions into user-guided skims via large language models (LLMs) judging, without the use of training data at all, beating all unsupervised and matching supervised methods. Our pipeline (i) segments raw video footage into coherent scenes, (ii) generates rich scene-level descriptions through a memory-efficient, batch-style VidLM prompting scheme that scales to hours-long videos on a single GPU, (iii) leverages an LLM as a judge to assign scene-level importance scores under a carefully crafted prompt, and finally, (iv) propagates those scores to short segments level via two new metrics: consistency (temporal coherency) and uniqueness (novelty), yielding fine-grained frame importance. On SumMe and TVSum, our data-free approach surpasses all prior data-hungry unsupervised methods. It also performs competitively on the Query-Focused Video Summarization (QFVS) benchmark, despite using no training data and the competing methods requiring supervised frame-level importance. To spur further research, we release VidSum-Reason, a new query-driven dataset featuring long-tailed concepts and multi-step reasoning; our framework attains robust F1 scores and serves as the first challenging baseline. Overall, our results demonstrate that pretrained multimodal models, when orchestrated with principled prompting and score propagation, already provide a powerful foundation for universal, text-queryable video summarization.

Abstract (translated)

视频数据的爆炸性增长加剧了对灵活且用户可控的摘要工具的需求,这些工具可以在没有特定领域训练数据的情况下运行。现有的方法要么依赖于数据集,从而限制了泛化能力;要么无法结合自然语言表达的用户意图。我们引入了一个名为“Prompts-to-Summaries”的系统:这是首个零样本、文本查询式的视频摘要生成器,它通过大型语言模型(LLM)评估将现成的视频-语言模型(VidLM)的字幕转换为用户的指导式概览,完全不使用训练数据,超越了所有无监督方法,并在某些情况下与有监督方法相匹敌。 我们的处理流程包括以下四个步骤: (i) 将原始视频片段分割成连贯的情节; (ii) 通过一种内存高效、批处理式的VidLM提示方案生成丰富的情景描述,能够支持单个GPU上长达数小时的视频; (iii) 利用一个LLM作为评判者,在精心设计的提示下为每个场景分配重要性得分; (iv) 最后,我们采用两种新的指标——一致性(时间连贯性)和独特性(新颖性),将这些分数传播到较短片段级别上,生成细粒度的关键帧的重要性评价。 在SumMe和TVSum数据集上,我们的无数据方法超过了所有之前的“数据饥渴”的无监督方法。它还在Query-Focused Video Summarization (QFVS)基准测试中表现出竞争力,尽管我们没有使用训练数据而竞争对手的方法需要有监督的帧级重要性标注。 为了推动进一步的研究,我们发布了VidSum-Reason:一个新的查询驱动的数据集,其中包含长尾概念和多步推理特征。我们的框架在该数据集中取得了稳健的F1分数,并且首次为这一挑战提供了有力的基础线。 总体而言,我们的结果表明,当使用原则性的提示机制与评分传播来协调预训练的多模态模型时,它们已经为通用、文本查询式的视频摘要提供了一个强大的基础。

URL

https://arxiv.org/abs/2506.10807

PDF

https://arxiv.org/pdf/2506.10807.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot