Abstract
The explosive growth of video data intensified the need for flexible user-controllable summarization tools that can operate without domain-specific training data. Existing methods either rely on datasets, limiting generalization, or cannot incorporate user intent expressed in natural language. We introduce Prompts-to-Summaries: the first zero-shot, text-queryable video summarizer that converts off-the-shelf video-language models (VidLMs) captions into user-guided skims via large language models (LLMs) judging, without the use of training data at all, beating all unsupervised and matching supervised methods. Our pipeline (i) segments raw video footage into coherent scenes, (ii) generates rich scene-level descriptions through a memory-efficient, batch-style VidLM prompting scheme that scales to hours-long videos on a single GPU, (iii) leverages an LLM as a judge to assign scene-level importance scores under a carefully crafted prompt, and finally, (iv) propagates those scores to short segments level via two new metrics: consistency (temporal coherency) and uniqueness (novelty), yielding fine-grained frame importance. On SumMe and TVSum, our data-free approach surpasses all prior data-hungry unsupervised methods. It also performs competitively on the Query-Focused Video Summarization (QFVS) benchmark, despite using no training data and the competing methods requiring supervised frame-level importance. To spur further research, we release VidSum-Reason, a new query-driven dataset featuring long-tailed concepts and multi-step reasoning; our framework attains robust F1 scores and serves as the first challenging baseline. Overall, our results demonstrate that pretrained multimodal models, when orchestrated with principled prompting and score propagation, already provide a powerful foundation for universal, text-queryable video summarization.
Abstract (translated)
视频数据的爆炸性增长加剧了对灵活且用户可控的摘要工具的需求,这些工具可以在没有特定领域训练数据的情况下运行。现有的方法要么依赖于数据集,从而限制了泛化能力;要么无法结合自然语言表达的用户意图。我们引入了一个名为“Prompts-to-Summaries”的系统:这是首个零样本、文本查询式的视频摘要生成器,它通过大型语言模型(LLM)评估将现成的视频-语言模型(VidLM)的字幕转换为用户的指导式概览,完全不使用训练数据,超越了所有无监督方法,并在某些情况下与有监督方法相匹敌。 我们的处理流程包括以下四个步骤: (i) 将原始视频片段分割成连贯的情节; (ii) 通过一种内存高效、批处理式的VidLM提示方案生成丰富的情景描述,能够支持单个GPU上长达数小时的视频; (iii) 利用一个LLM作为评判者,在精心设计的提示下为每个场景分配重要性得分; (iv) 最后,我们采用两种新的指标——一致性(时间连贯性)和独特性(新颖性),将这些分数传播到较短片段级别上,生成细粒度的关键帧的重要性评价。 在SumMe和TVSum数据集上,我们的无数据方法超过了所有之前的“数据饥渴”的无监督方法。它还在Query-Focused Video Summarization (QFVS)基准测试中表现出竞争力,尽管我们没有使用训练数据而竞争对手的方法需要有监督的帧级重要性标注。 为了推动进一步的研究,我们发布了VidSum-Reason:一个新的查询驱动的数据集,其中包含长尾概念和多步推理特征。我们的框架在该数据集中取得了稳健的F1分数,并且首次为这一挑战提供了有力的基础线。 总体而言,我们的结果表明,当使用原则性的提示机制与评分传播来协调预训练的多模态模型时,它们已经为通用、文本查询式的视频摘要提供了一个强大的基础。
URL
https://arxiv.org/abs/2506.10807