Abstract
Existing long video retrieval systems are trained and tested in the paragraph-to-video retrieval regime, where every long video is described by a single long paragraph. This neglects the richness and variety of possible valid descriptions of a video, which could be described in moment-by-moment detail, or in a single phrase summary, or anything in between. To provide a more thorough evaluation of the capabilities of long video retrieval systems, we propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos. We validate this pipeline's fidelity via rigorous human inspection. We then benchmark a representative set of video language models on these synthetic captions using a few long video datasets, showing that they struggle with the transformed data, especially the shortest captions. We also propose a lightweight fine-tuning method, where we use a contrastive loss to learn a hierarchical embedding loss based on the differing levels of information among the various captions. Our method improves performance both on the downstream paragraph-to-video retrieval task (+1.1% R@1 on ActivityNet), as well as for the various long video retrieval metrics we compute using our synthetic data (+3.6% R@1 for short descriptions on ActivityNet). For data access and other details, please refer to our project website at this https URL.
Abstract (translated)
现有的长视频检索系统在段落到视频检索模式下进行训练和测试,其中每个长视频都由一段长描述来描述。这忽略了描述视频中可能存在的丰富性和多样性,可以详细描述视频中的每个时刻,或者用一个短语概述,或者在其中的任何地方。为了对长视频检索系统的能力进行更彻底的评估,我们提出了一个利用最先进的较大语言模型生成一系列丰富多样的合成视频描述的管道。我们通过严谨的人检查来验证这个管道的可靠性。然后,我们使用几个大型视频数据集来基准这些合成视频描述的语言模型,发现它们在转换数据上表现不佳,尤其是最短的描述。我们还提出了一种轻量级的微调方法,我们使用对比损失来基于各种描述之间信息差异的程度学习层次嵌入损失。我们的方法在下游段落到视频检索任务(在ActivityNet上的R@1值+1.1%)以及我们使用合成数据计算的各种长视频检索指标上都取得了良好的性能(在ActivityNet上的短描述上的R@1值+3.6%)。对于数据访问和其他细节,请参阅我们的项目网站,链接在此:https://www.aclweb.org/anthology/N22-21-6666。
URL
https://arxiv.org/abs/2312.00115