Abstract
Text-to-video retrieval (TVR) aims to find the most relevant video in a large video gallery given a query text. The intricate and abundant context of the video challenges the performance and efficiency of TVR. To handle the serialized video contexts, existing methods typically select a subset of frames within a video to represent the video content for TVR. How to select the most representative frames is a crucial issue, whereby the selected frames are required to not only retain the semantic information of the video but also promote retrieval efficiency by excluding temporally redundant frames. In this paper, we make the first empirical study of frame selection for TVR. We systemically classify existing frame selection methods into text-free and text-guided ones, under which we detailedly analyze six different frame selections in terms of effectiveness and efficiency. Among them, two frame selections are first developed in this paper. According to the comprehensive analysis on multiple TVR benchmarks, we empirically conclude that the TVR with proper frame selections can significantly improve the retrieval efficiency without sacrificing the retrieval performance.
Abstract (translated)
文本转视频检索(TVR)旨在在给定查询文本的情况下,找到一个大型视频画廊中最相关的视频。视频的复杂性和丰富性使得TVR的性能和效率受到挑战。为处理序列化的视频上下文,现有方法通常选择视频中的一个子集来代表视频内容进行TVR。如何选择最具代表性的帧是一个关键问题,以便选定的帧不仅保留视频的语义信息,而且还通过排除时间上冗余的帧来提高检索效率。在本文中,我们对TVR的帧选择进行了首次实证研究。我们系统地分类现有帧选择方法为文本无关和文本指导两类,然后对这六种不同的帧选择在有效性和效率方面进行了详细分析。其中,本文提出了两种帧选择方法。根据多个TVR基准全面分析的结果,我们通过实证研究得出结论,合理选择帧的TVR可以在不牺牲检索性能的情况下显著提高检索效率。
URL
https://arxiv.org/abs/2311.00298