Abstract
Composed video retrieval (CoVR) is a challenging problem in computer vision which has recently highlighted the integration of modification text with visual queries for more sophisticated video search in large databases. Existing works predominantly rely on visual queries combined with modification text to distinguish relevant videos. However, such a strategy struggles to fully preserve the rich query-specific context in retrieved target videos and only represents the target video using visual embedding. We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information and learns discriminative embeddings of vision only, text only and vision-text for better alignment to accurately retrieve matched target videos. Our proposed framework can be flexibly employed for both composed video (CoVR) and image (CoIR) retrieval tasks. Experiments on three datasets show that our approach obtains state-of-the-art performance for both CovR and zero-shot CoIR tasks, achieving gains as high as around 7% in terms of recall@K=1 score. Our code, models, detailed language descriptions for WebViD-CoVR dataset are available at \url{this https URL}
Abstract (translated)
组成的视频检索(CoVR)是计算机视觉领域的一个具有挑战性的问题,最近在大型数据库中强调了将修改文本与视觉查询相结合以实现更复杂视频搜索的重要性。现有工作主要依赖于视觉查询与修改文本的组合来区分相关的视频。然而,这种策略很难在检索到的目标视频中完全保留查询特定的上下文信息,仅使用视觉嵌入来表示目标视频。我们提出了一个新颖的CoVR框架,它利用详细的语言描述来明确编码查询特定的上下文信息,并仅使用视觉和文本嵌入来学习更准确的匹配目标视频。我们提出的框架可以灵活应用于组合视频(CoVR)和图像(CoIR)检索任务。在三个数据集上的实验表明,我们的方法在CovR和零散CoIR任务上均取得了最先进的性能,召回@K=1得分甚至达到了约7%的提高。我们的代码、模型和详细语言描述可在 \url{this <https://this URL>} 这个网站上获得。
URL
https://arxiv.org/abs/2403.16997