Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement

Abstract
Abstract (translated)
URL
PDF

Abstract

Video corpus moment retrieval~(VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a natural language text as query. The relevance between the video and query is partial, mainly evident in two aspects: (1) Scope: The untrimmed video contains information-rich frames, and not all are relevant to the query. Strong correlation is typically observed only within the relevant moment, emphasizing the importance of capturing key content. (2) Modality: The relevance of query to different modalities varies; action descriptions align more with the visual elements, while character conversations are more related to textual information. Recognizing and addressing these modality-specific nuances is crucial for effective retrieval in VCMR. However, existing methods often treat all video contents equally, leading to sub-optimal moment retrieval. We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task. To this end, we propose a Partial Relevance Enhanced Model~(PREM) to improve VCMR. VCMR involves two sub-tasks: video retrieval and moment localization. To align with their distinct objectives, we implement specialized partial relevance enhancement strategies. For video retrieval, we introduce a multi-modal collaborative video retriever, generating distinct query representations tailored for different modalities by modality-specific pooling, ensuring a more effective match. For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content, followed by fusing multi-modal information for moment localization. Experimental results on TVR and DiDeMo datasets show that the proposed model outperforms the baselines, achieving a new state-of-the-art of VCMR.

Abstract (translated)

视频语料库 moment 检索（VCMR）是一项新的视频检索任务，旨在从大量未剪辑的视频中检索相关时刻，使用自然语言文本作为查询。视频与查询之间的相关性是局部的，主要表现在两个方面：（1）范围：未剪辑的视频包含信息丰富的帧，但并非所有都与查询相关。通常只有相关时刻之间的强相关性才会出现，强调捕捉关键内容的重要性；（2）模式：查询对不同模式的相关性有所不同；动作描述更接近视觉元素，而角色对话则更涉及文本信息。识别和解决这些模式特定微小差异对于 VCMR 的有效检索至关重要。然而，现有的方法通常将所有视频内容平等对待，导致 VCMR 的性能下降。我们认为，准确捕捉查询与视频之间的部分相关性对 VCMR 任务至关重要。为此，我们提出了一个 Partial Relevance Enhanced Model（PREM）来提高 VCMR 的性能。VCMR 包括两个子任务：视频检索和时刻定位。为了与它们的独特目标保持一致，我们为实现针对不同模式的专用部分相关性增强策略。对于视频检索，我们引入了一种多模态协同视频检索器，通过模式特定的池化生成针对不同模态的单独查询表示，确保更有效的匹配。对于时刻定位，我们提出了一个关注- then-融合时刻定位器，利用模式特定的门捕捉关键内容，然后将多模态信息融合为时刻定位。在 TVR 和 DiDeMo 数据集上的实验结果表明，与基线相比，所提出的模型表现出色，实现了 VCMR 的新最好状态。

URL

https://arxiv.org/abs/2402.13576

PDF

https://arxiv.org/pdf/2402.13576.pdf

Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement

Abstract

Abstract (translated)

URL

PDF Copy

PDF