Abstract
Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task focused on identifying a specific moment within a vast corpus of untrimmed videos using the natural language query. Existing methods for VCMR typically rely on frame-aware video retrieval, calculating similarities between the query and video frames to rank videos based on maximum frame similarity.However, this approach overlooks the semantic structure embedded within the information between frames, namely, the event, a crucial element for human comprehension of videos. Motivated by this, we propose EventFormer, a model that explicitly utilizes events within videos as fundamental units for video retrieval. The model extracts event representations through event reasoning and hierarchical event encoding. The event reasoning module groups consecutive and visually similar frame representations into events, while the hierarchical event encoding encodes information at both the frame and event levels. We also introduce anchor multi-head self-attenion to encourage Transformer to capture the relevance of adjacent content in the video. The training of EventFormer is conducted by two-branch contrastive learning and dual optimization for two sub-tasks of VCMR. Extensive experiments on TVR, ANetCaps, and DiDeMo benchmarks show the effectiveness and efficiency of EventFormer in VCMR, achieving new state-of-the-art results. Additionally, the effectiveness of EventFormer is also validated on partially relevant video retrieval task.
Abstract (translated)
视频数据集 Moment 检索(VCMR)是一个关注于在广泛的未剪辑视频数据集中查找特定时刻的实用视频检索任务,使用自然语言查询。现有的 VCMR 方法通常依赖于帧感知视频检索,通过计算查询和视频帧之间的相似度来排名视频。然而,这种方法忽视了帧间信息内嵌入的语义结构,即事件,这是人类理解视频的关键元素。为了实现这一目标,我们提出了 EventFormer 模型,该模型明确利用视频中的事件作为视频检索的基本单位。通过事件推理和层次结构事件编码来提取事件表示。事件推理模块将连续且视觉上相似的帧表示分组为事件,而层次结构事件编码在帧和事件级别上编码信息。我们还引入了锚多头自注意力,以鼓励 Transformer 捕捉视频中的相关内容。通过两个分支的对比学习和双优化来训练 EventFormer。在 TVR、ANetCaps 和 DiDeMo 基准测试上进行的实验表明,EventFormer 在 VCMR 取得了有效性和效率,实现了最新的最先进水平。此外,EventFormer 的有效性还在部分相关视频检索任务上得到了验证。
URL
https://arxiv.org/abs/2402.13566