Paper Reading AI Learner

Event-aware Video Corpus Moment Retrieval

2024-02-21 06:55:20
Danyang Hou, Liang Pang, Huawei Shen, Xueqi Cheng

Abstract

Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task focused on identifying a specific moment within a vast corpus of untrimmed videos using the natural language query. Existing methods for VCMR typically rely on frame-aware video retrieval, calculating similarities between the query and video frames to rank videos based on maximum frame similarity.However, this approach overlooks the semantic structure embedded within the information between frames, namely, the event, a crucial element for human comprehension of videos. Motivated by this, we propose EventFormer, a model that explicitly utilizes events within videos as fundamental units for video retrieval. The model extracts event representations through event reasoning and hierarchical event encoding. The event reasoning module groups consecutive and visually similar frame representations into events, while the hierarchical event encoding encodes information at both the frame and event levels. We also introduce anchor multi-head self-attenion to encourage Transformer to capture the relevance of adjacent content in the video. The training of EventFormer is conducted by two-branch contrastive learning and dual optimization for two sub-tasks of VCMR. Extensive experiments on TVR, ANetCaps, and DiDeMo benchmarks show the effectiveness and efficiency of EventFormer in VCMR, achieving new state-of-the-art results. Additionally, the effectiveness of EventFormer is also validated on partially relevant video retrieval task.

Abstract (translated)

视频数据集 Moment 检索(VCMR)是一个关注于在广泛的未剪辑视频数据集中查找特定时刻的实用视频检索任务,使用自然语言查询。现有的 VCMR 方法通常依赖于帧感知视频检索,通过计算查询和视频帧之间的相似度来排名视频。然而,这种方法忽视了帧间信息内嵌入的语义结构,即事件,这是人类理解视频的关键元素。为了实现这一目标,我们提出了 EventFormer 模型,该模型明确利用视频中的事件作为视频检索的基本单位。通过事件推理和层次结构事件编码来提取事件表示。事件推理模块将连续且视觉上相似的帧表示分组为事件,而层次结构事件编码在帧和事件级别上编码信息。我们还引入了锚多头自注意力,以鼓励 Transformer 捕捉视频中的相关内容。通过两个分支的对比学习和双优化来训练 EventFormer。在 TVR、ANetCaps 和 DiDeMo 基准测试上进行的实验表明,EventFormer 在 VCMR 取得了有效性和效率,实现了最新的最先进水平。此外,EventFormer 的有效性还在部分相关视频检索任务上得到了验证。

URL

https://arxiv.org/abs/2402.13566

PDF

https://arxiv.org/pdf/2402.13566.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot