Paper Reading AI Learner

Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement

2024-02-21 07:16:06
Danyang Hou, Liang Pang, Huawei Shen, Xueqi Cheng

Abstract

Video corpus moment retrieval~(VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a natural language text as query. The relevance between the video and query is partial, mainly evident in two aspects: (1) Scope: The untrimmed video contains information-rich frames, and not all are relevant to the query. Strong correlation is typically observed only within the relevant moment, emphasizing the importance of capturing key content. (2) Modality: The relevance of query to different modalities varies; action descriptions align more with the visual elements, while character conversations are more related to textual information. Recognizing and addressing these modality-specific nuances is crucial for effective retrieval in VCMR. However, existing methods often treat all video contents equally, leading to sub-optimal moment retrieval. We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task. To this end, we propose a Partial Relevance Enhanced Model~(PREM) to improve VCMR. VCMR involves two sub-tasks: video retrieval and moment localization. To align with their distinct objectives, we implement specialized partial relevance enhancement strategies. For video retrieval, we introduce a multi-modal collaborative video retriever, generating distinct query representations tailored for different modalities by modality-specific pooling, ensuring a more effective match. For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content, followed by fusing multi-modal information for moment localization. Experimental results on TVR and DiDeMo datasets show that the proposed model outperforms the baselines, achieving a new state-of-the-art of VCMR.

Abstract (translated)

视频语料库 moment 检索(VCMR)是一项新的视频检索任务,旨在从大量未剪辑的视频中检索相关时刻,使用自然语言文本作为查询。视频与查询之间的相关性是局部的,主要表现在两个方面:(1)范围:未剪辑的视频包含信息丰富的帧,但并非所有都与查询相关。通常只有相关时刻之间的强相关性才会出现,强调捕捉关键内容的重要性;(2)模式:查询对不同模式的相关性有所不同;动作描述更接近视觉元素,而角色对话则更涉及文本信息。识别和解决这些模式特定微小差异对于 VCMR 的有效检索至关重要。然而,现有的方法通常将所有视频内容平等对待,导致 VCMR 的性能下降。我们认为,准确捕捉查询与视频之间的部分相关性对 VCMR 任务至关重要。为此,我们提出了一个 Partial Relevance Enhanced Model(PREM)来提高 VCMR 的性能。VCMR 包括两个子任务:视频检索和时刻定位。为了与它们的独特目标保持一致,我们为实现针对不同模式的专用部分相关性增强策略。对于视频检索,我们引入了一种多模态协同视频检索器,通过模式特定的池化生成针对不同模态的单独查询表示,确保更有效的匹配。对于时刻定位,我们提出了一个关注- then-融合时刻定位器,利用模式特定的门捕捉关键内容,然后将多模态信息融合为时刻定位。在 TVR 和 DiDeMo 数据集上的实验结果表明,与基线相比,所提出的模型表现出色,实现了 VCMR 的新最好状态。

URL

https://arxiv.org/abs/2402.13576

PDF

https://arxiv.org/pdf/2402.13576.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot