Paper Reading AI Learner

Grounding-MD: Grounded Video-language Pre-training for Open-World Moment Detection

2025-04-20 09:54:25
Weijun Zhuang, Qizhang Li, Xin Li, Ming Liu, Xiaopeng Hong, Feng Gao, Fan Yang, Wangmeng Zuo

Abstract

Temporal Action Detection and Moment Retrieval constitute two pivotal tasks in video understanding, focusing on precisely localizing temporal segments corresponding to specific actions or events. Recent advancements introduced Moment Detection to unify these two tasks, yet existing approaches remain confined to closed-set scenarios, limiting their applicability in open-world contexts. To bridge this gap, we present Grounding-MD, an innovative, grounded video-language pre-training framework tailored for open-world moment detection. Our framework incorporates an arbitrary number of open-ended natural language queries through a structured prompt mechanism, enabling flexible and scalable moment detection. Grounding-MD leverages a Cross-Modality Fusion Encoder and a Text-Guided Fusion Decoder to facilitate comprehensive video-text alignment and enable effective cross-task collaboration. Through large-scale pre-training on temporal action detection and moment retrieval datasets, Grounding-MD demonstrates exceptional semantic representation learning capabilities, effectively handling diverse and complex query conditions. Comprehensive evaluations across four benchmark datasets including ActivityNet, THUMOS14, ActivityNet-Captions, and Charades-STA demonstrate that Grounding-MD establishes new state-of-the-art performance in zero-shot and supervised settings in open-world moment detection scenarios. All source code and trained models will be released.

Abstract (translated)

时序动作检测(Temporal Action Detection)和时刻检索(Moment Retrieval)是视频理解中的两大核心任务,旨在精确地定位与特定行为或事件相对应的时间片段。近期的研究引入了时刻检测(Moment Detection),以统一这两种任务,但现有的方法仍局限于封闭集场景中应用,限制了它们在开放世界情境下的适用性。为了弥合这一差距,我们提出了Grounding-MD——一种创新的、基于视频-语言预训练框架的方法,专门针对开放世界的时刻检测问题。我们的框架通过结构化的提示机制(Prompt Mechanism)来处理任意数量的开放式自然语言查询,这使得灵活和可扩展的时刻检测成为可能。 Grounding-MD利用了跨模态融合编码器和文本引导的融合解码器,以促进视频与文本之间的一致性和多任务协作的有效性。通过在时序动作检测和时刻检索数据集上进行大规模预训练,Grounding-MD展示了卓越的概念表示学习能力,并能够有效地处理多样化且复杂的查询条件。 跨四个基准测试集(包括ActivityNet、THUMOS14、ActivityNet-Captions 和 Charades-STA)的全面评估表明,Grounding-MD在开放世界的零样本和监督场景中建立起了新的性能标杆。所有源代码和预训练模型都将公开发布。

URL

https://arxiv.org/abs/2504.14553

PDF

https://arxiv.org/pdf/2504.14553.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot