Grounding-MD: Grounded Video-language Pre-training for Open-World Moment Detection

Abstract
Abstract (translated)
URL
PDF

Abstract

Temporal Action Detection and Moment Retrieval constitute two pivotal tasks in video understanding, focusing on precisely localizing temporal segments corresponding to specific actions or events. Recent advancements introduced Moment Detection to unify these two tasks, yet existing approaches remain confined to closed-set scenarios, limiting their applicability in open-world contexts. To bridge this gap, we present Grounding-MD, an innovative, grounded video-language pre-training framework tailored for open-world moment detection. Our framework incorporates an arbitrary number of open-ended natural language queries through a structured prompt mechanism, enabling flexible and scalable moment detection. Grounding-MD leverages a Cross-Modality Fusion Encoder and a Text-Guided Fusion Decoder to facilitate comprehensive video-text alignment and enable effective cross-task collaboration. Through large-scale pre-training on temporal action detection and moment retrieval datasets, Grounding-MD demonstrates exceptional semantic representation learning capabilities, effectively handling diverse and complex query conditions. Comprehensive evaluations across four benchmark datasets including ActivityNet, THUMOS14, ActivityNet-Captions, and Charades-STA demonstrate that Grounding-MD establishes new state-of-the-art performance in zero-shot and supervised settings in open-world moment detection scenarios. All source code and trained models will be released.

Abstract (translated)

时序动作检测（Temporal Action Detection）和时刻检索（Moment Retrieval）是视频理解中的两大核心任务，旨在精确地定位与特定行为或事件相对应的时间片段。近期的研究引入了时刻检测（Moment Detection），以统一这两种任务，但现有的方法仍局限于封闭集场景中应用，限制了它们在开放世界情境下的适用性。为了弥合这一差距，我们提出了Grounding-MD——一种创新的、基于视频-语言预训练框架的方法，专门针对开放世界的时刻检测问题。我们的框架通过结构化的提示机制（Prompt Mechanism）来处理任意数量的开放式自然语言查询，这使得灵活和可扩展的时刻检测成为可能。 Grounding-MD利用了跨模态融合编码器和文本引导的融合解码器，以促进视频与文本之间的一致性和多任务协作的有效性。通过在时序动作检测和时刻检索数据集上进行大规模预训练，Grounding-MD展示了卓越的概念表示学习能力，并能够有效地处理多样化且复杂的查询条件。跨四个基准测试集（包括ActivityNet、THUMOS14、ActivityNet-Captions 和 Charades-STA）的全面评估表明，Grounding-MD在开放世界的零样本和监督场景中建立起了新的性能标杆。所有源代码和预训练模型都将公开发布。

URL

https://arxiv.org/abs/2504.14553

PDF

https://arxiv.org/pdf/2504.14553.pdf

Grounding-MD: Grounded Video-language Pre-training for Open-World Moment Detection

Abstract

Abstract (translated)

URL

PDF Copy

PDF