Abstract
Temporal grounding of text descriptions in videos is a central problem in vision-language learning and video understanding. Existing methods often prioritize accuracy over scalability -- they have been optimized for grounding only a few text queries within short videos, and fail to scale up to long videos with hundreds of queries. In this paper, we study the effect of cross-modal fusion on the scalability of video grounding models. Our analysis establishes late fusion as a more cost-effective fusion scheme for long-form videos with many text queries. Moreover, it leads us to a novel, video-centric sampling scheme for efficient training. Based on these findings, we present SnAG, a simple baseline for scalable and accurate video grounding. Without bells and whistles, SnAG is 43% more accurate and 1.5x faster than CONE, a state of the art for long-form video grounding on the challenging MAD dataset, while achieving highly competitive results on short videos.
Abstract (translated)
视频中的文本描述的时间绑定是一个在视觉语言学习和视频理解中的核心问题。现有的方法通常优先考虑准确性,而不是可扩展性——它们已经优化为仅在短视频中绑定少数文本查询,并且无法扩展到具有数百个查询的长视频。在本文中,我们研究了跨模态融合对视频绑定模型可扩展性的影响。我们的分析证实了晚融合是一种更经济有效的融合方案,适用于长视频和高文本查询。此外,它我们还导致了一种新的视频中心采样方案,用于高效的训练。基于这些发现,我们提出了SnAG,一个简单的基础设施,具有可扩展性和准确性。没有花哨的装饰,SnAG比CONE快43%,准确率也提高了1.5倍,同时具有与在具有挑战性的MAD数据集上进行长视频绑定最先进的水平相当的表现,而在短视频中取得了极具竞争力的结果。
URL
https://arxiv.org/abs/2404.02257