Paper Reading AI Learner

SnAG: Scalable and Accurate Video Grounding

2024-04-02 19:25:04
Fangzhou Mu, Sicheng Mo, Yin Li

Abstract

Temporal grounding of text descriptions in videos is a central problem in vision-language learning and video understanding. Existing methods often prioritize accuracy over scalability -- they have been optimized for grounding only a few text queries within short videos, and fail to scale up to long videos with hundreds of queries. In this paper, we study the effect of cross-modal fusion on the scalability of video grounding models. Our analysis establishes late fusion as a more cost-effective fusion scheme for long-form videos with many text queries. Moreover, it leads us to a novel, video-centric sampling scheme for efficient training. Based on these findings, we present SnAG, a simple baseline for scalable and accurate video grounding. Without bells and whistles, SnAG is 43% more accurate and 1.5x faster than CONE, a state of the art for long-form video grounding on the challenging MAD dataset, while achieving highly competitive results on short videos.

Abstract (translated)

视频中的文本描述的时间绑定是一个在视觉语言学习和视频理解中的核心问题。现有的方法通常优先考虑准确性,而不是可扩展性——它们已经优化为仅在短视频中绑定少数文本查询,并且无法扩展到具有数百个查询的长视频。在本文中,我们研究了跨模态融合对视频绑定模型可扩展性的影响。我们的分析证实了晚融合是一种更经济有效的融合方案,适用于长视频和高文本查询。此外,它我们还导致了一种新的视频中心采样方案,用于高效的训练。基于这些发现,我们提出了SnAG,一个简单的基础设施,具有可扩展性和准确性。没有花哨的装饰,SnAG比CONE快43%,准确率也提高了1.5倍,同时具有与在具有挑战性的MAD数据集上进行长视频绑定最先进的水平相当的表现,而在短视频中取得了极具竞争力的结果。

URL

https://arxiv.org/abs/2404.02257

PDF

https://arxiv.org/pdf/2404.02257.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot