Paper Reading AI Learner

No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection

2023-07-20 04:12:10
Qi Zhang, Sipeng Zheng, Qin Jin

Abstract

Temporal video grounding (TVG) aims to retrieve the time interval of a language query from an untrimmed video. A significant challenge in TVG is the low "Semantic Noise Ratio (SNR)", which results in worse performance with lower SNR. Prior works have addressed this challenge using sophisticated techniques. In this paper, we propose a no-frills TVG model that consists of two core modules, namely multi-scale neighboring attention and zoom-in boundary detection. The multi-scale neighboring attention restricts each video token to only aggregate visual contexts from its neighbor, enabling the extraction of the most distinguishing information with multi-scale feature hierarchies from high-ratio noises. The zoom-in boundary detection then focuses on local-wise discrimination of the selected top candidates for fine-grained grounding adjustment. With an end-to-end training strategy, our model achieves competitive performance on different TVG benchmarks, while also having the advantage of faster inference speed and lighter model parameters, thanks to its lightweight architecture.

Abstract (translated)

时间视频grounding(TVG)的目标是从未修剪的视频中提取语言查询的时间间隔。TVG面临的一个主要挑战是低“语义噪声比率(SNR)”,这会导致在低SNR下表现更差。以前的工作已经使用了复杂的技术解决了这个问题。在本文中,我们提出了一种无功能的TVG模型,由两个核心模块组成:多尺度相邻注意力和缩放边界检测。多尺度相邻注意力限制每个视频块只能从相邻的上下文聚合,从高比率噪声中提取具有多尺度特征级联最显著的信息。缩放边界检测则专注于对所选的顶级候选进行局部区分,以进行精细的视频grounding调整。通过端到端的训练策略,我们的模型在不同类型的TVG基准测试中取得了竞争性能,同时也得益于其轻量级架构,实现了更快的推理速度和更轻量级的模型参数。

URL

https://arxiv.org/abs/2307.10567

PDF

https://arxiv.org/pdf/2307.10567.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot