Abstract
Temporal video grounding (TVG) aims to retrieve the time interval of a language query from an untrimmed video. A significant challenge in TVG is the low "Semantic Noise Ratio (SNR)", which results in worse performance with lower SNR. Prior works have addressed this challenge using sophisticated techniques. In this paper, we propose a no-frills TVG model that consists of two core modules, namely multi-scale neighboring attention and zoom-in boundary detection. The multi-scale neighboring attention restricts each video token to only aggregate visual contexts from its neighbor, enabling the extraction of the most distinguishing information with multi-scale feature hierarchies from high-ratio noises. The zoom-in boundary detection then focuses on local-wise discrimination of the selected top candidates for fine-grained grounding adjustment. With an end-to-end training strategy, our model achieves competitive performance on different TVG benchmarks, while also having the advantage of faster inference speed and lighter model parameters, thanks to its lightweight architecture.
Abstract (translated)
时间视频grounding(TVG)的目标是从未修剪的视频中提取语言查询的时间间隔。TVG面临的一个主要挑战是低“语义噪声比率(SNR)”,这会导致在低SNR下表现更差。以前的工作已经使用了复杂的技术解决了这个问题。在本文中,我们提出了一种无功能的TVG模型,由两个核心模块组成:多尺度相邻注意力和缩放边界检测。多尺度相邻注意力限制每个视频块只能从相邻的上下文聚合,从高比率噪声中提取具有多尺度特征级联最显著的信息。缩放边界检测则专注于对所选的顶级候选进行局部区分,以进行精细的视频grounding调整。通过端到端的训练策略,我们的模型在不同类型的TVG基准测试中取得了竞争性能,同时也得益于其轻量级架构,实现了更快的推理速度和更轻量级的模型参数。
URL
https://arxiv.org/abs/2307.10567