Abstract
Temporal sentence localization in videos (TSLV) aims to retrieve the most interested segment in an untrimmed video according to a given sentence query. However, almost of existing TSLV approaches suffer from the same limitations: (1) They only focus on either frame-level or object-level visual representation learning and corresponding correlation reasoning, but fail to integrate them both; (2) They neglect to leverage the rich semantic contexts to further benefit the query reasoning. To address these issues, in this paper, we propose a novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN), which enables both visual- and semantic-aware query reasoning from object-level to frame-level. Specifically, we present a new graph memory mechanism to perform visual-semantic query reasoning: For visual reasoning, we design a visual graph memory to leverage visual information of video; For semantic reasoning, a semantic graph memory is also introduced to explicitly leverage semantic knowledge contained in the classes and attributes of video objects, and perform correlation reasoning in the semantic space. Experiments on three datasets demonstrate that our HVSARN achieves a new state-of-the-art performance.
Abstract (translated)
视频中的时序语句定位(TSLV)旨在根据给定语句查询检索未剪辑视频中最感兴趣的部分。然而,几乎所有现有的TSLV方法都面临相同的限制:(1)它们只关注帧级或对象级的视觉表示学习和相应的相关性推理,但未能将它们综合起来;(2)它们忽视了利用丰富的语义上下文来进一步改善查询推理。为了解决这些问题,本文提出了一种 novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN),该网络从对象级到帧级实现视觉和语义aware查询推理。具体而言,我们提出了一种新的图记忆机制来执行视觉语义查询推理:对于视觉推理,我们设计了一个视觉图记忆,以利用视频的视觉信息;对于语义推理,我们引入了一个语义图记忆,以 explicitly 利用视频对象的分类和属性包含的语义知识,并在语义空间中进行相关性推理。对三个数据集的实验表明,我们的HVSARN实现了新的顶尖性能。
URL
https://arxiv.org/abs/2303.01046