Paper Reading AI Learner

Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal Sentence Localization in Videos

2023-03-02 08:00:22
Daizong Liu, Pan Zhou

Abstract

Temporal sentence localization in videos (TSLV) aims to retrieve the most interested segment in an untrimmed video according to a given sentence query. However, almost of existing TSLV approaches suffer from the same limitations: (1) They only focus on either frame-level or object-level visual representation learning and corresponding correlation reasoning, but fail to integrate them both; (2) They neglect to leverage the rich semantic contexts to further benefit the query reasoning. To address these issues, in this paper, we propose a novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN), which enables both visual- and semantic-aware query reasoning from object-level to frame-level. Specifically, we present a new graph memory mechanism to perform visual-semantic query reasoning: For visual reasoning, we design a visual graph memory to leverage visual information of video; For semantic reasoning, a semantic graph memory is also introduced to explicitly leverage semantic knowledge contained in the classes and attributes of video objects, and perform correlation reasoning in the semantic space. Experiments on three datasets demonstrate that our HVSARN achieves a new state-of-the-art performance.

Abstract (translated)

视频中的时序语句定位(TSLV)旨在根据给定语句查询检索未剪辑视频中最感兴趣的部分。然而,几乎所有现有的TSLV方法都面临相同的限制:(1)它们只关注帧级或对象级的视觉表示学习和相应的相关性推理,但未能将它们综合起来;(2)它们忽视了利用丰富的语义上下文来进一步改善查询推理。为了解决这些问题,本文提出了一种 novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN),该网络从对象级到帧级实现视觉和语义aware查询推理。具体而言,我们提出了一种新的图记忆机制来执行视觉语义查询推理:对于视觉推理,我们设计了一个视觉图记忆,以利用视频的视觉信息;对于语义推理,我们引入了一个语义图记忆,以 explicitly 利用视频对象的分类和属性包含的语义知识,并在语义空间中进行相关性推理。对三个数据集的实验表明,我们的HVSARN实现了新的顶尖性能。

URL

https://arxiv.org/abs/2303.01046

PDF

https://arxiv.org/pdf/2303.01046.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot