Paper Reading AI Learner

SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval

2024-04-22 10:23:59
Xuzheng Yu, Chen Jiang, Xingning Dong, Tian Gan, Ming Yang, Qingpei Guo

Abstract

The user base of short video apps has experienced unprecedented growth in recent years, resulting in a significant demand for video content analysis. In particular, text-video retrieval, which aims to find the top matching videos given text descriptions from a vast video corpus, is an essential function, the primary challenge of which is to bridge the modality gap. Nevertheless, most existing approaches treat texts merely as discrete tokens and neglect their syntax structures. Moreover, the abundant spatial and temporal clues in videos are often underutilized due to the lack of interaction with text. To address these issues, we argue that using texts as guidance to focus on relevant temporal frames and spatial regions within videos is beneficial. In this paper, we propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net) that exploits the inherent semantic and syntax hierarchy of texts to bridge the modality gap from two perspectives. First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions, to guide the visual representations. Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation. We evaluated our method on four public text-video retrieval datasets of MSR-VTT, MSVD, DiDeMo, and ActivityNet. The experimental results and ablation studies confirm the advantages of our proposed method.

Abstract (translated)

近年来,短视频应用程序的用户基础经历了空前的增长,导致对视频内容分析的需求显著增加。特别是,文本-视频检索,旨在从庞大的视频语料库中找到与给定文本描述的匹配视频,是至关重要的功能,其挑战在于弥合模态差距。然而,大多数现有方法仅仅将文本视为离散的标记,而忽略了它们的语法结构。此外,视频中丰富的空间和时间线索往往没有被充分利用,因为缺乏与文本的交互。为了应对这些问题,我们认为将文本作为指导,集中关注视频中的相关时态帧和空间区域,从两个角度弥合模态差距是有益的。在本文中,我们提出了一种新颖的语法层次结构增强文本-视频检索方法(SHE-Net),它利用了文本固有的语义和语法层次结构,从两个角度弥合模态差距。首先,为了促进更细粒度的视觉内容整合,我们采用文本语法层次结构,该结构揭示了文本描述的语法结构,指导视觉表示。其次,为了进一步增强多模态交互和对齐,我们还利用语法层次结构指导相似度计算。我们对MSR-VTT、MSVD、DiDeMo和ActivityNet等四个公共文本-视频检索数据集进行了实验评估。实验结果和消融实验证实了我们提出的方法的优势。

URL

https://arxiv.org/abs/2404.14066

PDF

https://arxiv.org/pdf/2404.14066.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot