Paper Reading AI Learner

Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval

2023-01-30 03:53:19
Yizhen Chen, Jie Wang, Lijian Lin, Zhongang Qi, Jin Ma, Ying Shan

Abstract

Vision-language alignment learning for video-text retrieval arouses a lot of attention in recent years. Most of the existing methods either transfer the knowledge of image-text pretraining model to video-text retrieval task without fully exploring the multi-modal information of videos, or simply fuse multi-modal features in a brute force manner without explicit guidance. In this paper, we integrate multi-modal information in an explicit manner by tagging, and use the tags as the anchors for better video-text alignment. Various pretrained experts are utilized for extracting the information of multiple modalities, including object, person, motion, audio, etc. To take full advantage of these information, we propose the TABLE (TAgging Before aLignmEnt) network, which consists of a visual encoder, a tag encoder, a text encoder, and a tag-guiding cross-modal encoder for jointly encoding multi-frame visual features and multi-modal tags information. Furthermore, to strengthen the interaction between video and text, we build a joint cross-modal encoder with the triplet input of [vision, tag, text] and perform two additional supervised tasks, Video Text Matching (VTM) and Masked Language Modeling (MLM). Extensive experimental results demonstrate that the TABLE model is capable of achieving State-Of-The-Art (SOTA) performance on various video-text retrieval benchmarks, including MSR-VTT, MSVD, LSMDC and DiDeMo.

Abstract (translated)

视觉语言对齐学习在视频文本检索中引起了近年来的广泛关注。大多数现有方法要么将图像-文本预训练模型的知识直接用于视频-文本检索任务,并未充分探索视频的多种感官信息,要么只是简单地将多种感官特征以强有力地方式融合,而没有明确指导。在本文中,我们通过标记的方式明确整合了多种感官信息,并将标签作为视频-文本对齐的基准。我们使用多种预训练专家提取多种感官信息,包括物体、人、运动、音频等。为了充分利用这些信息,我们提出了 TABLE (TAgging Before aLignmEnt)网络,它由视觉编码器、标签编码器、文本编码器和标签引导的跨媒体编码器组成,用于 jointly编码多帧视觉特征和多种感官标签信息。此外,为了加强视频和文本之间的交互,我们使用 [视觉,标签,文本]的三对输入作为跨媒体编码器,并执行两个额外的监督任务,视频文本匹配(VTM)和掩码语言建模(MLM)。广泛的实验结果表明, TABLE 模型可以在各种视频-文本检索基准任务中实现最先进的性能,包括 MSR-VTT、MSVD、LSMDC 和 DiDeMo。

URL

https://arxiv.org/abs/2301.12644

PDF

https://arxiv.org/pdf/2301.12644.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot