Abstract
Vision-language alignment learning for video-text retrieval arouses a lot of attention in recent years. Most of the existing methods either transfer the knowledge of image-text pretraining model to video-text retrieval task without fully exploring the multi-modal information of videos, or simply fuse multi-modal features in a brute force manner without explicit guidance. In this paper, we integrate multi-modal information in an explicit manner by tagging, and use the tags as the anchors for better video-text alignment. Various pretrained experts are utilized for extracting the information of multiple modalities, including object, person, motion, audio, etc. To take full advantage of these information, we propose the TABLE (TAgging Before aLignmEnt) network, which consists of a visual encoder, a tag encoder, a text encoder, and a tag-guiding cross-modal encoder for jointly encoding multi-frame visual features and multi-modal tags information. Furthermore, to strengthen the interaction between video and text, we build a joint cross-modal encoder with the triplet input of [vision, tag, text] and perform two additional supervised tasks, Video Text Matching (VTM) and Masked Language Modeling (MLM). Extensive experimental results demonstrate that the TABLE model is capable of achieving State-Of-The-Art (SOTA) performance on various video-text retrieval benchmarks, including MSR-VTT, MSVD, LSMDC and DiDeMo.
Abstract (translated)
视觉语言对齐学习在视频文本检索中引起了近年来的广泛关注。大多数现有方法要么将图像-文本预训练模型的知识直接用于视频-文本检索任务,并未充分探索视频的多种感官信息,要么只是简单地将多种感官特征以强有力地方式融合,而没有明确指导。在本文中,我们通过标记的方式明确整合了多种感官信息,并将标签作为视频-文本对齐的基准。我们使用多种预训练专家提取多种感官信息,包括物体、人、运动、音频等。为了充分利用这些信息,我们提出了 TABLE (TAgging Before aLignmEnt)网络,它由视觉编码器、标签编码器、文本编码器和标签引导的跨媒体编码器组成,用于 jointly编码多帧视觉特征和多种感官标签信息。此外,为了加强视频和文本之间的交互,我们使用 [视觉,标签,文本]的三对输入作为跨媒体编码器,并执行两个额外的监督任务,视频文本匹配(VTM)和掩码语言建模(MLM)。广泛的实验结果表明, TABLE 模型可以在各种视频-文本检索基准任务中实现最先进的性能,包括 MSR-VTT、MSVD、LSMDC 和 DiDeMo。
URL
https://arxiv.org/abs/2301.12644