Abstract
This paper strives to find amidst a set of sentences the one best describing the content of a given image or video. Different from existing works, which rely on a joint subspace for their image and video caption retrieval, we propose to do so in a visual space exclusively. Apart from this conceptual novelty, we contribute \emph{Word2VisualVec}, a deep neural network architecture that learns to predict a visual feature representation from textual input. Example captions are encoded into a textual embedding based on multi-scale sentence vectorization and further transferred into a deep visual feature of choice via a simple multi-layer perceptron. We further generalize Word2VisualVec for video caption retrieval, by predicting from text both 3-D convolutional neural network features as well as a visual-audio representation. Experiments on Flickr8k, Flickr30k, the Microsoft Video Description dataset and the very recent NIST TrecVid challenge for video caption retrieval detail Word2VisualVec's properties, its benefit over textual embeddings, the potential for multimodal query composition and its state-of-the-art results.
Abstract (translated)
本文力求在一组最佳描述给定图像或视频的内容的句子中找到。与现有的依靠联合子空间进行图像和视频字幕检索的作品不同,我们建议仅在视觉空间中进行。除了这个概念新颖之外,我们还贡献了一个深度神经网络体系结构,它可以从文本输入中预测视觉特征表示。示例字幕被编码成基于多尺度句子矢量化的文本嵌入,并且通过简单的多层感知器进一步转换成选择的深层视觉特征。我们进一步推广了Word2VisualVec用于视频标题检索,通过文本预测三维卷积神经网络特征以及视觉 - 音频表示。 Flickr8k,Flickr30k,Microsoft Video Description数据集和最近NIST TrecVid针对视频字幕检索的挑战的实验细节Word2VisualVec的属性,它在文本嵌入方面的优势,多模态查询组合的潜力以及最新的结果。
URL
https://arxiv.org/abs/1709.01362