Paper Reading AI Learner

Predicting Visual Features from Text for Image and Video Caption Retrieval

2018-01-29 12:32:19
Jianfeng Dong, Xirong Li, Cees G. M. Snoek

Abstract

This paper strives to find amidst a set of sentences the one best describing the content of a given image or video. Different from existing works, which rely on a joint subspace for their image and video caption retrieval, we propose to do so in a visual space exclusively. Apart from this conceptual novelty, we contribute \emph{Word2VisualVec}, a deep neural network architecture that learns to predict a visual feature representation from textual input. Example captions are encoded into a textual embedding based on multi-scale sentence vectorization and further transferred into a deep visual feature of choice via a simple multi-layer perceptron. We further generalize Word2VisualVec for video caption retrieval, by predicting from text both 3-D convolutional neural network features as well as a visual-audio representation. Experiments on Flickr8k, Flickr30k, the Microsoft Video Description dataset and the very recent NIST TrecVid challenge for video caption retrieval detail Word2VisualVec's properties, its benefit over textual embeddings, the potential for multimodal query composition and its state-of-the-art results.

Abstract (translated)

本文力求在一组最佳描述给定图像或视频的内容的句子中找到。与现有的依靠联合子空间进行图像和视频字幕检索的作品不同,我们建议仅在视觉空间中进行。除了这个概念新颖之外,我们还贡献了一个深度神经网络体系结构,它可以从文本输入中预测视觉特征表示。示例字幕被编码成基于多尺度句子矢量化的文本嵌入,并且通过简单的多层感知器进一步转换成选择的深层视觉特征。我们进一步推广了Word2VisualVec用于视频标题检索,通过文本预测三维卷积神经网络特征以及视觉 - 音频表示。 Flickr8k,Flickr30k,Microsoft Video Description数据集和最近NIST TrecVid针对视频字幕检索的挑战的实验细节Word2VisualVec的属性,它在文本嵌入方面的优势,多模态查询组合的潜力以及最新的结果。

URL

https://arxiv.org/abs/1709.01362

PDF

https://arxiv.org/pdf/1709.01362.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot