Paper Reading AI Learner

End-to-end Image Captioning Exploits Multimodal Distributional Similarity

2018-09-11 20:32:21
Pranava Madhyastha, Josiah Wang, Lucia Specia

Abstract

We hypothesize that end-to-end neural image captioning systems work seemingly well because they exploit and learn `distributional similarity' in a multimodal feature space by mapping a test image to similar training images in this space and generating a caption from the same space. To validate our hypothesis, we focus on the `image' side of image captioning, and vary the input image representation but keep the RNN text generation component of a CNN-RNN model constant. Our analysis indicates that image captioning models (i) are capable of separating structure from noisy input representations; (ii) suffer virtually no significant performance loss when a high dimensional representation is compressed to a lower dimensional space; (iii) cluster images with similar visual and linguistic information together. Our findings indicate that our distributional similarity hypothesis holds. We conclude that regardless of the image representation used image captioning systems seem to match images and generate captions in a learned joint image-text semantic subspace.

Abstract (translated)

我们假设端到端神经图像字幕系统看似很好,因为它们通过将测试图像映射到该空间中的类似训练图像并从同一空间生成标题来利用和学习多模态特征空间中的“分布相似性”。为了验证我们的假设,我们关注图像字幕的“图像”侧,并改变输入图像表示,但保持CNN-RNN模型的RNN文本生成组件不变。我们的分析表明,图像字幕模型(i)能够将结构与噪声输入表示分开; (ii)当高维表示被压缩到较低维空间时,实际上没有显着的性能损失; (iii)将具有相似视觉和语言信息的图像聚类在一起。我们的研究结果表明我们的分布相似性假设成立。我们得出结论,无论使用何种图像表示,图像字幕系统似乎都匹配图像并在学习的联合图像文本语义子空间中生成字幕。

URL

https://arxiv.org/abs/1809.04144

PDF

https://arxiv.org/pdf/1809.04144.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot