Abstract
We hypothesize that end-to-end neural image captioning systems work seemingly well because they exploit and learn `distributional similarity' in a multimodal feature space by mapping a test image to similar training images in this space and generating a caption from the same space. To validate our hypothesis, we focus on the `image' side of image captioning, and vary the input image representation but keep the RNN text generation component of a CNN-RNN model constant. Our analysis indicates that image captioning models (i) are capable of separating structure from noisy input representations; (ii) suffer virtually no significant performance loss when a high dimensional representation is compressed to a lower dimensional space; (iii) cluster images with similar visual and linguistic information together. Our findings indicate that our distributional similarity hypothesis holds. We conclude that regardless of the image representation used image captioning systems seem to match images and generate captions in a learned joint image-text semantic subspace.
Abstract (translated)
我们假设端到端神经图像字幕系统看似很好,因为它们通过将测试图像映射到该空间中的类似训练图像并从同一空间生成标题来利用和学习多模态特征空间中的“分布相似性”。为了验证我们的假设,我们关注图像字幕的“图像”侧,并改变输入图像表示,但保持CNN-RNN模型的RNN文本生成组件不变。我们的分析表明,图像字幕模型(i)能够将结构与噪声输入表示分开; (ii)当高维表示被压缩到较低维空间时,实际上没有显着的性能损失; (iii)将具有相似视觉和语言信息的图像聚类在一起。我们的研究结果表明我们的分布相似性假设成立。我们得出结论,无论使用何种图像表示,图像字幕系统似乎都匹配图像并在学习的联合图像文本语义子空间中生成字幕。
URL
https://arxiv.org/abs/1809.04144