Abstract
We introduce a variety of models, trained on a supervised image captioning corpus to predict the image features for a given caption, to perform sentence representation grounding. We train a grounded sentence encoder that achieves good performance on COCO caption and image retrieval and subsequently show that this encoder can successfully be transferred to various NLP tasks, with improved performance over text-only models. Lastly, we analyze the contribution of grounding, and show that word embeddings learned by this system outperform non-grounded ones.
Abstract (translated)
我们引入了各种模型,在受监督的图像字幕语料库上进行训练,以预测给定字幕的图像特征,以执行句子表示接地。我们训练了一个可以在COCO字幕和图像检索中实现良好性能的扎实句子编码器,并且随后显示该编码器可以成功地转换为各种NLP任务,并且性能优于纯文本模型。最后,我们分析了接地的贡献,并且表明这个系统学习的字嵌入优于未接地的嵌入。
URL
https://arxiv.org/abs/1707.06320