Abstract
Generating natural questions from an image is a semantic task that requires using visual and language modality to learn multimodal representations. Images can have multiple visual and language contexts that are relevant for generating questions namely places, captions, and tags. In this paper, we propose the use of exemplars for obtaining the relevant context. We obtain this by using a Multimodal Differential Network to produce natural and engaging questions. The generated questions show a remarkable similarity to the natural questions as validated by a human study. Further, we observe that the proposed approach substantially improves over state-of-the-art benchmarks on the quantitative metrics (BLEU, METEOR, ROUGE, and CIDEr).
Abstract (translated)
从图像生成自然问题是一种语义任务,需要使用视觉和语言模态来学习多模态表示。图像可以具有多个视觉和语言上下文,这些上下文与生成问题(即地点,标题和标签)相关。在本文中,我们建议使用范例来获得相关背景。我们通过使用多模态差分网络来获得这一点,以产生自然和引人入胜的问题。生成的问题显示出与人类研究验证的自然问题非常相似。此外,我们观察到所提出的方法大大改进了量化指标(BLEU,METEOR,ROUGE和CIDEr)的最新基准。
URL
https://arxiv.org/abs/1808.03986