Abstract
Visual Question Answering (VQA) has attracted much attention since it offers insight into the relationships between the multi-modal analysis of images and natural language. Most of the current algorithms are incapable of answering open-domain questions that require to perform reasoning beyond the image contents. To address this issue, we propose a novel framework which endows the model capabilities in answering more complex questions by leveraging massive external knowledge with dynamic memory networks. Specifically, the questions along with the corresponding images trigger a process to retrieve the relevant information in external knowledge bases, which are embedded into a continuous vector space by preserving the entity-relation structures. Afterwards, we employ dynamic memory networks to attend to the large body of facts in the knowledge graph and images, and then perform reasoning over these facts to generate corresponding answers. Extensive experiments demonstrate that our model not only achieves the state-of-the-art performance in the visual question answering task, but can also answer open-domain questions effectively by leveraging the external knowledge.
Abstract (translated)
视觉问答(VQA)引人关注,因为它提供了对图像的多模式分析与自然语言之间关系的深入了解。目前大多数算法都不能回答开放领域的问题,这些问题需要在图像内容之外进行推理。为了解决这个问题,我们提出了一个新颖的框架,通过利用动态内存网络的大量外部知识来赋予模型回答更复杂问题的能力。具体来说,问题连同相应的图像触发一个过程来检索外部知识库中的相关信息,通过保留实体 - 关系结构将其嵌入到连续向量空间中。之后,我们采用动态记忆网络来关注知识图和图像中的大量事实,然后对这些事实进行推理以产生相应的答案。大量的实验证明,我们的模型不仅能够在视觉问题解答任务中实现最先进的表现,而且还可以通过利用外部知识来有效回答开放领域问题。
URL
https://arxiv.org/abs/1712.00733