Abstract
We propose the Neuro-Symbolic Concept Learner (NS-CL), a model that learns visual concepts, words, and semantic parsing of sentences without explicit supervision on any of them; instead, our model learns by simply looking at images and reading paired questions and answers. Our model builds an object-based scene representation and translates sentences into executable, symbolic programs. To bridge the learning of two modules, we use a neuro-symbolic reasoning module that executes these programs on the latent scene representation. Analogical to human concept learning, the perception module learns visual concepts based on the language description of the object being referred to. Meanwhile, the learned visual concepts facilitate learning new words and parsing new sentences. We use curriculum learning to guide the searching over the large compositional space of images and language. Extensive experiments demonstrate the accuracy and efficiency of our model on learning visual concepts, word representations, and semantic parsing of sentences. Further, our method allows easy generalization to new object attributes, compositions, language concepts, scenes and questions, and even new program domains. It also empowers applications including visual question answering and bidirectional image-text retrieval.
Abstract (translated)
我们提出了神经符号概念学习者(ns-cl),一个学习视觉概念、词汇和句子语义分析的模型,而不需要对它们中任何一个进行明确的监督;相反,我们的模型通过简单地看图像和阅读成对的问题和答案来学习。我们的模型构建了一个基于对象的场景表示,并将句子转换为可执行的符号程序。为了连接两个模块的学习,我们使用一个神经符号推理模块,在潜在场景表示上执行这些程序。与人类概念学习类似,感知模块根据所指对象的语言描述学习视觉概念。同时,学习的视觉概念有助于学习新词和分析新句。我们使用课程学习来指导对图像和语言的大组合空间的搜索。大量的实验证明了我们的模型在学习视觉概念、单词表示和句子语义分析方面的准确性和效率。此外,我们的方法允许对新的对象属性、组合、语言概念、场景和问题,甚至新的程序域进行简单的概括。它还支持应用程序,包括可视问答和双向图像文本检索。
URL
https://arxiv.org/abs/1904.12584