Abstract
In this paper, we consider the problem of open information extraction (OIE) for extracting entity and relation level intermediate structures from sentences in open-domain. We focus on four types of valuable intermediate structures (Relation, Attribute, Description, and Concept), and propose a unified knowledge expression form, SAOKE, to express them. We publicly release a data set which contains more than forty thousand sentences and the corresponding facts in the SAOKE format labeled by crowd-sourcing. To our knowledge, this is the largest publicly available human labeled data set for open information extraction tasks. Using this labeled SAOKE data set, we train an end-to-end neural model using the sequenceto-sequence paradigm, called Logician, to transform sentences into facts. For each sentence, different to existing algorithms which generally focus on extracting each single fact without concerning other possible facts, Logician performs a global optimization over all possible involved facts, in which facts not only compete with each other to attract the attention of words, but also cooperate to share words. An experimental study on various types of open domain relation extraction tasks reveals the consistent superiority of Logician to other states-of-the-art algorithms. The experiments verify the reasonableness of SAOKE format, the valuableness of SAOKE data set, the effectiveness of the proposed Logician model, and the feasibility of the methodology to apply end-to-end learning paradigm on supervised data sets for the challenging tasks of open information extraction.
Abstract (translated)
本文研究了在开放域中从句子中提取实体和关系级中间结构的开放信息提取问题。我们着重研究了四种有价值的中间结构(关系、属性、描述和概念),并提出了一种统一的知识表达形式saoke来表达它们。我们公开发布了一个数据集,该数据集包含超过4万句句子和相应的事实,采用众包标记的SAOKE格式。据我们所知,这是用于开放信息提取任务的最大的公开的人工标记数据集。利用这个标记的SAOKE数据集,我们训练了一个端到端的神经模型,使用Sequenceto序列范式,称为逻辑学家,将句子转化为事实。对于每个句子,不同于现有的算法,这些算法通常只关注提取每个单独的事实,而不涉及其他可能的事实,逻辑学家对所有可能涉及的事实进行全局优化,其中事实不仅相互竞争以吸引词语的注意,而且还合作共享词语。通过对各种开放域关系提取任务的实验研究,揭示了逻辑学家与其他先进算法的一致优势。实验验证了SAOKE格式的合理性,SAOKE数据集的价值,所提出的逻辑模型的有效性,以及将端到端学习范式应用于监控数据集的方法对于开放信息提取的挑战性任务的可行性。
URL
https://arxiv.org/abs/1904.12535