Abstract
Understanding large, structured documents like scholarly articles, requests for proposals or business reports is a complex and difficult task. It involves discovering a document's overall purpose and subject(s), understanding the function and meaning of its sections and subsections, and extracting low level entities and facts about them. In this research, we present a deep learning based document ontology to capture the general purpose semantic structure and domain specific semantic concepts from a large number of academic articles and business documents. The ontology is able to describe different functional parts of a document, which can be used to enhance semantic indexing for a better understanding by human beings and machines. We evaluate our models through extensive experiments on datasets of scholarly articles from arXiv and Request for Proposal documents.
Abstract (translated)
理解大型结构化文档(如学术文章,提案请求或业务报告)是一项复杂而艰巨的任务。它涉及发现文档的总体目的和主题,理解其部分和子部分的功能和含义,以及提取关于它们的低级实体和事实。在这项研究中,我们提出了一个基于深度学习的文档本体,从大量的学术文章和商业文档中捕获通用语义结构和领域特定的语义概念。本体能够描述文档的不同功能部分,可用于增强语义索引以便人类和机器更好地理解。我们通过对来自arXiv和征求建议书文件的学术文章数据集的广泛实验来评估我们的模型。
URL
https://arxiv.org/abs/1807.09842