Abstract
Extracting structured information from unstructured text is critical for many downstream NLP applications and is traditionally achieved by closed information extraction (cIE). However, existing approaches for cIE suffer from two limitations: (i) they are often pipelines which makes them prone to error propagation, and/or (ii) they are restricted to sentence level which prevents them from capturing long-range dependencies and results in expensive inference time. We address these limitations by proposing REXEL, a highly efficient and accurate model for the joint task of document level cIE (DocIE). REXEL performs mention detection, entity typing, entity disambiguation, coreference resolution and document-level relation classification in a single forward pass to yield facts fully linked to a reference knowledge graph. It is on average 11 times faster than competitive existing approaches in a similar setting and performs competitively both when optimised for any of the individual subtasks and a variety of combinations of different joint tasks, surpassing the baselines by an average of more than 6 F1 points. The combination of speed and accuracy makes REXEL an accurate cost-efficient system for extracting structured information at web-scale. We also release an extension of the DocRED dataset to enable benchmarking of future work on DocIE, which is available at this https URL.
Abstract (translated)
从无结构文本中提取结构化信息对于许多下游自然语言处理(NLP)应用至关重要,而且通常通过关闭信息提取(cIE)来实现。然而,现有的cIE方法存在两个局限:(i)它们通常是流水线,容易传播错误,(ii)它们仅限于句子级别,无法捕捉长距离依赖关系,导致推理时间昂贵。为了克服这些局限,我们提出了REXEL,一种高效且准确的文档级别cIE(DocIE)模型。REXEL在单向传递过程中实现提举检测、实体类型、实体歧义、关系分类和文档级别关系,以产生完全链接到参考知识图谱的事实。在类似设置中,REXEL的平均速度是现有方法的11倍,而且在优化任何单个子任务或各种组合任务时,表现出色,超过了基线平均6个F1分。速度和准确性的结合使REXEL成为在网页规模上提取结构化信息的准确且高效系统。我们还发布了DocRED数据集的扩展,以便于未来在DocIE上进行基准测试,该扩展可通过此链接获得。
URL
https://arxiv.org/abs/2404.12788