Abstract
In this work, we are interested in automated methods for knowledge graph creation (KGC) from input text. Progress on large language models (LLMs) has prompted a series of recent works applying them to KGC, e.g., via zero/few-shot prompting. Despite successes on small domain-specific datasets, these models face difficulties scaling up to text common in many real-world applications. A principal issue is that in prior methods, the KG schema has to be included in the LLM prompt to generate valid triplets; larger and more complex schema easily exceed the LLMs' context window length. To address this problem, we propose a three-phase framework named Extract-Define-Canonicalize (EDC): open information extraction followed by schema definition and post-hoc canonicalization. EDC is flexible in that it can be applied to settings where a pre-defined target schema is available and when it is not; in the latter case, it constructs a schema automatically and applies self-canonicalization. To further improve performance, we introduce a trained component that retrieves schema elements relevant to the input text; this improves the LLMs' extraction performance in a retrieval-augmented generation-like manner. We demonstrate on three KGC benchmarks that EDC is able to extract high-quality triplets without any parameter tuning and with significantly larger schemas compared to prior works.
Abstract (translated)
在这项工作中,我们感兴趣的是从输入文本中自动生成知识图(KGC)的方法。大型语言模型(LLMs)的进步导致了一系列将它们应用于KGC的最近工作,例如通过零/几帧提示。尽管在小型领域特定数据集上取得了成功,但这些模型在许多现实世界应用中扩展到文本common遇到困难。一个主要问题是,在先前的方法中,KG模式必须在LLM提示中包括才能生成有效的三元组;大型的和更复杂模式很容易超过LLMs的上下文窗口长度。为了解决这个问题,我们提出了一个名为Extract-Define-Canonicalize(EDC)的三阶段框架:开放信息提取 followed by 模式定义和后置正则化。EDC具有灵活性,因为它可以应用于具有预定义目标模式和不需要预定义模式的情况;在后一种情况下,它自动构建模式并应用自正则化。为了进一步提高性能,我们引入了一个训练过的组件,它检索与输入文本相关的模式元素;这使得LLM在检索增强生成方式下的提取性能得到提高。我们在三个KGC基准测试中证明了EDC能够在不进行参数调整的情况下提取高质量的三元组,并且与先前的作品相比,具有明显更大的模式。
URL
https://arxiv.org/abs/2404.03868