Extract, Define, Canonicalize: An LLM-based Framework for Knowledge Graph Construction

Abstract
Abstract (translated)
URL
PDF

Abstract

In this work, we are interested in automated methods for knowledge graph creation (KGC) from input text. Progress on large language models (LLMs) has prompted a series of recent works applying them to KGC, e.g., via zero/few-shot prompting. Despite successes on small domain-specific datasets, these models face difficulties scaling up to text common in many real-world applications. A principal issue is that in prior methods, the KG schema has to be included in the LLM prompt to generate valid triplets; larger and more complex schema easily exceed the LLMs' context window length. To address this problem, we propose a three-phase framework named Extract-Define-Canonicalize (EDC): open information extraction followed by schema definition and post-hoc canonicalization. EDC is flexible in that it can be applied to settings where a pre-defined target schema is available and when it is not; in the latter case, it constructs a schema automatically and applies self-canonicalization. To further improve performance, we introduce a trained component that retrieves schema elements relevant to the input text; this improves the LLMs' extraction performance in a retrieval-augmented generation-like manner. We demonstrate on three KGC benchmarks that EDC is able to extract high-quality triplets without any parameter tuning and with significantly larger schemas compared to prior works.

Abstract (translated)

在这项工作中，我们感兴趣的是从输入文本中自动生成知识图（KGC）的方法。大型语言模型（LLMs）的进步导致了一系列将它们应用于KGC的最近工作，例如通过零/几帧提示。尽管在小型领域特定数据集上取得了成功，但这些模型在许多现实世界应用中扩展到文本common遇到困难。一个主要问题是，在先前的方法中，KG模式必须在LLM提示中包括才能生成有效的三元组；大型的和更复杂模式很容易超过LLMs的上下文窗口长度。为了解决这个问题，我们提出了一个名为Extract-Define-Canonicalize（EDC）的三阶段框架：开放信息提取 followed by 模式定义和后置正则化。EDC具有灵活性，因为它可以应用于具有预定义目标模式和不需要预定义模式的情况；在后一种情况下，它自动构建模式并应用自正则化。为了进一步提高性能，我们引入了一个训练过的组件，它检索与输入文本相关的模式元素；这使得LLM在检索增强生成方式下的提取性能得到提高。我们在三个KGC基准测试中证明了EDC能够在不进行参数调整的情况下提取高质量的三元组，并且与先前的作品相比，具有明显更大的模式。

URL

https://arxiv.org/abs/2404.03868

PDF

https://arxiv.org/pdf/2404.03868.pdf

Extract, Define, Canonicalize: An LLM-based Framework for Knowledge Graph Construction

Abstract

Abstract (translated)

URL

PDF Copy

PDF