Paper Reading AI Learner

Extract, Define, Canonicalize: An LLM-based Framework for Knowledge Graph Construction

2024-04-05 02:53:51
Bowen Zhang, Harold Soh

Abstract

In this work, we are interested in automated methods for knowledge graph creation (KGC) from input text. Progress on large language models (LLMs) has prompted a series of recent works applying them to KGC, e.g., via zero/few-shot prompting. Despite successes on small domain-specific datasets, these models face difficulties scaling up to text common in many real-world applications. A principal issue is that in prior methods, the KG schema has to be included in the LLM prompt to generate valid triplets; larger and more complex schema easily exceed the LLMs' context window length. To address this problem, we propose a three-phase framework named Extract-Define-Canonicalize (EDC): open information extraction followed by schema definition and post-hoc canonicalization. EDC is flexible in that it can be applied to settings where a pre-defined target schema is available and when it is not; in the latter case, it constructs a schema automatically and applies self-canonicalization. To further improve performance, we introduce a trained component that retrieves schema elements relevant to the input text; this improves the LLMs' extraction performance in a retrieval-augmented generation-like manner. We demonstrate on three KGC benchmarks that EDC is able to extract high-quality triplets without any parameter tuning and with significantly larger schemas compared to prior works.

Abstract (translated)

在这项工作中,我们感兴趣的是从输入文本中自动生成知识图(KGC)的方法。大型语言模型(LLMs)的进步导致了一系列将它们应用于KGC的最近工作,例如通过零/几帧提示。尽管在小型领域特定数据集上取得了成功,但这些模型在许多现实世界应用中扩展到文本common遇到困难。一个主要问题是,在先前的方法中,KG模式必须在LLM提示中包括才能生成有效的三元组;大型的和更复杂模式很容易超过LLMs的上下文窗口长度。为了解决这个问题,我们提出了一个名为Extract-Define-Canonicalize(EDC)的三阶段框架:开放信息提取 followed by 模式定义和后置正则化。EDC具有灵活性,因为它可以应用于具有预定义目标模式和不需要预定义模式的情况;在后一种情况下,它自动构建模式并应用自正则化。为了进一步提高性能,我们引入了一个训练过的组件,它检索与输入文本相关的模式元素;这使得LLM在检索增强生成方式下的提取性能得到提高。我们在三个KGC基准测试中证明了EDC能够在不进行参数调整的情况下提取高质量的三元组,并且与先前的作品相比,具有明显更大的模式。

URL

https://arxiv.org/abs/2404.03868

PDF

https://arxiv.org/pdf/2404.03868.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot