Unified Structured Knowledge Reasoning (USKR) aims to answer natural language questions (NLQs) by using structured sources such as tables, databases, and knowledge graphs in a unified way. Existing USKR methods either rely on employing task-specific strategies or custom-defined representations, which struggle to leverage the knowledge transfer between different SKR tasks or align with the prior of LLMs, thereby limiting their performance. This paper proposes a novel USKR framework named \textsc{Pandora}, which takes advantage of \textsc{Python}'s \textsc{Pandas} API to construct a unified knowledge representation for alignment with LLM pre-training. It employs an LLM to generate textual reasoning steps and executable Python code for each question. Demonstrations are drawn from a memory of training examples that cover various SKR tasks, facilitating knowledge transfer. Extensive experiments on four benchmarks involving three SKR tasks demonstrate that \textsc{Pandora} outperforms existing unified frameworks and competes effectively with task-specific methods.
统一结构化知识推理(USKR)的目标是通过使用诸如表格、数据库和知识图谱等结构化来源以统一的方式回答自然语言问题。现有的USKR方法要么依赖于特定任务的策略,要么定义了自定义表示形式,这些都难以在不同SKR任务之间进行知识迁移或与LLM的先验知识对齐,从而限制了它们的表现力。本文提出了一种名为Pandora的新颖框架,该框架利用Python的Pandas API构建统一的知识表示方式,以便更好地与大规模语言模型(LLMs)预训练相一致。Pandora通过使用一个大规模的语言模型来为每个问题生成文本推理步骤和可执行的Python代码,并从覆盖多种SKR任务的记忆中的示例中进行知识迁移。 在涉及三种不同SKR任务的四个基准测试上的广泛实验表明,Pandora不仅优于现有的统一框架,而且能够与特定任务的方法有效竞争。
https://arxiv.org/abs/2504.12734
High-stakes domains like cyber operations need responsible and trustworthy AI methods. While large language models (LLMs) are becoming increasingly popular in these domains, they still suffer from hallucinations. This research paper provides learning outcomes from a case study with LinkQ, an open-source natural language interface that was developed to combat hallucinations by forcing an LLM to query a knowledge graph (KG) for ground-truth data during question-answering (QA). We conduct a quantitative evaluation of LinkQ using a well-known KGQA dataset, showing that the system outperforms GPT-4 but still struggles with certain question categories - suggesting that alternative query construction strategies will need to be investigated in future LLM querying systems. We discuss a qualitative study of LinkQ with two domain experts using a real-world cybersecurity KG, outlining these experts' feedback, suggestions, perceived limitations, and future opportunities for systems like LinkQ.
高风险领域如网络操作需要负责任且可信赖的人工智能方法。尽管大型语言模型(LLM)在这些领域越来越受欢迎,但它们仍然存在幻觉问题。这篇研究论文提供了一项案例研究的学习成果,该案例研究使用了LinkQ——一个开源的自然语言接口,它通过迫使LLM在问答过程中查询知识图谱(KG)来对抗幻觉现象。我们使用一个知名的KGQA数据集对LinkQ进行了定量评估,结果显示该系统的表现优于GPT-4,但仍难以应对某些问题类别——这表明未来针对LLM查询系统的替代查询构建策略将需要进一步研究。我们还进行了一项定性研究,与两位领域专家一起探讨了在实际网络安全KG中使用LinkQ的情况,并概述了这些专家的反馈、建议、感知到的局限性和面向未来的系统(如LinkQ)的发展机会。
https://arxiv.org/abs/2504.12422
Retrieval-augmented generation (RAG) empowers large language models to access external and private corpus, enabling factually consistent responses in specific domains. By exploiting the inherent structure of the corpus, graph-based RAG methods further enrich this process by building a knowledge graph index and leveraging the structural nature of graphs. However, current graph-based RAG approaches seldom prioritize the design of graph structures. Inadequately designed graph not only impede the seamless integration of diverse graph algorithms but also result in workflow inconsistencies and degraded performance. To further unleash the potential of graph for RAG, we propose NodeRAG, a graph-centric framework introducing heterogeneous graph structures that enable the seamless and holistic integration of graph-based methodologies into the RAG workflow. By aligning closely with the capabilities of LLMs, this framework ensures a fully cohesive and efficient end-to-end process. Through extensive experiments, we demonstrate that NodeRAG exhibits performance advantages over previous methods, including GraphRAG and LightRAG, not only in indexing time, query time, and storage efficiency but also in delivering superior question-answering performance on multi-hop benchmarks and open-ended head-to-head evaluations with minimal retrieval tokens. Our GitHub repository could be seen at this https URL.
检索增强生成(RAG)使大型语言模型能够访问外部和私有语料库,从而在特定领域内提供事实一致的响应。通过利用语料库本身的内在结构,基于图的RAG方法进一步丰富了这一过程,通过构建知识图索引并利用图的结构性质来实现这一点。然而,目前的基于图的RAG方法很少注重图结构的设计。设计不当的图形不仅阻碍了多种图算法的无缝集成,还导致工作流程不一致和性能下降。为了进一步释放图在RAG中的潜力,我们提出了NodeRAG,这是一个以图为中心的框架,引入异构图结构,使基于图的方法能够无缝且全面地融入到RAG的工作流中。通过紧密贴合大型语言模型的能力,这一框架确保了从头到尾完全一致和高效的流程。通过广泛的实验,我们展示了NodeRAG在索引时间、查询时间和存储效率上优于先前方法(如GraphRAG和LightRAG),并能在多跳基准测试及开放式的直接比较评估中提供更好的问答性能,并且仅使用少量检索令牌即可实现这一目标。我们的GitHub仓库可在提供的链接中查看。 请注意,原文中的“this https URL”实际上应为具体的GitHub网址,这里由于隐私原因没有具体展示该URL地址。
https://arxiv.org/abs/2504.11544
This chapter investigates the concept of mutual understanding between humans and systems, positing that Neuro-symbolic Artificial Intelligence (NeSy AI) methods can significantly enhance this mutual understanding by leveraging explicit symbolic knowledge representations with data-driven learning models. We start by introducing three critical dimensions to characterize mutual understanding: sharing knowledge, exchanging knowledge, and governing knowledge. Sharing knowledge involves aligning the conceptual models of different agents to enable a shared understanding of the domain of interest. Exchanging knowledge relates to ensuring the effective and accurate communication between agents. Governing knowledge concerns establishing rules and processes to regulate the interaction between agents. Then, we present several different use case scenarios that demonstrate the application of NeSy AI and Knowledge Graphs to aid meaningful exchanges between human, artificial, and robotic agents. These scenarios highlight both the potential and the challenges of combining top-down symbolic reasoning with bottom-up neural learning, guiding the discussion of the coverage provided by current solutions along the dimensions of sharing, exchanging, and governing knowledge. Concurrently, this analysis facilitates the identification of gaps and less developed aspects in mutual understanding to address in future research.
本章探讨了人与系统之间的相互理解的概念,提出神经符号人工智能(NeSy AI)方法可以通过利用明确的符号知识表示和数据驱动的学习模型显著增强这种相互理解。我们首先介绍了三个关键维度来描述相互理解:共享知识、交换知识和管理知识。共享知识涉及调整不同代理的概念模型,以实现对感兴趣领域的共同理解。交换知识与确保各代理之间的有效且准确沟通有关。管理知识则涉及到建立规则和流程以规范代理间的互动。 接下来,我们展示了几个不同的使用案例场景,这些场景演示了NeSy AI和知识图谱的应用,用于促进人、人工和机器人代理之间有意义的交流。这些场景突显了将自上而下的符号推理与自下而上的神经学习相结合的潜力及其挑战,并引导讨论当前解决方案在共享、交换和管理知识这三个维度上的覆盖范围。同时,这项分析有助于识别未来研究中需要解决的相互理解方面的空白和欠发达领域。
https://arxiv.org/abs/2504.11200
Recent advances in Large Language Models have demonstrated their capabilities across a variety of tasks. However, automatically extracting implicit knowledge from natural language remains a significant challenge, as machines lack active experience with the physical world. Given this scenario, semantic knowledge graphs can serve as conceptual spaces that guide the automated text generation reasoning process to achieve more efficient and explainable results. In this paper, we apply a logic-augmented generation (LAG) framework that leverages the explicit representation of a text through a semantic knowledge graph and applies it in combination with prompt heuristics to elicit implicit analogical connections. This method generates extended knowledge graph triples representing implicit meaning, enabling systems to reason on unlabeled multimodal data regardless of the domain. We validate our work through three metaphor detection and understanding tasks across four datasets, as they require deep analogical reasoning capabilities. The results show that this integrated approach surpasses current baselines, performs better than humans in understanding visual metaphors, and enables more explainable reasoning processes, though still has inherent limitations in metaphor understanding, especially for domain-specific metaphors. Furthermore, we propose a thorough error analysis, discussing issues with metaphorical annotations and current evaluation methods.
近期在大型语言模型方面的进展已经展示了它们在各种任务中的能力。然而,自动从自然语言中提取隐性知识仍然是一项重大挑战,因为机器缺乏与物理世界的直接互动经验。在这种情况下,语义知识图可以作为概念空间来指导自动化文本生成的推理过程,从而实现更高效和可解释的结果。 本文应用了一种逻辑增强生成(LAG)框架,该框架通过语义知识图利用文本的显式表示,并结合提示启发式方法来触发隐性类比关系。这种方法生成扩展的知识图三元组,代表隐含的意义,使系统能够在任何领域的未标记多模态数据上进行推理。我们通过四个数据集上的三个隐喻检测和理解任务验证了我们的工作,因为这些任务需要深度的类比推理能力。 结果显示,这种集成方法超越了当前基准,在理解视觉隐喻方面甚至优于人类,并且能够实现更可解释的推理过程,尽管在特定领域的隐喻理解上仍然存在固有的局限性。此外,我们还提出了一种全面的错误分析,讨论了隐喻注释和目前评估方法的问题。
https://arxiv.org/abs/2504.11190
Large language models (LLMs) perform well in medical QA, but their effectiveness in Japanese contexts is limited due to privacy constraints that prevent the use of commercial models like GPT-4 in clinical settings. As a result, recent efforts focus on instruction-tuning open-source LLMs, though the potential of combining them with retrieval-augmented generation (RAG) remains underexplored. To bridge this gap, we are the first to explore a knowledge graph-based (KG) RAG framework for Japanese medical QA small-scale open-source LLMs. Experimental results show that KG-based RAG has only a limited impact on Japanese medical QA using small-scale open-source LLMs. Further case studies reveal that the effectiveness of the RAG is sensitive to the quality and relevance of the external retrieved content. These findings offer valuable insights into the challenges and potential of applying RAG in Japanese medical QA, while also serving as a reference for other low-resource languages.
大型语言模型(LLMs)在医学问答方面表现出色,但在日本语背景下其有效性受到隐私限制的制约,这使得像GPT-4这样的商业模型无法在临床环境中使用。因此,最近的努力主要集中在对开源LLM进行指令微调上,尽管将它们与检索增强生成(RAG)相结合的潜力尚未得到充分探索。为了填补这一空白,我们首次研究了一种基于知识图谱(KG)的RAG框架,用于日本医学问答的小规模开源LLM。实验结果显示,在使用小规模开源LLM进行日语医学问答时,基于KG的RAG的影响有限。进一步的案例研究表明,RAG的有效性对检索到的外部内容的质量和相关性非常敏感。这些发现为在日语医学问答中应用RAG面临的挑战及潜力提供了有价值的见解,并可作为其他低资源语言的应用参考。
https://arxiv.org/abs/2504.10982
Retrieval-Augmented Generation (RAG) is a promising technique for applying LLMs to proprietary domains. However, retrieved documents may contain sensitive knowledge, posing risks of privacy leakage in generative results. Thus, effectively erasing private information from retrieved documents is a key challenge for RAG. Unlike traditional text anonymization, RAG should consider: (1) the inherent multi-document reasoning may face de-anonymization attacks; (2) private knowledge varies by scenarios, so users should be allowed to customize which information to erase; (3) preserving sufficient publicly available knowledge for generation tasks. This paper introduces the privacy erasure task for RAG and proposes Eraser4RAG, a private knowledge eraser which effectively removes user-defined private knowledge from documents while preserving sufficient public knowledge for generation. Specifically, we first construct a global knowledge graph to identify potential knowledge across documents, aiming to defend against de-anonymization attacks. Then we randomly split it into private and public sub-graphs, and fine-tune Flan-T5 to rewrite the retrieved documents excluding private triples. Finally, PPO algorithm optimizes the rewriting model to minimize private triples and maximize public triples retention. Experiments on four QA datasets demonstrate that Eraser4RAG achieves superior erase performance than GPT-4o.
检索增强生成(RAG)是一种将大型语言模型应用于专有领域的有前景的技术。然而,检索到的文档中可能包含敏感信息,这会增加隐私泄露的风险。因此,在使用RAG时有效删除私人信息是一个关键挑战。与传统的文本匿名化不同,RAG需要考虑以下几点:(1) 内在的多文档推理可能会面临去匿名化攻击;(2) 私人知识因场景而异,所以用户应被允许自定义要删除的信息;(3) 保留足够的公开信息以完成生成任务。本文介绍了为RAG设计的隐私擦除任务,并提出了Eraser4RAG模型,该模型能够有效地从文档中移除用户定义的私人信息,同时保持足够的公共知识用于生成任务。 具体而言,我们首先构建了一个全局知识图来识别跨多份文档中的潜在知识,旨在防御去匿名化攻击。接着我们将此图随机拆分为私有和公有子图,并对Flan-T5进行微调以重写检索到的文档,排除其中的私人三元组信息(即由实体、关系和另一个实体组成的结构)。最后使用PPO算法优化重写模型,使其最大限度地减少私人三元组并保留尽可能多的公共知识。在四个问答数据集上的实验表明,Eraser4RAG在隐私擦除性能方面优于GPT-4o。
https://arxiv.org/abs/2504.09910
Short technical support pages such as IBM Technotes are quite common in technical support domain. These pages can be very useful as the knowledge sources for technical support applications such as chatbots, search engines and question-answering (QA) systems. Information extracted from documents to drive technical support applications is often stored in the form of Knowledge Graph (KG). Building KGs from a large corpus of documents poses a challenge of granularity because a large number of entities and actions are present in each page. The KG becomes virtually unusable if all entities and actions from these pages are stored in the KG. Therefore, only key entities and actions from each page are extracted and stored in the KG. This approach however leads to loss of knowledge represented by entities and actions left out of the KG as they are no longer available to graph search and reasoning functions. We propose a set of techniques to create micro knowledge graph (micrograph) for each of such web pages. The micrograph stores all the entities and actions in a page and also takes advantage of the structure of the page to represent exactly in which part of that page these entities and actions appeared, and also how they relate to each other. These micrographs can be used as additional knowledge sources by technical support applications. We define schemas for representing semi-structured and plain text knowledge present in the technical support web pages. Solutions in technical support domain include procedures made of steps. We also propose a technique to extract procedures from these webpages and the schemas to represent them in the micrographs. We also discuss how technical support applications can take advantage of the micrographs.
简短的技术支持页面,例如IBM的Technotes,在技术支持领域非常常见。这些页面可以作为技术支持应用(如聊天机器人、搜索引擎和问答系统)的知识来源,非常有用。从文档中提取信息以驱动这些应用程序时,通常会将知识存储在知识图谱(Knowledge Graph, KG)的形式中。 然而,根据大量文档构建KG面临着粒度挑战,因为每个页面包含大量的实体和动作。如果把所有这些页面上的实体和动作都储存在KG里,那么KG就会变得几乎无法使用了。因此,通常只提取每页中的关键实体和动作并将其存入KG。不过这种做法会导致被排除在KG之外的那些实体和动作所代表的知识丢失,从而它们不再能支持图搜索和推理功能。 为此,我们提出了一系列技术来为每个这样的网页创建微型知识图谱(micro knowledge graph,简称微图)。这些微图存储了页面上的所有实体和动作,并利用页面结构来表示这些实体和动作在页面的具体位置及其相互关系。这些微图可以作为技术支持应用程序的额外知识来源。 我们定义了一套用于表示技术支援网页中半结构化和平板文本知识的模式(schemas)。技术支持领域的解决方案包括由步骤组成的流程,因此我们也提出了一种从这些网页提取流程并以模式形式在微图上表示它们的技术。此外,还讨论了如何利用这些微图来提升技术支持应用程序的功能。
https://arxiv.org/abs/2504.09877
This study addresses the challenge of ambiguity in knowledge graph question answering (KGQA). While recent KGQA systems have made significant progress, particularly with the integration of large language models (LLMs), they typically assume user queries are unambiguous, which is an assumption that rarely holds in real-world applications. To address these limitations, we propose a novel framework that dynamically handles both entity ambiguity (e.g., distinguishing between entities with similar names) and intent ambiguity (e.g., clarifying different interpretations of user queries) through interactive clarification. Our approach employs a Bayesian inference mechanism to quantify query ambiguity and guide LLMs in determining when and how to request clarification from users within a multi-turn dialogue framework. We further develop a two-agent interaction framework where an LLM-based user simulator enables iterative refinement of logical forms through simulated user feedback. Experimental results on the WebQSP and CWQ dataset demonstrate that our method significantly improves performance by effectively resolving semantic ambiguities. Additionally, we contribute a refined dataset of disambiguated queries, derived from interaction histories, to facilitate future research in this direction.
这项研究解决了知识图谱问答(KGQA)中模糊性的挑战。尽管最近的KGQA系统在与大型语言模型(LLMs)集成方面取得了显著进展,但它们通常假设用户的查询是明确无误的,而在实际应用中这一假设很少成立。为了解决这些限制,我们提出了一种新颖的框架,该框架能够动态地处理实体模糊性(例如,区分具有相似名称的不同实体)和意图模糊性(例如,澄清用户查询的不同解释),并通过互动澄清来解决这些问题。 我们的方法采用贝叶斯推理机制来量化查询的模糊性,并指导LLMs在多轮对话框架内确定何时以及如何向用户提供澄清请求。此外,我们还开发了一个双代理交互框架,在此框架中,基于LLM的用户模拟器通过模拟用户的反馈来进行逻辑形式的迭代细化。 在WebQSP和CWQ数据集上的实验结果表明,我们的方法通过有效地解决语义模糊性显著提高了性能。另外,我们贡献了一组从互动历史中提炼出的、经过消歧处理的查询数据集,以促进在此方向上的未来研究。
https://arxiv.org/abs/2504.09665
How do we update AI memory of user intent as intent changes? We consider how an AI interface may assist the integration of new information into a repository of natural language data. Inspired by software engineering concepts like impact analysis, we develop methods and a UI for managing semantic changes with non-local effects, which we call "semantic conflict resolution." The user commits new intent to a project -- makes a "semantic commit" -- and the AI helps the user detect and resolve semantic conflicts within a store of existing information representing their intent (an "intent specification"). We develop an interface, SemanticCommit, to better understand how users resolve conflicts when updating intent specifications such as Cursor Rules and game design documents. A knowledge graph-based RAG pipeline drives conflict detection, while LLMs assist in suggesting resolutions. We evaluate our technique on an initial benchmark. Then, we report a 12 user within-subjects study of SemanticCommit for two task domains -- game design documents, and AI agent memory in the style of ChatGPT memories -- where users integrated new information into an existing list. Half of our participants adopted a workflow of impact analysis, where they would first flag conflicts without AI revisions then resolve conflicts locally, despite having access to a global revision feature. We argue that AI agent interfaces, such as software IDEs like Cursor and Windsurf, should provide affordances for impact analysis and help users validate AI retrieval independently from generation. Our work speaks to how AI agent designers should think about updating memory as a process that involves human feedback and decision-making.
我们如何在用户意图发生变化时更新AI的记忆呢?这个问题促使我们思考,一个AI界面应该如何帮助整合新的信息到自然语言数据的存储库中。受软件工程概念如影响分析的启发,我们开发了用于管理具有非局部效应的语义更改的方法和UI,称之为“语义冲突解决”。用户将新意图提交至项目——进行一次“语义提交”——而AI则帮助用户检测并解决存在于表示其意图的信息存储中的语义冲突(即“意图规范”)。我们开发了一个名为SemanticCommit的界面来更好地理解用户在更新如Cursor规则和游戏设计文档这样的意图规范时如何解决冲突。知识图谱驱动的RAG管道进行冲突检测,同时大型语言模型(LLMs)提供解决方案建议。 我们在初步基准上评估了我们的技术。接着,我们报告了一项包含12名参与者的内部受试者研究,该研究涵盖了SemanticCommit在两个任务领域(游戏设计文档和类似ChatGPT记忆的AI代理内存)中的应用情况,在这些场景中用户将新信息整合到现有列表中。我们的参与者中有半数选择了影响分析的工作流程,即他们首先会标记冲突而不会进行AI修订,然后局部解决冲突,尽管他们有使用全局修订功能的选项。 我们认为,如Cursor和Windsurf这样的软件IDE或AI代理界面应该为用户提供执行影响分析的能力,并帮助用户独立验证AI检索结果而非仅依赖于生成的内容。我们的研究强调了AI代理设计师在更新记忆时应如何考虑一种涉及人类反馈和决策的过程。
https://arxiv.org/abs/2504.09283
Knowledge graph embedding (KGE) models are extensively studied for knowledge graph completion, yet their evaluation remains constrained by unrealistic benchmarks. Commonly used datasets are either faulty or too small to reflect real-world data. Few studies examine the role of mediator nodes, which are essential for modeling n-ary relationships, or investigate model performance variation across domains. Standard evaluation metrics rely on the closed-world assumption, which penalizes models for correctly predicting missing triples, contradicting the fundamental goals of link prediction. These metrics often compress accuracy assessment into a single value, obscuring models' specific strengths and weaknesses. The prevailing evaluation protocol operates under the unrealistic assumption that an entity's properties, for which values are to be predicted, are known in advance. While alternative protocols such as property prediction, entity-pair ranking and triple classification address some of these limitations, they remain underutilized. This paper conducts a comprehensive evaluation of four representative KGE models on large-scale datasets FB-CVT-REV and FB+CVT-REV. Our analysis reveals critical insights, including substantial performance variations between small and large datasets, both in relative rankings and absolute metrics, systematic overestimation of model capabilities when n-ary relations are binarized, and fundamental limitations in current evaluation protocols and metrics.
知识图谱嵌入(KGE)模型在知识图谱补全方面得到了广泛研究,但其评估仍受到不切实际的基准测试限制。常用的数据库要么存在缺陷,要么规模过小,无法反映真实世界的数据情况。鲜有研究探讨中介节点的作用,这些节点对于建模n元关系至关重要,也未有人详细调查不同领域中的模型性能变化。标准评价指标基于闭世界假设,这种假设会因正确预测缺失三元组而惩罚模型,这与链接预测的基本目标相违背。这些度量方法往往将准确性评估简化为单一数值,掩盖了模型的具体优势和不足。现有的主流评估协议假定实体的属性值在进行预测之前已经为人所知,这一假设是不切实际的。虽然替代方案如属性预测、实体对排序及三元组分类能够缓解这些局限性,但它们仍未得到广泛应用。本文针对大规模数据集FB-CVT-REV和FB+CVT-REV进行了四种代表性KGE模型的全面评估。我们的分析揭示了关键洞察:在相对排名和绝对指标上,小型与大型数据集之间存在显著性能差异;当n元关系被二值化时,对模型能力的系统性高估;以及当前评价协议及度量标准的根本局限性。
https://arxiv.org/abs/2504.08970
In this document, we discuss a multi-step approach to automated construction of a knowledge graph, for structuring and representing domain-specific knowledge from large document corpora. We apply our method to build the first knowledge graph of nuclear fusion energy, a highly specialized field characterized by vast scope and heterogeneity. This is an ideal benchmark to test the key features of our pipeline, including automatic named entity recognition and entity resolution. We show how pre-trained large language models can be used to address these challenges and we evaluate their performance against Zipf's law, which characterizes human-generated natural language. Additionally, we develop a knowledge-graph retrieval-augmented generation system that combines large language models with a multi-prompt approach. This system provides contextually relevant answers to natural-language queries, including complex multi-hop questions that require reasoning across interconnected entities.
在本文档中,我们讨论了一种多步骤方法,用于自动构建知识图谱,以结构化和表示来自大型文档集合的特定领域的专业知识。我们将这种方法应用于建立第一个关于核聚变能源的知识图谱,这是一个具有广泛范围和异质性的高度专业领域。这是测试我们管道关键功能的理想基准,包括自动命名实体识别和实体解析。我们展示了如何使用预训练的大规模语言模型来应对这些挑战,并评估它们在齐夫定律(描述人类生成的自然语言)方面的表现。此外,我们开发了一种知识图谱检索增强型生成系统,该系统结合了大规模语言模型与多提示方法。此系统可以提供对自然语言查询的相关上下文回答,包括需要跨关联实体进行推理的复杂多跳问题。
https://arxiv.org/abs/2504.07738
Retrieval Augmented Generation (RAG) has enjoyed increased attention in the recent past and recent advancements in Large Language Models (LLMs) have highlighted the importance of integrating world knowledge into these systems. Current RAG methodologies often modify the internal architecture of pre-trained language models (PLMs) or rely on textifying knowledge graphs (KGs), which is inefficient in terms of token usage. This paper introduces ConceptFormer, a new approach to augment LLMs with structured knowledge from KGs, such as Wikidata, without altering their internal structure or relying on textual input of KGs. ConceptFormer operates in the LLM embedding vector space, creating and injecting \emph{concept vectors} that encapsulate the information of the KG nodes directly. Trained in conjunction with a frozen LLM, ConceptFormer generates a comprehensive lookup table that maps KG nodes to their respective concept vectors. The approach aims to enhance the factual recall capabilities of LLMs by enabling them to process these concept vectors natively, thus enriching them with structured world knowledge in an efficient and scalable manner. Our experiments demonstrate that the addition of concept vectors to GPT-2 0.1B substantially increases its factual recall ability (Hit@10) by up to 272\% when tested on sentences from Wikipedia and up to 348\% on synthetically generated sentences. Even injecting only a single concept vector into the prompt increases factual recall ability (Hit@10) by up to 213\% on Wikipedia sentences, significantly outperforming RAG with graph textification while consuming 130x fewer input tokens.
近期,检索增强生成(RAG)受到了越来越多的关注,并且大型语言模型(LLM)的最新进展强调了将世界知识整合到这些系统中的重要性。当前的RAG方法通常会修改预训练语言模型(PLM)的内部架构或依赖于将知识图谱(KG)文本化,这种方式在令牌使用效率方面不够高效。本文介绍了一种名为ConceptFormer的新方法,它可以在不改变LLM的内部结构且无需依赖KG文本输入的情况下,通过结构化的知识图谱如Wikidata来增强LLM。 ConceptFormer操作于LLM嵌入向量空间中,创建并注入封装了KG节点信息的“概念向量”。与冻结状态下的预训练LLM结合训练后,ConceptFormer生成了一个全面查找表,该表将KG节点映射到相应的概念向量。这种方法旨在通过使LLM能够原生处理这些概念向量来增强其事实检索能力,从而以高效且可扩展的方式丰富它们的结构化世界知识。 实验结果表明,在测试维基百科句子和合成生成的句子时,向GPT-2 0.1B添加概念向量可以显著提升其实证回忆能力(Hit@10),前者提升了高达272%,后者则提升了348%。即使仅在提示中注入一个概念向量,在测试维基百科句子时也能够将实证回忆能力(Hit@10)提高多达213%,大幅超越了依赖于图文本化的RAG方法,同时消耗的输入令牌数仅为后者的1/130。
https://arxiv.org/abs/2504.07624
Knowledge graphs have emerged as a popular method for injecting up-to-date, factual knowledge into large language models (LLMs). This is typically achieved by converting the knowledge graph into text that the LLM can process in context. While multiple methods of encoding knowledge graphs have been proposed, the impact of this textualization process on LLM performance remains under-explored. We introduce KG-LLM-Bench, a comprehensive and extensible benchmark spanning five knowledge graph understanding tasks, and evaluate how different encoding strategies affect performance across various base models. Our extensive experiments with seven language models and five textualization strategies provide insights for optimizing LLM performance on KG reasoning tasks.
知识图谱作为一种流行的方法,用于将最新的事实性知识注入大型语言模型(LLM)中而出现。这通常是通过将知识图谱转换为文本,以便LLM可以在上下文中处理来实现的。尽管已经提出了多种编码知识图谱的方法,但这种文本化过程对LLM性能的影响仍然鲜为人知。我们引入了KG-LLM-Bench,这是一个全面且可扩展的基准测试工具,涵盖了五个知识图理解任务,并评估了不同的编码策略如何影响各种基础模型的表现。通过与七种语言模型和五种文本化策略进行广泛的实验,我们的研究为优化LLM在知识图推理任务上的性能提供了见解。
https://arxiv.org/abs/2504.07087
The integration of tool learning with Large Language Models (LLMs) has expanded their capabilities in handling complex tasks by leveraging external tools. However, existing benchmarks for tool learning inadequately address critical real-world personalized scenarios, particularly those requiring multi-hop reasoning and inductive knowledge adaptation in dynamic environments. To bridge this gap, we introduce FamilyTool, a novel benchmark grounded in a family-based knowledge graph (KG) that simulates personalized, multi-hop tool use scenarios. FamilyTool challenges LLMs with queries spanning 1 to 3 relational hops (e.g., inferring familial connections and preferences) and incorporates an inductive KG setting where models must adapt to unseen user preferences and relationships without re-training, a common limitation in prior approaches that compromises generalization. We further propose KGETool: a simple KG-augmented evaluation pipeline to systematically assess LLMs' tool use ability in these settings. Experiments reveal significant performance gaps in state-of-the-art LLMs, with accuracy dropping sharply as hop complexity increases and inductive scenarios exposing severe generalization deficits. These findings underscore the limitations of current LLMs in handling personalized, evolving real-world contexts and highlight the urgent need for advancements in tool-learning frameworks. FamilyTool serves as a critical resource for evaluating and advancing LLM agents' reasoning, adaptability, and scalability in complex, dynamic environments. Code and dataset are available at Github.
将工具学习与大型语言模型(LLMs)的集成已经扩大了它们处理复杂任务的能力,通过利用外部工具来实现。然而,现有的工具学习基准测试未能充分解决关键的实际个性化场景问题,特别是那些需要多步推理和在动态环境中适应归纳知识的情况。为了填补这一空白,我们引入了一个基于家庭知识图谱(KG)的新基准FamilyTool,该基准模拟了个性化的、多步工具使用情景。FamilyTool通过跨越1到3个关系跳的查询挑战LLMs(例如,推断家族联系和偏好),并在一个归纳KG设置中整合这些因素,在这种环境中模型必须在不重新训练的情况下适应未见过的用户偏好和关系,这是先前方法中的常见限制,影响了其泛化能力。我们进一步提出KGETool:一种简单的KG增强评估管道,用于系统地评估LLMs在这种环境下的工具使用能力。实验结果显示,最先进的LLM存在显著性能差距,在复杂度增加时准确性急剧下降,而归纳情景暴露出了严重的泛化缺陷。这些发现强调了当前LLMs在处理个性化、不断变化的实际场景方面的局限性,并突显了急需改进工具学习框架的紧迫需求。FamilyTool作为评估和推动LLM代理在复杂动态环境中进行推理、适应性和可扩展性的关键资源,具有重要作用。代码和数据集可在Github上获取。
https://arxiv.org/abs/2504.06766
In a conversational system, dynamically generating follow-up questions based on context can help users explore information and provide a better user experience. Humans are usually able to ask questions that involve some general life knowledge and demonstrate higher order cognitive skills. However, the questions generated by existing methods are often limited to shallow contextual questions that are uninspiring and have a large gap to the human level. In this paper, we propose a three-stage external knowledge-enhanced follow-up question generation method, which generates questions by identifying contextual topics, constructing a knowledge graph (KG) online, and finally combining these with a large language model to generate the final question. The model generates information-rich and exploratory follow-up questions by introducing external common sense knowledge and performing a knowledge fusion operation. Experiments show that compared to baseline models, our method generates questions that are more informative and closer to human questioning levels while maintaining contextual relevance.
在对话系统中,根据上下文动态生成后续问题有助于用户探索信息,并提供更好的用户体验。人类通常能够提出涉及一般生活常识且展示出较高认知技能的问题。然而,现有方法生成的问题往往局限于浅层次的语境相关问题,缺乏启发性,与人类水平存在较大差距。本文提出了一个三阶段增强型外部知识辅助的后续问题生成方法:该方法通过识别上下文主题、在线构建知识图谱(KG),并最终结合大规模语言模型来生成最终问题。该模型通过引入外部常识知识以及进行知识融合操作,能够生成信息丰富且探索性强的后续问题。实验表明,与基准模型相比,我们的方法在保持语境相关性的同时,所生成的问题更加富有信息量,并更接近人类提问水平。
https://arxiv.org/abs/2504.05801
Coreference resolution across multiple documents poses a significant challenge in natural language processing, particularly within the domain of knowledge graphs. This study introduces an innovative method aimed at identifying and resolving references to the same entities that appear across differing texts, thus enhancing the coherence and collaboration of information. Our method employs a dynamic linking mechanism that associates entities in the knowledge graph with their corresponding textual mentions. By utilizing contextual embeddings along with graph-based inference strategies, we effectively capture the relationships and interactions among entities, thereby improving the accuracy of coreference resolution. Rigorous evaluations on various benchmark datasets highlight notable advancements in our approach over traditional methodologies. The results showcase how the contextual information derived from knowledge graphs enhances the understanding of complex relationships across documents, leading to better entity linking and information extraction capabilities in applications driven by knowledge. Our technique demonstrates substantial improvements in both precision and recall, underscoring its effectiveness in the area of cross-document coreference resolution.
跨文档的共指消解在自然语言处理领域,特别是在知识图谱的应用中,面临着重大挑战。本研究提出了一种创新方法,旨在识别和解决不同文本中同一实体之间的引用问题,从而增强信息的一致性和协作性。我们的方法采用了一种动态链接机制,将知识图中的实体与其相应的文本提及关联起来。通过使用上下文嵌入以及基于图形的推理策略,我们有效地捕捉到了实体之间的关系和互动,从而提高了共指消解的准确性。 在各种基准数据集上的严格评估表明,本方法相比传统的方法有了显著的进步。结果显示,来自知识图谱的上下文信息如何增强了跨文档复杂关系的理解能力,进而提升了解决方案中的实体链接和信息提取功能。我们的技术在精度和召回率上均表现出重大改进,这证明了它在跨文档共指消解领域的有效性。
https://arxiv.org/abs/2504.05767
Large language models have shown remarkable language processing and reasoning ability but are prone to hallucinate when asked about private data. Retrieval-augmented generation (RAG) retrieves relevant data that fit into an LLM's context window and prompts the LLM for an answer. GraphRAG extends this approach to structured Knowledge Graphs (KGs) and questions regarding entities multiple hops away. The majority of recent GraphRAG methods either overlook the retrieval step or have ad hoc retrieval processes that are abstract or inefficient. This prevents them from being adopted when the KGs are stored in graph databases supporting graph query languages. In this work, we present GraphRAFT, a retrieve-and-reason framework that finetunes LLMs to generate provably correct Cypher queries to retrieve high-quality subgraph contexts and produce accurate answers. Our method is the first such solution that can be taken off-the-shelf and used on KGs stored in native graph DBs. Benchmarks suggest that our method is sample-efficient and scales with the availability of training data. Our method achieves significantly better results than all state-of-the-art models across all four standard metrics on two challenging Q\&As on large text-attributed KGs.
大型语言模型展示了出色的语言处理和推理能力,但当被询问涉及私人数据的问题时容易产生错误信息。检索增强生成(RAG)方法从与大语言模型上下文窗口相匹配的相关数据中进行检索,并提示该模型给出答案。GraphRAG将这一方法扩展到了结构化的知识图谱(KGs),并解决了关于实体间多跳关系的查询问题。然而,最近大多数GraphRAG方法要么忽略了检索步骤,要么拥有抽象且低效的检索过程,这阻碍了它们在支持图形查询语言的原生图数据库中存储的知识图谱上的应用。 在此研究工作中,我们提出了GraphRAFT框架,这是一种“检索与推理”框架,它通过微调大型语言模型以生成可证明正确的Cypher查询来提取高质量的子图上下文,并产出准确的答案。我们的方法是首个可以直接应用于存储在原生图形数据库中的知识图谱上的此类解决方案。 基准测试表明,我们的方法具有样本效率且随着训练数据量的增加而表现出良好的扩展性。我们的方法在两个大型文本属性知识图谱上涉及四个标准指标的挑战性问答任务中,均显著优于所有现有最先进的模型。
https://arxiv.org/abs/2504.05478
Real-world data, such as news articles, social media posts, and chatbot conversations, is inherently dynamic and non-stationary, presenting significant challenges for constructing real-time structured representations through knowledge graphs (KGs). Relation Extraction (RE), a fundamental component of KG creation, often struggles to adapt to evolving data when traditional models rely on static, outdated datasets. Continual Relation Extraction (CRE) methods tackle this issue by incrementally learning new relations while preserving previously acquired knowledge. This study investigates the application of pre-trained language models (PLMs), specifically large language models (LLMs), to CRE, with a focus on leveraging memory replay to address catastrophic forgetting. We evaluate decoder-only models (eg, Mistral-7B and Llama2-7B) and encoder-decoder models (eg, Flan-T5 Base) on the TACRED and FewRel datasets. Task-incremental fine-tuning of LLMs demonstrates superior performance over earlier approaches using encoder-only models like BERT on TACRED, excelling in seen-task accuracy and overall performance (measured by whole and average accuracy), particularly with the Mistral and Flan-T5 models. Results on FewRel are similarly promising, achieving second place in whole and average accuracy metrics. This work underscores critical factors in knowledge transfer, language model architecture, and KG completeness, advancing CRE with LLMs and memory replay for dynamic, real-time relation extraction.
现实世界中的数据,如新闻文章、社交媒体帖子和聊天机器人对话,具有动态性和非平稳性,这给通过知识图谱(KG)构建实时结构化表示带来了重大挑战。关系抽取(RE),作为KG创建的一个基本组成部分,在处理不断变化的数据时往往会遇到困难,因为传统的模型依赖于静态且过时的数据集。持续关系提取(CRE)方法通过逐步学习新的关系并保留之前获取的知识来解决这一问题。 本研究探讨了预训练语言模型(PLM),特别是大型语言模型(LLMs),在CRE中的应用,并重点关注利用内存回放技术解决灾难性遗忘的问题。我们在TACRED和FewRel数据集上评估了解码器单独的模型(如Mistral-7B和Llama2-7B)以及编码器解码器的模型(如Flan-T5 Base)。任务增量微调LLMs在处理TACRED数据时表现出优于早期使用仅编码器模型(如BERT)的方法,尤其是在已见过的任务准确性和整体性能方面(通过总准确率和平均准确率度量),特别是Mistral和Flan-T5模型表现尤为出色。在FewRel上的结果也十分有前景,在总准确率和平均准确率指标上取得了第二名的好成绩。 这项研究强调了知识转移、语言模型架构以及KG完整性中的关键因素,并推进了使用LLMs和内存回放技术进行动态实时关系提取的CRE方法的发展。
https://arxiv.org/abs/2504.05214
Knowledge Graph based Retrieval-Augmented Generation (KG-RAG) is a technique that enhances Large Language Model (LLM) inference in tasks like Question Answering (QA) by retrieving relevant information from knowledge graphs (KGs). However, real-world KGs are often incomplete, meaning that essential information for answering questions may be missing. Existing benchmarks do not adequately capture the impact of KG incompleteness on KG-RAG performance. In this paper, we systematically evaluate KG-RAG methods under incomplete KGs by removing triples using different methods and analyzing the resulting effects. We demonstrate that KG-RAG methods are sensitive to KG incompleteness, highlighting the need for more robust approaches in realistic settings.
基于知识图谱的检索增强生成(KG-RAG)是一种技术,通过从知识图谱(KGs)中提取相关信息来提升大型语言模型(LLM)在如问答任务中的推理性能。然而,在现实世界中,知识图谱往往不完整,这意味着回答问题所需的重要信息可能缺失。现有的基准测试未能充分捕捉到这种知识图谱不完备性对KG-RAG性能的影响。在这篇论文中,我们通过使用不同的方法移除三元组并对产生的影响进行分析,系统地评估了在不完整的知识图谱下KG-RAG方法的表现。我们的研究显示,KG-RAG方法对于知识图谱的不完备性非常敏感,这凸显了在实际应用环境中需要采用更加稳健的方法的重要性。
https://arxiv.org/abs/2504.05163