Uncertainty quantification in Knowledge Graph Embedding (KGE) methods is crucial for ensuring the reliability of downstream applications. A recent work applies conformal prediction to KGE methods, providing uncertainty estimates by generating a set of answers that is guaranteed to include the true answer with a predefined confidence level. However, existing methods provide probabilistic guarantees averaged over a reference set of queries and answers (marginal coverage guarantee). In high-stakes applications such as medical diagnosis, a stronger guarantee is often required: the predicted sets must provide consistent coverage per query (conditional coverage guarantee). We propose CondKGCP, a novel method that approximates predicate-conditional coverage guarantees while maintaining compact prediction sets. CondKGCP merges predicates with similar vector representations and augments calibration with rank information. We prove the theoretical guarantees and demonstrate empirical effectiveness of CondKGCP by comprehensive evaluations.
知识图谱嵌入(KGE)方法中的不确定性量化对于确保下游应用的可靠性至关重要。最近的一项工作将符合预测应用于KGE方法,通过生成一组包含真实答案且保证达到预定义置信水平的答案集合来提供不确定性估计。然而,现有的方法仅提供了基于查询和答案参考集上的概率保证(边际覆盖保证)。在高风险应用场景中,如医学诊断,通常需要更强的保证:预测集必须为每个单独的查询提供一致的覆盖率(条件覆盖保证)。 我们提出了一种名为CondKGCP的新方法,该方法可以近似谓词条件下的覆盖保证,并同时保持紧凑的预测集合。CondKGCP通过合并具有相似向量表示的谓词并利用排名信息进行校准来实现这一目标。我们证明了CondKGCP的理论保证并通过全面评估展示了其在实际应用中的有效性。
https://arxiv.org/abs/2505.16877
When using Large Language Models (LLMs) to support Knowledge Graph Engineering (KGE), one of the first indications when searching for an appropriate model is its size. According to the scaling laws, larger models typically show higher capabilities. However, in practice, resource costs are also an important factor and thus it makes sense to consider the ratio between model performance and costs. The LLM-KG-Bench framework enables the comparison of LLMs in the context of KGE tasks and assesses their capabilities of understanding and producing KGs and KG queries. Based on a dataset created in an LLM-KG-Bench run covering 26 open state-of-the-art LLMs, we explore the model size scaling laws specific to KGE tasks. In our analyses, we assess how benchmark scores evolve between different model size categories. Additionally, we inspect how the general score development of single models and families of models correlates to their size. Our analyses revealed that, with a few exceptions, the model size scaling laws generally also apply to the selected KGE tasks. However, in some cases, plateau or ceiling effects occurred, i.e., the task performance did not change much between a model and the next larger model. In these cases, smaller models could be considered to achieve high cost-effectiveness. Regarding models of the same family, sometimes larger models performed worse than smaller models of the same family. These effects occurred only locally. Hence it is advisable to additionally test the next smallest and largest model of the same family.
在使用大型语言模型(LLMs)来支持知识图谱工程(KGE)时,寻找合适模型的第一个指标通常是其规模。根据扩展法则,更大的模型通常表现出更高的能力。然而,在实践中,资源成本也是一个重要因素,因此考虑性能与成本的比率是有意义的。LLM-KG-Bench框架使得在KGE任务背景下比较不同的大型语言模型,并评估它们理解和生成知识图谱及其查询的能力成为可能。基于一个由26种开放领域的最新大型语言模型组成的LLM-KG-Bench运行数据集,我们探讨了适用于KGE任务的具体规模扩展法则。 我们的分析包括评估不同规模类别之间基准得分的变化情况,以及单个模型和同一模型家族的总体评分发展趋势与其规模的相关性。研究发现,除了少数例外,规模扩展法则通常也适用于所选的知识图谱工程任务。然而,在某些情况下,会出现平台期或上限效应,即在某个模型与下一个更大规模模型之间,任务性能变化不大。在这种情形下,小型模型可能被认为具有较高的成本效益。 对于同一家族的模型而言,有时较大规模的模型表现不如较小规模的同族模型。这些影响仅限于局部。因此建议针对相同家族的模型进行额外测试,即测试该家族中最小和最大的两个模型。
https://arxiv.org/abs/2505.16276
Large language models (LLMs) have demonstrated remarkable capabilities, but still struggle with issues like hallucinations and outdated information. Retrieval-augmented generation (RAG) addresses these issues by grounding LLM outputs in external knowledge with an Information Retrieval (IR) system. Building on this foundation, graph-based RAG systems go a step further by retrieving subgraphs, which preserve the relationships between knowledge entities and provide more comprehensive context. However, graph RAG faces two challenges: (1) Retrieving relevant information introduces irrelevant nodes (especially in dense graph databases, where retrieval usually extends to adjacent nodes), and leads to overly lengthy inputs that hinder efficiency; (2) The representation gap between graph and language during generation with LLMs limits the ability to fully leverage graph structures for enhanced understanding. To address these limitations, we propose Align-GRAG, a novel reasoning-guided dual alignment framework in post-retrieval phrase. It first formulates a subgraph by retrieving nodes and edges. Then an Aligner is proposed to jointly optimizes a graph encoder with LLM-summarized reasoning. It achieves dual alignment of graph node and representation by leveraging KL divergence loss and contrastive loss, facilitating efficient pruning of irrelevant knowledge and establishing a unified semantic space. The Generator integrates the aligned graph data with LLM to produce coherent and accurate answers. Experiments on GraphQA benchmark across three tasks (including common sense reasoning, scene graph understanding, and knowledge graph reasoning) validate the effectiveness of our method. The code will be available upon accepted.
大型语言模型(LLM)展现了显著的能力,但仍面临诸如幻觉和信息过时等问题。检索增强生成(RAG)通过利用信息检索系统将LLM的输出与外部知识联系起来来解决这些问题。在此基础上,基于图的RAG系统更进一步,通过检索子图来保留知识实体之间的关系,并提供更加全面的上下文。然而,图RAG面临着两大挑战:(1) 检索相关信息时会引入无关节点(特别是在密集图数据库中,检索通常扩展到相邻节点),从而导致输入过长,影响效率;(2) 在使用LLM生成过程中,图与语言之间的表征差距限制了充分利用图结构以增强理解的能力。为了应对这些局限性,我们提出了一种新的推理引导的双对齐框架——Align-GRAG,在检索后阶段应用这一框架。该框架首先通过检索节点和边来构建子图。然后设计了一个Aligner,它同时优化了图编码器与LLM摘要推理,并利用KL散度损失和对比损失实现图表节点与表示的双重对齐,从而有效地修剪无关知识并建立统一的语义空间。生成器将对齐后的图形数据与LLM结合使用以产生连贯且准确的答案。在GraphQA基准测试中的三个任务(包括常识推理、场景图理解以及知识图谱推理)上进行的实验验证了我们方法的有效性。代码将在接受后提供。
https://arxiv.org/abs/2505.16237
We propose a novel framework for integrating fragmented multi-modal data in Alzheimer's disease (AD) research using large language models (LLMs) and knowledge graphs. While traditional multimodal analysis requires matched patient IDs across datasets, our approach demonstrates population-level integration of MRI, gene expression, biomarkers, EEG, and clinical indicators from independent cohorts. Statistical analysis identified significant features in each modality, which were connected as nodes in a knowledge graph. LLMs then analyzed the graph to extract potential correlations and generate hypotheses in natural language. This approach revealed several novel relationships, including a potential pathway linking metabolic risk factors to tau protein abnormalities via neuroinflammation (r>0.6, p<0.001), and unexpected correlations between frontal EEG channels and specific gene expression profiles (r=0.42-0.58, p<0.01). Cross-validation with independent datasets confirmed the robustness of major findings, with consistent effect sizes across cohorts (variance <15%). The reproducibility of these findings was further supported by expert review (Cohen's k=0.82) and computational validation. Our framework enables cross modal integration at a conceptual level without requiring patient ID matching, offering new possibilities for understanding AD pathology through fragmented data reuse and generating testable hypotheses for future research.
我们提出了一种使用大型语言模型(LLM)和知识图谱将碎片化多模态数据整合到阿尔茨海默病(AD)研究中的新框架。传统多模式分析需要在不同数据集中匹配患者ID,而我们的方法展示了一个能够在不依赖于单一患者的情况下,在人口水平上集成MRI、基因表达、生物标志物、EEG和临床指标的独立队列的方法。统计分析确定了每个模态中的显著特征,并将这些特征作为节点连接到知识图谱中。随后,LLM分析该图以提取潜在的相关性并用自然语言生成假设。 这种方法揭示了几种新颖的关系,包括代谢风险因素通过神经炎症导致tau蛋白异常的潜在途径(r>0.6, p<0.001),以及前额EEG通道和特定基因表达谱之间的意外相关性(r=0.42-0.58, p<0.01)。独立数据集上的交叉验证确认了主要发现的稳健性,各队列的效果大小具有一致性(方差小于15%)。这些发现的可重复性还得到了专家评审的认可(Cohen's k=0.82)和计算验证的支持。 我们的框架能够在概念层面上实现跨模态整合而无需匹配患者ID,这为通过碎片化数据重用来理解AD病理学提供了新的可能性,并为未来研究生成了测试假设。
https://arxiv.org/abs/2505.15747
We introduce the concept of protoknowledge to formalize and measure how sequences of tokens encoding Knowledge Graphs are internalized during pretraining and utilized at inference time by Large Language Models (LLMs). Indeed, LLMs have demonstrated the ability to memorize vast amounts of token sequences during pretraining, and a central open question is how they leverage this memorization as reusable knowledge through generalization. We then categorize protoknowledge into lexical, hierarchical, and topological forms, varying on the type of knowledge that needs to be activated. We measure protoknowledge through Knowledge Activation Tasks (KATs), analyzing its general properties such as semantic bias. We then investigate the impact of protoknowledge on Text-to-SPARQL performance by varying prompting strategies depending on input conditions. To this end, we adopt a novel analysis framework that assesses whether model predictions align with the successful activation of the relevant protoknowledge for each query. This methodology provides a practical tool to explore Semantic-Level Data Contamination and serves as an effective strategy for Closed-Pretraining models.
我们引入了“proto知识”这一概念,以形式化和量化大型语言模型(LLM)在预训练过程中如何内化编码知识图谱的标记序列,并在推理时利用这些知识。实际上,LLMs已经展示了在其预训练阶段能够记住大量的标记序列的能力,而一个核心的开放问题是它们是如何通过泛化来利用这种记忆作为可重复使用的知识的。我们将proto知识分类为词汇、层次和拓扑形式,根据需要激活的知识类型不同而变化。我们通过知识激活任务(KATs)测量proto知识,并分析其一般属性,如语义偏差。 接下来,我们探讨了在文本到SPARQL性能中proto知识的影响,根据输入条件的变化采用不同的提示策略。为此,我们采用了新的分析框架,评估模型预测是否与每个查询相关联的适当proto知识的成功激活相一致。这种方法为探索语义级数据污染提供了实用工具,并且对于封闭预训练模型而言是一种有效的策略。
https://arxiv.org/abs/2505.15501
While large language models (LLMs) show great potential in temporal reasoning, most existing work focuses heavily on enhancing performance, often neglecting the explainable reasoning processes underlying the results. To address this gap, we introduce a comprehensive benchmark covering a wide range of temporal granularities, designed to systematically evaluate LLMs' capabilities in explainable temporal reasoning. Furthermore, our findings reveal that LLMs struggle to deliver convincing explanations when relying solely on textual information. To address challenge, we propose GETER, a novel structure-aware generative framework that integrates Graph structures with text for Explainable TEmporal Reasoning. Specifically, we first leverage temporal knowledge graphs to develop a temporal encoder that captures structural information for the query. Subsequently, we introduce a structure-text prefix adapter to map graph structure features into the text embedding space. Finally, LLMs generate explanation text by seamlessly integrating the soft graph token with instruction-tuning prompt tokens. Experimental results indicate that GETER achieves state-of-the-art performance while also demonstrating its effectiveness as well as strong generalization capabilities. Our dataset and code are available at this https URL.
虽然大型语言模型(LLM)在时间推理方面显示出巨大的潜力,但大多数现有的工作主要集中在提升性能上,往往忽视了支撑结果的可解释性推理过程。为了解决这一不足,我们引入了一个全面的基准测试集,涵盖了广泛的时间粒度范围,并旨在系统地评估LLM们在可解释的时间推理方面的能力。此外,我们的研究发现表明,当仅依赖文本信息时,LLM难以提供令人信服的解释。为此,我们提出了GETER,这是一种新颖的结构感知生成框架,它将图结构与文本结合用于可解释的时间推理。具体来说,首先利用时间知识图来开发一个时间编码器,该编码器捕捉查询中的结构性信息。接着,引入了一个结构-文本前缀适配器以将图形结构特征映射到文本嵌入空间中。最后,LLM通过无缝地整合软图令牌与指令微调提示令牌生成解释性文本。实验结果表明,GETER在性能上达到了最先进的水平,并且展示了其有效性和强大的泛化能力。我们的数据集和代码可以在该链接获取:[https://此URL提供具体链接]。
https://arxiv.org/abs/2505.15245
Knowledge graph-based retrieval-augmented generation seeks to mitigate hallucinations in Large Language Models (LLMs) caused by insufficient or outdated knowledge. However, existing methods often fail to fully exploit the prior knowledge embedded in knowledge graphs (KGs), particularly their structural information and explicit or implicit constraints. The former can enhance the faithfulness of LLMs' reasoning, while the latter can improve the reliability of response generation. Motivated by these, we propose a trustworthy reasoning framework, termed Deliberation over Priors (DP), which sufficiently utilizes the priors contained in KGs. Specifically, DP adopts a progressive knowledge distillation strategy that integrates structural priors into LLMs through a combination of supervised fine-tuning and Kahneman-Tversky optimization, thereby improving the faithfulness of relation path generation. Furthermore, our framework employs a reasoning-introspection strategy, which guides LLMs to perform refined reasoning verification based on extracted constraint priors, ensuring the reliability of response generation. Extensive experiments on three benchmark datasets demonstrate that DP achieves new state-of-the-art performance, especially a Hit@1 improvement of 13% on the ComplexWebQuestions dataset, and generates highly trustworthy responses. We also conduct various analyses to verify its flexibility and practicality. The code is available at this https URL.
基于知识图谱的检索增强生成旨在减轻由于知识不足或过时而导致的大规模语言模型(LLM)中的幻觉现象。然而,现有的方法往往未能充分利用嵌入在知识图(KGs)中的先验知识,特别是它们的结构信息和明确或隐含约束。前者可以增强LLMs推理的准确性,而后者则能提高响应生成的可靠性。基于此动机,我们提出了一种名为“基于先验的审慎推理”(Deliberation over Priors, DP)的信任框架,该框架充分利用了KGs中包含的先验知识。具体而言,DP采用了一种渐进式的知识蒸馏策略,通过监督微调和卡恩曼-特沃斯基优化相结合的方式将结构化先验融入LLMs,从而提高了关系路径生成的准确性。此外,我们的框架还采用了推理内省策略,指导LLMs基于提取的约束先验进行精细化推理验证,确保响应生成的可靠性。在三个基准数据集上的广泛实验表明,DP达到了新的最先进的性能,尤其是在ComplexWebQuestions数据集上实现了13%的Hit@1提升,并且能够产生高度可信的回答。我们还进行了各种分析来验证其灵活性和实用性。代码可在此URL处获取。 翻译如下: 基于知识图谱(KG)的检索增强生成旨在减轻大规模语言模型(LLM)由于知识不足或过时所导致的幻觉现象。然而,现有的方法往往未能充分挖掘嵌入在KG中的先验知识,尤其是其结构信息和明确或隐含约束。前者可以提升LLMs推理过程的忠实度,而后者则能增强响应生成的可靠性。受此启发,我们提出了一种名为“基于先验的审慎推理”(Deliberation over Priors, DP)的信任框架,该框架充分利用了KG中包含的先验知识。具体而言,DP采用了一种渐进式的知识蒸馏策略,通过结合监督微调和卡恩曼-特沃斯基优化的方式将结构化先验整合到LLMs中,从而提高了关系路径生成的忠实度。此外,该框架还采用了推理内省策略,指导LLMs基于提取的约束先验进行精细化推理验证,以确保响应生成的可靠性。在三个基准数据集上的广泛实验表明,DP达到了新的最先进的性能,在ComplexWebQuestions数据集上实现了13%的Hit@1提升,并且能够产生高度可信的回答。我们还进行了各种分析来验证其灵活性和实用性。代码可在提供的URL地址获取。
https://arxiv.org/abs/2505.15210
Graph-structured data pervades domains such as social networks, biological systems, knowledge graphs, and recommender systems. While foundation models have transformed natural language processing, vision, and multimodal learning through large-scale pretraining and generalization, extending these capabilities to graphs -- characterized by non-Euclidean structures and complex relational semantics -- poses unique challenges and opens new opportunities. To this end, Graph Foundation Models (GFMs) aim to bring scalable, general-purpose intelligence to structured data, enabling broad transfer across graph-centric tasks and domains. This survey provides a comprehensive overview of GFMs, unifying diverse efforts under a modular framework comprising three key components: backbone architectures, pretraining strategies, and adaptation mechanisms. We categorize GFMs by their generalization scope -- universal, task-specific, and domain-specific -- and review representative methods, key innovations, and theoretical insights within each category. Beyond methodology, we examine theoretical foundations including transferability and emergent capabilities, and highlight key challenges such as structural alignment, heterogeneity, scalability, and evaluation. Positioned at the intersection of graph learning and general-purpose AI, GFMs are poised to become foundational infrastructure for open-ended reasoning over structured data. This survey consolidates current progress and outlines future directions to guide research in this rapidly evolving field. Resources are available at this https URL.
图结构数据在社交网络、生物系统、知识图谱和推荐系统等领域普遍存在。虽然基础模型通过大规模预训练和泛化能力革新了自然语言处理、视觉以及多模态学习,但将其扩展到具有非欧几里得结构和复杂关系语义的图中,则带来了独特的挑战,并开启了新的机遇。为此,图基础模型(GFMs)旨在为结构化数据带来可扩展性和通用性的智能,从而在以图为中心的任务和领域内实现广泛的迁移。本综述对GFMs提供了全面概述,在一个模块化的框架下统一了各种努力,该框架包括三个关键组成部分:骨干架构、预训练策略以及适应机制。我们按其泛化范围——普遍的、任务特定的及领域特定的来分类GFMs,并在每一类中回顾代表性方法、关键技术创新和理论见解。 除了方法论之外,本综述还考察了迁移能力与新涌现的能力等理论基础,同时强调包括结构对齐、异质性、可扩展性和评估在内的关键挑战。位于图学习与通用人工智能交叉点的GFMs正准备成为处理开放式推理问题中的结构性数据的基础架构。本综述总结了当前的研究进展,并为这一快速发展领域提出了未来方向以指导研究。 有关资源请访问[此处](https://this-url.com/)(原文中的URL链接)。
https://arxiv.org/abs/2505.15116
Large Language Models (LLMs) have shown impressive capabilities in contextual understanding and reasoning. However, evaluating their performance across diverse scientific domains remains underexplored, as existing benchmarks primarily focus on general domains and fail to capture the intricate complexity of scientific data. To bridge this gap, we construct SciCUEval, a comprehensive benchmark dataset tailored to assess the scientific context understanding capability of LLMs. It comprises ten domain-specific sub-datasets spanning biology, chemistry, physics, biomedicine, and materials science, integrating diverse data modalities including structured tables, knowledge graphs, and unstructured texts. SciCUEval systematically evaluates four core competencies: Relevant information identification, Information-absence detection, Multi-source information integration, and Context-aware inference, through a variety of question formats. We conduct extensive evaluations of state-of-the-art LLMs on SciCUEval, providing a fine-grained analysis of their strengths and limitations in scientific context understanding, and offering valuable insights for the future development of scientific-domain LLMs.
大型语言模型(LLMs)在上下文理解和推理方面表现出色。然而,针对不同科学领域的性能评估仍处于探索阶段,因为现有的基准测试主要集中在通用领域上,并且无法捕捉到科学数据的复杂性。为弥补这一不足,我们构建了SciCUEval,这是一个全面的基准数据集,旨在评估大型语言模型在科学研究上下文理解方面的能力。该数据集包含十个特定领域的子数据集,涵盖了生物学、化学、物理学、生物医学和材料科学,并整合了多种类型的数据模式,包括结构化表格、知识图谱和非结构化文本。 SciCUEval系统地评估四大核心能力:相关信息识别、信息缺失检测、多源信息整合以及上下文感知推理,通过多样化的题目格式进行测试。我们对最先进的大型语言模型在SciCUEval上的表现进行了广泛的评估,并对其科学上下文理解的强项和局限性提供了细致的分析,为未来开发专门针对科学领域的大型语言模型提供了宝贵的见解。
https://arxiv.org/abs/2505.15094
When addressing complex questions that require new information, people often associate the question with existing knowledge to derive a sensible answer. For instance, when evaluating whether melatonin aids insomnia, one might associate "hormones helping mental disorders" with "melatonin being a hormone and insomnia a mental disorder" to complete the reasoning. Large Language Models (LLMs) also require such associative thinking, particularly in resolving scientific inquiries when retrieved knowledge is insufficient and does not directly answer the question. Graph Inspired Veracity Extrapolation (GIVE) addresses this by using a knowledge graph (KG) to extrapolate structured knowledge. However, it involves the construction and pruning of many hypothetical triplets, which limits efficiency and generalizability. We propose Self-GIVE, a retrieve-RL framework that enhances LLMs with automatic associative thinking through reinforcement learning. Self-GIVE extracts structured information and entity sets to assist the model in linking to the queried concepts. We address GIVE's key limitations: (1) extensive LLM calls and token overhead for knowledge extrapolation, (2) difficulty in deploying on smaller LLMs (3B or 7B) due to complex instructions, and (3) inaccurate knowledge from LLM pruning. Specifically, after fine-tuning using self-GIVE with a 135 node UMLS KG, it improves the performance of the Qwen2.5 3B and 7B models by up to $\textbf{28.5%$\rightarrow$71.4%}$ and $\textbf{78.6$\rightarrow$90.5%}$ in samples $\textbf{unseen}$ in challenging biomedical QA tasks. In particular, Self-GIVE allows the 7B model to match or outperform GPT3.5 turbo with GIVE, while cutting token usage by over 90\%. Self-GIVE enhances the scalable integration of structured retrieval and reasoning with associative thinking.
当处理需要新信息的复杂问题时,人们通常会将这些问题与现有的知识关联起来以得出合理答案。例如,在评估褪黑激素是否有助于失眠治疗时,可能会把“荷尔蒙帮助精神障碍”的概念与“褪黑激素是一种荷尔蒙且失眠属于一种精神障碍”联系起来进行推理。大型语言模型(LLMs)也需要这种联想思维,尤其是在检索知识不足或无法直接回答问题的科学探究中。 图启发式真实性外推法(Graph Inspired Veracity Extrapolation, GIVE)通过使用知识图谱(KG)来扩展结构化知识以解决这类问题。然而,这种方法需要构造和修剪许多假设三元组,这限制了其效率和通用性。为此,我们提出了Self-GIVE框架,这是一个基于强化学习的检索-RL框架,它增强了LLMs的自动联想思维能力。Self-GIVE提取结构化的信息和实体集来帮助模型链接到查询概念。 我们解决了GIVE的关键局限:(1)知识外推时过度调用大型语言模型以及由此产生的大量令牌开销;(2)由于指令复杂性,在较小的语言模型(如3B或7B参数规模的模型)上部署困难;(3)LLMs修剪得到的知识准确性问题。具体来说,经过使用具有135个节点UMLS KG进行自我调优后,Self-GIVE显著提升了Qwen2.5 3B和7B模型在极具挑战性的生物医学问答任务中未见过样本的性能表现,分别提高了$\textbf{28.5% \rightarrow 71.4\%}$ 和 $\textbf{78.6 \rightarrow 90.5\%}$。特别地,Self-GIVE使7B模型能够与配备了GIVE功能的GPT3.5 turbo相匹敌甚至超越,并且减少了令牌使用量超过$\textbf{90\%}$。此外,Self-GIVE增强了结构化检索和推理与联想思维相结合的可扩展集成。
https://arxiv.org/abs/2505.15062
Recent advances in large language models (LLMs) and the abundance of food data have resulted in studies to improve food understanding using LLMs. Despite several recommendation systems utilizing LLMs and Knowledge Graphs (KGs), there has been limited research on integrating food related KGs with LLMs. We introduce KERL, a unified system that leverages food KGs and LLMs to provide personalized food recommendations and generates recipes with associated micro-nutritional information. Given a natural language question, KERL extracts entities, retrieves subgraphs from the KG, which are then fed into the LLM as context to select the recipes that satisfy the constraints. Next, our system generates the cooking steps and nutritional information for each recipe. To evaluate our approach, we also develop a benchmark dataset by curating recipe related questions, combined with constraints and personal preferences. Through extensive experiments, we show that our proposed KG-augmented LLM significantly outperforms existing approaches, offering a complete and coherent solution for food recommendation, recipe generation, and nutritional analysis. Our code and benchmark datasets are publicly available at this https URL.
最近在大型语言模型(LLM)和食品数据方面的进展已经导致了利用LLM来改善对食物的理解的研究。尽管已经有几个推荐系统结合使用了LLM和知识图谱(KG),但将食品相关的KG与LLM集成的研究仍然有限。我们引入了一种统一的系统KERL,该系统利用食品KG和LLM来提供个性化的食品推荐,并生成包含相关微量营养信息的食谱。给定自然语言问题后,KERL会提取实体,从KG中检索子图,然后将这些子图作为上下文输入到LLM中以选择符合约束条件的食谱。接下来,我们的系统为每个食谱生成烹饪步骤和营养信息。 为了评估我们提出的方法,我们也开发了一个基准数据集,该数据集通过策划与食谱相关的问题,并结合了约束条件和个人偏好来建立。通过广泛的实验,我们展示了我们的KG增强型LLM在食品推荐、菜谱生成以及营养分析方面显著优于现有的方法,提供了一种完整且连贯的解决方案。 我们的代码和基准数据集可在此链接公开获取:[https URL](请将"[https URL]"替换为实际网址)。
https://arxiv.org/abs/2505.14629
Recent advancements in Large Language Models (LLMs) have transformed code generation from natural language queries. However, despite their extensive knowledge and ability to produce high-quality code, LLMs often struggle with contextual accuracy, particularly in evolving codebases. Current code search and retrieval methods frequently lack robustness in both the quality and contextual relevance of retrieved results, leading to suboptimal code generation. This paper introduces a novel knowledge graph-based approach to improve code search and retrieval leading to better quality of code generation in the context of repository-level tasks. The proposed approach represents code repositories as graphs, capturing structural and relational information for enhanced context-aware code generation. Our framework employs a hybrid approach for code retrieval to improve contextual relevance, track inter-file modular dependencies, generate more robust code and ensure consistency with the existing codebase. We benchmark the proposed approach on the Evolutionary Code Benchmark (EvoCodeBench) dataset, a repository-level code generation benchmark, and demonstrate that our method significantly outperforms the baseline approach. These findings suggest that knowledge graph based code generation could advance robust, context-sensitive coding assistance tools.
近期,大型语言模型(LLM)在从自然语言查询生成代码方面取得了显著进展。然而,尽管这些模型拥有广泛的知识库并且能够生成高质量的代码,它们在处理上下文准确性时仍面临挑战,特别是在不断演化的代码库中。当前的代码搜索和检索方法通常缺乏对检索结果质量和相关性的稳健性,导致生成的代码质量不佳。本文提出了一种基于知识图谱的新方法来改进代码搜索与检索,在项目仓库级别的任务中提升代码生成的质量。该方法将代码仓库表示为图形结构,捕捉了结构性和关系性的信息,以增强上下文感知下的代码生成能力。 我们的框架采用混合策略进行代码检索,旨在提高相关性、跟踪文件间的模块依赖性、生成更稳健的代码,并确保与现有代码库的一致性。我们在Evolutionary Code Benchmark(EvoCodeBench)数据集上对所提出的方案进行了基准测试,这是一个项目仓库级别的代码生成评估工具。结果表明,我们的方法显著超越了基线模型。 这些发现暗示基于知识图谱的方法可能有助于开发出更加健壮且上下文敏感的编码辅助工具。
https://arxiv.org/abs/2505.14394
Large Language Models (LLMs) have inherent limitations of faithfulness and factuality, commonly referred to as hallucinations. Several benchmarks have been developed that provide a test bed for factuality evaluation within the context of English-centric datasets, while relying on supplementary informative context like web links or text passages but ignoring the available structured factual resources. To this end, Knowledge Graphs (KGs) have been identified as a useful aid for hallucination mitigation, as they provide a structured way to represent the facts about entities and their relations with minimal linguistic overhead. We bridge the lack of KG paths and multilinguality for factual language modeling within the existing hallucination evaluation benchmarks and propose a KG-based multilingual, multihop benchmark called \textbf{MultiHal} framed for generative text evaluation. As part of our data collection pipeline, we mined 140k KG-paths from open-domain KGs, from which we pruned noisy KG-paths, curating a high-quality subset of 25.9k. Our baseline evaluation shows an absolute scale increase by approximately 0.12 to 0.36 points for the semantic similarity score in KG-RAG over vanilla QA across multiple languages and multiple models, demonstrating the potential of KG integration. We anticipate MultiHal will foster future research towards several graph-based hallucination mitigation and fact-checking tasks.
大型语言模型(LLMs)固有地存在忠实度和事实性的局限性,通常被称为“幻觉”。已经开发出了几个基准测试工具,在以英语为中心的数据集的背景下提供了一种评估事实准确性的方法,但这些方法依赖于诸如网页链接或文本段落等补充信息,而忽略了现有的结构化事实资源。为此,知识图谱(KGs)被认定为一种有用的辅助手段来减轻幻觉问题,因为它们能够以最小的语言开销方式来表示实体及其关系的事实。 为了弥补现有幻觉评估基准中的知识图路径和多语言性的不足,我们提出了一种基于知识图的多语言、多跳基准测试——**MultiHal**,用于生成文本的评估。作为数据收集流程的一部分,我们从开放领域的知识图中挖掘出了140,000条KG路径,并从中剔除了噪音路径,整理出一个高质量的小规模子集(25,900条)。我们的基线评估显示,在多个语言和模型下,KG-RAG在语义相似性得分方面相对于普通问答模式有大约0.12到0.36的绝对值提升,这表明了集成知识图谱的可能性。我们预计MultiHal将促进未来基于图形的幻觉缓解与事实核查任务的研究工作。
https://arxiv.org/abs/2505.14101
Temporal Knowledge Graphs (TKGs), as an extension of static Knowledge Graphs (KGs), incorporate the temporal feature to express the transience of knowledge by describing when facts occur. TKG extrapolation aims to infer possible future facts based on known history, which has garnered significant attention in recent years. Some existing methods treat TKG as a sequence of independent subgraphs to model temporal evolution patterns, demonstrating impressive reasoning performance. However, they still have limitations: 1) In modeling subgraph semantic evolution, they usually neglect the internal structural interactions between subgraphs, which are actually crucial for encoding TKGs. 2) They overlook the potential smooth features that do not lead to semantic changes, which should be distinguished from the semantic evolution process. Therefore, we propose a novel Disentangled Multi-span Evolutionary Network (DiMNet) for TKG reasoning. Specifically, we design a multi-span evolution strategy that captures local neighbor features while perceiving historical neighbor semantic information, thus enabling internal interactions between subgraphs during the evolution process. To maximize the capture of semantic change patterns, we design a disentangle component that adaptively separates nodes' active and stable features, used to dynamically control the influence of historical semantics on future evolution. Extensive experiments conducted on four real-world TKG datasets show that DiMNet demonstrates substantial performance in TKG reasoning, and outperforms the state-of-the-art up to 22.7% in MRR.
时间知识图谱(TKGs)作为静态知识图谱(KGs)的扩展,通过描述事实发生的时间来引入了时间特性,以此表达知识的瞬时性。TKG推理旨在根据已知的历史信息推断可能的未来事实,在近年来受到了广泛关注。一些现有的方法将TKG视为一系列独立子图序列以建模时间演化模式,并展示了令人印象深刻的推理性能。然而,这些方法仍然存在局限:1)在建模子图语义演变时,它们通常忽略了子图之间的内部结构交互,而这对于编码TKGs实际上是至关重要的;2)它们忽视了那些不会导致语义变化的潜在平滑特性,而这些特性应该与语义演化过程区分开来。因此,我们提出了一个新颖的分解式多跨度进化网络(DiMNet),专门用于TKG推理。具体而言,我们设计了一种多跨度演进策略,它在捕捉局部邻居特征的同时感知历史邻居的语义信息,从而使得子图之间的内部交互能够在演化过程中得以实现。为了最大限度地捕获语义变化模式,我们设计了一个分解组件,可以自适应地区分节点的活跃和稳定特性,并用于动态控制历史语义对未来演化的影响力。 在四个真实世界TKG数据集上进行的广泛实验表明,DiMNet在TKG推理方面表现出显著性能,并且在MRR指标上的表现优于当前最佳方法高达22.7%。
https://arxiv.org/abs/2505.14020
Retrieval-Augmented Generation (RAG) systems empower large language models (LLMs) with external knowledge, yet struggle with efficiency-accuracy trade-offs when scaling to large knowledge graphs. Existing approaches often rely on monolithic graph retrieval, incurring unnecessary latency for simple queries and fragmented reasoning for complex multi-hop questions. To address these challenges, this paper propose SPLIT-RAG, a multi-agent RAG framework that addresses these limitations with question-driven semantic graph partitioning and collaborative subgraph retrieval. The innovative framework first create Semantic Partitioning of Linked Information, then use the Type-Specialized knowledge base to achieve Multi-Agent RAG. The attribute-aware graph segmentation manages to divide knowledge graphs into semantically coherent subgraphs, ensuring subgraphs align with different query types, while lightweight LLM agents are assigned to partitioned subgraphs, and only relevant partitions are activated during retrieval, thus reduce search space while enhancing efficiency. Finally, a hierarchical merging module resolves inconsistencies across subgraph-derived answers through logical verifications. Extensive experimental validation demonstrates considerable improvements compared to existing approaches.
检索增强生成(RAG)系统通过外部知识增强了大型语言模型(LLM)的能力,但在扩展到大规模知识图谱时,在效率和准确性的权衡上面临挑战。现有的方法往往依赖于单一的图形检索,对于简单查询会产生不必要的延迟,并且在处理复杂的多跳问题时会导致推理片段化。为了解决这些问题,本文提出了一种名为SPLIT-RAG的多代理RAG框架,该框架通过基于问题语义的知识图谱划分和协作子图检索来解决这些限制。 创新性的框架首先进行链接信息的语义分区(Semantic Partitioning of Linked Information),然后利用类型专门化的知识库实现多代理RAG。这种属性感知的图形分割能够将知识图谱划分为在语义上连贯的小图,确保小图与不同的查询类型相匹配,同时为划分后的子图分配轻量级的语言模型代理,并且仅激活相关分区进行检索,从而减少搜索空间并提高效率。 最后,一个分层合并模块通过逻辑验证解决来自各个子图答案之间的不一致性。广泛的实验验证显示,该方法相比现有方法具有显著的改进。
https://arxiv.org/abs/2505.13994
Graphs serve as versatile data structures in numerous real-world domains-including social networks, molecular biology, and knowledge graphs-by capturing intricate relational information among entities. Among graph-based learning techniques, Graph Contrastive Learning (GCL) has gained significant attention for its ability to derive robust, self-supervised graph representations through the contrasting of positive and negative sample pairs. However, a critical challenge lies in ensuring high-quality positive pairs so that the intrinsic semantic and structural properties of the original graph are preserved rather than distorted. To address this issue, we propose SRGCL (Self-Reinforced Graph Contrastive Learning), a novel framework that leverages the model's own encoder to dynamically evaluate and select high-quality positive pairs. We designed a unified positive pair generator employing multiple augmentation strategies, and a selector guided by the manifold hypothesis to maintain the underlying geometry of the latent space. By adopting a probabilistic mechanism for selecting positive pairs, SRGCL iteratively refines its assessment of pair quality as the encoder's representational power improves. Extensive experiments on diverse graph-level classification tasks demonstrate that SRGCL, as a plug-in module, consistently outperforms state-of-the-art GCL methods, underscoring its adaptability and efficacy across various domains.
图在包括社交网络、分子生物学和知识图谱在内的众多现实世界领域中作为多功能数据结构,通过捕捉实体之间的复杂关系信息而发挥作用。在基于图的机器学习技术中,图对比学习(GCL)因其能够通过正负样本对的对比来获得鲁棒且自监督的图表示而备受关注。然而,一个关键挑战在于确保高质量的正样本对,以便保留原始图的基本语义和结构特性而不使其扭曲。为了解决这个问题,我们提出了SRGCL(自我强化图对比学习),这是一种新型框架,它利用模型自身的编码器来动态评估和选择高质量的正样本对。我们设计了一个统一的正样本生成器,采用多种增强策略,并使用流形假设指导的选样器来保持潜在空间的基本几何特性。通过采用概率机制来选择正样本对,SRGCL可以随着编码器表达能力的提高而迭代地改进其对配对质量的评估。在各种图级分类任务上的广泛实验表明,作为插件模块的SRGCL始终优于最先进的GCL方法,证明了其跨各个领域的适应性和有效性。
https://arxiv.org/abs/2505.13650
A mathematical knowledge graph (KG) presents knowledge within the field of mathematics in a structured manner. Constructing a math KG using natural language is an essential but challenging task. There are two major limitations of existing works: first, they are constrained by corpus completeness, often discarding or manually supplementing incomplete knowledge; second, they typically fail to fully automate the integration of diverse knowledge sources. This paper proposes AutoMathKG, a high-quality, wide-coverage, and multi-dimensional math KG capable of automatic updates. AutoMathKG regards mathematics as a vast directed graph composed of Definition, Theorem, and Problem entities, with their reference relationships as edges. It integrates knowledge from ProofWiki, textbooks, arXiv papers, and TheoremQA, enhancing entities and relationships with large language models (LLMs) via in-context learning for data augmentation. To search for similar entities, MathVD, a vector database, is built through two designed embedding strategies using SBERT. To automatically update, two mechanisms are proposed. For knowledge completion mechanism, Math LLM is developed to interact with AutoMathKG, providing missing proofs or solutions. For knowledge fusion mechanism, MathVD is used to retrieve similar entities, and LLM is used to determine whether to merge with a candidate or add as a new entity. A wide range of experiments demonstrate the advanced performance and broad applicability of the AutoMathKG system, including superior reachability query results in MathVD compared to five baselines and robust mathematical reasoning capability in Math LLM.
数学知识图谱(KG)以结构化的方式展示了数学领域内的知识。使用自然语言构建数学KG是一项既重要又具挑战性的任务。现有工作的主要限制有两点:首先,它们受到语料库完整性的约束,常常会舍弃或手动补充不完整的知识;其次,它们通常无法完全自动化地整合不同的知识来源。本文提出了AutoMathKG,这是一个高质量、覆盖广泛且多维度的数学KG,能够实现自动更新。 AutoMathKG将数学视为由定义(Definition)、定理(Theorem)和问题(Problem)实体构成的巨大有向图,并以这些实体间的引用关系作为边。它整合了来自ProofWiki、教科书、arXiv论文和TheoremQA的知识来源,利用大规模语言模型(LLM)通过上下文学习对数据进行增强,提升实体及其关系的质量。 为了搜索相似的实体,构建了一个名为MathVD的向量数据库,该数据库使用SBERT设计了两种嵌入策略。为实现自动更新,提出了两个机制:对于知识补充机制,开发了一种与AutoMathKG交互的数学LLM,以提供缺失的证明或解答;对于知识融合机制,则利用MathVD检索相似实体,并通过LLM来决定是否将其合并到现有候选实体中或者作为新实体添加。 一系列广泛的实验展示了AutoMathKG系统的先进性能和广泛应用性,包括在MathVD中的可达性查询结果优于五个基准模型,以及数学LLM具备强大的推理能力。
https://arxiv.org/abs/2505.13406
Taxonomies are hierarchical knowledge graphs crucial for recommendation systems, and web applications. As data grows, expanding taxonomies is essential, but existing methods face key challenges: (1) discriminative models struggle with representation limits and generalization, while (2) generative methods either process all candidates at once, introducing noise and exceeding context limits, or discard relevant entities by selecting noisy candidates. We propose LORex ($\textbf{L}$ineage-$\textbf{O}$riented $\textbf{Re}$asoning for Taxonomy E$\textbf{x}$pansion), a plug-and-play framework that combines discriminative ranking and generative reasoning for efficient taxonomy expansion. Unlike prior methods, LORex ranks and chunks candidate terms into batches, filtering noise and iteratively refining selections by reasoning candidates' hierarchy to ensure contextual efficiency. Extensive experiments across four benchmarks and twelve baselines show that LORex improves accuracy by 12% and Wu & Palmer similarity by 5% over state-of-the-art methods.
分类法是推荐系统和网络应用程序中至关重要的知识图谱。随着数据的增长,扩展分类法变得至关重要,但现有方法面临着关键挑战:(1)判别模型在表示能力和泛化能力方面存在局限性;而(2)生成式方法要么一次性处理所有候选词,从而引入噪声并超出上下文限制,或者通过选择含噪的候选词而忽略相关实体。我们提出了一种新的框架LORex($\textbf{L}$ineage-$\textbf{O}$riented $\textbf{R}$easoning for Taxonomy E$\textbf{x}$pansion),这是一种插件式架构,它结合了判别性排序和生成性推理,用于有效扩展分类法。与先前的方法不同,LORex 将候选词按批次排名并进行过滤,在迭代中通过考虑候选词的层级关系来细化选择,以确保上下文效率。在四个基准测试和十二个基线上的广泛实验表明,LORex 在准确性和 Wu-Palmer 相似度上比最先进的方法分别提高了12% 和5%。
https://arxiv.org/abs/2505.13282
The exponential growth of scientific literature presents significant challenges for researchers navigating the complex knowledge landscape. We propose "Agentic Publications", a novel LLM-driven framework complementing traditional publishing by transforming papers into interactive knowledge systems. Our architecture integrates structured data with unstructured content through retrieval-augmented generation and multi-agent verification. The framework offers interfaces for both humans and machines, combining narrative explanations with machine-readable outputs while addressing ethical considerations through automated validation and transparent governance. Key features include continuous knowledge updates, automatic integration of new findings, and customizable detail levels. Our proof-of-concept demonstrates multilingual interaction, API accessibility, and structured knowledge representation through vector databases, knowledge graphs, and verification agents. This approach enhances scientific communication across disciplines, improving efficiency and collaboration while preserving traditional publishing pathways, particularly valuable for interdisciplinary fields where knowledge integration remains challenging.
科学文献的指数增长为研究人员在复杂的知识领域中导航带来了重大挑战。我们提出了“代理出版物”(Agentic Publications),这是一个新型的人工智能驱动框架,旨在通过将论文转化为互动的知识系统来补充传统的出版方式。我们的架构通过检索增强生成和多代理验证将结构化数据与非结构化内容相结合。该框架为人类和机器提供了接口,结合了叙述性解释和机读输出,并通过自动化验证和透明治理解决了伦理问题。关键特性包括持续的知识更新、新发现的自动整合以及可定制的详细程度级别。我们的概念证明演示了多语言互动、API可访问性和通过向量数据库、知识图谱和验证代理提供的结构化知识表示。这一方法增强了跨学科领域的科学交流,提高了效率和协作水平,并保留了传统的出版路径,在那些知识集成仍然具有挑战性的跨学科领域尤为重要。
https://arxiv.org/abs/2505.13246
Natural medicines, particularly Traditional Chinese Medicine (TCM), are gaining global recognition for their therapeutic potential in addressing human symptoms and diseases. TCM, with its systematic theories and extensive practical experience, provides abundant resources for healthcare. However, the effective application of TCM requires precise syndrome diagnosis, determination of treatment principles, and prescription formulation, which demand decades of clinical expertise. Despite advancements in TCM-based decision systems, machine learning, and deep learning research, limitations in data and single-objective constraints hinder their practical application. In recent years, large language models (LLMs) have demonstrated potential in complex tasks, but lack specialization in TCM and face significant challenges, such as too big model scale to deploy and issues with hallucination. To address these challenges, we introduce Tianyi with 7.6-billion-parameter LLM, a model scale proper and specifically designed for TCM, pre-trained and fine-tuned on diverse TCM corpora, including classical texts, expert treatises, clinical records, and knowledge graphs. Tianyi is designed to assimilate interconnected and systematic TCM knowledge through a progressive learning manner. Additionally, we establish TCMEval, a comprehensive evaluation benchmark, to assess LLMs in TCM examinations, clinical tasks, domain-specific question-answering, and real-world trials. The extensive evaluations demonstrate the significant potential of Tianyi as an AI assistant in TCM clinical practice and research, bridging the gap between TCM knowledge and practical application.
天然药物,尤其是传统中医(TCM),因其在治疗人类症状和疾病方面的潜力而逐渐获得全球认可。中医以其系统的理论体系和丰富的实践经验,为医疗保健提供了丰富的资源。然而,有效应用中医需要精准的症候诊断、确定治疗方法以及处方制定,这都需要数十年的临床经验积累。尽管基于中医的决策系统、机器学习及深度学习研究有所进步,但由于数据不足和单一目标限制,这些技术的实际应用仍然受限。 近年来,大型语言模型(LLM)在处理复杂任务方面展现出巨大潜力,但缺乏对中医的专业化训练,并且面临着诸如模型规模过大难以部署以及生成错误信息等问题。为了解决这些问题,我们推出了一款名为“天一”的76亿参数级LLM,这款专门为中医药设计的模型经过了多样化中医药语料库(包括古典文献、专家论著、临床记录和知识图谱)的预训练和微调。通过渐进式的学习方式,“天一”能够吸收并整合系统性和相互关联的中医知识。 此外,我们还建立了TCMEval,一个全面评估基准,用于评测LLM在中医药考试、临床任务、领域特定问题解答以及真实世界试验中的表现。“天一”的广泛测试表明了其作为辅助中医临床实践和研究的人工智能助手的巨大潜力,有助于弥合中医知识与实际应用之间的差距。
https://arxiv.org/abs/2505.13156