Knowledge Graph Completion (KGC) fundamentally hinges on the coherent fusion of pre-trained entity semantics with heterogeneous topological structures to facilitate robust relational reasoning. However, existing paradigms encounter a critical "structural resolution mismatch," failing to reconcile divergent representational demands across varying graph densities, which precipitates structural noise interference in dense clusters and catastrophic representation collapse in sparse regions. We present SynergyKGC, an adaptive framework that advances traditional neighbor aggregation to an active Cross-Modal Synergy Expert via relation-aware cross-attention and semantic-intent-driven gating. By coupling a density-dependent Identity Anchoring strategy with a Double-tower Coherent Consistency architecture, SynergyKGC effectively reconciles topological heterogeneity while ensuring representational stability across training and inference phases. Systematic evaluations on two public benchmarks validate the superiority of our method in significantly boosting KGC hit rates, providing empirical evidence for a generalized principle of resilient information integration in non-homogeneous structured data.
知识图谱补全(KGC)本质上依赖于预训练实体语义与异构拓扑结构的连贯融合,以促进稳健的关系推理。然而,现有的范式遇到了一个关键性的“结构性分辨率不匹配”问题,无法调和不同图形密度下不同的表示需求,这导致在密集簇中产生结构性噪音干扰,在稀疏区域则出现灾难性表示崩溃。我们提出了SynergyKGC,这是一个自适应框架,它将传统的邻居聚合推进到具有关系感知交叉注意力和语义意图驱动门控的主动跨模态协同专家。 通过结合依赖密度的身份锚定策略与双塔一致性架构,SynergyKGC有效地解决了拓扑异质性,并确保了训练和推理阶段中的表示稳定性。在两个公开基准上的系统评估验证了我们方法在显著提升KGC命中率方面的优越性,为非同质结构化数据中稳健信息整合的通用原则提供了实证依据。
https://arxiv.org/abs/2602.10845
Software vulnerability detection (SVD) is a critical challenge in modern systems. Large language models (LLMs) offer natural-language explanations alongside predictions, but most work focuses on binary evaluation, and explanations often lack semantic consistency with Common Weakness Enumeration (CWE) categories. We propose VulReaD, a knowledge-graph-guided approach for vulnerability reasoning and detection that moves beyond binary classification toward CWE-level reasoning. VulReaD leverages a security knowledge graph (KG) as a semantic backbone and uses a strong teacher LLM to generate CWE-consistent contrastive reasoning supervision, enabling student model training without manual annotations. Students are fine-tuned with Odds Ratio Preference Optimization (ORPO) to encourage taxonomy-aligned reasoning while suppressing unsupported explanations. Across three real-world datasets, VulReaD improves binary F1 by 8-10% and multi-class classification by 30% Macro-F1 and 18% Micro-F1 compared to state-of-the-art baselines. Results show that LLMs outperform deep learning baselines in binary detection and that KG-guided reasoning enhances CWE coverage and interpretability.
软件漏洞检测(SVD)是现代系统中一个关键的挑战。大型语言模型(LLMs)能够提供自然语言解释与预测,但大多数研究工作集中在二元评估上,并且生成的解释通常缺乏与通用弱点枚举(CWE)类别的语义一致性。我们提出了VulReaD方法,这是一种知识图谱引导的漏洞推理和检测方法,超越了二元分类走向基于CWE级别的推理。VulReaD利用安全知识图谱(KG)作为语义骨干,并采用一个强大的教师LLM生成与CWE一致的对比推理监督,从而在没有人工注释的情况下训练学生模型。通过使用比值偏好优化(ORPO),对学生模型进行微调以鼓励符合分类法的推理并抑制无根据的解释。在三个真实世界的数据集中,VulReaD相比于最先进的基线技术,在二元F1评分上提高了8-10%,多类分类上的Macro-F1和Micro-F1分别提升了30%和18%。 结果表明,LLMs在二元检测方面优于深度学习基线模型,并且知识图谱引导的推理能够增强CWE覆盖率及解释性。
https://arxiv.org/abs/2602.10787
Solid State Drives (SSDs) are critical to datacenters, consumer platforms, and mission-critical systems. Yet diagnosing their performance and reliability is difficult because data are fragmented and time-disjoint, and existing methods demand large datasets and expert input while offering only limited insights. Degradation arises not only from shifting workloads and evolving architectures but also from environmental factors such as temperature, humidity, and vibration. We present KORAL, a knowledge driven reasoning framework that integrates Large Language Models (LLMs) with a structured Knowledge Graph (KG) to generate insights into SSD operations. Unlike traditional approaches that require extensive expert input and large datasets, KORAL generates a Data KG from fragmented telemetry and integrates a Literature KG that already organizes knowledge from literature, reports, and traces. This turns unstructured sources into a queryable graph and telemetry into structured knowledge, and both the Graphs guide the LLM to deliver evidence-based, explainable analysis aligned with the domain vocabulary and constraints. Evaluation using real production traces shows that the KORAL delivers expert-level diagnosis and recommendations, supported by grounded explanations that improve reasoning transparency, guide operator decisions, reduce manual effort, and provide actionable insights to improve service quality. To our knowledge, this is the first end-to-end system that combines LLMs and KGs for full-spectrum SSD reasoning including Descriptive, Predictive, Prescriptive, and What-if analysis. We release the generated SSD-specific KG to advance reproducible research in knowledge-based storage system analysis. GitHub Repository: this https URL
固态硬盘(SSD)对于数据中心、消费者平台和关键任务系统至关重要。然而,由于数据碎片化且时间间隔不连续,诊断其性能与可靠性变得十分困难。现有的方法需要大型数据集和专家输入,并只能提供有限的见解。退化不仅由工作负载的变化及架构演进引起,还受到温度、湿度和振动等环境因素的影响。 我们介绍了一种基于知识推理框架KORAL,它将大语言模型(LLMs)与结构化的知识图谱(KG)相结合,以生成对SSD操作的理解。不同于传统方法需要大量的专家输入和大型数据集,KORAL从碎片化监控数据中生成一个数据KG,并整合了一个文献KG,后者已经组织了来自文献、报告和技术文件的知识。这将非结构化来源转换为可查询的图谱,并将监控数据转变为结构化的知识;而这两个图谱共同引导LLM提供基于证据且具有解释性的分析,与领域的词汇和限制保持一致。 使用实际生产跟踪进行评估显示,KORAL能够给出专家级诊断和建议,附带依据坚实的解释,提高了推理透明度,指导了操作员的决策,减少了手动努力,并提供了可操作的见解以提高服务质量。据我们所知,这是第一个结合LLM与KG的端到端系统,用于涵盖描述性、预测性、规范性和假设分析在内的全频谱SSD推理。 我们将生成的专用于SSD的知识图发布出来,旨在推动基于知识的存储系统分析领域的可重复研究。 GitHub 仓库:[此链接](https://this/)(请将“this”替换为实际的URL地址)
https://arxiv.org/abs/2602.10246
Temporal Knowledge Graph (TKG) reasoning seeks to predict future missing facts from historical evidence. While diffusion models (DM) have recently gained attention for their ability to capture complex predictive distributions, two gaps remain: (i) the generative path is conditioned only on positive evidence, overlooking informative negative context, and (ii) training objectives are dominated by cross-entropy ranking, which improves candidate ordering but provides little supervision over the calibration of the denoised embedding. To bridge this gap, we introduce Negative-Aware Diffusion model for TKG Extrapolation (NADEx). Specifically, NADEx encodes subject-centric histories of entities, relations and temporal intervals into sequential embeddings. NADEx perturbs the query object in the forward process and reconstructs it in reverse with a Transformer denoiser conditioned on the temporal-relational context. We further derive a cosine-alignment regularizer derived from batch-wise negative prototypes, which tightens the decision boundary against implausible candidates. Comprehensive experiments on four public TKG benchmarks demonstrate that NADEx delivers state-of-the-art performance.
时间知识图谱(Temporal Knowledge Graph,TKG)推理旨在从历史证据中预测未来的缺失事实。虽然扩散模型(Diffusion Models, DM)因其捕捉复杂预测分布的能力而最近受到关注,但仍有两个缺口未被填补:(i) 生成路径仅基于正向证据条件,忽略了信息量大的负向上下文;(ii) 训练目标主要依赖于交叉熵排序,这虽然能改善候选顺序的排名,但对于去噪嵌入的校准却提供了很少的监督。为弥合这一差距,我们引入了一种名为TKG外推中考虑负样本的扩散模型(Negative-Aware Diffusion model for TKG Extrapolation, NADEx)。具体来说,NADEx将实体、关系和时间间隔的历史以主体为中心编码成顺序嵌入。在正向过程中,NADEx扰动查询对象,并通过一个受时间-关系上下文条件约束的Transformer去噪器来重建该对象。我们还推导出了一种基于批量负样本原型的余弦对齐正则化项,这有助于收紧决策边界以排除不可信的候选者。在四个公开TKG基准上的全面实验表明,NADEx表现出了最先进的性能水平。
https://arxiv.org/abs/2602.08815
Vessel trajectory data from the Automatic Identification System (AIS) is used widely in maritime analytics. Yet, analysis is difficult for non-expert users due to the incompleteness and complexity of AIS data. We present CLEAR, a knowledge-centric vessel trajectory analysis platform that aims to overcome these barriers. By leveraging the reasoning and generative capabilities of Large Language Models (LLMs), CLEAR transforms raw AIS data into complete, interpretable, and easily explorable vessel trajectories through a Structured Data-derived Knowledge Graph (SD-KG). As part of the demo, participants can configure parameters to automatically download and process AIS data, observe how trajectories are completed and annotated, inspect both raw and imputed segments together with their SD-KG evidence, and interactively explore the SD-KG through a dedicated graph viewer, gaining an intuitive and transparent understanding of vessel movements.
来自自动识别系统(AIS)的船舶轨迹数据在海上分析中广泛应用。然而,由于AIS数据的不完整性和复杂性,对于非专业用户来说进行分析存在难度。我们推出了CLEAR平台,这是一个以知识为中心的船舶轨迹分析平台,旨在克服这些障碍。通过利用大型语言模型(LLMs)的推理和生成能力,CLEAR能够将原始的AIS数据转化为完整、可解释且易于探索的船舶轨迹。这一转化是通过一个由结构化数据驱动的知识图谱(SD-KG)实现的。 在演示过程中,参与者可以配置参数来自动下载并处理AIS数据,观察轨迹如何被补全和标注,检查原始与插值后的段落以及它们对应的SD-KG证据,并通过专门的图形查看器互动式地探索SD-KG。这一过程使用户能够直观且透明地理解船舶移动的情况。
https://arxiv.org/abs/2602.08482
Graph-RAG improves LLM reasoning using structured knowledge, yet conventional designs rely on a centralized knowledge graph. In distributed and access-restricted settings (e.g., hospitals or multinational organizations), retrieval must select relevant domains and appropriate traversal depth without global graph visibility or exhaustive querying. To address this challenge, we introduce \textbf{SCOUT-RAG} (\textit{\underline{S}calable and \underline{CO}st-efficient \underline{U}nifying \underline{T}raversal}), a distributed agentic Graph-RAG framework that performs progressive cross-domain retrieval guided by incremental utility goals. SCOUT-RAG employs four cooperative agents that: (i) estimate domain relevance, (ii) decide when to expand retrieval to additional domains, (iii) adapt traversal depth to avoid unnecessary graph exploration, and (iv) synthesize the high-quality answers. The framework is designed to minimize retrieval regret, defined as missing useful domain information, while controlling latency and API cost. Across multi-domain knowledge settings, SCOUT-RAG achieves performance comparable to centralized baselines, including DRIFT and exhaustive domain traversal, while substantially reducing cross-domain calls, total tokens processed, and latency.
Graph-RAG 通过使用结构化知识改进了大语言模型(LLM)的推理能力,但传统的设计依赖于集中式的知识图。在分布式和访问受限的环境中(例如医院或跨国组织),检索必须选择相关领域并确定适当的遍历深度,而无需全局图视图或详尽查询。为了解决这一挑战,我们引入了 SCOUT-RAG (可扩展且成本效益高的统一遍历框架),这是一种基于代理的分布式 Graph-RAG 框架,它执行由增量效用目标引导的跨领域检索。 SCOUT-RAG 采用四个协作代理: (i) 评估领域相关性; (ii) 决定何时将检索范围扩大到额外的领域; (iii) 根据需要调整遍历深度以避免不必要的图探索; (iv) 合成高质量的答案。 该框架旨在最小化检索遗憾,即错过有用的领域信息,同时控制延迟和 API 成本。在多领域知识设置中,SCOUT-RAG 实现了与集中式基准(包括 DRIFT 和详尽领域的遍历)相当的性能,同时大大减少了跨域调用、处理的总令牌数量以及延迟。
https://arxiv.org/abs/2602.08400
Temporal knowledge graphs (TKGs) structurally preserve evolving human knowledge. Recent research has focused on designing models to learn the evolutionary nature of TKGs to predict future facts, achieving impressive results. For instance, Hits@10 scores over 0.9 on YAGO dataset. However, we find that existing benchmarks inadvertently introduce a shortcut. Near state-of-the-art performance can be simply achieved by counting co-occurrences, without using any temporal information. In this work, we examine the root cause of this issue, identifying inherent biases in current datasets and over simplified form of evaluation task that can be exploited by these biases. Through this analysis, we further uncover additional limitations of existing benchmarks, including unreasonable formatting of time-interval knowledge, ignorance of learning knowledge obsolescence, and insufficient information for precise evolution understanding, all of which can amplify the shortcut and hinder a fair assessment. Therefore, we introduce the TKG evolution benchmark. It includes four bias-corrected datasets and two novel tasks closely aligned with the evolution process, promoting a more accurate understanding of the challenges in TKG evolution modeling. Benchmark is available at: this https URL.
时序知识图谱(TKGs)在结构上保留了不断演变的人类知识。近期的研究主要集中在设计模型以学习 TKGs 的演化特性,从而预测未来的事实,并取得了令人印象深刻的结果。例如,在 YAGO 数据集上实现了 Hits@10 分数超过 0.9 的成绩。然而,我们发现现有的基准测试方法无意中引入了一个捷径:通过简单地计算共现次数而无需使用任何时间信息就能轻易达到接近当前最佳的性能水平。 在这项工作中,我们探讨了这个问题的根本原因,并识别出了现有数据集中固有的偏差以及过于简化的评估任务形式,这些都可能被上述偏差所利用。通过这一分析,我们进一步揭示了现有基准测试中的额外限制,包括不合理的时间间隔知识格式、忽略学习知识过时的问题和不足以精确理解演化的信息量不足等,所有这些问题都会放大捷径效应并妨碍公正的评价。 因此,我们引入了一个新的 TKG 演化基准(TKG Evolution Benchmark),其中包含四个经过偏差修正的数据集以及两个与演化过程紧密相关的全新任务,旨在推动对 TKG 模型挑战更准确的理解。该基准测试可在以下链接获取:[this https URL]。
https://arxiv.org/abs/2602.08353
Large Language Models (LLMs) have shown strong potential in complex medical reasoning yet face diminishing gains under inference scaling laws. While existing studies augment LLMs with various knowledge types, it remains unclear how effectively the additional costs translate into accuracy. In this paper, we explore how meta-cognition of LLMs, i.e., their self-awareness of their own knowledge states, can regulate the reasoning process. Specifically, we propose MedCoG, a Medical Meta-Cognition Agent with Knowledge Graph, where the meta-cognitive assessments of task complexity, familiarity, and knowledge density dynamically regulate utilization of procedural, episodic, and factual knowledge. The LLM-centric on-demand reasoning aims to mitigate scaling laws by (1) reducing costs via avoiding indiscriminate scaling, (2) improving accuracy via filtering out distractive knowledge. To validate this, we empirically characterize the scaling curve and introduce inference density to quantify inference efficiency, defined as the ratio of theoretically effective cost to actual cost. Experiments demonstrate the effectiveness and efficiency of MedCoG on five hard sets of medical benchmarks, yielding 5.5x inference density. Furthermore, the Oracle study highlights the significant potential of meta-cognitive regulation.
大型语言模型(LLMs)在复杂的医学推理中展现了强大的潜力,但在推断扩展规律下却面临着收益递减的问题。尽管现有的研究通过添加各种类型的知识来增强LLM的能力,但额外的成本是否能转化为准确度的提升仍不清楚。本文探讨了LLM元认知能力的作用,即它们对自身知识状态的自我意识如何调节推理过程。具体来说,我们提出了MedCoG这一具有知识图谱的医学元认知代理,该代理通过动态评估任务复杂性、熟悉程度和知识密度来调控程序性、情景性和事实性知识的应用。 以中心化的按需推理为目标,MedCoG旨在通过以下方式缓解扩展规律带来的问题:(1)通过避免无差别扩大规模来降低成本;(2)通过筛选出干扰性的知识来提高准确度。为了验证这一点,我们从经验上描述了扩展曲线,并引入推断密度这一概念以量化推断效率,即理论有效成本与实际成本之比。 实验表明,在五个医学基准测试的困难集上,MedCoG表现出有效的效果和效率,达到了5.5倍的推断密度。此外,“Oracle”研究进一步突显了元认知调节的巨大潜力。
https://arxiv.org/abs/2602.07905
Large Language Models (LLMs) promise to accelerate discovery by reasoning across the expanding scientific landscape. Yet, the challenge is no longer access to information but connecting it in meaningful, domain-spanning ways. In materials science, where innovation demands integrating concepts from molecular chemistry to mechanical performance, this is especially acute. Neither humans nor single-agent LLMs can fully contend with this torrent of information, with the latter often prone to hallucinations. To address this bottleneck, we introduce a multi-agent framework guided by large-scale knowledge graphs to find sustainable substitutes for per- and polyfluoroalkyl substances (PFAS)-chemicals currently under intense regulatory scrutiny. Agents in the framework specialize in problem decomposition, evidence retrieval, design parameter extraction, and graph traversal, uncovering latent connections across distinct knowledge pockets to support hypothesis generation. Ablation studies show that the full multi-agent pipeline outperforms single-shot prompting, underscoring the value of distributed specialization and relational reasoning. We demonstrate that by tailoring graph traversal strategies, the system alternates between exploitative searches focusing on domain-critical outcomes and exploratory searches surfacing emergent cross-connections. Illustrated through the exemplar of biomedical tubing, the framework generates sustainable PFAS-free alternatives that balance tribological performance, thermal stability, chemical resistance, and biocompatibility. This work establishes a framework combining knowledge graphs with multi-agent reasoning to expand the materials design space, showcasing several initial design candidates to demonstrate the approach.
大型语言模型(LLMs)承诺通过在不断扩大的科学领域中进行推理来加速发现。然而,挑战不再是获取信息,而是以有意义的方式将其连接起来,尤其是在涵盖不同领域的范围内。在材料科学领域,创新要求将分子化学和机械性能的概念结合起来,这一需求尤为迫切。无论是人类还是单个代理的LLM都无法完全应对这些信息洪流,后者常常容易产生幻觉(即生成不真实的信息)。为了解决这个瓶颈问题,我们引入了一个由大规模知识图谱引导的多代理框架,旨在寻找持久替代品用于全氟和多氟烷基物质(PFAS),这类化学物质目前正受到严格的监管审查。 该框架中的代理专门从事问题分解、证据检索、设计参数提取以及图遍历等任务,揭示不同知识口袋之间的潜在联系,并支持假设生成。删减研究显示,完整的多代理管道优于单次提示策略,突显了分布式专业化和关系推理的价值。我们证明通过调整图遍历策略,该系统可以在集中于领域关键结果的利用性搜索与发现新兴跨学科连接的探索性搜索之间交替进行。 以生物医学管材为例,该框架能够生成可持续的无PFAS替代品,这些替代品在摩擦学性能、热稳定性、化学抗性和生物相容性方面实现了平衡。这项工作建立了一个结合知识图谱和多代理推理的方法来扩展材料设计空间,并展示了几个初始的设计候选方案以证明此方法的有效性。
https://arxiv.org/abs/2602.07491
Predicting gene regulation responses to biological perturbations requires reasoning about underlying biological causalities. While large language models (LLMs) show promise for such tasks, they are often overwhelmed by the entangled nature of high-dimensional perturbation results. Moreover, recent works have primarily focused on genetic perturbations in single-cell experiments, leaving bulk-cell chemical perturbations, which is central to drug discovery, largely unexplored. Motivated by this, we present LINCSQA, a novel benchmark for predicting target gene regulation under complex chemical perturbations in bulk-cell environments. We further propose PBio-Agent, a multi-agent framework that integrates difficulty-aware task sequencing with iterative knowledge refinement. Our key insight is that genes affected by the same perturbation share causal structure, allowing confidently predicted genes to contextualize more challenging cases. The framework employs specialized agents enriched with biological knowledge graphs, while a synthesis agent integrates outputs and specialized judges ensure logical coherence. PBio-Agent outperforms existing baselines on both LINCSQA and PerturbQA, enabling even smaller models to predict and explain complex biological processes without additional training.
预测基因对生物扰动的调控反应需要推理其背后的生物学因果关系。尽管大型语言模型(LLMs)在这方面显示出潜力,但它们往往难以应对高维扰动结果中的纠缠特性。此外,近期的研究主要集中在单细胞实验中的遗传扰动上,而批量细胞化学扰动在药物发现中至关重要,却鲜有探索。为此,我们提出了LINCSQA这一新基准测试工具,旨在预测复杂化学扰动下批量细胞环境中目标基因的调控情况。 为了解决这个问题,我们进一步提出了一种多代理框架PBio-Agent,该框架整合了难度感知任务排序和迭代知识精炼。我们的核心见解是受影响于相同扰动的基因共享因果结构,使得可以自信地预测的基因能够帮助解释更复杂的情况。此框架采用了富含生物知识图谱的专业代理人,并通过合成代理人集成输出结果,同时专门判官确保逻辑一致性。 PBio-Agent在LINCSQA和PerturbQA基准测试中均超越了现有的基线模型,即使对于较小的模型也能够在不额外训练的情况下预测并解释复杂的生物学过程。
https://arxiv.org/abs/2602.07408
This study presents LIT-GRAPH (Literature Graph for Recommendation and Pedagogical Heuristics), a novel knowledge graph-based recommendation system designed to scaffold high school English teachers in selecting diverse, pedagogically aligned instructional literature. The system is built upon an ontology for English literature, addressing the challenge of curriculum stagnation, where we compare four graph embedding paradigms: DeepWalk, Biased Random Walk (BRW), Hybrid (concatenated DeepWalk and BRW vectors), and the deep model Relational Graph Convolutional Network (R-GCN). Results reveal a critical divergence: while shallow models excelled in structural link prediction, R-GCN dominated semantic ranking. By leveraging relation-specific message passing, the deep model prioritizes pedagogical relevance over raw connectivity, resulting in superior, high-quality, domain-specific recommendations.
这项研究提出了LIT-GRAPH(用于推荐和教学启发的文献图),这是一种基于知识图谱的推荐系统,旨在帮助高中英语教师选择多样且符合教育目标的教学文学作品。该系统建立在英语文学本体论的基础上,解决了课程停滞的问题,并比较了四种图嵌入范式:DeepWalk、偏置随机游走(BRW)、混合型(结合DeepWalk和BRW向量)以及深度模型关系图卷积网络(R-GCN)。研究结果揭示了一个关键的差异:浅层模型在结构链接预测方面表现出色,而R-GCN则在语义排序上占据主导地位。通过利用特定于关系的消息传递,深度模型优先考虑教育相关性而非原始连接度,在领域特定推荐的质量和实用性方面取得了卓越成果。
https://arxiv.org/abs/2602.07307
Despite initial successes and a variety of architectures, retrieval-augmented generation systems still struggle to reliably retrieve and connect the multi-step evidence required for complicated reasoning tasks. Most of the standard RAG frameworks regard all retrieved information as equally reliable, overlooking the varying credibility and interconnected nature of large textual corpora. GraphRAG approaches offer potential improvement to RAG systems by integrating knowledge graphs, which structure information into nodes and edges, capture entity relationships, and enable multi-step logical traversal. However, GraphRAG is not always an ideal solution, as it depends on high-quality graph representations of the corpus. Such representations usually rely on manually curated knowledge graphs, which are costly to construct and update, or on automated graph-construction pipelines that are often unreliable. Moreover, systems following this paradigm typically use large language models to guide graph traversal and evidence retrieval. In this paper, we propose a novel RAG framework that uses a spreading activation algorithm to retrieve information from a corpus of documents connected by an automatically constructed heterogeneous knowledge graph. This approach reduces reliance on semantic knowledge graphs, which are often incomplete due to information loss during information extraction, avoids LLM-guided graph traversal, and improves performance on multi-hop question answering. Experiments show that our method achieves better or comparable performance to several state-of-the-art RAG methods and can be integrated as a plug-and-play module with different iterative RAG pipelines. When combined with chain-of-thought iterative retrieval, it yields up to a 39% absolute improvement in answer correctness over naive RAG, while achieving these results with small open-weight language models.
尽管检索增强生成系统(Retrieval-Augmented Generation,简称RAG)取得了初步成功并采用了多种架构,但它们在可靠地检索和连接复杂推理任务所需的多步骤证据方面仍然面临挑战。大多数标准的RAG框架将所有检索到的信息视为同样可信,忽略了大型文本语料库中信息的不同可靠性和相互关联性。GraphRAG方法通过集成知识图谱来提供潜在改进,这些图谱以节点和边的形式结构化信息,捕捉实体关系,并支持多步逻辑遍历。然而,GraphRAG并不总是理想的解决方案,因为它依赖于高质量的文本语料库表示形式,而这些表示通常依靠成本高昂且难以维护的手动整理知识图谱,或者依靠经常不可靠的自动化图形构建流水线。此外,遵循此类范式的系统通常使用大型语言模型引导图形遍历和证据检索。 在本文中,我们提出了一种新的RAG框架,该框架采用传播激活算法从一个由自动构建的异构知识图连接起来的文档集合中检索信息。这种方法减少了对语义知识图谱(由于信息提取过程中的信息丢失而经常不完整)的依赖,避免了大型语言模型引导下的图形遍历,并在多跳问答任务上提高了性能。实验表明,我们的方法在多项顶级RAG方法上的表现优于或与之相当,且可以作为即插即用模块集成到不同的迭代RAG流水线中。当结合链式思维迭代检索时,它能比简单的RAG系统最多提高39%的绝对答案正确率,并且使用小型开放权重语言模型即可实现这些结果。
https://arxiv.org/abs/2512.15922
Knowledge Graph (KG) generation requires models to learn complex semantic dependencies between triples while maintaining domain validity constraints. Unlike link prediction, which scores triples independently, generative models must capture interdependencies across entire subgraphs to produce semantically coherent structures. We present ARK (Auto-Regressive Knowledge Graph Generation), a family of autoregressive models that generate KGs by treating graphs as sequences of (head, relation, tail) triples. ARK learns implicit semantic constraints directly from data, including type consistency, temporal validity, and relational patterns, without explicit rule supervision. On the IntelliGraphs benchmark, our models achieve 89.2% to 100.0% semantic validity across diverse datasets while generating novel graphs not seen during training. We also introduce SAIL, a variational extension of ARK that enables controlled generation through learned latent representations, supporting both unconditional sampling and conditional completion from partial graphs. Our analysis reveals that model capacity (hidden dimensionality >= 64) is more critical than architectural depth for KG generation, with recurrent architectures achieving comparable validity to transformer-based alternatives while offering substantial computational efficiency. These results demonstrate that autoregressive models provide an effective framework for KG generation, with practical applications in knowledge base completion and query answering.
知识图谱(KG)生成需要模型学习三元组之间的复杂语义依赖关系,同时保持领域有效的约束。与链接预测不同,后者独立地对三元组进行评分,生成式模型必须捕捉整个子图内的相互依存关系以产生语义连贯的结构。我们提出了ARK(自回归知识图谱生成),这是一个自回归模型家族,通过将图形视为(head, relation, tail)三元组序列来生成KGs。ARK直接从数据中学习隐含的语义约束,包括类型一致性、时间有效性以及关系模式,而无需显式的规则监督。在IntelliGraphs基准上,我们的模型在不同的数据集上实现了89.2%到100.0%的语义有效性,在训练期间未见过的情况下生成了新颖的图结构。我们还引入了SAIL,这是ARK的一种变分扩展版本,通过学习潜在表示支持控制生成,能够从部分图形进行无条件采样和有条件补全。我们的分析表明,模型容量(隐藏维度≥64)比架构深度对KG生成更为关键,递归架构与基于Transformer的替代方案相比在有效性方面相当,但提供显著的计算效率优势。这些结果证明了自回归模型为KG生成提供了有效的框架,并且在知识库补全和查询回答等实际应用中具有实用价值。
https://arxiv.org/abs/2602.06707
Recent progress in large-scale CLIP-like vision-language models(VLMs) has greatly advanced medical image analysis. However, most existing medical VLMs still rely on coarse image-text contrastive objectives and fail to capture the systematic visual knowledge encoded in well-defined medical phenotype ontologies. To address this gap, we construct PhenoKG, the first large-scale, phenotype-centric multimodal knowledge graph that encompasses over 520K high-quality image-text pairs linked to more than 3,000 phenotypes. Building upon PhenoKG, we propose PhenoLIP, a novel pretraining framework that explicitly incorporates structured phenotype knowledge into medical VLMs through a two-stage process. We first learn a knowledge-enhanced phenotype embedding space from textual ontology data and then distill this structured knowledge into multimodal pretraining via a teacher-guided knowledge distillation objective. To support evaluation, we further introduce PhenoBench, an expert-verified benchmark designed for phenotype recognition, comprising over 7,800 image--caption pairs covering more than 1,000 phenotypes. Extensive experiments demonstrate that PhenoLIP outperforms previous state-of-the-art baselines, improving upon BiomedCLIP in phenotype classification accuracy by 8.85\% and BIOMEDICA in cross-modal retrieval by 15.03%, underscoring the value of integrating phenotype-centric priors into medical VLMs for structured and interpretable medical image understanding.
近期在大规模类似CLIP的视觉-语言模型(VLMs)方面取得的进步极大地推动了医学图像分析的发展。然而,大多数现有的医疗VLM仍然依赖于粗略的图像文本对比目标,并且无法捕捉到在定义明确的医学表型本体论中编码的系统性视觉知识。为了填补这一空白,我们构建了PhenoKG,这是首个大规模、以表型为中心的多模态知识图谱,涵盖了超过520,000张高质量的图像-文本对,并且这些数据链接到了3,000多个不同的表型上。 基于PhenoKG,我们提出了一个新的预训练框架——PhenoLIP。这个框架通过一个两阶段过程明确地将结构化的表型知识整合到医疗VLMs中。首先,我们从文本本体数据中学得增强的表型嵌入空间,并且然后通过一种教师指导的知识蒸馏目标将其结构化知识浓缩进多模态预训练。 为了支持评估,我们进一步引入了PhenoBench,这是一个经过专家验证的基准测试集,专门用于表型识别,包含超过7,800张图像-描述对,涵盖了1,000多个不同的表型。大量的实验表明,PhenoLIP超越了之前最先进的基线模型,在表型分类准确率上比BiomedCLIP高出8.85%,在跨模态检索任务上比BIOMEDICA高出15.03%。这突显出将基于中心的先验知识整合到医疗VLMs中的重要性,以实现结构化和可解释的医学图像理解。
https://arxiv.org/abs/2602.06184
Temporal knowledge graph question answering (TKGQA) aims to answer time-sensitive questions by leveraging temporal knowledge bases. While Large Language Models (LLMs) demonstrate significant potential in TKGQA, current prompting strategies constrain their efficacy in two primary ways. First, they are prone to reasoning hallucinations under complex temporal constraints. Second, static prompting limits model autonomy and generalization, as it lack optimization through dynamic interaction with temporal knowledge graphs (TKGs) environments. To address these limitations, we propose \textbf{TKG-Thinker}, a novel agent equipped with autonomous planning and adaptive retrieval capabilities for reasoning over TKGs. Specifically, TKG-Thinker performs in-depth temporal reasoning through dynamic multi-turn interactions with TKGs via a dual-training strategy. We first apply Supervised Fine-Tuning (SFT) with chain-of thought data to instill core planning capabilities, followed by a Reinforcement Learning (RL) stage that leverages multi-dimensional rewards to refine reasoning policies under intricate temporal constraints. Experimental results on benchmark datasets with three open-source LLMs show that TKG-Thinker achieves state-of-the-art performance and exhibits strong generalization across complex TKGQA settings.
时间知识图谱问答(TKGQA)旨在通过利用时间知识库来回答具有时效性的问题。尽管大型语言模型(LLMs)在TKGQA中展现出巨大的潜力,但当前的提示策略主要以两种方式限制了它们的有效性:首先,在复杂的时间约束下容易出现推理幻觉;其次,静态提示限制了模型自主性和泛化能力,因为缺乏与时间知识图谱(TKGs)环境进行动态交互优化的能力。为了解决这些局限性,我们提出了**TKG-Thinker**,这是一种配备了自主规划和自适应检索功能的新型代理,能够针对TKGs进行推理。 具体来说,通过双训练策略,TKG-Thinker 通过与时间知识图谱的动态多轮互动来进行深入的时间推理。首先,我们应用监督微调(SFT),使用思维链数据来培养核心规划能力;然后进入强化学习(RL)阶段,利用多维奖励在复杂的时间约束下精炼推理策略。 实验结果表明,在三个开源LLMs上运行基准数据集时,TKG-Thinker 达到了最先进的性能,并且在复杂的 TKGQA 设置中表现出强大的泛化能力。
https://arxiv.org/abs/2602.05818
Large Language Models (LLMs) excel at language understanding but remain limited in knowledge-intensive domains due to hallucinations, outdated information, and limited explainability. Text-based retrieval-augmented generation (RAG) helps ground model outputs in external sources but struggles with multi-hop reasoning. Knowledge Graphs (KGs), in contrast, support precise, explainable querying, yet require a knowledge of query languages. This work introduces an interactive framework in which LLMs generate and explain Cypher graph queries and users iteratively refine them through natural language. Applied to real-world KGs, the framework improves accessibility to complex datasets while preserving factual accuracy and semantic rigor and provides insight into how model performance varies across domains. Our core quantitative evaluation is a 90-query benchmark on a synthetic movie KG that measures query explanation quality and fault detection across multiple LLMs, complemented by two smaller real-life query-generation experiments on a Hyena KG and the MaRDI (Mathematical Research Data Initiative) KG.
大型语言模型(LLMs)在语言理解方面表现出色,但在知识密集型领域中仍受限于幻觉、过时信息和解释性不足。基于文本的检索增强生成(RAG)有助于将模型输出与外部来源相联系,但难以处理多跳推理。相比之下,知识图谱(KGs)支持精确且可解释的查询,但需要了解查询语言的知识。这项工作介绍了一个交互式框架,在该框架中LLMs可以生成和解释Cypher图查询,并通过自然语言迭代地改进这些查询。应用于现实世界中的KG,此框架提高了对复杂数据集的访问性,同时保持事实准确性与语义严谨性,并提供了关于模型性能在不同领域间差异的洞察。我们核心的定量评估包括一个针对合成电影KG的90个查询基准测试,该测试衡量多个LLMs的查询解释质量和故障检测能力,并辅以两个较小的真实生活查询生成实验:一个是在Hyena KG上的实验,另一个是在MaRDI(数学研究数据倡议)KG上的实验。
https://arxiv.org/abs/2602.05512
Knowledge graphs (KGs) have become a key ingredient supporting a variety of applications. Beyond the traditional triplet representation of facts where a relation connects two entities, modern KGs observe an increasing number of hyper-relational facts, where an arbitrary number of qualifiers associated with a triplet provide auxiliary information to further describe the rich semantics of the triplet, which can effectively boost the reasoning performance in link prediction tasks. However, existing link prediction techniques over such hyper-relational KGs (HKGs) mostly focus on a transductive setting, where KG embedding models are learned from the specific vocabulary of a given KG and subsequently can only make predictions within the same vocabulary, limiting their generalizability to previously unseen vocabularies. Against this background, we propose THOR, an inducTive link prediction technique for Hyper-relational knOwledge gRaphs. Specifically, we first introduce both relation and entity foundation graphs, modeling their fundamental inter- and intra-fact interactions in HKGs, which are agnostic to any specific relations and entities. Afterward, THOR is designed to learn from the two foundation graphs with two parallel graph encoders followed by a transformer decoder, which supports efficient masked training and fully-inductive inference. We conduct a thorough evaluation of THOR in hyper-relational link prediction tasks on 12 datasets with different settings. Results show that THOR outperforms a sizable collection of baselines, yielding 66.1%, 55.9%, and 20.4% improvement over the best-performing rule-based, semi-inductive, and fully-inductive techniques, respectively. A series of ablation studies also reveals our key design factors capturing the structural invariance transferable across HKGs for inductive tasks.
知识图谱(KGs)已成为支持多种应用的关键组成部分。除了传统的以关系连接两个实体的事实三元组表示法,现代知识图谱观察到越来越多的超关系事实,即与三元组相关的任意数量的限定符提供辅助信息,进一步描述了三元组的丰富语义,从而在链接预测任务中有效提升推理性能。然而,现有的针对此类超关系知识图谱(HKGs)上的链接预测技术大多集中在归纳式设置上,在这种情况下,从给定的知识图的具体词汇表学习到的嵌入模型只能在同一词汇表内进行预测,限制了它们对之前未见过的词汇表的泛化能力。鉴于此背景,我们提出了THOR,这是一种针对超关系知识图谱的归纳链接预测技术。具体而言,我们首先引入了关系基础图和实体基础图,这两个基础图分别建模HKGs中的基本跨事实和内部交互作用,并且它们与任何特定的关系或实体无关。随后,THOR被设计为通过两个并行的图编码器以及一个Transformer解码器从这两种基础图中进行学习,支持高效的屏蔽训练和完全归纳推理。我们在12个具有不同设置的数据集上对THOR进行了全面评估。结果显示,THOR在性能上优于一系列基准方法,分别比最佳规则基、半诱导式和全诱导式技术高出66.1%、55.9% 和 20.4% 的改进。一系列消融研究还揭示了我们关键设计因素捕获的结构不变性,在超关系知识图谱中的归纳任务中具有可转移性。
https://arxiv.org/abs/2602.05424
Retrieval augmented generation (RAG) has enhanced large language models by enabling access to external knowledge, with graph-based RAG emerging as a powerful paradigm for structured retrieval and reasoning. However, existing graph-based methods often over-rely on surface-level node matching and lack explicit causal modeling, leading to unfaithful or spurious answers. Prior attempts to incorporate causality are typically limited to local or single-document contexts and also suffer from information isolation that arises from modular graph structures, which hinders scalability and cross-module causal reasoning. To address these challenges, we propose HugRAG, a framework that rethinks knowledge organization for graph-based RAG through causal gating across hierarchical modules. HugRAG explicitly models causal relationships to suppress spurious correlations while enabling scalable reasoning over large-scale knowledge graphs. Extensive experiments demonstrate that HugRAG consistently outperforms competitive graph-based RAG baselines across multiple datasets and evaluation metrics. Our work establishes a principled foundation for structured, scalable, and causally grounded RAG systems.
基于检索的增强生成(Retrieval Augmented Generation,简称RAG)通过允许访问外部知识增强了大型语言模型的功能。其中,图基RAG作为结构化检索和推理的强大范式已经出现。然而,现有的图基方法往往过度依赖表面级别的节点匹配,并且缺乏显式的因果建模,导致生成的答案不准确或不合理。此前尝试引入因果关系的方法通常局限于局部或单文档上下文中,并且由于模块化图结构引起的信息隔离问题也限制了它们的可扩展性和跨模块因果推理能力。 为了解决这些问题,我们提出了HugRAG框架。该框架通过层级模块之间的因果门控重新思考了用于图基RAG的知识组织方式。HugRAG明确建模因果关系以抑制虚假相关性,并同时支持在大规模知识图上进行可扩展的推理。广泛的实验表明,在多个数据集和评估指标下,HugRAG始终优于竞争性的基于图的RAG基准模型。我们的工作为结构化、可扩展且具备因果基础的RAG系统奠定了原则性的基础。
https://arxiv.org/abs/2602.05143
Large language models (LLMs) rarely admit uncertainty, often producing fluent but misleading answers, rather than abstaining (i.e., refusing to answer). This weakness is even evident in temporal question answering, where models frequently ignore time-sensitive evidence and conflate facts across different time-periods. In this paper, we present the first empirical study of training LLMs with an abstention ability while reasoning about temporal QA. Existing approaches such as calibration might be unreliable in capturing uncertainty in complex reasoning. We instead frame abstention as a teachable skill and introduce a pipeline that couples Chain-of-Thought (CoT) supervision with Reinforcement Learning (RL) guided by abstention-aware rewards. Our goal is to systematically analyze how different information types and training techniques affect temporal reasoning with abstention behavior in LLMs. Through extensive experiments studying various methods, we find that RL yields strong empirical gains on reasoning: a model initialized by Qwen2.5-1.5B-Instruct surpasses GPT-4o by $3.46\%$ and $5.80\%$ in Exact Match on TimeQA-Easy and Hard, respectively. Moreover, it improves the True Positive rate on unanswerable questions by $20\%$ over a pure supervised fine-tuned (SFT) variant. Beyond performance, our analysis shows that SFT induces overconfidence and harms reliability, while RL improves prediction accuracy but exhibits similar risks. Finally, by comparing implicit reasoning cues (e.g., original context, temporal sub-context, knowledge graphs) with explicit CoT supervision, we find that implicit information provides limited benefit for reasoning with abstention. Our study provides new insights into how abstention and reasoning can be jointly optimized, providing a foundation for building more reliable LLMs.
大型语言模型(LLMs)很少承认不确定性,常常生成流畅但具有误导性的答案,而不是拒绝回答。这种弱点在时间性问题解答中尤为明显,在此情况下,模型经常忽略时效性证据,并将不同时间段的事实混淆在一起。在这篇论文中,我们介绍了第一个关于训练具备拒绝能力的大型语言模型以进行时态推理的实证研究。现有的方法如校准可能无法可靠地捕捉到复杂推理中的不确定性。相反,我们将拒绝视为可传授技能,并引入了一种结合链式思维(CoT)监督与基于拒绝感知奖励引导的强化学习(RL)的技术流程。我们的目标是系统性分析不同类型的信息和训练技术如何影响具有拒绝行为的LLMs的时间推理能力。通过广泛的方法研究实验,我们发现RL在推理方面产生了显著的经验优势:以Qwen2.5-1.5B-Instruct初始化的模型分别在TimeQA-Easy和Hard上的精确匹配度上超越了GPT-4o $3.46\%$ 和 $5.80\%$。此外,它将无法回答问题的真正阳性率提高了20%,超过了纯监督微调(SFT)变体的表现。除了性能之外,我们的分析表明SFT会导致过度自信并损害可靠性,而RL虽然改善了预测准确性但也表现出类似的风险。最后,通过比较隐含推理线索(例如原始上下文、时间子上下文、知识图谱)与显式CoT监督的效果,我们发现隐含信息对于具有拒绝能力的推理提供的益处有限。我们的研究为如何共同优化拒绝和推理提供了新的见解,并为进一步构建更可靠的大型语言模型奠定了基础。
https://arxiv.org/abs/2602.04755
Transformers achieve remarkable performance across many domains, yet struggle with tasks requiring multi-hop relational reasoning over structured data. We analyze this limitation through circuit complexity: standard transformers are $\mathsf{TC}^0$-complete and cannot solve graph connectivity in constant depth, implying $\Omega(k)$ layers are necessary for $k$-hop reasoning regardless of model size or training data. We introduce RASA (Relation-Aware Sparse Attention), a minimal architectural modification that provides structural inductive bias for relational reasoning. RASA adds: (1) sparse adjacency masking that restricts attention to graph-connected positions, reducing the attention pattern search space from $O(2^{n^2})$ to $O(2^m)$ for graphs with $m$ edges; and (2) learnable edge-type biases that encode relation-specific attention preferences. While RASA does not circumvent asymptotic depth requirements, the exponential reduction in attention pattern space provides stronger inductive bias for learning graph-structured functions. Empirically, on the MetaQA knowledge graph QA benchmark, RASA achieves 97.7% accuracy on 3-hop questions, outperforming EmbedKGQA (94.8%) by 2.9 percentage points. Notably, RASA's advantage grows with reasoning depth, validating that structural inductive bias is most beneficial for complex multi-hop queries. Our results demonstrate that minimal architectural modifications, grounded in complexity-theoretic analysis, can substantially improve multi-hop reasoning.
转换器在许多领域都表现出色,但在需要对结构化数据进行多跳关系推理的任务上却面临挑战。我们通过电路复杂性分析了这一局限:标准的转换器是$\mathsf{TC}^0$完全型,无法用常数深度解决图连通问题,这意味着无论模型大小或训练数据如何,都需要至少$\Omega(k)$层来进行$k$跳推理。为此,我们提出了RASA(关系感知稀疏注意机制),这是一种最小的架构修改,为关系推理提供了结构归纳偏置。 RASA增加了以下功能: 1. 稀疏邻接掩码:限制注意力仅集中在图连接的位置上,从而将注意力模式搜索空间从$O(2^{n^2})$减少到$O(2^m)$(其中$m$为边的数量); 2. 可学习的边类型偏置:编码关系特定的注意偏好。 尽管RASA并不能绕过渐进深度需求,但对注意力模式空间进行指数级缩减提供了更强的学习图结构化函数的归纳偏置。在MetaQA知识图问答基准测试中,对于3跳问题,RASA达到了97.7%的准确率,比EmbedKGQA(94.8%)高出2.9个百分点。值得注意的是,随着推理深度的增长,RASA的优势也在增加,这证明了结构化归纳偏置对复杂多跳查询最为有益。 我们的研究结果表明,基于复杂性理论分析进行最小架构修改可以显著提高多跳推理能力。
https://arxiv.org/abs/2602.02834