We present a Temporal Rule-Anchored Chain-of-Evidence (TRACE) on knowledge graphs for interpretable stock movement prediction that unifies symbolic relational priors, dynamic graph exploration, and LLM-guided decision making in a single end-to-end pipeline. The approach performs rule-guided multi-hop exploration restricted to admissible relation sequences, grounds candidate reasoning chains in contemporaneous news, and aggregates fully grounded evidence into auditable \texttt{UP}/\texttt{DOWN} verdicts with human-readable paths connecting text and structure. On an S\&P~500 benchmark, the method achieves 55.1\% accuracy, 55.7\% precision, 71.5\% recall, and 60.8\% F1, surpassing strong baselines and improving recall and F1 over the best graph baseline under identical evaluation. The gains stem from (i) rule-guided exploration that focuses search on economically meaningful motifs rather than arbitrary walks, and (ii) text-grounded consolidation that selectively aggregates high-confidence, fully grounded hypotheses instead of uniformly pooling weak signals. Together, these choices yield higher sensitivity without sacrificing selectivity, delivering predictive lift with faithful, auditably interpretable explanations.
https://arxiv.org/abs/2603.12500
While Large Language Models (LLMs) achieve expert-level performance on standard medical benchmarks through single-hop factual recall, they severely struggle with the complex, multi-hop diagnostic reasoning required in real-world clinical settings. A primary obstacle is "shortcut learning", where models exploit highly connected, generic hub nodes (e.g., "inflammation") in knowledge graphs to bypass authentic micro-pathological cascades. To address this, we introduce ShatterMed-QA, a bilingual benchmark of 10,558 multi-hop clinical questions designed to rigorously evaluate deep diagnostic reasoning. Our framework constructs a topology-regularized medical Knowledge Graph using a novel $k$-Shattering algorithm, which physically prunes generic hubs to explicitly sever logical shortcuts. We synthesize the evaluation vignettes by applying implicit bridge entity masking and topology-driven hard negative sampling, forcing models to navigate biologically plausible distractors without relying on superficial elimination. Comprehensive evaluations of 21 LLMs reveal massive performance degradation on our multi-hop tasks, particularly among domain-specific models. Crucially, restoring the masked evidence via Retrieval-Augmented Generation (RAG) triggers near-universal performance recovery, validating ShatterMed-QA's structural fidelity and proving its efficacy in diagnosing the fundamental reasoning deficits of current medical AI. Explore the dataset, interactive examples, and full leaderboards at our project website: this https URL
https://arxiv.org/abs/2603.12458
Online Video Large Language Models (VideoLLMs) play a critical role in supporting responsive, real-time interaction. Existing methods focus on streaming perception, lacking a synchronized logical reasoning stream. However, directly applying test-time scaling methods incurs unacceptable response latency. To address this trade-off, we propose Video Streaming Thinking (VST), a novel paradigm for streaming video understanding. It supports a thinking while watching mechanism, which activates reasoning over incoming video clips during streaming. This design improves timely comprehension and coherent cognition while preserving real-time responsiveness by amortizing LLM reasoning latency over video playback. Furthermore, we introduce a comprehensive post-training pipeline that integrates VST-SFT, which structurally adapts the offline VideoLLM to causal streaming reasoning, and VST-RL, which provides end-to-end improvement through self-exploration in a multi-turn video interaction environment. Additionally, we devise an automated training-data synthesis pipeline that uses video knowledge graphs to generate high-quality streaming QA pairs, with an entity-relation grounded streaming Chain-of-Thought to enforce multi-evidence reasoning and sustained attention to the video stream. Extensive evaluations show that VST-7B performs strongly on online benchmarks, e.g. 79.5% on StreamingBench and 59.3% on OVO-Bench. Meanwhile, VST remains competitive on offline long-form or reasoning benchmarks. Compared with Video-R1, VST responds 15.7 times faster and achieves +5.4% improvement on VideoHolmes, demonstrating higher efficiency and strong generalization across diverse video understanding tasks. Code, data, and models will be released at this https URL.
在线视频大型语言模型(VideoLLMs)在支持响应式、实时交互中扮演着关键角色。现有方法主要集中在流媒体感知上,缺乏同步的逻辑推理过程。然而,直接应用测试时间缩放方法会导致无法接受的响应延迟。为解决这一权衡问题,我们提出了视频流思考(VST),这是一种新的用于视频理解的流媒体范式。它支持边看边想机制,在流媒体期间激活对传入视频片段的推理。这种设计通过将大型语言模型的推理延迟分摊到视频播放中来提高及时理解和连贯的认知,并保持实时响应性。 此外,我们还引入了一个全面的后训练流水线,该流水线整合了VST-SFT和VST-RL。VST-SFT结构性地将离线VideoLLM适应为因果流推理,而VST-RL则通过在多回合视频互动环境中自我探索提供端到端改进。我们还设计了一个自动化的训练数据合成流水线,该流水线使用视频知识图来生成高质量的流媒体问答对,并采用基于实体关系的连贯思考链来强制执行多重证据推理和对视频流的持续关注。 广泛的评估表明,VST-7B在在线基准测试(如StreamingBench上的79.5%和OVO-Bench上的59.3%)中表现强劲。同时,在离线长格式或推理基准上,VST也保持了竞争力。与Video-R1相比,VST的响应速度快了15.7倍,并在VideoHolmes上实现了+5.4%的改进,展示了更高的效率和跨多样视频理解任务的强大泛化能力。 代码、数据和模型将在以下网址发布:[https URL]。
https://arxiv.org/abs/2603.12262
This paper presents an in-depth analysis of Wikidata qualifiers, focusing on their semantics and actual usage, with the aim of developing a taxonomy that addresses the challenges of selecting appropriate qualifiers, querying the graph, and making logical inferences. The study evaluates qualifier importance based on frequency and diversity, using a modified Shannon entropy index to account for the "long tail" phenomenon. By analyzing a Wikidata dump, the top 300 qualifiers were selected and categorized into a refined taxonomy that includes contextual, epistemic/uncertainty, structural, and additional qualifiers. The taxonomy aims to guide contributors in creating and querying statements, improve qualifier recommendation systems, and enhance knowledge graph design methodologies. The results show that the taxonomy effectively covers the most important qualifiers and provides a structured approach to understanding and utilizing qualifiers in Wikidata.
本文对维基数据的限定符进行了深入分析,重点关注其语义和实际应用,并旨在开发一种分类体系来应对选择适当限定符、查询图谱以及进行逻辑推理方面的挑战。研究基于频率和多样性的考量评估了限定符的重要性,并采用了一种经过修改的香农熵指标以适应“长尾”现象的影响。通过对维基数据的备份文件进行分析,选出了排名前300位的限定符,并将其分类为上下文、知识/不确定性、结构以及其他类型的限定符。该分类体系旨在指导贡献者创建和查询陈述语句,改善限定符推荐系统,并提升知识图谱的设计方法论。研究结果表明,这种分类体系能够有效地涵盖最重要的限定符,并提供一种有条理的方法来理解和使用维基数据中的限定符。
https://arxiv.org/abs/2603.11767
Graph-based Retrieval-Augmented Generation (GraphRAG) constructs the Knowledge Graph (KG) from external databases to enhance the timeliness and accuracy of Large Language Model (LLM) this http URL,this reliance on external data introduces new attack this http URL can inject poisoned texts into databases to manipulate LLMs into producing harmful target responses for attacker-chosen this http URL research primarily focuses on attacking conventional RAG this http URL,such methods are ineffective against this http URL robustness derives from the KG abstraction of GraphRAG,which reorganizes injected text into a graph before retrieval,thereby enabling the LLM to reason based on the restructured context instead of raw poisoned this http URL expose latent security vulnerabilities in GraphRAG,we propose Knowledge Evolution Poison (KEPo),a novel poisoning attack method specifically designed for this http URL each target query,KEPo first generates a toxic event containing poisoned knowledge based on the target this http URL fabricating event backgrounds and forging knowledge evolution paths from original facts to the toxic event,it then poisons the KG and misleads the LLM into treating the poisoned knowledge as the final this http URL multi-target attack scenarios,KEPo further connects multiple attack corpora,enabling their poisoned knowledge to mutually reinforce while expanding the scale of poisoned communities,thereby amplifying attack this http URL results across multiple datasets demonstrate that KEPo achieves state-of-the-art attack success rates for both single-target and multi-target attacks,significantly outperforming previous methods.
GraphRAG(基于图的检索增强生成)通过从外部数据库构建知识图谱来提升大型语言模型(LLM)在时效性和准确性方面的性能。然而,这种对外部数据的依赖也引入了新的攻击面——攻击者可以向数据库中注入中毒文本,使LLM生成特定目标的有害回应。尽管现有研究主要关注于针对传统检索增强生成(RAG)系统的攻击方法,但这些方法对GraphRAG无效,因为它的鲁棒性源于KG抽象化过程:在检索之前将注入的文本重组为图结构,从而使LLM基于重新组织后的上下文进行推理,而非直接处理原始中毒数据。 为了揭示GraphRAG潜在的安全漏洞,我们提出了一种专门针对该模型设计的新攻击方法——知识演进毒害(KEPo)。对于每个目标查询,KEPo首先根据特定目标生成含有中毒信息的事件,并构建虚假背景及从原有事实到有毒事件的知识演化路径。随后将这些知识引入KG中,误导LLM将其视为最终知识源。在多目标攻击场景下,KEPo进一步连接多个攻击数据库集合,使各自注入的知识相互强化并扩大受感染社区规模,从而放大整体攻击效果。 实验结果表明,在多种数据集上进行测试时,KEPo实现了针对单目标和多目标攻击的最佳成功率,并显著优于先前方法。
https://arxiv.org/abs/2603.11501
Retrieval-Augmented Generation (RAG) over Knowledge Graphs (KGs) suffers from the fact that indexing approaches may lose important contextual nuance when text is reduced to triples, thereby degrading performance in downstream Question-Answering (QA) tasks, particularly for multi-hop QA, which requires composing answers from multiple entities, facts, or relations. We propose a domain-agnostic, KG-based QA framework that covers both the indexing and retrieval/inference phases. A new indexing approach called Map-Disambiguate-Enrich-Reduce (MDER) generates context-derived triple descriptions and subsequently integrates them with entity-level summaries, thus avoiding the need for explicit traversal of edges in the graph during the QA retrieval phase. Complementing this, we introduce Decompose-Resolve (DR), a retrieval mechanism that decomposes user queries into resolvable triples and grounds them in the KG via iterative reasoning. Together, MDER and DR form an LLM-driven QA pipeline that is robust to sparse, incomplete, and complex relational data. Experiments show that on standard and domain specific benchmarks, MDER-DR achieves substantial improvements over standard RAG baselines (up to 66%), while maintaining cross-lingual robustness. Our code is available at this https URL.
基于知识图谱(KG)的检索增强生成(RAG)面临的一个问题是,索引方法可能在将文本缩减为三元组时丢失重要的上下文细节,从而导致下游问答任务(QA)性能下降,特别是在需要从多个实体、事实或关系中构建答案的多跳问答中。为此,我们提出了一种领域无关的知识图谱基础问答框架,涵盖索引和检索/推理阶段。 我们的方法包括一种新的称为映射-消歧-丰富-简化(MDER)的索引方式,它生成由上下文衍生出的三元组描述,并随后将这些描述与实体级别的摘要集成在一起。这种方法避免了在问答检索阶段显式遍历图中的边的需求。此外,我们还引入了一种称为分解-解析(DR)的检索机制,该机制将用户查询分解为可解析的三元组,并通过迭代推理将其锚定到知识图谱中。 MDER和DR共同形成了一个基于大语言模型的问答流水线,能够应对稀疏、不完整以及复杂的关系数据。实验表明,在标准和领域特定基准上,MDER-DR相比传统的RAG基线方法取得了显著改进(最高可达66%),同时保持了跨语言鲁棒性。 我们的代码可在提供的链接中获得。
https://arxiv.org/abs/2603.11223
Medication errors pose a significant threat to patient safety, making pharmacist verification (PV) a critical, yet heavily burdened, final safeguard. The direct application of Large Language Models (LLMs) to this zero-tolerance domain is untenable due to their inherent factual unreliability, lack of traceability, and weakness in complex reasoning. To address these challenges, we introduce PharmGraph-Auditor, a novel system designed for safe and evidence-grounded prescription auditing. The core of our system is a trustworthy Hybrid Pharmaceutical Knowledge Base (HPKB), implemented under the Virtual Knowledge Graph (VKG) paradigm. This architecture strategically unifies a relational component for set constraint satisfaction and a graph component for topological reasoning via a rigorous mapping layer. To construct this HPKB, we propose the Iterative Schema Refinement (ISR) algorithm, a framework that enables the co-evolution of both graph and relational schemas from medical texts. For auditing, we introduce the KB-grounded Chain of Verification (CoV), a new reasoning paradigm that transforms the LLM from an unreliable generator into a transparent reasoning engine. CoV decomposes the audit task into a sequence of verifiable queries against the HPKB, generating hybrid query plans to retrieve evidence from the most appropriate data store. Experimental results demonstrate robust knowledge extraction capabilities and show promises of using PharmGraph-Auditor to enable pharmacists to achieve safer and faster prescription verification.
药物错误对患者安全构成重大威胁,使得药师核验(PV)成为一项关键但负担沉重的最后一道防线。直接将大型语言模型(LLMs)应用于这种零容忍领域是不可行的,因为这些模型本身具有事实上的不准确性、缺乏可追溯性以及在复杂推理方面的弱点。为了解决这些问题,我们引入了PharmGraph-Auditor系统,这是一个旨在进行安全且基于证据的处方审计的新系统。我们的系统的核心是一个值得信赖的混合药物知识库(HPKB),该库是在虚拟知识图(VKG)范式下实现的。这种架构通过严格的映射层策略性地将用于集合约束满足的关联部分与用于拓扑推理的图形部分结合在一起。 为了构建这个HPKB,我们提出了迭代模式精炼(ISR)算法,这是一种框架,能够使图表和关系模式从医学文本中共同进化。对于审计任务,我们引入了基于知识库验证链(CoV)的新推理范式,该范式将LLM从一个不可靠的生成器转变为透明的推理引擎。CoV将审计任务分解为一系列针对HPKB的可验证查询序列,通过生成混合查询计划来检索来自最适当数据存储的证据。 实验结果表明了强大的知识提取能力,并展示了使用PharmGraph-Auditor使药师能够实现更安全和更快处方核验的巨大潜力。
https://arxiv.org/abs/2603.10891
Retrieval-Augmented Generation (RAG) systems typically treat documents as flat text, ignoring the structured metadata and linked relationships that knowledge graphs provide. In this paper, we investigate whether structured linked data, specifically this http URL markup and dereferenceable entity pages served by a Linked Data Platform, can improve retrieval accuracy and answer quality in both standard and agentic RAG systems. We conduct a controlled experiment across four domains (editorial, legal, travel, e-commerce) using Vertex AI Vector Search 2.0 for retrieval and the Google Agent Development Kit (ADK) for agentic reasoning. Our experimental design tests seven conditions: three document representations (plain HTML, HTML with JSON-LD, and an enhanced agentic-optimized entity page) crossed with two retrieval modes (standard RAG and agentic RAG with multi-hop link traversal), plus an Enhanced+ condition that adds rich navigational affordances and entity interlinking. Our results reveal that while JSON-LD markup alone provides only modest improvements, our enhanced entity page format, incorporating this http URL-style agent instructions, breadcrumbs, and neural search capabilities, achieves substantial gains: +29.6% accuracy improvement for standard RAG and +29.8% for the full agentic pipeline. The Enhanced+ variant, with richer navigational affordances, achieves the highest absolute scores (accuracy: 4.85/5, completeness: 4.55/5), though the incremental gain over the base enhanced format is not statistically significant. We release our dataset, evaluation framework, and enhanced entity page templates to support reproducibility.
检索增强生成(RAG)系统通常将文档视为扁平文本,忽略了知识图提供的结构化元数据和链接关系。在这篇论文中,我们探讨了使用结构化的链接数据,特别是HTTP URL标记以及由链接数据平台提供的可解析实体页面,是否能提高标准和代理式RAG系统的检索准确性和答案质量。我们在四个领域(编辑、法律、旅行、电子商务)进行了一项受控实验,使用Vertex AI向量搜索2.0进行检索,并采用Google代理开发工具包(ADK)进行代理推理。 我们的实验设计测试了七种条件:三种文档表示形式(纯HTML、带有JSON-LD的HTML和增强优化的实体页面)、两种检索模式(标准RAG和多跳链接遍历的代理式RAG),再加上一种Enhanced+条件,该条件添加了丰富的导航功能和实体互链。我们的结果显示,虽然单独使用JSON-LD标记仅提供了适度改进,但包含this http URL风格的代理指令、面包屑导航以及神经搜索能力的增强实体页面格式实现了显著提升:标准RAG准确率提高了29.6%,代理式RAG管道的完整流程提高到了29.8%。Enhanced+变体,具有更丰富的导航功能,在绝对得分上取得了最高分(准确性:4.85/5,完整性:4.55/5),尽管相对于基础增强格式的增量改进并未达到统计显著性。 我们发布了数据集、评估框架和增强实体页面模板以支持可重复性。
https://arxiv.org/abs/2603.10700
Table Question Answering (TableQA) enables natural language interaction with structured tabular data. However, existing large language model (LLM) approaches face critical limitations: context length constraints that restrict data handling capabilities, hallucination issues that compromise answer reliability, and single-agent architectures that struggle with complex reasoning scenarios involving semantic relationships and multi-hop logic. This paper introduces DataFactory, a multi-agent framework that addresses these limitations through specialized team coordination and automated knowledge transformation. The framework comprises a Data Leader employing the ReAct paradigm for reasoning orchestration, together with dedicated Database and Knowledge Graph teams, enabling the systematic decomposition of complex queries into structured and relational reasoning tasks. We formalize automated data-to-knowledge graph transformation via the mapping function T:D x S x R -> G, and implement natural language-based consultation that - unlike fixed workflow multi-agent systems - enables flexible inter-agent deliberation and adaptive planning to improve coordination robustness. We also apply context engineering strategies that integrate historical patterns and domain knowledge to reduce hallucinations and improve query accuracy. Across TabFact, WikiTableQuestions, and FeTaQA, using eight LLMs from five providers, results show consistent gains. Our approach improves accuracy by 20.2% (TabFact) and 23.9% (WikiTQ) over baselines, with significant effects (Cohen's d > 1). Team coordination also outperforms single-team variants (+5.5% TabFact, +14.4% WikiTQ, +17.1% FeTaQA ROUGE-2). The framework offers design guidelines for multi-agent collaboration and a practical platform for enterprise data analysis through integrated structured querying and graph-based knowledge representation.
表格问答(TableQA)使用户能够以自然语言的形式与结构化的表格数据进行交互。然而,现有的大型语言模型(LLM)方法面临关键限制:上下文长度的约束限制了它们处理数据的能力;幻觉问题影响了答案的可靠性;以及单一代理架构在涉及语义关系和多步逻辑推理的复杂场景中表现不佳。本文介绍了DataFactory,这是一个通过专门团队协作和自动化知识转换来解决这些限制的多智能体框架。该框架包含一个使用ReAct范式的数据领导者用于推理协调,同时还包括专门的数据库和知识图谱小组,这使得将复杂的查询分解为结构化和关系推理任务成为可能。我们通过映射函数T:D x S x R -> G实现了自动化数据到知识图谱的转换,并实施了基于自然语言的咨询机制——不同于固定工作流多智能体系统,这种机制允许灵活的代理间讨论和适应性规划以增强协作稳健性。此外,我们应用上下文工程技术整合历史模式和领域知识,从而减少幻觉并提高查询准确性。 在TabFact、WikiTableQuestions和FeTaQA数据集上使用来自五家提供商的八种LLM进行实验显示了持续的改进:我们的方法比基线提高了20.2%(TabFact)和23.9%(WikiTQ),Cohen's d值大于1表明效果显著。团队协作也优于单个团队变体,分别在TabFact、WikiTableQuestions和FeTaQA ROUGE-2上提高5.5%、14.4%和17.1%。 该框架为多智能体合作提供了设计准则,并通过集成结构化查询和基于图的知识表示提供了一个实用的企业数据分析平台。
https://arxiv.org/abs/2603.09152
AI memory systems increasingly organize knowledge into graph structures -- knowledge graphs, entity relations, community hierarchies -- yet lack a principled mechanism for continuous resolution control: where do the qualitative boundaries between abstraction levels lie, and how should an agent navigate them? We introduce Semantic Level of Detail (SLoD), a framework that answers both questions by defining a continuous zoom operator via heat kernel diffusion on the Poincaré ball $\mathbb{B}^d$. At coarse scales ($\sigma \to \infty$), diffusion aggregates embeddings into high-level summaries; at fine scales ($\sigma \to 0$), local semantic detail is preserved. We prove hierarchical coherence with bounded approximation error $O(\sigma)$ and $(1+\varepsilon)$ distortion for tree-structured hierarchies under Sarkar embedding. Crucially, we show that spectral gaps in the graph Laplacian induce emergent scale boundaries -- scales where the representation undergoes qualitative transitions -- which can be detected automatically without manual resolution parameters. On synthetic hierarchies (HSBM), our boundary scanner recovers planted levels with ARI up to 1.00, with detection degrading gracefully near the information-theoretic Kesten-Stigum threshold. On the full WordNet noun hierarchy (82K synsets), detected boundaries align with true taxonomic depth ($\tau = 0.79$), demonstrating that the method discovers meaningful abstraction levels in real-world knowledge graphs without supervision.
AI记忆系统越来越多地将知识组织成图结构——如知识图、实体关系和社区层级,但缺乏一个原则性的机制来持续控制分辨率:不同抽象层次之间的定性边界在哪里,代理如何在这些层次之间导航?我们引入了语义细节级别(SLoD)框架,通过Poincaré球 $\mathbb{B}^d$ 上的热核扩散定义了一个连续缩放操作器。在粗尺度下 ($\sigma \to \infty$),扩散聚合嵌入以生成高层次摘要;而在细尺度下 ($\sigma \to 0$),局部语义细节得以保留。我们证明了在Sarkar嵌入下的树状层级结构中,该框架具有有界的近似误差 $O(\sigma)$ 和 $(1+\varepsilon)$ 的失真度量,并且层次结构是协调一致的。 关键的是,我们在图拉普拉斯算子中的谱隙诱导出自动检测到的尺度边界——代表表示形式经历定性转变的地方。这可以自动检测而无需手动分辨率参数设置。在合成层级(HSBM)上,我们的边界扫描器能够以ARI最高达1.00 的准确度恢复嵌入层次级别,在接近信息论上的Kesten-Stigum阈值时表现平滑下降。 在完整的WordNet名词层级结构(82,000个同义词集合)中,检测到的边界与真实的分类深度 ($\tau = 0.79$) 对齐,这证明了该方法能够在不进行监督的情况下发现真实世界知识图中的有意义抽象层次。
https://arxiv.org/abs/2603.08965
The rapid emergence of open-source, locally hosted intelligent agents marks a critical inflection point in human-computer interaction. Systems such as OpenClaw demonstrate that Large Language Model (LLM)-based agents can autonomously operate local computing environments, orchestrate workflows, and integrate external tools. However, within the current paradigm, these agents remain conventional applications running on legacy operating systems originally designed for Graphical User Interfaces (GUIs) or Command Line Interfaces (CLIs). This architectural mismatch leads to fragmented interaction models, poorly structured permission management (often described as "Shadow AI"), and severe context fragmentation. This paper proposes a new paradigm: a Personal Agent Operating System (AgentOS). In AgentOS, traditional GUI desktops are replaced by a Natural User Interface (NUI) centered on a unified natural language or voice portal. The system core becomes an Agent Kernel that interprets user intent, decomposes tasks, and coordinates multiple agents, while traditional applications evolve into modular Skills-as-Modules enabling users to compose software through natural language rules. We argue that realizing AgentOS fundamentally becomes a Knowledge Discovery and Data Mining (KDD) problem. The Agent Kernel must operate as a real-time engine for intent mining and knowledge discovery. Viewed through this lens, the operating system becomes a continuous data mining pipeline involving sequential pattern mining for workflow automation, recommender systems for skill retrieval, and dynamically evolving personal knowledge graphs. These challenges define a new research agenda for the KDD community in building the next generation of intelligent computing systems.
开源、本地托管的智能代理的快速出现标志着人机交互中的一个关键转折点。以OpenClaw为代表的系统展示了基于大型语言模型(LLM)的代理可以自主操作本地计算环境,协调工作流,并集成外部工具。然而,在当前范式下,这些代理仍然是运行在最初为图形用户界面(GUI)或命令行界面(CLI)设计的传统操作系统上的常规应用。这种架构不匹配导致了交互模式碎片化、权限管理混乱(常被称为“影子AI”),以及严重的上下文碎片化问题。 本文提出了一种新的范式:个人代理操作系统(AgentOS)。在AgentOS中,传统的GUI桌面被一个以统一的自然语言或语音门户为中心的自然用户界面(NUI)所取代。系统核心则成为一个能够解释用户意图、分解任务并协调多个代理的代理内核,而传统应用则进化为模块化的技能组件,使用户可以通过自然语言规则来组合软件。 我们主张实现AgentOS本质上是一个知识发现和数据挖掘(KDD)问题。代理内核必须充当一个实时引擎,用于意图挖掘和知识发现。从这个角度来看,操作系统变成了一个连续的数据挖掘流水线,包括工作流自动化的序列模式挖掘、技能检索的推荐系统以及动态演进的个人知识图谱。 这些挑战为KDD社区在构建下一代智能计算系统的研究议程中定义了新的方向。
https://arxiv.org/abs/2603.08938
Large language models (LLMs) show significant potential for clinical decision support (CDS), yet their black-box nature -- characterized by untraceable reasoning and probabilistic hallucinations -- poses severe challenges in acupuncture, a field demanding rigorous interpretability and safety. To address this, we propose CORE-Acu, a neuro-symbolic framework for acupuncture clinical decision support that integrates Structured Chain-of-Thought (S-CoT) with knowledge graph (KG) safety verification. First, we construct the first acupuncture Structured Reasoning Trace dataset and a schema-constrained fine-tuning framework. By enforcing an explicit causal chain from pattern identification to treatment principles, treatment plans, and acupoint selection, we transform implicit Traditional Chinese Medicine (TCM) reasoning into interpretable generation constraints, mitigating the opacity of LLM-based CDS. Furthermore, we construct a TCM safety knowledge graph and establish a ``Generate--Verify--Revise'' closed-loop inference system based on a Symbolic Veto Mechanism, employing deterministic rules to intercept hallucinations and enforce hard safety boundaries. Finally, we introduce the Lexicon-Matched Entity-Reweighted Loss (LMERL), which corrects terminology drift caused by the frequency--importance mismatch in general optimization by adaptively amplifying gradient contributions of high-risk entities during fine-tuning. Experiments on 1,000 held-out cases demonstrate CORE-Acu's superior entity fidelity and reasoning quality. Crucially, CORE-Acu achieved 0/1,000 observed safety violations (95\% CI: 0--0.37\%), whereas GPT-4o exhibited an 8.5\% violation rate under identical rules. These results establish CORE-Acu as a robust neuro-symbolic framework for acupuncture clinical decision support, guaranteeing both reasoning auditability and strict safety compliance.
大型语言模型(LLMs)在临床决策支持(CDS)方面展现出巨大的潜力,但它们的黑箱特性——包括不可追踪的推理过程和概率性幻觉——给针灸这一需要严格可解释性和安全性的领域带来了严峻挑战。为了解决这些问题,我们提出了CORE-Acu框架,这是一个结合了结构化思维链(S-CoT)与知识图谱(KG)安全性验证的神经符号体系,专门用于针灸临床决策支持。 首先,我们构建了一个首个针对针灸的结构性推理追踪数据集以及一个受约束的微调框架。通过强制要求从模式识别到治疗原则、再到治疗计划和穴位选择之间的一条明确因果链,我们将隐含的传统中医学(TCM)推理转化为可解释性的生成约束,从而减轻基于LLM的CDS所面临的不透明性。 此外,我们还构建了一个TCM安全性知识图谱,并建立了一种“生成-验证-修正”的闭环推理系统,这种系统基于符号否决机制运作。通过使用确定性规则拦截幻觉并强制执行硬性的安全边界,该系统提高了决策的安全性和可靠性。 最后,我们引入了词典匹配实体重加权损失(Lexicon-Matched Entity-Reweighted Loss, LMERL),这种方法可以纠正由于一般优化中频率-重要性不匹配导致的术语漂移问题。在微调过程中,它会根据风险级别自适应地放大高风险实体的梯度贡献。 实验结果表明,在1000个未见过的案例中,CORE-Acu展现出卓越的实体忠实性和推理质量。尤为重要的是,与GPT-4o相比,在相同的安全规则下,CORE-Acu实现了零安全违规记录(95%置信区间为0--0.37%),而GPT-4o则显示出了8.5%的违规率。这些结果证明了CORE-Acu作为一个稳健的神经符号体系对于针灸临床决策支持的有效性,并保证了推理审计能力和严格的安全合规性。
https://arxiv.org/abs/2603.08321
Code evolution is inevitable in modern software development. Changes to third-party APIs frequently break existing code and complicate maintenance, posing practical challenges for developers. While large language models (LLMs) have shown promise in code generation, they struggle to reason without a structured representation of these evolving relationships, often leading them to produce outdated APIs or invalid outputs. In this work, we propose a knowledge graph-augmented framework that decomposes the migration task into two synergistic stages: evolution path retrieval and path-informed code generation. Our approach constructs static and dynamic API graphs to model intra-version structures and cross-version transitions, enabling structured reasoning over API evolution. Both modules are trained with synthetic supervision automatically derived from real-world API diffs, ensuring scalability and minimal human effort. Extensive experiments across single-package and multi-package benchmarks demonstrate that our framework significantly improves migration accuracy, controllability, and execution success over standard LLM baselines. The source code and datasets are available at: this https URL.
代码进化在现代软件开发中是不可避免的。第三方API的变化经常会导致现有代码出现问题,增加维护难度,这对开发者来说是一个实际挑战。虽然大型语言模型(LLMs)在代码生成方面显示出潜力,但由于缺乏对这些不断变化关系的结构化表示,它们往往难以进行推理,从而产生过时的API或无效输出。在这项工作中,我们提出了一种增强知识图谱的框架,将迁移任务分解为两个协同阶段:进化路径检索和基于路径的信息代码生成。我们的方法构建了静态和动态API图来建模版本内部结构和跨版本过渡,使API演化的结构化推理成为可能。该框架的所有模块均通过从真实世界API差异中自动衍生出的合成监督数据进行训练,从而确保其可扩展性和最小的人工干预。 在单包和多包基准测试中的广泛实验表明,与标准LLM基线相比,我们的框架显著提高了迁移准确性、可控性和执行成功率。源代码及数据集可在以下网址获取:this https URL.
https://arxiv.org/abs/2603.07581
Concept Bottleneck Models (CBMs) aim for ante-hoc interpretability by learning a bottleneck layer that predicts interpretable concepts before the decision. State-of-the-art approaches typically select which concepts to learn via human specification, open knowledge graphs, prompting an LLM, or using general CLIP concepts. However, concepts defined a-priori may not have sufficient predictive power for the task or even be learnable from the available data. As a result, these CBMs often significantly trail their black-box counterpart when controlling for information leakage. To address this, we introduce a novel CBM pipeline named Mechanistic CBM (M-CBM), which builds the bottleneck directly from a black-box model's own learned concepts. These concepts are extracted via Sparse Autoencoders (SAEs) and subsequently named and annotated on a selected subset of images using a Multimodal LLM. For fair comparison and leakage control, we also introduce the Number of Contributing Concepts (NCC), a decision-level sparsity metric that extends the recently proposed NEC metric. Across diverse datasets, we show that M-CBMs consistently surpass prior CBMs at matched sparsity, while improving concept predictions and providing concise explanations. Our code is available at this https URL.
概念瓶颈模型(CBM)旨在通过学习一个预测可解释概念的瓶颈层来实现预先可解释性,这些概念在决策之前被预测。最先进的方法通常通过人类指定、开放知识图谱、提示LLM或使用通用CLIP概念来选择要学习的概念。然而,事先定义的概念可能对任务没有足够的预测能力,甚至无法从可用数据中学习。因此,在控制信息泄漏的情况下,这些CBM模型往往明显落后于它们的黑盒对应物。 为了解决这个问题,我们引入了一种新的CBM管道,称为机制CBM(M-CBM),它直接从黑盒模型自身学到的概念构建瓶颈层。通过稀疏自编码器(SAE)提取这些概念,并使用多模态LLM对选定的一组图像进行命名和标注。为了公平比较和控制泄漏,我们还引入了贡献概念数量(NCC),这是一个决策层面的稀疏性指标,扩展了最近提出的NEC度量。 在多种数据集上,我们展示了当匹配稀疏度时,M-CBM模型始终超越先前的CBM模型,并且改进了概念预测和提供了简洁的解释。我们的代码可在该链接提供(假设原文中提供的URL已正确给出)。
https://arxiv.org/abs/2603.07343
Mathematical text understanding is a challenging task due to the presence of specialized entities and complex relationships between them. This study formulates mathematical problem interpretation as a Mathematical Entity Relation Extraction (MERE) task, where operands are treated as entities and operators as their relationships. Transformer-based models are applied to automatically extract these relations from mathematical text, with Bidirectional Encoder Representations from Transformers (BERT) achieving the best performance, reaching an accuracy of 99.39%. To enhance transparency and trust in the model's predictions, Explainable Artificial Intelligence (XAI) is incorporated using Shapley Additive Explanations (SHAP). The explainability analysis reveals how specific textual and mathematical features influence relation prediction, providing insights into feature importance and model behavior. By combining transformer-based learning, a task-specific dataset, and explainable modeling, this work offers an effective and interpretable framework for MERE, supporting future applications in automated problem solving, knowledge graph construction, and intelligent educational systems.
数学文本理解是一个具有挑战性的任务,因为其中包含专业的实体以及这些实体之间复杂的关系。这项研究将数学问题解释定义为一个数学实体关系抽取(MERE)的任务,在这个过程中,操作数被视为实体,而运算符则作为它们之间的关系。基于Transformer的模型被应用于自动从数学文本中提取这些关系,并且双向编码器表示法(BERT)表现最佳,达到了99.39%的准确率。 为了提高模型预测的透明度和可信度,研究采用了可解释的人工智能(XAI),具体使用了Shapley Additive Explanations(SHAP)。通过这种可解释性分析,可以揭示特定文本特征和数学特性如何影响关系预测,并提供有关特征重要性和模型行为的见解。 这项工作结合了基于Transformer的学习、任务特异性数据集以及可解释建模方法,为MERE提供了有效且可解释的框架。这将支持未来在自动化问题解决、知识图谱构建以及智能教育系统中的应用。
https://arxiv.org/abs/2603.06348
Personal Artificial Intelligence is currently hindered by the fragmentation of user data across isolated silos. While Retrieval-Augmented Generation offers a partial remedy, its reliance on unstructured vector similarity fails to capture the latent semantic topology and temporal dependencies essential for holistic sensemaking. We introduce EpisTwin, a neuro-symbolic framework that grounds generative reasoning in a verifiable, user-centric Personal Knowledge Graph. EpisTwin leverages Multimodal Language Models to lift heterogeneous, cross-application data into semantic triples. At inference, EpisTwin enables complex reasoning over the personal semantic graph via an agentic coordinator that combines Graph Retrieval-Augmented Generation with Online Deep Visual Refinement, dynamically re-grounding symbolic entities in their raw visual context. We also introduce PersonalQA-71-100, a synthetic benchmark designed to simulate a realistic user's digital footprint and evaluate EpisTwin performance. Our framework demonstrates robust results across a suite of state-of-the-art judge models, offering a promising direction for trustworthy Personal AI.
当前,个人人工智能的发展受到用户数据分散在孤立隔间中的碎片化问题的阻碍。虽然检索增强生成(Retrieval-Augmented Generation)提供了一定程度上的解决方案,但其依赖于未结构化的向量相似性方法无法捕捉到进行整体理解所需的潜在语义拓扑和时间依赖关系。为此,我们提出了EpisTwin框架,这是一个基于个人知识图的神经符号系统,该框架以可验证、用户为中心的方式为基础生成推理。 EpisTwin利用多模态语言模型将异构跨应用数据提升为语义三元组。在推断阶段,EpisTwin通过一个代理协调器实现对个人语义图的复杂推理,该协调器结合了基于图的检索增强生成与在线深度视觉细化技术,能够动态地重新锚定符号实体到其原始视觉上下文中。 此外,我们还引入了一个合成基准——PersonalQA-71-100,用于模拟一个真实用户数字足迹并评估EpisTwin的表现。我们的框架在一系列最新的评判模型中展现了稳健的结果,为建立值得信赖的个人人工智能提供了一条有前景的道路。
https://arxiv.org/abs/2603.06290
Retrieval-Augmented Generation (RAG) was introduced to enhance the capabilities of Large Language Models (LLMs) beyond their encoded prior knowledge. This is achieved by providing LLMs with an external source of knowledge, which helps reduce factual hallucinations and enables access to new information not available during pretraining. However, inconsistent retrieved information can negatively affect LLM responses. The Retrieval-Augmented Generation Benchmark (RGB) was introduced to evaluate the robustness of RAG systems under such conditions. In this work, we use the RGB corpus to evaluate LLMs in four scenarios: noise robustness, information integration, negative rejection, and counterfactual robustness. We perform a comparative analysis between the RGB RAG baseline and GraphRAG, a knowledge graph based retrieval system. We test three GraphRAG customizations to improve robustness. Results show improvements over the RGB baseline and provide insights for designing more reliable RAG systems for real world scenarios.
基于检索增强生成(Retrieval-Augmented Generation,简称RAG)技术被提出,旨在提升大型语言模型(Large Language Models,简称LLMs)的性能,超越它们编码的先验知识。通过为LLMs提供外部知识来源,可以减少事实性错误,并使其能够访问在预训练期间不可用的新信息。然而,不一致检索到的信息可能会对LLM的响应产生负面影响。为此,Retrieval-Augmented Generation Benchmark(RGB)被引入,用于评估RAG系统在这种条件下的鲁棒性。 在这项工作中,我们利用RGB数据集来评估四种场景下LLMs的表现:抗噪能力、信息整合、负面拒绝以及反事实鲁棒性。通过对比RGB RAG基准和基于知识图谱的检索系统GraphRAG之间的表现,我们对三种针对GraphRAG定制的方法进行了测试,以提高其鲁棒性。实验结果显示,在RGB基准上有所改进,并为设计适用于现实世界的更可靠的RAG系统提供了见解。
https://arxiv.org/abs/2603.05698
Retrieval-Augmented Generation (RAG) enhances large language models by incorporating external knowledge. However, existing vector-based methods often fail on global sensemaking tasks that require reasoning across many documents. GraphRAG addresses this by organizing documents into a knowledge graph with hierarchical communities that can be recursively summarized. Current GraphRAG approaches rely on Leiden clustering for community detection, but we prove that on sparse knowledge graphs, where average degree is constant and most nodes have low degree, modularity optimization admits exponentially many near-optimal partitions, making Leiden-based communities inherently non-reproducible. To address this, we propose replacing Leiden with k-core decomposition, which yields a deterministic, density-aware hierarchy in linear time. We introduce a set of lightweight heuristics that leverage the k-core hierarchy to construct size-bounded, connectivity-preserving communities for retrieval and summarization, along with a token-budget-aware sampling strategy that reduces LLM costs. We evaluate our methods on real-world datasets including financial earnings transcripts, news articles, and podcasts, using three LLMs for answer generation and five independent LLM judges for head-to-head evaluation. Across datasets and models, our approach consistently improves answer comprehensiveness and diversity while reducing token usage, demonstrating that k-core-based GraphRAG is an effective and efficient framework for global sensemaking.
增强型生成模型(Retrieval-Augmented Generation,简称RAG)通过引入外部知识来提升大型语言模型的能力。然而,现有的基于向量的方法在处理需要跨多文档进行推理的全局语义理解任务时往往效果不佳。GraphRAG 通过将文档组织成具有层级社区的知识图谱,并对其进行递归摘要来解决这一问题。目前的 GraphRAG 方法依赖于 Leiden 聚类来进行社区检测,但我们证明,在稀疏知识图中(平均度数恒定且大多数节点具有低度数),模块化优化会导致近似最优划分的数量呈指数级增长,这使得基于 Leiden 的社区本质上是不可重复生成的。为了解决这一问题,我们建议用 k 核分解来替代 Leiden 方法,该方法能以线性时间产生一种确定性的、密度感知型的层级结构。 为了利用这种层级结构来进行检索和摘要,我们提出了一套轻量级启发式规则,用于构建大小受限且保持连通性的社区。此外,还引入了一种基于预算的采样策略来降低大型语言模型(LLM)的成本。我们在包括财务收益报告、新闻文章和播客在内的现实世界数据集上评估了我们的方法,并使用三种不同类型的 LLM 生成答案,以及五名独立的 LLM 裁判进行两两对比评估。 在所有数据集和模型中,我们的方法始终改善了答案的全面性和多样性,同时减少了 token 的使用量。这表明基于 k 核的 GraphRAG 是一种有效的全局语义理解框架,并且效率更高。
https://arxiv.org/abs/2603.05207
Diagnosing hepatic diseases accurately and interpretably is critical, yet it remains challenging in real-world clinical settings. Existing AI approaches for clinical diagnosis often lack transparency, structured reasoning, and deployability. Recent efforts have leveraged large language models (LLMs), retrieval-augmented generation (RAG), and multi-agent collaboration. However, these approaches typically retrieve evidence from a single source and fail to support iterative, role-specialized deliberation grounded in structured clinical data. To address this, we propose MedCoRAG (i.e., Medical Collaborative RAG), an end-to-end framework that generates diagnostic hypotheses from standardized abnormal findings and constructs a patient-specific evidence package by jointly retrieving and pruning UMLS knowledge graph paths and clinical guidelines. It then performs Multi-Agent Collaborative Reasoning: a Router Agent dynamically dispatches Specialist Agents based on case complexity; these agents iteratively reason over the evidence and trigger targeted re-retrievals when needed, while a Generalist Agent synthesizes all deliberations into a traceable consensus diagnosis that emulates multidisciplinary consultation. Experimental results on hepatic disease cases from MIMIC-IV show that MedCoRAG outperforms existing methods and closed-source models in both diagnostic performance and reasoning interpretability.
准确且可解释地诊断肝病在临床上至关重要,但在实际操作中仍然面临挑战。现有的临床诊断人工智能方法往往缺乏透明度、结构化推理和部署能力。最近的研究尝试利用大型语言模型(LLM)、检索增强生成(RAG)以及多代理协作来改进这一领域。然而,这些方法通常只从单一来源检索证据,并且无法支持基于结构化临床数据的迭代、角色专业化讨论。为了解决这些问题,我们提出了MedCoRAG(即医疗协作RAG),这是一个端到端框架,它能根据标准化异常发现生成诊断假设,并通过联合检索和修剪UMLS知识图路径及临床指南来构建患者特定证据包。接着,MedCoRAG进行多代理协同推理:一个路由代理会根据案例复杂度动态分派专科代理;这些代理会在必要时迭代地在证据上进行推理并触发有针对性的重新检索,而全科代理则将所有讨论综合成可追溯的一致诊断结果,模拟跨学科咨询。我们在MIMIC-IV肝病病例上的实验表明,MedCoRAG在诊断性能和推理解释性方面均优于现有方法及封闭源代码模型。
https://arxiv.org/abs/2603.05129
Manipulative communication, such as gaslighting, guilt-tripping, and emotional coercion, is often difficult for individuals to recognize. Existing agentic AI systems lack the structured, longitudinal memory to track these subtle, context-dependent tactics, often failing due to limited context windows and catastrophic forgetting. We introduce EchoGuard, an agentic AI framework that addresses this gap by using a Knowledge Graph (KG) as the agent's core episodic and semantic memory. EchoGuard employs a structured Log-Analyze-Reflect loop: (1) users log interactions, which the agent structures as nodes and edges in a personal, episodic KG (capturing events, emotions, and speakers); (2) the system executes complex graph queries to detect six psychologically-grounded manipulation patterns (stored as a semantic KG); and (3) an LLM generates targeted Socratic prompts grounded by the subgraph of detected patterns, guiding users toward self-discovery. This framework demonstrates how the interplay between agentic architectures and Knowledge Graphs can empower individuals in recognizing manipulative communication while maintaining personal autonomy and safety. We present the theoretical foundation, framework design, a comprehensive evaluation strategy, and a vision to validate this approach.
操纵性沟通,如煤气灯效应、内疚操控和情感胁迫,通常很难被个人察觉。现有的代理式人工智能系统由于缺乏结构化和纵向记忆来跟踪这些细微且情境依赖的策略而经常失败,这归因于其有限的情境窗口和灾难性遗忘问题。我们引入了EchoGuard,这是一个利用知识图谱(KG)作为代理人核心事件性和语义记忆的代理式AI框架。EchoGuard采用了一个结构化的记录-分析-反思循环: 1. 用户记录互动,而智能助手将这些互动结构化为个人事件知识图谱中的节点和边(包括事件、情绪及发言者信息)。 2. 系统执行复杂的图查询来检测六种基于心理学的操纵模式(存储在语义KG中)。 3. 一个大型语言模型生成由检测到的子图支持的目标苏格拉底式提示,引导用户进行自我发现。 该框架展示了代理架构与知识图谱之间的相互作用如何能够帮助个人识别操纵性沟通,并同时维护个人自主权和安全。我们在此介绍其理论基础、设计框架、全面评估策略以及验证此方法的愿景。
https://arxiv.org/abs/2603.04815