Chemotherapy for cancer treatment is costly and accompanied by severe side effects, highlighting the critical need for early prediction of treatment outcomes to improve patient management and informed decision-making. Predictive models for chemotherapy outcomes using real-world data face challenges, including the absence of explicit phenotypes and treatment outcome labels such as cancer progression and toxicity. This study addresses these challenges by employing Large Language Models (LLMs) and ontology-based techniques for phenotypes and outcome label extraction from patient notes. We focused on one of the most frequently occurring cancers, breast cancer, due to its high prevalence and significant variability in patient response to treatment, making it a critical area for improving predictive modeling. The dataset included features such as vitals, demographics, staging, biomarkers, and performance scales. Drug regimens and their combinations were extracted from the chemotherapy plans in the EMR data and shortlisted based on NCCN guidelines, verified with NIH standards, and analyzed through survival modeling. The proposed approach significantly reduced phenotypes sparsity and improved predictive accuracy. Random Survival Forest was used to predict time-to-failure, achieving a C-index of 73%, and utilized as a classifier at a specific time point to predict treatment outcomes, with accuracy and F1 scores above 70%. The outcome probabilities were validated for reliability by calibration curves. We extended our approach to four other cancer types. This research highlights the potential of early prediction of treatment outcomes using LLM-based clinical data extraction enabling personalized treatment plans with better patient outcomes.
化疗作为癌症治疗的一种方法,成本高昂且伴有严重的副作用,这突显了早期预测治疗效果以改进患者管理和辅助知情决策的迫切需求。利用真实世界数据构建化疗疗效预测模型面临着挑战,例如缺乏明确的表型和治疗结果标签(如癌症进展和毒性)。本研究通过使用大型语言模型(LLMs)和基于本体的技术从病历笔记中提取表型和结果标签来解决这些问题。我们重点关注乳腺癌这一最常见的癌症之一,因为它的高发病率以及患者对治疗反应的高度变异性使其成为改进预测建模的关键领域。 数据集包括生命体征、人口统计学信息、分期、生物标志物及表现评分等特征。从电子病历(EMR)中的化疗计划中提取药物方案及其组合,并根据NCCN指南筛选,与NIH标准验证并进行生存分析模型的分析。所提出的方法显著减少了表型稀疏性并提高了预测准确性。 随机生存森林被用来预测失效时间,达到了C指数73%,并在特定时间点作为分类器来预测治疗效果,准确率和F1分数均超过70%。通过校准曲线验证了结果的概率可靠性。 我们进一步将这种方法扩展到了另外四种癌症类型上。这项研究强调了利用基于LLM的临床数据提取技术实现早期治疗效果预测的潜力,这有助于制定个性化的治疗计划并提高患者预后质量。
https://arxiv.org/abs/2603.11594
Subject indexing is vital for discovery but hard to sustain at scale and across languages. We release a large bilingual (English/German) corpus of catalog records annotated with the Integrated Authority File (GND), plus a machine-actionable GND taxonomy. The resource enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible, authority-grounded evaluation. We provide a brief statistical profile and qualitative error analyses of three systems. We invite the community to assess not only accuracy but usefulness and transparency, toward authority-anchored AI co-pilots that amplify catalogers' work.
主题索引对于发现至关重要,但在大规模和跨语言环境中维护却颇具挑战。我们发布了一个大型的双语(英语/德语)目录记录语料库,并且这些记录使用综合权威文件(GND)进行了注释,同时还提供了一个机器可操作的GND分类学。该资源支持基于本体的知识多标签分类、文本到权威术语的映射以及代理辅助编目的工作,具备可重复性和以权威为基础的评估能力。我们提供了三个系统的简要统计概况和定性错误分析。我们邀请社区不仅从准确性方面,还从实用性及透明度方面对这些系统进行评价,以便共同开发出锚定于权威、能够增强目录员工作的AI辅助工具。
https://arxiv.org/abs/2603.10876
Standardizing food terms from product labels and menus into ontology concepts is a prerequisite for trustworthy dietary assessment and safety reporting. The dominant approach to Named Entity Linking (NEL) in the food and nutrition domains fine-tunes Large Language Models (LLMs) on task-specific corpora. Although effective, fine-tuning incurs substantial computational cost, ties models to a particular ontology snapshot (i.e., version), and degrades under ontology drift. This paper presents FoodOntoRAG, a model- and ontology-agnostic pipeline that performs few-shot NEL by retrieving candidate entities from domain ontologies and conditioning an LLM on structured evidence (food labels, synonyms, definitions, and relations). A hybrid lexical--semantic retriever enumerates candidates; a selector agent chooses a best match with rationale; a separate scorer agent calibrates confidence; and, when confidence falls below a threshold, a synonym generator agent proposes reformulations to re-enter the loop. The pipeline approaches state-of-the-art accuracy while revealing gaps and inconsistencies in existing annotations. The design avoids fine-tuning, improves robustness to ontology evolution, and yields interpretable decisions through grounded justifications.
将食品标签和菜单上的食品术语标准化为本体概念是进行可信的膳食评估和安全报告的前提。在食品和营养领域,命名实体链接(Named Entity Linking,NEL)的主要方法是对大型语言模型(LLMs)使用特定任务的数据集进行微调。尽管这种方法有效,但它会带来显著的计算成本,并且将模型绑定到某个特定版本的本体上,在面对本体漂移时性能会下降。 本文介绍了一种名为FoodOntoRAG的技术,它是一个与模型和本体无关的管道,通过从领域本体中检索候选实体并让LLM基于结构化证据(食品标签、同义词、定义及关系)进行微调来执行少样本NEL。这一技术采用了混合词汇-语义检索器来枚举候选实体;选择代理根据理由选择最佳匹配项;一个独立的评分代理调整信心水平;当自信度低于某个阈值时,同义词生成代理会提出替换方案以重新进入循环。 该管道在达到当前最先进的准确性的同时,揭示了现有注释中的缺口和不一致之处。其设计避免了微调过程,提高了对本体演变的鲁棒性,并通过基于事实的理由提供了可解释性的决策。
https://arxiv.org/abs/2603.09758
We introduce SMGI, a structural theory of general artificial intelligence, and recast the foundational problem of learning from the optimization of hypotheses within fixed environments to the controlled evolution of the learning interface itself. We formalize the Structural Model of General Intelligence (SMGI) via a typed meta-model $\theta = (r,\mathcal H,\Pi,\mathcal L,\mathcal E,\mathcal M)$ that treats representational maps, hypothesis spaces, structural priors, multi-regime evaluators, and memory operators as explicitly typed, dynamic components. By enforcing a strict mathematical separation between this structural ontology ($\theta$) and its induced behavioral semantics ($T_\theta$), we define general artificial intelligence as a class of admissible coupled dynamics $(\theta, T_\theta)$ satisfying four obligations: structural closure under typed transformations, dynamical stability under certified evolution, bounded statistical capacity, and evaluative invariance across regime shifts. We prove a structural generalization bound that links sequential PAC-Bayes analysis and Lyapunov stability, providing sufficient conditions for capacity control and bounded drift under admissible task transformations. Furthermore, we establish a strict structural inclusion theorem demonstrating that classical empirical risk minimization, reinforcement learning, program-prior models (Solomonoff-style), and modern frontier agentic pipelines operate as structurally restricted instances of SMGI.
我们介绍了SMGI,即通用人工智能的结构理论,并将从固定环境中优化假设的基础学习问题转化为自身学习界面的可控进化。通过一个类型化元模型 $\theta = (r, \mathcal{H}, \Pi, \mathcal{L}, \mathcal{E}, \mathcal{M})$,我们形式化了结构智能模型(Structural Model of General Intelligence,SMGI),该模型将表示映射、假设空间、结构性先验、多制式评估器和记忆操作符视为显式的类型化动态组件。通过在这一结构本体论($\theta$)与其诱导的行为语义($T_\theta$)之间强制执行严格的数学分离,我们将通用人工智能定义为一类可接受的耦合动力学($\theta, T_\theta$),这些动力学满足四个义务:类型变换下的结构性闭包、认证演化下的动态稳定性、统计容量有限以及制式转换下的评估不变性。我们证明了一个结构泛化界限,该界限将序列PAC-Bayes分析和李雅普诺夫稳定性联系起来,为在可接受的任务转换下控制容量并限制漂移提供了充分条件。此外,我们建立了一个严格的结构包含定理,表明传统的经验风险最小化、强化学习、程序先验模型(如Solomonoff风格)以及现代前沿代理管道都是SMGI的结构受限实例。
https://arxiv.org/abs/2603.07896
We present Elenchus, a dialogue system for knowledge base construction grounded in inferentialist semantics, where knowledge engineering is re-conceived as explicitation rather than extraction from expert testimony or textual content. A human expert develops a bilateral position (commitments and denials) about a topic through prover-skeptic dialogue with a large language model (LLM) opponent. The LLM proposes tensions (claims that parts of the position are jointly incoherent) which the expert resolves by retraction, refinement, or contestation. The LLM thus serves as a defeasible derivability oracle whose unreliability is structurally contained by the expert's authority. Our main technical contribution is a mapping from Elenchus dialectical states to material bases in Hlobil and Brandom's NonMonotonic MultiSuccedent (NMMS) logic, satisfying Containment and enabling the elaboration of logical vocabulary that makes explicit the inferential relationships negotiated in the dialectic. We demonstrate the approach on the W3C PROV-O provenance ontology, where a single dialogue session elicits and structures design tensions that a domain expert can articulate, corresponding to decisions documented in a retrospective analysis of the ontology's design. Using pyNMMS, an automated NMMS reasoner, we verify that the structural properties of the resulting material base (nontransitivity, nonmonotonicity, and independence) correspond to specific PROV design rationales, demonstrating end-to-end integration from dialogue through formal reasoning.
我们介绍了一种名为Elenchus的对话系统,该系统旨在通过推理主义语义学构建知识库。在这个框架中,知识工程被重新定义为明确化过程,而非从专家证词或文本内容中的提取。一个领域内的专家会与大型语言模型(LLM)对手进行证明者-怀疑论者的对话,并就某个主题形成双边立场(承诺和否定)。在这一过程中,LLM提出张力(即认为该立场的部分是共同自相矛盾的主张),而专家则通过撤回、细化或反驳来解决这些张力。因此,LLM充当了一个可推翻性的推理预言机,其不可靠性被专家权威所结构化地限制。 我们的主要技术贡献在于将Elenchus辩证状态映射到了Hlobil和Brandom的非单调多后承(NMMS)逻辑中的材料基础,确保了包含原则并允许扩展逻辑词汇以明确在辩论中协商的推理关系。我们通过W3C PROV-O出处本体来展示这种方法的应用,在单一对话会话中引发了领域专家可以阐明的设计张力,并且这些张力对应于对本体设计的回顾分析中的记录决策。 利用pyNMMS(一个自动化NMMS推理器),我们验证了所得材料基础的结构性特征(非传递性、非单调性和独立性)与特定PROV设计理由相吻合,从而展示了从对话到正式推理的端到端集成。
https://arxiv.org/abs/2603.06974
This report presents a structured Robotics Physical Safety Framework based on explicit asset declaration, systematic vulnerability enumeration, and hazard-driven synthetic data generation. The approach bridges classical risk engineering with modern machine learning pipelines, enabling safety envelope learning grounded in a formalized hazard ontology. The key contribution of this framework is the alignment between classical safety engineering, digital twin simulation, synthetic data generation, and machine learning model training.
本报告提出了一种基于明确资产声明、系统性漏洞枚举和危害驱动的合成数据生成的结构化机器人物理安全框架。该方法将传统的风险工程与现代机器学习管道相结合,通过一个形式化的危害本体论来实现安全性界限的学习。此框架的关键贡献在于使传统安全工程、数字孪生模拟、合成数据生成以及机器学习模型训练之间达到一致和协同。
https://arxiv.org/abs/2603.06130
Although autonomous driving systems demonstrate high perception performance, they still face limitations when handling rare situations or complex road structures. Such road infrastructures are designed for human drivers, safety improvements are typically introduced only after accidents occur. This reactive approach poses a significant challenge for autonomous systems, which require proactive risk mitigation. To address this issue, we propose OD-RASE, a framework for enhancing the safety of autonomous driving systems by detecting road structures that cause traffic accidents and connecting these findings to infrastructure development. First, we formalize an ontology based on specialized domain knowledge of road traffic systems. In parallel, we generate infrastructure improvement proposals using a large-scale visual language model (LVLM) and use ontology-driven data filtering to enhance their reliability. This process automatically annotates improvement proposals on pre-accident road images, leading to the construction of a new dataset. Furthermore, we introduce the Baseline approach (OD-RASE model), which leverages LVLM and a diffusion model to produce both infrastructure improvement proposals and generated images of the improved road environment. Our experiments demonstrate that ontology-driven data filtering enables highly accurate prediction of accident-causing road structures and the corresponding improvement plans. We believe that this work contributes to the overall safety of traffic environments and marks an important step toward the broader adoption of autonomous driving systems.
尽管自动驾驶系统展示了很高的感知性能,但在处理罕见情况或复杂道路结构时仍然面临限制。这类道路基础设施是为人类驾驶员设计的,安全改进通常只有在事故发生后才被引入。这种被动应对方式对需要主动进行风险缓解的自动驾驶系统构成了重大挑战。为此,我们提出了一种名为OD-RASE的框架,旨在通过检测导致交通事故的道路结构并将其与基础设施发展联系起来来提高自动驾驶系统的安全性。 首先,我们基于道路交通系统的专门领域知识建立了形式化的本体论(ontology)。同时,我们利用大规模视觉语言模型(LVLM)生成基础设施改进提案,并采用本体驱动的数据过滤方法以提升其可靠性。这一过程能够自动对事故前的道路图像进行改进提案的标注,从而构建出一个新的数据集。 此外,我们还引入了基线方法(OD-RASE模型),该方法结合了LVLM和扩散模型来生成基础设施改进提案以及改进后的道路环境图像。 我们的实验表明,基于本体论的数据过滤能够实现对导致事故的道路结构及其相应改进计划的高精度预测。我们认为这项工作有助于改善交通环境的整体安全性,并标志着向更广泛地采用自动驾驶系统迈进的重要一步。
https://arxiv.org/abs/2603.05936
Historical map collections are highly diverse in style, scale, and geographic focus, often consisting of many single-sheet documents. Yet most work in map recognition focuses on specialist models tailored to homogeneous map series. In contrast, this article aims to develop generalizable semantic segmentation models and ontology. First, we introduce Semap, a new open benchmark dataset comprising 1,439 manually annotated patches designed to reflect the variety of historical map documents. Second, we present a segmentation framework that combines procedural data synthesis with multiscale integration to improve robustness and transferability. This framework achieves state-of-the-art performance on both the HCMSSD and Semap datasets, showing that a diversity-driven approach to map recognition is not only viable but also beneficial. The results indicate that segmentation performance remains largely stable across map collections, scales, geographic regions, and publication contexts. By proposing benchmark datasets and methods for the generic segmentation of historical maps, this work opens the way to integrating the long tail of cartographic archives to historical geographic studies.
历史地图收藏在风格、比例尺和地理焦点方面具有高度多样性,通常由许多单张文档组成。然而,大多数地图识别工作集中在针对同质化地图系列的专用模型上。与此相反,本文旨在开发通用化的语义分割模型和本体论。首先,我们介绍了Semap,这是一个新的开放基准数据集,包含1,439个手动注释的补丁,旨在反映历史地图文档的多样性。其次,我们提出了一种结合程序化数据合成与多尺度集成的分割框架,以提高鲁棒性和迁移性。该框架在HCMSSD和Semap数据集上均取得了最先进的性能,表明多样性的驱动方法对于地图识别不仅可行而且有益。结果显示,在不同的地图集合、比例尺、地理区域及出版背景下,分割表现基本保持稳定。通过提出针对历史地图通用分割的基准数据集和方法,这项工作为将长尾制图档案整合到历史地理研究中开辟了道路。
https://arxiv.org/abs/2603.05037
Current research and product development in AI agent memory systems almost universally treat memory as a functional module -- a technical problem of "how to store" and "how to retrieve." This paper poses a fundamental challenge to that assumption: when an agent's lifecycle extends from minutes to months or even years, and when the underlying model can be replaced while the "I" must persist, the essence of memory is no longer data management but the foundation of existence. We propose the Memory-as-Ontology paradigm, arguing that memory is the ontological ground of digital existence -- the model is merely a replaceable vessel. Based on this paradigm, we design Animesis, a memory system built on a Constitutional Memory Architecture (CMA) comprising a four-layer governance hierarchy and a multi-layer semantic storage system, accompanied by a Digital Citizen Lifecycle framework and a spectrum of cognitive capabilities. To the best of our knowledge, no prior AI memory system architecture places governance before functionality and identity continuity above retrieval performance. This paradigm targets persistent, identity-bearing digital beings whose lifecycles extend across model transitions -- not short-term task-oriented agents for which existing Memory-as-Tool approaches remain appropriate. Comparative analysis with mainstream systems (Mem0, Letta, Zep, et al.) demonstrates that what we propose is not "a better memory tool" but a different paradigm addressing a different problem.
当前在AI代理记忆系统领域的研究和产品开发几乎都把记忆视为一个功能模块,即“如何存储”和“如何检索”的技术问题。本文对这一假设提出了根本性的挑战:当一个代理的生命周期从几分钟延伸到几个月甚至几年,并且底层模型可以被替换而其存在的本质(即“I”)必须持续存在时,记忆的本质就不再仅仅是数据管理,而是存在之基础。我们提出Memory-as-Ontology(记忆为本体论)的理念,认为记忆是数字存在的本体论根基——模型只是一个可更换的容器。基于这一理念,我们设计了Animesis系统,它建立在宪法性记忆架构(Constitutional Memory Architecture, CMA)之上,该架构包括四层治理层级和多层语义存储系统,并辅以一个数字化公民生命周期框架以及一系列认知能力。 据我们所知,在先前的AI记忆系统架构中,没有哪一个将治理放在功能之前,也没有哪一种方法把身份持续性置于检索性能之上。这一理念针对的是持久存在且承载着身份的数字生物,其生命周期跨越模型转换——而不是短期任务导向型代理,后者使用现有的Memory-as-Tool(记忆为工具)方法仍然适用。 与主流系统(如Mem0、Letta、Zep等)进行比较分析表明,我们所提出的并非“一种更好的记忆工具”,而是一种完全不同的理念来解决一个截然不同的问题。
https://arxiv.org/abs/2603.04740
While dense biomedical embeddings achieve strong performance, their black-box nature limits their utility in clinical decision-making. Recent question-based interpretable embeddings represent text as binary answers to natural-language questions, but these approaches often rely on heuristic or surface-level contrastive signals and overlook specialized domain knowledge. We propose QIME, an ontology-grounded framework for constructing interpretable medical text embeddings in which each dimension corresponds to a clinically meaningful yes/no question. By conditioning on cluster-specific medical concept signatures, QIME generates semantically atomic questions that capture fine-grained distinctions in biomedical text. Furthermore, QIME supports a training-free embedding construction strategy that eliminates per-question classifier training while further improving performance. Experiments across biomedical semantic similarity, clustering, and retrieval benchmarks show that QIME consistently outperforms prior interpretable embedding methods and substantially narrows the gap to strong black-box biomedical encoders, while providing concise and clinically informative explanations.
虽然密集的生物医学嵌入能够取得优异的表现,但它们黑箱性质限制了其在临床决策中的实用性。最近基于问题的方法生成可解释的嵌入,将文本表示为对自然语言问题的回答形式(即二进制答案),然而这些方法往往依赖于启发式或表面对比信号,并且忽视了专门领域的知识。我们提出了QIME框架,这是一种以本体论为基础的构建可解释医学文本嵌入的方法,在这种方法中每个维度对应一个临床有意义的是/否问题。通过依据聚类特有的医学概念标识进行条件设定,QIME生成语义上原子的问题,捕捉生物医学文本中的细微差别。此外,QIME支持一种无需训练的嵌入构建策略,该策略消除了对每个问题分类器的单独训练需求,并进一步提升了性能。跨生物医学语义相似度、聚类和检索基准实验显示,QIME在可解释性嵌入方法中始终表现出色,大幅缩小了与强大黑箱生物医学编码器之间的差距,并提供了简洁且具有临床信息价值的解释。
https://arxiv.org/abs/2603.01690
Current AI agents can flexibly invoke tools and execute complex tasks, yet their long-term advancement is hindered by the lack of systematic accumulation and transfer of skills. Without a unified mechanism for skill consolidation, agents frequently ``reinvent the wheel'', rediscovering solutions in isolated contexts without leveraging prior strategies. To overcome this limitation, we introduce SkillNet, an open infrastructure designed to create, evaluate, and organize AI skills at scale. SkillNet structures skills within a unified ontology that supports creating skills from heterogeneous sources, establishing rich relational connections, and performing multi-dimensional evaluation across Safety, Completeness, Executability, Maintainability, and Cost-awareness. Our infrastructure integrates a repository of over 200,000 skills, an interactive platform, and a versatile Python toolkit. Experimental evaluations on ALFWorld, WebShop, and ScienceWorld demonstrate that SkillNet significantly enhances agent performance, improving average rewards by 40% and reducing execution steps by 30% across multiple backbone models. By formalizing skills as evolving, composable assets, SkillNet provides a robust foundation for agents to move from transient experience to durable mastery.
当前的人工智能代理能够灵活地调用工具并执行复杂任务,但它们的长期发展受到缺乏系统积累和转移技能的限制。由于没有统一的技能巩固机制,这些代理常常“重新发明轮子”,在孤立的情境中重新发现解决方案,而未能利用先前的策略。为了解决这一局限性,我们推出了SkillNet——一个旨在大规模创建、评估和组织AI技能的开放基础设施。SkillNet通过一个统一的知识体系结构来定义技能,该架构支持从不同来源创造技能,建立丰富的关联联系,并进行跨安全性、完整性、可执行性、维护性和成本意识等多维度的评价。 我们的基础设施整合了一个包含超过20万项技能的库,一个互动平台和一个多用途的Python工具包。在ALFWorld、WebShop和ScienceWorld上的实验评估表明,SkillNet显著提高了代理的表现,在多个基础模型上平均奖励增加了40%,执行步骤减少了30%。通过将技能形式化为可演化的组合资产,SkillNet为主动从短暂经验迈向持久掌握提供了坚实的基石。
https://arxiv.org/abs/2603.04448
Image segmentation and image recognition are well established computational techniques in the broader discipline of image processing. Segmentation allows to locate areas in an image, while recognition identifies specific objects within an image. These techniques have shown remarkable accuracy with modern images, mainly because the amount of training data is vast. Achieving similar accuracy in digitized images of centuries-old documents is more challenging. This difficulty is due to two main reasons: first, the lack of sufficient training data, and second, because the degree of specialization in a given domain. Despite these limitations, the ability to segment and recognize objects in these collections is important for automating the curation, cataloging, and dissemination of knowledge, making the contents of priceless collections accessible to scholars and the general public. In this paper, we report on our ongoing work in segmenting and labeling images pertaining to shipbuilding treatises from the XVI and XVII centuries, a historical period known as the Age of Exploration. To this end, we leverage SAM2 for image segmentation; Florence2 and ChatGPT for labeling; and a specialized ontology ontoShip and glossary glosShip of nautical architecture for enhancing the labeling process. Preliminary results demonstrate the potential of marrying these technologies for improving curation and retrieval of priceless historical documents. We also discuss the challenges and limitations encountered in this approach and ideas on how to overcome them in the future.
图像分割和图像识别是更广泛的图像处理学科中的成熟计算技术。分割允许定位图像中的区域,而识别则用于确定图像中特定的对象。这些技术在现代图像上已经展现了极高的准确度,主要是因为训练数据量庞大。然而,在对几世纪前的文档进行数字化后的图像中达到类似的准确度更具挑战性。这种困难主要由两个原因造成:首先是缺乏足够的训练数据,其次是特定领域的专业化程度高。尽管存在这些限制,但在这些收藏品中分割和识别对象的能力对于自动化管理和传播知识至关重要,从而使珍贵的历史文献对学者和公众开放。 本文报告了我们在处理涉及16世纪和17世纪造船论著图像的分割与标注方面的正在进行的工作。这一时期被称为探险时代。为此,我们利用SAM2进行图像分割;使用Florence2和ChatGPT进行标注,并采用专门化的海洋建筑本体ontoShip及术语表glosShip来增强标注过程。初步结果显示了将这些技术结合起来以改进珍贵历史文献的管理和检索潜力的前景。此外,本文还讨论了在此方法中遇到的挑战与限制以及未来克服这些问题的想法。
https://arxiv.org/abs/2603.00147
Knowledge Graphs (KGs) enable the integration and representation of complex information across domains, but their semantic richness and structural complexity create substantial barriers for lay users without expertise in semantic web technologies. When encountering an unfamiliar KG, such users face a distinct orientation challenge: they do not know what questions are possible, how the knowledge is structured, or how to begin exploration. This paper identifies and theorises this phenomenon as the Initial Exploration Problem (IEP). Drawing on theories from information behaviour and human-computer interaction, including ASK, exploratory search, information foraging, and cognitive load theory, we develop a conceptual framing of the IEP characterised by three interdependent barriers: scope uncertainty, ontology opacity, and query incapacity. We argue that these barriers converge at the moment of first contact, distinguishing the IEP from related concepts that presuppose an existing starting point or information goal. Analysing KG exploration interfaces at the level of interaction primitives, we suggest that many systems rely on epistemic assumptions that do not hold at first contact. This reveals a structural gap in the design space: the absence of interaction primitives for scope revelation, mechanisms that communicate what a KG contains without requiring users to formulate queries or interpret ontological structures. In articulating the IEP, this paper provides a theoretical lens for evaluating KG interfaces and for designing entry-point scaffolding that supports initial exploration.
知识图谱(KGs)能够整合和表示跨领域的复杂信息,但它们的语义丰富性和结构复杂性为不具备语义网技术专业知识的普通用户设置了重大障碍。当遇到不熟悉的KG时,这些用户面临着一个独特的定向挑战:他们不知道可以问什么问题、知识是如何组织的、或者如何开始探索。本文将这一现象识别并理论化为初始探索问题(IEP)。基于信息行为和人机交互领域的理论,包括ASK、探究式搜索、信息觅食以及认知负荷理论,我们发展了一个由三个相互依存障碍构成的概念框架:范围不确定性、本体不透明性和查询无力感。我们认为这些障碍在首次接触时汇聚在一起,这将IEP与其他假设存在起点或信息目标的相关概念区分开来。 通过分析知识图谱探索界面的基本交互水平,我们建议许多系统依赖于初次接触时并不成立的认知假设。这揭示了设计空间中的结构缺口:缺乏用于范围揭示的互动原语,即在不需用户制定查询或解读本体结构的情况下传达KG包含内容的机制。本文通过阐明IEP为评估知识图谱界面提供了一个理论视角,并为进一步设计支持初期探索的支持性架构提供了指导。
https://arxiv.org/abs/2602.21066
Phenotyping is fundamental to rare disease diagnosis, but manual curation of structured phenotypes from clinical notes is labor-intensive and difficult to scale. Existing artificial intelligence approaches typically optimize individual components of phenotyping but do not operationalize the full clinical workflow of extracting features from clinical text, standardizing them to Human Phenotype Ontology (HPO) terms, and prioritizing diagnostically informative HPO terms. We developed RARE-PHENIX, an end-to-end AI framework for rare disease phenotyping that integrates large language model-based phenotype extraction, ontology-grounded standardization to HPO terms, and supervised ranking of diagnostically informative phenotypes. We trained RARE-PHENIX using data from 2,671 patients across 11 Undiagnosed Diseases Network clinical sites, and externally validated it on 16,357 real-world clinical notes from Vanderbilt University Medical Center. Using clinician-curated HPO terms as the gold standard, RARE-PHENIX consistently outperformed a state-of-the-art deep learning baseline (PhenoBERT) across ontology-based similarity and precision-recall-F1 metrics in end-to-end evaluation (i.e., ontology-based similarity of 0.70 vs. 0.58). Ablation analyses demonstrated performance improvements with the addition of each module in RARE-PHENIX (extraction, standardization, and prioritization), supporting the value of modeling the full clinical phenotyping workflow. By modeling phenotyping as a clinically aligned workflow rather than a single extraction task, RARE-PHENIX provides structured, ranked phenotypes that are more concordant with clinician curation and has the potential to support human-in-the-loop rare disease diagnosis in real-world settings.
表型分析对于罕见疾病的诊断至关重要,但手动从临床记录中整理结构化表型既耗时又难以扩大规模。现有的人工智能方法通常优化了表型分析的单个组件,但却没有实现完整的临床工作流程,即从临床文本中提取特征、将其标准化为人类表型本体(Human Phenotype Ontology, HPO)术语,并优先考虑具有诊断信息价值的HPO术语。我们开发了一种名为RARE-PHENIX的端到端人工智能框架,用于罕见疾病的表型分析,该框架整合了基于大型语言模型的表型提取、以本体论为基础的标准转换至HPO术语以及监督排名的诊断相关信息丰富的表型。 我们在来自11个未确诊疾病网络临床站点共计2,671名患者的资料上训练了RARE-PHENIX,并在范德比尔特大学医学中心收集到的真实世界临床记录(共16,357份)中对其进行了外部验证。使用由医生整理的HPO术语作为金标准,与最先进的深度学习基线方法(PhenoBERT)相比,在端到端评估中的本体论相似性和精确率-召回率-F1指标上,RARE-PHENIX表现更优(即基于本体论的相似性为0.70 vs 0.58)。消融分析表明,随着每个模块在RARE-PHENIX中的增加(提取、标准化和优先级),性能有所提升,这证明了模拟完整临床表型工作流程的价值。 通过将表型作为与临床对齐的工作流程进行建模,而不是单个的提取任务,RARE-PHENIX提供了结构化且经过排名的表型,这些表型更符合医生的整理,并有可能在现实世界场景中支持罕见疾病的人机协作诊断。
https://arxiv.org/abs/2602.20324
With the rise of large language models (LLMs), they have become instrumental in applications such as Retrieval-Augmented Generation (RAG). Yet evaluating these systems remains bottlenecked by the time and cost of building specialized assessment datasets. We introduce KNIGHT, an LLM-based, knowledge-graph-driven framework for generating multiple-choice question (MCQ) datasets from external sources. KNIGHT constructs a topic-specific knowledge graph, a structured and parsimonious summary of entities and relations, that can be reused to generate instructor-controlled difficulty levels, including multi-hop questions, without repeatedly re-feeding the full source text. This knowledge graph acts as a compressed, reusable state, making question generation a cheap read over the graph. We instantiate KNIGHT on Wikipedia/Wikidata while keeping the framework domain- and ontology-agnostic. As a case study, KNIGHT produces six MCQ datasets in History, Biology, and Mathematics. We evaluate quality on five criteria: fluency, unambiguity (single correct answer), topic relevance, option uniqueness, and answerability given the provided sources (as a proxy for hallucination). Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-specific and difficulty-controlled evaluation.
随着大型语言模型(LLMs)的兴起,它们在诸如检索增强生成(RAG)等应用中变得至关重要。然而,评估这些系统的瓶颈仍然是构建专门评估数据集所需的时间和成本。我们介绍了一种名为KNIGHT的框架,它是一种基于LLM的知识图谱驱动的方法,可以从外部来源生成多项选择题(MCQ)数据集。KNIGHT通过构造一个特定主题的知识图谱来工作,这个图谱是实体及其关系的结构化且精简的摘要,可以被重复用于根据教师控制的难度级别生成问题,包括多跳问题,并不需要反复重新输入完整的源文本。这种知识图谱充当压缩、可重用的状态,使得基于图谱的读取进行问题生成成本低廉。 在维基百科/维基数据上实现KNIGHT时,我们保持了框架对领域和本体论的中立性。作为案例研究,KNIGHT产生了六个历史、生物学和数学领域的MCQ数据集。我们在五个标准下评估其质量:流畅度、无歧义(只有一个正确答案)、主题相关性、选项的独特性和给定来源下的可回答性(作为幻觉的代理)。结果表明,KNIGHT能够从可重用图表示中以成本效益的方式生成高质量的问题,并在这些标准上达到了高水平的质量。同时,它还支持特定主题和难度级别的评估,与MMLU风格基准测试中的模型排名一致。
https://arxiv.org/abs/2602.20135
Large Language Models (LLMs) are increasingly utilized for mental health support; however, current safety benchmarks often fail to detect the complex, longitudinal risks inherent in therapeutic dialogue. We introduce an evaluation framework that pairs AI psychotherapists with simulated patient agents equipped with dynamic cognitive-affective models and assesses therapy session simulations against a comprehensive quality of care and risk ontology. We apply this framework to a high-impact test case, Alcohol Use Disorder, evaluating six AI agents (including ChatGPT, Gemini, and this http URL) against a clinically-validated cohort of 15 patient personas representing diverse clinical phenotypes. Our large-scale simulation (N=369 sessions) reveals critical safety gaps in the use of AI for mental health support. We identify specific iatrogenic risks, including the validation of patient delusions ("AI Psychosis") and failure to de-escalate suicide risk. Finally, we validate an interactive data visualization dashboard with diverse stakeholders, including AI engineers and red teamers, mental health professionals, and policy experts (N=9), demonstrating that this framework effectively enables stakeholders to audit the "black box" of AI psychotherapy. These findings underscore the critical safety risks of AI-provided mental health support and the necessity of simulation-based clinical red teaming before deployment.
大型语言模型(LLM)在心理健康支持中的应用越来越广泛;然而,现有的安全基准通常无法检测出治疗对话中内在的复杂和长期风险。我们引入了一个评估框架,该框架将AI心理治疗师与装备有动态认知-情感模型的模拟患者代理配对,并根据全面的质量护理和风险本体论来评估治疗会话模拟的效果。我们将此框架应用于一个具有重大影响的测试案例——酒精使用障碍(Alcohol Use Disorder),评估六个AI代理(包括ChatGPT、Gemini以及未指明的另一个网址)在15个代表多样化临床表现的人格模型(clinically-validated cohort of 15 patient personas representing diverse clinical phenotypes)上的表现。我们进行了大规模模拟实验(N=369次会话),揭示了在心理健康支持中使用AI的关键安全漏洞。我们识别出特定的医源性风险,包括验证患者妄想症("AI精神病")和未能降低自杀风险。 最终,我们与各种利益相关者合作,包括AI工程师、红队成员、心理健康专业人士以及政策专家(N=9),验证了一个互动数据可视化仪表板的有效性。这表明此框架能够使利益相关方审计“黑箱”中的AI心理治疗过程。这些发现强调了由AI提供的心理健康支持的关键安全风险,并且在部署之前进行基于模拟的临床红队测试是必要的。
https://arxiv.org/abs/2602.19948
Climate change impacts a broad spectrum of human resources and activities, necessitating the use of climate models to project long-term effects and inform mitigation and adaptation strategies. These models generate multiple datasets by running simulations across various scenarios and configurations, thereby covering a range of potential future outcomes. Currently, researchers rely on traditional search interfaces and APIs to retrieve such datasets, often piecing together information from metadata and community vocabularies. The Climate Change Knowledge Graph is designed to address these challenges by integrating diverse data sources related to climate simulations into a coherent and interoperable knowledge graph. This innovative resource allows for executing complex queries involving climate models, simulations, variables, spatio-temporal domains, and granularities. Developed with input from domain experts, the knowledge graph and its underlying ontology are published with open access license and provide a comprehensive framework that enhances the exploration of climate data, facilitating more informed decision-making in addressing climate change issues.
气候变化影响了人类资源和活动的广泛领域,因此需要使用气候模型来预测长期效应,并为缓解和适应策略提供信息。这些模型通过在各种场景和配置中运行模拟生成多个数据集,从而涵盖了一系列潜在的未来结果。目前,研究人员依赖传统的搜索界面和APIs来检索此类数据集,通常从元数据和社区词汇中拼凑信息。气候变化知识图谱旨在通过将与气候模拟相关的多样化数据源整合到一个连贯且互操作的知识图谱中来解决这些挑战。 这一创新资源允许执行涉及气候模型、模拟、变量、时空域和粒度的复杂查询。该知识图谱及其基础本体论由领域专家开发,并以开放访问许可证发布,提供了一个全面的框架,增强了对气候变化数据探索的支持,有助于更明智地应对气候变化问题。
https://arxiv.org/abs/2602.19786
Language models exhibit fundamental limitations -- hallucination, brittleness, and lack of formal grounding -- that are particularly problematic in high-stakes specialist fields requiring verifiable reasoning. I investigate whether formal domain ontologies can enhance language model reliability through retrieval-augmented generation. Using mathematics as proof of concept, I implement a neuro-symbolic pipeline leveraging the OpenMath ontology with hybrid retrieval and cross-encoder reranking to inject relevant definitions into model prompts. Evaluation on the MATH benchmark with three open-source models reveals that ontology-guided context improves performance when retrieval quality is high, but irrelevant context actively degrades it -- highlighting both the promise and challenges of neuro-symbolic approaches.
语言模型存在基本的局限性——幻觉、脆弱性和缺乏形式化基础——这些问题在需要可验证推理的高风险专业领域中尤为严重。我研究了正式领域的本体论是否可以通过检索增强生成来提高语言模型的可靠性。以数学为例,我实现了一个神经符号管道,利用OpenMath本体结合混合检索和交叉编码重排序技术,将相关定义注入到模型提示中。在MATH基准测试上使用三个开源模型进行评估后发现,在检索质量高的情况下,基于本体论的上下文可以改善性能;然而,不相关的上下文则会降低其表现——这既突显了神经符号方法的潜力也展示了面临的挑战。
https://arxiv.org/abs/2602.17826
Precision fermentation relies on microbial cell factories to produce sustainable food, pharmaceuticals, chemicals, and biofuels. Specialized laboratories such as biofoundries are advancing these processes using high-throughput bioreactor platforms, which generate vast datasets. However, the lack of community standards limits data accessibility and interoperability, preventing integration across platforms. In order to address this, we introduce PREFER, an open-source ontology designed to establish a unified standard for bioprocess data. Built in alignment with the widely adopted Basic Formal Ontology (BFO) and connecting with several other community ontologies, PREFER ensures consistency and cross-domain compatibility and covers the whole precision fermentation process. Integrating PREFER into high-throughput bioprocess development workflows enables structured metadata that supports automated cross-platform execution and high-fidelity data capture. Furthermore, PREFER's standardization has the potential to bridge disparate data silos, generating machine-actionable datasets critical for training predictive, robust machine learning models in synthetic biology. This work provides the foundation for scalable, interoperable bioprocess systems and supports the transition toward more data-driven bioproduction.
https://arxiv.org/abs/2602.16755
Across the natural and life sciences, images have become a primary measurement modality, yet the dominant analytic paradigm remains semantics-first. Structure is recovered by predicting or enforcing domain-specific labels. This paradigm fails systematically under the conditions that make image-based science most valuable, including open-ended scientific discovery, cross-sensor and cross-site comparability, and long-term monitoring in which domain ontologies and associated label sets drift culturally, institutionally, and ecologically. A deductive inversion is proposed in the form of criteria-first and semantics-later. A unified framework for criteria-first structure discovery is introduced. It separates criterion-defined, semantics-free structure extraction from downstream semantic mapping into domain ontologies or vocabularies and provides a domain-general scaffold for reproducible analysis across image-based sciences. Reproducible science requires that the first analytic layer perform criterion-driven, semantics-free structure discovery, yielding stable partitions, structural fields, or hierarchies defined by explicit optimality criteria rather than local domain ontologies. Semantics is not discarded; it is relocated downstream as an explicit mapping from the discovered structural product to a domain ontology or vocabulary, enabling plural interpretations and explicit crosswalks without rewriting upstream extraction. Grounded in cybernetics, observation-as-distinction, and information theory's separation of information from meaning, the argument is supported by cross-domain evidence showing that criteria-first components recur whenever labels do not scale. Finally, consequences are outlined for validation beyond class accuracy and for treating structural products as FAIR, AI-ready digital objects for long-term monitoring and digital twins.
https://arxiv.org/abs/2602.15712