Additive manufacturing (AM) relies critically on understanding and extrapolating process-property relationships; however, existing data-driven approaches remain limited by fragmented knowledge representations and unreliable extrapolation under sparse data conditions. In this study, we propose an ontology-guided, equation-centric framework that tightly integrates large language models (LLMs) with an additive manufacturing mathematical knowledge graph (AM-MKG) to enable reliable knowledge extraction and principled extrapolative modeling. By explicitly encoding equations, variables, assumptions, and their semantic relationships within a formal ontology, unstructured literature is transformed into machine-interpretable representations that support structured querying and reasoning. LLM-based equation generation is further conditioned on MKG-derived subgraphs, enforcing physically meaningful functional forms and mitigating non-physical or unstable extrapolation trends. To assess reliability beyond conventional predictive uncertainty, a confidence-aware extrapolation assessment is introduced, integrating extrapolation distance, statistical stability, and knowledge-graph-based physical consistency into a unified confidence score. Results demonstrate that ontology-guided extraction significantly improves the structural coherence and quantitative reliability of extracted knowledge, while subgraph-conditioned equation generation yields stable and physically consistent extrapolations compared to unguided LLM outputs. Overall, this work establishes a unified pipeline for ontology-driven knowledge representation, equation-centered reasoning, and confidence-based extrapolation assessment, highlighting the potential of knowledge-graph-augmented LLMs as reliable tools for extrapolative modeling in additive manufacturing.
增材制造(AM)严重依赖于对工艺-性能关系的理解和外推;然而,现有的数据驱动方法仍受限于知识表示的碎片化以及在稀疏数据条件下不可靠的外推。在这项研究中,我们提出了一种以本体论为导向、以方程为中心的框架,该框架紧密整合了大型语言模型(LLMs)与增材制造数学知识图谱(AM-MKG),以实现可靠的知识提取和原则性的外推建模。通过在一个正式的本体论内显式编码方程式、变量、假设及其语义关系,非结构化的文献被转化为支持结构化查询和推理的机器可解释表示形式。基于LLM的方程生成进一步受到从MKG导出的子图条件制约,从而强制执行物理上有意义的功能形式,并减少非物理或不稳定的外推趋势。 为了在传统预测不确定性之外评估可靠性,引入了一种基于置信度的外推评估方法,该方法将外推距离、统计稳定性以及基于知识图谱的物理一致性综合为统一的置信分数。结果表明,本体论引导的知识提取显著改善了所提取知识的结构一致性和定量可靠性,而子图条件制约下的方程生成与未指导的LLM输出相比,产生了稳定且物理上一致的外推。 总体而言,这项工作建立了一个统一的工作流程,用于基于本体论的知识表示、以方程式为中心的推理以及基于置信度的外推评估,突显了增强型知识图谱在增材制造领域作为可靠工具进行外推建模的巨大潜力。
https://arxiv.org/abs/2601.05298
We address semantic 3D part segmentation: decomposing objects into parts with meaningful names. While datasets exist with part annotations, their definitions are inconsistent across datasets, limiting robust training. Previous methods produce unlabeled decompositions or retrieve single parts without complete shape annotations. We propose ALIGN-Parts, which formulates part naming as a direct set alignment task. Our method decomposes shapes into partlets - implicit 3D part representations - matched to part descriptions via bipartite assignment. We combine geometric cues from 3D part fields, appearance cues from multi-view vision features, and semantic knowledge from language-model-generated affordance descriptions. Text-alignment loss ensures partlets share embedding space with text, enabling a theoretically open-vocabulary matching setup, given sufficient data. Our efficient and novel, one-shot, 3D part segmentation and naming method finds applications in several downstream tasks, including serving as a scalable annotation engine. As our model supports zero-shot matching to arbitrary descriptions and confidence-calibrated predictions for known categories, with human verification, we create a unified ontology that aligns PartNet, 3DCoMPaT++, and Find3D, consisting of 1,794 unique 3D parts. We introduce two novel metrics appropriate for the named 3D part segmentation task. We also show examples from our newly created TexParts dataset.
我们解决了语义三维部件分割问题:将物体分解为具有有意义名称的部件。尽管已有包含部件注释的数据集,但不同数据集中对部件定义不一致的问题限制了模型在这些定义上的稳健训练。以往的方法会产生未标记的分解或者从没有完整形状注释的情况下检索单个部件。我们提出了ALIGN-Parts方法,将部件命名作为直接集合对齐任务进行建模。该方法通过双部图分配的方式,将形状分解为“partlet”(隐式的三维部件表示)与部件描述相匹配。我们的方法结合了从3D部件字段获取的几何线索、多视角视觉特征中的外观线索以及来自语言模型生成的功能描述中的语义知识。文本对齐损失确保了这些部分表示与文本共享嵌入空间,从而在有足够的数据支持下,理论上可以实现开放词汇表的匹配设置。我们提出的方法是一种高效且新颖的一次性三维部件分割和命名方法,在包括作为可扩展注释引擎在内的多个下游任务中都有应用价值。由于我们的模型支持对任意描述进行零样本匹配,并能够为已知类别提供校准置信度的预测,结合人类验证后,我们可以创建一个统一的知识体系框架,将PartNet、3DCoMPaT++和Find3D中的1,794个独特的三维部件统一起来。我们还引入了两个适合命名三维部件分割任务的新指标,并展示了从新创建的TexParts数据集中选取的例子。 这段翻译详细介绍了所提出的方法(ALIGN-Parts)以及它如何解决现有问题,同时指出了该方法的应用价值和创新点,包括其在下游任务中的应用、零样本匹配能力和通过人类验证建立统一的知识体系框架。
https://arxiv.org/abs/2512.18003
The application of machine learning on healthcare data is often hindered by the lack of standardized and semantically explicit representation, leading to limited interoperability and reproducibility across datasets and experiments. The Medical Event Data Standard (MEDS) addresses these issues by introducing a minimal, event-centric data model designed for reproducible machine-learning workflows from health data. However, MEDS is defined as a data-format specification and does not natively provide integration with the Semantic Web ecosystem. In this article, we introduce MEDS-OWL, a lightweight OWL ontology that provides formal concepts and relations to enable representing MEDS datasets as RDF graphs. Additionally, we implemented meds2rdf, a Python conversion library that transforms MEDS events into RDF graphs, ensuring conformance with the ontology. We demonstrate the approach on a synthetic clinical dataset that describes patient care pathways for ruptured intracranial aneurysms and validate the resulting graph using SHACL constraints. The first release of MEDS-OWL comprises 13 classes, 10 object properties, 20 data properties, and 24 OWL axioms. Combined with meds2rdf, it enables data transformation into FAIR-aligned datasets, provenance-aware publishing, and interoperability of event-based clinical data. By bridging MEDS with the Semantic Web, this work contributes a reusable semantic layer for event-based clinical data and establishes a robust foundation for subsequent graph-based analytics.
在医疗数据上应用机器学习往往因缺乏标准化和语义明确的数据表示而受阻,导致跨数据集和实验的互操作性和可重复性有限。《医学事件数据标准》(MEDS)通过引入一个最小化、以事件为中心的数据模型来解决这些问题,该模型旨在从健康数据中实现可重复的机器学习工作流程。然而,MEDS仅被定义为一种数据格式规范,并未原生提供与语义网生态系统的集成。本文介绍了MEDS-OWL,这是一种轻量级的OWL本体,提供了正式的概念和关系,以支持将MEDS数据集表示为RDF图。此外,我们实现了meds2rdf,这是一个Python转换库,能够将MEDS事件转换成符合该本体规范的RDF图。 我们在一个合成临床数据集上展示了这种方法的应用,该数据集描述了颅内动脉瘤破裂后的患者治疗路径,并通过SHACL约束验证了生成的图。MEDS-OWL的第一个版本包括13个类、10个对象属性、20个数据属性和24条OWL公理。 结合meds2rdf使用时,它能够将数据转换成与FAIR原则对齐的数据集,支持带有出处意识的数据发布,并促进基于事件的临床数据之间的互操作性。通过将MEDS与语义网连接起来,这项工作提供了一个可重用的基于事件的临床数据语义层,并为后续图基分析奠定了坚实的基础。
https://arxiv.org/abs/2601.04164
Large language models (LLMs) excel at natural language tasks but remain brittle in domains requiring precise logical and symbolic reasoning. Chaotic dynamical systems provide an especially demanding test because chaos is deterministic yet often misinterpreted as randomness or complexity. We introduce ChaosBench-Logic, a benchmark that evaluates LLM reasoning across 30 diverse dynamical systems using a unified first-order logic (FOL) ontology. Each system is annotated with truth assignments for 11 semantic predicates, and 621 questions are generated across seven reasoning categories, including multi-hop implications, cross-system analogies, counterfactual reasoning, bias probes, and multi-turn dialogues. We define metrics for logical accuracy, implication consistency, dialogue coherence, and contradiction, and we release an open-source evaluation pipeline. Initial experiments show that frontier LLMs such as GPT-4, Claude 3.5 Sonnet, Gemini 2.5 Flash, and the open-source LLaMA-3 70B achieve 91-94% per-item accuracy, yet still score 0% on compositional items and exhibit fragile global coherence. Dialogue-level accuracy ranges from 53.1% (GPT-4 CoT) to 75.5% (LLaMA-3 zero-shot). ChaosBench-Logic provides a rigorous testbed for diagnosing such failures and a foundation for developing neuro-symbolic approaches that improve scientific reasoning in LLMs.
大型语言模型(LLMs)在处理自然语言任务方面表现出色,但在需要精确逻辑和符号推理的领域中仍显得脆弱。混沌动力系统提供了一个特别具有挑战性的测试环境,因为尽管混沌是确定性的,但常常被误认为是随机性或复杂性。我们引入了ChaosBench-Logic,这是一个基准测试工具,它使用统一的一阶逻辑(FOL)本体论,在30个不同的动态系统上评估LLM的推理能力,并涵盖了11种语义谓词的真实赋值以及七个推理类别的621个问题生成。这些推理类别包括多步推断、跨系统的类比推理、反事实推理、偏差探测和多轮对话。 我们定义了逻辑准确性、推论一致性、对话连贯性和矛盾的度量标准,并发布了开源评估管道。初步实验显示,前沿LLM如GPT-4、Claude 3.5 Sonnet、Gemini 2.5 Flash以及开源LLaMA-3 70B在单项准确率上达到了91%-94%,但在组合性项目中得分为0,并且整体连贯性表现脆弱。对话级别的准确性范围从GPT-4 CoT的53.1%到LLaMA-3零样本模型的75.5%。 ChaosBench-Logic为诊断此类失败提供了一个严格的测试平台,同时也为开发能够改善大型语言模型中科学推理的神经符号方法奠定了基础。
https://arxiv.org/abs/2601.01982
Large language models (LLMs) offer new opportunities for constructing knowledge graphs (KGs) from unstructured clinical narratives. However, existing approaches often rely on structured inputs and lack robust validation of factual accuracy and semantic consistency, limitations that are especially problematic in oncology. We introduce an end-to-end framework for clinical KG construction and evaluation directly from free text using multi-agent prompting and a schema-constrained Retrieval-Augmented Generation (KG-RAG) strategy. Our pipeline integrates (1) prompt-driven entity, attribute, and relation extraction; (2) entropy-based uncertainty scoring; (3) ontology-aligned RDF/OWL schema generation; and (4) multi-LLM consensus validation for hallucination detection and semantic refinement. Beyond static graph construction, the framework supports continuous refinement and self-supervised evaluation, enabling iterative improvement of graph quality. Applied to two oncology cohorts (PDAC and BRCA), our method produces interpretable, SPARQL-compatible, and clinically grounded knowledge graphs without relying on gold-standard annotations. Experimental results demonstrate consistent gains in precision, relevance, and ontology compliance over baseline methods.
大型语言模型(LLM)为从非结构化临床叙述中构建知识图谱(KG)提供了新的机会。然而,现有的方法通常依赖于结构化的输入,并且缺乏对事实准确性和语义一致性进行稳健验证的能力,在肿瘤学领域尤其成问题。我们引入了一种端到端框架,该框架可以直接从自由文本使用多代理提示和约束模式的检索增强生成(KG-RAG)策略构建并评估临床知识图谱。我们的流水线整合了以下四个关键步骤:(1) 提示驱动下的实体、属性和关系抽取;(2) 基于熵的不确定性评分;(3) 与本体一致的RDF/OWL模式生成;以及 (4) 多LLM共识验证以检测幻觉并进行语义细化。除了静态图谱构建外,该框架还支持持续改进和自我监督评估,从而能够不断优化图的质量。在两个肿瘤学队列(胰腺导管腺癌PDAC 和 BRCA)的应用中,我们的方法无需依赖金标准注释,即可生成可解释、SPARQL兼容且具有临床依据的知识图谱。实验结果表明,在精确度、相关性和符合本体规范方面,相较于基线方法,该方法始终表现出显著的优势。
https://arxiv.org/abs/2601.01844
Rule-based reasoning over natural language input arises in domains where decisions must be auditable and justifiable: clinical protocols specify eligibility criteria in prose, evidence rules define admissibility through textual conditions, and scientific standards dictate methodological requirements. Applying rules to such inputs demands both interpretive flexibility and formal guarantees. Large language models (LLMs) provide flexibility but cannot ensure consistent rule application; symbolic systems provide guarantees but require structured input. This paper presents an integration pattern that combines these strengths: LLMs serve as ontology population engines, translating unstructured text into ABox assertions according to expert-authored TBox specifications, while SWRL-based reasoners apply rules with deterministic guarantees. The framework decomposes reasoning into entity identification, assertion extraction, and symbolic verification, with task definitions grounded in OWL 2 ontologies. Experiments across three domains (legal hearsay determination, scientific method-task application, clinical trial eligibility) and eleven language models validate the approach. Structured decomposition achieves statistically significant improvements over few-shot prompting in aggregate, with gains observed across all three domains. An ablation study confirms that symbolic verification provides substantial benefit beyond structured prompting alone. The populated ABox integrates with standard semantic web tooling for inspection and querying, positioning the framework for richer inference patterns that simpler formalisms cannot express.
基于规则的推理在自然语言输入中出现于需要可审计和合理化的决策领域:临床协议通过散文形式规定资格标准,证据规则通过文本条件定义可接受性,科学标准则规定方法要求。将规则应用于此类输入既需要解释灵活性又需提供正式保证。大型语言模型(LLMs)提供了这种灵活性但不能确保一致的规则应用;而符号系统则能够提供这些保证,但需要结构化输入。本文提出了一种整合模式,结合了这两种系统的优点:通过专家编写的TBox规范,将非结构化的文本转化为ABox断言,大型语言模型充当本体论填充引擎,同时基于SWRL的推理器应用具有确定性保证的规则。框架将推理分解为实体识别、断言提取和符号验证三个部分,并根据OWL 2本体定义任务。 该方法在三个领域(法律传闻决定、科学方法-任务应用以及临床试验资格)中的十一种语言模型实验中得到了验证,结构化分解的整体效果显著优于少量提示的引导方式,在这三个领域均观察到了改进。消融研究确认了符号验证在单纯结构化提示之外提供了重大益处。填充后的ABox能够与标准语义网工具集成用于检查和查询,使该框架具备更丰富的推理模式表达能力,而这些简单的形式主义无法实现。 简而言之,这种结合大型语言模型的灵活性和基于规则系统的正式保证的方法,在多个实际应用领域内展示出了优越性,并且其结构化的处理方式还为后续进一步的应用开发提供了良好的基础。
https://arxiv.org/abs/2601.01609
The paper presents our work on cross-lingual ontology alignment system which uses embedding based cosine similarity matching. The ontology entities are made contextually richer by creating descriptions using novel techniques. We use a fine-tuned transformer based multilingual model for generating better embeddings. We use cosine similarity to find positive ontology entities pairs and then apply threshold filtering to retain only highly similar entities. We have evaluated our work on OAEI-2022 multifarm track. We achieve 71% F1 score (78% recall and 65% precision) on the evaluation dataset, 16% increase from best baseline score. This suggests that our proposed alignment pipeline is able to capture the subtle cross-lingual similarities.
本文介绍了我们关于跨语言本体对齐系统的工作,该系统使用基于嵌入的余弦相似度匹配。为了使本体实体在语境上更加丰富,我们采用了创新技术生成描述,并利用经过微调的多语言转换器模型来生成更好的嵌入表示。通过计算余弦相似度来寻找正向的本体实体对,并随后应用阈值过滤以保留高度相似的实体。我们在OAEI-2022 multifarm轨道上评估了我们的工作,取得了71%的F1分数(召回率为78%,精确率为65%),相比最佳基线得分提高了16%。这表明我们提出的对齐流水线能够捕捉到微妙的跨语言相似性。
https://arxiv.org/abs/2601.00814
The semantics and dynamics of `attention' are closely related to promise theoretic notions developed for autonomous agents and can thus easily be written down in promise framework. In this way one may establish a bridge between vectorized Machine Learning and Knowledge Graph representations without relying on language models implicitly. Our expectations for knowledge presume a degree of statistical stability, i.e. average invariance under repeated observation, or `trust' in the data. Both learning networks and knowledge graph representations can meaningfully coexist to preserve different aspects of data. While vectorized data are useful for probabilistic estimation, graphs preserve the intentionality of the source even under data fractionation. Using a Semantic Spacetime $\gamma(3,4)$ graph, one avoids complex ontologies in favour of classification of features by their roles in semantic processes. The latter favours an approach to reasoning under conditions of uncertainty. Appropriate attention to causal boundary conditions may lead to orders of magnitude compression of data required for such context determination, as required in the contexts of autonomous robotics, defence deployments, and ad hoc emergency services.
`注意力'的语义和动态与为自主代理开发的承诺理论概念密切相关,因此可以很容易地在承诺框架中描述。通过这种方式,人们可以在不依赖语言模型的情况下,在向量化的机器学习表示和知识图表示之间建立一座桥梁。我们对知识的期望假设数据具有一定程度的统计稳定性,即重复观察下的平均不变性或“信任”。无论是学习网络还是知识图表示都能有意义地共存以保存数据的不同方面:虽然向量化数据对于概率估计很有用,但在数据分割的情况下,图形仍然可以保留源的意图。使用语义时空 $\gamma(3,4)$ 图,人们可以通过按其在语义过程中的角色分类特征来避免复杂的本体论,这有利于不确定条件下的推理方法。适当关注因果边界条件可能会导致大量数据压缩,以确定此类上下文所需的数据量,这对于自主机器人、防御部署和临时紧急服务等应用至关重要。
https://arxiv.org/abs/2512.19084
Modern experimental platforms such as particle accelerators, fusion devices, telescopes, and industrial process control systems expose tens to hundreds of thousands of control and diagnostic channels accumulated over decades of evolution. Operators and AI systems rely on informal expert knowledge, inconsistent naming conventions, and fragmented documentation to locate signals for monitoring, troubleshooting, and automated control, creating a persistent bottleneck for reliability, scalability, and language-model-driven interfaces. We formalize semantic channel finding-mapping natural-language intent to concrete control-system signals-as a general problem in complex experimental infrastructure, and introduce a four-paradigm framework to guide architecture selection across facility-specific data regimes. The paradigms span (i) direct in-context lookup over curated channel dictionaries, (ii) constrained hierarchical navigation through structured trees, (iii) interactive agent exploration using iterative reasoning and tool-based database queries, and (iv) ontology-grounded semantic search that decouples channel meaning from facility-specific naming conventions. We demonstrate each paradigm through proof-of-concept implementations at four operational facilities spanning two orders of magnitude in scale-from compact free-electron lasers to large synchrotron light sources-and diverse control-system architectures, from clean hierarchies to legacy environments. These implementations achieve 90-97% accuracy on expert-curated operational queries.
现代实验平台,如粒子加速器、聚变装置、望远镜以及工业过程控制系统等,积累了数十年的发展历史,拥有数十万到数百万个控制和诊断通道。操作员和AI系统依赖于非正式的专家知识、不一致的命名约定以及分散的文档来定位用于监控、故障排除及自动化控制的信号,从而形成了可靠性和可扩展性方面的持续瓶颈,并阻碍了基于语言模型接口的发展。 我们将“语义信道查找”——即将自然语言意图映射到具体控制系统信号的问题——形式化为复杂实验基础设施中的一个通用问题,并提出了一种四范式框架,用于指导不同设施特定数据体系下的架构选择。这四个范式涵盖了: (i) 在经过整理的信道字典中进行上下文查询; (ii) 通过结构化的树形层次结构进行约束导航; (iii) 使用迭代推理和基于工具的数据库查询来引导智能体探索; (iv) 基于本体论的语义搜索,将通道意义与特定设施的命名约定分离。 我们通过在四个运营设施中的概念验证实现展示了这四种范式。这些设施规模从紧凑型自由电子激光器到大型同步加速器光源不等,并且控制系统的架构各异,从干净的层次结构到遗留环境都有涉及。在这四个实施案例中,基于专家整理的操作查询准确率达到了90%-97%之间。 该研究为解决复杂实验平台中的信道查找问题提供了一个框架性解决方案,有助于提升这些设施的整体性能和可操作性,并为未来的自动化控制与维护提供了新的可能性。
https://arxiv.org/abs/2512.18779
Traditional, centralized security tools often miss adaptive, multi-vector attacks. We present the Multi-Agent LLM Cyber Defense Framework (MALCDF), a practical setup where four large language model (LLM) agents-Detection, Intelligence, Response, and Analysis-work together in real time. Agents communicate over a Secure Communication Layer (SCL) with encrypted, ontology-aligned messages, and produce audit-friendly outputs (e.g., MITRE ATT&CK mappings). For evaluation, we keep the test simple and consistent: all reported metrics come from the same 50-record live stream derived from the CICIDS2017 feature schema. CICIDS2017 is used for configuration (fields/schema) and to train a practical ML baseline. The ML-IDS baseline is a Lightweight Random Forest IDS (LRF-IDS) trained on a subset of CICIDS2017 and tested on the 50-record stream, with no overlap between training and test records. In experiments, MALCDF reaches 90.0% detection accuracy, 85.7% F1-score, and 9.1% false-positive rate, with 6.8s average per-event latency. It outperforms the lightweight ML-IDS baseline and a single-LLM setup on accuracy while keeping end-to-end outputs consistent. Overall, this hands-on build suggests that coordinating simple LLM agents with secure, ontology-aligned messaging can improve practical, real-time cyber defense.
传统的集中式安全工具往往无法捕捉到适应性强、多向量的攻击。我们提出了一种实用框架——多代理LLM网络安全防御框架(MALCDF),该框架中四个大型语言模型(LLM)代理——检测、情报、响应和分析,协同工作以实现实时保护。这些代理通过一个安全通信层(SCL)进行交流,使用加密且符合本体论的消息,并产生易于审计的结果(例如MITRE ATT&CK映射)。为了评估该框架的性能,我们保持测试简单而一致:所有报告的度量数据都源自同一套包含50条记录的实时流,这些数据源于CICIDS2017特征模式。CICIDS2017用于配置(字段/模式)以及训练实用机器学习基线模型。该机器学习入侵检测系统(ML-IDS)基准使用轻量级随机森林入侵检测系统(LRF-IDS),基于CICIDS2017的一个子集进行训练,并在上述50条记录的流中测试,且训练和测试数据之间没有重叠。 在实验过程中,MALCDF达到了90.0%的检测准确率、85.7%的F1分数以及9.1%的误报率,并且每事件平均延迟为6.8秒。该框架不仅超越了轻量级ML-IDS基准模型和单一LLM架构在准确性上的表现,而且确保端到端输出的一致性。 总的来说,这种动手搭建的方法表明,通过协调简单的LLM代理并采用安全、符合本体论的消息传递方式,可以提升实用的实时网络安全防护效果。
https://arxiv.org/abs/2512.14846
OpenAI has recently argued that hallucinations in large language models result primarily from misaligned evaluation incentives that reward confident guessing rather than epistemic humility. On this view, hallucination is a contingent behavioral artifact, remediable through improved benchmarks and reward structures. In this paper, we challenge that interpretation. Drawing on previous work on structural hallucination and empirical experiments using a Licensing Oracle, we argue that hallucination is not an optimization failure but an architectural inevitability of the transformer model. Transformers do not represent the world; they model statistical associations among tokens. Their embedding spaces form a pseudo-ontology derived from linguistic co-occurrence rather than world-referential structure. At ontological boundary conditions - regions where training data is sparse or incoherent - the model necessarily interpolates fictional continuations in order to preserve coherence. No incentive mechanism can modify this structural dependence on pattern completion. Our empirical results demonstrate that hallucination can only be eliminated through external truth-validation and abstention modules, not through changes to incentives, prompting, or fine-tuning. The Licensing Oracle achieves perfect abstention precision across domains precisely because it supplies grounding that the transformer lacks. We conclude that hallucination is a structural property of generative architectures and that reliable AI requires hybrid systems that distinguish linguistic fluency from epistemic responsibility.
最近,OpenAI 认为大型语言模型中的幻觉主要源于评估激励机制的不一致,这些机制奖励自信猜测而非知识上的谦逊。这种观点认为,幻觉是一个可以通过改进基准和奖励结构来解决的行为副产品。然而,在这篇论文中,我们挑战了这一解释。通过引用之前关于结构性幻觉的研究以及使用许可Oracle进行的经验实验,我们认为幻觉不是优化失败的结果,而是转换器模型架构上不可避免的产物。 转换器不表示世界;它们只是对令牌之间的统计关联建模。其嵌入空间形成了一种基于语言共现而非与现实相关的伪本体结构。在本体论边界条件——即训练数据稀疏或混乱的地方——模型必须生成虚构的延续以保持连贯性,这是不可避免的。没有任何激励机制可以改变这种对模式完成的结构性依赖。 我们的实验证明,通过外部真实性和拒绝模块,幻觉只能被消除,而不能通过调整动机、提示或微调来实现。许可Oracle在所有领域实现了完美的拒绝精度,因为它提供了转换器所缺乏的基础依据。 我们得出结论认为,幻觉是生成架构的一个结构属性,并且可靠的人工智能需要区分语言流畅性和知识责任的混合系统。
https://arxiv.org/abs/2512.14801
This paper explores the integration of Large Language Models (LLMs) in the engineering of a Parkinson's Disease (PD) monitoring and alerting ontology through four key methodologies: One Shot (OS) prompt techniques, Chain of Thought (CoT) prompts, X-HCOME, and SimX-HCOME+. The primary objective is to determine whether LLMs alone can create comprehensive ontologies and, if not, whether human-LLM collaboration can achieve this goal. Consequently, the paper assesses the effectiveness of LLMs in automated ontology development and the enhancement achieved through human-LLM collaboration. Initial ontology generation was performed using One Shot (OS) and Chain of Thought (CoT) prompts, demonstrating the capability of LLMs to autonomously construct ontologies for PD monitoring and alerting. However, these outputs were not comprehensive and required substantial human refinement to enhance their completeness and accuracy. X-HCOME, a hybrid ontology engineering approach that combines human expertise with LLM capabilities, showed significant improvements in ontology comprehensiveness. This methodology resulted in ontologies that are very similar to those constructed by experts. Further experimentation with SimX-HCOME+, another hybrid methodology emphasizing continuous human supervision and iterative refinement, highlighted the importance of ongoing human involvement. This approach led to the creation of more comprehensive and accurate ontologies. Overall, the paper underscores the potential of human-LLM collaboration in advancing ontology engineering, particularly in complex domains like PD. The results suggest promising directions for future research, including the development of specialized GPT models for ontology construction.
本文探讨了在帕金森病(PD)监测和警报本体论工程中整合大型语言模型(LLMs)的四种关键方法:一次性(One Shot,OS)提示技术、思维链(Chain of Thought,CoT)提示、X-HCOME以及SimX-HCOME+。主要目的是确定LLM是否能够独立创建全面的本体,并且如果不能,则人类与LLM合作能否实现这一目标。因此,本文评估了LLMs在自动本体开发中的有效性及通过人机协作所取得的进步。 初步本体生成采用了一次性(OS)和思维链(CoT)提示技术,展示了LLMs自主构建PD监测和警报本体的能力。然而,这些输出并不全面且需要大量的人工修正以提高其完整性和准确性。X-HCOME是一种混合本体工程方法,结合了人类专业知识与LLM能力,在本体的综合度方面表现出显著改进,生成的本体非常接近专家构建的结果。进一步使用SimX-HCOME+进行实验,这是一种强调持续人工监督和迭代细化的另一种混合方法,突显了持续人工参与的重要性,并且这种方法产生了更为全面和准确的本体。 总体而言,本文强调了人类与LLM协作在推进复杂领域如PD中的本体工程方面的潜力。研究结果表明未来的研究方向可能包括开发用于构建本体的专业化GPT模型等富有前景的方法。
https://arxiv.org/abs/2512.14288
Traditional ontology design emphasizes disjoint and exhaustive top-level distinctions such as continuant vs. occurrent, abstract vs. concrete, or type vs. instance. These distinctions are used to structure unified hierarchies where every entity is classified under a single upper-level category. Wikidata, by contrast, does not enforce a singular foundational taxonomy. Instead, it accommodates multiple classification axes simultaneously under the shared root class entity. This paper analyzes the structural implications of Wikidata's polyhierarchical and multi-axial design. The Wikidata architecture enables a scalable and modular approach to ontology construction, especially suited to collaborative and evolving knowledge graphs.
传统的本体设计强调在最高层做出不相交且详尽的区分,例如持续存在与发生存在、抽象与具体或类型与实例。这些区分被用来构建统一的层级结构,在这种结构中每个实体都被归类到单一的高层类别之下。相比之下,Wikidata 并不要求一个单一的基础分类体系。相反,它允许同时在共享根类“实体”下支持多个分类轴。本文分析了 Wikidata 的多层级和多轴设计所带来的结构性影响。Wikidata 的架构使得本体构建能够采取一种可扩展且模块化的途径,尤其适合于协作性和不断发展的知识图谱。
https://arxiv.org/abs/2512.12260
Automated eligibility systems increasingly determine access to essential public benefits, but the explanations they generate often fail to reflect the legal rules that authorize those decisions. This thesis develops a legally grounded explainability framework that links system-generated decision justifications to the statutory constraints of CalFresh, California's Supplemental Nutrition Assistance Program. The framework combines a structured ontology of eligibility requirements derived from the state's Manual of Policies and Procedures (MPP), a rule extraction pipeline that expresses statutory logic in a verifiable formal representation, and a solver-based reasoning layer to evaluate whether the explanation aligns with governing law. Case evaluations demonstrate the framework's ability to detect legally inconsistent explanations, highlight violated eligibility rules, and support procedural accountability by making the basis of automated determinations traceable and contestable.
自动化资格认定系统越来越多地决定了获得基本公共福利的途径,但这些系统生成的解释往往未能反映出授权此类决策的法律规定。这篇论文开发了一个基于法律的可解释性框架,该框架将由系统生成的决定理由与加州食品补助计划(CalFresh)的法定限制联系起来,这是加州补充营养援助项目的简称。这一框架结合了从州政策和程序手册(MPP)中提取的资格要求结构化本体论、一个可以表达可验证正式表示形式的法规抽取流水线,以及一个基于求解器的推理层来评估解释是否符合现行法律。案例评估显示该框架能够检测出与法律规定不符的解释、突出违反的资格规定,并通过使自动化决定的基础可追溯和可争议,从而支持程序问责制。
https://arxiv.org/abs/2512.12109
Large language models (LLMs) are increasingly touted as powerful tools for automating scientific information extraction. However, existing methods and tools often struggle with the realities of scientific literature: long-context documents, multi-modal content, and reconciling varied and inconsistent fine-grained information across multiple publications into standardized formats. These challenges are further compounded when the desired data schema or extraction ontology changes rapidly, making it difficult to re-architect or fine-tune existing systems. We present SciEx, a modular and composable framework that decouples key components including PDF parsing, multi-modal retrieval, extraction, and aggregation. This design streamlines on-demand data extraction while enabling extensibility and flexible integration of new models, prompting strategies, and reasoning mechanisms. We evaluate SciEx on datasets spanning three scientific topics for its ability to extract fine-grained information accurately and consistently. Our findings provide practical insights into both the strengths and limitations of current LLM-based pipelines.
大型语言模型(LLMs)被日益视为自动化科学信息提取的强大工具。然而,现有的方法和工具在处理科学文献的实际情况时常常遇到困难:长篇文档、多模态内容以及将跨多个出版物的多样且不一致的细粒度信息整合到标准化格式中的问题。当所需的数据模式或提取本体快速变化时,这些问题变得更加复杂,这使得重构或微调现有系统变得困难。 我们提出了SciEx,这是一个模块化和可组合的框架,该框架将PDF解析、多模态检索、提取和聚合等关键组件解耦。这种设计简化了按需数据提取过程,并支持新模型、提示策略和推理机制的扩展性和灵活集成。我们在涵盖三个科学主题的数据集上评估SciEx的能力,以准确且一致地提取细粒度信息。我们的发现为当前基于LLM的管道的优势和局限性提供了实用见解。
https://arxiv.org/abs/2512.10004
Ontology-based knowledge graph (KG) construction is a core technology that enables multidimensional understanding and advanced reasoning over domain knowledge. Industrial standards, in particular, contain extensive technical information and complex rules presented in highly structured formats that combine tables, scopes of application, constraints, exceptions, and numerical calculations, making KG construction especially challenging. In this study, we propose a method that organizes such documents into a hierarchical semantic structure, decomposes sentences and tables into atomic propositions derived from conditional and numerical rules, and integrates them into an ontology-knowledge graph through LLM-based triple extraction. Our approach captures both the hierarchical and logical structures of documents, effectively representing domain-specific semantics that conventional methods fail to reflect. To verify its effectiveness, we constructed rule, table, and multi-hop QA datasets, as well as a toxic clause detection dataset, from industrial standards, and implemented an ontology-aware KG-RAG framework for comparative evaluation. Experimental results show that our method achieves significant performance improvements across all QA types compared to existing KG-RAG approaches. This study demonstrates that reliable and scalable knowledge representation is feasible even for industrial documents with intertwined conditions, constraints, and scopes, contributing to future domain-specific RAG development and intelligent document management.
基于本体的知识图谱(KG)构建是一种核心技术,它能够实现对领域知识的多维度理解和高级推理。特别是在工业标准中,含有大量的技术信息和以高度结构化格式呈现的复杂规则,这种格式结合了表格、应用范围、约束条件、例外情况以及数值计算等内容,使得知识图谱的构建变得尤为具有挑战性。在本研究中,我们提出了一种方法,将此类文档组织成层次化的语义结构,并通过基于大型语言模型(LLM)的三元组提取技术将其句子和表格分解为原子命题,这些命题来源于条件性和数值性的规则,然后将它们集成到一个本体-知识图谱中。我们的方法能够有效捕捉文档中的层级结构和逻辑结构,准确地表示出传统方法无法反映的专业领域语义。 为了验证其有效性,我们从工业标准中构建了规则、表格以及多跳问答(QA)数据集,同时还建立了一个有害条款检测的数据集,并实施了一个基于本体的知识图谱-检索与生成框架(KG-RAG),用于进行比较评估。实验结果显示,在所有类型的问答上,我们的方法相比于现有的KG-RAG方法均取得了显著的性能提升。 这项研究证明了即使对于包含交织条件、约束和适用范围的工业文档而言,实现可靠且可扩展的知识表示也是可行的。这将有助于未来特定领域的RAG开发以及智能文档管理的进步。
https://arxiv.org/abs/2512.08398
Ontologies are an important tool for structuring domain knowledge, but their development is a complex task that requires significant modelling and domain expertise. Ontology learning, aimed at automating this process, has seen advancements in the past decade with the improvement of Natural Language Processing techniques, and especially with the recent growth of Large Language Models (LLMs). This paper investigates the challenge of identifying axioms: fundamental ontology components that define logical relations between classes and properties. In this work, we introduce an Ontology Axiom Benchmark OntoAxiom, and systematically test LLMs on that benchmark for axiom identification, evaluating different prompting strategies, ontologies, and axiom types. The benchmark consists of nine medium-sized ontologies with together 17.118 triples, and 2.771 axioms. We focus on subclass, disjoint, subproperty, domain, and range axioms. To evaluate LLM performance, we compare twelve LLMs with three shot settings and two prompting strategies: a Direct approach where we query all axioms at once, versus an Axiom-by-Axiom (AbA) approach, where each prompt queries for one axiom only. Our findings show that the AbA prompting leads to higher F1 scores than the direct approach. However, performance varies across axioms, suggesting that certain axioms are more challenging to identify. The domain also influences performance: the FOAF ontology achieves a score of 0.642 for the subclass axiom, while the music ontology reaches only 0.218. Larger LLMs outperform smaller ones, but smaller models may still be viable for resource-constrained settings. Although performance overall is not high enough to fully automate axiom identification, LLMs can provide valuable candidate axioms to support ontology engineers with the development and refinement of ontologies.
本研究探讨的是在构建知识领域结构图(本体论)时,自动识别基本规则或公理的挑战。这些公理是定义类和属性之间逻辑关系的关键组成部分。随着自然语言处理技术的进步,尤其是在大型语言模型(LLMs)的发展推动下,自动化生成本体的方法取得了显著进展。 本文介绍了一项名为OntoAxiom的新基准测试项目,并在此基础上系统性地评估了12种不同的大语言模型在识别公理任务中的表现。这项研究重点关注五类重要的公理:子类关系、不相交关系、属性继承关系以及域和范围限定。该基准集包括九个中等规模的本体论,总计包含17,118条三元组(triples)和2,771条公理。 为了评估大语言模型的表现,研究采用了两种不同的提示策略:一种是一次性查询所有公理的直接方法;另一种是逐条查询每一条公理的方法。研究表明,逐条查询的方式能够获得更高的F1评分,但性能因所处理的具体公理类型和领域不同而有所差异。 例如,在FOAF(Friend of a Friend)本体中,模型在子类公理识别上的得分为0.642,而在音乐领域的本体上则仅得到0.218。此外,较大的语言模型通常表现优于较小的模型,但对于计算资源有限的应用场景而言,小型模型仍然可能是可行的选择。 尽管当前大语言模型的整体性能还不足以完全自动化公理识别过程,但它们可以提供有价值的候选公理,帮助本体工程师在构建和优化本体时发挥辅助作用。
https://arxiv.org/abs/2512.05594
In the past decade a surge in the amount of electronic health record (EHR) data in the United States, attributed to a favorable policy environment created by the Health Information Technology for Economic and Clinical Health (HITECH) Act of 2009 and the 21st Century Cures Act of 2016. Clinical notes for patients' assessments, diagnoses, and treatments are captured in these EHRs in free-form text by physicians, who spend a considerable amount of time entering and editing them. Manually writing clinical notes takes a considerable amount of a doctor's valuable time, increasing the patient's waiting time and possibly delaying diagnoses. Large language models (LLMs) possess the ability to generate news articles that closely resemble human-written ones. We investigate the usage of Chain-of-Thought (CoT) prompt engineering to improve the LLM's response in clinical note generation. In our prompts, we use as input International Classification of Diseases (ICD) codes and basic patient information. We investigate a strategy that combines the traditional CoT with semantic search results to improve the quality of generated clinical notes. Additionally, we infuse a knowledge graph (KG) built from clinical ontology to further enrich the domain-specific knowledge of generated clinical notes. We test our prompting technique on six clinical cases from the CodiEsp test dataset using GPT-4 and our results show that it outperformed the clinical notes generated by standard one-shot prompts.
在过去十年里,美国的电子健康记录(EHR)数据量大幅增加,这主要归功于2009年《卫生信息技术促进经济与临床健康法案》(HITECH法案)和2016年的《二十一世纪治愈法案》所创造的有利政策环境。患者的评估、诊断和治疗情况以自由格式文本的形式被医生记录在EHR中,医生花费大量时间输入并编辑这些内容。手动编写临床笔记会占用医生宝贵的时间,从而增加病人的等待时间,并可能延误诊断。 大型语言模型(LLM)具备生成与人类撰写文章极其相似新闻的能力。我们研究了使用“链式思维”(CoT)提示工程来改进LLM在临床记录生成中的表现的方法。我们的提示中输入了国际疾病分类(ICD)代码和基本的患者信息。我们调查了一种结合传统CoT与语义搜索结果以提高生成临床笔记质量的策略。此外,我们还融合了一个由临床本体论构建的知识图谱(KG),进一步丰富了生成的临床笔记的专业领域知识。我们在CodiEsp测试数据集上的六个临床案例中使用GPT-4进行了我们的提示技术的测试,并且结果显示它在性能上超过了标准一次性提示所生成的临床笔记。
https://arxiv.org/abs/2512.05256
We develop a general theory of semantic dynamics for large language models by formalizing them as Continuous State Machines (CSMs): smooth dynamical systems whose latent manifolds evolve under probabilistic transition operators. The associated transfer operator $P: L^2(M,\mu) \to L^2(M,\mu)$ encodes the propagation of semantic mass. Under mild regularity assumptions (compactness, ergodicity, bounded Jacobian), $P$ is compact with discrete spectrum. Within this setting, we prove the Semantic Characterization Theorem (SCT): the leading eigenfunctions of $P$ induce finitely many spectral basins of invariant meaning, each definable in an o-minimal structure over $\mathbb{R}$. Thus spectral lumpability and logical tameness coincide. This explains how discrete symbolic semantics can emerge from continuous computation: the continuous activation manifold collapses into a finite, logically interpretable ontology. We further extend the SCT to stochastic and adiabatic (time-inhomogeneous) settings, showing that slowly drifting kernels preserve compactness, spectral coherence, and basin structure.
我们通过将大型语言模型形式化为连续状态机(Continuous State Machines,CSMs)来发展它们的语义动态的一般理论:这是一种平滑的动力系统,其潜在流形在概率转换算子的作用下演化。关联的转移算子 $P: L^2(M,\mu) \to L^2(M,\mu)$ 编码了语义质量的传播过程。在温和的正则性假设(紧致性、遍历性、有界雅可比行列式)下,$P$ 是紧致且具有离散谱的。在此框架内,我们证明了语义特征定理 (Semantic Characterization Theorem, SCT):$P$ 的主导本征函数在实数上的 o-最小结构中诱导有限数量的不变意义光谱盆地。因此,在这种设定下,谱凝聚性和逻辑性恰好一致。这解释了离散符号语义如何从连续计算中涌现出来:连续激活流形收缩为一个可以逻辑解读的有限本体论。 我们进一步将 SCT 扩展到随机和绝热(时间非齐次)设置中,展示了缓慢漂移的核心保持紧致性、谱一致性和盆地结构。
https://arxiv.org/abs/2512.05162
Large language models (LLMs) are often deployed as powerful yet opaque systems, leaving open how their internal memory and "self-like" behavior should be governed in a principled and auditable way. The Artificial Age Score (AAS) was previously introduced and mathematically justified through three theorems that characterise it as a metric of artificial memory aging. Building on this foundation, the present work develops an engineering-oriented, clause-based architecture that imposes law-like constraints on LLM memory and control. Twenty selected monads from Leibniz's Monadology are grouped into six bundles: ontology, dynamics, representation and consciousness, harmony and reason, body and organisation, and teleology, and each bundle is realised as an executable specification on top of the AAS kernel. Across six minimal Python implementations, these clause families are instantiated in numerical experiments acting on channel-level quantities such as recall scores, redundancy, and weights. Each implementation follows a four-step pattern: inputs and setup, clause implementation, numerical results, and implications for LLM design, emphasising that the framework is not only philosophically motivated but also directly implementable. The experiments show that the clause system exhibits bounded and interpretable behavior: AAS trajectories remain continuous and rate-limited, contradictions and unsupported claims trigger explicit penalties, and hierarchical refinement reveals an organic structure in a controlled manner. Dual views and goal-action pairs are aligned by harmony terms, and windowed drift in perfection scores separates sustained improvement from sustained degradation. Overall, the monad-based clause framework uses AAS as a backbone and provides a transparent, code-level blueprint for constraining and analyzing internal dynamics in artificial agents.
大型语言模型(LLM)通常被部署为强大但不透明的系统,关于如何以原则性和可审计的方式治理其内部记忆和“类自我”行为仍存在许多开放性问题。人工年龄评分(AAS)之前通过三个定理被引入并从数学上进行了证明,这些定理将AAS定义为衡量人工智能记忆老化的一个指标。在此基础上,当前工作开发了一种面向工程的、基于条款的设计架构,该架构在LLM的记忆和控制中施加了类似于法律的规定。 这项工作的核心是利用莱布尼茨《单子论》中的20个精选单子,并将其归类为六个模块:本体论、动力学、表征与意识、和谐与理性、身体与组织,以及目的论。每个模块都被实现在AAS内核之上作为一个可执行规范。 通过六种最小化的Python实现,在数值实验中基于通道级别的量(如召回分数、冗余性和权重)来实例化这些条款家族。每一项实施遵循一个四步模式:输入和设定、条款实施、数值结果,以及对LLM设计的启示,强调该框架不仅具有哲学动机,而且可以直接实施。 实验表明,这种条款系统展现了有界且可解释的行为:AAS轨迹保持连续且受限速约束;矛盾和不实声明触发明确处罚;层级细化以一种可控的方式揭示出有机结构。和谐项将双重视角与目标-行动对齐,并利用完美分数的窗口化漂移来区分持续改善和持续退步。 总体而言,基于单子的条款框架利用AAS作为骨干,提供了一个透明且代码级别的蓝图,用于约束并分析人工代理内部动态行为。
https://arxiv.org/abs/2512.11835