Knowledge Graph-to-Text (G2T) generation involves verbalizing structured knowledge graphs into natural language text. Recent advancements in Pretrained Language Models (PLMs) have improved G2T performance, but their effectiveness depends on datasets with precise graph-text alignment. However, the scarcity of high-quality, general-domain G2T generation datasets restricts progress in the general-domain G2T generation research. To address this issue, we introduce Wikipedia Ontology-Free Graph-text dataset (WikiOFGraph), a new large-scale G2T dataset generated using a novel method that leverages Large Language Model (LLM) and Data-QuestEval. Our new dataset, which contains 5.85M general-domain graph-text pairs, offers high graph-text consistency without relying on external ontologies. Experimental results demonstrate that PLM fine-tuned on WikiOFGraph outperforms those trained on other datasets across various evaluation metrics. Our method proves to be a scalable and effective solution for generating high-quality G2T data, significantly advancing the field of G2T generation.
知识图谱到文本(G2T)生成涉及将结构化的知识图谱转化为自然语言文本。最近,预训练语言模型(PLMs)的进步改善了G2T性能,但这些技术的有效性取决于具有精确图-文本文档对齐的数据集。然而,高质量、跨领域G2T生成数据集的稀缺性限制了其在一般领域G2T生成研究中的进展。为了解决这个问题,我们引入了维基百科免费图本文档(WikiOFGraph)作为新的大规模G2T数据集,这是通过利用大型语言模型(LLM)和数据- QuestEval方法生成的。我们的新数据集包含5850万通用领域图-文本文档对,不需要依赖于外部本体论。实验结果表明,PLM在WikiOFGraph上微调优于在其他数据集上训练的模型,各种评估指标都具有较高的图-文本文档一致性。我们的方法证明了一种可扩展和有效的生成高质量G2T数据的方法, significantly推动了G2T生成领域的发展。
https://arxiv.org/abs/2409.07088
Vehicles in public traffic that are equipped with Automated Driving Systems are subject to a number of expectations: Among other aspects, their behavior should be safe, conforming to the rules of the road and provide mobility to their users. This poses challenges for the developers of such systems: Developers are responsible for specifying this behavior, for example, in terms of requirements at system design time. As we will discuss in the article, this specification always involves the need for assumptions and trade-offs. As a result, insufficiencies in such a behavior specification can occur that can potentially lead to unsafe system behavior. In order to support the identification of specification insufficiencies, requirements and respective assumptions need to be made explicit. In this article, we propose the Semantic Norm Behavior Analysis as an ontology-based approach to specify the behavior for an Automated Driving System equipped vehicle. We use ontologies to formally represent specified behavior for a targeted operational environment, and to establish traceability between specified behavior and the addressed stakeholder needs. Furthermore, we illustrate the application of the Semantic Norm Behavior Analysis in two example scenarios and evaluate our results.
在公共道路上装备自动驾驶系统的车辆,受到一系列期望的限制:例如,其行为应安全,符合道路规则,并为用户提供便利。这对系统开发者来说是一个挑战:例如,在系统设计时指定其行为。正如本文中所述,这种指定总是涉及假设和权衡。因此,在某种行为规范的规范性定义中,可能会出现不足之处,这可能导致不安全的系统行为。为了支持规范性定义不足的识别,需要明确需求和相应的假设。在这篇文章中,我们提出了基于语义网络的行为分析作为一种基于语义网络指定自动驾驶系统车辆行为的 ontology 方法。我们使用语义网络正式表示指定行为,并建立指定行为和目标受众需求之间的可追溯性。此外,我们还展示了 Semantic Norm Behavior Analysis 在两个示例场景中的应用,并评估了我们的结果。
https://arxiv.org/abs/2409.06607
This paper presents an ontology design along with knowledge engineering, and multilingual semantic reasoning techniques to build an automated system for assimilating culinary information for Indian food in the form of a knowledge graph. The main focus is on designing intelligent methods to derive ontology designs and capture all-encompassing knowledge about food, recipes, ingredients, cooking characteristics, and most importantly, nutrition, at scale. We present our ongoing work in this workshop paper, describe in some detail the relevant challenges in curating knowledge of Indian food, and propose our high-level ontology design. We also present a novel workflow that uses AI, LLM, and language technology to curate information from recipe blog sites in the public domain to build knowledge graphs for Indian food. The methods for knowledge curation proposed in this paper are generic and can be replicated for any domain. The design is application-agnostic and can be used for AI-driven smart analysis, building recommendation systems for Personalized Digital Health, and complementing the knowledge graph for Indian food with contextual information such as user information, food biochemistry, geographic information, agricultural information, etc.
本文提出了一种本体设计、知识工程以及多语言语义推理技术,以构建一个自动化的系统,将印度美食烹饪信息转化为知识图谱。本文的主要关注点是设计智能方法,以从规模上提取本体设计并捕捉关于食物、食谱、食材、烹饪特点和最重要的是营养的全部知识。我们在本的工作论文中描述了当前的工作,详细介绍了用于印度美食知识策展的相关挑战,并提出了我们的高级本体设计。我们还介绍了一种新颖的工作流程,利用AI、LLM和语言技术从公共领域的食谱博客网站中策展信息,构建印度美食知识图谱。本文提出的方法是通用的,可以复制到任何领域。该设计无国界,可以用于AI驱动的智能分析和为个性化数字健康构建推荐系统,以及补充印度美食知识图谱的上下文信息,如用户信息、食品生物化学、地理信息、农业信息等。
https://arxiv.org/abs/2409.00830
Discovering individuals depression on social media has become increasingly important. Researchers employed ML/DL or lexicon-based methods for automated depression detection. Lexicon based methods, explainable and easy to implement, match words from user posts in a depression dictionary without considering contexts. While the DL models can leverage contextual information, their black-box nature limits their adoption in the domain. Though surrogate models like LIME and SHAP can produce explanations for DL models, the explanations are suitable for the developer and of limited use to the end user. We propose a Knolwedge-infused Neural Network (KiNN) incorporating domain-specific knowledge from DepressionFeature ontology (DFO) in a neural network to endow the model with user-level explainability regarding concepts and processes the clinician understands. Further, commonsense knowledge from the Commonsense Transformer (COMET) trained on ATOMIC is also infused to consider the generic emotional aspects of user posts in depression detection. The model is evaluated on three expertly curated datasets related to depression. We observed the model to have a statistically significant (p<0.1) boost in performance over the best domain-specific model, MentalBERT, across CLEF e-Risk (25% MCC increase, 12% F1 increase). A similar trend is observed across the PRIMATE dataset, where the proposed model performed better than MentalBERT (2.5% MCC increase, 19% F1 increase). The observations confirm the generated explanations to be informative for MHPs compared to post hoc model explanations. Results demonstrated that the user-level explainability of KiNN also surpasses the performance of baseline models and can provide explanations where other baselines fall short. Infusing the domain and commonsense knowledge in KiNN enhances the ability of models like GPT-3.5 to generate application-relevant explanations.
在社交媒体上发现个人抑郁变得越来越重要。研究人员使用机器学习/深度学习(ML/DL)或词汇库方法进行自动化抑郁检测。词汇库方法,具有可解释性和易用性,匹配用户帖子中的抑郁词典中的单词,而不考虑上下文。尽管DL模型可以利用上下文信息,但它们黑盒的本质限制了它们在领域的采用。尽管像LIME和SHAP这样的代理模型可以为DL模型提供解释,但这些解释仅对开发人员有用,对最终用户来说效果有限。我们提出了一种基于知识图谱的神经网络(KiNN),该网络从DepressionFeature ontology (DFO)中提取领域特定的知识,并在神经网络中集成,以使模型具有关于医生理解的抽象概念和过程的用户级别可解释性。此外,基于Commonsense Transformer(COMET)训练的常识知识也被注入到模型中,以考虑抑郁检测中用户帖子的一般情感方面。该模型在三个相关的抑郁专家数据集上进行了评估。我们观察到,与最佳领域的模型MentalBERT相比,该模型在CLEF e-Risk(25% MCC增加,12% F1增加)上的性能有统计学显著(p <0.1)提升。类似地,在PRIMATE数据集上,所提出的模型表现优于MentalBERT(2.5% MCC增加,19% F1增加)。这些观察结果证实了生成的解释对于医生来说比后置模型解释更有价值。结果表明,KiNN的用户级别可解释性超越了基线模型,可以为像GPT-3.5这样的模型提供与基线模型不同的应用相关解释。通过向KiNN注入领域和常识知识,增强了模型像GPT-3.5这样生成与应用相关的解释的能力。
https://arxiv.org/abs/2409.02122
Developing novel predictive models with complex biomedical information is challenging due to various idiosyncrasies related to heterogeneity, standardization or sparseness of the data. We previously introduced a person-centric ontology to organize information about individual patients, and a representation learning framework to extract person-centric knowledge graphs (PKGs) and to train Graph Neural Networks (GNNs). In this paper, we propose a systematic approach to examine the results of GNN models trained with both structured and unstructured information from the MIMIC-III dataset. Through ablation studies on different clinical, demographic, and social data, we show the robustness of this approach in identifying predictive features in PKGs for the task of readmission prediction.
开发具有复杂生物医学信息的新预测模型具有挑战性,因为数据异质性、标准化或稀疏性等因素会导致各种固有问题的出现。我们之前引入了一个人本主义 Ontology,用于组织关于个体的信息,和一个表示学习框架,用于提取人本主义知识图(PKGs)并训练图神经网络(GNNs)。在本文中,我们提出了一个系统方法来检查使用MIMIC-III数据集训练的GNN模型的结果。通过在不同临床、人口统计学和社交数据上的消融研究,我们证明了这种方法在预测再入院任务中提取预测特征的鲁棒性。
https://arxiv.org/abs/2408.15294
We introduce semantic towers, an extrinsic knowledge representation method, and compare it to intrinsic knowledge in large language models for ontology learning. Our experiments show a trade-off between performance and semantic grounding for extrinsic knowledge compared to a fine-tuned model intrinsic knowledge. We report our findings on the Large Language Models for Ontology Learning (LLMs4OL) 2024 challenge.
我们提出了语义塔,这是一种外显知识表示方法,并将它与大型语言模型的内隐知识进行了比较。我们的实验结果表明,与细调整的模型内隐知识相比,外显知识在性能和语义着色方面存在权衡。我们在2024年大型语言模型知识图谱学习(LLMs4OL)挑战中报告了我们的研究结果。
https://arxiv.org/abs/2408.14236
This paper presents CodeRefine, a novel framework for automatically transforming research paper methodologies into functional code using Large Language Models (LLMs). Our multi-step approach first extracts and summarizes key text chunks from papers, analyzes their code relevance, and creates a knowledge graph using a predefined ontology. Code is then generated from this structured representation and enhanced through a proposed retrospective retrieval-augmented generation approach. CodeRefine addresses the challenge of bridging theoretical research and practical implementation, offering a more accurate alternative to LLM zero-shot prompting. Evaluations on diverse scientific papers demonstrate CodeRefine's ability to improve code implementation from the paper, potentially accelerating the adoption of cutting-edge algorithms in real-world applications.
本文提出了一种名为CodeRefine的新框架,通过使用大型语言模型(LLMs)自动将研究论文的方法论转化为功能代码。我们的多步骤方法首先从论文中提取和总结关键文本片段,分析其代码相关性,并使用预定义的语义网络创建知识图。然后从这种结构化表示中生成代码,并通过所提出的回顾性检索增强生成方法进行增强。CodeRefine解决了将理论研究和实际实现相连接的挑战,为LLM零散提示的更准确替代提供了可行的方案。对各种科学论文的评估表明,CodeRefine能够改善代码实现,从而加速先进算法的实际应用。
https://arxiv.org/abs/2408.13366
Explainable AI (XAI) can greatly enhance user trust and satisfaction in AI-assisted decision-making processes. Recent findings suggest that a single explainer may not meet the diverse needs of multiple users in an AI system; indeed, even individual users may require multiple explanations. This highlights the necessity for a "multi-shot" approach, employing a combination of explainers to form what we introduce as an "explanation strategy". Tailored to a specific user or a user group, an "explanation experience" describes interactions with personalised strategies designed to enhance their AI decision-making processes. The iSee platform is designed for the intelligent sharing and reuse of explanation experiences, using Case-based Reasoning to advance best practices in XAI. The platform provides tools that enable AI system designers, i.e. design users, to design and iteratively revise the most suitable explanation strategy for their AI system to satisfy end-user needs. All knowledge generated within the iSee platform is formalised by the iSee ontology for interoperability. We use a summative mixed methods study protocol to evaluate the usability and utility of the iSee platform with six design users across varying levels of AI and XAI expertise. Our findings confirm that the iSee platform effectively generalises across applications and its potential to promote the adoption of XAI best practices.
可解释人工智能(XAI)在AI辅助决策过程中可以极大地增强用户的信任和满意度。最近的研究表明,一个解释器可能无法满足AI系统多个用户的需求;事实上,甚至单个用户也可能需要多个解释。这强调了需要采用“多击”方法,结合解释器形成我们称之为“解释策略”的组合。针对特定用户或用户群体,一个“解释体验”描述了与个人化策略互动以增强其AI决策过程的情况。iSee平台旨在实现智能共享和重用解释体验,使用基于案例推理的方法推动XAI的最佳实践。 平台为AI系统设计师(即设计用户)提供了工具,使他们能够设计并逐步修改其AI系统的最合适的解释策略,以满足最终用户需求。iSee平台中所有产生的知识都通过iSee本体论进行形式化,以实现与其他系统的互操作性。我们使用一种综合混合方法研究协议对六个具有不同AI和XAI专业水平的用户对iSee平台的可用性和效用进行评估。我们的研究结果证实,iSee平台在应用程序之间有效地泛化,并有可能促进采用XAI最佳实践。
https://arxiv.org/abs/2408.12941
Transcriptome foundation models TFMs hold great promises of deciphering the transcriptomic language that dictate diverse cell functions by self-supervised learning on large-scale single-cell gene expression data, and ultimately unraveling the complex mechanisms of human diseases. However, current TFMs treat cells as independent samples and ignore the taxonomic relationships between cell types, which are available in cell ontology graphs. We argue that effectively leveraging this ontology information during the TFM pre-training can improve learning biologically meaningful gene co-expression patterns while preserving TFM as a general purpose foundation model for downstream zero-shot and fine-tuning tasks. To this end, we present \textbf{s}ingle \textbf{c}ell, \textbf{Cell}-\textbf{o}ntology guided TFM scCello. We introduce cell-type coherence loss and ontology alignment loss, which are minimized along with the masked gene expression prediction loss during the pre-training. The novel loss component guide scCello to learn the cell-type-specific representation and the structural relation between cell types from the cell ontology graph, respectively. We pre-trained scCello on 22 million cells from CellxGene database leveraging their cell-type labels mapped to the cell ontology graph from Open Biological and Biomedical Ontology Foundry. Our TFM demonstrates competitive generalization and transferability performance over the existing TFMs on biologically important tasks including identifying novel cell types of unseen cells, prediction of cell-type-specific marker genes, and cancer drug responses.
转录组基础模型TFM在解析由自监督学习在大规模单细胞基因表达数据中 dictate diverse cell functions方面具有巨大的潜力,并最终揭示人类疾病中复杂机制的真相。然而,目前的TFM将细胞视为独立样本,并忽略了细胞类型之间在细胞元数据图中的分类关系。我们认为,在TFM预训练过程中有效利用这种元数据信息可以改善在生物学意义下学习基因共表达模式,同时保留TFM作为下游零散和微调任务的通用基础模型。为此,我们提出了single-cell, cell-ontology guided TFM scCello。我们引入了细胞类型和谐损失和元数据对齐损失,这些在预训练过程中与遮罩基因表达预测损失一起最小化。新损失组件guide scCello从细胞元数据图中学习细胞类型特异性表示和细胞类型之间的结构关系。我们在CellxGene数据库中使用细胞类型标签映射到细胞元数据图,进行预训练。我们的TFM在包括识别未见细胞的新细胞类型、预测细胞类型特异性标记基因和癌症药物反应等生物医学重要任务上表现出具有竞争力的泛化能力和可转移性能。
https://arxiv.org/abs/2408.12373
Anomaly detection is fundamental yet, challenging problem with practical applications in industry. The current approaches neglect the higher-order dependencies within the networks of interconnected sensors in the high-dimensional time series(multisensor data) for anomaly detection. To this end, we present a self-adapting anomaly detection framework for joint learning of (a) discrete hypergraph structure and (b) modeling the temporal trends and spatial relations among the interdependent sensors using the hierarchical encoder-decoder architecture to overcome the challenges. The hypergraph representation learning-based framework exploits the relational inductive biases in the hypergraph-structured data to learn the pointwise single-step-ahead forecasts through the self-supervised autoregressive task and predicts the anomalies based on the forecast error. Furthermore, our framework incentivizes learning the anomaly-diagnosis ontology through a differentiable approach. It derives the anomaly information propagation-based computational hypergraphs for root cause analysis and provides recommendations through an offline, optimal predictive control policy to remedy an anomaly. We conduct extensive experiments to evaluate the proposed method on the benchmark datasets for fair and rigorous comparison with the popular baselines. The proposed method outperforms the baseline models and achieves SOTA performance. We report the ablation studies to support the efficacy of the framework.
异常检测是一个基本但具有挑战性的问题,在工业领域具有实际应用。目前的解决方案忽略了高维时间序列(多传感器数据)中连接传感器网络的高级依赖关系以进行异常检测。为此,我们提出了一个自适应异常检测框架,用于联合学习(a)离散超图结构和(b)利用分层编码器-解码器架构建模相互依存的传感器之间的时序趋势和空间关系以克服挑战。基于超图表示的学习框架利用了超图结构数据中的关系归纳偏见,通过自监督自回归任务学习单步前预测,并根据预测误差预测异常。此外,我们的框架通过一种不同寻常的方法激励学习异常诊断本体论。它导出了基于异常信息传播的计算超图,用于根原因分析,并通过离线最优预测控制策略提供建议以解决异常。我们在基准数据集上进行了广泛的实验,以与流行的基线模型进行公平和严谨的比较。与基线模型相比,所提出的方法优越,并实现了SOTA性能。我们报告了消融研究以支持框架的有效性。
https://arxiv.org/abs/2408.11359
To address the challenge of automating knowledge discovery from a vast volume of literature, in this paper, we introduce a novel framework based on large language models (LLMs) that combines a progressive ontology prompting (POP) algorithm with a dual-agent system, named LLM-Duo, designed to enhance the automation of knowledge extraction from scientific articles. The POP algorithm utilizes a prioritized breadth-first search (BFS) across a predefined ontology to generate structured prompt templates and action orders, thereby guiding LLMs to discover knowledge in an automatic manner. Additionally, our LLM-Duo employs two specialized LLM agents: an explorer and an evaluator. These two agents work collaboratively and adversarially to enhance the reliability of the discovery and annotation processes. Experiments demonstrate that our method outperforms advanced baselines, enabling more accurate and complete annotations. To validate the effectiveness of our method in real-world scenarios, we employ our method in a case study of speech-language intervention discovery. Our method identifies 2,421 interventions from 64,177 research articles in the speech-language therapy domain. We curate these findings into a publicly accessible intervention knowledge base that holds significant potential to benefit the speech-language therapy community.
为了解决从大量文献中进行知识发现面临的挑战,在本文中,我们引入了一种基于大型语言模型(LLMs)的新框架,该框架将具有一个逐步本体提示(POP)算法和一个双代理系统,名为LLM-Duo,旨在增强从科学文章中提取知识的自动化程度。POP算法利用预定义的语义信息进行优先化的广度优先搜索(BFS),从而生成结构化的提示模板和动作指令,从而指导LLMs以自动方式发现知识。此外,我们的LLM-Duo采用两个专门用于LLM的代理:探索者和评估者。这两个代理以协同和对抗的方式增强发现和注释过程的可靠性。实验证明,我们的方法超越了先进基线,实现了更准确和完整的注释。为了验证我们的方法在现实场景中的有效性,我们在言语语言干预发现的一个案例研究中应用了我们的方法。我们的方法从言语语言治疗领域64,177篇研究文章中识别出2,421个干预措施。我们将这些发现放入一个公开可用的干预知识库中,该知识库对言语语言治疗社区具有很大的潜在利益。
https://arxiv.org/abs/2409.00054
The cutting edge of applying AI to science is the closed-loop automation of scientific research: robot scientists. We have previously developed two robot scientists: `Adam' (for yeast functional biology), and `Eve' (for early-stage drug design)). We are now developing a next generation robot scientist Genesis. With Genesis we aim to demonstrate that an area of science can be investigated using robot scientists unambiguously faster, and at lower cost, than with human scientists. Here we report progress on the Genesis project. Genesis is designed to automatically improve system biology models with thousands of interacting causal components. When complete Genesis will be able to initiate and execute in parallel one thousand hypothesis-led closed-loop cycles of experiment per-day. Here we describe the core Genesis hardware: the one thousand computer-controlled $\mu$-bioreactors. For the integrated Mass Spectrometry platform we have developed AutonoMS, a system to automatically run, process, and analyse high-throughput experiments. We have also developed Genesis-DB, a database system designed to enable software agents access to large quantities of structured domain information. We have developed RIMBO (Revisions for Improvements of Models in Biology Ontology) to describe the planned hundreds of thousands of changes to the models. We have demonstrated the utility of this infrastructure by developed two relational learning bioinformatic projects. Finally, we describe LGEM+ a relational learning system for the automated abductive improvement of genome-scale metabolic models.
应用人工智能(AI)到科学领域的最新前沿是对科学研究的闭合循环自动化:机器人科学家。我们之前开发了两个机器人科学家:`Adam`(用于酵母功能生物学)和 `Eve`(用于早期药物设计)。现在我们正在开发下一代机器人科学家Genesis。借助Genesis,我们希望证明使用机器人科学家可以比人类科学家更快、更经济地研究科学领域。在这里我们报道Genesis项目的进展。Genesis被设计为通过成千上万的相互作用因果组件自动改进系统生物学模型。当完成Genesis时,它将能够并行启动和执行每天每日的1000个假设实验循环。在这里我们描述了Genesis硬件的核心:1000个由计算机控制的μ生物反应器。为了实现集成质谱平台,我们开发了AutonoMS,这是一种自动运行、处理和分析高通量实验的系统。我们还开发了Genesis-DB,这是一种设计用于让软件代理访问大量结构化领域信息的数据库系统。我们开发了RIMBO(生物信息学模型改进的修订版)来描述计划对模型进行数百万次的改进。我们用这两个关系学习生物信息学项目证明了这种基础设施的实用性。最后,我们描述了LGEM+,这是一种关系学习系统,用于自动推理基因组规模代谢模型的改进。
https://arxiv.org/abs/2408.10689
The NFDI4DataScience (NFDI4DS) project aims to enhance the accessibility and interoperability of research data within Data Science (DS) and Artificial Intelligence (AI) by connecting digital artifacts and ensuring they adhere to FAIR (Findable, Accessible, Interoperable, and Reusable) principles. To this end, this poster introduces the NFDI4DS Ontology, which describes resources in DS and AI and models the structure of the NFDI4DS consortium. Built upon the NFDICore ontology and mapped to the Basic Formal Ontology (BFO), this ontology serves as the foundation for the NFDI4DS knowledge graph currently under development.
NFDI4DataScience(NFDI4DS)项目旨在通过连接数字遗产并确保其符合可发现性(Findable)、可访问性(Accessible)、可互操作性(Interoperable)和可重用性(Reusable)原则,增强数据科学(DS)和人工智能(AI)领域研究数据的可用性和互操作性。为此,这个海报介绍了NFDI4DS语义模型,它描述了DS和AI资源,并建模了NFDI4DS合作组织的结构。基于NFICore语义模型并映射到基本形式语义学(BFO),这个语义模型为目前正在开发中的NFDI4DS知识图谱奠定了基础。
https://arxiv.org/abs/2408.08698
We introduce ontology-mediated planning, in which planning problems are combined with an ontology. Our formalism differs from existing ones in that we focus on a strong separation of the formalisms for describing planning problems and ontologies, which are only losely coupled by an interface. Moreover, we present a black-box algorithm that supports the full expressive power of OWL DL. This goes beyond what existing approaches combining automated planning with ontologies can do, which only support limited description logics such as DL-Lite and description logics that are Horn. Our main algorithm relies on rewritings of the ontology-mediated planning specifications into PDDL, so that existing planning systems can be used to solve them. The algorithm relies on justifications, which allows for a generic approach that is independent of the expressivity of the ontology language. However, dedicated optimizations for computing justifications need to be implemented to enable an efficient rewriting procedure. We evaluated our implementation on benchmark sets from several domains. The evaluation shows that our procedure works in practice and that tailoring the reasoning procedure has significant impact on the performance.
https://arxiv.org/abs/2408.07544
Ontology alignment is integral to achieving semantic interoperability as the number of available ontologies covering intersecting domains is increasing. This paper proposes OWL2Vec4OA, an extension of the ontology embedding system OWL2Vec*. While OWL2Vec* has emerged as a powerful technique for ontology embedding, it currently lacks a mechanism to tailor the embedding to the ontology alignment task. OWL2Vec4OA incorporates edge confidence values from seed mappings to guide the random walk strategy. We present the theoretical foundations, implementation details, and experimental evaluation of our proposed extension, demonstrating its potential effectiveness for ontology alignment tasks.
知识图谱对齐是实现语义互操作性的关键,因为可用的知识图谱数量正在增加。本文提出了OWL2Vec4OA,作为OWL2Vec*的扩展,用于实现跨领域语义互操作性。尽管OWL2Vec*已成为一个强大的语义嵌入技术,但它目前缺乏将嵌入技术与对齐任务相结合的机制。OWL2Vec4OA从种子映射的边缘置信值中整合了引导随机漫步策略。我们提供了我们提出的扩展的理论基础、实现细节和实验评估,证明了它在语义对齐任务中具有潜在的有效性。
https://arxiv.org/abs/2408.06310
Past ontology requirements engineering (ORE) has primarily relied on manual methods, such as interviews and collaborative forums, to gather user requirements from domain experts, especially in large projects. Current OntoChat offers a framework for ORE that utilises large language models (LLMs) to streamline the process through four key functions: user story creation, competency question (CQ) extraction, CQ filtration and analysis, and ontology testing support. In OntoChat, users are expected to prompt the chatbot to generate user stories. However, preliminary evaluations revealed that they struggle to do this effectively. To address this issue, we experimented with a research method called participatory prompting, which involves researcher-mediated interactions to help users without deep knowledge of LLMs use the chatbot more effectively. The participatory prompting user study produces pre-defined prompt templates based on user queries, focusing on creating and refining personas, goals, scenarios, sample data, and data resources for user stories. These refined user stories will subsequently be converted into CQs.
过去的元模型要求工程(ORE)主要依赖于人工方法,如访谈和合作论坛,从领域专家那里收集用户需求,尤其是在大型项目中。目前,OntoChat 提供了一个基于大型语言模型(LLMs)的ORE框架,通过四个关键功能简化了过程:用户故事创建、能力问题(CQ)提取、CQ筛选和分析、以及元模型测试支持。在OntoChat中,用户期望提示聊天机器人生成用户故事。然而,初步评估发现他们很难做到这一点。为解决这个问题,我们尝试了一种研究方法,称为参与式提示,该方法涉及研究者和用户之间的互动,以帮助用户更好地使用聊天机器人。参与式提示用户研究产生了基于用户查询预定义的提示模板,重点在于创建和优化人物、目标、场景、示例数据和数据资源。这些经过优化的用户故事将随后转换为CQ。
https://arxiv.org/abs/2408.15256
The explosive growth of data fuels data-driven research, facilitating progress across diverse domains. The FAIR principles emerge as a guiding standard, aiming to enhance the findability, accessibility, interoperability, and reusability of data. However, current efforts primarily focus on manual data FAIRification, which can only handle targeted data and lack efficiency. To address this issue, we propose AutoFAIR, an architecture designed to enhance data FAIRness automately. Firstly, We align each data and metadata operation with specific FAIR indicators to guide machine-executable actions. Then, We utilize Web Reader to automatically extract metadata based on language models, even in the absence of structured data webpage schemas. Subsequently, FAIR Alignment is employed to make metadata comply with FAIR principles by ontology guidance and semantic matching. Finally, by applying AutoFAIR to various data, especially in the field of mountain hazards, we observe significant improvements in findability, accessibility, interoperability, and reusability of data. The FAIRness scores before and after applying AutoFAIR indicate enhanced data value.
数据爆炸式增长为数据驱动研究提供了动力,推动了跨学科领域的进步。FAIR原则作为一种指导标准应运而生,旨在提高数据的可发现性、可访问性、互操作性和可重用性。然而,目前的努力主要集中在手动数据FAIR化上,这只能处理目标数据,缺乏效率。为了应对这个问题,我们提出了AutoFAIR,一种旨在自动增强数据FAIR性的架构。首先,我们将每个数据和元数据操作与特定的FAIR指标对齐,以指导机器执行操作。然后,我们利用WebReader根据语言模型自动提取元数据,即使在缺乏结构化数据网页模板的情况下也是如此。接下来,应用FAIR对元数据进行对齐,通过本体指导和组织语义匹配使其符合FAIR原则。最后,将AutoFAIR应用于各种数据,尤其是在山区灾害领域,我们观察到数据可发现性、可访问性、互操作性和可重用性得到了显著改善。应用前和应用后的FAIRness得分表明,数据的价值得到了增强。
https://arxiv.org/abs/2408.04673
While large language models learn sound statistical representations of the language and information therein, ontologies are symbolic knowledge representations that can complement the former ideally. Research at this critical intersection relies on datasets that intertwine ontologies and text corpora to enable training and comprehensive benchmarking of neurosymbolic models. We present the MaterioMiner dataset and the linked materials mechanics ontology where ontological concepts from the mechanics of materials domain are associated with textual entities within the literature corpus. Another distinctive feature of the dataset is its eminently fine-granular annotation. Specifically, 179 distinct classes are manually annotated by three raters within four publications, amounting to a total of 2191 entities that were annotated and curated. Conceptual work is presented for the symbolic representation of causal composition-process-microstructure-property relationships. We explore the annotation consistency between the three raters and perform fine-tuning of pre-trained models to showcase the feasibility of named-entity recognition model training. Reusing the dataset can foster training and benchmarking of materials language models, automated ontology construction, and knowledge graph generation from textual data.
虽然大型语言模型可以学习语言及其其中的统计表示,本体论是符号化知识表示,可以补充前者的理想。在这个关键的交叉点的研究依赖于将本体论和文本语料库交织的数据集,以实现神经符号模型培训和全面基准测试。我们提出了MaterioMiner数据集和相关的材料力学本体论,其中本体论概念与文献语料库中的文本实体相关联。该数据集的另一个显著特点是它固有的细粒度注释。具体来说,在四个出版物中,有179个不同的类别是由三个评分者手动标注的,总共2191个实体被标注和策展。我们呈现了符号化表示因果组合-过程-微观结构-性质关系的概念工作。我们探讨了三个评分者之间的注释一致性,并对预训练模型进行微调,以展示命名实体识别模型训练的可行性。重新使用该数据集可以促进材料语言模型的培训和基准测试,自动本体论构建和知识图形的生成从文本数据中。
https://arxiv.org/abs/2408.04661
State-of-the-art task-oriented dialogue systems typically rely on task-specific ontologies for fulfilling user queries. The majority of task-oriented dialogue data, such as customer service recordings, comes without ontology and annotation. Such ontologies are normally built manually, limiting the application of specialised systems. Dialogue ontology construction is an approach for automating that process and typically consists of two steps: term extraction and relation extraction. In this work, we focus on relation extraction in a transfer learning set-up. To improve the generalisation, we propose an extension to the decoding mechanism of large language models. We adapt Chain-of-Thought (CoT) decoding, recently developed for reasoning problems, to generative relation extraction. Here, we generate multiple branches in the decoding space and select the relations based on a confidence threshold. By constraining the decoding to ontology terms and relations, we aim to decrease the risk of hallucination. We conduct extensive experimentation on two widely used datasets and find improvements in performance on target ontology for source fine-tuned and one-shot prompted large language models.
先进的任务导向对话系统通常依赖于任务特定的本体论来满足用户查询。大多数任务导向对话数据(如客户服务录音)都没有本体和注释。这样的本体通常是通过手动构建的,限制了专业系统的应用范围。对话本体构建是一种自动化的过程,通常包括两个步骤:词提取和关系提取。在我们的工作中,我们关注关系提取在迁移学习设置中的情况。为了提高泛化,我们提出了一个扩展大型语言模型解码机制的方案。我们将Chain-of-Thought(CoT)解码,最近为推理问题所开发,应用到生成关系提取。在这里,我们在解码域中生成多个分支,并根据置信度阈值选择关系。通过将解码限制为本体和关系的本体,我们旨在降低幻觉风险。我们对两个广泛使用的数据集进行了广泛的实验,并发现对于经过来源微调的和大语言模型的目标本体,性能得到了显著的提高。
https://arxiv.org/abs/2408.02361
Anomaly detection in manufacturing pipelines remains a critical challenge, intensified by the complexity and variability of industrial environments. This paper introduces AssemAI, an interpretable image-based anomaly detection system tailored for smart manufacturing pipelines. Our primary contributions include the creation of a tailored image dataset and the development of a custom object detection model, YOLO-FF, designed explicitly for anomaly detection in manufacturing assembly environments. Utilizing the preprocessed image dataset derived from an industry-focused rocket assembly pipeline, we address the challenge of imbalanced image data and demonstrate the importance of image-based methods in anomaly detection. The proposed approach leverages domain knowledge in data preparation, model development and reasoning. We compare our method against several baselines, including simple CNN and custom Visual Transformer (ViT) models, showcasing the effectiveness of our custom data preparation and pretrained CNN integration. Additionally, we incorporate explainability techniques at both user and model levels, utilizing ontology for user-friendly explanations and SCORE-CAM for in-depth feature and model analysis. Finally, the model was also deployed in a real-time setting. Our results include ablation studies on the baselines, providing a comprehensive evaluation of the proposed system. This work highlights the broader impact of advanced image-based anomaly detection in enhancing the reliability and efficiency of smart manufacturing processes.
在制造业流程中,异常检测仍然是一个关键挑战,尤其是在工业环境的复杂性和可变性方面。本文介绍了AssemAI,一种专为智能制造流程设计的可解释图像为基础的异常检测系统。我们的主要贡献包括创建了一个定制的图像数据集和一个自定义的目标检测模型YOLO-FF,该模型专门为制造流程中的异常检测而设计。利用从行业关注火箭组装生产线预处理得到的图像数据集,我们解决了图像数据不平衡的问题,并证明了基于图像的方法在异常检测中的重要性。所提出的方法利用数据预处理、模型开发和推理的领域知识。我们与几个基线进行了比较,包括简单的CNN和自定义视觉Transformer(ViT)模型,展示了我们自定义数据准备和预训练CNN集的效果。此外,我们在用户和模型级别都引入了可解释性技术,使用本体论实现用户友好性的解释,并使用SCORE-CAM进行深入的特征和模型分析。最后,该模型还应用于实时设置中。我们的结果包括基线的消融研究,提供了对所提出系统的全面评估。这项工作突出了先进图像为基础的异常检测在提高智能制造过程的可靠性和效率方面的更广泛影响。
https://arxiv.org/abs/2408.02181