Using generative Artificial Intelligence (AI), we transformed a set of 1,000 scientific papers in the area of biological materials into detailed ontological knowledge graphs, revealing their inherently scale-free nature. Using graph traversal path detection between dissimilar concepts based on combinatorial ranking of node similarity and betweenness centrality, we reveal deep insights into unprecedented interdisciplinary relationships that can be used to answer queries, identify gaps in knowledge, and propose never-before-seen material designs and their behaviors. One comparison revealed detailed structural parallels between biological materials and Beethoven's 9th Symphony, highlighting shared patterns of complexity through isomorphic mapping. The algorithm further created an innovative hierarchical mycelium-based composite that incorporates joint synthesis of graph sampling with principles extracted from Kandinsky's Composition VII painting, where the resulting composite reflects a balance of chaos and order, with features like adjustable porosity, mechanical strength, and complex patterned chemical functionalization. We uncover other isomorphisms across physical, biological, and artistic spheres, revealing a nuanced ontology of immanence and material flux that resonates with postmodern philosophy, and positions these interconnections within a heterarchical framework. Our findings reveal the dynamic, context-dependent interplay of entities beyond traditional hierarchical paradigms, emphasizing the significant role of individual components and their fluctuative relationships within the system. Our predictions achieve a far higher degree of novelty, technical detail and explorative capacity than conventional generative AI methods. The approach establishes a widely useful framework for innovation by revealing hidden connections that facilitate discovery.
使用生成人工智能 (AI),我们将1,000篇生物材料领域的科学论文转换为详细的概念知识图,揭示了它们固有的无尺度特性。通过基于节点相似度和连通性排名的组合排名在相似概念之间进行图遍历路径检测,我们揭示了前所未有的跨学科关系,这些关系可以用于回答问题、发现知识空白和提出新颖的物质设计和它们的特性。一个比较揭示了生物材料与贝多芬第九交响曲之间的详细结构相似性,通过同构映射揭示了共享的复杂性模式。该算法还创新性地制备了一种基于菌丝的复合材料,将图采样与Kandinsky的《构图 VII》中提取的原则相结合,形成了一个平衡熵和秩序的复合物,具有可调节的孔隙率、机械强度和复杂图案化学功能性。我们在物理、生物和艺术领域发现了其他同构体,揭示了物质存在的隐含维度和物质流动的细微差别,与后现代哲学产生共鸣,并将其置于分层框架中。我们的研究结果揭示了超越传统等级范式的实体之间的动态、上下文相关的相互作用,强调了个体组件及其在系统中的波动关系的重要性。我们的预测实现了与传统生成人工智能方法远更高的新颖性、技术细节和探索能力。该方法为创新建立了广泛的有用框架,通过揭示隐藏的连接促进了发现。
https://arxiv.org/abs/2403.11996
Difficulties in replication and reproducibility of empirical evidences in machine learning research have become a prominent topic in recent years. Ensuring that machine learning research results are sound and reliable requires reproducibility, which verifies the reliability of research findings using the same code and data. This promotes open and accessible research, robust experimental workflows, and the rapid integration of new findings. Evaluating the degree to which research publications support these different aspects of reproducibility is one goal of the present work. For this we introduce an ontology of reproducibility in machine learning and apply it to methods for graph neural networks. Building on these efforts we turn towards another critical challenge in machine learning, namely the curse of dimensionality, which poses challenges in data collection, representation, and analysis, making it harder to find representative data and impeding the training and inference processes. Using the closely linked concept of geometric intrinsic dimension we investigate to which extend the used machine learning models are influenced by the intrinsic dimension of the data sets they are trained on.
在机器学习研究中,证据的复制和可重复性难题已成为一个突出的话题。确保机器学习研究结果是可靠和有效的,需要可重复性,即使用相同的代码和数据验证研究发现的可靠性。这促进了开放和可访问的研究,健壮的实验工作流程,以及新发现的快速整合。评估研究出版物支持这些不同可重复性方面的程度是本工作的一个目标。为此,我们引入了机器学习中的可重复性本体论,并将其应用于图神经网络的方法中。基于这些努力,我们转向了机器学习中的另一个关键挑战——维度诅咒,它使得数据收集、表示和分析更具挑战性,使难以找到具有代表性的数据,从而阻碍了训练和推理过程。通过紧密相连的概念——几何固有维度,我们研究了用于训练的机器学习模型受到数据集固有维度影响的程度。
https://arxiv.org/abs/2403.08438
The conventional process of building Ontologies and Knowledge Graphs (KGs) heavily relies on human domain experts to define entities and relationship types, establish hierarchies, maintain relevance to the domain, fill the ABox (or populate with instances), and ensure data quality (including amongst others accuracy and completeness). On the other hand, Large Language Models (LLMs) have recently gained popularity for their ability to understand and generate human-like natural language, offering promising ways to automate aspects of this process. This work explores the (semi-)automatic construction of KGs facilitated by open-source LLMs. Our pipeline involves formulating competency questions (CQs), developing an ontology (TBox) based on these CQs, constructing KGs using the developed ontology, and evaluating the resultant KG with minimal to no involvement of human experts. We showcase the feasibility of our semi-automated pipeline by creating a KG on deep learning methodologies by exploiting scholarly publications. To evaluate the answers generated via Retrieval-Augmented-Generation (RAG) as well as the KG concepts automatically extracted using LLMs, we design a judge LLM, which rates the generated content based on ground truth. Our findings suggest that employing LLMs could potentially reduce the human effort involved in the construction of KGs, although a human-in-the-loop approach is recommended to evaluate automatically generated KGs.
传统的建设语义网和知识图(KGs)的过程很大程度上依赖于人类领域专家来定义实体和关系类型,建立层次结构,保持领域相关性,填充ABox(或实例),并确保数据质量(包括准确性、完整性等)。另一方面,大型语言模型(LLMs)最近因它们能够理解并生成类似人类自然语言而受到欢迎,为自动化这个过程提供了有前途的方法。这项工作探讨了通过开源LLM自动构建知识图(KGs)的可能性。我们的工作流程包括制定能力问题(CQs),基于这些CQs构建语义网(TBox),使用所开发的语义网构建KG,并对由此产生的KG进行评估,最小限度地不涉及人类专家。我们通过利用学术出版物来创建深度学习方法论的KG,展示了我们的半自动化的流程的可行性。为了评估通过检索增强生成(RAG)生成的答案以及使用LLM自动提取的KG概念,我们设计了一个评分LLM,根据真实值对生成内容进行评分。我们的研究结果表明,使用LLM可能有助于减少构建KGs的人力投入,尽管建议使用人类在循环中评估自动生成的KG。
https://arxiv.org/abs/2403.08345
As digital healthcare evolves, the security of electronic health records (EHR) becomes increasingly crucial. This study presents the GPT-Onto-CAABAC framework, integrating Generative Pretrained Transformer (GPT), medical-legal ontologies and Context-Aware Attribute-Based Access Control (CAABAC) to enhance EHR access security. Unlike traditional models, GPT-Onto-CAABAC dynamically interprets policies and adapts to changing healthcare and legal environments, offering customized access control solutions. Through empirical evaluation, this framework is shown to be effective in improving EHR security by accurately aligning access decisions with complex regulatory and situational requirements. The findings suggest its broader applicability in sectors where access control must meet stringent compliance and adaptability standards.
随着数字医疗的不断发展,电子病历(EHR)的安全性变得越来越重要。这项研究提出了一个名为GPT-Onto-CAABAC的框架,将生成预训练转换器(GPT)、医学法律概念映射和上下文感知属性基础访问控制(CAABAC)集成在一起,以提高EHR访问安全性。与传统模型不同,GPT-Onto-CAABAC会动态地解释策略,并适应于不断变化的医疗和法律环境,提供个性化的访问控制解决方案。通过实证评估,该框架被证明通过准确地将访问决策与复杂的监管和情境要求对齐,有效提高了EHR的安全性。这些发现表明,在需要满足严格合规和适应性标准的关键领域,该框架具有更广泛的适用性。
https://arxiv.org/abs/2403.08264
Ontology engineering (OE) in large projects poses a number of challenges arising from the heterogeneous backgrounds of the various stakeholders, domain experts, and their complex interactions with ontology designers. This multi-party interaction often creates systematic ambiguities and biases from the elicitation of ontology requirements, which directly affect the design, evaluation and may jeopardise the target reuse. Meanwhile, current OE methodologies strongly rely on manual activities (e.g., interviews, discussion pages). After collecting evidence on the most crucial OE activities, we introduce OntoChat, a framework for conversational ontology engineering that supports requirement elicitation, analysis, and testing. By interacting with a conversational agent, users can steer the creation of user stories and the extraction of competency questions, while receiving computational support to analyse the overall requirements and test early versions of the resulting ontologies. We evaluate OntoChat by replicating the engineering of the Music Meta Ontology, and collecting preliminary metrics on the effectiveness of each component from users. We release all code at this https URL.
大项目中的本体工程(OE)面临着来自各个利益相关者、专家和它们与本体设计者复杂互动所带来的挑战。这种多方互动通常会导致从提炼本体需求的过程中产生的系统歧义和偏见,这直接影响设计、评估,甚至可能危及目标重用。同时,当前的OE方法论在很大程度上依赖于手动活动(例如,访谈、讨论页面)。在收集有关最关键OE活动的证据后,我们介绍了OntoChat,一个支持对话本体工程需求抽取、分析和测试的框架。通过与交互式代理交互,用户可以引导创建用户故事和提取能力问题,同时获得计算机支持以分析总体要求并测试初步版本的 resulting 本体。我们通过复制音乐元本体的工程来评估OntoChat,并从用户那里收集关于每个组件有效性的初步度量。我们在这个链接上发布所有代码。
https://arxiv.org/abs/2403.05921
Deep phenotyping is the detailed description of patient signs and symptoms using concepts from an ontology. The deep phenotyping of the numerous physician notes in electronic health records requires high throughput methods. Over the past thirty years, progress toward making high throughput phenotyping feasible. In this study, we demonstrate that a large language model and a hybrid NLP model (combining word vectors with a machine learning classifier) can perform high throughput phenotyping on physician notes with high accuracy. Large language models will likely emerge as the preferred method for high throughput deep phenotyping of physician notes.
深度表型研究是对患者症状和病情的详细描述,基于知识图谱的概念。在电子病历中,对大量医生笔记的深入表型研究需要高吞吐量方法。在过去的三十多年里,努力使高吞吐量表型成为现实。在这项研究中,我们证明了大型语言模型和混合NLP模型(结合词向量与机器学习分类器)可以在对医生笔记进行高吞吐量表型研究的同时保持高准确率。大型语言模型很可能成为对医生笔记进行高吞吐量表型研究的首选方法。
https://arxiv.org/abs/2403.05920
Digital transformation in the built environment generates vast data for developing data-driven models to optimize building operations. This study presents an integrated solution utilizing edge computing, digital twins, and deep learning to enhance the understanding of climate in buildings. Parametric digital twins, created using an ontology, ensure consistent data representation across diverse service systems equipped by different buildings. Based on created digital twins and collected data, deep learning methods are employed to develop predictive models for identifying patterns in indoor climate and providing insights. Both the parametric digital twin and deep learning models are deployed on edge for low latency and privacy compliance. As a demonstration, a case study was conducted in a historic building in Östergötland, Sweden, to compare the performance of five deep learning architectures. The results indicate that the time-series dense encoder model exhibited strong competitiveness in performing multi-horizon forecasts of indoor temperature and relative humidity with low computational costs.
建筑环境中的数字化转型产生了大量的数据,为开发数据驱动模型以优化建筑运营提供了广阔的空间。本研究提出了一种综合解决方案,利用边缘计算、数字孪生和深度学习来提高对建筑物气候的理解。使用元数据创建的参数化数字孪生确保了跨多个服务系统(由不同建筑物提供)的一致数据表示。基于创建的数字孪生和收集的数据,深度学习方法被用于开发预测模型,以识别室内气候模式和提供洞察。两种参数化数字孪生和深度学习模型都在边缘部署,以实现低延迟和隐私合规。 在瑞典奥斯特格伦堡的历史建筑中,进行了一项案例研究,比较了五种深度学习架构的性能。研究结果表明,时间序列密集编码器模型在执行多时间尺度室内温度和相对湿度预测时具有很强的竞争力,同时具有较低的计算成本。
https://arxiv.org/abs/2403.04326
Existing approaches on zero-shot event detection usually train models on datasets annotated with known event types, and prompt them with unseen event definitions. These approaches yield sporadic successes, yet generally fall short of expectations. In this work, we aim to improve zero-shot event detection by training models to better follow event definitions. We hypothesize that a diverse set of event types and definitions are the key for models to learn to follow event definitions while existing event extraction datasets focus on annotating many high-quality examples for a few event types. To verify our hypothesis, we construct an automatically generated Diverse Event Definition (DivED) dataset and conduct comparative studies. Our experiments reveal that a large number of event types (200) and diverse event definitions can significantly boost event extraction performance; on the other hand, the performance does not scale with over ten examples per event type. Beyond scaling, we incorporate event ontology information and hard-negative samples during training, further boosting the performance. Based on these findings, we fine-tuned a LLaMA-2-7B model on our DivED dataset, yielding performance that surpasses SOTA large language models like GPT-3.5 across three open benchmarks on zero-shot event detection.
现有的零样本事件检测方法通常在已知事件类型的数据集上训练模型,并对其进行未见过的事件定义的提示。这些方法产生了一些零散的成功,但通常未能达到预期效果。在这项工作中,我们旨在通过训练模型更好地遵循事件定义来提高零样本事件检测。我们假设,具有多样事件类型和定义的丰富数据集对于模型学习跟随事件定义至关重要。为了验证我们的假设,我们构建了一个自动生成的多样化事件定义(DivED)数据集,并进行了比较研究。我们的实验发现,大量事件类型(200)和多样事件定义可以显著提高事件提取性能;另一方面,性能并未随着每个事件类型的超过10个示例而扩展。超越扩展,我们在训练过程中引入了事件本体信息以及难负样本。这进一步提高了性能。基于这些发现,我们将DivED数据集上的LLaMA-2-7B模型进行了微调,在三个开放基准测试中实现了与SOTA大型语言模型GPT-3.5相当或更好的性能。
https://arxiv.org/abs/2403.02586
Defeasible reasoning is a kind of reasoning where some generalisations may not be valid in all circumstances, that is general conclusions may fail in some cases. Various formalisms have been developed to model this kind of reasoning, which is characteristic of common-sense contexts. However, it is not easy for a modeller to choose among these systems the one that better fits its domain from an ontological point of view. In this paper we first propose a framework based on the notions of exceptionality and defeasibility in order to be able to compare formalisms and reveal their ontological commitments. Then, we apply this framework to compare four systems, showing the differences that may occur from an ontological perspective.
不可判定的推理是一种推理,其中某些概括可能不适用于所有情况,即某些情况下可能出现普遍结论失败。为了描述这种推理,已经开发了许多形式系统。然而,对于一个建模者来说,从本体论角度选择一个最适合其领域的系统并不容易。在本文中,我们首先提出了一个基于异常性和可判定的框架,以便能够比较形式系统并揭示其本体论承诺。然后,我们将这个框架应用于比较四个系统,展示了从本体论角度看可能出现的差异。
https://arxiv.org/abs/2403.00685
The previously introduced Modular Ontology Modeling methodology (MOMo) attempts to mimic the human analogical process by using modular patterns to assemble more complex concepts. To support this, MOMo organizes organizes ontology design patterns into design libraries, which are programmatically queryable, to support accelerated ontology development, for both human and automated processes. However, a major bottleneck to large-scale deployment of MOMo is the (to-date) limited availability of ready-to-use ontology design patterns. At the same time, Large Language Models have quickly become a source of common knowledge and, in some cases, replacing search engines for questions. In this paper, we thus present a collection of 104 ontology design patterns representing often occurring nouns, curated from the common-sense knowledge available in LLMs, organized into a fully-annotated modular ontology design library ready for use with MOMo.
previously introduced Modular Ontology Modeling methodology (MOMo)试图通过使用模块化模式来模仿人类类比过程,通过模块化模式来组装更复杂的概念。为了支持这一目标,MOMo将本体设计模式组织到设计库中,这些设计库是可编程查询的,以支持加速本体开发,无论是人类还是自动过程。然而,MOMo在大规模部署方面的一个主要瓶颈是(迄今为止)本体设计模式的可用性有限。与此同时,大型语言模型已成为共同知识和,在某些情况下,取代了搜索引擎的问题来源。在本文中,我们因此提出了一个包含104个本体设计模式的集合,这些模式是从LLM中获得的共同知识中精心挑选的,并组织成一个完全注释的模块化本体设计库,可与MOMo一起使用。
https://arxiv.org/abs/2402.18715
Previous work on spoken language understanding (SLU) mainly focuses on single-intent settings, where each input utterance merely contains one user intent. This configuration significantly limits the surface form of user utterances and the capacity of output semantics. In this work, we first propose a Multi-Intent dataset which is collected from a realistic in-Vehicle dialogue System, called MIVS. The target semantic frame is organized in a 3-layer hierarchical structure to tackle the alignment and assignment problems in multi-intent cases. Accordingly, we devise a BiRGAT model to encode the hierarchy of ontology items, the backbone of which is a dual relational graph attention network. Coupled with the 3-way pointer-generator decoder, our method outperforms traditional sequence labeling and classification-based schemes by a large margin.
之前在语音语言理解(SLU)方面的研究主要集中在单个意图设置,其中每个输入会话仅包含一个用户意图。这种配置大大限制了用户陈述的表面形式和输出语义的能力。在本文中,我们首先提出了一个名为MIVS的多意图数据集,该数据集来源于一个真实的车载对话系统。目标语义框架被组织成3层结构以解决多意图情况下的对齐和分配问题。因此,我们设计了一个BiRGAT模型来编码语义层次结构,其骨干网络是一个双关系图注意网络。结合3路指针生成器解码器,我们的方法在传统序列标记和基于分类的方案方面显著地超越了它们。
https://arxiv.org/abs/2402.18258
We investigate the task of inserting new concepts extracted from texts into an ontology using language models. We explore an approach with three steps: edge search which is to find a set of candidate locations to insert (i.e., subsumptions between concepts), edge formation and enrichment which leverages the ontological structure to produce and enhance the edge candidates, and edge selection which eventually locates the edge to be placed into. In all steps, we propose to leverage neural methods, where we apply embedding-based methods and contrastive learning with Pre-trained Language Models (PLMs) such as BERT for edge search, and adapt a BERT fine-tuning-based multi-label Edge-Cross-encoder, and Large Language Models (LLMs) such as GPT series, FLAN-T5, and Llama 2, for edge selection. We evaluate the methods on recent datasets created using the SNOMED CT ontology and the MedMentions entity linking benchmark. The best settings in our framework use fine-tuned PLM for search and a multi-label Cross-encoder for selection. Zero-shot prompting of LLMs is still not adequate for the task, and we proposed explainable instruction tuning of LLMs for improved performance. Our study shows the advantages of PLMs and highlights the encouraging performance of LLMs that motivates future studies.
我们研究了将文本中提取的新概念插入到本体论中的任务,使用了语言模型。我们探讨了一种具有三个步骤的方法:边缘搜索(即概念之间的子集),边缘形成和丰富(利用本体论结构产生和增强边缘候选者),边缘选择(最终将边缘放置的位置)。在所有步骤中,我们提出了利用神经方法,其中我们应用了基于嵌入的方法,并使用预训练语言模型(如BERT)进行边缘搜索,以及基于BERT微调的多标签边缘跨编码器(如GPT系列、FLAN-T5和Llama 2)进行边缘选择。我们在最近使用SNOMED CT本体论和MedMentions实体链接基准数据集上评估了我们的方法。我们框架中最佳设置是使用微调的PLM进行搜索,以及使用多标签跨编码器进行选择。LLM的零散提示仍然不足以完成这项任务,因此我们提出了对LLM进行可解释指令调整以提高性能的研究。我们的研究证明了PLM的优势,并强调了LLM鼓舞人心的性能,这激发了未来的研究。
https://arxiv.org/abs/2402.17897
Recently, ontology embeddings representing entities in a low-dimensional space have been proposed for ontology completion. However, the ontology embeddings for concept subsumption prediction do not address the difficulties of similar and isolated entities and fail to extract the global information of annotation axioms from an ontology. In this paper, we propose a self-matching training method for the two ontology embedding models: Inverted-index Matrix Embedding (InME) and Co-occurrence Matrix Embedding (CoME). The two embeddings capture the global and local information in annotation axioms by means of the occurring locations of each word in a set of axioms and the co-occurrences of words in each axiom. The self-matching training method increases the robustness of the concept subsumption prediction when predicted superclasses are similar to subclasses and are isolated to other entities in an ontology. Our evaluation experiments show that the self-matching training method with InME outperforms the existing ontology embeddings for the GO and FoodOn ontologies and that the method with the concatenation of CoME and OWL2Vec* outperforms them for the HeLiS ontology.
近年来,已经在低维度空间中提出了一些本体论嵌入表示实体,以完成本体。然而,概念子集预测的元本体论嵌入没有解决类似和孤立实体的困难,也没有从本体中提取注释轴心的全局信息。在本文中,我们提出了用于两种元本体论嵌入模型的自匹配训练方法:逆索引矩阵嵌入(InME)和共现矩阵嵌入(CoME)。这两种嵌入方式通过每个轴义词的发生位置和每个轴义词的共现来捕捉注释轴义中的全局和局部信息。自匹配训练方法增加了当预测的超类与子类相似且与其他实体隔离开来时,概念子集预测的鲁棒性。我们的评估实验结果表明,与现有本体论嵌入相比,使用InME的自匹配训练方法在GO和FoodOn本体中表现优异,而与CO-OMe和OWL2Vec*的串联方法相比,在HeLiS本体中表现优异。
https://arxiv.org/abs/2402.16278
Patient-Centric Knowledge Graphs (PCKGs) represent an important shift in healthcare that focuses on individualized patient care by mapping the patient's health information in a holistic and multi-dimensional way. PCKGs integrate various types of health data to provide healthcare professionals with a comprehensive understanding of a patient's health, enabling more personalized and effective care. This literature review explores the methodologies, challenges, and opportunities associated with PCKGs, focusing on their role in integrating disparate healthcare data and enhancing patient care through a unified health perspective. In addition, this review also discusses the complexities of PCKG development, including ontology design, data integration techniques, knowledge extraction, and structured representation of knowledge. It highlights advanced techniques such as reasoning, semantic search, and inference mechanisms essential in constructing and evaluating PCKGs for actionable healthcare insights. We further explore the practical applications of PCKGs in personalized medicine, emphasizing their significance in improving disease prediction and formulating effective treatment plans. Overall, this review provides a foundational perspective on the current state-of-the-art and best practices of PCKGs, guiding future research and applications in this dynamic field.
患者为中心的知识图(PCKGs)代表了医疗保健领域的一个重要转变,通过将患者的健康信息以整体和多维方式进行映射,实现个性化患者护理。PCKGs 整合了各种类型的健康数据,为医疗保健专业人员提供了一个全面的了解患者健康的视角,从而能够提供更加个性化和有效的护理。本文回顾了 PCKGs 所涉及的方法、挑战和机遇,重点关注了它们在整合不同 healthcare data 和通过统一健康视角提高护理效果的作用。此外,本文还讨论了 PCKG 开发中的复杂性,包括本体设计、数据集成技术、知识提取和知识结构化表示。它突出了构建和评估 PCKGs 为行动able healthcare 见解所必需的高级技术,如推理、语义搜索和推理机制。我们进一步探讨了 PCKGs 在个性化医学领域的实际应用,强调了它们在改善疾病预测和制定有效治疗计划中的重要性。总之,本文为 PCKGs 在当前技术和最佳实践的状态提供了基础性视角,为这个动态领域未来的研究和应用提供了指导。
https://arxiv.org/abs/2402.12608
We propose an ontology enhanced model for sentence based claim detection. We fused ontology embeddings from a knowledge base with BERT sentence embeddings to perform claim detection for the ClaimBuster and the NewsClaims datasets. Our ontology enhanced approach showed the best results with these small-sized unbalanced datasets, compared to other statistical and neural machine learning models. The experiments demonstrate that adding domain specific features (either trained word embeddings or knowledge graph metadata) can improve traditional ML methods. In addition, adding domain knowledge in the form of ontology embeddings helps avoid the bias encountered in neural network based models, for example the pure BERT model bias towards larger classes in our small corpus.
我们提出了一个基于语义信息论的句子基检测模型。我们将知识库中的元数据与BERT句子嵌入融合,以对ClaimBuster和NewsClaims数据集进行声称检测。我们的元数据增强方法在这些小型不均衡数据集上显示出最佳结果,与其他统计和神经机器学习模型相比。实验证明,向传统机器学习方法中添加领域特定特征(无论是训练的词向量还是知识图元数据)可以提高其性能。此外,以元数据的形式添加领域知识有助于避免基于神经网络模型的偏见,例如在我们小型文本数据集上BERT模型对较大类别的纯偏见。
https://arxiv.org/abs/2402.12282
Social Network Analysis (SNA) is a set of techniques developed in the field of social and behavioral sciences research, in order to characterize and study the social relationships that are established among a set of individuals. When building a social network for performing an SNA analysis, an initial process of data gathering is achieved in order to extract the characteristics of the individuals and their relationships. This is usually done by completing a questionnaire containing different types of questions that will be later used to obtain the SNA measures needed to perform the study. There are, then, a great number of different possible network generating questions and also many possibilities for mapping the responses to the corresponding characteristics and relationships. Many variations may be introduced into these questions (the way they are posed, the weights given to each of the responses, etc.) that may have an effect on the resulting networks. All these different variations are difficult to achieve manually, because the process is time-consuming and error prone. The tool described in this paper uses semantic knowledge representation techniques in order to facilitate this kind of sensitivity studies. The base of the tool is a conceptual structure, called "ontology" that is able to represent the different concepts and their definitions. The tool is compared to other similar ones, and the advantages of the approach are highlighted, giving some particular examples from an ongoing SNA study about alcohol consumption habits in adolescents.
社交网络分析(SNA)是一套在社会科学和行为科学研究领域开发的技术,用于描述和分析一组个体之间的社交关系。在构建执行SNA分析的社交网络时,首先完成了一个数据收集过程,以提取个体的特征以及他们之间的关系。这通常是通过完成包含不同类型问题的问卷来完成的。然后,存在许多不同的可能网络生成问题以及对应特征和关系的许多可能性。这些问题(问题的提出方式,对每个答案的权重等)中可能引入许多变化,这些变化可能影响结果的网络。由于过程费时且容易出错,因此很难手动实现这些不同变化。本文中所述的工具使用了语义知识表示技术,以便促进这种类型的敏感性研究。工具的基础是一个概念结构,称为“本体论”,它能够表示不同概念及其定义。工具与其他类似工具进行了比较,强调了这种方法的优势,并给出了一个关于青少年饮酒消费习惯的正在进行SNA研究的具体例子。
https://arxiv.org/abs/2402.12390
Alcohol Use Disorder (AUD) is a major concern for public health organizations worldwide, especially as regards the adolescent population. The consumption of alcohol in adolescents is known to be influenced by seeing friends and even parents drinking alcohol. Building on this fact, a number of studies into alcohol consumption among adolescents have made use of Social Network Analysis (SNA) techniques to study the different social networks (peers, friends, family, etc.) with whom the adolescent is involved. These kinds of studies need an initial phase of data gathering by means of questionnaires and a subsequent analysis phase using the SNA techniques. The process involves a number of manual data handling stages that are time consuming and error-prone. The use of knowledge engineering techniques (including the construction of a domain ontology) to represent the information, allows the automation of all the activities, from the initial data collection to the results of the SNA study. This paper shows how a knowledge model is constructed, and compares the results obtained using the traditional method with this, fully automated model, detailing the main advantages of the latter. In the case of the SNA analysis, the validity of the results obtained with the knowledge engineering approach are compared to those obtained manually using the UCINET, Cytoscape, Pajek and Gephi to test the accuracy of the knowledge model.
酒精使用障碍(AUD)是全球公共卫生组织关注的主要问题,尤其是在青少年群体方面。青少年饮酒的行为已知受到看到朋友饮酒,甚至父母饮酒的影响。在这个基础上,许多研究在青少年中探讨饮酒行为时,使用了社交网络分析(SNA)技术来研究青少年所涉及的不同的社交网络(朋友,同学,家人等)。这类研究需要通过问卷调查进行初步的数据收集,然后使用SNA技术进行后续分析。这一过程涉及多个手动数据处理阶段,这些阶段耗时且容易出错。利用知识工程技术(包括构建领域本体论)来表示信息,使所有活动(从数据收集到SNA研究的结果)实现自动化。本文展示了如何构建一个知识模型,并将其与传统方法进行比较,详细说明了后者的主要优势。在SNA分析方面,将知识工程方法获得的分析结果与使用UCINET、Cytoscape、Pajek和Gephi手动获得的分析结果进行了比较,以测试知识模型的准确性。
https://arxiv.org/abs/2402.10967
XAI (eXplanable AI) techniques that have the property of explaining the reasons for their conclusions, i.e. explainability or interpretability, are attracting attention. XAI is expected to be used in the development of forensic science and the justice system. In today's forensic and criminal investigation environment, experts face many challenges due to large amounts of data, small pieces of evidence in a chaotic and complex environment, traditional laboratory structures and sometimes inadequate knowledge. All these can lead to failed investigations and miscarriages of justice. In this paper, we describe the application of one logical approach to crime scene investigation. The subject of the application is ``The Adventure of the Speckled Band'' from the Sherlock Holmes short stories. The applied data is the knowledge graph created for the Knowledge Graph Reasoning Challenge. We tried to find the murderer by inferring each person with the motive, opportunity, and method. We created an ontology of motives and methods of murder from dictionaries and dictionaries, added it to the knowledge graph of ``The Adventure of the Speckled Band'', and applied scripts to determine motives, opportunities, and methods.
近年来,具有可解释性(Explainability或可理解性)的XAI技术引起了人们的关注。预计,在法医学和司法系统的开发中,XAI将得到应用。在今天的法医学和刑事调查环境中,专家面临着许多挑战,由于数据量巨大,证据碎片化且复杂的环境,传统实验室结构和有时缺乏的知识,所有这些可能导致调查失败和司法公正的失败。在本文中,我们描述了在犯罪现场调查中应用的一种逻辑方法。该方法的主题是《福尔摩斯短篇故事》中的《斑点带》。所应用的数据是知识图谱推理挑战中的知识图谱。我们试图通过推断每个人的动机、机会和方法来找到凶手。我们从字典和词典中创建了杀人的动机和方法的语义网络,并将其添加到《福尔摩斯短篇故事》的知识图中,然后应用脚本来确定动机、机会和方法。
https://arxiv.org/abs/2402.08284
Metamodeling is a general approach to expressing knowledge about classes and properties in an ontology. It is a desirable modeling feature in multiple applications that simplifies the extension and reuse of ontologies. Nevertheless, allowing metamodeling without restrictions is problematic for several reasons, mainly due to undecidability issues. Practical languages, therefore, forbid classes to occur as instances of other classes or treat such occurrences as semantically different objects. Specifically, meta-querying in SPARQL under the Direct Semantic Entailment Regime (DSER) uses the latter approach, thereby effectively not supporting meta-queries. However, several extensions enabling different metamodeling features have been proposed over the last decade. This paper deals with the Metamodeling Semantics (MS) over OWL 2 QL and the Metamodeling Semantic Entailment Regime (MSER), as proposed in Lenzerini et al. (2015) and Lenzerini et al. (2020); Cima et al. (2017). A reduction from OWL 2 QL to Datalog for meta-querying was proposed in Cima et al. (2017). In this paper, we experiment with various logic programming tools that support Datalog querying to determine their suitability as back-ends to MSER query answering. These tools stem from different logic programming paradigms (Prolog, pure Datalog, Answer Set Programming, Hybrid Knowledge Bases). Our work shows that the Datalog approach to MSER querying is practical also for sizeable ontologies with limited resources (time and memory). This paper significantly extends Qureshi & Faber (2021) by a more detailed experimental analysis and more background. Under consideration in Theory and Practice of Logic Programming (TPLP).
元建模是一种表达关于类和属性的知识的方法,应用于知识图谱。在多个应用中,元建模是一个理想的建模特征,可以简化知识图谱的扩展和重用。然而,无限制地允许元建模会存在问题,主要原因是不可判定性问题。因此,实用的语言禁止类作为其他类的实例出现,或者将这种现象视为语义上不同的对象。具体来说,在SPARQL的元查询 under Direct Semantic Entailment Regime (DSER) 下,元查询采用后者的方法,从而实质上不支持元查询。然而,在过去的十年里,已经提出了许多支持不同元建模功能的扩展。本文处理的是OWL 2 QL和元建模语义规则(MSR)中的元建模语义(MS)以及Cima等人(2017)提出的元建模语义规则。Cima等人(2017)提出了从OWL 2 QL到Datalog的减少方案,用于元查询。本文我们还研究了各种支持Datalog查询的逻辑编程工具,以确定它们作为MSER查询回答后端的可行性。这些工具源于不同的逻辑编程范式(Prolog,纯Datalog,答案集编程,混合知识数据库)。我们的工作表明,即使对于资源有限的大型知识图谱,元建模方法在MSER查询上也具有实用性。本文在《理论与实践逻辑编程》(TPLP)中大大扩展了Qureshi & Faber(2021)的内容,增加了更详细的实验分析和背景。
https://arxiv.org/abs/2402.02978
The use of social network theory and methods of analysis have been applied to different domains in recent years, including public health. The complete procedure for carrying out a social network analysis (SNA) is a time-consuming task that entails a series of steps in which the expert in social network analysis could make mistakes. This research presents a multi-domain knowledge model capable of automatically gathering data and carrying out different social network analyses in different domains, without errors and obtaining the same conclusions that an expert in SNA would obtain. The model is represented in an ontology called OntoSNAQA, which is made up of classes, properties and rules representing the domains of People, Questionnaires and Social Network Analysis. Besides the ontology itself, different rules are represented by SWRL and SPARQL queries. A Knowledge Based System was created using OntoSNAQA and applied to a real case study in order to show the advantages of the approach. Finally, the results of an SNA analysis obtained through the model were compared to those obtained from some of the most widely used SNA applications: UCINET, Pajek, Cytoscape and Gephi, to test and confirm the validity of the model.
近年来,社交网络理论和方法的运用已经应用于许多领域,包括公共卫生。进行社交网络分析(SNA)的完整程序是一个耗时且容易出错的过程,在这个过程中,社交网络分析专家可能会犯错误。这项研究提出了一个多领域知识模型,能够自动收集数据并在不同领域进行不同的社交网络分析,不会出现错误,并获得与SNA专家相同的结论。该模型由OntoSNAQA ontology组成,包含类、属性和规则,表示People、Questionnaires和Social Network Analysis领域。除了OntoSNAQA本身之外,不同的规则由SWRL和SPARQL查询表示。使用OntoSNAQA创建了一个知识基础系统,并将其应用于一个实际案例研究,以展示该方法的优势。最后,将SNA分析的结果与一些最广泛使用的SNA应用程序(UCINET、Pajek、Cytoscape和Gephi)进行比较,以测试并证实模型的有效性。
https://arxiv.org/abs/2402.02181