Ontology learning in complex domains, such as life sciences, poses significant challenges for current Large Language Models (LLMs). Existing LLMs struggle to generate ontologies with multiple hierarchical levels, rich interconnections, and comprehensive class coverage due to constraints on the number of tokens they can generate and inadequate domain adaptation. To address these issues, we extend the NeOn-GPT pipeline for ontology learning using LLMs with advanced prompt engineering techniques and ontology reuse to enhance the generated ontologies' domain-specific reasoning and structural depth. Our work evaluates the capabilities of LLMs in ontology learning in the context of highly specialized and complex domains such as life science domains. To assess the logical consistency, completeness, and scalability of the generated ontologies, we use the AquaDiva ontology developed and used in the collaborative research center AquaDiva as a case study. Our evaluation shows the viability of LLMs for ontology learning in specialized domains, providing solutions to longstanding limitations in model performance and scalability.
https://arxiv.org/abs/2412.02035
We present SPILDL, a Scalable and Parallel Inductive Learner in Description Logic (DL). SPILDL is based on the DL-Learner (the state of the art in DL-based ILP learning). As a DL-based ILP learner, SPILDL targets the $\mathcal{ALCQI}^{\mathcal{(D)}}$ DL language, and can learn DL hypotheses expressed as disjunctions of conjunctions (using the $\sqcup$ operator). Moreover, SPILDL's hypothesis language also incorporates the use of string concrete roles (also known as string data properties in the Web Ontology Language, OWL); As a result, this incorporation of powerful DL constructs, enables SPILDL to learn powerful DL-based hypotheses for describing many real-world complex concepts. SPILDL employs a hybrid parallel approach which combines both shared-memory and distributed-memory approaches, to accelerates ILP learning (for both hypothesis search and evaluation). According to experimental results, SPILDL's parallel search improved performance by up to $\sim$27.3 folds (best case). For hypothesis evaluation, SPILDL improved evaluation performance through HT-HEDL (our multi-core CPU + multi-GPU hypothesis evaluation engine), by up to 38 folds (best case). By combining both parallel search and evaluation, SPILDL improved performance by up to $\sim$560 folds (best case). In terms of worst case scenario, SPILDL's parallel search doesn't provide consistent speedups on all datasets, and is highly dependent on the search space nature of the ILP dataset. For some datasets, increasing the number of parallel search threads result in reduced performance, similar or worse than baseline. Some ILP datasets benefit from parallel search, while others don't (or the performance gains are negligible). In terms of parallel evaluation, on small datasets, parallel evaluation provide similar or worse performance than baseline.
https://arxiv.org/abs/2412.00830
Extracting relevant and structured knowledge from large, complex technical documents within the Reliability and Maintainability (RAM) domain is labor-intensive and prone to errors. Our work addresses this challenge by presenting OntoKGen, a genuine pipeline for ontology extraction and Knowledge Graph (KG) generation. OntoKGen leverages Large Language Models (LLMs) through an interactive user interface guided by our adaptive iterative Chain of Thought (CoT) algorithm to ensure that the ontology extraction process and, thus, KG generation align with user-specific requirements. Although KG generation follows a clear, structured path based on the confirmed ontology, there is no universally correct ontology as it is inherently based on the user's preferences. OntoKGen recommends an ontology grounded in best practices, minimizing user effort and providing valuable insights that may have been overlooked, all while giving the user complete control over the final ontology. Having generated the KG based on the confirmed ontology, OntoKGen enables seamless integration into schemeless, non-relational databases like Neo4j. This integration allows for flexible storage and retrieval of knowledge from diverse, unstructured sources, facilitating advanced querying, analysis, and decision-making. Moreover, the generated KG serves as a robust foundation for future integration into Retrieval Augmented Generation (RAG) systems, offering enhanced capabilities for developing domain-specific intelligent applications.
https://arxiv.org/abs/2412.00608
This paper presents a new ontology that implements the well-known Deontic Traditional Scheme in RDFs and SPARQL, fit to handle irresolvable conflicts, i.e., situations in which two or more statements prescribe conflicting obligations, prohibitions, or permissions, with none of them being "stronger" than the other one(s). In our view, this paper marks a significant advancement in standard theoretical research in formal Deontic Logic. Most contemporary approaches in this field are confined to the propositional level, mainly focus on the notion of obligation, and lack implementations. The proposed framework is encoded in RDF, which is not only a first-order language but also the most widely used knowledge representation language, as it forms the foundation of the Semantic Web. Moreover, the proposed computational ontology formalizes all deontic modalities defined in the Deontic Traditional Scheme, without specifically focusing on obligations, and offers constructs to model and reason with various types of irresolvable conflicts, violations, and the interaction between deontic modalities and contextual constraints in a given state of affairs. To the best of our knowledge, no existing approach in the literature addresses all these aspects within a unified integrated framework. All examples presented and discussed in this paper, together with Java code and clear instructions to re-execute them locally, are available at this https URL
https://arxiv.org/abs/2411.19918
This paper proposes a novel approach to semantic ontology alignment using contextual descriptors. A formalization was developed that enables the integration of essential and contextual descriptors to create a comprehensive knowledge model. The hierarchical structure of the semantic approach and the mathematical apparatus for analyzing potential conflicts between concepts, particularly in the example of "Transparency" and "Privacy" in the context of artificial intelligence, are demonstrated. Experimental studies showed a significant improvement in ontology alignment metrics after the implementation of contextual descriptors, especially in the areas of privacy, responsibility, and freedom & autonomy. The application of contextual descriptors achieved an average overall improvement of approximately 4.36%. The results indicate the effectiveness of the proposed approach for more accurately reflecting the complexity of knowledge and its contextual dependence.
https://arxiv.org/abs/2411.19113
Procedural Knowledge is the know-how expressed in the form of sequences of steps needed to perform some tasks. Procedures are usually described by means of natural language texts, such as recipes or maintenance manuals, possibly spread across different documents and systems, and their interpretation and subsequent execution is often left to the reader. Representing such procedures in a Knowledge Graph (KG) can be the basis to build digital tools to support those users who need to apply or execute them. In this paper, we leverage Large Language Model (LLM) capabilities and propose a prompt engineering approach to extract steps, actions, objects, equipment and temporal information from a textual procedure, in order to populate a Procedural KG according to a pre-defined ontology. We evaluate the KG extraction results by means of a user study, in order to qualitatively and quantitatively assess the perceived quality and usefulness of the LLM-extracted procedural knowledge. We show that LLMs can produce outputs of acceptable quality and we assess the subjective perception of AI by human evaluators.
https://arxiv.org/abs/2412.03589
The lack of a formal model of events hinders interoperability in distributed event-based systems. In this paper, we present a formal model of events, called Event-Model-F. The model is based on the foundational ontology DOLCE+DnS Ultralite (DUL) and provides comprehensive support to represent time and space, objects and persons, as well as mereological, causal, and correlative relationships between events. In addition, the Event-Model-F provides a flexible means for event composition, modeling event causality and event correlation, and representing different interpretations of the same event. The Event-Model-F is developed following the pattern-oriented approach of DUL, is modularized in different ontologies, and can be easily extended by domain specific ontologies.
缺乏正式的事件模型阻碍了分布式基于事件系统的互操作性。本文提出了一种名为Event-Model-F的形式化事件模型。该模型基于基础本体DOLCE+DnS Ultralite(DUL),提供了全面的支持来表示时间和空间、对象和人物,以及事件之间的部分论、因果关系和相关关系。此外,Event-Model-F还提供了一种灵活的方式来组合事件、建模事件的因果性和关联性,并表示同一事件的不同解释。该模型按照DUL的面向模式的方法开发,并在不同的本体中进行了模块化处理,可以轻松通过领域特定的本体进行扩展。
https://arxiv.org/abs/2411.16609
Cyberattacks are becoming increasingly difficult to detect and prevent due to their sophistication. In response, Autonomous Intelligent Cyber-defense Agents (AICAs) are emerging as crucial solutions. One prominent AICA agent is the Intrusion Response System (IRS), which is critical for mitigating threats after detection. IRS uses several Tactics, Techniques, and Procedures (TTPs) to mitigate attacks and restore the infrastructure to normal operations. Continuous monitoring of the enterprise infrastructure is an essential TTP the IRS uses. However, each system serves different purposes to meet operational needs. Integrating these disparate sources for continuous monitoring increases pre-processing complexity and limits automation, eventually prolonging critical response time for attackers to exploit. We propose a unified IRS Knowledge Graph ontology (IRSKG) that streamlines the onboarding of new enterprise systems as a source for the AICAs. Our ontology can capture system monitoring logs and supplemental data, such as a rules repository containing the administrator-defined policies to dictate the IRS responses. Besides, our ontology permits us to incorporate dynamic changes to adapt to the evolving cyber-threat landscape. This robust yet concise design allows machine learning models to train effectively and recover a compromised system to its desired state autonomously with explainability.
网络攻击因其复杂性而变得越来越难以检测和预防。为此,自主智能网络安全代理(AICAs)正在成为关键解决方案之一。其中一种突出的AICA是入侵响应系统(IRS),它在检测到威胁后对于减轻风险至关重要。IRS使用多种战术、技术和程序(TTPs)来缓解攻击并恢复基础设施的正常运行。持续监控企业基础设施是IRS使用的TTP中的一项重要技术。然而,每个系统的用途不同,以满足运营需求。将这些不同的来源整合用于持续监控会增加预处理复杂性,并限制自动化,最终延长关键响应时间,让攻击者有更多机会利用漏洞。 我们提出了一种统一的入侵响应系统知识图谱本体(IRSKG),它简化了新的企业系统的集成过程,作为AICAs的数据源。我们的本体可以捕获系统监控日志和补充数据,如包含管理员定义策略以决定IRS响应规则库。此外,我们的本体允许我们整合动态变化以适应不断演变的网络威胁态势。这种强大而简洁的设计使得机器学习模型能够有效地训练,并自主地将受损系统恢复到其期望的状态,同时保持可解释性。
https://arxiv.org/abs/2411.15672
Large Language Models (LLMs) offer promising solutions for text summarization. However, some domains require specific information to be available in the summaries. Generating these domain-adapted summaries is still an open challenge. Similarly, hallucinations in generated content is a major drawback of current approaches, preventing their deployment. This study proposes a novel approach that leverages ontologies to create domain-adapted summaries both structured and unstructured. We employ an ontology-guided constrained decoding process to reduce hallucinations while improving relevance. When applied to the medical domain, our method shows potential in summarizing Electronic Health Records (EHRs) across different specialties, allowing doctors to focus on the most relevant information to their domain. Evaluation on the MIMIC-III dataset demonstrates improvements in generating domain-adapted summaries of clinical notes and hallucination reduction.
大型语言模型(LLMs)为文本摘要提供了有前景的解决方案。然而,某些领域需要特定的信息出现在摘要中。生成这些领域的适应性摘要仍然是一个开放挑战。同样地,生成内容中的幻觉是当前方法的主要缺点之一,阻碍了它们的应用。本研究提出了一种新方法,利用本体论来创建结构化和非结构化的领域适应性摘要。我们采用了一个基于本体的约束解码过程,以减少幻觉并提高相关性。当应用于医疗领域时,我们的方法在总结不同专科的电子健康记录(EHRs)方面表现出潜力,使医生能够专注于与其领域最相关的部分。对MIMIC-III数据集的评估显示,在生成临床笔记的领域适应性摘要和减少幻觉方面有所改进。
https://arxiv.org/abs/2411.15666
Wikidata has a large ontology with classes at several orders. The Wikidata ontology has long been known to have violations of class order and information related to class order that appears suspect. SPARQL queries were evaluated against Wikidata to determine the prevalence of several kinds of violations and suspect information and the results analyzed. Some changes were manually made to Wikidata to remove some of these results and the queries rerun, showing the effect of the changes. Suggestions are provided on how the problems uncovered might be addressed, either though better tooling or involvement of the Wikidata community.
Wikidata 拥有一个包含多个层级类别的大型本体。长期以来,人们已经知道 Wikidata 的本体存在类别顺序的违规情况以及一些可疑的相关信息。通过针对 Wikidata 进行 SPARQL 查询来确定几种违规和可疑信息的普遍程度,并对结果进行了分析。为了消除其中的一些结果,手动对 Wikidata 进行了一些更改,并重新运行了查询以显示这些更改的影响。还提出了一些建议,说明如何通过改进工具或吸引 Wikidata 社区参与来解决发现的问题。
https://arxiv.org/abs/2411.15550
Knowledge Graph Embedding models, representing entities and edges in a low-dimensional space, have been extremely successful at solving tasks related to completing and exploring Knowledge Graphs (KGs). One of the key aspects of training most of these models is teaching to discriminate between true statements positives and false ones (negatives). However, the way in which negatives can be defined is not trivial, as facts missing from the KG are not necessarily false and a set of ground truth negatives is hardly ever given. This makes synthetic negative generation a necessity. Different generation strategies can heavily affect the quality of the embeddings, making it a primary aspect to consider. We revamp a strategy that generates corruptions during training respecting the domain and range of relations, we extend its capabilities and we show our methods bring substantial improvement (+10% MRR) for standard benchmark datasets and over +150% MRR for a larger ontology-backed dataset.
知识图谱嵌入模型通过在低维空间中表示实体和边,在解决与完成和探索知识图谱(KG)相关的任务方面取得了巨大成功。训练这些模型的关键方面之一是教会它们区分真实的陈述(正例)和虚假的陈述(负例)。然而,定义负例的方式并不简单,因为缺失于KG的事实不一定就是假的,并且几乎从未给定一套真实情况下的负例集。这使得合成负例生成变得必要。不同的生成策略会对嵌入的质量产生重大影响,使其成为需要重点考虑的一个方面。我们重新设计了一种在训练过程中生成尊重关系域和范围的破坏性数据的策略,扩展了其能力,并展示了我们的方法为标准基准数据集带来了显著改进(MRR提升+10%),对于一个更大的本体论支持的数据集则超过+150% MRR。
https://arxiv.org/abs/2411.14858
This paper showcases AdaptLIL, a real-time adaptive link-indented list ontology mapping visualization that uses eye gaze as the primary input source. Through a multimodal combination of real-time systems, deep learning, and web development applications, this system uniquely curtails graphical overlays (adaptations) to pairwise mappings of link-indented list ontology visualizations for individual users based solely on their eye gaze.
本文展示了AdaptLIL,这是一种实时自适应链接缩进列表本体映射可视化工具,主要采用眼动作为输入源。通过将实时系统、深度学习和网络开发应用进行多模态结合,该系统能够根据用户的独眼动数据,对链接缩进列表本体可视化的成对映射进行个性化的图形覆盖(适应)处理。
https://arxiv.org/abs/2411.11768
The increasing integration of artificial intelligence into various domains, including design and creative processes, raises significant ethical questions. While AI ethics is often examined from the perspective of technology developers, less attention has been paid to the practical ethical considerations faced by technology users, particularly in design contexts. This paper introduces a framework for addressing ethical challenges in creative production processes, such as the Double Diamond design model. Drawing on six major ethical theories - virtue ethics, deontology, utilitarianism, contract theory, care ethics, and existentialism - we develop a "compass" to navigate and reflect on the ethical dimensions of AI in design. The framework highlights the importance of responsibility, anticipation, and reflection across both the AI lifecycle and each stage of the creative process. We argue that by adopting a playful and exploratory approach to AI, while remaining anchored in core ethical principles, designers can responsibly harness the potential of AI technologies without overburdening or compromising their creative processes.
https://arxiv.org/abs/2412.03579
Large Language Models bear the promise of significant acceleration of key Knowledge Graph and Ontology Engineering tasks, including ontology modeling, extension, modification, population, alignment, as well as entity disambiguation. We lay out LLM-based Knowledge Graph and Ontology Engineering as a new and coming area of research, and argue that modular approaches to ontologies will be of central importance.
大型语言模型承载着显著加速关键知识图谱和本体工程任务的潜力,包括本体建模、扩展、修改、填充、对齐以及实体消歧。我们将基于LLM的知识图谱和本体工程作为一个新的且即将兴起的研究领域进行阐述,并认为模块化的方法对于本体将至关重要。
https://arxiv.org/abs/2411.09601
Sound event localization and detection (SELD) has seen substantial advancements through learning-based methods. These systems, typically trained from scratch on specific datasets, have shown considerable generalization capabilities. Recently, deep neural networks trained on large-scale datasets have achieved remarkable success in the sound event classification (SEC) field, prompting an open question of whether these advancements can be extended to develop general-purpose SELD models. In this paper, leveraging the power of pre-trained SEC models, we propose pre-trained SELD networks (PSELDNets) on large-scale synthetic datasets. These synthetic datasets, generated by convolving sound events with simulated spatial room impulse responses (SRIRs), contain 1,167 hours of audio clips with an ontology of 170 sound classes. These PSELDNets are transferred to downstream SELD tasks. When we adapt PSELDNets to specific scenarios, particularly in low-resource data cases, we introduce a data-efficient fine-tuning method, AdapterBit. PSELDNets are evaluated on a synthetic-test-set using collected SRIRs from TAU Spatial Room Impulse Response Database (TAU-SRIR DB) and achieve satisfactory performance. We also conduct our experiments to validate the transferability of PSELDNets to three publicly available datasets and our own collected audio recordings. Results demonstrate that PSELDNets surpass state-of-the-art systems across all publicly available datasets. Given the need for direction-of-arrival estimation, SELD generally relies on sufficient multi-channel audio clips. However, incorporating the AdapterBit, PSELDNets show more efficient adaptability to various tasks using minimal multi-channel or even just monophonic audio clips, outperforming the traditional fine-tuning approaches.
声事件定位与检测(SELD)通过基于学习的方法取得了显著的进步。这些系统通常在特定数据集上从头开始训练,显示出了相当大的泛化能力。最近,在大规模数据集上训练的深度神经网络在声音事件分类(SEC)领域取得了令人瞩目的成功,这促使了一个开放性问题:这些进展是否可以扩展到开发通用的SELD模型。本文利用预训练SEC模型的强大功能,提出了一种基于大规模合成数据集的预训练SELD网络(PSELDNets)。这些由声事件与模拟空间房间脉冲响应(SRIRs)卷积生成的合成数据集包含了1,167小时的音频片段和一个包含170类声音的本体论。这些PSELDNets被转移到下游的SELD任务中。当我们使PSELDNets适应特定场景,特别是在低资源数据的情况下时,我们引入了一种高效的数据微调方法——AdapterBit。我们在使用从TAU空间房间脉冲响应数据库(TAU-SRIR DB)收集的SRIRs合成测试集上评估了PSELDNets,并取得了令人满意的表现。我们也进行了实验以验证PSELDNets在三个公开数据集和我们自己的音频记录中的迁移能力。结果显示,PSELDNets在所有公开的数据集中均超过了现有的最先进的系统。考虑到需要进行到达方向估计,SELD通常依赖于足够的多通道音频片段。然而,通过引入AdapterBit,PSELDNets展示了使用最少的多通道甚至单声道音频片段对各种任务更高效的适应能力,并优于传统的微调方法。
https://arxiv.org/abs/2411.06399
Electronic health records (EHRs) are stored in various database systems with different database models on heterogeneous storage architectures, such as relational databases, document stores, or graph databases. These different database models have a big impact on query complexity and performance. While this has been a known fact in database research, its implications for the growing number of Text-to-Query systems have surprisingly not been investigated so far. In this paper, we present SM3-Text-to-Query, the first multi-model medical Text-to-Query benchmark based on synthetic patient data from Synthea, following the SNOMED-CT taxonomy -- a widely used knowledge graph ontology covering medical terminology. SM3-Text-to-Query provides data representations for relational databases (PostgreSQL), document stores (MongoDB), and graph databases (Neo4j and GraphDB (RDF)), allowing the evaluation across four popular query languages, namely SQL, MQL, Cypher, and SPARQL. We systematically and manually develop 408 template questions, which we augment to construct a benchmark of 10K diverse natural language question/query pairs for these four query languages (40K pairs overall). On our dataset, we evaluate several common in-context-learning (ICL) approaches for a set of representative closed and open-source LLMs. Our evaluation sheds light on the trade-offs between database models and query languages for different ICL strategies and LLMs. Last, SM3-Text-to-Query is easily extendable to additional query languages or real, standard-based patient databases.
电子健康记录(EHRs)存储在各种具有不同数据库模型的异构存储架构的数据库系统中,如关系型数据库、文档存储或图形数据库。这些不同的数据库模型对查询复杂度和性能有很大影响。虽然这一点在数据库研究中是已知的事实,但其对日益增多的文本转查询(Text-to-Query)系统的含义迄今尚未被探讨。本文介绍了SM3-Text-to-Query,这是一个基于Synthea生成的合成患者数据并遵循广泛使用的医学术语知识图谱本体SNOMED-CT的首个多模型医疗文本转查询基准测试工具。SM3-Text-to-Query提供了关系型数据库(PostgreSQL)、文档存储(MongoDB)和图形数据库(Neo4j 和 GraphDB(RDF))的数据表示,允许使用四种流行的查询语言进行评估,即SQL、MQL、Cypher 和SPARQL。我们系统地并手动开发了408个模板问题,并对其进行扩展以构建针对这四种查询语言的1万个多样化的自然语言问答对基准测试数据集(总共4万对)。在我们的数据集上,我们评估了几种常见的上下文学习(ICL)方法应用于一组代表性闭源和开源大语言模型(LLMs)的效果。我们的评估揭示了不同ICL策略和LLM之间的数据库模型与查询语言之间权衡的视角。最后,SM3-Text-to-Query可以轻松扩展到额外的查询语言或基于标准的真实患者数据库中。
https://arxiv.org/abs/2411.05521
A large amount of information in today's world is now stored in knowledge bases. Named Entity Recognition (NER) is a process of extracting, disambiguation, and linking an entity from raw text to insightful and structured knowledge bases. More concretely, it is identifying and classifying entities in the text that are crucial for Information Extraction, Semantic Annotation, Question Answering, Ontology Population, and so on. The process of NER has evolved in the last three decades since it first appeared in 1996. In this survey, we study the evolution of techniques employed for NER and compare the results, starting from supervised to the developing unsupervised learning methods.
当今世界,大量信息被存储在知识库中。命名实体识别(NER)是从原始文本中提取、消除歧义并链接实体到有洞察力且结构化的知识库的过程。更具体地说,它是识别和分类文本中的实体,这些实体对于信息抽取、语义标注、问题回答、本体重构等至关重要。自1996年首次出现以来,命名实体识别过程在过去三十年中不断发展演变。在这篇综述中,我们研究了用于NER的技术的演进,并从监督学习到正在发展的无监督学习方法开始进行比较。
https://arxiv.org/abs/2411.05057
Empirical evidence suggests that LLMs exhibit spontaneous cross-lingual alignment. Our findings suggest that although LLMs also demonstrate promising cross-lingual alignment in Information Extraction, there remains significant imbalance across languages, revealing an underlying deficiency in the IE alignment. To address this issue, we propose AlignXIE, a powerful code-based LLM that significantly enhances cross-lingual IE alignment through two strategies. Firstly, AlignXIE formulates IE across different languages, especially non-English ones, as code generation tasks, standardizing the representation of various schemas using Python classes to ensure consistency of the same ontology in different languages and align the schema. Secondly, it incorporates an IE cross-lingual alignment phase through a translated instance prediction task proposed in this paper to align the extraction process, utilizing ParallelNER, an IE bilingual parallel dataset with 257,190 samples, generated by our proposed LLM-based automatic pipeline for IE parallel data construction, with manual annotation to ensure quality. Ultimately, we obtain AlignXIE through multilingual IE instruction tuning. Although without training in 9 unseen languages, AlignXIE surpasses ChatGPT by $30.17\%$ and SoTA by $20.03\%$, thereby demonstrating superior cross-lingual IE capabilities. Comprehensive evaluations on 63 IE benchmarks in Chinese and English under various settings, demonstrate that AlignXIE significantly enhances cross-lingual and multilingual IE through boosting the IE alignment.
实证证据表明,大型语言模型(LLMs)表现出自发的跨语言对齐能力。我们的研究结果表明,尽管这些大型语言模型在信息提取方面也显示出有前景的跨语言对齐效果,但不同语言之间仍存在显著不平衡,揭示了信息提取(IE)对齐中存在的潜在不足。为了解决这一问题,我们提出了AlignXIE,这是一个基于代码的强大LLM,通过两种策略显著提升了跨语言IE对齐能力。首先,AlignXIE将不同语言的信息提取任务,尤其是非英语任务,视为代码生成任务,并使用Python类来标准化各种模式的表示,以确保不同语言中相同本体的一致性和模式对齐。其次,它引入了一个信息提取过程中的跨语言对齐阶段,通过本文提出的翻译实例预测任务实现这一目标,利用ParallelNER数据集进行对齐,该数据集是一个包含257,190个样本的双语平行IE数据集,由我们基于LLM的自动管道生成,并辅以人工标注保证质量。最终,我们通过对多种语言信息提取指令进行微调来获得AlignXIE模型。尽管没有在9种未见过的语言上进行训练,AlignXIE仍比ChatGPT高出30.17%,比最先进的方法(SoTA)高出20.03%,从而展示了其卓越的跨语言信息提取能力。综合评价了中文和英文下63个信息提取基准测试的不同设置,结果表明,通过增强信息提取对齐能力,AlignXIE显著提升了跨语言和多语言的信息提取性能。
https://arxiv.org/abs/2411.04794
Automatic summarization has consistently attracted attention, due to its versatility and wide application in various downstream tasks. Despite its popularity, we find that annotation efforts have largely been disjointed, and have lacked common terminology. Consequently, it is challenging to discover existing resources or identify coherent research directions. To address this, we survey a large body of work spanning 133 datasets in over 100 languages, creating a novel ontology covering sample properties, collection methods and distribution. With this ontology we make key observations, including the lack in accessible high-quality datasets for low-resource languages, and the field's over-reliance on the news domain and on automatically collected distant supervision. Finally, we make available a web interface that allows users to interact and explore our ontology and dataset collection, as well as a template for a summarization data card, which can be used to streamline future research into a more coherent body of work.
自动摘要一直备受关注,因其在多种下游任务中的多样性和广泛应用。尽管它很受欢迎,但我们发现标注工作大多是分散的,并且缺乏共同术语。因此,很难发现现有资源或识别连贯的研究方向。为了解决这个问题,我们调研了大量涵盖133个数据集、涉及超过100种语言的工作,创建了一个新的本体论,包括样本属性、收集方法和分布情况。通过这一本体论,我们做出了关键观察,其中包括对于低资源语言而言可访问的高质量数据集缺乏,以及该领域过度依赖新闻领域及自动收集的远监督现象。最后,我们提供了一个网络界面,允许用户互动并探索我们的本体论和数据集收藏,并且提供了一个摘要数据卡片模板,可用于将未来的研发展现为一个更加连贯的研究体系。
https://arxiv.org/abs/2411.04585
The generic text preprocessing pipeline, comprising Tokenisation, Normalisation, Stop Words Removal, and Stemming/Lemmatisation, has been implemented in many ontology matching (OM) systems. However, the lack of standardisation in text preprocessing creates diversity in mapping results. In this paper, we investigate the effect of the text preprocessing pipeline on OM tasks at syntactic levels. Our experiments on 8 Ontology Alignment Evaluation Initiative (OAEI) track repositories with 49 distinct alignments indicate: (1) Tokenisation and Normalisation are currently more effective than Stop Words Removal and Stemming/Lemmatisation; and (2) The selection of Lemmatisation and Stemming is task-specific. We recommend standalone Lemmatisation or Stemming with post-hoc corrections. We find that (3) Porter Stemmer and Snowball Stemmer perform better than Lancaster Stemmer; and that (4) Part-of-Speech (POS) Tagging does not help Lemmatisation. To repair less effective Stop Words Removal and Stemming/Lemmatisation used in OM tasks, we propose a novel context-based pipeline repair approach that significantly improves matching correctness and overall matching performance. We also discuss the use of text preprocessing pipeline in the new era of large language models (LLMs).
一个通用的文本预处理管道,包括分词、标准化、停用词移除和词干提取/词形还原,在许多本体匹配(OM)系统中已得到实现。然而,文本预处理中的非标准化导致了映射结果的多样性。本文研究了文本预处理管道在OM任务中语法层面的影响。我们在8个Ontology Alignment Evaluation Initiative (OAEI) 跟踪存储库上进行了实验,这些存储库包含49种不同的对齐方式,结果表明:(1) 当前分词和标准化比停用词移除和词干提取/词形还原更有效;(2) 选择词形还原还是词干提取取决于具体任务。我们推荐使用独立的词形还原或词干提取并进行事后修正。我们的发现还包括:(3) Porter词干提取器和Snowball词干提取器的表现优于Lancaster词干提取器;以及(4) 词性标注(POS)标签对词形还原没有帮助。为了改进OM任务中效果不佳的停用词移除和词干提取/词形还原,我们提出了一种基于上下文的新管道修复方法,该方法显著提高了匹配正确性和整体匹配性能。此外,我们还讨论了在大型语言模型(LLMs)新时代使用文本预处理管道的应用。
https://arxiv.org/abs/2411.03962