We consider the problem of Open-world Information Extraction (Open-world IE), which extracts comprehensive entity profiles from unstructured texts. Different from the conventional closed-world setting of Information Extraction (IE), Open-world IE considers a more general situation where entities and relations could be beyond a predefined ontology. More importantly, we seek to develop a large language model (LLM) that is able to perform Open-world IE to extract desirable entity profiles characterized by (possibly fine-grained) natural language instructions. We achieve this by finetuning LLMs using instruction tuning. In particular, we construct INSTRUCTOPENWIKI, a substantial instruction tuning dataset for Open-world IE enriched with a comprehensive corpus, extensive annotations, and diverse instructions. We finetune the pretrained BLOOM models on INSTRUCTOPENWIKI and obtain PIVOINE, an LLM for Open-world IE with strong instruction-following capabilities. Our experiments demonstrate that PIVOINE significantly outperforms traditional closed-world methods and other LLM baselines, displaying impressive generalization capabilities on both unseen instructions and out-of-ontology cases. Consequently, PIVOINE emerges as a promising solution to tackle the open-world challenge in IE effectively.
我们考虑了开放世界信息提取(开放世界IE)的问题,该问题从无结构文本中提取全面实体属性。与传统的信息提取(IE)的封闭世界设定不同,开放世界IE考虑了一种更为普遍的情况,即实体和关系可能超越预先定义的本体论。更重要的是,我们寻求开发一种大型语言模型(LLM),能够执行开放世界IE,以提取以(可能精细的)自然语言指令的特征化的想要的属性属性。我们通过指令优化来微调LLM。特别是,我们建立了INSTRUCTOPENWIKI,这是一个重要的指令优化数据集,以丰富开放世界IE,其中包括全面语料库、广泛的注释和多种指令。我们在INSTRUCTOPENWIKI上优化了预先训练的BLOOM模型,并获得了PIVOINE,这是一个开放世界IE中的LLM,具有强大的指令跟随能力。我们的实验表明,PIVOINE在未 unseen 指令和超越本体论的情况上表现出令人印象深刻的泛化能力。因此,PIVOINE成为解决开放世界IE挑战的有效解决方案。
https://arxiv.org/abs/2305.14898
Existing event-centric NLP models often only apply to the pre-defined ontology, which significantly restricts their generalization capabilities. This paper presents CEO, a novel Corpus-based Event Ontology induction model to relax the restriction imposed by pre-defined event ontologies. Without direct supervision, CEO leverages distant supervision from available summary datasets to detect corpus-wise salient events and exploits external event knowledge to force events within a short distance to have close embeddings. Experiments on three popular event datasets show that the schema induced by CEO has better coverage and higher accuracy than previous methods. Moreover, CEO is the first event ontology induction model that can induce a hierarchical event ontology with meaningful names on eleven open-domain corpora, making the induced schema more trustworthy and easier to be further curated.
现有的事件中心化自然语言处理模型通常只适用于预先定义的本体,这极大地限制了其泛化能力。本文介绍了CEO,一个基于 Corpus 的 Event Ontology Induction 模型,以放松预先定义事件本体的限制。在没有直接监督的情况下,CEO利用可用简要数据集的远程监督来检测 Corpus 中的显著事件,并利用外部事件知识来强制在极短距离内发生的事件具有靠近嵌入。对三个流行的事件数据集的实验表明,CEO 引起的 schema 比先前方法覆盖更广且更准确。此外,CEO是第一个能够在十一个开放域corpora上建立具有有意义的名称的层级事件本体的模型,从而使其引起的 schema 更加可信,更容易进一步 curated。
https://arxiv.org/abs/2305.13521
NeSy4VRD is a multifaceted resource designed to support the development of neurosymbolic AI (NeSy) research. NeSy4VRD re-establishes public access to the images of the VRD dataset and couples them with an extensively revised, quality-improved version of the VRD visual relationship annotations. Crucially, NeSy4VRD provides a well-aligned, companion OWL ontology that describes the dataset this http URL comes with open source infrastructure that provides comprehensive support for extensibility of the annotations (which, in turn, facilitates extensibility of the ontology), and open source code for loading the annotations to/from a knowledge graph. We are contributing NeSy4VRD to the computer vision, NeSy and Semantic Web communities to help foster more NeSy research using OWL-based knowledge graphs.
NeSy4VRD是一个多用途资源,旨在支持神经符号人工智能(NeSy)的研究。它重新恢复了VRD数据集的图像公开访问,并将它们与经过广泛修订和提高质量的VRD视觉关系注释版本相结合。关键地,NeSy4VRD提供了一组配合默契的OWL本体论,该本体论描述了这个http URL所包含的该数据集,提供了全面支持扩展注释(这反过来促进了本体论的扩展)的开源基础设施,以及用于将注释加载到知识图谱上的开源代码。我们正在向计算机视觉、NeSy和语义网社区贡献NeSy4VRD,以帮助促进基于OWL的知识图谱的更多NeSy研究。
https://arxiv.org/abs/2305.13258
In this paper, we study the (geospatial) ontologies we are interested in together as an ontology (a geospatial ontology) system, consisting of a set of the (geospatial) ontologies and a set of ontology operations. A homomorphism between two ontology systems is a function between two sets of ontologies, which preserves these ontology operations. We view clustering a set of the ontologies we are interested in as partitioning the set or defining an equivalence relation on the set or forming a quotient set of the set or obtaining the surjective image of the set. Each ontology system homomorphism can be factored as a surjective clustering to a quotient space, followed by an embedding. Ontology (merging) systems, natural partial orders on the systems, and ontology merging closures in the systems are then transformed under ontology system homomorphisms, given by quotients and embeddings.
在本文中,我们将研究我们感兴趣的(空间上的)本体论作为一個本体论(空间上的本体论)系统,由一组(空间上的)本体论和一组本体论操作组成。两个本体论系统的映射是这两个本体论集合之间的函数,保持了这些本体论操作。我们将将我们感兴趣的本体论集合进行分组视为将该集合划分成子集或将该集合中的等价关系定义或形成该集合的子集或得到该集合的满射图像。每个本体论系统映射都可以分解为满射分组到一个子集空间,然后嵌入。本体论合并系统、系统的自然 partial orders 以及系统中的本体合并结束 are then transformed under Ontology system homomorphisms, given by quotients and embeddings.
https://arxiv.org/abs/2305.13135
Humans have a natural ability to perform semantic associations with the surrounding objects in the environment. This allows them to create a mental map of the environment which helps them to navigate on-demand when given a linguistic instruction. A natural goal in Vision Language Navigation (VLN) research is to impart autonomous agents with similar capabilities. Recently introduced VL Maps \cite{huang23vlmaps} take a step towards this goal by creating a semantic spatial map representation of the environment without any labelled data. However, their representations are limited for practical applicability as they do not distinguish between different instances of the same object. In this work, we address this limitation by integrating instance-level information into spatial map representation using a community detection algorithm and by utilizing word ontology learned by large language models (LLMs) to perform open-set semantic associations in the mapping representation. The resulting map representation improves the navigation performance by two-fold (233\%) on realistic language commands with instance-specific descriptions compared to VL Maps. We validate the practicality and effectiveness of our approach through extensive qualitative and quantitative experiments.
人类有一项自然的能力,即与环境中的相邻对象进行语义联想。这让他们能够创建对环境的心理地图,帮助他们在给定语言学指令时自主导航。视觉语言导航(VLN)研究的自然目标是赋予具有类似能力的智能代理。最近引入的VL Maps \cite{huang23vlmaps}通过在没有标注数据的情况下创建环境语义空间地图表示来实现这一目标。然而,它们的表示在实践中受到限制,因为它们不能区分相同对象的不同实例。在这项工作中,我们克服了这种限制,通过使用社区检测算法将实例级信息整合到空间地图表示中,并利用大型语言模型(LLMs)学习的词汇本体论在映射表示中执行开放集合的语义联想。结果,map表示相较于VL Maps提高了导航性能(233\%)。我们通过广泛的定性和定量实验验证了我们方法的实用性和有效性。
https://arxiv.org/abs/2305.12363
Fine-grained entity typing (FET), which assigns entities in text with context-sensitive, fine-grained semantic types, will play an important role in natural language understanding. A supervised FET method, which typically relies on human-annotated corpora for training, is costly and difficult to scale. Recent studies leverage pre-trained language models (PLMs) to generate rich and context-aware weak supervision for FET. However, a PLM may still generate a mixture of rough and fine-grained types, or tokens unsuitable for typing. In this study, we vision that an ontology provides a semantics-rich, hierarchical structure, which will help select the best results generated by multiple PLM models and head words. Specifically, we propose a novel zero-shot, ontology-guided FET method, OntoType, which follows a type ontological structure, from coarse to fine, ensembles multiple PLM prompting results to generate a set of type candidates, and refines its type resolution, under the local context with a natural language inference model. Our experiments on the Ontonotes, FIGER, and NYT datasets using their associated ontological structures demonstrate that our method outperforms the state-of-the-art zero-shot fine-grained entity typing methods. Our error analysis shows that refinement of the existing ontology structures will further improve fine-grained entity typing.
Fine-grained entity typing (FET),即在文本中为具有上下文敏感、精细语义类型的实体分配角色,将在自然语言理解中发挥重要作用。通常依赖人工标注的 Corpus 进行训练的 supervised FET 方法成本较高且难以扩展。最近的研究利用预训练语言模型(PLMs)生成丰富的、具有上下文意识的弱监督结果,但一个 PLM 可能仍然生成粗糙和精细类型的混合或不适合 typing 的 token。在本文中,我们愿景一个零次元、基于本体论的指导的 FET 方法,名为 OntoType。该方法遵循类型的本体论结构,从粗到细,结合多个 PLM prompt 结果生成一组类型候选人,并优化其类型分辨率,在一个自然语言推理模型 local 上下文下。我们对 Ontonotes、FigER 和 New York Times 数据集使用其相关的本体论结构进行了实验,证明了我们的方法和最先进的零次元精细实体 typing 方法相比表现更好。我们的错误分析表明,改进现有的本体论结构将进一步提高 Fine-grained entity typing。
https://arxiv.org/abs/2305.12307
Online social media platforms, such as Twitter, are one of the most valuable sources of information during disaster events. Therefore, humanitarian organizations, government agencies, and volunteers rely on a summary of this information, i.e., tweets, for effective disaster management. Although there are several existing supervised and unsupervised approaches for automated tweet summary approaches, these approaches either require extensive labeled information or do not incorporate specific domain knowledge of disasters. Additionally, the most recent approaches to disaster summarization have proposed BERT-based models to enhance the summary quality. However, for further improved performance, we introduce the utilization of domain-specific knowledge without any human efforts to understand the importance (salience) of a tweet which further aids in summary creation and improves summary quality. In this paper, we propose a disaster-specific tweet summarization framework, IKDSumm, which initially identifies the crucial and important information from each tweet related to a disaster through key-phrases of that tweet. We identify these key-phrases by utilizing the domain knowledge (using existing ontology) of disasters without any human intervention. Further, we utilize these key-phrases to automatically generate a summary of the tweets. Therefore, given tweets related to a disaster, IKDSumm ensures fulfillment of the summarization key objectives, such as information coverage, relevance, and diversity in summary without any human intervention. We evaluate the performance of IKDSumm with 8 state-of-the-art techniques on 12 disaster datasets. The evaluation results show that IKDSumm outperforms existing techniques by approximately 2-79% in terms of ROUGE-N F1-score.
在线社交媒体平台,如推特,在灾难事件中是最有价值的信息来源之一。因此,人道主义组织、政府机构和志愿者依赖对这些信息进行摘要,即推特,以进行有效的灾难管理。尽管已经有几种现有的监督和无监督的方法来自动化推特摘要,但这些方法要么需要广泛的标签信息,要么没有与灾害相关的特定领域知识。此外,最近的方法灾难摘要提出了基于BERT模型来提高摘要质量。但是,为了进一步提高性能,我们引入了不使用人类努力理解推特重要性(重要性)的方法,这有助于促进摘要的创作和提高摘要质量。在本文中,我们提出了一个灾难特定的推特摘要框架,IKDSumm,该框架通过推特关键词识别确定每个与灾害相关的推特的关键和重要信息。我们利用灾害领域的特定知识(使用现有本体论)不使用人类干预进行识别。进一步,我们利用这些关键词自动生成推特摘要。因此,给定与灾害相关的推特,IKDSumm确保满足摘要的关键目标,如信息覆盖、相关性和多样性,而不需要任何人类干预。我们使用8个最先进的技术方法对12个灾难数据集进行评估。评估结果显示,IKDSumm在ROUGE-N F1-score方面比现有技术高出大约2-79%。
https://arxiv.org/abs/2305.11592
Geocoding is the task of converting location mentions in text into structured data that encodes the geospatial semantics. We propose a new architecture for geocoding, GeoNorm. GeoNorm first uses information retrieval techniques to generate a list of candidate entries from the geospatial ontology. Then it reranks the candidate entries using a transformer-based neural network that incorporates information from the ontology such as the entry's population. This generate-and-rerank process is applied twice: first to resolve the less ambiguous countries, states, and counties, and second to resolve the remaining location mentions, using the identified countries, states, and counties as context. Our proposed toponym resolution framework achieves state-of-the-art performance on multiple datasets. Code and models are available at \url{this https URL}.
地理编码是将文本中的位置提及转换为结构化数据的任务,以编码地理空间语义。我们提出了一种新的地理编码架构,称为GeoNorm。GeoNorm首先使用信息检索技术从地理空间本体论中生成候选人项列表。然后使用基于Transformer的神经网络,将候选人项重新排序,并将本体论中的信息,如项的人口信息,融入其中。此生成-再排序过程被两次应用:首先用于解决较少歧义的国家、州和国,其次用于解决剩余的位置提及,使用已确定的国家、州和国作为上下文。我们提出的地名解决框架在多个数据集上实现了最先进的性能。代码和模型可访问\url{this https URL}。
https://arxiv.org/abs/2305.11315
Learning vectors that capture the meaning of concepts remains a fundamental challenge. Somewhat surprisingly, perhaps, pre-trained language models have thus far only enabled modest improvements to the quality of such concept embeddings. Current strategies for using language models typically represent a concept by averaging the contextualised representations of its mentions in some corpus. This is potentially sub-optimal for at least two reasons. First, contextualised word vectors have an unusual geometry, which hampers downstream tasks. Second, concept embeddings should capture the semantic properties of concepts, whereas contextualised word vectors are also affected by other factors. To address these issues, we propose two contrastive learning strategies, based on the view that whenever two sentences reveal similar properties, the corresponding contextualised vectors should also be similar. One strategy is fully unsupervised, estimating the properties which are expressed in a sentence from the neighbourhood structure of the contextualised word embeddings. The second strategy instead relies on a distant supervision signal from ConceptNet. Our experimental results show that the resulting vectors substantially outperform existing concept embeddings in predicting the semantic properties of concepts, with the ConceptNet-based strategy achieving the best results. These findings are furthermore confirmed in a clustering task and in the downstream task of ontology completion.
学习捕捉概念含义的向量仍然是一个基本挑战。也许有些令人惊讶,目前预训练语言模型只实现了这种概念嵌入质量的适度改善。目前使用语言模型的策略通常通过平均一些语料库中涉及概念的上下文嵌入表示来代表一个概念。这可能因为至少两个原因而不可取。第一,上下文嵌入向量具有一种独特的几何形状,可能影响后续任务。第二,概念嵌入应该捕捉概念的语义特性,而上下文嵌入向量也受到其他因素的影响。为了解决这些问题,我们提出了两个对比学习策略,基于观点,每当两个句子表现出类似的特性时,对应的上下文嵌入向量也应该表现出类似的特性。一种策略是完全无监督的,从上下文嵌入向量的邻居结构中估计在句子中表达的性质的量。第二种策略则依赖于ConceptNet 从远处监督信号。我们的实验结果显示, resulting 向量在预测概念的语义特性方面大大优于现有的概念嵌入向量,而基于ConceptNet的策略则取得了最佳结果。这些发现在聚类任务和本体完成下游任务中也得到了确认。
https://arxiv.org/abs/2305.09785
We present a method for extracting general modules for ontologies formulated in the description logic ALC. A module for an ontology is an ideally substantially smaller ontology that preserves all entailments for a user-specified set of terms. As such, it has applications such as ontology reuse and ontology analysis. Different from classical modules, general modules may use axioms not explicitly present in the input ontology, which allows for additional conciseness. So far, general modules have only been investigated for lightweight description logics. We present the first work that considers the more expressive description logic ALC. In particular, our contribution is a new method based on uniform interpolation supported by some new theoretical results. Our evaluation indicates that our general modules are often smaller than classical modules and uniform interpolants computed by the state-of-the-art, and compared with uniform interpolants, can be computed in a significantly shorter time. Moreover, our method can be used for, and in fact improves, the computation of uniform interpolants and classical modules.
我们提出了一种方法,用于提取适用于在描述逻辑ALC中表述Ontology的一般模块。一个Ontology模块 ideally 是一个大大减小的Ontology,能够保留用户指定的一系列术语的所有蕴含。因此,它具有应用,例如Ontology重用和Ontology分析。与传统的模块不同,一般的模块可能使用输入Ontology中未明确指定的axioms,这可以增加简洁性。迄今为止,一般模块仅被用于轻量级描述逻辑的研究。我们提出了第一个考虑更表达能力强的ALC描述逻辑的工作。特别是,我们的贡献是基于均匀插值的支持一些新理论结果的新方法。我们的评估表明,我们的一般模块往往比传统的模块和最先进的均匀插值算法计算的均匀插值小,而与均匀插值相比,计算时间 significantly 缩短。此外,我们的方法可以用于计算均匀插值和传统的模块,实际上改进了这些计算。
https://arxiv.org/abs/2305.09503
We propose a method that allows to develop shared understanding between two agents for the purpose of performing a task that requires cooperation. Our method focuses on efficiently establishing successful task-oriented communication in an open multi-agent system, where the agents do not know anything about each other and can only communicate via grounded interaction. The method aims to assist researchers that work on human-machine interaction or scenarios that require a human-in-the-loop, by defining interaction restrictions and efficiency metrics. To that end, we point out the challenges and limitations of such a (diverse) setup, while also restrictions and requirements which aim to ensure that high task performance truthfully reflects the extent to which the agents correctly understand each other. Furthermore, we demonstrate a use-case where our method can be applied for the task of cooperative query answering. We design the experiments by modifying an established ontology alignment benchmark. In this example, the agents want to query each other, while representing different databases, defined in their own ontologies that contain different and incomplete knowledge. Grounded interaction here has the form of examples that consists of common instances, for which the agents are expected to have similar knowledge. Our experiments demonstrate successful communication establishment under the required restrictions, and compare different agent policies that aim to solve the task in an efficient manner.
我们提出了一种方法,该方法旨在开发两个 agents 之间的共享理解,以完成需要合作的任务。我们的方法专注于高效地在开放多agent系统中建立任务导向的沟通,其中 agents 彼此不知道任何事情,只能通过扎实的交互进行通信。该方法旨在协助研究人员进行人机互动或需要人工介入的场景,通过定义交互限制和效率指标来协助。为此,我们指出了这种(多样化的)setup 的挑战和限制,同时还有旨在确保高任务表现真实反映 agents 正确理解彼此的限制和需求。此外,我们展示了一个用法例,该用法例可以应用我们的方法和合作查询回答任务。我们修改了已建立的本体匹配基准,在这个例子中,agents 想要互相查询,同时代表不同的数据库,定义在它们自己的本体中包含不同且不完整的知识。扎实的交互在这里以例子的形式出现,其中包含共同实例, agents 被认为拥有类似的知识。我们的实验证明了满足所需限制后成功的通信建立,并比较了旨在以高效方式解决任务的不同 agent 政策。
https://arxiv.org/abs/2305.09349
Event detection (ED) is aimed to identify the key trigger words in unstructured text and predict the event types accordingly. Traditional ED models are too data-hungry to accommodate real applications with scarce labeled data. Besides, typical ED models are facing the context-bypassing and disabled generalization issues caused by the trigger bias stemming from ED datasets. Therefore, we focus on the true few-shot paradigm to satisfy the low-resource scenarios. In particular, we propose a multi-step prompt learning model (MsPrompt) for debiasing few-shot event detection, that consists of the following three components: an under-sampling module targeting to construct a novel training set that accommodates the true few-shot setting, a multi-step prompt module equipped with a knowledge-enhanced ontology to leverage the event semantics and latent prior knowledge in the PLMs sufficiently for tackling the context-bypassing problem, and a prototypical module compensating for the weakness of classifying events with sparse data and boost the generalization performance. Experiments on two public datasets ACE-2005 and FewEvent show that MsPrompt can outperform the state-of-the-art models, especially in the strict low-resource scenarios reporting 11.43% improvement in terms of weighted F1-score against the best-performing baseline and achieving an outstanding debiasing performance.
事件检测(ED)旨在识别无标签文本中的关键触发词,并预测相应的事件类型。传统的ED模型过于依赖数据,无法适应缺少标签数据的实际应用程序。此外,典型的ED模型正在面临由ED数据集引起的触发偏差引起的上下文绕过和 disabled 泛化问题。因此,我们关注真正的少数样本范式以满足资源有限的场景。特别是,我们提出了一种多步prompt learning模型( MsPrompt),以消除少数样本事件检测中的偏见,该模型由以下三个组件组成:一个 under-sampling 模块旨在构建一个适应真正的少数样本设置的新的训练集,一个多步prompt模块配备了知识增强本体论,以利用PLM中的事件语义和潜在先验知识足够解决上下文绕过问题,一个典型模块用于弥补缺乏数据对分类事件的弱点并提高泛化性能。在ACE-2005和少数事件两个公共数据集上的实验表明, MsPrompt可以在与最佳性能基准相比加权F1得分下降11.43%的情况下表现更好,并实现出色的去偏见性能。
https://arxiv.org/abs/2305.09335
We describe a translation from a fragment of SUMO (SUMO-K) into higher-order set theory. The translation provides a formal semantics for portions of SUMO which are beyond first-order and which have previously only had an informal interpretation. It also for the first time embeds a large common-sense ontology into a very secure interactive theorem proving system. We further extend our previous work in finding contradictions in SUMO from first order constructs to include a portion of SUMO's higher order constructs. Finally, using the translation, we can create problems that can be proven using higher-order interactive and automated theorem provers. This is tested in several systems and can be used to form a corpus of higher-order common-sense reasoning problems.
我们描述了将SUMO的一个片段(SUMO-K)翻译为高阶集合论的方法。该翻译为SUMO中超越第一阶的的部分提供了 formal semantics,这些部分以前仅具有非正式的解释。它也首次将大型常识本体论嵌入了一个非常安全的交互式证明系统。我们还从第一阶构造物中扩展了以前的工作,将SUMO中的高阶构造物的一部分也包括在内。最后,使用该翻译,我们可以创建可以使用高阶交互和自动证明器证明的问题。该测试在多个系统中进行,可以用于构建高阶常识推理问题库。
https://arxiv.org/abs/2305.07903
Machine learning with Semantic Web ontologies follows several strategies, one of which involves projecting ontologies into graph structures and applying graph embeddings or graph-based machine learning methods to the resulting graphs. Several methods have been developed that project ontology axioms into graphs. However, these methods are limited in the type of axioms they can project (totality), whether they are invertible (injectivity), and how they exploit semantic information. These limitations restrict the kind of tasks to which they can be applied. Category-theoretical semantics of logic languages formalizes interpretations using categories instead of sets, and categories have a graph-like structure. We developed CatE, which uses the category-theoretical formulation of the semantics of the Description Logic $\mathcal{ALC}$ to generate a graph representation for ontology axioms. The CatE projection is total and injective, and therefore overcomes limitations of other graph-based ontology embedding methods which are generally not invertible. We apply CatE to a number of different tasks, including deductive and inductive reasoning, and we demonstrate that CatE improves over state of the art ontology embedding methods. Furthermore, we show that CatE can also outperform model-theoretic ontology embedding methods in machine learning tasks in the biomedical domain.
与语义网本体论相关的机器学习采用多种策略,其中一种策略是将本体论元假设投射到图结构中,并将图嵌入或基于图的机器学习方法应用于生成的图。已经开发出几种方法,将本体论元假设投射到图中。但是这些方法在能够投射的元假设类型(全量),是否可逆(全因数),以及如何利用语义信息等方面受到限制。这些限制限制了它们能够应用于的任务类型。逻辑语言的 category- theoretical 语义用 categories 而不是集合来 formalize 解释,而 categories 具有图-like 结构。我们开发了 CatE,它使用描述逻辑 $\mathcal{ALC}$ 的 category- theoretical 语义框架生成本体论元假设的图表示。 CatE 投影是 total 和 injective,因此克服了一般不可逆的其他基于图的本体嵌入方法的限制。我们应用 CatE 处理多个不同任务,包括推理和推理。我们证明 CatE 优于当前先进的本体嵌入方法。此外,我们表明 CatE 也可以在生物医学领域的机器学习任务中优于模型理论的本体嵌入方法。
https://arxiv.org/abs/2305.07163
Universal source separation (USS) is a fundamental research task for computational auditory scene analysis, which aims to separate mono recordings into individual source tracks. There are three potential challenges awaiting the solution to the audio source separation task. First, previous audio source separation systems mainly focus on separating one or a limited number of specific sources. There is a lack of research on building a unified system that can separate arbitrary sources via a single model. Second, most previous systems require clean source data to train a separator, while clean source data are scarce. Third, there is a lack of USS system that can automatically detect and separate active sound classes in a hierarchical level. To use large-scale weakly labeled/unlabeled audio data for audio source separation, we propose a universal audio source separation framework containing: 1) an audio tagging model trained on weakly labeled data as a query net; and 2) a conditional source separation model that takes query net outputs as conditions to separate arbitrary sound sources. We investigate various query nets, source separation models, and training strategies and propose a hierarchical USS strategy to automatically detect and separate sound classes from the AudioSet ontology. By solely leveraging the weakly labelled AudioSet, our USS system is successful in separating a wide variety of sound classes, including sound event separation, music source separation, and speech enhancement. The USS system achieves an average signal-to-distortion ratio improvement (SDRi) of 5.57 dB over 527 sound classes of AudioSet; 10.57 dB on the DCASE 2018 Task 2 dataset; 8.12 dB on the MUSDB18 dataset; an SDRi of 7.28 dB on the Slakh2100 dataset; and an SSNR of 9.00 dB on the voicebank-demand dataset. We release the source code at this https URL
通用源分离(USS)是计算听觉场景分析中的 fundamental research task,旨在将单音录制分离成 individual source tracks,该任务的目标是解决音频源分离问题。该任务面临着三个潜在的挑战。首先,以前的音频源分离系统主要关注于分离一个或少量特定源。缺乏研究如何将一个模型用于分离任意源。其次,大多数以前的系统需要干净的源数据来训练分离器,而干净的源数据非常稀缺。第三,缺乏能够从层次上自动检测和分离活动的音频类USS系统。为了利用大规模的弱标记/未标记音频数据进行音频源分离,我们提出了一个通用的音频源分离框架,其中包括:1)一个基于弱标记数据的训练的音频标签模型,作为查询网;2)一个条件源分离模型,它使用查询网输出作为分离任意源的条件。我们研究各种查询网、源分离模型和训练策略,并提出了层次化的USS策略,以自动检测和分离音频类从 AudioSet 本体论中。通过仅利用弱标记的 AudioSet,我们的USS系统成功地分离了各种音频类,包括声音事件分离、音乐源分离和语音增强。USS系统在 AudioSet 本体论中的平均信号-失真比改善(SDRi)达到了 5.57 dB,在 DCASE 2018 Task 2 数据集上提高了 10.57 dB,在 MUSDB18 数据集上提高了 8.12 dB,在 Slakh2100 数据集上提高了 7.28 dB,在语音bank- demand 数据集上提高了 9.00 dB。我们将源代码在此 https URL 上发布。
https://arxiv.org/abs/2305.07447
This paper shines a light on the potential of definition-based semantic models for detecting idiomatic and semi-idiomatic multiword expressions (MWEs) in clinical terminology. Our study focuses on biomedical entities defined in the UMLS ontology and aims to help prioritize the translation efforts of these entities. In particular, we develop an effective tool for scoring the idiomaticity of biomedical MWEs based on the degree of similarity between the semantic representations of those MWEs and a weighted average of the representation of their constituents. We achieve this using a biomedical language model trained to produce similar representations for entity names and their definitions, called BioLORD. The importance of this definition-based approach is highlighted by comparing the BioLORD model to two other state-of-the-art biomedical language models based on Transformer: SapBERT and CODER. Our results show that the BioLORD model has a strong ability to identify idiomatic MWEs, not replicated in other models. Our corpus-free idiomaticity estimation helps ontology translators to focus on more challenging MWEs.
这篇文章强调了基于定义语义模型在检测临床术语中的idiomatic和半idiomatic多字表达式(MWEs)的潜力。我们的研究关注于UMLS ontology中定义的生物医学实体,并旨在帮助对这些实体的翻译工作进行优先排序。特别是,我们开发了一种有效的工具,用于根据那些MWEs的语义表示和对其构成成分的加权平均表示之间的相似程度评分它们的idiomaticity。我们使用了一个训练用于产生实体名称和定义相似的表示的生物医学语言模型,称为BioLORD,并将它与其他基于Transformer的最先进的生物医学语言模型——Sapien BERT和CodeR进行比较。我们将这种基于定义的方法的重要性强调了,通过比较BioLORD模型与其他模型的性能,例如基于Transformer的Sapien BERT和CodeR。我们的结果表明,BioLORD模型具有很强的idiomatic识别能力,而其他模型并没有复制这种能力。我们的无语料库idiomaticity估计帮助Ontology翻译人员专注于更困难的MWEs。
https://arxiv.org/abs/2305.06801
Knowledge graphs (KGs) are a popular way to organise information based on ontologies or schemas and have been used across a variety of scenarios from search to recommendation. Despite advances in KGs, representing knowledge remains a non-trivial task across industries and it is especially challenging in the biomedical and healthcare domains due to complex interdependent relations between entities, heterogeneity, lack of standardization, and sparseness of data. KGs are used to discover diagnoses or prioritize genes relevant to disease, but they often rely on schemas that are not centred around a node or entity of interest, such as a person. Entity-centric KGs are relatively unexplored but hold promise in representing important facets connected to a central node and unlocking downstream tasks beyond graph traversal and reasoning, such as generating graph embeddings and training graph neural networks for a wide range of predictive tasks. This paper presents an end-to-end representation learning framework to extract entity-centric KGs from structured and unstructured data. We introduce a star-shaped ontology to represent the multiple facets of a person and use it to guide KG creation. Compact representations of the graphs are created leveraging graph neural networks and experiments are conducted using different levels of heterogeneity or explicitness. A readmission prediction task is used to evaluate the results of the proposed framework, showing a stable system, robust to missing data, that outperforms a range of baseline machine learning classifiers. We highlight that this approach has several potential applications across domains and is open-sourced. Lastly, we discuss lessons learned, challenges, and next steps for the adoption of the framework in practice.
知识图(KGs)是一种基于主题模型或元模型的组织信息的方法,已经被广泛应用于搜索到推荐等各种场景。尽管KGs在表示知识方面取得了进展,但在各行业中表示知识仍然是一项巨大的任务,特别是在生物医学和医疗保健领域,由于实体之间的复杂依赖关系、个体的多样性、标准化不足和数据的稀疏性,变得非常具有挑战性。KGs用于诊断或优先考虑与疾病相关的基因,但它们往往依赖于不是围绕感兴趣的节点或实体(如人)的中心元模型。实体中心化的KGs相对较新,但仍具有代表重要方面并与中心节点关联的重要特征并解锁后续任务,如生成图形嵌入和训练图形神经网络进行各种预测任务的潜力。本文提出了一种端到端表示学习框架,从结构化和非结构化数据中提取实体中心化的KGs。我们引入了一个星形主题模型,以代表一个人的多样性特征,并用它来指导KGs的创建。利用图形神经网络的优势,我们创建了紧凑的图形表示,并使用不同的多样性或显式程度进行实验。 readmission预测任务用于评估所提出的框架的结果,表明一个稳定系统,对缺失数据具有鲁棒性,比一系列基准机器学习分类器表现更好。我们强调,这种方法具有多个领域的潜在应用,并开源。最后,我们讨论了在实践中采用框架的经验和挑战,以及下一步步骤。
https://arxiv.org/abs/2305.05640
We develop computational models to analyze court statements in order to assess judicial attitudes toward victims of sexual violence in the Israeli court system. The study examines the resonance of "rape myths" in the criminal justice system's response to sex crimes, in particular in judicial assessment of victim's credibility. We begin by formulating an ontology for evaluating judicial attitudes toward victim's credibility, with eight ordinal labels and binary categorizations. Second, we curate a manually annotated dataset for judicial assessments of victim's credibility in the Hebrew language, as well as a model that can extract credibility labels from court cases. The dataset consists of 855 verdict decision documents in sexual assault cases from 1990-2021, annotated with the help of legal experts and trained law students. The model uses a combined approach of syntactic and latent structures to find sentences that convey the judge's attitude towards the victim and classify them according to the credibility label set. Our ontology, data, and models will be made available upon request, in the hope they spur future progress in this judicial important task.
我们开发计算模型来分析法院陈述,以评估以色列法院系统对性暴力受害者的司法态度。研究检查了“强奸神话”在刑事司法系统对性犯罪的反应中的共鸣,特别是对受害者 credibility 的司法评估。我们开始制定一个本体论来评估司法对受害者 credibility 的态度,使用八个顺序标签和二进制分类。其次,我们编辑了一个用希伯来语手动注释的 dataset,用于对受害者 credibility 的司法评估,以及一个可以从法院案例中提取 credibility 标签的模型。该 dataset 包括1990年至2021年期间855个性侵犯案件的判决决定文档,由法律专家和培训过的律师注解。模型使用语法和隐语法结构的综合方法,找到表达法官对受害者的态度的句子,并根据 credibility 标签集进行分类。我们的本体论、数据和模型将根据请求提供,希望它们激励未来在这个司法重要任务中的进展。
https://arxiv.org/abs/2305.05302
The growing trend of Large Language Models (LLM) development has attracted significant attention, with models for various applications emerging consistently. However, the combined application of Large Language Models with semantic technologies for reasoning and inference is still a challenging task. This paper analyzes how the current advances in foundational LLM, like ChatGPT, can be compared with the specialized pretrained models, like REBEL, for joint entity and relation extraction. To evaluate this approach, we conducted several experiments using sustainability-related text as our use case. We created pipelines for the automatic creation of Knowledge Graphs from raw texts, and our findings indicate that using advanced LLM models can improve the accuracy of the process of creating these graphs from unstructured text. Furthermore, we explored the potential of automatic ontology creation using foundation LLM models, which resulted in even more relevant and accurate knowledge graphs.
大型语言模型(LLM)的发展日益引起关注,各种应用程序中的模型不断涌现。然而,将大型语言模型与语义技术结合用于推理和推断仍然是一项挑战性的任务。本文分析了当前基础的LLM进展,如ChatGPT,如何与专门训练的先验模型,如REBEL,用于分离实体和关系提取。为了评估这种方法,我们使用与可持续性相关的文本作为我们的使用案例,并建立了从原始文本自动创建知识图谱的管道。我们的发现表明,使用先进的LLM模型可以改进从无结构化文本创建这些图谱的过程的准确性。此外,我们探索了使用基础LLM模型自动创建本体论的潜力,这导致知识图谱更加相关和准确。
https://arxiv.org/abs/2305.04676
Ontologies play a critical role in Semantic Web technologies by providing a structured and standardized way to represent knowledge and enabling machines to understand the meaning of data. Several taxonomies and ontologies have been generated, but individuals target one domain, and only some of those have been found expensive in time and manual effort. Also, they need more coverage of unconventional topics representing a more holistic and comprehensive view of the knowledge landscape and interdisciplinary collaborations. Thus, there needs to be an ontology covering Science and Technology and facilitate multidisciplinary research by connecting topics from different fields and domains that may be related or have commonalities. To address these issues, we present an automatic Science and Technology Ontology (S&TO) that covers unconventional topics in different science and technology domains. The proposed S&TO can promote the discovery of new research areas and collaborations across disciplines. The ontology is constructed by applying BERTopic to a dataset of 393,991 scientific articles collected from Semantic Scholar from October 2021 to August 2022, covering four fields of science. Currently, S&TO includes 5,153 topics and 13,155 semantic relations. S&TO model can be updated by running BERTopic on more recent datasets
本体论在语义网技术中发挥了关键作用,通过提供结构化和标准化的方式来代表知识,使机器能够理解数据的含义。已经产生了多个分类和本体论,但个人主要关注一个领域,只有其中一些本体论的时间和手动努力成本很高。此外,需要更多涵盖非常规的主题,代表了知识领域和跨学科合作更全面和整体的视角。因此,需要建立一个涵盖科学和技术各领域的本体论,以促进新研究领域的发现和跨学科合作。为了解决这些问题,我们提出了一个自动的科学和技术本体论(S&TO),涵盖了不同科学和技术领域的非常规的主题。 proposed S&TO can promote the discovery of new research areas and collaborations across disciplines. The ontology is constructed by applying BERTopic to a dataset of 393,991 scientific articles collected from Semantic Scholar from October 2021 to August 2022, covering four fields of science. Currently, S&TO includes 5,153 topics and 13,155 semantic relations. S&TO model can be updated by running BERTopic on more recent datasets
https://arxiv.org/abs/2305.04055