Nowadays, increasingly more data are available as knowledge graphs (KGs). While this data model supports advanced reasoning and querying, they remain difficult to mine due to their size and complexity. Graph mining approaches can be used to extract patterns from KGs. However this presents two main issues. First, graph mining approaches tend to extract too many patterns for a human analyst to interpret (pattern explosion). Second, real-life KGs tend to differ from the graphs usually treated in graph mining: they are multigraphs, their vertex degrees tend to follow a power-law, and the way in which they model knowledge can produce spurious patterns. Recently, a graph mining approach named GraphMDL+ has been proposed to tackle the problem of pattern explosion, using the Minimum Description Length (MDL) principle. However, GraphMDL+, like other graph mining approaches, is not suited for KGs without adaptations. In this paper we propose KG-MDL, a graph pattern mining approach based on the MDL principle that, given a KG, generates a human-sized and descriptive set of graph patterns, and so in a parameter-less and anytime way. We report on experiments on medium-sized KGs showing that our approach generates sets of patterns that are both small enough to be interpreted by humans and descriptive of the KG. We show that the extracted patterns highlight relevant characteristics of the data: both of the schema used to create the data, and of the concrete facts it contains. We also discuss the issues related to mining graph patterns on knowledge graphs, as opposed to other types of graph data.
Nowadays, increasingly more data are available as knowledge graphs (KGs). While this data model supports advanced reasoning and querying, they remain difficult to mine due to their size and complexity. Graph mining approaches can be used to extract patterns from KGs. However this presents two main issues. First, graph mining approaches tend to extract too many patterns for a human analyst to interpret (pattern explosion). Second, real-life KGs tend to differ from the graphs usually treated in graph mining: they are multigraphs, their vertex degrees tend to follow a power-law, and the way in which they model knowledge can produce spurious patterns. Recently, a graph mining approach named GraphMDL+ has been proposed to tackle the problem of pattern explosion, using the minimum Description Length (MDL) principle. However, GraphMDL+, like other graph mining approaches, is not suited for KGs without adaptations. In this paper we propose KG-MDL, a graph pattern mining approach based on the MDL principle that, given a KG, generates a human-sized and descriptive set of graph patterns, and so in a parameter-less and anytime way. We report on experiments on medium-sized KGs showing that our approach generates sets of patterns that are both small enough to be interpreted by humans and descriptive of the KG. We show that the extracted patterns highlight relevant characteristics of the data: both of the schema used to create the data, and of the concrete facts it contains. We also discuss the issues related to mining graph patterns on knowledge graphs, as opposed to other types of graph data.
https://arxiv.org/abs/2309.12908
Human knowledge is subject to uncertainties, imprecision, incompleteness and inconsistencies. Moreover, the meaning of many everyday terms is dependent on the context. That poses a huge challenge for the Semantic Web. This paper introduces work on an intuitive notation and model for defeasible reasoning with imperfect knowledge, and relates it to previous work on argumentation theory. PKN is to N3 as defeasible reasoning is to deductive logic. Further work is needed on an intuitive syntax for describing reasoning strategies and tactics in declarative terms, drawing upon the AIF ontology for inspiration. The paper closes with observations on symbolic approaches in the era of large language models.
人类知识面临着不确定性、不准确、不完整和不一致的挑战。此外,许多日常术语的意义取决于上下文。这对语义网构成了巨大的挑战。本文介绍了一种基于不完美知识的可预测推理的直观符号表示和模型,并将其与推理理论方面的先前工作联系起来。PKN类似于可预测推理相对于演绎逻辑的地位。还需要进一步研究直观语法,以在declarative术语中描述推理策略和战术,并借鉴AIF本体论作为灵感来源。本文最后总结了大型语言模型时代的符号方法。
https://arxiv.org/abs/2309.12731
Many mathematical models have been leveraged to design embeddings for representing Knowledge Graph (KG) entities and relations for link prediction and many downstream tasks. These mathematically-inspired models are not only highly scalable for inference in large KGs, but also have many explainable advantages in modeling different relation patterns that can be validated through both formal proofs and empirical results. In this paper, we make a comprehensive overview of the current state of research in KG completion. In particular, we focus on two main branches of KG embedding (KGE) design: 1) distance-based methods and 2) semantic matching-based methods. We discover the connections between recently proposed models and present an underlying trend that might help researchers invent novel and more effective models. Next, we delve into CompoundE and CompoundE3D, which draw inspiration from 2D and 3D affine operations, respectively. They encompass a broad spectrum of techniques including distance-based and semantic-based methods. We will also discuss an emerging approach for KG completion which leverages pre-trained language models (PLMs) and textual descriptions of entities and relations and offer insights into the integration of KGE embedding methods with PLMs for KG completion.
许多数学模型被用来利用设计表示知识图(KG)实体和关系嵌入,以进行链接预测和其他许多后续任务。这些数学模型不仅具有在大型KG中进行推理的高度可扩展性,而且还具有许多可解释的优势,在建模不同的关系模式时,可以通过形式证明和实证结果进行验证。在本文中,我们进行了全面综述KG完成的研究现状。特别是,我们重点关注KG嵌入(KGE)设计的两个主要分支:距离方法和语义匹配方法。我们发现了最近提出模型之间的联系,并提出了可能有助于研究人员发明新且更有效模型的潜在趋势。接下来,我们将探讨结合化合物E和化合物E3D,分别从2D和3D阿夫洛夫操作中汲取灵感。它们涵盖了包括距离方法和语义方法在内的广泛这些方法。此外,我们还将讨论KG完成新兴的方法,利用预训练的语言模型(PLMs)和实体和关系文本描述,并提供关于将KGE嵌入方法与PLMs用于KG完成之间的集成的洞察。
https://arxiv.org/abs/2309.12501
Applying link prediction (LP) methods over knowledge graphs (KG) for tasks such as causal event prediction presents an exciting opportunity. However, typical LP models are ill-suited for this task as they are incapable of performing inductive link prediction for new, unseen event entities and they require retraining as knowledge is added or changed in the underlying KG. We introduce a case-based reasoning model, EvCBR, to predict properties about new consequent events based on similar cause-effect events present in the KG. EvCBR uses statistical measures to identify similar events and performs path-based predictions, requiring no training step. To generalize our methods beyond the domain of event prediction, we frame our task as a 2-hop LP task, where the first hop is a causal relation connecting a cause event to a new effect event and the second hop is a property about the new event which we wish to predict. The effectiveness of our method is demonstrated using a novel dataset of newsworthy events with causal relations curated from Wikidata, where EvCBR outperforms baselines including translational-distance-based, GNN-based, and rule-based LP models.
将链接预测方法(LP)应用于知识图谱(KG)中的任务,如因果关系预测,带来了令人兴奋的机会。然而,典型的LP模型不适合这项工作,因为它们无法对新 unseen 事件实体进行归纳链接预测,并且需要随着在底层KG中新知识的添加或更改而重新训练。我们引入了一种基于案例推理模型的例子推理模型( EvCBR),以预测新后继事件的属性,基于在KG中存在的类似因果关系的事件。 EvCBR 使用统计方法识别类似事件,并使用路径预测方法进行预测,不需要训练步骤。为了将我们的方法扩展到事件预测领域的之外,我们将任务定义成两个hop的LP任务,其中第一个hop是一个因果关系连接一个 cause 事件和一个 new 效应事件,第二个hop是我们希望预测的新事件的属性。我们的方法的效果通过使用一个从 Wikidata 整理的有价值的新闻事件数据集来展示,在该数据集中, EvCBR 比包括 Translation-distance-based、GNN-based 和规则为基础的 LP 模型的基准表现更好。
https://arxiv.org/abs/2309.12423
The emergence of large language models (LLMs) presents an unprecedented opportunity to automate construction contract management, reducing human errors and saving significant time and costs. However, LLMs may produce convincing yet inaccurate and misleading content due to a lack of domain expertise. To address this issue, expert-driven contract knowledge can be represented in a structured manner to constrain the automatic contract management process. This paper introduces the Nested Contract Knowledge Graph (NCKG), a knowledge representation approach that captures the complexity of contract knowledge using a nested structure. It includes a nested knowledge representation framework, a NCKG ontology built on the framework, and an implementation method. Furthermore, we present the LLM-assisted contract review pipeline enhanced with external knowledge in NCKG. Our pipeline achieves a promising performance in contract risk reviewing, shedding light on the combination of LLM and KG towards more reliable and interpretable contract management.
大型语言模型(LLM)的出现提供了自动化建筑合同管理、减少人为错误并节省大量时间和成本的前所未有的机会。然而,由于缺乏该领域专业知识,LLM可能会产生有说服力但不准确和误导性的内容。为了解决这一问题,专家驱动的合约知识可以以结构化的方式被表示,以限制自动合同管理过程。本文介绍了内嵌合约知识图(NCKG),这是一种基于内嵌结构的知识表示方法,以捕捉合约知识的复杂性。它包括内嵌知识表示框架、基于框架的NCKG本体论和实现方法。此外,我们介绍了在NCKG中结合外部知识的LLM协助合同审查 pipeline。我们在合同风险审查方面取得了令人期望的表现,揭示了LLM和KG的结合如何有助于更可靠和可解释的合同管理。
https://arxiv.org/abs/2309.12132
Relation triple extraction (RTE) is an essential task in information extraction and knowledge graph construction. Despite recent advancements, existing methods still exhibit certain limitations. They just employ generalized pre-trained models and do not consider the specificity of RTE tasks. Moreover, existing tagging-based approaches typically decompose the RTE task into two subtasks, initially identifying subjects and subsequently identifying objects and relations. They solely focus on extracting relational triples from subject to object, neglecting that once the extraction of a subject fails, it fails in extracting all triples associated with that subject. To address these issues, we propose BitCoin, an innovative Bidirectional tagging and supervised Contrastive learning based joint relational triple extraction framework. Specifically, we design a supervised contrastive learning method that considers multiple positives per anchor rather than restricting it to just one positive. Furthermore, a penalty term is introduced to prevent excessive similarity between the subject and object. Our framework implements taggers in two directions, enabling triples extraction from subject to object and object to subject. Experimental results show that BitCoin achieves state-of-the-art results on the benchmark datasets and significantly improves the F1 score on Normal, SEO, EPO, and multiple relation extraction tasks.
关系三元提取(RTE)是在信息提取和知识图构建中不可或缺的任务。尽管最近取得了进展,但现有方法仍然表现出某些限制。它们只是使用泛化的预训练模型,并不考虑RTE任务的具体性质。此外,现有的标签基于的方法通常将RTE任务分解为两个子任务,最初确定主题并随后确定对象和关系。他们只是专注于从主题到对象提取关系三元,而忽略了一旦提取一个主题失败,它将失败提取与该主题相关的所有三元。为了解决这些问题,我们提出了比特 coin,一个创新的双向标签和监督Contrastive学习基于联合关系三元提取框架。具体来说,我们设计了一个监督Contrastive学习方法,考虑每个标签的多个积极值,而不是仅仅限制只有一个积极值。此外,我们引入了惩罚项,以防止主题和对象之间的过度相似性。我们的框架实现了两个方向的taggers,使可以从主题到对象和对象到主题提取三元。实验结果显示,比特 coin在基准数据集上取得了最先进的结果,并显著提高了正常、SEO、EPO和多个关系提取任务F1得分。
https://arxiv.org/abs/2309.11853
Wikipedia articles are hierarchically organized through categories and lists, providing one of the most comprehensive and universal taxonomy, but its open creation is causing redundancies and inconsistencies. Assigning DBPedia classes to Wikipedia categories and lists can alleviate the problem, realizing a large knowledge graph which is essential for categorizing digital contents through entity linking and typing. However, the existing approach of CaLiGraph is producing incomplete and non-fine grained mappings. In this paper, we tackle the problem as ontology alignment, where structural information of knowledge graphs and lexical and semantic features of ontology class names are utilized to discover confident mappings, which are in turn utilized for finetuing pretrained language models in a distant supervision fashion. Our method SLHCat consists of two main parts: 1) Automatically generating training data by leveraging knowledge graph structure, semantic similarities, and named entity typing. 2) Finetuning and prompt-tuning of the pre-trained language model BERT are carried out over the training data, to capture semantic and syntactic properties of class names. Our model SLHCat is evaluated over a benchmark dataset constructed by annotating 3000 fine-grained CaLiGraph-DBpedia mapping pairs. SLHCat is outperforming the baseline model by a large margin of 25% in accuracy, offering a practical solution for large-scale ontology mapping.
维基百科的文章通过分类和列表Hierarchically organized,提供了最全面和普遍的 taxon,但其开放式创建正在导致冗余和不一致性问题。将 DBpedia 类分配给维基百科的分类和列表可以解决问题,实现大型知识图谱,这对于通过实体链接和 typing 对数字内容进行分类是至关重要的。然而,CaLiGraph 现有的方法导致不完整和不精细的映射。在本文中,我们将其解决作为 Ontology 对齐问题,其中知识图的结构信息、语义相似性和命名实体 typing 的信息被用来发现自信映射,这些映射随后用于远程监督训练预训练语言模型。我们的方法和SLHCat包括两个主要部分:1)自动生成训练数据,利用知识图的结构、语义相似性和命名实体 typing。2)微调和prompt-tuning预训练语言模型 BERT 的训练数据,以捕捉类名称的语义和语法特性。我们的模型SLHCat评估了通过标注3000个精细的CaLiGraph-DBpedia映射对对建立的基准数据集。SLHCat在准确性上领先基准模型25%,提供了大规模 Ontology 映射的实际解决方案。
https://arxiv.org/abs/2309.11791
Datasets that pair Knowledge Graphs (KG) and text together (KG-T) can be used to train forward and reverse neural models that generate text from KG and vice versa. However models trained on datasets where KG and text pairs are not equivalent can suffer from more hallucination and poorer recall. In this paper, we verify this empirically by generating datasets with different levels of noise and find that noisier datasets do indeed lead to more hallucination. We argue that the ability of forward and reverse models trained on a dataset to cyclically regenerate source KG or text is a proxy for the equivalence between the KG and the text in the dataset. Using cyclic evaluation we find that manually created WebNLG is much better than automatically created TeKGen and T-REx. Guided by these observations, we construct a new, improved dataset called LAGRANGE using heuristics meant to improve equivalence between KG and text and show the impact of each of the heuristics on cyclic evaluation. We also construct two synthetic datasets using large language models (LLMs), and observe that these are conducive to models that perform significantly well on cyclic generation of text, but less so on cyclic generation of KGs, probably because of a lack of a consistent underlying ontology.
将知识图谱(KG)和文本(KG-T)配对的dataset可以用来训练向前和向后的神经网络模型,这些模型从KG中生成文本,并从文本中生成KG。然而,在训练KG和文本不等价的dataset时,可能会导致更多的幻觉和更好的记住能力。在本文中,我们使用随机噪声生成不同水平的dataset,并发现较噪声的dataset会导致更多的幻觉。我们提出,在dataset中训练的forward和reverse模型能够循环生成源KG或文本的代理,这代表了dataset中的KG和文本之间的等价性。使用循环评估我们发现,手动创建的WebNLG比自动创建的TeKGen和T-REx更好。根据这些观察,我们使用启发式来创建名为LAGRANGE的新改进dataset,并展示了每个启发式对循环评估的影响。我们还使用大型语言模型(LLMs)创建了两个合成dataset,并观察到这些有助于在循环生成文本方面表现优异的模型,但在循环生成KG方面表现不如预期,可能是因为缺乏一致性的基元知识。
https://arxiv.org/abs/2309.11669
We present a comprehensive benchmark dataset for Knowledge Graph Question Answering in Materials Science (KGQA4MAT), with a focus on metal-organic frameworks (MOFs). A knowledge graph for metal-organic frameworks (MOF-KG) has been constructed by integrating structured databases and knowledge extracted from the literature. To enhance MOF-KG accessibility for domain experts, we aim to develop a natural language interface for querying the knowledge graph. We have developed a benchmark comprised of 161 complex questions involving comparison, aggregation, and complicated graph structures. Each question is rephrased in three additional variations, resulting in 644 questions and 161 KG queries. To evaluate the benchmark, we have developed a systematic approach for utilizing ChatGPT to translate natural language questions into formal KG queries. We also apply the approach to the well-known QALD-9 dataset, demonstrating ChatGPT's potential in addressing KGQA issues for different platforms and query languages. The benchmark and the proposed approach aim to stimulate further research and development of user-friendly and efficient interfaces for querying domain-specific materials science knowledge graphs, thereby accelerating the discovery of novel materials.
我们提出了一个 comprehensive 基准数据集,用于材料科学中的知识图问答(KGQA4MAT),重点关注金属有机框架(MOF)。MOF-KG 是一个基于整合结构化数据库和从文献中获取的知识的知识图。为了提高专家对于 MOF-KG 的访问,我们的目标是开发一个自然语言界面来查询知识图。我们开发了一组包含 161 个复杂问题,涉及比较、聚合和复杂的图形结构,每个问题都被重新表述为三个额外的变体,最终导致 644 个问题和 161 个 KG 查询。为了评估基准,我们开发了一种新的系统方法,利用 ChatGPT 将自然语言问题转换为正式的 KG 查询。我们还将这种方法应用于著名的 QALD-9 数据集,展示了 ChatGPT 在不同平台和查询语言下解决 KGQA 问题的潜力。基准和所提出的方法旨在刺激进一步研究和开发用户友好且高效的界面,以查询特定材料科学知识图,从而加速发现新材料。
https://arxiv.org/abs/2309.11361
Despite their competitive performance on knowledge-intensive tasks, large language models (LLMs) still have limitations in memorizing all world knowledge especially long tail knowledge. In this paper, we study the KG-augmented language model approach for solving the knowledge graph question answering (KGQA) task that requires rich world knowledge. Existing work has shown that retrieving KG knowledge to enhance LLMs prompting can significantly improve LLMs performance in KGQA. However, their approaches lack a well-formed verbalization of KG knowledge, i.e., they ignore the gap between KG representations and textual representations. To this end, we propose an answer-sensitive KG-to-Text approach that can transform KG knowledge into well-textualized statements most informative for KGQA. Based on this approach, we propose a KG-to-Text enhanced LLMs framework for solving the KGQA task. Experiments on several KGQA benchmarks show that the proposed KG-to-Text augmented LLMs approach outperforms previous KG-augmented LLMs approaches regarding answer accuracy and usefulness of knowledge statements.
尽管在知识密集型任务中表现出优异的性能,大型语言模型(LLMs)仍然无法全面记忆所有世界知识,特别是长尾巴知识。在本文中,我们研究了基于KG增强的语言模型方法来解决需要丰富世界知识的Knowledge Graph问答任务(KGQA任务)。现有研究表明,通过检索KG知识增强LLMs的提示能力,可以显著提高LLMs在KGQA任务中的表现。然而,他们的方法缺乏KG知识的准确口头表达,即忽略了KG表示和文本表示之间的差异。为此,我们提出了一种答案敏感KG到文本的方法,可以将KG知识转换为最 informative 的文本表述。基于这种方法,我们提出了一个KG到文本增强的LLMs框架,以解决KGQA任务。对多个KGQA基准测试数据的实验表明, proposed KG到文本增强的LLMs方法在答案准确性和知识陈述的有用性方面超越了以前的KG增强LLMs方法。
https://arxiv.org/abs/2309.11206
Inductive link prediction -- where entities during training and inference stages can be different -- has shown great potential for completing evolving knowledge graphs in an entity-independent manner. Many popular methods mainly focus on modeling graph-level features, while the edge-level interactions -- especially the semantic correlations between relations -- have been less explored. However, we notice a desirable property of semantic correlations between relations is that they are inherently edge-level and entity-independent. This implies the great potential of the semantic correlations for the entity-independent inductive link prediction task. Inspired by this observation, we propose a novel subgraph-based method, namely TACO, to model Topology-Aware COrrelations between relations that are highly correlated to their topological structures within subgraphs. Specifically, we prove that semantic correlations between any two relations can be categorized into seven topological patterns, and then proposes Relational Correlation Network (RCN) to learn the importance of each pattern. To further exploit the potential of RCN, we propose Complete Common Neighbor induced subgraph that can effectively preserve complete topological patterns within the subgraph. Extensive experiments demonstrate that TACO effectively unifies the graph-level information and edge-level interactions to jointly perform reasoning, leading to a superior performance over existing state-of-the-art methods for the inductive link prediction task.
启发式链接预测--在训练和推理阶段可以使用不同的实体--显示了在实体独立的情况下完成演化知识图的巨大潜力。许多流行的方法主要关注构建Graph级别的特征,而边缘级交互--特别是关系之间的语义相关性--则较少被探索。然而,我们注意到语义相关性在关系之间的优点在于它们本质上是边缘级别的,并且实体独立的。这暗示了在实体独立的启发式链接预测任务中语义相关性的巨大潜力。基于这一观察,我们提出了一种新的子结构方法,即TACO,以建模关系之间的拓扑Aware的相关性,这些相关性在子结构内部高度相关。具体来说,我们证明,任何两个关系的语义相关性都可以分类为七个拓扑模式,然后提出了关系相关性网络(RCN)来学习每个模式的重要性。为了进一步利用RCN的潜力,我们提出了完整的共同邻居诱导子结构,它可以在子结构内部有效地保留完整的拓扑模式。广泛的实验结果表明,TACO有效地整合了Graph级别的信息和边缘级交互,共同进行推理,导致比现有的启发式链接预测任务先进方法更好的表现。
https://arxiv.org/abs/2309.11528
Our research explores the use of natural language processing (NLP) methods to automatically classify entities for the purpose of knowledge graph population and integration with food system ontologies. We have created NLP models that can automatically classify organizations with respect to categories associated with environmental issues as well as Standard Industrial Classification (SIC) codes, which are used by the U.S. government to characterize business activities. As input, the NLP models are provided with text snippets retrieved by the Google search engine for each organization, which serves as a textual description of the organization that is used for learning. Our experimental results show that NLP models can achieve reasonably good performance for these two classification tasks, and they rely on a general framework that could be applied to many other classification problems as well. We believe that NLP models represent a promising approach for automatically harvesting information to populate knowledge graphs and aligning the information with existing ontologies through shared categories and concepts.
我们的研究探索了使用自然语言处理(NLP)方法来自动分类实体的目的,以实现知识图谱人口和与食品系统实体系统集成。我们创造了能够自动分类与环境问题相关的类别和标准工业分类(SIC)代码的NLP模型,这些代码是美国政府用于描述商业活动的编码。作为输入,NLP模型为每个组织提供了从谷歌搜索引擎检索的文本片段,用作用于学习的组织文本描述。我们的实验结果显示,NLP模型在这两个方面的性能表现相当良好,它们依赖于一个可以应用于许多其他分类问题的通用框架。我们相信,NLP模型代表了自动收集信息以填充知识图谱并通过共享类别和概念将信息与现有实体系统集成的有前途的方法。
https://arxiv.org/abs/2309.10880
Cross-lingual entity alignment is the task of finding the same semantic entities from different language knowledge graphs. In this paper, we propose a simple and novel unsupervised method for cross-language entity alignment. We utilize the deep learning multi-language encoder combined with a machine translator to encode knowledge graph text, which reduces the reliance on label data. Unlike traditional methods that only emphasize global or local alignment, our method simultaneously considers both alignment strategies. We first view the alignment task as a bipartite matching problem and then adopt the re-exchanging idea to accomplish alignment. Compared with the traditional bipartite matching algorithm that only gives one optimal solution, our algorithm generates ranked matching results which enabled many potentials downstream tasks. Additionally, our method can adapt two different types of optimization (minimal and maximal) in the bipartite matching process, which provides more flexibility. Our evaluation shows, we each scored 0.966, 0.990, and 0.996 Hits@1 rates on the DBP15K dataset in Chinese, Japanese, and French to English alignment tasks. We outperformed the state-of-the-art method in unsupervised and semi-supervised categories. Compared with the state-of-the-art supervised method, our method outperforms 2.6% and 0.4% in Ja-En and Fr-En alignment tasks while marginally lower by 0.2% in the Zh-En alignment task.
跨语言实体对齐是指从不同语言的知识图谱中查找具有相同语义实体的任务。在本文中,我们提出了一种简单且新颖的未监督方法,用于跨语言实体对齐。我们利用深度学习多语言编码器和机器翻译器来编码知识图文本,从而减少了对标签数据的依赖。与传统方法仅强调全局或局部对齐不同,我们的方法同时考虑了两种对齐策略。我们首先将对齐任务视为双方面匹配问题,然后采用交换思想来完成对齐。与传统的双方面匹配算法仅提供一种最优解决方案相比,我们的算法生成了排名匹配结果,从而了许多潜在的后续任务。此外,我们的方法可以在双方面匹配过程中适应最小和最大优化(最小和最大)两种类型的优化,提供了更多的灵活性。我们的评估表明,我们在中文、日本和法语到英语的对齐任务中分别获得了0.966、0.990和0.996Hits@1率。我们在未监督和半监督类别中击败了最先进的方法。与最先进的监督方法相比,我们的方法在Ja-En和Fr-En对齐任务中击败了2.6%和0.4%,而在Zh-En对齐任务中略微降低了0.2%。
https://arxiv.org/abs/2309.10598
Although generative AI has been successful in many areas, its ability to model geospatial data is still underexplored. Urban flow, a typical kind of geospatial data, is critical for a wide range of urban applications. Existing studies mostly focus on predictive modeling of urban flow that predicts the future flow based on historical flow data, which may be unavailable in data-sparse areas or newly planned regions. Some other studies aim to predict OD flow among regions but they fail to model dynamic changes of urban flow over time. In this work, we study a new problem of urban flow generation that generates dynamic urban flow for regions without historical flow data. To capture the effect of multiple factors on urban flow, such as region features and urban environment, we employ diffusion model to generate urban flow for regions under different conditions. We first construct an urban knowledge graph (UKG) to model the urban environment and relationships between regions, based on which we design a knowledge-enhanced spatio-temporal diffusion model (KSTDiff) to generate urban flow for each region. Specifically, to accurately generate urban flow for regions with different flow volumes, we design a novel diffusion process guided by a volume estimator, which is learnable and customized for each region. Moreover, we propose a knowledge-enhanced denoising network to capture the spatio-temporal dependencies of urban flow as well as the impact of urban environment in the denoising process. Extensive experiments on four real-world datasets validate the superiority of our model over state-of-the-art baselines in urban flow generation. Further in-depth studies demonstrate the utility of generated urban flow data and the ability of our model for long-term flow generation and urban flow prediction. Our code is released at: this https URL.
虽然生成AI在许多领域取得了成功,但它对空间数据建模的能力仍待探索。城市流是空间数据的一种典型类型,对于许多城市应用至关重要。现有研究大多关注预测城市流,通过历史流数据预测未来流,但在某些数据稀疏或新规划的地区可能会无法使用。还有一些研究旨在预测不同地区的OD流,但未能模型时间上的城市发展变化。在本文中,我们研究了一个新的城市流生成问题,该问题为没有历史流数据的地区生成动态城市流。为了捕捉多个因素对城市流的影响,如地区特征和城市环境,我们采用扩散模型为不同地区生成城市流。我们首先构建了一个城市知识图(UKG),以建模城市环境与地区之间的关系。基于该图,我们设计了一个知识增强的时间空间扩散模型(KSTDiff),以生成每个地区的城市流。具体而言,为了准确地生成不同地区不同流量的城市流,我们设计了一个基于流量估计器的新扩散过程,该过程可学习并针对每个地区定制。此外,我们提出了一个知识增强去噪网络,以捕捉城市流的时间空间依赖关系,以及城市环境在去噪过程中的影响。对四个实际数据集进行了广泛的实验,验证了我们模型在城市流生成方面的优越性。进一步的研究展示了生成城市流数据的用途,以及我们模型的长期流生成和城市流预测能力。我们的代码已发布在以下httpsURL。
https://arxiv.org/abs/2309.10547
Aiming to populate generalizable engineering design knowledge, we propose a method to extract facts of the form <head entity, relationship, tail entity> from sentences found in patent documents. These facts could be combined within and across patent documents to form knowledge graphs that serve as schemes for representing as well as storing design knowledge. Existing methods in engineering design literature often utilise a set of predefined relationships to populate triples that are statistical approximations rather than facts. In our method, we train a tagger to identify both entities and relationships from a sentence. Given a pair of entities, we train another tagger to identify the specific relationship tokens. For training these taggers, we manually construct a dataset of 44,227 sentences and corresponding facts. We benchmark our method against two typically recommended approaches. We apply our method by extracting facts from sentences found in patents related to fan systems. We build a knowledge base using these facts to demonstrate how domain ontologies could be constructed and contextualised knowledge of subsystems could be visualised. We then search the knowledge base for key issues prevailing in fan systems. We organize the responses into knowledge graphs and hold a comparative discussion against the opinions about the key issues from ChatGPT.
旨在填充通用工程设计知识,我们提出了一种方法,从专利文件中的语句中 extract 形式为 <head entity, relationship, tail entity> 的事实。这些事实可以在专利文件中内部和之间组合形成知识图谱,作为表示和存储设计知识的方案。在工程设计文献中,通常使用预先定义的关系填充三元组,而不是事实。在我们的方法中,我们训练一个分词器从一个句子中识别实体和关系。给定两个实体,我们训练另一个分词器识别特定的关系 token。为了训练这些分词器,我们手动构建一个44,227个句子和对应事实的 dataset。我们比较了我们所方法和通常建议的方法。我们应用我们的方法从与风扇系统相关的专利语句中提取事实。使用这些事实建立了一个知识库,以演示如何构建域主题模型和局部子系统上下文化的知识可以可视化。然后我们搜索知识库中风扇系统中出现的关键问题。我们将响应组织成知识图谱,并与ChatGPT对关键问题的意见进行比较讨论。
https://arxiv.org/abs/2307.06985
Commonsense reasoning is a critical aspect of human communication. Despite recent advances in conversational AI driven by large language models, commonsense reasoning remains a challenging task. In this work, we introduce SYNDICOM - a method for improving commonsense in dialogue response generation. SYNDICOM consists of two components. The first component is a dataset composed of commonsense dialogues created from a knowledge graph and synthesized into natural language. This dataset includes both valid and invalid responses to dialogue contexts, along with natural language feedback (NLF) for the invalid responses. The second contribution is a two-step procedure: training a model to predict natural language feedback (NLF) for invalid responses, and then training a response generation model conditioned on the predicted NLF, the invalid response, and the dialogue. SYNDICOM is scalable and does not require reinforcement learning. Empirical results on three tasks are evaluated using a broad range of metrics. SYNDICOM achieves a relative improvement of 53% over ChatGPT on ROUGE1, and human evaluators prefer SYNDICOM over ChatGPT 57% of the time. We will publicly release the code and the full dataset.
常识推理是人类沟通中一个重要的方面。尽管大型语言模型驱动的对话 AI 取得了 recent 的进步,但常识推理仍然是一项挑战性的任务。在本文中,我们介绍了 SYNDICOM ——一种方法,用于改进对话响应生成中的常识推理。SYNDICOM 由两个组件组成。第一个组件是一个知识图谱生成的常识对话集,并将其合成为自然语言。这个集包含了对话环境中的有效和无效回答,以及对于无效回答的自然语言反馈(NLF)。第二个贡献是一个两步程序:训练一个模型来预测无效的自然语言反馈(NLF),然后训练一个响应生成模型,基于预测的 NLF、无效的回答和对话条件。SYNDICOM 是可扩展的,不需要强化学习。对三个任务的实证结果使用广泛度量标准进行评估。SYNDICOM 在 ROUGE1 任务中相对于 ChatGPT 实现了 53% 的相对改善,而人类评估者有时更倾向于 SYNDICOM 而不是 ChatGPT。我们将公开发布代码和整个数据集。
https://arxiv.org/abs/2309.10015
Subsampling is effective in Knowledge Graph Embedding (KGE) for reducing overfitting caused by the sparsity in Knowledge Graph (KG) datasets. However, current subsampling approaches consider only frequencies of queries that consist of entities and their relations. Thus, the existing subsampling potentially underestimates the appearance probabilities of infrequent queries even if the frequencies of their entities or relations are high. To address this problem, we propose Model-based Subsampling (MBS) and Mixed Subsampling (MIX) to estimate their appearance probabilities through predictions of KGE models. Evaluation results on datasets FB15k-237, WN18RR, and YAGO3-10 showed that our proposed subsampling methods actually improved the KG completion performances for popular KGE models, RotatE, TransE, HAKE, ComplEx, and DistMult.
知识图谱嵌入(KGE)中,子采样有效用于减少知识图谱(KG)数据集稀疏引起的过拟合。然而,目前子采样方法仅考虑包含实体及其关系的查询频率。因此,即使其实体或关系频率很高,现有的子采样方法潜在地低估了较少查询的出现概率。为了解决这一问题,我们提出了基于模型的子采样(MBS)和混合子采样(MIX)方法,通过KGE模型的预测来估计它们的出现概率。在FB15k-237、WN18RR和 YAGO3-10数据集上的评估结果显示,我们提出的子采样方法实际上提高了流行的KGE模型、rotatE、TransE、HAKE、ComplEx和distMult的知识图谱完成性能。
https://arxiv.org/abs/2309.09296
This paper presents a knowledge graph construction method for legal case documents and related laws, aiming to organize legal information efficiently and enhance various downstream tasks. Our approach consists of three main steps: data crawling, information extraction, and knowledge graph deployment. First, the data crawler collects a large corpus of legal case documents and related laws from various sources, providing a rich database for further processing. Next, the information extraction step employs natural language processing techniques to extract entities such as courts, cases, domains, and laws, as well as their relationships from the unstructured text. Finally, the knowledge graph is deployed, connecting these entities based on their extracted relationships, creating a heterogeneous graph that effectively represents legal information and caters to users such as lawyers, judges, and scholars. The established baseline model leverages unsupervised learning methods, and by incorporating the knowledge graph, it demonstrates the ability to identify relevant laws for a given legal case. This approach opens up opportunities for various applications in the legal domain, such as legal case analysis, legal recommendation, and decision support.
本论文介绍了一种用于法律案件文件和相关法律的知识图构建方法,旨在高效组织法律信息并增强各种后续任务。我们的方法和步骤包括三个主要步骤:数据爬取、信息提取和知识图部署。首先,数据爬虫从各种来源收集了大量的法律案件文件和相关法律,提供了丰富的数据库以供进一步处理。接下来,信息提取步骤使用自然语言处理技术从无结构文本中提取实体,如法院、案件、领域和法律,以及它们之间的关系。最后,知识图部署,基于提取的关系将这些实体连接起来,创建一个多元化的图形,以满足不同用户的需求,如律师、法官和学者。 established baseline model leverages unsupervised learning methods, and by incorporating the knowledge graph, it demonstrates the ability to identify relevant laws for a given legal case. This approach opens up opportunities for various applications in the legal domain, such as legal case analysis, legal recommendation, and decision support.
https://arxiv.org/abs/2309.09069
In this paper, the problem of semantic information extraction for resource constrained text data transmission is studied. In the considered model, a sequence of text data need to be transmitted within a communication resource-constrained network, which only allows limited data transmission. Thus, at the transmitter, the original text data is extracted with natural language processing techniques. Then, the extracted semantic information is captured in a knowledge graph. An additional probability dimension is introduced in this graph to capture the importance of each information. This semantic information extraction problem is posed as an optimization framework whose goal is to extract most important semantic information for transmission. To find an optimal solution for this problem, a Floyd's algorithm based solution coupled with an efficient sorting mechanism is proposed. Numerical results testify the effectiveness of the proposed algorithm with regards to two novel performance metrics including semantic uncertainty and semantic similarity.
在本文中,研究了针对资源受限文本数据传输的语义信息提取问题。在考虑的模式中,需要在一个通信资源受限的网络中传输一组文本数据,该网络仅允许有限的数据传输。因此,在发送方,使用自然语言处理技术提取原始的文本数据。然后,提取的语义信息被捕捉在一个知识图谱中。在这个图谱中,引入了额外的概率维度,以捕捉每个信息的重要性。这个语义信息提取问题被 posed as an optimization framework,其目标是提取传输中最重要的语义信息。为了找到解决这个问题的最优解,提出了基于Floyd算法的高效排序机制的解决方案。数值结果显示,该算法在包括语义不确定性和语义相似性在内的两个新的性能指标方面的 effectiveness。
https://arxiv.org/abs/2309.08879
Large language models (LLMs) acquire extensive knowledge during pre-training, known as their parametric knowledge. However, in order to remain up-to-date and align with human instructions, LLMs inevitably require external knowledge during their interactions with users. This raises a crucial question: How will LLMs respond when external knowledge interferes with their parametric knowledge? To investigate this question, we propose a framework that systematically elicits LLM parametric knowledge and introduces external knowledge. Specifically, we uncover the impacts by constructing a parametric knowledge graph to reveal the different knowledge structures of LLMs, and introduce external knowledge through distractors of varying degrees, methods, positions, and formats. Our experiments on both black-box and open-source models demonstrate that LLMs tend to produce responses that deviate from their parametric knowledge, particularly when they encounter direct conflicts or confounding changes of information within detailed contexts. We also find that while LLMs are sensitive to the veracity of external knowledge, they can still be distracted by unrelated information. These findings highlight the risk of hallucination when integrating external knowledge, even indirectly, during interactions with current LLMs. All the data and results are publicly available.
大型语言模型(LLM)在预训练期间获得了广泛的知识,被称为其参数知识。然而,为了保持更新并与人类指令对齐,LLM在与用户的交互中不可避免地需要外部知识。这引发了一个关键问题:当外部知识干扰LLM的参数知识时,LLM会如何反应?为了解决这个问题,我们提出了一个框架,该框架 systematicly elicits LLM 参数知识并引入外部知识。具体来说,我们通过构建参数知识图来揭示LLM的不同知识结构,并通过不同程度、方法、位置和格式的干扰来引入外部知识。我们对黑盒和开源模型进行了实验,结果表明,LLM通常会产生与参数知识不同的响应,特别是当它们在详细上下文中遇到直接冲突或混淆信息时。我们还发现,尽管LLM对外部知识的真假十分敏感,但它们仍然可以被无关信息分散注意力。这些发现强调了在集成外部知识、甚至间接地与当前LLM交互时,幻觉的风险。所有数据和结果都公开可用。
https://arxiv.org/abs/2309.08594