Abstract
Much of human knowledge in cybersecurity is encapsulated within the ever-growing volume of scientific papers. As this textual data continues to expand, the importance of document organization methods becomes increasingly crucial for extracting actionable insights hidden within large text datasets. Knowledge Graphs (KGs) serve as a means to store factual information in a structured manner, providing explicit, interpretable knowledge that includes domain-specific information from the cybersecurity scientific literature. One of the challenges in constructing a KG from scientific literature is the extraction of ontology from unstructured text. In this paper, we address this topic and introduce a method for building a multi-modal KG by extracting structured ontology from scientific papers. We demonstrate this concept in the cybersecurity domain. One modality of the KG represents observable information from the papers, such as the categories in which they were published or the authors. The second modality uncovers latent (hidden) patterns of text extracted through hierarchical and semantic non-negative matrix factorization (NMF), such as named entities, topics or clusters, and keywords. We illustrate this concept by consolidating more than two million scientific papers uploaded to arXiv into the cyber-domain, using hierarchical and semantic NMF, and by building a cyber-domain-specific KG.
Abstract (translated)
大量的人类知识都封装在日益增长的科学论文中。随着文本数据的不断扩展,文档组织方法变得越来越重要,以便从大型文本数据集中提取潜在的具有行动意义的信息。知识图谱(KGs)作为一种存储事实信息的结构化方式,提供了明确的、可解释的知识,包括网络安全科学文献中的领域特定信息。构建KG从科学文献中的一大挑战是提取语义信息。在本文中,我们解决了这个问题,并引入了一种从科学论文中提取结构化语义的方法。我们在网络安全领域演示了这一概念。KG的一个模式代表来自论文的可观察信息,如它们所发表的分类或作者。另一个模式揭示了通过分层和语义非负矩阵分解(NMF)提取的潜在(隐藏)文本模式,例如命名实体、主题或聚类,以及关键词。我们通过使用分层和语义NMF将arXiv上超过2000万篇科学论文汇总到网络安全领域,并构建了一个网络安全领域特定的KG,来阐明这一概念。
URL
https://arxiv.org/abs/2403.16222