Abstract
The metadata about scientific experiments published in online repositories have been shown to suffer from a high degree of representational heterogeneity---there are often many ways to represent the same type of information, such as a geographical location via its latitude and longitude. To harness the potential that metadata have for discovering scientific data, it is crucial that they be represented in a uniform way that can be queried effectively. One step toward uniformly-represented metadata is to normalize the multiple, distinct field names used in metadata (e.g., lat lon, lat and long) to describe the same type of value. To that end, we present a new method based on clustering and embeddings (i.e., vector representations of words) to align metadata field names with ontology terms. We apply our method to biomedical metadata by generating embeddings for terms in biomedical ontologies from the BioPortal repository. We carried out a comparative study between our method and the NCBO Annotator, which revealed that our method yields more and substantially better alignments between metadata and ontology terms.
Abstract (translated)
在线数据库中发布的有关科学实验的元数据显示出高度的表示异质性——通常有许多方法来表示同一类型的信息,例如通过经纬度来表示地理位置。为了利用元数据在发现科学数据方面的潜力,关键是要用一种统一的方式来表示它们,以便有效地查询。统一表示元数据的一个步骤是规范化元数据中使用的多个不同字段名(例如,lat-lon、lat和lon g),以描述同一类型的值。为此,我们提出了一种基于聚类和嵌入(即词汇的矢量表示)的元数据字段名与本体术语对齐的新方法。我们将我们的方法应用到生物医学元数据中,通过从生物门户存储库生成生物医学本体中术语的嵌入。我们对我们的方法和NCBO注释器进行了比较研究,结果表明我们的方法在元数据和本体术语之间产生了更多更好的一致性。
URL
https://arxiv.org/abs/1903.08206