Interactive Ontology Matching with Cost-Efficient Learning

Abstract
Abstract (translated)
URL
PDF

Abstract

The creation of high-quality ontologies is crucial for data integration and knowledge-based reasoning, specifically in the context of the rising data economy. However, automatic ontology matchers are often bound to the heuristics they are based on, leaving many matches unidentified. Interactive ontology matching systems involving human experts have been introduced, but they do not solve the fundamental issue of flexibly finding additional matches outside the scope of the implemented heuristics, even though this is highly demanded in industrial settings. Active machine learning methods appear to be a promising path towards a flexible interactive ontology matcher. However, off-the-shelf active learning mechanisms suffer from low query efficiency due to extreme class imbalance, resulting in a last-mile problem where high human effort is required to identify the remaining matches. To address the last-mile problem, this work introduces DualLoop, an active learning method tailored to ontology matching. DualLoop offers three main contributions: (1) an ensemble of tunable heuristic matchers, (2) a short-term learner with a novel query strategy adapted to highly imbalanced data, and (3) long-term learners to explore potential matches by creating and tuning new heuristics. We evaluated DualLoop on three datasets of varying sizes and domains. Compared to existing active learning methods, we consistently achieved better F1 scores and recall, reducing the expected query cost spent on finding 90% of all matches by over 50%. Compared to traditional interactive ontology matchers, we are able to find additional, last-mile matches. Finally, we detail the successful deployment of our approach within an actual product and report its operational performance results within the Architecture, Engineering, and Construction (AEC) industry sector, showcasing its practical value and efficiency.

Abstract (translated)

高质量本体论的创建对于数据集成和基于知识的推理至关重要，尤其是在数据经济迅速崛起的背景下。然而，自动本体论匹配器通常受到其基于的启发式约束，导致许多匹配无法确定。已经引入了涉及人类专家的交互式本体论匹配系统，但这些系统并未解决实施启发式约束的基本问题，尽管在工业环境中这一点非常重要。积极机器学习方法似乎是通往具有灵活性的交互式本体论匹配器的有望之路。然而，由于极端的类别不平衡，现成的积极学习机制导致查询效率较低，导致最后1公里问题，需要高人类努力来确定剩余的匹配。为解决最后1公里问题，本文引入了DualLoop，一种专为本体论匹配的积极学习方法。DualLoop 带来了三个主要贡献：（1）可调整的启发式匹配器的集合；（2）适应高度不平衡数据的新查询策略；（3）创建并调整新本体论以探索潜在匹配。我们在三个不同规模和领域的数据集上评估了DualLoop。与现有积极学习方法相比，我们始终获得了更好的F1分数和召回，将预计查询成本用于找到90%的匹配降低了50%以上。与传统交互式本体论匹配器相比，我们能够找到额外的最后1公里匹配。最后，我们详细介绍了将我们的方法成功部署在实际产品中的情况，并报告了其在建筑、工程和 Construction（AEC）行业部门中的操作性能结果，展示了其实用价值和效率。

URL

https://arxiv.org/abs/2404.07663

PDF

https://arxiv.org/pdf/2404.07663.pdf

Interactive Ontology Matching with Cost-Efficient Learning

Abstract

Abstract (translated)

URL

PDF Copy

PDF