Abstract
Generalized Entity Matching (GEM), which aims at judging whether two records represented in different formats refer to the same real-world entity, is an essential task in data management. The prompt tuning paradigm for pre-trained language models (PLMs), including the recent PromptEM model, effectively addresses the challenges of low-resource GEM in practical applications, offering a robust solution when labeled data is scarce. However, existing prompt tuning models for GEM face the challenges of prompt design and information gap. This paper introduces an augmented prompt tuning framework for the challenges, which consists of two main improvements. The first is an augmented contextualized soft token-based prompt tuning method that extracts a guiding soft token benefit for the PLMs' prompt tuning, and the second is a cost-effective information augmentation strategy leveraging large language models (LLMs). Our approach performs well on the low-resource GEM challenges. Extensive experiments show promising advancements of our basic model without information augmentation over existing methods based on moderate-size PLMs (average 5.24%+), and our model with information augmentation achieves comparable performance compared with fine-tuned LLMs, using less than 14% of the API fee.
Abstract (translated)
通用实体匹配(GEM)是一种在数据管理中判断两个以不同格式表示的记录是否指向相同现实实体的重要任务。预训练语言模型(PLMs)的提示调整范式包括最近的PromptEM模型有效地解决了实际应用中低资源GEM的挑战,为标注数据有限的情况提供了一个稳健的解决方案。然而,现有的GEM提示调整模型面临提示设计和信息差距的挑战。本文介绍了一种增强提示调整框架,用于解决这些挑战,包括两个主要改进。第一个是增加基于上下文软标记的提示调整方法,为PLMs的提示调整提供指导软标记的利益;第二个是利用大型语言模型(LLMs)进行信息增强的策略。我们的方法在低资源GEM挑战中表现出色。大量的实验结果表明,与基于中等规模PLM的现有方法相比,我们的基本模型在信息增强方面的表现具有显著的进步(平均5.24%+),而我们的带有信息增强的模型与微调的LLM的性能相当,使用了不到14%的API费用。
URL
https://arxiv.org/abs/2405.04820