Ensemble Transfer Learning for Multilingual Coreference Resolution

Abstract
Abstract (translated)
URL
PDF

Abstract

Entity coreference resolution is an important research problem with many applications, including information extraction and question answering. Coreference resolution for English has been studied extensively. However, there is relatively little work for other languages. A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data. To overcome this challenge, we design a simple but effective ensemble-based framework that combines various transfer learning (TL) techniques. We first train several models using different TL methods. Then, during inference, we compute the unweighted average scores of the models' predictions to extract the final set of predicted clusters. Furthermore, we also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts. Leveraging the idea that the coreferential links naturally exist between anchor texts pointing to the same article, our method builds a sizeable distantly-supervised dataset for the target language that consists of tens of thousands of documents. We can pre-train a model on the pseudo-labeled dataset before finetuning it on the final target dataset. Experimental results on two benchmark datasets, OntoNotes and SemEval, confirm the effectiveness of our methods. Our best ensembles consistently outperform the baseline approach of simple training by up to 7.68% in the F1 score. These ensembles also achieve new state-of-the-art results for three languages: Arabic, Dutch, and Spanish.

Abstract (translated)

实体关联解決是一个具有许多应用的重要研究问题,包括信息抽取和问答。对英语的实体关联解決已经进行了深入的研究。然而,对其他语言的研究工作相对较少。在与非英语语言工作的过程中,经常出现的问题是缺乏标注的训练数据。为了克服这一挑战,我们设计了一个简单但有效的集体框架,结合各种转移学习(TL)技术。我们首先使用不同的TL方法训练了多个模型。然后,在推理时,我们计算模型的未加权平均预测得分,以提取最终的预测簇。此外,我们还提出了一种低成本的TL方法,通过利用维基百科的参考文本Bootstrapping coreference resolution models,利用这些参考文本构建目标语言的较大距离监督数据集,该数据集包含数千篇文章。我们可以在伪标签数据集上预训练模型,然后在最终目标数据集上微调。OntoNotes和 Semeval两个基准数据集的实验结果证实了我们方法的有效性。我们的最好集体方案 consistently outperform the baseline approach of simple training by up to 7.68% in the F1 score。这些集体方案还实现了三种语言:阿拉伯语、荷兰语和西班牙语的新高水平结果。

URL

https://arxiv.org/abs/2301.09175

PDF

https://arxiv.org/pdf/2301.09175.pdf

Ensemble Transfer Learning for Multilingual Coreference Resolution

Abstract

Abstract (translated)

URL

PDF Copy

PDF