Abstract
In recent years there is surge of interest in applying distant supervision (DS) to automatically generate training data for relation extraction. However, despite extensive efforts have been done on constructing advanced neural models, our experiments reveal that these neural models demonstrate only similar (or even worse) performance as compared with simple, feature-based methods. In this paper, we conduct thorough analysis to answer the question what other factors limit the performance of DS-trained neural models? Our results show that shifted labeled distribution commonly exists on real-world DS datasets, and impact of such issue is further validated using synthetic datasets for all models. Building upon the new insight, we develop a simple yet effective adaptation method for DS methods, called bias adjustment, to update models learned over source domain (i.e., DS training set) with label distribution statistics estimated on target domain (i.e., evaluation set). Experiments demonstrate that bias adjustment achieves consistent performance gains on all methods, especially on neural models, with up to a 22% relative F1 improvement.
Abstract (translated)
近年来,人们对应用远程监控(DS)自动生成用于关系提取的训练数据越来越感兴趣。然而,尽管我们在构建高级神经模型方面做了大量的努力,我们的实验表明,与简单的基于特征的方法相比,这些神经模型仅表现出相似(甚至更差)的性能。在本文中,我们进行了深入的分析来回答这个问题:哪些其他因素限制了DS训练神经模型的性能?我们的研究结果表明,在真实的DS数据集上通常存在着移位标记分布,这一问题的影响通过对所有模型的合成数据集得到进一步验证。基于新的认识,我们开发了一种简单而有效的DS方法自适应方法,称为偏差调整,用目标域(即评估集)上估计的标签分布统计信息更新源域(即DS训练集)上学习的模型。实验证明,偏差调整在所有方法上,特别是在神经模型上,都能获得一致的性能提高,F1的相对改善率高达22%。
URL
https://arxiv.org/abs/1904.09331