Abstract
In cross-lingual named entity recognition (NER), self-training is commonly used to bridge the linguistic gap by training on pseudo-labeled target-language data. However, due to sub-optimal performance on target languages, the pseudo labels are often noisy and limit the overall performance. In this work, we aim to improve self-training for cross-lingual NER by combining representation learning and pseudo label refinement in one coherent framework. Our proposed method, namely ContProto mainly comprises two components: (1) contrastive self-training and (2) prototype-based pseudo-labeling. Our contrastive self-training facilitates span classification by separating clusters of different classes, and enhances cross-lingual transferability by producing closely-aligned representations between the source and target language. Meanwhile, prototype-based pseudo-labeling effectively improves the accuracy of pseudo labels during training. We evaluate ContProto on multiple transfer pairs, and experimental results show our method brings in substantial improvements over current state-of-the-art methods.
Abstract (translated)
在跨语言命名实体识别(NER)中,通常使用自我训练来通过训练伪标签的目标语言数据来填补语言差距。然而,由于目标语言表现较差,伪标签往往噪声较多,并限制整体性能。在本研究中,我们旨在改进跨语言NER的自我训练,通过在一个一致性框架中结合表示学习和伪标签改进。我们提出的方法名为ContProto,其主要包含两个组件:(1)对比性自我训练,(2)基于原型的伪标签改进。我们的对比性自我训练可以通过分离不同类别的簇来促进跨语言分类,并通过在源和目标语言之间产生紧密对齐的表示来提高跨语言转移性。同时,基于原型的伪标签改进有效地在训练期间改善伪标签的准确性。我们针对多个转移对进行了评估,实验结果显示,我们的方法和当前最先进的方法之间存在显著的改进。
URL
https://arxiv.org/abs/2305.13628