Abstract
Parallel Data Curation (PDC) techniques aim to filter out noisy parallel sentences from the web-mined corpora. Prior research has demonstrated that ranking sentence pairs using similarity scores on sentence embeddings derived from Pre-trained Multilingual Language Models (multiPLMs) and training the NMT systems with the top-ranked samples, produces superior NMT performance than when trained using the full dataset. However, previous research has shown that the choice of multiPLM significantly impacts the ranking quality. This paper investigates the reasons behind this disparity across multiPLMs. Using the web-mined corpora CCMatrix and CCAligned for En$\rightarrow$Si, En$\rightarrow$Ta and Si$\rightarrow$Ta, we show that different multiPLMs (LASER3, XLM-R, and LaBSE) are biased towards certain types of sentences, which allows noisy sentences to creep into the top-ranked samples. We show that by employing a series of heuristics, this noise can be removed to a certain extent. This results in improving the results of NMT systems trained with web-mined corpora and reduces the disparity across multiPLMs.
Abstract (translated)
平行数据整理(PDC)技术旨在从网络挖掘的语料库中过滤掉噪声并行句子。先前的研究表明,使用预训练多语言模型(multiPLMs)生成的句子嵌入相似分数对句子对进行排序,并用排名靠前的样本训练神经机器翻译(NMT)系统,比使用完整数据集训练时能获得更优的NMT性能。然而,之前的研究还显示,multiPLM的选择显著影响了排序质量。本文探讨了多语言模型之间存在这种差异的原因。通过使用面向英语到斯洛文尼亚语(En→Si)、英语到泰米尔语(En→Ta)和斯洛文尼亚语到泰米尔语(Si→Ta)的网络挖掘语料库CCMatrix和CCAligned,我们发现不同的multiPLM(如LASER3、XLM-R和LaBSE),在偏好某些类型的句子方面存在偏差,这使得噪声句子能够进入排名靠前的样本中。我们证明通过采用一系列启发式方法可以到一定程度上移除这种噪声,从而改善使用网络挖掘语料库训练的NMT系统的性能,并减少不同multiPLM之间的差异性。
URL
https://arxiv.org/abs/2502.19074