Paper Reading AI Learner

Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics

2025-02-26 11:56:43
Aloka Fernando, Surangika Ranathunga, Nisansa de Silva

Abstract

Parallel Data Curation (PDC) techniques aim to filter out noisy parallel sentences from the web-mined corpora. Prior research has demonstrated that ranking sentence pairs using similarity scores on sentence embeddings derived from Pre-trained Multilingual Language Models (multiPLMs) and training the NMT systems with the top-ranked samples, produces superior NMT performance than when trained using the full dataset. However, previous research has shown that the choice of multiPLM significantly impacts the ranking quality. This paper investigates the reasons behind this disparity across multiPLMs. Using the web-mined corpora CCMatrix and CCAligned for En$\rightarrow$Si, En$\rightarrow$Ta and Si$\rightarrow$Ta, we show that different multiPLMs (LASER3, XLM-R, and LaBSE) are biased towards certain types of sentences, which allows noisy sentences to creep into the top-ranked samples. We show that by employing a series of heuristics, this noise can be removed to a certain extent. This results in improving the results of NMT systems trained with web-mined corpora and reduces the disparity across multiPLMs.

Abstract (translated)

平行数据整理(PDC)技术旨在从网络挖掘的语料库中过滤掉噪声并行句子。先前的研究表明,使用预训练多语言模型(multiPLMs)生成的句子嵌入相似分数对句子对进行排序,并用排名靠前的样本训练神经机器翻译(NMT)系统,比使用完整数据集训练时能获得更优的NMT性能。然而,之前的研究还显示,multiPLM的选择显著影响了排序质量。本文探讨了多语言模型之间存在这种差异的原因。通过使用面向英语到斯洛文尼亚语(En→Si)、英语到泰米尔语(En→Ta)和斯洛文尼亚语到泰米尔语(Si→Ta)的网络挖掘语料库CCMatrix和CCAligned,我们发现不同的multiPLM(如LASER3、XLM-R和LaBSE),在偏好某些类型的句子方面存在偏差,这使得噪声句子能够进入排名靠前的样本中。我们证明通过采用一系列启发式方法可以到一定程度上移除这种噪声,从而改善使用网络挖掘语料库训练的NMT系统的性能,并减少不同multiPLM之间的差异性。

URL

https://arxiv.org/abs/2502.19074

PDF

https://arxiv.org/pdf/2502.19074.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot