Abstract
Synthetic data generation is widely recognized as a way to enhance the quality of neural grammatical error correction (GEC) systems. However, current approaches often lack diversity or are too simplistic to generate the wide range of grammatical errors made by humans, especially for low-resource languages such as Arabic. In this paper, we will develop the error tagging model and the synthetic data generation model to create a large synthetic dataset in Arabic for grammatical error correction. In the error tagging model, the correct sentence is categorized into multiple error types by using the DeBERTav3 model. Arabic Error Type Annotation tool (ARETA) is used to guide multi-label classification tasks in an error tagging model in which each sentence is classified into 26 error tags. The synthetic data generation model is a back-translation-based model that generates incorrect sentences by appending error tags before the correct sentence that was generated from the error tagging model using the ARAT5 model. In the QALB-14 and QALB-15 Test sets, the error tagging model achieved 94.42% F1, which is state-of-the-art in identifying error tags in clean sentences. As a result of our syntactic data training in grammatical error correction, we achieved a new state-of-the-art result of F1-Score: 79.36% in the QALB-14 Test set. We generate 30,219,310 synthetic sentence pairs by using a synthetic data generation model.
Abstract (translated)
合成数据生成被广泛认为是提升神经语法错误修正(GEC)系统质量的一种方法。然而,目前的方法在生成多样化或复杂的语法错误方面往往不足,尤其是在像阿拉伯语这样的低资源语言中。在这篇论文中,我们开发了一种错误标记模型和一种合成数据生成模型,以创建用于语法错误校正的大型阿拉伯语合成数据集。 在错误标记模型中,使用DeBERTav3模型将正确的句子分类为多种错误类型。通过阿拉伯语错误类型注释工具(ARETA)来引导多标签分类任务,在该错误标记模型中,每个句子被分为26个错误标签之一。合成数据生成模型是一种基于逆向翻译的模型,它通过在从错误标记模型产生的正确句子之前添加错误标签来生成带有语法错误的句子。此过程使用ARAT5模型实现。 在QALB-14和QALB-15测试集中,我们的错误标记模型实现了94.42%的F1值,在识别清洁句子中的错误标签方面达到了最新的技术水平。通过我们在语法错误修正中使用合成数据训练,我们取得了新的最佳结果,即在QALB-14测试集上的F1分数为79.36%。 利用这种生成模型,我们共产生了30,219,310对合成句子。
URL
https://arxiv.org/abs/2502.05312