Paper Reading AI Learner

Towards the Development of Balanced Synthetic Data for Correcting Grammatical Errors in Arabic: An Approach Based on Error Tagging Model and Synthetic Data Generating Model

2025-02-07 20:28:37
Ahlam Alrehili, Areej Alhothali

Abstract

Synthetic data generation is widely recognized as a way to enhance the quality of neural grammatical error correction (GEC) systems. However, current approaches often lack diversity or are too simplistic to generate the wide range of grammatical errors made by humans, especially for low-resource languages such as Arabic. In this paper, we will develop the error tagging model and the synthetic data generation model to create a large synthetic dataset in Arabic for grammatical error correction. In the error tagging model, the correct sentence is categorized into multiple error types by using the DeBERTav3 model. Arabic Error Type Annotation tool (ARETA) is used to guide multi-label classification tasks in an error tagging model in which each sentence is classified into 26 error tags. The synthetic data generation model is a back-translation-based model that generates incorrect sentences by appending error tags before the correct sentence that was generated from the error tagging model using the ARAT5 model. In the QALB-14 and QALB-15 Test sets, the error tagging model achieved 94.42% F1, which is state-of-the-art in identifying error tags in clean sentences. As a result of our syntactic data training in grammatical error correction, we achieved a new state-of-the-art result of F1-Score: 79.36% in the QALB-14 Test set. We generate 30,219,310 synthetic sentence pairs by using a synthetic data generation model.

Abstract (translated)

合成数据生成被广泛认为是提升神经语法错误修正(GEC)系统质量的一种方法。然而,目前的方法在生成多样化或复杂的语法错误方面往往不足,尤其是在像阿拉伯语这样的低资源语言中。在这篇论文中,我们开发了一种错误标记模型和一种合成数据生成模型,以创建用于语法错误校正的大型阿拉伯语合成数据集。 在错误标记模型中,使用DeBERTav3模型将正确的句子分类为多种错误类型。通过阿拉伯语错误类型注释工具(ARETA)来引导多标签分类任务,在该错误标记模型中,每个句子被分为26个错误标签之一。合成数据生成模型是一种基于逆向翻译的模型,它通过在从错误标记模型产生的正确句子之前添加错误标签来生成带有语法错误的句子。此过程使用ARAT5模型实现。 在QALB-14和QALB-15测试集中,我们的错误标记模型实现了94.42%的F1值,在识别清洁句子中的错误标签方面达到了最新的技术水平。通过我们在语法错误修正中使用合成数据训练,我们取得了新的最佳结果,即在QALB-14测试集上的F1分数为79.36%。 利用这种生成模型,我们共产生了30,219,310对合成句子。

URL

https://arxiv.org/abs/2502.05312

PDF

https://arxiv.org/pdf/2502.05312.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot