Paper Reading AI Learner

NagaNLP: Bootstrapping NLP for Low-Resource Nagamese Creole with Human-in-the-Loop Synthetic Data

2025-12-14 04:08:26
Agniva Maiti, Manya Pandey, Murari Mandal

Abstract

The vast majority of the world's languages, particularly creoles like Nagamese, remain severely under-resourced in Natural Language Processing (NLP), creating a significant barrier to their representation in digital technology. This paper introduces NagaNLP, a comprehensive open-source toolkit for Nagamese, bootstrapped through a novel methodology that relies on LLM-driven but human-validated synthetic data generation. We detail a multi-stage pipeline where an expert-guided LLM (Gemini) generates a candidate corpus, which is then refined and annotated by native speakers. This synthetic-hybrid approach yielded a 10K pair conversational dataset and a high-quality annotated corpus for foundational tasks. To assess the effectiveness of our methodology, we trained both discriminative and generative models. Our fine-tuned XLM-RoBERTa-base model establishes a new benchmark for Nagamese, achieving a 93.81\% accuracy (0.90 F1-Macro) on Part-of-Speech tagging and a 0.75 F1-Macro on Named Entity Recognition, massively outperforming strong zero-shot baselines. Furthermore, we fine-tuned a Llama-3.2-3B Instruct model, named NagaLLaMA, which demonstrates superior performance on conversational tasks, achieving a Perplexity of 3.85, an order of magnitude improvement over its few-shot counterpart (96.76). We release the NagaNLP toolkit, including all datasets, models, and code, providing a foundational resource for a previously underserved language and a reproducible framework for reducing data scarcity in other low-resource contexts.

Abstract (translated)

世界上大多数语言,特别是像纳加梅斯语这样的克里奥尔语,在自然语言处理(NLP)方面资源严重匮乏,这在数字技术中对其表示构成了重大障碍。本文介绍了NagaNLP,这是一个针对纳加梅斯语的全面开源工具包,通过一种新颖的方法建立起来,该方法依赖于大型语言模型驱动但由人类验证的合成数据生成。 我们详细描述了一个多阶段管道,在这个过程中,一个专家指导下的大型语言模型(如Gemini)首先生成候选语料库,然后这些材料经过本地母语者的优化和标注。这种合成-混合的方法产生了一个10,000对会话数据集以及用于基础任务的高质量注释语料库。 为了评估我们的方法的有效性,我们训练了判别模型和生成模型。我们将微调后的XLM-RoBERTa-base模型作为纳加梅斯语的新基准,在词性标注上达到了93.81%的准确率(0.90 F1-Macro),在命名实体识别上的F1-Macro得分为0.75,大大超越了强大的零样本基线。此外,我们还微调了一个Llama-3.2-3B Instruct模型,并将其命名为NagaLLaMA,在对话任务中表现出了优越性能,其困惑度为3.85,相比少量示例的对照组(96.76)有了数量级的改进。 我们将发布NagaNLP工具包,包括所有数据集、模型和代码,这为以前服务不足的语言提供了一个基础资源,并为减少其他低资源环境中的数据稀缺性提供了可重复框架。

URL

https://arxiv.org/abs/2512.12537

PDF

https://arxiv.org/pdf/2512.12537.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot