Abstract
This paper introduces the L-ReLF (Low-Resource Lexical Framework), a novel, reproducible methodology for creating high-quality, structured lexical datasets for underserved languages. The lack of standardized terminology, exemplified by Moroccan Darija, poses a critical barrier to knowledge equity in platforms like Wikipedia, often forcing editors to rely on inconsistent, ad-hoc methods to create new words in their language. Our research details the technical pipeline developed to overcome these challenges. We systematically address the difficulties of working with low-resource data, including source identification, utilizing Optical Character Recognition (OCR) despite its bias towards Modern Standard Arabic, and rigorous post-processing to correct errors and standardize the data model. The resulting structured dataset is fully compatible with Wikidata Lexemes, serving as a vital technical resource. The L-ReLF methodology is designed for generalizability, offering other language communities a clear path to build foundational lexical data for downstream NLP applications, such as Machine Translation and morphological analysis.
Abstract (translated)
本文介绍了L-ReLF(低资源词汇框架),这是一种新颖且可复现的方法论,旨在为资源匮乏的语言创建高质量、结构化的词汇数据集。以摩洛哥达里贾语为例,缺乏标准化术语的现象严重阻碍了维基百科等平台上的知识公平性,往往迫使编辑者依赖不一致的临时方法来为本语言创造新词。本研究详细阐述了为应对这些挑战而开发的技术流程。我们系统性地解决了处理低资源数据的困难,包括来源识别、在光学字符识别(OCR)对现代标准阿拉伯语存在偏见的情况下仍加以利用,以及通过严格的后处理来纠正错误并标准化数据模型。最终生成的结构化数据集与Wikidata Lexemes完全兼容,作为一项重要的技术资源。L-ReLF方法论设计具有通用性,为其他语言社区构建面向下游自然语言处理应用(如机器翻译和形态分析)的基础词汇数据提供了清晰路径。
URL
https://arxiv.org/abs/2603.29346