Paper Reading AI Learner

L-ReLF: A Framework for Lexical Dataset Creation

2026-03-31 07:19:00
Anass Sedrati, Mounir Afifi, Reda Benkhadra

Abstract

This paper introduces the L-ReLF (Low-Resource Lexical Framework), a novel, reproducible methodology for creating high-quality, structured lexical datasets for underserved languages. The lack of standardized terminology, exemplified by Moroccan Darija, poses a critical barrier to knowledge equity in platforms like Wikipedia, often forcing editors to rely on inconsistent, ad-hoc methods to create new words in their language. Our research details the technical pipeline developed to overcome these challenges. We systematically address the difficulties of working with low-resource data, including source identification, utilizing Optical Character Recognition (OCR) despite its bias towards Modern Standard Arabic, and rigorous post-processing to correct errors and standardize the data model. The resulting structured dataset is fully compatible with Wikidata Lexemes, serving as a vital technical resource. The L-ReLF methodology is designed for generalizability, offering other language communities a clear path to build foundational lexical data for downstream NLP applications, such as Machine Translation and morphological analysis.

Abstract (translated)

本文介绍了L-ReLF(低资源词汇框架),这是一种新颖且可复现的方法论,旨在为资源匮乏的语言创建高质量、结构化的词汇数据集。以摩洛哥达里贾语为例,缺乏标准化术语的现象严重阻碍了维基百科等平台上的知识公平性,往往迫使编辑者依赖不一致的临时方法来为本语言创造新词。本研究详细阐述了为应对这些挑战而开发的技术流程。我们系统性地解决了处理低资源数据的困难,包括来源识别、在光学字符识别(OCR)对现代标准阿拉伯语存在偏见的情况下仍加以利用,以及通过严格的后处理来纠正错误并标准化数据模型。最终生成的结构化数据集与Wikidata Lexemes完全兼容,作为一项重要的技术资源。L-ReLF方法论设计具有通用性,为其他语言社区构建面向下游自然语言处理应用(如机器翻译和形态分析)的基础词汇数据提供了清晰路径。

URL

https://arxiv.org/abs/2603.29346

PDF

https://arxiv.org/pdf/2603.29346.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot