Abstract
The rapid evolution of Natural Language Processing (NLP) has favored major languages such as English, leaving a significant gap for many others due to limited resources. This is especially evident in the context of data annotation, a task whose importance cannot be underestimated, but which is time-consuming and costly. Thus, any dataset for resource-poor languages is precious, in particular when it is task-specific. Here, we explore the feasibility of repurposing existing datasets for a new NLP task: we repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA), to enable extractive QA (EQA) in the style of machine reading comprehension. We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA). We also present QA evaluation results for several monolingual and cross-lingual QA pairs including English, MSA, and five Arabic dialects. Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced. We also conduct a thorough analysis and share our insights from the process, which we hope will contribute to a deeper understanding of the challenges and the opportunities associated with task reformulation in NLP research.
Abstract (translated)
自然语言处理(NLP)的快速发展为英语等主要语言带来了优势,导致其他语言资源有限,形成了一个显著的缺口。这在数据注释等任务上尤其明显,这些任务的重要性不容忽视,但却需要花费大量时间和金钱。因此,对于资源较少的语言来说,任何数据集都是宝贵的,尤其是当它是针对特定任务时。在这里,我们探讨了将现有数据集用于新NLP任务的潜力:我们将Belebele数据集(Bandarkar等人,2023)重新用于多项选择问题(MCQA),以实现机器阅读理解风格的提取性问答(EQA)。我们还为英语和现代标准阿拉伯语(MSA)提供了注释指南和并行EQA数据集。我们还包括英语、MSA和五处阿拉伯语方言在内的多个单语和跨语种QA对。我们的目标是,让其他人能够适应我们的方法,为Belebele中的120多种语言变体提供支持,其中许多被认为资源不足。我们还进行了详细的分析,并分享了从过程中得出的见解,希望这有助于对NLP研究中的任务重塑所带来的挑战和机遇有更深入的理解。
URL
https://arxiv.org/abs/2404.17342