Can a Multichoice Dataset be Repurposed for Extractive Question Answering?

Abstract
Abstract (translated)
URL
PDF

Abstract

The rapid evolution of Natural Language Processing (NLP) has favored major languages such as English, leaving a significant gap for many others due to limited resources. This is especially evident in the context of data annotation, a task whose importance cannot be underestimated, but which is time-consuming and costly. Thus, any dataset for resource-poor languages is precious, in particular when it is task-specific. Here, we explore the feasibility of repurposing existing datasets for a new NLP task: we repurposed the Belebele dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA), to enable extractive QA (EQA) in the style of machine reading comprehension. We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA). We also present QA evaluation results for several monolingual and cross-lingual QA pairs including English, MSA, and five Arabic dialects. Our aim is to enable others to adapt our approach for the 120+ other language variants in Belebele, many of which are deemed under-resourced. We also conduct a thorough analysis and share our insights from the process, which we hope will contribute to a deeper understanding of the challenges and the opportunities associated with task reformulation in NLP research.

Abstract (translated)

自然语言处理（NLP）的快速发展为英语等主要语言带来了优势，导致其他语言资源有限，形成了一个显著的缺口。这在数据注释等任务上尤其明显，这些任务的重要性不容忽视，但却需要花费大量时间和金钱。因此，对于资源较少的语言来说，任何数据集都是宝贵的，尤其是当它是针对特定任务时。在这里，我们探讨了将现有数据集用于新NLP任务的潜力：我们将Belebele数据集（Bandarkar等人，2023）重新用于多项选择问题（MCQA），以实现机器阅读理解风格的提取性问答（EQA）。我们还为英语和现代标准阿拉伯语（MSA）提供了注释指南和并行EQA数据集。我们还包括英语、MSA和五处阿拉伯语方言在内的多个单语和跨语种QA对。我们的目标是，让其他人能够适应我们的方法，为Belebele中的120多种语言变体提供支持，其中许多被认为资源不足。我们还进行了详细的分析，并分享了从过程中得出的见解，希望这有助于对NLP研究中的任务重塑所带来的挑战和机遇有更深入的理解。

URL

https://arxiv.org/abs/2404.17342

PDF

https://arxiv.org/pdf/2404.17342.pdf

Can a Multichoice Dataset be Repurposed for Extractive Question Answering?

Abstract

Abstract (translated)

URL

PDF Copy

PDF