Abstract
The absence of explicitly tailored, accessible annotated datasets for educational purposes presents a notable obstacle for NLP tasks in languages with limited resources.This study initially explores the feasibility of using machine translation (MT) to convert an existing dataset into a Tigrinya dataset in SQuAD format. As a result, we present TIGQA, an expert annotated educational dataset consisting of 2.68K question-answer pairs covering 122 diverse topics such as climate, water, and traffic. These pairs are from 537 context paragraphs in publicly accessible Tigrinya and Biology books. Through comprehensive analyses, we demonstrate that the TIGQA dataset requires skills beyond simple word matching, requiring both single-sentence and multiple-sentence inference abilities. We conduct experiments using state-of-the art MRC methods, marking the first exploration of such models on TIGQA. Additionally, we estimate human performance on the dataset and juxtapose it with the results obtained from pretrained models.The notable disparities between human performance and best model performance underscore the potential for further enhancements to TIGQA through continued research. Our dataset is freely accessible via the provided link to encourage the research community to address the challenges in the Tigrinya MRC.
Abstract (translated)
缺乏为教育目的明确定制的、易于访问的注释数据集,为资源有限的语言中自然语言处理任务造成了显著的障碍。这项研究最初探讨了使用机器翻译(MT)将现有数据集转换为SQuAD格式的Tigrinya数据集的可行性。结果,我们提出了TIGQA,一个由2.68K个问题-答案对组成的专家注释教育数据集,涵盖了122个不同的主题,如气候、水和交通。这些对来自537个公开可用的Tigrinya和生物学书籍的上下文段落。通过全面的分析,我们证明了TIGQA数据集需要超过简单的单词匹配的技能,需要同时具备单句和多句推理能力。我们使用最先进的MRC方法进行了实验,这是对TIGQA的首次探索。此外,我们还估计了数据集中的人类表现,并将它与预训练模型的结果进行了比较。TIGQA中人类表现和最佳模型表现之间的显著差异凸出了通过持续研究进一步增强TIGQA的潜力。我们的数据集可以通过提供的链接免费获取,以鼓励研究社区关注Tigrinya MRC中的挑战。
URL
https://arxiv.org/abs/2404.17194