Abstract
Africa is home to over 2000 languages from over six language families and has the highest linguistic diversity among all continents. This includes 75 languages with at least one million speakers each. Yet, there is little NLP research conducted on African languages. Crucial in enabling such research is the availability of high-quality annotated datasets. In this paper, we introduce AfriSenti, which consists of 14 sentiment datasets of 110,000+ tweets in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yorùbá) from four language families annotated by native speakers. The data is used in SemEval 2023 Task 12, the first Afro-centric SemEval shared task. We describe the data collection methodology, annotation process, and related challenges when curating each of the datasets. We conduct experiments with different sentiment classification baselines and discuss their usefulness. We hope AfriSenti enables new work on under-represented languages. The dataset is available at this https URL and can also be loaded as a huggingface datasets (this https URL).
Abstract (translated)
非洲拥有超过6个语言家族的语言,拥有在所有大陆上最高的语言学多样性。这包括75个语言,每个都有至少100万人的母语人口。然而,关于非洲语言的研究却非常少。实现这种研究的关键在于拥有高质量的标注数据。在本文中,我们介绍了AfriSenti,它是由14个非洲语言中的110,000多条推特组成的标注数据集,这些推特是由母语人士标注的。这些数据集在SemEval 2023任务12中得到了使用,这是第一个以非洲为中心举行的SemEval共享任务。我们在每个数据集的编辑过程中描述了数据收集方法、标注过程和相关的挑战。我们使用了不同的情感分类基准数据集并进行实验,并讨论了它们的有用性。我们希望AfriSenti能够促进缺乏代表的语言的研究。该数据集可以在该httpsURL上获取,也可以作为拥抱脸数据集(该httpsURL)。
URL
https://arxiv.org/abs/2302.08956