Abstract
This paper presents Killkan, the first dataset for automatic speech recognition (ASR) in the Kichwa language, an indigenous language of Ecuador. Kichwa is an extremely low-resource endangered language, and there have been no resources before Killkan for Kichwa to be incorporated in applications of natural language processing. The dataset contains approximately 4 hours of audio with transcription, translation into Spanish, and morphosyntactic annotation in the format of Universal Dependencies. The audio data was retrieved from a publicly available radio program in Kichwa. This paper also provides corpus-linguistic analyses of the dataset with a special focus on the agglutinative morphology of Kichwa and frequent code-switching with Spanish. The experiments show that the dataset makes it possible to develop the first ASR system for Kichwa with reliable quality despite its small dataset size. This dataset, the ASR model, and the code used to develop them will be publicly available. Thus, our study positively showcases resource building and its applications for low-resource languages and their community.
Abstract (translated)
本文介绍了Killkan,这是库奇华语(Kichwa)的第一份自动语音识别(ASR)数据集,这是一种来自厄瓜多尔的土著语言。库奇华是一种极其缺乏资源、濒临灭绝的语言,以前没有库奇华的资源被融入到自然语言处理应用程序中。数据集包含近4小时的音频转录、西班牙语翻译和语素形态学注释的格式为Universal Dependencies。音频数据是从库奇华的一个公开可用的无线电节目提取的。本文还重点分析了数据集的语料库语义分析,特别关注库奇华的粘着形态和与西班牙语的频繁代码转换。实验结果表明,尽管数据集规模较小,但该数据集还是可以开发出库奇华语的第一份ASR系统,具有可靠的质量和效果。这个数据集、ASR模型和用于开发它们的代码将公开发布。因此,我们的研究正面展示了资源建设和它们对低资源语言及其社区的启示。
URL
https://arxiv.org/abs/2404.15501