Abstract
Automatic Speech Recognition (ASR) can play a crucial role in enhancing the accessibility of spoken languages worldwide. In this paper, we build a set of ASR tools for Amharic, a language spoken by more than 50 million people primarily in eastern Africa. Amharic is written in the Ge'ez script, a sequence of graphemes with spacings denoting word boundaries. This makes computational processing of Amharic challenging since the location of spacings can significantly impact the meaning of formed sentences. We find that existing benchmarks for Amharic ASR do not account for these spacings and only measure individual grapheme error rates, leading to significantly inflated measurements of in-the-wild performance. In this paper, we first release corrected transcriptions of existing Amharic ASR test datasets, enabling the community to accurately evaluate progress. Furthermore, we introduce a post-processing approach using a transformer encoder-decoder architecture to organize raw ASR outputs into a grammatically complete and semantically meaningful Amharic sentence. Through experiments on the corrected test dataset, our model enhances the semantic correctness of Amharic speech recognition systems, achieving a Character Error Rate (CER) of 5.5\% and a Word Error Rate (WER) of 23.3\%.
Abstract (translated)
自动语音识别(ASR)在提高全球范围内口头语言的可访问性方面发挥着关键作用。在本文中,我们为阿姆哈勒语(一种主要在东非使用的语言)构建了一组ASR工具。阿姆哈勒语用吉斯文书写,这是一种由标点符号组成的序列,其中间隔表示单词边界。这使得阿姆哈勒语的计算处理具有挑战性,因为间隔的位置可能会显著影响形成的句子的意思。我们发现,现有的阿姆哈勒语ASR基准没有考虑到这些间隔,而只是测量单个词形错误率,导致在野外性能的测量值大幅膨胀。在本文中,我们首先发布了现有阿姆哈勒语ASR测试数据集的修正转录,使社区能够准确评估进展。此外,我们使用Transformer编码器-解码器架构引入了一种后处理方法,将原始ASR输出组织成一个语法完整且语义有意义的阿姆哈勒语句子。通过在修正测试数据集上的实验,我们的模型提高了阿姆哈勒语语音识别系统的语义正确性,实现了 Character Error Rate(CER)为5.5\% 和 Word Error Rate(WER)为23.3\%的性能。
URL
https://arxiv.org/abs/2404.13362