Paper Reading AI Learner

Semantically Corrected Amharic Automatic Speech Recognition

2024-04-20 12:08:00
Samuael Adnew, Paul Pu Liang


Automatic Speech Recognition (ASR) can play a crucial role in enhancing the accessibility of spoken languages worldwide. In this paper, we build a set of ASR tools for Amharic, a language spoken by more than 50 million people primarily in eastern Africa. Amharic is written in the Ge'ez script, a sequence of graphemes with spacings denoting word boundaries. This makes computational processing of Amharic challenging since the location of spacings can significantly impact the meaning of formed sentences. We find that existing benchmarks for Amharic ASR do not account for these spacings and only measure individual grapheme error rates, leading to significantly inflated measurements of in-the-wild performance. In this paper, we first release corrected transcriptions of existing Amharic ASR test datasets, enabling the community to accurately evaluate progress. Furthermore, we introduce a post-processing approach using a transformer encoder-decoder architecture to organize raw ASR outputs into a grammatically complete and semantically meaningful Amharic sentence. Through experiments on the corrected test dataset, our model enhances the semantic correctness of Amharic speech recognition systems, achieving a Character Error Rate (CER) of 5.5\% and a Word Error Rate (WER) of 23.3\%.

Abstract (translated)

自动语音识别(ASR)在提高全球范围内口头语言的可访问性方面发挥着关键作用。在本文中,我们为阿姆哈勒语(一种主要在东非使用的语言)构建了一组ASR工具。阿姆哈勒语用吉斯文书写,这是一种由标点符号组成的序列,其中间隔表示单词边界。这使得阿姆哈勒语的计算处理具有挑战性,因为间隔的位置可能会显著影响形成的句子的意思。我们发现,现有的阿姆哈勒语ASR基准没有考虑到这些间隔,而只是测量单个词形错误率,导致在野外性能的测量值大幅膨胀。在本文中,我们首先发布了现有阿姆哈勒语ASR测试数据集的修正转录,使社区能够准确评估进展。此外,我们使用Transformer编码器-解码器架构引入了一种后处理方法,将原始ASR输出组织成一个语法完整且语义有意义的阿姆哈勒语句子。通过在修正测试数据集上的实验,我们的模型提高了阿姆哈勒语语音识别系统的语义正确性,实现了 Character Error Rate(CER)为5.5\% 和 Word Error Rate(WER)为23.3\%的性能。



3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot