Paper Reading AI Learner

ASR advancements for indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa'ikhana

2024-04-12 10:12:38
Monica Romero, Sandra Gomez, Iván G. Torre

Abstract

Indigenous languages are a fundamental legacy in the development of human communication, embodying the unique identity and culture of local communities of America. The Second AmericasNLP Competition Track 1 of NeurIPS 2022 proposed developing automatic speech recognition (ASR) systems for five indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa'ikhana. In this paper, we propose a reliable ASR model for each target language by crawling speech corpora spanning diverse sources and applying data augmentation methods that resulted in the winning approach in this competition. To achieve this, we systematically investigated the impact of different hyperparameters by a Bayesian search on the performance of the language models, specifically focusing on the variants of the Wav2vec2.0 XLS-R model: 300M and 1B parameters. Moreover, we performed a global sensitivity analysis to assess the contribution of various hyperparametric configurations to the performances of our best models. Importantly, our results show that freeze fine-tuning updates and dropout rate are more vital parameters than the total number of epochs of lr. Additionally, we liberate our best models -- with no other ASR model reported until now for two Wa'ikhana and Kotiria -- and the many experiments performed to pave the way to other researchers to continue improving ASR in minority languages. This insight opens up interesting avenues for future work, allowing for the advancement of ASR techniques in the preservation of minority indigenous and acknowledging the complexities involved in this important endeavour.

Abstract (translated)

土著语言是人类交流发展的重要遗产,体现了美国各地社区的独特身份和文化。2022年NeurIPS第二天的NLP竞赛赛道1提出为五种土著语言开发自动语音识别(ASR)系统:库亚(Quechua)、瓜拉尼(Guarani)、布里比(Bribri)、科托利亚(Kotiria)和瓦伊克哈纳(Wa'ikhana)。在本文中,我们通过爬取跨度广泛的语音数据集并应用竞赛中的最佳方法,提出了可靠的ASR模型,用于每个目标语言。为了实现这一目标,我们系统地研究了不同超参数对语言模型性能的影响,特别关注Wav2vec2.0 XLS-R模型的两个变体:300M和1B参数。此外,我们进行了全局敏感性分析,以评估各种超参数配置对最佳模型的性能贡献。重要的是,我们的结果表明,静止微调更新和 dropout 率比学习率的总迭代次数更加重要。此外,我们还发布了之前没有报道过的最好的模型 -- 直到现在只有两个Wa'ikhana和Kotiria模型被报道过 -- 以及为了其他研究人员继续改进亚索语言而进行的许多实验。这一洞察为未来工作打开了有趣的途径,允许在保护少数民族土著语言方面推动ASR技术的发展,并承认这一重要任务中涉及的复杂性。

URL

https://arxiv.org/abs/2404.08368

PDF

https://arxiv.org/pdf/2404.08368.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot