Abstract
This paper explores the use of Hybrid CTC/Attention encoder-decoder models trained with Intermediate CTC (InterCTC) for Irish (Gaelic) low-resource speech recognition (ASR) and dialect identification (DID). Results are compared to the current best performing models trained for ASR (TDNN-HMM) and DID (ECAPA-TDNN). An optimal InterCTC setting is initially established using a Conformer encoder. This setting is then used to train a model with an E-branchformer encoder and the performance of both architectures are compared. A multi-task fine-tuning approach is adopted for language model (LM) shallow fusion. The experiments yielded an improvement in DID accuracy of 10.8% relative to a baseline ECAPA-TDNN, and WER performance approaching the TDNN-HMM model. This multi-task approach emerges as a promising strategy for Irish low-resource ASR and DID.
Abstract (translated)
本文探讨了使用经过中间CTC(InterCTC)训练的混合编码器-解码器模型(Hybrid CTC/Attention encoder-decoder)在爱尔兰(盖尔语)低资源 speech recognition(ASR)和 dialect identification(DID)任务中的应用。结果与目前最佳训练的 ASR(TDNN-HMM)和 DID(ECAPA-TDNN)模型进行了比较。首先,通过使用 Conformer 编码器建立了一个最优的 InterCTC 设置。然后,使用 E-branchfinder 编码器训练了一个模型,并比较了两种架构的性能。为语言模型(LM)采用多任务微调。实验结果表明,与基线 ECAPA-TDNN相比,DID 准确度提高了 10.8%,而 WER 性能接近于 TDNN-HMM 模型。这种多任务方法在爱尔兰低资源 ASR 和 DID 任务中具有前景。
URL
https://arxiv.org/abs/2405.01293