This paper explores the use of Hybrid CTC/Attention encoder-decoder models trained with Intermediate CTC (InterCTC) for Irish (Gaelic) low-resource speech recognition (ASR) and dialect identification (DID). Results are compared to the current best performing models trained for ASR (TDNN-HMM) and DID (ECAPA-TDNN). An optimal InterCTC setting is initially established using a Conformer encoder. This setting is then used to train a model with an E-branchformer encoder and the performance of both architectures are compared. A multi-task fine-tuning approach is adopted for language model (LM) shallow fusion. The experiments yielded an improvement in DID accuracy of 10.8% relative to a baseline ECAPA-TDNN, and WER performance approaching the TDNN-HMM model. This multi-task approach emerges as a promising strategy for Irish low-resource ASR and DID.
本文探讨了使用经过中间CTC(InterCTC)训练的混合编码器-解码器模型(Hybrid CTC/Attention encoder-decoder)在爱尔兰(盖尔语)低资源 speech recognition(ASR)和 dialect identification(DID)任务中的应用。结果与目前最佳训练的 ASR(TDNN-HMM)和 DID(ECAPA-TDNN)模型进行了比较。首先,通过使用 Conformer 编码器建立了一个最优的 InterCTC 设置。然后,使用 E-branchfinder 编码器训练了一个模型,并比较了两种架构的性能。为语言模型(LM)采用多任务微调。实验结果表明,与基线 ECAPA-TDNN相比,DID 准确度提高了 10.8%,而 WER 性能接近于 TDNN-HMM 模型。这种多任务方法在爱尔兰低资源 ASR 和 DID 任务中具有前景。
https://arxiv.org/abs/2405.01293
Membership Inference (MI) poses a substantial privacy threat to the training data of Automatic Speech Recognition (ASR) systems, while also offering an opportunity to audit these models with regard to user data. This paper explores the effectiveness of loss-based features in combination with Gaussian and adversarial perturbations to perform MI in ASR models. To the best of our knowledge, this approach has not yet been investigated. We compare our proposed features with commonly used error-based features and find that the proposed features greatly enhance performance for sample-level MI. For speaker-level MI, these features improve results, though by a smaller margin, as error-based features already obtained a high performance for this task. Our findings emphasise the importance of considering different feature sets and levels of access to target models for effective MI in ASR systems, providing valuable insights for auditing such models.
翻译:Membership Inference (MI) 对自动语音识别(ASR)系统的训练数据提出了实质性的隐私威胁,同时为审计这些模型与用户数据有关提供了机会。本文探讨了基于损失的特征与高斯和对抗扰动在ASR模型中进行MI的有效性。据我们所知,这种方法尚未被研究过。我们将提出的特征与常见的基于错误的特征进行比较,发现所提出的特征在样本级MI方面极大地提高了性能。在说话人级别MI方面,这些特征提高了结果,但相对较小,因为基于错误的特征已经在这一任务上取得了很高的性能。我们的研究结果强调了在ASR系统中有效MI时考虑不同特征集和访问目标模型的重要性,为审计这些模型提供了宝贵的见解。
https://arxiv.org/abs/2405.01207
Recent transformer-based ASR models have achieved word-error rates (WER) below 4%, surpassing human annotator accuracy, yet they demand extensive server resources, contributing to significant carbon footprints. The traditional server-based architecture of ASR also presents privacy concerns, alongside reliability and latency issues due to network dependencies. In contrast, on-device (edge) ASR enhances privacy, boosts performance, and promotes sustainability by effectively balancing energy use and accuracy for specific applications. This study examines the effects of quantization, memory demands, and energy consumption on the performance of various ASR model inference on the NVIDIA Jetson Orin Nano. By analyzing WER and transcription speed across models using FP32, FP16, and INT8 quantization on clean and noisy datasets, we highlight the crucial trade-offs between accuracy, speeds, quantization, energy efficiency, and memory needs. We found that changing precision from fp32 to fp16 halves the energy consumption for audio transcription across different models, with minimal performance degradation. A larger model size and number of parameters neither guarantees better resilience to noise, nor predicts the energy consumption for a given transcription load. These, along with several other findings offer novel insights for optimizing ASR systems within energy- and memory-limited environments, crucial for the development of efficient on-device ASR solutions. The code and input data needed to reproduce the results in this article are open sourced are available on [this https URL].
近年来,基于Transformer的自动语音识别(ASR)模型已经实现了词错误率(WER)低于4%,超过了人类注释者的工作准确率,然而它们需要大量的服务器资源,导致显著的碳排放足迹。传统的基于服务器的ASR架构也存在隐私问题,以及由于网络依赖关系而导致的可靠性和延迟问题。相比之下,在设备级(边缘)ASR上,通过有效平衡能源消耗和准确性,提高了隐私,增强了性能,促进了可持续性。 本研究探讨了量化、内存需求和能源消耗对各种ASR模型在NVIDIA Jetson Orin Nano上的性能的影响。通过分析在干净和噪音数据集上使用FP32、FP16和INT8量化模型的WER和转录速度,我们突出了准确度、速度、量化、能源效率和内存需求之间的关键权衡。我们发现,将精度从fp32变为fp16可以减半不同模型的音频转录能源消耗,同时性能降幅很小。模型大小和参数数量越大,并不能保证对噪声的鲁棒性,也不能预测给定转录负载的能源消耗。这些以及其他发现为在能源和内存受限的环境中优化ASR系统提供了新的见解,对于实现高效的本地ASR解决方案具有关键意义。本文开源的代码和输入数据可在[本文的链接]中找到。
https://arxiv.org/abs/2405.01004
Whisper is a multitask and multilingual speech model covering 99 languages. It yields commendable automatic speech recognition (ASR) results in a subset of its covered languages, but the model still underperforms on a non-negligible number of under-represented languages, a problem exacerbated in smaller model versions. In this work, we examine its limitations, demonstrating the presence of speaker-related (gender, age) and model-related (resourcefulness and model size) bias. Despite that, we show that only model-related bias are amplified by quantization, impacting more low-resource languages and smaller models. Searching for a better compression approach, we propose DistilWhisper, an approach that is able to bridge the performance gap in ASR for these languages while retaining the advantages of multitask and multilingual capabilities. Our approach involves two key strategies: lightweight modular ASR fine-tuning of whisper-small using language-specific experts, and knowledge distillation from whisper-large-v2. This dual approach allows us to effectively boost ASR performance while keeping the robustness inherited from the multitask and multilingual pre-training. Results demonstrate that our approach is more effective than standard fine-tuning or LoRA adapters, boosting performance in the targeted languages for both in- and out-of-domain test sets, while introducing only a negligible parameter overhead at inference.
Whisper 是一个多任务、多语言的语音模型,覆盖 99 种语言。在它所覆盖的语言的范围内,它产生了可观的自动语音识别(ASR)结果,但在一些未涵盖的语言上,该模型仍然表现不佳,尤其是在较小的模型版本中,问题更加严重。在这项工作中,我们研究了其局限性,证明了说话者相关(性别、年龄)和模型相关(资源丰富性和模型大小)偏见的存在。尽管如此,我们证明了仅模型相关偏见通过量化被放大,影响更多的低资源语言和较小的模型。为了寻找更好的压缩方法,我们提出了 DistilWhisper,一种能够同时提高这些语言的ASR性能,同时保留多任务和多语言功能优势的方法。我们的方法涉及两个关键策略:通过语言特定的专家对Whisper-small进行轻量级模块化ASR微调,以及从Whisper-large-v2中进行知识蒸馏。这种双策略使我们能够在有效提高ASR性能的同时保持多任务和多语言预训练所带来的稳健性。结果表明,我们的方法比标准的微调或LoRA适配器更有效,在目标语言的测试集中提高性能,同时仅引入了微不足道的参数开销。
https://arxiv.org/abs/2405.00966
Speech emotion recognition (SER) has garnered increasing attention due to its wide range of applications in various fields, including human-machine interaction, virtual assistants, and mental health assistance. However, existing SER methods often overlook the information gap between the pre-training speech recognition task and the downstream SER task, resulting in sub-optimal performance. Moreover, current methods require much time for fine-tuning on each specific speech dataset, such as IEMOCAP, which limits their effectiveness in real-world scenarios with large-scale noisy data. To address these issues, we propose an active learning (AL)-based fine-tuning framework for SER, called \textsc{After}, that leverages task adaptation pre-training (TAPT) and AL methods to enhance performance and efficiency. Specifically, we first use TAPT to minimize the information gap between the pre-training speech recognition task and the downstream speech emotion recognition task. Then, AL methods are employed to iteratively select a subset of the most informative and diverse samples for fine-tuning, thereby reducing time consumption. Experiments demonstrate that our proposed method \textsc{After}, using only 20\% of samples, improves accuracy by 8.45\% and reduces time consumption by 79\%. The additional extension of \textsc{After} and ablation studies further confirm its effectiveness and applicability to various real-world scenarios. Our source code is available on Github for reproducibility. (this https URL).
演讲情感识别(SER)因为其在各种领域的广泛应用而引起了越来越多的关注,包括人机交互、虚拟助手和心理健康协助。然而,现有的SER方法通常忽视了预训练语音识别任务和下游SER任务之间的信息差距,导致性能低。此外,当前方法对每个具体语音数据集进行微调的时间成本很高,比如IEMOCAP,这限制了其在具有大规模嘈杂数据的大规模场景中的有效性。为了解决这些问题,我们提出了一个基于主动学习(AL)的SER微调框架,称为\textsc{After},并利用任务适应预训练(TAPT)和AL方法来提高性能和效率。具体来说,我们首先使用TAPT最小化预训练语音识别任务和下游语音情感识别任务之间的信息差距。然后,使用AL方法迭代选择具有最信息量和多样性的样本进行微调,从而减少时间消耗。实验证明,我们提出的\textsc{After}方法,只需使用20%的样本,提高了8.45%的准确率,并将时间消耗降低了79%。此外,对\textsc{After}的扩展研究和消融分析进一步证实了其有效性和应用到各种现实场景中的可行性。我们的源代码可以在Github上获取可重复性。(就是这个https://URL)
https://arxiv.org/abs/2405.00307
Whisper is a state-of-the-art automatic speech recognition (ASR) model (Radford et al., 2022). Although Swiss German dialects are allegedly not part of Whisper's training data, preliminary experiments showed that Whisper can transcribe Swiss German quite well, with the output being a speech translation into Standard German. To gain a better understanding of Whisper's performance on Swiss German, we systematically evaluate it using automatic, qualitative, and human evaluation. We test its performance on three existing test sets: SwissDial (Dogan-Schönberger et al., 2021), STT4SG-350 (Plüss et al., 2023), and Swiss Parliaments Corpus (Plüss et al., 2021). In addition, we create a new test set for this work, based on short mock clinical interviews. For automatic evaluation, we used word error rate (WER) and BLEU. In the qualitative analysis, we discuss Whisper's strengths and weaknesses and anylyze some output examples. For the human evaluation, we conducted a survey with 28 participants who were asked to evaluate Whisper's performance. All of our evaluations suggest that Whisper is a viable ASR system for Swiss German, so long as the Standard German output is desired.
Whisper是一种最先进的自动语音识别(ASR)模型(Radford等人,2022)。尽管瑞士德语方言据称不是Whisper的训练数据,但初步实验表明,Whisper在瑞士德语上转录得很好,输出为标准德语的发音。为了更好地了解Whisper在瑞士德语上的表现,我们系统地使用自动、定性和人性评估对其进行评估。我们在三个现有测试集上测试了其性能:瑞士德语(Dogan-Schönberger等人,2021)、STT4SG-350(Plüss等人,2023)和瑞士议会语料库(Plüss等人,2021)。此外,我们还基于短模临床访谈创建了一个新的测试集。对于自动评估,我们使用了词错误率(WER)和BLEU。在定性分析中,我们讨论了Whisper的优缺点,并分析了一些输出示例。对于人类评估,我们进行了一次调查,让28名参与者评估Whisper的性能。我们所有的评估都表明,只要希望输出标准德语,Whisper是一个可行的ASR系统。
https://arxiv.org/abs/2404.19310
In recent years, Transformer networks have shown remarkable performance in speech recognition tasks. However, their deployment poses challenges due to high computational and storage resource requirements. To address this issue, a lightweight model called EfficientASR is proposed in this paper, aiming to enhance the versatility of Transformer models. EfficientASR employs two primary modules: Shared Residual Multi-Head Attention (SRMHA) and Chunk-Level Feedforward Networks (CFFN). The SRMHA module effectively reduces redundant computations in the network, while the CFFN module captures spatial knowledge and reduces the number of parameters. The effectiveness of the EfficientASR model is validated on two public datasets, namely Aishell-1 and HKUST. Experimental results demonstrate a 36% reduction in parameters compared to the baseline Transformer network, along with improvements of 0.3% and 0.2% in Character Error Rate (CER) on the Aishell-1 and HKUST datasets, respectively.
在近年来,Transformer网络在语音识别任务中表现出惊人的性能。然而,由于计算和存储资源需求较高,它们的部署会带来挑战。为了解决这个问题,本文提出了一种轻量级的模型EfficientASR,旨在增强Transformer模型的多样性。EfficientASR采用两个主要模块:共享残差多头注意力(SRMHA)和 chunk级前馈网络(CFFN)。SRMHA模块有效减少了网络中的冗余计算,而CFFN模块捕捉了空间知识并减少了参数数量。EfficientASR模型的有效性在两个公共数据集上进行了验证,即Aishell-1和HKUST。实验结果表明,与基线Transformer网络相比,参数减少了36%,同时Aishell-1和HKUST数据集上的Character Error Rate(CER)分别降低了0.3%和0.2%。
https://arxiv.org/abs/2404.19214
Similar to humans, animals make extensive use of verbal and non-verbal forms of communication, including a large range of audio signals. In this paper, we address dog vocalizations and explore the use of self-supervised speech representation models pre-trained on human speech to address dog bark classification tasks that find parallels in human-centered tasks in speech recognition. We specifically address four tasks: dog recognition, breed identification, gender classification, and context grounding. We show that using speech embedding representations significantly improves over simpler classification baselines. Further, we also find that models pre-trained on large human speech acoustics can provide additional performance boosts on several tasks.
类似于人类,动物也广泛利用口头和非口头形式的交流方式,包括大量的音频信号。在本文中,我们关注狗的叫声,并探讨了使用预训练于人类语音的自监督语音表示模型来解决狗吠叫分类任务,这些任务在语音识别中的人类中心任务中发现了相似之处。我们特别解决了四个任务:狗识别、品种鉴定、性别分类和上下文绑定。我们发现,使用语音嵌入表示显著优于简单的分类基线。此外,我们还发现,预训练于大型人类语音声学的大规模模型可以在多个任务上提供额外的性能提升。
https://arxiv.org/abs/2404.18739
Automated Speech Recognition shows superhuman performance for adult English speech on a range of benchmarks, but disappoints when fed children's speech. This has long sat in the way of child-robot interaction. Recent evolutions in data-driven speech recognition, including the availability of Transformer architectures and unprecedented volumes of training data, might mean a breakthrough for child speech recognition and social robot applications aimed at children. We revisit a study on child speech recognition from 2017 and show that indeed performance has increased, with newcomer OpenAI Whisper doing markedly better than leading commercial cloud services. While transcription is not perfect yet, the best model recognises 60.3% of sentences correctly barring small grammatical differences, with sub-second transcription time running on a local GPU, showing potential for usable autonomous child-robot speech interactions.
自动语音识别在各种基准测试中展示了超人类的表现,但对于儿童语音的表现则感到失望。这阻碍了儿童与机器人之间的互动。最近的数据驱动语音识别的演变,包括Transformer架构和前所未有的训练数据,或许可以为儿童语音识别及面向儿童的社交机器人应用带来突破。我们回顾了2017年关于儿童语音识别的研究,事实证明,表现已经得到了提高,新加入的OpenAI Whisper在表现上明显优于领导商业云服务的其他产品。虽然转录还有改进的空间,但最好的模型已经能够正确识别60.3%的句子,小型语法差异不正确的情况下,子秒级别的转录时间在一台局部GPU上运行,这表明了可用于实用的自主儿童机器人语音交互的潜力。
https://arxiv.org/abs/2404.17394
Word error rate (WER) is a metric used to evaluate the quality of transcriptions produced by Automatic Speech Recognition (ASR) systems. In many applications, it is of interest to estimate WER given a pair of a speech utterance and a transcript. Previous work on WER estimation focused on building models that are trained with a specific ASR system in mind (referred to as ASR system-dependent). These are also domain-dependent and inflexible in real-world applications. In this paper, a hypothesis generation method for ASR System-Independent WER estimation (SIWE) is proposed. In contrast to prior work, the WER estimators are trained using data that simulates ASR system output. Hypotheses are generated using phonetically similar or linguistically more likely alternative words. In WER estimation experiments, the proposed method reaches a similar performance to ASR system-dependent WER estimators on in-domain data and achieves state-of-the-art performance on out-of-domain data. On the out-of-domain data, the SIWE model outperformed the baseline estimators in root mean square error and Pearson correlation coefficient by relative 17.58% and 18.21%, respectively, on Switchboard and CALLHOME. The performance was further improved when the WER of the training set was close to the WER of the evaluation dataset.
词错误率(WER)是一种用于评估自动语音识别(ASR)系统产生的转录质量的度量。在许多应用中,估计WER根据语音单位和文本的关系进行估计。之前的工作主要集中在构建特定ASR系统的模型(称为ASR系统依赖的模型)。这些模型在领域上是相关的,且在现实世界的应用中不灵活。在本文中,提出了一个用于ASR System-Independent WER估计(SIWE)的假设生成方法。与之前的工作不同,使用模拟ASR系统输出的数据来训练WER估计算法。假设是通过 phonetically similar 或 linguistically more likely 替代词来生成的。在WER估计实验中,所提出的方法在领域内数据上的性能与ASR系统依赖的WER估计算法相当,同时在对外部数据上的性能达到了最先进的水平。在外部数据上,SIWE模型分别比基线估计器在Switchboard和CALLHOME上的相对误差的绝对值和皮尔逊相关系数上提高了17.58%和18.21%。当训练集的WER接近评估数据上的WER时,性能进一步得到改善。
https://arxiv.org/abs/2404.16743
This paper is concerned with automatic continuous speech recognition using trainable systems. The aim of this work is to build acoustic models for spoken Swedish. This is done employing hidden Markov models and using the SpeechDat database to train their parameters. Acoustic modeling has been worked out at a phonetic level, allowing general speech recognition applications, even though a simplified task (digits and natural number recognition) has been considered for model evaluation. Different kinds of phone models have been tested, including context independent models and two variations of context dependent models. Furthermore many experiments have been done with bigram language models to tune some of the system parameters. System performance over various speaker subsets with different sex, age and dialect has also been examined. Results are compared to previous similar studies showing a remarkable improvement.
本文关注使用可训练系统进行自动连续语音识别。本工作的目标是建立用于 spoken Swedish 的音频模型。这是通过使用隐马尔可夫模型,并使用SpeechDat数据库训练其参数来实现的。在音位级别上进行了语音建模,允许进行通用语音识别应用,尽管对于模型评估,简化任务(数字和自然数识别)已经被考虑。已经测试了不同类型的电话模型,包括上下文无关模型和上下文相关模型的两种变体。此外,还与大词本语言模型一起进行了很多实验,以调整一些系统参数。还检查了系统在不同说话者子集上的性能,包括不同性别、年龄和方言。结果与之前类似的研究相比显示出显著的改进。
https://arxiv.org/abs/2404.16547
Scale has opened new frontiers in natural language processing, but at a high cost. In response, by learning to only activate a subset of parameters in training and inference, Mixture-of-Experts (MoE) have been proposed as an energy efficient path to even larger and more capable language models and this shift towards a new generation of foundation models is gaining momentum, particularly within the field of Automatic Speech Recognition (ASR). Recent works that incorporating MoE into ASR models have complex designs such as routing frames via supplementary embedding network, improving multilingual ability for the experts, and utilizing dedicated auxiliary losses for either expert load balancing or specific language handling. We found that delicate designs are not necessary, while an embarrassingly simple substitution of MoE layers for all Feed-Forward Network (FFN) layers is competent for the ASR task. To be more specific, we benchmark our proposed model on a large scale inner-source dataset (160k hours), the results show that we can scale our baseline Conformer (Dense-225M) to its MoE counterparts (MoE-1B) and achieve Dense-1B level Word Error Rate (WER) while maintaining a Dense-225M level Real Time Factor (RTF). Furthermore, by applying Unified 2-pass framework with bidirectional attention decoders (U2++), we achieve the streaming and non-streaming decoding modes in a single MoE based model, which we call U2++ MoE. We hope that our study can facilitate the research on scaling speech foundation models without sacrificing deployment efficiency.
翻译:在自然语言处理领域,Scale确实开辟了新的前沿,但代价高昂。为了应对这种情况,通过仅在训练和推理时激活参数集的一小部分,提出了Mixture-of-Experts(MoE)方法,作为一种能源高效的途径,以达到更大的和更强大的语言模型,并将这一代基础模型从目前的MoE模型中转移到新一代。在自动语音识别(ASR)领域,采用MoE的ASR模型已经引起了越来越多的关注,尤其是ASR任务。最近的工作包括通过附加嵌入网络路由帧、提高专家多语言能力以及为专家负载均衡或特定语言处理使用专用的辅助损失等复杂设计。我们发现,一些设计并不必要,而用MoE层替换所有前馈网络(FFN)层对所有ASR任务来说也是有效的。具体来说,我们在一个大型内部数据集(160k小时)上对所提出的模型进行了基准测试,结果显示,在保持Dense-225M级别的基线Conformer(Dense-225M)的同时,我们将基线Conformer(Dense-225M)扩展到了MoE对应的同级(MoE-1B),并达到与Dense-225M级别的实时因子(RTF)相同的WER级别。此外,通过应用统一2-路注意力解码器(U2++),我们实现了一个基于MoE的单模型流式和非流式解码模式,我们称之为U2++ MoE。我们希望我们的研究能够促进在不牺牲部署效率的情况下研究扩展语音基础模型。
https://arxiv.org/abs/2404.16407
Sequence modeling is a crucial area across various domains, including Natural Language Processing (NLP), speech recognition, time series forecasting, music generation, and bioinformatics. Recurrent Neural Networks (RNNs) and Long Short Term Memory Networks (LSTMs) have historically dominated sequence modeling tasks like Machine Translation, Named Entity Recognition (NER), etc. However, the advancement of transformers has led to a shift in this paradigm, given their superior performance. Yet, transformers suffer from $O(N^2)$ attention complexity and challenges in handling inductive bias. Several variations have been proposed to address these issues which use spectral networks or convolutions and have performed well on a range of tasks. However, they still have difficulty in dealing with long sequences. State Space Models(SSMs) have emerged as promising alternatives for sequence modeling paradigms in this context, especially with the advent of S4 and its variants, such as S4nd, Hippo, Hyena, Diagnol State Spaces (DSS), Gated State Spaces (GSS), Linear Recurrent Unit (LRU), Liquid-S4, Mamba, etc. In this survey, we categorize the foundational SSMs based on three paradigms namely, Gating architectures, Structural architectures, and Recurrent architectures. This survey also highlights diverse applications of SSMs across domains such as vision, video, audio, speech, language (especially long sequence modeling), medical (including genomics), chemical (like drug design), recommendation systems, and time series analysis, including tabular data. Moreover, we consolidate the performance of SSMs on benchmark datasets like Long Range Arena (LRA), WikiText, Glue, Pile, ImageNet, Kinetics-400, sstv2, as well as video datasets such as Breakfast, COIN, LVU, and various time series datasets. The project page for Mamba-360 work is available on this webpage.\url{this https URL}.
序列建模是一个贯穿各种领域的关键领域,包括自然语言处理(NLP)、语音识别、时间序列预测、音乐生成和生物信息学。递归神经网络(RNNs)和长短时记忆网络(LSTMs)历史上曾统治序列建模任务,如机器翻译、命名实体识别等。然而,Transformer的进步导致了一种范式的转移,由于它们在性能上的优越表现。然而,Transformer的注意力复杂性和处理归纳偏差的能力仍然存在挑战。为解决这些问题,已经提出了几种变体,包括使用特征网络或卷积的模型,并在各种任务上表现良好。然而,它们仍然很难处理长序列。状态空间模型(SSMs)在这一背景下出现了有前景的替代方案,尤其是S4和其变体,如S4nd、Hippo、Hyena、诊断状态空间(DSS)、Gated State Spaces(GSS)和Linear Recurrent Unit(LRU)、Liquid-S4、Mamba等。在本次调查中,我们根据三种范式对基本SSMs进行了分类,即开关架构、结构架构和循环架构。本调查还强调了SSMs在各个领域的多样化应用,如视觉、视频、音频、语音、语言(特别是长序列建模)、医学(包括基因组学)、化学(如药物设计)和推荐系统,以及时间序列分析,包括表格数据。此外,我们还分析了SSMs在基准数据集,如Long Range Arena(LRA)、WikiText、Glue、Pile、ImageNet、Kinetics-400、sstv2,以及视频数据集,如Breakfast、COIN、LVU等。Mamba-360工作的项目页面可以在该网页上查看。
https://arxiv.org/abs/2404.16112
This paper presents Killkan, the first dataset for automatic speech recognition (ASR) in the Kichwa language, an indigenous language of Ecuador. Kichwa is an extremely low-resource endangered language, and there have been no resources before Killkan for Kichwa to be incorporated in applications of natural language processing. The dataset contains approximately 4 hours of audio with transcription, translation into Spanish, and morphosyntactic annotation in the format of Universal Dependencies. The audio data was retrieved from a publicly available radio program in Kichwa. This paper also provides corpus-linguistic analyses of the dataset with a special focus on the agglutinative morphology of Kichwa and frequent code-switching with Spanish. The experiments show that the dataset makes it possible to develop the first ASR system for Kichwa with reliable quality despite its small dataset size. This dataset, the ASR model, and the code used to develop them will be publicly available. Thus, our study positively showcases resource building and its applications for low-resource languages and their community.
本文介绍了Killkan,这是库奇华语(Kichwa)的第一份自动语音识别(ASR)数据集,这是一种来自厄瓜多尔的土著语言。库奇华是一种极其缺乏资源、濒临灭绝的语言,以前没有库奇华的资源被融入到自然语言处理应用程序中。数据集包含近4小时的音频转录、西班牙语翻译和语素形态学注释的格式为Universal Dependencies。音频数据是从库奇华的一个公开可用的无线电节目提取的。本文还重点分析了数据集的语料库语义分析,特别关注库奇华的粘着形态和与西班牙语的频繁代码转换。实验结果表明,尽管数据集规模较小,但该数据集还是可以开发出库奇华语的第一份ASR系统,具有可靠的质量和效果。这个数据集、ASR模型和用于开发它们的代码将公开发布。因此,我们的研究正面展示了资源建设和它们对低资源语言及其社区的启示。
https://arxiv.org/abs/2404.15501
It is challenging to improve automatic speech recognition (ASR) performance in noisy conditions with a single-channel speech enhancement (SE) front-end. This is generally attributed to the processing distortions caused by the nonlinear processing of single-channel SE front-ends. However, the causes of such degraded ASR performance have not been fully investigated. How to design single-channel SE front-ends in a way that significantly improves ASR performance remains an open research question. In this study, we investigate a signal-level numerical metric that can explain the cause of degradation in ASR performance. To this end, we propose a novel analysis scheme based on the orthogonal projection-based decomposition of SE errors. This scheme manually modifies the ratio of the decomposed interference, noise, and artifact errors, and it enables us to directly evaluate the impact of each error type on ASR performance. Our analysis reveals the particularly detrimental effect of artifact errors on ASR performance compared to the other types of errors. This provides us with a more principled definition of processing distortions that cause the ASR performance degradation. Then, we study two practical approaches for reducing the impact of artifact errors. First, we prove that the simple observation adding (OA) post-processing (i.e., interpolating the enhanced and observed signals) can monotonically improve the signal-to-artifact ratio. Second, we propose a novel training objective, called artifact-boosted signal-to-distortion ratio (AB-SDR), which forces the model to estimate the enhanced signals with fewer artifact errors. Through experiments, we confirm that both the OA and AB-SDR approaches are effective in decreasing artifact errors caused by single-channel SE front-ends, allowing them to significantly improve ASR performance.
在嘈杂条件下,使用单通道语音增强(SE)前端来提高自动语音识别(ASR)性能是一项具有挑战性的任务。通常,这种挑战是由于单通道SE前端非线性处理引起的处理失真。然而,尚未对这种失真导致的ASR性能下降进行全面调查。如何设计一种明显提高ASR性能的单通道SE前端仍然是一个开放的研究问题。在这项研究中,我们研究了一个可以解释ASR性能下降原因的信号级数值度量。为此,我们提出了基于SE误差正交投影分解的分析方案。这种方案手动修改了分解干扰、噪声和伪迹误差的比率,并使我们能够直接评估每种误差类型对ASR性能的影响。我们的分析揭示了伪迹误差对ASR性能的破坏性比其他类型的误差更加严重。然后,我们研究了两种减少伪迹误差影响的方法。首先,我们证明了简单的观察添加(OA)后处理(即插值增强和观察信号)可以单调地改善信号-伪迹比。其次,我们提出了名为伪迹增强信号-畸变比(AB-SDR)的新训练目标,该目标迫使模型使用更少的伪迹误差估计增强信号。通过实验,我们证实了OA和AB-SDR方法都能有效减少由单通道SE前端引起的伪迹误差,从而显著提高ASR性能。
https://arxiv.org/abs/2404.14860
Understanding cognitive processes in the brain demands sophisticated models capable of replicating neural dynamics at large scales. We present a physiologically inspired speech recognition architecture, compatible and scalable with deep learning frameworks, and demonstrate that end-to-end gradient descent training leads to the emergence of neural oscillations in the central spiking neural network. Significant cross-frequency couplings, indicative of these oscillations, are measured within and across network layers during speech processing, whereas no such interactions are observed when handling background noise inputs. Furthermore, our findings highlight the crucial inhibitory role of feedback mechanisms, such as spike frequency adaptation and recurrent connections, in regulating and synchronising neural activity to improve recognition performance. Overall, on top of developing our understanding of synchronisation phenomena notably observed in the human auditory pathway, our architecture exhibits dynamic and efficient information processing, with relevance to neuromorphic technology.
理解大脑中的认知过程需要复杂且能够在大尺度上复制神经动态的模型。我们提出了一个生理学上启发的语音识别架构,与深度学习框架兼容并具有可扩展性,并证明了端到端梯度下降训练会导致中央尖峰神经网络中神经振荡的出现。在语音处理过程中,我们测量了跨频联系,这些联系表明了这些振荡,而在处理背景噪声输入时,并没有观察到这样的相互作用。此外,我们的研究结果突出了反馈机制(如尖峰频率适应和循环连接)在调节和同步神经活动以提高识别性能中的关键抑制作用。总的来说,在发展我们人类听觉通路中同步现象的基础上,我们的架构表现出动态和高效的信息处理,与类神经形态技术有关。
https://arxiv.org/abs/2404.14024
Automatic Speech Recognition (ASR) can play a crucial role in enhancing the accessibility of spoken languages worldwide. In this paper, we build a set of ASR tools for Amharic, a language spoken by more than 50 million people primarily in eastern Africa. Amharic is written in the Ge'ez script, a sequence of graphemes with spacings denoting word boundaries. This makes computational processing of Amharic challenging since the location of spacings can significantly impact the meaning of formed sentences. We find that existing benchmarks for Amharic ASR do not account for these spacings and only measure individual grapheme error rates, leading to significantly inflated measurements of in-the-wild performance. In this paper, we first release corrected transcriptions of existing Amharic ASR test datasets, enabling the community to accurately evaluate progress. Furthermore, we introduce a post-processing approach using a transformer encoder-decoder architecture to organize raw ASR outputs into a grammatically complete and semantically meaningful Amharic sentence. Through experiments on the corrected test dataset, our model enhances the semantic correctness of Amharic speech recognition systems, achieving a Character Error Rate (CER) of 5.5\% and a Word Error Rate (WER) of 23.3\%.
自动语音识别(ASR)在提高全球范围内口头语言的可访问性方面发挥着关键作用。在本文中,我们为阿姆哈勒语(一种主要在东非使用的语言)构建了一组ASR工具。阿姆哈勒语用吉斯文书写,这是一种由标点符号组成的序列,其中间隔表示单词边界。这使得阿姆哈勒语的计算处理具有挑战性,因为间隔的位置可能会显著影响形成的句子的意思。我们发现,现有的阿姆哈勒语ASR基准没有考虑到这些间隔,而只是测量单个词形错误率,导致在野外性能的测量值大幅膨胀。在本文中,我们首先发布了现有阿姆哈勒语ASR测试数据集的修正转录,使社区能够准确评估进展。此外,我们使用Transformer编码器-解码器架构引入了一种后处理方法,将原始ASR输出组织成一个语法完整且语义有意义的阿姆哈勒语句子。通过在修正测试数据集上的实验,我们的模型提高了阿姆哈勒语语音识别系统的语义正确性,实现了 Character Error Rate(CER)为5.5\% 和 Word Error Rate(WER)为23.3\%的性能。
https://arxiv.org/abs/2404.13362
Speech-driven facial animation methods usually contain two main classes, 3D and 2D talking face, both of which attract considerable research attention in recent years. However, to the best of our knowledge, the research on 3D talking face does not go deeper as 2D talking face, in the aspect of lip-synchronization (lip-sync) and speech perception. To mind the gap between the two sub-fields, we propose a learning framework named Learn2Talk, which can construct a better 3D talking face network by exploiting two expertise points from the field of 2D talking face. Firstly, inspired by the audio-video sync network, a 3D sync-lip expert model is devised for the pursuit of lip-sync between audio and 3D facial motion. Secondly, a teacher model selected from 2D talking face methods is used to guide the training of the audio-to-3D motions regression network to yield more 3D vertex accuracy. Extensive experiments show the advantages of the proposed framework in terms of lip-sync, vertex accuracy and speech perception, compared with state-of-the-arts. Finally, we show two applications of the proposed framework: audio-visual speech recognition and speech-driven 3D Gaussian Splatting based avatar animation.
演讲驱动的面部动画方法通常包含两个主要类别:3D和2D对话面部,这两者在近年来都吸引了相当大的研究关注。然而,据我们所知,在3D对话面部的研究中,深度并没有达到2D对话面部的水平,尤其是在同步( lipsynchronization)和语音感知方面。为了弥补这两个子领域的差距,我们提出了一个名为Learn2Talk的学习框架,通过利用2D对话面部的两个专业领域来构建更好的3D对话面部网络。首先,受到音频-视频同步网络的启发,设计了一个3D sync-lip专家模型,以实现音频和3D面部运动的同步。其次,从2D对话面部方法中选择一个教师模型,用于指导音频-到-3D运动回归网络的训练,以实现更高精度的3D顶点准确性。大量的实验结果表明,与最先进的水平相比,所提出的框架在同步、顶点准确性和语音感知方面具有优势。最后,我们展示了两个基于所提出框架的应用:音频-视频语音识别和基于语音的3D高斯平铺基于虚拟角色动画。
https://arxiv.org/abs/2404.12888
Self-supervised learned (SSL) models such as Wav2vec and HuBERT yield state-of-the-art results on speech-related tasks. Given the effectiveness of such models, it is advantageous to use them in conventional ASR systems. While some approaches suggest incorporating these models as a trainable encoder or a learnable frontend, training such systems is extremely slow and requires a lot of computation cycles. In this work, we propose two simple approaches that use (1) framewise addition and (2) cross-attention mechanisms to efficiently incorporate the representations from the SSL model(s) into the ASR architecture, resulting in models that are comparable in size with standard encoder-decoder conformer systems while also avoiding the usage of SSL models during training. Our approach results in faster training and yields significant performance gains on the Librispeech and Tedlium datasets compared to baselines. We further provide detailed analysis and ablation studies that demonstrate the effectiveness of our approach.
自监督学习(SSL)模型,如Wav2vec和HuBERT,在语音相关任务上取得了最先进的成果。鉴于这类模型的有效性,将它们应用于传统的ASR系统具有优势。虽然一些方法建议将这类模型作为可训练的编码器或可学习的前端,但训练这些系统非常耗时且需要大量的计算周期。在本文中,我们提出了两种简单的策略,即(1)框架级加法和(2)跨注意机制,将SSL模型的表示有效地融入ASR架构,从而实现与标准编码器-解码器紧凑系统相当大小的模型,并避免在训练过程中使用SSL模型。我们的方法使得训练更快,同时在Librispeech和Tedlium数据集上的性能相较于基线有了显著的提高。此外,我们还提供了详细的分析和消融实验,以证明我们方法的的有效性。
https://arxiv.org/abs/2404.12628
Voice based applications are ruling over the era of automation because speech has a lot of factors that determine a speakers information as well as speech. Modern Automatic Speech Recognition (ASR) is a blessing in the field of Human-Computer Interaction (HCI) for efficient communication among humans and devices using Artificial Intelligence technology. Speech is one of the easiest mediums of communication because it has a lot of identical features for different speakers. Nowadays it is possible to determine speakers and their identity using their speech in terms of speaker recognition. In this paper, we presented a method that will provide a speakers geographical identity in a certain region using continuous Bengali speech. We consider eight different divisions of Bangladesh as the geographical region. We applied the Mel Frequency Cepstral Coefficient (MFCC) and Delta features on an Artificial Neural Network to classify speakers division. We performed some preprocessing tasks like noise reduction and 8-10 second segmentation of raw audio before feature extraction. We used our dataset of more than 45 hours of audio data from 633 individual male and female speakers. We recorded the highest accuracy of 85.44%.
基于语音的应用程序正在统治自动化时代,因为语音有很多因素来决定发言者的信息和语音。现代人工智能语音识别(ASR)在人工智能技术领域为人类和设备之间的有效通信带来了福音。语音是交流中最为容易的媒介,因为它有很多不同发言者的相似特征。如今,通过他们的语音,可以确定发言者和他们的身份,实现发言者识别。在本文中,我们提出了一种方法,将使用连续孟加拉语来确定发言者在某个地区的地理身份。我们将孟加拉国分为八个不同的地区。我们在人工神经网络中应用了Mel频率倒谱系数(MFCC)和差分特征来对发言者进行分类。在特征提取之前,我们进行了一些预处理任务,如降噪和8-10秒音频的分割。我们使用我们拥有超过633个单独男性和女性发言者的数据集。我们记录了最高的准确率为85.44%。
https://arxiv.org/abs/2404.15168