Word error rate (WER) is a metric used to evaluate the quality of transcriptions produced by Automatic Speech Recognition (ASR) systems. In many applications, it is of interest to estimate WER given a pair of a speech utterance and a transcript. Previous work on WER estimation focused on building models that are trained with a specific ASR system in mind (referred to as ASR system-dependent). These are also domain-dependent and inflexible in real-world applications. In this paper, a hypothesis generation method for ASR System-Independent WER estimation (SIWE) is proposed. In contrast to prior work, the WER estimators are trained using data that simulates ASR system output. Hypotheses are generated using phonetically similar or linguistically more likely alternative words. In WER estimation experiments, the proposed method reaches a similar performance to ASR system-dependent WER estimators on in-domain data and achieves state-of-the-art performance on out-of-domain data. On the out-of-domain data, the SIWE model outperformed the baseline estimators in root mean square error and Pearson correlation coefficient by relative 17.58% and 18.21%, respectively, on Switchboard and CALLHOME. The performance was further improved when the WER of the training set was close to the WER of the evaluation dataset.
词错误率(WER)是一种用于评估自动语音识别(ASR)系统产生的转录质量的度量。在许多应用中,估计WER根据语音单位和文本的关系进行估计。之前的工作主要集中在构建特定ASR系统的模型(称为ASR系统依赖的模型)。这些模型在领域上是相关的,且在现实世界的应用中不灵活。在本文中,提出了一个用于ASR System-Independent WER估计(SIWE)的假设生成方法。与之前的工作不同,使用模拟ASR系统输出的数据来训练WER估计算法。假设是通过 phonetically similar 或 linguistically more likely 替代词来生成的。在WER估计实验中,所提出的方法在领域内数据上的性能与ASR系统依赖的WER估计算法相当,同时在对外部数据上的性能达到了最先进的水平。在外部数据上,SIWE模型分别比基线估计器在Switchboard和CALLHOME上的相对误差的绝对值和皮尔逊相关系数上提高了17.58%和18.21%。当训练集的WER接近评估数据上的WER时,性能进一步得到改善。
https://arxiv.org/abs/2404.16743
This paper is concerned with automatic continuous speech recognition using trainable systems. The aim of this work is to build acoustic models for spoken Swedish. This is done employing hidden Markov models and using the SpeechDat database to train their parameters. Acoustic modeling has been worked out at a phonetic level, allowing general speech recognition applications, even though a simplified task (digits and natural number recognition) has been considered for model evaluation. Different kinds of phone models have been tested, including context independent models and two variations of context dependent models. Furthermore many experiments have been done with bigram language models to tune some of the system parameters. System performance over various speaker subsets with different sex, age and dialect has also been examined. Results are compared to previous similar studies showing a remarkable improvement.
本文关注使用可训练系统进行自动连续语音识别。本工作的目标是建立用于 spoken Swedish 的音频模型。这是通过使用隐马尔可夫模型,并使用SpeechDat数据库训练其参数来实现的。在音位级别上进行了语音建模,允许进行通用语音识别应用,尽管对于模型评估,简化任务(数字和自然数识别)已经被考虑。已经测试了不同类型的电话模型,包括上下文无关模型和上下文相关模型的两种变体。此外,还与大词本语言模型一起进行了很多实验,以调整一些系统参数。还检查了系统在不同说话者子集上的性能,包括不同性别、年龄和方言。结果与之前类似的研究相比显示出显著的改进。
https://arxiv.org/abs/2404.16547
Scale has opened new frontiers in natural language processing, but at a high cost. In response, by learning to only activate a subset of parameters in training and inference, Mixture-of-Experts (MoE) have been proposed as an energy efficient path to even larger and more capable language models and this shift towards a new generation of foundation models is gaining momentum, particularly within the field of Automatic Speech Recognition (ASR). Recent works that incorporating MoE into ASR models have complex designs such as routing frames via supplementary embedding network, improving multilingual ability for the experts, and utilizing dedicated auxiliary losses for either expert load balancing or specific language handling. We found that delicate designs are not necessary, while an embarrassingly simple substitution of MoE layers for all Feed-Forward Network (FFN) layers is competent for the ASR task. To be more specific, we benchmark our proposed model on a large scale inner-source dataset (160k hours), the results show that we can scale our baseline Conformer (Dense-225M) to its MoE counterparts (MoE-1B) and achieve Dense-1B level Word Error Rate (WER) while maintaining a Dense-225M level Real Time Factor (RTF). Furthermore, by applying Unified 2-pass framework with bidirectional attention decoders (U2++), we achieve the streaming and non-streaming decoding modes in a single MoE based model, which we call U2++ MoE. We hope that our study can facilitate the research on scaling speech foundation models without sacrificing deployment efficiency.
翻译:在自然语言处理领域,Scale确实开辟了新的前沿,但代价高昂。为了应对这种情况,通过仅在训练和推理时激活参数集的一小部分,提出了Mixture-of-Experts(MoE)方法,作为一种能源高效的途径,以达到更大的和更强大的语言模型,并将这一代基础模型从目前的MoE模型中转移到新一代。在自动语音识别(ASR)领域,采用MoE的ASR模型已经引起了越来越多的关注,尤其是ASR任务。最近的工作包括通过附加嵌入网络路由帧、提高专家多语言能力以及为专家负载均衡或特定语言处理使用专用的辅助损失等复杂设计。我们发现,一些设计并不必要,而用MoE层替换所有前馈网络(FFN)层对所有ASR任务来说也是有效的。具体来说,我们在一个大型内部数据集(160k小时)上对所提出的模型进行了基准测试,结果显示,在保持Dense-225M级别的基线Conformer(Dense-225M)的同时,我们将基线Conformer(Dense-225M)扩展到了MoE对应的同级(MoE-1B),并达到与Dense-225M级别的实时因子(RTF)相同的WER级别。此外,通过应用统一2-路注意力解码器(U2++),我们实现了一个基于MoE的单模型流式和非流式解码模式,我们称之为U2++ MoE。我们希望我们的研究能够促进在不牺牲部署效率的情况下研究扩展语音基础模型。
https://arxiv.org/abs/2404.16407
Sequence modeling is a crucial area across various domains, including Natural Language Processing (NLP), speech recognition, time series forecasting, music generation, and bioinformatics. Recurrent Neural Networks (RNNs) and Long Short Term Memory Networks (LSTMs) have historically dominated sequence modeling tasks like Machine Translation, Named Entity Recognition (NER), etc. However, the advancement of transformers has led to a shift in this paradigm, given their superior performance. Yet, transformers suffer from $O(N^2)$ attention complexity and challenges in handling inductive bias. Several variations have been proposed to address these issues which use spectral networks or convolutions and have performed well on a range of tasks. However, they still have difficulty in dealing with long sequences. State Space Models(SSMs) have emerged as promising alternatives for sequence modeling paradigms in this context, especially with the advent of S4 and its variants, such as S4nd, Hippo, Hyena, Diagnol State Spaces (DSS), Gated State Spaces (GSS), Linear Recurrent Unit (LRU), Liquid-S4, Mamba, etc. In this survey, we categorize the foundational SSMs based on three paradigms namely, Gating architectures, Structural architectures, and Recurrent architectures. This survey also highlights diverse applications of SSMs across domains such as vision, video, audio, speech, language (especially long sequence modeling), medical (including genomics), chemical (like drug design), recommendation systems, and time series analysis, including tabular data. Moreover, we consolidate the performance of SSMs on benchmark datasets like Long Range Arena (LRA), WikiText, Glue, Pile, ImageNet, Kinetics-400, sstv2, as well as video datasets such as Breakfast, COIN, LVU, and various time series datasets. The project page for Mamba-360 work is available on this webpage.\url{this https URL}.
序列建模是一个贯穿各种领域的关键领域,包括自然语言处理(NLP)、语音识别、时间序列预测、音乐生成和生物信息学。递归神经网络(RNNs)和长短时记忆网络(LSTMs)历史上曾统治序列建模任务,如机器翻译、命名实体识别等。然而,Transformer的进步导致了一种范式的转移,由于它们在性能上的优越表现。然而,Transformer的注意力复杂性和处理归纳偏差的能力仍然存在挑战。为解决这些问题,已经提出了几种变体,包括使用特征网络或卷积的模型,并在各种任务上表现良好。然而,它们仍然很难处理长序列。状态空间模型(SSMs)在这一背景下出现了有前景的替代方案,尤其是S4和其变体,如S4nd、Hippo、Hyena、诊断状态空间(DSS)、Gated State Spaces(GSS)和Linear Recurrent Unit(LRU)、Liquid-S4、Mamba等。在本次调查中,我们根据三种范式对基本SSMs进行了分类,即开关架构、结构架构和循环架构。本调查还强调了SSMs在各个领域的多样化应用,如视觉、视频、音频、语音、语言(特别是长序列建模)、医学(包括基因组学)、化学(如药物设计)和推荐系统,以及时间序列分析,包括表格数据。此外,我们还分析了SSMs在基准数据集,如Long Range Arena(LRA)、WikiText、Glue、Pile、ImageNet、Kinetics-400、sstv2,以及视频数据集,如Breakfast、COIN、LVU等。Mamba-360工作的项目页面可以在该网页上查看。
https://arxiv.org/abs/2404.16112
This paper presents Killkan, the first dataset for automatic speech recognition (ASR) in the Kichwa language, an indigenous language of Ecuador. Kichwa is an extremely low-resource endangered language, and there have been no resources before Killkan for Kichwa to be incorporated in applications of natural language processing. The dataset contains approximately 4 hours of audio with transcription, translation into Spanish, and morphosyntactic annotation in the format of Universal Dependencies. The audio data was retrieved from a publicly available radio program in Kichwa. This paper also provides corpus-linguistic analyses of the dataset with a special focus on the agglutinative morphology of Kichwa and frequent code-switching with Spanish. The experiments show that the dataset makes it possible to develop the first ASR system for Kichwa with reliable quality despite its small dataset size. This dataset, the ASR model, and the code used to develop them will be publicly available. Thus, our study positively showcases resource building and its applications for low-resource languages and their community.
本文介绍了Killkan,这是库奇华语(Kichwa)的第一份自动语音识别(ASR)数据集,这是一种来自厄瓜多尔的土著语言。库奇华是一种极其缺乏资源、濒临灭绝的语言,以前没有库奇华的资源被融入到自然语言处理应用程序中。数据集包含近4小时的音频转录、西班牙语翻译和语素形态学注释的格式为Universal Dependencies。音频数据是从库奇华的一个公开可用的无线电节目提取的。本文还重点分析了数据集的语料库语义分析,特别关注库奇华的粘着形态和与西班牙语的频繁代码转换。实验结果表明,尽管数据集规模较小,但该数据集还是可以开发出库奇华语的第一份ASR系统,具有可靠的质量和效果。这个数据集、ASR模型和用于开发它们的代码将公开发布。因此,我们的研究正面展示了资源建设和它们对低资源语言及其社区的启示。
https://arxiv.org/abs/2404.15501
It is challenging to improve automatic speech recognition (ASR) performance in noisy conditions with a single-channel speech enhancement (SE) front-end. This is generally attributed to the processing distortions caused by the nonlinear processing of single-channel SE front-ends. However, the causes of such degraded ASR performance have not been fully investigated. How to design single-channel SE front-ends in a way that significantly improves ASR performance remains an open research question. In this study, we investigate a signal-level numerical metric that can explain the cause of degradation in ASR performance. To this end, we propose a novel analysis scheme based on the orthogonal projection-based decomposition of SE errors. This scheme manually modifies the ratio of the decomposed interference, noise, and artifact errors, and it enables us to directly evaluate the impact of each error type on ASR performance. Our analysis reveals the particularly detrimental effect of artifact errors on ASR performance compared to the other types of errors. This provides us with a more principled definition of processing distortions that cause the ASR performance degradation. Then, we study two practical approaches for reducing the impact of artifact errors. First, we prove that the simple observation adding (OA) post-processing (i.e., interpolating the enhanced and observed signals) can monotonically improve the signal-to-artifact ratio. Second, we propose a novel training objective, called artifact-boosted signal-to-distortion ratio (AB-SDR), which forces the model to estimate the enhanced signals with fewer artifact errors. Through experiments, we confirm that both the OA and AB-SDR approaches are effective in decreasing artifact errors caused by single-channel SE front-ends, allowing them to significantly improve ASR performance.
在嘈杂条件下,使用单通道语音增强(SE)前端来提高自动语音识别(ASR)性能是一项具有挑战性的任务。通常,这种挑战是由于单通道SE前端非线性处理引起的处理失真。然而,尚未对这种失真导致的ASR性能下降进行全面调查。如何设计一种明显提高ASR性能的单通道SE前端仍然是一个开放的研究问题。在这项研究中,我们研究了一个可以解释ASR性能下降原因的信号级数值度量。为此,我们提出了基于SE误差正交投影分解的分析方案。这种方案手动修改了分解干扰、噪声和伪迹误差的比率,并使我们能够直接评估每种误差类型对ASR性能的影响。我们的分析揭示了伪迹误差对ASR性能的破坏性比其他类型的误差更加严重。然后,我们研究了两种减少伪迹误差影响的方法。首先,我们证明了简单的观察添加(OA)后处理(即插值增强和观察信号)可以单调地改善信号-伪迹比。其次,我们提出了名为伪迹增强信号-畸变比(AB-SDR)的新训练目标,该目标迫使模型使用更少的伪迹误差估计增强信号。通过实验,我们证实了OA和AB-SDR方法都能有效减少由单通道SE前端引起的伪迹误差,从而显著提高ASR性能。
https://arxiv.org/abs/2404.14860
Understanding cognitive processes in the brain demands sophisticated models capable of replicating neural dynamics at large scales. We present a physiologically inspired speech recognition architecture, compatible and scalable with deep learning frameworks, and demonstrate that end-to-end gradient descent training leads to the emergence of neural oscillations in the central spiking neural network. Significant cross-frequency couplings, indicative of these oscillations, are measured within and across network layers during speech processing, whereas no such interactions are observed when handling background noise inputs. Furthermore, our findings highlight the crucial inhibitory role of feedback mechanisms, such as spike frequency adaptation and recurrent connections, in regulating and synchronising neural activity to improve recognition performance. Overall, on top of developing our understanding of synchronisation phenomena notably observed in the human auditory pathway, our architecture exhibits dynamic and efficient information processing, with relevance to neuromorphic technology.
理解大脑中的认知过程需要复杂且能够在大尺度上复制神经动态的模型。我们提出了一个生理学上启发的语音识别架构,与深度学习框架兼容并具有可扩展性,并证明了端到端梯度下降训练会导致中央尖峰神经网络中神经振荡的出现。在语音处理过程中,我们测量了跨频联系,这些联系表明了这些振荡,而在处理背景噪声输入时,并没有观察到这样的相互作用。此外,我们的研究结果突出了反馈机制(如尖峰频率适应和循环连接)在调节和同步神经活动以提高识别性能中的关键抑制作用。总的来说,在发展我们人类听觉通路中同步现象的基础上,我们的架构表现出动态和高效的信息处理,与类神经形态技术有关。
https://arxiv.org/abs/2404.14024
Automatic Speech Recognition (ASR) can play a crucial role in enhancing the accessibility of spoken languages worldwide. In this paper, we build a set of ASR tools for Amharic, a language spoken by more than 50 million people primarily in eastern Africa. Amharic is written in the Ge'ez script, a sequence of graphemes with spacings denoting word boundaries. This makes computational processing of Amharic challenging since the location of spacings can significantly impact the meaning of formed sentences. We find that existing benchmarks for Amharic ASR do not account for these spacings and only measure individual grapheme error rates, leading to significantly inflated measurements of in-the-wild performance. In this paper, we first release corrected transcriptions of existing Amharic ASR test datasets, enabling the community to accurately evaluate progress. Furthermore, we introduce a post-processing approach using a transformer encoder-decoder architecture to organize raw ASR outputs into a grammatically complete and semantically meaningful Amharic sentence. Through experiments on the corrected test dataset, our model enhances the semantic correctness of Amharic speech recognition systems, achieving a Character Error Rate (CER) of 5.5\% and a Word Error Rate (WER) of 23.3\%.
自动语音识别(ASR)在提高全球范围内口头语言的可访问性方面发挥着关键作用。在本文中,我们为阿姆哈勒语(一种主要在东非使用的语言)构建了一组ASR工具。阿姆哈勒语用吉斯文书写,这是一种由标点符号组成的序列,其中间隔表示单词边界。这使得阿姆哈勒语的计算处理具有挑战性,因为间隔的位置可能会显著影响形成的句子的意思。我们发现,现有的阿姆哈勒语ASR基准没有考虑到这些间隔,而只是测量单个词形错误率,导致在野外性能的测量值大幅膨胀。在本文中,我们首先发布了现有阿姆哈勒语ASR测试数据集的修正转录,使社区能够准确评估进展。此外,我们使用Transformer编码器-解码器架构引入了一种后处理方法,将原始ASR输出组织成一个语法完整且语义有意义的阿姆哈勒语句子。通过在修正测试数据集上的实验,我们的模型提高了阿姆哈勒语语音识别系统的语义正确性,实现了 Character Error Rate(CER)为5.5\% 和 Word Error Rate(WER)为23.3\%的性能。
https://arxiv.org/abs/2404.13362
Speech-driven facial animation methods usually contain two main classes, 3D and 2D talking face, both of which attract considerable research attention in recent years. However, to the best of our knowledge, the research on 3D talking face does not go deeper as 2D talking face, in the aspect of lip-synchronization (lip-sync) and speech perception. To mind the gap between the two sub-fields, we propose a learning framework named Learn2Talk, which can construct a better 3D talking face network by exploiting two expertise points from the field of 2D talking face. Firstly, inspired by the audio-video sync network, a 3D sync-lip expert model is devised for the pursuit of lip-sync between audio and 3D facial motion. Secondly, a teacher model selected from 2D talking face methods is used to guide the training of the audio-to-3D motions regression network to yield more 3D vertex accuracy. Extensive experiments show the advantages of the proposed framework in terms of lip-sync, vertex accuracy and speech perception, compared with state-of-the-arts. Finally, we show two applications of the proposed framework: audio-visual speech recognition and speech-driven 3D Gaussian Splatting based avatar animation.
演讲驱动的面部动画方法通常包含两个主要类别:3D和2D对话面部,这两者在近年来都吸引了相当大的研究关注。然而,据我们所知,在3D对话面部的研究中,深度并没有达到2D对话面部的水平,尤其是在同步( lipsynchronization)和语音感知方面。为了弥补这两个子领域的差距,我们提出了一个名为Learn2Talk的学习框架,通过利用2D对话面部的两个专业领域来构建更好的3D对话面部网络。首先,受到音频-视频同步网络的启发,设计了一个3D sync-lip专家模型,以实现音频和3D面部运动的同步。其次,从2D对话面部方法中选择一个教师模型,用于指导音频-到-3D运动回归网络的训练,以实现更高精度的3D顶点准确性。大量的实验结果表明,与最先进的水平相比,所提出的框架在同步、顶点准确性和语音感知方面具有优势。最后,我们展示了两个基于所提出框架的应用:音频-视频语音识别和基于语音的3D高斯平铺基于虚拟角色动画。
https://arxiv.org/abs/2404.12888
Self-supervised learned (SSL) models such as Wav2vec and HuBERT yield state-of-the-art results on speech-related tasks. Given the effectiveness of such models, it is advantageous to use them in conventional ASR systems. While some approaches suggest incorporating these models as a trainable encoder or a learnable frontend, training such systems is extremely slow and requires a lot of computation cycles. In this work, we propose two simple approaches that use (1) framewise addition and (2) cross-attention mechanisms to efficiently incorporate the representations from the SSL model(s) into the ASR architecture, resulting in models that are comparable in size with standard encoder-decoder conformer systems while also avoiding the usage of SSL models during training. Our approach results in faster training and yields significant performance gains on the Librispeech and Tedlium datasets compared to baselines. We further provide detailed analysis and ablation studies that demonstrate the effectiveness of our approach.
自监督学习(SSL)模型,如Wav2vec和HuBERT,在语音相关任务上取得了最先进的成果。鉴于这类模型的有效性,将它们应用于传统的ASR系统具有优势。虽然一些方法建议将这类模型作为可训练的编码器或可学习的前端,但训练这些系统非常耗时且需要大量的计算周期。在本文中,我们提出了两种简单的策略,即(1)框架级加法和(2)跨注意机制,将SSL模型的表示有效地融入ASR架构,从而实现与标准编码器-解码器紧凑系统相当大小的模型,并避免在训练过程中使用SSL模型。我们的方法使得训练更快,同时在Librispeech和Tedlium数据集上的性能相较于基线有了显著的提高。此外,我们还提供了详细的分析和消融实验,以证明我们方法的的有效性。
https://arxiv.org/abs/2404.12628
Voice based applications are ruling over the era of automation because speech has a lot of factors that determine a speakers information as well as speech. Modern Automatic Speech Recognition (ASR) is a blessing in the field of Human-Computer Interaction (HCI) for efficient communication among humans and devices using Artificial Intelligence technology. Speech is one of the easiest mediums of communication because it has a lot of identical features for different speakers. Nowadays it is possible to determine speakers and their identity using their speech in terms of speaker recognition. In this paper, we presented a method that will provide a speakers geographical identity in a certain region using continuous Bengali speech. We consider eight different divisions of Bangladesh as the geographical region. We applied the Mel Frequency Cepstral Coefficient (MFCC) and Delta features on an Artificial Neural Network to classify speakers division. We performed some preprocessing tasks like noise reduction and 8-10 second segmentation of raw audio before feature extraction. We used our dataset of more than 45 hours of audio data from 633 individual male and female speakers. We recorded the highest accuracy of 85.44%.
基于语音的应用程序正在统治自动化时代,因为语音有很多因素来决定发言者的信息和语音。现代人工智能语音识别(ASR)在人工智能技术领域为人类和设备之间的有效通信带来了福音。语音是交流中最为容易的媒介,因为它有很多不同发言者的相似特征。如今,通过他们的语音,可以确定发言者和他们的身份,实现发言者识别。在本文中,我们提出了一种方法,将使用连续孟加拉语来确定发言者在某个地区的地理身份。我们将孟加拉国分为八个不同的地区。我们在人工神经网络中应用了Mel频率倒谱系数(MFCC)和差分特征来对发言者进行分类。在特征提取之前,我们进行了一些预处理任务,如降噪和8-10秒音频的分割。我们使用我们拥有超过633个单独男性和女性发言者的数据集。我们记录了最高的准确率为85.44%。
https://arxiv.org/abs/2404.15168
Recent advancements in language modeling have led to the emergence of Large Language Models (LLMs) capable of various natural language processing tasks. Despite their success in text-based tasks, applying LLMs to the speech domain remains limited and challenging. This paper presents BLOOMZMMS, a novel model that integrates a multilingual LLM with a multilingual speech encoder, aiming to harness the capabilities of LLMs for speech recognition and beyond. Utilizing a multi-instructional training approach, we demonstrate the transferability of linguistic knowledge from the text to the speech modality. Our experiments, conducted on 1900 hours of transcribed data from 139 languages, establish that a multilingual speech representation can be effectively learned and aligned with a multilingual LLM. While this learned representation initially shows limitations in task generalization, we address this issue by generating synthetic targets in a multi-instructional style. Our zero-shot evaluation results confirm the robustness of our approach across multiple tasks, including speech translation and multilingual spoken language understanding, thereby opening new avenues for applying LLMs in the speech domain.
近年来在语言建模方面的进步导致了一种名为大型语言模型(LLMs)的多语言处理能力在各种自然语言处理任务中出现。尽管它们在文本基任务上的成功,但将LLMs应用于语音领域仍然具有挑战性和限制。本文介绍了一种名为BLOOMZMMS的新模型,该模型将多语言LLM与多语言语音编码器集成,旨在利用LLMs在语音识别和 beyond 的能力。利用多指令训练方法,我们证明了语言知识可以从文本模式转移到语音模式。我们对139个语言的1900小时转录数据进行的实验证明,多语言语音表示可以有效地学习和与多语言LLM对齐。虽然最初的学习表示在任务泛化方面存在局限性,但我们通过多指令风格生成合成目标来解决这个问题。我们的零击评估结果证实了我们的方法在多个任务上的稳健性,包括语音翻译和多语言口语理解,从而为在语音领域应用LLM提供了新的途径。
https://arxiv.org/abs/2404.10922
This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed description of our model architecture, consisting of a full-context 600M-parameter Conformer encoder pre-trained with BEST-RQ and an RNN-T decoder fine-tuned jointly with the encoder. Our extensive evaluation demonstrates competitive word error rates (WERs) against larger and more computationally expensive models, such as Whisper large and Canary-1B. Furthermore, our architectural choices yield several key advantages, including an improved code-switching capability, a 5x inference speedup compared to an optimized Whisper baseline, a 30% reduction in hallucination rate on speech data, and a 90% reduction in ambient noise compared to Whisper, along with significantly improved time-stamp accuracy. Throughout this work, we adopt a system-centric approach to analyzing various aspects of fully-fledged ASR models to gain practically relevant insights useful for real-world services operating at scale.
本文描述了AssemblyAI设计的工业规模自动语音识别(ASR)系统,旨在满足大规模、多语言ASR满足各种应用需求的需求。我们的系统利用四个语言中的无监督(12.5M小时)、有监督(188k小时)和伪标签(1.6M小时)数据构建了多样化的训练数据集。我们详细描述了我们的模型架构,包括使用BEST-RQ预训练的完整上下文600M参数的Conformer编码器以及与编码器共同细化的RNN-T解码器。我们的大量评估显示,与更大、更昂贵的大型模型(如Whisper large和Canary-1B)相比,我们的具有竞争力的词错误率(WERs)。此外,我们的架构选择带来几个关键优势,包括提高的换挡能力、与优化后的Whisper基线相比的5倍推理速度、对语音数据的幻觉率降低30%以及与Whisper相比的90%的环境噪声降低。在本文中,我们采用系统中心的方法分析各种大規模ASR模型的各个方面,以获得与实际服务操作规模相关的实际相关见解,这些见解对于大规模服务至关重要。
https://arxiv.org/abs/2404.09841
As the rapidly advancing domain of natural language processing (NLP), large language models (LLMs) have emerged as powerful tools for interpreting human commands and generating text across various tasks. Nonetheless, the resilience of LLMs to handle text containing inherent errors, stemming from human interactions and collaborative systems, has not been thoroughly explored. Our study investigates the resilience of LLMs against five common types of disruptions including 1) ASR (Automatic Speech Recognition) errors, 2) OCR (Optical Character Recognition) errors, 3) grammatical mistakes, 4) typographical errors, and 5) distractive content. We aim to investigate how these models react by deliberately embedding these errors into instructions. Our findings reveal that while some LLMs show a degree of resistance to certain types of noise, their overall performance significantly suffers. This emphasizes the importance of further investigation into enhancing model resilience. In response to the observed decline in performance, our study also evaluates a "re-pass" strategy, designed to purify the instructions of noise before the LLMs process them. Our analysis indicates that correcting noisy instructions, particularly for open-source LLMs, presents significant challenges.
随着自然语言处理(NLP)领域迅速发展,大型语言模型(LLMs)已成为解读人类指令和生成各种任务的强大工具。然而,LLMs对处理包含固有错误的文本以及协作系统产生的文本的抵抗力尚未得到充分探讨。我们的研究调查了LLMs对五种常见干扰类型的抵抗力,包括1)自动语音识别(ASR)错误,2)光学字符识别(OCR)错误,3)语法错误,4)排版错误,5)分散的内容。我们旨在通过故意将这些错误嵌入指令中,研究模型对这些干扰的反应。我们的研究结果表明,虽然某些LLM对某些类型的噪音表现出一定程度的抵抗力,但它们的整体性能严重下降。这强调了进一步研究增强模型韧性的重要性。为了应对观察到的性能下降,我们的研究还评估了一种“重新通过”策略,旨在在LLMs处理指令之前净化指令中的噪音。我们的分析表明,修复噪音指令,特别是对于开源LLM,带来了显著的挑战。
https://arxiv.org/abs/2404.09754
Intention-based Human-Robot Interaction (HRI) systems allow robots to perceive and interpret user actions to proactively interact with humans and adapt to their behavior. Therefore, intention prediction is pivotal in creating a natural interactive collaboration between humans and robots. In this paper, we examine the use of Large Language Models (LLMs) for inferring human intention during a collaborative object categorization task with a physical robot. We introduce a hierarchical approach for interpreting user non-verbal cues, like hand gestures, body poses, and facial expressions and combining them with environment states and user verbal cues captured using an existing Automatic Speech Recognition (ASR) system. Our evaluation demonstrates the potential of LLMs to interpret non-verbal cues and to combine them with their context-understanding capabilities and real-world knowledge to support intention prediction during human-robot interaction.
基于意图的人机交互(HRI)系统允许机器人感知和解释用户的动作,从而主动与人类互动并适应其行为。因此,意图预测在创建人类与机器人之间自然互动的重要性不言而喻。在本文中,我们研究了使用大型语言模型(LLMs)在合作物体分类任务中推断人类意图的方法。我们引入了一种分层的解释用户非语言线索的方法,包括手势、身体姿势和面部表情,并将其与环境和用户口头线索捕获的现有自动语音识别(ASR)系统相结合。我们的评估表明,LLMs具有解释非语言线索的能力,并将其与上下文理解能力和现实世界的知识相结合,支持在人类与机器人交互过程中进行意图预测。
https://arxiv.org/abs/2404.08424
Indigenous languages are a fundamental legacy in the development of human communication, embodying the unique identity and culture of local communities of America. The Second AmericasNLP Competition Track 1 of NeurIPS 2022 proposed developing automatic speech recognition (ASR) systems for five indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa'ikhana. In this paper, we propose a reliable ASR model for each target language by crawling speech corpora spanning diverse sources and applying data augmentation methods that resulted in the winning approach in this competition. To achieve this, we systematically investigated the impact of different hyperparameters by a Bayesian search on the performance of the language models, specifically focusing on the variants of the Wav2vec2.0 XLS-R model: 300M and 1B parameters. Moreover, we performed a global sensitivity analysis to assess the contribution of various hyperparametric configurations to the performances of our best models. Importantly, our results show that freeze fine-tuning updates and dropout rate are more vital parameters than the total number of epochs of lr. Additionally, we liberate our best models -- with no other ASR model reported until now for two Wa'ikhana and Kotiria -- and the many experiments performed to pave the way to other researchers to continue improving ASR in minority languages. This insight opens up interesting avenues for future work, allowing for the advancement of ASR techniques in the preservation of minority indigenous and acknowledging the complexities involved in this important endeavour.
土著语言是人类交流发展的重要遗产,体现了美国各地社区的独特身份和文化。2022年NeurIPS第二天的NLP竞赛赛道1提出为五种土著语言开发自动语音识别(ASR)系统:库亚(Quechua)、瓜拉尼(Guarani)、布里比(Bribri)、科托利亚(Kotiria)和瓦伊克哈纳(Wa'ikhana)。在本文中,我们通过爬取跨度广泛的语音数据集并应用竞赛中的最佳方法,提出了可靠的ASR模型,用于每个目标语言。为了实现这一目标,我们系统地研究了不同超参数对语言模型性能的影响,特别关注Wav2vec2.0 XLS-R模型的两个变体:300M和1B参数。此外,我们进行了全局敏感性分析,以评估各种超参数配置对最佳模型的性能贡献。重要的是,我们的结果表明,静止微调更新和 dropout 率比学习率的总迭代次数更加重要。此外,我们还发布了之前没有报道过的最好的模型 -- 直到现在只有两个Wa'ikhana和Kotiria模型被报道过 -- 以及为了其他研究人员继续改进亚索语言而进行的许多实验。这一洞察为未来工作打开了有趣的途径,允许在保护少数民族土著语言方面推动ASR技术的发展,并承认这一重要任务中涉及的复杂性。
https://arxiv.org/abs/2404.08368
Automated speaking assessment (ASA) typically involves automatic speech recognition (ASR) and hand-crafted feature extraction from the ASR transcript of a learner's speech. Recently, self-supervised learning (SSL) has shown stellar performance compared to traditional methods. However, SSL-based ASA systems are faced with at least three data-related challenges: limited annotated data, uneven distribution of learner proficiency levels and non-uniform score intervals between different CEFR proficiency levels. To address these challenges, we explore the use of two novel modeling strategies: metric-based classification and loss reweighting, leveraging distinct SSL-based embedding features. Extensive experimental results on the ICNALE benchmark dataset suggest that our approach can outperform existing strong baselines by a sizable margin, achieving a significant improvement of more than 10% in CEFR prediction accuracy.
自动口语评估(ASA)通常涉及自动语音识别(ASR)和从学习者的语音ASR转录中手工提取特征。近年来,自监督学习(SSL)在传统方法中表现出优异性能。然而,基于SSL的ASA系统面临至少三个数据相关挑战:有限的标注数据、学习者水平分布不均以及不同CEFR水平之间的分数间隔非均匀。为了应对这些挑战,我们探讨了使用两种新颖的建模策略:基于指标的分类和损失加权,利用独特的SSL基体特征。在ICNALE基准数据集上的广泛实验结果表明,我们的方法可以显著优于现有强大的基线,实现CEFR预测准确性的提高超过10%。
https://arxiv.org/abs/2404.07575
This paper presents Conformer-1, an end-to-end Automatic Speech Recognition (ASR) model trained on an extensive dataset of 570k hours of speech audio data, 91% of which was acquired from publicly available sources. To achieve this, we perform Noisy Student Training after generating pseudo-labels for the unlabeled public data using a strong Conformer RNN-T baseline model. The addition of these pseudo-labeled data results in remarkable improvements in relative Word Error Rate (WER) by 11.5% and 24.3% for our asynchronous and realtime models, respectively. Additionally, the model is more robust to background noise owing to the addition of these data. The results obtained in this study demonstrate that the incorporation of pseudo-labeled publicly available data is a highly effective strategy for improving ASR accuracy and noise robustness.
本文介绍了Conformer-1,一种端到端自动语音识别(ASR)模型,通过对570k小时语音音频数据的大型数据集进行训练,其中91%的数据来自于公开可用的来源。为了实现这一目标,我们在使用强Conformer RNN-T基线模型生成未标记公共数据的反向样本后,对模型进行噪音学生训练。这些反向样本数据的结果分别使我们的异步和实时模型的相对单词错误率(WER)提高了11.5%和24.3%。此外,由于这些数据的添加,模型对背景噪声更加鲁棒。本研究的结果表明,利用公开可用的反向样本数据来提高ASR的准确性和噪声鲁棒性是一种非常有效的策略。
https://arxiv.org/abs/2404.07341
Deep learning expresses a category of machine learning algorithms that have the capability to combine raw inputs into intermediate features layers. These deep learning algorithms have demonstrated great results in different fields. Deep learning has particularly witnessed for a great achievement of human level performance across a number of domains in computer vision and pattern recognition. For the achievement of state-of-the-art performances in diverse domains, the deep learning used different architectures and these architectures used activation functions to perform various computations between hidden and output layers of any architecture. This paper presents a survey on the existing studies of deep learning in handwriting recognition field. Even though the recent progress indicates that the deep learning methods has provided valuable means for speeding up or proving accurate results in handwriting recognition, but following from the extensive literature survey, the present study finds that the deep learning has yet to revolutionize more and has to resolve many of the most pressing challenges in this field, but promising advances have been made on the prior state of the art. Additionally, an inadequate availability of labelled data to train presents problems in this domain. Nevertheless, the present handwriting recognition survey foresees deep learning enabling changes at both bench and bedside with the potential to transform several domains as image processing, speech recognition, computer vision, machine translation, robotics and control, medical imaging, medical information processing, bio-informatics, natural language processing, cyber security, and many others.
深度学习表示了一种将原始输入合并到中间特征层中的机器学习算法。这些深度学习算法在各种领域都取得了巨大的成功。在计算机视觉和模式识别领域,深度学习取得了人类水平性能的显著进步。为了在多样领域实现最先进的性能,深度学习使用了不同的架构,这些架构使用激活函数在隐藏和输出层之间执行各种计算。本文对手写识别领域现有研究的调查结果进行了综述。尽管最近的研究表明,深度学习方法为加快或在手写识别中证明准确结果提供了有价值的方法,但根据广泛的文献调查,当前研究尚未推翻现状,解决这一领域内的许多最紧迫的问题,但在先前的技术水平上取得了重要进展。此外,训练数据不足的问题也存在于这个领域。然而,本文的手写识别调查展望了深度学习在 bench 和 bedside 的变革,具有将多个领域(如图像处理、语音识别、计算机视觉、机器翻译、机器人与控制、医学影像、医学信息处理、生物信息学、自然语言处理、网络安全等)进行变革的可能,这使得深度学习在各个领域都具有广泛的应用前景。
https://arxiv.org/abs/2404.08011
Discrete speech tokens have been more and more popular in multiple speech processing fields, including automatic speech recognition (ASR), text-to-speech (TTS) and singing voice synthesis (SVS). In this paper, we describe the systems developed by the SJTU X-LANCE group for the TTS (acoustic + vocoder), SVS, and ASR tracks in the Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge. Notably, we achieved 1st rank on the leaderboard in the TTS track both with the whole training set and only 1h training data, along with the lowest bitrate among all submissions.
离散语音词已经成为了多个语音处理领域(包括自动语音识别(ASR)、文本转语音(TTS)和唱歌语音合成(SVS))的热门选择。在本文中,我们描述了西安交通大学X-LANCE团队为Interspeech 2024 语音处理使用离散语音单元挑战中的TTS(声学+编码器)、SVS和ASR项目开发的系统。值得注意的是,我们在整个训练集和仅用1小时训练数据的情况下,在TTS track的排行榜上获得了第一,同时拥有所有提交作品中的最低带宽。
https://arxiv.org/abs/2404.06079