Code-switching, the alternation of languages within a single discourse, presents a significant challenge for Automatic Speech Recognition. Despite the unique nature of the task, performance is commonly measured with established metrics such as Word-Error-Rate (WER). However, in this paper, we question whether these general metrics accurately assess performance on code-switching. Specifically, using both Connectionist-Temporal-Classification and Encoder-Decoder models, we show fine-tuning on non-code-switched data from both matrix and embedded language improves classical metrics on code-switching test sets, although actual code-switched words worsen (as expected). Therefore, we propose Point-of-Interest Error Rate (PIER), a variant of WER that focuses only on specific words of interest. We instantiate PIER on code-switched utterances and show that this more accurately describes the code-switching performance, showing huge room for improvement in future work. This focused evaluation allows for a more precise assessment of model performance, particularly in challenging aspects such as inter-word and intra-word code-switching.
代码转换(code-switching),即在一个对话中交替使用多种语言,对自动语音识别技术提出了重大挑战。尽管任务具有独特性,但性能通常还是通过诸如单词错误率(Word-Error-Rate, WER)等既定指标来衡量的。然而,在本文中我们质疑这些通用度量是否能够准确评估代码转换的表现。 具体而言,基于连接主义时间分类和编码器-解码器模型,我们展示了在来自两种语言(矩阵语言和嵌入式语言)而非代码混合数据上的微调可以改善经典指标对代码转换测试集的性能。然而,实际上,在这些场景中涉及的代码切换词语的表现却更差(符合预期)。因此,本文提出了一个单词错误率(WER)变体——兴趣点误差率(Point-of-Interest Error Rate, PIER),该指标专门针对特定感兴趣的词进行评估。 我们利用PIER来分析代码转换中的特定词汇表现,并证明这种方法能更准确地描述模型在处理代码切换时的表现,指出未来改进的潜力巨大。这种有针对性的评估方式使我们可以更加精确地衡量模型性能,尤其是在跨词和词语内部的代码切换这些挑战性领域。
https://arxiv.org/abs/2501.09512
This paper presents an efficient decoding approach for end-to-end automatic speech recognition (E2E-ASR) with large language models (LLMs). Although shallow fusion is the most common approach to incorporate language models into E2E-ASR decoding, we face two practical problems with LLMs. (1) LLM inference is computationally costly. (2) There may be a vocabulary mismatch between the ASR model and the LLM. To resolve this mismatch, we need to retrain the ASR model and/or the LLM, which is at best time-consuming and in many cases not feasible. We propose "delayed fusion," which applies LLM scores to ASR hypotheses with a delay during decoding and enables easier use of pre-trained LLMs in ASR tasks. This method can reduce not only the number of hypotheses scored by the LLM but also the number of LLM inference calls. It also allows re-tokenizion of ASR hypotheses during decoding if ASR and LLM employ different tokenizations. We demonstrate that delayed fusion provides improved decoding speed and accuracy compared to shallow fusion and N-best rescoring using the LibriHeavy ASR corpus and three public LLMs, OpenLLaMA 3B & 7B and Mistral 7B.
本文提出了一种针对端到端自动语音识别(E2E-ASR)与大型语言模型(LLM)结合的有效解码方法。尽管浅层融合是将语言模型融入E2E-ASR解码中最常见的方法,但在使用LLM时我们面临两个实际问题:(1) LLM的推理计算成本较高。(2) ASR模型和LLM之间可能存在词汇不匹配的情况。为了解决这一词汇不匹配的问题,我们需要重新训练ASR模型和/或LLM,这在最佳情况下也是耗时的,并且在许多情况下是不可行的。 为此,我们提出了“延迟融合”的方法,这种方法是在解码过程中以一定时间延迟的方式将LLM得分应用于ASR假设中。这样可以更方便地利用预训练好的LLM来进行ASR任务。该方法不仅减少了需要通过LLM评分的假设数量,还减少了LLM推理调用的数量。此外,如果ASR和LLM使用不同的分词策略,“延迟融合”还可以在解码过程中重新对ASR假设进行分词。 我们展示了与浅层融合及N-best重排序相比,延迟融合可以提供更快且更准确的解码结果,并通过LibriHeavy ASR语料库以及三个公开的大型语言模型(OpenLLaMA 3B & 7B 和 Mistral 7B)进行了验证。
https://arxiv.org/abs/2501.09258
Data augmentation (DA) is ubiquitously used in training of Automatic Speech Recognition (ASR) models. DA offers increased data variability, robustness and generalization against different acoustic distortions. Recently, personalization of ASR models on mobile devices has been shown to improve Word Error Rate (WER). This paper evaluates data augmentation in this context and proposes persoDA; a DA method driven by user's data utilized to personalize ASR. persoDA aims to augment training with data specifically tuned towards acoustic characteristics of the end-user, as opposed to standard augmentation based on Multi-Condition Training (MCT) that applies random reverberation and noises. Our evaluation with an ASR conformer-based baseline trained on Librispeech and personalized for VOICES shows that persoDA achieves a 13.9% relative WER reduction over using standard data augmentation (using random noise & reverberation). Furthermore, persoDA shows 16% to 20% faster convergence over MCT.
数据增强(Data Augmentation,简称DA)在自动语音识别(Automatic Speech Recognition,ASR)模型的训练中广泛使用。数据增强能够增加数据的多样性、鲁棒性和对抗不同声学失真的泛化能力。最近的研究表明,在移动设备上个性化ASR模型可以提高单词错误率(Word Error Rate,WER)。本文在此背景下评估了数据增强方法,并提出了persoDA——一种基于用户特定数据来个性化ASR的数据增强方法。与基于多条件训练(Multi-Condition Training,MCT)的标准随机混响和噪声数据增强不同,persoDA旨在通过专门针对最终用户的声学特征进行训练数据的扩充。 实验评估结果表明,在使用Librispeech训练的基于Conformer的基准模型并个性化应用于VOICES数据集的情况下,persoDA比标准的数据增强方法(包括随机噪音及混响)实现了13.9%相对WER的降低。此外,与MCT相比,persoDA显示出16%-20%更快的收敛速度。
https://arxiv.org/abs/2501.09113
In this paper, we take a step towards jointly modeling automatic speech recognition (STT) and speech synthesis (TTS) in a fully non-autoregressive way. We develop a novel multimodal framework capable of handling the speech and text modalities as input either individually or together. The proposed model can also be trained with unpaired speech or text data owing to its multimodal nature. We further propose an iterative refinement strategy to improve the STT and TTS performance of our model such that the partial hypothesis at the output can be fed back to the input of our model, thus iteratively improving both STT and TTS predictions. We show that our joint model can effectively perform both STT and TTS tasks, outperforming the STT-specific baseline in all tasks and performing competitively with the TTS-specific baseline across a wide range of evaluation metrics.
在这篇论文中,我们朝着同时以完全非自回归的方式建模自动语音识别(STT)和语音合成(TTS)的方向迈进了一步。我们开发了一个新颖的多模态框架,能够处理单独或结合使用的语音和文本模态输入。由于其多模态特性,所提出的模型可以使用未配对的语音或文本数据进行训练。此外,我们还提出了一种迭代精炼策略,以提高我们的模型在STT和TTS方面的性能,使得输出中的部分假设可以反馈到模型的输入中,从而通过迭代方式逐步改善两种任务的预测结果。结果显示,我们的联合模型能够有效执行STT和TTS任务,在所有任务上均优于特定于STT的基线,并且在广泛的评估指标下与特定于TTS的基线相比具有竞争力。
https://arxiv.org/abs/2501.09104
We collect novel data in the public service domain to evaluate the capability of the state-of-the-art automatic speech recognition (ASR) models in capturing regional differences in accents in the United Kingdom (UK), specifically focusing on two accents from Scotland with distinct dialects. This study addresses real-world problems where biased ASR models can lead to miscommunication in public services, disadvantaging individuals with regional accents particularly those in vulnerable populations. We first examine the out-of-the-box performance of the Whisper large-v3 model on a baseline dataset and our data. We then explore the impact of fine-tuning Whisper on the performance in the two UK regions and investigate the effectiveness of existing model evaluation techniques for our real-world application through manual inspection of model errors. We observe that the Whisper model has a higher word error rate (WER) on our test datasets compared to the baseline data and fine-tuning on a given data improves performance on the test dataset with the same domain and accent. The fine-tuned models also appear to show improved performance when applied to the test data outside of the region it was trained on suggesting that fine-tuned models may be transferable within parts of the UK. Our manual analysis of model outputs reveals the benefits and drawbacks of using WER as an evaluation metric and fine-tuning to adapt to regional dialects.
我们收集了公共服务领域的新型数据,以评估最先进的自动语音识别(ASR)模型在捕捉英国(UK)地区口音差异方面的能力,重点关注具有明显方言的苏格兰地区的两种口音。这项研究旨在解决实际问题,即带有偏见的ASR模型可能导致使用公共服务业时产生误解,并且尤其会使讲地方口音的人群——尤其是弱势群体受到不利影响。 首先,我们考察了Whisper large-v3模型在基准数据集和我们的数据集上的开箱即用性能。接着,我们探索了微调Whisper对英国两个地区性能的影响,并通过人工检查模型错误来调查现有模型评估技术对我们实际应用的有效性。 观察结果显示,与基准数据相比,Whisper模型在我的测试数据集上单词错误率(WER)更高。在特定数据上的微调可以改善该测试数据域和口音下的表现。此外,经过微调的模型似乎在外地区域的数据测试中也表现出改进性能,这表明这些微调模型可能在英国部分地区具有可迁移性。 我们对模型输出的手动分析揭示了使用WER作为评估指标的优点和缺点,并指出了适应区域方言进行微调的效果。
https://arxiv.org/abs/2501.08502
While Speech Foundation Models (SFMs) excel in various speech tasks, their performance for low-resource tasks such as child Automatic Speech Recognition (ASR) is hampered by limited pretraining data. To address this, we explore different model merging techniques to leverage knowledge from models trained on larger, more diverse speech corpora. This paper also introduces Selective Attention (SA) Merge, a novel method that selectively merges task vectors from attention matrices to enhance SFM performance on low-resource tasks. Experiments on the MyST database show significant reductions in relative word error rate of up to 14%, outperforming existing model merging and data augmentation techniques. By combining data augmentation techniques with SA Merge, we achieve a new state-of-the-art WER of 8.69 on the MyST database for the Whisper-small model, highlighting the potential of SA Merge for improving low-resource ASR.
尽管语音基础模型(SFM)在各种语音任务中表现出色,但在儿童自动语音识别(ASR)等低资源任务上的性能受限于有限的预训练数据。为了解决这个问题,我们探索了不同的模型合并技术,以利用那些在更大、更多样化语音语料库上训练出的模型的知识。本文还介绍了选择性注意(SA)合并,这是一种新颖的方法,通过从注意力矩阵中选择性地合并任务向量来增强SFM在低资源任务上的性能。在MyST数据库上的实验表明,与现有的模型合并和数据增强技术相比,相对词错误率显著降低了高达14%。结合使用数据增强技术和SA合并后,我们为Whisper-small模型在MyST数据库上实现了新的最先进的8.69 WER(单词错误率),这突显了SA合并对于改进低资源ASR的潜力。
https://arxiv.org/abs/2501.08468
In this paper we propose a robust loudspeaker beamforming algorithm which is used to enhance the performance of voice driven applications in scenarios where the loudspeakers introduce the majority of the noise, e.g. when music is playing loudly. The loudspeaker beamformer modifies the loudspeaker playback signals to create a low-acoustic-energy region around the device that implements automatic speech recognition for a voice driven application (VDA). The algorithm utilises a distortion measure based on human auditory perception to limit the distortion perceived by human listeners. Simulations and real-world experiments show that the proposed loudspeaker beamformer improves the speech recognition performance in all tested scenarios. Moreover, the algorithm allows to further reduce the acoustic energy around the VDA device at the expense of reduced objective audio quality at the listener's location.
在本文中,我们提出了一种鲁棒的扬声器波束成形算法,该算法用于增强语音驱动应用在扬声器引入大部分噪声场景下的性能,例如当音乐播放音量较大时。该扬声器波束形成器修改了扬声器播放信号,以在实施自动语音识别的设备周围创建一个低声能区域(VDA)。该算法使用基于人类听觉感知的失真度量来限制由人耳感知到的失真程度。模拟和实际实验表明,在所有测试场景中,所提出的扬声器波束形成器都改善了语音识别性能。此外,通过牺牲听众位置处的客观音频质量,该算法进一步降低了VDA设备周围的声能。
https://arxiv.org/abs/2501.08104
Spoken language understanding (SLU) is a structure prediction task in the field of speech. Recently, many works on SLU that treat it as a sequence-to-sequence task have achieved great success. However, This method is not suitable for simultaneous speech recognition and understanding. In this paper, we propose a joint speech recognition and structure learning framework (JSRSL), an end-to-end SLU model based on span, which can accurately transcribe speech and extract structured content simultaneously. We conduct experiments on name entity recognition and intent classification using the Chinese dataset AISHELL-NER and the English dataset SLURP. The results show that our proposed method not only outperforms the traditional sequence-to-sequence method in both transcription and extraction capabilities but also achieves state-of-the-art performance on the two datasets.
口语理解(SLU)是语音领域中的结构预测任务。近期,许多将SLU视为序列到序列任务的研究工作取得了显著成功。然而,这种方法不适合于同时进行的语音识别和理解。在本文中,我们提出了一种联合语音识别与结构学习框架(JSRSL),这是一种基于跨度的端到端口语理解模型,能够在同时转录语音并提取结构化内容方面表现出色。我们在中文数据集AISHELL-NER和英文数据集SLURP上进行了命名实体识别和意图分类实验。结果表明,我们提出的方法不仅在转录能力和提取能力上超越了传统的序列到序列方法,在两个数据集中也达到了最先进的性能水平。
https://arxiv.org/abs/2501.07329
Large Audio-Language Models (LALMs) have demonstrated remarkable performance in tasks involving audio perception and understanding, such as speech recognition and audio captioning. However, their reasoning capabilities - critical for solving complex real-world problems - remain underexplored. In this work, we conduct the first exploration into integrating Chain-of-Thought (CoT) reasoning into LALMs to enhance their reasoning ability across auditory modalities. We evaluate representative CoT methods, analyzing their performance in both information extraction and reasoning tasks across sound, music, and speech domains. Our findings reveal that CoT methods significantly improve performance on easy and medium tasks but encounter challenges with hard tasks, where reasoning chains can confuse the model rather than improve accuracy. Additionally, we identify a positive correlation between reasoning path length and accuracy, demonstrating the potential of scaling inference for advanced instruction-following and reasoning. This study not only highlights the promise of CoT in enhancing LALM reasoning capabilities but also identifies key limitations and provides actionable directions for future research.
大型音频语言模型(LALMs)在涉及音频感知和理解的任务中,如语音识别和音轨字幕生成方面展示了出色的表现。然而,它们的推理能力——对于解决复杂现实世界问题至关重要——仍然未被充分探索。在这项工作中,我们首次探讨了将“思维链”(CoT)推理集成到LALMs中的方法,以增强其在听觉模式下的推理能力。我们评估了几种代表性的CoT方法,并分析了它们在信息抽取和跨声音、音乐和语音领域的推理任务上的表现。 我们的研究发现表明,CoT方法显著提高了简单和中等难度任务的性能,但在复杂困难的任务上却遇到了挑战,因为推理链有时会让模型产生混淆而不是提高准确性。此外,我们还发现了推理路径长度与准确率之间存在正相关性,这显示了扩展推断以实现更高级别的指令遵循和推理能力的巨大潜力。 这项研究不仅突显了CoT在增强LALM推理能力方面的前景,而且还识别出了关键的限制因素,并为未来的研发工作提供了可操作的方向。
https://arxiv.org/abs/2501.07246
Objective: Speech tests aim to estimate discrimination loss or speech recognition threshold (SRT). This paper investigates the potential to estimate SRTs from clinical data that target at characterizing the discrimination loss. Knowledge about the relationship between the speech test outcome variables--conceptually linked via the psychometric function--is important towards integration of data from different databases. Design: Depending on the available data, different SRT estimation procedures were compared and evaluated. A novel, model-based SRT estimation procedure was proposed that deals with incomplete patient data. Interpretations of supra-threshold deficits were assessed for the two interpretation modes. Study sample: Data for 27009 patients with Freiburg monosyllabic speech test (FMST) and audiogram (AG) results from the same day were included in the retrospective analysis. Results: The model-based SRT estimation procedure provided accurate SRTs, but with large deviations in the estimated slope. Supra-threshold hearing loss components differed between the two interpretation modes. Conclusions: The model-based procedure can be used for SRT estimation, and its properties relate to data availability for individual patients. All SRT procedures are influenced by the uncertainty of the word recognition scores. In the future, the proposed approach can be used to assess additional differences between speech tests.
目标:语音测试旨在估计辨别损失或言语识别阈值(SRT)。本文探讨了从用于表征辨别损失的临床数据中估算SRTs的可能性。了解语音测试结果变量之间的关系——通过心理测量函数概念上关联起来——对于整合来自不同数据库的数据至关重要。 设计:根据可用数据的不同,比较并评估了几种不同的SRT估算方法。提出了一种新颖的基于模型的SRT估算程序,该程序可以处理不完整的患者数据。评估了两种解释模式下的超阈值听力损失成分。 研究样本:回顾性分析中包括了27009名使用弗赖堡单音节语音测试(FMST)和同一日听力图(AG)结果的患者的资料。 结果:基于模型的SRT估算程序能够提供准确的SRT,但估计出的斜率存在较大偏差。两种解释模式下的超阈值听力损失成分有所不同。 结论:可以使用基于模型的方法进行SRT估算,并且其特性与个别患者的数据可用性相关。所有SRT方法均受到单词识别分数不确定性的影晌。未来可以利用所提出的方法来评估不同语音测试之间的额外差异。
https://arxiv.org/abs/2501.08921
Intra-sentential code-switching (CS) refers to the alternation between languages that happens within a single utterance and is a significant challenge for Automatic Speech Recognition (ASR) systems. For example, when a Vietnamese speaker uses foreign proper names or specialized terms within their speech. ASR systems often struggle to accurately transcribe intra-sentential CS due to their training on monolingual data and the unpredictable nature of CS. This issue is even more pronounced for low-resource languages, where limited data availability hinders the development of robust models. In this study, we propose AdaCS, a normalization model integrates an adaptive bias attention module (BAM) into encoder-decoder network. This novel approach provides a robust solution to CS ASR in unseen domains, thereby significantly enhancing our contribution to the field. By utilizing BAM to both identify and normalize CS phrases, AdaCS enhances its adaptive capabilities with a biased list of words provided during inference. Our method demonstrates impressive performance and the ability to handle unseen CS phrases across various domains. Experiments show that AdaCS outperforms previous state-of-the-art method on Vietnamese CS ASR normalization by considerable WER reduction of 56.2% and 36.8% on the two proposed test sets.
句内代码转换(Code-Switching,简称CS)指的是在一个单一的表述中交替使用不同语言的现象,这对自动语音识别(ASR)系统构成了重大挑战。例如,当越南语使用者在其讲话中嵌入外来专有名词或专业术语时就会发生这种情况。由于训练数据通常是单语种的,并且代码转换具有不可预测性,因此ASR系统很难准确地转录句内CS内容。对于资源较少的语言而言,这一问题尤为突出,因为有限的数据可用性阻碍了稳健模型的发展。 在本研究中,我们提出了一种名为AdaCS的规范化模型,该模型将自适应偏差注意力模块(BAM)整合到编码器-解码器网络中。这种新颖的方法为处理未见域中的代码转换ASR提供了一个稳健的解决方案,从而大大提升了我们在这一领域的贡献。通过使用BAM来识别和规范CS短语,并在推理过程中提供带有偏置词列表的支持,AdaCS增强了其适应能力。我们的方法展示了卓越的表现及跨多个领域处理未见过的CS短语的能力。 实验表明,在越南语CS ASR规范化任务上,AdaCS相比现有的最先进方法表现出显著的优势,分别在两个提议测试集上实现了56.2%和36.8%的单词错误率(WER)降低。
https://arxiv.org/abs/2501.07102
Self-supervised speech models (S3Ms) have become a common tool for the speech processing community, leveraging representations for downstream tasks. Clustering S3M representations yields discrete speech units (DSUs), which serve as compact representations for speech signals. DSUs are typically obtained by k-means clustering. Using DSUs often leads to strong performance in various tasks, including automatic speech recognition (ASR). However, even with the high dimensionality and redundancy of S3M representations, preprocessing S3M representations for better clustering remains unexplored, even though it can affect the quality of DSUs. In this paper, we investigate the potential of linear preprocessing methods for extracting DSUs. We evaluate standardization, principal component analysis, whitening, and independent component analysis (ICA) on DSU-based ASR benchmarks and demonstrate their effectiveness as preprocessing for k-means. We also conduct extensive analyses of their behavior, such as orthogonality or interpretability of individual components of ICA.
自监督语音模型(S3M)已成为语音处理社区的常用工具,用于下游任务中的表示学习。通过聚类S3M表示可以得到离散语音单元(DSU),这些DSU作为语音信号的紧凑表示形式。通常情况下,DSUs是通过K-means聚类方法获得的。使用DSUs往往在包括自动语音识别(ASR)在内的各种任务中表现出色。然而,即使考虑到S3M表示的高度维度和冗余性,在进行更好的聚类之前对S3M表示进行预处理仍然没有得到充分探索,尽管这可能会影响DSU的质量。 在这篇论文中,我们探讨了线性预处理方法在提取离散语音单元方面的潜力。我们在基于DSU的ASR基准上评估了几种标准技术(如标准化、主成分分析和白化)以及独立成分分析(ICA),并证明这些技术作为K-means聚类预处理手段的有效性。我们还进行了广泛的分析,探讨了诸如ICA中各个分量的正交性和解释性等行为特征。 简而言之,这项研究不仅验证了几种线性预处理方法对于DSU提取的重要性,而且还深入讨论了它们在自动语音识别任务中的潜在价值和影响。
https://arxiv.org/abs/2501.06562
Spoken language datasets are vital for advancing linguistic research, Natural Language Processing, and speech technology. However, resources dedicated to Italian, a linguistically rich and diverse Romance language, remain underexplored compared to major languages like English or Mandarin. This survey provides a comprehensive analysis of 66 spoken Italian datasets, highlighting their characteristics, methodologies, and applications. The datasets are categorized by speech type, source and context, and demographic and linguistic features, with a focus on their utility in fields such as Automatic Speech Recognition, emotion detection, and education. Challenges related to dataset scarcity, representativeness, and accessibility are discussed alongside recommendations for enhancing dataset creation and utilization. The full dataset inventory is publicly accessible via GitHub and archived on Zenodo, serving as a valuable resource for researchers and developers. By addressing current gaps and proposing future directions, this work aims to support the advancement of Italian speech technologies and linguistic research.
口语数据集对于推进语言学研究、自然语言处理和语音技术至关重要。然而,相较于英语或汉语等主要语言,意大利语——一种富有语言特色且多样的罗曼语言——所拥有的资源仍然相对较少且未被充分探索。本文档对66个意大利口语数据集进行了全面分析,突出了它们的特征、方法论及其在自动语音识别、情感检测和教育领域的应用等方面的特点。根据言语类型、来源及背景以及人口统计学和语言特点,这些数据集被分类。同时,文中还讨论了因数据稀缺性、代表性不足和获取难度所带来的挑战,并提出了一些建议以提升数据集的创建与使用效率。 全文档中的所有数据集目录可在GitHub上公开访问,并在Zenodo中归档存储,从而为研究人员和开发者提供了一个宝贵的资源库。本文通过解决现有缺口并展望未来方向,旨在支持意大利语语音技术及语言学研究的发展进步。
https://arxiv.org/abs/2501.06557
We develop automatic speech recognition (ASR) systems for stories told by Afrikaans and isiXhosa preschool children. Oral narratives provide a way to assess children's language development before they learn to read. We consider a range of prior child-speech ASR strategies to determine which is best suited to this unique setting. Using Whisper and only 5 minutes of transcribed in-domain child speech, we find that additional in-domain adult data (adult speech matching the story domain) provides the biggest improvement, especially when coupled with voice conversion. Semi-supervised learning also helps for both languages, while parameter-efficient fine-tuning helps on Afrikaans but not on isiXhosa (which is under-represented in the Whisper model). Few child-speech studies look at non-English data, and even fewer at the preschool ages of 4 and 5. Our work therefore represents a unique validation of a wide range of previous child-speech ASR strategies in an under-explored setting.
我们为讲阿非利卡语和isiXhosa的学龄前儿童的故事开发自动语音识别(ASR)系统。口述叙述提供了一种在孩子们学会阅读之前评估其语言发展的方法。我们考虑了一系列先前用于处理儿童语音的ASR策略,以确定哪种策略最适合这种独特的场景。使用Whisper模型,并且仅需5分钟转录的同领域内的儿童语音数据,我们发现额外的同领域的成人语音数据(与故事主题匹配的成人语音)提供了最大的改进,尤其是在结合音色转换的情况下更是如此。对于这两种语言来说,半监督学习也有帮助,而参数高效的微调在阿非利卡语中有效但在isiXhosa中的效果不佳(因为isiXhosa在Whisper模型中代表性不足)。很少有针对儿童语音的研究关注非英语数据,更少的是针对4至5岁学龄前儿童的年龄组。因此,我们的工作为一系列以往的儿童语音ASR策略提供了独特的验证,在一个鲜被探索的场景下进行了广泛的评估。
https://arxiv.org/abs/2501.06478
This work introduces TTS-Transducer - a novel architecture for text-to-speech, leveraging the strengths of audio codec models and neural transducers. Transducers, renowned for their superior quality and robustness in speech recognition, are employed to learn monotonic alignments and allow for avoiding using explicit duration predictors. Neural audio codecs efficiently compress audio into discrete codes, revealing the possibility of applying text modeling approaches to speech generation. However, the complexity of predicting multiple tokens per frame from several codebooks, as necessitated by audio codec models with residual quantizers, poses a significant challenge. The proposed system first uses a transducer architecture to learn monotonic alignments between tokenized text and speech codec tokens for the first codebook. Next, a non-autoregressive Transformer predicts the remaining codes using the alignment extracted from transducer loss. The proposed system is trained end-to-end. We show that TTS-Transducer is a competitive and robust alternative to contemporary TTS systems.
这项工作介绍了TTS-Transducer,这是一种新颖的文本转语音架构,利用了音频编解码模型和神经转换器的优势。转换器因其在语音识别中的卓越质量和鲁棒性而闻名,被用来学习单调对齐并允许避免使用显式的持续时间预测器。神经音频编解码器可以高效地将音频压缩为离散代码,揭示了应用文本建模方法进行语音生成的可能性。然而,由于需要从多个编码本中预测每个帧的多个令牌,这给带有残差量化器的音频编解码模型带来了显著挑战。所提出的系统首先使用转换器架构来学习标记化文本与语音编解码令牌之间的单调对齐(针对第一个编码本)。接下来,一个非自回归Transformer利用从转换损失中提取的对齐信息预测剩余的代码。该系统的训练是端到端进行的。我们展示了TTS-Transducer在当代TTS系统中是一个具有竞争力且稳健的选择。
https://arxiv.org/abs/2501.06320
General-purpose automatic speech recognition (ASR) systems do not always perform well in goal-oriented dialogue. Existing ASR correction methods rely on prior user data or named entities. We extend correction to tasks that have no prior user data and exhibit linguistic flexibility such as lexical and syntactic variations. We propose a novel context augmentation with a large language model and a ranking strategy that incorporates contextual information from the dialogue states of a goal-oriented conversational AI and its tasks. Our method ranks (1) n-best ASR hypotheses by their lexical and semantic similarity with context and (2) context by phonetic correspondence with ASR hypotheses. Evaluated in home improvement and cooking domains with real-world users, our method improves recall and F1 of correction by 34% and 16%, respectively, while maintaining precision and false positive rate. Users rated .8-1 point (out of 5) higher when our correction method worked properly, with no decrease due to false positives.
通用的自动语音识别(ASR)系统在目标导向对话中并不总是表现良好。现有的ASR校正方法依赖于用户的先验数据或命名实体。我们扩展了这种校正,使之适用于没有先验用户数据且表现出语言灵活性的任务,如词汇和句法变化。我们提出了一种新颖的上下文增强方法,利用大型语言模型,并结合目标导向对话AI及其任务中的上下文信息进行排名策略。我们的方法包括: 1. 根据与上下文的词义和语义相似性对n-best ASR假设进行排序; 2. 根据与ASR假设的音素对应关系对上下文进行排序。 在家居改善和烹饪领域的真实世界用户测试中,我们的方法使校正召回率提高了34%,F1值提高16%,同时保持了精度和假阳性率不变。当我们的校正方法正常工作时,用户的评分平均提高了0.8到1分(满分5分),且没有因假阳性而导致评分下降。
https://arxiv.org/abs/2501.06129
While recent multilingual automatic speech recognition models claim to support thousands of languages, ASR for low-resource languages remains highly unreliable due to limited bimodal speech and text training data. Better multilingual spoken language understanding (SLU) can strengthen massively the robustness of multilingual ASR by levering language semantics to compensate for scarce training data, such as disambiguating utterances via context or exploiting semantic similarities across languages. Even more so, SLU is indispensable for inclusive speech technology in roughly half of all living languages that lack a formal writing system. However, the evaluation of multilingual SLU remains limited to shallower tasks such as intent classification or language identification. To address this, we present Fleurs-SLU, a multilingual SLU benchmark that encompasses topical speech classification in 102 languages and multiple-choice question answering through listening comprehension in 92 languages. We extensively evaluate both end-to-end speech classification models and cascaded systems that combine speech-to-text transcription with subsequent classification by large language models on Fleurs-SLU. Our results show that cascaded systems exhibit greater robustness in multilingual SLU tasks, though speech encoders can achieve competitive performance in topical speech classification when appropriately pre-trained. We further find a strong correlation between robust multilingual ASR, effective speech-to-text translation, and strong multilingual SLU, highlighting the mutual benefits between acoustic and semantic speech representations.
尽管最近的多语言自动语音识别(ASR)模型声称支持数千种语言,但由于双模态语音和文本训练数据有限,低资源语言的ASR依然高度不可靠。更好的多语言口语理解(SLU)可以通过利用语言语义来弥补稀缺的训练数据,从而大大增强多语言ASR的鲁棒性,例如通过上下文消除歧义或在不同语言之间利用语义相似性。此外,在大约一半缺乏正式书写系统的现存语言中,SLU对于包容性的语音技术来说是不可或缺的。然而,对多语言SLU的评估仍限于更浅层次的任务,如意图分类或语言识别。为解决这一问题,我们提出了Fleurs-SLU,这是一个涵盖102种语言的话题性演讲分类和92种语言的通过听力理解进行的选择题问答的多语言SLU基准测试。 我们在Fleurs-SLU上广泛评估了端到端语音分类模型和结合语音转文本转录与后续由大型语言模型完成的分类任务的级联系统。我们的结果显示,级联系统在多语言SLU任务中表现出更强的鲁棒性,尽管当适当的预训练时,语音编码器可以在话题性演讲分类上实现竞争性的性能。此外,我们还发现强大的多语言ASR、有效的语音转文本翻译和强大的多语言SLU之间存在很强的相关性,这突显了声学和语义语音表示之间的相互好处。
https://arxiv.org/abs/2501.06117
Rotary Position Embedding (RoPE) encodes relative and absolute positional information in Transformer-based models through rotation matrices applied to input vectors within sequences. While RoPE has demonstrated superior performance compared to other positional embedding technologies in natural language processing tasks, its effectiveness in speech processing applications remains understudied. In this work, we conduct a comprehensive evaluation of RoPE across diverse automatic speech recognition (ASR) tasks. Our experimental results demonstrate that for ASR tasks, RoPE consistently achieves lower error rates compared to the currently widely used relative positional embedding. To facilitate further research, we release the implementation and all experimental recipes through the SpeechBrain toolkit.
旋转位置嵌入(RoPE)通过在序列中对输入向量应用旋转向量矩阵,为基于Transformer的模型编码相对和绝对的位置信息。虽然RoPE在自然语言处理任务中相比其他位置嵌入技术表现出了更优越的效果,但其在语音处理应用程序中的效果仍然缺乏研究。在这项工作中,我们进行了全面评估,探讨了RoPE在各种自动语音识别(ASR)任务上的性能。我们的实验结果显示,在ASR任务上,RoPE始终比目前广泛使用的相对位置嵌入实现了更低的错误率。为了促进进一步的研究,我们将通过SpeechBrain工具包发布实现和所有实验配方。
https://arxiv.org/abs/2501.06051
This paper introduces an all-neural text formatting (TF) model designed for commercial automatic speech recognition (ASR) systems, encompassing punctuation restoration (PR), truecasing, and inverse text normalization (ITN). Unlike traditional rule-based or hybrid approaches, this method leverages a two-stage neural architecture comprising a multi-objective token classifier and a sequence-to-sequence (seq2seq) model. This design minimizes computational costs and reduces hallucinations while ensuring flexibility and robustness across diverse linguistic entities and text domains. Developed as part of the Universal-2 ASR system, the proposed method demonstrates superior performance in TF accuracy, computational efficiency, and perceptual quality, as validated through comprehensive evaluations using both objective and subjective methods. This work underscores the importance of holistic TF models in enhancing ASR usability in practical settings.
本文介绍了一种专为商业自动语音识别(ASR)系统设计的全神经文本格式化(TF)模型,涵盖了标点恢复(PR)、真实用例处理和逆向文本规范化(ITN)。与传统的基于规则的方法或混合方法不同,该方法采用了一个两阶段的神经架构,包括一个多目标标记分类器和序列到序列(seq2seq)模型。这种设计在确保对各种语言实体和文本域具有灵活性和鲁棒性的同时,最大限度地减少了计算成本并降低了幻觉现象的发生率。作为Universal-2 ASR系统的一部分开发而成,所提出的方法通过使用客观和主观方法进行全面评估后,在TF准确度、计算效率和感知质量方面均表现出色。这项工作强调了在实际应用中提高ASR可用性的综合文本格式化模型的重要性。
https://arxiv.org/abs/2501.05948
Current time-synchronous sequence-to-sequence automatic speech recognition (ASR) models are trained by using sequence level cross-entropy that sums over all alignments. Due to the discriminative formulation, incorporating the right label context into the training criterion's gradient causes normalization problems and is not mathematically well-defined. The classic hybrid neural network hidden Markov model (NN-HMM) with its inherent generative formulation enables conditioning on the right label context. However, due to the HMM state-tying the identity of the right label context is never modeled explicitly. In this work, we propose a factored loss with auxiliary left and right label contexts that sums over all alignments. We show that the inclusion of the right label context is particularly beneficial when training data resources are limited. Moreover, we also show that it is possible to build a factored hybrid HMM system by relying exclusively on the full-sum criterion. Experiments were conducted on Switchboard 300h and LibriSpeech 960h.
当前的时间同步序列到序列自动语音识别(ASR)模型通过在所有对齐方式上求和的序列级交叉熵进行训练。由于其判别性定义,将正确的标签上下文纳入训练标准的梯度中会导致归一化问题,并且从数学角度来看是不明确的。经典的混合神经网络隐马尔可夫模型(NN-HMM)及其固有的生成型公式能够使正确标签上下文得以条件化。然而,由于HMM状态绑定的原因,正确的标签上下文身份从未被明确建模。 在本工作中,我们提出了一种带有辅助左右标签上下文的因式分解损失函数,该函数对所有对齐方式求和。研究表明,在训练数据资源有限的情况下,加入右标签上下文特别有益。此外,我们还证明了可以仅依靠全和准则来建立一个因子混合HMM系统。 实验是在Switchboard 300小时语料库和LibriSpeech 960小时语料库上进行的。
https://arxiv.org/abs/2501.04521