Self-supervised learning (SSL) based speech pre-training has attracted much attention for its capability of extracting rich representations learned from massive unlabeled data. On the other hand, the use of weakly-supervised data is less explored for speech pre-training. To fill this gap, we propose a weakly-supervised speech pre-training method based on speaker-aware speech data. It adopts a similar training procedure to the widely-used masked speech prediction based SSL framework, while incorporating additional target-speaker enrollment information as an auxiliary input. In this way, the learned representation is steered towards the target speaker even in the presence of highly overlapping interference, allowing potential applications to tasks such as target speech recognition. Our experiments on Libri2Mix and WSJ0-2mix datasets show that the proposed model achieves significantly better ASR performance compared to WavLM, the state-of-the-art SSL model with denoising capability.
基于自监督学习的语音前训练受到了广泛关注,因为它能够从大量未标记数据中学习到丰富的表示。然而,对于语音前训练,使用弱监督数据的探索较少。为了填补这一差距,我们提出了一种基于语音识别者意识的语音前训练方法。该方法采用了与广泛使用的掩码语音预测框架类似的训练流程,同时添加目标语音识别者 enrollment信息作为辅助输入。这样, learned 表示就会被引导向目标语音识别者,即使在存在高度重叠的干扰情况下也是如此,从而允许潜在的应用领域进行目标语音识别等任务。我们在Libri2混合和WSJ0-2混合数据集上的实验表明, proposed model相比具有去噪能力的WavLM,在语音识别性能方面取得了显著更好的表现。
https://arxiv.org/abs/2305.16286
Multi-talker overlapped speech poses a significant challenge for speech recognition and diarization. Recent research indicated that these two tasks are inter-dependent and complementary, motivating us to explore a unified modeling method to address them in the context of overlapped speech. A recent study proposed a cost-effective method to convert a single-talker automatic speech recognition (ASR) system into a multi-talker one, by inserting a Sidecar separator into the frozen well-trained ASR model. Extending on this, we incorporate a diarization branch into the Sidecar, allowing for unified modeling of both ASR and diarization with a negligible overhead of only 768 parameters. The proposed method yields better ASR results compared to the baseline on LibriMix and LibriSpeechMix datasets. Moreover, without sophisticated customization on the diarization task, our method achieves acceptable diarization results on the two-speaker subset of CALLHOME with only a few adaptation steps.
多说话者重叠语音对语音识别和去噪构成了一个重要的挑战。最近的研究表明,这两个任务是相互依赖和互补的,因此我们可以考虑一种统一建模方法,在重叠语音的背景下解决这些问题。一项最近的研究提出了一种成本效益高的新方法,通过将 Sidecar 分离器插入已经训练好的单说话者自动语音识别(ASR)模型中,可以将 ASR 和去噪统一建模,而只需要很少的参数 overhead。 proposed 方法在 Libri 混合和 LibriSpeech 混合数据集上的 ASR 结果比基准更好。此外,在没有复杂的去噪任务定制的情况下,我们的方法和只需要几个适应步骤就可以在 Home 电话中实现可接受去噪结果。
https://arxiv.org/abs/2305.16263
Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities. In this paper, we propose VioLA, a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text, such as speech-to-text, text-to-text, text-to-speech, and speech-to-speech tasks, as a conditional codec language model task via multi-task learning framework. To accomplish this, we first convert all the speech utterances to discrete tokens (similar to the textual data) using an offline neural codec encoder. In such a way, all these tasks are converted to token-based sequence conversion problems, which can be naturally handled with one conditional language model. We further integrate task IDs (TID) and language IDs (LID) into the proposed model to enhance the modeling capability of handling different languages and tasks. Experimental results demonstrate that the proposed VioLA model can support both single-modal and cross-modal tasks well, and the decoder-only model achieves a comparable and even better performance than the strong baselines.
最近的研究表明,在各种任务中,模型架构、训练目标和推断方法在多种modality的任务上都有很大的一致性。在本文中,我们提出了VioLA模型,它是一种单一的自回归Transformer解码器,通过多任务学习框架将涉及语音和文本的各种跨模态任务统一为条件codec语言模型任务。为了实现这一目标,我们首先使用在线神经网络codec编码器将所有的语音 utterances转换为离散代币(类似于文本数据)。这样,所有这些任务都被转换为代币序列转换问题,这种问题可以用一个条件语言模型自然处理。我们还将任务标识符(TID)和语言标识符(LID)集成到 proposed 模型中,以增强处理不同语言和任务的计算能力。实验结果表明,提出的VioLA模型可以同时支持单模态和跨模态任务,而解码器单独模型的表现甚至比强大的基线模型还要好。
https://arxiv.org/abs/2305.16107
In Speech Emotion Recognition (SER), textual data is often used alongside audio signals to address their inherent variability. However, the reliance on human annotated text in most research hinders the development of practical SER systems. To overcome this challenge, we investigate how Automatic Speech Recognition (ASR) performs on emotional speech by analyzing the ASR performance on emotion corpora and examining the distribution of word errors and confidence scores in ASR transcripts to gain insight into how emotion affects ASR. We utilize four ASR systems, namely Kaldi ASR, wav2vec, Conformer, and Whisper, and three corpora: IEMOCAP, MOSI, and MELD to ensure generalizability. Additionally, we conduct text-based SER on ASR transcripts with increasing word error rates to investigate how ASR affects SER. The objective of this study is to uncover the relationship and mutual impact of ASR and SER, in order to facilitate ASR adaptation to emotional speech and the use of SER in real world.
在语音情感识别( SER )中,通常使用文本数据与音频信号一起解决问题,以解决其固有的不确定性。然而,在大多数研究中,依赖人类标注的文本限制了实际 SER 系统的开发。要克服这一挑战,我们研究如何将自动语音识别(ASR )在情感语音中进行表现,通过分析情感 corpora 的 ASR 表现,并检查 ASR transcripts 中单词错误和自信心分数的分布,以了解情感如何影响 ASR。我们使用四个 ASR 系统,即 Kaldi ASR、wav2vec、Conformer 和 Whisper,以及三个 corpora:IEMOCAP、MOSI 和 MELD,以确保可扩展性。此外,我们逐渐增加单词错误率,在 ASR transcripts 上进行文本 SER,以研究 ASR 对 SER 的影响。本研究的目标是揭示 ASR 和 SER 之间的关系和相互影响,以促进 ASR 适应情感语音,并促进 SER 在现实世界中的应用。
https://arxiv.org/abs/2305.16065
India is the second largest English-speaking country in the world with a speaker base of roughly 130 million. Thus, it is imperative that automatic speech recognition (ASR) systems for English should be evaluated on Indian accents. Unfortunately, Indian speakers find a very poor representation in existing English ASR benchmarks such as LibriSpeech, Switchboard, Speech Accent Archive, etc. In this work, we address this gap by creating Svarah, a benchmark that contains 9.6 hours of transcribed English audio from 117 speakers across 65 geographic locations throughout India, resulting in a diverse range of accents. Svarah comprises both read speech and spontaneous conversational data, covering various domains, such as history, culture, tourism, etc., ensuring a diverse vocabulary. We evaluate 6 open source ASR models and 2 commercial ASR systems on Svarah and show that there is clear scope for improvement on Indian accents. Svarah as well as all our code will be publicly available.
印度是世界上第二大英语国家,拥有大约1300万人的说话人口。因此,必须评估英语自动语音识别系统(ASR)的性能,以印度口音为基准。不幸的是,印度说话者发现在现有的英语ASR基准如LibriSpeech、Switchboard、Speech Accent Archive等中,他们的表现非常差。在这个工作中,我们解决这个问题的方法是创建Svarah,它是一个基准,包含了在印度全国65个地理区域中的117名说话者录制的9.6小时的英语音频,产生了各种口音。Svarah包括阅读 speech 和自发对话数据,涵盖了各种领域,例如历史、文化、旅游等,确保具有多种词汇。我们评估了6个开源ASR模型和2个商业ASR系统在Svarah上,并表明印度口音有改善的空间。Svarah和我们的所有代码都将公开可用。
https://arxiv.org/abs/2305.15760
End-to-end models with large capacity have significantly improved multilingual automatic speech recognition, but their computation cost poses challenges for on-device applications. We propose a streaming truly multilingual Conformer incorporating mixture-of-expert (MoE) layers that learn to only activate a subset of parameters in training and inference. The MoE layer consists of a softmax gate which chooses the best two experts among many in forward propagation. The proposed MoE layer offers efficient inference by activating a fixed number of parameters as the number of experts increases. We evaluate the proposed model on a set of 12 languages, and achieve an average 11.9% relative improvement in WER over the baseline. Compared to an adapter model using ground truth information, our MoE model achieves similar WER and activates similar number of parameters but without any language information. We further show around 3% relative WER improvement by multilingual shallow fusion.
具有大量能力的端到端模型已经显著改进了多语言自动语音识别,但它们的计算成本对于在设备上的应用提出了挑战。我们提出了一种流式真正的多语言融合器,其中包含专家混合层(MoE)。这些层在训练和推理中只会激活一部分参数。MoE层由一个softmax门选择前方传播中最好的两个专家。我们提出了一种MoE层,通过激活固定数量的参数来提供高效的推理,随着专家数量的增加,这种方法可以提高平均WER11.9%。与使用真实标签信息的调整模型相比,我们的MoE模型实现了类似的WER,并激活类似的参数数量,但不需要任何语言信息。我们还通过多语言浅融合展示了约3%的相对WER改善。
https://arxiv.org/abs/2305.15663
We propose an unsupervised speech-to-speech translation (S2ST) system that does not rely on parallel data between the source and target languages. Our approach maps source and target language speech signals into automatically discovered, discrete units and reformulates the problem as unsupervised unit-to-unit machine translation. We develop a three-step training procedure that involves (a) pre-training an unit-based encoder-decoder language model with a denoising objective (b) training it with word-by-word translated utterance pairs created by aligning monolingual text embedding spaces and (c) running unsupervised backtranslation bootstrapping off of the initial translation model. Our approach avoids mapping the speech signal into text and uses speech-to-unit and unit-to-speech models instead of automatic speech recognition and text to speech models. We evaluate our model on synthetic-speaker Europarl-ST English-German and German-English evaluation sets, finding that unit-based translation is feasible under this constrained scenario, achieving 9.29 ASR-BLEU in German to English and 8.07 in English to German.
我们提出了一个不需要源和目标语言平行数据的非监督语音到语音翻译系统(S2ST)。我们的方法将源和目标语言语音信号映射到自动发现、离散单元,并重新表述问题为无监督单元到单元机器翻译。我们开发了一个三步骤的训练程序,包括(a) 先训练一个单元基于编码-解码语言模型,以消除噪声目标(b) 通过对齐单语言文本嵌入空间创建 word-by-word 翻译 utterance 对进行训练,(c) 运行无监督反向翻译Bootstrapping 从初始翻译模型中启动。我们的方法避免将语音信号映射到文本,而是使用语音到单元和单元到语音模型,而不是自动语音识别和文本到语音模型。我们在合成听者欧洲语言资源( Europarl-ST)的英语-德语和德语-英语评估 sets 上评估我们的模型,发现在 this 约束条件下,单元翻译是可行的,实现德语到英语的 ASR-BLEU 值为 9.29,英语到德语的值为 8.07。
https://arxiv.org/abs/2305.15405
Many existing works on voice conversion (VC) tasks use automatic speech recognition (ASR) models for ensuring linguistic consistency between source and converted samples. However, for the low-data resource domains, training a high-quality ASR remains to be a challenging task. In this work, we propose a novel iterative way of improving both the ASR and VC models. We first train an ASR model which is used to ensure content preservation while training a VC model. In the next iteration, the VC model is used as a data augmentation method to further fine-tune the ASR model and generalize it to diverse speakers. By iteratively leveraging the improved ASR model to train VC model and vice-versa, we experimentally show improvement in both the models. Our proposed framework outperforms the ASR and one-shot VC baseline models on English singing and Hindi speech domains in subjective and objective evaluations in low-data resource settings.
许多现有的语音转换任务(VC)使用自动语音识别(ASR)模型以确保源和转换样本之间的语言一致性。但对于低数据资源领域,训练高质量的ASR模型仍然是一个挑战性的任务。在本研究中,我们提出了一种新颖的迭代方法来改进ASR和VC模型。我们首先训练一种用于确保内容保留的ASR模型,然后在下一个迭代中,将VC模型用作数据增强方法,进一步优化ASR模型,并使其适用于多种说话者。通过迭代利用改进的ASR模型来训练VC模型,并反之亦然,我们实验性地展示了两个模型的进步。在我们提出的框架中,在低数据资源环境下,在英语唱歌和汉式语言 domains 中,我们的ASR模型和一次性的VC基线模型在主观和客观评估中表现优异。通过迭代利用改进的ASR模型来训练VC模型,我们同时也证明了两个模型的进步。
https://arxiv.org/abs/2305.15055
Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along with extra visual information such as lip videos, and has been shown to be more effective than audio-only speech enhancement. This paper proposes further incorporating ultrasound tongue images to improve lip-based AV-SE systems' performance. Knowledge distillation is employed at the training stage to address the challenge of acquiring ultrasound tongue images during inference, enabling an audio-lip speech enhancement student model to learn from a pre-trained audio-lip-tongue speech enhancement teacher model. Experimental results demonstrate significant improvements in the quality and intelligibility of the speech enhanced by the proposed method compared to the traditional audio-lip speech enhancement baselines. Further analysis using phone error rates (PER) of automatic speech recognition (ASR) shows that palatal and velar consonants benefit most from the introduction of ultrasound tongue images.
音频-视频语音增强(AV-SE)旨在结合额外的视觉信息(如唇视频)来改善退化的语音质量,并已经证明比仅使用音频的语音增强更有效。本文提议进一步添加超声波唇图像,以提高基于唇的AV-SE系统的性能。在训练阶段,使用知识蒸馏来解决在推理期间获取超声波唇图像的挑战,使音频-唇语音增强学生模型可以从预先训练的音频-唇-超声波语音增强教师模型学习。实验结果表明,相比传统的音频-唇语音增强基准,采用该方法的语音增强质量和可理解性得到了显著改善。进一步使用自动语音识别(ASR)语音错误率进行分析,显示Palatal和Veloar辅音最受益于引入超声波唇图像。
https://arxiv.org/abs/2305.14933
Automatic speech recognition (ASR) systems play a key role in applications involving human-machine interactions. Despite their importance, ASR models for the Portuguese language proposed in the last decade have limitations in relation to the correct identification of punctuation marks in automatic transcriptions, which hinder the use of transcriptions by other systems, models, and even by humans. However, recently Whisper ASR was proposed by OpenAI, a general-purpose speech recognition model that has generated great expectations in dealing with such limitations. This chapter presents the first study on the performance of Whisper for punctuation prediction in the Portuguese language. We present an experimental evaluation considering both theoretical aspects involving pausing points (comma) and complete ideas (exclamation, question, and fullstop), as well as practical aspects involving transcript-based topic modeling - an application dependent on punctuation marks for promising performance. We analyzed experimental results from videos of Museum of the Person, a virtual museum that aims to tell and preserve people's life histories, thus discussing the pros and cons of Whisper in a real-world scenario. Although our experiments indicate that Whisper achieves state-of-the-art results, we conclude that some punctuation marks require improvements, such as exclamation, semicolon and colon.
自动语音识别(ASR)系统在涉及人类-机器互动的应用中扮演着关键角色。尽管它们非常重要,但在过去十年中,针对葡萄牙语的ASR模型在自动转录的正确识别punctuation marks方面存在一些限制,这限制了其他系统、模型甚至人类的使用。然而,最近OpenAI提出了Whisper ASR,这是一个通用的语音识别模型,在处理这些限制方面引起了极大的期望。本章介绍了第一个研究,是关于Whisper在葡萄牙语中的punctuation预测性能的研究。我们考虑了理论和实践两个方面,包括涉及暂停点(逗号)和完整想法(感叹号、问题、句号)的方面,以及涉及基于转录的主题建模的方面——这是一个依赖于punctuation marks才能取得良好性能的应用。我们从Person博物馆的视频中提取了实验结果,该博物馆旨在保护和讲述人们的生命历史,因此讨论了Whisper在现实世界场景中的优点和缺点。虽然我们的实验表明Whisper取得了最先进的结果,但我们得出结论,一些punctuation marks需要改进,例如感叹号、分号和斜杠。
https://arxiv.org/abs/2305.14580
Large self-supervised pre-trained speech models have achieved remarkable success across various speech-processing tasks. The self-supervised training of these models leads to universal speech representations that can be used for different downstream tasks, ranging from automatic speech recognition (ASR) to speaker identification. Recently, Whisper, a transformer-based model was proposed and trained on large amount of weakly supervised data for ASR; it outperformed several state-of-the-art self-supervised models. Given the superiority of Whisper for ASR, in this paper we explore the transferability of the representation for four other speech tasks in SUPERB benchmark. Moreover, we explore the robustness of Whisper representation for ``in the wild'' tasks where speech is corrupted by environment noise and room reverberation. Experimental results show Whisper achieves promising results across tasks and environmental conditions, thus showing potential for cross-task real-world deployment.
大型自监督预训练语音模型在各种语音处理任务中取得了显著的成功。自监督训练这些模型导致通用语音表示,可用于各种后续任务,包括自动语音识别(ASR)和语音识别(ASR)。最近,Whisper模型被提出并训练了大量弱监督的ASR数据;它 outperform 了几个最先进的自监督模型。鉴于Whisper对ASR的优越性,在本文中,我们探索SuperB基准中其他四个语音任务的表示是否可以转移。此外,我们还探索“在野外”任务中,语音受到环境噪声和房间回音污染时Whisper表示的鲁棒性。实验结果表明,Whisper在不同任务和环境条件下取得了令人期望的结果,因此显示了跨任务现实世界部署的潜力。
https://arxiv.org/abs/2305.14546
Conversational AI systems (e.g. Alexa, Siri, Google Assistant, etc.) need to understand queries with defects to ensure robust conversational understanding and reduce user frictions. The defective queries are often induced by user ambiguities and mistakes, or errors in the automatic speech recognition (ASR) and natural language understanding (NLU). Personalized query rewriting (personalized QR) targets reducing defects in the torso and tail user query traffic, and it typically relies on an index of past successful user interactions with the conversational AI. This paper presents our "Collaborative Query Rewriting" approach that focuses on rewriting novel user interactions unseen in the user history. This approach builds a "user Feedback Interaction Graph" (FIG) consisting of historical user-entity interactions, and leverages multi-hop customer affinity to enrich each user's index (i.e. the Collaborative User Index) that would help cover future unseen defective queries. To counteract the precision degradation from the enlarged index, we introduced additional transformer layers to the L1 retrieval model and added multi-hop affinity and guardrail features to the L2 re-ranking model. Given the production constraints of storage cost and runtime retrieval latency, managing the size of the Collaborative User Index is important. As the user index can be pre-computed, we explored using a Large Language Model (LLM) for multi-hop customer affinity retrieval on the Video/Music domains. In particular, this paper looked into the Dolly-V2 7B model. Given limited user index size, We found the user index derived from fine-tuned Dolly-V2 generation significantly enhanced coverage of unseen user interactions. Consequently, this boosted QR performance on unseen user interactions compared to the graph traversal based user index.
对话型人工智能系统(例如亚马逊、 Siri、谷歌助手等)需要理解有瑕疵的查询,以确保强大的对话理解以及减少用户摩擦。瑕疵查询往往由用户的歧义和错误、或者自动语音识别(ASR)和自然语言理解(NLU)的错误引起。个性化查询改写(个性化 QR)的目标是减少用户头部和尾部查询的缺陷,并通常依赖于过去成功用户与对话型人工智能的互动索引。本文介绍了我们的“合作式查询改写”方法,重点是改写用户在用户历史中未看到的新颖用户交互。这种方法构建了一个“用户反馈交互图”(FIG),由历史用户实体交互组成,利用多跳客户亲和力来丰富每个用户的索引(即合作用户索引),以覆盖未来未看到的有缺陷的查询。为了抵消扩大索引的精度损失,我们引入了额外的Transformer层到L1检索模型中,并在L2重新排序模型中添加多跳亲和力和拦截器特性。鉴于存储成本和生产限制以及实时检索延迟,管理合作用户索引的大小非常重要。由于用户的索引可以预先计算,我们探索了在视频和音乐领域使用大型语言模型(LLM)进行多跳客户亲和力检索。特别是,本文研究了 Dolly-V2 7B模型。鉴于用户索引容量有限,我们发现经过优化的 Dolly-V2 生成的用户索引显著增加了未看到的用户交互覆盖率。因此,与基于图遍历的用户索引相比,这个提高了 QR 在未看到的用户交互方面的性能。
https://arxiv.org/abs/2305.14449
Attention-based encoder-decoder (AED) models have shown impressive performance in ASR. However, most existing AED methods neglect to simultaneously leverage both acoustic and semantic features in decoder, which is crucial for generating more accurate and informative semantic states. In this paper, we propose an Acoustic and Semantic Cooperative Decoder (ASCD) for ASR. In particular, unlike vanilla decoders that process acoustic and semantic features in two separate stages, ASCD integrates them cooperatively. To prevent information leakage during training, we design a Causal Multimodal Mask. Moreover, a variant Semi-ASCD is proposed to balance accuracy and computational cost. Our proposal is evaluated on the publicly available AISHELL-1 and aidatatang_200zh datasets using Transformer, Conformer, and Branchformer as encoders, respectively. The experimental results show that ASCD significantly improves the performance by leveraging both the acoustic and semantic information cooperatively.
基于注意力的编码-解码(AED)模型在语音合成器(SR)中表现出了令人印象深刻的性能。然而,大多数现有的 AED 方法忽视了在解码器中同时利用语音和语义特征的重要性,这对于生成更准确且 informative 语义状态至关重要。在本文中,我们提出了一种语音和语义合作解码器(ASCD)为 ASR 进行优化。特别地,与普通的解码器将语音和语义特征分别处理在两个不同步骤中不同,ASCD 将它们合作起来进行整合。为了防止训练期间的信息泄漏,我们设计了一个因果多模态掩码。此外,我们提出了一种半合作式 ASCD,以平衡精度和计算成本。我们 proposal 是基于可用的 AIShell-1 和 aidatatang_200zh 数据集,分别使用 Transformer、Conformer 和分支器作为编码器进行评估的。实验结果显示,ASCD 通过合作利用语音和语义信息,显著地提高了性能。
https://arxiv.org/abs/2305.14049
We propose SE-Bridge, a novel method for speech enhancement (SE). After recently applying the diffusion models to speech enhancement, we can achieve speech enhancement by solving a stochastic differential equation (SDE). Each SDE corresponds to a probabilistic flow ordinary differential equation (PF-ODE), and the trajectory of the PF-ODE solution consists of the speech states at different moments. Our approach is based on consistency model that ensure any speech states on the same PF-ODE trajectory, correspond to the same initial state. By integrating the Brownian Bridge process, the model is able to generate high-intelligibility speech samples without adversarial training. This is the first attempt that applies the consistency models to SE task, achieving state-of-the-art results in several metrics while saving 15 x the time required for sampling compared to the diffusion-based baseline. Our experiments on multiple datasets demonstrate the effectiveness of SE-Bridge in SE. Furthermore, we show through extensive experiments on downstream tasks, including Automatic Speech Recognition (ASR) and Speaker Verification (SV), that SE-Bridge can effectively support multiple downstream tasks.
我们提出SE-Bridge,一种用于语音增强的新颖方法(SE)。最近,我们应用扩散模型对语音增强进行了尝试,我们可以通过解决一宗随机微分方程(SDE)来实现语音增强。每个SDE对应着一宗概率流普通微分方程(PF-ODE),PF-ODE解的轨迹包含不同时刻的语音状态。我们的方法是基于一致性模型的,该模型确保在同一PF-ODE解轨迹上的任何语音状态都对应着相同的初始状态。通过整合布朗运动桥过程,模型能够生成高清晰度语音样本而无需对抗训练。这是第一个尝试将一致性模型应用于SE任务,在多个指标上实现了最先进的结果,而与扩散基线相比,采样所需的时间节省了15倍。我们对各种数据集的实验表明,SE-Bridge在SE任务中非常有效。此外,我们通过对后续任务,包括自动语音识别(ASR)和语音识别(SV)等广泛的实验,证明了SE-Bridge能够有效支持多个后续任务。
https://arxiv.org/abs/2305.13796
Streaming Automatic Speech Recognition (ASR) in voice assistants can utilize prefetching to partially hide the latency of response generation. Prefetching involves passing a preliminary ASR hypothesis to downstream systems in order to prefetch and cache a response. If the final ASR hypothesis after endpoint detection matches the preliminary one, the cached response can be delivered to the user, thus saving latency. In this paper, we extend this idea by introducing predictive automatic speech recognition, where we predict the full utterance from a partially observed utterance, and prefetch the response based on the predicted utterance. We introduce two personalization approaches and investigate the tradeoff between potential latency gains from successful predictions and the cost increase from failed predictions. We evaluate our methods on an internal voice assistant dataset as well as the public SLURP dataset.
在语音助手中,流式自动语音识别(ASR)可以利用预缓存来部分隐藏响应生成延迟。预缓存涉及将初步的ASR假设传递给下游系统,以预缓存并缓存一个响应。如果Endpoint检测后的最终的ASR假设与初步的假设匹配,缓存中的响应可以传递给用户,从而节省延迟。在本文中,我们扩展了这一想法,引入了预测式自动语音识别,其中我们从部分观测的utterance中预测整个话述,并基于预测话述预缓存响应。我们引入了两个个性化方法,并研究成功预测带来的潜在延迟收益与失败预测带来的成本增加之间的权衡。我们评估了我们的方法在一个内部语音助手数据集和一个公共SLRP数据集上的性能。
https://arxiv.org/abs/2305.13794
The recently proposed serialized output training (SOT) simplifies multi-talker automatic speech recognition (ASR) by generating speaker transcriptions separated by a special token. However, frequent speaker changes can make speaker change prediction difficult. To address this, we propose boundary-aware serialized output training (BA-SOT), which explicitly incorporates boundary knowledge into the decoder via a speaker change detection task and boundary constraint loss. We also introduce a two-stage connectionist temporal classification (CTC) strategy that incorporates token-level SOT CTC to restore temporal context information. Besides typical character error rate (CER), we introduce utterance-dependent character error rate (UD-CER) to further measure the precision of speaker change prediction. Compared to original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a pre-trained ASR model for BA-SOT model initialization further reduces CER/UD-CER by 8.4%/19.9%.
最近提出的离散化输出训练(SOT)简化了多说话人自动语音识别(ASR)的方法,通过生成分离的说话人摘要来简化。然而,频繁的变化可能会导致说话人变化预测困难。为了解决这个问题,我们提出了边界aware的离散化输出训练(BA-SOT),通过说话人变化检测任务和边界约束损失将边界知识引入解码器。我们还引入了一个两阶段的并发时间分类策略,其中包含 token 级别的SOT CTC,以恢复时间上下文信息。除了常见的字符错误率(CER),我们还引入了话动依赖字符错误率(UD-CER),以进一步衡量说话人变化预测的准确性。与原始SOT相比,BA-SOT将CER/UD-CER降低了5.1%/14.0%。此外,利用预先训练的ASR模型进行BA-SOT模型初始化进一步降低了CER/UD-CER的8.4%/19.9%。
https://arxiv.org/abs/2305.13716
Voice technology has become ubiquitous recently. However, the accuracy, and hence experience, in different languages varies significantly, which makes the technology not equally inclusive. The availability of data for different languages is one of the key factors affecting accuracy, especially in training of all-neural end-to-end automatic speech recognition systems. Cross-lingual knowledge transfer and iterative pseudo-labeling are two techniques that have been shown to be successful for improving the accuracy of ASR systems, in particular for low-resource languages, like Ukrainian. Our goal is to train an all-neural Transducer-based ASR system to replace a DNN-HMM hybrid system with no manually annotated training data. We show that the Transducer system trained using transcripts produced by the hybrid system achieves 18% reduction in terms of word error rate. However, using a combination of cross-lingual knowledge transfer from related languages and iterative pseudo-labeling, we are able to achieve 35% reduction of the error rate.
语音技术最近变得非常普遍,然而,不同语言的准确率和经验之间存在显著差异,这使得该技术并不完全包容。不同语言的数据可用性是影响准确性的一个关键因素,特别是在训练全神经网络端到端自动语音识别系统时。跨语言知识转移和迭代伪标签是两种已成功用于提高语音识别系统准确性的技术。我们的目标是训练一个全神经网络的语音识别系统,以取代由混合系统生成的带有手动标注的训练数据的智能识别系统。我们表明,使用混合系统生成的 transcripts 训练的智能识别系统 word 错误率下降了18%。然而,通过结合来自相关语言的跨语言知识转移和迭代伪标签,我们能够实现35%的错误率降低。
https://arxiv.org/abs/2305.13652
Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task. The main ingredients are a new dataset based on readings of publicly available religious texts and effectively leveraging self-supervised learning. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, as well as a language identification model for 4,017 languages. Experiments show that our multilingual speech recognition model more than halves the word error rate of Whisper on 54 languages of the FLEURS benchmark while being trained on a small fraction of the labeled data.
扩大语音技术的语料范围有望改善更多人的信息获取。然而,当前语音技术仅涵盖约一百种语言,是全球超过7,000种语言的一小部分。大规模多语言语音(MMS)项目根据公开宗教文本的阅读,有效地利用了自监督学习。我们构建了一个覆盖1,406种语言的预训练wav2vec 2.0模型,为1,107种语言构建了一个单一的多语言自动语音识别模型,为相同数量的语料库构建了一个语言识别模型,以及为4,017种语言构建了一个语言分类模型。实验表明,我们的多语言语音识别模型在FLEURS基准测试中的Whisper单词错误率远远超过了的一半,而训练数据中的标注数据仅占一小部分。
https://arxiv.org/abs/2305.13516
Speech data from different domains has distinct acoustic and linguistic characteristics. It is common to train a single multidomain model such as a Conformer transducer for speech recognition on a mixture of data from all domains. However, changing data in one domain or adding a new domain would require the multidomain model to be retrained. To this end, we propose a framework called modular domain adaptation (MDA) that enables a single model to process multidomain data while keeping all parameters domain-specific, i.e., each parameter is only trained by data from one domain. On a streaming Conformer transducer trained only on video caption data, experimental results show that an MDA-based model can reach similar performance as the multidomain model on other domains such as voice search and dictation by adding per-domain adapters and per-domain feed-forward networks in the Conformer encoder.
来自不同领域的声音数据具有不同的声学和语言学特征。通常,对单个多领域模型,如语音合成转换器,进行训练,使用所有领域的混合数据。然而,改变一个领域的数据或添加一个新领域将需要重新训练多领域模型。为此,我们提出了一种框架,称为模块化 domain 适应(MDA),它使一个模型能够处理多领域数据,同时保持所有参数领域特定,即每个参数仅由一个领域的数据训练。在仅训练视频标题数据的动态语音合成转换器上,实验结果显示,基于MDA的模型可以在其他领域,如语音搜索和口语命令,达到与多领域模型类似的性能,通过在语音合成编码器中添加每个领域的适配器和领域适配器。
https://arxiv.org/abs/2305.13408
Automatic speech recognition systems based on deep learning are mainly trained under empirical risk minimization (ERM). Since ERM utilizes the averaged performance on the data samples regardless of a group such as healthy or dysarthric speakers, ASR systems are unaware of the performance disparities across the groups. This results in biased ASR systems whose performance differences among groups are severe. In this study, we aim to improve the ASR system in terms of group robustness for dysarthric speakers. To achieve our goal, we present a novel approach, sample reweighting with sample affinity test (Re-SAT). Re-SAT systematically measures the debiasing helpfulness of the given data sample and then mitigates the bias by debiasing helpfulness-based sample reweighting. Experimental results demonstrate that Re-SAT contributes to improved ASR performance on dysarthric speech without performance degradation on healthy speech.
基于深度学习的自动语音识别系统主要基于经验风险最小化(ERM)进行训练。由于ERM无论使用数据样本中的平均表现,都会使用健康或发音困难的听众的表现,因此语音识别系统不知道不同群组之间的性能差异。这导致具有偏见的语音识别系统,其不同群组的性能差异非常严重。在本研究中,我们旨在改善发音困难的听众群组的鲁棒性,以提高发音困难的语音自动语音识别系统的性能。为了实现我们的目标,我们提出了一种新方法,即样本亲和力测试(Re-SAT)。Re-SAT systematically measures给定数据样本的去偏帮助性,然后通过去偏帮助性的样本重新加权来缓解偏见。实验结果显示,Re-SAT有助于改善发音困难的语音自动语音识别系统的性能,而健康语音的性能并未受到负面影响。
https://arxiv.org/abs/2305.13108