Conformers have recently been proposed as a promising modelling approach for automatic speech recognition (ASR), outperforming recurrent neural network-based approaches and transformers. Nevertheless, in general, the performance of these end-to-end models, especially attention-based models, is particularly degraded in the case of long utterances. To address this limitation, we propose adding a fully-differentiable memory-augmented neural network between the encoder and decoder of a conformer. This external memory can enrich the generalization for longer utterances since it allows the system to store and retrieve more information recurrently. Notably, we explore the neural Turing machine (NTM) that results in our proposed Conformer-NTM model architecture for ASR. Experimental results using Librispeech train-clean-100 and train-960 sets show that the proposed system outperforms the baseline conformer without memory for long utterances.
Conformers 最近被提出作为自动语音识别(ASR)的一种有前途的建模方法,比基于循环神经网络的方法和变压器表现更好。然而,总的来说,这些端到端模型,特别是基于注意力的模型,在较长的发言中表现特别差。为了解决这个问题,我们建议在一个 conformer 的编码器和解码器之间添加一个全变分的增强记忆神经网络。这个外部记忆可以丰富对更长发言的泛化,因为它允许系统多次存储和检索更多的信息。值得注意的是,我们探索了导致我们提出的 conformer-NTM 模型架构的神经网络 Turing 机器(NTM)。使用 LibriSpeech 训练- clean-100 和训练-960 集的实验结果表明, proposed 系统在较长的发言中比无记忆的基础 conformer 表现更好。
https://arxiv.org/abs/2309.13029
Neural network pruning offers an effective method for compressing a multilingual automatic speech recognition (ASR) model with minimal performance loss. However, it entails several rounds of pruning and re-training needed to be run for each language. In this work, we propose the use of an adaptive masking approach in two scenarios for pruning a multilingual ASR model efficiently, each resulting in sparse monolingual models or a sparse multilingual model (named as Dynamic ASR Pathways). Our approach dynamically adapts the sub-network, avoiding premature decisions about a fixed sub-network structure. We show that our approach outperforms existing pruning methods when targeting sparse monolingual models. Further, we illustrate that Dynamic ASR Pathways jointly discovers and trains better sub-networks (pathways) of a single multilingual model by adapting from different sub-network initializations, thereby reducing the need for language-specific pruning.
神经网络剪枝是一种有效的方法,以压缩具有最小性能损失的多语言自动语音识别(ASR)模型。然而,它需要进行多个语言的剪枝和再训练。在这项工作中,我们提议使用一种自适应掩蔽方法,在两个场景下高效剪枝多语言ASR模型,每个产生稀疏的 Monolingual 模型或稀疏的 Multilingual 模型(称为动态ASR通道)。我们的方法动态适应子网络,避免过早决定固定的子网络结构。我们表明,当针对稀疏的 Monolingual 模型时,我们的方法比现有的剪枝方法表现更好。此外,我们举例说明,动态ASR通道通过自适应从不同的子网络初始化中学习更好的子网络(通道),从而减少了特定语言的剪枝需求。
https://arxiv.org/abs/2309.13018
In this work, we investigate two popular end-to-end automatic speech recognition (ASR) models, namely Connectionist Temporal Classification (CTC) and RNN-Transducer (RNN-T), for offline recognition of voice search queries, with up to 2B model parameters. The encoders of our models use the neural architecture of Google's universal speech model (USM), with additional funnel pooling layers to significantly reduce the frame rate and speed up training and inference. We perform extensive studies on vocabulary size, time reduction strategy, and its generalization performance on long-form test sets. Despite the speculation that, as the model size increases, CTC can be as good as RNN-T which builds label dependency into the prediction, we observe that a 900M RNN-T clearly outperforms a 1.8B CTC and is more tolerant to severe time reduction, although the WER gap can be largely removed by LM shallow fusion.
在本研究中,我们研究了两种流行的端到端自动语音识别(ASR)模型,即连接性时间分类(CTC)和RNN-控制器(RNN-T),以 offline 识别语音搜索查询,使用高达2B的模型参数。我们模型的编码器使用谷歌的通用语音模型(USM)的神经网络架构,并添加 funnel Pooling 层来显著降低帧率,加快训练和推断。我们深入研究了词汇量、时间减少策略以及在长篇测试集上的通用表现。尽管有人猜测,随着模型规模的增长,CTC可能不亚于 RNN-T,它将标签依赖项引入预测中,但我们观察到,一个900M的RNN-T明显 outperforms a 1.8B的CTC,并且更加容忍严重的时间减少,尽管通过LM浅融合可以大部分消除WER之间的差距。
https://arxiv.org/abs/2309.12963
Keyword spotting (KWS) refers to the task of identifying a set of predefined words in audio streams. With the advances seen recently with deep neural networks, it has become a popular technology to activate and control small devices, such as voice assistants. Relying on such models for edge devices, however, can be challenging due to hardware constraints. Moreover, as adversarial attacks have increased against voice-based technologies, developing solutions robust to such attacks has become crucial. In this work, we propose VIC-KD, a robust distillation recipe for model compression and adversarial robustness. Using self-supervised speech representations, we show that imposing geometric priors to the latent representations of both Teacher and Student models leads to more robust target models. Experiments on the Google Speech Commands datasets show that the proposed methodology improves upon current state-of-the-art robust distillation methods, such as ARD and RSLAD, by 12% and 8% in robust accuracy, respectively.
关键字检测(KWS)是指识别音频流中的预先定义词汇的任务。随着深度学习网络的最新进展,它已经成为激活和控制小型设备的流行技术,例如语音助手。然而,依靠此类模型来处理边缘设备可能会由于硬件限制而面临挑战。此外,随着对基于语音技术的对抗攻击的增加,开发对此类攻击具有鲁棒性的解决方案变得越来越重要。在这个研究中,我们提出了VIC-KD,一个模型压缩和对抗鲁棒性的鲁棒分岔方法。通过使用自监督语音表示,我们证明了在教师和学生模型的潜在表示中添加几何先验可以生成更加鲁棒的目标模型。在Google Speech 命令数据集上的实验表明,该方法在鲁棒精度方面相对于当前先进的鲁棒分岔方法如ard和RSLAD分别提高了12%和8%。
https://arxiv.org/abs/2309.12914
Affect recognition, encompassing emotions, moods, and feelings, plays a pivotal role in human communication. In the realm of conversational artificial intelligence (AI), the ability to discern and respond to human affective cues is a critical factor for creating engaging and empathetic interactions. This study delves into the capacity of large language models (LLMs) to recognise human affect in conversations, with a focus on both open-domain chit-chat dialogues and task-oriented dialogues. Leveraging three diverse datasets, namely IEMOCAP, EmoWOZ, and DAIC-WOZ, covering a spectrum of dialogues from casual conversations to clinical interviews, we evaluated and compared LLMs' performance in affect recognition. Our investigation explores the zero-shot and few-shot capabilities of LLMs through in-context learning (ICL) as well as their model capacities through task-specific fine-tuning. Additionally, this study takes into account the potential impact of automatic speech recognition (ASR) errors on LLM predictions. With this work, we aim to shed light on the extent to which LLMs can replicate human-like affect recognition capabilities in conversations.
情感识别在人类沟通中扮演着关键角色。在对话型人工智能(AI)领域,能够分辨和响应人类情感 cues 是创造有趣和感同身受的交互的关键因素。本文探讨了大型语言模型(LLM)在对话中识别人类情感的能力,重点研究了公开领域的闲聊对话和任务驱动的对话。利用三个不同的数据集,包括IEMOCAP、EmoWOZ和DAIC-WOZ,涵盖了从闲聊对话到临床访谈的一系列对话,我们评估了和比较了LLM在情感识别方面的表现。我们的研究探索了LLM通过上下文学习(ICL)的零Shot和少量Shot能力,以及通过任务特定微调来提高其模型能力。此外,本文考虑到了自动语音识别(ASR)错误对LLM预测的潜在影响。通过这项工作,我们旨在阐明LLM在对话中能否模拟人类情感识别能力的局限性。
https://arxiv.org/abs/2309.12881
To train transcriptor models that produce robust results, a large and diverse labeled dataset is required. Finding such data with the necessary characteristics is a challenging task, especially for languages less popular than English. Moreover, producing such data requires significant effort and often money. Therefore, a strategy to mitigate this problem is the use of data augmentation techniques. In this work, we propose a framework that approaches data augmentation based on deepfake audio. To validate the produced framework, experiments were conducted using existing deepfake and transcription models. A voice cloner and a dataset produced by Indians (in English) were selected, ensuring the presence of a single accent in the dataset. Subsequently, the augmented data was used to train speech to text models in various scenarios.
要训练产生稳健结果的转录员模型,需要建立一个大型且多样化的标记数据集。找到具有所需特征的数据是一项挑战性的任务,特别是对于那些比英语更不受欢迎的语言。此外,生成这些数据需要巨大的努力,通常还需要资金。因此,减轻这个问题的策略是使用数据增强技术。在本研究中,我们提出了一个基于深度伪造音频的方法,来建立数据增强框架。为了验证生成的框架,使用了现有的深度伪造和转录模型进行实验。选择了一个语音克隆器和一个由印度人生成的英语数据集,确保数据集中只有一个口音。随后,增强的数据被用于训练在各种场景下的语言到文本模型。
https://arxiv.org/abs/2309.12802
This paper introduces an improved duration informed attention neural network (DurIAN-E) for expressive and high-fidelity text-to-speech (TTS) synthesis. Inherited from the original DurIAN model, an auto-regressive model structure in which the alignments between the input linguistic information and the output acoustic features are inferred from a duration model is adopted. Meanwhile the proposed DurIAN-E utilizes multiple stacked SwishRNN-based Transformer blocks as linguistic encoders. Style-Adaptive Instance Normalization (SAIN) layers are exploited into frame-level encoders to improve the modeling ability of expressiveness. A denoiser incorporating both denoising diffusion probabilistic model (DDPM) for mel-spectrograms and SAIN modules is conducted to further improve the synthetic speech quality and expressiveness. Experimental results prove that the proposed expressive TTS model in this paper can achieve better performance than the state-of-the-art approaches in both subjective mean opinion score (MOS) and preference tests.
本 paper 介绍了一种改进的时间 informed 注意力神经网络 (Durian-E) 用于表达高保真的文本到语音(TTS)合成。从原始的 Durian 模型中继承而来,采用了一种自回归模型结构,该结构使用一个时间模型从时间推断输入语言信息和输出声学特征的对齐。同时, proposed 的 Durian-E 利用多个叠加的 SwishRNN 基座Transformer 块作为语言编码器。Style-Adaptive Instance Normalization (SAIN) 层被应用于帧级别的编码器,以改善表达建模能力。一种去噪器,结合mel频谱去噪和 SAIN 模块,进一步改善合成语音质量和表达性能。实验结果表明,本文提出的表达TTS模型在主观意见评分(MOS)和偏好测试中比现有方法表现更好。
https://arxiv.org/abs/2309.12792
This research introduces an enhanced version of the multi-objective speech assessment model, called MOSA-Net+, by leveraging the acoustic features from large pre-trained weakly supervised models, namely Whisper, to create embedding features. The first part of this study investigates the correlation between the embedding features of Whisper and two self-supervised learning (SSL) models with subjective quality and intelligibility scores. The second part evaluates the effectiveness of Whisper in deploying a more robust speech assessment model. Third, the possibility of combining representations from Whisper and SSL models while deploying MOSA-Net+ is analyzed. The experimental results reveal that Whisper's embedding features correlate more strongly with subjective quality and intelligibility than other SSL's embedding features, contributing to more accurate prediction performance achieved by MOSA-Net+. Moreover, combining the embedding features from Whisper and SSL models only leads to marginal improvement. As compared to MOSA-Net and other SSL-based speech assessment models, MOSA-Net+ yields notable improvements in estimating subjective quality and intelligibility scores across all evaluation metrics. We further tested MOSA-Net+ on Track 3 of the VoiceMOS Challenge 2023 and obtained the top-ranked performance.
这项研究介绍了一种增强版本的多目标语音评估模型,称为MOSA-Net+,通过利用大型弱监督预训练模型Whisper的声学特征来创建嵌入特征。本研究第一部分研究了Whisper的嵌入特征与两个基于自我监督学习(SSL)模型的主观质量和语音识别得分之间的相关性。本研究第二部分评估了Whisper在部署更稳健的语音评估模型方面的 effectiveness。第三部分分析了在部署MOSA-Net+的同时,将Whisper和SSL模型的表示相结合的可能性。实验结果显示,Whisper的嵌入特征与主观质量和语音识别得分之间的相关性比SSL模型的其他嵌入特征更强,为MOSA-Net+实现的更准确的预测性能做出了贡献。此外,将Whisper和SSL模型的嵌入特征相结合仅会导致微小改进。与MOSA-Net和其他基于SSL的语音评估模型相比,MOSA-Net+在估计主观质量和语音识别得分方面实现了显著的改进。我们在2023年声音MOS挑战 track 3 上测试了MOSA-Net+,并取得了排名最高的性能。
https://arxiv.org/abs/2309.12766
Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme recognition versus supervised models. Training SSRL models requires a large amount of pre-training data and this poses a challenge for low resource languages. A common approach is transferring knowledge from other languages. Instead, we propose to use audio augmentation to pre-train SSRL models in a low resource condition and evaluate phoneme recognition as downstream task. We performed a systematic comparison of augmentation techniques, namely: pitch variation, noise addition, accented target-language speech and other language speech. We found combined augmentations (noise/pitch) was the best augmentation strategy outperforming accent and language knowledge transfer. We compared the performance with various quantities and types of pre-training data. We examined the scaling factor of augmented data to achieve equivalent performance to models pre-trained with target domain speech. Our findings suggest that for resource constrained languages, in-domain synthetic augmentation can outperform knowledge transfer from accented or other language speech.
自监督表示学习(SSRL)已经提高了后续音节识别相对于监督模型的性能。训练SSRL模型需要大量的预训练数据,这对资源有限的语言来说是一个挑战。一种常见方法是从其他语言中转移知识。相反,我们建议利用音频增强来在资源有限的情况下预训练SSRL模型,并将音节识别作为后续任务进行评估。我们对增强技术进行了系统性的比较,包括音调变化、噪音添加、目标语言语音带有口音以及其他语言语音。我们发现合并增强(噪音/音调)是最佳的增强策略,比口音和语言知识转移表现更好。我们与各种数量和类型的预训练数据进行了比较,并研究了增强数据的缩放因子,以获得与目标语言语音预训练模型相当的性能。我们的发现表明,对于资源有限的语言来说,跨域合成增强可以优于带有口音或其他语言语音的知识转移。
https://arxiv.org/abs/2309.12763
Speech Emotion Recognition (SER) plays a pivotal role in enhancing human-computer interaction by enabling a deeper understanding of emotional states across a wide range of applications, contributing to more empathetic and effective communication. This study proposes an innovative approach that integrates self-supervised feature extraction with supervised classification for emotion recognition from small audio segments. In the preprocessing step, to eliminate the need of crafting audio features, we employed a self-supervised feature extractor, based on the Wav2Vec model, to capture acoustic features from audio data. Then, the output featuremaps of the preprocessing step are fed to a custom designed Convolutional Neural Network (CNN)-based model to perform emotion classification. Utilizing the ShEMO dataset as our testing ground, the proposed method surpasses two baseline methods, i.e. support vector machine classifier and transfer learning of a pretrained CNN. comparing the propose method to the state-of-the-art methods in SER task indicates the superiority of the proposed method. Our findings underscore the pivotal role of deep unsupervised feature learning in elevating the landscape of SER, offering enhanced emotional comprehension in the realm of human-computer interactions.
语音情感识别( SER )在增强人机交互方面发挥着关键作用,通过使对多种应用程序中情感状态的理解更深入,有助于更同情心和有效的沟通。本研究提出了一种创新的方法,将自监督特征提取与监督分类集成,从小型音频片段中情感识别。在预处理步骤中,为了消除制作音频特征的需要,我们采用了基于瓦夫2Vec模型的自监督特征提取器,从音频数据中提取声学特征。然后将预处理步骤的输出特征映射喂入一个定制的卷积神经网络(CNN)基模型,进行情感分类。利用 ShEMO 数据集作为我们的测试集,所提出的方法超过了两个基准方法,即支持向量机分类器和预先训练的 CNN 的迁移学习。将所提出的方法与 SER 任务中的先进方法进行比较表明这种方法的优势。我们的研究结果强调了深度无监督特征学习在提高 SER 景观方面的关键作用,提供了在人机交互领域中增强情感理解。
https://arxiv.org/abs/2309.12714
Recent progress in Automatic Speech Recognition (ASR) has been coupled with a substantial increase in the model sizes, which may now contain billions of parameters, leading to slow inferences even with adapted hardware. In this context, several ASR models exist in various sizes, with different inference costs leading to different performance levels. Based on the observation that smaller models perform optimally on large parts of testing corpora, we propose to train a decision module, that would allow, given an audio sample, to use the smallest sufficient model leading to a good transcription. We apply our approach to two Whisper models with different sizes. By keeping the decision process computationally efficient, we build a decision module that allows substantial computational savings with reduced performance drops.
最近Automatic Speech Recognition (ASR)的进步与模型大小的显著增加相耦合,这些模型现在可能包含数十亿参数,导致即使使用适应硬件也缓慢Inference。在这种情况下,存在几种不同大小的ASR模型,不同的Inference成本导致不同的性能水平。基于观察到小型模型在测试数据集的大部分方面表现最佳,我们建议训练一个决策模块,给定一个音频样本,使用最小的足够模型,从而得到良好的转录。我们分别对两个不同大小的Whisper模型应用了我们的方法。通过保持决策过程计算高效的模式,我们构建了一个决策模块,允许实现显著的计算节省,同时减少了性能下降。
https://arxiv.org/abs/2309.12712
This paper details our speaker diarization system designed for multi-domain, multi-microphone casual conversations. The proposed diarization pipeline uses weighted prediction error (WPE)-based dereverberation as a front end, then applies end-to-end neural diarization with vector clustering (EEND-VC) to each channel separately. It integrates the diarization result obtained from each channel using diarization output voting error reduction plus overlap (DOVER-LAP). To harness the knowledge from the target domain and results integrated across all channels, we apply self-supervised adaptation for each session by retraining the EEND-VC with pseudo-labels derived from DOVER-LAP. The proposed system was incorporated into NTT's submission for the distant automatic speech recognition task in the CHiME-7 challenge. Our system achieved 65 % and 62 % relative improvements on development and eval sets compared to the organizer-provided VC-based baseline diarization system, securing third place in diarization performance.
这篇文章介绍了我们设计的适用于多领域、多麦克风 casual conversations 的 speaker diarization 系统。该 diarization 系统采用基于加权预测误差(WPE)的声学去混响作为前端,然后将全端神经网络声学归一化与向量聚类(EEND-VC)分别应用于每个通道。它通过减少 diarization 输出的投票错误以及融合(DOVER-LAP)来实现每个通道的声学归一化结果,并将它们整合在一起。为了从目标领域中提取知识和将整合在所有通道中的结果进行训练,我们在每个会话中使用自监督适应技术,通过从 DOVER-LAP 中推导出伪标签来重新训练 EEND-VC。该提议系统被 NTT 纳入了 CHiME-7 挑战中远程自动语音识别任务提交的候选列表中。我们的系统相对于组织者提供的基于 VC 基线的声学基线 diarization 系统在开发和应用集上实现了 65 % 和 62 % 的相对改进,确保了声学归一化性能的第三名。
https://arxiv.org/abs/2309.12656
Dual-path is a popular architecture for speech separation models (e.g. Sepformer) which splits long sequences into overlapping chunks for its intra- and inter-blocks that separately model intra-chunk local features and inter-chunk global relationships. However, it has been found that inter-blocks, which comprise half a dual-path model's parameters, contribute minimally to performance. Thus, we propose the Single-Path Global Modulation (SPGM) block to replace inter-blocks. SPGM is named after its structure consisting of a parameter-free global pooling module followed by a modulation module comprising only 2% of the model's total parameters. The SPGM block allows all transformer layers in the model to be dedicated to local feature modelling, making the overall model single-path. SPGM achieves 22.1 dB SI-SDRi on WSJ0-2Mix and 20.4 dB SI-SDRi on Libri2Mix, exceeding the performance of Sepformer by 0.5 dB and 0.3 dB respectively and matches the performance of recent SOTA models with up to 8 times fewer parameters.
双路径是一种广泛应用于语音分离模型(例如Sepformer)的常见架构,该模型将长序列分成重叠的块,以便内部块和外部块分别建模内部块 local 特征和外部块 global 关系。然而,我们发现外部块,占双路径模型参数的一半,对性能的贡献较小。因此,我们建议采用Single-Path Global Modulation(SPGM)块来取代外部块。SPGM块以其结构命名,由一个参数免费的全球聚合模块和仅模型总参数的2%的调制模块组成。SPGM块允许模型中的所有变压器层专门用于 local 特征建模,从而使整个模型成为双路径。在 WSJ0-2 混合中,SPGM 实现 22.1 dBSI-SDRi,在 Libri2 混合中实现 20.4 dBSI-SDRi,分别比 Sepformer 表现高出 0.5 dB 和 0.3 dB,并与最近的 SOTA 模型,参数数量不到8倍的 recent models 表现相当。
https://arxiv.org/abs/2309.12608
The ICASSP 2023 Acoustic Echo Cancellation Challenge is intended to stimulate research in acoustic echo cancellation (AEC), which is an important area of speech enhancement and is still a top issue in audio communication. This is the fourth AEC challenge and it is enhanced by adding a second track for personalized acoustic echo cancellation, reducing the algorithmic + buffering latency to 20ms, as well as including a full-band version of AECMOS. We open source two large datasets to train AEC models under both single talk and double talk scenarios. These datasets consist of recordings from more than 10,000 real audio devices and human speakers in real environments, as well as a synthetic dataset. We open source an online subjective test framework and provide an objective metric for researchers to quickly test their results. The winners of this challenge were selected based on the average mean opinion score (MOS) achieved across all scenarios and the word accuracy (WAcc) rate.
ICASSP 2023 声学回声抵消挑战旨在刺激声学回声抵消研究(AEC),这是语音增强的一个重要领域,仍然是音频通信中的一个主要问题。这是第四个 AEC 挑战,通过添加 personalized 声学回声抵消的第二个轨道,将算法和缓冲延迟降低到 20ms,并包括 AECMOS 全波段版本,我们开源了两个大型数据集,用于训练 AEC 模型,无论是在单人对话还是双人对话场景中。这些数据集包括从超过 10,000 个真实的音频设备和人类演讲者在真实环境中录制的录音,以及一个合成数据集。我们开源了一个在线主观测试框架,并为研究人员提供一个客观的指标,以快速测试他们的结果。该挑战的获胜者是根据在所有场景中实现的平均值意见得分(MOS)和单词准确性(Wacc)率来选择的。
https://arxiv.org/abs/2309.12553
The goal of this work is Active Speaker Detection (ASD), a task to determine whether a person is speaking or not in a series of video frames. Previous works have dealt with the task by exploring network architectures while learning effective representations has been less explored. In this work, we propose TalkNCE, a novel talk-aware contrastive loss. The loss is only applied to part of the full segments where a person on the screen is actually speaking. This encourages the model to learn effective representations through the natural correspondence of speech and facial movements. Our loss can be jointly optimized with the existing objectives for training ASD models without the need for additional supervision or training data. The experiments demonstrate that our loss can be easily integrated into the existing ASD frameworks, improving their performance. Our method achieves state-of-the-art performances on AVA-ActiveSpeaker and ASW datasets.
本研究的目标是 active speaker detection(ASD),该任务是在一系列视频帧中确定一个人是否在说话。以往的研究通过探索网络架构来完成该任务,而学习有效的表示方法则较少被探索。在这项工作中,我们提出了TalkNCE,这是一种具有意识对话感知 contrastive 损失的新损失函数。该损失只应用于整个片段中,其中屏幕上的人在实际上说话的部分。这鼓励模型通过自然的对话和面部表情映射来学习有效的表示方法。我们的损失可以与现有的训练目标一起优化,无需额外的监督或训练数据。实验表明,我们的损失可以轻松地与现有的 ASD 框架集成,提高其性能。我们的方法和在AVA-Active Speaker和ASW数据集上取得了最先进的性能。
https://arxiv.org/abs/2309.12306
In this study, we present synchronous bilingual Connectionist Temporal Classification (CTC), an innovative framework that leverages dual CTC to bridge the gaps of both modality and language in the speech translation (ST) task. Utilizing transcript and translation as concurrent objectives for CTC, our model bridges the gap between audio and text as well as between source and target languages. Building upon the recent advances in CTC application, we develop an enhanced variant, BiL-CTC+, that establishes new state-of-the-art performances on the MuST-C ST benchmarks under resource-constrained scenarios. Intriguingly, our method also yields significant improvements in speech recognition performance, revealing the effect of cross-lingual learning on transcription and demonstrating its broad applicability. The source code is available at this https URL.
在本研究中,我们提出了同步双语连接性时间分类(CTC)创新框架,该框架利用双CTC技术,在言语翻译(ST)任务中解决modality和语言之间的空白。通过将摘要和翻译作为CTC的并发目标,我们的模型实现了音频和文本之间的空白以及源语言和目标语言之间的空白。基于最近在CTC应用方面的进展情况,我们开发了一个增强版本, BiL-CTC+,在资源受限的情况下,在MUST-C ST基准测试数据上创造了新的最先进的性能。令人惊奇的是,我们的方法还显著提高了语音识别性能,揭示了跨语言学习对摘要的影响,并证明了其广泛适用性。源代码可以在以下httpsURL上获取。
https://arxiv.org/abs/2309.12234
Neural network approaches to single-channel speech enhancement have received much recent attention. In particular, mask-based architectures have achieved significant performance improvements over conventional methods. This paper proposes a multiscale autoencoder (MSAE) for mask-based end-to-end neural network speech enhancement. The MSAE performs spectral decomposition of an input waveform within separate band-limited branches, each operating with a different rate and scale, to extract a sequence of multiscale embeddings. The proposed framework features intuitive parameterization of the autoencoder, including a flexible spectral band design based on the Constant-Q transform. Additionally, the MSAE is constructed entirely of differentiable operators, allowing it to be implemented within an end-to-end neural network, and be discriminatively trained. The MSAE draws motivation both from recent multiscale network topologies and from traditional multiresolution transforms in speech processing. Experimental results show the MSAE to provide clear performance benefits relative to conventional single-branch autoencoders. Additionally, the proposed framework is shown to outperform a variety of state-of-the-art enhancement systems, both in terms of objective speech quality metrics and automatic speech recognition accuracy.
神经网络对单通道语音增强的研究最近受到了广泛关注。特别是,基于Mask的架构在与传统方法相比实现了显著的性能提升。本文提出了一种基于Mask的多维度自编码器(MSAE),用于实现基于Mask的端到端神经网络语音增强。MSAE在 separate band-limited分支内执行谱分解操作,每个分支以不同的速率和尺度运行,以提取多尺度嵌入序列。 proposed 框架采用直觉的自编码器参数化,包括基于康普顿-Q变换的灵活谱带设计。此外,MSAE完全由不同的操作员构建,使其能够在端到端神经网络内部实现,并进行有选择性的训练。MSAE从最近的多尺度网络拓扑和传统语音处理中的多分辨率变换中吸取了动力。实验结果表明,与传统的单分支自编码器相比,MSAE可以提供明显的性能优势。此外, proposed 框架在 objective speech quality metrics 和自动语音识别精度方面击败了多种最先进的增强系统。
https://arxiv.org/abs/2309.12121
This study investigates mask-based beamformers (BFs), which estimate filters to extract target speech using time-frequency masks. Although several BF methods have been proposed, the following aspects are yet to be comprehensively investigated. 1) Which BF can provide the best extraction performance in terms of the closeness of the BF output to the target speech? 2) Is the optimal mask for the best performance common for all BFs? 3) Is the ideal ratio mask (IRM) identical to the optimal mask? Accordingly, we investigate these issues considering four mask-based BFs: the maximum signal-to-noise ratio BF, two variants of this, and the multichannel Wiener filter (MWF) BF. To obtain the optimal mask corresponding to the peak performance for each BF, we employ an approach that minimizes the mean square error between the BF output and target speech for each utterance. Via the experiments with the CHiME-3 dataset, we verify that the four BFs have the same peak performance as the upper bound provided by the ideal MWF BF, whereas the optimal mask depends on the adopted BF and differs from the IRM. These observations differ from the conventional idea that the optimal mask is common for all BFs and that peak performance differs for each BF. Hence, this study contributes to the design of mask-based BFs.
本研究调查基于Mask的beamformer(BF),该方法通过使用时间频率Mask估计滤波器来提取目标语音。虽然已经提出了几种BF方法,但以下方面仍需全面调查。1) 哪种BF可以在与目标语音的接近程度方面提供最佳的提取性能?2) 所有BF中的最优Mask都是相同的吗?3) 理想的比例Mask(IRM)与最优Mask是否相同?因此,我们考虑考虑四个基于Mask的BF:最大信号-噪声比BF、这两种变体以及多通道Wicher filter(WF)BF。为了在每个BF的峰值性能中提取最优Mask,我们采用一种方法,最小化每个说话 utterance 之间的BF输出与目标语音的平方误差。通过使用CHiME-3数据集的实验,我们证实,四个BF都与理想WFBFBF提供的上界相同,而最优Mask取决于所选的BF,并不同于IRM。这些观察与传统的假设不同,即最优Mask对所有BF都是相同的,而每个BF的峰值性能不同。因此,这项研究为基于Mask的BF的设计做出贡献。
https://arxiv.org/abs/2309.12065
Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speaker's voice without adaptation parameters. By quantizing speech waveform into discrete acoustic tokens and modeling these tokens with the language model, recent language model-based TTS models show zero-shot speaker adaptation capabilities with only a 3-second acoustic prompt of an unseen speaker. However, they are limited by the length of the acoustic prompt, which makes it difficult to clone personal speaking style. In this paper, we propose a novel zero-shot TTS model with the multi-scale acoustic prompts based on a neural codec language model VALL-E. A speaker-aware text encoder is proposed to learn the personal speaking style at the phoneme-level from the style prompt consisting of multiple sentences. Following that, a VALL-E based acoustic decoder is utilized to model the timbre from the timbre prompt at the frame-level and generate speech. The experimental results show that our proposed method outperforms baselines in terms of naturalness and speaker similarity, and can achieve better performance by scaling out to a longer style prompt.
零样本文本到语音(TTS)合成的目标是复制任何未见到的演讲者的声音,而无需适应参数。通过将语音波形量化成离散的声学令牌,并与语言模型建模这些令牌,近年来基于语言模型的TTS模型显示出零样本演讲适应能力,仅依赖于未见到演讲者的3秒语音提示。然而,这些限制是由语音提示的长度所限制的,这使得难以复制个人演讲风格。在本文中,我们提出了一种新的零样本TTS模型,基于神经网络编码语言模型VALL-E,使用多级语音提示。我们将演讲者aware文本编码器用于从包含多个句子的风格提示中学习个人演讲风格,然后使用VALL-E基于语音解码器来建模帧级的音色,并生成语音。实验结果显示,我们提出的方法在自然度和演讲者相似度方面优于基准方法,并可以通过将风格提示扩展到更长的提示来实现更好的性能。
https://arxiv.org/abs/2309.11977
Previous methods for predicting room acoustic parameters and speech quality metrics have focused on the single-channel case, where room acoustics and Mean Opinion Score (MOS) are predicted for a single recording device. However, quality-based device selection for rooms with multiple recording devices may benefit from a multi-channel approach where the descriptive metrics are predicted for multiple devices in parallel. Following our hypothesis that a model may benefit from multi-channel training, we develop a multi-channel model for joint MOS and room acoustics prediction (MOSRA) for five channels in parallel. The lack of multi-channel audio data with ground truth labels necessitated the creation of simulated data using an acoustic simulator with room acoustic labels extracted from the generated impulse responses and labels for MOS generated in a student-teacher setup using a wav2vec2-based MOS prediction model. Our experiments show that the multi-channel model improves the prediction of the direct-to-reverberation ratio, clarity, and speech transmission index over the single-channel model with roughly 5$\times$ less computation while suffering minimal losses in the performance of the other metrics.
以前的预测房间声学参数和语音质量度量方法主要关注单一通道的情况,其中房间声学和Mean Opinion Score (MOS)预测是针对单个记录设备的情况。然而,对于多个记录设备的房间的高质量设备选择,可能受益于同时预测多个设备的描述度量的多通道方法。遵循我们的假设,一个模型可能从多通道训练中受益,因此我们开发了多通道模型,用于同时预测五通道的MOS和房间声学(MOSRA)。缺乏具有真实标签的多通道音频数据迫使使用 acoustic Simulator 生成模拟数据,其中从生成的瞬态响应中提取的房间声学标签和使用基于 wav2vec2 的 MOS 预测模型生成的 MOS 标签。我们的实验表明,多通道模型在直接反射比、清晰度和语音传输指数的预测上比单一通道模型提高了约5$\times$的计算量,同时在其他度量表现上遭受了最小的损失。
https://arxiv.org/abs/2309.11976