In TV services, dialogue level personalization is key to meeting user preferences and needs. When dialogue and background sounds are not separately available from the production stage, Dialogue Separation (DS) can estimate them to enable personalization. DS was shown to provide clear benefits for the end user. Still, the estimated signals are not perfect, and some leakage can be introduced. This is undesired, especially during passages without dialogue. We propose to combine DS and Voice Activity Detection (VAD), both recently proposed for TV audio. When their combination suggests dialogue inactivity, background components leaking in the dialogue estimate are reassigned to the background estimate. A clear improvement of the audio quality is shown for dialogue-free signals, without performance drops when dialogue is active. A post-processed VAD estimate with improved detection accuracy is also generated. It is concluded that DS and VAD can improve each other and are better used together.
在电视服务中，对话水平个性化是满足用户偏好和需求的关键。当对话和背景声从生产阶段单独获得时，对话分离(DS)可以估计它们，实现个性化。DS已被证明为用户提供了明确的好处。然而，估计的信号并不是完美的，可能会有一些泄漏。这不希望发生，特别是在没有对话的情况下。我们建议将对话分离和语音活动检测(VAD)两项最近为电视音频提出的技术结合起来。当它们的组合提示对话未活动时，背景成分在对话估计中泄漏的部分会被重新分配给背景估计。在没有对话的信号中，音频质量明显改进，而在对话活动时则没有性能下降。还生成了改进了检测精度的 post-processed VAD估计。因此，得出结论，DS和VAD可以互相改进，最好一起使用。
Generative AI has demonstrated impressive performance in various fields, among which speech synthesis is an interesting direction. With the diffusion model as the most popular generative model, numerous works have attempted two active tasks: text to speech and speech enhancement. This work conducts a survey on audio diffusion model, which is complementary to existing surveys that either lack the recent progress of diffusion-based speech synthesis or highlight an overall picture of applying diffusion model in multiple fields. Specifically, this work first briefly introduces the background of audio and diffusion model. As for the text-to-speech task, we divide the methods into three categories based on the stage where diffusion model is adopted: acoustic model, vocoder and end-to-end framework. Moreover, we categorize various speech enhancement tasks by either certain signals are removed or added into the input speech. Comparisons of experimental results and discussions are also covered in this survey.
Instrument playing technique (IPT) is a key element of musical presentation. However, most of the existing works for IPT detection only concern monophonic music signals, yet little has been done to detect IPTs in polyphonic instrumental solo pieces with overlapping IPTs or mixed IPTs. In this paper, we formulate it as a frame-level multi-label classification problem and apply it to Guzheng, a Chinese plucked string instrument. We create a new dataset, Guzheng\_Tech99, containing Guzheng recordings and onset, offset, pitch, IPT annotations of each note. Because different IPTs vary a lot in their lengths, we propose a new method to solve this problem using multi-scale network and self-attention. The multi-scale network extracts features from different scales, and the self-attention mechanism applied to the feature maps at the coarsest scale further enhances the long-range feature extraction. Our approach outperforms existing works by a large margin, indicating its effectiveness in IPT detection.
As one of the major branches of automatic speech recognition, attention-based models greatly improves the feature representation ability of the model. In particular, the multi-head mechanism is employed in the attention, hoping to learn speech features of more aspects in different attention subspaces. For speech recognition of complex languages, on the one hand, a small head size will lead to an obvious shortage of learnable aspects. On the other hand, we need to reduce the dimension of each subspace to keep the size of the overall feature space unchanged when we increase the number of heads, which will significantly weaken the ability to represent the feature of each subspace. Therefore, this paper explores how to use a small attention subspace to represent complete speech features while ensuring many heads. In this work we propose a novel neural network architecture, namely, pyramid multi-branch fusion DCNN with multi-head self-attention. The proposed architecture is inspired by Dilated Convolution Neural Networks (DCNN), it uses multiple branches with DCNN to extract the feature of the input speech under different receptive fields. To reduce the number of parameters, every two branches are merged until all the branches are merged into one. Thus, its shape is like a pyramid rotated 90 degrees. We demonstrate that on Aishell-1, a widely used Mandarin speech dataset, our model achieves a character error rate (CER) of 6.45% on the test sets.
作为自动语音识别的主要分支之一，注意力模型极大地提高了模型的特征表示能力。特别是，采用了多眼机制，希望在不同注意力 subspace 中学习更多的语音特征方面。对于复杂语言的语音识别，一方面，较小的头部大小会导致可学习方面的数量明显不足。另一方面，我们需要在每个 subspace 中减少维度，以保持整个特征空间的size不变，当头部数量增加时，这将会极大地减弱表示每个 subspace 特征的能力。因此，本文探讨了如何使用一个小的注意力 subspace 来代表完整的语音特征，同时确保许多头部。在这个工作中，我们提出了一种新型的神经网络架构，即金字塔多分支融合 DCNN 和多眼自注意力。该架构受到缩小卷积神经网络(DCNN)的启发，它使用多个 DCNN 分支从不同的接收域中提取输入语音的特征。为了减少参数数量，每个分支都合并直到所有分支都合并成一条。因此，它的形状就像金字塔旋转90度。我们证明了在广泛使用的 Mandarin 语音数据集 Aishell-1 中，我们的模型在测试集上实现了字符错误率(CER)为6.45%。
Transformer-based models have recently made significant achievements in the application of end-to-end (E2E) automatic speech recognition (ASR). It is possible to deploy the E2E ASR system on smart devices with the help of Transformer-based models. While these models still have the disadvantage of requiring a large number of model parameters. To overcome the drawback of universal Transformer models for the application of ASR on edge devices, we propose a solution that can reuse the block in Transformer models for the occasion of the small footprint ASR system, which meets the objective of accommodating resource limitations without compromising recognition accuracy. Specifically, we design a novel block-reusing strategy for speech Transformer (BRST) to enhance the effectiveness of parameters and propose an adapter module (ADM) that can produce a compact and adaptable model with only a few additional trainable parameters accompanying each reusing block. We conducted an experiment with the proposed method on the public AISHELL-1 corpus, and the results show that the proposed approach achieves the character error rate (CER) of 9.3%/6.63% with only 7.6M/8.3M parameters without and with the ADM, respectively. In addition, we also make a deeper analysis to show the effect of ADM in the general block-reusing method.
Transformer-based models 最近在端到端(E2E)自动语音识别(ASR)的应用方面取得了重要成就。借助Transformer-based模型，可以在智能设备上部署E2E ASR系统。尽管这些模型仍然具有需要大量模型参数的缺点，但我们希望克服通用Transformer模型在边缘设备上ASR应用的缺点，并提出一种解决方案，可以在Transformer模型中重用块以实现小 footprint ASR系统，满足适应资源限制并不影响识别精度的目标。具体来说，我们设计了一种Speech Transformer(BRST)的块重用策略，以提高参数的有效性，并提出了适应模块(ADM)，该模块可以产生紧凑且可适应的模型，每个重用块仅有几个训练参数相随。我们在公共AIShell-1语料库上进行了实验，结果表明，没有ADM的情况下，该方法实现了字符错误率(CER)9.3%/6.63%，而有了ADM的情况下，仅使用7.6M/8.3M参数分别实现了9.3%/6.63%。此外，我们还进行了深入分析，以显示通用块重用方法中的ADM效应。
Two sound field reproduction methods, weighted pressure matching and weighted mode matching, are theoretically and experimentally compared. The weighted pressure and mode matching are a generalization of conventional pressure and mode matching, respectively. Both methods are derived by introducing a weighting matrix in the pressure and mode matching. The weighting matrix in the weighted pressure matching is defined on the basis of the kernel interpolation of the sound field from pressure at a discrete set of control points. In the weighted mode matching, the weighting matrix is defined by a regional integration of spherical wavefunctions. It is theoretically shown that the weighted pressure matching is a special case of the weighted mode matching by infinite-dimensional harmonic analysis for estimating expansion coefficients from pressure observations. The difference between the two methods are discussed through experiments.
We introduce LMCodec, a causal neural speech codec that provides high quality audio at very low bitrates. The backbone of the system is a causal convolutional codec that encodes audio into a hierarchy of coarse-to-fine tokens using residual vector quantization. LMCodec trains a Transformer language model to predict the fine tokens from the coarse ones in a generative fashion, allowing for the transmission of fewer codes. A second Transformer predicts the uncertainty of the next codes given the past transmitted codes, and is used to perform conditional entropy coding. A MUSHRA subjective test was conducted and shows that the quality is comparable to reference codecs at higher bitrates. Example audio is available at this https URL.
我们介绍了LMCodec，一个 causal 神经网络语音编码器，提供低比特率下的高质量语音。该系统的核心是一个 causal 卷积编码器，通过残留向量化将音频编码为精细到粗的代币层级，从而实现更少量的代码传输。LMCodec 训练了一个 Transformer 语言模型，以生成从粗代币到精细代币的预测，从而允许更少的代码传输。第二个 Transformer 预测了给定过去传输的代码的不确定性，并用于执行条件熵编码。一项MusHRA 主观测试进行了 conducted，表明质量在更高的比特率下与参考codec 相当。示例音频可用在这个 https URL 上。
We show that training a multi-headed self-attention-based deep network to predict deleted, information-dense 2-8 Hz speech modulations over a 1.5-second section of a speech utterance is an effective way to make machines learn to extract speech modulations using time-domain contextual information. Our work exhibits that, once trained on large volumes of unlabelled data, the outputs of the self-attention layers vary in time with a modulation peak at 4 Hz. These pre-trained layers can be used to initialize parts of an Automatic Speech Recognition system to reduce its reliance on labeled speech data greatly.
我们表明，训练一个多头自注意力为基础的深度网络，以预测在语音发言中删除的、信息密度高的2-8Hz语音调制，在1.5秒的 section 内，是一种有效的方法，使机器学习使用时间域上下文信息从语音调制中提取信息。我们的工作表明，一旦训练了大规模的未标记数据，自注意力层的输出时间上的分布与一个调制峰值为4Hz的调制峰变化。这些预训练层可以用来初始化自动语音识别系统的部分，以减少它对标记语音数据的依赖性。
Sound is a fundamental and rich source of information; playing a key role in many areas from humanities and social sciences through to engineering and mathematics. Sound is more than just data 'signals'. It encapsulates physical, sensorial and emotional, as well as social, cultural and environmental factors. Sound contributes to the transformation of our experiences, environments and beliefs. Sound is all around us and everywhere. Hence, it should come as no surprise that sound is a complex multicomponent entity with a vast assortment of characteristics and applications. Of course, an important question is, what is the best way to store and represent sound digitally to capture these characteristics? What model or method is best for manipulating, extracting and filtering sounds? There are a large number of representations and models, however, one approach that has yet to be used with sound is dual-quaternions. While dual-quaternions have established themselves in many fields of science and computing as an efficient mathematical model for providing an unambiguous, un-cumbersome, computationally effective means of representing multi-component data. Sound is one area that has yet to explore and reap the benefits of dual-quaternions (using sound and audio-related dual-quaternion models). This article aims to explore the exciting potential and possibilities dual-quaternions offer when applied and combined with sound-based models (including but not limited to the applications, tools, machine-learning, statistical and computational sound-related algorithms).
声音是一个重要的基本资源和丰富的信息来源,在许多领域扮演着关键的角色,包括人文社会科学、工程和数学等。声音不仅仅是数据“信号”,它涵盖了身体、感官和情感等物理、心理和社会环境因素。声音有助于我们的经历、环境和信念的转化。声音无处不在,因此毫不意外的是,声音是一个复杂的多组件实体,具有大量的特点和应用领域。当然,一个重要的问题是如何最好地存储和代表声音数字形式,以捕捉这些特点?哪种模型或方法最适合 manipulate、提取和滤波声音?存在大量的表示和模型,然而,一种方法仍未与声音一起使用的方法是双质数。尽管双质数在科学和计算的许多领域中已经建立了自己,并成为提供无歧义、不繁琐、计算有效的多组件数据表示的有效数学模型。声音是其中一个领域,尚未探索并收获双质数的好处(使用与声音和音频相关的双质数模型)。本文旨在探索双质数与基于声音模型的结合所提供 exciting 的潜力和可能性(包括应用、工具、机器学习、统计和计算声音相关的算法)。
In this paper, we introduce a new approach, called "Posthoc Interpretation via Quantization (PIQ)", for interpreting decisions made by trained classifiers. Our method utilizes vector quantization to transform the representations of a classifier into a discrete, class-specific latent space. The class-specific codebooks act as a bottleneck that forces the interpreter to focus on the parts of the input data deemed relevant by the classifier for making a prediction. We evaluated our method through quantitative and qualitative studies and found that PIQ generates interpretations that are more easily understood by participants to our user studies when compared to several other interpretation methods in the literature.
在本文中,我们介绍了一种新方法,称为“后算术解释法(PIQ)”,用于解释训练分类器做出的决策。我们的算法利用向量化将分类器表示转换为离散、类特异性的潜在空间。类特异性编码书作为瓶颈,迫使解释者专注于分类器认为 relevant 的输入数据的部分,以进行预测。我们通过量化和定性研究评估了我们的算法,并发现与文献中的多个其他解释方法相比,PIQ产生的解释更容易让参与者理解。
Recent works show that speech separation guided diarization (SSGD) is an increasingly promising direction, mainly thanks to the recent progress in speech separation. It performs diarization by first separating the speakers and then applying voice activity detection (VAD) on each separated stream. In this work we conduct an in-depth study of SSGD in the conversational telephone speech (CTS) domain, focusing mainly on low-latency streaming diarization applications. We consider three state-of-the-art speech separation (SSep) algorithms and study their performance both in online and offline scenarios, considering non-causal and causal implementations as well as continuous SSep (CSS) windowed inference. We compare different SSGD algorithms on two widely used CTS datasets: CALLHOME and Fisher Corpus (Part 1 and 2) and evaluate both separation and diarization performance. To improve performance, a novel, causal and computationally efficient leakage removal algorithm is proposed, which significantly decreases false alarms. We also explore, for the first time, fully end-to-end SSGD integration between SSep and VAD modules. Crucially, this enables fine-tuning on real-world data for which oracle speakers sources are not available. In particular, our best model achieves 8.8% DER on CALLHOME, which outperforms the current state-of-the-art end-to-end neural diarization model, despite being trained on an order of magnitude less data and having significantly lower latency, i.e., 0.1 vs. 1 seconds. Finally, we also show that the separated signals can be readily used also for automatic speech recognition, reaching performance close to using oracle sources in some configurations.
近年来的工作表明,语音分离引导的分音(SSGD)是一个越来越有前途的方向,这主要得益于语音分离领域的 recent 进展。它首先分离说话人,然后对每个分离的流应用语音活动检测(VAD)。在这项工作中,我们深入研究了语音分离引导的分音(SSGD)在口语电话语音(CTS)领域中的应用,主要集中在低延迟流分音应用。我们考虑了三种最先进的语音分离算法(SSep),并研究了它们在在线和离线场景下的性能,考虑了非因果和因果实现的实现方式,以及连续 SSep(CSS)窗口推理。我们比较了不同 SSGD 算法在两个广泛使用的 CTS 数据集上的表现:CALLHOME 和 Fisher Corpus(Part 1 和 2),并评估了分离和分音性能。为了改善性能,我们提出了一种新的、因果且计算高效的泄漏去除算法,这显著减少了误报。我们还首次探索了 SSep 和 VAD 模块之间的完全端到端 SSGD 集成。至关重要的是,这使得可以在没有可用的oracle 说话人来源的现实世界数据上进行微调。特别是,我们的最佳模型在CALLHOME上取得了 8.8%的der,比当前最先进的端到端神经网络分音模型还要好,尽管训练数据量要少得多,且延迟显著更低,即 0.1 秒 vs. 1秒。最后,我们还表明,分离信号可以方便地用于自动语音识别,在某些配置下达到与使用oracle 说话人来源类似的性能。
Personalized TTS is an exciting and highly desired application that allows users to train their TTS voice using only a few recordings. However, TTS training typically requires many hours of recording and a large model, making it unsuitable for deployment on mobile devices. To overcome this limitation, related works typically require fine-tuning a pre-trained TTS model to preserve its ability to generate high-quality audio samples while adapting to the target speaker's voice. This process is commonly referred to as ``voice cloning.'' Although related works have achieved significant success in changing the TTS model's voice, they are still required to fine-tune from a large pre-trained model, resulting in a significant size for the voice-cloned model. In this paper, we propose applying trainable structured pruning to voice cloning. By training the structured pruning masks with voice-cloning data, we can produce a unique pruned model for each target speaker. Our experiments demonstrate that using learnable structured pruning, we can compress the model size to 7 times smaller while achieving comparable voice-cloning performance.
个性化TTS是一个令人兴奋且高度渴望的应用,它允许用户使用少数录制进行TTS语音训练。然而,TTS训练通常需要大量录制和大型模型,因此不适合在移动设备上部署。要克服这个限制,相关工作通常需要微调预先训练的TTS模型,以保留其生成高质量音频样本的能力,同时适应目标说话人的声音。这一过程通常被称为“语音克隆”。尽管相关工作已经成功地改变了TTS模型的声音,但它们仍然需要从大型预先训练模型进行微调,导致语音克隆模型的大小很大。在本文中,我们提议将可训练的结构压缩应用于语音克隆。通过使用语音克隆数据训练结构压缩 masks,我们可以为每个目标说话人生产一个独特的压缩模型。我们的实验表明,使用可训练的结构压缩,我们可以将模型大小压缩到7倍 smaller,同时实现类似的语音克隆性能。
Deep learning based methods have become a paradigm for cover song identification (CSI) in recent years, where the ByteCover systems have achieved state-of-the-art results on all the mainstream datasets of CSI. However, with the burgeon of short videos, many real-world applications require matching short music excerpts to full-length music tracks in the database, which is still under-explored and waiting for an industrial-level solution. In this paper, we upgrade the previous ByteCover systems to ByteCover3 that utilizes local features to further improve the identification performance of short music queries. ByteCover3 is designed with a local alignment loss (LAL) module and a two-stage feature retrieval pipeline, allowing the system to perform CSI in a more precise and efficient way. We evaluated ByteCover3 on multiple datasets with different benchmark settings, where ByteCover3 beat all the compared methods including its previous versions.
深度学习方法已经成为歌曲覆盖识别(CSI)的范式,Byte Cover系统在CSI主流数据集上取得了最先进的结果。然而,随着短视频的兴起,许多现实世界的应用需要将简短的音乐片段与数据库中的完整音乐曲目匹配,该领域仍待探索并等待工业级解决方案。在本文中,我们将以前的Byte Cover系统升级到Byte Cover3,该系统利用本地特征进一步改进了简短的音乐查询识别性能。Byte Cover3采用了 local alignment loss (LAL)模块和两个阶段的特征提取管道,使系统能够以更精确和高效的方式执行CSI。我们在不同的基准设置下评估了Byte Cover3,其中Byte Cover3在所有比较方法中取得了领先的结果。
The remarkable success of transformers in the field of natural language processing has sparked the interest of the speech-processing community, leading to an exploration of their potential for modeling long-range dependencies within speech sequences. Recently, transformers have gained prominence across various speech-related domains, including automatic speech recognition, speech synthesis, speech translation, speech para-linguistics, speech enhancement, spoken dialogue systems, and numerous multimodal applications. In this paper, we present a comprehensive survey that aims to bridge research studies from diverse subfields within speech technology. By consolidating findings from across the speech technology landscape, we provide a valuable resource for researchers interested in harnessing the power of transformers to advance the field. We identify the challenges encountered by transformers in speech processing while also offering insights into potential solutions to address these issues.
Deep Speech Enhancement Challenge is the 5th edition of deep noise suppression (DNS) challenges organized at ICASSP 2023 Signal Processing Grand Challenges. DNS challenges were organized during 2019-2023 to stimulate research in deep speech enhancement (DSE). Previous DNS challenges were organized at INTERSPEECH 2020, ICASSP 2021, INTERSPEECH 2021, and ICASSP 2022. From prior editions, we learnt that improving signal quality (SIG) is challenging particularly in presence of simultaneously active interfering talkers and noise. This challenge aims to develop models for joint denosing, dereverberation and suppression of interfering talkers. When primary talker wears a headphone, certain acoustic properties of their speech such as direct-to-reverberation (DRR), signal to noise ratio (SNR) etc. make it possible to suppress neighboring talkers even without enrollment data for primary talker. This motivated us to create two tracks for this challenge: (i) Track-1 Headset; (ii) Track-2 Speakerphone. Both tracks has fullband (48kHz) training data and testset, and each testclips has a corresponding enrollment data (10-30s duration) for primary talker. Each track invited submissions of personalized and non-personalized models all of which are evaluated through same subjective evaluation. Most models submitted to challenge were personalized models, same team is winner in both tracks where the best models has improvement of 0.145 and 0.141 in challenge's Score as compared to noisy blind testset.
Deep Speech Enhancement Challenge是ICASSP 2023信号处理 Grand Challenges组织的第5版深度噪声抑制(DNS)挑战,该挑战在2019-2023年期间组织,以刺激深度语音增强研究(DSE)。以前的DNS挑战在InterSPEECH 2020、ICASSP 2021、InterSPEECH 2021和ICASSP 2022组织过。从以前的版本中,我们得知,提高信号质量(SIG)是挑战性的任务,特别是在同时具有干扰讲话者和噪声的情况下。该挑战旨在开发模型,以 joint 去噪声、去混响和抑制干扰讲话者。当主讲话者戴上耳机时,他们的 speech 的某些物理特性,如直接反射(DRR)、信号到噪声比(SNR)等,可以使在没有主讲话者注册数据的情况下抑制相邻讲话者,这促使我们创建两个轨道:(i) track-1耳机;(ii) track-2麦克风。两个轨道都有全频(48kHz)的训练数据和测试集,每个测试片段都有对应的主要讲话者注册数据(10-30秒)。每个轨道都邀请了个性化和非个性化的模型提交,所有模型都通过相同的主观评估进行评估。大多数模型提交了挑战,同一个团队在两个轨道中都赢得了胜利,最好的模型在挑战得分上比噪声盲测试集提高了0.145和0.141。
This paper proposes a methodology for discovering meaningful properties in data by exploring the latent space of unsupervised deep generative models. We combine manipulation of individual latent variables to extreme values outside the training range with methods inspired by causal inference into an approach we call causal disentanglement with extreme values (CDEV) and show that this approach yields insights for model interpretability. Using this technique, we can infer what properties of unknown data the model encodes as meaningful. We apply the methodology to test what is meaningful in the communication system of sperm whales, one of the most intriguing and understudied animal communication systems. We train a network that has been shown to learn meaningful representations of speech and test whether we can leverage such unsupervised learning to decipher the properties of another vocal communication system for which we have no ground truth. The proposed technique suggests that sperm whales encode information using the number of clicks in a sequence, the regularity of their timing, and audio properties such as the spectral mean and the acoustic regularity of the sequences. Some of these findings are consistent with existing hypotheses, while others are proposed for the first time. We also argue that our models uncover rules that govern the structure of communication units in the sperm whale communication system and apply them while generating innovative data not shown during training. This paper suggests that an interpretation of the outputs of deep neural networks with causal methodology can be a viable strategy for approaching data about which little is known and presents another case of how deep learning can limit the hypothesis space. Finally, the proposed approach combining latent space manipulation and causal inference can be extended to other architectures and arbitrary datasets.
本 paper 提出了一种方法,通过探索 unsupervised 深度生成模型的隐态空间,来发现数据中的有意义属性。我们结合了对因果推断 inspired 的方法的操纵 individual 隐态变量到训练范围之外极端值的方法,并将其转化为我们称之为因果分离与极端值 (CDEV) 的方法,并证明了这种方法可以带来模型解释性 insights。利用这种方法,我们可以推断未知数据中有意义属性的存在。我们应用这种方法来测试射水豚通信系统中有意义属性的存在,这是一类最具挑战性和未深入研究的动物通信系统之一。我们训练了一个网络,使其可以学习有意义的语音表示,并测试我们是否可以利用这种无监督学习解码我们没有 ground truth 的另一种语音通信系统的属性。该方法建议,射水豚使用序列中的 clicks 数量、他们的计时规律性以及音频属性,如谱均值和序列的声波规律性来编码信息。其中一些发现与现有假设一致,而另一些则是首次提出。我们同时也认为,我们的模型揭示了在射水豚通信系统中控制通信单元结构的规则,并在产生在训练期间未展示的数据的创新数据时应用这些规则。本 paper 建议,使用因果方法解释深度学习输出可以是一种可行的方法,用于处理数据中 little is known 的情况,并展示了深度学习如何限制假设空间的另一个案例。最后,我们建议将隐态空间操纵和因果推断相结合的方法可以扩展到其他架构和任意数据集。
Although large foundation models pre-trained by self-supervised learning have achieved state-of-the-art performance in many tasks including automatic speech recognition (ASR), knowledge distillation (KD) is often required in practice to transfer the knowledge learned by large teacher models into much smaller student models with affordable computation and memory costs. This paper proposes a novel two-stage KD framework to distil the knowledge from multiple speech foundation models as teachers into a single student neural transducer model for ASR. In the first stage, the student model encoder is pre-trained using the embeddings extracted from multiple teacher models. In the second stage, the student encoder is fine-tuned with the audio-text pairs based on the ASR task. Experiments on the LibriSpeech 100-hour subset show that the proposed KD framework improves the performance of both streaming and non-streaming student models when using only one teacher. The performance of the student model can be further enhanced when multiple teachers are used jointly, achieving word error rate reductions (WERRs) of 17.5% and 10.6%. Our proposed framework can be combined with other existing KD methods to achieve further improvements. Further WERRs were obtained by incorporating extra unlabelled data during encoder pre-training, leading to a total relative WERR of 55.0% on the non-streaming student model.
尽管采用自监督学习训练的大型基础模型已经在许多任务中达到了最先进的表现,包括自动语音识别(ASR),但知识蒸馏(KD)在实践中经常是必要的,可以将大型教师模型学到的知识转移到相对较小的学生模型,使其计算和存储成本 affordable。本文提出了一种新的两阶段KD框架,将多个语音基础模型的知识作为教师从多个教师模型中提取 embeddings,然后将学生编码器训练为基于ASR任务的单个学生神经网络转换器。在第一阶段,学生编码器使用从多个教师模型中提取的嵌入s进行预训练。在第二阶段,学生编码器与基于ASR任务的音频文本对进行微调。在LibriSpeech 100小时子集的实验中,结果表明,仅使用一名教师时,该 proposed KD框架可以改善流和非流学生模型的性能。当多个教师同时使用时,学生模型的性能可以进一步增强,实现单词错误率降低(WERR)17.5%和10.6%。我们的 proposed 框架可以与其他现有的KD方法相结合,以实现进一步的改进。在编码器的预训练过程中,额外的未标记数据可以添加,从而在非流学生模型上实现总共的WERR降低到55.0%。
Social ambiance describes the context in which social interactions happen, and can be measured using speech audio by counting the number of concurrent speakers. This measurement has enabled various mental health tracking and human-centric IoT applications. While on-device Socal Ambiance Measure (SAM) is highly desirable to ensure user privacy and thus facilitate wide adoption of the aforementioned applications, the required computational complexity of state-of-the-art deep neural networks (DNNs) powered SAM solutions stands at odds with the often constrained resources on mobile devices. Furthermore, only limited labeled data is available or practical when it comes to SAM under clinical settings due to various privacy constraints and the required human effort, further challenging the achievable accuracy of on-device SAM solutions. To this end, we propose a dedicated neural architecture search framework for Energy-efficient and Real-time SAM (ERSAM). Specifically, our ERSAM framework can automatically search for DNNs that push forward the achievable accuracy vs. hardware efficiency frontier of mobile SAM solutions. For example, ERSAM-delivered DNNs only consume 40 mW x 12 h energy and 0.05 seconds processing latency for a 5 seconds audio segment on a Pixel 3 phone, while only achieving an error rate of 14.3% on a social ambiance dataset generated by LibriSpeech. We can expect that our ERSAM framework can pave the way for ubiquitous on-device SAM solutions which are in growing demand.
社交氛围描述了社交互动的环境,并可以使用语音音频计数来测量,即同时讲话的人数。这种测量已经使各种心理健康跟踪和人为中心的物联网应用得以实现。尽管在设备上的社交氛围测量(SAM)是非常理想的,以确保用户隐私并促进上述应用的广泛采用,但先进的深度学习网络(DNN) powered的SAM解决方案所需的计算复杂性与移动设备通常面临的资源限制相矛盾。此外,只有在临床环境下才存在有限的标记数据或实际可用的数据,这与必要的人类努力一起,进一步挑战了在设备上的SAM解决方案可以实现的准确性。为此,我们提出了一个专门的神经网络架构搜索框架,以能源效率和实时SAM(ERSAM)。具体而言,我们的ERSAM框架可以自动搜索推动移动设备SAM解决方案实现准确性与硬件效率极限的DNN。例如,ERSAM提供的DNN仅在Pixel 3手机上消耗40毫瓦 x 12小时的能量,以及仅产生0.05秒的处理延迟,但对于由LriSpeech生成的社交氛围数据集,仅实现了14.3%的错误率。我们期望我们的ERSAM框架可以为日益增长的设备上的Sam解决方案需求铺平道路。
Voice-enabled technology is quickly becoming ubiquitous, and is constituted from machine learning (ML)-enabled components such as speech recognition and voice activity detection. However, these systems don't yet work well for everyone. They exhibit bias - the systematic and unfair discrimination against individuals or cohorts of individuals in favour of others (Friedman & Nissembaum, 1996) - across axes such as age, gender and accent. ML is reliant on large datasets for training. Dataset documentation is designed to give ML Practitioners (MLPs) a better understanding of a dataset's characteristics. However, there is a lack of empirical research on voice dataset documentation specifically. Additionally, while MLPs are frequent participants in fairness research, little work focuses on those who work with voice data. Our work makes an empirical contribution to this gap. Here, we combine two methods to form an exploratory study. First, we undertake 13 semi-structured interviews, exploring multiple perspectives of voice dataset documentation practice. Using open and axial coding methods, we explore MLPs' practices through the lenses of roles and tradeoffs. Drawing from this work, we then purposively sample voice dataset documents (VDDs) for 9 voice datasets. Our findings then triangulate these two methods, using the lenses of MLP roles and trade-offs. We find that current VDD practices are inchoate, inadequate and incommensurate. The characteristics of voice datasets are codified in fragmented, disjoint ways that often do not meet the needs of MLPs. Moreover, they cannot be readily compared, presenting a barrier to practitioners' bias reduction efforts. We then discuss the implications of these findings for bias practices in voice data and speech technologies. We conclude by setting out a program of future work to address these findings -- that is, how we may "right the docs".
语音驱动的技术已经变得非常普遍,其构成成分包括语音识别和语音活动检测等机器学习(ML)驱动组件。然而,这些系统并不一定适用于每个人。它们表现出偏见——对个体或群体整体进行系统性和不公平的歧视,以某些人为优势(Friedman & Nissembaum,1996)——跨越年龄、性别和口音等轴。机器学习依赖于大型数据集进行训练。数据集文档的设计旨在使机器学习从业者(MLP)更好地理解数据集的特征。然而, specifically, there is a lack of empirical research on voice dataset documentation. Additionally, while MLPs are frequently participants in fairness research, little attention is paid to those who work with voice data. Our work fills this gap by making an empirical contribution. Here, we combine two methods to form an exploration study. First, we conduct 13 semi-structured interviews, exploring the multiple perspectives of voice dataset documentation practice. Using open and axial coding methods, we explore MLP practices through the lens of roles and tradeoffs. Drawing from this work, we then randomly sample voice dataset documents (VDDs) for 9 voice datasets. Our findings then triangulate these two methods using MLP roles and trade-offs. We find that current VDD practices are inchoate, inadequate, and incommensurate. The characteristics of voice datasets arecodified in fragmented, disjoint ways that often do not meet the needs of MLPs. Moreover, they cannot be readily compared, presenting a barrier to practitioners' bias reduction efforts. We then discuss the implications of these findings for bias practices in voice data and speech technologies. We conclude by setting out a program of future work to address these findings——that is, how we may "right theDocs".
Multi-modal contrastive learning techniques in the audio-text domain have quickly become a highly active area of research. Most works are evaluated with standard audio retrieval and classification benchmarks assuming that (i) these models are capable of leveraging the rich information contained in natural language, and (ii) current benchmarks are able to capture the nuances of such information. In this work, we show that state-of-the-art audio-text models do not yet really understand natural language, especially contextual concepts such as sequential or concurrent ordering of sound events. Our results suggest that existing benchmarks are not sufficient to assess these models' capabilities to match complex contexts from the audio and text modalities. We propose a Transformer-based architecture and show that, unlike prior work, it is capable of modeling the sequential relationship between sound events in the text and audio, given appropriate benchmark data. We advocate for the collection or generation of additional, diverse, data to allow future research to fully leverage natural language for audio-text modeling.