In crowded settings, the human brain can focus on speech from a target speaker, given prior knowledge of how they sound. We introduce a novel intelligent hearable system that achieves this capability, enabling target speech hearing to ignore all interfering speech and noise, but the target speaker. A naive approach is to require a clean speech example to enroll the target speaker. This is however not well aligned with the hearable application domain since obtaining a clean example is challenging in real world scenarios, creating a unique user interface problem. We present the first enrollment interface where the wearer looks at the target speaker for a few seconds to capture a single, short, highly noisy, binaural example of the target speaker. This noisy example is used for enrollment and subsequent speech extraction in the presence of interfering speakers and noise. Our system achieves a signal quality improvement of 7.01 dB using less than 5 seconds of noisy enrollment audio and can process 8 ms of audio chunks in 6.24 ms on an embedded CPU. Our user studies demonstrate generalization to real-world static and mobile speakers in previously unseen indoor and outdoor multipath environments. Finally, our enrollment interface for noisy examples does not cause performance degradation compared to clean examples, while being convenient and user-friendly. Taking a step back, this paper takes an important step towards enhancing the human auditory perception with artificial intelligence. We provide code and data at: this https URL.
在拥挤的环境中,人类大脑可以在已知目标讲话者声音的前提下,将注意力集中在目标讲话者的 speech 上。我们介绍了一种新的智能可听系统,实现了这一功能,使得目标讲话者可以忽略所有干扰讲话者和噪音,专注于目标讲话者。一种 naive 的方法是要求目标讲话者提供干净的 speech 例子。然而,在现实场景中获取干净的例子是非常具有挑战性的,这导致了独特的用户界面问题。我们展示了第一个 enrollment 界面,其中佩戴者注视目标讲话者几秒钟,以捕捉目标讲话者的一个短小的高噪音双边 example。这个噪音 example 被用于 enrollment 和随后的讲话提取,同时存在干扰讲话者和噪音。我们的系统使用不到 5 秒钟的噪音 enrollment 音频实现了信号质量提高 7.01 dB,并且在嵌入式 CPU 上可以处理 8 ms 的音频片段,并在 6.24 ms 时完成该操作。我们的用户研究证明了在以前未见过的室内和室外多径环境中,目标讲话者的一般化能力。最后,我们的噪音 example 的 enrollment 界面没有导致性能下降,而比干净例子更方便和用户友好。从另一个角度来说,本文在通过人工智能增强人类听觉感知方面迈出了重要的一步。我们在这个链接上提供了代码和数据:https://this URL。
https://arxiv.org/abs/2405.06289
Automatic speech recognition (ASR) systems, increasingly prevalent in education, healthcare, employment, and mobile technology, face significant challenges in inclusivity, particularly for the 80 million-strong global community of people who stutter. These systems often fail to accurately interpret speech patterns deviating from typical fluency, leading to critical usability issues and misinterpretations. This study evaluates six leading ASRs, analyzing their performance on both a real-world dataset of speech samples from individuals who stutter and a synthetic dataset derived from the widely-used LibriSpeech benchmark. The synthetic dataset, uniquely designed to incorporate various stuttering events, enables an in-depth analysis of each ASR's handling of disfluent speech. Our comprehensive assessment includes metrics such as word error rate (WER), character error rate (CER), and semantic accuracy of the transcripts. The results reveal a consistent and statistically significant accuracy bias across all ASRs against disfluent speech, manifesting in significant syntactical and semantic inaccuracies in transcriptions. These findings highlight a critical gap in current ASR technologies, underscoring the need for effective bias mitigation strategies. Addressing this bias is imperative not only to improve the technology's usability for people who stutter but also to ensure their equitable and inclusive participation in the rapidly evolving digital landscape.
自动语音识别(ASR)系统在教育、医疗、就业和移动技术等领域越来越普遍,但它们在包容性方面面临重大挑战,特别是对于全球8000万说普通话的人来说。这些系统往往无法准确解释从说普通话的人那里收集到的发音模式,导致关键可用性问题和误解。这项研究评估了六个领先ASR,分析它们在真实世界数据集和基于广泛使用的LibriSpeech基准的合成数据上的表现。合成数据独特地设计,以包含各种口音事件,能够深入分析每个ASR对口音语言的处理。我们的全面评估包括单词错误率(WER)、字符错误率和文本的语义准确度等指标。结果显示,所有ASR对口音语言的准确性存在一致且统计显著的偏差,在转录中表现出明显的语法和语义不准确。这些发现突出了当前ASR技术的一个关键缺陷,强调需要有效的偏差缓解策略。解决这个偏差不仅是提高口语音步障碍人士技术可用性的必要条件,而且还要确保他们在快速发展的数字环境中平等和包容地参与。
https://arxiv.org/abs/2405.06150
Recent developments in large speech foundation models like Whisper have led to their widespread use in many automatic speech recognition (ASR) applications. These systems incorporate `special tokens' in their vocabulary, such as $\texttt{<endoftext>}$, to guide their language generation process. However, we demonstrate that these tokens can be exploited by adversarial attacks to manipulate the model's behavior. We propose a simple yet effective method to learn a universal acoustic realization of Whisper's $\texttt{<endoftext>}$ token, which, when prepended to any speech signal, encourages the model to ignore the speech and only transcribe the special token, effectively `muting' the model. Our experiments demonstrate that the same, universal 0.64-second adversarial audio segment can successfully mute a target Whisper ASR model for over 97\% of speech samples. Moreover, we find that this universal adversarial audio segment often transfers to new datasets and tasks. Overall this work demonstrates the vulnerability of Whisper models to `muting' adversarial attacks, where such attacks can pose both risks and potential benefits in real-world settings: for example the attack can be used to bypass speech moderation systems, or conversely the attack can also be used to protect private speech data.
近年来,如Whisper这样的大型语言模型的发展已经导致它们在许多自动语音识别(ASR)应用程序中得到了广泛应用。这些系统在其词汇中包含一些特殊标记,如<endoftext>,以指导其语言生成过程。然而,我们发现这些标记可以被攻击者用于操纵模型的行为。为了学习Whisper的<endoftext>标记的通用音素实现,我们提出了一种简单而有效的方法,该方法允许模型在将特殊标记附加到任何语音信号之前,忽略 speech,而仅转录特殊标记,从而有效地“沉默”模型。我们的实验证明,同样的通用 0.64 秒 adversarial 音频片段可以成功沉默目标 Whisper ASR 模型超过 97% 的语音样本。此外,我们发现,这个通用 adversarial 音频片段通常会转移到新的数据集和任务中。总的来说,这项工作证明了Whisper模型的易受“静音”攻击的风险,这些攻击在现实环境中有可能既带来风险又带来潜在好处。例如,攻击可以用于绕过 speech 审核系统,反之,攻击也可以用于保护个人语音数据。
https://arxiv.org/abs/2405.06134
Audio-driven talking head generation is advancing from 2D to 3D content. Notably, Neural Radiance Field (NeRF) is in the spotlight as a means to synthesize high-quality 3D talking head outputs. Unfortunately, this NeRF-based approach typically requires a large number of paired audio-visual data for each identity, thereby limiting the scalability of the method. Although there have been attempts to generate audio-driven 3D talking head animations with a single image, the results are often unsatisfactory due to insufficient information on obscured regions in the image. In this paper, we mainly focus on addressing the overlooked aspect of 3D consistency in the one-shot, audio-driven domain, where facial animations are synthesized primarily in front-facing perspectives. We propose a novel method, NeRFFaceSpeech, which enables to produce high-quality 3D-aware talking head. Using prior knowledge of generative models combined with NeRF, our method can craft a 3D-consistent facial feature space corresponding to a single image. Our spatial synchronization method employs audio-correlated vertex dynamics of a parametric face model to transform static image features into dynamic visuals through ray deformation, ensuring realistic 3D facial motion. Moreover, we introduce LipaintNet that can replenish the lacking information in the inner-mouth area, which can not be obtained from a given single image. The network is trained in a self-supervised manner by utilizing the generative capabilities without additional data. The comprehensive experiments demonstrate the superiority of our method in generating audio-driven talking heads from a single image with enhanced 3D consistency compared to previous approaches. In addition, we introduce a quantitative way of measuring the robustness of a model against pose changes for the first time, which has been possible only qualitatively.
音频驱动的讲话头生成从2D发展到3D内容。值得注意的是,Neural Radiance Field(NeRF)作为一种生成高质量3D谈话头的方法,受到了广泛关注。然而,基于NeRF的方法通常需要大量的成对音频-视觉数据来每个身份,从而限制了方法的可扩展性。尽管已经尝试使用单张图像生成音频驱动的3D谈话头动画,但结果往往不令人满意,因为图像中的遮盖区域信息不足。在本文中,我们主要关注解决单张音频驱动领域中3D一致性的忽视问题,其中面部动画主要在正面视角下合成。我们提出了名为NeRFFaceSpeech的新方法,该方法利用先验知识、NeRF和生成模型的组合,可以生成高质量的3D意识谈话头。通过将参数化面部模型的空间相关顶点动力学与NeRF相结合,我们的方法可以将静态图像特征通过光线变形转换成动态视觉效果,确保逼真的3D面部运动。此外,我们还引入了LipaintNet,它可以补充图像中内嘴区域所缺少的信息。该网络通过利用生成能力而不需要额外数据进行自监督训练。综合实验证明,与以前方法相比,我们的方法在从单张图像中生成音频驱动谈话头方面具有增强的3D一致性。此外,我们还引入了一种量化评估模型对姿态变化的鲁棒性的方法,这是目前仅凭定性评估的方法无法实现的。
https://arxiv.org/abs/2405.05749
This paper presents our system submission for the In-Car Multi-Channel Automatic Speech Recognition (ICMC-ASR) Challenge, which focuses on speaker diarization and speech recognition in complex multi-speaker scenarios. To address these challenges, we develop end-to-end speaker diarization models that notably decrease the diarization error rate (DER) by 49.58\% compared to the official baseline on the development set. For speech recognition, we utilize self-supervised learning representations to train end-to-end ASR models. By integrating these models, we achieve a character error rate (CER) of 16.93\% on the track 1 evaluation set, and a concatenated minimum permutation character error rate (cpCER) of 25.88\% on the track 2 evaluation set.
本文针对In-Car Multi-Channel Automatic Speech Recognition(ICMC-ASR)挑战提交了我的系统。该挑战关注于在复杂多说话人场景中的说话人识别和语音识别。为解决这些挑战,我们开发了端到端说话人识别模型,与官方基线相比,降低了49.58%的说话人识别误差率(DER)。对于语音识别,我们利用自监督学习表示来训练端到端ASR模型。通过整合这些模型,我们在轨道1评估集中实现了16.93%的词错误率(CER),在轨道2评估集中实现了25.88%的最小排列字符误差率(cpCER)。
https://arxiv.org/abs/2405.05498
The trustworthiness of AI applications has been the subject of recent research and is also addressed in the EU's recently adopted AI Regulation. The currently emerging foundation models in the field of text, speech and image processing offer completely new possibilities for developing AI applications. This whitepaper shows how the trustworthiness of an AI application developed with foundation models can be evaluated and ensured. For this purpose, the application-specific, risk-based approach for testing and ensuring the trustworthiness of AI applications, as developed in the 'AI Assessment Catalog - Guideline for Trustworthy Artificial Intelligence' by Fraunhofer IAIS, is transferred to the context of foundation models. Special consideration is given to the fact that specific risks of foundation models can have an impact on the AI application and must also be taken into account when checking trustworthiness. Chapter 1 of the white paper explains the fundamental relationship between foundation models and AI applications based on them in terms of trustworthiness. Chapter 2 provides an introduction to the technical construction of foundation models and Chapter 3 shows how AI applications can be developed based on them. Chapter 4 provides an overview of the resulting risks regarding trustworthiness. Chapter 5 shows which requirements for AI applications and foundation models are to be expected according to the draft of the European Union's AI Regulation and Chapter 6 finally shows the system and procedure for meeting trustworthiness requirements.
人工智能应用程序的可靠性一直是最近的研究主题,同时也被欧盟最近通过的AI法规所涉及。在文本、语音和图像处理领域目前正在崛起的新基础模型为开发人工智能应用程序提供了全新的可能性。这份白皮书展示了使用基础模型开发的人工智能应用程序的可靠性如何评估和确保。为此,将“AI评估目录 - 可信人工智能指南”中提出的针对应用程序特定、基于风险的方法转移到基础模型的背景下。特别关注基础模型特定的风险对AI应用程序的影响,在检查可靠性时也必须予以考虑。白皮书第一章解释了基于基础模型的AI应用程序和它们之间基于信用的基本关系。第二章介绍了基础模型的技术构建,第三章展示了如何基于它们开发AI应用程序。第四章提供了关于可信度风险的概述。第五章说明了根据欧盟AI法规草案预期应满足的AI应用程序和基础模型的要求。第六章最后展示了满足可信度要求的具体系统和程序。
https://arxiv.org/abs/2405.04937
Audio-visual target speaker extraction (AV-TSE) aims to extract the specific person's speech from the audio mixture given auxiliary visual cues. Previous methods usually search for the target voice through speech-lip synchronization. However, this strategy mainly focuses on the existence of target speech, while ignoring the variations of the noise characteristics. That may result in extracting noisy signals from the incorrect sound source in challenging acoustic situations. To this end, we propose a novel reverse selective auditory attention mechanism, which can suppress interference speakers and non-speech signals to avoid incorrect speaker extraction. By estimating and utilizing the undesired noisy signal through this mechanism, we design an AV-TSE framework named Subtraction-and-ExtrAction network (SEANet) to suppress the noisy signals. We conduct abundant experiments by re-implementing three popular AV-TSE methods as the baselines and involving nine metrics for evaluation. The experimental results show that our proposed SEANet achieves state-of-the-art results and performs well for all five datasets. We will release the codes, the models and data logs.
音频视觉目标说话者提取(AV-TSE)旨在从给定的辅助视觉线索的音频混合中提取特定人物的讲话。以前的方法通常通过语音- lips 同步来寻找目标声音。然而,这种策略主要关注目标声音的存在,而忽略了噪声特征的变化。这可能导致在具有挑战性听力的情况下从错误的声音来源中提取出嘈杂信号。为此,我们提出了一种新颖的反选择性注意机制,可以通过抑制干扰说话者和非说话信号来避免错误的目标声音提取。通过估计和利用这种机制估计的垃圾信号,我们设计了一个名为 Subtraction-and-ExtrAction network (SEANet) 的AV-TSE框架,用于抑制噪音信号。我们通过重新实现三个流行的AV-TSE方法作为基线,并使用九个指标进行评估。实验结果表明,我们提出的SEANet实现了最先进的水平,并且在所有五个数据集上都表现良好。我们将发布代码、模型和数据日志。
https://arxiv.org/abs/2404.18501
The prevalence of mobile technology offers unique opportunities for addressing healthcare challenges, especially for individuals with visual impairments. This paper explores the development and implementation of a deep learning-based mobile application designed to assist blind and visually impaired individuals in real-time pill identification. Utilizing the YOLO framework, the application aims to accurately recognize and differentiate between various pill types through real-time image processing on mobile devices. The system incorporates Text-to- Speech (TTS) to provide immediate auditory feedback, enhancing usability and independence for visually impaired users. Our study evaluates the application's effectiveness in terms of detection accuracy and user experience, highlighting its potential to improve medication management and safety among the visually impaired community. Keywords-Deep Learning; YOLO Framework; Mobile Application; Visual Impairment; Pill Identification; Healthcare
移动技术的普及为解决医疗挑战提供了独特的机遇,特别是对于视力受损的个人。本文探讨了设计一款基于深度学习的移动应用程序,以帮助盲人和视力受损者在实时 pill 识别方面的应用。利用 YOLO 框架,该应用程序旨在通过移动设备上的实时图像处理准确识别和区分各种药片类型。系统还配备了 Text-to-Speech(TTS)功能,为 visually impaired 用户提供即时的听觉反馈,从而提高其可用性和独立性。我们的研究评估了该应用程序在检测准确性和用户体验方面的效果,强调了其在改善视力受损群体中的药物管理和安全方面的潜在价值。关键词-深度学习;YOLO 框架;移动应用程序;视力受损;药片识别;医疗
https://arxiv.org/abs/2405.05983
Recently, the usage of speech self-supervised models (SSL) for downstream tasks has been drawing a lot of attention. While large pre-trained models commonly outperform smaller models trained from scratch, questions regarding the optimal fine-tuning strategies remain prevalent. In this paper, we explore the fine-tuning strategies of the WavLM Large model for the speech emotion recognition task on the MSP Podcast Corpus. More specifically, we perform a series of experiments focusing on using gender and semantic information from utterances. We then sum up our findings and describe the final model we used for submission to Speech Emotion Recognition Challenge 2024.
近年来,将自监督语音模型(SSL)用于下游任务的话题引起了广泛关注。虽然大型预训练模型通常比从头训练的较小模型表现更好,但关于最佳微调策略的问题仍然普遍存在。在本文中,我们探讨了WavLM Large模型在MSP播客数据集上对语音情感识别任务的微调策略。具体来说,我们进行了一系列实验,重点关注使用说话人的性别和语义信息。然后,我们总结了我们的研究结果,并描述了提交给2024年Speech Emotion Recognition Challenge的最终模型。
https://arxiv.org/abs/2405.04485
Room acoustic parameters (RAPs) and room physical parameters ( RPPs) are essential metrics for parameterizing the room acoustical characteristics (RAC) of a sound field around a listener's local environment, offering comprehensive indications for various applications. The current RAPs and RPPs estimation methods either fall short of covering broad real-world acoustic environments in the context of real background noise or lack universal frameworks for blindly estimating RAPs and RPPs from noisy single-channel speech signals, particularly sound source distances, direction-of-arrival (DOA) of sound sources, and occupancy levels. On the other hand, in this paper, we propose a novel universal blind estimation framework called the blind estimator of room acoustical and physical parameters (BERP), by introducing a new stochastic room impulse response (RIR) model, namely, the sparse stochastic impulse response (SSIR) model, and endowing the BERP with a unified encoder and multiple separate predictors to estimate RPPs and SSIR parameters in parallel. This estimation framework enables the computationally efficient and universal estimation of room parameters by solely using noisy single-channel speech signals. Finally, all the RAPs can be simultaneously derived from the RIRs synthesized from SSIR model with the estimated parameters. To evaluate the effectiveness of the proposed BERP and SSIR models, we compile a task-specific dataset from several publicly available datasets. The results reveal that the BERP achieves state-of-the-art (SOTA) performance. Moreover, the evaluation results pertaining to the SSIR RIR model also demonstrated its efficacy. The code is available on GitHub.
房间声学参数(RAPs)和房间物理参数(RPPs)是描述听觉环境(RAC)的关键指标,为各种应用提供全面的室声学特性参数。目前,RAPs和RPPs的估计方法要么在涵盖真实世界声场的同时,对背景噪声缺乏普适性;要么缺乏通用的框架,无法从嘈杂的单声道语音信号中盲目估计RAPs和RPPs,特别是声源距离,声源到达方向(DOA)和占有率水平。另一方面,在本文中,我们提出了一种名为盲估器室声学和物理参数的新通用盲估计框架(BERP),通过引入一种新的稀疏随机冲击响应(SSIR)模型,以及为BERP分配统一编码器和解码器,实现对RPPs和SSIR参数的同时估计。这种估计框架仅利用单声道语音信号的噪声实现计算有效和普世的室参数估计。最后,所有RAPs都可以从SSIR模型生成的RIR中同时导出。为了评估所提出的BERP和SSIR模型,我们汇总了多个公开可用的数据集,得到的结果表明,BERP达到了最先进的性能水平。此外,与SSIR RIR模型相关的评估结果也证明了其有效性。代码可以在GitHub上找到。
https://arxiv.org/abs/2405.04476
Artificial neural networks (ANNs) perform extraordinarily on numerous tasks including classification or prediction, e.g., speech processing and image classification. These new functions are based on a computational model that is enabled to select freely all necessary internal model parameters as long as it eventually delivers the functionality it is supposed to exhibit. Here, we review the connection between the model parameter selection in machine learning (ML) algorithms running on ANNs and the epistemological theory of neopragmatism focusing on the theory's utility and anti-representationalist aspects. To understand the consequences of the model parameter selection of an ANN, we suggest using neopragmatist theories whose implications are well studied. Incidentally, neopragmatism's notion of optimization is also based on utility considerations. This means that applying this approach elegantly reveals the inherent connections between optimization in ML, using a numerical method during the learning phase, and optimization in the ethical theory of consequentialism, where it occurs as a maxim of action. We suggest that these connections originate from the way relevance is calculated in ML systems. This could ultimately reveal a tendency for specific actions in ML systems.
人工智能神经网络(ANNs)在许多任务中表现出色,包括分类或预测,例如语音处理和图像分类。这些新功能基于一种计算模型,该模型允许在最终实现其预期功能时自由选择所有必要的内部模型参数。在这里,我们回顾了机器学习(ML)算法中模型参数选择与新格言主义理论之间的联系,重点关注其理论的实用性和反表现主义 aspects。为了了解ANN模型参数选择的影响,我们建议使用研究 implications 好的新格言主义理论。值得注意的是,新格言主义的优化概念也是基于效用考虑的。这意味着应用这种方法恰当地揭示了在机器学习中的优化问题,以及在伦理后果理论中的优化问题,那里它在作为行动最大化的目标时发生。我们建议,这些联系源于ML系统中的相关性计算方式。这最终可能揭示出ML系统中的特定行动趋势。
https://arxiv.org/abs/2405.04386
In the task of talking face generation, the objective is to generate a face video with lips synchronized to the corresponding audio while preserving visual details and identity information. Current methods face the challenge of learning accurate lip synchronization while avoiding detrimental effects on visual quality, as well as robustly evaluating such synchronization. To tackle these problems, we propose utilizing an audio-visual speech representation expert (AV-HuBERT) for calculating lip synchronization loss during training. Moreover, leveraging AV-HuBERT's features, we introduce three novel lip synchronization evaluation metrics, aiming to provide a comprehensive assessment of lip synchronization performance. Experimental results, along with a detailed ablation study, demonstrate the effectiveness of our approach and the utility of the proposed evaluation metrics.
在面部生成任务中,目标是生成与相应音频同步的带嘴唇的视频,同时保留视觉细节和身份信息。目前的方法面临着学习准确嘴同步的同时避免对视觉质量产生负面影响,以及稳健地评估这种同步的挑战。为了解决这些问题,我们提出了使用音频-视觉言语表示专家(AV-HuBERT)在训练过程中计算嘴同步损失的方法。此外,利用AV-HuBERT的特征,我们引入了三个新的嘴同步评估指标,旨在提供对嘴同步性能的全面评估。实验结果和详细消融研究证明了我们方法的有效性和所提出的评估指标的实用性。
https://arxiv.org/abs/2405.04327
Self-Supervised Learning (SSL) has proven to be useful in various speech tasks. However, these methods are generally very demanding in terms of data, memory, and computational resources. BERT-based Speech pre-Training with Random-projection Quantizer (BEST-RQ), is an SSL method that has shown great performance on Automatic Speech Recognition (ASR) while being simpler than other SSL methods, such as wav2vec 2.0. Despite BEST-RQ's great performance, details are lacking in the original paper, such as the amount of GPU/TPU hours used in pre-training, and there is no official easy-to-use open-source implementation. Furthermore, BEST-RQ has not been evaluated on other downstream tasks aside from ASR and speech translation. In this work, we describe a re-implementation of a Random-projection quantizer and perform a preliminary study with a comparison to wav2vec 2.0 on four downstream tasks. We discuss the details and differences of our implementation. We show that a random projection quantizer can achieve similar downstream performance as wav2vec 2.0 while decreasing training time by over a factor of two.
自监督学习(SSL)在各种语音任务中已经证明非常有用。然而,这些方法在数据、内存和计算资源方面通常非常具有挑战性。基于BERT的语音预训练随机投影量化器(BEST-RQ)是一种 SSL 方法,它在自动语音识别(ASR)方面表现出与其他 SSL 方法相似但更简单的效果。尽管BEST-RQ在ASR方面的表现非常出色,但原始论文中的细节缺乏,例如预训练过程中使用的GPU/TPU小时数,以及没有官方易用的开源实现。此外,BEST-RQ尚未在除ASR和语音翻译以外的其他下游任务上进行评估。在这项工作中,我们描述了一个随机投影量化器的重新实现,并比较了它与 wav2vec 2.0 在四个下游任务上的效果。我们讨论了我们实现的具体细节和与 wav2vec 2.0 的差异。我们证明了,与 wav2vec 2.0 相比,随机投影量化器可以实现类似的下游性能,同时将训练时间降低到原来的两倍。
https://arxiv.org/abs/2405.04296
In this paper, we present an unsupervised single-channel method for joint blind dereverberation and room impulse response estimation, based on posterior sampling with diffusion models. We parameterize the reverberation operator using a filter with exponential decay for each frequency subband, and iteratively estimate the corresponding parameters as the speech utterance gets refined along the reverse diffusion trajectory. A measurement consistency criterion enforces the fidelity of the generated speech with the reverberant measurement, while an unconditional diffusion model implements a strong prior for clean speech generation. Without any knowledge of the room impulse response nor any coupled reverberant-anechoic data, we can successfully perform dereverberation in various acoustic scenarios. Our method significantly outperforms previous blind unsupervised baselines, and we demonstrate its increased robustness to unseen acoustic conditions in comparison to blind supervised methods. Audio samples and code are available online.
在本文中,我们提出了一种基于后验采样和扩散模型的联合盲去噪和室脉冲响应估计方法。我们通过指数衰减的滤波器对每个频率子带进行参数化,并沿着反向扩散轨迹对语音语调进行细化,逐步估计相应的参数。一个测量一致性标准确保生成的语音与回声测量保持一致,而条件扩散模型则实现了对干净语音生成的强大假设。在没有了解室脉冲响应,也没有任何耦合的回声-等化数据的情况下,我们可以在各种声学场景中成功进行去噪。与之前基于无监督学习的盲去噪 baseline 相比,我们的方法显著性能更卓越,并且我们证明了与盲监督方法相比,其对未见过的声学条件的鲁棒性有所提高。音频样本和代码可在网上获取。
https://arxiv.org/abs/2405.04272
When comparing speech sounds across languages, scholars often make use of feature representations of individual sounds in order to determine fine-grained sound similarities. Although binary feature systems for large numbers of speech sounds have been proposed, large-scale computational applications often face the challenges that the proposed feature systems -- even if they list features for several thousand sounds -- only cover a smaller part of the numerous speech sounds reflected in actual cross-linguistic data. In order to address the problem of missing data for attested speech sounds, we propose a new approach that can create binary feature vectors dynamically for all sounds that can be represented in the the standardized version of the International Phonetic Alphabet proposed by the Cross-Linguistic Transcription Systems (CLTS) reference catalog. Since CLTS is actively used in large data collections, covering more than 2,000 distinct language varieties, our procedure for the generation of binary feature vectors provides immediate access to a very large collection of multilingual wordlists. Testing our feature system in different ways on different datasets proves that the system is not only useful to provide a straightforward means to compare the similarity of speech sounds, but also illustrates its potential to be used in future cross-linguistic machine learning applications.
比较不同语言中的语音音素时,学者们通常会利用单个音素的特征表示来确定细粒度的音素相似性。尽管已经提出了二进制特征系统来表示数千个语音音素,但大规模计算应用通常面临所提出的特征系统——即使它们列出了几个 Thousand 个语音音素的特征,也只能覆盖实际跨语种数据中反映出的众多语音音素中的较小部分。为了解决缺失数据的问题,我们提出了一个新方法,可以为在CLTS提出的国际音标标准化版本中可以表示的所有语音创建动态二进制特征向量。 因为CLTS在大型数据集合中得到广泛应用,涵盖了超过2000个不同的语言变体,我们生成二进制特征向量的方法为访问一个非常大的多语言词汇表提供了直接的途径。在不同的数据集上对所提出的特征系统进行测试证明,该系统不仅是提供比较语音音素相似性的直接途径,而且表明其在跨语种机器学习应用中的潜在功能。
https://arxiv.org/abs/2405.04271
Suicide and suicidal behaviors remain significant challenges for public policy and healthcare. In response, psychological support hotlines have been established worldwide to provide immediate help to individuals in mental crises. The effectiveness of these hotlines largely depends on accurately identifying callers' emotional states, particularly underlying negative emotions indicative of increased suicide risk. However, the high demand for psychological interventions often results in a shortage of professional operators, highlighting the need for an effective speech emotion recognition model. This model would automatically detect and analyze callers' emotions, facilitating integration into hotline services. Additionally, it would enable large-scale data analysis of psychological support hotline interactions to explore psychological phenomena and behaviors across populations. Our study utilizes data from the Beijing psychological support hotline, the largest suicide hotline in China. We analyzed speech data from 105 callers containing 20,630 segments and categorized them into 11 types of negative emotions. We developed a negative emotion recognition model and a fine-grained multi-label classification model using a large-scale pre-trained model. Our experiments indicate that the negative emotion recognition model achieves a maximum F1-score of 76.96%. However, it shows limited efficacy in the fine-grained multi-label classification task, with the best model achieving only a 41.74% weighted F1-score. We conducted an error analysis for this task, discussed potential future improvements, and considered the clinical application possibilities of our study. All the codes are public available.
自杀和自杀行为仍然是公共政策和卫生保健的显著挑战。为了应对这一挑战,全球范围内已经建立了心理支持热线,为处于精神危机中的个人提供及时帮助。这些热线的有效性很大程度上取决于准确识别来电者的情感状态,特别是表明 increased suicide risk 的潜在负面情感。然而,心理干预需求的增加通常会导致专业操作员不足,凸显了需要有效的情感识别模型的需求。这个模型将自动检测并分析来电者的情感,促进将其融入热线服务。此外,它将能够对心理支持热线互动大规模数据进行分析,以探索人群中的心理现象和行为。我们的研究利用了北京心理支持热线等中国最大的自杀热线的中国数据。我们对105个来电者(包含20,630个片段)的语音数据进行了分析,并将它们分为11种负面情绪类型。我们使用一个大型的预训练模型开发了负情感识别模型和细粒度多标签分类模型。我们的实验表明,负情感识别模型的最大F1分数达到76.96%。然而,在细粒度多标签分类任务中,最佳模型只能达到41.74%的加权F1分数。我们对这个任务进行了错误分析,讨论了潜在的 future improvements,并考虑了本研究的临床应用可能性。所有代码都是公开可用的。
https://arxiv.org/abs/2405.04128
State-of-the-art Deep Learning systems for speaker verification are commonly based on speaker embedding extractors. These architectures are usually composed of a feature extractor front-end together with a pooling layer to encode variable-length utterances into fixed-length speaker vectors. The authors have recently proposed the use of a Double Multi-Head Self-Attention pooling for speaker recognition, placed between a CNN-based front-end and a set of fully connected layers. This has shown to be an excellent approach to efficiently select the most relevant features captured by the front-end from the speech signal. In this paper we show excellent experimental results by adapting this architecture to other different speaker characterization tasks, such as emotion recognition, sex classification and COVID-19 detection.
先进的深度学习系统通常基于发言者嵌入提取器。这些架构通常由一个特征提取器前端和一个池化层组成,将变长语音转换为固定长度的发言者向量。作者最近提出了使用双 multi-head 自注意力池化进行发言者识别,将自注意力机制放在基于 CNN 的前端和一系列全连接层之间。这已被证明是从语音信号中选择最相关特征的绝佳方法。在本文中,我们通过将此架构应用于其他不同的发言者特征识别任务,如情感识别、性别分类和 COVID-19 检测,展示了出色的实验结果。
https://arxiv.org/abs/2405.04096
Graph representation learning has become a hot research topic due to its powerful nonlinear fitting capability in extracting representative node embeddings. However, for sequential data such as speech signals, most traditional methods merely focus on the static graph created within a sequence, and largely overlook the intrinsic evolving patterns of these data. This may reduce the efficiency of graph representation learning for sequential data. For this reason, we propose an adaptive graph representation learning method based on dynamically evolved graphs, which are consecutively constructed on a series of subsequences segmented by a sliding window. In doing this, it is better to capture local and global context information within a long sequence. Moreover, we introduce a weighted approach to update the node representation rather than the conventional average one, where the weights are calculated by a novel matrix computation based on the degree of neighboring nodes. Finally, we construct a learnable graph convolutional layer that combines the graph structure loss and classification loss to optimize the graph structure. To verify the effectiveness of the proposed method, we conducted experiments for speech emotion recognition on the IEMOCAP and RAVDESS datasets. Experimental results show that the proposed method outperforms the latest (non-)graph-based models.
由于其在提取具有代表性的节点嵌入的强大非线性适应能力,图表示学习已成为一个热门的研究课题。然而,对于像语音信号这样的序列数据,大多数传统方法仅关注序列内创建的静态图,并大大忽视了这些数据固有的演变模式。这可能会降低图表示学习对于序列数据的有效性。因此,我们提出了一个基于动态演变图的自适应图表示学习方法,该方法在系列片段分界滑动窗口上连续构建。这样做,更好地捕捉长序列中的局部和全局上下文信息。此外,我们引入了一种加权方法来更新节点表示,而不是传统的平均表示,其中权重通过基于节点度量的全新矩阵计算得到。最后,我们构建了一个可学习图卷积层,将图结构损失和分类损失相结合来优化图形。为了验证所提出方法的有效性,我们在IEMOCAP和RAVDESS数据集上对语音情感识别进行了实验。实验结果表明,与基于图的最近模型相比,所提出的方法具有更好的性能。
https://arxiv.org/abs/2405.03956
Automatically detecting Alzheimer's Disease (AD) from spontaneous speech plays an important role in its early diagnosis. Recent approaches highly rely on the Transformer architectures due to its efficiency in modelling long-range context dependencies. However, the quadratic increase in computational complexity associated with self-attention and the length of audio poses a challenge when deploying such models on edge devices. In this context, we construct a novel framework, namely Hierarchical Attention-Free Transformer (HAFFormer), to better deal with long speech for AD detection. Specifically, we employ an attention-free module of Multi-Scale Depthwise Convolution to replace the self-attention and thus avoid the expensive computation, and a GELU-based Gated Linear Unit to replace the feedforward layer, aiming to automatically filter out the redundant information. Moreover, we design a hierarchical structure to force it to learn a variety of information grains, from the frame level to the dialogue level. By conducting extensive experiments on the ADReSS-M dataset, the introduced HAFFormer can achieve competitive results (82.6% accuracy) with other recent work, but with significant computational complexity and model size reduction compared to the standard Transformer. This shows the efficiency of HAFFormer in dealing with long audio for AD detection.
自动从自发性语言中检测阿尔茨海默病(AD)在早期诊断中扮演着重要角色。最近的方法高度依赖Transformer架构,因为其在建模长距离上下文依赖方面非常高效。然而,与自注意力机制和音频长度相关的计算复杂度的大幅增加使得在边缘设备上部署此类模型时面临挑战。在这种情况下,我们构建了一个新框架——分层注意力免费Transformer(HAFFormer),以更好地处理AD检测中的长篇演讲。具体来说,我们采用多尺度深度卷积的注意力免费模块来代替自注意力和因此避免昂贵的计算,并采用基于GELU的门控线性单元来代替前馈层,旨在自动过滤出冗余信息。此外,我们还设计了一个分层结构,以迫使它学习各种信息颗粒,从帧级别到对话级别。在ADReSS-M数据集上进行广泛的实验后,引入的HAFFormer可以与其他最近的工作竞争(82.6%的准确率),但与标准Transformer相比,具有显著的计算复杂度和模型大小减少。这证明了HAFFormer在处理长篇音频以检测AD方面的效率。
https://arxiv.org/abs/2405.03952
This paper introduces, to the best of the authors' knowledge, the first fine-grained temporal sparsity-aware keyword spotting (KWS) IC leveraging temporal similarities between neighboring feature vectors extracted from input frames and network hidden states, eliminating unnecessary operations and memory accesses. This KWS IC, featuring a bio-inspired delta-gated recurrent neural network ({\Delta}RNN) classifier, achieves an 11-class Google Speech Command Dataset (GSCD) KWS accuracy of 90.5% and energy consumption of 36nJ/decision. At 87% temporal sparsity, computing latency and energy per inference are reduced by 2.4$\times$/3.4$\times$, respectively. The 65nm design occupies 0.78mm$^2$ and features two additional blocks, a compact 0.084mm$^2$ digital infinite-impulse-response (IIR)-based band-pass filter (BPF) audio feature extractor (FEx) and a 24kB 0.6V near-Vth weight SRAM with 6.6$\times$ lower read power compared to the standard SRAM.
本文介绍了作者知识范围内第一个细粒度的时间稀疏性感知关键词提取(KWS)IC,该KWS IC利用了从输入帧和网络隐藏状态中提取的相邻特征向量之间的时间相似性,消除了不必要的操作和内存访问。这个KWS IC,具有生物启发的 delta-门控循环神经网络({\Delta}RNN)分类器,在11个类别的Google语音命令数据集(GSCD)上的KWS准确率为90.5%,能量消耗为36nJ/决策。在87%的时间稀疏性下,计算延迟和能量每决策减少2.4/3.4。该65nm设计占用0.78mm$^2$,并具有两个附加模块:一个紧凑的0.084mm$^2$数字无限脉冲响应(IIR)基于带通滤波器(BPF)音频特征提取器(FEx)和一个24kB 0.6V近Vth权重SRAM,与标准SRAM相比具有6.6$\times$的低读功率。
https://arxiv.org/abs/2405.03905