Benchmarking plays a pivotal role in assessing and enhancing the performance of compact deep learning models designed for execution on resource-constrained devices, such as microcontrollers. Our study introduces a novel, entirely artificially generated benchmarking dataset tailored for speech recognition, representing a core challenge in the field of tiny deep learning. SpokeN-100 consists of spoken numbers from 0 to 99 spoken by 32 different speakers in four different languages, namely English, Mandarin, German and French, resulting in 12,800 audio samples. We determine auditory features and use UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction) as a dimensionality reduction method to show the diversity and richness of the dataset. To highlight the use case of the dataset, we introduce two benchmark tasks: given an audio sample, classify (i) the used language and/or (ii) the spoken number. We optimized state-of-the-art deep neural networks and performed an evolutionary neural architecture search to find tiny architectures optimized for the 32-bit ARM Cortex-M4 nRF52840 microcontroller. Our results represent the first benchmark data achieved for SpokeN-100.
基准测试在评估和增强资源受限设备上设计的紧凑型深度学习模型的性能中发挥着重要作用,例如微控制器。我们的研究介绍了一个新的、完全人工生成的基准测试数据集,专门针对语音识别,代表了该领域中最小的深度学习挑战。SpokeN-100 包括来自 0 到 99 的语音数字,由 32 名不同的说话者用英语、普通话、德语和法语讲述了,共产生 12,800 个音频样本。我们确定音频特征,并使用 UMAP(统一曼哈顿近似和投影降维)作为降维方法,以展示数据集的多样性和丰富性。为了突出该数据集的使用案例,我们引入了两个基准任务:给定一个音频样本,分类(i)使用的语言,(ii)说话的数字。我们优化了最先进的深度神经网络,并进行了进化神经架构搜索,以找到针对 32 位 ARM Cortex-M4 nRF52840 微控制器的最佳架构。我们的结果代表了 SpokeN-100 第一个基准数据。
https://arxiv.org/abs/2403.09753
This paper addresses the challenges and advancements in speech recognition for singing, a domain distinctly different from standard speech recognition. Singing encompasses unique challenges, including extensive pitch variations, diverse vocal styles, and background music interference. We explore key areas such as phoneme recognition, language identification in songs, keyword spotting, and full lyrics transcription. I will describe some of my own experiences when performing research on these tasks just as they were starting to gain traction, but will also show how recent developments in deep learning and large-scale datasets have propelled progress in this field. My goal is to illuminate the complexities of applying speech recognition to singing, evaluate current capabilities, and outline future research directions.
本文探讨了 singing领域中语音识别(speech recognition)的挑战和进展。与标准语音识别领域不同,唱歌领域具有独特的挑战,包括广泛的音高变化、多样化的歌唱风格和背景音乐干扰。我们探讨了关键领域,如音素识别、歌曲中的语言识别、关键词捕捉和完整歌词转录。我将在讨论这些任务开始得到广泛关注时描述我自己的一些经验,但也会展示深度学习和大规模数据集的最近发展如何推动这一领域的发展。我的目标是阐明将语音识别应用于唱歌的复杂性,评估现有能力,并指出未来的研究方向。
https://arxiv.org/abs/2403.09298
Conformer-based attention models have become the de facto backbone model for Automatic Speech Recognition tasks. A blank symbol is usually introduced to align the input and output sequences for CTC or RNN-T models. Unfortunately, the long input length overloads computational budget and memory consumption quadratically by attention mechanism. In this work, we propose a "Skip-and-Recover" Conformer architecture, named Skipformer, to squeeze sequence input length dynamically and inhomogeneously. Skipformer uses an intermediate CTC output as criteria to split frames into three groups: crucial, skipping and ignoring. The crucial group feeds into next conformer blocks and its output joint with skipping group by original temporal order as the final encoder output. Experiments show that our model reduces the input sequence length by 31 times on Aishell-1 and 22 times on Librispeech corpus. Meanwhile, the model can achieve better recognition accuracy and faster inference speed than recent baseline models. Our code is open-sourced and available online.
基于Transformer的注意力模型已成为自动语音识别任务的默认骨干模型。为了对CTC或RNN-T模型对输入和输出序列进行对齐,通常会引入一个空白符号。然而,由于长输入长度的过载,自注意力机制导致计算资源和内存消耗的指数增长。在这项工作中,我们提出了一个名为Skipformer的“跳过和恢复”Transformer架构,以动态和异质地压缩序列输入长度。Skipformer使用中间的CTC输出作为分割框架,将帧分为三组:关键,跳过和忽略。关键组输入经过下一个 conformer 块,其输出与跳过组按原始时间顺序连接作为最终编码器输出。实验证明,我们的模型在Aishell-1上减少了输入序列长度31倍,在Librispeech数据集上减少了22倍。与此同时,与最近的基本模型相比,我们的模型具有更好的识别准确性和更快的推理速度。我们的代码是开源的,可以在网上获取。
https://arxiv.org/abs/2403.08258
In the wake of the surging tide of deep learning over the past decade, Automatic Speech Recognition (ASR) has garnered substantial attention, leading to the emergence of numerous publicly accessible ASR systems that are actively being integrated into our daily lives. Nonetheless, the impartial and replicable evaluation of these ASR systems encounters challenges due to various crucial subtleties. In this paper we introduce the SpeechColab Leaderboard, a general-purpose, open-source platform designed for ASR evaluation. With this platform: (i) We report a comprehensive benchmark, unveiling the current state-of-the-art panorama for ASR systems, covering both open-source models and industrial commercial services. (ii) We quantize how distinct nuances in the scoring pipeline influence the final benchmark outcomes. These include nuances related to capitalization, punctuation, interjection, contraction, synonym usage, compound words, etc. These issues have gained prominence in the context of the transition towards an End-to-End future. (iii) We propose a practical modification to the conventional Token-Error-Rate (TER) evaluation metric, with inspirations from Kolmogorov complexity and Normalized Information Distance (NID). This adaptation, called modified-TER (mTER), achieves proper normalization and symmetrical treatment of reference and hypothesis. By leveraging this platform as a large-scale testing ground, this study demonstrates the robustness and backward compatibility of mTER when compared to TER. The SpeechColab Leaderboard is accessible at this https URL
在过去十年深度学习浪潮的推动下,自动语音识别(ASR)引起了大量关注,导致出现了大量公开可用的ASR系统,这些系统积极地融入了我们日常的生活。然而,由于各种关键微妙的因素,这些ASR系统的公正可重复性评估遇到了挑战。在本文中,我们介绍了SpeechColab Leaderboard,一个专为ASR评估而设计的通用、开源平台。通过这个平台: (i)我们全面报告了一个基准,揭示了ASR系统目前的状况,涵盖了开源模型和工业商业服务。 (ii)我们量化了评分管道中细微差异如何影响最终基准结果。这些差异包括大小写、标点符号、插入语、缩写、同义词使用、复合词等。这些问题在面向结束端未来的背景下尤为突出。 (iii)我们提出了一个实用性的修改传统的标记错误率(TER)评估指标,从Kolmogorov复杂度和Normalized Information Distance (NID)中汲取灵感。这个修改后的指标被称为mTER,实现了对参考和假设的对称处理和适当的归一化。通过将这个平台视为一个大规模的测试场,本研究展示了mTER与TER之间的稳健性和后向兼容性。SpeechColab Leaderboard可在此处访问:https://www.google.com/url?q=SpeechColab+Leaderboard
https://arxiv.org/abs/2403.08196
This study presents a model of automatic speech recognition (ASR) designed to diagnose pronunciation issues in children with speech sound disorders (SSDs) to replace manual transcriptions in clinical procedures. Since ASR models trained for general purposes primarily predict input speech into real words, employing a well-known high-performance ASR model for evaluating pronunciation in children with SSDs is impractical. We fine-tuned the wav2vec 2.0 XLS-R model to recognize speech as pronounced rather than as existing words. The model was fine-tuned with a speech dataset from 137 children with inadequate speech production pronouncing 73 Korean words selected for actual clinical diagnosis. The model's predictions of the pronunciations of the words matched the human annotations with about 90% accuracy. While the model still requires improvement in recognizing unclear pronunciation, this study demonstrates that ASR models can streamline complex pronunciation error diagnostic procedures in clinical fields.
这项研究提出了一种自动语音识别(ASR)模型,旨在诊断儿童语音障碍(SSD)中的发音问题,以取代临床程序中的手动转录。由于为通用目的训练的ASR模型主要预测输入语音为真实单词,因此使用一个已知的高性能ASR模型评估儿童SSD的发音是不切实际的。我们通过微调wav2vec 2.0 XLS-R模型,使其能够识别发音而不是现有单词。该模型通过137个患有发音问题的儿童和一个包含73个韩国单词的实际临床诊断的语音数据集进行了微调。模型对单词发音的预测与人类注释者的一致性约为90%。虽然模型在识别不清晰的发音方面仍然需要改进,但本研究证明了ASR模型可以简化临床领域中复杂的发音错误诊断程序。
https://arxiv.org/abs/2403.08187
An important and difficult task in code-switched speech recognition is to recognize the language, as lots of words in two languages can sound similar, especially in some accents. We focus on improving performance of end-to-end Automatic Speech Recognition models by conditioning transformer layers on language ID of words and character in the output in an per layer supervised manner. To this end, we propose two methods of introducing language specific parameters and explainability in the multi-head attention mechanism, and implement a Temporal Loss that helps maintain continuity in input alignment. Despite being unable to reduce WER significantly, our method shows promise in predicting the correct language from just spoken data. We introduce regularization in the language prediction by dropping LID in the sequence, which helps align long repeated output sequences.
在代码切换语音识别中,一个重要而困难的任务是识别语言,因为两种语言中有很多单词听起来相似,特别是在某些口音中。我们专注于通过在每层监督的方式条件下训练变换器层来提高端到端自动语音识别模型的性能。为此,我们提出了两种在多头注意机制中引入语言特定参数和解释性的方法,并实现了一个时间损失,该损失有助于保持输入对齐的连续性。尽管我们无法显著减少WER,但我们的方法在仅基于口语数据预测正确语言方面显示出了潜力。我们在语言预测中引入正则化,通过删除序列中的LID来帮助对齐长重复输出序列。
https://arxiv.org/abs/2403.08011
Speech technology is a field that encompasses various techniques and tools used to enable machines to interact with speech, such as automatic speech recognition (ASR), spoken dialog systems, and others, allowing a device to capture spoken words through a microphone from a human speaker. End-to-end approaches such as Connectionist Temporal Classification (CTC) and attention-based methods are the most used for the development of ASR systems. However, these techniques were commonly used for research and development for many high-resourced languages with large amounts of speech data for training and evaluation, leaving low-resource languages relatively underdeveloped. While the CTC method has been successfully used for other languages, its effectiveness for the Sepedi language remains uncertain. In this study, we present the evaluation of the Sepedi-English code-switched automatic speech recognition system. This end-to-end system was developed using the Sepedi Prompted Code Switching corpus and the CTC approach. The performance of the system was evaluated using both the NCHLT Sepedi test corpus and the Sepedi Prompted Code Switching corpus. The model produced the lowest WER of 41.9%, however, the model faced challenges in recognizing the Sepedi only text.
语音技术是一个涵盖各种用于使机器与语音交互的技术和工具的领域,例如自动语音识别(ASR)、 spoken dialog systems 等,使设备通过麦克风从人类说话者那里捕捉口头语言。端到端方法如 Connectionist Temporal Classification(CTC)和基于注意的方法是开发 ASR 系统最常用的方法。然而,这些技术通常用于为许多具有大量语音数据的大型语言的研究和开发,使得低资源语言相对较落后。尽管 CTC 方法在其他语言上已经取得了成功,但对于Sepedi语言,其对语言的效果仍不确定。在本文中,我们研究了 Sepedi-English 代码切换自动语音识别系统的性能。这个端到端系统使用 Sepedi 提示代码切换语料库和 CTC 方法开发。系统的性能通过 NCHLT Sepedi 测试集和 Sepedi 提示代码切换语料集进行评估。该模型在 WER(准确率+误率)为 41.9% 的情况下产生了最低的准确率,然而,该模型在识别 Sepedi 语言仅文本时遇到了挑战。
https://arxiv.org/abs/2403.07947
Emergency Medical Services (EMS) responders often operate under time-sensitive conditions, facing cognitive overload and inherent risks, requiring essential skills in critical thinking and rapid decision-making. This paper presents CognitiveEMS, an end-to-end wearable cognitive assistant system that can act as a collaborative virtual partner engaging in the real-time acquisition and analysis of multimodal data from an emergency scene and interacting with EMS responders through Augmented Reality (AR) smart glasses. CognitiveEMS processes the continuous streams of data in real-time and leverages edge computing to provide assistance in EMS protocol selection and intervention recognition. We address key technical challenges in real-time cognitive assistance by introducing three novel components: (i) a Speech Recognition model that is fine-tuned for real-world medical emergency conversations using simulated EMS audio recordings, augmented with synthetic data generated by large language models (LLMs); (ii) an EMS Protocol Prediction model that combines state-of-the-art (SOTA) tiny language models with EMS domain knowledge using graph-based attention mechanisms; (iii) an EMS Action Recognition module which leverages multimodal audio and video data and protocol predictions to infer the intervention/treatment actions taken by the responders at the incident scene. Our results show that for speech recognition we achieve superior performance compared to SOTA (WER of 0.290 vs. 0.618) on conversational data. Our protocol prediction component also significantly outperforms SOTA (top-3 accuracy of 0.800 vs. 0.200) and the action recognition achieves an accuracy of 0.727, while maintaining an end-to-end latency of 3.78s for protocol prediction on the edge and 0.31s on the server.
紧急医疗服务(EMS)的响应者通常在时间紧迫的情况下操作,面临着认知超负荷和固有风险,需要具备关键的思维和快速决策能力。本文介绍了一个端到端的可穿戴式认知助手系统CognitiveEMS,该系统可以作为实时获取和分析多模态数据的合作虚拟伙伴,通过增强现实(AR)智能眼镜与EMS救援人员交互。CognitiveEMS实时处理连续的数据流,并利用边缘计算为EMS协议选择和干预识别提供支持。我们通过引入三个新的组件来解决实时思维助手的关键技术挑战:(i)一个用于模拟EMS音频录音的实时语音识别模型,通过模拟大型语言模型(LLMs)生成的合成数据进行微调;(ii)一个结合最先进的(SOTA)微型语言模型与EMS领域知识的EMS协议预测模型,使用图感知关注机制;(iii)一个利用多模态音频和视频数据以及协议预测推断救援人员在现场采取干预/治疗措施的EMS动作识别模块。我们的结果表明,在语音识别方面,我们实现了与SOTA(WER 0.290 vs. 0.618)的优越性能。我们的协议预测组件显著优于SOTA(top-3精度为0.800 vs. 0.200),而动作识别模块的准确度为0.727,同时在边缘的协议预测上的延迟为3.78秒,服务器上的延迟为0.31秒。
https://arxiv.org/abs/2403.06734
It has been shown that the intelligibility of noisy speech can be improved by speech enhancement (SE) algorithms. However, monaural SE has not been established as an effective frontend for automatic speech recognition (ASR) in noisy conditions compared to an ASR model trained on noisy speech directly. The divide between SE and ASR impedes the progress of robust ASR systems, especially as SE has made major advances in recent years. This paper focuses on eliminating this divide with an ARN (attentive recurrent network) time-domain and a CrossNet time-frequency domain enhancement models. The proposed systems fully decouple frontend enhancement and backend ASR trained only on clean speech. Results on the WSJ, CHiME-2, LibriSpeech, and CHiME-4 corpora demonstrate that ARN and CrossNet enhanced speech both translate to improved ASR results in noisy and reverberant environments, and generalize well to real acoustic scenarios. The proposed system outperforms the baselines trained on corrupted speech directly. Furthermore, it cuts the previous best word error rate (WER) on CHiME-2 by $28.4\%$ relatively with a $5.57\%$ WER, and achieves $3.32/4.44\%$ WER on single-channel CHiME-4 simulated/real test data without training on CHiME-4.
已经证明,通过语音增强(SE)算法可以提高嘈杂语音的可听度。然而,与在嘈杂声直接训练的ASR模型相比,单声道SE并未被证明是一种有效的自动语音识别(ASR)前端。SE和ASR之间的差距阻碍了稳健ASR系统的进步,特别是由于SE在近年来取得了重大进展。本文重点消除SE和ASR之间的差距,采用ARN(注意循环网络)时域和CrossNet时频域增强模型。所提出的系统完全解耦前级增强和仅基于干净语音的后端ASR训练。WSJ、CHiME-2、LibriSpeech和CHiME-4语料库的结果表明,ARN和CrossNet增强的语音在嘈杂和回声环境中都有改善的ASR结果,并且具有良好的泛化能力。与直接基于污染语音的基线相比,所提出的系统具有更优异的性能。此外,它通过相对减少CHiME-2上的最佳词错误率(WER)28.4%以及实现没有在CHiME-4上进行训练的单通道CHiME-4模拟/真实测试数据上的3.32/4.44% WER来削减了基线的最佳WER。
https://arxiv.org/abs/2403.06387
There is a growing interest in cost-effective self-supervised fine-tuning (SSFT) of self-supervised learning (SSL)-based speech models to obtain task-specific representations. These task-specific representations are used for robust performance on various downstream tasks by fine-tuning on the labelled data. This work presents a cost-effective SSFT method named Self-supervised Correspondence (SCORE) fine-tuning to adapt the SSL speech representations for content-related tasks. The proposed method uses a correspondence training strategy, aiming to learn similar representations from perturbed speech and original speech. Commonly used data augmentation techniques for content-related tasks (ASR) are applied to obtain perturbed speech. SCORE fine-tuned HuBERT outperforms the vanilla HuBERT on SUPERB benchmark with only a few hours of fine-tuning (< 5 hrs) on a single GPU for automatic speech recognition, phoneme recognition, and query-by-example tasks, with relative improvements of 1.09%, 3.58%, and 12.65%, respectively. SCORE provides competitive results with the recently proposed SSFT method SPIN, using only 1/3 of the processed speech compared to SPIN.
成本有效的自监督微调(SSFT)自我监督学习(SSL)语音模型的任务特定表示(SSM)越来越受到关注,以获得针对各种下游任务的稳健性能。这些任务特定表示在给定标签数据上进行微调,从而在标签数据上进行微调以获得针对各种下游任务的稳健性能。本文提出了一种名为自监督对应(SCORE)微调的方法,以适应内容相关的任务,它使用对应训练策略从扰动语音和原始语音中学习类似的表示。通常用于内容相关任务(ASR)的数据增强技术(ASR)应用到获得扰动语音。SCORE微调的HUBERT在仅用几个小时的微调(<5小时)在一个GPU上击败了普通HUBERT基准,在自动语音识别、音素识别和实例到实例任务中实现了相对改善分别为1.09%、3.58%和12.65%。与最近提出的SSFT方法SPIN相比,SCORE只需处理语音的三分之一,但提供了与SPIN相当的竞争结果。
https://arxiv.org/abs/2403.06260
As Automatic Speech Recognition (ASR) models become ever more pervasive, it is important to ensure that they make reliable predictions under corruptions present in the physical and digital world. We propose Speech Robust Bench (SRB), a comprehensive benchmark for evaluating the robustness of ASR models to diverse corruptions. SRB is composed of 69 input perturbations which are intended to simulate various corruptions that ASR models may encounter in the physical and digital world. We use SRB to evaluate the robustness of several state-of-the-art ASR models and observe that model size and certain modeling choices such as discrete representations, and self-training appear to be conducive to robustness. We extend this analysis to measure the robustness of ASR models on data from various demographic subgroups, namely English and Spanish speakers, and males and females, and observed noticeable disparities in the model's robustness across subgroups. We believe that SRB will facilitate future research towards robust ASR models, by making it easier to conduct comprehensive and comparable robustness evaluations.
随着自动语音识别(ASR)模型变得越来越普遍,确保它们在物理和数字世界中存在的各种腐败下做出可靠的预测非常重要。我们提出了Speech Robust Benchmark(SRB),一个全面评估ASR模型对各种腐败的鲁棒性的基准。SRB由69个输入扰动组成,旨在模拟ASR模型在物理和数字世界中可能遇到的各种腐败。我们使用SRB评估了几个最先进的ASR模型的鲁棒性,并观察到模型大小和某些建模选择(如离散表示和自训练)似乎有助于提高鲁棒性。我们将这种分析扩展到测量ASR模型在不同人口子群体上的鲁棒性,包括英国和西班牙语使用者、男性和女性。在子群体之间观察到了模型鲁棒性的明显差异。我们相信,SRB将促进未来研究朝着更鲁棒的ASR模型方向发展,使其更容易进行全面的和可比较的鲁棒性评估。
https://arxiv.org/abs/2403.07937
Since the foundational work of William Labov on the social stratification of language (Labov, 1964), linguistics has made concentrated efforts to explore the links between sociodemographic characteristics and language production and perception. But while there is strong evidence for socio-demographic characteristics in language, they are infrequently used in Natural Language Processing (NLP). Age and gender are somewhat well represented, but Labov's original target, socioeconomic status, is noticeably absent. And yet it matters. We show empirically that NLP disadvantages less-privileged socioeconomic groups. We annotate a corpus of 95K utterances from movies with social class, ethnicity and geographical language variety and measure the performance of NLP systems on three tasks: language modelling, automatic speech recognition, and grammar error correction. We find significant performance disparities that can be attributed to socioeconomic status as well as ethnicity and geographical differences. With NLP technologies becoming ever more ubiquitous and quotidian, they must accommodate all language varieties to avoid disadvantaging already marginalised groups. We argue for the inclusion of socioeconomic class in future language technologies.
自威廉·拉波在语言的社会分层基础工作(Labov,1964)以来,语言学一直在集中努力探讨社会人口特征与语言生产和感知之间的联系。但在自然语言处理(NLP)中,社会人口特征的证据并不常见。年龄和性别在一定程度上得到了代表,但拉波最初的目的是社会经济地位,这显然缺席了。然而,这很重要。我们通过实证研究展示了NLP不利于社会经济地位较弱的小群体。我们将来自95K个电影的95K个语料库进行注释,并测量了NLP系统在三个任务上的表现:语言建模、自动语音识别和语法错误纠正。我们发现,社会经济地位以及种族和地理差异都可以解释为NLP系统性能差异显著的原因。随着NLP技术越来越普遍和日常化,它们必须适应所有语言多样性,避免伤害已经处于边缘地位的群体。我们主张在未来的语言技术中包括社会经济阶级。
https://arxiv.org/abs/2403.04445
This work is an attempt to introduce a comprehensive benchmark for Arabic speech recognition, specifically tailored to address the challenges of telephone conversations in Arabic language. Arabic, characterized by its rich dialectal diversity and phonetic complexity, presents a number of unique challenges for automatic speech recognition (ASR) systems. These challenges are further amplified in the domain of telephone calls, where audio quality, background noise, and conversational speech styles negatively affect recognition accuracy. Our work aims to establish a robust benchmark that not only encompasses the broad spectrum of Arabic dialects but also emulates the real-world conditions of call-based communications. By incorporating diverse dialectical expressions and accounting for the variable quality of call recordings, this benchmark seeks to provide a rigorous testing ground for the development and evaluation of ASR systems capable of navigating the complexities of Arabic speech in telephonic contexts. This work also attempts to establish a baseline performance evaluation using state-of-the-art ASR technologies.
这项工作旨在引入一个全面的阿拉伯语语音识别基准,特别是针对阿拉伯语言通话中的挑战。阿拉伯语以其丰富的方言多样性和语音复杂性而闻名,这对自动语音识别(ASR)系统带来了许多独特的挑战。在电话领域的挑战进一步加剧,因为音频质量、背景噪音和交谈说话风格都会影响识别准确性。我们的工作旨在建立一个稳健的基准,不仅包括阿拉伯语方言的广泛范围,而且还要模拟真实世界的通话条件。通过纳入不同的方言表达和考虑通话录音的质量,这个基准旨在为开发和评估能够应对阿拉伯语在电话上下文中的复杂性的ASR系统提供一个严谨的测试平台。此外,这项工作还试图使用最先进的ASR技术建立一个基准性能评估。
https://arxiv.org/abs/2403.04280
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames, performing even worse than single-modality models. While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input. In this paper, we investigate this contrasting phenomenon from the perspective of modality bias and reveal that an excessive modality bias on the audio caused by dropout is the underlying reason. Moreover, we present the Modality Bias Hypothesis (MBH) to systematically describe the relationship between modality bias and robustness against missing modality in multimodal systems. Building on these findings, we propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality and to maintain performance and robustness simultaneously. Finally, to address an entirely missing modality, we adopt adapters to dynamically switch decision strategies. The effectiveness of our proposed approach is evaluated and validated through a series of comprehensive experiments using the MISP2021 and MISP2022 datasets. Our code is available at this https URL
高级音频视觉语音识别(AVSR)系统被发现对缺失视频帧非常敏感,甚至比单模态模型表现得更差。将 dropout 技术应用于视频模态可以增强对缺失帧的鲁棒性,但同时也会导致在处理完整数据输入时性能下降。在本文中,我们从模态偏差的角度研究了这一矛盾现象,并揭示了由于 dropout 引起的过度模态偏差是导致音频模态导致鲁棒性下降的潜在原因。此外,我们还提出了 Modality Bias Hypothesis (MBH) 来系统地描述模态偏差和缺失模态对多模态系统中的鲁棒性的关系。基于这些发现,我们提出了一个名为 Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) 的框架,以减少对音频模态的过度依赖,同时保持性能和鲁棒性。最后,为了处理完全缺失的模态,我们采用动态切换决策策略。通过使用 MISP2021 和 MISP2022 数据集进行一系列全面的实验,评估和验证了我们所提出方法的有效性和可靠性。我们的代码可在此处访问:https:// URL
https://arxiv.org/abs/2403.04245
Radio advertising remains an integral part of modern marketing strategies, with its appeal and potential for targeted reach undeniably effective. However, the dynamic nature of radio airtime and the rising trend of multiple radio spots necessitates an efficient system for monitoring advertisement broadcasts. This study investigates a novel automated radio advertisement detection technique incorporating advanced speech recognition and text classification algorithms. RadIA's approach surpasses traditional methods by eliminating the need for prior knowledge of the broadcast content. This contribution allows for detecting impromptu and newly introduced advertisements, providing a comprehensive solution for advertisement detection in radio broadcasting. Experimental results show that the resulting model, trained on carefully segmented and tagged text data, achieves an F1-macro score of 87.76 against a theoretical maximum of 89.33. This paper provides insights into the choice of hyperparameters and their impact on the model's performance. This study demonstrates its potential to ensure compliance with advertising broadcast contracts and offer competitive surveillance. This groundbreaking research could fundamentally change how radio advertising is monitored and open new doors for marketing optimization.
广播广告仍然被认为是现代市场营销策略的重要组成部分,其吸引力以及精准传播效果无疑非常有效。然而,广播时间的动态性质和多个广播 spot 的上升趋势需要一个高效的事件监测系统。这项研究调查了一种结合高级语音识别和文本分类算法的全新自动广播广告检测技术。RadIA 的方法超越了传统方法,消除了对广播内容先前知识的需要。这一贡献使得检测到即兴和新增的广告成为可能,为广播广告检测提供了全面解决方案。实验结果表明,经过仔细分割和标记的文本数据集训练得到的模型,其F1-macro得分达到了87.76,而理论最大值为89.33。本文提供了关于超参数选择及其对模型性能影响的见解。这项研究证明了其确保遵守广告播出合同的可能性和提供竞争性监控的潜力。这一突破性的研究可能彻底改变如何监测广播广告,并为营销优化打开新的大门。
https://arxiv.org/abs/2403.03538
Non-verbal signals in speech are encoded by prosody and carry information that ranges from conversation action to attitude and emotion. Despite its importance, the principles that govern prosodic structure are not yet adequately understood. This paper offers an analytical schema and a technological proof-of-concept for the categorization of prosodic signals and their association with meaning. The schema interprets surface-representations of multi-layered prosodic events. As a first step towards implementation, we present a classification process that disentangles prosodic phenomena of three orders. It relies on fine-tuning a pre-trained speech recognition model, enabling the simultaneous multi-class/multi-label detection. It generalizes over a large variety of spontaneous data, performing on a par with, or superior to, human annotation. In addition to a standardized formalization of prosody, disentangling prosodic patterns can direct a theory of communication and speech organization. A welcome by-product is an interpretation of prosody that will enhance speech- and language-related technologies.
言语中的非语言信号由音调( prosody)编码,其传递的信息范围从对话行动到态度和情感。尽管其重要性,支配音调结构的原则尚未得到充分理解。本文提出了一个分析框架和一个技术概念证明,对语调信号的分类及其与意义的关系进行分类。该框架解释了多层语调事件的表面表征。作为实现的第一步,我们提出了一个将三级语调现象分类的过程。它依赖于对预训练语音识别模型的微调,实现同时多类/多标签检测。它适用于各种自发性数据,与人类标注相当,甚至更好。除了对语调的标准化形式化外,分离语调模式可以引导交流理论和语音组织理论。一个有益的副产品是,它可以增强与言语和语言相关技术的发展。
https://arxiv.org/abs/2403.03522
Since humans can listen to audio and watch videos at faster speeds than actually observed, we often listen to or watch these pieces of content at higher playback speeds to increase the time efficiency of content comprehension. To further utilize this capability, systems that automatically adjust the playback speed according to the user's condition and the type of content to assist in more efficient comprehension of time-series content have been developed. However, there is still room for these systems to further extend human speed-listening ability by generating speech with playback speed optimized for even finer time units and providing it to humans. In this study, we determine whether humans can hear the optimized speech and propose a system that automatically adjusts playback speed at units as small as phonemes while ensuring speech intelligibility. The system uses the speech recognizer score as a proxy for how well a human can hear a certain unit of speech and maximizes the speech playback speed to the extent that a human can hear. This method can be used to produce fast but intelligible speech. In the evaluation experiment, we compared the speech played back at a constant fast speed and the flexibly speed-up speech generated by the proposed method in a blind test and confirmed that the proposed method produced speech that was easier to listen to.
由于人类可以以比实际观察到的更快的速度听音频和观看视频,因此我们通常会以更高的播放速度播放或观看这些内容,以提高内容理解的效率。为了更有效地利用这一功能,根据用户情况和内容类型自动调整播放速度的系统已经开发出来。然而,这些系统仍有很大的改进空间,可以通过生成经过优化的时间单位语音,提供更细粒度的语音播放速度,从而进一步提高人类的速度 listening 能力。在本研究中,我们确定人类是否可以听到优化的语音,并提出了一个系统,可以在单位为音素级别的小单位上自动调整播放速度,同时确保语音的可听性。该系统使用语音识别得分作为一个人类是否能听懂某个语音单位的好坏代理,将播放速度最大化,使得人类可以听到。这种方法可以用于产生快速且可理解的语音。在评估实验中,我们比较了恒定高速播放的语音和在盲测试中由所提出方法生成的具有更灵活速度的语音,证实了所提出的方法产生的语音更容易听。
https://arxiv.org/abs/2403.02938
The paper reports on a series of experiments aiming at probing LeBenchmark, a pretrained acoustic model trained on 7k hours of spoken French, for syntactic information. Pretrained acoustic models are increasingly used for downstream speech tasks such as automatic speech recognition, speech translation, spoken language understanding or speech parsing. They are trained on very low level information (the raw speech signal), and do not have explicit lexical knowledge. Despite that, they obtained reasonable results on tasks that requires higher level linguistic knowledge. As a result, an emerging question is whether these models encode syntactic information. We probe each representation layer of LeBenchmark for syntax, using the Orféo treebank, and observe that it has learnt some syntactic information. Our results show that syntactic information is more easily extractable from the middle layers of the network, after which a very sharp decrease is observed.
本文报道了一系列实验,旨在探究 LeBenchmark,一种利用 7 小时 spoken French 数据进行预训练的音频模型,以获取语法的信息。越来越多的预训练音频模型被用于下游语音任务,如自动语音识别、语音翻译、口语理解或语音解析。它们在非常低级别的信息(原始语音信号)上进行训练,并且没有明确的词汇知识。尽管如此,它们在需要更高层次语言知识的任务上取得了相当不错的结果。因此,一个有趣的问题是我们是否这些模型编码了语法的信息。我们使用 Orpheo 语料库对 LeBenchmark 的每个表示层进行语法探索,结果表明它已经学习了一些语法信息。我们的结果表明,网络的中层能够更容易地提取语法的信息,这之后,观察到了一个非常尖锐的下降。
https://arxiv.org/abs/2403.02173
Multi-talker automatic speech recognition plays a crucial role in scenarios involving multi-party interactions, such as meetings and conversations. Due to its inherent complexity, this task has been receiving increasing attention. Notably, the serialized output training (SOT) stands out among various approaches because of its simplistic architecture and exceptional performance. However, the frequent speaker changes in token-level SOT (t-SOT) present challenges for the autoregressive decoder in effectively utilizing context to predict output sequences. To address this issue, we introduce a masked t-SOT label, which serves as the cornerstone of an auxiliary training loss. Additionally, we utilize a speaker similarity matrix to refine the self-attention mechanism of the decoder. This strategic adjustment enhances contextual relationships within the same speaker's tokens while minimizing interactions between different speakers' tokens. We denote our method as speaker-aware SOT (SA-SOT). Experiments on the Librispeech datasets demonstrate that our SA-SOT obtains a relative cpWER reduction ranging from 12.75% to 22.03% on the multi-talker test sets. Furthermore, with more extensive training, our method achieves an impressive cpWER of 3.41%, establishing a new state-of-the-art result on the LibrispeechMix dataset.
多说话者自动语音识别在涉及多方交互的场景(如会议和对话)中扮演着关键角色。由于其固有复杂性,这项任务受到了越来越多的关注。值得注意的是,序列化输出训练(SOT)脱颖而出,因为其简单的架构和出色的性能。然而,在词级序列SOT(t-SOT)中频繁的说话者变化对自回归解码器有效地利用上下文预测输出序列造成了挑战。为解决这个问题,我们引入了遮罩t-SOT标签,作为辅助训练损失的基石。此外,我们利用说话者相似度矩阵来优化解码器的自注意力机制。这个策略调整在同一说话者的词级间增强了上下文关系,同时最小化了不同说话者词级的交互。我们将我们的方法称为说话者意识到的序列化输出(SA-SOT)。在Librispeech数据集上的实验证明,我们的SA-SOT在多说话者测试集中的相对cpWER reduction范围从12.75%到22.03%不等。此外,经过更广泛的训练,我们的方法实现了令人印象深刻的3.41%的cpWER,在LibrispeechMix数据集上建立了新的最先进结果。
https://arxiv.org/abs/2403.02010
Kurdish, an Indo-European language spoken by over 30 million speakers, is considered a dialect continuum and known for its diversity in language varieties. Previous studies addressing language and speech technology for Kurdish handle it in a monolithic way as a macro-language, resulting in disparities for dialects and varieties for which there are few resources and tools available. In this paper, we take a step towards developing resources for language and speech technology for varieties of Central Kurdish, creating a corpus by transcribing movies and TV series as an alternative to fieldwork. Additionally, we report the performance of machine translation, automatic speech recognition, and language identification as downstream tasks evaluated on Central Kurdish varieties. Data and models are publicly available under an open license at this https URL.
库尔德语,一种由超过3000万人使用的高地德语,被认为是印欧语系的一个变体,以其多样性的语言变体而闻名。以前的研究在处理库尔德语的语言和语音技术时,将其作为一个整体作为宏观语言,导致库尔德语方言和变体缺乏资源和支持。在本文中,我们朝着为中央库尔德语的多样性开发资源和技术迈出一步,通过将电影和电视剧的转录作为替代田野工作的方法,创建了一个语料库。此外,我们报道了机器翻译、自动语音识别和语言识别在中央库尔德语变体上的评估结果。数据和模型目前都可以在https://这个网址上以开源许可证免费获取。
https://arxiv.org/abs/2403.01983