Due to the subjective nature of current clinical evaluation, the need for automatic severity evaluation in dysarthric speech has emerged. DNN models outperform ML models but lack user-friendly explainability. ML models offer explainable results at a feature level, but their performance is comparatively lower. Current ML models extract various features from raw waveforms to predict severity. However, existing methods do not encompass all dysarthric features used in clinical evaluation. To address this gap, we propose a feature extraction method that minimizes information loss. We introduce an ASR transcription as a novel feature extraction source. We finetune the ASR model for dysarthric speech, then use this model to transcribe dysarthric speech and extract word segment boundary information. It enables capturing finer pronunciation and broader prosodic features. These features demonstrated an improved severity prediction performance to existing features: balanced accuracy of 83.72%.
https://arxiv.org/abs/2412.03784
Automatic speech Recognition (ASR) is a fundamental and important task in the field of speech and natural language processing. It is an inherent building block in many applications such as voice assistant, speech translation, etc. Despite the advancement of ASR technologies in recent years, it is still inevitable for modern ASR systems to have a substantial number of erroneous recognition due to environmental noise, ambiguity, etc. Therefore, the error correction in ASR is crucial. Motivated by this, this paper studies ASR error correction in the Chinese language, which is one of the most popular languages and enjoys a large number of users in the world. We first create a benchmark dataset named \emph{ASR-EC} that contains a wide spectrum of ASR errors generated by industry-grade ASR systems. To the best of our knowledge, it is the first Chinese ASR error correction benchmark. Then, inspired by the recent advances in \emph{large language models (LLMs)}, we investigate how to harness the power of LLMs to correct ASR errors. We apply LLMs to ASR error correction in three paradigms. The first paradigm is prompting, which is further categorized as zero-shot, few-shot, and multi-step. The second paradigm is finetuning, which finetunes LLMs with ASR error correction data. The third paradigm is multi-modal augmentation, which collectively utilizes the audio and ASR transcripts for error correction. Extensive experiments reveal that prompting is not effective for ASR error correction. Finetuning is effective only for a portion of LLMs. Multi-modal augmentation is the most effective method for error correction and achieves state-of-the-art performance.
https://arxiv.org/abs/2412.03075
We introduce GLM-4-Voice, an intelligent and human-like end-to-end spoken chatbot. It supports both Chinese and English, engages in real-time voice conversations, and varies vocal nuances such as emotion, intonation, speech rate, and dialect according to user instructions. GLM-4-Voice uses an ultra-low bitrate (175bps), single-codebook speech tokenizer with 12.5Hz frame rate derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. To efficiently transfer knowledge from text to speech modalities, we synthesize speech-text interleaved data from existing text pre-training corpora using a text-to-token model. We continue pre-training from the pre-trained text language model GLM-4-9B with a combination of unsupervised speech data, interleaved speech-text data, and supervised speech-text data, scaling up to 1 trillion tokens, achieving state-of-the-art performance in both speech language modeling and spoken question answering. We then fine-tune the pre-trained model with high-quality conversational speech data, achieving superior performance compared to existing baselines in both conversational ability and speech quality. The open models can be accessed through this https URL and this https URL.
https://arxiv.org/abs/2412.02612
This work introduces the first framework for reconstructing surgical dialogue from unstructured real-world recordings, which is crucial for characterizing teaching tasks. In surgical training, the formative verbal feedback that trainers provide to trainees during live surgeries is crucial for ensuring safety, correcting behavior immediately, and facilitating long-term skill acquisition. However, analyzing and quantifying this feedback is challenging due to its unstructured and specialized nature. Automated systems are essential to manage these complexities at scale, allowing for the creation of structured datasets that enhance feedback analysis and improve surgical education. Our framework integrates voice activity detection, speaker diarization, and automated speech recaognition, with a novel enhancement that 1) removes hallucinations (non-existent utterances generated during speech recognition fueled by noise in the operating room) and 2) separates speech from trainers and trainees using few-shot voice samples. These aspects are vital for reconstructing accurate surgical dialogues and understanding the roles of operating room participants. Using data from 33 real-world surgeries, we demonstrated the system's capability to reconstruct surgical teaching dialogues and detect feedback instances effectively (F1 score of 0.79+/-0.07). Moreover, our hallucination removal step improves feedback detection performance by ~14%. Evaluation on downstream clinically relevant tasks of predicting Behavioral Adjustment of trainees and classifying Technical feedback, showed performances comparable to manual annotations with F1 scores of 0.82+/0.03 and 0.81+/0.03 respectively. These results highlight the effectiveness of our framework in supporting clinically relevant tasks and improving over manual methods.
https://arxiv.org/abs/2412.00760
We explore diverse representations of speech audio, and their effect on a performance of late fusion ensemble of E-Branchformer models, applied to Automatic Speech Recognition (ASR) task. Although it is generally known that ensemble methods often improve the performance of the system even for speech recognition, it is very interesting to explore how ensembles of complex state-of-the-art models, such as medium-sized and large E-Branchformers, cope in this setting when their base models are trained on diverse representations of the input speech audio. The results are evaluated on four widely-used benchmark datasets: \textit{Librispeech, Aishell, Gigaspeech}, \textit{TEDLIUMv2} and show that improvements of $1\% - 14\%$ can still be achieved over the state-of-the-art models trained using comparable techniques on these datasets. A noteworthy observation is that such ensemble offers improvements even with the use of language models, although the gap is closing.
https://arxiv.org/abs/2412.01861
Large Language Models (LLMs) have showcased exceptional performance across diverse NLP tasks, and their integration with speech encoder is rapidly emerging as a dominant trend in the Automatic Speech Recognition (ASR) field. Previous works mainly concentrated on leveraging LLMs for speech recognition in English and Chinese. However, their potential for addressing speech recognition challenges in low resource settings remains underexplored. Hence, in this work, we aim to explore the capability of LLMs in low resource ASR and Mandarin-English code switching ASR. We also evaluate and compare the recognition performance of LLM-based ASR systems against Whisper model. Extensive experiments demonstrate that LLM-based ASR yields a relative gain of 12.8\% over the Whisper model in low resource ASR while Whisper performs better in Mandarin-English code switching ASR. We hope that this study could shed light on ASR for low resource scenarios.
https://arxiv.org/abs/2412.00721
Data augmentation is a widely adopted technique utilized to improve the robustness of automatic speech recognition (ASR). Employing a fixed data augmentation strategy for all training data is a common practice. However, it is important to note that there can be variations in factors such as background noise, speech rate, etc. among different samples within a single training batch. By using a fixed augmentation strategy, there is a risk that the model may reach a suboptimal state. In addition to the risks of employing a fixed augmentation strategy, the model's capabilities may differ across various training stages. To address these issues, this paper proposes the method of sample-adaptive data augmentation with progressive scheduling(PS-SapAug). The proposed method applies dynamic data augmentation in a two-stage training approach. It employs hybrid normalization to compute sample-specific augmentation parameters based on each sample's loss. Additionally, the probability of augmentation gradually increases throughout the training progression. Our method is evaluated on popular ASR benchmark datasets, including Aishell-1 and Librispeech-100h, achieving up to 8.13% WER reduction on LibriSpeech-100h test-clean, 6.23% on test-other, and 5.26% on AISHELL-1 test set, which demonstrate the efficacy of our approach enhancing performance and minimizing errors.
https://arxiv.org/abs/2412.00415
In today's digital age, video content is prevalent, serving as a primary source of information, education, and entertainment. However, the Deaf and Hard of Hearing (DHH) community often faces significant challenges in accessing video content due to the inadequacy of automatic speech recognition (ASR) systems in providing accurate and reliable captions. This paper addresses the urgent need to improve video caption quality by leveraging Large Language Models (LLMs). We present a comprehensive study that explores the integration of LLMs to enhance the accuracy and context-awareness of captions generated by ASR systems. Our methodology involves a novel pipeline that corrects ASR-generated captions using advanced LLMs. It explicitly focuses on models like GPT-3.5 and Llama2-13B due to their robust performance in language comprehension and generation tasks. We introduce a dataset representative of real-world challenges the DHH community faces to evaluate our proposed pipeline. Our results indicate that LLM-enhanced captions significantly improve accuracy, as evidenced by a notably lower Word Error Rate (WER) achieved by ChatGPT-3.5 (WER: 9.75%) compared to the original ASR captions (WER: 23.07%), ChatGPT-3.5 shows an approximate 57.72% improvement in WER compared to the original ASR captions.
https://arxiv.org/abs/2412.00342
Speech recognition is a key challenge in natural language processing, requiring low latency, efficient computation, and strong generalization for real-time applications. While software-based artificial neural networks (ANNs) excel at this task, they are computationally intensive and depend heavily on data pre-processing. Neuromorphic computing, with its low-latency and energy-efficient advantages, holds promise for audio classification. Memristive nanowire networks, combined with pre-processing techniques like Mel-Frequency Cepstrum Coefficient extraction, have been widely used for associative learning, but such pre-processing can be power-intensive, undermining latency benefits. This study pioneers the use of memristive and spatio-temporal properties of nanowire networks for audio signal classification without pre-processing. A nanowire network simulation is paired with three linear classifiers for 10-class MNIST audio classification and binary speaker generalization tests. The hybrid system achieves significant benefits: excellent data compression with only 3% of nanowire output utilized, a 10-fold reduction in computational latency, and up to 28.5% improved classification accuracy (using a logistic regression classifier). Precision and recall improve by 10% and 17% for multispeaker datasets, and by 24% and 17% for individual speaker datasets, compared to raw data this http URL work provides a foundational proof of concept for utilizing memristive nanowire networks (NWN) in edge-computing devices, showcasing their potential for efficient, real-time audio signal processing with reduced computational overhead and power consumption, and enabling the development of advanced neuromorphic computing solutions.
https://arxiv.org/abs/2411.19611
Brain-Computer-Interface (BCI) aims to support communication-impaired patients by translating neural signals into speech. A notable research topic in BCI involves Electroencephalography (EEG) signals that measure the electrical activity in the brain. While significant advancements have been made in BCI EEG research, a major limitation still exists: the scarcity of publicly available EEG datasets for non-English languages, such as Arabic. To address this gap, we introduce in this paper ArEEG_Words dataset, a novel EEG dataset recorded from 22 participants with mean age of 22 years (5 female, 17 male) using a 14-channel Emotiv Epoc X device. The participants were asked to be free from any effects on their nervous system, such as coffee, alcohol, cigarettes, and so 8 hours before recording. They were asked to stay calm in a clam room during imagining one of the 16 Arabic Words for 10 seconds. The words include 16 commonly used words such as up, down, left, and right. A total of 352 EEG recordings were collected, then each recording was divided into multiple 250ms signals, resulting in a total of 15,360 EEG signals. To the best of our knowledge, ArEEG_Words data is the first of its kind in Arabic EEG domain. Moreover, it is publicly available for researchers as we hope that will fill the gap in Arabic EEG research.
https://arxiv.org/abs/2411.18888
Spontaneous or conversational multilingual speech presents many challenges for state-of-the-art automatic speech recognition (ASR) systems. In this work, we present a new technique AMPS that augments a multilingual multimodal ASR system with paraphrase-based supervision for improved conversational ASR in multiple languages, including Hindi, Marathi, Malayalam, Kannada, and Nyanja. We use paraphrases of the reference transcriptions as additional supervision while training the multimodal ASR model and selectively invoke this paraphrase objective for utterances with poor ASR performance. Using AMPS with a state-of-the-art multimodal model SeamlessM4T, we obtain significant relative reductions in word error rates (WERs) of up to 5%. We present detailed analyses of our system using both objective and human evaluation metrics.
自发的或对话式的多语言口语对最先进的自动语音识别(ASR)系统提出了许多挑战。在这项工作中,我们介绍了一种新技术AMPS,它通过基于释义的监督来增强一个多语言多模式ASR系统,从而改进多种语言的对话式ASR性能,包括印地语、马拉地语、马拉雅拉姆语、卡纳达语和齐贝语。我们在训练多模式ASR模型时使用参考转录的释义作为额外的监督,并选择性地为那些语音识别性能较差的话语调用这种释义目标。使用AMPS与最先进的多模态模型SeamlessM4T结合,我们获得了高达5%的词错误率(WER)的显著相对减少。我们使用客观和人类评估指标对系统进行了详细的分析。
https://arxiv.org/abs/2411.18368
Continual learning for automatic speech recognition (ASR) systems poses a challenge, especially with the need to avoid catastrophic forgetting while maintaining performance on previously learned tasks. This paper introduces a novel approach leveraging the machine speech chain framework to enable continual learning in ASR using gradient episodic memory (GEM). By incorporating a text-to-speech (TTS) component within the machine speech chain, we support the replay mechanism essential for GEM, allowing the ASR model to learn new tasks sequentially without significant performance degradation on earlier tasks. Our experiments, conducted on the LJ Speech dataset, demonstrate that our method outperforms traditional fine-tuning and multitask learning approaches, achieving a substantial error rate reduction while maintaining high performance across varying noise conditions. We showed the potential of our semi-supervised machine speech chain approach for effective and efficient continual learning in speech recognition.
持续学习对于自动语音识别(ASR)系统来说是一项挑战,尤其是在需要避免灾难性遗忘的同时保持之前所学任务的性能。本文提出了一种新的方法,该方法利用机器语音链框架并通过梯度情景记忆(GEM),实现ASR中的持续学习。通过在机器语音链中加入文本到语音(TTS)组件,我们支持了对GEM至关重要的重放缓冲机制,使得ASR模型可以顺序地学习新任务而不显著降低早期任务的性能。我们在LJ Speech数据集上的实验表明,我们的方法优于传统的微调和多任务学习方法,在保持不同噪声条件下高性能的同时实现了显著的错误率下降。我们展示了半监督机器语音链方法在语音识别中实现有效且高效的持续学习的潜力。
https://arxiv.org/abs/2411.18320
This paper investigates a novel approach to end-to-end speech translation (ST) based on aligning frozen pre-trained automatic speech recognition (ASR) and machine translation (MT) models via a small connector module (Q-Former, our Subsampler-Transformer Encoder). This connector bridges the gap between the speech and text modalities, transforming ASR encoder embeddings into the latent representation space of the MT encoder while being the only part of the system optimized during training. Experiments are conducted on the How2 English-Portuguese dataset as we investigate the alignment approach in a small-scale scenario focusing on ST. While keeping the size of the connector module constant and small in comparison ( < 5% of the size of the larger aligned models), increasing the size and capability of the foundation ASR and MT models universally improves translation results. We also find that the connectors can serve as domain adapters for the foundation MT models, significantly improving translation performance in the aligned ST setting. We conclude that this approach represents a viable and scalable approach to training end-to-end ST systems.
本文研究了一种新颖的端到端语音翻译(ST)方法,该方法基于通过一个小连接模块(Q-Former,我们的Subsampler-Transformer编码器)将冻结预训练的自动语音识别(ASR)和机器翻译(MT)模型对齐。此连接模块在言语和文本模态之间架起了一座桥梁,在转换过程中将ASR编码器嵌入转化为MT编码器的潜在表示空间,并且它是整个系统中唯一在训练期间进行优化的部分。我们在How2英语-葡萄牙语数据集上进行了实验,以研究这种对齐方法在一个侧重于ST的小规模场景中的应用。尽管连接模块的大小保持不变且相对较小(小于较大对齐模型大小的5%),但增加基础ASR和MT模型的大小和能力普遍提高了翻译结果。我们还发现,这些连接器可以用作基础MT模型的领域适配器,在对齐的ST设置中显著提高翻译性能。我们得出结论,这种方法代表了一种可行且可扩展的训练端到端ST系统的方法。
https://arxiv.org/abs/2411.18294
The utilization of speech Self-Supervised Learning (SSL) models achieves impressive performance on Automatic Speech Recognition (ASR). However, in low-resource language ASR, they encounter the domain mismatch problem between pre-trained and low-resource languages. Typical solutions like fine-tuning the SSL model suffer from high computation costs while using frozen SSL models as feature extractors comes with poor performance. To handle these issues, we extend a conventional efficient fine-tuning scheme based on the adapter. We add an extra intermediate adaptation to warm up the adapter and downstream model initialization. Remarkably, we update only 1-5% of the total model parameters to achieve the adaptation. Experimental results on the ML-SUPERB dataset show that our solution outperforms conventional efficient fine-tuning. It achieves up to a 28% relative improvement in the Character/Phoneme error rate when adapting to unseen languages.
语音自监督学习(SSL)模型在自动语音识别(ASR)中取得了令人印象深刻的表现。然而,在低资源语言的ASR任务中,它们面临预训练模型和低资源语言之间的领域不匹配问题。像微调SSL模型这样的典型解决方案计算成本高昂,而将冻结的SSL模型作为特征提取器使用则性能不佳。为了解决这些问题,我们扩展了一种基于适配器的传统高效微调方案。我们在适配器与下游模型初始化之间添加了一个额外的中间适应步骤来预热适配器。值得注意的是,我们只需更新总模型参数的1-5%即可实现这一适应。在ML-SUPERB数据集上的实验结果表明,我们的解决方案优于传统高效的微调方法,并且在适应未见过的语言时,字符/音素错误率最多可相对提高28%。
https://arxiv.org/abs/2411.18217
Speaker-attributed automatic speech recognition (SA-ASR) aims to transcribe speech while assigning transcripts to the corresponding speakers accurately. Existing methods often rely on complex modular systems or require extensive fine-tuning of joint modules, limiting their adaptability and general efficiency. This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions, using only standard monolingual ASR datasets. Our method involves training a speaker module to predict speaker embeddings based on weak labels without requiring additional ASR model modifications. Despite being trained exclusively with non-overlapping monolingual data, our approach effectively extracts speaker attributes across diverse multilingual datasets, including those with overlapping speech. Experimental results demonstrate competitive performance compared to strong baselines, highlighting the model's robustness and potential for practical applications.
演讲者归因自动语音识别(SA-ASR)旨在转录语音的同时,准确地将转录内容分配给相应的演讲者。现有的方法往往依赖于复杂的模块化系统或需要对联合模块进行广泛的微调,这限制了它们的适应性和总体效率。本文介绍了一种新颖的方法,利用一个冻结的多语言ASR模型,在仅使用标准单语ASR数据集的情况下将演讲者归因整合到转录内容中。我们的方法包括训练一个演讲者模块,根据弱标签预测演讲者嵌入,而无需对ASR模型进行额外修改。尽管只用非重叠的单语数据进行训练,我们的方法仍然能够有效地从多种多语言数据集中提取演讲者属性,包括那些具有重叠语音的数据集。实验结果表明,与强大的基线相比,该方法表现出了竞争力,突显了模型的鲁棒性和实际应用潜力。
https://arxiv.org/abs/2411.18152
Full-duplex multimodal large language models (LLMs) provide a unified framework for addressing diverse speech understanding and generation tasks, enabling more natural and seamless human-machine conversations. Unlike traditional modularised conversational AI systems, which separate speech recognition, understanding, and text-to-speech generation into distinct components, multimodal LLMs operate as single end-to-end models. This streamlined design eliminates error propagation across components and fully leverages the rich non-verbal information embedded in input speech signals. We introduce SALMONN-omni, a codec-free, full-duplex speech understanding and generation model capable of simultaneously listening to its own generated speech and background sounds while speaking. To support this capability, we propose a novel duplex spoken dialogue framework incorporating a ``thinking'' mechanism that facilitates asynchronous text and speech generation relying on embeddings instead of codecs (quantized speech and audio tokens). Experimental results demonstrate SALMONN-omni's versatility across a broad range of streaming speech tasks, including speech recognition, speech enhancement, and spoken question answering. Additionally, SALMONN-omni excels at managing turn-taking, barge-in, and echo cancellation scenarios, establishing its potential as a robust prototype for full-duplex conversational AI systems. To the best of our knowledge, SALMONN-omni is the first codec-free model of its kind. A full technical report along with model checkpoints will be released soon.
全双工多模态大型语言模型(LLMs)为处理各种语音理解和生成任务提供了一个统一的框架,实现了更加自然和无缝的人机对话。与传统的模块化会话AI系统不同,后者将语音识别、理解以及文本到语音生成分为不同的组件,多模态LMs则作为单一端到端模型运行。这种简化的设计消除了各组件之间的误差传播,并充分利用了输入语音信号中嵌入的丰富非言语信息。我们介绍了SALMONN-omni,这是一种无编解码器、全双工的语音理解和生成模型,在说话时能够同时监听自己产生的语音和背景声音。为了支持这种能力,我们提出了一种新的双向口语对话框架,该框架包含一个“思考”机制,它通过依赖嵌入而非编解码器(量化后的语音和音频令牌)来促进异步文本和语音生成。实验结果表明SALMONN-omni在广泛的流式语音任务中表现出色,包括语音识别、语音增强和口语问答。此外,SALMONN-omni在管理轮流发言、打断以及回声消除场景方面表现出色,确立了其作为全双工会话AI系统强大原型的潜力。据我们所知,SALMONN-omni是第一种此类无编解码器模型。一份完整的技术报告和模型检查点将很快发布。
https://arxiv.org/abs/2411.18138
Self-supervised learning (SSL) models have shown exceptional capabilities across various speech-processing tasks. Continuous SSL representations are effective but suffer from high computational and storage demands. On the other hand, discrete SSL representations, although with degraded performance, reduce transmission and storage costs, and improve input sequence efficiency through de-duplication and subword-modeling. To boost the performance of discrete representations for ASR, we introduce a novel fusion mechanism that integrates two discrete representations. The fusion mechanism preserves all the benefits of discrete representation while enhancing the model's performance by integrating complementary information. Additionally, we explore "self-augmented'' discrete representations, which apply transformations to a single continuous SSL representation, eliminating the fusion mechanism's dependency on multiple SSL models and further decreasing its inference costs. Experimental results on benchmarks, including LibriSpeech and ML-SUPERB, indicate up to 19% and 24% relative character error rate improvement compared with the non-fusion baseline, validating the effectiveness of our proposed methods.
自我监督学习(SSL)模型在各种语音处理任务中表现出卓越的能力。连续的SSL表示虽然有效,但面临着高计算和存储需求的问题。另一方面,尽管离散的SSL表示性能有所下降,但它们能够减少传输和存储成本,并通过去重和子词建模提高输入序列效率。为了提升ASR(自动语音识别)中离散表示的性能,我们引入了一种新颖的融合机制,该机制整合了两种离散表示形式。这种融合机制保留了所有离散表示的优点,同时通过集成互补信息来增强模型性能。此外,我们还探索了“自我增广”的离散表示方法,这种方法通过对单一连续SSL表示进行变换,消除了融合机制对多个SSL模型的依赖,并进一步降低了推理成本。在包括LibriSpeech和ML-SUPERB在内的基准测试上的实验结果表明,与非融合基线相比,字符错误率相对改善了19%到24%,验证了我们提出方法的有效性。
https://arxiv.org/abs/2411.18107
Speech language models (SpeechLMs) accept speech input and produce speech output, allowing for more natural human-computer interaction compared to text-based large language models (LLMs). Traditional approaches for developing SpeechLMs are constrained by the limited availability of unsupervised speech data and parallel speech-text data, which are significantly less abundant than text pre-training data, thereby limiting their scalability as LLMs. We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora, eliminating the need for parallel speech-text datasets. Our method efficiently constructs speech-text interleaved data by sampling text spans from existing text corpora and synthesizing corresponding speech spans using a text-to-token model, bypassing the need to generate actual speech. We also employ a supervised speech tokenizer derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. This supervised training approach results in discrete speech tokens with strong semantic preservation even at lower sampling rates (e.g. 12.5Hz), while still maintaining speech reconstruction quality. Starting from a pre-trained language model and scaling our pre-training to 1 trillion tokens (with 600B synthetic interleaved speech-text data), we achieve state-of-the-art performance in speech language modeling and spoken question answering, improving performance on spoken questions tasks from the previous SOTA of 13% (Moshi) to 31%. We further demonstrate that by fine-tuning the pre-trained model with speech dialogue data, we can develop an end-to-end spoken chatbot that achieves competitive performance comparable to existing baselines in both conversational abilities and speech quality, even operating exclusively in the speech domain.
语音语言模型(SpeechLMs)接受语音输入并生成语音输出,与基于文本的大规模语言模型(LLMs)相比,它们允许更自然的人机交互。传统开发SpeechLMs的方法受限于无监督语音数据和并行的语音-文本数据的有限可用性,这些数据比文本预训练数据要少得多,从而限制了它们作为LLMs的可扩展性。我们提出了一种新的方法来扩大语音-文本预训练规模,该方法利用从文本语料库中衍生出的大规模合成交错数据,无需并行的语音-文本数据集。我们的方法通过从现有文本语料库中抽样文本片段,并使用文本到令牌模型合成相应的语音片段来有效构建语音-文本交错数据,绕过了生成实际语音的需求。我们还采用了一种监督式的语音分词器,该分词器源自自动语音识别(ASR)模型,在编码器中加入了矢量量化瓶颈。这种监督训练方法即使在较低的采样率(例如12.5Hz)下也能保持较强的语义保存能力,并且仍然能保持语音重建质量。从预训练的语言模型开始,我们将预训练扩展到1万亿令牌(含600B合成交错语音-文本数据),实现了语音语言建模和口语问答的最先进性能,将之前的最高准确率(Moshi)从13%提高到了31%。我们进一步证明,通过使用语音对话数据微调预训练模型,我们可以开发出一个端到端的口语聊天机器人,在对话能力和语音质量方面都达到了与现有基线相当的竞争表现,即使它仅在语音领域操作。
https://arxiv.org/abs/2411.17607
Recent techniques for speech deepfake detection often rely on pre-trained self-supervised models. These systems, initially developed for Automatic Speech Recognition (ASR), have proved their ability to offer a meaningful representation of speech signals, which can benefit various tasks, including deepfake detection. In this context, pre-trained models serve as feature extractors and are used to extract embeddings from input speech, which are then fed to a binary speech deepfake detector. The remarkable accuracy achieved through this approach underscores a potential relationship between ASR and speech deepfake detection. However, this connection is not yet entirely clear, and we do not know whether improved performance in ASR corresponds to higher speech deepfake detection capabilities. In this paper, we address this question through a systematic analysis. We consider two different pre-trained self-supervised ASR models, Whisper and Wav2Vec 2.0, and adapt them for the speech deepfake detection task. These models have been released in multiple versions, with increasing number of parameters and enhanced ASR performance. We investigate whether performance improvements in ASR correlate with improvements in speech deepfake detection. Our results provide insights into the relationship between these two tasks and offer valuable guidance for the development of more effective speech deepfake detectors.
最近用于语音深度伪造检测的技术通常依赖于预训练的自监督模型。这些系统最初是为自动语音识别(ASR)开发的,已经证明了它们能够提供有意义的语音信号表示,这可以促进包括深度伪造检测在内的各种任务。在此背景下,预训练模型作为特征提取器使用,并用于从输入语音中抽取嵌入式向量,然后将这些嵌入式向量送入二元语音深度伪造检测器中。通过这种方法所达到的显著准确度强调了ASR与语音深度伪造检测之间可能存在的关系。然而,这种联系尚未完全明朗,我们并不清楚在ASR方面的性能提升是否对应于更高的语音深度伪造检测能力。在这篇论文中,我们通过对系统分析来解答这一问题。我们考虑了两种不同的预训练自监督ASR模型:Whisper和Wav2Vec 2.0,并将它们适配用于语音深度伪造检测任务。这些模型发布了多个版本,参数数量逐渐增加且ASR性能有所提升。我们研究了在ASR方面的性能改进是否与语音深度伪造检测能力的提高相关联。我们的结果为这两个任务之间的关系提供了洞见,并为开发更有效的语音深度伪造检测器提供了宝贵指导。
https://arxiv.org/abs/2411.17349
Automatic Speech Recognition (ASR) systems in the clinical domain face significant challenges, notably the need to recognise specialised medical vocabulary accurately and meet stringent precision requirements. We introduce United-MedASR, a novel architecture that addresses these challenges by integrating synthetic data generation, precision ASR fine-tuning, and advanced semantic enhancement techniques. United-MedASR constructs a specialised medical vocabulary by synthesising data from authoritative sources such as ICD-10 (International Classification of Diseases, 10th Revision), MIMS (Monthly Index of Medical Specialties), and FDA databases. This enriched vocabulary helps finetune the Whisper ASR model to better cater to clinical needs. To enhance processing speed, we incorporate Faster Whisper, ensuring streamlined and high-speed ASR performance. Additionally, we employ a customised BART-based semantic enhancer to handle intricate medical terminology, thereby increasing accuracy efficiently. Our layered approach establishes new benchmarks in ASR performance, achieving a Word Error Rate (WER) of 0.985% on LibriSpeech test-clean, 0.26% on Europarl-ASR EN Guest-test, and demonstrating robust performance on Tedlium (0.29% WER) and FLEURS (0.336% WER). Furthermore, we present an adaptable architecture that can be replicated across different domains, making it a versatile solution for domain-specific ASR systems.
https://arxiv.org/abs/2412.00055