For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). A Convolutional VAE model was trained on a subsample of the LibriSpeech dataset to reconstruct short fragments of audio spectrograms (25 ms) from a 13-dimensional embedding. The trained model for a 40-dimensional (300 ms) embedding was used to generate features for corpus of spoken commands on the GoogleSpeechCommands dataset. Using the generated features an ASR system was built and compared to the model with MFCC features.
许多自动语音识别(ASR)任务中,音频特征作为频谱图显示出比Mel频谱系数(MFCC)更好的结果,但在实践中,由于特征空间复杂维度,它们很难使用。以下论文提出了一种基于卷积变分自编码器(VAE)生成压缩频谱表示的方法。使用LibriSpeech数据集中的子集训练卷积变分自编码器(25ms)重构音频频谱图(13维)中的短片段。用于GoogleSpeechCommands数据集中的说话命令语料库的ASR系统使用了生成的特征。与使用MFCC特征的模型进行了比较。
https://arxiv.org/abs/2410.02560
This paper presents an overview of rule-based system for automatic accentuation and phonemic transcription of Russian texts for speech connected tasks, such as Automatic Speech Recognition (ASR). Two parts of the developed system, accentuation and transcription, use different approaches to achieve correct phonemic representations of input phrases. Accentuation is based on "Grammatical dictionary of the Russian language" of A.A. Zaliznyak and wiktionary corpus. To distinguish homographs, the accentuation system also utilises morphological information of the sentences based on Recurrent Neural Networks (RNN). Transcription algorithms apply the rules presented in the monograph of B.M. Lobanov and L.I. Tsirulnik "Computer Synthesis and Voice Cloning". The rules described in the present paper are implemented in an open-source module, which can be of use to any scientific study connected to ASR or Speech To Text (STT) tasks. Automatically marked up text annotations of the Russian Voxforge database were used as training data for an acoustic model in CMU Sphinx. The resulting acoustic model was evaluated on cross-validation, mean Word Accuracy being 71.2%. The developed toolkit is written in the Python language and is accessible on GitHub for any researcher interested.
本文概述了一个基于规则的自动重音和音素转录的俄罗斯文本系统,用于语音连接任务,如自动语音识别(ASR)。系统的两个部分,重音和转录,采用不同的方法来实现输入短语的正确音素表示。重音基于A.A. Zaliznyak和维基词典俄语的“语法词典”。为了区分同形词,重音系统还利用基于循环神经网络(RNN)的句子的语素信息。转录算法应用了B.M. Lobanov和L.I. Tsirulnik在其著作中提出的规则。本文中的规则描述在文中得到了实现,这是一个开源模块,对于与ASR或语音到文本(STT)任务相关的任何科学研究都很有用。在CMU Sphinx上使用俄语Voxforge数据库的自动标注的文本注释作为音频模型的训练数据。通过交叉验证评估该音频模型,平均单词准确率达到了71.2%。开发工具包是用Python编写的,并可公开获取在GitHub上。
https://arxiv.org/abs/2410.02538
Utterances by L2 speakers can be unintelligible due to mispronunciation and improper prosody. In computer-aided language learning systems, textual feedback is often provided using a speech recognition engine. However, an ideal form of feedback for L2 speakers should be so fine-grained that it enables them to detect and diagnose unintelligible parts of L2 speakers' utterances. Inspired by language teachers who correct students' pronunciation through a voice-to-voice process, this pilot study utilizes a unique semi-parallel dataset composed of non-native speakers' (L2) reading aloud, shadowing of native speakers (L1) and their script-shadowing utterances. We explore the technical possibility of replicating the process of an L1 speaker's shadowing L2 speech using Voice Conversion techniques, to create a virtual shadower system. Experimental results demonstrate the feasibility of the VC system in simulating L1's shadowing behavior. The output of the virtual shadower system shows a reasonable similarity to the real L1 shadowing utterances in both linguistic and acoustic aspects.
由于L2操作者的语误和不当的语调,他们的言辞可能会变得无法理解。在计算机辅助语言学习系统中,通常使用语音识别引擎提供文本反馈。然而,对于L2操作者来说,理想的反馈应如此细微,以至于他们能够检测和诊断L2操作者语篇中的不可理解部分。受到通过语音-对-语音过程纠正学生发音的语言教师的启发,本研究利用由非母语操作者(L2)朗读、shadowing母语操作者(L1)和其script-shadowing语篇组成独特的不平衡数据集。我们探讨了使用Voice Conversion技术复制L1操作者影子L2语篇的过程的可行性,以创建一个虚拟阴影系统。实验结果表明,VC系统可以模拟L1操作者的影子行为。虚拟阴影系统的输出在语言和声学方面与真实L1影子语篇具有合理的相似性。
https://arxiv.org/abs/2410.02239
Spoken language assessment (SLA) systems restrict themselves to evaluating the pronunciation and oral fluency of a speaker by analysing the read and spontaneous spoken utterances respectively. The assessment of language grammar or vocabulary is relegated to written language assessment (WLA) systems. Most WLA systems present a set of sentences from a curated finite-size database of sentences thereby making it possible to anticipate the test questions and train oneself. In this paper, we propose a novel end-to-end SLA system to assess language grammar from spoken utterances thus making WLA systems redundant; additionally, we make the assessment largely unteachable by employing a large language model (LLM) to bring in variations in the test. We further demonstrate that a hybrid automatic speech recognition (ASR) with a custom-built language model outperforms the state-of-the-art ASR engine for spoken grammar assessment.
口语评估(SLA)系统仅限于通过分析听读和自然口语来评估发言者的发音和口语流畅度。对语言语法的评估被降格为书面语言评估(WLA)系统。大多数WLA系统都是一组来自经过筛选的有限句子数据库的句子,从而使其可以预见测试问题并自学。在本文中,我们提出了一个新颖的端到端SLA系统来评估口语中的语言语法,从而使WLA系统过时;此外,我们通过使用大型语言模型(LLM)引入测试中的变化,使评估变得很大程度上不可教。我们进一步证明了,具有自定义语言模型的混合自动语音识别(ASR)在口语语法评估中优于最先进的ASR引擎。
https://arxiv.org/abs/2410.01579
The rise of foundation models (FMs), coupled with regulatory efforts addressing their risks and impacts, has sparked significant interest in open-source models. However, existing speech FMs (SFMs) fall short of full compliance with the open-source principles, even if claimed otherwise, as no existing SFM has model weights, code, and training data publicly available under open-source terms. In this work, we take the first step toward filling this gap by focusing on the 24 official languages of the European Union (EU). We collect suitable training data by surveying automatic speech recognition datasets and unlabeled speech corpora under open-source compliant licenses, for a total of 950k hours. Additionally, we release automatic transcripts for 441k hours of unlabeled data under the permissive CC-BY license, thereby facilitating the creation of open-source SFMs for the EU languages.
基础模型(FMs)的崛起,以及针对其风险和影响的监管努力,引发了对于开源模型的浓厚兴趣。然而,现有的语音FM(SFM)即使在宣称符合开源原则的情况下,也未能完全符合,因为目前没有开源许可证下公开可用的模型权重、代码和训练数据。在这项工作中,我们迈出了填补这一空白的第一步,专注于欧盟(EU)的24个官方语言。我们通过调查开源许可证下的自动语音识别数据集和开源语音语料库,收集了总计950k小时的合适训练数据。此外,我们针对441k小时的未标注数据发布了宽松的CC-BY许可证的自动转录,从而为欧盟语言创建了开源SFM提供了便利。
https://arxiv.org/abs/2410.01036
The image-based multimodal automatic speech recognition (ASR) model enhances speech recognition performance by incorporating audio-related image. However, some works suggest that introducing image information to model does not help improving ASR performance. In this paper, we propose a novel approach effectively utilizing audio-related image information and set up VHASR, a multimodal speech recognition system that uses vision as hotwords to strengthen the model's speech recognition capability. Our system utilizes a dual-stream architecture, which firstly transcribes the text on the two streams separately, and then combines the outputs. We evaluate the proposed model on four datasets: Flickr8k, ADE20k, COCO, and OpenImages. The experimental results show that VHASR can effectively utilize key information in images to enhance the model's speech recognition ability. Its performance not only surpasses unimodal ASR, but also achieves SOTA among existing image-based multimodal ASR.
基于图像的多模态自动语音识别(ASR)模型通过引入与音频相关的图像来提高语音识别性能。然而,一些工作建议向模型引入图像信息并不能提高ASR性能。在本文中,我们提出了一种新颖的方法,有效利用音频相关的图像信息,并设置VHASR,一种使用视觉作为热门词的多模态语音识别系统,来增强模型的语音识别能力。我们的系统采用双流架构,首先分别转录两个流中的文本,然后结合输出。我们在四个数据集上评估所提出的模型:Flickr8k,ADE20k,COCO和OpenImages。实验结果表明,VHASR能够有效利用图像中的关键信息来增强模型的语音识别能力。其性能不仅超越了单模态ASR,而且在该领域的现有图像-为基础的多模态ASR中实现了最先进的水平。
https://arxiv.org/abs/2410.00822
We present a cost-effective approach for developing Automatic Speech Recognition (ASR) models for low-resource languages like Ika. We fine-tune the pretrained wav2vec 2.0 Massively Multilingual Speech Models on a high-quality speech dataset compiled from New Testament Bible translations in Ika. Our results show that fine-tuning multilingual pretrained models achieves a Word Error Rate (WER) of 0.5377 and Character Error Rate (CER) of 0.2651 with just over 1 hour of training data. The larger 1 billion parameter model outperforms the smaller 300 million parameter model due to its greater complexity and ability to store richer speech representations. However, we observe overfitting to the small training dataset, reducing generalizability. Our findings demonstrate the potential of leveraging multilingual pretrained models for low-resource languages. Future work should focus on expanding the dataset and exploring techniques to mitigate overfitting.
我们提出了一种成本效益高的方法,用于为像Ika这样的低资源语言开发自动语音识别(ASR)模型。我们在质量较高的语音数据集上对预训练的wav2vec 2.0多语言语音模型进行微调。我们的结果表明,仅用不到1小时的训练数据,微调多语言预训练模型可以达到0.5377的单词错误率(WER)和0.2651的字符错误率(CER)。由于其更大的复杂性和存储更丰富的语音表示的能力,较大的10亿参数模型超越了较小的3亿参数模型。然而,我们观察到过拟合到小训练数据集,导致泛化能力下降。我们的研究结果表明,利用多语言预训练模型对于低资源语言具有巨大潜力。未来的工作应该集中于扩展数据集,并探索减轻过拟合的技术。
https://arxiv.org/abs/2410.00940
A hybrid autoregressive transducer (HAT) is a variant of neural transducer that models blank and non-blank posterior distributions separately. In this paper, we propose a novel internal acoustic model (IAM) training strategy to enhance HAT-based speech recognition. IAM consists of encoder and joint networks, which are fully shared and jointly trained with HAT. This joint training not only enhances the HAT training efficiency but also encourages IAM and HAT to emit blanks synchronously which skips the more expensive non-blank computation, resulting in more effective blank thresholding for faster decoding. Experiments demonstrate that the relative error reductions of the HAT with IAM compared to the vanilla HAT are statistically significant. Moreover, we introduce dual blank thresholding, which combines both HAT- and IAM-blank thresholding and a compatible decoding algorithm. This results in a 42-75% decoding speed-up with no major performance degradation.
混合自回归转换器(HAT)是一种神经转换器的变体,它分别建模空白和非空白后验分布。在本文中,我们提出了一个新的内部声学模型(IAM)训练策略,以提高基于HAT的语音识别效果。IAM由编码器和解码器网络组成,它们完全共享并与HAT一起共同训练。这种联合训练不仅提高了HAT训练效率,而且鼓励IAM和HAT同时发射空白。这使得非空白计算更加昂贵,从而提高了快速解码的空白阈值。实验证明,与纯HAT相比,使用IAM的HAT的相对误差降低具有统计学意义。此外,我们引入了双重空白阈值,它结合了HAT和IAM的空白阈值和一个兼容的解码算法。这使得解码速度提高了42-75%,且性能没有明显下降。
https://arxiv.org/abs/2409.20313
Extending the RNN Transducer (RNNT) to recognize multi-talker speech is essential for wider automatic speech recognition (ASR) applications. Multi-talker RNNT (MT-RNNT) aims to achieve recognition without relying on costly front-end source separation. MT-RNNT is conventionally implemented using architectures with multiple encoders or decoders, or by serializing all speakers' transcriptions into a single output stream. The first approach is computationally expensive, particularly due to the need for multiple encoder processing. In contrast, the second approach involves a complex label generation process, requiring accurate timestamps of all words spoken by all speakers in the mixture, obtained from an external ASR system. In this paper, we propose a novel alignment-free training scheme for the MT-RNNT (MT-RNNT-AFT) that adopts the standard RNNT architecture. The target labels are created by appending a prompt token corresponding to each speaker at the beginning of the transcription, reflecting the order of each speaker's appearance in the mixtures. Thus, MT-RNNT-AFT can be trained without relying on accurate alignments, and it can recognize all speakers' speech with just one round of encoder processing. Experiments show that MT-RNNT-AFT achieves performance comparable to that of the state-of-the-art alternatives, while greatly simplifying the training process.
将循环神经网络(RNN)Transducer(RNNT)扩展以识别多说话者语音对于更广泛的自动语音识别(ASR)应用至关重要。多说话者循环神经网络(MT-RNNT)旨在实现不依赖于昂贵的前端声源分离的识别,MT-RNNT通常采用具有多个编码器或解码器的架构,或通过将所有发言者的转录序列串行化得到单个输出流来实现。第一种方法在计算上非常昂贵,特别是因为需要进行多个编码器处理。相比之下,第二种方法涉及复杂的标签生成过程,需要从外部ASR系统获得所有发言者所说话的所有单词的准确时间戳。在本文中,我们提出了一个名为MT-RNNT-AFT的新无同步训练方案,它采用标准的RNNT架构。目标标签是通过在转录的开头附加上每个发言者的提示词来创建的,反映了每个发言者在混合中的出现顺序。因此,MT-RNNT-AFT可以在不需要准确对齐的情况下进行训练,并且可以在只需一次编码器处理 round 就能识别所有发言者的语音。实验证明,MT-RNNT-AFT在实现与最先进的替代方案相当的同时,大大简化了训练过程。
https://arxiv.org/abs/2409.20301
This paper works on streaming automatic speech recognition (ASR). Mamba, a recently proposed state space model, has demonstrated the ability to match or surpass Transformers in various tasks while benefiting from a linear complexity advantage. We explore the efficiency of Mamba encoder for streaming ASR and propose an associated lookahead mechanism for leveraging controllable future information. Additionally, a streaming-style unimodal aggregation (UMA) method is implemented, which automatically detects token activity and streamingly triggers token output, and meanwhile aggregates feature frames for better learning token representation. Based on UMA, an early termination (ET) method is proposed to further reduce recognition latency. Experiments conducted on two Mandarin Chinese datasets demonstrate that the proposed model achieves competitive ASR performance in terms of both recognition accuracy and latency.
本文研究流式语音识别(ASR)。Mamba是一种最近提出的状态空间模型,已经在各种任务中证明了与Transformer相匹敌或超越Transformer的能力,同时具有线性复杂度优势。我们研究了Mamba编码器在流式ASR中的效率,并提出了一个相关的前景检测机制,以便利用可控制的未来信息。此外,还实现了一种流式风格的统一聚合(UMA)方法,该方法自动检测词活动并流式触发词输出,同时对特征帧进行聚合以更好地学习词表示。基于UMA,我们提出了一个早期终止(ET)方法,以进一步降低识别延迟。在两个中文数据集上进行的实验证明,与Transformer模型相比,所提出的模型在识别准确性和延迟方面具有竞争力的ASR性能。
https://arxiv.org/abs/2410.00070
In this work, we present AfriHuBERT, an extension of mHuBERT-147, a state-of-the-art (SOTA) and compact self-supervised learning (SSL) model, originally pretrained on 147 languages. While mHuBERT-147 was pretrained on 16 African languages, we expand this to cover 39 African languages through continued pretraining on 6,500+ hours of speech data aggregated from diverse sources, including 23 newly added languages. We evaluate AfriHuBERT on two key speech tasks: Language Identification (LID) and Automatic Speech Recognition (ASR) using FLEURS dataset. Our results show a +4% F1 score improvement on average for LID and a -1.2% average Word Error Rate (WER) reduction for ASR. Further analysis shows that ASR models trained on AfriHuBERT exhibit improved cross-corpus generalization. Additionally, the analysis indicates that the FLEURS have data quality limitations that may affect their suitability for evaluating low-resource African languages, suggesting the need for better evaluation benchmarks for these languages.
在这项工作中,我们提出了AfriHuBERT,是mHuBERT-147的扩展,是一种最先进的(SOTA)且紧凑的自监督学习(SSL)模型,最初针对147个语言进行预训练。与mHuBERT-147预训练在16个非洲语言不同,我们通过继续在包括来自各种来源的6,500小时语音数据上进行预训练,扩展到包括23个新添加的语言。我们在FLEURS数据集上评估AfriHuBERT,该数据集包括两种关键的语音任务:语言识别(LID)和自动语音识别(ASR)。我们的结果表明,对于LID,AfriHuBERT的平均F1得分提高了+4%;对于ASR,平均 Word Error Rate(WER)降低了-1.2%。进一步分析表明,在AfriHuBERT上训练的ASR模型具有更好的跨语料库泛化能力。此外,分析表明,FLEURS数据集存在数据质量限制,这可能会影响其适用于低资源非洲语言的适用性,建议为这些语言提供更好的评估基准。
https://arxiv.org/abs/2409.20201
Effective spoken dialog systems should facilitate natural interactions with quick and rhythmic timing, mirroring human communication patterns. To reduce response times, previous efforts have focused on minimizing the latency in automatic speech recognition (ASR) to optimize system efficiency. However, this approach requires waiting for ASR to complete processing until a speaker has finished speaking, which limits the time available for natural language processing (NLP) to formulate accurate responses. As humans, we continuously anticipate and prepare responses even while the other party is still speaking. This allows us to respond appropriately without missing the optimal time to speak. In this work, as a pioneering study toward a conversational system that simulates such human anticipatory behavior, we aim to realize a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance (EOU), using the middle portion of an utterance. To achieve this, we propose a training strategy for an encoder-decoder-based ASR system, which involves masking future segments of an utterance and prompting the decoder to predict the words in the masked audio. Additionally, we develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information to accurately detect the EOU. The experimental results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU. Moreover, the proposed training strategy exhibits general improvements in ASR performance.
有效的口语对话系统应促进自然交互,实现快速而有节奏的 timing,模仿人类通信模式。为了减少响应时间,之前的努力集中在优化自动语音识别(ASR)以提高系统效率,然而,这种方法要求等待ASR完成处理直到说话者结束说话,这限制了自然语言处理(NLP)形成准确回答的时间。作为人类,我们即使在对方说话时也会持续预测和准备回答。这使我们能够在不错过最佳说话时间的同时适当地做出回应。在本文中,作为模拟具有人类预测行为的会话系统的先驱研究,我们旨在实现一个预测即将到来的单词并估计结束语义单元(EOU)持续时间的功能,使用句子的中间部分。为了实现这一目标,我们提出了一个基于编码器-解码器-基于 ASR 的训练策略,其中遮罩未来的语音段并提示解码器预测被遮罩的音频中的单词。此外,我们还开发了一个基于跨注意的算法,结合了听觉和语言信息,准确检测 EOU。实验结果表明,所提出的模型具有预测即将到来的单词和估计直到实际 EOU 事件发生前 300ms 的能力。此外,所提出的训练策略在 ASR 性能上表现出了显著的改进。
https://arxiv.org/abs/2409.19990
Recent advancements in integrating Large Language Models (LLM) with automatic speech recognition (ASR) have performed remarkably in general domains. While supervised fine-tuning (SFT) of all model parameters is often employed to adapt pre-trained LLM-based ASR models to specific domains, it imposes high computational costs and notably reduces their performance in general domains. In this paper, we propose a novel parameter-efficient multi-domain fine-tuning method for adapting pre-trained LLM-based ASR models to multi-accent domains without catastrophic forgetting named \textit{HDMoLE}, which leverages hierarchical routing and dynamic thresholds based on combining low-rank adaptation (LoRA) with the mixer of experts (MoE) and can be generalized to any linear layer. Hierarchical routing establishes a clear correspondence between LoRA experts and accent domains, improving cross-domain collaboration among the LoRA experts. Unlike the static Top-K strategy for activating LoRA experts, dynamic thresholds can adaptively activate varying numbers of LoRA experts at each MoE layer. Experiments on the multi-accent and standard Mandarin datasets demonstrate the efficacy of HDMoLE. Applying HDMoLE to an LLM-based ASR model projector module achieves similar performance to full fine-tuning in the target multi-accent domains while using only 9.6% of the trainable parameters required for full fine-tuning and minimal degradation in the source general domain.
近年来,将大型语言模型(LLM)与自动语音识别(ASR)相结合的进展在一般领域取得了显著的成就。通常采用有监督的微调(SFT)来将预训练的LLM-based ASR模型适应特定的领域,但这会带来高昂的计算成本,并显著降低其在一般领域的性能。在本文中,我们提出了一个名为\textit{HDMoLE}的新参数高效的多领域微调方法,用于将预训练的LLM-based ASR模型适应多重音域,该方法基于分层路由和基于专家的动态阈值,可以扩展到任何线性层。分层路由建立了一个LLM专家和重音域之间的明确对应关系,从而改善了LLM专家之间的跨领域合作。与静态Top-K策略激活LLM专家不同,动态阈值可以根据每个MoE层自适应地激活不同的LLM专家。在多重音域和标准普通话数据集上的实验证明,HDMoLE的有效性。将HDMoLE应用于LLM-based ASR模型投影模块,在仅使用9.6%的训练参数进行完全微调的同时,在源一般领域中的性能与完全微调相当,甚至更高。
https://arxiv.org/abs/2409.19878
This paper enhances dysarthric and dysphonic speech recognition by fine-tuning pretrained automatic speech recognition (ASR) models on the 2023-10-05 data package of the Speech Accessibility Project (SAP), which contains the speech of 253 people with Parkinson's disease. Experiments tested methods that have been effective for Cerebral Palsy, including the use of speaker clustering and severity-dependent models, weighted fine-tuning, and multi-task learning. Best results were obtained using a multi-task learning model, in which the ASR is trained to produce an estimate of the speaker's impairment severity as an auxiliary output. The resulting word error rates are considerably improved relative to a baseline model fine-tuned using only Librispeech data, with word error rate improvements of 37.62\% and 26.97\% compared to fine-tuning on 100h and 960h of LibriSpeech data, respectively.
本文通过在2023-10-05数据包的Speech Accessibility Project (SAP)上微调预训练的自动语音识别(ASR)模型,增强了失语症(dysarthric)和失调语音识别。实验测试了对于帕金森病患者有效的几种方法,包括说话者聚类和严重程度相关的模型,加权微调以及多任务学习。最佳结果是通过一个多任务学习模型获得的,其中ASR被训练为作为辅助输出生成说话者损伤程度的估计。与仅使用Librispeech数据进行微调的基线模型相比,结果中的单词错误率有显著改善,分别是37.62%和26.97%。
https://arxiv.org/abs/2409.19818
We propose a novel approach to end-to-end automatic speech recognition (ASR) to achieve efficient speech in-context learning (SICL) for (i) long-form speech decoding, (ii) test-time speaker adaptation, and (iii) test-time contextual biasing. Specifically, we introduce an attention-based encoder-decoder (AED) model with SICL capability (referred to as SICL-AED), where the decoder utilizes an utterance-level cross-attention to integrate information from the encoder's output efficiently, and a document-level self-attention to learn contextual information. Evaluated on the benchmark TEDLIUM3 dataset, SICL-AED achieves an 8.64% relative word error rate (WER) reduction compared to a baseline utterance-level AED model by leveraging previously decoded outputs as in-context examples. It also demonstrates comparable performance to conventional long-form AED systems with significantly reduced runtime and memory complexity. Additionally, we introduce an in-context fine-tuning (ICFT) technique that further enhances SICL effectiveness during inference. Experiments on speaker adaptation and contextual biasing highlight the general speech in-context learning capabilities of our system, achieving effective results with provided contexts. Without specific fine-tuning, SICL-AED matches the performance of supervised AED baselines for speaker adaptation and improves entity recall by 64% for contextual biasing task.
我们提出了一个新颖的端到端自动语音识别(ASR)方法,以实现对于(i)长篇语音解码,(ii)测试时间说话人自适应,(iii)测试时间上下文偏差的有效语音在上下文中的学习(SICL)。具体来说,我们引入了一个具有SICL能力的注意机制编码器-解码器(AED)模型(称为SICL-AED),其中解码器利用句法级别跨注意来有效地整合编码器的输出信息,并利用文档级别自注意来学习上下文信息。在基准的TEDLIUM3数据集上评估,与基线语音级别AED模型相比,SICL-AED降低了8.64%的相对单词错误率(WER),通过利用之前解码的输出作为上下文示例。它还展示了与传统长篇语音AED系统相当的表现,具有显著的运行时间和内存复杂度减少。此外,我们还引入了一种在推理过程中进行上下文微调(ICFT)的技术,进一步增强了SICL的有效性。在说话人自适应和上下文偏差实验中,我们的系统的通用语音在上下文中的学习能力得到了突出表现,实现了所提供上下文的有效结果。如果没有具体的微调,SICL-AED与监督AED基线的性能相等,对于上下文偏差任务,实现了44%的实体回放。
https://arxiv.org/abs/2409.19757
In the field of spoken language processing, audio-visual speech processing is receiving increasing research attention. Key components of this research include tasks such as lip reading, audio-visual speech recognition, and visual-to-speech synthesis. Although significant success has been achieved, theoretical analysis is still insufficient for audio-visual tasks. This paper presents a quantitative analysis based on information theory, focusing on information intersection between different modalities. Our results show that this analysis is valuable for understanding the difficulties of audio-visual processing tasks as well as the benefits that could be obtained by modality integration.
在口语处理领域,音频视觉语音处理受到了越来越多的研究关注。这种研究的关键组成部分包括诸如唇读、音频视觉语音识别和视觉到语音合成等任务。尽管已经取得了显著的成功,但理论分析仍然不足以支持音频视觉任务。本文基于信息论进行定量分析,重点关注不同模态之间的信息交集。我们的结果表明,这种分析对于理解音频视觉处理任务的困难以及通过模态整合所能获得的好处具有很高的价值。
https://arxiv.org/abs/2409.19575
Speech Language Models (SLMs) have demonstrated impressive performance on speech translation tasks. However, existing research primarily focuses on direct instruction fine-tuning and often overlooks the inherent reasoning capabilities of SLMs. In this paper, we introduce a three-stage training framework designed to activate the chain-of-thought (CoT) capabilities of SLMs. We propose CoT-ST, a speech translation model that utilizes multimodal CoT to decompose speech translation into sequential steps of speech recognition and translation. We validated the effectiveness of our method on two datasets: the CoVoST-2 dataset and MuST-C dataset. The experimental results demonstrate that CoT-ST outperforms previous state-of-the-art methods, achieving higher BLEU scores (CoVoST-2 en-ja: 30.5->30.8, en-zh: 45.2->47.7, MuST-C en-zh: 19.6->21.2). This work is open sourced at this https URL .
语言建模(SLMs)在语音翻译任务上已经展示了出色的表现。然而,现有的研究主要关注直接指令微调,往往忽视了SLMs固有的推理能力。在本文中,我们提出了一个三阶段的训练框架,旨在激活SLMs的连锁思维(CoT)能力。我们提出了CoT-ST,一种利用多模态CoT分解语音翻译为语音识别和翻译的序列步骤的语音翻译模型。我们在两个数据集上验证了我们方法的有效性:CoVoST-2数据集和MuST-C数据集。实验结果表明,CoT-ST超越了以前的 state-of-the-art 方法,实现了更高的 BLEU 分数(CoVoST-2 en-ja: 30.5->30.8,en-zh: 45.2->47.7,MuST-C en-zh: 19.6->21.2)。本工作目前处于开放源代码状态,https:// this URL 。
https://arxiv.org/abs/2409.19510
Speech signal processing is a cornerstone of modern communication technologies, tasked with improving the clarity and comprehensibility of audio data in noisy environments. The primary challenge in this field is the effective separation and recognition of speech from background noise, crucial for applications ranging from voice-activated assistants to automated transcription services. The quality of speech recognition directly impacts user experience and accessibility in technology-driven communication. This review paper explores advanced clustering techniques, particularly focusing on the Kernel Fuzzy C-Means (KFCM) method, to address these challenges. Our findings indicate that KFCM, compared to traditional methods like K-Means (KM) and Fuzzy C-Means (FCM), provides superior performance in handling non-linear and non-stationary noise conditions in speech signals. The most notable outcome of this review is the adaptability of KFCM to various noisy environments, making it a robust choice for speech enhancement applications. Additionally, the paper identifies gaps in current methodologies, such as the need for more dynamic clustering algorithms that can adapt in real time to changing noise conditions without compromising speech recognition quality. Key contributions include a detailed comparative analysis of current clustering algorithms and suggestions for further integrating hybrid models that combine KFCM with neural networks to enhance speech recognition accuracy. Through this review, we advocate for a shift towards more sophisticated, adaptive clustering techniques that can significantly improve speech enhancement and pave the way for more resilient speech processing systems.
语音信号处理是现代通信技术的重要基石,负责在嘈杂环境中改善音频数据的清晰度和可理解性。这个领域的主要挑战是有效分离和识别语音从背景噪声中,这对从语音激活助手到自动转录服务的各种应用至关重要。语音识别的质量直接影响用户体验和技术驱动通信中的可访问性。本文回顾论文探讨了高级聚类技术,特别是关注Kernel Fuzzy C-Means(KFCM)方法,以解决这些挑战。我们的研究结果表明,与传统方法如K-Means(KM)和模糊C-Means(FCM)相比,KFCM在处理语音信号中的非线性和非平稳噪声条件下具有卓越的性能。这篇回顾论文的最显著成果是KFCM在各种嘈杂环境中的适应性,使其成为提高语音增强应用的可信赖选择。此外,论文还指出了当前方法论中的空白,如需要更动态的聚类算法来实时适应变化的环境条件,而不会牺牲语音识别质量。关键贡献包括对当前聚类算法的详细比较分析和建议,进一步将KFCM与神经网络相结合以提高语音识别准确性的混合模型。通过这篇回顾,我们主张转向更复杂、自适应的聚类技术,显著提高语音增强效果,为更健壮的语音处理系统铺平道路。
https://arxiv.org/abs/2409.19448
The increasing administrative burden of medical documentation, particularly through Electronic Health Records (EHR), significantly reduces the time available for direct patient care and contributes to physician burnout. To address this issue, we propose MediNotes, an advanced generative AI framework designed to automate the creation of SOAP (Subjective, Objective, Assessment, Plan) notes from medical conversations. MediNotes integrates Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and Automatic Speech Recognition (ASR) to capture and process both text and voice inputs in real time or from recorded audio, generating structured and contextually accurate medical notes. The framework also incorporates advanced techniques like Quantized Low-Rank Adaptation (QLoRA) and Parameter-Efficient Fine-Tuning (PEFT) for efficient model fine-tuning in resource-constrained environments. Additionally, MediNotes offers a query-based retrieval system, allowing healthcare providers and patients to access relevant medical information quickly and accurately. Evaluations using the ACI-BENCH dataset demonstrate that MediNotes significantly improves the accuracy, efficiency, and usability of automated medical documentation, offering a robust solution to reduce the administrative burden on healthcare professionals while improving the quality of clinical workflows.
随着医疗记录管理负担的增加,特别是通过电子病历(EHR),明显减少了直接患者护理的时间,并导致医生疲劳。为解决这个问题,我们提出了MediNotes,一种先进的生成人工智能框架,旨在自动从医疗对话中创建SOAP(主观,客观,评估,计划)笔记。MediNotes集成了大型语言模型(LLMs),检索增强生成(RAG)和自动语音识别(ASR),以捕获并处理实时或记录音频中的文本和语音输入,生成结构化和语境准确的医疗笔记。该框架还采用了高级技术,如量化低秩适应(QLoRA)和参数 efficient 微调(PEFT),以在资源受限的环境中实现模型微调的高效性。此外,MediNotes还提供了一个基于查询的检索系统,使医疗保健提供商和患者能够快速准确地访问相关医疗信息。使用ACI-BENCH数据集进行的评估表明,MediNotes显著提高了自动医疗记录的准确性、效率和可用性,为减少医疗专业人士的行政负担并提高临床工作流程的质量提供了有力的解决方案。
https://arxiv.org/abs/2410.01841
Current automatic speech recognition systems struggle with modeling long speech sequences due to high quadratic complexity of Transformer-based models. Selective state space models such as Mamba has performed well on long-sequence modeling in natural language processing and computer vision tasks. However, research endeavors in speech technology tasks has been under-explored. We propose Speech-Mamba, which incorporates selective state space modeling in Transformer neural architectures. Long sequence representations with selective state space models in Speech-Mamba is complemented with lower-level representations from Transformer-based modeling. Speech-mamba achieves better capacity to model long-range dependencies, as it scales near-linearly with sequence length.
目前自动语音识别系统的一个挑战是,由于基于Transformer的模型的四元复杂度高,因此它们在处理长序列时存在困难。像Mamba这样的选择性状态空间模型在自然语言处理和计算机视觉任务中表现良好,但是对语音技术任务的科学研究还有待深入探讨。我们提出Speech-Mamba,它采用了Transformer神经架构的选择性状态空间建模。Speech-Mamba中的长序列表示与Transformer基于模型的建模相辅相成,并具有较低级别的Transformer表示。Speech-Mamba在模型的序列长度下实现了更好的长距离依赖建模能力,因为它的复杂度接近线性。
https://arxiv.org/abs/2409.18654