We present a geometry-driven method for normalizing dysarthric speech using local Lie group transformations of spectrograms. Time, frequency, and amplitude distortions are modeled as smooth, invertible deformations, parameterized by scalar fields and applied via exponential maps. A neural network is trained to infer these fields from synthetic distortions of typical speech-without using any pathological data. At test time, the model applies an approximate inverse to real dysarthric inputs. Despite zero-shot generalization, we observe substantial ASR gains, including up to 16 percentage points WER reduction on challenging TORGO samples, with no degradation on clean speech. This work introduces a principled, interpretable approach for robust speech recognition under motor speech disorders
我们提出了一种基于几何的方法,利用频谱图的局部李群变换来对构音障碍(dysarthria)语音进行归一化处理。时间、频率和振幅失真被建模为光滑且可逆的形变,这些变形通过标量场参数化,并通过指数映射应用。我们训练了一个神经网络从典型语音的人工失真中推断出这些场,而无需使用任何病理数据。在测试阶段,该模型对真实的构音障碍输入施加近似的逆变换。尽管没有针对病理语音进行过零样本泛化训练,但我们在自动语音识别(ASR)上取得了显著的改进,包括在挑战性的TORGO样例中误词率(WER)最多降低了16个百分点,并且对于干净的语音没有造成性能下降。这项工作引入了一种原理明确、可解释的方法来提高运动性构音障碍条件下的鲁棒语音识别能力。
https://arxiv.org/abs/2504.12279
Automatic speech recognition (ASR) is crucial for human-machine interaction in diverse applications like conversational agents, industrial robotics, call center automation, and automated subtitling. However, developing high-performance ASR models remains challenging, particularly for low-resource languages like Arabic, due to the scarcity of large, labeled speech datasets, which are costly and labor-intensive to produce. In this work, we employ weakly supervised learning to train an Arabic ASR model using the Conformer architecture. Our model is trained from scratch on 15,000 hours of weakly annotated speech data covering both Modern Standard Arabic (MSA) and Dialectal Arabic (DA), eliminating the need for costly manual transcriptions. Despite the absence of human-verified labels, our approach attains state-of-the-art (SOTA) performance, exceeding all previous efforts in the field of Arabic ASR on the standard benchmarks. By demonstrating the effectiveness of weak supervision as a scalable, cost-efficient alternative to traditional supervised approaches, paving the way for improved ASR systems in low resource settings.
自动语音识别(ASR)在诸如对话代理、工业机器人、呼叫中心自动化和自动字幕生成等众多应用中,对于人机交互至关重要。然而,开发高性能的ASR模型仍然具有挑战性,尤其是在像阿拉伯语这样的低资源语言中,因为大而标注良好的语音数据集稀缺且制作成本高昂且耗时。在这项工作中,我们采用弱监督学习方法,使用Conformer架构训练一个阿拉伯语ASR模型。我们的模型从头开始在15,000小时的弱注释语音数据上进行训练,涵盖了现代标准阿拉伯语(MSA)和方言阿拉伯语(DA),从而消除了昂贵的人工转录需求。 尽管缺乏人工验证标签,但我们的方法达到了业内最先进的性能水平,在阿拉伯语ASR的标准基准测试中超过了所有先前的努力。通过证明弱监督作为一种可扩展且成本效益高的替代方案的有效性,传统监督方法在资源匮乏的环境下为改进ASR系统铺平了道路。
https://arxiv.org/abs/2504.12254
Rich-text captions are essential to help communication for Deaf and hard-of-hearing (DHH) people, second-language learners, and those with autism spectrum disorder (ASD). They also preserve nuances when converting speech to text, enhancing the realism of presentation scripts and conversation or speech logs. However, current real-time captioning systems lack the capability to alter text attributes (ex. capitalization, sizes, and fonts) at the word level, hindering the accurate conveyance of speaker intent that is expressed in the tones or intonations of the speech. For example, ''YOU should do this'' tends to be considered as indicating ''You'' as the focus of the sentence, whereas ''You should do THIS'' tends to be ''This'' as the focus. This paper proposes a solution that changes the text decorations at the word level in real time. As a prototype, we developed an application that adjusts word size based on the loudness of each spoken word. Feedback from users implies that this system helped to convey the speaker's intent, offering a more engaging and accessible captioning experience.
丰富的文字字幕对于聋人和听力障碍(DHH)人士、第二语言学习者以及自闭症谱系障碍(ASD)患者来说是沟通的重要工具。它们在将口语转换为文本时,还能保留语调等细微差别,提升演示脚本和对话或演讲记录的真实感。然而,现有的实时字幕系统缺乏调整文字属性(如大小写、字体大小和类型)的能力,这妨碍了准确传达说话人的意图,因为许多意图是通过语音的语气或音调来表达的。例如,“YOU应该这样做”通常被视为句子的重点在“你”,而“You应该做THIS”则重点在“这个”。本文提出了一种解决方案,在实时字幕中根据需要调整文本属性(如字体大小)。作为原型应用,我们开发了一个应用程序,可以根据每个发音词的声音大小调整其文字大小。用户反馈表明该系统有助于更好地传达说话人的意图,并提供更加吸引人且易于访问的字幕体验。
https://arxiv.org/abs/2504.10849
Integrating spatial context into large language models (LLMs) has the potential to revolutionize human-computer interaction, particularly in wearable devices. In this work, we present a novel system architecture that incorporates spatial speech understanding into LLMs, enabling contextually aware and adaptive applications for wearable technologies. Our approach leverages microstructure-based spatial sensing to extract precise Direction of Arrival (DoA) information using a monaural microphone. To address the lack of existing dataset for microstructure-assisted speech recordings, we synthetically create a dataset called OmniTalk by using the LibriSpeech dataset. This spatial information is fused with linguistic embeddings from OpenAI's Whisper model, allowing each modality to learn complementary contextual representations. The fused embeddings are aligned with the input space of LLaMA-3.2 3B model and fine-tuned with lightweight adaptation technique LoRA to optimize for on-device processing. SING supports spatially-aware automatic speech recognition (ASR), achieving a mean error of $25.72^\circ$-a substantial improvement compared to the 88.52$^\circ$ median error in existing work-with a word error rate (WER) of 5.3. SING also supports soundscaping, for example, inference how many people were talking and their directions, with up to 5 people and a median DoA error of 16$^\circ$. Our system demonstrates superior performance in spatial speech understanding while addressing the challenges of power efficiency, privacy, and hardware constraints, paving the way for advanced applications in augmented reality, accessibility, and immersive experiences.
将空间上下文融入大型语言模型(LLM)有可能彻底改变人机交互,特别是在可穿戴设备领域。在这项工作中,我们提出了一种新的系统架构,该架构将基于空间的语音理解整合到LLM中,从而实现适用于可穿戴技术的情境感知和自适应应用。我们的方法利用微结构辅助的空间感应来提取精确的方向到达(DoA)信息,仅使用单声道麦克风即可完成。 为了应对现有缺乏用于微结构辅助语音记录的数据集的问题,我们通过使用LibriSpeech数据集合成创建了一个名为OmniTalk的新数据集。这种空间信息与OpenAI的Whisper模型中的语言嵌入相结合,使每种模式能够学习互补的情境表示。融合后的嵌入被对齐到LLaMA-3.2 3B模型的输入空间,并采用轻量级适应技术LoRA进行微调,以优化设备上的处理。 SING支持基于空间的自动语音识别(ASR),平均误差为$25.72^\circ$——这比现有工作的88.52$^\circ$中位误差有显著改善。同时,在使用率为5%的情况下,其单词错误率(WER)仅为5.3。此外,SING还支持声景分析,例如推断有多少人在讲话及其方向,最多可以处理5人,并且具有16$^\circ$的中位DoA误差。 我们的系统在空间语音理解方面表现出色,同时解决了功耗效率、隐私和硬件限制等挑战,为增强现实、可访问性和沉浸式体验等方面的高级应用铺平了道路。
https://arxiv.org/abs/2504.08907
Speech summarization has become an essential tool for efficiently managing and accessing the growing volume of spoken and audiovisual content. However, despite its increasing importance, speech summarization is still not clearly defined and intersects with several research areas, including speech recognition, text summarization, and specific applications like meeting summarization. This survey not only examines existing datasets and evaluation methodologies, which are crucial for assessing the effectiveness of summarization approaches but also synthesizes recent developments in the field, highlighting the shift from traditional systems to advanced models like fine-tuned cascaded architectures and end-to-end solutions.
语音总结已成为高效管理和访问日益增长的口语和音频视频内容的重要工具。然而,尽管其重要性不断提高,但语音总结尚未得到明确定义,并且与包括语音识别、文本摘要以及会议总结等特定应用在内的多个研究领域相互交织。此次调查不仅审视了现有数据集和评估方法(这些对于衡量总结方法的有效性至关重要),还综合了该领域的最新进展,强调从传统系统向精细化级联架构和端到端解决方案的转变。
https://arxiv.org/abs/2504.08024
Humans have the ability to utilize visual cues, such as lip movements and visual scenes, to enhance auditory perception, particularly in noisy environments. However, current Automatic Speech Recognition (ASR) or Audio-Visual Speech Recognition (AVSR) models often struggle in noisy scenarios. To solve this task, we propose a model that improves transcription by correlating noise sources to visual cues. Unlike works that rely on lip motion and require the speaker's visibility, we exploit broader visual information from the environment. This allows our model to naturally filter speech from noise and improve transcription, much like humans do in noisy scenarios. Our method re-purposes pretrained speech and visual encoders, linking them with multi-headed attention. This approach enables the transcription of speech and the prediction of noise labels in video inputs. We introduce a scalable pipeline to develop audio-visual datasets, where visual cues correlate to noise in the audio. We show significant improvements over existing audio-only models in noisy scenarios. Results also highlight that visual cues play a vital role in improved transcription accuracy.
人类能够利用视觉线索,如唇部动作和视觉场景,来增强听觉感知,特别是在嘈杂的环境中。然而,当前的自动语音识别(ASR)或音视频语音识别(AVSR)模型在嘈杂环境下往往表现不佳。为了解决这个问题,我们提出了一种通过将噪声源与视觉线索关联起来从而提高转写的准确性模型。不同于依赖唇部运动并需要说话人可见的研究工作,我们的方法利用了环境中的更广泛的视觉信息。这使我们的模型能够像人类一样自然地从噪音中过滤出语音并改善转写。 我们采用了一种可扩展的数据管道来开发音视频数据集,在这些数据集中,视觉线索与音频中的噪声相关联。该方法重新利用预训练的语音和视觉编码器,并通过多头注意力机制将它们连接起来。这种方法使得在视频输入中不仅可以进行语音转录,还可以预测噪音标签。 实验结果显示,相较于现有的纯音频模型,在嘈杂环境中我们的模型有显著的改进表现。结果还表明,视觉线索对于提高转写准确性起着至关重要的作用。
https://arxiv.org/abs/2504.07229
Training speech recognition systems on noisy transcripts is a significant challenge in industrial pipelines, where datasets are enormous and ensuring accurate transcription for every instance is difficult. In this work, we introduce novel loss functions to mitigate the impact of transcription errors in RNN-Transducer models. Our Star-Transducer loss addresses deletion errors by incorporating "skip frame" transitions in the loss lattice, restoring over 90% of the system's performance compared to models trained with accurate transcripts. The Bypass-Transducer loss uses "skip token" transitions to tackle insertion errors, recovering more than 60% of the quality. Finally, the Target-Robust Transducer loss merges these approaches, offering robust performance against arbitrary errors. Experimental results demonstrate that the Target-Robust Transducer loss significantly improves RNN-T performance on noisy data by restoring over 70% of the quality compared to well-transcribed data.
在工业管道中,针对语音识别系统的训练,由于数据集庞大且难以确保每个实例的转录都准确无误,因此基于含噪声转写的训练是一个重要挑战。本文介绍了一些新的损失函数来减轻RNN-Transducer模型中的转写错误影响。我们的Star-Transducer损失通过在损失格中加入“跳帧”过渡来解决删除错误问题,在与使用精准转写训练的模型相比,可以恢复系统性能超过90%。Bypass-Transducer损失则利用“跳过标记”的转换来处理插入错误,能够恢复超过60%的质量。最后,Target-Robust Transducer损失结合了这些方法,提供了对任意错误的强大鲁棒性。实验结果表明,与使用准确转写数据训练的模型相比,使用Target-Robust Transducer损失可以显著提高RNN-T在噪声数据上的性能,恢复超过70%的质量水平。
https://arxiv.org/abs/2504.06963
Document-level context is crucial for handling discourse challenges in text-to-text document-level machine translation (MT). Despite the increased discourse challenges introduced by noise from automatic speech recognition (ASR), the integration of document-level context in speech translation (ST) remains insufficiently explored. In this paper, we develop DoCIA, an online framework that enhances ST performance by incorporating document-level context. DoCIA decomposes the ST pipeline into four stages. Document-level context is integrated into the ASR refinement, MT, and MT refinement stages through auxiliary LLM (large language model)-based modules. Furthermore, DoCIA leverages document-level information in a multi-level manner while minimizing computational overhead. Additionally, a simple yet effective determination mechanism is introduced to prevent hallucinations from excessive refinement, ensuring the reliability of the final results. Experimental results show that DoCIA significantly outperforms traditional ST baselines in both sentence and discourse metrics across four LLMs, demonstrating its effectiveness in improving ST performance.
文档级别的上下文对于处理文本到文本的文档级机器翻译(MT)中的话语挑战至关重要。尽管自动语音识别(ASR)引入的噪音增加了话语挑战,但在语音翻译(ST)中整合文档级别上下文的研究仍然不足。本文提出了DoCIA,在线框架通过加入文档级别的上下文来增强ST性能。DoCIA将ST流程分解为四个阶段,并在ASR改进、MT和MT改进阶段通过辅助的大规模语言模型(LLM)模块融入文档级的上下文信息。此外,DoCIA以多层次的方式利用文档级别的信息同时尽量减少计算开销。另外,还引入了一个简单而有效的确定机制来防止过度修正导致的幻觉现象,确保最终结果的可靠性。实验结果显示,与四种大型语言模型下的传统ST基线相比,DoCIA在句子和话语度量标准上均表现显著优越,证明了其提升ST性能的有效性。
https://arxiv.org/abs/2504.05122
Extensive research has shown that Automatic Speech Recognition (ASR) systems are vulnerable to audio adversarial attacks. Current attacks mainly focus on single-source scenarios, ignoring dual-source scenarios where two people are speaking simultaneously. To bridge the gap, we propose a Selective Masking Adversarial attack, namely SMA attack, which ensures that one audio source is selected for recognition while the other audio source is muted in dual-source scenarios. To better adapt to the dual-source scenario, our SMA attack constructs the normal dual-source audio from the muted audio and selected audio. SMA attack initializes the adversarial perturbation with a small Gaussian noise and iteratively optimizes it using a selective masking optimization algorithm. Extensive experiments demonstrate that the SMA attack can generate effective and imperceptible audio adversarial examples in the dual-source scenario, achieving an average success rate of attack of 100% and signal-to-noise ratio of 37.15dB on Conformer-CTC, outperforming the baselines.
大量的研究表明,自动语音识别(ASR)系统容易受到音频对抗攻击。目前的攻击主要集中在单一来源场景上,忽略了双源场景,即在两个人同时说话的情况下进行攻击的情况。为了填补这一空白,我们提出了一种选择性屏蔽对抗攻击方法,称为SMA攻击,该方法确保在一个双源场景中只对一个声音来源进行识别处理而另一个声音来源被静音。 为了更好地适应双源场景,我们的SMA攻击从静音音频和选定的音频构建正常的双源音频。SMA攻击通过使用小高斯噪声初始化对抗扰动,并利用选择性屏蔽优化算法对其进行迭代优化来实施攻击。大量的实验表明,在双源场景中,SMA攻击能够生成有效且难以察觉的音频对抗样本,在Conformer-CTC模型上实现了100%的平均成功率和37.15dB的信噪比,优于基线方法。
https://arxiv.org/abs/2504.04394
Robot audition, encompassing Sound Source Localization (SSL), Sound Source Separation (SSS), and Automatic Speech Recognition (ASR), enables robots and smart devices to acquire auditory capabilities similar to human hearing. Despite their wide applicability, processing multi-channel audio signals from microphone arrays in SSL involves computationally intensive matrix operations, which can hinder efficient deployment on Central Processing Units (CPUs), particularly in embedded systems with limited CPU resources. This paper introduces a GPU-based implementation of SSL for robot audition, utilizing the Generalized Singular Value Decomposition-based Multiple Signal Classification (GSVD-MUSIC), a noise-robust algorithm, within the HARK platform, an open-source software suite. For a 60-channel microphone array, the proposed implementation achieves significant performance improvements. On the Jetson AGX Orin, an embedded device powered by an NVIDIA GPU and ARM Cortex-A78AE v8.2 64-bit CPUs, we observe speedups of 4645.1x for GSVD calculations and 8.8x for the SSL module, while speedups of 2223.4x for GSVD calculation and 8.95x for the entire SSL module on a server configured with an NVIDIA A100 GPU and AMD EPYC 7352 CPUs, making real-time processing feasible for large-scale microphone arrays and providing ample capacity for real-time processing of potential subsequent machine learning or deep learning tasks.
机器人听觉技术,包括声音源定位(SSL)、声音源分离(SSS)和自动语音识别(ASR),使机器人和智能设备能够获得类似人类的听力能力。尽管这些技术应用广泛,但处理多通道麦克风阵列音频信号的声音源定位涉及复杂的矩阵运算,在中央处理器(CPU)上进行计算时可能会造成效率低下,特别是在嵌入式系统中受限于有限的CPU资源的情况下更为明显。本文介绍了一种基于GPU实现SSL的方法,用于机器人听觉技术,采用的是HARK平台中的噪声鲁棒算法——广义奇异值分解多信号分类(GSVD-MUSIC),这是一个开源软件套件。对于60通道麦克风阵列而言,该方法实现了显著的性能提升。 在Jetson AGX Orin设备上,这款由NVIDIA GPU和ARM Cortex-A78AE v8.2 64位CPU驱动的嵌入式装置中,我们观察到GSVD计算加速了4645.1倍,而SSL模块的速度提升了8.8倍。在配置有NVIDIA A100 GPU和AMD EPYC 7352 CPU的服务器上,GSVD计算速度提高了2223.4倍,整个SSL模块则达到了8.95倍的速度提升,使得大规模麦克风阵列的实时处理成为可能,并为后续潜在的机器学习或深度学习任务提供了足够的实时处理能力。
https://arxiv.org/abs/2504.03373
Recent developments in Artificial Intelligence (AI) and Machine Learning (ML) are creating new opportunities for Human-Autonomy Teaming (HAT) in tasks, missions, and continuous coordinated activities. A major challenge is enabling humans to maintain awareness and control over autonomous assets, while also building trust and supporting shared contextual understanding. To address this, we present a real-time Human Digital Twin (HDT) architecture that integrates Large Language Models (LLMs) for knowledge reporting, answering, and recommendation, embodied in a visual interface. The system applies a metacognitive approach to enable personalized, context-aware responses aligned with the human teammate's expectations. The HDT acts as a visually and behaviorally realistic team member, integrated throughout the mission lifecycle, from training to deployment to after-action review. Our architecture includes speech recognition, context processing, AI-driven dialogue, emotion modeling, lip-syncing, and multimodal feedback. We describe the system design, performance metrics, and future development directions for more adaptive and realistic HAT systems.
近期在人工智能(AI)和机器学习(ML)领域的发展为人类自主团队合作(Human-Autonomy Teaming,HAT)创造了新的机会,在任务、使命以及持续的协调活动中尤其如此。然而,一个主要挑战是使人类能够保持对自主资产的意识和控制,并同时建立信任和支持共同的理解背景。为此,我们提出了一种实时的人类数字孪生(Human Digital Twin, HDT)架构,该架构集成了大型语言模型(LLMs),用于知识报告、回答问题以及推荐,这些功能以视觉界面的形式呈现出来。 系统采用元认知方法,能够提供个性化且与上下文相关联的响应,符合人类队友的期望。HDT作为一个在视觉和行为上都十分逼真的团队成员,在整个任务生命周期中发挥着作用——从培训到部署再到事后分析。 我们的架构包括语音识别、情境处理、由AI驱动的对话、情感建模、口型同步以及多模式反馈等功能。本文描述了系统设计、性能指标及未来发展方向,以期构建更为适应性和真实的HAT系统。
https://arxiv.org/abs/2504.03147
Developing Automatic Speech Recognition (ASR) systems for Tunisian Arabic Dialect is challenging due to the dialect's linguistic complexity and the scarcity of annotated speech datasets. To address these challenges, we propose the LinTO audio and textual datasets -- comprehensive resources that capture phonological and lexical features of Tunisian Arabic Dialect. These datasets include a variety of texts from numerous sources and real-world audio samples featuring diverse speakers and code-switching between Tunisian Arabic Dialect and English or French. By providing high-quality audio paired with precise transcriptions, the LinTO audio and textual datasets aim to provide qualitative material to build and benchmark ASR systems for the Tunisian Arabic Dialect. Keywords -- Tunisian Arabic Dialect, Speech-to-Text, Low-Resource Languages, Audio Data Augmentation
开发针对突尼斯阿拉伯方言的自动语音识别(ASR)系统具有挑战性,原因在于该方言的语言复杂性和注释语音数据集的稀缺。为了应对这些挑战,我们提出了LinTO音频和文本数据集——全面的资源,涵盖了突尼斯阿拉伯方言中的音位学和词汇特征。这些数据集中包括了来自多种来源的各种文本以及真实世界的音频样本,这些音频样本展示了不同说话者之间的多样性和突尼斯阿拉伯语与英语或法语之间切换的现象。 通过提供高质量的音频及其精准转录,LinTO音频和文本数据集旨在为构建和评估针对突尼斯阿拉伯方言的ASR系统提供优质的材料。关键词:突尼斯阿拉伯方言、语音到文字转换、低资源语言、音频数据增强
https://arxiv.org/abs/2504.02604
Hornbills, an iconic species of Malaysia's biodiversity, face threats from habi-tat loss, poaching, and environmental changes, necessitating accurate and real-time population monitoring that is traditionally challenging and re-source intensive. The emergence of Tiny Machine Learning (TinyML) offers a chance to transform wildlife monitoring by enabling efficient, real-time da-ta analysis directly on edge devices. Addressing the challenge of wildlife conservation, this research paper explores the pivotal role of machine learn-ing, specifically TinyML, in the classification and monitoring of hornbill calls in Malaysia. Leveraging audio data from the Xeno-canto database, the study aims to develop a speech recognition system capable of identifying and classifying hornbill vocalizations. The proposed methodology involves pre-processing the audio data, extracting features using Mel-Frequency Energy (MFE), and deploying the model on an Arduino Nano 33 BLE, which is adept at edge computing. The research encompasses foundational work, in-cluding a comprehensive introduction, literature review, and methodology. The model is trained using Edge Impulse and validated through real-world tests, achieving high accuracy in hornbill species identification. The project underscores the potential of TinyML for environmental monitoring and its broader application in ecological conservation efforts, contributing to both the field of TinyML and wildlife conservation.
马来亚生物多样性中的代表性物种——犀鸟,正面临栖息地丧失、盗猎和环境变化的威胁,这要求进行精确且实时的人口监测,而这种监测传统上是极具挑战性和耗资巨大的。随着Tiny Machine Learning(TinyML)技术的出现,为野生动物监测带来了转型的机会,通过在边缘设备上实现高效的数据分析,可以直接实现实时数据处理。本文研究旨在探讨机器学习,特别是TinyML,在马来西亚犀鸟叫声分类和监测中的关键作用。该研究利用Xeno-canto数据库中的音频数据,开发出一种能够识别并分类犀鸟鸣叫的语音识别系统。 论文的方法包括对音频数据进行预处理、使用梅尔频率能量(MFE)提取特征,并在Arduino Nano 33 BLE设备上部署模型,该设备擅长边缘计算。研究涵盖了基础工作,包括详细的引言、文献综述和方法论。模型利用Edge Impulse平台进行训练,并通过现实世界的测试进行验证,在犀鸟物种识别方面取得了高准确率。 该项目强调了TinyML在环境监测中的潜力及其在生态保护领域的广泛应用,为TinyML领域以及野生动物保护做出了贡献。
https://arxiv.org/abs/2504.12272
We present F5R-TTS, a novel text-to-speech (TTS) system that integrates Gradient Reward Policy Optimization (GRPO) into a flow-matching based architecture. By reformulating the deterministic outputs of flow-matching TTS into probabilistic Gaussian distributions, our approach enables seamless integration of reinforcement learning algorithms. During pretraining, we train a probabilistically reformulated flow-matching based model which is derived from F5-TTS with an open-source dataset. In the subsequent reinforcement learning (RL) phase, we employ a GRPO-driven enhancement stage that leverages dual reward metrics: word error rate (WER) computed via automatic speech recognition and speaker similarity (SIM) assessed by verification models. Experimental results on zero-shot voice cloning demonstrate that F5R-TTS achieves significant improvements in both speech intelligibility (relatively 29.5\% WER reduction) and speaker similarity (relatively 4.6\% SIM score increase) compared to conventional flow-matching based TTS systems. Audio samples are available at this https URL.
我们介绍了一种新的文本到语音(TTS)系统,即F5R-TTS。该系统将梯度奖励策略优化(GRPO)融入到了基于流匹配的架构中。通过将确定性的流匹配TTS输出重新表述为概率高斯分布,我们的方法能够实现强化学习算法的无缝集成。在预训练阶段,我们使用一个开源数据集对源自F5-TTS的概率性重构流匹配模型进行训练。随后,在强化学习(RL)阶段,我们采用了一个由GRPO驱动的增强阶段,该阶段利用了双奖励指标:通过自动语音识别计算出来的词错误率(WER),以及验证模型评估出的说话人相似度(SIM)。实验结果表明,在零样本声音克隆任务中,F5R-TTS相比传统的基于流匹配的TTS系统在语音可懂度方面实现了相对29.5%的WER降低,并且在说话人相似度上实现了相对4.6%的SIM分数提升。音频示例可在该链接获取:[此URL](请将"[此URL]"替换为实际链接)。
https://arxiv.org/abs/2504.02407
Full-text error correction with Large Language Models (LLMs) for Automatic Speech Recognition (ASR) has gained increased attention due to its potential to correct errors across long contexts and address a broader spectrum of error types, including punctuation restoration and inverse text normalization. Nevertheless, many challenges persist, including issues related to stability, controllability, completeness, and fluency. To mitigate these challenges, this paper proposes the Chain of Correction (CoC) for full-text error correction with LLMs, which corrects errors segment by segment using pre-recognized text as guidance within a regular multi-turn chat format. The CoC also uses pre-recognized full text for context, allowing the model to better grasp global semantics and maintain a comprehensive overview of the entire content. Utilizing the open-sourced full-text error correction dataset ChFT, we fine-tune a pre-trained LLM to evaluate the performance of the CoC framework. Experimental results demonstrate that the CoC effectively corrects errors in full-text ASR outputs, significantly outperforming baseline and benchmark systems. We further analyze how to set the correction threshold to balance under-correction and over-rephrasing, extrapolate the CoC model on extremely long ASR outputs, and investigate whether other types of information can be employed to guide the error correction process.
基于大型语言模型(LLMs)的全文本错误校正技术在自动语音识别(ASR)领域引起了越来越多的关注,因为它有潜力纠正长上下文中的错误,并解决包括标点恢复和逆向文本规范在内的广泛类型的错误。然而,仍然存在诸如稳定性、可控性、完整性和流畅度等方面的挑战。为了解决这些挑战,本文提出了链式校正(CoC)方法,用于基于LLMs的全文本错误校正。该方法通过在常规多轮对话格式下使用预先识别的文本作为指导,分段纠正错误。此外,CoC还利用预识别的全文本提供上下文信息,使模型能够更好地理解全局语义,并对整个内容保持全面的了解。 我们采用开源全文本错误校正数据集ChFT,对该方法进行了微调并评估了其性能。实验结果表明,CoC在纠正全文本ASR输出中的错误方面表现出色,显著优于基准和对照系统。此外,我们还分析了如何设定校正阈值以平衡过度修正与不足修正之间的关系,并探讨了将CoC模型应用于极长的ASR输出的可能性。最后,本文还研究了是否可以利用其他类型的信息来指导错误校正过程。 总结来说,本文通过引入链式校正方法,为解决全文本自动语音识别中的错误校正问题提供了有效的解决方案和深入的研究视角。
https://arxiv.org/abs/2504.01519
The widespread application of automatic speech recognition (ASR) supports large-scale voice surveillance, raising concerns about privacy among users. In this paper, we concentrate on using adversarial examples to mitigate unauthorized disclosure of speech privacy thwarted by potential eavesdroppers in speech communications. While audio adversarial examples have demonstrated the capability to mislead ASR models or evade ASR surveillance, they are typically constructed through time-intensive offline optimization, restricting their practicality in real-time voice communication. Recent work overcame this limitation by generating universal adversarial perturbations (UAPs) and enhancing their transferability for black-box scenarios. However, they introduced excessive noise that significantly degrades audio quality and affects human perception, thereby limiting their effectiveness in practical scenarios. To address this limitation and protect live users' speech against ASR systems, we propose a novel framework, AudioShield. Central to this framework is the concept of Transferable Universal Adversarial Perturbations in the Latent Space (LS-TUAP). By transferring the perturbations to the latent space, the audio quality is preserved to a large extent. Additionally, we propose target feature adaptation to enhance the transferability of UAPs by embedding target text features into the perturbations. Comprehensive evaluation on four commercial ASR APIs (Google, Amazon, iFlytek, and Alibaba), three voice assistants, two LLM-powered ASR and one NN-based ASR demonstrates the protection superiority of AudioShield over existing competitors, and both objective and subjective evaluations indicate that AudioShield significantly improves the audio quality. Moreover, AudioShield also shows high effectiveness in real-time end-to-end scenarios, and demonstrates strong resilience against adaptive countermeasures.
自动语音识别(ASR)的广泛应用支持了大规模的声音监控,引发了用户对于隐私保护的关注。本文集中于使用对抗样本来减轻潜在监听者在语音通信中未经授权披露语音隐私的问题。虽然音频对抗样例已经展示了误导ASR模型或避开ASR监视的能力,但它们通常是通过耗时的离线优化构建出来的,这限制了其在实时语音通信中的实用性。近期的工作克服了这一局限性,通过生成通用对抗扰动(UAPs)并在黑盒场景中提高其迁移能力来实现这一点。然而,这些方法引入了大量的噪声,显著降低了音频质量并影响了人类的感知体验,从而限制了它们在实际场景中的有效性。 为了解决上述问题,并保护实时语音通信免受ASR系统的侵害,我们提出了一种新的框架——AudioShield。该框架的核心概念是潜在空间中的可迁移通用对抗扰动(LS-TUAP)。通过将扰动转移到潜在空间中,音频质量可以在很大程度上得到保留。此外,我们还提出了目标特征适应策略,通过将目标文本特征嵌入到扰动中来增强UAP的迁移能力。 对四种商业ASR API(Google、Amazon、iFlytek和阿里云),三种语音助手,两种基于LLM的ASR系统及一种基于NN的ASR系统的全面评估表明,AudioShield在保护性能上超越了现有的竞争对手。客观评价和主观测试均显示,AudioShield显著提高了音频质量。此外,AudioShield在实时端到端场景中也表现出高效性,并展示了对抗自适应对策时的强大韧性。
https://arxiv.org/abs/2504.00858
The detection of telecom fraud faces significant challenges due to the lack of high-quality multimodal training data that integrates audio signals with reasoning-oriented textual analysis. To address this gap, we present TeleAntiFraud-28k, the first open-source audio-text slow-thinking dataset specifically designed for automated telecom fraud analysis. Our dataset is constructed through three strategies: (1) Privacy-preserved text-truth sample generation using automatically speech recognition (ASR)-transcribed call recordings (with anonymized original audio), ensuring real-world consistency through text-to-speech (TTS) model regeneration; (2) Semantic enhancement via large language model (LLM)-based self-instruction sampling on authentic ASR outputs to expand scenario coverage; (3) Multi-agent adversarial synthesis that simulates emerging fraud tactics through predefined communication scenarios and fraud typologies. The generated dataset contains 28,511 rigorously processed speech-text pairs, complete with detailed annotations for fraud reasoning. The dataset is divided into three tasks: scenario classification, fraud detection, fraud type classification. Furthermore, we construct TeleAntiFraud-Bench, a standardized evaluation benchmark comprising proportionally sampled instances from the dataset, to facilitate systematic testing of model performance on telecom fraud detection tasks. We also contribute a production-optimized supervised fine-tuning (SFT) model trained on hybrid real/synthetic data, while open-sourcing the data processing framework to enable community-driven dataset expansion. This work establishes a foundational framework for multimodal anti-fraud research while addressing critical challenges in data privacy and scenario diversity. The project will be released at this https URL.
电信欺诈的检测面临重大挑战,主要是由于缺乏高质量的多模态训练数据,这种数据需要将音频信号与以推理为导向的文字分析相结合。为了解决这一缺口,我们提出了TeleAntiFraud-28k,这是首个开源的、用于自动化电信欺诈分析的慢思考音频文字数据集。我们的数据集通过以下三种策略构建: 1. 使用自动语音识别(ASR)转录的真实电话录音生成隐私保护的文字真值样本,并通过文本到语音(TTS)模型再生确保现实世界的一致性; 2. 通过在真实ASR输出的基础上使用大型语言模型 (LLM) 自我指令采样,对语义进行增强以扩展场景覆盖范围; 3. 多代理对抗合成:根据预定义的通信场景和欺诈类型模拟新兴的欺诈手法。 生成的数据集包含经过严格处理的28,511个语音文字配对,并附有详细的欺诈推理标注。该数据集被划分为三个任务:场景分类、欺诈检测以及欺诈类型分类。此外,我们构建了TeleAntiFraud-Bench,这是一个标准化的评估基准,由数据集中按比例抽取的实例组成,用于系统地测试模型在电信欺诈检测任务中的性能。我们还贡献了一个基于混合真实/合成数据训练的生产优化监督微调(SFT) 模型,并开放源代码的数据处理框架以促进社区驱动的数据集扩展。 这项工作为多模态反欺诈研究奠定了基础,同时解决了数据隐私和场景多样性的重要挑战。该项目将在以下网址发布:[https://this-url.com](https://this-url.com)
https://arxiv.org/abs/2503.24115
Code-switching, the act of alternating between languages, emerged as a prevalent global phenomenon that needs to be addressed for building user-friendly language technologies. A main bottleneck in this pursuit is data scarcity, motivating research in the direction of code-switched data augmentation. However, current literature lacks comprehensive studies that enable us to understand the relation between the quality of synthetic data and improvements on NLP tasks. We extend previous research conducted in this direction on machine translation (MT) with results on automatic speech recognition (ASR) and cascaded speech translation (ST) to test generalizability of findings. Our experiments involve a wide range of augmentation techniques, covering lexical replacements, linguistic theories, and back-translation. Based on the results of MT, ASR, and ST, we draw conclusions and insights regarding the efficacy of various augmentation techniques and the impact of quality on performance.
代码切换,即在不同语言之间交替使用的行为,已成为一个需要解决的全球现象,以构建用户友好的语言技术。这一领域的主要瓶颈在于数据稀缺性,这促使了对代码转换数据增强的研究。然而,现有的文献缺乏全面研究来帮助我们理解合成数据的质量与其在自然语言处理任务中表现改善之间的关系。我们在机器翻译(MT)领域的先前研究成果的基础上,通过自动语音识别(ASR)和级联语音翻译(ST)的结果,测试了这些发现的普适性。我们的实验涵盖了广泛的数据增强技术,包括词汇替换、语言学理论以及反向翻译。基于MT、ASR和ST的实验结果,我们得出了关于各种数据增强技术的有效性和质量对性能影响的结论与见解。
https://arxiv.org/abs/2503.23576
Automatic speech recognition systems have undoubtedly advanced with the integration of multilingual and multitask models such as Whisper, which have shown a promising ability to understand and process speech across a wide range of languages. Despite their robustness, these models often fall short in handling the linguistic distinctions of minority languages. This study addresses this gap by integrating traditional and novel language models with fine-tuned Whisper models to raise their performance in less commonly studied languages. Through rigorous fine-tuning and evaluation across multiple datasets, we demonstrate substantial improvements in word error rate, particularly in low-resource scenarios. Our approach not only does take advantage of the extensive data Whisper was pre-trained on, but also complements its linguistic adaptability by incorporating language models. We obtained improvements up to 51\% for in-distribution datasets and up to 34\% for out-of-distribution sentences using statistical language models, while large language models provided moderate but consistently robust improvement across diverse linguistic contexts. The findings reveal that, while the integration reliably benefits all model sizes, the extent of improvement varies, highlighting the importance of optimized language model parameters. Finally, we emphasize the importance of selecting appropriate evaluation parameters when reporting the results using transformer-based ASR models. In summary, this research clears the way for more inclusive ASR technologies that perform better across languages by enriching their linguistic knowledge. For further implementation details of this study, the technical documentation and source code are available at this http URL.
自动语音识别系统(ASR)在整合了如Whisper这样的多语言和多任务模型后无疑取得了显著进步,这些模型展示出了理解并处理多种语言的音频的强大能力。尽管这些模型具有较强的鲁棒性,但在处理少数语言时往往表现不佳,这是因为它们未能充分捕捉到这些语言的独特特征。这项研究通过将传统与新型的语言模型与微调过的Whisper模型结合使用,旨在提高在较少被研究的语言中的性能。通过对多个数据集进行严格的微调和评估,我们展示了在资源不足的情况下的词错误率显著降低。 我们的方法不仅充分利用了Whisper预先训练时所用的大量数据,而且还通过引入语言模型增强了其语言适应能力。采用统计语言模型,在同分布的数据集中取得了最高达51%的改进,在异分布句子中则达到了高达34%的提升;而大型语言模型在各种语言背景下也提供了适度但稳定的表现增强。 研究结果表明,尽管整合确实提高了所有模型规模的效果,但是改善程度因参数优化的不同而有所差异。因此,选择合适的评估标准对于使用基于变压器架构的ASR模型报告结果尤为重要。 总之,这项研究表明了通过丰富其语言知识可以使跨语种的语音识别技术变得更加包容且性能更佳的方法。欲了解本研究的技术细节和源代码,请访问以下链接:[此HTTP URL]。
https://arxiv.org/abs/2503.23542
Large language models (LLMs) have shown exceptional versatility in natural language processing, prompting recent efforts to extend their multimodal capabilities to speech processing through the development of audio large language models (Audio LLMs). While Audio LLMs excel in tasks such as speech recognition and synthesis, it remains unclear how they perform when faced with the auditory cognitive challenges posed by real-world environments, such as audio comprehension and listening recall, particularly in the presence of background noise or overlapping speech. Unlike text-based LLMs, which have access to vast amounts of text data for pre-training, retraining Audio LLMs with diverse auditory cognitive scenes is difficult due to the limited datasets that simulate real-world auditory cognitive scenarios and the challenge of acquiring auditory cognitive labels for training. While test-time compute (TTC) methods have been shown to enhance the capabilities of text-based LLMs during inference, a key challenge lies in designing these TTC methods to improve the auditory capabilities of Audio LLMs. This study aims to address these two research gaps by: i) exploring the auditory cognitive capabilities of Audio LLMs, and ii) enhancing their capabilities using TTC approaches. We have investigated five different Audio LLMs for auditory cognition using a \textit{self-collected} database and have proposed five TTC approaches to enhance auditory cognitive capabilities during inference. Our findings reveal that Audio LLMs performance decreases in more challenging auditory cognitive tasks. The proposed TTC approaches significantly enhance cognitive auditory capabilities, advancing the development of more adaptable and resilient Audio LLMs for practical applications such as assistive listening devices, voice-based AI assistants, and communication technologies.
大型语言模型(LLM)在自然语言处理方面表现出色,最近人们开始努力通过开发音频大型语言模型(Audio LLM)来扩展它们的多模态能力以应用于语音处理。虽然Audio LLM们在诸如语音识别和合成等任务中表现优异,但它们如何应对现实世界环境中由背景噪音或重叠说话带来的听觉认知挑战仍不清楚。与文本基础LLM不同,后者可以访问大量用于预训练的文本数据,由于模拟现实世界的听觉认知场景的数据集有限以及获取听觉认知标签进行训练的难度,重新训练Audio LLM们以适应多样化的听觉情景变得困难。 尽管在推理过程中使用测试时间计算(TTC)方法已显示可以增强基于文本的LLM的能力,但关键挑战在于设计能够提升Audio LLM听觉能力的TTC方法。本研究旨在通过以下方式解决这两个研究缺口:i) 探索Audio LLM们的听觉认知能力;ii) 采用TTC方法来增强其在推理过程中的能力。我们调查了五种不同的Audio LLMs,使用了一个自收集的数据库,并提出了五种TTC方法以提升它们的认知听觉能力。 我们的研究发现表明,在更具挑战性的听觉认知任务中,Audio LLM的表现会下降。所提出的TTC方法显著增强了听觉认知能力,推动了更加适应和抗压的Audio LLM的发展,为辅助听力设备、基于语音的人工智能助手以及通信技术等实际应用提供了支持。
https://arxiv.org/abs/2503.23395