Deep learning models trained in a supervised setting have revolutionized audio and speech processing. However, their performance inherently depends on the quantity of human-annotated data, making them costly to scale and prone to poor generalization under unseen conditions. To address these challenges, Self-Supervised Learning (SSL) has emerged as a promising paradigm, leveraging vast amounts of unlabeled data to learn relevant representations. The application of SSL for Automatic Speech Recognition (ASR) has been extensively studied, but research on other downstream tasks, notably Speaker Recognition (SR), remains in its early stages. This work describes major SSL instance-invariance frameworks (e.g., SimCLR, MoCo, and DINO), initially developed for computer vision, along with their adaptation to SR. Various SSL methods for SR, proposed in the literature and built upon these frameworks, are also presented. An extensive review of these approaches is then conducted: (1) the effect of the main hyperparameters of SSL frameworks is investigated; (2) the role of SSL components is studied (e.g., data-augmentation, projector, positive sampling); and (3) SSL frameworks are evaluated on SR with in-domain and out-of-domain data, using a consistent experimental setup, and a comprehensive comparison of SSL methods from the literature is provided. Specifically, DINO achieves the best downstream performance and effectively models intra-speaker variability, although it is highly sensitive to hyperparameters and training conditions, while SimCLR and MoCo provide robust alternatives that effectively capture inter-speaker variability and are less prone to collapse. This work aims to highlight recent trends and advancements, identifying current challenges in the field.
在监督环境下训练的深度学习模型已经革新了音频和语音处理领域。然而,这些模型的表现很大程度上依赖于大量的人工标注数据,这使得它们难以扩展并且在未见过的情况下面对较差的泛化性能。为了解决这些问题,自我监督学习(Self-Supervised Learning, SSL)作为一种有前景的方法应运而生,它能够利用大量的无标签数据来学习相关的表示形式。尽管自动语音识别(ASR)领域的SSL应用已经得到了广泛的研究,但在其他下游任务如说话人识别(SR)方面,研究仍处于早期阶段。 这项工作主要描述了几个重要的SSL实例不变性框架(例如SimCLR、MoCo和DINO),这些最初是为计算机视觉领域开发的,并探讨了它们如何被适应到SR任务中。此外,还介绍了文献中提出的各种基于这些框架的用于SR的SSL方法。随后进行了全面回顾: 1. 研究了SSL框架的主要超参数的影响。 2. 分析了SSL组件的作用(如数据增强、投影器和正样本采样)。 3. 在使用一致实验设置的情况下,评估了SSL框架在域内和跨域SR任务上的性能,并提供了对文献中各种SSL方法的全面比较。 具体而言,DINO实现了最好的下游性能并且能够有效地建模同一说话人的变化性。然而,它对于超参数和训练条件非常敏感。相比之下,SimCLR和MoCo提供了一种稳健的选择,它们能有效捕捉不同说话人间的变化,并且更不容易发生“塌陷”现象(即模型表现变差的现象)。 这项工作的主要目标是展示该领域的最近趋势和发展,同时也指出了当前面临的挑战。
https://arxiv.org/abs/2602.10829
Vietnamese has a phonetic orthography, where each grapheme corresponds to at most one phoneme and vice versa. Exploiting this high grapheme-phoneme transparency, we propose ViSpeechFormer (\textbf{Vi}etnamese \textbf{Speech} Trans\textbf{Former}), a phoneme-based approach for Vietnamese Automatic Speech Recognition (ASR). To the best of our knowledge, this is the first Vietnamese ASR framework that explicitly models phonemic representations. Experiments on two publicly available Vietnamese ASR datasets show that ViSpeechFormer achieves strong performance, generalizes better to out-of-vocabulary words, and is less affected by training bias. This phoneme-based paradigm is also promising for other languages with phonetic orthographies. The code will be released upon acceptance of this paper.
Tiếng Việt có hệ thống chính tả âm thanh, nơi mỗi ký tự đối ứng với không quá một âm tiết và ngược lại. Lợi dụng tính trong suốt cao của sự tương ứng giữa các ký tự và âm tiết, chúng tôi đề xuất ViSpeechFormer (Việtnam Speech TransFormer), một phương pháp dựa trên âm tiết cho nhận dạng giọng nói tiếng Việt tự động (ASR). Theo kiến thức của chúng tôi, đây là khung ASR đầu tiên dành cho tiếng Việt mô hình hóa rõ ràng các biểu diễn âm tiết. Các thí nghiệm trên hai tập dữ liệu ASR Tiếng Việt công khai cho thấy ViSpeechFormer đạt được hiệu suất mạnh mẽ, tổng quát tốt hơn đối với từ ngoài danh sách và ít bị ảnh hưởng bởi thiên lệch huấn luyện. Phương pháp dựa trên âm tiết này cũng hứa hẹn cho các ngôn ngữ khác có hệ thống chính tả âm thanh. Mã nguồn sẽ được phát hành sau khi bài báo được chấp nhận.
https://arxiv.org/abs/2602.10003
This paper introduces the first standardized benchmark for evaluating Automatic Speech Recognition (ASR) in the Bambara language, utilizing one hour of professionally recorded Malian constitutional text. Designed as a controlled reference set under near-optimal acoustic and linguistic conditions, the benchmark was used to evaluate 37 models, ranging from Bambara-trained systems to large-scale commercial models. Our findings reveal that current ASR performance remains significantly below deployment standards in a narrow formal domain; the top-performing system in terms of Word Error Rate (WER) achieved 46.76\% and the best Character Error Rate (CER) of 13.00\% was set by another model, while several prominent multilingual models exceeded 100\% WER. These results suggest that multilingual pre-training and model scaling alone are insufficient for underrepresented languages. Furthermore, because this dataset represents a best-case scenario of the most simplified and formal form of spoken Bambara, these figures are yet to be tested against practical, real-world settings. We provide the benchmark and an accompanying public leaderboard to facilitate transparent evaluation and future research in Bambara speech technology.
本文介绍了首个用于评估班巴拉语自动语音识别(ASR)的标准基准,该基准基于一小时由专业人员录制的马里宪法文本。在接近最优声学和语言条件下设计而成的此参考数据集被用来评估37种模型的表现,这些模型包括专门训练的班巴拉语系统以及大规模商用模型。我们的研究结果表明,在狭义正式领域中,当前ASR性能依然显著低于部署标准;以单词错误率(WER)衡量的最佳系统为46.76%,而最佳字符错误率(CER)为13.00%的则是另一个模型,同时一些著名的多语言模型的WER超过了100%。这些结果表明,仅靠多语种预训练和模型扩展不足以满足弱势语言的需求。此外,因为这个数据集代表了最简化且形式化的班巴拉口语的最佳情况,因此这些数字尚未在实际场景中进行测试。我们提供了基准测试及相应的公开排行榜以促进班巴拉语音技术的透明评估与未来研究。
https://arxiv.org/abs/2602.09785
Dysarthric speech exhibits high variability and limited labeled data, posing major challenges for both automatic speech recognition (ASR) and assistive speech technologies. Existing approaches rely on synthetic data augmentation or speech reconstruction, yet often entangle speaker identity with pathological articulation, limiting controllability and robustness. In this paper, we propose ProtoDisent-TTS, a prototype-based disentanglement TTS framework built on a pre-trained text-to-speech backbone that factorizes speaker timbre and dysarthric articulation within a unified latent space. A pathology prototype codebook provides interpretable and controllable representations of healthy and dysarthric speech patterns, while a dual-classifier objective with a gradient reversal layer enforces invariance of speaker embeddings to pathological attributes. Experiments on the TORGO dataset demonstrate that this design enables bidirectional transformation between healthy and dysarthric speech, leading to consistent ASR performance gains and robust, speaker-aware speech reconstruction.
构音障碍言语表现出高度的变异性以及有限的标注数据,这对自动语音识别(ASR)和辅助语音技术构成了重大挑战。现有方法依赖于合成数据增强或语音重构,但这些方法往往将说话人的身份与病理发音纠缠在一起,从而限制了可控性和鲁棒性。在本文中,我们提出了ProtoDisent-TTS框架,这是一个基于预训练的文本到语音模型构建的原型基础解耦TTS框架,在统一的潜在空间中分解说话人音色和构音障碍发音特征。该方法提供了一套病理学原型代码库,可以为健康和构音障碍言语模式提供可解释且可控的表示,并通过双分类器目标与梯度反转层确保说话人嵌入对病理属性的不变性。在TORGO数据集上的实验表明,这种设计实现了健康语和构音障碍语之间的双向转换,从而提高了持续一致的ASR性能,并且能够生成鲁棒、具有身份意识的语音重构。
https://arxiv.org/abs/2602.08696
Spoken dialogue is a primary source of information in videos; therefore, accurately identifying who spoke what and when is essential for deep video understanding. We introduce D-ORCA, a \textbf{d}ialogue-centric \textbf{o}mni-modal large language model optimized for \textbf{r}obust audio-visual \textbf{ca}ptioning. We further curate DVD, a large-scale, high-quality bilingual dataset comprising nearly 40,000 multi-party dialogue videos for training and 2000 videos for evaluation in English and Mandarin, addressing a critical gap in the open-source ecosystem. To ensure fine-grained captioning accuracy, we adopt group relative policy optimization with three novel reward functions that assess speaker attribution accuracy, global speech content accuracy, and sentence-level temporal boundary alignment. These rewards are derived from evaluation metrics widely used in speech processing and, to our knowledge, are applied for the first time as reinforcement learning objectives for audio-visual captioning. Extensive experiments demonstrate that D-ORCA substantially outperforms existing open-source models in speaker identification, speech recognition, and temporal grounding. Notably, despite having only 8 billion parameters, D-ORCA achieves performance competitive with Qwen3-Omni across several general-purpose audio-visual understanding benchmarks. Demos are available at \href{this https URL}{this https URL}. Our code, data, and checkpoints will be available at \href{this https URL}{this https URL}.
口头对话是视频中的主要信息来源,因此准确地识别谁说了什么和什么时候说的对于深入理解视频内容至关重要。我们介绍了D-ORCA,这是一种以对话为中心、面向多模态的大规模语言模型,专门用于鲁棒性的音频-视觉字幕生成(**d**ialogue-centric **o**mni-modal large language model optimized for **r**obust audio-visual **c**aptioning)。此外,我们还策划了DVD数据集,这是一个大规模、高质量的双语数据集,包含近40,000个多方对话视频用于训练以及2000个视频用于评估,这些视频覆盖了英语和普通话。这一数据集填补了开源生态系统中的一个重要空白。 为了确保细粒度字幕的准确性,我们采用了一种基于群体相对策略优化的方法,并引入了三个新颖的奖励函数来评价说话者属性准确率、全球语义内容准确率以及句子级别的时间边界对齐精度。这些奖励函数源自广泛使用的语音处理评估指标,在我们的知识范围内,它们首次被用作音频-视觉字幕生成中的强化学习目标。 通过广泛的实验验证,我们发现D-ORCA在说话人识别、语音识别和时序定位方面大幅优于现有的开源模型。值得注意的是,尽管只有80亿个参数,D-ORCA在多项通用的音频-视频理解基准测试中表现出与Qwen3-Omni相当的性能。 演示可以访问此链接:[this https URL](请将括号中的内容替换为实际URL)。 我们的代码、数据和检查点将在以下网址提供:[this https URL](同样,请将括号中的内容替换为实际URL)。
https://arxiv.org/abs/2602.07960
Recent studies have demonstrated that prompting large language models (LLM) with audio encodings enables effective speech understanding capabilities. However, most speech LLMs are trained on single-channel, single-talker data, which makes it challenging to directly apply them to multi-talker and multi-channel speech understanding task. In this work, we present a comprehensive investigation on how to enable directional multi-talker speech understanding capabilities for LLMs, specifically in smart glasses usecase. We propose two novel approaches to integrate directivity into LLMs: (1) a cascaded system that leverages a source separation front-end module, and (2) an end-to-end system that utilizes serialized output training. All of the approaches utilize a multi-microphone array embedded in smart glasses to optimize directivity interpretation and processing in a streaming manner. Experimental results demonstrate the efficacy of our proposed methods in endowing LLMs with directional speech understanding capabilities, achieving strong performance in both speech recognition and speech translation tasks.
最近的研究表明,通过音频编码对大规模语言模型(LLM)进行提示可以实现有效的语音理解能力。然而,大多数语音LLM是基于单声道、单一发言者的数据训练的,这使得它们难以直接应用于多发言者和多通道语音理解任务中。在这项工作中,我们全面探讨了如何使LLM具备方向性的多发言者语音理解能力,特别是在智能眼镜的应用场景中。我们提出了两种新颖的方法将方向性整合到LLM中:(1)一个级联系统,利用源分离前端模块;以及(2)一种端到端的系统,采用序列化输出训练方法。所有这些方法都使用嵌入在智能眼镜中的多麦克风阵列来优化流式处理过程中的指向性解释和处理。实验结果表明了我们提出的方法的有效性,能够赋予LLM方向性的语音理解能力,并且在语音识别和语音翻译任务中均表现出色。
https://arxiv.org/abs/2602.07211
The lack of impaired speech data hinders advancements in the development of inclusive speech technologies, particularly in low-resource languages such as Akan. To address this gap, this study presents a curated corpus of speech samples from native Akan speakers with speech impairment. The dataset comprises of 50.01 hours of audio recordings cutting across four classes of impaired speech namely stammering, cerebral palsy, cleft palate, and stroke induced speech disorder. Recordings were done in controlled supervised environments were participants described pre-selected images in their own words. The resulting dataset is a collection of audio recordings, transcriptions, and associated metadata on speaker demographics, class of impairment, recording environment and device. The dataset is intended to support research in low-resource automatic disordered speech recognition systems and assistive speech technology.
缺乏受损言语数据阻碍了包容性语音技术的发展,尤其是在阿坎语这样的低资源语言中。为了解决这一缺口,本研究提出了一套精心整理的来自患有言语障碍的母语阿坎语使用者的语音样本数据集。该数据集包含4类受损言语(口吃、脑瘫、唇裂和卒中引起的言语障碍)共50.01小时的音频录音。这些录音是在受控监督环境中完成的,参与者使用自己的语言描述了预先选定的图片。所得的数据集包括音频记录、转录文本以及与说话者的人口统计信息、受损类型、录制环境和设备相关的元数据。该数据集旨在支持低资源环境下自动言语障碍识别系统和辅助语音技术的研究。
https://arxiv.org/abs/2602.05406
Automatic speech recognition (ASR) for conversational speech remains challenging due to the limited availability of large-scale, well-annotated multi-speaker dialogue data and the complex temporal dynamics of natural interactions. Speaker-aware simulated conversations (SASC) offer an effective data augmentation strategy by transforming single-speaker recordings into realistic multi-speaker dialogues. However, prior work has primarily focused on English data, leaving questions about the applicability to lower-resource languages. In this paper, we adapt and implement the SASC framework for Hungarian conversational ASR. We further propose C-SASC, an extended variant that incorporates pause modeling conditioned on utterance duration, enabling a more faithful representation of local temporal dependencies observed in human conversation while retaining the simplicity and efficiency of the original approach. We generate synthetic Hungarian dialogues from the BEA-Large corpus and combine them with real conversational data for ASR training. Both SASC and C-SASC are evaluated extensively under a wide range of simulation configurations, using conversational statistics derived from CallHome, BEA-Dialogue, and GRASS corpora. Experimental results show that speaker-aware conversational simulation consistently improves recognition performance over naive concatenation-based augmentation. While the additional duration conditioning in C-SASC yields modest but systematic gains--most notably in character-level error rates--its effectiveness depends on the match between source conversational statistics and the target domain. Overall, our findings confirm the robustness of speaker-aware conversational simulation for Hungarian ASR and highlight the benefits and limitations of increasingly detailed temporal modeling in synthetic dialogue generation.
自动语音识别(ASR)在对话场景中仍然面临挑战,主要是由于大规模、高质量标注的多说话人对话数据稀缺以及自然交互中的复杂时间动态。带有说话人意识的模拟对话(SASC)通过将单说话人的录音转换为现实的多说话人对话提供了一种有效的数据增强策略。然而,先前的工作主要集中在英语数据上,留下了关于其在低资源语言中适用性的疑问。本文中,我们将SASC框架适应并应用于匈牙利语会话ASR。我们进一步提出了C-SASC(条件持续时间的SASC),这是一种扩展变体,通过根据说话人持续时间进行暂停建模来包含局部时间依赖关系的真实表示,在保留原始方法简单性和效率的同时实现这一目标。从BEA-Large语料库生成合成匈牙利对话,并将其与真实会话数据结合用于ASR训练。在广泛的模拟配置下对SASC和C-SASC进行了全面评估,使用了CallHome、BEA-Dialogue和GRASS语料库中的对话统计信息进行评价。实验结果显示,带有说话人意识的会话仿真始终比基于简单串联的数据增强方法提高了识别性能。虽然C-SASC中额外的时间持续性条件带来了较小但系统性的改进——特别是在字符级错误率方面——其有效性取决于源对话统计和目标领域之间的匹配程度。总体而言,我们的研究结果证实了带说话人意识的会话仿真在匈牙利语ASR中的鲁棒性和复杂时间建模在合成对话生成中的优势与局限性。
https://arxiv.org/abs/2602.04776
Automatic speech recognition (ASR) models are normally trained to operate over single utterances, with a short duration of less than 30 seconds. This choice has been made in part due to computational constraints, but also reflects a common, but often inaccurate, modelling assumption that treats utterances as independent and identically distributed samples. When long-format audio recordings are available, to work with such systems, these recordings must first be segmented into short utterances and processed independently. In this work, we show that due to recent algorithmic and hardware advances, this is no longer necessary, and current attention-based approaches can be used to train ASR systems that operate on sequences of over an hour in length. Therefore, to gain a better understanding of the relationship between the training/evaluation sequence length and performance, we train ASR models on large-scale data using 10 different sequence lengths from 10 seconds up to 1 hour. The results show a benefit from using up to 21.8 minutes of context, with up to a 14.2% relative improvement from a short context baseline in our primary experiments. Through modifying various architectural components, we find that the method of encoding positional information and the model's width/depth are important factors when working with long sequences. Finally, a series of evaluations using synthetic data are constructed to help analyse the model's use of context. From these results, it is clear that both linguistic and acoustic aspects of the distant context are being used by the model.
自动语音识别(ASR)模型通常被训练来处理单个话语,持续时间少于30秒。这种选择在一定程度上是由于计算限制作出的,但也反映出一种常见的但往往不准确的建模假设,即认为各个话语是独立且分布相同的样本。当有长格式音频记录可用时,为了与这些系统配合使用,必须先将记录分割成短的话语并分别处理。在这项工作中,我们展示了由于近期算法和硬件的进步,这种情况已不再必要,并且当前基于注意力的方法可以用来训练能在超过一小时长度的序列上运行的ASR系统。因此,为了更好地理解训练/评估序列长度与性能之间的关系,我们在大规模数据集上使用从10秒到1小时不等的10种不同序列长度来训练ASR模型。结果表明,在最多21.8分钟的上下文中可以获得收益,相较于短上下文基线,我们的主要实验中最高可实现高达14.2%的相对改进。 通过修改各种架构组件,我们发现编码位置信息的方法以及模型的宽度/深度在处理长序列时是重要的因素。最后,使用合成数据构建了一系列评估以帮助分析模型对上下文的利用情况。从这些结果可以看出,该模型不仅在语言方面还在于声学特性上都充分利用了远处的上下文信息。
https://arxiv.org/abs/2602.09044
Pre-trained models for automatic speech recognition (ASR) and speech enhancement (SE) have exhibited remarkable capabilities under matched noise and channel conditions. However, these models often suffer from severe performance degradation when confronted with domain shifts, particularly in the presence of unseen noise and channel distortions. In view of this, we in this paper present URSA-GAN, a unified and domain-aware generative framework specifically designed to mitigate mismatches in both noise and channel conditions. URSA-GAN leverages a dual-embedding architecture that consists of a noise encoder and a channel encoder, each pre-trained with limited in-domain data to capture domain-relevant representations. These embeddings condition a GAN-based speech generator, facilitating the synthesis of speech that is acoustically aligned with the target domain while preserving phonetic content. To enhance generalization further, we propose dynamic stochastic perturbation, a novel regularization technique that introduces controlled variability into the embeddings during generation, promoting robustness to unseen domains. Empirical results demonstrate that URSA-GAN effectively reduces character error rates in ASR and improves perceptual metrics in SE across diverse noisy and mismatched channel scenarios. Notably, evaluations on compound test conditions with both channel and noise degradations confirm the generalization ability of URSA-GAN, yielding relative improvements of 16.16% in ASR performance and 15.58% in SE metrics.
针对自动语音识别(ASR)和语音增强(SE)的预训练模型,在匹配噪声和信道条件下表现出卓越的能力。然而,这些模型在面对领域偏移时,特别是在存在未知噪声和信道失真的情况下,往往会遭受严重的性能下降。鉴于此问题,本文提出了一种统一且领域的感知生成框架URSA-GAN,该框架专门设计用于解决噪声和信道条件下的不匹配问题。 URSA-GAN采用双嵌入架构,包括一个预训练的噪声编码器和一个通道编码器,每个编码器使用有限的领域内数据进行预训练,以捕捉相关的领域表示。这些嵌入信息指导基于生成对抗网络(GAN)的语音生成器,在保持语素内容的同时,使合成语音在声学上与目标域对齐。 为了进一步增强泛化能力,我们提出了动态随机扰动技术,这是一种新颖的正则化方法,通过在生成过程中引入受控的变化到嵌入信息中,促进模型对未知领域的鲁棒性。 实证结果表明,URSA-GAN能够在各种嘈杂和不匹配信道场景下有效降低ASR的字符错误率,并改善SE的感知指标。值得注意的是,在结合了信道降质和噪声降质的复杂测试条件下进行评估时,URSA-GAN显示出了其泛化能力:在ASR性能方面相对提高了16.16%,而在SE度量上相对提升了15.58%。
https://arxiv.org/abs/2602.04307
Self-supervised learning (SSL) has advanced speech processing but suffers from quadratic complexity due to self-attention. To address this, SummaryMixing (SM) has been proposed as a linear-time alternative that summarizes entire utterances using mean pooling but lacks sufficient local context. In this work, we introduce Windowed SummaryMixing (WSM), which enhances SM by integrating local neighborhood summaries alongside the global summary, maintaining efficiency while improving temporal dependencies. Additionally, we introduce a selective fine-tuning approach, replacing self-attention layers in SSL models with WSM blocks and fine-tuning only these blocks in low-resource settings. Our approach improves ASR performance while reducing peak VRAM usage by 40\% in the SSL models. WSM blocks have linear-time complexity with enhanced context awareness. Selectively replacing some attention layers reduces compute, memory, and latency, making it ideal for low-resource speech recognition.
自监督学习(SSL)在语音处理方面取得了进展,但由于自我注意机制导致的二次复杂性而受到限制。为了解决这一问题,提出了SummaryMixing(SM),这是一种线性时间替代方案,通过均值池化来总结整个话语,但缺乏足够的局部上下文信息。在这项工作中,我们引入了Windowed SummaryMixing(WSM),该方法通过整合局部邻域摘要和全局摘要,增强了SM的功能,在保持效率的同时提高了时间依赖关系。此外,我们还提出了一种选择性微调的方法,即在资源较少的情况下用WSM模块替换SSL模型中的自我注意层,并仅对这些模块进行微调。我们的方法不仅提升了ASR(自动语音识别)的性能,还将SSL模型的峰值VRAM使用量减少了40%。WSM模块具有线性时间复杂度并增强了上下文感知能力。选择性地用一些注意力层替换它们可以减少计算、内存和延迟需求,使其非常适合低资源环境下的语音识别任务。
https://arxiv.org/abs/2602.09043
Discretized representations of speech signals are efficient alternatives to continuous features for various speech applications, including automatic speech recognition (ASR) and speech language models. However, these representations, such as semantic or phonetic tokens derived from clustering outputs of self-supervised learning (SSL) speech models, are susceptible to environmental noise, which can degrade backend task performance. In this work, we introduce a frontend system that estimates clean speech tokens from noisy speech and evaluate it on an ASR backend using semantic tokens. We consider four types of enhancement models based on their input/output domains: wave-to-wave, token-to-token, continuous SSL features-to-token, and wave-to-token. These models are trained independently of ASR backends. Experiments on the CHiME-4 dataset demonstrate that wave-to-token enhancement achieves the best performance among the frontends. Moreover, it mostly outperforms the ASR system based on continuous SSL features.
离散化的语音信号表示是连续特征在包括自动语音识别(ASR)和语音语言模型在内的各种语音应用中的高效替代方案。然而,这些表示形式,例如通过自我监督学习(SSL)语音模型聚类输出得到的语义或音位标记,容易受到环境噪声的影响,从而降低后台任务的表现。在这项工作中,我们介绍了一种前端系统,该系统从嘈杂的语音中估计出清晰的语音标记,并使用语义标记在ASR后端上对其进行评估。根据输入/输出域的不同,我们考虑了四种增强模型:波形到波形、标记到标记、连续SSL特征到标记以及波形到标记。这些模型是在不依赖于ASR后台的情况下训练的。在CHiME-4数据集上的实验表明,在前端中,波形到标记增强实现了最佳性能,并且大多数情况下优于基于连续SSL特征的ASR系统。
https://arxiv.org/abs/2602.04217
As Generative AI (GenAI), particularly inference, rapidly emerges as a dominant workload category, the Kubernetes ecosystem is proactively evolving to natively support its unique demands. This industry paper demonstrates how emerging Kubernetes-native projects can be combined to deliver the benefits of container orchestration, such as scalability and resource efficiency, to complex AI workflows. We implement and evaluate an illustrative, multi-stage use case consisting of automatic speech recognition and summarization. First, we address batch inference by using Kueue to manage jobs that transcribe audio files with Whisper models and Dynamic Accelerator Slicer (DAS) to increase parallel job execution. Second, we address a discrete online inference scenario by feeding the transcripts to a Large Language Model for summarization hosted using llm-d, a novel solution utilizing the recent developments around the Kubernetes Gateway API Inference Extension (GAIE) for optimized routing of inference requests. Our findings illustrate that these complementary components (Kueue, DAS, and GAIE) form a cohesive, high-performance platform, proving Kubernetes' capability to serve as a unified foundation for demanding GenAI workloads: Kueue reduced total makespan by up to 15%; DAS shortened mean job completion time by 36%; and GAIE improved Time to First Token by 82\%.
随着生成式人工智能(GenAI)尤其是推理工作负载迅速成为主导类型,Kubernetes生态系统正积极演进以原生支持其独特需求。本行业报告展示了如何结合新兴的原生Kubernetes项目,将容器编排的好处如可扩展性和资源效率提供给复杂的AI工作流程。我们实施并评估了一个具有代表性的多阶段用例,其中包括自动语音识别和摘要。首先,我们使用Kueue来管理批处理推理作业,并利用Whisper模型转录音频文件,同时采用动态加速器切片器(DAS)以增加并行作业执行的数量。其次,我们在一个离线推理场景中应用了这一方案,即将转录文本传递给由llm-d托管的大规模语言模型进行摘要生成,这是一种新颖的解决方案,利用Kubernetes Gateway API推理扩展(GAIE)来优化推理请求的路由。 我们的研究发现表明,这些互补组件(包括Kueue、DAS和GAIE)共同构成了一个连贯且高性能的平台,这证明了Kubernetes有能力成为复杂GenAI工作负载统一基础的理想选择:Kueue最多可将总耗时减少15%;DAS缩短平均作业完成时间36%;而GAIE则使首个令牌生成时间提高了82%。
https://arxiv.org/abs/2602.04900
This paper documents our efforts in releasing the printed and audio book of the translation of the famous novel The Little Prince into the Chakavian dialect, as a computer-readable, AI-ready dataset, with the textual and the audio components of the two releases now aligned on the level of each written and spoken word. Our motivation for working on this release is multiple. The first one is our wish to preserve the highly valuable and specific content beyond the small editions of the printed and the audio book. With the dataset published in the this http URL repository, this content is from now on at the fingertips of any interested individual. The second motivation is to make the data available for various artificial-intelligence-related usage scenarios, such as the one we follow upon inside this paper already -- adapting the Whisper-large-v3 open automatic speech recognition model, with decent performance on standard Croatian, to Chakavian dialectal speech. We can happily report that with adapting the model, the word error rate on the selected test data has being reduced to a half, while we managed to remove up to two thirds of the error on character level. We envision many more usages of this dataset beyond the set of experiments we have already performed, both on tasks of artificial intelligence research and application, as well as dialectal research. The third motivation for this release is our hope that this, now highly structured dataset, will be transformed into a digital online edition of this work, allowing individuals beyond the research and technology communities to enjoy the beauty of the message of the little boy in the desert, told through the spectacular prism of the Chakavian dialect.
本文档记录了我们发布《小王子》这部著名小说的翻译版本的工作,该版本是用查卡维亚方言写成,并同时发布了纸质书和有声书。这些资料已经被转换为计算机可读、适合AI处理的数据集,文本和音频部分现在在每一个书面单词和口头发音上都已对齐。 我们的动机多方面存在。首先是我们希望保存这部珍贵且独特内容的遗产,不仅限于小规模的出版物中。通过将数据集发布在这个http URL仓库,这些资料现在已经触手可及,任何对此感兴趣的人都可以轻松访问到它。第二个动机是让该数据可用于各种与人工智能相关的情景使用,例如我们在本文档中的实践——适应Whisper-large-v3开放自动语音识别模型,在标准克罗地亚语上表现良好但针对查卡维亚方言进行了调整。我们可以高兴地报告说,通过这一调整,在选择的测试数据集上的单词错误率已经降低了一半,并且我们成功减少了三分之二的字符级误差。 我们认为该数据集有更多可能的应用场景,不仅限于我们已经完成的研究实验,还包括人工智能研究和应用以及方言学研究。第三个动机是希望通过这个现在已经高度结构化了的数据集,能够让这部作品转变成一个数字化的在线版本,让超出科研和技术社区的人们也能享受到通过查卡维亚语这独特棱镜讲述的小男孩在沙漠中的美丽故事。
https://arxiv.org/abs/2602.03245
The advancement of speech technology has predominantly favored high-resource languages, creating a significant digital divide for speakers of most Sub-Saharan African languages. To address this gap, we introduce WAXAL, a large-scale, openly accessible speech dataset for 21 languages representing over 100 million speakers. The collection consists of two main components: an Automated Speech Recognition (ASR) dataset containing approximately 1,250 hours of transcribed, natural speech from a diverse range of speakers, and a Text-to-Speech (TTS) dataset with over 180 hours of high-quality, single-speaker recordings reading phonetically balanced scripts. This paper details our methodology for data collection, annotation, and quality control, which involved partnerships with four African academic and community organizations. We provide a detailed statistical overview of the dataset and discuss its potential limitations and ethical considerations. The WAXAL datasets are released at this https URL under the permissive CC-BY-4.0 license to catalyze research, enable the development of inclusive technologies, and serve as a vital resource for the digital preservation of these languages.
语音技术的进步主要偏向于高资源语言,从而在大多数撒哈拉以南非洲语言的使用者之间造成了显著的数字鸿沟。为了弥补这一差距,我们推出了WAXAL,这是一个包含21种语言的大规模开放获取语音数据集,这些语言代表了超过1亿的语言使用者。该集合由两个主要部分组成:一个自动语音识别(ASR)数据集,其中包括大约1,250小时的转录自然语音,涵盖了各种各样的发言者;以及一个多于180小时高质量单一发言者的文本到语音(TTS)数据集,这些发言者朗读了音韵平衡脚本。本文详细介绍了我们在数据收集、注释和质量控制方面的方法,这涉及到与非洲四家学术和社区组织的合作。我们提供了该数据集的统计概述,并讨论了其潜在限制及伦理考量。WAXAL数据集已在此链接(https URL)下以许可性较强的CC-BY-4.0许可证发布,旨在促进研究、推动包容性技术的发展并作为这些语言数字化保存的重要资源。
https://arxiv.org/abs/2602.02734
Accented speech remains a persistent challenge for automatic speech recognition (ASR), as most models are trained on data dominated by a few high-resource English varieties, leading to substantial performance degradation for other accents. Accent-agnostic approaches improve robustness yet struggle with heavily accented or unseen varieties, while accent-specific methods rely on limited and often noisy labels. We introduce Moe-Ctc, a Mixture-of-Experts architecture with intermediate CTC supervision that jointly promotes expert specialization and generalization. During training, accent-aware routing encourages experts to capture accent-specific patterns, which gradually transitions to label-free routing for inference. Each expert is equipped with its own CTC head to align routing with transcription quality, and a routing-augmented loss further stabilizes optimization. Experiments on the Mcv-Accent benchmark demonstrate consistent gains across both seen and unseen accents in low- and high-resource conditions, achieving up to 29.3% relative WER reduction over strong FastConformer baselines.
https://arxiv.org/abs/2602.01967
Multilingual automatic speech recognition (ASR) requires tokenization that efficiently covers many writing systems. Byte-level BPE (BBPE) using UTF-8 is widely adopted for its language-agnostic design and full Unicode coverage, but its variable-length encoding inflates token sequences for non-Latin scripts, such as Chinese, Japanese, and Korean (CJK). Longer sequences increase computational load and memory use. We propose BBPE16, a UTF-16-based BBPE tokenizer that represents most modern scripts with a uniform 2-byte code unit. BBPE16 preserves BBPE's language-agnostic properties while substantially improving cross-lingual token sharing. Across monolingual, bilingual, and trilingual ASR, and in a multilingual continual-learning setup, BBPE16 attains comparable or better accuracy; for Chinese, it reduces token counts by up to 10.4% and lowers decoding iterations by up to 10.3%. These reductions speed up fine-tuning and inference and decrease memory usage, making BBPE16 a practical tokenization choice for multilingual ASR.
https://arxiv.org/abs/2602.01717
This work presents EmoAra, an end-to-end emotion-preserving pipeline for cross-lingual spoken communication, motivated by banking customer service where emotional context affects service quality. EmoAra integrates Speech Emotion Recognition, Automatic Speech Recognition, Machine Translation, and Text-to-Speech to process English speech and deliver an Arabic spoken output while retaining emotional nuance. The system uses a CNN-based emotion classifier, Whisper for English transcription, a fine-tuned MarianMT model for English-to-Arabic translation, and MMS-TTS-Ara for Arabic speech synthesis. Experiments report an F1-score of 94% for emotion classification, translation performance of BLEU 56 and BERTScore F1 88.7%, and an average human evaluation score of 81% on banking-domain translations. The implementation and resources are available at the accompanying GitHub repository.
https://arxiv.org/abs/2602.01170
Recent speech foundation models excel at multilingual automatic speech recognition (ASR) for high-resource languages, but adapting them to low-resource languages remains challenging due to data scarcity and efficiency constraints. Full-model fine-tuning is computationally expensive and prone to overfitting, while parameter-efficient methods like LoRA apply adaptation uniformly across layers, overlooking internal representations thus compromising effectiveness and efficiency. We analyze multilingual ASR models and reveal a U-shaped adaptability pattern: early and late layers are language-specific and require more adaptation, while intermediate layers retain shared semantics and need less. Building on this observation, we propose DAMA, a Depth-Aware Model Adaptation framework that allocates adaptation capacity according to each layer's role. DAMA also introduces Singular Value Decomposition (SVD)-based initialization to constrain adaptation and preserve the U-shaped pattern, as well as a frozen middle-layer basis for further efficiency. Evaluated on 18 low-resource languages across two benchmark datasets, DAMA matches or surpasses state-of-the-art accuracy with 80% fewer trainable parameters, achieves a 29% error reduction under extreme data scarcity, and significantly improves memory, training time, and computational efficiency over baselines. These results highlight the benefits of structure-aware adaptation for efficient, scalable multilingual ASR.
https://arxiv.org/abs/2602.01008
Spoken question-answering (SQA) systems relying on automatic speech recognition (ASR) often struggle with accurately recognizing medical terminology. To this end, we propose MedSpeak, a novel knowledge graph-aided ASR error correction framework that refines noisy transcripts and improves downstream answer prediction by leveraging both semantic relationships and phonetic information encoded in a medical knowledge graph, together with the reasoning power of LLMs. Comprehensive experimental results on benchmarks demonstrate that MedSpeak significantly improves the accuracy of medical term recognition and overall medical SQA performance, establishing MedSpeak as a state-of-the-art solution for medical SQA. The code is available at this https URL.
https://arxiv.org/abs/2602.00981