Edge devices operate in constrained and varying resource settings, requiring dynamic architectures that can adapt to limitations of the available resources. To meet such demands, layer dropping ($\mathcal{LD}$) approach is typically used to transform static models into dynamic ones by skipping parts of the network along with reducing overall computational complexity. However, existing $\mathcal{LD}$ methods greatly impact the dynamic model's performance for low and high dropping cases, deteriorating the performance-computation trade-off. To this end, we propose a distillation-based layer dropping (DLD) framework that effectively combines the capabilities of knowledge distillation and $\mathcal{LD}$ in an end-to-end fashion, thereby achieving state-of-the-art performance for dynamic speech networks. Comprehensive experimentation utilizing well-known speech recognition methods, including conformer and WavLM, on three public benchmarks demonstrates the effectiveness of our framework, reducing the word error rate by $9.32\%$ and $2.25\%$ for high and no dropping cases with $33.3\%$ reduction in training time.
边缘设备在有限和变化的资源环境中运行,需要能够适应可用资源限制的动态架构。为了满足这一需求,通常采用层掉落($\mathcal{LD}$)方法将静态模型转换为动态模型,通过跳过网络的部分来减少整体计算复杂度。然而,现有的层掉落方法对低频和高频掉层情况下的动态模型性能影响很大,从而恶化了性能与计算量之间的权衡。为此,我们提出了一种基于蒸馏的层掉落(DLD)框架,该框架能够以端到端的方式有效地结合知识蒸馏和$\mathcal{LD}$的能力,从而在动态语音网络中实现最先进的性能。 通过使用包括Conformer和WavLM在内的知名语音识别方法,在三个公共基准上进行的全面实验展示了我们框架的有效性。对于高频掉层情况,我们的框架将词错误率降低了9.32%,而对于无掉层的情况则减少了2.25%。此外,该框架在训练时间方面也实现了33.3%的减少。
https://arxiv.org/abs/2601.16117
We investigate intelligent personal assistants (IPAs) accessibility for deaf and hard of hearing (DHH) people who can use their voice in everyday communication. The inability of IPAs to understand diverse accents including deaf speech renders them largely inaccessible to non-signing and speaking DHH individuals. Using an Echo Show, we compare the usability of natural language input via spoken English; with Alexa's automatic speech recognition and a Wizard-of-Oz setting with a trained facilitator re-speaking commands against that of a large language model (LLM)-assisted touch interface in a mixed-methods study. The touch method was navigated through an LLM-powered "task prompter," which integrated the user's history and smart environment to suggest contextually-appropriate commands. Quantitative results showed no significant differences across both spoken English conditions vs LLM-assisted touch. Qualitative results showed variability in opinions on the usability of each method. Ultimately, it will be necessary to have robust deaf-accented speech recognized natively by IPAs.
我们研究了智能个人助手(IPA)对聋人和听力障碍者(DHH)的可用性,这些人在日常交流中可以使用他们的声音。由于IPAs无法理解包括聋人语音在内的各种口音,因此对于不使用手语且能说话的DHH人士来说,它们难以使用。 我们利用Echo Show设备比较了通过口头英语进行自然语言输入;Alexa的自动语音识别和由受训协调员复述命令的“巫师之 Oz”设置与大型语言模型(LLM)辅助触控界面在混合方法研究中的可用性。触控方法通过一个由LLM驱动的"任务提示器"来导航,该提示器结合了用户的使用历史和智能环境以提出合适的上下文指令。 定量结果显示,在口头英语条件下的表现与LLM辅助触控无显著差异。定性的结果表明,对于每种方法的可用性存在不同的意见。最终,需要IPAs能够原生地识别聋人特有的口音。
https://arxiv.org/abs/2601.15209
Code understanding is a foundational capability in software engineering tools and developer workflows. However, most existing systems are designed for English-speaking users interacting via keyboards, which limits accessibility in multilingual and voice-first settings, particularly in regions like India. Voice-based interfaces offer a more inclusive modality, but spoken queries involving code present unique challenges due to the presence of non-standard English usage, domain-specific vocabulary, and custom identifiers such as variable and function names, often combined with code-mixed expressions. In this work, we develop a multilingual speech-driven framework for code understanding that accepts spoken queries in a user native language, transcribes them using Automatic Speech Recognition (ASR), applies code-aware ASR output refinement using Large Language Models (LLMs), and interfaces with code models to perform tasks such as code question answering and code retrieval through benchmarks such as CodeSearchNet, CoRNStack, and CodeQA. Focusing on four widely spoken Indic languages and English, we systematically characterize how transcription errors impact downstream task performance. We also identified key failure modes in ASR for code and demonstrated that LLM-guided refinement significantly improves performance across both transcription and code understanding stages. Our findings underscore the need for code-sensitive adaptations in speech interfaces and offer a practical solution for building robust, multilingual voice-driven programming tools.
代码理解是软件工程工具和开发者工作流程中的基础能力。然而,大多数现有系统都是为说英语的用户通过键盘交互设计的,这限制了在多语言环境(尤其是像印度这样的地区)以及语音优先设置下的可访问性。基于语音的界面提供了一种更具包容性的模式,但由于存在非标准英语使用、领域特定词汇和自定义标识符(如变量名和函数名),结合代码混合表达的情况,因此语音查询中的代码问题提出了独特的挑战。 在本工作中,我们开发了一个多语言语音驱动框架用于代码理解。该框架接受用户母语的口头询问,并通过自动语音识别 (ASR) 转录这些询问;然后使用大型语言模型 (LLMs) 对 ASR 输出进行代码感知的细化处理;最终与代码模型接口以执行诸如代码问答和代码检索等任务,通过 CodeSearchNet、CoRNStack 和 CodeQA 等基准来实现。我们专注于四种广泛使用的印度语和英语,并系统地分析了转录错误如何影响下游任务表现。此外,我们还确定了 ASR 在处理代码时的关键失败模式,并展示了由 LLM 引导的细化过程在转录和代码理解阶段显著提高了性能。 我们的研究结果强调了在语音界面中进行代码敏感性适应的需求,并为构建健壮、多语言驱动的编程工具提供了实际解决方案。
https://arxiv.org/abs/2601.15339
Large encoder-decoder models like Whisper achieve strong offline transcription but remain impractical for streaming applications due to high latency. However, due to the accessibility of pre-trained checkpoints, the open Thai ASR landscape remains dominated by these offline architectures, leaving a critical gap in efficient streaming solutions. We present Typhoon ASR Real-time, a 115M-parameter FastConformer-Transducer model for low-latency Thai speech recognition. We demonstrate that rigorous text normalization can match the impact of model scaling: our compact model achieves a 45x reduction in computational cost compared to Whisper Large-v3 while delivering comparable accuracy. Our normalization pipeline resolves systemic ambiguities in Thai transcription --including context-dependent number verbalization and repetition markers (mai yamok) --creating consistent training targets. We further introduce a two-stage curriculum learning approach for Isan (north-eastern) dialect adaptation that preserves Central Thai performance. To address reproducibility challenges in Thai ASR, we release the Typhoon ASR Benchmark, a gold-standard human-labeled datasets with transcriptions following established Thai linguistic conventions, providing standardized evaluation protocols for the research community.
大型编码器-解码器模型如Whisper在离线转录方面表现出色,但由于高延迟问题,在流媒体应用中仍然不太实用。然而,由于预训练检查点的可访问性,开放式的泰语语音识别领域依然被这些离线架构主导,导致高效流式解决方案存在空白。 我们提出了Typhoon ASR Real-time,这是一个基于FastConformer-Transducer模型、拥有1.15亿参数量的小型模型,旨在实现低延迟的泰语语音识别。我们展示了严格的文本规范化可以与模型规模扩展的效果相匹敌:相较于Whisper Large-v3,我们的紧凑模型在计算成本上减少了45倍,同时保持了相近的准确性。 此外,我们还提出了一个两阶段的教学学习方法(curriculum learning approach),以适应Isan方言,并确保中央泰语性能不受影响。该方法解决了泰国语音识别中系统性歧义的问题——包括基于上下文的数字读音和重复标记(mai yamok)问题,从而创建了一致性的训练目标。 为了应对泰国语音识别中的可再现性挑战,我们发布了Typhoon ASR基准测试平台,这是一个遵循公认的泰语语言惯例的人工标注数据集,并提供了标准化的评估协议供研究社区使用。
https://arxiv.org/abs/2601.13044
Parameter-efficient fine-tuning (PEFT) is a scalable approach for adapting large speech foundation models to new domains. While methods such as LoRA and its state-of-the-art variants reduce adaptation costs, they typically allocate parameters uniformly across model subspaces, which limits their efficiency and scalability in speech applications. Building on our prior work, this paper introduces SSVD-Outer (SSVD-O), an extension of the structured SVD-guided (SSVD) fine-tuning method. SSVD-O combines input acoustic feature space-associated inner transformations with output semantic feature space-associated outer transformations to enable scalable and balanced adaptation. We conduct the first systematic analysis of parameter budget allocation across model subspaces in PEFT for automatic speech recognition (ASR), and investigate the trade-off between learning and forgetting under constrained resources. SSVD-O is benchmarked against LoRA, DoRA, PiSSA, and SSVD on domain-shifted ASR tasks, including child speech and regional accents, across model scales from 0.1B to 2B within the ESPnet framework. Experimental results show that SSVD-O consistently narrows the performance gap to full fine-tuning while improving generalization and mitigating catastrophic forgetting.
参数高效微调(PEFT)是一种可扩展的方法,用于将大型语音基础模型适应到新的领域。虽然像LoRA及其最先进的变体这样的方法可以减少调整成本,但它们通常会均匀地在整个模型子空间中分配参数,这在语音应用中的效率和可扩展性受到了限制。基于我们之前的工作,本文介绍了一种结构化奇异值分解指导(SSVD)微调方法的扩展——SSVD-Outer (SSVD-O)。SSVD-O 结合了与输入声学特征空间相关的内部变换以及与输出语义特征空间相关的外部变换,从而实现了可扩展且平衡的适应性调整。 我们在自动语音识别(ASR)中首次对PEFT中的模型子空间参数预算分配进行了系统的分析,并在资源受限的情况下探讨了学习和遗忘之间的权衡。SSVD-O 在 ESPnet 框架内针对从 0.1B 到 2B 的不同规模的模型,通过儿童语音和地方口音等域转换 ASR 任务与 LoRA、DoRA、PiSSA 和 SSVD 进行了基准测试。 实验结果表明,在保持泛化能力的同时,SSVD-O 能够在资源受限的情况下有效减少灾难性遗忘,并且始终缩小向全量微调的性能差距。
https://arxiv.org/abs/2601.12600
Audio-visual speech recognition (AVSR) typically improves recognition accuracy in noisy environments by integrating noise-immune visual cues with audio signals. Nevertheless, high-noise audio inputs are prone to introducing adverse interference into the feature fusion process. To mitigate this, recent AVSR methods often adopt mask-based strategies to filter audio noise during feature interaction and fusion, yet such methods risk discarding semantically relevant information alongside noise. In this work, we propose an end-to-end noise-robust AVSR framework coupled with speech enhancement, eliminating the need for explicit noise mask generation. This framework leverages a Conformer-based bottleneck fusion module to implicitly refine noisy audio features with video assistance. By reducing modality redundancy and enhancing inter-modal interactions, our method preserves speech semantic integrity to achieve robust recognition performance. Experimental evaluations on the public LRS3 benchmark suggest that our method outperforms prior advanced mask-based baselines under noisy conditions.
音频-视觉语音识别(AVSR)通常通过将抗噪声的视觉线索与音频信号相结合,来提高嘈杂环境下的识别准确性。然而,在高噪音环境下,音频输入容易引入对特征融合过程不利的干扰。为了缓解这一问题,最近的一些AVSR方法经常采用基于掩码的策略,在特征交互和融合过程中过滤掉音频中的噪声,但这种做法存在风险,即可能会一并丢弃与语义相关的有用信息。 在本研究中,我们提出了一种结合语音增强功能的端到端抗噪AVSR框架,并且不需要显式生成噪音掩码。该框架利用基于Conformer的瓶颈融合模块,在视频辅助下隐式精炼嘈杂音频特征。通过减少模态冗余并加强跨模态互动,我们的方法能够保持语义完整性,并实现稳健的识别性能。 在公开发布的LRS3基准测试上的实验评估表明,在噪音条件下,本研究的方法优于现有的先进掩码基线模型。
https://arxiv.org/abs/2601.12436
This paper proposes a Dialect Identification (DID) approach inspired by the Connectionist Temporal Classification (CTC) loss function as used in Automatic Speech Recognition (ASR). CTC-DID frames the dialect identification task as a limited-vocabulary ASR system, where dialect tags are treated as a sequence of labels for a given utterance. For training, the repetition of dialect tags in transcriptions is estimated either using a proposed Language-Agnostic Heuristic (LAH) approach or a pre-trained ASR model. The method is evaluated on the low-resource Arabic Dialect Identification (ADI) task, with experimental results demonstrating that an SSL-based CTC-DID model, trained on a limited dataset, outperforms both fine-tuned Whisper and ECAPA-TDNN models. Notably, CTC-DID also surpasses these models in zero-shot evaluation on the Casablanca dataset. The proposed approach is found to be more robust to shorter utterances and is shown to be easily adaptable for streaming, real-time applications, with minimal performance degradation.
这篇论文提出了一种受自动语音识别(ASR)中连接时序分类(CTC)损失函数启发的方言识别(DID)方法。CTC-DID将方言识别任务视为一个有限词汇量的ASR系统,其中方言标签被视为给定话语的一系列标签。在训练过程中,使用作者提出的语言无关启发式(LAH)方法或预训练的ASR模型来估计转写中方言标签的重复情况。 该方法在资源匮乏的阿拉伯语方言识别(ADI)任务上进行了评估,实验结果表明,在有限的数据集上训练的基于自监督学习(SSL)的CTC-DID模型优于微调后的Whisper和ECAPA-TDNN模型。值得注意的是,CTC-DID还在卡萨布兰卡数据集上的零样本评价中超越了这些模型。 该提出的CTC-DID方法表现出更强的鲁棒性,能够更好地处理较短的话语,并且易于适应流式传输、实时应用,同时保持较低的性能损失。
https://arxiv.org/abs/2601.12199
Speech processing for low-resource dialects remains a fundamental challenge in developing inclusive and robust speech technologies. Despite its linguistic significance and large speaker population, the Wu dialect of Chinese has long been hindered by the lack of large-scale speech data, standardized evaluation benchmarks, and publicly available models. In this work, we present WenetSpeech-Wu, the first large-scale, multi-dimensionally annotated open-source speech corpus for the Wu dialect, comprising approximately 8,000 hours of diverse speech data. Building upon this dataset, we introduce WenetSpeech-Wu-Bench, the first standardized and publicly accessible benchmark for systematic evaluation of Wu dialect speech processing, covering automatic speech recognition (ASR), Wu-to-Mandarin translation, speaker attribute prediction, speech emotion recognition, text-to-speech (TTS) synthesis, and instruction-following TTS (instruct TTS). Furthermore, we release a suite of strong open-source models trained on WenetSpeech-Wu, establishing competitive performance across multiple tasks and empirically validating the effectiveness of the proposed dataset. Together, these contributions lay the foundation for a comprehensive Wu dialect speech processing ecosystem, and we open-source proposed datasets, benchmarks, and models to support future research on dialectal speech intelligence.
语音处理在资源匮乏的方言中仍然是一项基本挑战,这对于开发包容性强大的语音技术至关重要。尽管吴语(中文的一种方言)具有重要的语言学意义和庞大的使用者群体,但由于缺乏大规模语音数据、标准化评估基准以及公开可用模型,其发展长期以来一直受到限制。在此项工作中,我们推出了WenetSpeech-Wu,这是首个针对吴语的大规模、多维度标注开源语音语料库,包含大约8,000小时的多样化语音数据。 基于这一数据集,我们介绍了WenetSpeech-Wu-Bench,这是第一个标准化且公开可访问的吴语语音处理评估基准,涵盖自动语音识别(ASR)、吴语到普通话翻译、说话人属性预测、语音情感识别、文本转语音(TTS)合成以及指令跟随式TTS(instruct TTS)。此外,我们还发布了基于WenetSpeech-Wu训练的一系列强大的开源模型,在多个任务上建立了竞争性性能,并通过实证验证了所提数据集的有效性。 这些贡献共同为吴语的全面语音处理生态系统奠定了基础。我们将开放源代码提出的数据集、基准和模型,以支持未来对方言语音智能的研究工作。
https://arxiv.org/abs/2601.11027
Objective: Surface electromyography (EMG) is a non-invasive sensing modality widely used in biomechanics, rehabilitation, prosthetic control, and human-machine interfaces. Despite decades of use, achieving robust generalization across subjects, recording systems, and acquisition protocols remains challenging. While foundation models (FMs) are gaining traction for EMG, existing approaches remain limited to single downstream tasks and lack deployability on embedded platforms. This work addresses these limitations. Methods: We present TinyMyo, a lightweight FM based on a Transformer encoder architecture. The model is pre-trained in a self-supervised manner using masked reconstruction on publicly available datasets. With only 3.6M parameters, TinyMyo is designed to support multiple downstream tasks through minimal task-specific head adaptations. Results: We demonstrate generalization across hand gesture classification, hand kinematic regression, speech production and speech recognition, with performance comparable to or surpassing the state of the art (SoA), and model size below 5M parameters. We achieve SoA results compared to previous FM-based works on the NinaPro DB5 (89.4%), UCI-EMG (97.56%), and EPN-612 (96.74%) datasets. We demonstrate the first-time deployment of an EMG FM on an ultra-low power microcontroller (GAP9), with an inference time of 0.785 s, energy of 44.91 mJ and power envelope of 57.18 mW. Conclusion: TinyMyo demonstrates that compact, self-supervised EMG FM can guarantee strong generalization across multiple downstream tasks while remaining compatible with low-power edge devices. Significance: TinyMyo is the first EMG FM for ultra-low power edge devices, enabling scalable and energy-efficient sensing for motor intent decoding, neuromuscular assessment, and biosignal driven human-machine interaction.
目标:表面肌电图(EMG)是一种广泛应用于生物力学、康复、假肢控制和人机界面的非侵入式传感模式。尽管经过几十年的应用,实现跨受试者、记录系统及采集协议的良好泛化仍然具有挑战性。虽然基础模型(FMs)在EMG应用中正日益受到重视,但现有方法依然局限于单一下游任务,并且无法部署于嵌入式平台上。这项工作旨在解决这些限制。 方法:我们提出了TinyMyo,一种基于Transformer编码器架构的轻量级基础模型。该模型通过公开可用的数据集上自监督的方式进行预训练,采用掩码重构方式。仅包含3.6M参数的小型模型被设计为能够支持多种下游任务,并且只需要对特定任务头部进行微小调整即可。 结果:我们展示了TinyMyo在手部手势分类、手部运动学回归、语音生成和识别等领域的泛化能力,其性能与现有最佳方法相当或更优,同时模型规模低于5M参数。在NinaPro DB5(89.4%)、UCI-EMG(97.56%)以及EPN-612(96.74%)数据集上相比之前的基于基础模型的工作,我们的结果达到了现有最佳水平。此外,我们首次将一个EMG基础模型部署到了超低功耗微控制器(GAP9)上,在推理时间仅为0.785秒、能耗为44.91 mJ和功率上限为57.18 mW的情况下实现了这一目标。 结论:TinyMyo表明,紧凑的自监督EMG基础模型能够在多种下游任务中保证强大的泛化能力,并且兼容低功耗边缘设备。意义:TinyMyo是首个应用于超低功耗边缘设备的EMG基础模型,这使得运动意图解码、神经肌肉评估及生物信号驱动的人机交互能够实现可扩展和节能的目标感测技术。
https://arxiv.org/abs/2512.15729
Traditional speech systems typically rely on separate, task-specific models for text-to-speech (TTS), automatic speech recognition (ASR), and voice conversion (VC), resulting in fragmented pipelines that limit scalability, efficiency, and cross-task generalization. In this paper, we present General-Purpose Audio (GPA), a unified audio foundation model that integrates multiple core speech tasks within a single large language model (LLM) architecture. GPA operates on a shared discrete audio token space and supports instruction-driven task induction, enabling a single autoregressive model to flexibly perform TTS, ASR, and VC without architectural modifications. This unified design combines a fully autoregressive formulation over discrete speech tokens, joint multi-task training across speech domains, and a scalable inference pipeline that achieves high concurrency and throughput. The resulting model family supports efficient multi-scale deployment, including a lightweight 0.3B-parameter variant optimized for edge and resource-constrained environments. Together, these design choices demonstrate that a unified autoregressive architecture can achieve competitive performance across diverse speech tasks while remaining viable for low-latency, practical deployment.
传统的语音系统通常依赖于独立的、特定任务的模型来完成文本转语音(TTS)、自动语音识别(ASR)和声音转换(VC),这导致了分散的工作流程,限制了其可扩展性、效率以及跨任务泛化能力。在本文中,我们提出了一个通用音频基础模型——General-Purpose Audio (GPA),该模型在一个大型语言模型架构内整合了多个核心语音任务。GPA操作于共享的离散音频标记空间,并支持指令驱动的任务诱导功能,使得单一的自回归模型能够灵活地执行TTS、ASR和VC任务而无需对架构进行任何修改。这种统一的设计结合了一个完整的自回归公式化处理离散语音令牌、跨语音领域的联合多任务训练以及一个可扩展的推理管道,该管道实现了高并发性和吞吐量。 所形成的模型系列支持高效的多尺度部署,包括一种轻量级的0.3B参数变体,该变体针对边缘和资源受限环境进行了优化。这些设计选择共同展示了统一自回归架构能够跨多种语音任务实现竞争性的性能,并且仍然适用于低延迟的实际部署场景。
https://arxiv.org/abs/2601.10770
In recent years, foundation models have become very popular due to their exceptional performance, mainly in natural language (NLP) tasks where they were first introduced. These models usually consist of hundreds of millions, or even billions, of parameters, making them resource-intensive during training and in production systems, leading to increased costs. This paper focuses on the reduction of a foundation's model size when applied to music information retrieval (MIR) tasks. Our research combines the Branchformer architecture with SummaryMixing, which were first applied in speech recognition, along with a random quantization process. To facilitate reproducibility, we conduct pre-training on publicly available datasets, complemented by a proprietary dataset comparable in scale to other private datasets reported in the literature. We ensure robust evaluation by using a framework consisting of a variety of downstream MIR tasks. Our results show that our architecture achieves competitive performance when compared with other state-of-the-art models that use multi-head self-attention, while reducing the model size from 8.5% up to 12.3%.
近年来,由于在自然语言处理(NLP)任务中表现出色,基础模型变得非常流行。这些模型通常包含数亿甚至数十亿个参数,在训练和生产系统中资源消耗大,导致成本增加。本文的重点是研究将基础模型应用于音乐信息检索(MIR)任务时的模型规模缩减问题。我们的研究结合了在语音识别领域首次应用的Branchformer架构与SummaryMixing方法,并加入随机量化过程。为了便于重现实验结果,我们在公开的数据集上进行预训练,并补充了一个可与文献中其他私有数据集相媲美的专有数据集。我们通过一个包含多种下游MIR任务的框架来确保评估的稳健性。我们的结果显示,与使用多头自注意力机制的最新模型相比,我们的架构在保持竞争力的同时,将模型规模减少了8.5%至12.3%。
https://arxiv.org/abs/2601.09603
We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception. Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance, as the model can be easily misled by noisy hypotheses. To address this, our framework, Speech-Hands, recasts the problem as an explicit self-reflection decision. This learnable reflection primitive proves effective in preventing the model from being derailed by flawed external candidates. We show that this agentic action mechanism generalizes naturally from speech recognition to complex, multiple-choice audio reasoning. Across the OpenASR leaderboard, Speech-Hands consistently outperforms strong baselines by 12.1% WER on seven benchmarks. The model also achieves 77.37% accuracy and high F1 on audio QA decisions, showing robust generalization and reliability across diverse audio question answering datasets. By unifying perception and decision-making, our work offers a practical path toward more reliable and resilient audio intelligence.
我们介绍了一个具备声音代理特性的框架,该框架学会了关键的全方位理解技能:即知道何时信任自身的能力,以及何时需要咨询外部音频感知。我们的工作是由一个至关重要的但违反直觉的研究结果所驱动的:简单地在语音识别和外部声音理解任务上微调一个全能模型往往会降低性能,因为模型容易被噪声假设误导。为了解决这个问题,我们开发了一个名为Speech-Hands的框架,它将问题重新定义为明确的自我反思决策。这种可学习的反思基础证明了防止模型因错误的外部候选答案而偏离轨道的有效性。我们展示了这种代理行动机制能够从语音识别自然地推广到复杂的多选题音频推理。在OpenASR排行榜上,Speech-Hands在这七个基准测试中始终优于强大的基线模型,平均提升了12.1%的单词错误率(WER)。此外,在音频问答决策方面,该模型达到了77.37%的准确率和高F1值,展示了其在多种多样的音频问题回答数据集上的强大泛化能力和可靠性。通过统一感知与决策制定,我们的工作为创建更加可靠且适应性强的音频智能提供了一条实用的道路。
https://arxiv.org/abs/2601.09413
The recent surge in open-source Multimodal Large Language Models (MLLM) frameworks, such as LLaVA, provides a convenient kickoff for artificial intelligence developers and researchers. However, most of the MLLM frameworks take vision as the main input modality, and provide limited in-depth support for the modality of speech, audio, and music. This situation hinders the development of audio-language models, and forces researchers to spend a lot of effort on code writing and hyperparameter tuning. We present SLAM-LLM, an open-source deep learning framework designed to train customized MLLMs, focused on speech, language, audio, and music processing. SLAM-LLM provides a modular configuration of different encoders, projectors, LLMs, and parameter-efficient fine-tuning plugins. SLAM-LLM also includes detailed training and inference recipes for mainstream tasks, along with high-performance checkpoints like LLM-based Automatic Speech Recognition (ASR), Automated Audio Captioning (AAC), and Music Captioning (MC). Some of these recipes have already reached or are nearing state-of-the-art performance, and some relevant techniques have also been accepted by academic papers. We hope SLAM-LLM will accelerate iteration, development, data engineering, and model training for researchers. We are committed to continually pushing forward audio-based MLLMs through this open-source framework, and call on the community to contribute to the LLM-based speech, audio and music processing.
近期,开源多模态大型语言模型(MLLM)框架如LLaVA的兴起为人工智能开发者和研究人员提供了一个便捷的起点。然而,大多数MLLM框架主要以视觉作为输入模式,并对语音、音频和音乐等模态的支持有限。这种情况阻碍了音頻-语言模型的发展,并迫使研究者花费大量精力在代码编写和超参数调整上。 我们推出了SLAM-LLM,这是一个开源深度学习框架,旨在训练定制化的多模态大型语言模型(MLLM),专注于语音、语言、音频和音乐处理。SLAM-LLM提供了不同编码器、投影器、大语言模型(LLMs)以及高效微调插件的模块化配置。此外,它还包含了主流任务的详细训练和推理配方,并包括高性能检查点,如基于大型语言模型的自动语音识别(ASR)、自动化音频描述(AAC)和音乐描述(MC)。其中一些配方已达到或接近业界最佳性能水平,相关技术也被学术论文接受。 我们希望SLAM-LLM能够加速研究者的迭代、开发、数据工程以及模型训练过程。我们将致力于通过这个开源框架不断推动基于语音的多模态大型语言模型的发展,并呼吁社区贡献于基于大型语言模型的语音、音频和音乐处理工作。
https://arxiv.org/abs/2601.09385
With the rapid advancement of Multimodal Large Language Models (MLLMs), their potential has garnered significant attention in Chinese Classical Studies (CCS). While existing research has primarily focused on text and visual modalities, the audio corpus within this domain remains largely underexplored. To bridge this gap, we propose the Multi-task Classical Chinese Literary Genre Audio Corpus (MCGA). It encompasses a diverse range of literary genres across six tasks: Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Speech Emotion Captioning (SEC), Spoken Question Answering (SQA), Speech Understanding (SU), and Speech Reasoning (SR). Through the evaluation of ten MLLMs, our experimental results demonstrate that current models still face substantial challenges when processed on the MCGA test set. Furthermore, we introduce an evaluation metric for SEC and a metric to measure the consistency between the speech and text capabilities of MLLMs. We release MCGA and our code to the public to facilitate the development of MLLMs with more robust multidimensional audio capabilities in CCS. MCGA Corpus: this https URL
随着多模态大型语言模型(MLLMs)的迅速发展,它们在中文古典研究(CCS)中的潜力引起了广泛关注。尽管现有研究主要集中在文本和视觉模式上,但该领域内的音频语料库仍未被充分探索。为填补这一空白,我们提出了一个多任务中国古典文学体裁音频语料库(MCGA)。它涵盖了六个方面的多样化文学体裁:自动语音识别(ASR)、语音到文本翻译(S2TT)、语音情感说明(SEC)、口语问答(SQA)、语音理解(SU)和语音推理(SR)。通过评估十个MLLMs,我们的实验结果显示,在处理MCGA测试集时,当前模型仍面临重大挑战。此外,我们还引入了一个用于SEC的评价指标以及一个衡量MLLMs在语音与文本能力之间一致性程度的指标。我们将MCGA语料库及其代码公开发布,以促进CCS领域中具有更强大多维度音频处理能力的MLLM的发展。 **MCGA语料库链接:[此网址](this https URL)**
https://arxiv.org/abs/2601.09270
CAPTCHAs are widely used by websites to block bots and spam by presenting challenges that are easy for humans but difficult for automated programs to solve. To improve accessibility, audio CAPTCHAs are designed to complement visual ones. However, the robustness of audio CAPTCHAs against advanced Large Audio Language Models (LALMs) and Automatic Speech Recognition (ASR) models remains unclear. In this paper, we introduce AI-CAPTCHA, a unified framework that offers (i) an evaluation framework, ACEval, which includes advanced LALM- and ASR-based solvers, and (ii) a novel audio CAPTCHA approach, IllusionAudio, leveraging audio illusions. Through extensive evaluations of seven widely deployed audio CAPTCHAs, we show that most existing methods can be solved with high success rates by advanced LALMs and ASR models, exposing critical security weaknesses. To address these vulnerabilities, we design a new audio CAPTCHA approach, IllusionAudio, which exploits perceptual illusion cues rooted in human auditory mechanisms. Extensive experiments demonstrate that our method defeats all tested LALM- and ASR-based attacks while achieving a 100% human pass rate, significantly outperforming existing audio CAPTCHA methods.
CAPTCHA(完全自动化公开图灵测试以区分计算机和人类)被网站广泛使用,通过提出人易解而自动化程序难解的挑战来阻止机器人和垃圾信息。为了提高可访问性,音频CAPTCHA被设计出来与视觉CAPTCHA互补。然而,关于音频CAPTCHA对先进大型音频语言模型(LALM)和自动语音识别(ASR)模型的鲁棒性仍然不清楚。在本文中,我们介绍了AI-CAPTCHA,这是一个统一的框架,提供了两个主要功能:(i) 一个评估框架ACEval,其中包括先进的基于LALM和ASR的求解器;以及(ii)一种新的音频CAPTCHA方法IllusionAudio,利用了听觉错觉。通过广泛测试七种常见的音频CAPTCHA,我们展示了大多数现有方法可以通过高级LALM和ASR模型以很高的成功率解决,揭示出严重的安全弱点。为了解决这些漏洞,我们设计了一种新的人声验证码方法——IllusionAudio,它利用了基于人类听觉机制的感知错觉线索。广泛的实验表明,我们的方法能够击败所有测试过的LALM和ASR攻击,并实现了100%的人类通过率,在性能上显著优于现有的音频CAPTCHA方法。
https://arxiv.org/abs/2601.08516
Global frameworks increasingly advocate for Responsible Artificial Intelligence (AI) in education, yet they provide limited guidance on how ethical, culturally responsive, and curriculum-aligned AI can be operationalized within functioning teacher education systems, particularly in the Global South. This study addresses this gap through the design and evaluation of GenAITEd Ghana, a context-aware, region-specific conversational AI prototype developed to support teacher education in Ghana. Guided by a Design Science Research approach, the system was developed as a school-mimetic digital infrastructure aligned with the organizational logic of Ghanaian Colleges of Education and the National Council for Curriculum and Assessment (NaCCA) framework. GenAITEd Ghana operates as a multi-agent, retrieval-augmented conversational AI that coordinates multiple models for curriculum-grounded dialogue, automatic speech recognition, voice synthesis, and multimedia interaction. Two complementary prompt pathways were embedded: system-level prompts that enforce curriculum boundaries, ethical constraints, and teacher-in-the-loop oversight, and interaction-level semi-automated prompts that structure live pedagogical dialogue through clarification, confirmation, and guided response generation. Evaluation findings show that the system effectively enacted key Responsible AI principles, including transparency, accountability, cultural responsiveness, privacy, and human oversight. Human expert evaluations further indicated that GenAITEd Ghana is pedagogically appropriate for Ghanaian teacher education, promoting student engagement while preserving educators' professional authority. Identified challenges highlight the need for continued model integration, professional development, and critical AI literacy to mitigate risks of over-reliance.
全球框架越来越多地倡导在教育中使用负责任的人工智能(AI),但它们对如何将道德、文化响应和符合课程要求的AI操作化于教师教育系统中的指导却非常有限,尤其是在全球南方地区。本研究通过设计并评估GenAITEd Ghana来填补这一空白——这是一个针对加纳特定区域的情境感知会话式人工智能原型,旨在支持教师教育。 在设计科学方法论指导下开发的GenAITEd Ghana是一个模仿学校环境的数字基础设施,并与加纳教育学院和国家课程及评估委员会(NaCCA)框架的组织逻辑保持一致。GenAITEd Ghana作为一个多代理、检索增强型对话AI系统运行,协调多个模型以实现基于课程的对话、自动语音识别、语音合成以及多媒体互动。 研究中嵌入了两条互补提示路径:一是系统层面的提示,强制执行课程边界、伦理限制和教师监督;二是交互层面的半自动化提示,通过澄清、确认及引导式响应生成来结构化实时教学对话。评估结果表明,该系统有效地实施了负责任AI的关键原则,包括透明度、责任性、文化敏感性、隐私以及人类监管。 专家评估进一步显示,GenAITEd Ghana在加纳教师教育中具有适当的教学应用,既促进学生参与又维护教育者的专业权威。已识别的挑战强调需要继续进行模型集成、专业发展及批判性AI素养培训,以减轻过度依赖的风险。
https://arxiv.org/abs/2601.06093
Despite having hundreds of millions of speakers, Chinese dialects lag behind Mandarin in speech and language technologies. Most varieties are primarily spoken, making dialect-to-Mandarin speech-LLMs (large language models) more practical than dialect LLMs. Building dialect-to-Mandarin speech-LLMs requires speech representations with cross-dialect semantic alignment between Chinese dialects and Mandarin. In this paper, we achieve such a cross-dialect semantic alignment by training a speech encoder with ASR (automatic speech recognition)-only data, as demonstrated by speech-to-speech retrieval on a new benchmark of spoken Chinese varieties that we contribute. Our speech encoder further demonstrates state-of-the-art ASR performance on Chinese dialects. Together, our Chinese dialect benchmark, semantically aligned speech representations, and speech-to-speech retrieval evaluation lay the groundwork for future Chinese dialect speech-LLMs. We release the benchmark at this https URL.
尽管汉语方言拥有数亿使用者,但在语音和语言技术方面却落后于普通话。大多数方言主要以口语形式存在,因此从方言到普通话的语音大模型(speech LLM)比单一方言的大模型更为实用。构建方言到普通话的语音大模型需要一种能够在中国方言与普通话之间实现跨方言语义对齐的语音表示方法。在本文中,我们通过仅使用自动语音识别(ASR)数据训练语音编码器,实现了这种跨方言的语义对齐,并且通过我们在新基准测试中贡献的汉语口语变体的语音到语音检索展示了这一点。我们的语音编码器还在汉语方言上的自动语音识别性能方面表现出最先进的水平。我们提供的汉语方言基准、语义对齐的语音表示以及语音到语音检索评估,为未来汉语方言语音大模型的发展奠定了基础。我们在[此链接](https://this.url.com)发布该基准测试。
https://arxiv.org/abs/2601.07274
The development of resource-constrained approaches to automatic speech recognition (ASR) is of great interest due to its broad applicability to many low-resource languages for which there is scant usable data. Existing approaches to many low-resource natural language processing tasks leverage additional data from higher-resource languages that are closely related to a target low-resource language. One increasingly popular approach uses task arithmetic to combine models trained on different tasks to create a model for a task where there is little to no training data. In this paper, we consider training on a particular language to be a task, and we generate task vectors by fine-tuning variants of the Whisper ASR system. For pairings of high- and low-resource languages, we merge task vectors via a linear combination, optimizing the weights of the linear combination on the downstream word error rate on the low-resource target language's validation set. We find that this approach consistently improves performance on the target languages.
针对资源受限的自动语音识别(ASR)方法的发展引起了广泛关注,因为这类方法在许多数据稀缺的语言中具有广泛的应用性。对于这些低资源语言处理任务,现有方法通常利用与目标低资源语言密切相关的高资源语言中的额外数据。一种越来越流行的方法是使用任务算术来结合不同任务上训练的模型,以创建适用于几乎没有或完全没有训练数据的任务的新模型。在本文中,我们将针对特定语言的训练视为一个任务,并通过微调Whisper ASR系统的变体生成任务向量。对于高资源和低资源语言的组合,我们通过线性组合合并任务向量,并在目标低资源语言验证集上的下游词错误率上优化线性组合的权重。我们发现这种方法能一致地提升目标语言的表现。
https://arxiv.org/abs/2601.07038
In speech language modeling, two architectures dominate the frontier: the Transformer and the Conformer. However, it remains unknown whether their comparable performance stems from convergent processing strategies or distinct architectural inductive biases. We introduce Architectural Fingerprinting, a probing framework that isolates the effect of architecture on representation, and apply it to a controlled suite of 24 pre-trained encoders (39M-3.3B parameters). Our analysis reveals divergent hierarchies: Conformers implement a "Categorize Early" strategy, resolving phoneme categories 29% earlier in depth and speaker gender by 16% depth. In contrast, Transformers "Integrate Late," deferring phoneme, accent, and duration encoding to deep layers (49-57%). These fingerprints suggest design heuristics: Conformers' front-loaded categorization may benefit low-latency streaming, while Transformers' deep integration may favor tasks requiring rich context and cross-utterance normalization.
在语音语言建模领域,两种架构引领前沿:Transformer和Conformer。然而,它们相似的性能表现是源于趋同的处理策略还是各自独特的建筑归纳偏置仍不清楚。我们引入了“建筑指纹识别”,这是一种探测框架,旨在将架构对表示的影响单独隔离出来,并将其应用于一套受控预训练编码器(从39M到3.3B参数)。我们的分析揭示了不同的层次结构:Conformer实施了一种"早期分类"策略,在深度上比Transformer早29%的时间解决音素类别问题,性别分类则提前16%的深度。相比之下,Transformers采用的是“晚期整合”策略,将音素、口音和持续时间的编码推迟到深层(49-57%)。这些指纹表明了设计启发式方法:Conformers的早期分类可能有利于低延迟流媒体应用,而Transformers的深层整合则更适用于需要丰富上下文信息以及跨语句归一化的任务。
https://arxiv.org/abs/2601.06972
Although many Automatic Speech Recognition (ASR) systems have been developed for Modern Standard Arabic (MSA) and Dialectal Arabic (DA), few studies have focused on dialect-specific implementations, particularly for low-resource Arabic dialects such as Sudanese. This paper presents a comprehensive study of data augmentation techniques for fine-tuning OpenAI Whisper models and establishes the first benchmark for the Sudanese dialect. Two augmentation strategies are investigated: (1) self-training with pseudo-labels generated from unlabeled speech, and (2) TTS-based augmentation using synthetic speech from the Klaam TTS system. The best-performing model, Whisper-Medium fine-tuned with combined self-training and TTS augmentation (28.4 hours), achieves a Word Error Rate (WER) of 57.1% on the evaluation set and 51.6% on an out-of-domain holdout set substantially outperforming zero-shot multilingual Whisper (78.8% WER) and MSA-specialized Arabic models (73.8-123% WER). All experiments used low-cost resources (Kaggle free tier and this http URL trial), demonstrating that strategic data augmentation can overcome resource limitations for low-resource dialects and provide a practical roadmap for developing ASR systems for low-resource Arabic dialects and other marginalized language varieties. The models, evaluation benchmarks, and reproducible training pipelines are publicly released to facilitate future research on low-resource Arabic ASR.
尽管已经为现代标准阿拉伯语(MSA)和方言阿拉伯语(DA)开发了许多自动语音识别(ASR)系统,但针对特定方言的实现研究却很少,特别是对于资源较少的阿拉伯方言如苏丹阿拉伯语。本文提出了一项关于数据增强技术对OpenAI Whisper模型微调的全面研究,并为苏丹阿拉伯语建立了第一个基准测试。文中探讨了两种增强策略:(1)使用未标记语音生成伪标签的自我训练;(2)基于TTS的数据增强,利用Klaam TTS系统的合成语音。经过结合自训练和TTS增强技术微调的最佳模型——Whisper-Medium,在评估数据集上实现了57.1%的单词错误率(WER),而在域外测试集中则达到了51.6%,远远优于零样本多语种Whisper(78.8% WER)和专门针对MSA的阿拉伯语音模型(73.8-123% WER)。所有实验均使用低成本资源(Kaggle免费层级及该网址提供的试用服务),证明了战略性数据增强可以克服低资源方言的数据限制,并为开发面向低资源阿拉伯语方言及其他边缘化语言变体的ASR系统提供了一条实用路径。本文所使用的模型、评估基准和可重复训练流程已公开发布,以促进对低资源阿拉伯语ASR的研究进展。
https://arxiv.org/abs/2601.06802