Recent advancement in deep learning encouraged developing large automatic speech recognition (ASR) models that achieve promising results while ignoring computational and memory constraints. However, deploying such models on low resource devices is impractical despite of their favorable performance. Existing approaches (pruning, distillation, layer skip etc.) transform the large models into smaller ones at the cost of significant performance degradation or require prolonged training of smaller models for better performance. To address these issues, we introduce an efficacious two-step representation learning based approach capable of producing several small sized models from a single large model ensuring considerably better performance in limited number of epochs. Comprehensive experimentation on ASR benchmarks reveals the efficacy of our approach, achieving three-fold training speed-up and up to 12.54% word error rate improvement.
最近的深度学习进展鼓励开发出了一系列大规模自动语音识别(ASR)模型,这些模型在忽略计算和内存限制的情况下取得了令人鼓舞的结果。然而,在资源有限的设备上部署这样的大模型是不切实际的,尽管它们有良好的性能表现。现有的方法(如剪枝、蒸馏、跳过层等),虽然可以将大型模型转换为较小的模型,但会导致显著的性能下降或需要长时间训练小型模型以获得更好的性能。 为了应对这些问题,我们提出了一种有效的两步表示学习方法,可以从单个大规模模型中生成多个小规模模型,并确保在有限的训练周期内有相当不错的性能表现。我们在ASR基准测试上的全面实验表明了该方法的有效性,实现了三倍的训练速度提升,并且错误词率(WER)最多减少了12.54%。
https://arxiv.org/abs/2505.16991
Recent advances in Automatic Speech Recognition (ASR) have been largely fueled by massive speech corpora. However, extending coverage to diverse languages with limited resources remains a formidable challenge. This paper introduces Speech Back-Translation, a scalable pipeline that improves multilingual ASR models by converting large-scale text corpora into synthetic speech via off-the-shelf text-to-speech (TTS) models. We demonstrate that just tens of hours of real transcribed speech can effectively train TTS models to generate synthetic speech at hundreds of times the original volume while maintaining high quality. To evaluate synthetic speech quality, we develop an intelligibility-based assessment framework and establish clear thresholds for when synthetic data benefits ASR training. Using Speech Back-Translation, we generate more than 500,000 hours of synthetic speech in ten languages and continue pre-training Whisper-large-v3, achieving average transcription error reductions of over 30\%. These results highlight the scalability and effectiveness of Speech Back-Translation for enhancing multilingual ASR systems.
最近的自动语音识别(ASR)进展主要得益于大规模语音语料库。然而,将覆盖范围扩展到资源有限的多种语言仍然是一个巨大的挑战。本文介绍了Speech Back-Translation,这是一种可扩展的工作流程,通过使用现成的文本转语音(TTS)模型将大型文本语料库转换为合成语音来改进多语言ASR模型。我们证明了仅用数十小时的实际录音可以有效训练TTS模型以生成比原始数据量大几百倍的高质量合成语音。为了评估合成语音的质量,我们开发了一个基于可理解性的评价框架,并确定了合成数据对ASR培训有益的确切阈值。通过使用Speech Back-Translation,我们在十种语言中生成了超过50万小时的合成语音并继续预训练Whisper-large-v3模型,在平均转录错误率上降低了超过30%。这些结果强调了Speech Back-Translation在增强多语言ASR系统中的可扩展性和有效性。
https://arxiv.org/abs/2505.16972
The integration of artificial intelligence in sports analytics has transformed soccer video understanding, enabling real-time, automated insights into complex game dynamics. Traditional approaches rely on isolated data streams, limiting their effectiveness in capturing the full context of a match. To address this, we introduce SoccerChat, a multimodal conversational AI framework that integrates visual and textual data for enhanced soccer video comprehension. Leveraging the extensive SoccerNet dataset, enriched with jersey color annotations and automatic speech recognition (ASR) transcripts, SoccerChat is fine-tuned on a structured video instruction dataset to facilitate accurate game understanding, event classification, and referee decision making. We benchmark SoccerChat on action classification and referee decision-making tasks, demonstrating its performance in general soccer event comprehension while maintaining competitive accuracy in referee decision making. Our findings highlight the importance of multimodal integration in advancing soccer analytics, paving the way for more interactive and explainable AI-driven sports analysis. this https URL
将人工智能技术融入体育数据分析,特别是在足球视频理解方面,已经改变了对复杂比赛动态的实时、自动化洞察。传统的方法依赖于孤立的数据流,这限制了它们捕捉比赛全貌的有效性。为了解决这个问题,我们推出了SoccerChat,这是一个多模态对话AI框架,它整合了视觉和文本数据以增强足球视频的理解能力。 通过利用丰富的SoccerNet数据集,并结合球衣颜色注释以及自动语音识别(ASR)转录内容,SoccerChat在经过结构化的视频指令数据集上进行微调,从而能够实现准确的比赛理解、事件分类和裁判决策。我们在动作分类和裁判决策任务上对SoccerChat进行了基准测试,展示了其在通用足球赛事理解方面的性能,并且在裁判决策方面保持了竞争性的准确性。 我们的研究结果强调了多模态整合在推进足球分析领域的关键作用,为更加互动和解释性强的AI驱动体育分析铺平道路。[参考链接](https://this-url)
https://arxiv.org/abs/2505.16630
We introduces X-ARES (eXtensive Audio Representation and Evaluation Suite), a novel open-source benchmark designed to systematically assess audio encoder performance across diverse domains. By encompassing tasks spanning speech, environmental sounds, and music, X-ARES provides two evaluation approaches for evaluating audio representations: linear fine-tuning and unparameterized evaluation. The framework includes 22 distinct tasks that cover essential aspects of audio processing, from speech recognition and emotion detection to sound event classification and music genre identification. Our extensive evaluation of state-of-the-art audio encoders reveals significant performance variations across different tasks and domains, highlighting the complexity of general audio representation learning.
我们介绍了X-ARES(eXtensive Audio Representation and Evaluation Suite),这是一个新颖的开源基准测试工具,旨在系统地评估音频编码器在不同领域的性能。通过涵盖从语音、环境声音到音乐的各项任务,X-ARES提供了两种用于评估音频表示的方法:线性微调和无参数化评估。该框架包括了22项不同的任务,涵盖了音频处理的各个方面,从语音识别和情感检测到声音事件分类和音乐流派识别。我们对最先进的音频编码器进行的广泛评估揭示了在不同任务和领域中性能差异显著,突显了一般音频表示学习的复杂性。
https://arxiv.org/abs/2505.16369
Automatic Speech Recognition (ASR) has recently shown remarkable progress, but accurately transcribing children's speech remains a significant challenge. Recent developments in Large Language Models (LLMs) have shown promise in improving ASR transcriptions. However, their applications in child speech including conversational scenarios are underexplored. In this study, we explore the use of LLMs in correcting ASR errors for conversational child speech. We demonstrate the promises and challenges of LLMs through experiments on two children's conversational speech datasets with both zero-shot and fine-tuned ASR outputs. We find that while LLMs are helpful in correcting zero-shot ASR outputs and fine-tuned CTC-based ASR outputs, it remains challenging for LLMs to improve ASR performance when incorporating contextual information or when using fine-tuned autoregressive ASR (e.g., Whisper) outputs.
最近,自动语音识别(ASR)取得了显著的进步,但准确转录儿童的讲话仍然是一个重大的挑战。大型语言模型(LLMs)在改善ASR转录方面显示出潜力,但在应用于对话场景中的儿童语音时仍缺乏深入的研究。在这项研究中,我们探讨了使用LLM来纠正对话场景下儿童语音的ASR错误的方法。通过在两个儿童对话语音数据集上进行实验,包括零样本(zero-shot)和微调后的CTC模型ASR输出,我们展示了LLMs的应用前景及挑战。 我们的发现表明,当应用于零样本ASR结果以及使用基于CTC的ASR微调后输出时,LLM能够有效地纠正错误。然而,当需要利用上下文信息或处理自回归ASR(如Whisper)模型的微调后的输出时,仍然很难让LLMs进一步提升ASR的表现。 总结来说,尽管LLM在改善零样本和基于CTC模型的ASR性能方面显示出潜力,但在更加复杂的情境中,例如需要上下文理解或处理更复杂的自回归模型生成的数据时,其效果仍有限。这表明,在利用大型语言模型改进儿童对话语音识别方面的研究仍有待进一步探索和发展。
https://arxiv.org/abs/2505.16212
Recent studies have highlighted the potential of discrete tokens derived from self-supervised learning (SSL) models for various speech-related tasks. These tokens serve not only as substitutes for text in language modeling but also as intermediate representations for tasks such as automatic speech recognition (ASR). However, discrete tokens are typically obtained via k-means clustering of SSL features independently of downstream tasks, making them suboptimal for specific applications. This paper proposes the use of differentiable k-means, enabling the joint optimization of tokenization and downstream tasks. This approach enables the fine-tuning of the SSL parameters and learning weights for outputs from multiple SSL layers. Experiments were conducted with ASR as a downstream task. ASR accuracy successfully improved owing to the optimized tokens. The acquired tokens also exhibited greater purity of phonetic information, which were found to be useful even in speech resynthesis.
最近的研究强调了自监督学习(SSL)模型中离散标记在各种语音相关任务中的潜力。这些标记不仅可以用作语言建模中文本的替代品,还可以作为自动语音识别(ASR)等任务的中间表示形式。然而,离散标记通常通过独立于下游任务对SSL特征进行k均值聚类来获取,这使得它们对于特定应用来说不是最优选择。本文提出了使用可微分k均值的方法,使标记化和下游任务能够联合优化。这种方法允许对SSL参数以及来自多个SSL层的输出权重进行微调。实验是在ASR作为下游任务的情况下进行的。由于优化后的标记,ASR准确率得到了成功提升。获得的标记还表现出更纯净的音素信息,在语音重合成中也证明是有用的。
https://arxiv.org/abs/2505.16207
Recently, a method for synthesizing foreign-accented speech only with native speech data using discrete tokens obtained from self-supervised learning (SSL) models was proposed. Considering limited availability of accented speech data, this method is expected to make it much easier to simulate foreign accents. By using the synthesized accented speech as listening materials for humans or training data for automatic speech recognition (ASR), both of them will acquire higher robustness against foreign accents. However, the previous method has a fatal flaw that it cannot reproduce duration-related accents. Durational accents are commonly seen when L2 speakers, whose native language has syllable-timed or mora-timed rhythm, speak stress-timed languages, such as English. In this paper, we integrate duration modification to the previous method to simulate foreign accents more accurately. Experiments show that the proposed method successfully replicates durational accents seen in real L2 speech.
最近,提出了一种仅使用来自自我监督学习(SSL)模型获得的离散令牌来合成带有外国口音的语音的方法。考虑到有声母语数据但缺乏带口音的数据的情况,这种方法有望使模拟外国口音变得更加容易。通过将合成的带口音语音用作人类的听力材料或自动语音识别(ASR)系统的训练数据,两者都将提高其对抗外国口音的能力。然而,先前的方法存在一个致命缺陷,即无法重现与持续时间相关的口音变化。在讲英语等重音语言时,那些母语具有音节定时或拍子定时节奏的二语(L2)说话者常常会表现出这种持续时间上的口音差异。在这篇论文中,我们将持续时间修改整合到先前的方法中,以更准确地模拟外国口音。实验表明,所提出的方法成功复制了真实L2语音中的持续时间口音变化。
https://arxiv.org/abs/2505.16191
Although multilingual automatic speech recognition (ASR) systems have significantly advanced, enabling a single model to handle multiple languages, inherent linguistic differences and data imbalances challenge SOTA performance across all languages. While language identification (LID) models can route speech to the appropriate ASR model, they incur high costs from invoking SOTA commercial models and suffer from inaccuracies due to misclassification. To overcome these, we propose SIMA, a selective invocation for multilingual ASR that adapts to the difficulty level of the input speech. Built on a spoken large language model (SLLM), SIMA evaluates whether the input is simple enough for direct transcription or requires the invocation of a SOTA ASR model. Our approach reduces word error rates by 18.7% compared to the SLLM and halves invocation costs compared to LID-based methods. Tests on three datasets show that SIMA is a scalable, cost-effective solution for multilingual ASR applications.
尽管多语言自动语音识别(ASR)系统已经取得了显著的进步,使得单一模型能够处理多种语言,但固有的语言差异和数据不平衡问题仍然对所有语言的最佳性能构成了挑战。虽然语言识别(LID)模型可以将语音路由到相应的ASR模型,但由于调用顶级商用模型的成本高昂以及由于分类错误导致的不准确性,这一方法存在缺陷。为克服这些问题,我们提出了SIMA,这是一种针对多语言ASR的选择性调用机制,可以根据输入语音的难度水平进行适应。SIMA基于口语化大型语言模型(SLLM),评估输入是否足够简单以直接转录,还是需要调用顶级ASR模型。与单独使用SLLM相比,我们的方法将单词错误率降低了18.7%,并且相比于基于LID的方法,成本减半。在三个数据集上的测试表明,SIMA是多语言ASR应用中可扩展且经济有效的解决方案。
https://arxiv.org/abs/2505.16168
We introduce a data-driven approach for enabling word-level timestamp prediction in the Canary model. Accurate timestamp information is crucial for a variety of downstream tasks such as speech content retrieval and timed subtitles. While traditional hybrid systems and end-to-end (E2E) models may employ external modules for timestamp prediction, our approach eliminates the need for separate alignment mechanisms. By leveraging the NeMo Forced Aligner (NFA) as a teacher model, we generate word-level timestamps and train the Canary model to predict timestamps directly. We introduce a new <|timestamp|> token, enabling the Canary model to predict start and end timestamps for each word. Our method demonstrates precision and recall rates between 80% and 90%, with timestamp prediction errors ranging from 20 to 120 ms across four languages, with minimal WER degradation. Additionally, we extend our system to automatic speech translation (AST) tasks, achieving timestamp prediction errors around 200 milliseconds.
我们提出了一种数据驱动的方法,用于在Canary模型中实现词级别的时间戳预测。准确的时间戳信息对于诸如语音内容检索和定时字幕等多种下游任务至关重要。传统混合系统和端到端(E2E)模型可能使用外部模块进行时间戳预测,而我们的方法则消除了对独立对齐机制的需求。通过利用NeMo强制对齐器(NFA)作为教师模型,我们生成词级别的时间戳并训练Canary模型直接预测时间戳。我们引入了一个新的<|timestamp|>标记,使Canary模型能够为每个单词预测开始和结束时间戳。我们的方法在四种语言中展示了80%到90%的精确率和召回率,并且时间戳预测误差范围为20至120毫秒,同时保持了最小的WER(单词错误率)退化。此外,我们将系统扩展到了自动语音翻译(AST)任务,在这些任务上实现了约200毫秒的时间戳预测误差。
https://arxiv.org/abs/2505.15646
Human listeners readily adjust to unfamiliar speakers and language varieties through exposure, but do these adaptation benefits extend to state-of-the-art spoken language models? We introduce a scalable framework that allows for in-context learning (ICL) in Phi-4 Multimodal using interleaved task prompts and audio-text pairs, and find that as few as 12 example utterances (~50 seconds) at inference time reduce word error rates by a relative 19.7% (1.2 pp.) on average across diverse English corpora. These improvements are most pronounced in low-resource varieties, when the context and target speaker match, and when more examples are provided--though scaling our procedure yields diminishing marginal returns to context length. Overall, we find that our novel ICL adaptation scheme (1) reveals a similar performance profile to human listeners, and (2) demonstrates consistent improvements to automatic speech recognition (ASR) robustness across diverse speakers and language backgrounds. While adaptation succeeds broadly, significant gaps remain for certain varieties, revealing where current models still fall short of human flexibility. We release our prompts and code on GitHub.
人类听者可以通过接触不熟悉的说话者和语言变体来适应,但这些适应的好处是否也适用于最先进的语音模型?我们引入了一个可扩展的框架,在Phi-4多模态系统中使用交错的任务提示和音频文本对进行上下文学习(ICL),并发现仅在推理时提供12个示例语句(约50秒)即可平均减少各种英语语料库中的单词错误率,相对降低19.7%(1.2个百分点)。这些改进在低资源变体中最为显著,当上下文与目标说话者匹配且提供更多示例的情况下更加明显——尽管扩大我们的程序会导致上下文长度的边际收益递减。总体而言,我们发现我们的新ICL适应方案: 1. 展现了类似人类听者的性能曲线; 2. 在各种说话人和语言背景下对自动语音识别(ASR)鲁棒性表现出一致改进。 虽然这种适应在广泛范围内成功,但对于某些变体仍存在显著差距,这揭示出当前模型仍然不及人类的灵活性。我们已在GitHub上发布我们的提示和代码。
https://arxiv.org/abs/2505.14887
Automatic speech recognition (ASR) for dysarthric speech remains challenging due to data scarcity, particularly in non-English languages. To address this, we fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions, then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech. The generated data is then used to fine-tune a multilingual ASR model, Massively Multilingual Speech (MMS), for improved dysarthric speech recognition. Evaluation on PC-GITA (Spanish), EasyCall (Italian), and SSNCE (Tamil) demonstrates that VC with both speaker and prosody conversion significantly outperforms the off-the-shelf MMS performance and conventional augmentation techniques such as speed and tempo perturbation. Objective and subjective analyses of the generated data further confirm that the generated speech simulates dysarthric characteristics.
自动语音识别(ASR)对于含糊不清的言语(dysarthria)仍然是一项挑战,尤其是在非英语语言中,主要原因在于缺乏数据。为了解决这个问题,我们对一个英语含糊不清言语的声纹转换模型进行了微调(使用UASpeech数据集),该模型能够编码说话者的特征和语调扭曲。然后,我们将这个模型应用于将健康状态下的非英语语音(如FLEURS数据集中)转化为类似于非英语含糊不清的语音。生成的数据随后被用来对多语言ASR模型Massively Multilingual Speech (MMS)进行微调,以提高其识别含糊不清言语的能力。 在PC-GITA(西班牙语)、EasyCall(意大利语)和SSNCE(泰米尔语)数据集上的评估表明,同时转换说话者特征和语调的声纹转换方法显著优于现成的MMS性能以及传统的增强技术如变速和变速扰动。对生成数据进行客观和主观分析进一步确认了生成的语音能够模拟含糊不清言语的特点。 这种新颖的方法为解决非英语语言中含糊不清言语自动识别难题提供了一种有效的途径,通过声纹转换模型来填补数据不足的缺口,并利用多语言ASR系统的优势提高识别准确率。
https://arxiv.org/abs/2505.14874
We introduce Vox-Profile, a comprehensive benchmark to characterize rich speaker and speech traits using speech foundation models. Unlike existing works that focus on a single dimension of speaker traits, Vox-Profile provides holistic and multi-dimensional profiles that reflect both static speaker traits (e.g., age, sex, accent) and dynamic speech properties (e.g., emotion, speech flow). This benchmark is grounded in speech science and linguistics, developed with domain experts to accurately index speaker and speech characteristics. We report benchmark experiments using over 15 publicly available speech datasets and several widely used speech foundation models that target various static and dynamic speaker and speech properties. In addition to benchmark experiments, we showcase several downstream applications supported by Vox-Profile. First, we show that Vox-Profile can augment existing speech recognition datasets to analyze ASR performance variability. Vox-Profile is also used as a tool to evaluate the performance of speech generation systems. Finally, we assess the quality of our automated profiles through comparison with human evaluation and show convergent validity. Vox-Profile is publicly available at: this https URL.
我们介绍了Vox-Profile,这是一个全面的基准测试系统,旨在利用语音基础模型来描述丰富的说话人和语声特征。与现有的仅关注单一维度说话人特质的研究不同,Vox-Profile提供的是一个全面且多维的特性轮廓,能够反映静态说话人的特质(如年龄、性别、口音)以及动态语声属性(如情感、语速)。该基准测试基于语音科学和语言学原理,并与领域专家合作开发,以准确索引说话人和语声特征。我们使用超过15个公开可用的语音数据集以及几个广泛使用的语音基础模型进行了基准测试实验,这些模型针对各种静态和动态说话人及语声特性。 除了基准测试实验之外,我们还展示了Vox-Profile支持的一些下游应用案例。首先,我们表明Vox-Profile可以增强现有的语音识别数据集,以分析ASR(自动语音识别)性能的变异性。此外,Vox-Profile被用作评估语音生成系统性能的工具。最后,通过与人工评价进行对比,我们展示了我们的自动化特性轮廓的质量,并验证了其收敛有效性。 Vox-Profile可在以下网址公开获取:[this https URL](请将链接替换为实际可用地址)。
https://arxiv.org/abs/2505.14648
Deep neural networks have achieved state-of-the-art results in a wide range of applications, from natural language processing and computer vision to speech recognition. However, as tasks become increasingly complex, model sizes continue to grow, posing challenges in latency and memory efficiency. To meet these constraints, post-training quantization has emerged as a promising solution. In this paper, we propose a novel hardware-efficient quantization and inference scheme that exploits hardware advantages with minimal accuracy degradation. Specifically, we introduce a W4A8 scheme, where weights are quantized and stored using 4-bit integer precision, and inference computations are performed using 8-bit floating-point arithmetic, demonstrating significant speedups and improved memory utilization compared to 16-bit operations, applicable on various modern accelerators. To mitigate accuracy loss, we develop a novel quantization algorithm, dubbed Dual Precision Quantization (DPQ), that leverages the unique structure of our scheme without introducing additional inference overhead. Experimental results demonstrate improved performance (i.e., increased throughput) while maintaining tolerable accuracy degradation relative to the full-precision model.
深度神经网络在自然语言处理、计算机视觉和语音识别等一系列应用中取得了最先进的成果。然而,随着任务复杂性的增加,模型的大小也在不断扩大,这给延迟和内存效率带来了挑战。为了应对这些限制,训练后的量化技术应运而生,成为了一种有前景的解决方案。在这篇论文中,我们提出了一种新型硬件高效的量化和推理方案,在不显著降低精度的情况下充分利用硬件优势。具体来说,我们引入了W4A8方案,其中权重使用4位整型进行量化并存储,并在推理计算过程中采用8位浮点运算,相比16位操作显示出了明显的速度提升和内存利用率的改进,适用于各种现代加速器。 为了减少精度损失,我们开发了一种新的量化算法,称为双精度量化(DPQ),该算法利用了我们方案的独特结构,在不增加额外推理开销的情况下实现了这一目标。实验结果表明,通过使用我们的方法,性能得到了提升(即吞吐量提高),同时相对于全精度模型而言保持了可接受的精度损失水平。
https://arxiv.org/abs/2505.14638
Despite significant progress in neural spoken dialog systems, personality-aware conversation agents -- capable of adapting behavior based on personalities -- remain underexplored due to the absence of personality annotations in speech datasets. We propose a pipeline that preprocesses raw audio recordings to create a dialogue dataset annotated with timestamps, response types, and emotion/sentiment labels. We employ an automatic speech recognition (ASR) system to extract transcripts and timestamps, then generate conversation-level annotations. Leveraging these annotations, we design a system that employs large language models to predict conversational personality. Human evaluators were engaged to identify conversational characteristics and assign personality labels. Our analysis demonstrates that the proposed system achieves stronger alignment with human judgments compared to existing approaches.
尽管神经语音对话系统取得了显著进展,但由于语音数据集中缺乏个性标注,能够根据个性调整行为的个性化会话代理仍然研究不足。我们提出了一条流水线,用于预处理原始音频记录,创建带有时间戳、响应类型和情感/情绪标签的对话数据集。首先,我们使用自动语音识别(ASR)系统提取转录文本和时间戳,然后生成对话级别的注释。利用这些注释,我们设计了一个系统,该系统采用大型语言模型来预测会话个性。通过人类评估者来识别会话特征并分配个性标签。我们的分析表明,所提出的系统在与现有方法相比时,在与人类判断的对齐方面表现出更强的效果。
https://arxiv.org/abs/2505.14356
Audio-Visual Speech Recognition (AVSR) enhances robustness in noisy environments by integrating visual cues. While recent advances integrate Large Language Models (LLMs) into AVSR, their high computational cost hinders deployment in resource-constrained settings. To address this, we propose Llama-SMoP, an efficient Multimodal LLM that employs a Sparse Mixture of Projectors (SMoP) module to scale model capacity without increasing inference costs. By incorporating sparsely-gated mixture-of-experts (MoE) projectors, Llama-SMoP enables the use of smaller LLMs while maintaining strong performance. We explore three SMoP configurations and show that Llama-SMoP DEDR (Disjoint-Experts, Disjoint-Routers), which uses modality-specific routers and experts, achieves superior performance on ASR, VSR, and AVSR tasks. Ablation studies confirm its effectiveness in expert activation, scalability, and noise robustness.
音频-视觉语音识别(AVSR)通过整合视觉线索,提高了在噪音环境中的鲁棒性。尽管最近的研究将大型语言模型(LLMs)集成到AVSR中,但高昂的计算成本阻碍了它们在资源受限环境下的部署。为了解决这一问题,我们提出了Llama-SMoP,这是一种高效的多模态LLM,采用了稀疏混合投影器(SMoP)模块来扩大模型容量而不增加推理成本。通过引入稀疏门控专家混合体(MoE)投影器,Llama-SMoP允许使用较小的LLMs同时保持强大的性能表现。 我们探讨了三种SMoP配置,并展示了Llama-SMoP DEDR(独立专家、独立路由器),该模型利用模态特异性的路由器和专家,在自动语音识别(ASR)、视觉语音识别(VSR)以及AVSR任务上取得了卓越的性能。消融实验验证了其在激活专家能力、可扩展性和抗噪性方面的有效性。 简而言之,Llama-SMoP通过使用稀疏混合投影器模块,成功地解决了大型语言模型在资源受限环境中的部署难题,并且在多种语音识别任务中表现出色。
https://arxiv.org/abs/2505.14336
Hausa Natural Language Processing (NLP) has gained increasing attention in recent years, yet remains understudied as a low-resource language despite having over 120 million first-language (L1) and 80 million second-language (L2) speakers worldwide. While significant advances have been made in high-resource languages, Hausa NLP faces persistent challenges, including limited open-source datasets and inadequate model representation. This paper presents an overview of the current state of Hausa NLP, systematically examining existing resources, research contributions, and gaps across fundamental NLP tasks: text classification, machine translation, named entity recognition, speech recognition, and question answering. We introduce HausaNLP (this https URL), a curated catalog that aggregates datasets, tools, and research works to enhance accessibility and drive further development. Furthermore, we discuss challenges in integrating Hausa into large language models (LLMs), addressing issues of suboptimal tokenization and dialectal variation. Finally, we propose strategic research directions emphasizing dataset expansion, improved language modeling approaches, and strengthened community collaboration to advance Hausa NLP. Our work provides both a foundation for accelerating Hausa NLP progress and valuable insights for broader multilingual NLP research.
豪萨语自然语言处理(NLP)近年来引起了越来越多的关注,然而作为一种低资源语言,尽管全球拥有超过1.2亿母语使用者和8000万第二语言使用者,其研究仍然相对较少。虽然在高资源语言领域取得了显著进展,但豪萨语NLP仍面临着诸如开放数据集有限、模型表示不足等持续挑战。本文概述了当前豪萨语NLP的状态,系统地考察了现有资源、研究成果和跨基础NLP任务(如文本分类、机器翻译、命名实体识别、语音识别及问答)的差距。我们推出了HausaNLP(此链接),这是一个经过精心策划的目录,汇总了数据集、工具和研究工作,以提高可访问性并推动进一步的发展。此外,我们讨论了在大型语言模型中整合豪萨语所面临的挑战,并解决了次优标记化及方言变化等问题。最后,我们提出了战略性的研究方向,强调数据集扩展、改进的语言建模方法以及加强社区合作对于推进豪萨语NLP的重要性。我们的工作为加速豪萨语NLP的进步提供了基础,并且为更广泛的多语言NLP研究提供了宝贵的见解。 翻译总结: 本文重点介绍了豪萨语自然语言处理(NLP)领域目前的发展状态,探讨了该领域的挑战、现有资源及未来的研究方向。通过系统地分析和提出改进策略,希望能够促进豪萨语NLP的进一步发展,并为更大范围内的多语言研究提供参考。
https://arxiv.org/abs/2505.14311
Meetings are a valuable yet challenging scenario for speech applications due to complex acoustic conditions. This paper summarizes the outcomes of the MISP 2025 Challenge, hosted at Interspeech 2025, which focuses on multi-modal, multi-device meeting transcription by incorporating video modality alongside audio. The tasks include Audio-Visual Speaker Diarization (AVSD), Audio-Visual Speech Recognition (AVSR), and Audio-Visual Diarization and Recognition (AVDR). We present the challenge's objectives, tasks, dataset, baseline systems, and solutions proposed by participants. The best-performing systems achieved significant improvements over the baseline: the top AVSD model achieved a Diarization Error Rate (DER) of 8.09%, improving by 7.43%; the top AVSR system achieved a Character Error Rate (CER) of 9.48%, improving by 10.62%; and the best AVDR system achieved a concatenated minimum-permutation Character Error Rate (cpCER) of 11.56%, improving by 72.49%.
会议场景对于语音应用程序来说既宝贵又具挑战性,因为存在复杂的声学条件。本文总结了在Interspeech 2025上举办的MISP 2025挑战赛的结果,该挑战赛专注于通过结合视频模态来进行多模式、跨设备的会议转录。任务包括音频-视觉说话人定位(AVSD)、音频-视觉语音识别(AVSR)和音频-视觉定位与识别(AVDR)。我们介绍了挑战的目标、任务、数据集、基线系统以及参赛者提出的解决方案。最佳性能系统在基线基础上取得了显著的改进:顶级AVSD模型实现了8.09%的说话人错误率(DER),比基准提高了7.43%;最佳AVSR系统达到了9.48%的字符错误率(CER),比基准提高了10.62%;而最佳AVDR系统则取得了11.56%的最小置换拼接字符错误率(cpCER),比基线性能提升了72.49%。
https://arxiv.org/abs/2505.13971
Sign Language Recognition (SLR) systems primarily focus on manual gestures, but non-manual features such as mouth movements, specifically mouthing, provide valuable linguistic information. This work directly classifies mouthing instances to their corresponding words in the spoken language while exploring the potential of transfer learning from Visual Speech Recognition (VSR) to mouthing recognition in German Sign Language. We leverage three VSR datasets: one in English, one in German with unrelated words and one in German containing the same target words as the mouthing dataset, to investigate the impact of task similarity in this setting. Our results demonstrate that multi-task learning improves both mouthing recognition and VSR accuracy as well as model robustness, suggesting that mouthing recognition should be treated as a distinct but related task to VSR. This research contributes to the field of SLR by proposing knowledge transfer from VSR to SLR datasets with limited mouthing annotations.
手语识别(SLR)系统主要关注手动手势,但非手动特征如嘴部动作——尤其是口型——提供了有价值的语言信息。本研究直接将口型实例分类到对应的口语词汇中,并探索从视觉语音识别(VSR)向德国手语中的口型识别转移学习的潜力。我们利用三个VSR数据集:一个英文的,一个德文但包含不相关的词语,以及另一个德文且包含与口型数据集相同目标词的数据集,来研究该设置下的任务相似性影响。我们的结果表明,在这种情况下,多任务学习可以提高口型识别和VSR的准确性以及模型鲁棒性,这表明应该将口型识别视为与VSR相关但独立的任务。这项研究通过从VSR向具有有限口型注释的手语数据集进行知识转移,为SLR领域做出了贡献。
https://arxiv.org/abs/2505.13784
Multi-task and multilingual approaches benefit large models, yet speech processing for low-resource languages remains underexplored due to data scarcity. To address this, we present Granary, a large-scale collection of speech datasets for recognition and translation across 25 European languages. This is the first open-source effort at this scale for both transcription and translation. We enhance data quality using a pseudo-labeling pipeline with segmentation, two-pass inference, hallucination filtering, and punctuation restoration. We further generate translation pairs from pseudo-labeled transcriptions using EuroLLM, followed by a data filtration pipeline. Designed for efficiency, our pipeline processes vast amount of data within hours. We assess models trained on processed data by comparing their performance on previously curated datasets for both high- and low-resource languages. Our findings show that these models achieve similar performance using approx. 50% less data. Dataset will be made available at this https URL
多任务和多语言方法对大型模型有益,但由于数据稀缺,低资源语言的语音处理仍然未被充分探索。为解决这一问题,我们推出了Granary——一个用于25种欧洲语言跨语言识别和翻译的大规模语音数据集合。这是首个开源的努力,旨在同时提高转录和翻译的数据质量。为了增强数据质量,我们使用了一个伪标签流水线,包括分段、双遍推理、幻觉过滤和标点符号恢复等步骤。 随后,我们利用EuroLLM从伪标记的转写中生成翻译对,并经过一个数据过滤管道进一步优化数据集。我们的流程旨在提高效率,在数小时内处理大量的数据。 为了评估在加工后的数据上训练出的模型的表现,我们将这些模型与之前整理好的、针对高资源和低资源语言的高质量数据集进行比较。我们发现,使用大约50%更少的数据,这些模型仍然可以达到相似的性能水平。 该数据集将在这个网址(请参阅原文中的链接)上开放获取。
https://arxiv.org/abs/2505.13404
Transferring linguistic knowledge from a pretrained language model (PLM) to acoustic feature learning has proven effective in enhancing end-to-end automatic speech recognition (E2E-ASR). However, aligning representations between linguistic and acoustic modalities remains a challenge due to inherent modality gaps. Optimal transport (OT) has shown promise in mitigating these gaps by minimizing the Wasserstein distance (WD) between linguistic and acoustic feature distributions. However, previous OT-based methods overlook structural relationships, treating feature vectors as unordered sets. To address this, we propose Graph Matching Optimal Transport (GM-OT), which models linguistic and acoustic sequences as structured graphs. Nodes represent feature embeddings, while edges capture temporal and sequential relationships. GM-OT minimizes both WD (between nodes) and Gromov-Wasserstein distance (GWD) (between edges), leading to a fused Gromov-Wasserstein distance (FGWD) formulation. This enables structured alignment and more efficient knowledge transfer compared to existing OT-based approaches. Theoretical analysis further shows that prior OT-based methods in linguistic knowledge transfer can be viewed as a special case within our GM-OT framework. We evaluate GM-OT on Mandarin ASR using a CTC-based E2E-ASR system with a PLM for knowledge transfer. Experimental results demonstrate significant performance gains over state-of-the-art models, validating the effectiveness of our approach.
从预训练语言模型(PLM)转移语言知识到声学特征学习已被证明可以有效增强端到端自动语音识别(E2E-ASR)。然而,由于模态差距的存在,将语言和声学模式之间的表示对齐仍然是一个挑战。最优传输(OT)通过最小化语言和声学特征分布之间的Wasserstein距离(WD),显示出缓解这些差距的潜力。然而,基于OT的方法忽视了结构关系,并且把特征向量视为无序集合。为了解决这个问题,我们提出了图匹配最优传输(GM-OT),它将语言和声学序列建模为有结构的图。节点表示特征嵌入,而边则捕捉时间顺序关系。GM-OT同时最小化WD(在节点之间)和Gromov-Wasserstein距离(GWD)(在边之间),从而导出融合的Gromov-Wasser斯坦距离(FGWD)公式。这使得结构对齐更加有效,并促进了比现有基于OT的方法更为高效的知识转移。理论分析进一步表明,在语言知识传输中之前使用的基于OT的方法可以被看作是我们GM-OT框架的一个特例。我们在使用CTC基的E2E-ASR系统的普通话自动语音识别任务上评估了GM-OT,该系统利用PLM进行知识迁移。实验结果表明,我们的方法相较于最先进的模型有显著的性能提升,验证了我们方法的有效性。
https://arxiv.org/abs/2505.13079