Edge devices operate in constrained and varying resource settings, requiring dynamic architectures that can adapt to limitations of the available resources. To meet such demands, layer dropping ($\mathcal{LD}$) approach is typically used to transform static models into dynamic ones by skipping parts of the network along with reducing overall computational complexity. However, existing $\mathcal{LD}$ methods greatly impact the dynamic model's performance for low and high dropping cases, deteriorating the performance-computation trade-off. To this end, we propose a distillation-based layer dropping (DLD) framework that effectively combines the capabilities of knowledge distillation and $\mathcal{LD}$ in an end-to-end fashion, thereby achieving state-of-the-art performance for dynamic speech networks. Comprehensive experimentation utilizing well-known speech recognition methods, including conformer and WavLM, on three public benchmarks demonstrates the effectiveness of our framework, reducing the word error rate by $9.32\%$ and $2.25\%$ for high and no dropping cases with $33.3\%$ reduction in training time.
边缘设备在有限和变化的资源环境中运行,需要能够适应可用资源限制的动态架构。为了满足这一需求,通常采用层掉落($\mathcal{LD}$)方法将静态模型转换为动态模型,通过跳过网络的部分来减少整体计算复杂度。然而,现有的层掉落方法对低频和高频掉层情况下的动态模型性能影响很大,从而恶化了性能与计算量之间的权衡。为此,我们提出了一种基于蒸馏的层掉落(DLD)框架,该框架能够以端到端的方式有效地结合知识蒸馏和$\mathcal{LD}$的能力,从而在动态语音网络中实现最先进的性能。 通过使用包括Conformer和WavLM在内的知名语音识别方法,在三个公共基准上进行的全面实验展示了我们框架的有效性。对于高频掉层情况,我们的框架将词错误率降低了9.32%,而对于无掉层的情况则减少了2.25%。此外,该框架在训练时间方面也实现了33.3%的减少。
https://arxiv.org/abs/2601.16117
Non-invasive decoding of imagined speech remains challenging due to weak, distributed signals and limited labeled data. Our paper introduces an image-based approach that transforms magnetoencephalography (MEG) signals into time-frequency representations compatible with pretrained vision models. MEG data from 21 participants performing imagined speech tasks were projected into three spatial scalogram mixtures via a learnable sensor-space convolution, producing compact image-like inputs for ImageNet-pretrained vision architectures. These models outperformed classical and non-pretrained models, achieving up to 90.4% balanced accuracy for imagery vs. silence, 81.0% vs. silent reading, and 60.6% for vowel decoding. Cross-subject evaluation confirmed that pretrained models capture shared neural representations, and temporal analyses localized discriminative information to imagery-locked intervals. These findings show that pretrained vision models applied to image-based MEG representations can effectively capture the structure of imagined speech in non-invasive neural signals.
想象中的言语非侵入性解码由于微弱且分布广泛的信号以及有限的标注数据而具有挑战性。我们的论文介绍了一种基于图像的方法,该方法将脑磁图(MEG)信号转换为与预训练视觉模型兼容的时间-频率表示形式。21名参与者在执行想象说话任务时采集的MEG数据通过可学习的传感器空间卷积被投影到三个时空频谱混合体中,产生了适合ImageNet预训练视觉架构的紧凑图像样输入。这些模型超越了传统和未经过预训练的模型,在想象言语与静默之间的分类中达到了最高90.4%的平衡准确率;在想象言语与无声阅读之间达到了81.0%的准确率;以及60.6%的元音解码准确性。跨主体评估证实,预训练模型能够捕捉到共享的神经表示,并且时间分析将区分信息定位到了与想象说话相关的特定时间段内。这些发现表明,当应用于基于图像的MEG表示时,预先训练的视觉模型可以有效地捕获非侵入性神经信号中所含有的想象言语结构。
https://arxiv.org/abs/2601.15909
This study evaluates AV-HuBERT's perceptual bio-fidelity by benchmarking its response to incongruent audiovisual stimuli (McGurk effect) against human observers (N=44). Results reveal a striking quantitative isomorphism: AI and humans exhibited nearly identical auditory dominance rates (32.0% vs. 31.8%), suggesting the model captures biological thresholds for auditory resistance. However, AV-HuBERT showed a deterministic bias toward phonetic fusion (68.0%), significantly exceeding human rates (47.7%). While humans displayed perceptual stochasticity and diverse error profiles, the model remained strictly categorical. Findings suggest that current self-supervised architectures mimic multisensory outcomes but lack the neural variability inherent to human speech perception.
这项研究通过将AV-HuBERT模型对不一致的音视频刺激(McGurk效应)的反应与人类观察者(N=44)进行比较,评估了其感知生物保真度。结果揭示了一个显著的数量一致性:AI和人类表现出几乎相同的听觉优势率(32.0% 对比 31.8%),这表明该模型捕捉到了听觉抵抗的生物学阈值。然而,AV-HuBERT显示出对音素融合的确定性偏见(68.0%),远高于人类的比例(47.7%)。尽管人类表现出感知随机性和多样化的错误模式,但该模型保持了严格的分类特性。研究结果表明,当前的自监督架构可以模仿多感官的结果,但在人类语音感知中固有的神经变异性方面仍有欠缺。
https://arxiv.org/abs/2601.15869
Emotional information in speech plays a unique role in multimodal perception. However, current Speech Large Language Models (SpeechLLMs), similar to conventional speech emotion recognition (SER) systems, still treat emotion understanding as a simple classification problem. This provides limited interpretability of predictions, while leaving the LLMs' expressive and reasoning capabilities underutilized. In this work, we take the first step to reformulate SER as a deep reasoning problem through reinforcement learning (RL). We propose EmotionThinker, which is designed to generate accurate emotion predictions with interpretable explanations grounded in fine-grained acoustic cues. To achieve this, we first construct EmotionCoT-35K, an emotional reasoning dataset with Chain-of-Thought annotations and detailed captions. Second, we observe that current SpeechLLMs exhibit weak prosody perception, whereas prosodic cues constitute fundamental signals for interpreting emotions. To address this, we develop the prosody-enhanced foundation model EmotionThinker-Base, and demonstrate that prosody enhancement improves emotion understanding. Third, we introduce Group-Relative-Policy-Optimization with Progressive-Trust-aware-Reasoning-Reward (GRPO-PTR) for RL. Different from standard GRPO, which relies only on rule-based outcome rewards, GRPO-PTR progressively introduces reasoning reward, dynamically adjusts it with a trustworthiness weight reflecting the alignment between reasoning and outcome, and evaluates the overall reasoning quality with a reward model based on multi-dimensional criteria. EmotionThinker outperforms previous state-of-the-art evaluation models both in emotion accuracy and explanation quality, advancing SER toward interpretable multimodal reasoning. Project page: this https URL
情感信息在多模态感知中扮演着独特的角色。然而,当前的语音大型语言模型(SpeechLLMs)以及传统的语音情绪识别(SER)系统仍然将情绪理解视为一个简单的分类问题。这种做法限制了预测的解释性,并未充分利用LLM的表达和推理能力。为此,在这项工作中,我们首次尝试通过强化学习(RL)将SER重新定义为深度推理问题。我们提出了EmotionThinker模型,旨在生成基于细粒度声学线索的情感预测及可解释的说明。 为了实现这一目标,首先构建了带有链式思考标注和详细描述的情感推理数据集EmotionCoT-35K。其次,观察到当前的SpeechLLMs对语调感知较弱,而语调信号构成了理解情绪的基本信号。为了解决这个问题,我们开发了一个增强型基础模型EmotionThinker-Base,并展示了语调增强改善了情感的理解能力。最后,引入了基于逐步信任意识推理奖励的组相对策略优化(Group-Relative-Policy-Optimization with Progressive-Trust-aware-Reasoning-Reward, GRPO-PTR)用于RL。与仅依赖规则基础结果奖励的标准GRPO不同,GRPO-PTR逐渐引入推理奖励,并根据推论和结果之间的对齐情况动态调整,使用基于多维度标准的奖励模型来评估整体推理质量。 EmotionThinker在情感准确性及解释质量方面均超越了先前最先进的评估模型,推动SER向可解释的多模态推理迈进。项目页面:[此处插入实际URL]
https://arxiv.org/abs/2601.15668
In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers: 1) Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen-Audio and enables streaming waveform reconstruction via a block-wise DiT. 2) Qwen-TTS-Tokenizer-12Hz achieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission ($97\,\mathrm{ms}$) through its 12.5 Hz, 16-layer multi-codebook design and a lightweight causal ConvNet. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.
在这份报告中,我们介绍了Qwen3-TTS系列,这是一个先进的多语言、可控性高、鲁棒性强以及支持流式传输的文本到语音模型家族。Qwen3-TTS 支持业界领先的3秒声音克隆和基于描述的控制功能,允许创建全新的声音,并对输出语音进行细粒度调整。该模型是在涵盖10种语言的超过500万小时的语音数据上训练而成,并采用了一种双轨LM架构来进行实时合成,同时配备两个语音分词器: 1) Qwen-TTS-Tokenizer-25Hz 是一种单代码本编解码器,强调语义内容,在与Qwen-Audio无缝集成方面表现出色,并通过块式DiT实现流式波形重建。 2) Qwen-TTS-Tokenizer-12Hz 采用12.5 Hz、16层的多代码本设计以及轻量级因果卷积网络(ConvNet),实现了极高的比特率减少和超低延迟的流媒体传输,能够通过其独特设计在接收到第一个数据包后97毫秒内开始发送语音。 广泛的实验表明,Qwen3-TTS 在多种客观和主观基准测试中表现出卓越性能(例如TTS多语言测试集、InstructTTSEval以及我们自己的长语音测试集)。为了促进社区研究和发展,我们将两个分词器和模型在Apache 2.0许可证下开源。
https://arxiv.org/abs/2601.15621
While modern Text-to-Speech (TTS) systems achieve high fidelity for read-style speech, they struggle to generate Autonomous Sensory Meridian Response (ASMR), a specialized, low-intensity speech style essential for relaxation. The inherent challenges include ASMR's subtle, often unvoiced characteristics and the demand for zero-shot speaker adaptation. In this paper, we introduce DeepASMR, the first framework designed for zero-shot ASMR generation. We demonstrate that a single short snippet of a speaker's ordinary, read-style speech is sufficient to synthesize high-fidelity ASMR in their voice, eliminating the need for whispered training data from the target speaker. Methodologically, we first identify that discrete speech tokens provide a soft factorization of ASMR style from speaker timbre. Leveraging this insight, we propose a two-stage pipeline incorporating a Large Language Model (LLM) for content-style encoding and a flow-matching acoustic decoder for timbre reconstruction. Furthermore, we contribute DeepASMR-DB, a comprehensive 670-hour English-Chinese multi-speaker ASMR speech corpus, and introduce a novel evaluation protocol integrating objective metrics, human listening tests, LLM-based scoring and unvoiced speech analysis. Extensive experiments confirm that DeepASMR achieves state-of-the-art naturalness and style fidelity in ASMR generation for anyone of any voice, while maintaining competitive performance on normal speech synthesis.
尽管现代的文本转语音(TTS)系统在朗读风格的语言生成中实现了高保真度,但它们却难以产生自主感觉顶点反应(ASMR),这是一种对放松至关重要的特殊低强度语调。ASMR的特点是微妙且常常没有声带震动,而且需要零样本说话者适应能力。在这篇论文中,我们介绍了DeepASMR,这是首个专门设计用于零样本ASMR生成的框架。我们证明,通过一个演讲者的普通朗读语音片段就足以合成高质量的、符合其嗓音特性的ASMR语调,并且无需目标演讲者的低声训练数据。 从方法论上看,首先我们发现离散化的语音令牌可以提供一种分解方式,即从说话者的声音特质中提取出ASMR风格。利用这一洞察力,我们提出了一种两阶段的流程:通过大型语言模型(LLM)对内容和语调进行编码,并采用流匹配声学解码器重建音色。此外,我们还贡献了DeepASMR-DB,这是一个包含670小时、英语和中文多说话者的ASMR语音数据库,并引入了一种新的评估协议,集成了客观度量标准、人类听觉测试、基于大型语言模型的评分以及无声语调分析。 大量的实验表明,DeepASMR在任何声音中的自然性和风格保真度上都达到了最先进的水平,在常规言语合成方面也保持着竞争性性能。
https://arxiv.org/abs/2601.15596
The rapid emergence of new entities -- driven by cultural shifts, evolving trends, and personalized user data -- poses a significant challenge for existing Speech Large Language Models (Speech LLMs). While these models excel at general conversational tasks, their static training knowledge limits their ability to recognize domain-specific terms such as contact names, playlists, or technical jargon. Existing solutions primarily rely on prompting, which suffers from poor scalability: as the entity list grows, prompting encounters context window limitations, increased inference latency, and the "lost-in-the-middle" phenomenon. An alternative approach, Generative Error Correction (GEC), attempts to rewrite transcripts via post-processing but frequently suffers from "over-correction", introducing hallucinations of entities that were never spoken. In this work, we introduce LOGIC (Logit-Space Integration for Contextual Biasing), an efficient and robust framework that operates directly in the decoding layer. Unlike prompting, LOGIC decouples context injection from input processing, ensuring constant-time complexity relative to prompt length. Extensive experiments using the Phi-4-MM model across 11 multilingual locales demonstrate that LOGIC achieves an average 9% relative reduction in Entity WER with a negligible 0.30% increase in False Alarm Rate.
新兴实体的迅速涌现,受到文化变迁、趋势演变和个人用户数据的影响,给现有的语音大型语言模型(Speech LLMs)带来了重大挑战。虽然这些模型在处理一般对话任务方面表现出色,但它们静态训练知识的局限性限制了其识别特定领域的术语能力,如联系人名称、播放列表或技术行话的能力。现有解决方案主要依赖于提示法,这种方法由于可扩展性差而受到限制:随着实体列表的增长,提示面临上下文窗口限制、推理延迟增加以及“中间迷失”现象等问题。 另一种方法是生成错误校正(Generative Error Correction, GEC),它试图通过后期处理重写转录内容,但经常遭受过度纠正的问题,导致引入了原本未提及的实体幻觉。在本文中,我们介绍了LOGIC(Logit-空间集成以用于上下文偏向),这是一个直接在解码层操作的有效且稳健框架。与提示方法不同,LOGIC将上下文注入与输入处理分离,确保其时间复杂度相对于提示长度保持恒定。 使用Phi-4-MM模型进行的广泛跨11个多语言地区的实验表明,LOGIC平均实现了实体错误词率(Entity WER)9%的相对减少,并且误报率仅增加了0.30%,微乎其微。
https://arxiv.org/abs/2601.15397
We investigate intelligent personal assistants (IPAs) accessibility for deaf and hard of hearing (DHH) people who can use their voice in everyday communication. The inability of IPAs to understand diverse accents including deaf speech renders them largely inaccessible to non-signing and speaking DHH individuals. Using an Echo Show, we compare the usability of natural language input via spoken English; with Alexa's automatic speech recognition and a Wizard-of-Oz setting with a trained facilitator re-speaking commands against that of a large language model (LLM)-assisted touch interface in a mixed-methods study. The touch method was navigated through an LLM-powered "task prompter," which integrated the user's history and smart environment to suggest contextually-appropriate commands. Quantitative results showed no significant differences across both spoken English conditions vs LLM-assisted touch. Qualitative results showed variability in opinions on the usability of each method. Ultimately, it will be necessary to have robust deaf-accented speech recognized natively by IPAs.
我们研究了智能个人助手(IPA)对聋人和听力障碍者(DHH)的可用性,这些人在日常交流中可以使用他们的声音。由于IPAs无法理解包括聋人语音在内的各种口音,因此对于不使用手语且能说话的DHH人士来说,它们难以使用。 我们利用Echo Show设备比较了通过口头英语进行自然语言输入;Alexa的自动语音识别和由受训协调员复述命令的“巫师之 Oz”设置与大型语言模型(LLM)辅助触控界面在混合方法研究中的可用性。触控方法通过一个由LLM驱动的"任务提示器"来导航,该提示器结合了用户的使用历史和智能环境以提出合适的上下文指令。 定量结果显示,在口头英语条件下的表现与LLM辅助触控无显著差异。定性的结果表明,对于每种方法的可用性存在不同的意见。最终,需要IPAs能够原生地识别聋人特有的口音。
https://arxiv.org/abs/2601.15209
Everyday communication is dynamic and multisensory, often involving shifting attention, overlapping speech and visual cues. Yet, most neural attention tracking studies are still limited to highly controlled lab settings, using clean, often audio-only stimuli and requiring sustained attention to a single talker. This work addresses that gap by introducing a novel dataset from 24 normal-hearing participants. We used a mobile electroencephalography (EEG) system (44 scalp electrodes and 20 cEEGrid electrodes) in an audiovisual (AV) paradigm with three conditions: sustained attention to a single talker in a two-talker environment, attention switching between two talkers, and unscripted two-talker conversations with a competing single talker. Analysis included temporal response functions (TRFs) modeling, optimal lag analysis, selective attention classification with decision windows ranging from 1.1s to 35s, and comparisons of TRFs for attention to AV conversations versus side audio-only talkers. Key findings show significant differences in the attention-related P2-peak between attended and ignored speech across conditions for scalp EEG. No significant change in performance between switching and sustained attention suggests robustness for attention switches. Optimal lag analysis revealed narrower peak for conversation compared to single-talker AV stimuli, reflecting the additional complexity of multi-talker processing. Classification of selective attention was consistently above chance (55-70% accuracy) for scalp EEG, while cEEGrid data yielded lower correlations, highlighting the need for further methodological improvements. These results demonstrate that mobile EEG can reliably track selective attention in dynamic, multisensory listening scenarios and provide guidance for designing future AV paradigms and real-world attention tracking applications.
日常交流是动态且多感官的,通常涉及到注意力转移、重叠言语和视觉线索。然而,大多数神经注意追踪研究仍然局限于高度控制的实验室环境中,使用清洁、通常是单一音频的刺激,并要求对单个说话者保持持续关注。这项工作通过引入一个由24名正常听力参与者组成的新型数据集来填补这一空白。我们采用了一种便携式脑电图(EEG)系统(包括44个头皮电极和20个cEEGrid电极),在一个视听(AV)实验范式中设置了三种条件:在双说话者环境中持续关注单一个体的讲话、在两个说话者之间转换注意力,以及与竞争性单一说话者进行未脚本化的双人对话。分析包括时间响应函数(TRFs)建模、最优延迟分析、选择性注意分类(决策窗口从1.1秒到35秒),以及视听对话中对侧单个说话者的TRF比较。 关键发现表明,头皮EEG在所有条件下,被关注和未被关注的讲话之间存在显著差异。在注意力转换与持续注意力之间的表现没有明显变化,这反映了注意力切换时的强大适应性。最优延迟分析揭示了对话中的峰值比单一说话者AV刺激更窄,反映出多说话者处理的额外复杂性。选择性注意分类对于头皮EEG一直高于偶然水平(55-70%准确率),而cEEGrid数据则显示更低的相关性,这强调了进一步方法改进的需求。 这些结果表明便携式EEG能够可靠地追踪动态、多感官倾听场景中的选择性注意力,并为未来视听范式的开发和现实世界注意跟踪应用提供了指导。
https://arxiv.org/abs/2601.15097
The development of multimodal large language models (MLLMs) has advanced general video understanding. However, existing video evaluation benchmarks primarily focus on non-interactive videos, such as movies and recordings. To fill this gap, this paper proposes the first omnimodal benchmark for interactive livestream videos, LiViBench. It features a diverse set of 24 tasks, highlighting the perceptual, reasoning, and livestream-specific challenges. To efficiently construct the dataset, we design a standardized semi-automatic annotation workflow that incorporates the human-in-the-loop at multiple stages. The workflow leverages multiple MLLMs to form a multi-agent system for comprehensive video description and uses a seed-question-driven method to construct high-quality annotations. All interactive videos in the benchmark include audio, speech, and real-time comments modalities. To enhance models' understanding of interactive videos, we design tailored two-stage instruction-tuning and propose a Video-to-Comment Retrieval (VCR) module to improve the model's ability to utilize real-time comments. Based on these advancements, we develop LiVi-LLM-7B, an MLLM with enhanced knowledge of interactive livestreams. Experiments show that our model outperforms larger open-source models with up to 72B parameters, narrows the gap with leading proprietary models on LiViBench, and achieves enhanced performance on general video benchmarks, including VideoMME, LongVideoBench, MLVU, and VideoEval-Pro.
多模态大型语言模型(MLLM)的发展推动了通用视频理解的进步。然而,现有的视频评估基准主要集中在非互动性视频上,如电影和录制内容。为填补这一空白,本文提出首个面向交互式直播视频的全模态基准测试 LiViBench。该系统包含24种多样化的任务,强调感知、推理以及特定于直播的挑战。为了高效构建数据集,我们设计了一套标准化的半自动标注工作流程,在多个阶段中引入人机协作机制。此工作流程利用多种 MLLM 构建多代理系统进行综合视频描述,并采用基于种子问题驱动的方法来生成高质量的注释。基准测试中的所有交互式视频均包含音频、语音和实时评论模态。 为了增强模型对交互式视频的理解,我们设计了定制化的两阶段指令微调方法,并提出了一种名为视频到评论检索(VCR)模块的技术以提升模型利用实时评论的能力。基于这些进展,我们开发出 LiVi-LLM-7B,这是一种具备增强的互动直播知识的多模态大型语言模型。 实验表明,我们的模型在 LiViBench 上的表现优于多达 720 亿参数量级的开源模型,在缩小与领先专有模型差距的同时也提升了通用视频基准测试(如 VideoMME、LongVideoBench、MLVU 和 VideoEval-Pro)上的性能。
https://arxiv.org/abs/2601.15016
We present VCNAC, a variable channel neural audio codec. Our approach features a single encoder and decoder parametrization that enables native inference for different channel setups, from mono speech to cinematic 5.1 channel surround audio. Channel compatibility objectives ensure that multi-channel content maintains perceptual quality when decoded to fewer channels. The shared representation enables training of generative language models on a single set of codebooks while supporting inference-time scalability across modalities and channel configurations. Evaluation using objective spatial audio metrics and subjective listening tests demonstrates that our unified approach maintains high reconstruction quality across mono, stereo, and surround audio configurations.
我们介绍了VCNAC,这是一种可变通道神经音频编解码器。我们的方法具有单一的编码器和解码器参数化设置,能够为不同声道配置提供原生推理支持,从单声道语音到电影级别的5.1环绕声都涵盖在内。多声道兼容性目标确保当将多声道内容解码为更少声道时仍能保持感知质量。共享表示使得可以在一组代码本上训练生成式语言模型,并且支持跨模式和声道配置的推理时间上的可扩展性。通过使用客观空间音频指标和主观听觉测试进行评估,证明我们的统一方法能够在单声道、立体声和环绕声音频配置中维持高质量的重建效果。
https://arxiv.org/abs/2601.14960
Single-channel speech enhancement algorithms are often used in resource-constrained embedded devices, where low latency and low complexity designs gain more importance. In recent years, researchers have proposed a wide variety of novel solutions to this problem. In particular, a recent deep learning model named ULCNet is among the state-of-the-art approaches in this domain. This paper proposes an adaptation of ULCNet, by replacing its GRU layers with FastGRNNs, to reduce both computational latency and complexity. Furthermore, this paper shows empirical evidence on the performance decay of FastGRNNs in long audio signals during inference due to internal state drifting, and proposes a novel approach based on a trainable complementary filter to mitigate it. The resulting model, Fast-ULCNet, performs on par with the state-of-the-art original ULCNet architecture on a speech enhancement task, while reducing its model size by more than half and decreasing its latency by 34% on average.
单通道语音增强算法常用于资源受限的嵌入式设备,其中低延迟和低复杂度的设计更为重要。近年来,研究人员提出了一系列新颖的方法来解决这一问题。特别地,最近提出的深度学习模型ULCNet是该领域的最新方法之一。本文提出了对ULCNet的一种改进版本,通过将其GRU层替换为FastGRNN来降低计算延迟和复杂性。此外,本文还展示了实验证据表明,在长时间音频信号的推断过程中,FastGRNN内部状态漂移会导致性能下降,并提出了一种基于可训练互补滤波器的新方法来缓解这一问题。最终模型Fast-ULCNet在语音增强任务上表现出与原始ULCNet架构相当的性能,同时将其模型大小减少了超过一半,并将延迟平均降低了34%。
https://arxiv.org/abs/2601.14925
In this work, we introduce a multi-task transformer for speech deepfake detection, capable of predicting formant trajectories and voicing patterns over time, ultimately classifying speech as real or fake, and highlighting whether its decisions rely more on voiced or unvoiced regions. Building on a prior speaker-formant transformer architecture, we streamline the model with an improved input segmentation strategy, redesign the decoding process, and integrate built-in explainability. Compared to the baseline, our model requires fewer parameters, trains faster, and provides better interpretability, without sacrificing prediction performance.
在这项工作中,我们提出了一种用于语音深度伪造检测的多任务变压器模型。该模型能够预测共振峰轨迹和发声模式随时间的变化,并最终将语音分类为真实或虚假,同时还能指出其决策更多地依赖于有声区域还是无声区域。基于先前提出的说话人-共振峰变换器架构,我们通过改进输入分段策略、重新设计解码过程以及整合内置的可解释性功能来简化模型。与基线模型相比,我们的模型需要更少的参数、训练速度更快,并且提供了更好的可解释性,同时不牺牲预测性能。
https://arxiv.org/abs/2601.14850
Movie dubbing is the task of synthesizing speech from scripts conditioned on video scenes, requiring accurate lip sync, faithful timbre transfer, and proper modeling of character identity and emotion. However, existing methods face two major limitations: (1) high-quality multimodal dubbing datasets are limited in scale, suffer from high word error rates, contain sparse annotations, rely on costly manual labeling, and are restricted to monologue scenes, all of which hinder effective model training; (2) existing dubbing models rely solely on the lip region to learn audio-visual alignment, which limits their applicability to complex live-action cinematic scenes, and exhibit suboptimal performance in lip sync, speech quality, and emotional expressiveness. To address these issues, we propose FunCineForge, which comprises an end-to-end production pipeline for large-scale dubbing datasets and an MLLM-based dubbing model designed for diverse cinematic scenes. Using the pipeline, we construct the first Chinese television dubbing dataset with rich annotations, and demonstrate the high quality of these data. Experiments across monologue, narration, dialogue, and multi-speaker scenes show that our dubbing model consistently outperforms SOTA methods in audio quality, lip sync, timbre transfer, and instruction following. Code and demos are available at this https URL.
电影配音的任务是根据视频场景从脚本中合成语音,要求精确的唇部同步、忠实的声音特质转移以及恰当的角色身份和情感建模。然而,现有方法面临两大限制:(1)高质量多模式配音数据集规模有限,存在高错误率,注释稀疏,依赖昂贵的手动标注,并局限于独白场景,这些因素阻碍了有效的模型训练;(2)现有的配音模型仅依赖唇部区域来学习音频-视觉对齐,这限制了它们在复杂的真实电影场景中的适用性,并且在唇部同步、语音质量以及情感表达方面表现不佳。为了解决这些问题,我们提出了FunCineForge,这是一个用于大规模配音数据集的端到端生产管道和一个基于多模态大型语言模型(MLLM)设计的适合多样化电影场景的配音模型。使用该管道,我们构建了首个具有丰富注释的中文电视剧配音数据集,并展示了这些数据的高质量特性。在独白、旁白、对话以及多说话人场景中的实验表明,我们的配音模型在音频质量、唇部同步、音色转移和指令遵循方面均优于现有最先进的方法(SOTA)。代码和演示可在提供的链接中访问。 原文链接:[提供链接]
https://arxiv.org/abs/2601.14777
Speech Emotion Recognition models typically use single categorical labels, overlooking the inherent ambiguity of human emotions. Ambiguous Emotion Recognition addresses this by representing emotions as probability distributions, but progress is limited by unreliable ground-truth distributions inferred from sparse human annotations. This paper explores whether Large Audio-Language Models (ALMs) can mitigate the annotation bottleneck by generating high-quality synthetic annotations. We introduce a framework leveraging ALMs to create Synthetic Perceptual Proxies, augmenting human annotations to improve ground-truth distribution reliability. We validate these proxies through statistical analysis of their alignment with human distributions and evaluate their impact by fine-tuning ALMs with the augmented emotion distributions. Furthermore, to address class imbalance and enable unbiased evaluation, we propose DiME-Aug, a Distribution-aware Multimodal Emotion Augmentation strategy. Experiments on IEMOCAP and MSP-Podcast show that synthetic annotations enhance emotion distribution, especially in low-ambiguity regions where annotation agreement is high. However, benefits diminish for highly ambiguous emotions with greater human disagreement. This work provides the first evidence that ALMs could address annotation scarcity in ambiguous emotion recognition, but highlights the need for more advanced prompting or generation strategies to handle highly ambiguous cases.
语音情感识别模型通常使用单一类别标签,忽略了人类情绪的内在模糊性。为了解决这一问题,模糊情感识别通过将情感表示为概率分布来进行建模,但其进展受到由稀疏的人类注释推断出的不可靠真实地面分布的限制。本文探讨了大型音频-语言模型(ALMs)是否可以通过生成高质量的合成注释来缓解标注瓶颈。我们引入了一个框架,利用ALM来创建合成感知代理,以增强人类注释并提高真实地面分布的可靠性。通过统计分析其与人类分布的一致性来验证这些代理,并通过在扩充的情绪分布上微调ALMs来评估它们的影响。 为了处理类别不平衡问题,并进行无偏评价,我们提出了一种基于分布感知的多模态情感增强策略DiME-Aug。在IEMOCAP和MSP-Podcast数据集上的实验表明,合成注释可以改善情绪分布,特别是在标注一致性较高的低模糊区域更为显著。然而,在人类意见分歧较大的高度模糊的情感中,其益处会减弱。 这项工作首次证明了ALMs可能解决模糊情感识别中的标注稀缺问题,但同时也强调需要更先进的提示或生成策略来处理高度模糊的情况。
https://arxiv.org/abs/2601.14620
Recent studies demonstrate the effectiveness of Self Supervised Learning (SSL) speech representations for Speech Inversion (SI). However, applying SI in real-world scenarios remains challenging due to the pervasive presence of background noise. We propose a unified framework that integrates Speech Enhancement (SE) and SI models through shared SSL-based speech representations. In this framework, the SSL model is trained not only to support the SE module in suppressing noise but also to produce representations that are more informative for the SI task, allowing both modules to benefit from joint training. At a Signal-to-Noise Ratio of -5 db, our method for the SI task achieves relative improvements over the baseline of 80.95% under babble noise and 38.98% under non-babble noise, as measured by the average Pearson product-moment correlation across all estimated parameters.
最近的研究展示了自监督学习(Self-Supervised Learning,SSL)在语音逆向工程(Speech Inversion,SI)中的有效性。然而,在实际场景中应用SI仍然面临挑战,因为背景噪声普遍存在。我们提出了一种统一的框架,该框架通过基于SSL的语音表示将语音增强(Speech Enhancement,SE)和SI模型结合起来。在这个框架中,SSL模型不仅支持SE模块抑制噪声的功能,还能够生成对SI任务更有信息量的表示,使得两个模块可以相互受益于联合训练。 在信噪比为-5dB的情况下,我们的方法在处理噪音环境下的语音逆向工程(SI)任务时,在背景噪声(babble noise)条件下相对基准提高了80.95%,而非背景噪声条件下相对基准提高了38.98%。这些改进是根据所有估计参数的平均皮尔逊积矩相关系数来衡量的。
https://arxiv.org/abs/2601.14516
Neural vocoders are central to speech synthesis; despite their success, most still suffer from limited prosody modeling and inaccurate phase reconstruction. We propose a vocoder that introduces prosody-guided harmonic attention to enhance voiced segment encoding and directly predicts complex spectral components for waveform synthesis via inverse STFT. Unlike mel-spectrogram-based approaches, our design jointly models magnitude and phase, ensuring phase coherence and improved pitch fidelity. To further align with perceptual quality, we adopt a multi-objective training strategy that integrates adversarial, spectral, and phase-aware losses. Experiments on benchmark datasets demonstrate consistent gains over HiFi-GAN and AutoVocoder: F0 RMSE reduced by 22 percent, voiced/unvoiced error lowered by 18 percent, and MOS scores improved by 0.15. These results show that prosody-guided attention combined with direct complex spectrum modeling yields more natural, pitch-accurate, and robust synthetic speech, setting a strong foundation for expressive neural vocoding.
神经声码器在语音合成中扮演着核心角色,尽管它们已经取得了一定的成功,但大多数仍然存在韵律建模能力有限和相位重建不准确的问题。我们提出了一种引入韵律引导谐波注意机制的声码器,以增强对有声音段的编码,并直接预测复数频谱成分进行波形合成(通过逆STFT)。与基于梅尔光谱图的方法不同,我们的设计同时建模幅度和相位,确保相位的一致性和提高音高的精度。为了进一步符合感知质量的要求,我们采用了一种多目标训练策略,集成了对抗性、频谱以及相位感知损失。 在基准数据集上的实验表明,与HiFi-GAN和AutoVocoder相比,我们的模型表现出了持续的改进:F0 RMSE(均方根误差)降低了22%,有声/无声错误率减少了18%,MOS评分提高了0.15。这些结果证明了韵律引导注意机制结合直接复数频谱建模能够生成更加自然、音高准确且稳健的人工语音,为情感丰富的神经声码器奠定了坚实的基础。
https://arxiv.org/abs/2601.14472
Many spoken languages, including English, exhibit wide variation in dialects and accents, making accent control an important capability for flexible text-to-speech (TTS) models. Current TTS systems typically generate accented speech by conditioning on speaker embeddings associated with specific accents. While effective, this approach offers limited interpretability and controllability, as embeddings also encode traits such as timbre and emotion. In this study, we analyze the interaction between speaker embeddings and linguistically motivated phonological rules in accented speech synthesis. Using American and British English as a case study, we implement rules for flapping, rhoticity, and vowel correspondences. We propose the phoneme shift rate (PSR), a novel metric quantifying how strongly embeddings preserve or override rule-based transformations. Experiments show that combining rules with embeddings yields more authentic accents, while embeddings can attenuate or overwrite rules, revealing entanglement between accent and speaker identity. Our findings highlight rules as a lever for accent control and a framework for evaluating disentanglement in speech generation.
许多口语语言,包括英语,在口音和方言上表现出广泛的差异,因此控制口音成为了灵活文本转语音(TTS)模型的一个重要能力。当前的TTS系统通常通过基于特定口音的说话者嵌入来生成带口音的语音。尽管这种方法有效,但它提供了有限的可解释性和可控性,因为嵌入还编码了诸如音色和情感之类的特征。在这项研究中,我们分析了在合成带有口音的语音时,说话者嵌入与基于语言学动机的音系规则之间的相互作用。以美式英语和英式英语为例,我们实现了卷舌、响音性和元音对应的规则。我们提出了一个新指标——音素转换率(PSR),用于量化嵌入在多大程度上保留或覆盖了基于规则的转变。实验表明,结合使用规则和嵌入可以产生更真实的口音,但嵌入也可能削弱甚至替代这些规则,揭示出口音与说话者身份之间的纠缠关系。我们的研究结果强调了利用规则控制口音以及评估语音生成中解缠(disentanglement)框架的重要性。
https://arxiv.org/abs/2601.14417
Phone recognition (PR) serves as the atomic interface for language-agnostic modeling for cross-lingual speech processing and phonetic analysis. Despite prolonged efforts in developing PR systems, current evaluations only measure surface-level transcription accuracy. We introduce PRiSM, the first open-source benchmark designed to expose blind spots in phonetic perception through intrinsic and extrinsic evaluation of PR systems. PRiSM standardizes transcription-based evaluation and assesses downstream utility in clinical, educational, and multilingual settings with transcription and representation probes. We find that diverse language exposure during training is key to PR performance, encoder-CTC models are the most stable, and specialized PR models still outperform Large Audio Language Models. PRiSM releases code, recipes, and datasets to move the field toward multilingual speech models with robust phonetic ability: this https URL.
电话识别(PR)作为跨语言语音处理和音素分析中与语言无关建模的原子接口,起到了关键作用。尽管在开发PR系统方面已投入了大量努力,但目前的评估仅限于表面转录准确性的测量。我们引入了PRiSM,这是首个开源基准测试平台,旨在通过内在和外在评价揭示语音识别系统的盲点,并提高对音素感知的理解。PRiSM 标准化了基于转写的评估方式,并使用转写与表示探针来评估多语言、教育及临床环境中下游任务的实用性。 研究发现,在训练过程中广泛的语言接触对于PR性能至关重要,编码器-CTC(连接时序分类)模型是最稳定的类型之一。尽管如此,专门设计的PR模型仍优于大型音频语言模型的表现。 PRiSM 发布了代码、配方和数据集,以推动构建具有强大音素能力的多语言语音模型的发展:[此链接](this https URL)提供了更多详细信息。
https://arxiv.org/abs/2601.14046
Code understanding is a foundational capability in software engineering tools and developer workflows. However, most existing systems are designed for English-speaking users interacting via keyboards, which limits accessibility in multilingual and voice-first settings, particularly in regions like India. Voice-based interfaces offer a more inclusive modality, but spoken queries involving code present unique challenges due to the presence of non-standard English usage, domain-specific vocabulary, and custom identifiers such as variable and function names, often combined with code-mixed expressions. In this work, we develop a multilingual speech-driven framework for code understanding that accepts spoken queries in a user native language, transcribes them using Automatic Speech Recognition (ASR), applies code-aware ASR output refinement using Large Language Models (LLMs), and interfaces with code models to perform tasks such as code question answering and code retrieval through benchmarks such as CodeSearchNet, CoRNStack, and CodeQA. Focusing on four widely spoken Indic languages and English, we systematically characterize how transcription errors impact downstream task performance. We also identified key failure modes in ASR for code and demonstrated that LLM-guided refinement significantly improves performance across both transcription and code understanding stages. Our findings underscore the need for code-sensitive adaptations in speech interfaces and offer a practical solution for building robust, multilingual voice-driven programming tools.
代码理解是软件工程工具和开发者工作流程中的基础能力。然而,大多数现有系统都是为说英语的用户通过键盘交互设计的,这限制了在多语言环境(尤其是像印度这样的地区)以及语音优先设置下的可访问性。基于语音的界面提供了一种更具包容性的模式,但由于存在非标准英语使用、领域特定词汇和自定义标识符(如变量名和函数名),结合代码混合表达的情况,因此语音查询中的代码问题提出了独特的挑战。 在本工作中,我们开发了一个多语言语音驱动框架用于代码理解。该框架接受用户母语的口头询问,并通过自动语音识别 (ASR) 转录这些询问;然后使用大型语言模型 (LLMs) 对 ASR 输出进行代码感知的细化处理;最终与代码模型接口以执行诸如代码问答和代码检索等任务,通过 CodeSearchNet、CoRNStack 和 CodeQA 等基准来实现。我们专注于四种广泛使用的印度语和英语,并系统地分析了转录错误如何影响下游任务表现。此外,我们还确定了 ASR 在处理代码时的关键失败模式,并展示了由 LLM 引导的细化过程在转录和代码理解阶段显著提高了性能。 我们的研究结果强调了在语音界面中进行代码敏感性适应的需求,并为构建健壮、多语言驱动的编程工具提供了实际解决方案。
https://arxiv.org/abs/2601.15339