Recent advances in generative AI have made the creation of speech deepfakes widely accessible, posing serious challenges to digital trust. To counter this, various speech deepfake detection strategies have been proposed, including Person-of-Interest (POI) approaches, which focus on identifying impersonations of specific individuals by modeling and analyzing their unique vocal traits. Despite their excellent performance, the existing methods offer limited granularity and lack interpretability. In this work, we propose a POI-based speech deepfake detection method that operates at the phoneme level. Our approach decomposes reference audio into phonemes to construct a detailed speaker profile. In inference, phonemes from a test sample are individually compared against this profile, enabling fine-grained detection of synthetic artifacts. The proposed method achieves comparable accuracy to traditional approaches while offering superior robustness and interpretability, key aspects in multimedia forensics. By focusing on phoneme analysis, this work explores a novel direction for explainable, speaker-centric deepfake detection.
近期在生成式人工智能领域的进展使得制造语音深度伪造变得更为普及,这给数字信任带来了严重的挑战。为了应对这一问题,已经提出了多种语音深度伪造检测策略,其中包括聚焦于通过建模和分析特定个体独特声纹来识别其模仿行为的“人物感兴趣”(POI)方法。尽管这些现有方法表现出色,但它们提供的细节粒度有限且缺乏可解释性。 在本项工作中,我们提出了一种基于POI的语音深度伪造检测方法,该方法在音素层级上进行操作。我们的方法将参考音频分解为音素以构建详细的说话人模型。在推理过程中,测试样本中的每个音素都会与这个模型进行单独比较,从而能够精细地检测合成伪迹。所提出的方法在准确性方面可媲美传统方法,同时提供更强的鲁棒性和解释性,在多媒体取证中尤为关键。 通过专注于音素分析,这项工作探索了可解释、以说话人为中心的深度伪造检测的新方向。
https://arxiv.org/abs/2507.08626
Environmental sound recordings often contain intelligible speech, raising privacy concerns that limit analysis, sharing and reuse of data. In this paper, we introduce a method that renders speech unintelligible while preserving both the integrity of the acoustic scene, and the overall audio quality. Our approach involves reversing waveform segments to distort speech content. This process is enhanced through a voice activity detection and speech separation pipeline, which allows for more precise targeting of speech. In order to demonstrate the effectivness of the proposed approach, we consider a three-part evaluation protocol that assesses: 1) speech intelligibility using Word Error Rate (WER), 2) sound sources detectability using Sound source Classification Accuracy-Drop (SCAD) from a widely used pre-trained model, and 3) audio quality using the Fréchet Audio Distance (FAD), computed with our reference dataset that contains unaltered speech. Experiments on this simulated evaluation dataset, which consists of linear mixtures of speech and environmental sound scenes, show that our method achieves satisfactory speech intelligibility reduction (97.9% WER), minimal degradation of the sound sources detectability (2.7% SCAD), and high perceptual quality (FAD of 1.40). An ablation study further highlights the contribution of each component of the pipeline. We also show that incorporating random splicing to our speech content privacy enforcement method can enhance the algorithm's robustness to attempt to recover the clean speech, at a slight cost of audio quality.
环境声音录音中常常包含可理解的人类对话,这引发了隐私问题,并限制了数据的分析、共享和再利用。本文介绍了一种方法,该方法能够使语音变得不可辨认,同时保留声学场景的整体性和音频质量。我们的方法包括反转波形片段以扭曲语音内容。通过使用语音活动检测和语音分离管道来增强这一过程,从而更精确地定位语音区域。 为了证明所提方法的有效性,我们制定了一个三阶段评估协议,其中包括: 1. 使用单词错误率(WER)评估语音的可理解度; 2. 使用广泛使用的预训练模型中的声音源分类准确度下降(SCAD)来评估音频中各个声源的可检测性; 3. 通过使用包含未修改语音数据的数据集计算弗雷歇音频距离(FAD),评估音质。 在由语音和环境声音场景线性混合组成的模拟评测数据集中进行实验,结果显示我们的方法能够有效降低语音的理解度(WER为97.9%),并使声源的可检测性几乎不受影响(SCAD仅为2.7%),同时保持较高的感知质量(FAD值为1.40)。 进一步的研究表明了管道中每个组件的作用。此外,我们还发现,在我们的语音隐私保护方法中引入随机拼接可以增强算法对尝试恢复原始清晰语音的鲁棒性,但音频质量略有下降作为代价。
https://arxiv.org/abs/2507.08412
The construction of high-quality datasets is a cornerstone of modern text-to-speech (TTS) systems. However, the increasing scale of available data poses significant challenges, including storage constraints. To address these issues, we propose a TTS corpus construction method based on active learning. Unlike traditional feed-forward and model-agnostic corpus construction approaches, our method iteratively alternates between data collection and model training, thereby focusing on acquiring data that is more informative for model improvement. This approach enables the construction of a data-efficient corpus. Experimental results demonstrate that the corpus constructed using our method enables higher-quality speech synthesis than corpora of the same size.
高质量数据集的构建是现代文本到语音(TTS)系统的关键。然而,可用数据规模的扩大带来了显著挑战,包括存储限制等问题。为了解决这些问题,我们提出了一种基于主动学习的TTS语料库构建方法。与传统的前馈和模型无关的语料库构建方法不同,我们的方法在数据收集和模型训练之间进行迭代交替,从而集中获取对模型改进更为有益的数据。这种方法使得能够构建出更加高效利用数据的语料库。实验结果表明,使用我们方法构建的语料库能够在相同规模下实现更高质量的语音合成效果。
https://arxiv.org/abs/2507.08319
The BirdCLEF+ 2025 challenge requires classifying 206 species, including birds, mammals, insects, and amphibians, from soundscape recordings under a strict 90-minute CPU-only inference deadline, making many state-of-the-art deep learning approaches impractical. To address this constraint, the DS@GT BirdCLEF team explored two strategies. First, we establish competitive baselines by optimizing pre-trained models from the Bioacoustics Model Zoo for CPU inference. Using TFLite, we achieved a nearly 10x inference speedup for the Perch model, enabling it to run in approximately 16 minutes and achieve a final ROC-AUC score of 0.729 on the public leaderboard post-competition and 0.711 on the private leaderboard. The best model from the zoo was BirdSetEfficientNetB1, with a public score of 0.810 and a private score of 0.778. Second, we introduce a novel, lightweight pipeline named Spectrogram Token Skip-Gram (STSG) that treats bioacoustics as a sequence modeling task. This method converts audio into discrete "spectrogram tokens" by clustering Mel-spectrograms using Faiss K-means and then learns high-quality contextual embeddings for these tokens in an unsupervised manner with a Word2Vec skip-gram model. For classification, embeddings within a 5-second window are averaged and passed to a linear model. With a projected inference time of 6 minutes for a 700-minute test set, the STSG approach achieved a final ROC-AUC public score of 0.559 and a private score of 0.520, demonstrating the viability of fast tokenization approaches with static embeddings for bioacoustic classification. Supporting code for this paper can be found at this https URL.
The BirdCLEF+ 2025挑战要求从声音景观录音中分类出包括鸟类、哺乳动物、昆虫和两栖动物在内的206个物种,且必须在90分钟的CPU-only推理时间内完成,这使得许多最先进的深度学习方法变得不切实际。为了应对这一限制,GT BirdCLEF团队探索了两种策略。 首先,我们通过优化来自生物声学模型动物园(Bioacoustics Model Zoo)的预训练模型以适应CPU推理来建立具有竞争力的基础线。利用TFLite,我们实现了Perch模型在CPU上的近10倍推理加速,使其能够在大约16分钟内运行,并在比赛后的公共排行榜上达到了ROC-AUC得分为0.729,在私人排行榜上的得分则为0.711。动物园中的最佳模型是BirdSetEfficientNetB1,其公共评分为0.810,私人评分为0.778。 其次,我们引入了一个新的轻量级管道,名为频谱图标记跳字(Spectrogram Token Skip-Gram, STSG),该方法将生物声学任务视为序列建模任务。此方法通过使用Faiss K-means对梅尔频谱图进行聚类来将音频转换为离散的“频谱图标记”,然后利用Word2Vec跳字模型在无监督方式下学习这些标记的高质量上下文嵌入。对于分类,5秒窗口内的嵌入会被平均并传递给线性模型。预计STSG方法对700分钟测试集的推理时间约为6分钟,在公共排行榜上达到了ROC-AUC得分为0.559,并在私人排行榜上的得分则为0.520,这表明快速标记化方法与静态嵌入结合使用对于生物声学分类具有可行性。此论文的相关支持代码可以在以下链接找到:[此处插入URL]。
https://arxiv.org/abs/2507.08236
Automatic speaker verification (ASV) systems are often affected by spoofing attacks. Recent transformer-based models have improved anti-spoofing performance by learning strong feature representations. However, these models usually need high computing power. To address this, we introduce RawTFNet, a lightweight CNN model designed for audio signals. The RawTFNet separates feature processing along time and frequency dimensions, which helps to capture the fine-grained details of synthetic speech. We tested RawTFNet on the ASVspoof 2021 LA and DF evaluation datasets. The results show that RawTFNet reaches comparable performance to that of the state-of-the-art models, while also using fewer computing resources. The code and models will be made publicly available.
自动说话人验证(ASV)系统经常受到伪装攻击的影响。近期基于变压器的模型通过学习强大的特征表示提高了抗伪装性能,但这些模型通常需要较高的计算能力。为了解决这个问题,我们引入了RawTFNet,这是一种专为音频信号设计的轻量级CNN模型。RawTFNet将特征处理沿时间维度和频率维度进行分离,有助于捕捉合成语音中的细粒度细节。我们在ASVspoof 2021 LA和DF评估数据集上测试了RawTFNet。结果显示,RawTFNet达到了与当前最先进模型相当的性能,并且使用了更少的计算资源。代码和模型将公开发布。
https://arxiv.org/abs/2507.08227
Room Impulse Responses (RIRs) accurately characterize acoustic properties of indoor environments and play a crucial role in applications such as speech enhancement, speech recognition, and audio rendering in augmented reality (AR) and virtual reality (VR). Existing blind estimation methods struggle to achieve practical accuracy. To overcome this challenge, we propose the dynamic audio-room acoustic synthesis (DARAS) model, a novel deep learning framework that is explicitly designed for blind RIR estimation from monaural reverberant speech signals. First, a dedicated deep audio encoder effectively extracts relevant nonlinear latent space features. Second, the Mamba-based self-supervised blind room parameter estimation (MASS-BRPE) module, utilizing the efficient Mamba state space model (SSM), accurately estimates key room acoustic parameters and features. Third, the system incorporates a hybrid-path cross-attention feature fusion module, enhancing deep integration between audio and room acoustic features. Finally, our proposed dynamic acoustic tuning (DAT) decoder adaptively segments early reflections and late reverberation to improve the realism of synthesized RIRs. Experimental results, including a MUSHRA-based subjective listening study, demonstrate that DARAS substantially outperforms existing baseline models, providing a robust and effective solution for practical blind RIR estimation in real-world acoustic environments.
房间脉冲响应(RIR)能够精确地表征室内环境的声学特性,并在诸如语音增强、语音识别及增强现实(AR)和虚拟现实(VR)中的音频渲染等应用中发挥关键作用。现有的盲估计方法难以达到实用精度。为了解决这一挑战,我们提出了一种动态音频-房间声学合成(DARAS)模型,这是一种专为从单声道混响语音信号进行盲RIR估算而设计的新型深度学习框架。首先,一个专门的深度音频编码器有效提取相关的非线性潜在空间特征;其次,采用高效的Mamba状态空间模型(SSM)的基于自监督的盲房间参数估计(MASS-BRPE)模块能够准确地估计关键的房间声学参数和特性;第三,系统整合了混合路径交叉注意力特征融合模块,增强了音频与房间声学特征之间的深度集成。最后,我们提出的动态声学调谐(DAT)解码器自适应地分割早期反射和晚期混响,以提高合成RIR的真实感。 实验结果,包括基于MUSHRA的主观听觉研究,表明DARAS在性能上显著优于现有的基准模型,并为实际环境中的盲RIR估计提供了一个稳健且有效的解决方案。
https://arxiv.org/abs/2507.08135
Deep learning-based machine listening is broadening the scope of industrial acoustic analysis for applications like anomaly detection and predictive maintenance, thereby improving manufacturing efficiency and reliability. Nevertheless, its reliance on large, task-specific annotated datasets for every new task limits widespread implementation on shop floors. While emerging sound foundation models aim to alleviate data dependency, they are too large and computationally expensive, requiring cloud infrastructure or high-end hardware that is impractical for on-site, real-time deployment. We address this gap with LISTEN (Lightweight Industrial Sound-representable Transformer for Edge Notification), a kilobyte-sized industrial sound foundation model. Using knowledge distillation, LISTEN runs in real-time on low-cost edge devices. On benchmark downstream tasks, it performs nearly identically to its much larger parent model, even when fine-tuned with minimal datasets and training resource. Beyond the model itself, we demonstrate its real-world utility by integrating LISTEN into a complete machine monitoring framework on an edge device with an Industrial Internet of Things (IIoT) sensor and system, validating its performance and generalization capabilities on a live manufacturing shop floor.
基于深度学习的机器听觉正在扩大工业声学分析的应用范围,例如异常检测和预测性维护,从而提高制造效率和可靠性。然而,它对每个新任务所需的大量特定任务标注数据集的高度依赖限制了其在车间中的广泛实施。虽然新兴的声音基础模型旨在减轻这种数据依赖性问题,但它们体积庞大且计算成本高昂,需要云基础设施或高端硬件,这使得其实时现场部署不切实际。为了解决这一缺口,我们提出了LISTEN(轻量级工业声音表示Transformer边缘通知系统),这是一种千字节大小的工业声音基础模型。通过知识蒸馏技术,LISTEN可以在低成本的边缘设备上实现实时运行。在基准下游任务中,即使使用最小的数据集和训练资源进行微调,它也能几乎与体积大得多的母模型一样表现良好。除了模型本身之外,我们还展示了其实际应用价值,即将LISTEN集成到配备IIoT传感器和系统的边缘设备上的完整机器监控框架中,并在实时制造车间环境中验证了其性能和泛化能力。
https://arxiv.org/abs/2507.07879
Recent advances in Automatic Speech Recognition (ASR) have demonstrated remarkable accuracy and robustness in diverse audio applications, such as live transcription and voice command processing. However, deploying these models on resource constrained edge devices (e.g., IoT device, wearables) still presents substantial challenges due to strict limits on memory, compute and power. Quantization, particularly Post-Training Quantization (PTQ), offers an effective way to reduce model size and inference cost without retraining. Despite its importance, the performance implications of various advanced quantization methods and bit-width configurations on ASR models remain unclear. In this work, we present a comprehensive benchmark of eight state-of-the-art (SOTA) PTQ methods applied to two leading edge-ASR model families, Whisper and Moonshine. We systematically evaluate model performances (i.e., accuracy, memory I/O and bit operations) across seven diverse datasets from the open ASR leaderboard, analyzing the impact of quantization and various configurations on both weights and activations. Built on an extension of the LLM compression toolkit, our framework integrates edge-ASR models, diverse advanced quantization algorithms, a unified calibration and evaluation data pipeline, and detailed analysis tools. Our results characterize the trade-offs between efficiency and accuracy, demonstrating that even 3-bit quantization can succeed on high capacity models when using advanced PTQ techniques. These findings provide valuable insights for optimizing ASR models on low-power, always-on edge devices.
近期,在自动语音识别(ASR)领域取得了显著进展,展示了在多种音频应用中的卓越准确性和鲁棒性,例如实时转录和语音命令处理。然而,将这些模型部署到资源受限的边缘设备(如物联网设备、可穿戴设备)上仍然面临重大挑战,因为这类设备对内存、计算能力和电源有着严格的限制。量化技术,特别是后训练量化(PTQ),提供了一种有效的方法来减少模型大小和推理成本,并且无需重新训练。尽管其重要性不言而喻,但各种先进的量化方法以及不同位宽配置对ASR模型性能的影响仍然不明朗。 在本研究中,我们全面评估了八种最先进的后训练量化(PTQ)方法应用于两个领先的边缘-ASR模型家族——Whisper和Moonshine的效果。我们系统地分析了这八个方法在七个来自开放ASR排行榜的多样化数据集上的模型表现(包括准确性、内存I/O以及位操作),同时考虑了权重与激活的不同配置对量化性能的影响。 我们的框架基于大规模语言模型压缩工具包的扩展,结合了边缘-ASR模型、多样化的先进量化算法、统一的校准和评估数据管道,以及详细的分析工具。研究结果揭示了效率与准确性的权衡关系,并展示了即使是3位宽的量化也能在采用高级PTQ技术的情况下成功应用于高容量模型上。这些发现为优化低功耗、始终在线的边缘设备上的ASR模型提供了宝贵的见解。
https://arxiv.org/abs/2507.07877
Neural audio codecs and autoencoders have emerged as versatile models for audio compression, transmission, feature-extraction, and latent-space generation. However, a key limitation is that most are trained to maximize reconstruction fidelity, often neglecting the specific latent structure necessary for optimal performance in diverse downstream applications. We propose a simple, post-hoc framework to address this by modifying the bottleneck of a pre-trained autoencoder. Our method introduces a "Re-Bottleneck", an inner bottleneck trained exclusively through latent space losses to instill user-defined structure. We demonstrate the framework's effectiveness in three experiments. First, we enforce an ordering on latent channels without sacrificing reconstruction quality. Second, we align latents with semantic embeddings, analyzing the impact on downstream diffusion modeling. Third, we introduce equivariance, ensuring that a filtering operation on the input waveform directly corresponds to a specific transformation in the latent space. Ultimately, our Re-Bottleneck framework offers a flexible and efficient way to tailor representations of neural audio models, enabling them to seamlessly meet the varied demands of different applications with minimal additional training.
神经音频编解码器和自编码器作为音频压缩、传输、特征提取以及潜在空间生成的多功能模型已经出现。然而,这些模型的一个关键限制是大多数都经过训练以最大化重构保真度,而往往忽略了在各种下游应用中实现最佳性能所必需的具体潜在结构。我们提出了一种简单的后处理框架,通过修改预训练自编码器的瓶颈来解决这一问题。我们的方法引入了“重新瓶颈”,这是一个仅通过潜在空间损失进行训练的内部瓶颈,以植入用户定义的结构。我们在三个实验中展示了该框架的有效性。首先,在不牺牲重构质量的前提下强制执行潜在通道的顺序。其次,我们使潜在变量与语义嵌入对齐,并分析这对下游扩散模型的影响。第三,我们引入等变性,确保输入波形上的滤波操作在潜在空间中有直接对应的特定转换。最终,“重新瓶颈”框架为调整神经音频模型的表示提供了一种灵活而高效的方式,使其能够轻松满足不同应用的需求,同时只需进行最少的额外训练。
https://arxiv.org/abs/2507.07867
Emotion and intent recognition from speech is essential and has been widely investigated in human-computer interaction. The rapid development of social media platforms, chatbots, and other technologies has led to a large volume of speech data streaming from users. Nevertheless, annotating such data manually is expensive, making it challenging to train machine learning models for recognition purposes. To this end, we propose applying semi-supervised learning to incorporate a large scale of unlabelled data alongside a relatively smaller set of labelled data. We train end-to-end acoustic and linguistic models, each employing multi-task learning for emotion and intent recognition. Two semi-supervised learning approaches, including fix-match learning and full-match learning, are compared. The experimental results demonstrate that the semi-supervised learning approaches improve model performance in speech emotion and intent recognition from both acoustic and text data. The late fusion of the best models outperforms the acoustic and text baselines by joint recognition balance metrics of 12.3% and 10.4%, respectively.
从语音中识别情绪和意图在人机交互领域至关重要,并且已经得到了广泛的研究。社交媒体平台、聊天机器人和其他技术的快速发展产生了大量来自用户的语音数据流。然而,手动标注这些数据的成本高昂,使得训练用于识别目的的机器学习模型变得具有挑战性。为此,我们提出应用半监督学习方法来结合大规模未标记的数据和相对较小的已标记数据集。 我们训练端到端声学和语言模型,每个模型都采用多任务学习进行情绪和意图识别。比较了两种半监督学习方法,包括fix-match学习和full-match学习。实验结果表明,在从声学和文本数据中识别语音情感和意图方面,半监督学习方法提高了模型的性能。最佳模型的后期融合分别比声学和文本基线在联合识别平衡指标上提升了12.3%和10.4%。
https://arxiv.org/abs/2507.07806
Given the increasing privacy concerns from identity theft and the re-identification of speakers through content in the speech field, this paper proposes a prompt-based speech generation pipeline that ensures dual anonymization of both speaker identity and spoken content. This is addressed through 1) generating a speaker identity unlinkable to the source speaker, controlled by descriptors, and 2) replacing sensitive content within the original text using a name entity recognition model and a large language model. The pipeline utilizes the anonymized speaker identity and text to generate high-fidelity, privacy-friendly speech via a text-to-speech synthesis model. Experimental results demonstrate an achievement of significant privacy protection while maintaining a decent level of content retention and audio quality. This paper also investigates the impact of varying speaker descriptions on the utility and privacy of generated speech to determine potential biases.
鉴于身份盗窃和通过语音内容重新识别说话人所带来的隐私问题日益严重,本文提出了一种基于提示的语音生成管道,该管道确保了说话人身份和所说内容的双重匿名化。这一方案通过以下两个方面来实现: 1)生成一个与源说话人无法关联的说话人身份,并由描述符进行控制; 2)使用命名实体识别模型和大型语言模型替换原始文本中的敏感内容。 此管道利用匿名化的说话人身份和文本,通过基于文本的语音合成模型生成高保真且隐私友好的语音。实验结果表明,在保持相当高的内容保留率和音频质量的同时实现了显著的隐私保护水平。此外,本文还研究了不同说话者描述对生成语音实用性和隐私性的影响,以确定潜在的偏见。
https://arxiv.org/abs/2507.07799
Psychoacoustical so-called "timbre spaces" map perceptual similarity ratings of instrument sounds onto low-dimensional embeddings via multidimensional scaling, but suffer from scalability issues and are incapable of generalization. Recent results from audio (music and speech) quality assessment as well as image similarity have shown that deep learning is able to produce embeddings that align well with human perception while being largely free from these constraints. Although the existing human-rated timbre similarity data is not large enough to train deep neural networks (2,614 pairwise ratings on 334 audio samples), it can serve as test-only data for audio models. In this paper, we introduce metrics to assess the alignment of diverse audio representations with human judgments of timbre similarity by comparing both the absolute values and the rankings of embedding distances to human similarity ratings. Our evaluation involves three signal-processing-based representations, twelve representations extracted from pre-trained models, and three representations extracted from a novel sound matching model. Among them, the style embeddings inspired by image style transfer, extracted from the CLAP model and the sound matching model, remarkably outperform the others, showing their potential in modeling timbre similarity.
心理声学所谓的“音色空间”通过多维尺度分析将乐器声音的感知相似性评分映射到低维度嵌入中,但这种方法存在可扩展性和泛化能力差的问题。最近来自音频(音乐和语音)质量评估以及图像相似性的结果表明,深度学习能够生成与人类感知高度一致的嵌入表示,并且这些表示很少受上述限制的影响。尽管现有基于人工评分的音色相似性数据量不足以训练深度神经网络(2,614对334个音频样本进行的人工配对评分),但它可以作为测试数据用于音频模型。 在本文中,我们引入了评估各种音频表示与人类音色相似性判断之间一致性的度量方法。通过比较嵌入距离的绝对值和排名与人工相似性评分之间的关系来进行评估。我们的评测涵盖了三种基于信号处理的表示、十二种从预训练模型中提取的表示以及三种从新的声音匹配模型中提取的表示。在这其中,受到图像风格迁移启发的“风格嵌入”,无论是从CLAP模型还是新开发的声音匹配模型中提取出来的,都表现出了显著的优势,并显示出其在建模音色相似性方面的潜力。
https://arxiv.org/abs/2507.07764
In this work, we address the voice conversion (VC) task using a vector-based interface. To align audio embeddings between speakers, we employ discrete optimal transport mapping. Our evaluation results demonstrate the high quality and effectiveness of this method. Additionally, we show that applying discrete optimal transport as a post-processing step in audio generation can lead to the incorrect classification of synthetic audio as real.
在这项工作中,我们使用基于向量的接口解决语音转换(VC)任务。为了在说话人间对齐音频嵌入,我们采用离散最优传输映射。我们的评估结果证明了该方法的质量和有效性很高。此外,我们还展示了将离散最优传输作为音频生成后的处理步骤可能会导致合成音频被误判为真实音频。
https://arxiv.org/abs/2505.04382
Single-channel speech enhancement is utilized in various tasks to mitigate the effect of interfering signals. Conventionally, to ensure the speech enhancement performs optimally, the speech enhancement has needed to be tuned for each task. Thus, generalizing speech enhancement models to unknown downstream tasks has been challenging. This study aims to construct a generic speech enhancement front-end that can improve the performance of back-ends to solve multiple downstream tasks. To this end, we propose a novel training criterion that minimizes the distance between the enhanced and the ground truth clean signal in the feature representation domain of self-supervised learning models. Since self-supervised learning feature representations effectively express high-level speech information useful for solving various downstream tasks, the proposal is expected to make speech enhancement models preserve such information. Experimental validation demonstrates that the proposal improves the performance of multiple speech tasks while maintaining the perceptual quality of the enhanced signal.
单通道语音增强技术被应用于各种任务中,以减轻干扰信号的影响。传统上,为了确保语音增强在每个任务中的性能最优,需要针对每项任务对语音增强模型进行调优。因此,将语音增强模型推广到未知下游任务一直是一个挑战。本研究旨在构建一个通用的语音增强前端,该前端能够改善后端性能以解决多个下游任务。为此,我们提出了一种新的训练标准,即在自监督学习模型的特征表示域内最小化增强信号与干净信号的真实值之间的距离。由于自监督学习的特征表示可以有效表达对各种下游任务有用的高级语音信息,因此该提议有望使语音增强模型保留此类信息。实验验证表明,该提案能够提升多个语音任务的表现同时保持增强信号的感知质量。
https://arxiv.org/abs/2507.07631
Decoding speech from brain signals is a challenging research problem. Although existing technologies have made progress in reconstructing the mel spectrograms of auditory stimuli at the word or letter level, there remain core challenges in the precise reconstruction of minute-level continuous imagined speech: traditional models struggle to balance the efficiency of temporal dependency modeling and information retention in long-sequence decoding. To address this issue, this paper proposes the Dynamic Multiscale Fusion Network (DMF2Mel), which consists of four core components: the Dynamic Contrastive Feature Aggregation Module (DC-FAM), the Hierarchical Attention-Guided Multi-Scale Network (HAMS-Net), the SplineMap attention mechanism, and the bidirectional state space module (convMamba). Specifically, the DC-FAM separates speech-related "foreground features" from noisy "background features" through local convolution and global attention mechanisms, effectively suppressing interference and enhancing the representation of transient signals. HAMS-Net, based on the U-Net framework,achieves cross-scale fusion of high-level semantics and low-level details. The SplineMap attention mechanism integrates the Adaptive Gated Kolmogorov-Arnold Network (AGKAN) to combine global context modeling with spline-based local fitting. The convMamba captures long-range temporal dependencies with linear complexity and enhances nonlinear dynamic modeling capabilities. Results on the SparrKULee dataset show that DMF2Mel achieves a Pearson correlation coefficient of 0.074 in mel spectrogram reconstruction for known subjects (a 48% improvement over the baseline) and 0.048 for unknown subjects (a 35% improvement over the baseline).Code is available at: this https URL.
从脑信号中解码语音是一项具有挑战性的研究问题。尽管现有技术在重建听觉刺激的梅尔频谱图方面取得了进展,尤其是在单词或字母级别上,但在精确重建连续想象语音这一微米级细节上仍存在核心难题:传统的模型难以在时间依赖性建模效率和长序列解码中的信息保留之间找到平衡点。为了解决这个问题,本文提出了动态多尺度融合网络(DMF2Mel),该网络由四个核心组件组成:动态对比特征聚合模块(DC-FAM)、分层注意力引导的多尺度网络(HAMS-Net)、SplineMap注意机制和双向状态空间模块(convMamba)。 具体来说,DC-FAM通过局部卷积和全局注意力机制将与语音相关的“前景特征”从嘈杂的“背景特征”中分离出来,有效抑制干扰并增强瞬态信号的表现。HAMS-Net基于U-Net框架,在高层语义和低层细节之间实现了跨尺度融合。SplineMap注意机制结合自适应门控柯尔莫哥洛夫—阿诺索夫网络(AGKAN),将全局上下文建模与基于样条的局部拟合相结合。convMamba以线性复杂度捕获长距离时间依赖关系,并增强了非线性动态建模能力。 在SparrKULee数据集上的实验结果显示,DMF2Mel在已知受试者的梅尔频谱图重建中取得了皮尔逊相关系数0.074(比基线提高了48%),未知受试者为0.048(比基线提高了35%)。代码可在以下网址获得:[https://this-url](http://this-url)。
https://arxiv.org/abs/2507.07526
Room impulse response estimation is essential for tasks like speech dereverberation, which improves automatic speech recognition. Most existing methods rely on either statistical signal processing or deep neural networks designed to replicate signal processing principles. However, combining statistical and physical modeling for RIR estimation remains largely unexplored. This paper proposes a novel approach integrating both aspects through a theoretically grounded model. The RIR is decomposed into interpretable parameters: white Gaussian noise filtered by a frequency-dependent exponential decay (e.g. modeling wall absorption) and an autoregressive filter (e.g. modeling microphone response). A variational free-energy cost function enables practical parameter estimation. As a proof of concept, we show that given dry and reverberant speech signals, the proposed method outperforms classical deconvolution in noisy environments, as validated by objective metrics.
房间脉冲响应(RIR)的估计对于像语音去混响这样的任务至关重要,这些任务可以改善自动语音识别的效果。现有的大多数方法依赖于统计信号处理或设计用于复制信号处理原则的深度神经网络。然而,将统计模型和物理建模结合以进行RIR估计的方法仍然鲜有研究。本文提出了一种新颖的方法,通过一个理论基础扎实的模型来整合这两个方面。该方法将RIR分解为可解释的参数:由频率相关的指数衰减滤波器过滤后的白高斯噪声(例如,模拟墙壁吸收)和自回归滤波器(例如,模拟麦克风响应)。通过变分自由能代价函数实现了实用的参数估计。作为概念验证,我们展示了在嘈杂环境中给定干信号和混响信号的情况下,所提出的方法优于传统的去卷积方法,并且这一点已经由客观指标证实。
https://arxiv.org/abs/2507.08051
Spiking Neural Networks (SNNs), inspired by biological neural mechanisms, represent a promising neuromorphic computing paradigm that offers energy-efficient alternatives to traditional Artificial Neural Networks (ANNs). Despite proven effectiveness, SNN architectures have struggled to achieve competitive performance on large-scale speech processing task. Two key challenges hinder progress: (1) the high computational overhead during training caused by multi-timestep spike firing, and (2) the absence of large-scale SNN architectures tailored to speech processing tasks. To overcome the issues, we introduce Input-aware Multi-Level Spikeformer, i.e. IML-Spikeformer, a spiking Transformer architecture specifically designed for large-scale speech processing. Central to our design is the Input-aware Multi-Level Spike (IMLS) mechanism, which simulate multi-timestep spike firing within a single timestep using an adaptive, input-aware thresholding scheme. IML-Spikeformer further integrates a Reparameterized Spiking Self-Attention (RepSSA) module with a Hierarchical Decay Mask (HDM), forming the HD-RepSSA module. This module enhances the precision of attention maps and enables modeling of multi-scale temporal dependencies in speech signals. Experiments demonstrate that IML-Spikeformer achieves word error rates of 6.0\% on AiShell-1 and 3.4\% on Librispeech-960, comparable to conventional ANN transformers while reducing theoretical inference energy consumption by 4.64$\times$ and 4.32$\times$ respectively. IML-Spikeformer marks an advance of scalable SNN architectures for large-scale speech processing in both task performance and energy efficiency.
脉冲神经网络(SNNs)受到生物神经机制的启发,代表了一种有前景的神经形态计算范式,为传统的人工神经网络(ANNs)提供了节能替代方案。尽管经过验证有效,但SNN架构在大规模语音处理任务中难以实现竞争力的表现。两个关键挑战阻碍了进展:一是多时间步脉冲放电导致的训练阶段高计算开销;二是缺乏专门针对语音处理任务的大规模SNN架构。为了克服这些问题,我们引入了输入感知多层次脉冲变换器(Input-aware Multi-Level Spikeformer),简称IML-Spikeformer,这是一种专为大规模语音处理设计的脉冲Transformer架构。 我们的设计核心是输入感知多层次脉冲(IMLS)机制,在单个时间步内使用自适应、输入感知阈值方案来模拟多时间步的脉冲放电。此外,IML-Spikeformer还集成了重参数化脉冲自我注意(Reparameterized Spiking Self-Attention, RepSSA)模块和分层衰减掩码(Hierarchical Decay Mask, HDM),形成了HD-RepSSA模块。这个模块增强了注意力图的精度,并且能够建模语音信号中的多尺度时间依赖关系。 实验表明,IML-Spikeformer在AiShell-1数据集上实现了6.0%的单词错误率,在Librispeech-960数据集上实现了3.4%的单词错误率,与传统的ANN变换器性能相当,同时理论上的推理能耗分别减少了4.64倍和4.32倍。IML-Spikeformer标志着可扩展SNN架构在大规模语音处理任务中的性能和能效都取得了进步。
https://arxiv.org/abs/2507.07396
Audio-visual sound source localization (AV-SSL) identifies the position of a sound source by exploiting the complementary strengths of auditory and visual signals. However, existing AV-SSL methods encounter three major challenges: 1) inability to selectively isolate the target sound source in multi-source scenarios, 2) misalignment between semantic visual features and spatial acoustic features, and 3) overreliance on paired audio-visual data. To overcome these limitations, we introduce Cross-Instance Audio-Visual Localization (CI-AVL), a novel task that leverages images from different instances of the same sound event category to localize target sound sources, thereby reducing dependence on paired data while enhancing generalization capabilities. Our proposed VP-SelDoA tackles this challenging task through a semantic-level modality fusion and employs a Frequency-Temporal ConMamba architecture to generate target-selective masks for sound isolation. We further develop a Semantic-Spatial Matching mechanism that aligns the heterogeneous semantic and spatial features via integrated cross- and self-attention mechanisms. To facilitate the CI-AVL research, we construct a large-scale dataset named VGG-SSL, comprising 13,981 spatial audio clips across 296 sound event categories. Extensive experiments show that our proposed method outperforms state-of-the-art audio-visual localization methods, achieving a mean absolute error (MAE) of 12.04 and an accuracy (ACC) of 78.23%.
音频视觉声音源定位(AV-SSL)通过利用听觉和视觉信号的优势互补来识别声源的位置。然而,现有的AV-SSL方法面临三大挑战:1)在多声源场景中无法选择性地隔离目标声源;2)语义视觉特征与空间声学特征之间的不一致;3)过度依赖配对的音频-视频数据。为了克服这些限制,我们提出了跨实例音频视觉定位(CI-AVL),这是一种新任务,它利用同一声音事件类别的不同实例中的图像来定位目标声源,从而减少对配对数据的依赖并增强泛化能力。我们的VP-SelDoA方法通过语义级模态融合解决了这一具有挑战性的任务,并采用频率-时间ConMamba架构生成用于声音隔离的目标选择性掩码。我们进一步开发了一种语义-空间匹配机制,利用集成交叉和自注意力机制将异质的语义和空间特征对齐。为了促进CI-AVL研究,我们构建了一个大规模数据集VGG-SSL,包含296个声音事件类别中的13,981段空间音频剪辑。广泛的实验表明,我们的方法优于现有的最先进的音频视觉定位方法,在均方误差(MAE)上达到了12.04%,在准确率(ACC)上达到了78.23%。
https://arxiv.org/abs/2507.07384
Integration of information from non-auditory cues can significantly improve the performance of speech-separation models. Often such models use deep modality-specific networks to obtain unimodal features, and risk being too costly or lightweight but lacking capacity. In this work, we present an iterative representation refinement approach called Bottleneck Iterative Network (BIN), a technique that repeatedly progresses through a lightweight fusion block, while bottlenecking fusion representations by fusion tokens. This helps improve the capacity of the model, while avoiding major increase in model size and balancing between the model performance and training cost. We test BIN on challenging noisy audio-visual speech separation tasks, and show that our approach consistently outperforms state-of-the-art benchmark models with respect to SI-SDRi on NTCD-TIMIT and LRS3+WHAM! datasets, while simultaneously achieving a reduction of more than 50% in training and GPU inference time across nearly all settings.
从非听觉线索中整合信息可以显著提高语音分离模型的性能。通常,这样的模型使用深度模态特定网络来获取单模态特征,并且存在成本过高或过于轻量级而缺乏能力的风险。在本文工作中,我们提出了一种迭代表示精炼方法,称为瓶颈迭代网络(Bottleneck Iterative Network, BIN)。这是一种技术,通过重复地利用一个轻量级的融合块,在融合令牌的帮助下不断细化融合表示。这有助于提高模型的能力,同时避免大幅增加模型大小,并且在模型性能和训练成本之间取得平衡。 我们在具有挑战性的嘈杂音频-视觉语音分离任务上测试了BIN,并展示了我们的方法在NTCD-TIMIT和LRS3+WHAM!数据集上的SI-SDRi指标方面,始终优于最先进的基准模型。同时,在几乎所有的设置中,我们的方法实现了超过50%的训练时间和GPU推理时间减少。 简而言之,该研究展示了一种名为Bottleneck Iterative Network (BIN) 的新方法,通过不断优化轻量级融合块和瓶颈表示,有效提高了语音分离任务的表现,并显著减少了训练与推断的时间成本。
https://arxiv.org/abs/2507.07270
Nowadays, speech emotion recognition (SER) plays a vital role in the field of human-computer interaction (HCI) and the evolution of artificial intelligence (AI). Our proposed DCRF-BiLSTM model is used to recognize seven emotions: neutral, happy, sad, angry, fear, disgust, and surprise, which are trained on five datasets: RAVDESS (R), TESS (T), SAVEE (S), EmoDB (E), and Crema-D (C). The model achieves high accuracy on individual datasets, including 97.83% on RAVDESS, 97.02% on SAVEE, 95.10% for CREMA-D, and a perfect 100% on both TESS and EMO-DB. For the combined (R+T+S) datasets, it achieves 98.82% accuracy, outperforming previously reported results. To our knowledge, no existing study has evaluated a single SER model across all five benchmark datasets (i.e., R+T+S+C+E) simultaneously. In our work, we introduce this comprehensive combination and achieve a remarkable overall accuracy of 93.76%. These results confirm the robustness and generalizability of our DCRF-BiLSTM framework across diverse datasets.
如今,语音情感识别(SER)在人机交互(HCI)和人工智能(AI)的发展中扮演着重要角色。我们提出的DCRF-BiLSTM模型用于识别七种情感:中立、快乐、悲伤、愤怒、恐惧、厌恶和惊讶,并且该模型已在五个数据集上进行训练,包括RAVDESS(R)、TESS(T)、SAVEE(S)、EmoDB(E)和Crema-D(C)。在各个数据集中,我们的模型达到了很高的准确率:97.83%的RAVDESS准确率、97.02%的SAVEE准确率、CREMA-D的95.10%,以及TESS和EMO-DB上的完美100%。对于合并的数据集(R+T+S),该模型达到了98.82%的准确率,超过了之前报告的结果。据我们所知,目前还没有任何研究在所有五个基准数据集上同时评估单一SER模型(即 R+T+S+C+E)。我们的工作首次引入了这种全面组合,并取得了令人瞩目的整体准确率为93.76%。这些结果证实了我们DCRF-BiLSTM框架在不同数据集上的鲁棒性和泛化能力。
https://arxiv.org/abs/2507.07046