Deep learning models define the state-of-the-art in Automatic Drum Transcription (ADT), yet their performance is contingent upon large-scale, paired audio-MIDI datasets, which are scarce. Existing workarounds that use synthetic data often introduce a significant domain gap, as they typically rely on low-fidelity SoundFont libraries that lack acoustic diversity. While high-quality one-shot samples offer a better alternative, they are not available in a standardized, large-scale format suitable for training. This paper introduces a new paradigm for ADT that circumvents the need for paired audio-MIDI training data. Our primary contribution is a semi-supervised method to automatically curate a large and diverse corpus of one-shot drum samples from unlabeled audio sources. We then use this corpus to synthesize a high-quality dataset from MIDI files alone, which we use to train a sequence-to-sequence transcription model. We evaluate our model on the ENST and MDB test sets, where it achieves new state-of-the-art results, significantly outperforming both fully supervised methods and previous synthetic-data approaches. The code for reproducing our experiments is publicly available at this https URL
深度学习模型在自动鼓谱转录(ADT)领域中定义了最先进的技术水平,然而它们的表现依赖于大规模的配对音频-MIDI数据集,而这种数据集非常稀缺。现有的解决方案通常采用合成数据来弥补这一不足,但这些方法往往引入了一个显著的领域差距问题,因为它们通常依靠低保真的SoundFont库,这导致声音多样性缺失。虽然高质量的一次性鼓样本提供了一种更好的替代方案,但这些样本并未以适用于大规模训练的标准格式存在。 本文介绍了一种新的ADT范式,该范式能够绕过配对音频-MIDI训练数据的需求。我们的主要贡献是一种半监督方法,用于从无标签的音频源中自动整理出大量多样化的一次性鼓样本集合。然后我们使用这个集合,仅通过MIDI文件来合成高质量的数据集,并用其训练序列到序列转录模型。我们在ENST和MDB测试集中评估了该模型的表现,取得了新的最先进的结果,显著超越了全监督方法以及先前的合成数据方法。 用于重现实验的代码已公开发布于[此链接](https://this https URL)(请将URL中的“this https URL”替换为实际链接)。
https://arxiv.org/abs/2601.09520
Signal prediction is widely used in, e.g., economic forecasting, echo cancellation and in data compression, particularly in predictive coding of speech and music. Predictive coding algorithms reduce the bit-rate required for data transmission or storage by signal prediction. The prediction gain is a classic measure in applied signal coding of the quality of a predictor, as it links the mean-squared prediction error to the signal-to-quantization-noise of predictive coders. To evaluate predictor models, knowledge about the maximum achievable prediction gain independent of a predictor model is desirable. In this manuscript, Nadaraya-Watson kernel-regression (NWKR) and an information theoretic upper bound are applied to analyze the upper bound of the prediction gain on a newly recorded dataset of sustained speech/phonemes. It was found that for unvoiced speech a linear predictor always achieves the maximum prediction gain within at most 0.3 dB. On voiced speech, the optimum one-tap predictor was found to be linear but starting with two taps, the maximum achievable prediction gain was found to be about 2 dB to 6 dB above the prediction gain of the linear predictor. Significant differences between speakers/subjects were observed. The created dataset as well as the code can be obtained for research purpose upon request.
信号预测在经济预测、回声消除以及数据压缩(特别是在语音和音乐的预测编码中)等领域广泛应用。预测编码算法通过信号预测来降低数据传输或存储所需的比特率。预测增益是衡量应用于信号编码中的预测器质量的经典指标,它将均方预测误差与预测编解码器的信噪比联系起来。为了评估预测模型,在不依赖于特定预测模型的情况下了解可达到的最大预测增益很有价值。 在这篇手稿中,作者应用了Nadaraya-Watson核回归(NWKR)和信息论上界来分析在一个新的持续语音/音素数据集上的预测增益的上限。研究发现,在非周期性声音信号中,线性预测器始终能够实现最大预测增益,且误差不超过0.3分贝。而在周期性声音信号中,最佳的一阶线性预测器是线性的;但当使用超过一阶的线性预测器时(即至少为两阶),所能达到的最大预测增益比单纯线性预测器提高了大约2到6分贝。此外,研究还观察到了不同说话人/受试者之间的显著差异。 该创建的数据集以及相关代码可供学术研究之用,有需求的研究人员可以申请获取。
https://arxiv.org/abs/2601.09461
Conventional audio equalization is a static process that requires manual and cumbersome adjustments to adapt to changing listening contexts (e.g., mood, location, or social setting). In this paper, we introduce a Large Language Model (LLM)-based alternative that maps natural language text prompts to equalization settings. This enables a conversational approach to sound system control. By utilizing data collected from a controlled listening experiment, our models exploit in-context learning and parameter-efficient fine-tuning techniques to reliably align with population-preferred equalization settings. Our evaluation methods, which leverage distributional metrics that capture users' varied preferences, show statistically significant improvements in distributional alignment over random sampling and static preset baselines. These results indicate that LLMs could function as "artificial equalizers," contributing to the development of more accessible, context-aware, and expert-level audio tuning methods.
传统的音频均衡是一个静态过程,需要手动且繁琐的调整以适应不断变化的聆听环境(例如心情、地点或社交场合)。在本文中,我们提出了一种基于大型语言模型(LLM)的替代方案,该方案能够将自然语言文本指令映射为均衡设置。这使通过对话方式控制声音系统成为可能。通过利用来自受控听觉实验收集的数据,我们的模型利用了上下文学习和参数高效的微调技术,从而可靠地与人群中优选的均衡设置相匹配。 我们的评估方法使用捕捉用户多样化偏好的分布度量,并且结果显示在分布对齐方面比随机采样和静态预设基准线有统计学上的显著改进。这些结果表明,LLM可以充当“人工均衡器”,有助于开发出更易于访问、感知上下文的以及专业级音频调优方法。
https://arxiv.org/abs/2601.09448
We propose a timbre conversion model based on the Diffusion architecture de-signed to precisely translate music played by various instruments into piano ver-sions. The model employs a Pitch Encoder and Loudness Encoder to extract pitch and loudness features of the music, which serve as conditional inputs to the Dif-fusion Model's decoder, generating high-quality piano timbres. Case analysis re-sults show that the model performs excellently in terms of pitch accuracy and timbral similarity, maintaining stable conversion across different musical styles (classical, jazz, pop) and lengths (from short clips to full pieces). Particularly, the model maintains high sound quality and accuracy even when dealing with rapidly changing notes and complex musical structures, demonstrating good generaliza-tion capability. Additionally, the model has the potential for real-time musical conversion and is suitable for live performances and digital music creation tools. Future research will focus on enhancing the handling of loudness dynamics and incorporating additional musical features (such as timbral variations and rhythmic complexity) to improve the model's adaptability and expressiveness. We plan to explore the model's application potential in other timbre conversion tasks, such as converting vocals to instrumental sounds or integration with MIDI digital pianos, further expanding the application scope of the Diffusion-based timbre conversion model in the field of music generation.
我们提出了一种基于扩散架构的音色转换模型,旨在将由各种乐器演奏的音乐精确地转化为钢琴版本。该模型采用了一个音高编码器和响度编码器来提取音乐中的音高和响度特征,这些特征作为条件输入被送入扩散模型的解码器中,从而生成高质量的钢琴音色。 案例分析结果显示,该模型在音准准确性和音色相似性方面表现出色,并且能够保持不同音乐风格(古典、爵士、流行)和长度(从短片段到整首曲子)之间的稳定转换。特别是,在处理快速变化的音符及复杂的音乐结构时,模型仍能保持高质量的声音效果和准确性,展示了良好的泛化能力。 此外,该模型还具有实时音乐转换的潜力,并适用于现场表演和数字音乐创作工具中使用。未来的研究将集中在增强响度动态处理的能力上,并整合更多的音乐特征(如音色变化和节奏复杂性),以提升模型的适应性和表现力。我们计划探索该模型在其他音色转换任务中的应用潜能,例如将人声转化为乐器声音或与MIDI数字钢琴集成,从而进一步扩大基于扩散架构的音色转换模型在音乐生成领域的应用范围。
https://arxiv.org/abs/2601.09333
Speech tokenizers serve as the cornerstone of discrete Speech Large Language Models (Speech LLMs). Existing tokenizers either prioritize semantic encoding, fuse semantic content with acoustic style inseparably, or achieve incomplete semantic-acoustic disentanglement. To achieve better disentanglement, we propose DSA-Tokenizer, which explicitly disentangles speech into discrete semantic and acoustic tokens via distinct optimization constraints. Specifically, semantic tokens are supervised by ASR to capture linguistic content, while acoustic tokens focus on mel-spectrograms restoration to encode style. To eliminate rigid length constraints between the two sequences, we introduce a hierarchical Flow-Matching decoder that further improve speech generation this http URL, We employ a joint reconstruction-recombination training strategy to enforce this separation. DSA-Tokenizer enables high fidelity reconstruction and flexible recombination through robust disentanglement, facilitating controllable generation in speech LLMs. Our analysis highlights disentangled tokenization as a pivotal paradigm for future speech modeling. Audio samples are avaialble at this https URL. The code and model will be made publicly available after the paper has been accepted.
语音标记化器是离散的大型语言模型(Speech LLMs)的基础。现有的标记化方法要么侧重于语义编码,要么将语义内容与声学风格不可分割地融合在一起,或者实现不完全的语义-声学解耦。为了更好地进行解耦,我们提出了DSA-Tokenizer,它通过不同的优化约束条件明确地将语音分解成离散的语义和声学标记。具体来说,语义标记由ASR监督以捕捉语言内容,而声学标记则专注于梅尔频谱图恢复以编码风格。为了消除两个序列之间固有的长度限制,我们引入了分层Flow-Matching解码器,进一步改善了语音生成过程中的音质和灵活性。通过联合重构-重组训练策略强制执行这种分离。DSA-Tokenizer通过强大的解耦实现了高保真度的重建和灵活的重组,从而促进了Speech LLMs中可控生成的应用。我们的分析强调了解耦标记化作为未来语音建模的关键范式的重要性。音频样本可在提供的链接处获取。论文接受后,代码和模型将公开发布。 请注意,原文中的“this http URL”可能是指向实际资源或数据的具体网址,在这里我们用描述性语言代替了具体的URL以便更清楚地说明内容。如果您需要访问具体资源,请参考原始文档或联系作者以获得正确的链接地址。
https://arxiv.org/abs/2601.09239
Generative recommendation systems have achieved significant advances by leveraging semantic IDs to represent items. However, existing approaches that tokenize each modality independently face two critical limitations: (1) redundancy across modalities that reduces efficiency, and (2) failure to capture inter-modal interactions that limits item representation. We introduce FusID, a modality-fused semantic ID framework that addresses these limitations through three key components: (i) multimodal fusion that learns unified representations by jointly encoding information across modalities, (ii) representation learning that brings frequently co-occurring item embeddings closer while maintaining distinctiveness and preventing feature redundancy, and (iii) product quantization that converts the fused continuous embeddings into multiple discrete tokens to mitigate ID conflict. Evaluated on a multimodal next-song recommendation (i.e., playlist continuation) benchmark, FusID achieves zero ID conflicts, ensuring that each token sequence maps to exactly one song, mitigates codebook underutilization, and outperforms baselines in terms of MRR and Recall@k (k = 1, 5, 10, 20).
生成推荐系统通过利用语义ID来表示项目已经取得了显著的进步。然而,现有独立处理每种模式的方法面临着两个关键限制:(1)跨模式冗余降低了效率;(2)未能捕捉到模态间的相互作用,从而限制了项目的表征能力。我们引入了一个名为FusID的融合式语义ID框架,该框架通过三个核心组件解决了这些问题:(i)多模态融合,通过联合编码多种模式的信息来学习统一表示;(ii)表征学习,将频繁共同出现的商品嵌入体拉近的同时保持差异性和防止特征冗余;(iii)产品量化,将融合后的连续嵌入转换为多个离散令牌以缓解ID冲突。在多模态下一首歌推荐(即播放列表延续)基准测试中,FusID实现了零ID冲突,确保每个令牌序列映射到唯一的歌曲,减轻了代码本未充分利用的问题,并且在MRR和Recall@k(k = 1, 5, 10, 20)指标上超越了基线方法。
https://arxiv.org/abs/2601.08764
CAPTCHAs are widely used by websites to block bots and spam by presenting challenges that are easy for humans but difficult for automated programs to solve. To improve accessibility, audio CAPTCHAs are designed to complement visual ones. However, the robustness of audio CAPTCHAs against advanced Large Audio Language Models (LALMs) and Automatic Speech Recognition (ASR) models remains unclear. In this paper, we introduce AI-CAPTCHA, a unified framework that offers (i) an evaluation framework, ACEval, which includes advanced LALM- and ASR-based solvers, and (ii) a novel audio CAPTCHA approach, IllusionAudio, leveraging audio illusions. Through extensive evaluations of seven widely deployed audio CAPTCHAs, we show that most existing methods can be solved with high success rates by advanced LALMs and ASR models, exposing critical security weaknesses. To address these vulnerabilities, we design a new audio CAPTCHA approach, IllusionAudio, which exploits perceptual illusion cues rooted in human auditory mechanisms. Extensive experiments demonstrate that our method defeats all tested LALM- and ASR-based attacks while achieving a 100% human pass rate, significantly outperforming existing audio CAPTCHA methods.
CAPTCHA(完全自动化公开图灵测试以区分计算机和人类)被网站广泛使用,通过提出人易解而自动化程序难解的挑战来阻止机器人和垃圾信息。为了提高可访问性,音频CAPTCHA被设计出来与视觉CAPTCHA互补。然而,关于音频CAPTCHA对先进大型音频语言模型(LALM)和自动语音识别(ASR)模型的鲁棒性仍然不清楚。在本文中,我们介绍了AI-CAPTCHA,这是一个统一的框架,提供了两个主要功能:(i) 一个评估框架ACEval,其中包括先进的基于LALM和ASR的求解器;以及(ii)一种新的音频CAPTCHA方法IllusionAudio,利用了听觉错觉。通过广泛测试七种常见的音频CAPTCHA,我们展示了大多数现有方法可以通过高级LALM和ASR模型以很高的成功率解决,揭示出严重的安全弱点。为了解决这些漏洞,我们设计了一种新的人声验证码方法——IllusionAudio,它利用了基于人类听觉机制的感知错觉线索。广泛的实验表明,我们的方法能够击败所有测试过的LALM和ASR攻击,并实现了100%的人类通过率,在性能上显著优于现有的音频CAPTCHA方法。
https://arxiv.org/abs/2601.08516
Increasing levels of anthropogenic noise from ships contribute significantly to underwater sound pollution, posing risks to marine ecosystems. This makes monitoring crucial to understand and quantify the impact of the ship radiated noise. Passive Acoustic Monitoring (PAM) systems are widely deployed for this purpose, generating years of underwater recordings across diverse soundscapes. Manual analysis of such large-scale data is impractical, motivating the need for automated approaches based on machine learning. Recent advances in automatic Underwater Acoustic Target Recognition (UATR) have largely relied on supervised learning, which is constrained by the scarcity of labeled data. Transfer Learning (TL) offers a promising alternative to mitigate this limitation. In this work, we conduct the first empirical comparative study of transfer learning for UATR, evaluating multiple pretrained audio models originating from diverse audio domains. The pretrained model weights are frozen, and the resulting embeddings are analyzed through classification, clustering, and similarity-based evaluations. The analysis shows that the geometrical structure of the embedding space is largely dominated by recording-specific characteristics. However, a simple linear probe can effectively suppress this recording-specific information and isolate ship-type features from these embeddings. As a result, linear probing enables effective automatic UATR using pretrained audio models at low computational cost, significantly reducing the need for a large amounts of high-quality labeled ship recordings.
船舶产生的人为噪音水平上升,对水下声音污染的贡献显著增加,给海洋生态系统带来了风险。因此,监测船辐射噪声的影响变得至关重要。被动声学监测(PAM)系统被广泛用于这一目的,生成了跨越各种声音环境多年的大量水下录音数据。手动分析如此大规模的数据几乎是不可能的,这促使人们寻求基于机器学习的自动化方法的需求。最近,在自动水下声目标识别(UATR)方面取得的进展大多依赖于监督学习,而这种学习方式受限于标注数据的稀缺性。迁移学习(TL)为缓解这一限制提供了有希望的选择。 在本研究中,我们进行了首个关于转移学习用于UATR的经验比较分析,评估了来自不同音频领域的一系列预训练模型的效果。我们冻结了这些预训练模型的权重,并通过分类、聚类和基于相似性的评价方法来分析生成的嵌入(embeddings)。 分析结果显示,在嵌入空间中的几何结构主要由特定录音的特点所主导。然而,简单的线性探测可以有效抑制这一特点的信息并从这些嵌入中分离出船型特征。因此,通过使用预训练音频模型进行线性探测,能够以较低的计算成本实现有效的自动UATR,并显著减少了对高质量标注船舶录音的需求。 这种方法不仅提高了研究效率,还为解决数据标注问题提供了一种创新方案,促进了水下声学目标识别技术的发展。
https://arxiv.org/abs/2601.08358
The impossibility of a transposable 12 semitone tuning of the octave arises from the mathematical fact that $2 \times 2^{7/12} \neq 3$ i.e., the second harmonic of the fifth can not exactly match the third harmonic of the fundamental. This in turn, stems from the whole number harmonic structure of western music, and the subsequent fundamental character of the octave interval as multiples of 2 in frequency, a property inherited by our music system from the physics of instruments with vibrating elements being to a good approximation one dimensional. In the current era of electronic music, one can relax the above assumptions to construct an analogous music system where all the structural properties of the standard music system are preserved, but where harmonics are not whole number multiples of the fundamental frequency, and the octave is no longer a factor of 2 in frequency. This now allows to construct a transposable 12 semitone music system where the second harmonic of the fifth exactly matches the third harmonic of the fundamental. The enhanced harmonic qualities of this system recover to a good approximation the musical qualities of Just Intonation, whilst retaining by construction all the versatility and modulating ability of 12TET.
十二音半音音阶的转调不可能性源于数学事实,即 \(2 \times 2^{7/12} \neq 3\),也就是说,五度音程的第二次谐波无法与基频的第三次谐波完全匹配。这一问题的根本原因在于西方音乐的整体数谐波结构以及八度间隔作为频率两倍关系的基本特性,这是由乐器振动元件的一维物理性质所继承而来的属性。 在当今电子音乐时代,可以放松上述假设来构建一个类比的传统音乐系统,在这个系统中,所有传统音乐系统的结构性质得以保留,但谐波不再是基频的整数倍,并且八度间隔不再具有频率两倍的关系。现在可以构造出一种可转调的十二半音音乐体系,其中五度音程的第二次谐波与基频第三次谐波完全匹配。 这种增强后的和谐品质在很大程度上恢复了纯律(Just Intonation)的音乐特性,同时通过构建性地保留了12平均律的所有变通能力和转换能力。
https://arxiv.org/abs/2601.08074
In this work, we present a novel perspective on cognitive impairment classification from speech by integrating speech foundation models that explicitly recognize speech dialects. Our motivation is based on the observation that individuals with Alzheimer's Disease (AD) or mild cognitive impairment (MCI) often produce measurable speech characteristics, such as slower articulation rate and lengthened sounds, in a manner similar to dialectal phonetic variations seen in speech. Building on this idea, we introduce VoxCog, an end-to-end framework that uses pre-trained dialect models to detect AD or MCI without relying on additional modalities such as text or images. Through experiments on multiple multilingual datasets for AD and MCI detection, we demonstrate that model initialization with a dialect classifier on top of speech foundation models consistently improves the predictive performance of AD or MCI. Our trained models yield similar or often better performance compared to previous approaches that ensembled several computational methods using different signal modalities. Particularly, our end-to-end speech-based model achieves 87.5% and 85.9% accuracy on the ADReSS 2020 challenge and ADReSSo 2021 challenge test sets, outperforming existing solutions that use multimodal ensemble-based computation or LLMs.
在这项工作中,我们提出了一个新颖的视角,通过整合能够明确识别口音的基础语音模型来对认知障碍进行分类。我们的动机基于观察到患有阿尔茨海默病(AD)或轻度认知障碍(MCI)的人常常表现出可量化的言语特征,例如发音速度减慢和声音延长,这与在语言中看到的方言语音变化方式类似。以此为基础,我们引入了VoxCog,这是一个端到端框架,它使用预训练的方言模型来检测AD或MCI,并且无需依赖文本或图像等其他模态。 通过在多个多语种数据集上进行实验以用于AD和MCI检测,我们证明,在语音基础模型之上添加口音分类器进行初始化的方法能持续提高对AD或MCI预测性能。我们的训练模型相比之前使用多种信号模式结合几种计算方法的方案达到了相似甚至更好的效果。 特别是,我们的端到端语音模型在ADReSS 2020挑战和ADReSSo 2021挑战测试集上分别实现了87.5%和85.9%的准确率,超越了使用多模态集成计算或LLM(大型语言模型)的方法。
https://arxiv.org/abs/2601.07999
This study investigates the use of computational audio analysis to examine ideological narratives in Nazi propaganda films. Employing a three-step pipeline, speaker diarization, audio transcription and psycholinguistic analysis, it reveals ideological patterns in characters. Despite current issues with speaker diarization, the methodology provides insights into character traits and propaganda narratives, suggesting scalable applications.
这项研究探讨了利用计算音频分析来考察纳粹宣传电影中的意识形态叙述的方法。通过采用三步流程:说话人识别、音频转录和心理语言学分析,该研究揭示了角色中存在的一些意识形态模式。尽管目前在说话人识别方面还存在问题,但这种方法论为了解角色特征和宣传叙事提供了见解,并暗示了可扩展应用的可能性。
https://arxiv.org/abs/2601.08879
With the recent advancements in reasoning capa- bilities, tool calling using MCP servers and Audio Language Models (ALMs), development and integration of multi-modal agents (with voice and text support) has come to the industry forefront. Cascading pipelines for voice agents still play a central role in the industry owing to their superior reasoning capabilities facilitated by LLMs. Although, cascading pipelines often present error propagation through the pipeline. We propose a framework, FOCAL to benchmark end-to-end reasoning, component-wise error propagation and error analysis for automated as well as human-assisted testing of multi-modal agents (voice to voice + text input). We also share two novel metrics viz. Reasoning and Semantic scores to evaluate efficacy of the agent in having meaningful conversations in voice mode.
随着在推理能力、MCP服务器工具调用和音频语言模型(ALMs)方面的最新进展,多模态代理(支持语音和文本输入)的开发与集成已成为行业重点。尽管级联管道由于大型语言模型(LLMs)提供的卓越推理能力,在语音代理领域仍然占据重要地位,但级联管道通常也会导致错误在管道中传播的问题。我们提出了一种框架——FOCAL,用于评估多模态代理(包括语音到语音和文本输入)的端到端推理、组件级别的误差传递以及自动化及人工辅助测试中的误差分析。此外,我们还提出了两个新颖的度量标准,即推理得分和语义得分,以评估代理在语音模式下进行有意义对话的有效性。
https://arxiv.org/abs/2601.07367
Large Audio Language Models (LALMs) have been widely applied in real-time scenarios, such as in-car assistants and online meeting comprehension. In practice, audio inputs are often corrupted by device and environmental noise, leading to performance degradation. However, existing LALM studies on noise lack quantitative analysis and rely mainly on intuition and empirical observation, thus failing to understand practical robustness. To address this issue, we introduce Signal Embedding Energy (SEE), a method for quantifying the impact of noise intensity on LALM inputs, enabling the differentiation of LALM robustness in real-world deployments. SEE introduces a perspective based on structured activation subspaces derived from the model's internal representations, which more accurately captures its perception of noise than raw audio features. Across experiments, SEE exhibits a strong correlation with LALM performance, achieving a correlation of 0.98. Surprisingly, traditional audio denoising methods are only marginally effective for LALMs, and, in some cases, even increase SEE and impair performance. This suggests a mismatch between speech-centric denoising objectives and the noise sensitivity of modern LALMs. Therefore, we propose a mitigation strategy derived from SEE to denoise LALM inputs, outperforming existing denoising methods. This paper introduces a novel metric for noise quantification in LALMs, providing guidance for robustness improvements in real-world deployments.
大型音频语言模型(LALM)在实时场景中得到了广泛应用,例如车载助手和在线会议理解。然而,在实践中,音频输入常常受到设备噪音和环境噪音的污染,导致性能下降。尽管如此,现有的关于噪音影响的研究主要依赖直觉和经验观察,并没有进行量化的分析,因而无法充分了解模型的实际鲁棒性。 为解决这一问题,我们引入了信号嵌入能量(SEE)方法来量化噪音强度对LALM输入的影响,这有助于在实际部署中区分LALM的鲁棒性。SEE提出了一种基于从模型内部表示衍生出的结构化激活子空间的方法视角,比原始音频特征更能准确捕捉模型对于噪声的感知。 实验结果显示,SEE与LALM性能之间具有很强的相关性,达到了0.98的相关系数。令人惊讶的是,传统的音频降噪方法对LALMs的效果微乎其微,在某些情况下甚至会增加SEE并损害性能。这表明以语音为中心的降噪目标和现代LALM对噪音的敏感度存在不匹配。 基于这一观察,我们提出了一个源自SEE的新策略来为LALM输入降噪,并且此方法优于现有的降噪技术。本文提出了一种针对LALM中噪声量化的新型度量标准,这将为实际部署中的鲁棒性改进提供指导。
https://arxiv.org/abs/2601.07331
Audio recorded in real-world environments often contains a mixture of foreground speech and background environmental sounds. With rapid advances in text-to-speech, voice conversion, and other generation models, either component can now be modified independently. Such component-level manipulations are harder to detect, as the remaining unaltered component can mislead the systems designed for whole deepfake audio, and they often sound more natural to human listeners. To address this gap, we have proposed CompSpoofV2 dataset and a separation-enhanced joint learning framework. CompSpoofV2 is a large-scale curated dataset designed for component-level audio anti-spoofing, which contains over 250k audio samples, with a total duration of approximately 283 hours. Based on the CompSpoofV2 and the separation-enhanced joint learning framework, we launch the Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2), focusing on component-level spoofing, where both speech and environmental sounds may be manipulated or synthesized, creating a more challenging and realistic detection scenario. The challenge will be held in conjunction with the IEEE International Conference on Multimedia and Expo 2026 (ICME 2026).
在现实世界环境中录制的音频通常包含前景语音和背景环境声音。随着文本转语音、语音转换和其他生成模型的迅速发展,这两部分现在可以独立修改。这种组件级别的操作更难被检测到,因为未被改变的部分可能会误导专门设计用于整个深度伪造音频系统的系统,并且它们往往听起来更加自然。为了解决这一缺口,我们提出了CompSpoofV2数据集和一个增强分离联合学习框架。 CompSpoofV2是一个大规模精心策划的数据集,专为组件级别的音频防伪设计,包含超过25万个音频样本,总时长约为283小时。基于CompSpoofV2和增强分离的联合学习框架,我们发起了环境感知语音与声音深度伪造检测挑战(ESDD2),专注于组件级伪造,其中既有可能对语音进行操纵或合成,也有可能对环境声音进行同样的处理,从而创建了一个更具挑战性且更接近现实场景的检测情境。该挑战将于2026年IEEE国际多媒体和展览会议(ICME 2026)期间举行。
https://arxiv.org/abs/2601.07303
This paper summarizes the ICASSP 2026 Automatic Song Aesthetics Evaluation (ASAE) Challenge, which focuses on predicting the subjective aesthetic scores of AI-generated songs. The challenge consists of two tracks: Track 1 targets the prediction of the overall musicality score, while Track 2 focuses on predicting five fine-grained aesthetic scores. The challenge attracted strong interest from the research community and received numerous submissions from both academia and industry. Top-performing systems significantly surpassed the official baseline, demonstrating substantial progress in aligning objective metrics with human aesthetic preferences. The outcomes establish a standardized benchmark and advance human-aligned evaluation methodologies for modern music generation systems.
本文总结了ICASSP 2026年自动歌曲美学评估(ASAE)挑战赛,该挑战专注于预测AI生成的歌曲在主观上的美学评分。比赛分为两个赛道:Track 1旨在预测整体音乐性得分,而Track 2则侧重于预测五个细粒度的美学评分。本次挑战吸引了研究社区的浓厚兴趣,并收到了来自学术界和业界的大量提交作品。表现优异的系统显著超越了官方基线水平,展示了在将客观指标与人类审美偏好相匹配方面的重大进展。此次成果建立了一个标准化基准,并推进了现代音乐生成系统的以人为中心的评估方法论。
https://arxiv.org/abs/2601.07237
Balancing dialogue, music, and sound effects with accompanying video is crucial for immersive storytelling, yet current audio mixing workflows remain largely manual and labor-intensive. While recent advancements have introduced the visually guided acoustic highlighting task, which implicitly rebalances audio sources using multimodal guidance, it remains unclear which visual aspects are most effective as conditioning this http URL address this gap through a systematic study of whether deep video understanding improves audio remixing. Using textual descriptions as a proxy for visual analysis, we prompt large vision-language models to extract six types of visual-semantic aspects, including object and character appearance, emotion, camera focus, tone, scene background, and inferred sound-related cues. Through extensive experiments, camera focus, tone, and scene background consistently yield the largest improvements in perceptual mix quality over state-of-the-art baselines. Our findings (i) identify which visual-semantic cues most strongly support coherent and visually aligned audio remixing, and (ii) outline a practical path toward automating cinema-grade sound design using lightweight guidance derived from large vision-language models.
平衡对话、音乐和音效与视频的配合对于沉浸式叙事至关重要,但目前的音频混音工作流程仍然主要依赖于手动操作且耗时。虽然近期的进步引入了视觉引导的声学突出显示任务,该任务使用多模态指导隐含地重新调整音频来源,但仍不清楚哪些视觉方面最有效地作为条件来改善这一过程。为解决这一差距,通过一项系统研究探讨深度视频理解是否能提升音频混音的质量。 在本研究中,我们采用文本描述作为视觉分析的替代方案,提示大型视觉-语言模型提取六种类型的视觉语义要素,包括物体和人物外观、情感、摄像机聚焦点、基调、场景背景以及推断出的声音相关线索。通过广泛的实验发现,摄像机聚焦点、基调和场景背景在感知混音质量方面,相较于最先进的基准方法,能持续带来最大的提升。 我们的研究结果表明: (i) 确定了哪些视觉语义提示最有力地支持连贯且与视觉一致的音频重混。 (ii) 概述了利用大型视觉-语言模型轻量级指导实现电影级别的声音设计自动化的一种实际路径。
https://arxiv.org/abs/2601.08871
Selective fixed-filter active noise control (SFANC) is a novel approach capable of mitigating noise with varying frequency characteristics. It offers faster response and greater computational efficiency compared to traditional adaptive algorithms. However, spatial factors, particularly the influence of the noise source location, are often overlooked. Some existing studies have explored the impact of the direction-of-arrival (DoA) of the noise source on ANC performance, but they are mostly limited to free-field conditions and do not consider the more complex indoor reverberant environments. To address this gap, this paper proposes a learning-based directional SFANC method that incorporates the DoA of the noise source in reverberant environments. In this framework, multiple reference signals are processed by a convolutional neural network (CNN) to estimate the azimuth and elevation angles of the noise source, as well as to identify the most appropriate control filter for effective noise cancellation. Compared to traditional adaptive algorithms, the proposed approach achieves superior noise reduction with shorter response times, even in the presence of reverberations.
选择性固定滤波器主动噪声控制(SFANC)是一种新颖的方法,能够减轻具有不同频率特性的噪音。与传统的自适应算法相比,它提供了更快的响应速度和更高的计算效率。然而,空间因素特别是声源位置的影响通常被忽视了。一些现有的研究探讨了噪声来源的方向到达(DoA)对主动噪声控制(ANC)性能的影响,但这些研究大多局限于自由场条件,并未考虑更复杂的室内混响环境。为了解决这一空白,本文提出了一种基于学习的方向性SFANC方法,在复杂室内环境中结合声源的DoA。在这个框架中,多个参考信号通过卷积神经网络(CNN)处理来估计噪声源的方位角和仰角,同时确定最合适的控制滤波器以实现有效的噪音消除。与传统的自适应算法相比,所提出的方法在存在混响的情况下也能获得更优的降噪效果,并且响应时间更短。
https://arxiv.org/abs/2601.06981
Recent advances in generative models have enabled modern Text-to-Audio (TTA) systems to synthesize audio with high perceptual quality. However, TTA systems often struggle to maintain semantic consistency with the input text, leading to mismatches in sound events, temporal tructures, or contextual relationships. Evaluating semantic fidelity in TTA remains a significant challenge. Traditional methods primarily rely on subjective human listening tests, which is time-consuming. To solve this, we propose an objective evaluator based on a Mixture of Experts (MoE) architecture with Sequential Cross-Attention (SeqCoAttn). Our model achieves the first rank in the XACLE Challenge, with an SRCC of 0.6402 (an improvement of 30.6% over the challenge baseline) on the test dataset. Code is available at: this https URL.
最近在生成模型方面的进展使得现代文本到音频(TTA)系统能够合成具有高感知质量的音频。然而,这些TTA系统通常难以保持与输入文本的一致性语义关系,从而导致声音事件、时间结构或上下文关系上的不匹配。评估TTA系统的语义保真度仍然是一个重大挑战。传统方法主要依赖于耗时的人类主观听测实验。为了解决这一问题,我们提出了一种基于混合专家(MoE)架构和顺序交叉注意力(SeqCoAttn)的客观评价器。我们的模型在XACLE Challenge中获得第一名,并且在测试数据集上达到了0.6402的相关系数SRCC(比挑战基线提高了30.6%)。代码可以在以下网址获取:this https URL。
https://arxiv.org/abs/2601.06829
This work introduces a robust single-channel inverse filter for dereverberation of non-ideal recordings, validated on real audio. The developed method focuses on the calculation and modification of a discrete impulse response in order to filter the characteristics from a known digital single channel recording setup and room characteristics such as early reflections and reverberations. The aim is a dryer and clearer signal reconstruction, which ideally would be the direct-path signal. The time domain impulse response is calculated from the cepstral domain and faded by means of frequency bin specific exponential decay in the spectrum. The decay rates are obtained by using the blind estimates of reverberation time ratio between recorded output and test signals for each frequency bin. The modified impulse response does filter a recorded audio-signal by deconvolution. The blind estimation is well known and stands out for its robustness to noise and non-idealities. Estimation of a direct path signal is key to many applications.
这项工作提出了一种鲁棒的单通道逆滤波器,用于去除非理想录音中的回声,已在真实音频上进行了验证。开发的方法专注于计算和修改离散脉冲响应,以过滤已知数字单声道录制设备和房间特性(如早期反射和混响)的特点。目标是重建干净且清晰的信号,理想情况下应为直接路径信号。 时间域脉冲响应通过倒谱域进行计算,并通过频带特定指数衰减的方法在频谱中进行平滑处理。衰减速率通过使用录制输出与测试信号之间回声时间比的盲估计来获取每个频率带上的值。修改后的脉冲响应可以通过反卷积过滤记录下来的音频信号。 这种盲估计算法以其对噪声和非理想情况的鲁棒性而闻名。直接路径信号的估计对于许多应用来说至关重要。
https://arxiv.org/abs/2601.06662
A binaural rendering framework for personal sound zones (PSZs) is proposed to enable multiple head-tracked listeners to receive fully independent stereo audio programs. Current PSZ systems typically rely on monophonic rendering and therefore cannot control the left and right ears separately, which limits the quality and accuracy of spatial imaging. The proposed method employs a Binaural Spatially Adaptive Neural Network (BSANN) to generate ear-optimized loudspeaker filters that reconstruct the desired acoustic field at each ear of multiple listeners. The framework integrates anechoically measured loudspeaker frequency responses, analytically modeled transducer directivity, and rigid-sphere head-related transfer functions (HRTFs) to enhance acoustic accuracy and spatial rendering fidelity. An explicit active crosstalk cancellation (XTC) stage further improves three-dimensional spatial perception. Experiments show significant gains in measured objective performance metrics, including inter-zone isolation (IZI), inter-program isolation (IPI), and crosstalk cancellation (XTC), with log-frequency-weighted values of 10.23/10.03 dB (IZI), 11.11/9.16 dB (IPI), and 10.55/11.13 dB (XTC), respectively, over 100-20,000 Hz. The combined use of ear-wise control, accurate acoustic modeling, and integrated active XTC produces a unified rendering method that delivers greater isolation performance, increased robustness to room asymmetry, and more faithful spatial reproduction in real acoustic environments.
提出了一种用于个人声音区域(PSZ)的双耳渲染框架,旨在使多个头部跟踪听众能够接收完全独立的立体声音频节目。当前的 PSZ 系统通常依赖于单声道渲染,因此无法单独控制左右耳朵,从而限制了空间成像的质量和准确性。所提出的方法采用了一种双耳空间自适应神经网络(BSANN),用于生成针对每个耳朵优化的扬声器滤波器,在多个听众的每只耳朵处重建所需的声场。该框架结合了无回声测量的扬声器频率响应、分析建模的换能器方向性以及刚性球头相关的传输函数(HRTFs)来增强声学精度和空间渲染保真度。一个显式的主动交叉谈话取消(XTC)阶段进一步改善了三维空间感知。实验结果显示,在包括区域隔离度(IZI)、节目间隔离度(IPI)和交叉谈话消除(XTC)等客观性能指标上有了显著提高,其在100-20,000 Hz范围内的对数频率加权值分别为IZI 10.23/10.03 dB、IPI 11.11/9.16 dB 和XTC 10.55/11.13 dB。耳朵级别的控制、精确的声学建模和集成的主动 XTC 的综合使用产生了一种统一的渲染方法,它提供了更好的隔离性能,增强了对房间不对称性的鲁棒性,并在真实的声学环境中实现了更忠实的空间再现。
https://arxiv.org/abs/2601.06621