Recent approaches in music generation rely on disentangled representations, often labeled as structure and timbre or local and global, to enable controllable synthesis. Yet the underlying properties of these embeddings remain underexplored. In this work, we evaluate such disentangled representations in a set of music audio models for controllable generation using a probing-based framework that goes beyond standard downstream tasks. The selected models reflect diverse unsupervised disentanglement strategies, including inductive biases, data augmentations, adversarial objectives, and staged training procedures. We further isolate specific strategies to analyze their effect. Our analysis spans four key axes: informativeness, equivariance, invariance, and disentanglement, which are assessed across datasets, tasks, and controlled transformations. Our findings reveal inconsistencies between intended and actual semantics of the embeddings, suggesting that current strategies fall short of producing truly disentangled representations, and prompting a re-examination of how controllability is approached in music generation.
最近的音乐生成方法依赖于分离表示,通常被标记为结构与音色或局部与全局特征,以实现可控合成。然而,这些嵌入的基本特性仍然未被充分探索。在这项工作中,我们使用一种基于探针任务的方法框架来评估一组用于可控生成的音乐音频模型中的此类分离表示,并且这种方法超出了标准下游任务的范围。所选模型反映了多样化的无监督分离策略,包括归纳偏差、数据增强、对抗目标以及分阶段训练流程。此外,我们还单独分析了特定策略的效果。我们的分析涵盖了四个关键维度:信息性(informativeness)、等变性(equivariance)、不变性(invariance)和分离度(disentanglement),这些特性在不同的数据集、任务及受控转换中被评估。研究发现表明,嵌入的预期语义与其实际语义之间存在不一致之处,这暗示现有的策略未能产生真正意义上的分离表示,并且呼吁重新审视音乐生成中的可控性方法。
https://arxiv.org/abs/2602.10058
Passive acoustic monitoring has become a key strategy in biodiversity assessment, conservation, and behavioral ecology, especially as Internet-of-Things (IoT) devices enable continuous in situ audio collection at scale. While recent self-supervised learning (SSL)-based audio encoders, such as BEATs and AVES, have shown strong performance in bioacoustic tasks, their computational cost and limited robustness to unseen environments hinder deployment on resource-constrained platforms. In this work, we introduce BioME, a resource-efficient audio encoder designed for bioacoustic applications. BioME is trained via layer-to-layer distillation from a high-capacity teacher model, enabling strong representational transfer while reducing the parameter count by 75%. To further improve ecological generalization, the model is pretrained on multi-domain data spanning speech, environmental sounds, and animal vocalizations. A key contribution is the integration of modulation-aware acoustic features via FiLM conditioning, injecting a DSP-inspired inductive bias that enhances feature disentanglement in low-capacity regimes. Across multiple bioacoustic tasks, BioME matches or surpasses the performance of larger models, including its teacher, while being suitable for resource-constrained IoT deployments. For reproducibility, code and pretrained checkpoints are publicly available.
被动声学监测已成为生物多样性评估、保护和行为生态学的关键策略,尤其是随着物联网(IoT)设备的普及,这种技术能够大规模地进行现场音频采集。尽管最近基于自监督学习(SSL)的音频编码器(如BEATs和AVES)在生物声学任务中表现出色,但其计算成本高及对未知环境适应性差的问题限制了它们在资源受限平台上的部署。为此,我们引入了一种名为BioME的新颖音频编码器,专为生物声学应用而设计,并且资源效率极高。 BioME通过层级到层级的知识蒸馏技术从一个大型教师模型中学习,这使得强大的特征传递成为可能,同时将参数数量减少了75%。为了进一步提升生态系统的泛化能力,该模型使用涵盖语音、环境声音和动物鸣叫的跨域数据进行预训练。其主要贡献在于通过FiLM条件化的方式整合调制感知声学特性,注入了一种DSP启发式的归纳偏置,增强了低容量架构下的特征解耦。 在多个生物声学任务中,BioME的表现与更大的模型(包括它的教师模型)相当甚至更好,并且适合资源受限的IoT部署。为了便于重复验证,代码和预训练检查点都是公开可用的。
https://arxiv.org/abs/2602.09970
Music stem generation, the task of producing musically-synchronized and isolated instrument audio clips, offers the potential of greater user control and better alignment with musician workflows compared to conventional text-to-music models. Existing stem generation approaches, however, either rely on fixed architectures that output a predefined set of stems in parallel, or generate only one stem at a time, resulting in slow inference despite flexibility in stem combination. We propose Stemphonic, a diffusion-/flow-based framework that overcomes this trade-off and generates a variable set of synchronized stems in one inference pass. During training, we treat each stem as a batch element, group synchronized stems in a batch, and apply a shared noise latent to each group. At inference-time, we use a shared initial noise latent and stem-specific text inputs to generate synchronized multi-stem outputs in one pass. We further expand our approach to enable one-pass conditional multi-stem generation and stem-wise activity controls to empower users to iteratively generate and orchestrate the temporal layering of a mix. We benchmark our results on multiple open-source stem evaluation sets and show that Stemphonic produces higher-quality outputs while accelerating the full mix generation process by 25 to 50%. Demos at: this https URL.
音乐音轨生成,即生产与音乐同步且独立的乐器音频片段的任务,相比传统的文本到音乐模型,提供了更大的用户控制能力和更好的与音乐家工作流程的一致性。然而,现有的音轨生成方法要么依赖于输出预定义集合并行音轨的固定架构,要么一次只生成一个音轨,导致尽管在音轨组合方面灵活但推断速度较慢。我们提出了Stemphonic框架,这是一个基于扩散和流的方法,可以克服这一取舍,在一次推理过程中生成一组可变且同步的音轨。 在训练期间,我们将每个音轨视为一批元素中的一个,并将同步音轨分组成一批处理。对于每批同步音轨,应用共享噪声潜在变量(noise latent)。到了推断阶段,我们使用共享初始噪声潜在变量和特定于音轨的文本输入,在一次推断中生成一组同步多轨道输出。 此外,我们将方法扩展到支持一次性条件多轨道生成及逐轨道活动控制,使得用户能够迭代地生成和编排混音中的时间层次。我们在多个开源音轨评估数据集上进行了基准测试,并显示Stemphonic产生了高质量的输出并加速了整个混音生成过程25%到50%。 示例演示请访问:[这个链接](https://this https URL)(请注意,实际应用中需要插入有效的URL以供查看)。
https://arxiv.org/abs/2602.09891
Acoustic room modes and the Green's function mode expansion are well-known for rectangular rooms with perfectly reflecting walls. First-order approximations also exist for nearly rigid boundaries; however, current analytical methods fail to accommodate more general boundary conditions, e.g., when wall absorption is significant. In this work, we present a comprehensive analysis that extends previous studies by including additional first-order asymptotics that account for soft-wall boundaries. In addition, we introduce a semi-analytical, efficient, and reliable method for computing the Green's function in rectangular rooms, which is described and validated through numerical tests. With a sufficiently large truncation order, the resulting error becomes negligible, making the method suitable as a benchmark for numerical simulations. Additional aspects regarding the spectral basis orthogonality and completeness are also addressed, providing a general framework for the validity of the proposed approach.
声学房间模式和Green函数模展开在具有完美反射壁的矩形房间里是众所周知的。对于接近刚性的边界,也存在一阶近似;然而,目前的分析方法无法适应更一般的边界条件,例如当墙壁吸收显著时的情况。在这项工作中,我们提出了一种全面的分析方法,扩展了先前的研究,包括额外的一阶渐进公式以考虑软墙边界的影响。此外,我们引入了一个半解析、高效且可靠的方法来计算矩形房间中的Green函数,并通过数值测试对其进行了描述和验证。当截断顺序足够大时,由此产生的误差变得可以忽略不计,使得该方法适合作为数值模拟的基准。此外,还讨论了谱基正交性和完备性的相关方面,提供了一个关于所提出方法有效性的通用框架。
https://arxiv.org/abs/2602.09594
Environmental sound classification (ESC) has gained significant attention due to its diverse applications in smart city monitoring, fault detection, acoustic surveillance, and manufacturing quality control. To enhance CNN performance, feature stacking techniques have been explored to aggregate complementary acoustic descriptors into richer input representations. In this paper, we investigate CNN-based models employing various stacked feature combinations, including Log-Mel Spectrogram (LM), Spectral Contrast (SPC), Chroma (CH), Tonnetz (TZ), Mel-Frequency Cepstral Coefficients (MFCCs), and Gammatone Cepstral Coefficients (GTCC). Experiments are conducted on the widely used ESC-50 and UrbanSound8K datasets under different training regimes, including pretraining on ESC-50, fine-tuning on UrbanSound8K, and comparison with Audio Spectrogram Transformer (AST) models pretrained on large-scale corpora such as AudioSet. This experimental design enables an analysis of how feature-stacked CNNs compare with transformer-based models under varying levels of training data and pretraining diversity. The results indicate that feature-stacked CNNs offer a more computationally and data-efficient alternative when large-scale pretraining or extensive training data are unavailable, making them particularly well suited for resource-constrained and edge-level sound classification scenarios.
环境声音分类(ESC)由于在智慧城市监控、故障检测、声学监视和制造质量控制等领域的广泛应用,受到了广泛关注。为了提高卷积神经网络(CNN)的性能,研究者探索了特征堆叠技术,以将互补的声学描述符整合到更丰富的输入表示中。本文探讨了基于各种堆叠特征组合的CNN模型,包括对数梅尔频谱图(LM)、光谱对比度(SPC)、音调(CH)、Tonnetz(TZ)、梅尔频率倒谱系数(MFCC)和伽马通倒谱系数(GTCC)。实验在广泛使用的ESC-50和UrbanSound8K数据集上进行,包括在ESC-50上的预训练、在UrbanSound8K上的微调以及与在大规模语料库如AudioSet上预先训练的音频光谱图变压器(AST)模型的比较。这种实验设计使得能够分析特征堆叠CNN如何在不同程度的训练数据和预训练多样性下,与基于变换器的模型进行比较。研究结果表明,在大型规模的预训练或大量训练数据不可用的情况下,特征堆叠CNN提供了一种计算效率和数据使用效率更高的替代方案,这使其特别适用于资源受限和边缘级别声音分类场景。
https://arxiv.org/abs/2602.09321
This work presents the largest curation of Southern Resident Killer Whale (SRKW) acoustic data to date, also containing other marine mammals in their environment. We systematically search all available public archival hydrophone data within the SRKW habitat (over 30 years of audio data). The search consists of a weakly-supervised, positive-unlabelled, active learning strategy to identify all instances of marine mammals. The resulting transformer-based detectors outperform state-of-the-art detectors on the DEEPAL, DCLDE-2026, and two newly introduced expert-annotated datasets in terms of accuracy, energy efficiency, and speed. The detection model has a specificity of 0-28.8% at 95% sensitivity. Our multiclass species classifier obtains a top-1 accuracy of 42.1% (11 train classes, 4 test classes) and our ecotype classifier obtains a top-1 accuracy of 43.0% (4 train classes, 5 test classes) on the DCLDE-2026 dataset. We yield 919 hours of SRKW data, 230 hours of Bigg's orca data, 1374 hours of orca data from unlabelled ecotypes, 1501 hours of humpback data, 88 hours of sea lion data, 246 hours of pacific white-sided dolphin data, and over 784 hours of unspecified marine mammal data. This SRKW dataset is larger than DCLDE-2026, Ocean Networks Canada, and OrcaSound combined. The curated species labels are available under CC-BY 4.0 license, and the corresponding audio data are available under the licenses of the original owners. The comprehensive nature of this dataset makes it suitable for unsupervised machine translation, habitat usage surveys, and conservation endeavours for this critically endangered ecotype.
这项工作提供了迄今为止最大的南方居留型杀人鲸(SRKW)声学数据集,其中包括其环境中的其他海洋哺乳动物。我们系统地搜索了SRKW栖息地中所有可用的公共档案水听器数据(超过30年的音频数据)。该搜索采用了一种弱监督、正例-未标记主动学习策略来识别所有的海洋哺乳动物实例。基于转换器的检测器在DEEPAL、DCLDE-2026和两个新引入的专业注释数据集上,在准确性、能源效率和速度方面优于现有最佳检测器。 该检测模型在95%敏感度下具有0至28.8%的具体性。我们的多类物种分类器在DCLDE-2026数据集中获得了42.1%的前一准确率(训练类别为11,测试类别为4),而生态类型分类器则获得了43.0%的前一准确率(训练类别为4,测试类别为5)。 我们提供了919小时的SRKW数据、230小时的大白海豚数据、1374小时来自未标记生态类型的虎鲸数据、1501小时的座头鲸数据、88小时的海狮数据以及246小时的太平洋侧大腹海豚数据和超过784小时的其他未指定海洋哺乳动物的数据。该SRKW数据集比DCLDE-2026、加拿大海洋网络和OrcaSound的总和还要大。 经过整理后的物种标签在CC-BY 4.0许可下提供,相应的音频数据根据原所有者的许可证提供。 这一全面的数据集适用于无监督机器翻译、栖息地使用调查以及对这种极度濒危生态类型的保护工作。
https://arxiv.org/abs/2602.09295
Blind room impulse response (RIR) estimation is a core task for capturing and transferring acoustic properties; yet existing methods often suffer from limited modeling capability and degraded performance under unseen conditions. Moreover, emerging generative audio applications call for more flexible impulse response generation methods. We propose Gencho, a diffusion-transformer-based model that predicts complex spectrogram RIRs from reverberant speech. A structure-aware encoder leverages isolation between early and late reflections to encode the input audio into a robust representation for conditioning, while the diffusion decoder generates diverse and perceptually realistic impulse responses from it. Gencho integrates modularly with standard speech processing pipelines for acoustic matching. Results show richer generated RIRs than non-generative baselines while maintaining strong performance in standard RIR metrics. We further demonstrate its application to text-conditioned RIR generation, highlighting Gencho's versatility for controllable acoustic simulation and generative audio tasks.
盲房间冲激响应(RIR)估计是捕捉和传输声学特性的一个核心任务;然而,现有的方法往往在建模能力和未见过条件下的性能方面存在局限性。此外,新兴的生成音频应用需要更加灵活的脉冲响应生成方法。我们提出了Gencho模型,这是一种基于扩散-变压器的方法,可以从混响语音中预测复杂的频谱图RIR。该模型包含一个结构感知编码器,利用早期反射与后期反射之间的隔离特性将输入音频编码为鲁棒表示形式以进行条件处理,同时扩散解码器则从这种表示生成多样且听觉上真实的脉冲响应。Gencho可以模块化地集成到标准的语音处理流程中,用于声学匹配。实验结果显示,相比非生成基线方法,Gencho能够产生更加丰富的RIR,并在标准RIR指标上保持了强大的性能。我们进一步展示了其在基于文本条件下的RIR生成中的应用,突显了Gencho在可控声学仿真和生成音频任务方面的灵活性与多功能性。
https://arxiv.org/abs/2602.09233
This research applies artificial intelligence (AI) to separate, cluster, and analyze cardiorespiratory sounds. We recorded a new dataset (HLS-CMDS) and developed several AI models, including generative AI methods based on large language models (LLMs) for guided separation, explainable AI (XAI) techniques to interpret latent representations, variational autoencoders (VAEs) for waveform separation, a chemistry-inspired non-negative matrix factorization (NMF) algorithm for clustering, and a quantum convolutional neural network (QCNN) designed to detect abnormal physiological patterns. The performance of these AI models depends on the quality of the recorded signals. Therefore, this thesis also reviews the biosensing technologies used to capture biomedical data. It summarizes developments in microelectromechanical systems (MEMS) acoustic sensors and quantum biosensors, such as quantum dots and nitrogen-vacancy centers. It further outlines the transition from electronic integrated circuits (EICs) to photonic integrated circuits (PICs) and early progress toward integrated quantum photonics (IQP) for chip-based biosensing. Together, these studies show how AI and next-generation sensors can support more intelligent diagnostic systems for future healthcare.
这项研究将人工智能(AI)应用于心血管呼吸声音的分离、聚类和分析。我们记录了一个新的数据集(HLS-CMDS),并开发了多种AI模型,包括基于大规模语言模型(LLMs)的生成式AI方法用于引导性分离,解释型AI(XAI)技术以解读潜在表示,变分自编码器(VAEs)用于波形分离,一种受化学启发的非负矩阵分解(NMF)算法用于聚类,以及一种量子卷积神经网络(QCNN),旨在检测异常生理模式。这些AI模型的表现取决于记录信号的质量。因此,论文还回顾了捕捉生物医学数据所使用的生物传感技术的发展情况。它总结了微机电系统(MEMS)声学传感器和量子生物传感器(如量子点和氮空位中心)的进展,并进一步概述了从电子集成电路(EICs)向光子集成电路(PICs)过渡以及朝向用于芯片基生物传感的集成量子光子学(IQP)的早期进步。这些研究共同展示了AI和下一代传感器如何支持未来医疗中更为智能的诊断系统。
https://arxiv.org/abs/2602.09210
Open-vocabulary keyword spotting (OV-KWS) enables personalized device control via arbitrary voice commands. Recently, researchers have explored using audio-text joint embeddings, allowing users to enroll phrases with text, and proposed techniques to disambiguate similar utterances. We find that existing OV-KWS solutions often overly bias the beginning phonemes of an enrollment, causing false triggers when negative enrollment-query-pairs share a prefix (``turn the volume up'' vs. ``turn the volume down''). We trace this to two factors: training data bias and position-biased cross-modal scoring. To address these limitations, we introduce the Partial Overlap Benchmark (POB) with two datasets, POB-Spark and POB-LibriPhrase (POB-LP), containing mismatched audio-text pairs with shared prefixes, and propose Equal-weighting Position Scoring (EPS), a lightweight decision layer. Using EPS alone reduces EER on POB-Spark from 64.4\% to 29.3\% and improves POB-LP accuracy from 87.6\% to 96.8\%, while maintaining performance on LibriPhrase and Google Speech Commands (GSC). With POB data added in training, our work achieves the best POB benchmark results while incurring the least amount of degradation on prior metrics among baselines. This degradation is most pronounced in GSC, which contains only one-word commands. We surface mitigating this trade-off as future work.
开放式词汇关键词识别(OV-KWS)通过任意语音命令实现了个性化设备控制。最近,研究人员探索了使用音频-文本联合嵌入的方法,允许用户用文字注册短语,并提出了区分相似发音的技术。我们发现现有的OV-KWS解决方案通常会对注册过程中的初始音素过于偏重,导致当负面注册查询对共享前缀时出现误触发(如“调高音量”与“调低音量”)。我们将这一问题归因于两个因素:训练数据的偏差和位置偏向的跨模态评分。 为了解决这些限制,我们引入了部分重叠基准测试(POB),其中包括两个数据集——POB-Spark和POB-LibriPhrase(POB-LP)——其中包含具有共享前缀但不匹配的音频-文本对,并提出了等权重位置评分(EPS),这是一个轻量级决策层。仅使用EPS即可将POB-Spark上的错误接受率(EER)从64.4%降至29.3%,并将POB-LP精度从87.6%提升至96.8%,同时保持在LibriPhrase和Google语音命令(GSC)数据集上原有的性能。当我们在训练中加入POB数据后,我们的方法在POB基准测试中取得了最佳结果,并且对基线的原有指标造成了最少的退化影响。这种退化最明显地出现在仅包含单个单词指令的GSC数据集中。 我们建议在未来的工作中解决这一权衡问题。
https://arxiv.org/abs/2602.08930
While deep learning has advanced speech enhancement (SE), effective phase modeling remains challenging, as conventional networks typically operate within a flat Euclidean feature space, which is not easy to model the underlying circular topology of the phase. To address this, we propose a manifold-aware magnitude-phase dual-stream framework that aligns the phase stream with its intrinsic circular geometry by enforcing Global Rotation Equivariance (GRE) characteristic. Specifically, we introduce a Magnitude-Phase Interactive Convolutional Module (MPICM) for modulus-based information exchange and a Hybrid-Attention Dual-FFN (HADF) bottleneck for unified feature fusion, both of which are designed to preserve GRE in the phase stream. Comprehensive evaluations are conducted across phase retrieval, denoising, dereverberation, and bandwidth extension tasks to validate the superiority of the proposed method over multiple advanced baselines. Notably, the proposed architecture reduces Phase Distance by over 20\% in the phase retrieval task and improves PESQ by more than 0.1 in zero-shot cross-corpus denoising evaluations. The overall superiority is also established in universal SE tasks involving mixed distortions. Qualitative analysis further reveals that the learned phase features exhibit distinct periodic patterns, which are consistent with the intrinsic circular nature of the phase. The source code is available at this https URL.
尽管深度学习在语音增强(SE)方面取得了进展,但有效的相位建模仍然具有挑战性。传统网络通常在平坦的欧氏特征空间中操作,这难以模拟相位的基本环形拓扑结构。为了解决这个问题,我们提出了一种流形感知的幅度-相位双通道框架,通过强制执行全局旋转等变(GRE)特性来使相位通道与其固有的圆形几何形状对齐。具体而言,我们引入了基于模量的信息交换幅度-相位交互卷积模块(MPICM)和用于统一特征融合的混合注意力双FFN(HADF)瓶颈,两者都旨在在相位流中保持GRE。 为了验证所提出方法相对于多个高级基线模型的优势,我们在相位检索、降噪、去混响以及带宽扩展任务上进行了全面评估。值得注意的是,在相位检索任务中,我们的架构将相位距离降低了超过20%,并且在零样本跨语料库降噪评估中,PESQ提高了超过0.1分。在涉及混合失真的通用SE任务中也建立了整体优势。 定性分析进一步揭示了学习到的相位特征表现出明显的周期性模式,这与相位的基本环形本质一致。源代码可在此处获取(请将此处替换为实际链接)。
https://arxiv.org/abs/2602.08556
Synthesizing coherent soundtracks for long-form videos remains a formidable challenge, currently stalled by three critical impediments: computational scalability, temporal coherence, and, most critically, a pervasive semantic blindness to evolving narrative logic. To bridge these gaps, we propose NarraScore, a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic. Uniquely, we repurpose frozen Vision-Language Models (VLMs) as continuous affective sensors, distilling high-dimensional visual streams into dense, narrative-aware Valence-Arousal trajectories. Mechanistically, NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism: a \textit{Global Semantic Anchor} ensures stylistic stability, while a surgical \textit{Token-Level Affective Adapter} modulates local tension via direct element-wise residual injection. This minimalist design bypasses the bottlenecks of dense attention and architectural cloning, effectively mitigating the overfitting risks associated with data scarcity. Experiments demonstrate that NarraScore achieves state-of-the-art consistency and narrative alignment with negligible computational overhead, establishing a fully autonomous paradigm for long-video soundtrack generation.
为长视频合成连贯的背景音乐仍然是一个巨大的挑战,目前主要受到三个关键障碍的影响:计算可扩展性、时间一致性以及最重要的是对不断变化的情节逻辑缺乏普遍的理解能力。为了弥合这些差距,我们提出了一种名为NarraScore的分层框架,其核心见解在于情感是情节逻辑的高度压缩表达形式。独特之处在于我们将冻结的视觉-语言模型(VLMs)重新用作连续的情感传感器,将高维的视觉流转化为密集且具有叙事意识的价值感与唤醒度轨迹。 从机制上看,NarraScore采用了一种双分支注入策略,旨在调和全局结构与局部动态之间的关系:一个“全局语义锚点”确保了风格的一致性,而一个精确的“标记级情感适配器”则通过直接元素级别的残差注入来调节局部张力。这种简约设计绕过了密集注意力机制和架构克隆所带来的瓶颈,并有效地缓解了由于数据稀少而导致过拟合的风险。 实验表明,NarraScore在一致性与叙事契合度方面达到了最先进的水平,并且计算开销微乎其微,从而确立了一种完全自主的长视频背景音乐生成范式。
https://arxiv.org/abs/2602.09070
Current audio formats present a fundamental trade-off between file size and functionality: lossless formats like FLAC preserve quality but lack adaptability, while lossy formats reduce size at the cost of fidelity and offer no stem-level this http URL introduce the Stem-Native Codec (SNC), a novel audio container format that stores music as independently encoded stems plus a low-energy mastering residual. By exploiting the lower information entropy of separated stems compared to mixed audio, SNC achieves a 38.2% file size reduction versus FLAC (7.76 MB vs. 12.55 MB for a 2:18 test track) while maintaining perceptual transparency (STOI = 0.996). Unlike existing formats, SNC enables context-aware adaptive playback, spatial audio rendering, and user-controlled remixing without requiring additional storage. Our experimental validation demonstrates that the stems-plus residual architecture successfully decouples the conflicting requirements of compression efficiency and feature richness, offering a practical path toward next-generation audio distribution systems.
当前的音频格式在文件大小和功能之间存在基本权衡:无损格式如FLAC可以保持音质,但缺乏适应性;而有损压缩格式则通过牺牲音质来减小文件体积,并且无法提供分轨级别的这些特性。本文介绍了Stem-Native Codec(SNC),这是一种新颖的音频容器格式,它将音乐以独立编码的分轨和低能量母带残差的形式存储。与混合音频相比,分离出来的各轨信息熵较低,因此通过利用这一点,SNC在文件大小上比FLAC减少了38.2%(对于一段2:18测试曲目,分别为7.76MB对12.55MB),同时保持了感知上的透明度(STOI = 0.996)。不同于现有的格式,SNC能够实现上下文感知的自适应播放、空间音频渲染以及用户控制下的混音,而无需额外存储资源。我们的实验验证表明,分轨加残差架构成功地解耦了压缩效率和功能丰富性的矛盾要求,为下一代音频发行系统提供了一条实用路径。
https://arxiv.org/abs/2602.08148
Realistic sound propagation is essential for immersion in a virtual scene, yet physically accurate wave-based simulations remain computationally prohibitive for real-time applications. Wave coding methods address this limitation by precomputing and compressing impulse responses of a given scene into a set of scalar acoustic parameters, which can reach unmanageable sizes in large environments with many source-receiver pairs. We introduce Reciprocal Latent Fields (RLF), a memory-efficient framework for encoding and predicting these acoustic parameters. The RLF framework employs a volumetric grid of trainable latent embeddings decoded with a symmetric function, ensuring acoustic reciprocity. We study a variety of decoders and show that leveraging Riemannian metric learning leads to a better reproduction of acoustic phenomena in complex scenes. Experimental validation demonstrates that RLF maintains replication quality while reducing the memory footprint by several orders of magnitude. Furthermore, a MUSHRA-like subjective listening test indicates that sound rendered via RLF is perceptually indistinguishable from ground-truth simulations.
真实的声波传播对于虚拟场景的沉浸感至关重要,然而基于物理精确波模拟的方法在实时应用中仍存在计算资源限制。波编码方法通过预先计算并压缩给定场景中的脉冲响应为一组标量声学参数来解决这一问题,但在大型环境中包含大量源接收对时这些参数集会变得难以管理。我们引入了互易潜域(RLF),这是一种用于高效编码和预测这些声学参数的内存友好框架。 RLF 框架采用了一个可训练嵌入体素网格,并通过一个对称函数进行解码,确保声波的互易性。我们研究了多种解码器,并发现利用黎曼度量学习可以更好地再现复杂场景中的声学现象。实验验证表明,RLF 在保持复制质量的同时,将内存占用显著减少了几个数量级。此外,一项类似 MUSHRA 的主观听觉测试显示,通过 RLF 渲染的声音在感知上与基于地面实况的模拟声音无法区分。
https://arxiv.org/abs/2602.06937
Spatial audio is crucial for creating compelling immersive 360-degree video experiences. However, generating realistic spatial audio, such as first-order ambisonics (FOA), from 360-degree videos in complex acoustic scenes remains challenging. Existing methods often overlook the dynamic nature and acoustic complexity of 360-degree scenes, fail to fully account for dynamic sound sources, and neglect complex environmental effects such as occlusion, reflections, and reverberation, which are influenced by scene geometries and materials. We propose DynFOA, a framework based on dynamic acoustic perception and conditional diffusion, for generating high-fidelity FOA from 360-degree videos. DynFOA first performs visual processing via a video encoder, which detects and localizes multiple dynamic sound sources, estimates their depth and semantics, and reconstructs the scene geometry and materials using a 3D Gaussian Splatting. This reconstruction technique accurately models occlusion, reflections, and reverberation based on the geometries and materials of the reconstructed 3D scene and the listener's viewpoint. The audio encoder then captures the spatial motion and temporal 4D sound source trajectories to fine-tune the diffusion-based FOA generator. The fine-tuned FOA generator adjusts spatial cues in real time, ensuring consistent directional fidelity during listener head rotation and complex environmental changes. Extensive evaluations demonstrate that DynFOA consistently outperforms existing methods across metrics such as spatial accuracy, acoustic fidelity, and distribution matching, while also improving the user experience. Therefore, DynFOA provides a robust and scalable approach to rendering realistic dynamic spatial audio for VR and immersive media applications.
空间音频对于创造引人入胜的360度视频体验至关重要。然而,从复杂的声学场景中生成逼真的空间音频(如一级环绕声(First-Order Ambisonics, FOA))仍然具有挑战性。现有方法常常忽视了360度场景中的动态特性及声学复杂性,并且未能充分考虑动态声音源的存在,同时忽略了诸如遮挡、反射和混响等复杂的环境效应的影响,这些效应受场景几何形状和材料影响。 为此,我们提出了DynFOA框架,该框架基于动态声学感知与条件扩散技术,能够从360度视频中生成高保真的一级环绕声音。首先,DynFOA通过视频编码器进行视觉处理,识别并定位多个动态声音源,估算其深度及语义信息,并利用三维高斯点阵(3D Gaussian Splatting)重构场景的几何形状与材料属性。这种重建技术能够根据重构后的三维场景几何形状和材质以及听众视角准确模拟遮挡、反射及混响。 接着,音频编码器捕获空间运动和时间上的四维声音源轨迹,并对基于扩散的一级环绕声生成器进行微调。经过微调的FOA生成器能够在实时调整空间线索的同时确保在听众头部旋转或复杂环境变化期间的方向一致性。 广泛评估表明,DynFOA在诸如空间精度、音频保真度和分布匹配等指标上均优于现有方法,并且提升了用户体验。因此,DynFOA为虚拟现实(VR)及沉浸式媒体应用中的逼真的动态空间音效渲染提供了一种稳健而可扩展的方法。
https://arxiv.org/abs/2602.06846
Complex activities in real-world audio unfold over extended durations and exhibit hierarchical structure, yet most prior work focuses on short clips and isolated events. To bridge this gap, we introduce MultiAct, a new dataset and benchmark for multi-level structured understanding of human activities from long-form audio. MultiAct comprises long-duration kitchen recordings annotated at three semantic levels (activities, sub-activities and events) and paired with fine-grained captions and high-level summaries. We further propose a unified hierarchical model that jointly performs classification, detection, sequence prediction and multi-resolution captioning. Experiments on MultiAct establish strong baselines and reveal key challenges in modelling hierarchical and compositional structure of long-form audio. A promising direction for future work is the exploration of methods better suited to capturing the complex, long-range relationships in long-form audio.
复杂的真实世界音频活动持续时间较长,并具有层次结构,然而大多数先前的研究主要集中在短片段和孤立事件上。为了弥合这一差距,我们引入了MultiAct,这是一个新的数据集和基准测试平台,用于从长格式音频中进行多层次的结构化人类活动理解。MultiAct包含长时间厨房录音,在三个语义层级(活动、子活动和事件)进行了标注,并配有详细的描述和高层次总结。此外,我们还提出了一种统一的层次模型,该模型可以同时执行分类、检测、序列预测以及多分辨率描述生成。 在MultiAct上的实验建立了强大的基线并揭示了建模长格式音频中的层级和组合结构的关键挑战。未来工作的有前景的方向是探索更适合捕捉长格式音频中复杂的长期关系的方法。
https://arxiv.org/abs/2602.06765
Surface electromyography (EMG) is a promising modality for silent speech interfaces, but its effectiveness depends heavily on sensor placement and channel availability. In this work, we investigate the contribution of individual and combined EMG channels to speech reconstruction performance. Our findings reveal that while certain EMG channels are individually more informative, the highest performance arises from subsets that leverage complementary relationships among channels. We also analyzed phoneme classification accuracy under channel ablations and observed interpretable patterns reflecting the anatomical roles of the underlying muscles. To address performance degradation from channel reduction, we pretrained models on full 8-channel data using random channel dropout and fine-tuned them on reduced-channel subsets. Fine-tuning consistently outperformed training from scratch for 4 - 6 channel settings, with the best dropout strategy depending on the number of channels. These results suggest that performance degradation from sensor reduction can be mitigated through pretraining and channel-aware design, supporting the development of lightweight and practical EMG-based silent speech systems.
表面肌电图(EMG)是无声语音接口的一种有前景的模式,但其效果很大程度上取决于传感器放置和可用通道的数量。在这项工作中,我们研究了单独或组合使用的EMG通道对语音重建性能的影响。我们的发现表明,虽然某些EMG通道在单独使用时更具有信息量,但是最佳性能来自于那些能够利用各通道间互补关系的子集。此外,在分析语音音位分类精度(通过移除不同通道的方式进行)时,我们观察到了反映相关肌肉解剖角色的可解释模式。 为了应对由于减少通道数量而导致性能下降的问题,我们在全8通道数据上使用随机通道丢弃法对模型进行了预训练,并在减少了通道数目的子集上对其进行了微调。微调过程始终优于从头开始训练,在4到6个通道设置下表现出更好的性能,最佳的丢弃策略取决于所使用的通道数量。 这些结果表明,通过预训练和具有渠道意识的设计方法可以缓解由于传感器减少导致的性能下降问题,并支持开发轻量级且实用的基于EMG的无声语音系统。
https://arxiv.org/abs/2602.06460
Misophonia is a disorder characterized by a decreased tolerance to specific everyday sounds (trigger sounds) that can evoke intense negative emotional responses such as anger, panic, or anxiety. These reactions can substantially impair daily functioning and quality of life. Assistive technologies that selectively detect trigger sounds could help reduce distress and improve well-being. In this study, we investigate sound event detection (SED) to localize intervals of trigger sounds in continuous environmental audio as a foundational step toward such assistive support. Motivated by the scarcity of real-world misophonia data, we generate synthetic soundscapes tailored to misophonia trigger sound detection using audio synthesis techniques. Then, we perform trigger sound detection tasks using hybrid CNN-based models. The models combine feature extraction using a frozen pre-trained CNN backbone with a trainable time-series module such as gated recurrent units (GRUs), long short-term memories (LSTMs), echo state networks (ESNs), and their bidirectional variants. The detection performance is evaluated using common SED metrics, including Polyphonic Sound Detection Score 1 (PSDS1). On the multi-class trigger SED task, bidirectional temporal modeling consistently improves detection performance, with Bidirectional GRU (BiGRU) achieving the best overall accuracy. Notably, the Bidirectional ESN (BiESN) attains competitive performance while requiring orders of magnitude fewer trainable parameters by optimizing only the readout. We further simulate user personalization via a few-shot "eating sound" detection task with at most five support clips, in which BiGRU and BiESN are compared. In this strict adaptation setting, BiESN shows robust and stable performance, suggesting that lightweight temporal modules are promising for personalized misophonia trigger SED.
口恶声症(Misophonia)是一种障碍,其特点是对于特定的日常声音(触发声音)容忍度降低,这些声音能够引发强烈的负面情绪反应如愤怒、恐慌或焦虑。这些反应会严重影响日常生活和生活质量。选择性检测触发声音的辅助技术可以帮助减少不适感并提高幸福感。在这项研究中,我们探讨了声学事件检测(SED),以在连续环境音频中定位触发声音的时间段,作为提供此类支持的基础步骤。 由于现实世界中的口恶声症数据稀缺,我们使用音频合成技术生成针对口恶声症触发声音检测的合成音景。然后,我们使用基于混合卷积神经网络(CNN)的模型执行触发声音检测任务。这些模型结合了使用冻结预训练的CNN主干进行特征提取与可训练的时间序列模块(如门控循环单元GRUs、长短时记忆LSTMs以及回声状态网络ESNs及其双向变体)。 检测性能通过常见的SED度量评估,包括复音声音检测得分1(PSDS1)。在多类触发声学事件检测任务中,双向时间建模一致提高了检测性能。其中,双向GRU(BiGRU)实现了最佳整体准确性。值得注意的是,优化仅读出部分的双向ESN(BiESN)获得了竞争性表现,同时需要数个量级更少的可训练参数。 我们进一步通过一个最多五段支持片段的小样本“进食声音”检测任务模拟用户个性化设置,在这种严格的适应环境设定下,BiGRU和BiESN进行比较。在此环境下,BiESN展示了稳健且稳定的性能,表明轻量级时间模块对于个性化的口恶声症触发声学事件检测具有前景。
https://arxiv.org/abs/2602.06271
Advances in AIGC technologies have enabled the synthesis of highly realistic audio deepfakes capable of deceiving human auditory perception. Although numerous audio deepfake detection (ADD) methods have been developed, most rely on local temporal/spectral features or pairwise relations, overlooking high-order interactions (HOIs). HOIs capture discriminative patterns that emerge from multiple feature components beyond their individual contributions. We propose HyperPotter, a hypergraph-based framework that explicitly models these synergistic HOIs through clustering-based hyperedges with class-aware prototype initialization. Extensive experiments demonstrate that HyperPotter surpasses its baseline by an average relative gain of 22.15% across 11 datasets and outperforms state-of-the-art methods by 13.96% on 4 challenging cross-domain datasets, demonstrating superior generalization to diverse attacks and speakers.
在AIGC(人工智能生成内容)技术的进步推动下,已经能够合成出高度逼真的音频深度伪造音讯,这些声音足以欺骗人类的听觉感知。尽管已开发出了许多音频深度伪造检测(ADD)方法,但大多数依赖于局部时间/频谱特征或成对关系,忽视了高阶交互(HOIs)。高阶交互捕捉到的鉴别模式超越了各个特征成分单独贡献的范畴。 我们提出了一种基于超图框架的方法——HyperPotter。该方法通过带有类别感知原型初始化的聚类基超边,明确地建模这些协同作用的高阶交互。广泛的实验表明,在11个数据集上,HyperPotter相较于基准方法平均相对增益达到了22.15%,在4个具有挑战性的跨域数据集上的性能比最先进的方法高出13.96%。这证明了它在面对各种攻击和说话者时具备优秀的泛化能力。 简而言之,通过利用高阶交互模型的特性,HyperPotter能够在检测音频深度伪造方面取得显著进展,并显示出优越的适应性和鲁棒性。
https://arxiv.org/abs/2602.05670
We propose WaveTrainerFit, a neural vocoder that performs high-quality waveform generation from data-driven features such as SSL features. WaveTrainerFit builds upon the WaveFit vocoder, which integrates diffusion model and generative adversarial network. Furthermore, the proposed method incorporates the following key improvements: 1. By introducing trainable priors, the inference process starts from noise close to the target speech instead of Gaussian noise. 2. Reference-aware gain adjustment is performed by imposing constraints on the trainable prior to matching the speech energy. These improvements are expected to reduce the complexity of waveform modeling from data-driven features, enabling high-quality waveform generation with fewer inference steps. Through experiments, we showed that WaveTrainerFit can generate highly natural waveforms with improved speaker similarity from data-driven features, while requiring fewer iterations than WaveFit. Moreover, we showed that the proposed method works robustly with respect to the depth at which SSL features are extracted. Code and pre-trained models are available from this https URL.
我们提出了WaveTrainerFit,这是一种神经声音编码器(neural vocoder),它可以从数据驱动特征(如自监督学习特征)生成高质量的波形。WaveTrainerFit基于WaveFit声音编码器构建而成,后者结合了扩散模型和生成对抗网络。此外,所提出的方法还引入了以下关键改进: 1. 引入可训练先验(trainable priors),使推理过程从接近目标语音的噪声开始,而非高斯噪声。 2. 通过在可训练先验上施加约束来调整参考感知增益(reference-aware gain adjustment),以匹配语音能量。 这些改进有望减少从数据驱动特征建模波形所需的复杂度,在较少推断步骤的情况下实现高质量波形生成。通过实验,我们展示了WaveTrainerFit可以从数据驱动特征生成高度自然的波形,并且提高了说话人相似性,同时所需迭代次数少于WaveFit。此外,我们还证明了所提出的方法对自监督学习特征提取深度的变化具有良好的鲁棒性。 代码和预训练模型可从以下链接获取:[URL](请将[URL]替换为实际的链接地址)。
https://arxiv.org/abs/2602.05443
The lack of impaired speech data hinders advancements in the development of inclusive speech technologies, particularly in low-resource languages such as Akan. To address this gap, this study presents a curated corpus of speech samples from native Akan speakers with speech impairment. The dataset comprises of 50.01 hours of audio recordings cutting across four classes of impaired speech namely stammering, cerebral palsy, cleft palate, and stroke induced speech disorder. Recordings were done in controlled supervised environments were participants described pre-selected images in their own words. The resulting dataset is a collection of audio recordings, transcriptions, and associated metadata on speaker demographics, class of impairment, recording environment and device. The dataset is intended to support research in low-resource automatic disordered speech recognition systems and assistive speech technology.
缺乏受损言语数据阻碍了包容性语音技术的发展,尤其是在阿坎语这样的低资源语言中。为了解决这一缺口,本研究提出了一套精心整理的来自患有言语障碍的母语阿坎语使用者的语音样本数据集。该数据集包含4类受损言语(口吃、脑瘫、唇裂和卒中引起的言语障碍)共50.01小时的音频录音。这些录音是在受控监督环境中完成的,参与者使用自己的语言描述了预先选定的图片。所得的数据集包括音频记录、转录文本以及与说话者的人口统计信息、受损类型、录制环境和设备相关的元数据。该数据集旨在支持低资源环境下自动言语障碍识别系统和辅助语音技术的研究。
https://arxiv.org/abs/2602.05406