When detecting anomalous sounds in complex environments, one of the main difficulties is that trained models must be sensitive to subtle differences in monitored target signals, while many practical applications also require them to be insensitive to changes in acoustic domains. Examples of such domain shifts include changing the type of microphone or the location of acoustic sensors, which can have a much stronger impact on the acoustic signal than subtle anomalies themselves. Moreover, users typically aim to train a model only on source domain data, which they may have a relatively large collection of, and they hope that such a trained model will be able to generalize well to an unseen target domain by providing only a minimal number of samples to characterize the acoustic signals in that domain. In this work, we review and discuss recent publications focusing on this domain generalization problem for anomalous sound detection in the context of the DCASE challenges on acoustic machine condition monitoring.
在复杂环境中检测异常声音时,主要的困难之一是训练模型必须对监测目标信号中的细微差异敏感,同时许多实际应用还要求它们对声学域的变化不敏感。例如,改变麦克风类型或声学传感器的位置等领域的变化可能会比细微异常本身更强烈地影响声学信号。此外,用户通常希望仅使用源域数据(他们可能拥有的相对较多的数据)来训练模型,并期望这样的模型能够通过提供少量样本以表征目标领域中的声学信号,在未见过的目标领域中表现良好。在这项工作中,我们回顾并讨论了最近针对DCASE挑战赛中声学机器状态监测异常声音检测领域的泛化问题的出版物。
https://arxiv.org/abs/2503.10435
Singing voice beat tracking is a challenging task, due to the lack of musical accompaniment that often contains robust rhythmic and harmonic patterns, something most existing beat tracking systems utilize and can be essential for estimating beats. In this paper, a novel temporal convolutional network-based beat-tracking approach featuring self-supervised learning (SSL) representations and adapter tuning is proposed to track the beat and downbeat of singing voices jointly. The SSL DistilHuBERT representations are utilized to capture the semantic information of singing voices and are further fused with the generic spectral features to facilitate beat estimation. Sources of variabilities that are particularly prominent with the non-homogeneous singing voice data are reduced by the efficient adapter tuning. Extensive experiments show that feature fusion and adapter tuning improve the performance individually, and the combination of both leads to significantly better performances than the un-adapted baseline system, with up to 31.6% and 42.4% absolute F1-score improvements on beat and downbeat tracking, respectively.
声乐歌声节拍跟踪是一项具有挑战性的任务,由于缺乏通常包含稳健节奏和和声模式的音乐伴奏,而这些往往是大多数现有节拍跟踪系统所依赖的关键要素。在本文中,提出了一种基于时间卷积网络(TCN)的新方法,该方法结合了自监督学习(SSL)表示与适配器微调技术,旨在同时追踪声乐歌声中的节拍和强拍。 文中使用SSL DistilHuBERT 表示来捕捉声乐的语义信息,并进一步融合通用频谱特征以促进节拍估计。通过高效的适配器微调减少了非同质化声乐数据中尤为突出的变化来源。大量的实验表明,特征融合与适配器微调分别提高了性能,在两者结合使用时,则显著优于未经调整的基础系统:在节拍跟踪方面提高了高达31.6%的F1得分,在强拍跟踪方面提高了高达42.4%的F1得分。
https://arxiv.org/abs/2503.10086
Automatic Speech Recognition (ASR) is widely used within consumer devices such as mobile phones. Recently, personalization or on-device model fine-tuning has shown that adaptation of ASR models towards target user speech improves their performance over rare words or accented speech. Despite these gains, fine-tuning on user data (target domain) risks the personalized model to forget knowledge about its original training distribution (source domain) i.e. catastrophic forgetting, leading to subpar general ASR performance. A simple and efficient approach to combat catastrophic forgetting is to measure forgetting via a validation set that represents the source domain distribution. However, such validation sets are large and impractical for mobile devices. Towards this, we propose a novel method to subsample a substantially large validation set into a smaller one while maintaining the ability to estimate forgetting. We demonstrate the efficacy of such a dataset in mitigating forgetting by utilizing it to dynamically determine the number of ideal fine-tuning epochs. When measuring the deviations in per user fine-tuning epochs against a 50x larger validation set (oracle), our method achieves a lower mean-absolute-error (3.39) compared to randomly selected subsets of the same size (3.78-8.65). Unlike random baselines, our method consistently tracks the oracle's behaviour across three different forgetting thresholds.
自动语音识别(ASR)在消费设备如手机中广泛应用。最近,个性化或设备端模型微调显示了针对目标用户的讲话进行的适应性改进提高了其对罕见词汇或口音说话的表现。尽管有这些进步,但基于用户数据(目标领域)的微调可能会导致个性化模型忘记关于原始训练分布(源领域)的知识,即灾难性遗忘,从而导致一般ASR性能不佳。 一种简单且有效的方法是通过代表源领域分布的有效验证集来测量遗忘情况以对抗灾难性遗忘。然而,这样的验证集规模庞大,在移动设备上不可行。为此,我们提出了一种新颖的方法,将一个大规模的验证集缩减为一个小得多的数据集,同时保持估算遗忘情况的能力。我们展示了利用这种数据集动态确定理想微调周期数在减轻遗忘方面的有效性。 通过测量每个用户的微调周期与50倍更大验证集(即“金标准”或oracle)相比的偏差值,我们的方法达到了更低的平均绝对误差(3.39),而随机选择相同大小子集的方法则表现出更高的误差范围(3.78-8.65)。不同于随机基准线,我们的方法在三个不同的遗忘阈值下都能持续跟踪“金标准”的行为。 这种方法不仅提高了个性化ASR模型的效果和效率,还在一定程度上解决了因设备资源限制而带来的挑战。
https://arxiv.org/abs/2503.09906
Correlation-based auditory attention decoding (AAD) algorithms exploit neural tracking mechanisms to determine listener attention among competing speech sources via, e.g., electroencephalography signals. The correlation coefficients between the decoded neural responses and encoded speech stimuli of the different speakers then serve as AAD decision variables. A critical trade-off exists between the temporal resolution (the decision window length used to compute these correlations) and the AAD accuracy. This trade-off is typically characterized by evaluating AAD accuracy across multiple window lengths, leading to the performance curve. We propose a novel method to model this trade-off curve using labeled correlations from only a single decision window length. Our approach models the (un)attended correlations with a normal distribution after applying the Fisher transformation, enabling accurate AAD accuracy prediction across different window lengths. We validate the method on two distinct AAD implementations: a linear decoder and the non-linear VLAAI deep neural network, evaluated on separate datasets. Results show consistently low modeling errors of approximately 2 percent points, with 94% of true accuracies falling within estimated 95%-confidence intervals. The proposed method enables efficient performance curve modeling without extensive multi-window length evaluation, facilitating practical applications in, e.g., performance tracking in neuro-steered hearing devices to continuously adapt the system parameters over time.
基于相关性的听觉注意力解码(AAD)算法利用神经跟踪机制,通过如脑电图信号等方式确定听众在竞争性语音源中的注意力。然后,不同说话者编码的语音刺激与解码后的神经响应之间的相关系数作为AAD决策变量。在这之间存在一个关键的权衡:时间分辨率(用于计算这些相关性的决策窗口长度)和AAD精度。这个权衡通常通过评估多种窗口长度下的AAD准确性来进行表征,从而形成性能曲线。 我们提出了一种新颖的方法来仅利用单一决策窗口长度的标注相关性数据建模这一权衡曲线。我们的方法在应用了费希尔变换后,用正态分布模型拟合(未)关注的相关性,这使得能够在不同的窗口长度下准确预测AAD准确性。我们在两个不同AAD实现上验证了这种方法:线性解码器和非线性的VLAAI深度神经网络,并使用独立的数据集进行评估。 结果表明,在所有情况下,建模误差约为2个百分点,且94%的真实精度落在估计的95%-置信区间内。我们提出的方法能够在不进行广泛多窗口长度评估的情况下有效地建立性能曲线模型,从而为实际应用(如在神经控制听力设备中的性能跟踪以持续调整系统参数)提供了便利性。
https://arxiv.org/abs/2503.09349
This paper outlines how to leverage the Web MIDI API and web technologies to convert numerical data in JavaScript to Most Significant Byte and Least Significant Byte combos, stage the data as dual concurrent CC messages, use WebSockets to send it to multiple endpoints, and wire the browser to other music software. This method allows users to control their own native application via 14-bit MIDI messaging and even applications housed on a remote source. Because the technology utilizes WebSockets, it is not reliant on local networks for connectivity and opens the possibilities of remote software control and collaboration anywhere in the world. While no shortage of options exists for controlling music software from the web, the Web MIDI API allows for a more streamlined end user experience as it seamlessly links to core OS MIDI functionality. The paper will share a use case of transmitting high-resolution MIDI through the browser and translating it to control voltage data for use with a modular synthesizer.
这篇论文概述了如何利用Web MIDI API和网络技术将JavaScript中的数值数据转换为最高有效字节和最低有效字节的组合,将其作为两个并发控制变化(CC)消息进行组织,通过WebSocket发送到多个端点,并连接浏览器和其他音乐软件。这种方法允许用户通过14位MIDI消息来控制自己的本地应用程序,甚至可以远程源上的应用程序。由于该技术利用了WebSocket,它不需要依赖于局域网的连接性,从而开启了在全球范围内实现远程软件控制和协作的可能性。尽管从网络上控制音乐软件的方法众多,但Web MIDI API提供了一个更为流畅的用户体验,因为它与操作系统的核心MIDI功能无缝链接。论文将分享一个使用浏览器传输高分辨率MIDI并通过它转换为控制电压数据以用于模块化合成器的具体应用场景。
https://arxiv.org/abs/2503.09055
Analog-digital hybrid electronic music systems once existed out of necessity in order to facilitate a flexible work environment for the creation of live computer music. As computational power increased with the development of faster microprocessors, the need for digital functionality with analog sound production decreased, with the computer becoming more capable of handling both tasks. Given the exclusivity of these systems and the relatively short time they were in use, the possibilities of such systems were hardly explored. The work of José Vicente Asuar best demonstrated a push for accessibility of such systems, but he never received the support of any institution in order to bring his machine widespread attention. Modeled after his approach, using a Commodore 64 (or freely available OS emulator) and analog modular hardware, this paper aims to fashion a system that is accessible, affordable, easy to use, educational, and musically rich in nature.
模拟与数字混合的电子音乐系统曾经由于必要性而存在,旨在为现场计算机音乐创作提供一个灵活的工作环境。随着微处理器的发展和计算能力的提升,对于同时具备数字功能和模拟声音制作的需求逐渐减少,因为计算机开始能够处理这些任务。鉴于这类系统的专属性质以及它们短暂的存在时间,其可能性并未得到充分探索。José Vicente Asuar 的工作最佳地展示了推动此类系统普及性的努力,但他从未获得任何机构的支持来使他的机器广为人知。 借鉴他的方法,本论文旨在使用 Commodore 64(或免费的操作系统模拟器)和模拟模块化硬件打造一个系统,该系统需具备易用性、经济实惠、教育意义以及丰富的音乐表现力。
https://arxiv.org/abs/2503.09053
Sound effects model design commonly uses digital signal processing techniques with full control ability, but it is difficult to achieve realism within a limited number of parameters. Recently, neural sound effects synthesis methods have emerged as a promising approach for generating high-quality and realistic sounds, but the process of synthesizing the desired sound poses difficulties in terms of control. This paper presents a real-time neural synthesis model guided by a physically inspired model, enabling the generation of high-quality sounds while inheriting the control interface of the physically inspired model. We showcase the superior performance of our model in terms of sound quality and control.
声音效果模型设计通常采用具有完全控制能力的数字信号处理技术,但在有限参数内实现真实感较为困难。最近,基于神经网络的声音合成方法作为一种生成高质量和逼真声音的有前途的方法出现,但生成所需声音的过程在控制方面存在挑战。本文提出了一种由物理启发式模型引导的实时神经合成模型,该模型能够在保持物理启发式模型控制界面的同时生成高质量的声音。我们展示了我们的模型在声音质量和控制方面的优越性能。
https://arxiv.org/abs/2503.08806
In this paper, we investigate a novel approach for Target Speech Extraction (TSE), which relies solely on textual context to extract the target speech. We refer to this task as Contextual Speech Extraction (CSE). Unlike traditional TSE methods that rely on pre-recorded enrollment utterances, video of the target speaker's face, spatial information, or other explicit cues to identify the target stream, our proposed method requires only a few turns of previous dialogue (or monologue) history. This approach is naturally feasible in mobile messaging environments where voice recordings are typically preceded by textual dialogue that can be leveraged implicitly. We present three CSE models and analyze their performances on three datasets. Through our experiments, we demonstrate that even when the model relies purely on dialogue history, it can achieve over 90 % accuracy in identifying the correct target stream with only two previous dialogue turns. Furthermore, we show that by leveraging both textual context and enrollment utterances as cues during training, we further enhance our model's flexibility and effectiveness, allowing us to use either cue during inference, or combine both for improved performance. Samples and code available on this https URL .
在这篇论文中,我们研究了一种新颖的目标语音提取(TSE)方法,该方法仅依赖于文本上下文来提取目标语音。我们将这一任务称为情境语音抽取(CSE)。与传统的TSE方法不同,后者依靠预先录制的注册语音、目标说话者的面部视频、空间信息或其他显式线索来识别目标音频流,我们提出的方法只需要前几轮对话的历史记录(或独白历史)即可。这种基于文本上下文的方法在移动消息环境中的自然可行性尤为突出,在该环境中,语音录音通常会跟随一段可以隐含利用的文本对话。 本文介绍了三种CSE模型,并对其在三个数据集上的性能进行了分析。通过实验我们证明,即使当模型仅依赖于两轮对话历史时,它仍然能够以超过90%的准确率识别正确的目标音频流。此外,我们还展示了如何通过结合使用文本上下文和注册语音作为训练期间的线索来进一步增强模型的灵活性和有效性,在推理过程中可以单独或同时利用这些线索以提高性能。 样本和代码可在以下网址获得:[请在此处插入实际URL]
https://arxiv.org/abs/2503.08798
Leitmotifs are musical phrases that are reprised in various forms throughout a piece. Due to diverse variations and instrumentation, detecting the occurrence of leitmotifs from audio recordings is a highly challenging task. Leitmotif detection may be handled as a subcategory of audio event detection, where leitmotif activity is predicted at the frame level. However, as leitmotifs embody distinct, coherent musical structures, a more holistic approach akin to bounding box regression in visual object detection can be helpful. This method captures the entirety of a motif rather than fragmenting it into individual frames, thereby preserving its musical integrity and producing more useful predictions. We present our experimental results on tackling leitmotif detection as a boundary regression task.
引子动机(Leitmotifs)是音乐短语,在整个作品中以各种形式重复出现。由于其多样的变体和乐器编配,从音频记录中检测引子动机的出现是一项极具挑战性的任务。引子动机的识别可以作为音频事件检测的一个子类别来处理,在这种情况下,引子动机活动在帧级别上进行预测。然而,鉴于引子动机代表了独特的、连贯的音乐结构,类似于视觉对象检测中的边界框回归的方法可能会更有帮助。这种方法捕获整个主题,而不是将其分解为单独的帧,从而保持其音乐完整性并产生更有用的预测。我们在此展示我们在将引子动机检测作为边界回归任务处理方面的实验结果。
https://arxiv.org/abs/2503.07977
Music source separation is the task of separating a mixture of instruments into constituent tracks. Music source separation models are typically trained using only audio data, although additional information can be used to improve the model's separation capability. In this paper, we propose two ways of using musical scores to aid music source separation: a score-informed model where the score is concatenated with the magnitude spectrogram of the audio mixture as the input of the model, and a model where we use only the score to calculate the separation mask. We train our models on synthetic data in the SynthSOD dataset and evaluate our methods on the URMP and Aalto anechoic orchestra datasets, comprised of real recordings. The score-informed model improves separation results compared to a baseline approach, but struggles to generalize from synthetic to real data, whereas the score-only model shows a clear improvement in synthetic-to-real generalization.
音乐来源分离的任务是将多种乐器的混合音频拆分成各自的轨道。通常,音乐来源分离模型仅使用音频数据进行训练,尽管可以利用额外的信息来提升模型的分离效果。在这篇论文中,我们提出了两种方法,通过使用乐谱来辅助音乐来源分离:一种是将乐谱与音频混合物的幅度频谱图拼接在一起作为输入的“基于乐谱信息”的模型;另一种则是仅使用乐谱计算分离掩码的模型。我们在SynthSOD数据集上使用合成数据训练这些模型,并在URMP和Aalto无回声管弦乐队数据集中进行评估,这些数据集包含真实的录音。结果显示,“基于乐谱信息”的模型相比于基线方法能够提升分离效果,但其难以从合成数据泛化到真实数据;而仅用乐谱的模型则显示出显著改善了从合成数据到真实数据的泛化能力。
https://arxiv.org/abs/2503.07352
Real-time computer-based accompaniment for human musical performances entails three critical tasks: identifying what the performer is playing, locating their position within the score, and synchronously playing the accompanying parts. Among these, the second task (score following) has been addressed through methods such as dynamic programming on string sequences, Hidden Markov Models (HMMs), and Online Time Warping (OLTW). Yet, the remarkably successful techniques of Deep Learning (DL) have not been directly applied to this problem. Therefore, we introduce HeurMiT, a novel DL-based score-following framework, utilizing a neural architecture designed to learn compressed latent representations that enables precise performer tracking despite deviations from the score. Parallelly, we implement a real-time MIDI data augmentation toolkit, aimed at enhancing the robustness of these learned representations. Additionally, we integrate the overall system with simple heuristic rules to create a comprehensive framework that can interface seamlessly with existing transcription and accompaniment technologies. However, thorough experimentation reveals that despite its impressive computational efficiency, HeurMiT's underlying limitations prevent it from being practical in real-world score following scenarios. Consequently, we present our work as an introductory exploration into the world of DL-based score followers, while highlighting some promising avenues to encourage future research towards robust, state-of-the-art neural score following systems.
实时计算机伴奏系统在人类音乐表演中的应用包括三个关键任务:识别演奏者正在演奏的内容、确定他们在乐谱中的位置,以及同步播放伴奏部分。在这三项任务中,第二项(即“乐谱追踪”)已经通过诸如字符串序列的动态规划、隐马尔可夫模型(HMMs)、在线时间扭曲(OLTW)等方法得到解决。然而,尽管深度学习技术取得了显著的成功,但直接应用于这一问题的情况却不多见。为此,我们提出了HeurMiT——一种基于深度学习的新框架,它利用神经架构来学习压缩的潜在表示,能够精准追踪演奏者,即使他们在乐谱上有所偏离也无妨。同时,我们实现了一套实时MIDI数据增强工具包,旨在提升这些学到的表示法的鲁棒性。此外,我们将整个系统与简单的启发式规则相结合,创建了一个综合框架,可无缝地与现有的转录和伴奏技术相兼容。 然而,经过彻底实验后发现,尽管HeurMiT在计算效率方面表现卓越,但其内在限制却使其难以适应现实世界的乐谱追踪场景。因此,我们将我们的工作视为深度学习应用于乐谱追踪领域的一次初步探索,并强调了一些有前景的研究方向,以鼓励未来研究朝向鲁棒、最先进的神经网络乐谱跟踪系统迈进。
https://arxiv.org/abs/2503.06348
Generative systems of musical accompaniments are rapidly growing, yet there are no standardized metrics to evaluate how well generations align with the conditional audio prompt. We introduce a distribution-based measure called "Accompaniment Prompt Adherence" (APA), and validate it through objective experiments on synthetic data perturbations, and human listening tests. Results show that APA aligns well with human judgments of adherence and is discriminative to transformations that degrade adherence. We release a Python implementation of the metric using the widely adopted pre-trained CLAP embedding model, offering a valuable tool for evaluating and comparing accompaniment generation systems.
音乐伴奏生成系统正在迅速发展,但目前还没有标准化的指标来评估生成的伴奏与条件音频提示之间的契合程度。我们提出了一种基于分布的度量方法,称为“伴奏提示一致性”(APA),并通过在合成数据扰动上的客观实验以及人类听觉测试对其进行了验证。研究结果表明,APA 与人类对一致性的判断高度吻合,并且能够区分那些降低一致性的变换操作。我们使用广泛采用的预训练 CLAP 嵌入模型发布了一个 Python 实现版本的该度量标准,为评估和比较伴奏生成系统提供了一种有价值的工具。
https://arxiv.org/abs/2503.06346
Array-geometry-agnostic speech separation (AGA-SS) aims to develop an effective separation method regardless of the microphone array geometry. Conventional methods rely on permutation-free operations, such as summation or attention mechanisms, to capture spatial information. However, these approaches often incur high computational costs or disrupt the effective use of spatial information during intra- and inter-channel interactions, leading to suboptimal performance. To address these issues, we propose UniArray, a novel approach that abandons the conventional interleaving manner. UniArray consists of three key components: a virtual microphone estimation (VME) module, a feature extraction and fusion module, and a hierarchical dual-path separator. The VME ensures robust performance across arrays with varying channel numbers. The feature extraction and fusion module leverages a spectral feature extraction module and a spatial dictionary learning (SDL) module to extract and fuse frequency-bin-level features, allowing the separator to focus on using the fused features. The hierarchical dual-path separator models feature dependencies along the time and frequency axes while maintaining computational efficiency. Experimental results show that UniArray outperforms state-of-the-art methods in SI-SDRi, WB-PESQ, NB-PESQ, and STOI across both seen and unseen array geometries.
阵列无关几何语音分离(AGA-SS)旨在开发一种无论麦克风阵列几何形状如何都有效的分离方法。传统的方法依赖于无需排列操作的机制,如求和或注意机制,来捕获空间信息。然而,这些方法通常会带来高昂的计算成本或者在通道内的交互过程中破坏了空间信息的有效利用,导致性能不佳。为了解决这些问题,我们提出了UniArray这一新颖的方法,它摒弃了传统的交错方式。UniArray由三个关键组件组成:虚拟麦克风估计(VME)模块、特征提取与融合模块以及层次双路径分离器。 - VME模块确保了在具有不同通道数量的阵列中都能保持稳健性能。 - 特征提取和融合模块利用频谱特征提取模块和空间词典学习(SDL)模块来提取并融合频率单元级别的特征,使分离器能够集中使用这些融合后的特征。 - 层次双路径分离器同时建模时间轴和频率轴上的特征依赖性,并保持计算效率。 实验结果表明,在短时客观评价值增强(SI-SDRi)、宽带感知语音质量量表(WB-PESQ)、窄带感知语音质量量表(NB-PESQ)以及短时客观可懂度(STOI)等多个指标上,无论面对已见还是未见过的阵列几何形状,UniArray均超越了当前最先进的方法。
https://arxiv.org/abs/2503.05110
This work integrates language AI-based voice communication understanding with collision risk assessment. The proposed framework consists of two major parts, (a) Automatic Speech Recognition (ASR); (b) surface collision risk modeling. ASR module generates information tables by processing voice communication transcripts, which serve as references for producing potential taxi plans and calculating the surface movement collision risk. For ASR, we collect and annotate our own Named Entity Recognition (NER) dataset based on open-sourced video recordings and safety investigation reports. Additionally, we refer to FAA Order JO 7110.65W and FAA Order JO 7340.2N to get the list of heuristic rules and phase contractions of communication between the pilot and the Air Traffic Controller (ATCo) used in daily aviation operations. Then, we propose the novel ATC Rule-Enhanced NER method, which integrates the heuristic rules into the model training and inference stages, resulting into hybrid rule-based NER model. We show the effectiveness of this hybrid approach by comparing different setups with different token-level embedding models. For the risk modeling, we adopt the node-link airport layout graph from NASA FACET and model the aircraft taxi speed at each link as a log-normal distribution and derive the total taxi time distribution. Then, we propose a spatiotemporal formulation of the risk probability of two aircraft moving across potential collision nodes during ground movement. We show the effectiveness of our approach by simulating two case studies, (a) the Henada airport runway collision accident happened in January 2024; (b) the KATL taxiway collision happened in September 2024. We show that, by understanding the pilot-ATC communication transcripts and analyzing surface movement patterns, the proposed model improves airport safety by providing risk assessment in time.
这项工作将基于语言AI的语音通信理解与碰撞风险评估相结合。所提出的框架包括两个主要部分,即(a)自动语音识别(ASR),以及(b)表面碰撞风险建模。ASR模块通过处理语音通信转录生成信息表,这些信息表作为产生潜在出租车计划和计算地勤移动碰撞风险的参考。 在ASR方面,我们基于开源视频记录和安全调查报告收集并标注了自己的命名实体识别(NER)数据集。此外,我们参照了美国联邦航空管理局(FAA)指令JO 7110.65W 和 FAA指令 JO 7340.2N 来获取飞行员与空中交通管制员(ATCo)日常航空操作中使用的启发式规则和通信阶段缩写。然后,我们提出了新的ATC 规则增强型 NER 方法,该方法将启发式规则集成到模型训练和推理阶段,从而形成混合规则基础的NER模型。通过使用不同的分词级别嵌入模型的不同设置进行比较,证明了这种方法的有效性。 在风险建模方面,我们采用了NASA FACET 的节点-链接机场布局图,并将飞机在每个环节上的滑行速度建模为对数正态分布,从而得出总的滑行时间分布。接着,我们提出了一种时空形式的风险概率模型,该模型考虑了两架飞机在地勤移动期间经过潜在碰撞点时的相对位置和时间因素。 通过模拟两个案例研究,即2024年1月发生的Henada机场跑道碰撞事故和2024年9月发生的KATL滑行道碰撞事件,我们展示了该方法的有效性。结果显示,在理解飞行员与ATC之间的通信记录并分析地勤移动模式后,所提出的模型通过及时提供风险评估来提高机场的安全性。
https://arxiv.org/abs/2503.04974
Humans exhibit a remarkable ability to focus auditory attention in complex acoustic environments, such as cocktail parties. Auditory attention detection (AAD) aims to identify the attended speaker by analyzing brain signals, such as electroencephalography (EEG) data. Existing AAD algorithms often leverage deep learning's powerful nonlinear modeling capabilities, few consider the neural mechanisms underlying auditory processing in the brain. In this paper, we propose SincAlignNet, a novel network based on an improved SincNet and contrastive learning, designed to align audio and EEG features for auditory attention detection. The SincNet component simulates the brain's processing of audio during auditory attention, while contrastive learning guides the model to learn the relationship between EEG signals and attended speech. During inference, we calculate the cosine similarity between EEG and audio features and also explore direct inference of the attended speaker using EEG data. Cross-trial evaluations results demonstrate that SincAlignNet outperforms state-of-the-art AAD methods on two publicly available datasets, KUL and DTU, achieving average accuracies of 78.3% and 92.2%, respectively, with a 1-second decision window. The model exhibits strong interpretability, revealing that the left and right temporal lobes are more active during both male and female speaker scenarios. Furthermore, we found that using data from only six electrodes near the temporal lobes maintains similar or even better performance compared to using 64 electrodes. These findings indicate that efficient low-density EEG online decoding is achievable, marking an important step toward the practical implementation of neuro-guided hearing aids in real-world applications. Code is available at: this https URL.
人类在复杂的声学环境中展示出令人瞩目的听觉注意力集中能力,例如鸡尾酒会场景。听觉注意检测(AAD)旨在通过分析脑信号(如脑电图EEG数据)来识别被关注的说话人。现有的AAD算法通常利用深度学习的强大非线性建模能力,但很少考虑大脑中听觉处理的神经机制。在本文中,我们提出了SincAlignNet,这是一种基于改进后的SincNet和对比学习设计的新网络,旨在对齐音频和EEG特征以进行听觉注意检测。SincNet组件模拟了大脑在听觉注意力过程中对音频的处理方式,而对比学习则指导模型学习EEG信号与被关注言语之间的关系。在推理阶段,我们计算EEG和音频特征之间的余弦相似度,并探索仅使用EEG数据直接推断被关注说话人的方法。 跨试验评估结果显示,SincAlignNet在两个公开可用的数据集KUL和DTU上优于最先进的AAD方法,在1秒的决策窗口内分别实现了平均78.3%和92.2%的准确率。该模型表现出很强的可解释性,表明左侧和右侧颞叶在男声和女声场景中都更为活跃。此外,我们发现仅使用靠近颞叶的六个电极的数据就能维持与使用64个电极相似或更好的性能。这些发现表明,在现实世界的应用中实现高效低密度EEG在线解码是可能的,并为神经引导助听器的实际应用奠定了重要基础。 代码可在此网址获取:[链接] (请将此替换为您提供的实际URL)
https://arxiv.org/abs/2503.04156
Two algorithms for combined acoustic echo cancellation (AEC) and noise reduction (NR) are analysed, namely the generalised echo and interference canceller (GEIC) and the extended multichannel Wiener filter (MWFext). Previously, these algorithms have been examined for linear echo paths, and assuming access to voice activity detectors (VADs) that separately detect desired speech and echo activity. However, algorithms implementing VADs may introduce detection errors. Therefore, in this paper, the previous analyses are extended by 1) modelling general nonlinear echo paths by means of the generalised Bussgang decomposition, and 2) modelling VAD error effects in each specific algorithm, thereby also allowing to model specific VAD assumptions. It is found and verified with simulations that, generally, the MWFext achieves a higher NR performance, while the GEIC achieves a more robust AEC performance.
本文分析了两种结合声学回声消除(AEC)和噪声减少(NR)的算法,即广义回波干扰取消器(GEIC)和扩展多通道维纳滤波器(MWFext)。之前的研究主要针对线性回声路径,并假设可以使用独立检测目标语音活动和回波活动的语音激活检测器(VADs)。然而,采用VAD算法可能会引入检测误差。因此,在本文中,之前的分析被进一步扩展,具体包括:1)通过广义Bussgang分解来建模一般的非线性回声路径;2)针对每个特定算法建模范化的VAD错误效应,从而允许对具体的VAD假设进行建模。研究表明,并通过仿真验证,一般来说,MWFext能够实现更高的NR性能,而GEIC则能提供更稳健的AEC性能。
https://arxiv.org/abs/2503.03593
Prior approaches to lead instrument detection primarily analyze mixture audio, limited to coarse classifications and lacking generalization ability. This paper presents a novel approach to lead instrument detection in multitrack music audio by crafting expertly annotated datasets and designing a novel framework that integrates a self-supervised learning model with a track-wise, frame-level attention-based classifier. This attention mechanism dynamically extracts and aggregates track-specific features based on their auditory importance, enabling precise detection across varied instrument types and combinations. Enhanced by track classification and permutation augmentation, our model substantially outperforms existing SVM and CRNN models, showing robustness on unseen instruments and out-of-domain testing. We believe our exploration provides valuable insights for future research on audio content analysis in multitrack music settings.
先前的乐器检测方法主要通过分析混合音频来进行,只能进行粗略分类,并且缺乏泛化能力。本文提出了一种针对多轨音乐音频中的主奏乐器检测的新方法,该方法通过精心注释的数据集和设计一种新颖的框架实现,此框架结合了自监督学习模型与基于轨道级别的帧注意力机制的分类器。这种注意力机制能够根据听觉重要性动态提取并聚合特定于每条轨道的特征,从而在各种乐器类型及组合中实现精确检测。通过轨道分类和排列增强技术的加持,我们的模型大幅超越现有的SVM和CRNN模型,在未见过的乐器以及跨域测试中表现出强大的鲁棒性。我们认为此研究为未来多轨音乐环境下的音频内容分析提供了有价值的见解。
https://arxiv.org/abs/2503.03232
Implicit neural representations (INR) have gained prominence for efficiently encoding multimedia data, yet their applications in audio signals remain limited. This study introduces the Kolmogorov-Arnold Network (KAN), a novel architecture using learnable activation functions, as an effective INR model for audio representation. KAN demonstrates superior perceptual performance over previous INRs, achieving the lowest Log-SpectralDistance of 1.29 and the highest Perceptual Evaluation of Speech Quality of 3.57 for 1.5 s audio. To extend KAN's utility, we propose FewSound, a hypernetwork-based architecture that enhances INR parameter updates. FewSound outperforms the state-of-the-art HyperSound, with a 33.3% improvement in MSE and 60.87% in SI-SNR. These results show KAN as a robust and adaptable audio representation with the potential for scalability and integration into various hypernetwork frameworks. The source code can be accessed at this https URL.
隐式神经表示(INR)在高效编码多媒体数据方面已变得突出,但它们在音频信号中的应用仍然有限。本研究引入了Kolmogorov-Arnold网络(KAN),这是一种使用可学习激活函数的新型架构,作为有效的音频表示INR模型。KAN在感知性能上优于先前的INRs,在1.5秒音频中实现了最低的对数谱距离(Log-Spectral Distance)为1.29和最高的语音质量感知评价(Perceptual Evaluation of Speech Quality)为3.57。 为了扩展KAN的应用范围,我们提出了一种基于超网络(hypernetwork)架构的FewSound。该架构增强了INR参数更新的能力。在MSE方面,FewSound比当前最佳的HyperSound提高了33.3%,而在SI-SNR方面提升了60.87%。这些结果表明KAN是一种稳健且适应性强的音频表示方法,并具有可扩展性以及与各种超网络框架集成的潜力。 源代码可在[此处](https://example.com)访问(请将"this https URL"替换为实际链接地址)。
https://arxiv.org/abs/2503.02585
The vast amounts of audio data collected in Sound Event Detection (SED) applications require efficient annotation strategies to enable supervised learning. Manual labeling is expensive and time-consuming, making Active Learning (AL) a promising approach for reducing annotation effort. We introduce Top K Entropy, a novel uncertainty aggregation strategy for AL that prioritizes the most uncertain segments within an audio recording, instead of averaging uncertainty across all segments. This approach enables the selection of entire recordings for annotation, improving efficiency in sparse data scenarios. We compare Top K Entropy to random sampling and Mean Entropy, and show that fewer labels can lead to the same model performance, particularly in datasets with sparse sound events. Evaluations are conducted on audio mixtures of sound recordings from parks with meerkat, dog, and baby crying sound events, representing real-world bioacoustic monitoring scenarios. Using Top K Entropy for active learning, we can achieve comparable performance to training on the fully labeled dataset with only 8% of the labels. Top K Entropy outperforms Mean Entropy, suggesting that it is best to let the most uncertain segments represent the uncertainty of an audio file. The findings highlight the potential of AL for scalable annotation in audio and time-series applications, including bioacoustics.
在声音事件检测(SED)应用中,收集的大量音频数据需要高效的标注策略来支持监督学习。手动标注成本高昂且耗时,因此主动学习(AL)成为了一种有前景的方法,用于减少标注工作量。我们引入了Top K Entropy,这是一种新的不确定性聚合策略,专注于音频记录中最不确定的部分,而不是在整个时间段内平均计算不确定性值。这种方法使得整个录音的选取更加高效,特别是在声音事件稀疏的数据场景中。 我们将Top K Entropy与随机抽样和Mean Entropy进行了比较,并展示了使用更少的标签同样可以达到相同模型性能的效果,尤其是在具有稀疏声音事件数据集的情况下。评估是在包含来自公园的声音记录混合音频上进行的,这些声音包括耳廓狐、狗叫声以及婴儿哭泣声,代表了现实世界中的生物声学监测场景。 利用Top K Entropy进行主动学习,我们可以仅用完全标注数据集8%的标签量就达到与使用完整标注数据训练相媲美的性能。这表明Top K Entropy优于Mean Entropy策略,说明让最不确定的部分来代表整个音频文件的不确定性是最有效的。 这些发现突显了AL在音频和时间序列应用中的大规模标注潜力,包括生物声学领域。
https://arxiv.org/abs/2503.02422
We propose a method for accurately detecting bioacoustic sound events that is robust to overlapping events, a common issue in domains such as ethology, ecology and conservation. While standard methods employ a frame-based, multi-label approach, we introduce an onset-based detection method which we name Voxaboxen. It takes inspiration from object detection methods in computer vision, but simultaneously takes advantage of recent advances in self-supervised audio encoders. For each time window, Voxaboxen predicts whether it contains the start of a vocalization and how long the vocalization is. It also does the same in reverse, predicting whether each window contains the end of a vocalization, and how long ago it started. The two resulting sets of bounding boxes are then fused using a graph-matching algorithm. We also release a new dataset designed to measure performance on detecting overlapping vocalizations. This consists of recordings of zebra finches annotated with temporally-strong labels and showing frequent overlaps. We test Voxaboxen on seven existing data sets and on our new data set. We compare Voxaboxen to natural baselines and existing sound event detection methods and demonstrate SotA results. Further experiments show that improvements are robust to frequent vocalization overlap.
我们提出了一种准确检测生物声学声音事件的方法,该方法能够有效应对重叠事件的问题,这是动物行为学、生态学和保护领域常见的挑战。传统方法通常采用基于帧的多标签策略,而我们则引入了一种基于时间起点的声音事件检测方法,并将其命名为Voxaboxen。这种方法借鉴了计算机视觉中物体检测的方法,同时利用了近年来在自监督音频编码器领域的最新进展。对于每个时间窗口,Voxaboxen会预测该窗口是否包含声音发声的开始时刻以及发声持续的时间长度;同样地,它也会反向进行预测,即判断该窗口是否包含了某个发声的结束时刻及其何时开始的信息。通过图匹配算法将上述两种方式产生的边界框集合融合在一起。我们还发布了一个新的数据集,旨在评估检测重叠声音的表现能力,其中包括带有严格时间标注的丝雀录音,并且其中包含频繁的声音重叠现象。我们在七个现有的数据集和新发布的数据集上测试了Voxaboxen的方法,并将其与自然基线及现有声音事件检测方法进行了比较,展示了其在该领域的最新技术水平(State-of-the-Art, SotA)。进一步的实验表明,该改进能够稳健地应对频繁的声音重叠现象。
https://arxiv.org/abs/2503.02389