Estimating the sound absorption in situ relies on accurately describing the measured sound field. Evidence suggests that modeling the reflection of impinging spherical waves is important, especially for compact measurement systems. This article proposes a method for estimating the sound absorption coefficient of a material sample by mapping the sound pressure, measured by a microphone array, to a distribution of monopoles along a line in the complex plane. The proposed method is compared to modeling the sound field as a superposition of two sources (a monopole and an image source). The obtained inverse problems are solved with Tikhonov regularization, with automatic choice of the regularization parameter by the L-curve criterion. The sound absorption measurement is tested with simulations of the sound field above infinite and finite porous absorbers. The approaches are compared to the plane-wave absorption coefficient and the one obtained by spherical wave incidence. Experimental analysis of two porous samples and one resonant absorber is also carried out in situ. Four arrays were tested with an increasing aperture and number of sensors. It was demonstrated that measurements are feasible even with an array with only a few microphones. The discretization of the integral equation led to a more accurate reconstruction of the sound pressure and particle velocity at the sample's surface. The resulting absorption coefficient agrees with the one obtained for spherical wave incidence, indicating that including more monopoles along the complex line is an essential feature of the sound field.
估计材料样本中的声吸收依赖于准确描述测量到的声场。证据表明,在紧凑型测量系统中建模入射球波的反射非常重要。本文提出了一种通过将测量到的声压通过麦克风阵列映射到复平面上的极化子分布中,估计材料样本的声吸收系数的算法。该方法与将声场建模为两个源(一个球体源和一个图像源)的超平面波传播模型的方法进行了比较。通过L-曲线准则自动选择截距参数。通过模拟无限和有限孔隙吸收器的声场,测试了声吸收测量。将平面波吸收系数和通过球波入射获得的声吸收系数进行了比较。还在现场进行了两个多孔样本和一个谐振吸收器的实验分析。四个阵列分别用逐渐扩大的孔径和更多传感器进行测试。结果表明,即使只有几个麦克风,测量也是可行的。离散化积分方程导致样本表面上的声压和颗粒速度更精确的重建。得到的吸收系数与球波入射时获得的相同,表明在复平面上包括更多的极化子是声场的一个重要特征。
https://arxiv.org/abs/2404.11399
In short video and live broadcasts, speech, singing voice, and background music often overlap and obscure each other. This complexity creates difficulties in structuring and recognizing the audio content, which may impair subsequent ASR and music understanding applications. This paper proposes a multi-task audio source separation (MTASS) based ASR model called JRSV, which Jointly Recognizes Speech and singing Voices. Specifically, the MTASS module separates the mixed audio into distinct speech and singing voice tracks while removing background music. The CTC/attention hybrid recognition module recognizes both tracks. Online distillation is proposed to improve the robustness of recognition further. To evaluate the proposed methods, a benchmark dataset is constructed and released. Experimental results demonstrate that JRSV can significantly improve recognition accuracy on each track of the mixed audio.
简短的视频和现场直播中,说话声、歌唱声和背景音乐经常重叠并掩盖彼此。这种复杂性使得对音频内容的组织和识别带来了困难,这可能会影响到后续的ASR和音乐理解应用程序。本文提出了一种基于多任务音频源分离(MTASS)的ASR模型,称为JRSV,它同时识别说话和歌唱声音。具体来说,MTASS模块将混合音频分离为不同的说话和歌唱声道,并去除了背景音乐。CTC/attention混合识别模块同时识别这两条轨道。提出了在线去噪以进一步提高识别的鲁棒性。为了评估所提出的方法,构建了一个基准数据集并发布。实验结果表明,JRSV可以在混合音频的每个轨道上显著提高识别准确性。
https://arxiv.org/abs/2404.11275
This paper presents a computationally efficient and distributed speaker diarization framework for networked IoT-style audio devices. The work proposes a Federated Learning model which can identify the participants in a conversation without the requirement of a large audio database for training. An unsupervised online update mechanism is proposed for the Federated Learning model which depends on cosine similarity of speaker embeddings. Moreover, the proposed diarization system solves the problem of speaker change detection via. unsupervised segmentation techniques using Hotelling's t-squared Statistic and Bayesian Information Criterion. In this new approach, speaker change detection is biased around detected quasi-silences, which reduces the severity of the trade-off between the missed detection and false detection rates. Additionally, the computational overhead due to frame-by-frame identification of speakers is reduced via. unsupervised clustering of speech segments. The results demonstrate the effectiveness of the proposed training method in the presence of non-IID speech data. It also shows a considerable improvement in the reduction of false and missed detection at the segmentation stage, while reducing the computational overhead. Improved accuracy and reduced computational cost makes the mechanism suitable for real-time speaker diarization across a distributed IoT audio network.
本文提出了一种计算高效且分布式的音频设备 speaker diarization 框架,适用于网络化 IOT 式音频设备。该工作提出了一种联邦学习模型,可以在不需要大量音频数据进行训练的情况下,识别对话中的参与者。对于联邦学习模型,提出了一种无监督在线更新机制,它依赖于说话人嵌入的余弦相似性。此外,所提出的 diarization 系统通过使用 Hotelling 的 t-平方统计量和贝叶斯信息准则来解决说话人切换检测问题。在新方法中,说话人切换检测存在偏差,集中在检测到的伪 silence 上,这减少了错检和误检率之间的权衡。此外,通过无监督对语音段进行聚类,可以降低识别每个说话人所需的计算开销。结果表明,所提出的训练方法在非 IID 语音数据存在的情况下非常有效。它还表明,在分割阶段,错检和误检的减少效果显著,同时降低了计算开销。这种提高准确性和降低计算成本使得该机制适用于分布式 IOT 音频网络中的实时 speaker diarization。
https://arxiv.org/abs/2404.10842
Which visual descriptors are suitable for multi-modal interaction and how to integrate them via real-time video data analysis into a corpus-based concatenative synthesis sound system.
适合多模态交互的视觉描述词有哪些?如何将它们通过实时视频数据分析集成到基于语料库的串联合成音响系统?
https://arxiv.org/abs/2404.10578
We present an algorithm for detecting and tracking underwater mobile objects using active acoustic transmission of broadband chirp signals whose reflections are received by a hydrophone array. The method overcomes the problem of high false alarm rate by applying a track-before-detect ap- proach to the sequence of received reflections. A 2D time- space matrix is created for the reverberations received from each transmitted probe signal by performing delay and sum beamforming and pulse compression. The result is filtered by a 2D constant false alarm rate (CFAR) detector to identify reflection patterns corresponding to potential targets. Closely spaced signals for multiple probe transmissions are combined into blobs to avoid multiple detections of a single object. A track- before-detect method using a Nearly Constant Velocity (NCV) model is employed to track multiple objects. The position and velocity is estimated by the debiased converted measurement Kalman filter. Results are analyzed for simulated scenarios and for experiments at sea, where GPS tagged gilt-head seabream fish were tracked. Compared to two benchmark schemes, the results show a favorable track continuity and accuracy that is robust to the choice of detection threshold.
我们提出了一种使用宽带脉冲信号的主动声发射来检测和跟踪水下移动目标的算法。该方法通过应用跟踪在检测前的序列来降低虚假警报率。通过进行延迟和求和 beamforming 和脉冲压缩,为每个传输的探测信号创建了2D 时间-空间矩阵。结果通过2D 常数虚假警报率(CFAR)检测器进行滤波,以识别潜在目标的反射模式。 对于多个探测信号,将近距离的信号合并成团以避免对单个目标的多次检测。采用Nearly Constant Velocity(NCV)模型采用跟踪在检测前的方法来跟踪多个目标。通过无偏转换测量 Kalman 滤波器估计位置和速度。 在模拟场景和海上实验中分析结果。与两个基准方案相比,结果表明,该方法具有更好的跟踪连续性和准确性,并且对检测阈值的選擇具有鲁棒性。
https://arxiv.org/abs/2404.10316
Audio-based generative models for music have seen great strides recently, but so far have not managed to produce full-length music tracks with coherent musical structure. We show that by training a generative model on long temporal contexts it is possible to produce long-form music of up to 4m45s. Our model consists of a diffusion-transformer operating on a highly downsampled continuous latent representation (latent rate of 21.5Hz). It obtains state-of-the-art generations according to metrics on audio quality and prompt alignment, and subjective tests reveal that it produces full-length music with coherent structure.
近年来,基于音频的生成模型在音乐领域取得了显著的进步,但目前尚无模型能够生成具有连贯音乐结构的完整长篇音乐。我们证明了,通过在长时依赖的上下文中训练一个生成模型,可以生成长达4分45秒的连贯音乐。我们的模型由一个扩散-Transformer组成,它在高度降采样后的连续潜在表示上运行(潜在率21.5Hz)。根据音频质量和提示对齐的指标,它获得了最先进的生成,主观测试也表明它具有连贯的音乐结构。
https://arxiv.org/abs/2404.10301
The neural semi-Markov Conditional Random Field (semi-CRF) framework has demonstrated promise for event-based piano transcription. In this framework, all events (notes or pedals) are represented as closed intervals tied to specific event types. The neural semi-CRF approach requires an interval scoring matrix that assigns a score for every candidate interval. However, designing an efficient and expressive architecture for scoring intervals is not trivial. In this paper, we introduce a simple method for scoring intervals using scaled inner product operations that resemble how attention scoring is done in transformers. We show theoretically that, due to the special structure from encoding the non-overlapping intervals, under a mild condition, the inner product operations are expressive enough to represent an ideal scoring matrix that can yield the correct transcription result. We then demonstrate that an encoder-only non-hierarchical transformer backbone, operating only on a low-time-resolution feature map, is capable of transcribing piano notes and pedals with high accuracy and time precision. The experiment shows that our approach achieves the new state-of-the-art performance across all subtasks in terms of the F1 measure on the Maestro dataset.
神经半马尔可夫条件随机场(半马尔可夫条件随机场)框架在事件基于钢琴 transcription 方面显示出巨大的潜力。在这种框架中,所有事件(音符或踏板)都表示为特定事件类型的关闭间隔。神经半马尔可夫条件随机场方法需要一个间隔评分矩阵,为每个候选间隔分配分数。然而,为评分间隔设计高效且富有表现力的架构并不容易。在本文中,我们提出了一种通过缩放内积操作进行间隔评分的方法,这种方法类似于在Transformer中进行注意力评分的方式。我们证明了,由于对非重叠间隔的编码,在轻度条件下,内积操作具有足够的表现力来表示一个理想的评分矩阵,从而产生正确的转录结果。然后,我们证明了仅使用编码器-仅非层次结构Transformer骨干网络,在对低时间分辨率特征图仅操作时,可以实现对钢琴音符和踏板的高精度和高时间精确度的转录。实验结果表明,我们的方法在Maestro数据集上的所有子任务上的F1得分均达到了当前最先进水平。
https://arxiv.org/abs/2404.09466
The advancements of technology have led to the use of multimodal systems in various real-world applications. Among them, the audio-visual systems are one of the widely used multimodal systems. In the recent years, associating face and voice of a person has gained attention due to presence of unique correlation between them. The Face-voice Association in Multilingual Environments (FAME) Challenge 2024 focuses on exploring face-voice association under a unique condition of multilingual scenario. This condition is inspired from the fact that half of the world's population is bilingual and most often people communicate under multilingual scenario. The challenge uses a dataset namely, Multilingual Audio-Visual (MAV-Celeb) for exploring face-voice association in multilingual environments. This report provides the details of the challenge, dataset, baselines and task details for the FAME Challenge.
技术的进步导致各种现实应用中多模态系统的使用。其中,音频-视频系统是应用最广泛的多模态系统之一。在最近几年里,由于个人脸部和声音之间独特的相关性,人们对面部和声音的关联引起了关注。多语言环境中的Face-voice协会(FAME)挑战2024专注于探讨在独特多语言场景下脸部-声音的关联。这一条件灵感来自于世界上半数人口是双语,人们通常在多语言场景下交流的事实。挑战使用了一个数据集,即多语言音频-视频(MAV-Celeb)数据集,以探索多语言环境中的脸部-声音协会。本报告提供了FAME挑战的详细信息、数据集、基线和任务细节。
https://arxiv.org/abs/2404.09342
Self-supervised learning has emerged as a powerful way to pre-train generalizable machine learning models on large amounts of unlabeled data. It is particularly compelling in the music domain, where obtaining labeled data is time-consuming, error-prone, and ambiguous. During the self-supervised process, models are trained on pretext tasks, with the primary objective of acquiring robust and informative features that can later be fine-tuned for specific downstream tasks. The choice of the pretext task is critical as it guides the model to shape the feature space with meaningful constraints for information encoding. In the context of music, most works have relied on contrastive learning or masking techniques. In this study, we expand the scope of pretext tasks applied to music by investigating and comparing the performance of new self-supervised methods for music tagging. We open-source a simple ResNet model trained on a diverse catalog of millions of tracks. Our results demonstrate that, although most of these pre-training methods result in similar downstream results, contrastive learning consistently results in better downstream performance compared to other self-supervised pre-training methods. This holds true in a limited-data downstream context.
自监督学习已成为一种在大量未标注数据上预训练具有泛化机器学习模型的强大方法。在音乐领域,获得标注数据耗时、错误率高且具有不确定性。在自监督过程中,模型通过预处理任务进行训练,主要目标是在后续特定下游任务中获取稳健和有用的特征。预处理任务的选取对模型将如何对特征空间进行有意义的约束从而进行信息编码起着关键作用。在音乐领域,大多数作品都依赖对比学习或遮盖技术。在这项研究中,我们通过研究并比较新 Self-supervised 方法的音乐标签性能,扩展了应用于音乐的预处理任务的范畴。我们开源了一个基于多样化音乐目录的简单 ResNet 模型。我们的结果表明,尽管大多数预训练方法产生了类似的结果,但对比学习在下游性能方面始终优于其他 Self-supervised 预训练方法。在有限数据下游环境中,这一结论同样成立。
https://arxiv.org/abs/2404.09177
Sonification can provide valuable insights about data but most existing approaches are not designed to be controlled by the user in an interactive fashion. Interactions enable the designer of the sonification to more rapidly experiment with sound design and allow the sonification to be modified in real-time by interacting with various control parameters. In this paper, we describe two case studies of interactive sonification that utilize publicly available datasets that have been described recently in the International Conference on Auditory Display (ICAD). They are from the health and energy domains: electroencephalogram (EEG) alpha wave data and air pollutant data consisting of nitrogen dioxide, sulfur dioxide, carbon monoxide, and ozone. We show how these sonfications can be recreated to support interaction utilizing a general interactive sonification framework built using ChucK, Unity, and Chunity. In addition to supporting typical sonification methods that are common in existing sonification toolkits, our framework introduces novel methods such as supporting discrete events, interleaved playback of multiple data streams for comparison, and using frequency modulation (FM) synthesis in terms of one data attribute modulating another. We also describe how these new functionalities can be used to improve the sonification experience of the two datasets we have investigated.
声化可以将数据的价值提供给用户,但现有的方法并不是为了以交互方式由用户控制而设计的。交互使得声化设计的速度更快,用户可以通过与各种控制参数的交互来实时修改声化。在本文中,我们描述了两个利用最近在人工智能会议(ICAD)上描述的公开可用数据集的交互式声化案例。它们来自健康和能源领域:脑电图(EEG)α波数据和由氮氧化物、二氧化硫、一氧化碳和臭氧组成的空气污染数据。我们展示了如何使用基于ChucK、Unity和Chunity构建的通用交互式声化框架来重新创建这些声化,从而支持交互。除了支持现有声化工具包中常见的声化方法外,我们的框架还引入了诸如支持离散事件、比较多个数据流的中断和基于频率调制(FM)合成等新的功能。我们还描述了这些新功能如何改善我们所研究的两个数据集的声化体验。
https://arxiv.org/abs/2404.08813
Infinite impulse response filters are an essential building block of many time-varying audio systems, such as audio effects and synthesisers. However, their recursive structure impedes end-to-end training of these systems using automatic differentiation. Although non-recursive filter approximations like frequency sampling and frame-based processing have been proposed and widely used in previous works, they cannot accurately reflect the gradient of the original system. We alleviate this difficulty by re-expressing a time-varying all-pole filter to backpropagate the gradients through itself, so the filter implementation is not bound to the technical limitations of automatic differentiation frameworks. This implementation can be employed within any audio system containing filters with poles for efficient gradient evaluation. We demonstrate its training efficiency and expressive capabilities for modelling real-world dynamic audio systems on a phaser, time-varying subtractive synthesiser, and feed-forward compressor. We make our code available and provide the trained audio effect and synth models in a VST plugin at this https URL.
无限响应滤波器是许多时间变化音频系统(如音频效果和合成器)的基本构建模块。然而,其递归结构阻碍了使用自动微分来训练这些系统。尽管在之前的工作中,已经提出了并广泛使用了非递归滤波器近似,如频率采样和基于帧的处理,但这些滤波器无法准确地反映原始系统的梯度。我们通过重新表示时间变化的全体 pole滤波器来通过反向传播梯度来缓解这个问题,因此滤波器实现不受自动微分框架的技术限制。这个实现可以在包含具有极点的滤波器的任何音频系统中高效地使用。我们在phaser、时间变化的减法合成器和前馈压缩器上展示了其训练效率和表现力,用于建模真实世界动态音频系统。我们将代码公开,可在此链接处下载训练好的音频效果和合成器模型:https://url.com/
https://arxiv.org/abs/2404.07970
Isolating the desired speaker's voice amidst multiplespeakers in a noisy acoustic context is a challenging task. Per-sonalized speech enhancement (PSE) endeavours to achievethis by leveraging prior knowledge of the speaker's voice.Recent research efforts have yielded promising PSE mod-els, albeit often accompanied by computationally intensivearchitectures, unsuitable for resource-constrained embeddeddevices. In this paper, we introduce a novel method to per-sonalize a lightweight dual-stage Speech Enhancement (SE)model and implement it within DeepFilterNet2, a SE modelrenowned for its state-of-the-art performance. We seek anoptimal integration of speaker information within the model,exploring different positions for the integration of the speakerembeddings within the dual-stage enhancement architec-ture. We also investigate a tailored training strategy whenadapting DeepFilterNet2 to a PSE task. We show that ourpersonalization method greatly improves the performancesof DeepFilterNet2 while preserving minimal computationaloverhead.
在一个噪音干扰的安静环境中,分离所需发言者的声音是一项具有挑战性的任务。为了实现这一目标,个人化语音增强(PSE)方法利用了发言者声音的先前知识。尽管最近的研究已经产生了有前景的PSE模型,但通常附带计算密集型架构,不适合资源受限的嵌入式设备。在本文中,我们提出了一种新的方法,对轻量级的双级语音增强(SE)模型进行个性化,并将其实现在大卫滤波器网络2中,该网络因其最先进的性能而闻名。我们寻求在模型中优化发言者信息的最佳 integration 位置,探讨将发言者嵌入在双级增强架构中的不同位置。我们还研究了在将大卫滤波器网络2适应PSE任务时如何实现适当的训练策略。我们证明了我们的个性化方法在提高DeepFilterNet2的性能的同时,保留了最小的计算开销。
https://arxiv.org/abs/2404.08022
In this study, we introduce a method for estimating sound fields in reverberant environments using a conditional invertible neural network (CINN). Sound field reconstruction can be hindered by experimental errors, limited spatial data, model mismatches, and long inference times, leading to potentially flawed and prolonged characterizations. Further, the complexity of managing inherent uncertainties often escalates computational demands or is neglected in models. Our approach seeks to balance accuracy and computational efficiency, while incorporating uncertainty estimates to tailor reconstructions to specific needs. By training a CINN with Monte Carlo simulations of random wave fields, our method reduces the dependency on extensive datasets and enables inference from sparse experimental data. The CINN proves versatile at reconstructing Room Impulse Responses (RIRs), by acting either as a likelihood model for maximum a posteriori estimation or as an approximate posterior distribution through amortized Bayesian inference. Compared to traditional Bayesian methods, the CINN achieves similar accuracy with greater efficiency and without requiring its adaptation to distinct sound field conditions.
在这项研究中,我们提出了一种使用条件可逆神经网络(CINN)估算回声环境中的声场的方法。声场重构可能会受到实验误差、有限的空间数据、模型不匹配和长推理时间的限制,从而可能导致可能存在缺陷和延长的描述。此外,管理固有不确定性的复杂性通常会降低计算需求,或者在模型中而被忽略。我们的方法旨在平衡准确性和计算效率,并引入不确定性估计来定制重构以满足特定需求。通过用随机波场蒙特卡洛模拟训练CINN,我们的方法减少了依赖于丰富数据集的依赖性,并使推理从稀疏实验数据中进行。CINN在重构室脉冲响应(RIRs)时表现出了多才多艺,它可以作为最大后验估计的概率模型,也可以作为通过折现贝叶斯推理的近似后验分布。与传统贝叶斯方法相比,CINN在实现相同准确性的同时具有更高的效率,并且不需要将其适应于不同的声场条件。
https://arxiv.org/abs/2404.06928
In recent years, advancements in neural network designs and the availability of large-scale labeled datasets have led to significant improvements in the accuracy of piano transcription models. However, most previous work focused on high-performance offline transcription, neglecting deliberate consideration of model size. The goal of this work is to implement real-time inference for piano transcription while ensuring both high performance and lightweight. To this end, we propose novel architectures for convolutional recurrent neural networks, redesigning an existing autoregressive piano transcription model. First, we extend the acoustic module by adding a frequency-conditioned FiLM layer to the CNN module to adapt the convolutional filters on the frequency axis. Second, we improve note-state sequence modeling by using a pitchwise LSTM that focuses on note-state transitions within a note. In addition, we augment the autoregressive connection with an enhanced recursive context. Using these components, we propose two types of models; one for high performance and the other for high compactness. Through extensive experiments, we show that the proposed models are comparable to state-of-the-art models in terms of note accuracy on the MAESTRO dataset. We also investigate the effective model size and real-time inference latency by gradually streamlining the architecture. Finally, we conduct cross-data evaluation on unseen piano datasets and in-depth analysis to elucidate the effect of the proposed components in the view of note length and pitch range.
近年来,神经网络设计的进步和大规模有标签数据集的可用性导致钢琴转录模型的准确性得到了显著提高。然而,之前的工作主要集中在高性能的离线转录,而忽略了模型大小的故意考虑。本文的目标是在保证高性能的同时实现轻量化。为此,我们提出了新颖的卷积循环神经网络架构,重新设计了一个现有的自回归钢琴转录模型。首先,我们通过在CNN模块中添加频率条件下的FiLM层来扩展音频模块,以适应频率轴上的卷积滤波器。其次,我们通过使用关注音符之间音符状态变化的LSTM来改进音符序列建模。此外,我们还通过增强递归上下文来增强自回归连接。使用这些组件,我们提出了两种类型的模型;一种用于高性能,另一种用于高紧凑性。通过广泛的实验,我们证明了所提出的模型在MAESTRO数据集上的音符准确性与最先进的模型相当。我们还研究了有效模型大小和实时推理延迟,通过逐步优化模型架构进行。最后,我们在未见过的钢琴数据集上进行跨数据评估,并对音符长度和音高范围进行深入分析,阐明了所提出的组件在音符长度和音高范围上的效果。
https://arxiv.org/abs/2404.06818
There is increasing interest in the use of the LEArnable Front-end (LEAF) in a variety of speech processing systems. However, there is a dearth of analyses of what is actually learnt and the relative importance of training the different components of the front-end. In this paper, we investigate this question on keyword spotting, speech-based emotion recognition and language identification tasks and find that the filters for spectral decomposition and the low pass filter used to estimate spectral energy variations exhibit no learning and the per-channel energy normalisation (PCEN) is the key component that is learnt. Following this, we explore the potential of adapting only the PCEN layer with a small amount of noisy data to enable it to learn appropriate dynamic range compression that better suits the noise conditions. This in turn enables a system trained on clean speech to work more accurately on noisy test data as demonstrated by the experimental results reported in this paper.
在各种语音处理系统中,使用LEarnable Front-end (LEAF)越来越受到关注。然而,目前缺乏对实际学习的分析和不同组件训练的相对重要性。在本文中,我们对关键词抽样、基于语音的情感识别和语言识别任务进行了研究,并发现用于谱分解的滤波器和用于估计谱能量变化的小通滤波器没有学习,而每通道能量归一化(PCEN)是关键的学习组件。接着,我们探讨了仅使用少量噪声数据来调整PCEN层的可能性,以使它能够学习适当的动态范围压缩,更好地适应噪声条件。这将使得在干净语音上训练的系统在噪声测试数据上更准确地工作,正如本文中报告的实验结果所证明。
https://arxiv.org/abs/2404.06702
To achieve a flexible recommendation and retrieval system, it is desirable to calculate music similarity by focusing on multiple partial elements of musical pieces and allowing the users to select the element they want to focus on. A previous study proposed using multiple individual networks for calculating music similarity based on each instrumental sound, but it is impractical to use each signal as a query in search systems. Using separated instrumental sounds alternatively resulted in less accuracy due to artifacts. In this paper, we propose a method to compute similarities focusing on each instrumental sound with a single network that takes mixed sounds as input instead of individual instrumental sounds. Specifically, we design a single similarity embedding space with disentangled dimensions for each instrument, extracted by Conditional Similarity Networks, which is trained by the triplet loss using masks. Experimental results have shown that (1) the proposed method can obtain more accurate feature representation than using individual networks using separated sounds as input, (2) each sub-embedding space can hold the characteristics of the corresponding instrument, and (3) the selection of similar musical pieces focusing on each instrumental sound by the proposed method can obtain human consent, especially in drums and guitar.
为了实现一个灵活的推荐和检索系统,通过关注音乐作品的多部分元素并允许用户选择他们要关注的元素,计算音乐相似度的方法是值得推荐的。之前的研究提出了基于每个独奏声音使用多个独立网络计算音乐相似度的方法,但在搜索引擎中使用每个信号作为查询是不实用的。使用分离的独奏声音的结果是因为伪影而导致的准确性较低。在本文中,我们提出了一种计算每个独奏声音相似度的方法,该方法使用单个网络对混合声音进行输入。具体来说,我们设计了一个具有解耦维度的每个乐器的单相似性嵌入空间,该空间通过条件相似性网络提取,并通过三元组损失使用掩码进行训练。实验结果表明:(1)与使用单独声音作为输入的独奏网络相比,所提出的方法可以获得更准确的特征表示;(2)每个子嵌入空间都可以保留相应乐器的特征;(3)通过所提出的方法选择关注每个独奏乐器的类似音乐作品可以获得人类的同意,尤其是在鼓和吉他中。
https://arxiv.org/abs/2404.06682
Existing research on music recommendation systems primarily focuses on recommending similar music, thereby often neglecting diverse and distinctive musical recordings. Musical outliers can provide valuable insights due to the inherent diversity of music itself. In this paper, we explore music outliers, investigating their potential usefulness for music discovery and recommendation systems. We argue that not all outliers should be treated as noise, as they can offer interesting perspectives and contribute to a richer understanding of an artist's work. We introduce the concept of 'Genuine' music outliers and provide a definition for them. These genuine outliers can reveal unique aspects of an artist's repertoire and hold the potential to enhance music discovery by exposing listeners to novel and diverse musical experiences.
现有关于音乐推荐系统的研究主要集中在推荐类似的音乐,往往忽略了多样且独特的音乐录音。音乐奇异点因其本身多样性的特点,可以提供宝贵的见解。在本文中,我们探讨音乐奇异点,并研究其对音乐发现和推荐系统的潜在有用性。我们认为,并不是所有的奇异点都应被视为噪音,因为它们可以提供有趣的视角,并有助于更全面地理解艺术家的工作。我们引入了“真实”音乐奇异点的概念,并为其定义。这些真实奇异点可以揭示艺术家作品独特的一面,并有可能通过让听众接触到新颖多样音乐体验,从而提高音乐发现。
https://arxiv.org/abs/2404.06103
Self-supervised learning (SSL) using masked prediction has made great strides in general-purpose audio representation. This study proposes Masked Modeling Duo (M2D), an improved masked prediction SSL, which learns by predicting representations of masked input signals that serve as training signals. Unlike conventional methods, M2D obtains a training signal by encoding only the masked part, encouraging the two networks in M2D to model the input. While M2D improves general-purpose audio representations, a specialized representation is essential for real-world applications, such as in industrial and medical domains. The often confidential and proprietary data in such domains is typically limited in size and has a different distribution from that in pre-training datasets. Therefore, we propose M2D for X (M2D-X), which extends M2D to enable the pre-training of specialized representations for an application X. M2D-X learns from M2D and an additional task and inputs background noise. We make the additional task configurable to serve diverse applications, while the background noise helps learn on small data and forms a denoising task that makes representation robust. With these design choices, M2D-X should learn a representation specialized to serve various application needs. Our experiments confirmed that the representations for general-purpose audio, specialized for the highly competitive AudioSet and speech domain, and a small-data medical task achieve top-level performance, demonstrating the potential of using our models as a universal audio pre-training framework. Our code is available online for future studies at this https URL
自监督学习(SSL)利用遮罩预测在通用音频表示方面取得了很大的进展。本研究提出了一种改进的遮罩预测自监督模型(M2D),通过预测遮罩输入信号的表示作为训练信号来学习。与传统方法不同,M2D仅通过编码遮罩部分来获得训练信号,鼓励M2D中的两个网络对输入进行建模。虽然M2D提高了通用音频表示,但针对现实世界的应用(如工业和医疗领域)需要专门的表示。这些领域的数据通常规模有限,且与预训练数据集的分布不同。因此,我们提出了M2D-X(M2D扩展X),将M2D扩展到为应用X进行专门表示的预训练。M2D-X从M2D学习,并添加一个任务和一个输入背景噪声。我们将额外的任务配置为可配置,以便为各种应用提供多样化的服务,而背景噪声帮助在小数据上学习,并形成一个去噪任务,使表示具有稳健性。通过这些设计选择,M2D-X应该能够学习到服务于各种应用需求的专门表示。我们的实验证实,为通用音频、针对高度竞争的AudioSet和语音领域的专门表示以及小数据医疗任务,M2D-X实现了顶级性能,这表明我们可以将我们的模型作为通用音频预训练框架应用于各种应用。我们的代码可以从该链接上的 URL获取:
https://arxiv.org/abs/2404.06095
Previous works on depression detection use datasets collected in similar environments to train and test the models. In practice, however, the train and test distributions cannot be guaranteed to be identical. Distribution shifts can be introduced due to variations such as recording environment (e.g., background noise) and demographics (e.g., gender, age, etc). Such distributional shifts can surprisingly lead to severe performance degradation of the depression detection models. In this paper, we analyze the application of test-time training (TTT) to improve robustness of models trained for depression detection. When compared to regular testing of the models, we find TTT can significantly improve the robustness of the model under a variety of distributional shifts introduced due to: (a) background-noise, (b) gender-bias, and (c) data collection and curation procedure (i.e., train and test samples are from separate datasets).
之前的抑郁症检测工作使用了类似于相同环境收集的数据来训练和测试模型。然而,在实践中,训练和测试分布不能保证完全相同。由于诸如记录环境(例如,背景噪音)和人口统计学(例如,性别、年龄等)的差异,分布可能会发生偏移。这些分布偏移可能会导致抑郁症检测模型的性能严重下降。在本文中,我们分析了将测试时间训练(TTT)应用于改善为抑郁症检测训练模型增加鲁棒性的应用。与对模型的常规测试相比,我们发现TTT可以在由于以下原因引入的各种分布偏移上显著提高模型的鲁棒性:(a)背景噪音,(b)性别偏见,和(c)数据收集和编辑过程(即训练和测试样本来自不同的数据集)。
https://arxiv.org/abs/2404.05071
We detail the mathematical formulation of the line of "functional quantizer" modules developed by the Mathematics and Music Lab (MML) at Michigan Technological University, for the VCV Rack software modular synthesizer platform, which allow synthesizer players to tune oscillators to new musical scales based on mathematical functions. For example, we describe the recently-released MML Logarithmic Quantizer (LOG QNT) module that tunes synthesizer oscillators to the non-Pythagorean musical scale introduced by pop band The Apples in Stereo.
我们详细说明了由密歇根州立大学数学和音乐实验室(MML)开发的“功能量化器”模块的数学公式,这是为VCV Rack软件合成器平台而设计的,允许合成器演奏者根据数学函数将振荡器调整到新的音乐旋律。例如,我们描述了最近发布的MML对数量化器(LOG QNT)模块,它将合成器振荡器调整到由流行乐队The Apples在立体声中引入的非Pythagorean音乐旋律。
https://arxiv.org/abs/2404.04739