We propose Lodge, a network capable of generating extremely long dance sequences conditioned on given music. We design Lodge as a two-stage coarse to fine diffusion architecture, and propose the characteristic dance primitives that possess significant expressiveness as intermediate representations between two diffusion models. The first stage is global diffusion, which focuses on comprehending the coarse-level music-dance correlation and production characteristic dance primitives. In contrast, the second-stage is the local diffusion, which parallelly generates detailed motion sequences under the guidance of the dance primitives and choreographic rules. In addition, we propose a Foot Refine Block to optimize the contact between the feet and the ground, enhancing the physical realism of the motion. Our approach can parallelly generate dance sequences of extremely long length, striking a balance between global choreographic patterns and local motion quality and expressiveness. Extensive experiments validate the efficacy of our method.
我们提出了Lodge,一个能够根据给定音乐的极具长度的舞蹈序列生成网络。我们将Lodge设计成两级粗到细扩散架构,并提出了具有显著表现力的特征舞蹈原型作为两级扩散模型的中间表示。第一阶段是全局扩散,重点理解粗级别音乐舞蹈相关性和生成特征舞蹈原型。相反,第二阶段是局部扩散,在舞蹈原型的指导和舞蹈规则的指导下,并行生成详细的动作序列。此外,我们还提出了脚部精细化块以优化脚与地面之间的接触,提高了动作的物理真实性。我们的方法可以并行生成极其长度的舞蹈序列,在全局舞蹈模式和局部动作质量和表现力之间取得了平衡。 extensive实验验证了我们的方法的有效性。
https://arxiv.org/abs/2403.10518
Diffusion-based audio and music generation models commonly generate music by constructing an image representation of audio (e.g., a mel-spectrogram) and then converting it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their effectiveness. We propose MusicHiFi -- an efficient high-fidelity stereophonic vocoder. Our method employs a cascade of three generative adversarial networks (GANs) that convert low-resolution mel-spectrograms to audio, upsamples to high-resolution audio via bandwidth expansion, and upmixes to stereophonic audio. Compared to previous work, we propose 1) a unified GAN-based generator and discriminator architecture and training procedure for each stage of our cascade, 2) a new fast, near downsampling-compatible bandwidth extension module, and 3) a new fast downmix-compatible mono-to-stereo upmixer that ensures the preservation of monophonic content in the output. We evaluate our approach using both objective and subjective listening tests and find our approach yields comparable or better audio quality, better spatialization control, and significantly faster inference speed compared to past work. Sound examples are at this https URL.
扩散基于音频和音乐的生成模型通常通过构建音频的图像表示(例如,一个频谱图)并使用相位重建模型或 vocoder 将它转换为音频来生成音乐。然而,典型的 vocoder 产生的音频在较低的分辨率(例如 16-24 kHz)上,这限制了它们的有效性。我们提出 MusicHiFi -- 一个高效的高保真立体声 vocoder。我们的方法采用了一个级联的三个生成对抗网络(GANs)来将低分辨率频谱图转换为音频,通过带宽扩展将低分辨率音频升级到高分辨率音频,并使用上混合器将单声道音频转换为立体声音频。与之前的工作相比,我们提出了以下1) 对于我们级联的每个阶段,使用统一的 GAN 生成器和判别器架构以及训练程序;2) 一个新型的快速、近降采样兼容的带宽扩展模块;3) 一个快速降采样兼容的单声道到立体声上混合器,确保在输出中保留单声道内容的完整性。我们通过客观和主观听测试评估了我们的方法,发现我们的方法产生了相当不错的音频质量、更好的空间定位控制,以及比过去工作显著更快的推理速度。音频示例可以在以下链接中找到:https://www.example.com/
https://arxiv.org/abs/2403.10493
Audiovisual emotion recognition (ER) in videos has immense potential over unimodal performance. It effectively leverages the inter- and intra-modal dependencies between visual and auditory modalities. This work proposes a novel audio-visual emotion recognition system utilizing a joint multimodal transformer architecture with key-based cross-attention. This framework aims to exploit the complementary nature of audio and visual cues (facial expressions and vocal patterns) in videos, leading to superior performance compared to solely relying on a single modality. The proposed model leverages separate backbones for capturing intra-modal temporal dependencies within each modality (audio and visual). Subsequently, a joint multimodal transformer architecture integrates the individual modality embeddings, enabling the model to effectively capture inter-modal (between audio and visual) and intra-modal (within each modality) relationships. Extensive evaluations on the challenging Affwild2 dataset demonstrate that the proposed model significantly outperforms baseline and state-of-the-art methods in ER tasks.
音频视觉情感识别 (ER) 在视频中有巨大的潜力。它有效地利用了视觉和听觉模式之间的相互依赖关系。这项工作提出了一种新型的音频-视觉情感识别系统,采用基于关键的跨注意力的联合多模态Transformer架构。这个框架旨在利用视频中的音频和视觉提示(面部表情和语调模式)的互补性质,从而比仅依赖单一模态时获得更好的性能。所提出的模型利用单独的骨干网络来捕捉每个模式内部的时序依赖(音频和视觉)。接着,联合多模态Transformer架构整合了每个模态的单独嵌入,使得模型能够有效捕捉跨模态(在音频和视觉之间)和模态内的(每个模式内)关系。在具有挑战性的Affwild2数据集上进行的广泛评估证明,与基线和最先进的ER方法相比,所提出的模型在ER任务中显著表现出更好的性能。
https://arxiv.org/abs/2403.10488
Deep learning (DL) models have emerged as a powerful tool in avian bioacoustics to diagnose environmental health and biodiversity. However, inconsistencies in research pose notable challenges hindering progress in this domain. Reliable DL models need to analyze bird calls flexibly across various species and environments to fully harness the potential of bioacoustics in a cost-effective passive acoustic monitoring scenario. Data fragmentation and opacity across studies complicate a comprehensive evaluation of general model performance. To overcome these challenges, we present the BirdSet benchmark, a unified framework consolidating research efforts with a holistic approach for classifying bird vocalizations in avian bioacoustics. BirdSet harmonizes open-source bird recordings into a curated dataset collection. This unified approach provides an in-depth understanding of model performance and identifies potential shortcomings across different tasks. By establishing baseline results of current models, BirdSet aims to facilitate comparability, guide subsequent data collection, and increase accessibility for newcomers to avian bioacoustics.
深度学习(DL)模型在鸟类生物声学领域被证明是一个强大的工具,用于诊断环境和生物多样性。然而,研究的不一致性给这个领域的发展带来了显著的挑战,从而阻碍了进步。可靠的数据库模型需要根据不同物种和环境对鸟叫进行分析,以全面利用生物声学的潜力,实现成本有效的被动式声学监测场景。数据碎片化和研究结果的透明度限制了对模型性能的全面评估。为了克服这些挑战,我们提出了BirdSet基准,这是一个整合研究努力并采用全面方法的鸟类生物声学分类数据集。BirdSet将开源鸟类录音整合成一个 curated的数据集系列。这种统一的方法深入研究了模型表现,并指出了不同任务中的潜在不足。通过建立当前模型的基线结果,BirdSet旨在促进可比较性,指导后续数据收集,并为鸟类生物声学的新手提供易于使用的便利性。
https://arxiv.org/abs/2403.10380
In this work, we consider the problem of localizing multiple signal sources based on time-difference of arrival (TDOA) measurements. In the blind setting, in which the source signals are not known, the localization task is challenging due to the data association problem. That is, it is not known which of the TDOA measurements correspond to the same source. Herein, we propose to perform joint localization and data association by means of an optimal transport formulation. The method operates by finding optimal groupings of TDOA measurements and associating these with candidate source locations. To allow for computationally feasible localization in three-dimensional space, an efficient set of candidate locations is constructed using a minimal multilateration solver based on minimal sets of receiver pairs. In numerical simulations, we demonstrate that the proposed method is robust both to measurement noise and TDOA detection errors. Furthermore, it is shown that the data association provided by the proposed method allows for statistically efficient estimates of the source locations.
在这项工作中,我们考虑基于时间差异到达(TDOA)测量的多个信号源的定位问题。在盲设置中,由于源信号不知道,定位任务具有挑战性。也就是说,不知道哪些TDOA测量对应于相同的源。本文提出,通过最优传输形式的运用,进行联合定位和数据关联。该方法通过寻找最优的TDOA测量组合并将其与候选源位置关联来进行定位。为了实现三维空间中的计算可行定位,使用基于最小多面体解的最小集接收器对构造了高效的候选位置。在数值仿真中,我们证明了所提出的方法对测量噪声和TDOA检测误差都具有鲁棒性。此外,还证明了由所提出方法提供的数据关联具有统计学上的高效性。
https://arxiv.org/abs/2403.10329
Audio-text retrieval (ATR), which retrieves a relevant caption given an audio clip (A2T) and vice versa (T2A), has recently attracted much research attention. Existing methods typically aggregate information from each modality into a single vector for matching, but this sacrifices local details and can hardly capture intricate relationships within and between modalities. Furthermore, current ATR datasets lack comprehensive alignment information, and simple binary contrastive learning labels overlook the measurement of fine-grained semantic differences between samples. To counter these challenges, we present a novel ATR framework that comprehensively captures the matching relationships of multimodal information from different perspectives and finer granularities. Specifically, a fine-grained alignment method is introduced, achieving a more detail-oriented matching through a multiscale process from local to global levels to capture meticulous cross-modal relationships. In addition, we pioneer the application of cross-modal similarity consistency, leveraging intra-modal similarity relationships as soft supervision to boost more intricate alignment. Extensive experiments validate the effectiveness of our approach, outperforming previous methods by significant margins of at least 3.9% (T2A) / 6.9% (A2T) R@1 on the AudioCaps dataset and 2.9% (T2A) / 5.4% (A2T) R@1 on the Clotho dataset.
音频文本检索(ATR)是一种从音频片段(A2T)中检索相关标题,反之从文本中检索相关音频片段(T2A)的方法,最近吸引了大量研究关注。现有的方法通常将来自每个模态的信息聚合为单个向量进行匹配,但这会牺牲局部细节,并且很难捕捉模态之间的精细关系。此外,当前的ATR数据集缺乏全面的对齐信息,简单的二元对比学习标签忽视了样本之间的细微语义差异的测量。为了应对这些挑战,我们提出了一个新颖的ATR框架,全面捕捉不同角度和更细粒度的多模态信息之间的匹配关系。具体来说,我们引入了一种细粒度对齐方法,通过多尺度过程从局部到全局层次结构,捕捉细致的跨模态关系。此外,我们还开创性地应用了跨模态相似性一致性,利用内部模态相似关系作为软监督,以提高更复杂对齐的准确性。大量实验验证了我们的方法的有效性,在AudioCaps数据集上比前方法至少提高了3.9%(T2A)/6.9%(A2T)的R@1,在Clotho数据集上比前方法至少提高了2.9%(T2A)/5.4%(A2T)的R@1。
https://arxiv.org/abs/2403.10146
This paper presents enhancements to the MT3 model, a state-of-the-art (SOTA) token-based multi-instrument automatic music transcription (AMT) model. Despite SOTA performance, MT3 has the issue of instrument leakage, where transcriptions are fragmented across different instruments. To mitigate this, we propose MR-MT3, with enhancements including a memory retention mechanism, prior token sampling, and token shuffling are proposed. These methods are evaluated on the Slakh2100 dataset, demonstrating improved onset F1 scores and reduced instrument leakage. In addition to the conventional multi-instrument transcription F1 score, new metrics such as the instrument leakage ratio and the instrument detection F1 score are introduced for a more comprehensive assessment of transcription quality. The study also explores the issue of domain overfitting by evaluating MT3 on single-instrument monophonic datasets such as ComMU and NSynth. The findings, along with the source code, are shared to facilitate future work aimed at refining token-based multi-instrument AMT models.
本文介绍了对MT3模型的改进,这是一种最先进的基于标记的跨乐器自动音乐转录(AMT)模型。尽管MT3具有卓越的SOTA性能,但它仍然存在一个问题,即乐器泄漏。为了减轻这个问题,我们提出了MR-MT3,包括以下增强措施:记忆保留机制、先标记采样和标记洗牌。这些方法在Slakh2100数据集上进行了评估,证明了改善的启动F1分数和减少乐器泄漏。除了传统的跨乐器转录F1分数之外,还引入了新的指标,如乐器泄漏比例和乐器检测F1分数,以更全面地评估转录质量。此外,研究还通过对MT3在单乐器单旋律数据集上的评估,探讨了领域过拟合的问题。通过评估MT3在ComMU和NSynth等单乐器单旋律数据集上的性能,得出的结论与源代码一起分享,以促进未来工作,改进基于标记的跨乐器AMT模型。
https://arxiv.org/abs/2403.10024
There are many packages in Python which allow one to perform real-time processing on audio data. Unfortunately, due to the synchronous nature of the language, there lacks a framework which allows for distributed parallel processing of the data without requiring a large programming overhead and in which the data acquisition is not blocked by subsequent processing operations. This work improves on packages used for audio data collection with a light-weight backend and a simple interface that allows for distributed processing through a socket-based structure. This is intended for real-time audio machine learning and data processing in Python with a quick deployment of multiple parallel operations on the same data, allowing users to spend less time debugging and more time developing.
有很多Python库允许对音频数据进行实时处理。不幸的是,由于编程语言的异步特性,缺乏一个框架,允许在不需要大量编程开销的情况下对数据进行分布式并行处理,并且数据获取不被后续处理操作阻塞。这项工作在轻量级后端和简单的界面上实现了对音频数据收集库的改进。这个库旨在通过在同一数据上快速部署多个并行操作,实现Python中的实时音频机器学习和数据处理,以便用户花更少的时间调试,更多的时间开发。
https://arxiv.org/abs/2403.09789
Multi-label imbalanced classification poses a significant challenge in machine learning, particularly evident in bioacoustics where animal sounds often co-occur, and certain sounds are much less frequent than others. This paper focuses on the specific case of classifying anuran species sounds using the dataset AnuraSet, that contains both class imbalance and multi-label examples. To address these challenges, we introduce Mixture of Mixups (Mix2), a framework that leverages mixing regularization methods Mixup, Manifold Mixup, and MultiMix. Experimental results show that these methods, individually, may lead to suboptimal results; however, when applied randomly, with one selected at each training iteration, they prove effective in addressing the mentioned challenges, particularly for rare classes with few occurrences. Further analysis reveals that Mix2 is also proficient in classifying sounds across various levels of class co-occurrences.
多标签不平衡分类在机器学习领域面临着巨大的挑战,尤其是在生物声学中,动物声音经常同时出现,某些声音比其他声音要少得多。本文重点关注使用AnuraSet数据集对节肢动物声音进行分类的特定情况,该数据集包含分类不平衡和多标签示例。为解决这些挑战,我们引入了Mixture of Mixups (Mix2)框架,该框架利用了Mixup、Manifold Mixup和MultiMix等混合正则化方法。实验结果表明,这些方法单独应用可能导致次优结果;然而,当随机应用时,在每次训练迭代中选择一个,它们在解决提到的挑战(特别是很少发生分类的稀有类别)方面表现有效。进一步的分析表明,Mix2在分类各种级别分类共存的声音方面也很擅长。
https://arxiv.org/abs/2403.09598
Masked Autoencoders (MAEs) learn rich low-level representations from unlabeled data but require substantial labeled data to effectively adapt to downstream tasks. Conversely, Instance Discrimination (ID) emphasizes high-level semantics, offering a potential solution to alleviate annotation requirements in MAEs. Although combining these two approaches can address downstream tasks with limited labeled data, naively integrating ID into MAEs leads to extended training times and high computational costs. To address this challenge, we introduce uaMix-MAE, an efficient ID tuning strategy that leverages unsupervised audio mixtures. Utilizing contrastive tuning, uaMix-MAE aligns the representations of pretrained MAEs, thereby facilitating effective adaptation to task-specific semantics. To optimize the model with small amounts of unlabeled data, we propose an audio mixing technique that manipulates audio samples in both input and virtual label spaces. Experiments in low/few-shot settings demonstrate that \modelname achieves 4-6% accuracy improvements over various benchmarks when tuned with limited unlabeled data, such as AudioSet-20K. Code is available at this https URL
遮蔽自动编码器(MAEs)从未标记的数据中学习丰富的低层次表示,但需要大量的标记数据才能有效地适应下游任务。相反,实例分类(ID)强调高级语义,为解决MAEs中的注释需求提供了一个潜在的解决方案。虽然将这两种方法结合起来可以在有限的标记数据下解决下游任务,但无意识地将ID集成到MAEs中会导致延长训练时间和高计算成本。为了应对这个挑战,我们引入了uaMix-MAE,一种有效的ID调整策略,利用无监督音频混合。通过对比调整,uaMix-MAE将预训练MAEs的表示对齐,从而促进针对任务特定语义的有效适应。为优化模型并在少量未标记数据上进行调整,我们提出了一种音频混合技术,在输入和虚拟标签空间中操作音频样本。在低/少样本设置下的实验表明,当用有限的无标记数据对模型进行调整时,\modelname在各种基准测试中的准确率可以达到4-6%的提高,例如AudioSet-20K。代码可以从该链接获取:<https://this-url>
https://arxiv.org/abs/2403.09579
Steered Response Power (SRP) is a widely used method for the task of sound source localization using microphone arrays, showing satisfactory localization performance on many practical scenarios. However, its performance is diminished under highly reverberant environments. Although Deep Neural Networks (DNNs) have been previously proposed to overcome this limitation, most are trained for a specific number of microphones with fixed spatial coordinates. This restricts their practical application on scenarios frequently observed in wireless acoustic sensor networks, where each application has an ad-hoc microphone topology. We propose Neural-SRP, a DNN which combines the flexibility of SRP with the performance gains of DNNs. We train our network using simulated data and transfer learning, and evaluate our approach on recorded and simulated data. Results verify that Neural-SRP's localization performance significantly outperforms the baselines.
指向响应功率(SRP)是一种广泛使用的用于声源定位的方法,通过麦克风阵列实现,在许多实际场景中显示出令人满意的定位性能。然而,在高度回声环境中,其性能会减弱。虽然已经提出了使用深度神经网络(DNN)来克服这一限制的方法,但大多数都是为固定数量的空间位置的麦克风进行训练。这限制了它们在无线声波传感器网络中实际应用的 scenarios,其中每个应用都有随机的麦克风拓扑结构。我们提出 Neural-SRP,一种结合了 SRP 的灵活性和 DNN 的性能提升的神经网络。我们使用模拟数据和迁移学习来训练我们的网络,并使用记录和模拟数据评估我们的方法。结果证实,Neural-SRP 的定位性能显著优于基线。
https://arxiv.org/abs/2403.09455
The paper summarizes spectrogram and gives practical application of spectrogram in signal processing. For analysis, finger-snapping is recorded with a sampling rate of 441000 Hz and 96000 Hz. The effects of the number of segments on the Power Spectral Density (PSD) and spectrogram are analyzed and visualized.
本文综述了谱图及其在信号处理中的实际应用。为了分析,手指闪烁以441000 Hz和96000 Hz的采样率进行记录。对数字段数对功率谱密度(PSD)和谱图的影响进行了分析和可视化。
https://arxiv.org/abs/2403.09321
Benchmarking plays a pivotal role in assessing and enhancing the performance of compact deep learning models designed for execution on resource-constrained devices, such as microcontrollers. Our study introduces a novel, entirely artificially generated benchmarking dataset tailored for speech recognition, representing a core challenge in the field of tiny deep learning. SpokeN-100 consists of spoken numbers from 0 to 99 spoken by 32 different speakers in four different languages, namely English, Mandarin, German and French, resulting in 12,800 audio samples. We determine auditory features and use UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction) as a dimensionality reduction method to show the diversity and richness of the dataset. To highlight the use case of the dataset, we introduce two benchmark tasks: given an audio sample, classify (i) the used language and/or (ii) the spoken number. We optimized state-of-the-art deep neural networks and performed an evolutionary neural architecture search to find tiny architectures optimized for the 32-bit ARM Cortex-M4 nRF52840 microcontroller. Our results represent the first benchmark data achieved for SpokeN-100.
基准测试在评估和增强资源受限设备上设计的紧凑型深度学习模型的性能中发挥着重要作用,例如微控制器。我们的研究介绍了一个新的、完全人工生成的基准测试数据集,专门针对语音识别,代表了该领域中最小的深度学习挑战。SpokeN-100 包括来自 0 到 99 的语音数字,由 32 名不同的说话者用英语、普通话、德语和法语讲述了,共产生 12,800 个音频样本。我们确定音频特征,并使用 UMAP(统一曼哈顿近似和投影降维)作为降维方法,以展示数据集的多样性和丰富性。为了突出该数据集的使用案例,我们引入了两个基准任务:给定一个音频样本,分类(i)使用的语言,(ii)说话的数字。我们优化了最先进的深度神经网络,并进行了进化神经架构搜索,以找到针对 32 位 ARM Cortex-M4 nRF52840 微控制器的最佳架构。我们的结果代表了 SpokeN-100 第一个基准数据。
https://arxiv.org/abs/2403.09753
This study aimed to develop a deep learning model for the classification of bearing faults in wind turbine generators from acoustic signals. A convolutional LSTM model was successfully constructed and trained by using audio data from five predefined fault types for both training and validation. To create the dataset, raw audio signal data was collected and processed in frames to capture time and frequency domain information. The model exhibited outstanding accuracy on training samples and demonstrated excellent generalization ability during validation, indicating its proficiency of generalization capability. On the test samples, the model achieved remarkable classification performance, with an overall accuracy exceeding 99.5%, and a false positive rate of less than 1% for normal status. The findings of this study provide essential support for the diagnosis and maintenance of bearing faults in wind turbine generators, with the potential to enhance the reliability and efficiency of wind power generation.
这项研究旨在开发一个用于风轮发电机轴承故障分类的深度学习模型,通过音频信号进行。使用来自五种预定义故障类型的音频数据成功构建和训练了卷积循环自注意力机制(LSTM)模型。为了创建数据集,对原始音频信号数据进行了收集和处理,以捕捉时间和频率域信息。在训练样本上,该模型表现出出色的准确度,而在验证样本上的表现也很出色,表明其具有良好的泛化能力。在测试样本上,该模型取得了显著的分类性能,整体准确率超过99.5%,正常状态下的假阳性率为不到1%,为轴承故障的诊断和维护提供了关键支持,有望提高风力发电的可靠性和效率。
https://arxiv.org/abs/2403.09030
Self-supervised speech representation learning enables the extraction of meaningful features from raw waveforms. These features can then be efficiently used across multiple downstream tasks. However, two significant issues arise when considering the deployment of such methods ``in-the-wild": (i) Their large size, which can be prohibitive for edge applications; and (ii) their robustness to detrimental factors, such as noise and/or reverberation, that can heavily degrade the performance of such systems. In this work, we propose RobustDistiller, a novel knowledge distillation mechanism that tackles both problems jointly. Simultaneously to the distillation recipe, we apply a multi-task learning objective to encourage the network to learn noise-invariant representations by denoising the input. The proposed mechanism is evaluated on twelve different downstream tasks. It outperforms several benchmarks regardless of noise type, or noise and reverberation levels. Experimental results show that the new Student model with 23M parameters can achieve results comparable to the Teacher model with 95M parameters. Lastly, we show that the proposed recipe can be applied to other distillation methodologies, such as the recent DPWavLM. For reproducibility, code and model checkpoints will be made available at \mbox{\url{this https URL}}.
自监督语音表示学习可以从原始波形中提取有意义的特征。这些特征可以有效地在多个下游任务中使用。然而,在考虑将这种方法应用于“野外”时,有两个重要问题出现:(i)它们的大规模,这可能会对边缘应用程序造成困难;(ii)它们对有害因素(如噪声和/或回声)的鲁棒性差,这些因素可能严重削弱系统的性能。在这篇论文中,我们提出了RobustDistiller,一种新颖的knowledge distillation机制,共同解决了这两个问题。同时,我们还为去噪的输入应用了多任务学习目标,以鼓励网络通过去噪学习输入的噪声无关表示。所提出的机制在12个不同的下游任务上的评估结果。无论噪声类型如何,或者噪声和回声水平如何,它都超越了几个基准。实验结果表明,具有23M参数的新学生模型可以实现与具有95M参数的教师模型相当的结果。最后,我们证明了这种方法可以应用于其他distillation方法,例如最近提出的DPWavLM。为了保证可重复性,将在\url{这个 https URL}上提供代码和模型检举。
https://arxiv.org/abs/2403.08654
This paper describes a data-driven approach to creating real-time neural network models of guitar amplifiers, recreating the amplifiers' sonic response to arbitrary inputs at the full range of controls present on the physical device. While the focus on the paper is on the data collection pipeline, we demonstrate the effectiveness of this conditioned black-box approach by training an LSTM model to the task, and comparing its performance to an offline white-box SPICE circuit simulation. Our listening test results demonstrate that the neural amplifier modeling approach can match the subjective performance of a high-quality SPICE model, all while using an automated, non-intrusive data collection process, and an end-to-end trainable, real-time feasible neural network model.
本文描述了一种数据驱动的方法来创建真实时间吉他放大器的神经网络模型,以重新创建放大器对任意输入的声学响应,这些输入位于物理设备的全部控制范围之内。虽然本文的重点在数据收集管道上,但通过训练一个LSTM模型来任务,并将其性能与离线白盒SPICE电路仿真进行比较,证明了这种条件黑色盒方法的有效性。我们的听觉测试结果表明,神经放大器建模方法可以与高质量SPICE模型的主观表现相匹敌,同时使用一种自动化的、非侵入性的数据收集过程和一个端到端的训练,实现实时可行的神经网络模型。
https://arxiv.org/abs/2403.08559
In this work we propose an audio recording segmentation method based on an adaptive change point detection (A-CPD) for machine guided weak label annotation of audio recording segments. The goal is to maximize the amount of information gained about the temporal activation's of the target sounds. For each unlabeled audio recording, we use a prediction model to derive a probability curve used to guide annotation. The prediction model is initially pre-trained on available annotated sound event data with classes that are disjoint from the classes in the unlabeled dataset. The prediction model then gradually adapts to the annotations provided by the annotator in an active learning loop. The queries used to guide the weak label annotator towards strong labels are derived using change point detection on these probabilities. We show that it is possible to derive strong labels of high quality even with a limited annotation budget, and show favorable results for A-CPD when compared to two baseline query strategies.
在这项工作中,我们提出了一种基于自适应变化点检测(A-CPD)的音频剪辑分割方法,用于机器指导弱标签注释音频剪辑段。目标是最大化目标声音的激活时间的更多信息。对于每个未标记的音频剪辑,我们使用预测模型来导出用于指导注释的概率曲线。预测模型最初在具有与未标记数据集中的类不同的类别的可用注释数据上进行预训练。然后,预测模型在活跃学习循环中逐渐适应提供给定注释者的注释。用于引导弱标签注释器向强标签的查询是通过这些概率上的变化点检测获得的。我们证明了即使标签预算有限,也可以通过自适应变化点检测得到高质量的高质量标签,并且与两种基线查询策略相比,A-CPD具有优势。
https://arxiv.org/abs/2403.08525
Recently, deep learning-based Text-to-Speech (TTS) systems have achieved high-quality speech synthesis results. Recurrent neural networks have become a standard modeling technique for sequential data in TTS systems and are widely used. However, training a TTS model which includes RNN components requires powerful GPU performance and takes a long time. In contrast, CNN-based sequence synthesis techniques can significantly reduce the parameters and training time of a TTS model while guaranteeing a certain performance due to their high parallelism, which alleviate these economic costs of training. In this paper, we propose a lightweight TTS system based on deep convolutional neural networks, which is a two-stage training end-to-end TTS model and does not employ any recurrent units. Our model consists of two stages: Text2Spectrum and SSRN. The former is used to encode phonemes into a coarse mel spectrogram and the latter is used to synthesize the complete spectrum from the coarse mel spectrogram. Meanwhile, we improve the robustness of our model by a series of data augmentations, such as noise suppression, time warping, frequency masking and time masking, for solving the low resource mongolian problem. Experiments show that our model can reduce the training time and parameters while ensuring the quality and naturalness of the synthesized speech compared to using mainstream TTS models. Our method uses NCMMSC2022-MTTSC Challenge dataset for validation, which significantly reduces training time while maintaining a certain accuracy.
近年来,基于深度学习的语音合成系统(TTS)已经取得了高质量的发音结果。循环神经网络已成为TTS系统中的标准建模技术,并得到了广泛应用。然而,使用包含RNN组件的TTS模型进行训练需要强大的GPU性能,训练时间较长。相比之下,基于卷积神经网络的序列合成技术可以显著减少TTS模型的参数和训练时间,同时保证一定性能,因为它们具有高并行度,从而减轻了训练成本。 在本文中,我们提出了一个轻量级的TTS系统,基于深度卷积神经网络,这是端到端的两阶段训练的TTS模型,没有采用任何循环单元。我们的模型包含两个阶段:Text2Spectrum和SSRN。前者用于将音素编码成粗 Mel 分频图,后者用于从粗 Mel 分频图合成完整频谱。同时,我们通过一系列数据增强技术来提高模型的鲁棒性,例如降噪、时间扭曲、频率掩码和时间掩码,以解决低资源 MongoDB 问题。实验证明,与使用主流TTS模型相比,我们的模型可以减少训练时间和参数,同时保证合成语音的质量和自然度。我们的方法使用NCMMSC2022-MTTSC挑战数据集进行验证,这显著减少了训练时间,同时保持了一定的准确度。
https://arxiv.org/abs/2403.08164
Modelling musical structure is vital yet challenging for artificial intelligence systems that generate symbolic music compositions. This literature review dissects the evolution of techniques for incorporating coherent structure, from symbolic approaches to foundational and transformative deep learning methods that harness the power of computation and data across a wide variety of training paradigms. In the later stages, we review an emerging technique which we refer to as "sub-task decomposition" that involves decomposing music generation into separate high-level structural planning and content creation stages. Such systems incorporate some form of musical knowledge or neuro-symbolic methods by extracting melodic skeletons or structural templates to guide the generation. Progress is evident in capturing motifs and repetitions across all three eras reviewed, yet modelling the nuanced development of themes across extended compositions in the style of human composers remains difficult. We outline several key future directions to realize the synergistic benefits of combining approaches from all eras examined.
音乐结构建模对于生成符号音乐作品的人工智能系统来说至关重要,但同时也具有挑战性。本文回顾了将连贯结构融入其中的一系列技术的发展,从符号方法到利用计算和数据的力量在各种训练范式中发掘权力的基础和转化方法。在较后的阶段,我们回顾了一种新兴技术,我们称之为“子任务分解”,它涉及将音乐生成分解为单独的高层次结构规划和内容创作阶段。这样的系统通过提取旋律骨架或结构模板来指导生成。在回顾的所有三个时期中,进步是显而易见的,然而,在扩展作品风格的作曲家中,建模主题的微妙发展仍然具有挑战性。我们概述了几个未来方向,以实现所有时期探讨的方法的协同作用。
https://arxiv.org/abs/2403.07995
Keyword spotting systems for always-on TinyML-constrained applications require on-site tuning to boost the accuracy of offline trained classifiers when deployed in unseen inference conditions. Adapting to the speech peculiarities of target users requires many in-domain samples, often unavailable in real-world scenarios. Furthermore, current on-device learning techniques rely on computationally intensive and memory-hungry backbone update schemes, unfit for always-on, battery-powered devices. In this work, we propose a novel on-device learning architecture, composed of a pretrained backbone and a user-aware embedding learning the user's speech characteristics. The so-generated features are fused and used to classify the input utterance. For domain shifts generated by unseen speakers, we measure error rate reductions of up to 19% from 30.1% to 24.3% based on the 35-class problem of the Google Speech Commands dataset, through the inexpensive update of the user projections. We moreover demonstrate the few-shot learning capabilities of our proposed architecture in sample- and class-scarce learning conditions. With 23.7 kparameters and 1 MFLOP per epoch required for on-device training, our system is feasible for TinyML applications aimed at battery-powered microcontrollers.
关键词识别系统对于始终在线的TinyML约束应用需要现场调整,以在部署到未见到的推理条件下提高离线训练分类器的准确性。适应目标用户的语音特点需要许多领域样本,这在现实世界的场景中常常不可用。此外,当前的设备学习技术依赖于计算密集型且内存消耗大的骨架更新方案,不适合用于始终在线、电池供电的设备。在本文中,我们提出了一个新颖的设备学习架构,由预训练的骨架和用户感知的嵌入学习用户的语音特征组成。所生成的特征进行融合并用于分类输入语义。对于由未见说话人产生的领域转移,我们根据Google Speech Commands数据集的35类问题,将用户投影的更新成本降低到35%,从而将错误率降低至24.3%。此外,我们还展示了我们提出的架构在样本和分类稀疏学习条件下的少样本学习能力。对于在设备上进行训练,需要23.7k参数和每epoch 1MFLOP,我们的系统对于面向电池供电的微型控制器的TinyML应用是可行的。
https://arxiv.org/abs/2403.07802