Several attempts have been made to handle multiple source separation tasks such as speech enhancement, speech separation, sound event separation, music source separation (MSS), or cinematic audio source separation (CASS) with a single model. These models are trained on large-scale data including speech, instruments, or sound events and can often successfully separate a wide range of sources. However, it is still challenging for such models to cover all separation tasks because some of them are contradictory (e.g., musical instruments are separated in MSS while they have to be grouped in CASS). To overcome this issue and support all the major separation tasks, we propose a task-aware unified source separation (TUSS) model. The model uses a variable number of learnable prompts to specify which source to separate, and changes its behavior depending on the given prompts, enabling it to handle all the major separation tasks including contradictory ones. Experimental results demonstrate that the proposed TUSS model successfully handles the five major separation tasks mentioned earlier. We also provide some audio examples, including both synthetic mixtures and real recordings, to demonstrate how flexibly the TUSS model changes its behavior at inference depending on the prompts.
已经有多次尝试使用单一模型来处理多个来源分离任务,如语音增强、语音分离、声音事件分离、音乐源分离(MSS)或电影音频源分离(CASS)。这些模型是基于包括语音、乐器或声音事件在内的大规模数据进行训练的,并且通常能够成功地分离出各种各样的源头。然而,要使这种模型涵盖所有分离任务仍然具有挑战性,因为其中一些任务是相互矛盾的(例如,在MSS中需要将乐器分离出来,而在CASS中则需要将它们分组)。为了克服这一问题并支持所有主要的分离任务,我们提出了一种任务感知统一源分离(TUSS)模型。该模型使用可变数量的学习提示来指定要分离哪个源头,并根据给定的提示改变其行为,使其能够处理包括相互矛盾的任务在内的所有主要分离任务。实验结果表明,所提出的TUSS模型成功地应对了前述五项主要分离任务。我们还提供了一些音频示例,包括合成混合音和真实录音,以展示TUSS模型在推理时如何根据提示灵活改变其行为。
https://arxiv.org/abs/2410.23987
XyloAudio is a line of ultra-low-power audio inference chips, designed for in- and near-microphone analysis of audio in real-time energy-constrained scenarios. Xylo is designed around a highly efficient integer-logic processor which simulates parameter- and activity-sparse spiking neural networks (SNNs) using a leaky integrate-and-fire (LIF) neuron model. Neurons on Xylo are quantised integer devices operating in synchronous digital CMOS, with neuron and synapse state quantised to 16 bit, and weight parameters quantised to 8 bit. Xylo is tailored for real-time streaming operation, as opposed to accelerated-time operation in the case of an inference accelerator. XyloAudio includes a low-power audio encoding interface for direct connection to a microphone, designed for sparse encoding of incident audio for further processing by the inference core. In this report we present the results of DCASE 2020 acoustic scene classification audio benchmark dataset deployed to XyloAudio 2. We describe the benchmark dataset; the audio preprocessing approach; and the network architecture and training approach. We present the performance of the trained model, and the results of power and latency measurements performed on the XyloAudio 2 development kit. This benchmark is conducted as part of the Neurobench project.
XyloAudio 是一款超低功耗音频推理芯片系列,专为实时能量受限场景下的麦克风内和近麦克风音频分析设计。Xylo围绕一个高效的整数逻辑处理器构建,该处理器使用泄漏积分放电(LIF)神经元模型模拟参数稀疏和活动稀疏的脉冲神经网络(SNN)。Xylo上的神经元是量化的整数设备,在同步数字CMOS中运行,其中神经元和突触状态量化为16位,权重参数量化为8位。Xylo专为实时流媒体操作设计,而不是像推理加速器那样进行加速时间操作。XyloAudio 包含一个低功耗音频编码接口,可以直接连接到麦克风,用于稀疏编码传入的音频以便进一步由推理核心处理。 在这份报告中,我们展示了将DCASE 2020声景分类音频基准数据集部署到XyloAudio 2上的结果。我们描述了基准数据集;音频预处理方法;以及网络架构和训练方法。我们还呈现了训练模型的性能,并展示了在XyloAudio 2开发套件上进行的功耗和延迟测量的结果。这项基准测试是作为Neurobench项目的一部分而进行的。
https://arxiv.org/abs/2410.23776
Virtual reality (VR) environments are frequently used in auditory and cognitive research to imitate real-life scenarios, presumably enhancing state-of-the-art approaches with traditional computer screens. However, the effects of different display technologies on audiovisual processing remain underexplored. This study investigated how VR displayed with an head-mounted display (HMD) affects serial recall performance compared to traditional computer monitors, focusing on their effects on audiovisual processing in cognitive tasks. For that matter, we conducted two experiments with both an HMD and a computer monitor as display devices and two types of audiovisual incongruences: angle (Exp. 1) and voice (Exp. 2) incongruence. To quantify cognitive performance an audiovisual verbal serial recall (avVSR) task was developed where an embodied conversational agent (ECA) was animated to speak the target digit sequence. Even though subjective evaluations showed a higher sense of presence in the HMD condition, we found no effect of the display device on the proportion of correctly recalled digits. For the extreme conditions of angle incongruence in the computer monitor presentation the proportion of correctly recalled digits increased marginally, presumably due to raised attention, but the effect is likely too small to be meaningful. Response times were not affected by incongruences in either display device across both experiments. These findings suggest that the avVSR task is robust against angular and voice audiovisual incongruences, irrespective of the display device, at least for the conditions studied here. Hence, the study introduces the avVSR task in VR and contributes to the understanding of audiovisual integration.
虚拟现实(VR)环境常被用于听觉和认知研究,以模仿现实生活中的场景,理论上可以比传统的计算机屏幕更有效地提升现有方法。然而,不同显示技术对视听处理的影响仍然没有得到充分的研究。本研究探讨了使用头戴式显示器(HMD)的VR如何影响序列回忆表现,与传统电脑显示器相比,重点分析它们在认知任务中对视听处理的影响。为此,我们进行了两项实验,分别使用HMD和计算机显示器作为显示设备,并测试了两种类型的视听不一致:角度(实验1)和声音(实验2)不一致。为了量化认知表现,开发了一个视听口头序列回忆(avVSR)任务,在该任务中,一个具身的会话代理(ECA)被动画化以说出目标数字序列。尽管主观评价显示HMD条件下存在更高的临场感,但我们并未发现显示器类型对正确回忆出的数字比例有影响。对于计算机显示器展示下的角度不一致极端条件,正确回忆出的数字比例略有增加,可能是因为注意力提高所致,但该效应太小,不大可能是有意义的。在两项实验中,无论使用哪种显示设备,反应时间都不受不一致的影响。这些发现表明,avVSR任务对角位和声音视听不一致具有稳健性,至少对于本研究中的条件而言,这与所使用的显示设备无关。因此,这项研究介绍了VR环境下的avVSR任务,并有助于理解视听整合的机制。
https://arxiv.org/abs/2410.23015
Developing new machine learning applications often requires the collection of new datasets. However, existing datasets may already contain relevant information to train models for new purposes. We propose SoundCollage: a framework to discover new classes within audio datasets by incorporating (1) an audio pre-processing pipeline to decompose different sounds in audio samples and (2) an automated model-based annotation mechanism to identify the discovered classes. Furthermore, we introduce clarity measure to assess the coherence of the discovered classes for better training new downstream applications. Our evaluations show that the accuracy of downstream audio classifiers within discovered class samples and held-out datasets improves over the baseline by up to 34.7% and 4.5%, respectively, highlighting the potential of SoundCollage in making datasets reusable by labeling with newly discovered classes. To encourage further research in this area, we open-source our code at this https URL.
开发新的机器学习应用通常需要收集新数据集。然而,现有数据集中可能已经包含可用于训练模型以适应新用途的相关信息。我们提出了一种名为SoundCollage的框架:通过结合(1)一个音频预处理管道来分解音频样本中的不同声音和(2)一种自动基于模型的注释机制来识别发现的类别,从而在音频数据集中发现新的类别。此外,我们引入了一个清晰度度量标准,用于评估发现类别的连贯性,以更好地训练新的下游应用。我们的评估显示,在发现的类别样本和保留的数据集中的下游音频分类器准确性分别比基线提高了34.7%和4.5%,这突显了SoundCollage在通过标记新发现的类别来使数据集可重复使用方面的潜力。为了鼓励该领域的进一步研究,我们在以下链接开放源代码:此 https URL。
https://arxiv.org/abs/2410.23008
Building upon Diff-A-Riff, a latent diffusion model for musical instrument accompaniment generation, we present a series of improvements targeting quality, diversity, inference speed, and text-driven control. First, we upgrade the underlying autoencoder to a stereo-capable model with superior fidelity and replace the latent U-Net with a Diffusion Transformer. Additionally, we refine text prompting by training a cross-modality predictive network to translate text-derived CLAP embeddings to audio-derived CLAP embeddings. Finally, we improve inference speed by training the latent model using a consistency framework, achieving competitive quality with fewer denoising steps. Our model is evaluated against the original Diff-A-Riff variant using objective metrics in ablation experiments, demonstrating promising advancements in all targeted areas. Sound examples are available at: this https URL.
基于Diff-A-Riff,一种用于乐器伴奏生成的潜在扩散模型,我们提出了一系列改进措施,旨在提升质量、多样性、推理速度和文本驱动控制。首先,我们将底层自编码器升级为具有更高保真度的立体声模型,并用扩散变换器替换了潜在的U-Net。此外,通过训练跨模态预测网络将源自文本的CLAP嵌入转换为源自音频的CLAP嵌入来优化文本提示。最后,我们使用一致性框架训练潜在模型以提高推理速度,在减少去噪步骤的同时保持竞争力的质量。我们在消融实验中使用客观指标评估了我们的模型与原始Diff-A-Riff变体之间的差异,结果显示在所有目标领域均有显著进步。示例音频可在以下链接获取:this https URL。
https://arxiv.org/abs/2410.23005
This paper presents a system developed for submission to Poleval 2024, Task 3: Polish Automatic Speech Recognition Challenge. We describe Voicebox-based speech synthesis pipeline and utilize it to augment Conformer and Whisper speech recognition models with synthetic data. We show that addition of synthetic speech to training improves achieved results significantly. We also present final results achieved by our models in the competition.
本文介绍了一个为提交至 Poleval 2024 第3任务:波兰语自动语音识别挑战赛而开发的系统。我们描述了基于 Voicebox 的语音合成流水线,并利用它来通过合成数据增强 Conformer 和 Whisper 语音识别模型。我们展示了在训练中加入合成语音能够显著提升结果。此外,我们也呈现了我们在比赛中所取得的最终成绩。
https://arxiv.org/abs/2410.22903
This paper proposes a novel neural audio codec, named APCodec+, which is an improved version of APCodec. The APCodec+ takes the audio amplitude and phase spectra as the coding object, and employs an adversarial training strategy. Innovatively, we propose a two-stage joint-individual training paradigm for APCodec+. In the joint training stage, the encoder, quantizer, decoder and discriminator are jointly trained with complete spectral loss, quantization loss, and adversarial loss. In the individual training stage, the encoder and quantizer fix their parameters and provide high-quality training data for the decoder and discriminator. The decoder and discriminator are individually trained from scratch without the quantization loss. The purpose of introducing individual training is to reduce the learning difficulty of the decoder, thereby further improving the fidelity of the decoded audio. Experimental results confirm that our proposed APCodec+ at low bitrates achieves comparable performance with baseline codecs at higher bitrates, thanks to the proposed staged training paradigm.
本文提出了一种新颖的神经音频编解码器,名为APCodec+,它是APCodec的改进版本。APCodec+以音频振幅和相位谱作为编码对象,并采用对抗训练策略。创新性地,我们为APCodec+提出了一个两阶段联合-独立训练范式。在联合训练阶段,编码器、量化器、解码器和判别器通过完整的频谱损失、量化损失和对抗损失进行联合训练。在独立训练阶段,编码器和量化器固定其参数,并为解码器和判别器提供高质量的训练数据。解码器和判别器从头开始单独训练,不使用量化损失。引入独立训练的目的在于降低解码器的学习难度,从而进一步提高解码音频的质量。实验结果表明,由于采用了提出的分阶段训练范式,在低比特率下我们所提的APCodec+能够达到与基线编解码器在较高比特率下的相当性能。
https://arxiv.org/abs/2410.22807
Speech enhancement is crucial in human-computer interaction, especially for ubiquitous devices. Ultrasound-based speech enhancement has emerged as an attractive choice because of its superior ubiquity and performance. However, inevitable interference from unexpected and unintended sources during audio-ultrasound data acquisition makes existing solutions rely heavily on human effort for data collection and processing. This leads to significant data scarcity that limits the full potential of ultrasound-based speech enhancement. To address this, we propose USpeech, a cross-modal ultrasound synthesis framework for speech enhancement with minimal human effort. At its core is a two-stage framework that establishes correspondence between visual and ultrasonic modalities by leveraging audible audio as a bridge. This approach overcomes challenges from the lack of paired video-ultrasound datasets and the inherent heterogeneity between video and ultrasound data. Our framework incorporates contrastive video-audio pre-training to project modalities into a shared semantic space and employs an audio-ultrasound encoder-decoder for ultrasound synthesis. We then present a speech enhancement network that enhances speech in the time-frequency domain and recovers the clean speech waveform via a neural vocoder. Comprehensive experiments show USpeech achieves remarkable performance using synthetic ultrasound data comparable to physical data, significantly outperforming state-of-the-art ultrasound-based speech enhancement baselines. USpeech is open-sourced at this https URL.
语音增强在人机交互中至关重要,特别是在普适设备的应用方面。基于超声波的语音增强因其卓越的普适性和性能而成为一种极具吸引力的选择。然而,在音频-超声数据采集过程中不可避免地会受到意外和非预期来源的干扰,这使得现有解决方案高度依赖于人为的数据收集和处理工作。这种依赖性导致了数据稀缺的问题,限制了基于超声波语音增强技术潜力的充分发挥。为了解决这一问题,我们提出了USpeech,一个几乎无需人工干预的跨模态超声合成框架用于语音增强。其核心是一个两阶段框架,通过利用可听音频作为桥梁来建立视觉与超声波模式之间的对应关系。这种方法克服了缺乏配对视频-超声数据集以及视频和超声波数据之间固有异质性的挑战。我们的框架结合对比性视频-音频预训练将不同模态映射到共享的语义空间,并采用音频-超声编码解码器进行超声合成。随后,我们提出一个语音增强网络,在时频域中提升语音质量并通过神经声码器恢复干净的语音波形。全面的实验表明,USpeech利用合成超声数据达到了与物理数据相媲美的显著性能,大幅超越了现有的基于超声波的语音增强基线方法。USpeech已开源,并可在此网址获取:https://URL(注:此处为占位符,请替换实际链接)。
https://arxiv.org/abs/2410.22076
This paper proposes a method for unsupervised anomalous sound detection (UASD) and captioning the reason for detection. While there is a method that captions the difference between given normal and anomalous sound pairs, it is assumed to be trained and used separately from the UASD model. Therefore, the obtained caption can be irrelevant to the differences that the UASD model captured. In addition, it requires many caption labels representing differences between anomalous and normal sounds for model training. The proposed method employs a retrieval-augmented approach for captioning of anomalous sounds. Difference captioning in the embedding space output by the pre-trained CLAP (contrastive language-audio pre-training) model makes the anomalous sound detection results consistent with the captions and does not require training. Experiments based on subjective evaluation and a sample-wise analysis of the output captions demonstrate the effectiveness of the proposed method.
本文提出了一种用于无监督异常声音检测(UASD)及描述检测原因的方法。虽然有一种方法可以标注给定的正常和异常声音对之间的差异,但该方法假设与UASD模型分开训练和使用。因此,获得的标注可能与UASD模型捕捉到的差异无关。此外,这种标注方法需要大量代表异常和正常声音之间差异的标注标签来进行模型训练。所提出的方法采用了一种检索增强的方法来对异常声音进行描述。通过在预训练的CLAP(对比语言-音频预训练)模型输出的嵌入空间中进行差异标注,使得异常声音检测结果与描述保持一致,并且不需要额外的训练。基于主观评价和针对输出描述样本的分析实验表明了所提出方法的有效性。
https://arxiv.org/abs/2410.22056
Chord progressions encapsulate important information about music, pertaining to its structure and conveyed emotions. They serve as the backbone of musical composition, and in many cases, they are the sole information required for a musician to play along and follow the music. Despite their importance, chord progressions as a data domain remain underexplored. There is a lack of large-scale datasets suitable for deep learning applications, and limited research exploring chord progressions as an input modality. In this work, we present Chordonomicon, a dataset of over 666,000 songs and their chord progressions, annotated with structural parts, genre, and release date - created by scraping various sources of user-generated progressions and associated metadata. We demonstrate the practical utility of the Chordonomicon dataset for classification and generation tasks, and discuss its potential to provide valuable insights to the research community. Chord progressions are unique in their ability to be represented in multiple formats (e.g. text, graph) and the wealth of information chords convey in given contexts, such as their harmonic function . These characteristics make the Chordonomicon an ideal testbed for exploring advanced machine learning techniques, including transformers, graph machine learning, and hybrid systems that combine knowledge representation and machine learning.
和弦进程包含了有关音乐的重要信息,涉及其结构和传达的情感。它们作为音乐创作的支柱,在许多情况下,是音乐家演奏并跟随音乐所需的唯一信息。尽管它们非常重要,但和弦进程作为一个数据领域仍然研究不足。目前缺乏适用于深度学习应用的大规模数据集,并且对于将和弦进程作为一种输入模式的研究也很有限。在此项工作中,我们介绍了Chordonomicon,一个包含超过666,000首歌曲及其和弦进程的数据集,这些进程被标注了结构部分、类型和发行日期——该数据集通过抓取各种用户生成的进程及相关元数据来源创建而成。我们展示了Chordonomicon数据集在分类和生成任务中的实际效用,并讨论其为研究界提供宝贵见解的潜力。和弦进程的独特之处在于它们可以以多种格式(例如文本、图形)表示,以及和弦在其特定语境中传达的信息量,如其和声功能。这些特性使得Chordonomicon成为探索先进机器学习技术的理想试验场,包括转换器、图机器学习以及结合知识表示和机器学习的混合系统。
https://arxiv.org/abs/2410.22046
This paper proposes a framework of explaining anomalous machine sounds in the context of anomalous sound detection~(ASD). While ASD has been extensively explored, identifying how anomalous sounds differ from normal sounds is also beneficial for machine condition monitoring. However, existing sound difference captioning methods require anomalous sounds for training, which is impractical in typical machine condition monitoring settings where such sounds are unavailable. To solve this issue, we propose a new strategy for explaining anomalous differences that does not require anomalous sounds for training. Specifically, we introduce a framework that explains differences in predefined timbre attributes instead of using free-form text captions. Objective metrics of timbre attributes can be computed using timbral models developed through psycho-acoustical research, enabling the estimation of how and what timbre attributes have changed from normal sounds without training machine learning models. Additionally, to accurately determine timbre differences regardless of variations in normal training data, we developed a method that jointly conducts anomalous sound detection and timbre difference estimation based on a k-nearest neighbors method in an audio embedding space. Evaluation using the MIMII DG dataset demonstrated the effectiveness of the proposed method.
本文提出了一种在异常声音检测(ASD)背景下解释机器异常声音的框架。虽然ASD已经被广泛研究,但识别异常声音与正常声音的区别也有利于机器状态监测。然而,现有的声音差异描述方法需要使用异常声音进行训练,在典型的机器状态监测环境中这通常是不可行的,因为这种情况下异常声音不可用。为了解决这一问题,我们提出了一种新的解释异常差异的策略,该策略不需要在训练过程中使用异常声音。具体而言,我们引入了一个框架,通过解释预定义音色属性的差异来代替自由形式的文字描述。音色属性的目标度量可以通过基于心理声学研究开发的音色模型计算得出,这使得无需训练机器学习模型就能估计出与正常声音相比,哪些音色属性发生了变化以及如何变化。此外,为了准确地确定音色差异,不受正常训练数据变异的影响,我们开发了一种方法,在音频嵌入空间中基于k最近邻方法同时进行异常声音检测和音色差异估计。使用MIMII DG数据集的评估证明了所提出方法的有效性。
https://arxiv.org/abs/2410.22033
Audio fingerprinting techniques have seen great advances in recent years, enabling accurate and fast audio retrieval even in conditions when the queried audio sample has been highly deteriorated or recorded in noisy conditions. Expectedly, most of the existing work is centered around music, with popular music identification services such as Apple's Shazam or Google's Now Playing designed for individual audio recognition on mobile devices. However, the spectral content of speech differs from that of music, necessitating modifications to current audio fingerprinting approaches. This paper offers fresh insights into adapting existing techniques to address the specialized challenge of speech retrieval in telecommunications and cloud communications platforms. The focus is on achieving rapid and accurate audio retrieval in batch processing instead of facilitating single requests, typically on a centralized server. Moreover, the paper demonstrates how this approach can be utilized to support audio clustering based on speech transcripts without undergoing actual speech-to-text conversion. This optimization enables significantly faster processing without the need for GPU computing, a requirement for real-time operation that is typically associated with state-of-the-art speech-to-text tools.
音频指纹技术在近年来取得了巨大的进展,即使查询的音频样本已经严重退化或是在嘈杂环境中录制的情况下,也能实现准确快速的音频检索。不出所料,现有的大多数工作都集中在音乐领域上,比如苹果的Shazam和谷歌的Now Playing这样的流行音乐识别服务就是设计用于移动设备上的单独音频识别。然而,语音的频谱内容与音乐不同,这需要对当前的音频指纹方法进行修改。本文提供了关于如何调整现有技术以应对电信和云通信平台中特有的语音检索挑战的新见解。重点在于实现批量处理中的快速准确音频检索,而不是支持单个请求,通常是在一个中心服务器上进行。此外,该论文展示了如何利用这一方法来基于语音转录本进行音频聚类,而无需实际的语音到文本转换。这种优化使得处理速度显著加快,并且不需要GPU计算,这对实时操作是必需的要求,而这通常是与最先进的语音到文本工具相关联的。
https://arxiv.org/abs/2410.21876
The detection of anomalous sounds in machinery operation presents a significant challenge due to the difficulty in generalizing anomalous acoustic patterns. This task is typically approached as an unsupervised learning or novelty detection problem, given the complexities associated with the acquisition of comprehensive anomalous acoustic data. Conventional methodologies for training anomalous sound detection systems primarily employ auto-encoder architectures or representational learning with auxiliary tasks. However, both approaches have inherent limitations. Auto-encoder structures are constrained to utilizing only the target machine's operational sounds, while training with auxiliary tasks, although capable of incorporating diverse acoustic inputs, may yield representations that lack correlation with the characteristic acoustic signatures of anomalous conditions. We propose a training method based on the source separation model (CMGAN) that aims to isolate non-target machine sounds from a mixture of target and non-target class acoustic signals. This approach enables the effective utilization of diverse machine sounds and facilitates the training of complex neural network architectures with limited sample sizes. Our experimental results demonstrate that the proposed method yields better performance compared to both conventional auto-encoder training approaches and source separation techniques that focus on isolating target machine signals. Moreover, our experimental results demonstrate that the proposed method exhibits the potential for enhanced representation learning as the quantity of non-target data increases, even while maintaining a constant volume of target class data.
在机械运行过程中检测异常声音是一个重大挑战,因为很难将异常的声学模式进行泛化。由于获取全面异常声学数据的复杂性,这项任务通常被当作无监督学习或新颖性检测问题来处理。传统的训练异常声音检测系统的方法主要采用自动编码器架构或带有辅助任务的表现学习。然而,这两种方法都有其内在局限性。自动编码器结构仅限于使用目标机械的操作声音,而通过辅助任务进行的训练虽然能够结合多样化的声学输入,但可能会生成与异常情况特征声学签名缺乏关联的表示形式。我们提出了一种基于源分离模型(CMGAN)的训练方法,旨在从目标和非目标类声学信号混合中隔离出非目标机械声音。这种方法可以有效利用多种机械声音,并有利于在有限样本量下训练复杂的神经网络架构。实验结果表明,与传统自动编码器训练方法及专注于分离目标机械信号的源分离技术相比,我们提出的方法表现更佳。此外,我们的实验结果显示,随着非目标数据数量的增加,即使保持目标类数据量恒定,所提出的方法也展现出增强表示学习的潜力。
https://arxiv.org/abs/2410.21797
Modern day audio signal classification techniques lack the ability to classify low feature audio signals in the form of spectrographic temporal frequency data representations. Additionally, currently utilized techniques rely on full diverse data sets that are often not representative of real-world distributions. This paper derives several first-of-its-kind machine learning methodologies to analyze these low feature audio spectrograms given data distributions that may have normalized, skewed, or even limited training sets. In particular, this paper proposes several novel customized convolutional architectures to extract identifying features using binary, one-class, and siamese approaches to identify the spectrographic signature of a given audio signal. Utilizing these novel convolutional architectures as well as the proposed classification methods, these experiments demonstrate state-of-the-art classification accuracy and improved efficiency than traditional audio classification methods.
现代音频信号分类技术缺乏对以频谱时频数据表示形式出现的低特征音频信号进行分类的能力。此外,目前使用的技术依赖于完整多样化的数据集,这些数据集通常不能代表现实世界的数据分布。本文推导出几种前所未有的机器学习方法,用于分析在可能已归一化、偏斜或甚至训练样本有限的数据分布下的这些低特征音频频谱图。特别是,本文提出了一些新颖的定制卷积架构,利用二元分类、单类和双胞胎网络的方法来识别给定音频信号的频谱签名。通过使用这些创新的卷积架构以及所提出的分类方法,实验展示了优于传统音频分类方法的最先进的分类准确性和改进了的效率。
https://arxiv.org/abs/2410.21561
Sonar based audio classification techniques are a growing area of research in the field of underwater acoustics. Usually, underwater noise picked up by passive sonar transducers contains all types of signals that travel through the ocean and is transformed into spectrographic images. As a result, the corresponding spectrograms intended to display the temporal-frequency data of a certain object often include the tonal regions of abundant extraneous noise that can effectively interfere with a 'contact'. So, a majority of spectrographic samples extracted from underwater audio signals are rendered unusable due to their clutter and lack the required indistinguishability between different objects. With limited clean true data for supervised training, creating classification models for these audio signals is severely bottlenecked. This paper derives several new techniques to combat this problem by developing a novel Score-CAM based denoiser to extract an object's signature from noisy spectrographic data without being given any ground truth data. In particular, this paper proposes a novel generative adversarial network architecture for learning and producing spectrographic training data in similar distributions to low-feature spectrogram inputs. In addition, this paper also a generalizable class activation mapping based denoiser for different distributions of acoustic data, even real-world data distributions. Utilizing these novel architectures and proposed denoising techniques, these experiments demonstrate state-of-the-art noise reduction accuracy and improved classification accuracy than current audio classification standards. As such, this approach has applications not only to audio data but for countless data distributions used all around the world for machine learning.
基于声纳的音频分类技术是水下声学研究领域的一个快速增长的研究方向。通常,被动声纳换能器捕获的水下噪声包含所有通过海洋传播的信号,并被转换成频谱图像。因此,用于显示某一物体的时间-频率数据对应的频谱图往往包括大量外来噪声的音调区域,这些外来噪声会有效干扰“接触”。因此,从水下音频信号中提取的大多数频谱样本由于杂乱无章而变得无法使用,缺乏不同物体之间的必要区分度。在监督训练可用的真实干净数据有限的情况下,为这些音频信号创建分类模型严重受限。本文提出几种新方法来解决这个问题,通过开发一种基于Score-CAM的新去噪器从噪声频谱数据中提取目标特征,而无需提供任何真实数据作为基准。特别是,本文提出了一种新的生成对抗网络架构,用于学习并生成与低特征频谱图输入具有相似分布的频谱训练数据。此外,本文还提出了一种可推广到不同声学数据分布(甚至现实世界的数据分布)上的基于类激活映射的去噪器。通过利用这些新架构和提出的去噪技术,实验展示了比当前音频分类标准更先进的噪声减少精度和分类准确性提升。因此,这种方法不仅适用于音频数据,而且可用于世界各地用于机器学习的各种数据分布。
https://arxiv.org/abs/2410.21557
This study introduces a refined approach to Text-to-Speech (TTS) generation that significantly enhances sampling stability across languages, with a particular focus on Hebrew. By leveraging discrete semantic units with higher phonetic correlation obtained from a self-supervised model, our method addresses the inherent instability often encountered in TTS systems, especially those dealing with non-diacriticized scripts like Hebrew. Utilizing HuBERT codes, our model generates discrete representations that are optimized for TTS tasks, thereby reducing the dependency on diacritic-based text processing. This advancement not only simplifies the language modeling process but also improves the robustness and shows controllability of the speech output due to disentenglement properties of the semantic units. The inclusion of a speaker embedding in the vocoder further aids in capturing the unique vocal characteristics of the speaker, contributing to the naturalness of the synthesized speech. Our experimental results demonstrate that this approach not only maintains high performance in Hebrew but also shows adaptability to English, underscoring its effectiveness in enhancing stability in TTS systems universally. Our method, named LOTHM (Language of The Hebrew Man), outperforms existing methods in terms of stability while achieving naturalness and speaker similarity on par with previous methods, making it a compelling choice for future speech synthesis applications. Samples can be found in our page this http URL .
这项研究介绍了一种改进的文本转语音(TTS)生成方法,该方法显著提升了多语言环境下的采样稳定性,特别关注希伯来语。通过利用从自监督模型中获得的具有更高音素相关性的离散语义单元,我们的方法解决了TTS系统经常遇到的本质不稳定性问题,特别是处理无重音符号脚本如希伯来文时。使用HuBERT代码,我们的模型生成了针对TTS任务优化的离散表示形式,从而减少了对基于重音文本处理的依赖性。这一进步不仅简化了语言建模过程,还由于语义单元的分离特性而提高了语音输出的鲁棒性和可控性。在声码器中加入说话人嵌入进一步帮助捕捉说话人的独特声音特征,有助于提升合成语音的自然度。我们的实验结果表明,这种方法不仅保持了希伯来语的高性能,而且对英语也有很好的适应能力,强调了其在全球范围内提高TTS系统稳定性的有效性。我们提出的方法名为“LOTHM(The Language of The Hebrew Man)”,在稳定性方面优于现有方法,并且在自然度和说话人相似性上与先前方法相当,使其成为未来语音合成应用的有力选择。样本可以在我们的页面 http://this.URL 找到。
https://arxiv.org/abs/2410.21502
Audio production style transfer is the task of processing an input to impart stylistic elements from a reference recording. Existing approaches often train a neural network to estimate control parameters for a set of audio effects. However, these approaches are limited in that they can only control a fixed set of effects, where the effects must be differentiable or otherwise employ specialized training techniques. In this work, we introduce ST-ITO, Style Transfer with Inference-Time Optimization, an approach that instead searches the parameter space of an audio effect chain at inference. This method enables control of arbitrary audio effect chains, including unseen and non-differentiable effects. Our approach employs a learned metric of audio production style, which we train through a simple and scalable self-supervised pretraining strategy, along with a gradient-free optimizer. Due to the limited existing evaluation methods for audio production style transfer, we introduce a multi-part benchmark to evaluate audio production style metrics and style transfer systems. This evaluation demonstrates that our audio representation better captures attributes related to audio production and enables expressive style transfer via control of arbitrary audio effects.
音频制作风格转换的任务是处理输入以赋予参考录音的风格元素。现有方法通常训练神经网络来估计一组音频效果的控制参数。然而,这些方法受到限制,只能控制固定的一组效果,且这些效果必须可微分或采用专门的训练技术。在本研究中,我们引入了ST-ITO(推理时优化的风格转换),这种方法不是预先设定音频效果链的控制参数,而是在推理过程中搜索音频效果链的参数空间。该方法可以控制任意的音频效果链,包括未见过的和不可微分的效果。我们的方法采用了一种学习得到的音频制作风格度量标准,通过一种简单且可扩展的自监督预训练策略进行训练,并结合了无梯度优化器。由于现有的音频制作风格转换评估方法有限,我们引入了一个多部分基准来评估音频制作风格度量和风格转换系统。这一评估表明我们的音频表示更能捕捉与音频生产相关的属性,并能通过控制任意音频效果实现富有表现力的风格转移。
https://arxiv.org/abs/2410.21233
Deep learning-based single-channel speaker separation has improved significantly in recent years largely due to the introduction of the transformer-based attention mechanism. However, these improvements come at the expense of intense computational demands, precluding their use in many practical applications. As a computationally efficient alternative with similar modeling capabilities, Mamba was recently introduced. We propose SepMamba, a U-Net-based architecture composed primarily of bidirectional Mamba layers. We find that our approach outperforms similarly-sized prominent models - including transformer-based models - on the WSJ0 2-speaker dataset while enjoying a significant reduction in computational cost, memory usage, and forward pass time. We additionally report strong results for causal variants of SepMamba. Our approach provides a computationally favorable alternative to transformer-based architectures for deep speech separation.
基于深度学习的单通道说话人分离近年来取得了显著进步,这主要归功于引入了基于变压器的注意力机制。然而,这些改进是以高昂的计算需求为代价的,阻碍了许多实际应用中的使用。作为一种具有类似建模能力但计算效率更高的替代方案,最近提出了Mamba模型。我们提出了一种名为SepMamba的架构,它主要由双向Mamba层构成,并采用了U-Net结构。我们的研究表明,在WSJ0 2-speaker数据集上,与同样大小的著名模型(包括基于变压器的模型)相比,我们的方法表现更优,并且计算成本、内存使用和前向传播时间都有显著降低。此外,我们还报告了SepMamba因果变体的强大结果。我们的方法为深度语音分离中的基于变压器架构提供了计算上的有利替代方案。
https://arxiv.org/abs/2410.20997
Atrial fibrillation (AF) is characterized by irregular electrical impulses originating in the atria, which can lead to severe complications and even death. Due to the intermittent nature of the AF, early and timely monitoring of AF is critical for patients to prevent further exacerbation of the condition. Although ambulatory ECG Holter monitors provide accurate monitoring, the high cost of these devices hinders their wider adoption. Current mobile-based AF detection systems offer a portable solution, however, these systems have various applicability issues such as being easily affected by environmental factors and requiring significant user effort. To overcome the above limitations, we present MobileAF, a novel smartphone-based AF detection system using speakers and microphones. In order to capture minute cardiac activities, we propose a multi-channel pulse wave probing method. In addition, we enhance the signal quality by introducing a three-stage pulse wave purification pipeline. What's more, a ResNet-based network model is built to implement accurate and reliable AF detection. We collect data from 23 participants utilizing our data collection application on the smartphone. Extensive experimental results demonstrate the superior performance of our system, with 97.9% accuracy, 96.8% precision, 97.2% recall, 98.3% specificity, and 97.0% F1 score.
心房颤动(AF)的特点是心房产生的不规则电冲动,这可能导致严重的并发症甚至死亡。由于AF的间歇性特点,对患者进行早期和及时监测至关重要,以防止病情进一步恶化。虽然动态心电图霍尔特监护仪能够提供准确的监测,但这些设备的成本较高限制了其更广泛的应用。目前基于移动设备的AF检测系统提供了便携式的解决方案,然而,这些系统的适用性存在各种问题,如容易受到环境因素的影响以及需要用户付出较多努力。为了克服上述限制,我们提出了一种名为MobileAF的新颖智能手机心房颤动检测系统,该系统利用扬声器和麦克风进行工作。为了捕捉微小的心脏活动,我们提出了一个多通道脉冲波探测方法。此外,通过引入三阶段的脉冲波净化管道来提高信号质量。更重要的是,构建了一个基于ResNet的网络模型以实现准确可靠的AF检测。我们通过智能手机上的数据收集应用从23名参与者处收集了数据。广泛的实验结果证明了我们的系统的优越性能,其准确性为97.9%,精确度为96.8%,召回率为97.2%,特异度为98.3%,F1得分为97.0%。
https://arxiv.org/abs/2410.20852
The goal of the acoustic scene classification (ASC) task is to classify recordings into one of the predefined acoustic scene classes. However, in real-world scenarios, ASC systems often encounter challenges such as recording device mismatch, low-complexity constraints, and the limited availability of labeled data. To alleviate these issues, in this paper, a data-efficient and low-complexity ASC system is built with a new model architecture and better training strategies. Specifically, we firstly design a new low-complexity architecture named Rep-Mobile by integrating multi-convolution branches which can be reparameterized at inference. Compared to other models, it achieves better performance and less computational complexity. Then we apply the knowledge distillation strategy and provide a comparison of the data efficiency of the teacher model with different architectures. Finally, we propose a progressive pruning strategy, which involves pruning the model multiple times in small amounts, resulting in better performance compared to a single step pruning. Experiments are conducted on the TAU dataset. With Rep-Mobile and these training strategies, our proposed ASC system achieves the state-of-the-art (SOTA) results so far, while also winning the first place with a significant advantage over others in the DCASE2024 Challenge.
语音场景分类(ASC)任务的目标是将录音归类到预定义的声学场景类别之一。然而,在实际应用中,ASC系统通常会面临一些挑战,例如录制设备不匹配、低复杂度约束以及标注数据有限等问题。为了缓解这些问题,本文构建了一个高效且低复杂度的ASC系统,采用了新的模型架构和更好的训练策略。具体来说,我们首先设计了一种新的低复杂度架构,命名为Rep-Mobile,通过在推理时可重新参数化的多卷积分支集成实现。与其它模型相比,它实现了更好的性能并降低了计算复杂度。接着,我们应用了知识蒸馏策略,并对比分析了不同架构下教师模型的数据效率。最后,我们提出了一种渐进式剪枝策略,即通过多次少量地剪枝模型来达到比单次大量剪枝更好的效果。实验在TAU数据集上进行。借助Rep-Mobile和这些训练策略,我们提出的ASC系统取得了目前最先进的(SOTA)结果,并且在DCASE2024挑战赛中以显著优势获得了第一名。
https://arxiv.org/abs/2410.20775