This paper presents the multi-speaker multi-lingual few-shot voice cloning system developed by THU-HCSI team for LIMMITS'24 Challenge. To achieve high speaker similarity and naturalness in both mono-lingual and cross-lingual scenarios, we build the system upon YourTTS and add several enhancements. For further improving speaker similarity and speech quality, we introduce speaker-aware text encoder and flow-based decoder with Transformer blocks. In addition, we denoise the few-shot data, mix up them with pre-training data, and adopt a speaker-balanced sampling strategy to guarantee effective fine-tuning for target speakers. The official evaluations in track 1 show that our system achieves the best speaker similarity MOS of 4.25 and obtains considerable naturalness MOS of 3.97.
本文介绍了由THU-HCSI团队为LIMMITS'24挑战开发的的多语种、多声道语音克隆系统。为了在单语种和跨语种场景下实现高说话者相似度和自然度,我们在YourTTS基础上进行了系统构建,并添加了几个增强功能。为了进一步提高说话者相似度和语音质量,我们引入了说话者感知的文本编码器和基于Transformer的流式解码器。此外,我们还对几 shot数据进行了去噪、混合处理,并采用了一种针对说话者的平衡采样策略,以确保对目标说话者的有效微调。在1号轨道的官方评估中,我们的系统实现了4.25的说话者相似度MOS和显著的自然度MOS。
https://arxiv.org/abs/2404.16619
Existing works have made strides in video generation, but the lack of sound effects (SFX) and background music (BGM) hinders a complete and immersive viewer experience. We introduce a novel semantically consistent v ideo-to-audio generation framework, namely SVA, which automatically generates audio semantically consistent with the given video content. The framework harnesses the power of multimodal large language model (MLLM) to understand video semantics from a key frame and generate creative audio schemes, which are then utilized as prompts for text-to-audio models, resulting in video-to-audio generation with natural language as an interface. We show the satisfactory performance of SVA through case study and discuss the limitations along with the future research direction. The project page is available at this https URL.
现有的视频生成工作已经取得了一定的进展,但缺乏音效(SFX)和背景音乐(BGM)会阻碍完全沉浸的观众体验。我们介绍了一个新颖的语义一致的视频到音频生成框架,即SVA,它能够自动生成与给定视频内容语义一致的音频。该框架利用多模态大型语言模型(MLLM)的力量,从关键帧理解视频语义,并生成创意音频方案,这些方案作为文本到音频模型的提示,实现了自然语言作为界面的视频到音频生成。我们通过案例研究展示了SVA的满意性能,并讨论了与未来研究方向相关的局限性。项目页面可以通过这个链接访问:https://www.aclweb.org/anthology/N18-1196
https://arxiv.org/abs/2404.16305
We analyze the concept of virtuosity as a collective attribute in music and its relationship with the entropy based on an experiment that compares two sets of digital signals played by composer-performer electric guitarists. Based on an interdisciplinary approach related to the complex systems, we computed the spectrum of signals, identified statistical distributions that best describe them, and measured the Shannon entropy to establish their diversity. Findings suggested that virtuosity might be related to a range of entropy values that identify levels of diversity of the frequency components of audio signals. Despite the presence of different values of entropy in the two sets of signals, they are statistically similar. Therefore, entropy values can be interpreted as levels of virtuosity in music.
我们将音乐中的技巧性作为一个集体属性来分析,并探讨其与熵的关系。我们通过对比两位作曲家演奏的电吉他产生的两组数字信号来实施一个实验。基于复杂系统的跨学科方法,我们计算了信号的频谱,找到了最能描述它们的统计分布,并测量了香农熵来确定它们的多样性。研究结果表明,技巧性可能与音频信号中频率分量的多样性有关。尽管两组信号中存在不同的熵值,但它们在统计上是相似的。因此,熵值可以解释为音乐中的技巧水平。
https://arxiv.org/abs/2404.16259
The increasing prevalence of audio deepfakes poses significant security threats, necessitating robust detection methods. While existing detection systems exhibit promise, their robustness against malicious audio manipulations remains underexplored. To bridge the gap, we undertake the first comprehensive study of the susceptibility of the most widely adopted audio deepfake detectors to manipulation attacks. Surprisingly, even manipulations like volume control can significantly bypass detection without affecting human perception. To address this, we propose CLAD (Contrastive Learning-based Audio deepfake Detector) to enhance the robustness against manipulation attacks. The key idea is to incorporate contrastive learning to minimize the variations introduced by manipulations, therefore enhancing detection robustness. Additionally, we incorporate a length loss, aiming to improve the detection accuracy by clustering real audios more closely in the feature space. We comprehensively evaluated the most widely adopted audio deepfake detection models and our proposed CLAD against various manipulation attacks. The detection models exhibited vulnerabilities, with FAR rising to 36.69%, 31.23%, and 51.28% under volume control, fading, and noise injection, respectively. CLAD enhanced robustness, reducing the FAR to 0.81% under noise injection and consistently maintaining an FAR below 1.63% across all tests. Our source code and documentation are available in the artifact repository (this https URL).
音频深度伪造技术的普遍增加带来了显著的安全威胁,需要强大的检测方法。虽然现有的检测系统表现出巨大的潜力,但它们对抗恶意音频编辑的鲁棒性仍然缺乏深入的研究。为了弥合这个差距,我们开展了第一个全面研究,旨在评估最广泛采用的音频深度伪造检测器对编辑攻击的易感性。令人惊讶的是,即使包括像音量控制在内的编辑攻击也可以在没有任何影响人类感知的情况下显著绕过检测。为了应对这个问题,我们提出了CLAD(基于对比学习的音频深度伪造检测器),以增强对抗编辑攻击的鲁棒性。关键思想是利用对比学习最小化编辑操作带来的变化,从而提高检测器的鲁棒性。此外,我们还引入了长度损失,旨在通过将实音频在特征空间中聚类得更紧密来提高检测准确性。我们对最广泛采用的音频深度伪造检测模型和我们的CLAD进行了全面评估,对抗各种编辑攻击。检测模型显示出漏洞,在音量控制、衰减和噪音注入等情况下,FAR分别上升至36.69%、31.23%和51.28%。CLAD增强了鲁棒性,在噪音注入下的FAR降至0.81%,并且在所有测试中都保持了FAR低于1.63%的稳定性。我们的源代码和文档可在此处下载(此https URL)。
https://arxiv.org/abs/2404.15854
Single-model systems often suffer from deficiencies in tasks such as speaker verification (SV) and image classification, relying heavily on partial prior knowledge during decision-making, resulting in suboptimal performance. Although multi-model fusion (MMF) can mitigate some of these issues, redundancy in learned representations may limits improvements. To this end, we propose an adversarial complementary representation learning (ACoRL) framework that enables newly trained models to avoid previously acquired knowledge, allowing each individual component model to learn maximally distinct, complementary representations. We make three detailed explanations of why this works and experimental results demonstrate that our method more efficiently improves performance compared to traditional MMF. Furthermore, attribution analysis validates the model trained under ACoRL acquires more complementary knowledge, highlighting the efficacy of our approach in enhancing efficiency and robustness across tasks.
单模型系统通常在诸如演讲验证(SV)和图像分类等任务中存在不足,因此在决策过程中严重依赖先验知识,导致性能较低。尽管多模型融合(MMF)可以在一定程度上减轻这些问题,但学习到的表示的冗余可能限制了提高。为此,我们提出了一个对抗性互补表示学习(ACoRL)框架,使新训练的模型能够避免之前获得的知识,使得每个组件模型能够学习到最独特的互补表示。我们详细解释了这种方法的工作原理,并进行了实验验证,表明与传统MMF相比,我们的方法能更有效地提高性能。此外,归因分析证实,在ACoRL框架下训练的模型获得了更多的互补知识,这表明我们的方法在提高任务效率和鲁棒性方面具有有效性。
https://arxiv.org/abs/2404.15704
We introduce HybridVC, a voice conversion (VC) framework built upon a pre-trained conditional variational autoencoder (CVAE) that combines the strengths of a latent model with contrastive learning. HybridVC supports text and audio prompts, enabling more flexible voice style conversion. HybridVC models a latent distribution conditioned on speaker embeddings acquired by a pretrained speaker encoder and optimises style text embeddings to align with the speaker style information through contrastive learning in parallel. Therefore, HybridVC can be efficiently trained under limited computational resources. Our experiments demonstrate HybridVC's superior training efficiency and its capability for advanced multi-modal voice style conversion. This underscores its potential for widespread applications such as user-defined personalised voice in various social media platforms. A comprehensive ablation study further validates the effectiveness of our method.
我们提出了HybridVC,一种基于预训练条件变异自动编码器(CVAE)的语音转换(VC)框架,结合了潜在模型的优势和对比学习的力量。HybridVC支持文本和音频提示,实现更灵活的语音风格转换。HybridVC基于预训练说话人编码器获得的说话人嵌入,通过并行对比学习优化样式文本嵌入,使其与说话人风格信息对齐。因此,HybridVC可以在有限的计算资源下高效训练。我们的实验证明了HybridVC卓越的训练效率和其在高级多模态语音风格转换方面的能力。这进一步证明了其在各种社交媒体平台中实现用户定义个性化语音的广泛应用潜力。全面的消融研究进一步验证了我们的方法的有效性。
https://arxiv.org/abs/2404.15637
This paper presents a software allowing to describe voices using a continuous Voice Femininity Percentage (VFP). This system is intended for transgender speakers during their voice transition and for voice therapists supporting them in this process. A corpus of 41 French cis- and transgender speakers was recorded. A perceptual evaluation allowed 57 participants to estimate the VFP for each voice. Binary gender classification models were trained on external gender-balanced data and used on overlapping windows to obtain average gender prediction estimates, which were calibrated to predict VFP and obtained higher accuracy than $F_0$ or vocal track length-based models. Training data speaking style and DNN architecture were shown to impact VFP estimation. Accuracy of the models was affected by speakers' age. This highlights the importance of style, age, and the conception of gender as binary or not, to build adequate statistical representations of cultural concepts.
本文介绍了一种使用连续的声音女性化百分比(VFP)描述声音的软件。这个系统旨在为变性说话者在其变声过程中使用,并为声音治疗师提供支持。一个由41名法国同性恋和跨性别说话者组成的语料库进行了记录。感知评估让57名参与者估计每个声音的VFP。二元性别分类模型在平衡的外部数据上进行训练,并在重叠窗口上使用,以获得平均性别预测估计,这些估计经校准可预测VFP,并获得比$F_0$或基于语音轨道长度的模型更高的准确性。培训数据中的说话方式和DNN架构被证明会影响VFP估计的准确性。说话者的年龄也影响了模型的准确性。这突出了风格、年龄和性别概念的二元或非二元性质对文化概念进行适当统计表示的重要性。
https://arxiv.org/abs/2404.15176
Deepfake speech represents a real and growing threat to systems and society. Many detectors have been created to aid in defense against speech deepfakes. While these detectors implement myriad methodologies, many rely on low-level fragments of the speech generation process. We hypothesize that breath, a higher-level part of speech, is a key component of natural speech and thus improper generation in deepfake speech is a performant discriminator. To evaluate this, we create a breath detector and leverage this against a custom dataset of online news article audio to discriminate between real/deepfake speech. Additionally, we make this custom dataset publicly available to facilitate comparison for future work. Applying our simple breath detector as a deepfake speech discriminator on in-the-wild samples allows for accurate classification (perfect 1.0 AUPRC and 0.0 EER on test data) across 33.6 hours of audio. We compare our model with the state-of-the-art SSL-wav2vec model and show that this complex deep learning model completely fails to classify the same in-the-wild samples (0.72 AUPRC and 0.99 EER).
Deepfake speech represents a real and growing threat to systems and society.为应对深度伪造语音,已经创建了许多检测器。尽管这些检测器采用了多种方法,但许多检测器依赖低级别语音生成过程的片段。我们假设呼吸(语音的较高层次)是自然语音的重要组成部分,因此深度伪造语音的不正确生成是一个表现性的区分器。为了评估这一点,我们创建了一个呼吸检测器,并将其应用于一个在线新闻文章音频自定义数据集,以区分真实/深度伪造语音。此外,我们还公开了这个自定义数据集,以便未来工作的比较。在野外样本上应用我们简单的呼吸检测器作为深度伪造语音区分器,能够实现93.6小时的准确分类(在测试数据上的完美1.0 AUPRC和0.0 EER)。我们将我们的模型与最先进的SSL-wav2vec模型进行比较,并展示了这个复杂的深度学习模型完全无法正确分类相同野外的样本(0.72 AUPRC和0.99 EER)。
https://arxiv.org/abs/2404.15143
Self-Supervised Learning (SSL) frameworks became the standard for learning robust class representations by benefiting from large unlabeled datasets. For Speaker Verification (SV), most SSL systems rely on contrastive-based loss functions. We explore different ways to improve the performance of these techniques by revisiting the NT-Xent contrastive loss. Our main contribution is the definition of the NT-Xent-AM loss and the study of the importance of Additive Margin (AM) in SimCLR and MoCo SSL methods to further separate positive from negative pairs. Despite class collisions, we show that AM enhances the compactness of same-speaker embeddings and reduces the number of false negatives and false positives on SV. Additionally, we demonstrate the effectiveness of the symmetric contrastive loss, which provides more supervision for the SSL task. Implementing these two modifications to SimCLR improves performance and results in 7.85% EER on VoxCeleb1-O, outperforming other equivalent methods.
自监督学习(SSL)框架通过利用大量未标注数据的优势,成为学习稳健类别表示的标准。对于说话人验证(SV),大多数SSL系统依赖于对比式损失函数。我们探讨了通过回顾NT-Xent对比损失来提高这些技术性能的不同方法。我们的主要贡献是定义NT-Xent-AM损失,并研究了在SimCLR和MoCo SSL方法中添加Additive Margin(AM)对进一步区分正负对的重要性。尽管存在类别碰撞,我们证明了AM能增强相同说话者嵌入的紧凑性,并减少SV上的假负和假正数量。此外,我们还证明了对称对比损失的有效性,为SSL任务提供了更多的监督。对SimCLR进行这两种修改后的性能优于其他等效方法,提高了7.85%的均方误差(EER)在VoxCeleb1-O数据集上,超过了其他方法。
https://arxiv.org/abs/2404.14913
In multi-sample keyword spotting, each keyword class is represented by multiple spoken instances, called samples. A naïve approach to detect keywords in a target sequence consists of querying all samples of all classes using sub-sequence dynamic time warping. However, the resulting processing time increases linearly with respect to the number of samples belonging to each class. Alternatively, only a single Fréchet mean can be queried for each class, resulting in reduced processing time but usually also in worse detection performance as the variability of the query samples is not captured sufficiently well. In this work, multi-sample dynamic time warping is proposed to compute class-specific cost-tensors that include the variability of all query samples. To significantly reduce the computational complexity during inference, these cost tensors are converted to cost matrices before applying dynamic time warping. In experimental evaluations for few-shot keyword spotting, it is shown that this method yields a very similar performance as using all individual query samples as templates while having a runtime that is only slightly slower than when using Fréchet means.
在多样本关键词检测中,每个关键词类别由多个口头实例表示,这些实例被称为样本。检测目标序列中关键词的一种 naive 方法包括对所有类别的样本使用子序列动态时间压缩。然而,由于每个类别的样本数量不同,因此处理时间会线性增加。另外,为每个类别只能查询一个 Fréchet mean,导致处理时间降低,但通常检测性能也会较差,因为查询样本的变异程度没有被捕捉足够好。 在本文中,提出了一种多样本动态时间压缩方法来计算包括所有查询样本变异性的类特定成本张量。为了在推理过程中显著降低计算复杂性,这些成本张量在应用动态时间压缩之前被转换为成本矩阵。在少量样本关键词检测的实验评估中,研究表明,这种方法与使用所有单个查询样本作为模板时的性能非常相似,但运行时间略慢于使用 Fréchet mean。
https://arxiv.org/abs/2404.14903
It is challenging to improve automatic speech recognition (ASR) performance in noisy conditions with a single-channel speech enhancement (SE) front-end. This is generally attributed to the processing distortions caused by the nonlinear processing of single-channel SE front-ends. However, the causes of such degraded ASR performance have not been fully investigated. How to design single-channel SE front-ends in a way that significantly improves ASR performance remains an open research question. In this study, we investigate a signal-level numerical metric that can explain the cause of degradation in ASR performance. To this end, we propose a novel analysis scheme based on the orthogonal projection-based decomposition of SE errors. This scheme manually modifies the ratio of the decomposed interference, noise, and artifact errors, and it enables us to directly evaluate the impact of each error type on ASR performance. Our analysis reveals the particularly detrimental effect of artifact errors on ASR performance compared to the other types of errors. This provides us with a more principled definition of processing distortions that cause the ASR performance degradation. Then, we study two practical approaches for reducing the impact of artifact errors. First, we prove that the simple observation adding (OA) post-processing (i.e., interpolating the enhanced and observed signals) can monotonically improve the signal-to-artifact ratio. Second, we propose a novel training objective, called artifact-boosted signal-to-distortion ratio (AB-SDR), which forces the model to estimate the enhanced signals with fewer artifact errors. Through experiments, we confirm that both the OA and AB-SDR approaches are effective in decreasing artifact errors caused by single-channel SE front-ends, allowing them to significantly improve ASR performance.
在嘈杂条件下,使用单通道语音增强(SE)前端来提高自动语音识别(ASR)性能是一项具有挑战性的任务。通常,这种挑战是由于单通道SE前端非线性处理引起的处理失真。然而,尚未对这种失真导致的ASR性能下降进行全面调查。如何设计一种明显提高ASR性能的单通道SE前端仍然是一个开放的研究问题。在这项研究中,我们研究了一个可以解释ASR性能下降原因的信号级数值度量。为此,我们提出了基于SE误差正交投影分解的分析方案。这种方案手动修改了分解干扰、噪声和伪迹误差的比率,并使我们能够直接评估每种误差类型对ASR性能的影响。我们的分析揭示了伪迹误差对ASR性能的破坏性比其他类型的误差更加严重。然后,我们研究了两种减少伪迹误差影响的方法。首先,我们证明了简单的观察添加(OA)后处理(即插值增强和观察信号)可以单调地改善信号-伪迹比。其次,我们提出了名为伪迹增强信号-畸变比(AB-SDR)的新训练目标,该目标迫使模型使用更少的伪迹误差估计增强信号。通过实验,我们证实了OA和AB-SDR方法都能有效减少由单通道SE前端引起的伪迹误差,从而显著提高ASR性能。
https://arxiv.org/abs/2404.14860
One key aspect differentiating data-driven single- and multi-channel speech enhancement and dereverberation methods is that both the problem formulation and complexity of the solutions are considerably more challenging in the latter case. Additionally, with limited computational resources, it is cumbersome to train models that require the management of larger datasets or those with more complex designs. In this scenario, an unverified hypothesis that single-channel methods can be adapted to multi-channel scenarios simply by processing each channel independently holds significant implications, boosting compatibility between sound scene capture and system input-output formats, while also allowing modern research to focus on other challenging aspects, such as full-bandwidth audio enhancement, competitive noise suppression, and unsupervised learning. This study verifies this hypothesis by comparing the enhancement promoted by a basic single-channel speech enhancement and dereverberation model with two other multi-channel models tailored to separate clean speech from noisy 3D mixes. A direction of arrival estimation model was used to objectively evaluate its capacity to preserve spatial information by comparing the output signals with ground-truth coordinate values. Consequently, a trade-off arises between preserving spatial information with a more straightforward single-channel solution at the cost of obtaining lower gains in intelligibility scores.
数据驱动的单通道和多通道语音增强和去噪方法的一个重要区别是,后者的问题表述和解决方案的复杂性大大增加。此外,在有限计算资源的情况下,训练需要管理更大数据集或更复杂设计的模型非常费力。在这种情况下,一个未经证实的假设是,单通道方法可以简单地适应多通道场景,只需对每个通道独立处理,这对声景捕捉系统和输入-输出格式之间的兼容性产生了重大影响,同时也允许现代研究集中精力于其他具有挑战性的方面,例如全带宽音频增强、竞争性噪声抑制和无监督学习。通过比较基本单通道语音增强和去噪模型与两个专门针对分离干净语音和嘈杂3D混合的Multi-Channel模型的增强效果,本研究验证了这个假设。采用到达方向估计模型通过比较输出信号与地面坐标值来客观评估其保留空间信息的能力。因此,在保留空间信息方面,更简单的单通道解决方案在获得较低的增益智能分数的同时,需要在清晰度分数上做出让步。
https://arxiv.org/abs/2404.14564
Evolutionary Algorithms and Generative Deep Learning have been two of the most powerful tools for sound generation tasks. However, they have limitations: Evolutionary Algorithms require complicated designs, posing challenges in control and achieving realistic sound generation. Generative Deep Learning models often copy from the dataset and lack creativity. In this paper, we propose LVNS-RAVE, a method to combine Evolutionary Algorithms and Generative Deep Learning to produce realistic and novel sounds. We use the RAVE model as the sound generator and the VGGish model as a novelty evaluator in the Latent Vector Novelty Search (LVNS) algorithm. The reported experiments show that the method can successfully generate diversified, novel audio samples under different mutation setups using different pre-trained RAVE models. The characteristics of the generation process can be easily controlled with the mutation parameters. The proposed algorithm can be a creative tool for sound artists and musicians.
进化算法和生成式深度学习是音效生成任务中最强大的工具之一。然而,它们也有局限性:进化算法需要复杂的架构,在控制和实现真实音效生成方面存在挑战。生成式深度学习模型通常从数据集中复制,缺乏创造性。在本文中,我们提出了LVNS-RAVE方法,将进化算法和生成式深度学习相结合,以产生真实和新的音效。我们使用RAVE模型作为音效生成器,VGGish模型作为新颖性评估器在Latent Vector Novelty Search(LVNS)算法中。报道的实验结果表明,该方法在不同突变设置下,可以成功生成具有多样性的新颖音频样本。通过控制突变参数,可以轻松控制生成过程的特点。所提出的算法可以为音乐家和音响师提供一种创新工具。
https://arxiv.org/abs/2404.14063
The availability of smart devices leads to an exponential increase in multimedia content. However, the rapid advancements in deep learning have given rise to sophisticated algorithms capable of manipulating or creating multimedia fake content, known as Deepfake. Audio Deepfakes pose a significant threat by producing highly realistic voices, thus facilitating the spread of misinformation. To address this issue, numerous audio anti-spoofing detection challenges have been organized to foster the development of anti-spoofing countermeasures. This survey paper presents a comprehensive review of every component within the detection pipeline, including algorithm architectures, optimization techniques, application generalizability, evaluation metrics, performance comparisons, available datasets, and open-source availability. For each aspect, we conduct a systematic evaluation of the recent advancements, along with discussions on existing challenges. Additionally, we also explore emerging research topics on audio anti-spoofing, including partial spoofing detection, cross-dataset evaluation, and adversarial attack defence, while proposing some promising research directions for future work. This survey paper not only identifies the current state-of-the-art to establish strong baselines for future experiments but also guides future researchers on a clear path for understanding and enhancing the audio anti-spoofing detection mechanisms.
智能设备的普及导致多媒体内容的指数级增加。然而,深度学习的快速发展已经催生出能够操纵或创建多媒体假内容的复杂算法,即Deepfake。音频Deepfakes通过产生高度逼真的声音,从而促进信息的传播,对人类社会造成了严重的威胁。为了解决这个问题,已经组织了大量的音频抗伪造检测挑战,以促进对抗伪造技术的研发。 这份调查论文对检测管道的每个组成部分进行了全面的回顾,包括算法架构、优化技术、应用的可扩展性、评估指标、性能比较和可用数据集以及开源性。对每个方面,我们进行了对最近进展的系统评估,并讨论了现有的挑战。此外,我们还探讨了音频抗伪造的研究方向,包括部分伪造检测、跨数据集评估和防御性攻击,同时为未来的研究提出了有前途的研究方向。 这份调查论文不仅确定了当前的最先进水平,为未来的实验建立了强大的基线,而且还指导了未来研究人员理解并提高音频抗伪造检测机制的清晰路径。
https://arxiv.org/abs/2404.13914
Current research in robotic sounds generally focuses on either masking the consequential sound produced by the robot or on sonifying data about the robot to create a synthetic robot sound. We propose to capture, modify, and utilise rather than mask the sounds that robots are already producing. In short, this approach relies on capturing a robot's sounds, processing them according to contextual information (e.g., collaborators' proximity or particular work sequences), and playing back the modified sound. Previous research indicates the usefulness of non-semantic, and even mechanical, sounds as a communication tool for conveying robotic affect and function. Adding to this, this paper presents a novel approach which makes two key contributions: (1) a technique for real-time capture and processing of consequential robot sounds, and (2) an approach to explore these sounds through direct human-robot interaction. Drawing on methodologies from design, human-robot interaction, and creative practice, the resulting 'Robotic Blended Sonification' is a concept which transforms the consequential robot sounds into a creative material that can be explored artistically and within application-based studies.
目前,机器人声音研究的主要关注点通常是要么消除机器人产生的后果声音,要么将机器人的数据 sonification 用于创建合成机器人声音。我们提出了一种捕捉、修改和利用机器人的声音,而不是仅仅消除它们已经产生的声音的方法。简而言之,这种方法依赖于捕捉机器人的声音,根据上下文信息(例如合作者的接近或特定工作序列)对其进行处理,然后播放修改后的声音。之前的研究表明,非语义甚至机械声音可以作为传达机器人情感和功能的有用沟通工具。此外,本文提出了一种新的方法,该方法做出了两个关键贡献:(1)实时捕获和处理后果机器声音的技术;(2)通过直接与机器人进行人机交互来探索这些声音的方法。从设计、人机交互和创意实践的方法论中汲取经验,最终产生的“机器人混合声 sonification”是一种可以将后果机器人声音转化为可以用于艺术探索和应用研究的有创意材料的概念。
https://arxiv.org/abs/2404.13821
This article discusses the application of single vector hydrophones in the field of underwater acoustic signal processing for Direction Of Arrival (DOA) estimation. Addressing the limitations of traditional DOA estimation methods in multi-source environments and under noise interference, this study introduces a Vector Signal Reconstruction Sparse and Parametric Approach (VSRSPA). This method involves reconstructing the signal model of a single vector hydrophone, converting its covariance matrix into a Toeplitz structure suitable for the Sparse and Parametric Approach (SPA) algorithm. The process then optimizes it using the SPA algorithm to achieve more accurate DOA estimation. Through detailed simulation analysis, this research has confirmed the performance of the proposed algorithm in single and dual-target DOA estimation scenarios, especially under various signal-to-noise ratio(SNR) conditions. The simulation results show that, compared to traditional DOA estimation methods, this algorithm has significant advantages in estimation accuracy and resolution, particularly in multi-source signals and low SNR environments. The contribution of this study lies in providing an effective new method for DOA estimation with single vector hydrophones in complex environments, introducing new research directions and solutions in the field of vector hydrophone signal processing.
这篇文章讨论了在水下声信号处理领域中应用单向量水听器进行方向估计的应用。它解决了传统方向估计方法在多源环境中的局限性和噪声干扰问题,并提出了一种向量信号重构稀疏和参数方法(VSRSPA)。该方法包括重构单向量水听器的信号模型,将其协方差矩阵转换为适用于稀疏和参数方法(SPA)算法的Toeplitz结构。然后使用SPA算法优化该过程,以实现更准确的方向估计。通过详细的仿真分析,这项研究证实了该算法在单目标和双目标方向估计场景中的性能,特别是在各种信噪比条件下。仿真结果表明,与传统方向估计方法相比,该算法在估计精度和分辨率方面具有显著优势,尤其是在多源信号和低信噪比环境中。这项研究的贡献在于为复杂环境中的单向量水听器方向估计提供了一种有效的新方法,推动了向量水听器信号处理领域的研究方向和解决方案的发展。
https://arxiv.org/abs/2404.15160
Word embedding has become an essential means for text-based information retrieval. Typically, word embeddings are learned from large quantities of general and unstructured text data. However, in the domain of music, the word embedding may have difficulty understanding musical contexts or recognizing music-related entities like artists and tracks. To address this issue, we propose a new approach called Musical Word Embedding (MWE), which involves learning from various types of texts, including both everyday and music-related vocabulary. We integrate MWE into an audio-word joint representation framework for tagging and retrieving music, using words like tag, artist, and track that have different levels of musical specificity. Our experiments show that using a more specific musical word like track results in better retrieval performance, while using a less specific term like tag leads to better tagging performance. To balance this compromise, we suggest multi-prototype training that uses words with different levels of musical specificity jointly. We evaluate both word embedding and audio-word joint embedding on four tasks (tag rank prediction, music tagging, query-by-tag, and query-by-track) across two datasets (Million Song Dataset and MTG-Jamendo). Our findings show that the suggested MWE is more efficient and robust than the conventional word embedding.
翻译:词向量已经成为基于文本的信息检索的必要手段。通常,词向量是从大量的通用和不结构化文本数据中学习的。然而,在音乐领域,词向量可能很难理解音乐上下文或识别音乐相关的实体,如艺术家和曲目。为解决这个问题,我们提出了一个名为 Musical Word Embedding(MWE)的新方法,它涉及从各种类型的文本中学习,包括日常和音乐相关的词汇。我们将 MWE 集成到一个用于标记和检索音乐的音频词共现框架中,使用具有不同音乐特定性的单词,如标签、艺术家和曲目。我们的实验结果表明,使用更具体的音乐词如曲目可以获得更好的检索性能,而使用更不具体的词汇如标签会导致更好的分类性能。为了平衡这个妥协,我们建议使用具有不同音乐特定性的单词进行联合训练。我们在两个数据集(Million Song Dataset 和 MTG-Jamendo)上对四个任务(标签排名预测、音乐标签、基于标签的查询和基于曲目的查询)进行了评估。我们的研究结果表明,与传统词向量相比,所建议的 MWE 更有效且更稳健。
https://arxiv.org/abs/2404.13569
This study investigates the application of single vector hydrophones in underwater acoustic signal processing for Direction of Arrival (DOA) estimation. Addressing the limitations of traditional DOA estimation methods in multi-source environments and under noise interference, this research proposes a Vector Signal Reconstruction (VSR) technique. This technique transforms the covariance matrix of single vector hydrophone signals into a Toeplitz structure suitable for gridless sparse methods through complex calculations and vector signal reconstruction. Furthermore, two sparse DOA estimation algorithms based on vector signal reconstruction are introduced. Theoretical analysis and simulation experiments demonstrate that the proposed algorithms significantly improve the accuracy and resolution of DOA estimation in multi-source signals and low Signal-to-Noise Ratio (SNR) environments compared to traditional algorithms. The contribution of this study lies in providing an effective new method for DOA estimation with single vector hydrophones in complex environments, introducing new research directions and solutions in the field of vector hydrophone signal processing.
这项研究探讨了在水下声信号处理中应用单向量水听器进行方向估计。为了克服传统方向估计方法在多源环境中的局限性和噪声干扰,这项研究提出了向量信号重构技术。通过复杂计算和向量信号重构,将单向量水听器信号的协方差矩阵变换为适合无网格稀疏方法的Toeplitz结构。此外,还介绍了两种基于向量信号重构的稀疏 DOA 估计算法。理论分析和模拟实验证明,与传统算法相比,所提出的算法在多源信号和低信噪比环境中显著提高了方向估计的准确性和分辨率。本研究的贡献在于为复杂环境中使用单向量水听器进行方向估计提供了一种有效的新方法,为向量水听器信号处理领域带来了新的研究方向和解决方案。
https://arxiv.org/abs/2404.13568
Recent research has successfully adapted vision-based convolutional neural network (CNN) architectures for audio recognition tasks using Mel-Spectrograms. However, these CNNs have high computational costs and memory requirements, limiting their deployment on low-end edge devices. Motivated by the success of efficient vision models like InceptionNeXt and ConvNeXt, we propose AudioRepInceptionNeXt, a single-stream architecture. Its basic building block breaks down the parallel multi-branch depth-wise convolutions with descending scales of k x k kernels into a cascade of two multi-branch depth-wise convolutions. The first multi-branch consists of parallel multi-scale 1 x k depth-wise convolutional layers followed by a similar multi-branch employing parallel multi-scale k x 1 depth-wise convolutional layers. This reduces computational and memory footprint while separating time and frequency processing of Mel-Spectrograms. The large kernels capture global frequencies and long activities, while small kernels get local frequencies and short activities. We also reparameterize the multi-branch design during inference to further boost speed without losing accuracy. Experiments show that AudioRepInceptionNeXt reduces parameters and computations by 50%+ and improves inference speed 1.28x over state-of-the-art CNNs like the Slow-Fast while maintaining comparable accuracy. It also learns robustly across a variety of audio recognition tasks. Codes are available at this https URL.
近年来,使用Mel-Spectrograms成功将基于视觉的卷积神经网络(CNN)架构应用于音频识别任务。然而,这些CNN具有较高的计算成本和内存需求,限制了它们在低端边缘设备上的部署。受到像InceptionNeXt和ConvNeXt等高效视觉模型的成功启发,我们提出了AudioRepInceptionNeXt,一种单流架构。其基本构建模块将具有逐级下降的k x k kernels的并行多分支深度卷积分解成两个并行的多分支深度卷积。第一个多分支包括逐级下降的1 x k深度卷积层,然后是一个类似的多分支,使用逐级下降的k x 1深度卷积层。这减少了计算和内存足迹,同时将Mel-Spectrogram的时域和频域处理分离。大kernels捕捉全局频率和长活动,而小kernels获取局部频率和短活动。在推理过程中,我们也对多分支设计进行了重新参数化,以进一步提高速度而不会失去准确性。实验证明,AudioRepInceptionNeXt比诸如Slow-Fast这样的最先进的CNN减少50%+的参数和计算,同时提高推理速度1.28倍,而在保持相当准确性的同时。它还能够在各种音频识别任务中稳健地学习。代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2404.13551
In the composition process, selecting appropriate single-instrumental music sequences and assigning their track-role is an indispensable task. However, manually determining the track-role for a myriad of music samples can be time-consuming and labor-intensive. This study introduces a deep learning model designed to automatically predict the track-role of single-instrumental music sequences. Our evaluations show a prediction accuracy of 87% in the symbolic domain and 84% in the audio domain. The proposed track-role prediction methods hold promise for future applications in AI music generation and analysis.
在创作过程中,选择合适的单乐器音乐序列并分配其轨道角色是一个不可或缺的任务。然而,手动确定许多音乐样本的轨道角色可能需要花费大量的时间和精力。本研究介绍了一种用于自动预测单乐器音乐序列轨道角色的深度学习模型。我们的评估结果显示,在符号域中的预测准确率为87%,在音频域中的预测准确率为84%。所提出的轨道角色预测方法具有很大的潜力,将在未来的AI音乐生成和分析应用中发挥重要作用。
https://arxiv.org/abs/2404.13286