In this work, we investigate two popular end-to-end automatic speech recognition (ASR) models, namely Connectionist Temporal Classification (CTC) and RNN-Transducer (RNN-T), for offline recognition of voice search queries, with up to 2B model parameters. The encoders of our models use the neural architecture of Google's universal speech model (USM), with additional funnel pooling layers to significantly reduce the frame rate and speed up training and inference. We perform extensive studies on vocabulary size, time reduction strategy, and its generalization performance on long-form test sets. Despite the speculation that, as the model size increases, CTC can be as good as RNN-T which builds label dependency into the prediction, we observe that a 900M RNN-T clearly outperforms a 1.8B CTC and is more tolerant to severe time reduction, although the WER gap can be largely removed by LM shallow fusion.
在本研究中,我们研究了两种流行的端到端自动语音识别(ASR)模型,即连接性时间分类(CTC)和RNN-控制器(RNN-T),以 offline 识别语音搜索查询,使用高达2B的模型参数。我们模型的编码器使用谷歌的通用语音模型(USM)的神经网络架构,并添加 funnel Pooling 层来显著降低帧率,加快训练和推断。我们深入研究了词汇量、时间减少策略以及在长篇测试集上的通用表现。尽管有人猜测,随着模型规模的增长,CTC可能不亚于 RNN-T,它将标签依赖项引入预测中,但我们观察到,一个900M的RNN-T明显 outperforms a 1.8B的CTC,并且更加容忍严重的时间减少,尽管通过LM浅融合可以大部分消除WER之间的差距。
https://arxiv.org/abs/2309.12963
Keyword spotting (KWS) refers to the task of identifying a set of predefined words in audio streams. With the advances seen recently with deep neural networks, it has become a popular technology to activate and control small devices, such as voice assistants. Relying on such models for edge devices, however, can be challenging due to hardware constraints. Moreover, as adversarial attacks have increased against voice-based technologies, developing solutions robust to such attacks has become crucial. In this work, we propose VIC-KD, a robust distillation recipe for model compression and adversarial robustness. Using self-supervised speech representations, we show that imposing geometric priors to the latent representations of both Teacher and Student models leads to more robust target models. Experiments on the Google Speech Commands datasets show that the proposed methodology improves upon current state-of-the-art robust distillation methods, such as ARD and RSLAD, by 12% and 8% in robust accuracy, respectively.
关键字检测(KWS)是指识别音频流中的预先定义词汇的任务。随着深度学习网络的最新进展,它已经成为激活和控制小型设备的流行技术,例如语音助手。然而,依靠此类模型来处理边缘设备可能会由于硬件限制而面临挑战。此外,随着对基于语音技术的对抗攻击的增加,开发对此类攻击具有鲁棒性的解决方案变得越来越重要。在这个研究中,我们提出了VIC-KD,一个模型压缩和对抗鲁棒性的鲁棒分岔方法。通过使用自监督语音表示,我们证明了在教师和学生模型的潜在表示中添加几何先验可以生成更加鲁棒的目标模型。在Google Speech 命令数据集上的实验表明,该方法在鲁棒精度方面相对于当前先进的鲁棒分岔方法如ard和RSLAD分别提高了12%和8%。
https://arxiv.org/abs/2309.12914
To train transcriptor models that produce robust results, a large and diverse labeled dataset is required. Finding such data with the necessary characteristics is a challenging task, especially for languages less popular than English. Moreover, producing such data requires significant effort and often money. Therefore, a strategy to mitigate this problem is the use of data augmentation techniques. In this work, we propose a framework that approaches data augmentation based on deepfake audio. To validate the produced framework, experiments were conducted using existing deepfake and transcription models. A voice cloner and a dataset produced by Indians (in English) were selected, ensuring the presence of a single accent in the dataset. Subsequently, the augmented data was used to train speech to text models in various scenarios.
要训练产生稳健结果的转录员模型,需要建立一个大型且多样化的标记数据集。找到具有所需特征的数据是一项挑战性的任务,特别是对于那些比英语更不受欢迎的语言。此外,生成这些数据需要巨大的努力,通常还需要资金。因此,减轻这个问题的策略是使用数据增强技术。在本研究中,我们提出了一个基于深度伪造音频的方法,来建立数据增强框架。为了验证生成的框架,使用了现有的深度伪造和转录模型进行实验。选择了一个语音克隆器和一个由印度人生成的英语数据集,确保数据集中只有一个口音。随后,增强的数据被用于训练在各种场景下的语言到文本模型。
https://arxiv.org/abs/2309.12802
This paper introduces an improved duration informed attention neural network (DurIAN-E) for expressive and high-fidelity text-to-speech (TTS) synthesis. Inherited from the original DurIAN model, an auto-regressive model structure in which the alignments between the input linguistic information and the output acoustic features are inferred from a duration model is adopted. Meanwhile the proposed DurIAN-E utilizes multiple stacked SwishRNN-based Transformer blocks as linguistic encoders. Style-Adaptive Instance Normalization (SAIN) layers are exploited into frame-level encoders to improve the modeling ability of expressiveness. A denoiser incorporating both denoising diffusion probabilistic model (DDPM) for mel-spectrograms and SAIN modules is conducted to further improve the synthetic speech quality and expressiveness. Experimental results prove that the proposed expressive TTS model in this paper can achieve better performance than the state-of-the-art approaches in both subjective mean opinion score (MOS) and preference tests.
本 paper 介绍了一种改进的时间 informed 注意力神经网络 (Durian-E) 用于表达高保真的文本到语音(TTS)合成。从原始的 Durian 模型中继承而来,采用了一种自回归模型结构,该结构使用一个时间模型从时间推断输入语言信息和输出声学特征的对齐。同时, proposed 的 Durian-E 利用多个叠加的 SwishRNN 基座Transformer 块作为语言编码器。Style-Adaptive Instance Normalization (SAIN) 层被应用于帧级别的编码器,以改善表达建模能力。一种去噪器,结合mel频谱去噪和 SAIN 模块,进一步改善合成语音质量和表达性能。实验结果表明,本文提出的表达TTS模型在主观意见评分(MOS)和偏好测试中比现有方法表现更好。
https://arxiv.org/abs/2309.12792
This research introduces an enhanced version of the multi-objective speech assessment model, called MOSA-Net+, by leveraging the acoustic features from large pre-trained weakly supervised models, namely Whisper, to create embedding features. The first part of this study investigates the correlation between the embedding features of Whisper and two self-supervised learning (SSL) models with subjective quality and intelligibility scores. The second part evaluates the effectiveness of Whisper in deploying a more robust speech assessment model. Third, the possibility of combining representations from Whisper and SSL models while deploying MOSA-Net+ is analyzed. The experimental results reveal that Whisper's embedding features correlate more strongly with subjective quality and intelligibility than other SSL's embedding features, contributing to more accurate prediction performance achieved by MOSA-Net+. Moreover, combining the embedding features from Whisper and SSL models only leads to marginal improvement. As compared to MOSA-Net and other SSL-based speech assessment models, MOSA-Net+ yields notable improvements in estimating subjective quality and intelligibility scores across all evaluation metrics. We further tested MOSA-Net+ on Track 3 of the VoiceMOS Challenge 2023 and obtained the top-ranked performance.
这项研究介绍了一种增强版本的多目标语音评估模型,称为MOSA-Net+,通过利用大型弱监督预训练模型Whisper的声学特征来创建嵌入特征。本研究第一部分研究了Whisper的嵌入特征与两个基于自我监督学习(SSL)模型的主观质量和语音识别得分之间的相关性。本研究第二部分评估了Whisper在部署更稳健的语音评估模型方面的 effectiveness。第三部分分析了在部署MOSA-Net+的同时,将Whisper和SSL模型的表示相结合的可能性。实验结果显示,Whisper的嵌入特征与主观质量和语音识别得分之间的相关性比SSL模型的其他嵌入特征更强,为MOSA-Net+实现的更准确的预测性能做出了贡献。此外,将Whisper和SSL模型的嵌入特征相结合仅会导致微小改进。与MOSA-Net和其他基于SSL的语音评估模型相比,MOSA-Net+在估计主观质量和语音识别得分方面实现了显著的改进。我们在2023年声音MOS挑战 track 3 上测试了MOSA-Net+,并取得了排名最高的性能。
https://arxiv.org/abs/2309.12766
Speech Emotion Recognition (SER) plays a pivotal role in enhancing human-computer interaction by enabling a deeper understanding of emotional states across a wide range of applications, contributing to more empathetic and effective communication. This study proposes an innovative approach that integrates self-supervised feature extraction with supervised classification for emotion recognition from small audio segments. In the preprocessing step, to eliminate the need of crafting audio features, we employed a self-supervised feature extractor, based on the Wav2Vec model, to capture acoustic features from audio data. Then, the output featuremaps of the preprocessing step are fed to a custom designed Convolutional Neural Network (CNN)-based model to perform emotion classification. Utilizing the ShEMO dataset as our testing ground, the proposed method surpasses two baseline methods, i.e. support vector machine classifier and transfer learning of a pretrained CNN. comparing the propose method to the state-of-the-art methods in SER task indicates the superiority of the proposed method. Our findings underscore the pivotal role of deep unsupervised feature learning in elevating the landscape of SER, offering enhanced emotional comprehension in the realm of human-computer interactions.
语音情感识别( SER )在增强人机交互方面发挥着关键作用,通过使对多种应用程序中情感状态的理解更深入,有助于更同情心和有效的沟通。本研究提出了一种创新的方法,将自监督特征提取与监督分类集成,从小型音频片段中情感识别。在预处理步骤中,为了消除制作音频特征的需要,我们采用了基于瓦夫2Vec模型的自监督特征提取器,从音频数据中提取声学特征。然后将预处理步骤的输出特征映射喂入一个定制的卷积神经网络(CNN)基模型,进行情感分类。利用 ShEMO 数据集作为我们的测试集,所提出的方法超过了两个基准方法,即支持向量机分类器和预先训练的 CNN 的迁移学习。将所提出的方法与 SER 任务中的先进方法进行比较表明这种方法的优势。我们的研究结果强调了深度无监督特征学习在提高 SER 景观方面的关键作用,提供了在人机交互领域中增强情感理解。
https://arxiv.org/abs/2309.12714
Recent progress in Automatic Speech Recognition (ASR) has been coupled with a substantial increase in the model sizes, which may now contain billions of parameters, leading to slow inferences even with adapted hardware. In this context, several ASR models exist in various sizes, with different inference costs leading to different performance levels. Based on the observation that smaller models perform optimally on large parts of testing corpora, we propose to train a decision module, that would allow, given an audio sample, to use the smallest sufficient model leading to a good transcription. We apply our approach to two Whisper models with different sizes. By keeping the decision process computationally efficient, we build a decision module that allows substantial computational savings with reduced performance drops.
最近Automatic Speech Recognition (ASR)的进步与模型大小的显著增加相耦合,这些模型现在可能包含数十亿参数,导致即使使用适应硬件也缓慢Inference。在这种情况下,存在几种不同大小的ASR模型,不同的Inference成本导致不同的性能水平。基于观察到小型模型在测试数据集的大部分方面表现最佳,我们建议训练一个决策模块,给定一个音频样本,使用最小的足够模型,从而得到良好的转录。我们分别对两个不同大小的Whisper模型应用了我们的方法。通过保持决策过程计算高效的模式,我们构建了一个决策模块,允许实现显著的计算节省,同时减少了性能下降。
https://arxiv.org/abs/2309.12712
It is challenging to build a multi-singer high-fidelity singing voice synthesis system with cross-lingual ability by only using monolingual singers in the training stage. In this paper, we propose CrossSinger, which is a cross-lingual singing voice synthesizer based on Xiaoicesing2. Specifically, we utilize International Phonetic Alphabet to unify the representation for all languages of the training data. Moreover, we leverage conditional layer normalization to incorporate the language information into the model for better pronunciation when singers meet unseen languages. Additionally, gradient reversal layer (GRL) is utilized to remove singer biases included in lyrics since all singers are monolingual, which indicates singer's identity is implicitly associated with the text. The experiment is conducted on a combination of three singing voice datasets containing Japanese Kiritan dataset, English NUS-48E dataset, and one internal Chinese dataset. The result shows CrossSinger can synthesize high-fidelity songs for various singers with cross-lingual ability, including code-switch cases.
建造一个多歌手高保真歌唱语音合成系统,并具有跨语言能力,在训练阶段只使用 Monolingual 歌手是具有挑战性的。在本文中,我们提出了 CrossSinger,这是一个基于 Xiaoicesing2 的跨语言歌唱语音合成系统。具体来说,我们使用国际语音字母来统一训练数据中的所有语言的表示。此外,我们利用条件层归一化来将语言信息添加到模型中,以便更好地发音,当歌手遇到未知语言时。此外,我们使用梯度反转层(GRL)来消除歌词中的歌手偏见,因为所有歌手都是 Monolingual,这表明歌手的身份 implicit 地与文本相关联。实验是在包含日本Kirittan 数据集、英语 NUS-48E 数据集和内部中国数据集的三张歌唱语音数据集的混合物上进行的。结果表明 CrossSinger 可以为各种有跨语言能力的演员合成高保真歌曲,包括代码切换情况。
https://arxiv.org/abs/2309.12672
This paper details our speaker diarization system designed for multi-domain, multi-microphone casual conversations. The proposed diarization pipeline uses weighted prediction error (WPE)-based dereverberation as a front end, then applies end-to-end neural diarization with vector clustering (EEND-VC) to each channel separately. It integrates the diarization result obtained from each channel using diarization output voting error reduction plus overlap (DOVER-LAP). To harness the knowledge from the target domain and results integrated across all channels, we apply self-supervised adaptation for each session by retraining the EEND-VC with pseudo-labels derived from DOVER-LAP. The proposed system was incorporated into NTT's submission for the distant automatic speech recognition task in the CHiME-7 challenge. Our system achieved 65 % and 62 % relative improvements on development and eval sets compared to the organizer-provided VC-based baseline diarization system, securing third place in diarization performance.
这篇文章介绍了我们设计的适用于多领域、多麦克风 casual conversations 的 speaker diarization 系统。该 diarization 系统采用基于加权预测误差(WPE)的声学去混响作为前端,然后将全端神经网络声学归一化与向量聚类(EEND-VC)分别应用于每个通道。它通过减少 diarization 输出的投票错误以及融合(DOVER-LAP)来实现每个通道的声学归一化结果,并将它们整合在一起。为了从目标领域中提取知识和将整合在所有通道中的结果进行训练,我们在每个会话中使用自监督适应技术,通过从 DOVER-LAP 中推导出伪标签来重新训练 EEND-VC。该提议系统被 NTT 纳入了 CHiME-7 挑战中远程自动语音识别任务提交的候选列表中。我们的系统相对于组织者提供的基于 VC 基线的声学基线 diarization 系统在开发和应用集上实现了 65 % 和 62 % 的相对改进,确保了声学归一化性能的第三名。
https://arxiv.org/abs/2309.12656
Dual-path is a popular architecture for speech separation models (e.g. Sepformer) which splits long sequences into overlapping chunks for its intra- and inter-blocks that separately model intra-chunk local features and inter-chunk global relationships. However, it has been found that inter-blocks, which comprise half a dual-path model's parameters, contribute minimally to performance. Thus, we propose the Single-Path Global Modulation (SPGM) block to replace inter-blocks. SPGM is named after its structure consisting of a parameter-free global pooling module followed by a modulation module comprising only 2% of the model's total parameters. The SPGM block allows all transformer layers in the model to be dedicated to local feature modelling, making the overall model single-path. SPGM achieves 22.1 dB SI-SDRi on WSJ0-2Mix and 20.4 dB SI-SDRi on Libri2Mix, exceeding the performance of Sepformer by 0.5 dB and 0.3 dB respectively and matches the performance of recent SOTA models with up to 8 times fewer parameters.
双路径是一种广泛应用于语音分离模型(例如Sepformer)的常见架构,该模型将长序列分成重叠的块,以便内部块和外部块分别建模内部块 local 特征和外部块 global 关系。然而,我们发现外部块,占双路径模型参数的一半,对性能的贡献较小。因此,我们建议采用Single-Path Global Modulation(SPGM)块来取代外部块。SPGM块以其结构命名,由一个参数免费的全球聚合模块和仅模型总参数的2%的调制模块组成。SPGM块允许模型中的所有变压器层专门用于 local 特征建模,从而使整个模型成为双路径。在 WSJ0-2 混合中,SPGM 实现 22.1 dBSI-SDRi,在 Libri2 混合中实现 20.4 dBSI-SDRi,分别比 Sepformer 表现高出 0.5 dB 和 0.3 dB,并与最近的 SOTA 模型,参数数量不到8倍的 recent models 表现相当。
https://arxiv.org/abs/2309.12608
This paper proposes a universal sound separation (USS) method capable of handling untrained sampling frequencies (SFs). The USS aims at separating arbitrary sources of different types and can be the key technique to realize a source separator that can be universally used as a preprocessor for any downstream tasks. To realize a universal source separator, there are two essential properties: universalities with respect to source types and recording conditions. The former property has been studied in the USS literature, which has greatly increased the number of source types that can be handled by a single neural network. However, the latter property (e.g., SF) has received less attention despite its necessity. Since the SF varies widely depending on the downstream tasks, the universal source separator must handle a wide variety of SFs. In this paper, to encompass the two properties, we propose an SF-independent (SFI) extension of a computationally efficient USS network, SuDoRM-RF. The proposed network uses our previously proposed SFI convolutional layers, which can handle various SFs by generating convolutional kernels in accordance with an input SF. Experiments show that signal resampling can degrade the USS performance and the proposed method works more consistently than signal-resampling-based methods for various SFs.
本论文提出了一种能够处理未培训采样频率(SFs)的通用音频分离方法(USS),该方法旨在分离任意类型的不同来源,可以被视为实现通用源分离器的关键技术,并将其广泛用于任何后续任务中的预处理。要实现通用源分离器,必须满足两个关键特性:与源类型和录制条件的普及性。在USS文献中已经研究了前者,这增加了一个神经网络能够处理的各种源类型的数量。然而,尽管后者(如SF)是必要的,但它并没有得到足够的关注。由于SF取决于后续任务,通用源分离器必须处理各种SF。在本文中,为了涵盖这两个特性,我们提出了一种计算效率高的USS网络的SF独立的扩展,即SuDoRM-RF。该网络使用我们先前提出的SF独立的卷积层,该卷积层可以根据输入SF生成卷积kernel,以处理各种SF。实验结果表明,信号重采样可能会降低USS性能,而该方法比基于信号重采样的方法对各种SF的工作更加一致性。
https://arxiv.org/abs/2309.12581
The ICASSP 2023 Acoustic Echo Cancellation Challenge is intended to stimulate research in acoustic echo cancellation (AEC), which is an important area of speech enhancement and is still a top issue in audio communication. This is the fourth AEC challenge and it is enhanced by adding a second track for personalized acoustic echo cancellation, reducing the algorithmic + buffering latency to 20ms, as well as including a full-band version of AECMOS. We open source two large datasets to train AEC models under both single talk and double talk scenarios. These datasets consist of recordings from more than 10,000 real audio devices and human speakers in real environments, as well as a synthetic dataset. We open source an online subjective test framework and provide an objective metric for researchers to quickly test their results. The winners of this challenge were selected based on the average mean opinion score (MOS) achieved across all scenarios and the word accuracy (WAcc) rate.
ICASSP 2023 声学回声抵消挑战旨在刺激声学回声抵消研究(AEC),这是语音增强的一个重要领域,仍然是音频通信中的一个主要问题。这是第四个 AEC 挑战,通过添加 personalized 声学回声抵消的第二个轨道,将算法和缓冲延迟降低到 20ms,并包括 AECMOS 全波段版本,我们开源了两个大型数据集,用于训练 AEC 模型,无论是在单人对话还是双人对话场景中。这些数据集包括从超过 10,000 个真实的音频设备和人类演讲者在真实环境中录制的录音,以及一个合成数据集。我们开源了一个在线主观测试框架,并为研究人员提供一个客观的指标,以快速测试他们的结果。该挑战的获胜者是根据在所有场景中实现的平均值意见得分(MOS)和单词准确性(Wacc)率来选择的。
https://arxiv.org/abs/2309.12553
Target-Speaker Voice Activity Detection (TS-VAD) utilizes a set of speaker profiles alongside an input audio signal to perform speaker diarization. While its superiority over conventional methods has been demonstrated, the method can suffer from errors in speaker profiles, as those profiles are typically obtained by running a traditional clustering-based diarization method over the input signal. This paper proposes an extension to TS-VAD, called Profile-Error-Tolerant TS-VAD (PET-TSVAD), which is robust to such speaker profile errors. This is achieved by employing transformer-based TS-VAD that can handle a variable number of speakers and further introducing a set of additional pseudo-speaker profiles to handle speakers undetected during the first pass diarization. During training, we use speaker profiles estimated by multiple different clustering algorithms to reduce the mismatch between the training and testing conditions regarding speaker profiles. Experimental results show that PET-TSVAD consistently outperforms the existing TS-VAD method on both the VoxConverse and DIHARD-I datasets.
目标音频信号语音活动检测(TS-VAD)使用一组 speaker profiles 与输入音频信号一起执行 speaker 去声化。虽然其相对于传统方法的优势已经得到证明,但方法可能会受到 speaker profiles 的错误,因为这些 profiles 通常是通过运行传统的基于簇聚类的去声化方法在输入信号上得到的。本文提出了一个扩展,称为 profiles-error-容忍的 TS-VAD(PET-TSVAD),能够 robustly 应对 such speaker profiles 的错误。这通过使用能够处理可变数量的 speaker 的 transformer-based TS-VAD 实现,并引入了一组额外的伪 speaker profiles,以处理在第一遍去声化中未被发现的演讲者。在训练期间,我们使用多个不同的簇聚类算法估计的演讲者 profiles 以减少训练和测试条件之间的不匹配。实验结果显示,PET-TSVAD 在 VoxConverse 和 DIHARD-I 数据集上 consistently 优于现有的 TS-VAD 方法。
https://arxiv.org/abs/2309.12521
Generating multi-instrument music from symbolic music representations is an important task in Music Information Retrieval (MIR). A central but still largely unsolved problem in this context is musically and acoustically informed control in the generation process. As the main contribution of this work, we propose enhancing control of multi-instrument synthesis by conditioning a generative model on a specific performance and recording environment, thus allowing for better guidance of timbre and style. Building on state-of-the-art diffusion-based music generative models, we introduce performance conditioning - a simple tool indicating the generative model to synthesize music with style and timbre of specific instruments taken from specific performances. Our prototype is evaluated using uncurated performances with diverse instrumentation and achieves state-of-the-art FAD realism scores while allowing novel timbre and style control. Our project page, including samples and demonstrations, is available at this http URL
从符号音乐表示中提取多乐器音乐的生成任务在音乐信息检索(MIR)中是一项重要的任务。该领域的一个中心但至今仍未完全解决的问题是生成过程中的音乐和声学控制。作为这项工作的主要贡献,我们提议增强多乐器合成的控制,通过特定表演和录制环境 conditioning 一个生成模型,从而更好地指导音色和风格。基于最先进的扩散型音乐生成模型,我们引入了表现 conditioning —— 一个简单的工具,指示生成模型从特定表演中提取的特定乐器的音色和风格进行合成音乐。我们原型使用未编辑的多样乐器表演进行评估,并取得了先进的淡入淡出真实感分数,同时允许新的音色和风格控制。我们的项目页面,包括样本和演示,可访问此 http URL。
https://arxiv.org/abs/2309.12283
This work investigates a case study of using physical-based sonification of Quadratic Unconstrained Binary Optimization (QUBO) problems, optimized by the Variational Quantum Eigensolver (VQE) algorithm. The VQE approximates the solution of the problem by using an iterative loop between the quantum computer and a classical optimization routine. This work explores the intermediary statevectors found in each VQE iteration as the means of sonifying the optimization process itself. The implementation was realised in the form of a musical interface prototype named Variational Quantum Harmonizer (VQH), providing potential design strategies for musical applications, focusing on chords, chord progressions, and arpeggios. The VQH can be used both to enhance data visualization or to create artistic pieces. The methodology is also relevant in terms of how an artist would gain intuition towards achieving a desired musical sound by carefully designing QUBO cost functions. Flexible mapping strategies could supply a broad portfolio of sounds for QUBO and quantum-inspired musical compositions, as demonstrated in a case study composition, "Dependent Origination" by Peter Thomas and Paulo Itaborai.
本研究调查了一个案例研究,涉及利用基于物理的音频增强技术对经过Variational Quantum Eigensolver (VQE)算法优化的quadratic Unconstrained Binary Optimization (QUBO)问题进行音频增强。VQE使用量子计算机和经典优化算法之间的迭代循环来近似解决问题。本研究探索了在每个VQE迭代中出现的中间状态向量,将其视为优化过程本身的音频增强手段。实现形式是名为Variational Quantum Harmonizer (VQH)的音乐接口原型,为音乐应用提供了潜在设计策略,重点关注和弦、和弦进展和拨片。VQH既可以用于增强数据可视化,也可以用于创作艺术片段。研究方法也涉及到如何通过精心设计的QUBO成本函数来启发艺术家实现所需的音乐声音。灵活的映射策略可以为QUBO和量子 inspired的音乐创作提供广泛的音乐声音集,就像Peter Thomas和Paulo Itaborai创作的一个案例音乐作品《依赖的起源》所演示的那样。
https://arxiv.org/abs/2309.12254
In recent years, datasets of paired audio and captions have enabled remarkable success in automatically generating descriptions for audio clips, namely Automated Audio Captioning (AAC). However, it is labor-intensive and time-consuming to collect a sufficient number of paired audio and captions. Motivated by the recent advances in Contrastive Language-Audio Pretraining (CLAP), we propose a weakly-supervised approach to train an AAC model assuming only text data and a pre-trained CLAP model, alleviating the need for paired target data. Our approach leverages the similarity between audio and text embeddings in CLAP. During training, we learn to reconstruct the text from the CLAP text embedding, and during inference, we decode using the audio embeddings. To mitigate the modality gap between the audio and text embeddings we employ strategies to bridge the gap during training and inference stages. We evaluate our proposed method on Clotho and AudioCaps datasets demonstrating its ability to achieve a relative performance of up to ~$83\%$ compared to fully supervised approaches trained with paired target data.
近年来,配对音频和字幕数据集在自动为音频片段生成描述方面取得了显著的成功,也就是自动化音频标题生成(AAC)。然而,收集足够的配对音频和字幕数据集是费时费力的。基于最近在Contrastive Language-Audio Pretraining(CLAP)方面的进展,我们提出了一种弱监督的方法来训练一个AAC模型,假设只有文本数据和一个预先训练的CLAP模型,从而消除了需要配对目标数据的需求。我们的方法利用CLAP中音频和文本嵌入之间的相似性。在训练期间,我们学习从CLAP文本嵌入中恢复文本,而在推理期间,我们使用音频嵌入进行解码。为了缓解音频和文本嵌入之间的模式差异,我们采用了在训练和推理阶段中 bridge the gap 的策略。我们评估了我们提出的方法在Clotho和AudioCaps数据集上的表现,表明它能够与完全监督的方法训练使用配对目标数据相比实现高达 ~$83\%$ 的相对性能。
https://arxiv.org/abs/2309.12242
Presentation attack (spoofing) detection (PAD) typically operates alongside biometric verification to improve reliablity in the face of spoofing attacks. Even though the two sub-systems operate in tandem to solve the single task of reliable biometric verification, they address different detection tasks and are hence typically evaluated separately. Evidence shows that this approach is suboptimal. We introduce a new metric for the joint evaluation of PAD solutions operating in situ with biometric verification. In contrast to the tandem detection cost function proposed recently, the new tandem equal error rate (t-EER) is parameter free. The combination of two classifiers nonetheless leads to a \emph{set} of operating points at which false alarm and miss rates are equal and also dependent upon the prevalence of attacks. We therefore introduce the \emph{concurrent} t-EER, a unique operating point which is invariable to the prevalence of attacks. Using both modality (and even application) agnostic simulated scores, as well as real scores for a voice biometrics application, we demonstrate application of the t-EER to a wide range of biometric system evaluations under attack. The proposed approach is a strong candidate metric for the tandem evaluation of PAD systems and biometric comparators.
呈现攻击(伪造)检测(PAD)通常会与生物信息学验证一起运行,以提高在伪造攻击面前的可靠性。尽管两个子系统协同工作来解决可靠的生物信息学验证单一的任务,但它们处理不同的检测任务,因此通常需要分别评估。证据表明这种方法是最优化的。我们介绍了一种新的度量方法,用于在实时的生物信息学验证与生物信息学验证同时运行的情况下对PAD解决方案进行联合评估。与最近提出的协同检测成本函数不同,新的协同等误差率(t-EER)没有参数。虽然两个分类器的组合导致一系列 operating points,其中 false警报和误报率相等,并且也取决于攻击的普及程度。因此,我们介绍了协同的 t-EER,这是一个独特的 operating point,与攻击的普及程度是不可变的。使用两种模式(甚至应用)无关的模拟得分和语音生物信息学应用的真实得分,我们展示了t-EER应用于受到攻击的生物信息学系统评估的广泛范围。该方法作为协同评估 PAD系统与生物信息学比较器的强有力的候选度量。
https://arxiv.org/abs/2309.12237
A range of applications of multi-modal music information retrieval is centred around the problem of connecting large collections of sheet music (images) to corresponding audio recordings, that is, identifying pairs of audio and score excerpts that refer to the same musical content. One of the typical and most recent approaches to this task employs cross-modal deep learning architectures to learn joint embedding spaces that link the two distinct modalities - audio and sheet music images. While there has been steady improvement on this front over the past years, a number of open problems still prevent large-scale employment of this methodology. In this article we attempt to provide an insightful examination of the current developments on audio-sheet music retrieval via deep learning methods. We first identify a set of main challenges on the road towards robust and large-scale cross-modal music retrieval in real scenarios. We then highlight the steps we have taken so far to address some of these challenges, documenting step-by-step improvement along several dimensions. We conclude by analysing the remaining challenges and present ideas for solving these, in order to pave the way to a unified and robust methodology for cross-modal music retrieval.
多种多感官音乐信息检索应用的核心是连接大型音乐谱子(图像)与相应的音频录制的问题,即确定一对音频和曲段节选,它们都涉及相同的音乐内容。这种任务的常用最新方法之一是通过跨感官深度学习架构学习连接两种不同感官——音频和音乐谱子图像的联合嵌入空间。尽管过去几年中在这方面取得了稳定的改进,但还有一些开放性问题仍然阻止这种方法的大规模使用。在本文中,我们尝试通过深度学习方法深入探讨音乐谱子音频检索的最新发展。我们首先确定一系列主要挑战,这是在真实场景中实现稳健和大规模的跨感官音乐检索所面临的关键障碍。然后我们重点介绍了我们迄今采取的步骤,记录了一系列维度上的逐步改进。我们最后分析剩余的挑战并提出解决这些问题的方案,为跨感官音乐检索提供统一的稳健方法铺平道路。
https://arxiv.org/abs/2309.12158
Linking sheet music images to audio recordings remains a key problem for the development of efficient cross-modal music retrieval systems. One of the fundamental approaches toward this task is to learn a cross-modal embedding space via deep neural networks that is able to connect short snippets of audio and sheet music. However, the scarcity of annotated data from real musical content affects the capability of such methods to generalize to real retrieval scenarios. In this work, we investigate whether we can mitigate this limitation with self-supervised contrastive learning, by exposing a network to a large amount of real music data as a pre-training step, by contrasting randomly augmented views of snippets of both modalities, namely audio and sheet images. Through a number of experiments on synthetic and real piano data, we show that pre-trained models are able to retrieve snippets with better precision in all scenarios and pre-training configurations. Encouraged by these results, we employ the snippet embeddings in the higher-level task of cross-modal piece identification and conduct more experiments on several retrieval configurations. In this task, we observe that the retrieval quality improves from 30% up to 100% when real music data is present. We then conclude by arguing for the potential of self-supervised contrastive learning for alleviating the annotated data scarcity in multi-modal music retrieval models.
将音乐谱与音频录制链接仍然是开发高效跨modal音乐检索系统的关键问题。对于这个任务,一种基本的方法是通过深度神经网络学习一个跨modal嵌入空间,该空间能够连接音频和音乐片段的短 snippet。然而,从真实音乐内容的标注数据稀缺性的角度来看,这些方法是否能够适用于真实的检索场景具有影响。在这项工作中,我们研究是否能够通过自监督比较学习来减轻这种限制,方法是将网络暴露在大量的真实音乐数据上作为预训练步骤,通过随机增强的两种模式片段的视图进行比较。通过模拟和真实的钢琴数据进行了一系列实验,我们表明预训练模型能够在所有场景和预训练配置下更准确地检索 snippet。因为这些结果的鼓励,我们使用 snippet嵌入在跨modal片段识别的高级任务中,并进行了更多的实验,针对多个检索配置。在这个任务中,我们观察到当存在真实音乐数据时,检索质量从30%增加到100%。因此我们最终得出结论,自监督比较学习的潜力有助于减轻跨modal音乐检索模型中标注数据稀缺性的问题。
https://arxiv.org/abs/2309.12134
Neural network approaches to single-channel speech enhancement have received much recent attention. In particular, mask-based architectures have achieved significant performance improvements over conventional methods. This paper proposes a multiscale autoencoder (MSAE) for mask-based end-to-end neural network speech enhancement. The MSAE performs spectral decomposition of an input waveform within separate band-limited branches, each operating with a different rate and scale, to extract a sequence of multiscale embeddings. The proposed framework features intuitive parameterization of the autoencoder, including a flexible spectral band design based on the Constant-Q transform. Additionally, the MSAE is constructed entirely of differentiable operators, allowing it to be implemented within an end-to-end neural network, and be discriminatively trained. The MSAE draws motivation both from recent multiscale network topologies and from traditional multiresolution transforms in speech processing. Experimental results show the MSAE to provide clear performance benefits relative to conventional single-branch autoencoders. Additionally, the proposed framework is shown to outperform a variety of state-of-the-art enhancement systems, both in terms of objective speech quality metrics and automatic speech recognition accuracy.
神经网络对单通道语音增强的研究最近受到了广泛关注。特别是,基于Mask的架构在与传统方法相比实现了显著的性能提升。本文提出了一种基于Mask的多维度自编码器(MSAE),用于实现基于Mask的端到端神经网络语音增强。MSAE在 separate band-limited分支内执行谱分解操作,每个分支以不同的速率和尺度运行,以提取多尺度嵌入序列。 proposed 框架采用直觉的自编码器参数化,包括基于康普顿-Q变换的灵活谱带设计。此外,MSAE完全由不同的操作员构建,使其能够在端到端神经网络内部实现,并进行有选择性的训练。MSAE从最近的多尺度网络拓扑和传统语音处理中的多分辨率变换中吸取了动力。实验结果表明,与传统的单分支自编码器相比,MSAE可以提供明显的性能优势。此外, proposed 框架在 objective speech quality metrics 和自动语音识别精度方面击败了多种最先进的增强系统。
https://arxiv.org/abs/2309.12121