Abstract
Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLAP) latents. The pretrained CLAP models enable us to train LDMs with audio embedding while providing text embedding as a condition during sampling. By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency. Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance measured by both objective and subjective metrics (e.g., frechet distance). Moreover, AudioLDM is the first TTA system that enables various text-guided audio manipulations (e.g., style transfer) in a zero-shot fashion. Our implementation and demos are available at this https URL.
Abstract (translated)
文本到音频(TTA)系统最近因其基于文本描述合成一般音频的能力而受到关注。然而,以往的TTA研究则往往受限于高计算成本,生成质量较低。在本研究中,我们提出了AudioLDM,一个基于潜在空间构建的TTA系统,以从对比语言音频预处理(CLAP)潜在空间中学习连续音频表示。预先训练的CLAP模型使我们能够在采样期间使用音频嵌入同时提供文本嵌入作为条件。通过不建模跨模态关系而仅学习音频信号和其组成的潜在表示,AudioLDM在生成质量和计算效率方面都有优势。通过在一个GPU上训练单个音频Cap,AudioLDM实现了最先进的TTA性能,以客观和主观指标(例如梯度距离)来衡量。此外,AudioLDM是第一种能够在零样本情况下指导各种文本音频操纵(例如风格转移)的TTA系统。我们的实现和演示可在以下httpsURL上提供。
URL
https://arxiv.org/abs/2301.12503