Synthesizing Diverse, High-Quality Audio Textures

Abstract
Abstract (translated)
URL
PDF

Abstract

Texture synthesis techniques based on matching the Gram matrix of feature activations in neural networks have achieved spectacular success in the image domain. In this paper we extend these techniques to the audio domain. We demonstrate that synthesizing diverse audio textures is challenging, and argue that this is because audio data is relatively low-dimensional. We therefore introduce two new terms to the original Grammian loss: an autocorrelation term that preserves rhythm, and a diversity term that encourages the optimization procedure to synthesize unique textures. We quantitatively study the impact of our design choices on the quality of the synthesized audio by introducing an audio analogue to the Inception loss which we term the VGGish loss. We show that there is a trade-off between the diversity and quality of the synthesized audio using this technique. We additionally perform a number of experiments to qualitatively study how these design choices impact the quality of the synthesized audio. Finally we describe the implications of these results for the problem of audio style transfer.

Abstract (translated)

基于匹配神经网络中特征激活的克拉矩阵的纹理合成技术在图像领域取得了惊人的成功。在本文中，我们将这些技术扩展到音频域。我们证明合成各种音频纹理是具有挑战性的，并且认为这是因为音频数据的维度相对较低。因此，我们引入两个新的术语来解释原始的Grammian损失：保留节奏的自相关术语和鼓励优化过程合成独特纹理的多样性术语。我们通过将音频模拟引入到我们称为VGGish损失的初始损失中，定量研究了我们的设计选择对合成音频质量的影响。我们表明，使用这种技术在合成音频的多样性和质量之间进行权衡。我们另外进行了大量实验来定性研究这些设计选择如何影响合成音频的质量。最后，我们描述这些结果对于音频风格传输问题的影响。

URL

https://arxiv.org/abs/1806.08002

PDF

https://arxiv.org/pdf/1806.08002.pdf