Paper Reading AI Learner

Synthesizing Diverse, High-Quality Audio Textures

2018-06-20 21:51:32
Joseph Antognini, Matt Hoffman, Ron J. Weiss

Abstract

Texture synthesis techniques based on matching the Gram matrix of feature activations in neural networks have achieved spectacular success in the image domain. In this paper we extend these techniques to the audio domain. We demonstrate that synthesizing diverse audio textures is challenging, and argue that this is because audio data is relatively low-dimensional. We therefore introduce two new terms to the original Grammian loss: an autocorrelation term that preserves rhythm, and a diversity term that encourages the optimization procedure to synthesize unique textures. We quantitatively study the impact of our design choices on the quality of the synthesized audio by introducing an audio analogue to the Inception loss which we term the VGGish loss. We show that there is a trade-off between the diversity and quality of the synthesized audio using this technique. We additionally perform a number of experiments to qualitatively study how these design choices impact the quality of the synthesized audio. Finally we describe the implications of these results for the problem of audio style transfer.

Abstract (translated)

基于匹配神经网络中特征激活的克拉矩阵的纹理合成技术在图像领域取得了惊人的成功。在本文中,我们将这些技术扩展到音频域。我们证明合成各种音频纹理是具有挑战性的,并且认为这是因为音频数据的维度相对较低。因此,我们引入两个新的术语来解释原始的Grammian损失:保留节奏的自相关术语和鼓励优化过程合成独特纹理的多样性术语。我们通过将音频模拟引入到我们称为VGGish损失的初始损失中,定量研究了我们的设计选择对合成音频质量的影响。我们表明,使用这种技术在合成音频的多样性和质量之间进行权衡。我们另外进行了大量实验来定性研究这些设计选择如何影响合成音频的质量。最后,我们描述这些结果对于音频风格传输问题的影​​响。

URL

https://arxiv.org/abs/1806.08002

PDF

https://arxiv.org/pdf/1806.08002.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot