Paper Reading AI Learner

Continuous descriptor-based control for deep audio synthesis

2023-02-27 06:40:11
Ninon Devis, Nils Demerlé, Sarah Nabi, David Genova, Philippe Esling

Abstract

Despite significant advances in deep models for music generation, the use of these techniques remains restricted to expert users. Before being democratized among musicians, generative models must first provide expressive control over the generation, as this conditions the integration of deep generative models in creative workflows. In this paper, we tackle this issue by introducing a deep generative audio model providing expressive and continuous descriptor-based control, while remaining lightweight enough to be embedded in a hardware synthesizer. We enforce the controllability of real-time generation by explicitly removing salient musical features in the latent space using an adversarial confusion criterion. User-specified features are then reintroduced as additional conditioning information, allowing for continuous control of the generation, akin to a synthesizer knob. We assess the performance of our method on a wide variety of sounds including instrumental, percussive and speech recordings while providing both timbre and attributes transfer, allowing new ways of generating sounds.

Abstract (translated)

尽管在音乐生成方面的深度学习模型取得了显著进展,但这些技术仍然仅适用于专家用户。在音乐家之间实现民主化之前,生成模型必须首先提供表达性的控制,这保证了深度生成模型在创意工作流程中的集成。在本文中,我们解决这个问题的方法是引入了一种深度生成音频模型,提供表达性和连续特征描述控制,同时仍然足够轻量级,可以嵌入硬件合成器。我们通过使用对抗性混淆准则明确删除了潜在空间中的显著音乐特征,从而强制了实时生成控制的可编程性。我们还将用户指定的特征作为额外的 conditioning information 引入,从而可以持续控制生成,类似于合成器调音台。我们评估了我们的算法在包括乐器、打击乐器和语音录音等多种声音中的表现,同时提供了音色和属性传输,从而提供了生成声音的新方式。

URL

https://arxiv.org/abs/2302.13542

PDF

https://arxiv.org/pdf/2302.13542.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot