Abstract
Despite significant advances in deep models for music generation, the use of these techniques remains restricted to expert users. Before being democratized among musicians, generative models must first provide expressive control over the generation, as this conditions the integration of deep generative models in creative workflows. In this paper, we tackle this issue by introducing a deep generative audio model providing expressive and continuous descriptor-based control, while remaining lightweight enough to be embedded in a hardware synthesizer. We enforce the controllability of real-time generation by explicitly removing salient musical features in the latent space using an adversarial confusion criterion. User-specified features are then reintroduced as additional conditioning information, allowing for continuous control of the generation, akin to a synthesizer knob. We assess the performance of our method on a wide variety of sounds including instrumental, percussive and speech recordings while providing both timbre and attributes transfer, allowing new ways of generating sounds.
Abstract (translated)
尽管在音乐生成方面的深度学习模型取得了显著进展,但这些技术仍然仅适用于专家用户。在音乐家之间实现民主化之前,生成模型必须首先提供表达性的控制,这保证了深度生成模型在创意工作流程中的集成。在本文中,我们解决这个问题的方法是引入了一种深度生成音频模型,提供表达性和连续特征描述控制,同时仍然足够轻量级,可以嵌入硬件合成器。我们通过使用对抗性混淆准则明确删除了潜在空间中的显著音乐特征,从而强制了实时生成控制的可编程性。我们还将用户指定的特征作为额外的 conditioning information 引入,从而可以持续控制生成,类似于合成器调音台。我们评估了我们的算法在包括乐器、打击乐器和语音录音等多种声音中的表现,同时提供了音色和属性传输,从而提供了生成声音的新方式。
URL
https://arxiv.org/abs/2302.13542