Abstract
Recent approaches in music generation rely on disentangled representations, often labeled as structure and timbre or local and global, to enable controllable synthesis. Yet the underlying properties of these embeddings remain underexplored. In this work, we evaluate such disentangled representations in a set of music audio models for controllable generation using a probing-based framework that goes beyond standard downstream tasks. The selected models reflect diverse unsupervised disentanglement strategies, including inductive biases, data augmentations, adversarial objectives, and staged training procedures. We further isolate specific strategies to analyze their effect. Our analysis spans four key axes: informativeness, equivariance, invariance, and disentanglement, which are assessed across datasets, tasks, and controlled transformations. Our findings reveal inconsistencies between intended and actual semantics of the embeddings, suggesting that current strategies fall short of producing truly disentangled representations, and prompting a re-examination of how controllability is approached in music generation.
Abstract (translated)
最近的音乐生成方法依赖于分离表示,通常被标记为结构与音色或局部与全局特征,以实现可控合成。然而,这些嵌入的基本特性仍然未被充分探索。在这项工作中,我们使用一种基于探针任务的方法框架来评估一组用于可控生成的音乐音频模型中的此类分离表示,并且这种方法超出了标准下游任务的范围。所选模型反映了多样化的无监督分离策略,包括归纳偏差、数据增强、对抗目标以及分阶段训练流程。此外,我们还单独分析了特定策略的效果。我们的分析涵盖了四个关键维度:信息性(informativeness)、等变性(equivariance)、不变性(invariance)和分离度(disentanglement),这些特性在不同的数据集、任务及受控转换中被评估。研究发现表明,嵌入的预期语义与其实际语义之间存在不一致之处,这暗示现有的策略未能产生真正意义上的分离表示,并且呼吁重新审视音乐生成中的可控性方法。
URL
https://arxiv.org/abs/2602.10058