Paper Reading AI Learner

Evaluating Disentangled Representations for Controllable Music Generation

2026-02-10 18:25:04
Laura Ib\'a\~nez-Mart\'inez, Chukwuemeka Nkama, Andrea Poltronieri, Xavier Serra, Mart\'in Rocamora

Abstract

Recent approaches in music generation rely on disentangled representations, often labeled as structure and timbre or local and global, to enable controllable synthesis. Yet the underlying properties of these embeddings remain underexplored. In this work, we evaluate such disentangled representations in a set of music audio models for controllable generation using a probing-based framework that goes beyond standard downstream tasks. The selected models reflect diverse unsupervised disentanglement strategies, including inductive biases, data augmentations, adversarial objectives, and staged training procedures. We further isolate specific strategies to analyze their effect. Our analysis spans four key axes: informativeness, equivariance, invariance, and disentanglement, which are assessed across datasets, tasks, and controlled transformations. Our findings reveal inconsistencies between intended and actual semantics of the embeddings, suggesting that current strategies fall short of producing truly disentangled representations, and prompting a re-examination of how controllability is approached in music generation.

Abstract (translated)

最近的音乐生成方法依赖于分离表示,通常被标记为结构与音色或局部与全局特征,以实现可控合成。然而,这些嵌入的基本特性仍然未被充分探索。在这项工作中,我们使用一种基于探针任务的方法框架来评估一组用于可控生成的音乐音频模型中的此类分离表示,并且这种方法超出了标准下游任务的范围。所选模型反映了多样化的无监督分离策略,包括归纳偏差、数据增强、对抗目标以及分阶段训练流程。此外,我们还单独分析了特定策略的效果。我们的分析涵盖了四个关键维度:信息性(informativeness)、等变性(equivariance)、不变性(invariance)和分离度(disentanglement),这些特性在不同的数据集、任务及受控转换中被评估。研究发现表明,嵌入的预期语义与其实际语义之间存在不一致之处,这暗示现有的策略未能产生真正意义上的分离表示,并且呼吁重新审视音乐生成中的可控性方法。

URL

https://arxiv.org/abs/2602.10058

PDF

https://arxiv.org/pdf/2602.10058.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot