Paper Reading AI Learner

Diff-TONE: Timestep Optimization for iNstrument Editing in Text-to-Music Diffusion Models

2025-06-18 15:01:25
Teysir Baoueb, Xiaoyu Bie, Xi Wang, Ga\"el Richard

Abstract

Breakthroughs in text-to-music generation models are transforming the creative landscape, equipping musicians with innovative tools for composition and experimentation like never before. However, controlling the generation process to achieve a specific desired outcome remains a significant challenge. Even a minor change in the text prompt, combined with the same random seed, can drastically alter the generated piece. In this paper, we explore the application of existing text-to-music diffusion models for instrument editing. Specifically, for an existing audio track, we aim to leverage a pretrained text-to-music diffusion model to edit the instrument while preserving the underlying content. Based on the insight that the model first focuses on the overall structure or content of the audio, then adds instrument information, and finally refines the quality, we show that selecting a well-chosen intermediate timestep, identified through an instrument classifier, yields a balance between preserving the original piece's content and achieving the desired timbre. Our method does not require additional training of the text-to-music diffusion model, nor does it compromise the generation process's speed.

Abstract (translated)

文本到音乐生成模型的突破正在重塑创意领域,为作曲家提供了前所未有的创新工具。然而,控制生成过程以实现特定目标仍然是一个重大挑战。即使是细微的文本提示变化,在相同的随机种子下也会大幅改变生成的作品。本文探讨了现有文本到音乐扩散模型在乐器编辑中的应用。具体而言,对于现有的音频轨道,我们旨在利用预训练的文本到音乐扩散模型来编辑乐器,同时保留底层内容。 根据模型首先关注音频的整体结构或内容,然后添加乐器信息,最后细化质量这一洞察,我们展示了通过使用一个乐器分类器选择适当的时间步长,可以实现保留原始作品内容与达到期望音色之间的平衡。我们的方法不需要额外训练文本到音乐的扩散模型,并且不会影响生成过程的速度。

URL

https://arxiv.org/abs/2506.15530

PDF

https://arxiv.org/pdf/2506.15530.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot