Paper Reading AI Learner

D3RM: A Discrete Denoising Diffusion Refinement Model for Piano Transcription

2025-01-09 08:44:06
Hounsu Kim, Taegyun Kwon, Juhan Nam

Abstract

Diffusion models have been widely used in the generative domain due to their convincing performance in modeling complex data distributions. Moreover, they have shown competitive results on discriminative tasks, such as image segmentation. While diffusion models have also been explored for automatic music transcription, their performance has yet to reach a competitive level. In this paper, we focus on discrete diffusion model's refinement capabilities and present a novel architecture for piano transcription. Our model utilizes Neighborhood Attention layers as the denoising module, gradually predicting the target high-resolution piano roll, conditioned on the finetuned features of a pretrained acoustic model. To further enhance refinement, we devise a novel strategy which applies distinct transition states during training and inference stage of discrete diffusion models. Experiments on the MAESTRO dataset show that our approach outperforms previous diffusion-based piano transcription models and the baseline model in terms of F1 score. Our code is available in this https URL.

Abstract (translated)

扩散模型由于在模拟复杂数据分布方面的说服力表现,在生成领域得到了广泛的应用。此外,它们还在判别性任务(如图像分割)中表现出具有竞争力的结果。尽管扩散模型已被探索用于自动音乐转录,但其性能尚未达到竞争水平。在这篇论文中,我们专注于离散扩散模型的改进能力,并提出了一种新的架构来处理钢琴转录问题。我们的模型利用了邻里注意力层作为去噪模块,在预训练声学模型精调特征的基础上,逐步预测目标高分辨率的钢琴滚筒图(piano roll)。为了进一步增强改进效果,我们设计了一种新策略,该策略在离散扩散模型的训练和推理阶段应用不同的过渡状态。我们在MAESTRO数据集上的实验表明,在F1分数方面,我们的方法优于先前基于扩散的钢琴转录模型及基线模型。我们的代码可以在提供的链接中找到(请将“this https URL”替换为实际链接)。

URL

https://arxiv.org/abs/2501.05068

PDF

https://arxiv.org/pdf/2501.05068.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot