Abstract
Diffusion models have been widely used in the generative domain due to their convincing performance in modeling complex data distributions. Moreover, they have shown competitive results on discriminative tasks, such as image segmentation. While diffusion models have also been explored for automatic music transcription, their performance has yet to reach a competitive level. In this paper, we focus on discrete diffusion model's refinement capabilities and present a novel architecture for piano transcription. Our model utilizes Neighborhood Attention layers as the denoising module, gradually predicting the target high-resolution piano roll, conditioned on the finetuned features of a pretrained acoustic model. To further enhance refinement, we devise a novel strategy which applies distinct transition states during training and inference stage of discrete diffusion models. Experiments on the MAESTRO dataset show that our approach outperforms previous diffusion-based piano transcription models and the baseline model in terms of F1 score. Our code is available in this https URL.
Abstract (translated)
扩散模型由于在模拟复杂数据分布方面的说服力表现,在生成领域得到了广泛的应用。此外,它们还在判别性任务(如图像分割)中表现出具有竞争力的结果。尽管扩散模型已被探索用于自动音乐转录,但其性能尚未达到竞争水平。在这篇论文中,我们专注于离散扩散模型的改进能力,并提出了一种新的架构来处理钢琴转录问题。我们的模型利用了邻里注意力层作为去噪模块,在预训练声学模型精调特征的基础上,逐步预测目标高分辨率的钢琴滚筒图(piano roll)。为了进一步增强改进效果,我们设计了一种新策略,该策略在离散扩散模型的训练和推理阶段应用不同的过渡状态。我们在MAESTRO数据集上的实验表明,在F1分数方面,我们的方法优于先前基于扩散的钢琴转录模型及基线模型。我们的代码可以在提供的链接中找到(请将“this https URL”替换为实际链接)。
URL
https://arxiv.org/abs/2501.05068