Towards Efficient and Real-Time Piano Transcription Using Neural Autoregressive Models

Abstract
Abstract (translated)
URL
PDF

Abstract

In recent years, advancements in neural network designs and the availability of large-scale labeled datasets have led to significant improvements in the accuracy of piano transcription models. However, most previous work focused on high-performance offline transcription, neglecting deliberate consideration of model size. The goal of this work is to implement real-time inference for piano transcription while ensuring both high performance and lightweight. To this end, we propose novel architectures for convolutional recurrent neural networks, redesigning an existing autoregressive piano transcription model. First, we extend the acoustic module by adding a frequency-conditioned FiLM layer to the CNN module to adapt the convolutional filters on the frequency axis. Second, we improve note-state sequence modeling by using a pitchwise LSTM that focuses on note-state transitions within a note. In addition, we augment the autoregressive connection with an enhanced recursive context. Using these components, we propose two types of models; one for high performance and the other for high compactness. Through extensive experiments, we show that the proposed models are comparable to state-of-the-art models in terms of note accuracy on the MAESTRO dataset. We also investigate the effective model size and real-time inference latency by gradually streamlining the architecture. Finally, we conduct cross-data evaluation on unseen piano datasets and in-depth analysis to elucidate the effect of the proposed components in the view of note length and pitch range.

Abstract (translated)

近年来，神经网络设计的进步和大规模有标签数据集的可用性导致钢琴转录模型的准确性得到了显著提高。然而，之前的工作主要集中在高性能的离线转录，而忽略了模型大小的故意考虑。本文的目标是在保证高性能的同时实现轻量化。为此，我们提出了新颖的卷积循环神经网络架构，重新设计了一个现有的自回归钢琴转录模型。首先，我们通过在CNN模块中添加频率条件下的FiLM层来扩展音频模块，以适应频率轴上的卷积滤波器。其次，我们通过使用关注音符之间音符状态变化的LSTM来改进音符序列建模。此外，我们还通过增强递归上下文来增强自回归连接。使用这些组件，我们提出了两种类型的模型；一种用于高性能，另一种用于高紧凑性。通过广泛的实验，我们证明了所提出的模型在MAESTRO数据集上的音符准确性与最先进的模型相当。我们还研究了有效模型大小和实时推理延迟，通过逐步优化模型架构进行。最后，我们在未见过的钢琴数据集上进行跨数据评估，并对音符长度和音高范围进行深入分析，阐明了所提出的组件在音符长度和音高范围上的效果。

URL

https://arxiv.org/abs/2404.06818

PDF

https://arxiv.org/pdf/2404.06818.pdf

Towards Efficient and Real-Time Piano Transcription Using Neural Autoregressive Models

Abstract

Abstract (translated)

URL

PDF Copy

PDF