Abstract
The neural semi-Markov Conditional Random Field (semi-CRF) framework has demonstrated promise for event-based piano transcription. In this framework, all events (notes or pedals) are represented as closed intervals tied to specific event types. The neural semi-CRF approach requires an interval scoring matrix that assigns a score for every candidate interval. However, designing an efficient and expressive architecture for scoring intervals is not trivial. In this paper, we introduce a simple method for scoring intervals using scaled inner product operations that resemble how attention scoring is done in transformers. We show theoretically that, due to the special structure from encoding the non-overlapping intervals, under a mild condition, the inner product operations are expressive enough to represent an ideal scoring matrix that can yield the correct transcription result. We then demonstrate that an encoder-only non-hierarchical transformer backbone, operating only on a low-time-resolution feature map, is capable of transcribing piano notes and pedals with high accuracy and time precision. The experiment shows that our approach achieves the new state-of-the-art performance across all subtasks in terms of the F1 measure on the Maestro dataset.
Abstract (translated)
神经半马尔可夫条件随机场(半马尔可夫条件随机场)框架在事件基于钢琴 transcription 方面显示出巨大的潜力。在这种框架中,所有事件(音符或踏板)都表示为特定事件类型的关闭间隔。神经半马尔可夫条件随机场方法需要一个间隔评分矩阵,为每个候选间隔分配分数。然而,为评分间隔设计高效且富有表现力的架构并不容易。在本文中,我们提出了一种通过缩放内积操作进行间隔评分的方法,这种方法类似于在Transformer中进行注意力评分的方式。我们证明了,由于对非重叠间隔的编码,在轻度条件下,内积操作具有足够的表现力来表示一个理想的评分矩阵,从而产生正确的转录结果。然后,我们证明了仅使用编码器-仅非层次结构Transformer骨干网络,在对低时间分辨率特征图仅操作时,可以实现对钢琴音符和踏板的高精度和高时间精确度的转录。实验结果表明,我们的方法在Maestro数据集上的所有子任务上的F1得分均达到了当前最先进水平。
URL
https://arxiv.org/abs/2404.09466