Paper Reading AI Learner

Scoring Intervals using Non-hierarchical Transformer For Automatic Piano Transcription

2024-04-15 05:35:09
Yujia Yan, Zhiyao Duan

Abstract

The neural semi-Markov Conditional Random Field (semi-CRF) framework has demonstrated promise for event-based piano transcription. In this framework, all events (notes or pedals) are represented as closed intervals tied to specific event types. The neural semi-CRF approach requires an interval scoring matrix that assigns a score for every candidate interval. However, designing an efficient and expressive architecture for scoring intervals is not trivial. In this paper, we introduce a simple method for scoring intervals using scaled inner product operations that resemble how attention scoring is done in transformers. We show theoretically that, due to the special structure from encoding the non-overlapping intervals, under a mild condition, the inner product operations are expressive enough to represent an ideal scoring matrix that can yield the correct transcription result. We then demonstrate that an encoder-only non-hierarchical transformer backbone, operating only on a low-time-resolution feature map, is capable of transcribing piano notes and pedals with high accuracy and time precision. The experiment shows that our approach achieves the new state-of-the-art performance across all subtasks in terms of the F1 measure on the Maestro dataset.

Abstract (translated)

神经半马尔可夫条件随机场(半马尔可夫条件随机场)框架在事件基于钢琴 transcription 方面显示出巨大的潜力。在这种框架中,所有事件(音符或踏板)都表示为特定事件类型的关闭间隔。神经半马尔可夫条件随机场方法需要一个间隔评分矩阵,为每个候选间隔分配分数。然而,为评分间隔设计高效且富有表现力的架构并不容易。在本文中,我们提出了一种通过缩放内积操作进行间隔评分的方法,这种方法类似于在Transformer中进行注意力评分的方式。我们证明了,由于对非重叠间隔的编码,在轻度条件下,内积操作具有足够的表现力来表示一个理想的评分矩阵,从而产生正确的转录结果。然后,我们证明了仅使用编码器-仅非层次结构Transformer骨干网络,在对低时间分辨率特征图仅操作时,可以实现对钢琴音符和踏板的高精度和高时间精确度的转录。实验结果表明,我们的方法在Maestro数据集上的所有子任务上的F1得分均达到了当前最先进水平。

URL

https://arxiv.org/abs/2404.09466

PDF

https://arxiv.org/pdf/2404.09466.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot