Paper Reading AI Learner

Towards Efficient and Real-Time Piano Transcription Using Neural Autoregressive Models

2024-04-10 08:06:15
Taegyun Kwon, Dasaem Jeong, Juhan Nam

Abstract

In recent years, advancements in neural network designs and the availability of large-scale labeled datasets have led to significant improvements in the accuracy of piano transcription models. However, most previous work focused on high-performance offline transcription, neglecting deliberate consideration of model size. The goal of this work is to implement real-time inference for piano transcription while ensuring both high performance and lightweight. To this end, we propose novel architectures for convolutional recurrent neural networks, redesigning an existing autoregressive piano transcription model. First, we extend the acoustic module by adding a frequency-conditioned FiLM layer to the CNN module to adapt the convolutional filters on the frequency axis. Second, we improve note-state sequence modeling by using a pitchwise LSTM that focuses on note-state transitions within a note. In addition, we augment the autoregressive connection with an enhanced recursive context. Using these components, we propose two types of models; one for high performance and the other for high compactness. Through extensive experiments, we show that the proposed models are comparable to state-of-the-art models in terms of note accuracy on the MAESTRO dataset. We also investigate the effective model size and real-time inference latency by gradually streamlining the architecture. Finally, we conduct cross-data evaluation on unseen piano datasets and in-depth analysis to elucidate the effect of the proposed components in the view of note length and pitch range.

Abstract (translated)

近年来,神经网络设计的进步和大规模有标签数据集的可用性导致钢琴转录模型的准确性得到了显著提高。然而,之前的工作主要集中在高性能的离线转录,而忽略了模型大小的故意考虑。本文的目标是在保证高性能的同时实现轻量化。为此,我们提出了新颖的卷积循环神经网络架构,重新设计了一个现有的自回归钢琴转录模型。首先,我们通过在CNN模块中添加频率条件下的FiLM层来扩展音频模块,以适应频率轴上的卷积滤波器。其次,我们通过使用关注音符之间音符状态变化的LSTM来改进音符序列建模。此外,我们还通过增强递归上下文来增强自回归连接。使用这些组件,我们提出了两种类型的模型;一种用于高性能,另一种用于高紧凑性。通过广泛的实验,我们证明了所提出的模型在MAESTRO数据集上的音符准确性与最先进的模型相当。我们还研究了有效模型大小和实时推理延迟,通过逐步优化模型架构进行。最后,我们在未见过的钢琴数据集上进行跨数据评估,并对音符长度和音高范围进行深入分析,阐明了所提出的组件在音符长度和音高范围上的效果。

URL

https://arxiv.org/abs/2404.06818

PDF

https://arxiv.org/pdf/2404.06818.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot