Paper Reading AI Learner

An Comparative Analysis of Different Pitch and Metrical Grid Encoding Methods in the Task of Sequential Music Generation

2023-01-31 03:19:50
Yuqiang Li, Shengchen Li, George Fazekas

Abstract

Pitch and meter are two fundamental music features for symbolic music generation tasks, where researchers usually choose different encoding methods depending on specific goals. However, the advantages and drawbacks of different encoding methods have not been frequently discussed. This paper presents a integrated analysis of the influence of two low-level feature, pitch and meter, on the performance of a token-based sequential music generation model. First, the commonly used MIDI number encoding and a less used class-octave encoding are compared. Second, an dense intra-bar metric grid is imposed to the encoded sequence as auxiliary features. Different complexity and resolutions of the metric grid are compared. For complexity, the single token approach and the multiple token approach are compared; for grid resolution, 0 (ablation), 1 (bar-level), 4 (downbeat-level) 12, (8th-triplet-level) up to 64 (64th-note-grid-level) are compared; for duration resolution, 4, 8, 12 and 16 subdivisions per beat are compared. All different encodings are tested on separately trained Transformer-XL models for a melody generation task. Regarding distribution similarity of several objective evaluation metrics to the test dataset, results suggest that the class-octave encoding significantly outperforms the taken-for-granted MIDI encoding on pitch-related metrics; finer grids and multiple-token grids improve the rhythmic quality, but also suffer from over-fitting at early training stage. Results display a general phenomenon of over-fitting from two aspects, the pitch embedding space and the test loss of the single-token grid encoding. From a practical perspective, we both demonstrate the feasibility and raise the concern of easy over-fitting problem of using smaller networks and lower embedding dimensions on the generation task. The findings can also contribute to futural models in terms of feature engineering.

Abstract (translated)

音高和节拍是符号音乐生成任务中两个基本的音乐特征,研究者通常根据特定的目标选择不同的编码方法。然而,不同编码方法的优缺点并没有被经常讨论。本文将综合分析两个低级别特征的影响,即音高和节拍,对 token-based Sequential音乐生成模型的性能产生影响。首先,常见的 MIDI 数字编码和较少使用的 class-Octave 编码进行比较。其次,将密集在每bar内的度量网格强加到编码序列作为辅助特征。不同的复杂性和分辨率的度量网格进行比较。对于复杂性,单token 方法和多token 方法进行比较;对于网格分辨率,0(ablation),1(bar-level),4(downbeat-level)到64(64th-note-grid-level)进行比较;对于持续时间分辨率,4,8,12和16 subdivisions per beat进行比较。所有不同的编码方法都在分别训练的Transformer-XL模型上进行了 melody 生成任务测试。关于多个客观评估指标与测试数据分布相似的分布相似性,结果暗示着 class-Octave 编码在音高相关的评估指标上 significantly outperforms 常用的 MIDI 编码;更细的网格和多token网格可以提高节奏质量,但在训练初期容易过拟合。结果表现出过拟合的一般现象,即音高嵌入空间和单token 网格编码的测试损失。从实用的角度来看,我们 both 证明了可行性,并提出了使用更小的网络和低嵌入维度在生成任务中容易过拟合的问题。研究结果还可以为特征工程领域的未来模型做出贡献。

URL

https://arxiv.org/abs/2301.13383

PDF

https://arxiv.org/pdf/2301.13383.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot