An Comparative Analysis of Different Pitch and Metrical Grid Encoding Methods in the Task of Sequential Music Generation

Abstract
Abstract (translated)
URL
PDF

Abstract

Pitch and meter are two fundamental music features for symbolic music generation tasks, where researchers usually choose different encoding methods depending on specific goals. However, the advantages and drawbacks of different encoding methods have not been frequently discussed. This paper presents a integrated analysis of the influence of two low-level feature, pitch and meter, on the performance of a token-based sequential music generation model. First, the commonly used MIDI number encoding and a less used class-octave encoding are compared. Second, an dense intra-bar metric grid is imposed to the encoded sequence as auxiliary features. Different complexity and resolutions of the metric grid are compared. For complexity, the single token approach and the multiple token approach are compared; for grid resolution, 0 (ablation), 1 (bar-level), 4 (downbeat-level) 12, (8th-triplet-level) up to 64 (64th-note-grid-level) are compared; for duration resolution, 4, 8, 12 and 16 subdivisions per beat are compared. All different encodings are tested on separately trained Transformer-XL models for a melody generation task. Regarding distribution similarity of several objective evaluation metrics to the test dataset, results suggest that the class-octave encoding significantly outperforms the taken-for-granted MIDI encoding on pitch-related metrics; finer grids and multiple-token grids improve the rhythmic quality, but also suffer from over-fitting at early training stage. Results display a general phenomenon of over-fitting from two aspects, the pitch embedding space and the test loss of the single-token grid encoding. From a practical perspective, we both demonstrate the feasibility and raise the concern of easy over-fitting problem of using smaller networks and lower embedding dimensions on the generation task. The findings can also contribute to futural models in terms of feature engineering.

Abstract (translated)

音高和节拍是符号音乐生成任务中两个基本的音乐特征,研究者通常根据特定的目标选择不同的编码方法。然而,不同编码方法的优缺点并没有被经常讨论。本文将综合分析两个低级别特征的影响,即音高和节拍,对 token-based Sequential音乐生成模型的性能产生影响。首先,常见的 MIDI 数字编码和较少使用的 class-Octave 编码进行比较。其次,将密集在每bar内的度量网格强加到编码序列作为辅助特征。不同的复杂性和分辨率的度量网格进行比较。对于复杂性,单token 方法和多token 方法进行比较;对于网格分辨率,0(ablation),1(bar-level),4(downbeat-level)到64(64th-note-grid-level)进行比较;对于持续时间分辨率,4,8,12和16 subdivisions per beat进行比较。所有不同的编码方法都在分别训练的Transformer-XL模型上进行了 melody 生成任务测试。关于多个客观评估指标与测试数据分布相似的分布相似性,结果暗示着 class-Octave 编码在音高相关的评估指标上 significantly outperforms 常用的 MIDI 编码;更细的网格和多token网格可以提高节奏质量,但在训练初期容易过拟合。结果表现出过拟合的一般现象,即音高嵌入空间和单token 网格编码的测试损失。从实用的角度来看,我们 both 证明了可行性,并提出了使用更小的网络和低嵌入维度在生成任务中容易过拟合的问题。研究结果还可以为特征工程领域的未来模型做出贡献。

URL

https://arxiv.org/abs/2301.13383

PDF

https://arxiv.org/pdf/2301.13383.pdf