Aria-MIDI: A Dataset of Piano MIDI Files for Symbolic Music Modeling

Abstract
Abstract (translated)
URL
PDF

Abstract

We introduce an extensive new dataset of MIDI files, created by transcribing audio recordings of piano performances into their constituent notes. The data pipeline we use is multi-stage, employing a language model to autonomously crawl and score audio recordings from the internet based on their metadata, followed by a stage of pruning and segmentation using an audio classifier. The resulting dataset contains over one million distinct MIDI files, comprising roughly 100,000 hours of transcribed audio. We provide an in-depth analysis of our techniques, offering statistical insights, and investigate the content by extracting metadata tags, which we also provide. Dataset available at this https URL.

Abstract (translated)

我们介绍了一个全新的MIDI文件数据集，该数据集是由将钢琴演奏的音频记录转录成其组成音符创建而成。我们使用的数据管道是多阶段的，首先使用语言模型根据元数据自主爬取和评分互联网上的音频录音，然后通过使用音频分类器进行修剪和分割。生成的数据集中包含超过一百万个不同的MIDI文件，这些文件大约涵盖了10万小时转录后的音频内容。我们对我们的技术进行了深入分析，并提供了统计洞察；同时提取并提供元数据标签以供进一步研究。数据集可在此 [https URL] 获取。

URL

https://arxiv.org/abs/2504.15071

PDF

https://arxiv.org/pdf/2504.15071.pdf

Aria-MIDI: A Dataset of Piano MIDI Files for Symbolic Music Modeling

Abstract

Abstract (translated)

URL

PDF Copy

PDF