Abstract
We introduce an extensive new dataset of MIDI files, created by transcribing audio recordings of piano performances into their constituent notes. The data pipeline we use is multi-stage, employing a language model to autonomously crawl and score audio recordings from the internet based on their metadata, followed by a stage of pruning and segmentation using an audio classifier. The resulting dataset contains over one million distinct MIDI files, comprising roughly 100,000 hours of transcribed audio. We provide an in-depth analysis of our techniques, offering statistical insights, and investigate the content by extracting metadata tags, which we also provide. Dataset available at this https URL.
Abstract (translated)
我们介绍了一个全新的MIDI文件数据集,该数据集是由将钢琴演奏的音频记录转录成其组成音符创建而成。我们使用的数据管道是多阶段的,首先使用语言模型根据元数据自主爬取和评分互联网上的音频录音,然后通过使用音频分类器进行修剪和分割。生成的数据集中包含超过一百万个不同的MIDI文件,这些文件大约涵盖了10万小时转录后的音频内容。我们对我们的技术进行了深入分析,并提供了统计洞察;同时提取并提供元数据标签以供进一步研究。数据集可在此 [https URL] 获取。
URL
https://arxiv.org/abs/2504.15071