Abstract
We present Sleeping-DISCO 9M, a large-scale pre-training dataset for music and song. To the best of our knowledge, there are no open-source high-quality dataset representing popular and well-known songs for generative music modeling tasks such as text-music, music-captioning, singing-voice synthesis, melody reconstruction and cross-model retrieval. Past contributions focused on isolated and constrained factors whose core perspective was to create synthetic or re-recorded music corpus (e.g. GTSinger, M4Singer) and arbitrarily large-scale audio datasets (e.g. DISCO-10M and LAIONDISCO-12M) had been another focus for the community. Unfortunately, adoption of these datasets has been below substantial in the generative music community as these datasets fail to reflect real-world music and its flavour. Our dataset changes this narrative and provides a dataset that is constructed using actual popular music and world-renowned artists.
Abstract (translated)
我们介绍了Sleeping-DISCO 9M,这是一个用于音乐和歌曲的大规模预训练数据集。据我们所知,目前还没有开源的高质量数据集能够代表流行且知名的歌曲,以供诸如文本-音乐生成、音乐描述、歌声合成、旋律重构及跨模型检索等任务使用。以往的研究主要集中在孤立和受限的因素上,其核心观点是创建合成或重新录制的音乐语料库(例如GTSinger、M4Singer),而社区的另一个焦点则是任意大规模的音频数据集(如DISCO-10M和LAIONDISCO-12M)。不幸的是,由于这些数据集无法反映现实世界中的音乐及其特色,它们在生成音乐领域并未被广泛采用。我们的数据集改变了这一局面,并提供了基于实际流行音乐及世界级艺术家构建的数据集。
URL
https://arxiv.org/abs/2506.14293