Abstract
Limited diversity in standardized benchmarks for evaluating audio representation learning (ARL) methods may hinder systematic comparison of current methods' capabilities. We present ARCH, a comprehensive benchmark for evaluating ARL methods on diverse audio classification domains, covering acoustic events, music, and speech. ARCH comprises 12 datasets, that allow us to thoroughly assess pre-trained SSL models of different sizes. ARCH streamlines benchmarking of ARL techniques through its unified access to a wide range of domains and its ability to readily incorporate new datasets and models. To address the current lack of open-source, pre-trained models for non-speech audio, we also release new pre-trained models that demonstrate strong performance on non-speech datasets. We argue that the presented wide-ranging evaluation provides valuable insights into state-of-the-art ARL methods, and is useful to pinpoint promising research directions.
Abstract (translated)
标准基准测试对评估音频表示学习(ARL)方法的有限多样性可能会阻碍当前方法的系统比较能力。我们提出了ARCH(音频分类域全面基准),一个用于评估各种音频分类域中ARL方法的全面基准,包括音频事件、音乐和语音。ARCH包括12个数据集,使我们能够深入评估不同大小的预训练SSL模型的性能。ARCH通过其广泛的领域访问权限和容易纳入新数据集和模型的能力,简化了ARL技术的基准测试。为了解决当前缺乏非语音音频的开放源代码预训练模型的问题,我们还发布了在非语音数据集上表现出强劲性能的新预训练模型。我们认为,所提出的广泛的评估为最先进的ARL方法提供了宝贵的见解,有助于确定有前途的研究方向。
URL
https://arxiv.org/abs/2405.00934