Abstract
Chord recognition serves as a critical task in music information retrieval due to the abstract and descriptive nature of chords in music analysis. While audio chord recognition systems have achieved significant accuracy for small vocabularies (e.g., major/minor chords), large-vocabulary chord recognition remains a challenging problem. This complexity also arises from the inherent long-tail distribution of chords, where rare chord types are underrepresented in most datasets, leading to insufficient training samples. Effective chord recognition requires leveraging contextual information from audio sequences, yet existing models, such as combinations of convolutional neural networks, bidirectional long short-term memory networks, and bidirectional transformers, face limitations in capturing long-term dependencies and exhibit suboptimal performance on large-vocabulary chord recognition tasks. This work proposes ChordFormer, a novel conformer-based architecture designed to tackle structural chord recognition (e.g., triads, bass, sevenths) for large vocabularies. ChordFormer leverages conformer blocks that integrate convolutional neural networks with transformers, thus enabling the model to capture both local patterns and global dependencies effectively. By addressing challenges such as class imbalance through a reweighted loss function and structured chord representations, ChordFormer outperforms state-of-the-art models, achieving a 2% improvement in frame-wise accuracy and a 6% increase in class-wise accuracy on large-vocabulary chord datasets. Furthermore, ChordFormer excels in handling class imbalance, providing robust and balanced recognition across chord types. This approach bridges the gap between theoretical music knowledge and practical applications, advancing the field of large-vocabulary chord recognition.
Abstract (translated)
和弦识别在音乐信息检索中是一项关键任务,由于和弦在音乐分析中的抽象性和描述性特点。尽管音频和弦识别系统在处理小词汇量(如大调/小调和弦)时已经取得了显著的准确性,但对于大词汇量和弦识别来说,这仍然是一个具有挑战性的难题。这种复杂性还源于和弦固有的长尾分布特性,在大多数数据集中,罕见和弦类型代表性不足,导致训练样本数量不足。 有效的和弦识别需要从音频序列中获取上下文信息,但现有的模型(如卷积神经网络、双向长短时记忆网络和双向变压器的组合)在捕捉长期依赖关系方面存在局限性,并且在大词汇量和弦识别任务上的表现欠佳。本研究提出了一种名为ChordFormer的新颖架构,该架构基于Conformer模块设计,旨在解决大型词汇表中的结构化和弦识别问题(例如三和弦、低音、七和弦)。ChordFormer利用结合了卷积神经网络与变压器的Conformer块,使模型能够有效捕捉局部模式及全局依赖关系。 通过采用重新加权的损失函数来应对类别不平衡的问题,并且使用有结构化的和弦表示方式,ChordFormer超越了现有的先进模型,在大型词汇表和弦数据集上实现了2%的帧级准确率提升以及6%的类级别准确率增长。此外,ChordFormer在处理类别不平衡方面表现出色,为各种类型的和弦提供了稳健且均衡的识别能力。 这种方法连接了理论音乐知识与实际应用之间的鸿沟,并推动了大规模词汇表和弦识别领域的进步。
URL
https://arxiv.org/abs/2502.11840