Paper Reading AI Learner

ChordFormer: A Conformer-Based Architecture for Large-Vocabulary Audio Chord Recognition

2025-02-17 14:35:16
Muhammad Waseem Akram, Stefano Dettori, Valentina Colla, Giorgio Carlo Buttazzo

Abstract

Chord recognition serves as a critical task in music information retrieval due to the abstract and descriptive nature of chords in music analysis. While audio chord recognition systems have achieved significant accuracy for small vocabularies (e.g., major/minor chords), large-vocabulary chord recognition remains a challenging problem. This complexity also arises from the inherent long-tail distribution of chords, where rare chord types are underrepresented in most datasets, leading to insufficient training samples. Effective chord recognition requires leveraging contextual information from audio sequences, yet existing models, such as combinations of convolutional neural networks, bidirectional long short-term memory networks, and bidirectional transformers, face limitations in capturing long-term dependencies and exhibit suboptimal performance on large-vocabulary chord recognition tasks. This work proposes ChordFormer, a novel conformer-based architecture designed to tackle structural chord recognition (e.g., triads, bass, sevenths) for large vocabularies. ChordFormer leverages conformer blocks that integrate convolutional neural networks with transformers, thus enabling the model to capture both local patterns and global dependencies effectively. By addressing challenges such as class imbalance through a reweighted loss function and structured chord representations, ChordFormer outperforms state-of-the-art models, achieving a 2% improvement in frame-wise accuracy and a 6% increase in class-wise accuracy on large-vocabulary chord datasets. Furthermore, ChordFormer excels in handling class imbalance, providing robust and balanced recognition across chord types. This approach bridges the gap between theoretical music knowledge and practical applications, advancing the field of large-vocabulary chord recognition.

Abstract (translated)

和弦识别在音乐信息检索中是一项关键任务,由于和弦在音乐分析中的抽象性和描述性特点。尽管音频和弦识别系统在处理小词汇量(如大调/小调和弦)时已经取得了显著的准确性,但对于大词汇量和弦识别来说,这仍然是一个具有挑战性的难题。这种复杂性还源于和弦固有的长尾分布特性,在大多数数据集中,罕见和弦类型代表性不足,导致训练样本数量不足。 有效的和弦识别需要从音频序列中获取上下文信息,但现有的模型(如卷积神经网络、双向长短时记忆网络和双向变压器的组合)在捕捉长期依赖关系方面存在局限性,并且在大词汇量和弦识别任务上的表现欠佳。本研究提出了一种名为ChordFormer的新颖架构,该架构基于Conformer模块设计,旨在解决大型词汇表中的结构化和弦识别问题(例如三和弦、低音、七和弦)。ChordFormer利用结合了卷积神经网络与变压器的Conformer块,使模型能够有效捕捉局部模式及全局依赖关系。 通过采用重新加权的损失函数来应对类别不平衡的问题,并且使用有结构化的和弦表示方式,ChordFormer超越了现有的先进模型,在大型词汇表和弦数据集上实现了2%的帧级准确率提升以及6%的类级别准确率增长。此外,ChordFormer在处理类别不平衡方面表现出色,为各种类型的和弦提供了稳健且均衡的识别能力。 这种方法连接了理论音乐知识与实际应用之间的鸿沟,并推动了大规模词汇表和弦识别领域的进步。

URL

https://arxiv.org/abs/2502.11840

PDF

https://arxiv.org/pdf/2502.11840.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot