Paper Reading AI Learner

Exploiting Music Source Separation for Automatic Lyrics Transcription with Whisper

2025-06-18 14:48:50
Jaza Syed, Ivan Meresman Higgs, Ond\v{r}ej C\'ifka, Mark Sandler

Abstract

Automatic lyrics transcription (ALT) remains a challenging task in the field of music information retrieval, despite great advances in automatic speech recognition (ASR) brought about by transformer-based architectures in recent years. One of the major challenges in ALT is the high amplitude of interfering audio signals relative to conventional ASR due to musical accompaniment. Recent advances in music source separation have enabled automatic extraction of high-quality separated vocals, which could potentially improve ALT performance. However, the effect of source separation has not been systematically investigated in order to establish best practices for its use. This work examines the impact of source separation on ALT using Whisper, a state-of-the-art open source ASR model. We evaluate Whisper's performance on original audio, separated vocals, and vocal stems across short-form and long-form transcription tasks. For short-form, we suggest a concatenation method that results in a consistent reduction in Word Error Rate (WER). For long-form, we propose an algorithm using source separation as a vocal activity detector to derive segment boundaries, which results in a consistent reduction in WER relative to Whisper's native long-form algorithm. Our approach achieves state-of-the-art results for an open source system on the Jam-ALT long-form ALT benchmark, without any training or fine-tuning. We also publish MUSDB-ALT, the first dataset of long-form lyric transcripts following the Jam-ALT guidelines for which vocal stems are publicly available.

Abstract (translated)

自动歌词转录(ALT)在音乐信息检索领域仍然是一个具有挑战性的任务,尽管近年来基于变压器架构的自动语音识别(ASR)技术取得了重大进展。ALT的一个主要挑战是由于音乐伴奏的存在,干扰音频信号相对于传统ASR来说幅度更大。最近在音乐源分离领域的进步使得可以从原始音频中提取高质量的分离人声,这有可能提高ALT性能。然而,关于源分离的效果尚未系统地进行研究以建立最佳实践方法。 这项工作利用Whisper(一种最先进的开源ASR模型),考察了源分离对ALT的影响。我们在短形式和长形式转录任务上评估了Whisper在原始音频、分离人声以及人声音轨上的表现。对于短形式,我们提出了一种拼接方法,这种方法可以持续降低词错误率(WER)。对于长形式,我们建议使用源分离作为音效检测器来推导片段边界的方法,相比起Whisper原生的长形式算法,这种方法能一致地减少WER。 我们的方法在不进行任何训练或微调的情况下,在Jam-ALT长形式ALT基准测试中达到了开源系统的最新水平。此外,我们发布了MUSDB-ALT数据集,这是首个遵循Jam-ALT指南的长形式歌词转录数据集,并公开了人声音轨。

URL

https://arxiv.org/abs/2506.15514

PDF

https://arxiv.org/pdf/2506.15514.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot