Abstract
The Automated Audio Captioning (AAC) task aims to describe an audio signal using natural language. To evaluate machine-generated captions, the metrics should take into account audio events, acoustic scenes, paralinguistics, signal characteristics, and other audio information. Traditional AAC evaluation relies on natural language generation metrics like ROUGE and BLEU, image captioning metrics such as SPICE and CIDEr, or Sentence-BERT embedding similarity. However, these metrics only compare generated captions to human references, overlooking the audio signal itself. In this work, we propose MACE (Multimodal Audio-Caption Evaluation), a novel metric that integrates both audio and reference captions for comprehensive audio caption evaluation. MACE incorporates audio information from audio as well as predicted and reference captions and weights it with a fluency penalty. Our experiments demonstrate MACE's superior performance in predicting human quality judgments compared to traditional metrics. Specifically, MACE achieves a 3.28% and 4.36% relative accuracy improvement over the FENSE metric on the AudioCaps-Eval and Clotho-Eval datasets respectively. Moreover, it significantly outperforms all the previous metrics on the audio captioning evaluation task. The metric is opensourced at this https URL
Abstract (translated)
自动音频字幕(AAC)任务旨在使用自然语言来描述一个音频信号。为了评估机器生成的字幕,度量标准应考虑到音频事件、声学场景、副语言特征、信号特性以及其他音频信息。传统的AAC评估依赖于像ROUGE和BLEU这样的自然语言生成指标,图像字幕度量如SPICE和CIDEr,或Sentence-BERT嵌入相似性等。然而,这些指标仅将生成的字幕与人类参考进行比较,忽略了音频信号本身。在这项工作中,我们提出了MACE(Multimodal Audio-Caption Evaluation,多模态音字幕评估),这是一种新型度量标准,整合了音频和参考字幕以进行全面的音频字幕评估。MACE结合了来自音频以及预测和参考字幕中的音频信息,并加权了一个流利性惩罚因子。我们的实验表明,与传统指标相比,MACE在预测人类质量判断方面表现出色。具体而言,在AudioCaps-Eval和Clotho-Eval数据集上,MACE分别比FENSE度量标准的相对准确率提高了3.28%和4.36%。此外,它在音频字幕评估任务中显著优于所有先前的指标。该度量标准已开源,地址为:[此https URL]。
URL
https://arxiv.org/abs/2411.00321