Paper Reading AI Learner

MACE: Leveraging Audio for Evaluating Audio Captioning Systems

2024-11-01 02:41:33
Satvik Dixit, Soham Deshmukh, Bhiksha Raj

Abstract

The Automated Audio Captioning (AAC) task aims to describe an audio signal using natural language. To evaluate machine-generated captions, the metrics should take into account audio events, acoustic scenes, paralinguistics, signal characteristics, and other audio information. Traditional AAC evaluation relies on natural language generation metrics like ROUGE and BLEU, image captioning metrics such as SPICE and CIDEr, or Sentence-BERT embedding similarity. However, these metrics only compare generated captions to human references, overlooking the audio signal itself. In this work, we propose MACE (Multimodal Audio-Caption Evaluation), a novel metric that integrates both audio and reference captions for comprehensive audio caption evaluation. MACE incorporates audio information from audio as well as predicted and reference captions and weights it with a fluency penalty. Our experiments demonstrate MACE's superior performance in predicting human quality judgments compared to traditional metrics. Specifically, MACE achieves a 3.28% and 4.36% relative accuracy improvement over the FENSE metric on the AudioCaps-Eval and Clotho-Eval datasets respectively. Moreover, it significantly outperforms all the previous metrics on the audio captioning evaluation task. The metric is opensourced at this https URL

Abstract (translated)

自动音频字幕(AAC)任务旨在使用自然语言来描述一个音频信号。为了评估机器生成的字幕,度量标准应考虑到音频事件、声学场景、副语言特征、信号特性以及其他音频信息。传统的AAC评估依赖于像ROUGE和BLEU这样的自然语言生成指标,图像字幕度量如SPICE和CIDEr,或Sentence-BERT嵌入相似性等。然而,这些指标仅将生成的字幕与人类参考进行比较,忽略了音频信号本身。在这项工作中,我们提出了MACE(Multimodal Audio-Caption Evaluation,多模态音字幕评估),这是一种新型度量标准,整合了音频和参考字幕以进行全面的音频字幕评估。MACE结合了来自音频以及预测和参考字幕中的音频信息,并加权了一个流利性惩罚因子。我们的实验表明,与传统指标相比,MACE在预测人类质量判断方面表现出色。具体而言,在AudioCaps-Eval和Clotho-Eval数据集上,MACE分别比FENSE度量标准的相对准确率提高了3.28%和4.36%。此外,它在音频字幕评估任务中显著优于所有先前的指标。该度量标准已开源,地址为:[此https URL]。

URL

https://arxiv.org/abs/2411.00321

PDF

https://arxiv.org/pdf/2411.00321.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot