A Comprehensive Survey of Automated Audio Captioning

Abstract
Abstract (translated)
URL
PDF

Abstract

Automated audio captioning, a task that mimics human perception as well as innovatively links audio processing and natural language processing, has overseen much progress over the last few years. Audio captioning requires recognizing the acoustic scene, primary audio events and sometimes the spatial and temporal relationship between events in an audio clip. It also requires describing these elements by a fluent and vivid sentence. Deep learning-based approaches are widely adopted to tackle this problem. This current paper situates itself as a comprehensive review covering the benchmark datasets, existing deep learning techniques and the evaluation metrics in automated audio captioning.

Abstract (translated)

URL

https://arxiv.org/abs/2205.05357

PDF

https://arxiv.org/pdf/2205.05357.pdf