Effects of Word-frequency based Pre- and Post- Processings for Audio Captioning

2020-09-24 01:07:33

Daiki Takeuchi, Yuma Koizumi, Yasunori Ohishi, Noboru Harada, Kunio Kashino

arXiv_CL

Abstract
Abstract (translated)
URL
PDF

Abstract

The system we used for Task 6 (Automated Audio Captioning)of the Detection and Classification of Acoustic Scenes and Events(DCASE) 2020 Challenge combines three elements, namely, dataaugmentation, multi-task learning, and post-processing, for audiocaptioning. The system received the highest evaluation scores, butwhich of the individual elements most fully contributed to its perfor-mance has not yet been clarified. Here, to asses their contributions,we first conducted an element-wise ablation study on our systemto estimate to what extent each element is effective. We then con-ducted a detailed module-wise ablation study to further clarify thekey processing modules for improving accuracy. The results showthat data augmentation and post-processing significantly improvethe score in our system. In particular, mix-up data augmentationand beam search in post-processing improve SPIDEr by 0.8 and 1.6points, respectively.

Abstract (translated)

URL

https://arxiv.org/abs/2009.11436

PDF

https://arxiv.org/pdf/2009.11436.pdf