City classification from multiple real-world sound scenes

Abstract
Abstract (translated)
URL
PDF

Abstract

The majority of sound scene analysis work focuses on one of two clearly defined tasks: acoustic scene classification or sound event detection. Whilst this separation of tasks is useful for problem definition, they inherently ignore some subtleties of the real-world, in particular how humans vary in how they describe a scene. Some will describe the weather and features within it, others will use a holistic descriptor like 'park', and others still will use unique identifiers such as cities or names. In this paper, we undertake the task of automatic city classification to ask whether we can recognize a city from a set of sound scenes? In this problem each each city has recordings from multiple scenes. We test a series of methods for this novel task and show that whilst a simple convolutional neural network (CNN) can achieve accuracy of 50%, which is less than the acoustic scene classification task baseline in the DCASE 2018 ASC challenge (on the same data), with a simple adaptation to the class labels to use paired city labels with grouped scenes, accuracy increases to 52%, closer to the simpler scene classification task. Finally we also formulate the problem in a multi-task learning framework and achieve an accuracy of 56%, outperforming the aforementioned approaches.

Abstract (translated)

大多数声场分析工作集中在两个明确定义的任务之一：声场分类或声音事件检测。虽然任务的分离对于问题的定义很有用，但它们本质上忽略了现实世界中的一些微妙之处，特别是人类在描述场景时的差异。一些人会描述天气和其中的特征，另一些人会使用像“公园”这样的整体描述符，还有一些人会使用诸如城市或名称之类的唯一标识符。在本文中，我们承担了城市自动分类的任务，询问我们能否从一组有声场景中识别出一个城市？在这个问题上，每个城市都有来自多个场景的录像。我们对这项新任务测试了一系列方法，结果表明，虽然简单的卷积神经网络（CNN）可以达到50%的准确率，这低于DCASE 2018 ASC挑战中的声学场景分类任务基线（在相同的数据上），但对类标签进行了简单的调整，以使用分组场景的成对城市标签。精度提高到52%，更接近于更简单的场景分类任务。最后，我们也在多任务学习框架中阐述了这个问题，并取得了56%的准确率，优于上述方法。

URL

https://arxiv.org/abs/1905.00979

PDF

https://arxiv.org/pdf/1905.00979.pdf