Paper Reading AI Learner

City classification from multiple real-world sound scenes

2019-05-02 22:01:10
Helen L. Bear, Toni Heittola, Annamaria Mesaros, Emmanouil Benetos, Tuomas Virtanen

Abstract

The majority of sound scene analysis work focuses on one of two clearly defined tasks: acoustic scene classification or sound event detection. Whilst this separation of tasks is useful for problem definition, they inherently ignore some subtleties of the real-world, in particular how humans vary in how they describe a scene. Some will describe the weather and features within it, others will use a holistic descriptor like 'park', and others still will use unique identifiers such as cities or names. In this paper, we undertake the task of automatic city classification to ask whether we can recognize a city from a set of sound scenes? In this problem each each city has recordings from multiple scenes. We test a series of methods for this novel task and show that whilst a simple convolutional neural network (CNN) can achieve accuracy of 50%, which is less than the acoustic scene classification task baseline in the DCASE 2018 ASC challenge (on the same data), with a simple adaptation to the class labels to use paired city labels with grouped scenes, accuracy increases to 52%, closer to the simpler scene classification task. Finally we also formulate the problem in a multi-task learning framework and achieve an accuracy of 56%, outperforming the aforementioned approaches.

Abstract (translated)

大多数声场分析工作集中在两个明确定义的任务之一:声场分类或声音事件检测。虽然任务的分离对于问题的定义很有用,但它们本质上忽略了现实世界中的一些微妙之处,特别是人类在描述场景时的差异。一些人会描述天气和其中的特征,另一些人会使用像“公园”这样的整体描述符,还有一些人会使用诸如城市或名称之类的唯一标识符。在本文中,我们承担了城市自动分类的任务,询问我们能否从一组有声场景中识别出一个城市?在这个问题上,每个城市都有来自多个场景的录像。我们对这项新任务测试了一系列方法,结果表明,虽然简单的卷积神经网络(CNN)可以达到50%的准确率,这低于DCASE 2018 ASC挑战中的声学场景分类任务基线(在相同的数据上),但对类标签进行了简单的调整,以使用分组场景的成对城市标签。精度提高到52%,更接近于更简单的场景分类任务。最后,我们也在多任务学习框架中阐述了这个问题,并取得了56%的准确率,优于上述方法。

URL

https://arxiv.org/abs/1905.00979

PDF

https://arxiv.org/pdf/1905.00979.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot