Paper Reading AI Learner

Joint Analysis of Acoustic Event and Scene Based on Multitask Learning

2019-04-27 11:22:18
Keisuke Imoto, Masahiro Niitsuma, Ryosuke Yamanishi, Yoichi Yamashita

Abstract

Acoustic event detection and scene classification are major research tasks in environmental sound analysis, and many methods based on neural networks have been proposed. Conventional methods have addressed these tasks separately; however, acoustic events and scenes are closely related to each other. For example, in the acoustic scene ``office'', the acoustic events ``mouse clicking'' and ``keyboard typing'' are likely to occur. In this paper, we propose multitask learning for joint analysis of acoustic events and scenes, which shares the parts of the networks holding information on acoustic events and scenes in common. By integrating the two networks, we expect that information on acoustic scenes will improve the performance of acoustic event detection. Experimental results obtained using TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016 datasets indicate that the proposed method improves the performance of acoustic event detection by 10.66 percentage points in terms of the F-score, compared with a conventional method based on a convolutional recurrent neural network.

Abstract (translated)

声事件检测和场景分类是环境声分析的主要研究任务,提出了许多基于神经网络的方法。传统的方法已经分别处理了这些任务;然而,声学事件和场景是密切相关的。例如,在声学场景“office”中,可能会发生声学事件“鼠标单击”和“键盘键入”。本文提出了一种多任务学习方法,用于声事件和场景的联合分析,该方法共享网络中包含声事件和场景信息的部分。通过两个网络的集成,我们期望声学场景的信息能够提高声学事件检测的性能。利用2016/2017年TUT声音事件和2016年TUT声学场景数据集获得的实验结果表明,与基于卷积循环神经网络的传统方法相比,该方法提高了10.66个百分点的声学事件检测性能。

URL

https://arxiv.org/abs/1904.12146

PDF

https://arxiv.org/pdf/1904.12146.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot