Paper Reading AI Learner

Audio Event-Relational Graph Representation Learning for Acoustic Scene Classification

2023-10-05 20:48:59
Yuanbo Hou, Siyang Song, Chuang Yu, Wenwu Wang, Dick Botteldooren

Abstract

Most deep learning-based acoustic scene classification (ASC) approaches identify scenes based on acoustic features converted from audio clips containing mixed information entangled by polyphonic audio events (AEs). However, these approaches have difficulties in explaining what cues they use to identify scenes. This paper conducts the first study on disclosing the relationship between real-life acoustic scenes and semantic embeddings from the most relevant AEs. Specifically, we propose an event-relational graph representation learning (ERGL) framework for ASC to classify scenes, and simultaneously answer clearly and straightly which cues are used in classifying. In the event-relational graph, embeddings of each event are treated as nodes, while relationship cues derived from each pair of nodes are described by multi-dimensional edge features. Experiments on a real-life ASC dataset show that the proposed ERGL achieves competitive performance on ASC by learning embeddings of only a limited number of AEs. The results show the feasibility of recognizing diverse acoustic scenes based on the audio event-relational graph. Visualizations of graph representations learned by ERGL are available here (this https URL).

Abstract (translated)

大多数基于深度学习的音频场景分类(ASC)方法通过将包含混叠信息的音乐片段(AEs)所激发的声场信息转换为声场特征来识别场景。然而,这些方法在解释他们用于识别场景的 cues 方面存在困难。本文进行了首次研究,揭示了真实音频场景与最相关的AEs所表示语义嵌入之间的关系。具体来说,我们提出了一种 ASC 分类场景的 Event-relational 图形表示学习(ERGL)框架,同时清晰地回答了在分类中使用的 cues。在事件关系图(event-relational graph)中,每个事件的嵌入被视为节点,而从每个节点 pairs 得到的关系 cues 用多维边特征来描述。在一个真实的 ASC 数据集上进行的实验表明,所提出的 ERGL 通过学习仅有限数量的AEs 的嵌入实现了 ASC 分类的性能。结果表明,基于音频事件关系图识别不同音频场景是可行的。ERGL 学习到的图形表示的可视化版本在这里(此 https URL)。

URL

https://arxiv.org/abs/2310.03889

PDF

https://arxiv.org/pdf/2310.03889.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot