Abstract
Most deep learning-based acoustic scene classification (ASC) approaches identify scenes based on acoustic features converted from audio clips containing mixed information entangled by polyphonic audio events (AEs). However, these approaches have difficulties in explaining what cues they use to identify scenes. This paper conducts the first study on disclosing the relationship between real-life acoustic scenes and semantic embeddings from the most relevant AEs. Specifically, we propose an event-relational graph representation learning (ERGL) framework for ASC to classify scenes, and simultaneously answer clearly and straightly which cues are used in classifying. In the event-relational graph, embeddings of each event are treated as nodes, while relationship cues derived from each pair of nodes are described by multi-dimensional edge features. Experiments on a real-life ASC dataset show that the proposed ERGL achieves competitive performance on ASC by learning embeddings of only a limited number of AEs. The results show the feasibility of recognizing diverse acoustic scenes based on the audio event-relational graph. Visualizations of graph representations learned by ERGL are available here (this https URL).
Abstract (translated)
大多数基于深度学习的音频场景分类(ASC)方法通过将包含混叠信息的音乐片段(AEs)所激发的声场信息转换为声场特征来识别场景。然而,这些方法在解释他们用于识别场景的 cues 方面存在困难。本文进行了首次研究,揭示了真实音频场景与最相关的AEs所表示语义嵌入之间的关系。具体来说,我们提出了一种 ASC 分类场景的 Event-relational 图形表示学习(ERGL)框架,同时清晰地回答了在分类中使用的 cues。在事件关系图(event-relational graph)中,每个事件的嵌入被视为节点,而从每个节点 pairs 得到的关系 cues 用多维边特征来描述。在一个真实的 ASC 数据集上进行的实验表明,所提出的 ERGL 通过学习仅有限数量的AEs 的嵌入实现了 ASC 分类的性能。结果表明,基于音频事件关系图识别不同音频场景是可行的。ERGL 学习到的图形表示的可视化版本在这里(此 https URL)。
URL
https://arxiv.org/abs/2310.03889