Aerial scene classification, which aims to semantically label remote sensing images in a set of predefined classes (e.g., agricultural, beach, and harbor), is a very challenging task in remote sensing due to high intra-class variability and the different scales and orientations of the objects present in the dataset images. In remote sensing area, the use of CNN architectures as an alternative solution is also a reality for scene classification tasks. Generally, these CNNs are used to perform the traditional image classification task. However, another less used way to classify remote sensing image might be the one that uses deep metric learning (DML) approaches. In this sense, this work proposes to employ six DML approaches for aerial scene classification tasks, analysing their behave with four different pre-trained CNNs as well as combining them through the use of evolutionary computation algorithm (UMDA). In performed experiments, it is possible to observe than DML approaches can achieve the best classification results when compared to traditional pre-trained CNNs for three well-known remote sensing aerial scene datasets. In addition, the UMDA algorithm proved to be a promising strategy to combine DML approaches when there is diversity among them, managing to improve at least 5.6% of accuracy in the classification results using almost 50\% of the available classifiers for the construction of the final ensemble of classifiers.
空中场景分类旨在对预定义类别的遥感图像进行语义分类(例如农业、海滩和港口),这是一个在遥感中非常具有挑战性的任务,因为这些类别内部存在高度的多样性和图像数据中物体的尺度和方向不同。在遥感区域,使用CNN架构作为替代方案也是一个重要的任务。一般而言,这些CNN用于执行传统的图像分类任务。然而,另一种不太常用的分类方法可能是使用深度度量学习(DML)方法。因此,本工作提议使用六个MLD方法来处理空中场景分类任务,分析它们与四种不同的预训练CNN的行为,并使用进化计算算法(UMDA)将它们组合起来。在执行的实验中,可以观察到MLD方法能够相较于传统预训练CNNs在三个著名的遥感空中场景数据集上实现最佳分类结果。此外,UMDA算法证明是一种有前途的方法,当其中存在多样性时,将它们组合起来,通过使用几乎可用的Classifiers的50%。
https://arxiv.org/abs/2303.11389
In this paper, we propose a method for incremental learning of two distinct tasks over time: acoustic scene classification (ASC) and audio tagging (AT). We use a simple convolutional neural network (CNN) model as an incremental learner to solve the tasks. Generally, incremental learning methods catastrophically forget the previous task when sequentially trained on a new task. To alleviate this problem, we use independent learning and knowledge distillation (KD) between the timesteps in learning. Experiments are performed on TUT 2016/2017 dataset, containing 4 acoustic scene classes and 25 sound event classes. The proposed incremental learner solves the AT task with an F1 score of 54.4% and the ASC task with an accuracy of 88.9% in an incremental time step, outperforming a multi-task system which solves ASC and AT at the same time. The ASC task performance degrades only by 5.1% from the initial time ASC accuracy of 94.0%.
在本文中,我们提出一种方法,用于在时间中逐步学习两个不同的任务:声景分类( ASC )和音频标签( AT )。我们使用一个简单的卷积神经网络(CNN)模型作为增量学习模型来解决这些任务。通常情况下,增量学习方法在对新任务顺序训练时会发生灾难性的错误,会忘记之前的任务。为了减轻这个问题,我们在学习过程中的每个时间步骤中使用独立学习和知识蒸馏(KD )。实验在 TUT 2016/2017 数据集上进行了实施,该数据集包含 4 个声景类别和 25 个声音事件类别。 proposed 增量学习模型在逐步学习 AT 任务时,以 54.4% 的 F1 得分解决了 ASC 任务,并在一个增量时间步骤中以 88.9% 的精度解决了 ASC 任务,比同时解决 ASC 和 AT 任务的多任务系统表现更好。 ASC 任务性能仅从初始的 94.0% 下降了 5.1%。
https://arxiv.org/abs/2302.14815
Remote sensing scene classification has been extensively studied for its critical roles in geological survey, oil exploration, traffic management, earthquake prediction, wildfire monitoring, and intelligence monitoring. In the past, the Machine Learning (ML) methods for performing the task mainly used the backbones pretrained in the manner of supervised learning (SL). As Masked Image Modeling (MIM), a self-supervised learning (SSL) technique, has been shown as a better way for learning visual feature representation, it presents a new opportunity for improving ML performance on the scene classification task. This research aims to explore the potential of MIM pretrained backbones on four well-known classification datasets: Merced, AID, NWPU-RESISC45, and Optimal-31. Compared to the published benchmarks, we show that the MIM pretrained Vision Transformer (ViTs) backbones outperform other alternatives (up to 18% on top 1 accuracy) and that the MIM technique can learn better feature representation than the supervised learning counterparts (up to 5% on top 1 accuracy). Moreover, we show that the general-purpose MIM-pretrained ViTs can achieve competitive performance as the specially designed yet complicated Transformer for Remote Sensing (TRS) framework. Our experiment results also provide a performance baseline for future studies.
遥感场景分类在地质调查、石油勘探、 traffic管理、地震预测、野火监测和情报监测等方面具有重要的作用,因此被广泛研究。在过去,机器学习(ML)方法在执行该任务时主要使用监督学习(SL)的骨干。由于掩模图像建模(MIM)是一种自监督学习(SSL)技术,已经被证明是学习视觉特征表示更好的方法,因此它提供了改善机器学习场景分类性能的新机会。本研究旨在探索MIM训练的骨干在四个著名的分类数据集上的潜在能力: Merced、AID、NWPU-RESISC45和 optimal-31。与已发布的基准相比,我们表明,MIM训练的视觉变换器(ViTs)骨干比其他替代品表现出色(最高准确性的准确率高达18%)。此外,我们还表明,一般性的MIM训练的ViTs可以像专门为遥感(TRS)框架设计的复杂Transformer一样实现竞争性能。我们的实验结果还提供了未来研究的性能基准。
https://arxiv.org/abs/2302.14256
Indoor scene classification has become an important task in perception modules and has been widely used in various applications. However, problems such as intra-category variability and inter-category similarity have been holding back the models' performance, which leads to the need for new types of features to obtain a more meaningful scene representation. A semantic segmentation mask provides pixel-level information about the objects available in the scene, which makes it a promising source of information to obtain a more meaningful local representation of the scene. Therefore, in this work, a novel approach that uses a semantic segmentation mask to obtain a 2D spatial layout of the object categories across the scene, designated by segmentation-based semantic features (SSFs), is proposed. These features represent, per object category, the pixel count, as well as the 2D average position and respective standard deviation values. Moreover, a two-branch network, GS2F2App, that exploits CNN-based global features extracted from RGB images and the segmentation-based features extracted from the proposed SSFs, is also proposed. GS2F2App was evaluated in two indoor scene benchmark datasets: the SUN RGB-D and the NYU Depth V2, achieving state-of-the-art results on both datasets.
在感知模块中,室内场景分类已成为一个重要的任务,并在各种应用中被广泛应用。然而,问题如类别内变异性和类别间相似性一直阻碍着模型的表现,这导致需要新的特征来获取更有意义的场景表示。语义分割掩码提供了场景中可用物体的像素级信息,这使得它成为一个获取场景更有意义的局部表示的有前途的信息来源。因此,在本研究中,我们提出了一种新方法,使用语义分割掩码来得到场景每个类别的2D空间布局,按照分割为基础的语义特征(SSFs)进行指定。这些特征代表了每个物体类别的像素计数以及2D平均位置和相应的标准差值。此外,我们还提出了一种两分支的网络GS2F2App,它利用从RGB图像中提取的卷积态全球特征以及从所提议的SSFs中提取的分割基特征。GS2F2App在两个室内场景基准数据集上进行了评估:Sun RGB-D和Nyu College Depth V2,在两个数据集上取得了最先进的结果。
https://arxiv.org/abs/2302.06432
The real-time processing of time series signals is a critical issue for many real-life applications. The idea of real-time processing is especially important in audio domain as the human perception of sound is sensitive to any kind of disturbance in perceived signals, especially the lag between auditory and visual modalities. The rise of deep learning (DL) models complicated the landscape of signal processing. Although they often have superior quality compared to standard DSP methods, this advantage is diminished by higher latency. In this work we propose novel method for minimization of inference time latency and memory consumption, called Short-Term Memory Convolution (STMC) and its transposed counterpart. The main advantage of STMC is the low latency comparable to long short-term memory (LSTM) networks. Furthermore, the training of STMC-based models is faster and more stable as the method is based solely on convolutional neural networks (CNNs). In this study we demonstrate an application of this solution to a U-Net model for a speech separation task and GhostNet model in acoustic scene classification (ASC) task. In case of speech separation we achieved a 5-fold reduction in inference time and a 2-fold reduction in latency without affecting the output quality. The inference time for ASC task was up to 4 times faster while preserving the original accuracy.
实时处理序列信号是许多实际应用程序的关键问题。实时处理的概念在音频领域尤为重要,因为人类对声音的感知对 perceived信号中的任何扰动都非常敏感,特别是听觉和视觉模式之间的延迟。深度学习(DL)模型的崛起使信号处理领域变得复杂。尽管他们通常比标准DSP方法提供更好的质量,但这一优势随着更高的延迟而减弱。在这项工作中,我们提出了一种新的方法来最小化推断时间和内存消耗,称为短期记忆卷积(STMC)和其transposed counterpart。STMC的主要优点是低延迟,类似于长短期记忆(LSTM)网络。此外,基于STMC的模型的训练速度更快更稳定,因为方法仅基于卷积神经网络(CNNs)。在本研究中,我们演示了这种方法对U-Net模型的一个语音分离任务和一个声学场景分类(ASC)任务的应用。在语音分离的情况下,我们实现了推断时间的五fold减少和延迟的两人次减少,而输出质量没有受到影响。 ASC任务推断时间高达4倍快,同时保留了原始准确性。
https://arxiv.org/abs/2302.04331
Transferring the ImageNet pre-trained weights to the various remote sensing tasks has produced acceptable results and reduced the need for labeled samples. However, the domain differences between ground imageries and remote sensing images cause the performance of such transfer learning to be limited. Recent research has demonstrated that self-supervised learning methods capture visual features that are more discriminative and transferable than the supervised ImageNet weights. We are motivated by these facts to pre-train the in-domain representations of remote sensing imagery using contrastive self-supervised learning and transfer the learned features to other related remote sensing datasets. Specifically, we used the SimSiam algorithm to pre-train the in-domain knowledge of remote sensing datasets and then transferred the obtained weights to the other scene classification datasets. Thus, we have obtained state-of-the-art results on five land cover classification datasets with varying numbers of classes and spatial resolutions. In addition, By conducting appropriate experiments, including feature pre-training using datasets with different attributes, we have identified the most influential factors that make a dataset a good choice for obtaining in-domain features. We have transferred the features obtained by pre-training SimSiam on remote sensing datasets to various downstream tasks and used them as initial weights for fine-tuning. Moreover, we have linearly evaluated the obtained representations in cases where the number of samples per class is limited. Our experiments have demonstrated that using a higher-resolution dataset during the self-supervised pre-training stage results in learning more discriminative and general representations.
将ImageNet的前训练权重转移到各种遥感任务已经取得了可以接受的结果,并减少了需要标记样本的需求。然而,地面图像和遥感图像之间的域差异导致这种转移学习的表现受到限制。最近的研究表明,自监督学习方法捕捉的视觉特征比 supervised ImageNet weights 更精细和可转移。基于这些事实,我们被激励使用对比自监督学习来训练遥感图像域表示,并将学到的特征转移到其他相关的遥感数据集。具体来说,我们使用Siam算法训练遥感数据集的域知识,然后将其传递给其他场景分类数据集。因此,我们获得了不同数量和类间空间分辨率的五陆地覆盖分类数据集的最佳结果。此外,通过进行适当的实验,包括使用不同属性的数据集进行特征训练,我们识别了最有影响力的因素,这使得一个数据集成为获取域特征的好选择。我们已将Siam训练到遥感数据集中的特征转移到各种后续任务,并将其用作初始权重进行微调。此外,在每个类样本数量受限的情况下,我们线性评估了获得的表现。我们的实验表明,在自监督前训练阶段使用高分辨率数据集会导致学习更精细和普遍表示。
https://arxiv.org/abs/2302.01793
The domain adaptation (DA) approaches available to date are usually not well suited for practical DA scenarios of remote sensing image classification, since these methods (such as unsupervised DA) rely on rich prior knowledge about the relationship between label sets of source and target domains, and source data are often not accessible due to privacy or confidentiality issues. To this end, we propose a practical universal domain adaptation setting for remote sensing image scene classification that requires no prior knowledge on the label sets. Furthermore, a novel universal domain adaptation method without source data is proposed for cases when the source data is unavailable. The architecture of the model is divided into two parts: the source data generation stage and the model adaptation stage. The first stage estimates the conditional distribution of source data from the pre-trained model using the knowledge of class-separability in the source domain and then synthesizes the source data. With this synthetic source data in hand, it becomes a universal DA task to classify a target sample correctly if it belongs to any category in the source label set, or mark it as ``unknown" otherwise. In the second stage, a novel transferable weight that distinguishes the shared and private label sets in each domain promotes the adaptation in the automatically discovered shared label set and recognizes the ``unknown'' samples successfully. Empirical results show that the proposed model is effective and practical for remote sensing image scene classification, regardless of whether the source data is available or not. The code is available at this https URL.
目前可用的域适应方法(DA)通常不适合遥感图像分类的实际DA场景,因为这些方法(如未监督的DA)依赖于丰富的先前知识,即源和目标域标签集合之间的关系,而源数据往往由于隐私或保密问题而无法访问。因此,我们提出了一种实用的泛化域适应设置,该设置不需要标签集合的先前知识。此外,我们还建议在无法获取源数据的情况下提出一种全新的泛化域适应方法。模型架构被分为两个部分:源数据生成阶段和模型适应阶段。在第一阶段,使用预训练模型从源域中识别类别分离的知识,估计源数据的条件分布,然后合成源数据。有了合成的源数据,它成为通用的域适应任务,如果源标签集合中的任何类别属于它,或者标记为“未知”否则。在第二阶段,一种新的可转移权重,区别每个域中共享和私有标签集合,促进自动发现共享标签集合的适应,并成功识别“未知”样本。经验数据显示, proposed 模型对遥感图像场景分类非常有效和实用,无论可用的源数据是否可用。代码可在该httpsURL上获取。
https://arxiv.org/abs/2301.11387
Deep convolutional neural networks have been widely used in scene classification of remotely sensed images. In this work, we propose a robust learning method for the task that is secure against partially incorrect categorization of images. Specifically, we remove and correct errors in the labels progressively by iterative multi-view voting and entropy ranking. At each time step, we first divide the training data into disjoint parts for separate training and voting. The unanimity in the voting reveals the correctness of the labels, so that we can train a strong model with only the images with unanimous votes. In addition, we adopt entropy as an effective measure for prediction uncertainty, in order to partially recover labeling errors by ranking and selection. We empirically demonstrate the superiority of the proposed method on the WHU-RS19 dataset and the AID dataset.
https://arxiv.org/abs/2301.05858
Due to their ability to offer more comprehensive information than data from a single view, multi-view (multi-source, multi-modal, multi-perspective, etc.) data are being used more frequently in remote sensing tasks. However, as the number of views grows, the issue of data quality becomes more apparent, limiting the potential benefits of multi-view data. Although recent deep neural network (DNN) based models can learn the weight of data adaptively, a lack of research on explicitly quantifying the data quality of each view when fusing them renders these models inexplicable, performing unsatisfactorily and inflexible in downstream remote sensing tasks. To fill this gap, in this paper, evidential deep learning is introduced to the task of aerial-ground dual-view remote sensing scene classification to model the credibility of each view. Specifically, the theory of evidence is used to calculate an uncertainty value which describes the decision-making risk of each view. Based on this uncertainty, a novel decision-level fusion strategy is proposed to ensure that the view with lower risk obtains more weight, making the classification more credible. On two well-known, publicly available datasets of aerial-ground dual-view remote sensing images, the proposed approach achieves state-of-the-art results, demonstrating its effectiveness. The code and datasets of this article are available at the following address: this https URL.
https://arxiv.org/abs/2301.00622
Audio-Visual scene understanding is a challenging problem due to the unstructured spatial-temporal relations that exist in the audio signals and spatial layouts of different objects and various texture patterns in the visual images. Recently, many studies have focused on abstracting features from convolutional neural networks while the learning of explicit semantically relevant frames of sound signals and visual images has been overlooked. To this end, we present an end-to-end framework, namely attentional graph convolutional network (AGCN), for structure-aware audio-visual scene representation. First, the spectrogram of sound and input image is processed by a backbone network for feature extraction. Then, to build multi-scale hierarchical information of input features, we utilize an attention fusion mechanism to aggregate features from multiple layers of the backbone network. Notably, to well represent the salient regions and contextual information of audio-visual inputs, the salient acoustic graph (SAG) and contextual acoustic graph (CAG), salient visual graph (SVG), and contextual visual graph (CVG) are constructed for the audio-visual scene representation. Finally, the constructed graphs pass through a graph convolutional network for structure-aware audio-visual scene recognition. Extensive experimental results on the audio, visual and audio-visual scene recognition datasets show that promising results have been achieved by the AGCN methods. Visualizing graphs on the spectrograms and images have been presented to show the effectiveness of proposed CAG/SAG and CVG/SVG that could focus on the salient and semantic relevant regions.
https://arxiv.org/abs/2301.00145
Recent advances in artificial intelligence (AI) have significantly intensified research in the geoscience and remote sensing (RS) field. AI algorithms, especially deep learning-based ones, have been developed and applied widely to RS data analysis. The successful application of AI covers almost all aspects of Earth observation (EO) missions, from low-level vision tasks like super-resolution, denoising, and inpainting, to high-level vision tasks like scene classification, object detection, and semantic segmentation. While AI techniques enable researchers to observe and understand the Earth more accurately, the vulnerability and uncertainty of AI models deserve further attention, considering that many geoscience and RS tasks are highly safety-critical. This paper reviews the current development of AI security in the geoscience and RS field, covering the following five important aspects: adversarial attack, backdoor attack, federated learning, uncertainty, and explainability. Moreover, the potential opportunities and trends are discussed to provide insights for future research. To the best of the authors' knowledge, this paper is the first attempt to provide a systematic review of AI security-related research in the geoscience and RS community. Available code and datasets are also listed in the paper to move this vibrant field of research forward.
https://arxiv.org/abs/2212.09360
Reliance on vast annotations to achieve leading performance severely restricts the practicality of large-scale point cloud semantic segmentation. For the purpose of reducing data annotation costs, effective labeling schemes are developed and contribute to attaining competitive results under weak supervision strategy. Revisiting current weak label forms, we introduce One Class One Click (OCOC), a low cost yet informative quasi scene-level label, which encapsulates point-level and scene-level annotations. An active weakly supervised framework is proposed to leverage scarce labels by involving weak supervision from global and local perspectives. Contextual constraints are imposed by an auxiliary scene classification task, respectively based on global feature embedding and point-wise prediction aggregation, which restricts the model prediction merely to OCOC labels. Furthermore, we design a context-aware pseudo labeling strategy, which effectively supplement point-level supervisory signals. Finally, an active learning scheme with a uncertainty measure - temporal output discrepancy is integrated to examine informative samples and provides guidance on sub-clouds query, which is conducive to quickly attaining desirable OCOC annotations and reduces the labeling cost to an extremely low extent. Extensive experimental analysis using three LiDAR benchmarks collected from airborne, mobile and ground platforms demonstrates that our proposed method achieves very promising results though subject to scarce labels. It considerably outperforms genuine scene-level weakly supervised methods by up to 25\% in terms of average F1 score and achieves competitive results against full supervision schemes. On terrestrial LiDAR dataset - Semantics3D, using approximately 2\textpertenthousand{} of labels, our method achieves an average F1 score of 85.2\%, which increases by 11.58\% compared to the baseline model.
https://arxiv.org/abs/2211.12657
Pattern recognition from audio signals is an active research topic encompassing audio tagging, acoustic scene classification, music classification, and other areas. Spectrogram and mel-frequency cepstral coefficients (MFCC) are among the most commonly used features for audio signal analysis and classification. Recently, deep convolutional neural networks (CNN) have been successfully used for audio classification problems using spectrogram-based 2D features. In this paper, we present SpectNet, an integrated front-end layer that extracts spectrogram features within a CNN architecture that can be used for audio pattern recognition tasks. The front-end layer utilizes learnable gammatone filters that are initialized using mel-scale filters. The proposed layer outputs a 2D spectrogram image which can be fed into a 2D CNN for classification. The parameters of the entire network, including the front-end filterbank, can be updated via back-propagation. This training scheme allows for fine-tuning the spectrogram-image features according to the target audio dataset. The proposed method is evaluated in two different audio signal classification tasks: heart sound anomaly detection and acoustic scene classification. The proposed method shows a significant 1.02\% improvement in MACC for the heart sound classification task and 2.11\% improvement in accuracy for the acoustic scene classification task compared to the classical spectrogram image features. The source code of our experiments can be found at \url{this https URL}
https://arxiv.org/abs/2211.09352
Recent years have witnessed the great success of deep learning algorithms in the geoscience and remote sensing realm. Nevertheless, the security and robustness of deep learning models deserve special attention when addressing safety-critical remote sensing tasks. In this paper, we provide a systematic analysis of backdoor attacks for remote sensing data, where both scene classification and semantic segmentation tasks are considered. While most of the existing backdoor attack algorithms rely on visible triggers like squared patches with well-designed patterns, we propose a novel wavelet transform-based attack (WABA) method, which can achieve invisible attacks by injecting the trigger image into the poisoned image in the low-frequency domain. In this way, the high-frequency information in the trigger image can be filtered out in the attack, resulting in stealthy data poisoning. Despite its simplicity, the proposed method can significantly cheat the current state-of-the-art deep learning models with a high attack success rate. We further analyze how different trigger images and the hyper-parameters in the wavelet transform would influence the performance of the proposed method. Extensive experiments on four benchmark remote sensing datasets demonstrate the effectiveness of the proposed method for both scene classification and semantic segmentation tasks and thus highlight the importance of designing advanced backdoor defense algorithms to address this threat in remote sensing scenarios. The code will be available online at \url{this https URL}.
https://arxiv.org/abs/2211.08044
Road-safety inspection is an indispensable instrument for reducing road-accident fatalities contributed to road infrastructure. Recent work formalizes road-safety assessment in terms of carefully selected risk factors that are also known as road-safety attributes. In current practice, these attributes are manually annotated in geo-referenced monocular video for each road segment. We propose to reduce dependency on tedious human labor by automating recognition with a two-stage neural architecture. The first stage predicts more than forty road-safety attributes by observing a local spatio-temporal context. Our design leverages an efficient convolutional pipeline, which benefits from pre-training on semantic segmentation of street scenes. The second stage enhances predictions through sequential integration across a larger temporal window. Our design leverages per-attribute instances of a lightweight bidirectional LSTM architecture. Both stages alleviate extreme class imbalance by incorporating a multi-task variant of recall-based dynamic loss weighting. We perform experiments on the iRAP-BH dataset, which involves fully labeled geo-referenced video along 2,300 km of public roads in Bosnia and Herzegovina. We also validate our approach by comparing it with the related work on two road-scene classification datasets from the literature: Honda Scenes and FM3m. Experimental evaluation confirms the value of our contributions on all three datasets.
https://arxiv.org/abs/2211.04165
This paper describes a pipeline for collecting acoustic scene data by using crowdsourcing. The detailed process of crowdsourcing is explained, including planning, validation criteria, and actual user interfaces. As a result of data collection, we present CochlScene, a novel dataset for acoustic scene classification. Our dataset consists of 76k samples collected from 831 participants in 13 acoustic scenes. We also propose a manual data split of training, validation, and test sets to increase the reliability of the evaluation results. Finally, we provide a baseline system for future research.
https://arxiv.org/abs/2211.02289
MobileNet is widely used for Acoustic Scene Classification (ASC) in embedded systems. Existing works reduce the complexity of ASC algorithms by pruning some components, e.g. pruning channels in the convolutional layer. In practice, the maximum proportion of channels being pruned, which is defined as Ratio of Prunable Channels ($R_\textit{PC}$), is often decided empirically. This paper proposes a method that determines the $R_\textit{PC}$ by simple linear regression models related to the Sparsity of Channels ($S_C$) in the convolutional layers. In the experiment, $R_\textit{PC}$ is examined by removing inactive channels until reaching a knee point of performance decrease. Simple methods for calculating the $S_C$ of trained models and resulted $R_\textit{PC}$ are proposed. The experiment results demonstrate that 1) the decision of $R_\textit{PC}$ is linearly dependent on $S_C$ and the hyper-parameters have a little impact on the relationship; 2) MobileNet shows a high sensitivity and stability on proposed method.
https://arxiv.org/abs/2210.15960
Most existing deep learning-based acoustic scene classification (ASC) approaches directly utilize representations extracted from spectrograms to identify target scenes. However, these approaches pay little attention to the audio events occurring in the scene despite they provide crucial semantic information. This paper conducts the first study to investigate whether real-life acoustic scenes can be reliably recognized based only on the features that describe a limited number of audio events. To model the task-specific relationships between coarse-grained acoustic scenes and fine-grained audio events, we propose an event relational graph representation learning (ERGL) framework for ASC. Specifically, the ERGL learns a graph representation of an acoustic scene from the input audio, where the embedding of each event is treated as a node, while the relationship cues derived from each pair of event embeddings are described by a learned multi-dimensional edge feature. Experiments on a polyphonic acoustic scene dataset show that the proposed ERGL achieves competitive performance on ASC by using only a limited number of embeddings of audio events without any data augmentations. The validity of the proposed ERGL framework proves the feasibility of recognizing diverse acoustic scenes based on the event relational graph. Our code is available on project homepage (this https URL).
https://arxiv.org/abs/2210.15366
Convolution neural networks (CNNs) have shown great success in various applications. However, the computational complexity and memory storage of CNNs is a bottleneck for their deployment on resource-constrained devices. Recent efforts towards reducing the computation cost and the memory overhead of CNNs involve similarity-based passive filter pruning methods. Similarity-based passive filter pruning methods compute a pairwise similarity matrix for the filters and eliminate a few similar filters to obtain a small pruned CNN. However, the computational complexity of computing the pairwise similarity matrix is high, particularly when a convolutional layer has many filters. To reduce the computational complexity in obtaining the pairwise similarity matrix, we propose to use an efficient method where the complete pairwise similarity matrix is approximated from only a few of its columns by using a Nyström approximation method. The proposed efficient similarity-based passive filter pruning method is 3 times faster and gives same accuracy at the same reduction in computations for CNNs compared to that of the similarity-based pruning method that computes a complete pairwise similarity matrix. Apart from this, the proposed efficient similarity-based pruning method performs similarly or better than the existing norm-based pruning methods. The efficacy of the proposed pruning method is evaluated on CNNs such as DCASE 2021 Task 1A baseline network and a VGGish network designed for acoustic scene classification.
https://arxiv.org/abs/2210.17416
In this technical report, the systems we submitted for subtask 1B of the DCASE 2021 challenge, regarding audiovisual scene classification, are described in detail. They are essentially multi-source transformers employing a combination of auditory and visual features to make predictions. These models are evaluated utilizing the macro-averaged multi-class cross-entropy and accuracy metrics. In terms of the macro-averaged multi-class cross-entropy, our best model achieved a score of 0.620 on the validation data. This is slightly better than the performance of the baseline system (0.658). With regard to the accuracy measure, our best model achieved a score of 77.1\% on the validation data, which is about the same as the performance obtained by the baseline system (77.0\%).
https://arxiv.org/abs/2210.10212