Vision Transformer (ViT) models have recently emerged as powerful and versatile models for various visual tasks. Recently, a work called PMF has achieved promising results in few-shot image classification by utilizing pre-trained vision transformer models. However, PMF employs full fine-tuning for learning the downstream tasks, leading to significant overfitting and storage issues, especially in the remote sensing domain. In order to tackle these issues, we turn to the recently proposed parameter-efficient tuning methods, such as VPT, which updates only the newly added prompt parameters while keeping the pre-trained backbone frozen. Inspired by VPT, we propose the Meta Visual Prompt Tuning (MVP) method. Specifically, we integrate the VPT method into the meta-learning framework and tailor it to the remote sensing domain, resulting in an efficient framework for Few-Shot Remote Sensing Scene Classification (FS-RSSC). Furthermore, we introduce a novel data augmentation strategy based on patch embedding recombination to enhance the representation and diversity of scenes for classification purposes. Experiment results on the FS-RSSC benchmark demonstrate the superior performance of the proposed MVP over existing methods in various settings, such as various-way-various-shot, various-way-one-shot, and cross-domain adaptation.
视觉变换器(ViT)模型最近成为各种视觉任务的强大和多功能模型。最近,一个名为PMF的工作在少量图像分类方面取得了令人瞩目的成果,利用预先训练的视觉变换器模型。然而,PMF采用了全 fine-tuning 来学习后续任务,导致严重的过拟合和存储问题,特别是在遥感领域。为了解决这些问题,我们转向了最近提出的参数高效的调整方法,例如 VPT,它只更新新添加的 prompt parameters,而保持预先训练的核心框架冻结。受到 VPT 的启发,我们提出了 Meta Visual Prompt Tuning (MVP) 方法。具体来说,我们将 VPT 方法纳入了元学习框架,并针对遥感领域进行定制,从而生成高效的框架,用于少量遥感场景分类(FS-RSSC)。此外,我们引入了基于补丁嵌入重构的一种新的数据增强策略,以提高场景的表示和多样性,以分类目的为例进行展示。FS-RSSC 基准实验结果证明了所提出的 MVP 在多种设置下比现有方法表现更好,例如不同方式的各种次数、不同方式的一次访问和跨域适应。
https://arxiv.org/abs/2309.09276
Recent studies focus on developing efficient systems for acoustic scene classification (ASC) using convolutional neural networks (CNNs), which typically consist of consecutive kernels. This paper highlights the benefits of using separate kernels as a more powerful and efficient design approach in ASC tasks. Inspired by the time-frequency nature of audio signals, we propose TF-SepNet, a CNN architecture that separates the feature processing along the time and frequency dimensions. Features resulted from the separate paths are then merged by channels and directly forwarded to the classifier. Instead of the conventional two dimensional (2D) kernel, TF-SepNet incorporates one dimensional (1D) kernels to reduce the computational costs. Experiments have been conducted using the TAU Urban Acoustic Scene 2022 Mobile development dataset. The results show that TF-SepNet outperforms similar state-of-the-arts that use consecutive kernels. A further investigation reveals that the separate kernels lead to a larger effective receptive field (ERF), which enables TF-SepNet to capture more time-frequency features.
最近的研究表明,利用卷积神经网络(CNN)开发高效的声音场景分类系统(ASC)是一个有效的方法,该系统通常由连续的内核组成。本文强调了使用独立的内核作为 ASC 任务更加强大和高效的设计方法的优势。受音频信号的时间频率特性启发,我们提出了 TF-SepNet,一种 CNN 架构,可以在时间频率维度上分离特征处理。从独立的路径中产生的特征随后通过通道合并并直接forward到分类器。与传统二维内核不同,TF-SepNet 采用了一维内核,以降低计算成本。实验使用 TAU 城市声音场景 2022 移动开发数据集进行了开展。结果表明,TF-SepNet 比使用连续内核的类似技术水平表现更好。进一步研究表明,独立的内核导致更大的有效响应面(ERF),使 TF-SepNet 能够捕获更多的时间频率特征。
https://arxiv.org/abs/2309.08200
The increasing availability of multi-sensor data sparks interest in multimodal self-supervised learning. However, most existing approaches learn only common representations across modalities while ignoring intra-modal training and modality-unique representations. We propose Decoupling Common and Unique Representations (DeCUR), a simple yet effective method for multimodal self-supervised learning. By distinguishing inter- and intra-modal embeddings, DeCUR is trained to integrate complementary information across different modalities. We evaluate DeCUR in three common multimodal scenarios (radar-optical, RGB-elevation, and RGB-depth), and demonstrate its consistent benefits on scene classification and semantic segmentation downstream tasks. Notably, we get straightforward improvements by transferring our pretrained backbones to state-of-the-art supervised multimodal methods without any hyperparameter tuning. Furthermore, we conduct a comprehensive explainability analysis to shed light on the interpretation of common and unique features in our multimodal approach. Codes are available at \url{this https URL}.
多传感器数据的日益普及引起了多模态自监督学习的兴趣。然而,大多数现有方法只学习不同模态之间的共同表示,而忽视了内部模态培训和模态独特的表示。我们提出了Decur,它是一种简单但有效的多模态自监督学习方法。通过区分间模态嵌入和内模态嵌入,Decur被训练以整合不同模态的互补信息。我们评估了Decur在三种常见的多模态场景(雷达光学、RGB高度和RGB深度)上的表现,并证明了它在场景分类和语义分割后续任务中的持续好处。值得注意的是,我们无需超参数调优即可通过将我们的预训练基线迁移到最先进的多模态自监督方法上,而无需进行超参数调整。此外,我们进行了全面解释性分析,以阐明我们多模态方法中共同和独特特征的解释。代码可在\url{this https URL}上获取。
https://arxiv.org/abs/2309.05300
Deep learning models have a risk of utilizing spurious clues to make predictions, such as recognizing actions based on the background scene. This issue can severely degrade the open-set action recognition performance when the testing samples have different scene distributions from the training samples. To mitigate this problem, we propose a novel method, called Scene-debiasing Open-set Action Recognition (SOAR), which features an adversarial scene reconstruction module and an adaptive adversarial scene classification module. The former prevents the decoder from reconstructing the video background given video features, and thus helps reduce the background information in feature learning. The latter aims to confuse scene type classification given video features, with a specific emphasis on the action foreground, and helps to learn scene-invariant information. In addition, we design an experiment to quantify the scene bias. The results indicate that the current open-set action recognizers are biased toward the scene, and our proposed SOAR method better mitigates such bias. Furthermore, our extensive experiments demonstrate that our method outperforms state-of-the-art methods, and the ablation studies confirm the effectiveness of our proposed modules.
深度学习模型有利用伪线索进行预测的风险,例如基于背景场景识别动作。当测试样本与训练样本的场景分布不同时,这种问题会对开放集动作识别性能造成严重的影响。为了解决这个问题,我们提出了一种新方法,称为场景去偏差的开放集动作识别(SOAR),它包括一个对抗场景重建模块和一个自适应对抗场景分类模块。前者防止解码器根据视频特征重构视频背景,从而有助于减少特征学习中的背景信息。后者旨在根据视频特征对场景进行分类,特别注重动作的前端,并有助于学习场景不变的信息。此外,我们设计了一项实验来量化场景偏差。结果显示,当前开放集动作识别器存在偏向场景的倾向,而我们提出的SOAR方法更好地克服了这种偏差。此外,我们的广泛实验表明,我们的方法比当前的方法表现更好,而削除研究证实了我们提出的模块的 effectiveness。
https://arxiv.org/abs/2309.01265
We tackle the problem of class incremental learning (CIL) in the realm of landcover classification from optical remote sensing (RS) images in this paper. The paradigm of CIL has recently gained much prominence given the fact that data are generally obtained in a sequential manner for real-world phenomenon. However, CIL has not been extensively considered yet in the domain of RS irrespective of the fact that the satellites tend to discover new classes at different geographical locations temporally. With this motivation, we propose a novel CIL framework inspired by the recent success of replay-memory based approaches and tackling two of their shortcomings. In order to reduce the effect of catastrophic forgetting of the old classes when a new stream arrives, we learn a curriculum of the new classes based on their similarity with the old classes. This is found to limit the degree of forgetting substantially. Next while constructing the replay memory, instead of randomly selecting samples from the old streams, we propose a sample selection strategy which ensures the selection of highly confident samples so as to reduce the effects of noise. We observe a sharp improvement in the CIL performance with the proposed components. Experimental results on the benchmark NWPU-RESISC45, PatternNet, and EuroSAT datasets confirm that our method offers improved stability-plasticity trade-off than the literature.
本文从光学遥感(RS)图像领域探讨了 class 增量学习(CIL)的问题。由于在实际场景中,数据通常需要以顺序方式获取,因此 CIL 已经成为一种非常热门的范式。然而,尽管卫星通常会在不同的地理区域和时间发现新类,但 RS 领域目前尚未广泛地考虑 CIL,即使考虑到卫星发现新类时可能存在的时间差异。因此,我们提出了一种基于最近成功回放记忆方法的新 CIL 框架,并解决了其两个缺点。为了在一个新的流到达时减少旧类灾难性遗忘的影响,我们学习了新的类的课程大纲,该大纲基于旧类之间的相似性。我们发现这极大地限制了遗忘的程度。在构建回放记忆时,我们而不是随机选择旧类中的样本,我们提出了一种样本选择策略,以确保选择高度自信的样本,以减少噪声的影响。我们观察到,与所选组件一起,CIL 性能得到了显著的改善。在基准数据集NWPU-RESISC45、模式Net和欧洲卫星数据集上进行了实验,实验结果显示,我们的方法比文献提供更好的稳定性和灵活性权衡。
https://arxiv.org/abs/2309.01050
Deep neural networks (DNNs) have achieved tremendous success in many remote sensing (RS) applications. However, their vulnerability to the threat of adversarial perturbations should not be neglected. Unfortunately, current adversarial defense approaches in RS studies usually suffer from performance fluctuation and unnecessary re-training costs due to the need for prior knowledge of the adversarial perturbations among RS data. To circumvent these challenges, we propose a universal adversarial defense approach in RS imagery (UAD-RS) using pre-trained diffusion models to defend the common DNNs against multiple unknown adversarial attacks. Specifically, the generative diffusion models are first pre-trained on different RS datasets to learn generalized representations in various data domains. After that, a universal adversarial purification framework is developed using the forward and reverse process of the pre-trained diffusion models to purify the perturbations from adversarial samples. Furthermore, an adaptive noise level selection (ANLS) mechanism is built to capture the optimal noise level of the diffusion model that can achieve the best purification results closest to the clean samples according to their Frechet Inception Distance (FID) in deep feature space. As a result, only a single pre-trained diffusion model is needed for the universal purification of adversarial samples on each dataset, which significantly alleviates the re-training efforts for each attack setting and maintains high performance without the prior knowledge of adversarial perturbations. Experiments on four heterogeneous RS datasets regarding scene classification and semantic segmentation verify that UAD-RS outperforms state-of-the-art adversarial purification approaches with a universal defense against seven commonly existing adversarial perturbations.
深度学习(DNN)在许多遥感(RS)应用中取得了巨大的成功,但是其对dversarial perturbations的威胁不应该被忽视。不幸的是,在RS研究中当前的dversarial防御方法通常因为需要对RS数据中的dversarial perturbations进行前置知识的需求而表现出性能波动和不必要的重新训练成本。为了克服这些挑战,我们提出了在RS图像中使用预先训练扩散模型的通用dversarial防御方法(UAD-RS),以保护常见的DNN免受多种未知的dversarial攻击。具体来说,先对不同的RS数据集进行预先训练,以学习在各种数据域中的通用表示。然后,使用预先训练扩散模型的 forward 和 reverse 过程来净化dversarial样本。此外,建立了自适应噪声水平选择机制(ANLS),以捕捉扩散模型的最佳噪声水平,该机制能够以清洁样本的深度特征空间中的卷积感知距离(FID)的最佳净化结果为目标实现最好的净化效果。因此,只需要在每个数据集上使用一个预先训练扩散模型来进行通用的dversarial样本净化,这显著减轻每个攻击设置下的重新训练努力,并且在没有dversarial perturbations的前置知识的情况下维持高水平的表现。关于场景分类和语义分割的四种不同RS数据集的实验证实了UAD-RS相对于最先进的dversarial净化方法以及通过通用防御对抗七种常见的dversarial干扰的优势。
https://arxiv.org/abs/2307.16865
Visual Saliency refers to the innate human mechanism of focusing on and extracting important features from the observed environment. Recently, there has been a notable surge of interest in the field of automotive research regarding the estimation of visual saliency. While operating a vehicle, drivers naturally direct their attention towards specific objects, employing brain-driven saliency mechanisms that prioritize certain elements over others. In this investigation, we present an intelligent system that combines a drowsiness detection system for drivers with a scene comprehension pipeline based on saliency. To achieve this, we have implemented a specialized 3D deep network for semantic segmentation, which has been pretrained and tailored for processing the frames captured by an automotive-grade external camera. The proposed pipeline was hosted on an embedded platform utilizing the STA1295 core, featuring ARM A7 dual-cores, and embeds an hardware accelerator. Additionally, we employ an innovative biosensor embedded on the car steering wheel to monitor the driver drowsiness, gathering the PhotoPlethysmoGraphy (PPG) signal of the driver. A dedicated 1D temporal deep convolutional network has been devised to classify the collected PPG time-series, enabling us to assess the driver level of attentiveness. Ultimately, we compare the determined attention level of the driver with the corresponding saliency-based scene classification to evaluate the overall safety level. The efficacy of the proposed pipeline has been validated through extensive experimental results.
视觉吸引力是指人类天生机制,专注于并从观察环境中提取重要特征。最近,在汽车研究领域,对视觉吸引力的估计引起了显著的兴趣。在驾驶车辆时,司机自然地将注意力指向特定的物体,使用大脑驱动的视觉吸引力机制,将某些元素优先级更高。在这个研究中,我们提出了一个智能系统,它将Driver drowsiness detection系统与基于吸引力的场景理解管道相结合。为了实现这一目标,我们实现了一个专门用于语义分割的3D深度神经网络,该网络已经预训练并定制以适应处理汽车级别外部摄像头捕捉的帧。该提议管道托管在一个嵌入平台上,利用STA1295核心,具有ARM A7双核心,并嵌入了硬件加速器。此外,我们嵌入了在汽车方向盘上的创新智能传感器来监测司机的睡眠,收集司机的PhotoPlethysmoGraphy(PPG)信号。我们设计了一个专门的1D时间深度卷积神经网络,用于分类收集的PPG时间序列,以便评估司机的注意力水平。最终,我们比较了司机确定的注意水平与基于吸引力的场景分类对应的注意水平,以评估整体安全水平。该提议管道的效果经过广泛的实验结果验证了。
https://arxiv.org/abs/2308.03770
Scene recognition based on deep-learning has made significant progress, but there are still limitations in its performance due to challenges posed by inter-class similarities and intra-class dissimilarities. Furthermore, prior research has primarily focused on improving classification accuracy, yet it has given less attention to achieving interpretable, precise scene classification. Therefore, we are motivated to propose EnTri, an ensemble scene recognition framework that employs ensemble learning using a hierarchy of visual features. EnTri represents features at three distinct levels of detail: pixel-level, semantic segmentation-level, and object class and frequency level. By incorporating distinct feature encoding schemes of differing complexity and leveraging ensemble strategies, our approach aims to improve classification accuracy while enhancing transparency and interpretability via visual and textual explanations. To achieve interpretability, we devised an extension algorithm that generates both visual and textual explanations highlighting various properties of a given scene that contribute to the final prediction of its category. This includes information about objects, statistics, spatial layout, and textural details. Through experiments on benchmark scene classification datasets, EnTri has demonstrated superiority in terms of recognition accuracy, achieving competitive performance compared to state-of-the-art approaches, with an accuracy of 87.69%, 75.56%, and 99.17% on the MIT67, SUN397, and UIUC8 datasets, respectively.
基于深度学习的场景识别已经取得了显著进展,但由于跨类相似性和同类别不相似性的挑战,其性能仍然存在一定的限制。此外,先前的研究主要关注如何提高分类准确率,但较少关注如何实现可解释性、精确的场景分类。因此,我们动机地提出了EnTri,一个集成场景识别框架,利用视觉特征的层级结构采用集成学习。EnTri代表场景中的三组不同级别的特征:像素级、语义分割级和对象类别和频率级。通过采用不同复杂度的特征编码方案并利用集成策略,我们的目标是提高分类准确率,同时通过视觉和文本解释增强透明度和可解释性。为了实现可解释性,我们设计了一个扩展算法,生成视觉和文本解释,突出显示给定场景的各种属性,为它的类别最终预测做出贡献。这包括关于对象、统计、空间布局和纹理细节的信息。通过实验对比基准场景分类数据集,EnTri在识别准确率方面表现出了优越性,比最先进的方法实现了更好的性能,其中 MIT67、Sun397 和 UIUC8 数据集的准确率分别为 87.69%、75.56% 和 99.17%。
https://arxiv.org/abs/2307.12442
The field of Explainable Artificial Intelligence (XAI) aims to improve the interpretability of black-box machine learning models. Building a heatmap based on the importance value of input features is a popular method for explaining the underlying functions of such models in producing their predictions. Heatmaps are almost understandable to humans, yet they are not without flaws. Non-expert users, for example, may not fully understand the logic of heatmaps (the logic in which relevant pixels to the model's prediction are highlighted with different intensities or colors). Additionally, objects and regions of the input image that are relevant to the model prediction are frequently not entirely differentiated by heatmaps. In this paper, we propose a framework called TbExplain that employs XAI techniques and a pre-trained object detector to present text-based explanations of scene classification models. Moreover, TbExplain incorporates a novel method to correct predictions and textually explain them based on the statistics of objects in the input image when the initial prediction is unreliable. To assess the trustworthiness and validity of the text-based explanations, we conducted a qualitative experiment, and the findings indicated that these explanations are sufficiently reliable. Furthermore, our quantitative and qualitative experiments on TbExplain with scene classification datasets reveal an improvement in classification accuracy over ResNet variants.
可解释人工智能(XAI)领域的目标是改善黑盒机器学习模型的可解释性。基于输入特征重要性值构建热图是一种常见的方法,用于解释这些模型产生预测背后的基本函数。热图几乎可以向人类解释,但仍然有一些缺点。非专家用户可能无法完全理解热图的逻辑(热图的逻辑是在模型预测相关的像素以不同强度或颜色强调的逻辑)。此外,输入图像中与模型预测相关的物体和区域往往无法通过热图完全区分。在本文中,我们提出了一个框架称为TbExplain,采用XAI技术和预先训练的对象检测器,以呈现场景分类模型的文本解释。此外,TbExplain还包括一种新的方法来纠正预测,并基于输入图像中物体的统计信息文本解释它们,当最初的预测不可靠时。为了评估文本解释的可靠性和有效性,我们进行了一种定性实验,结果表明这些解释足够可靠。此外,我们对TbExplain与场景分类数据集的定量和定性实验表明, ResNet变体的分类精度有了提高。
https://arxiv.org/abs/2307.10003
Self-supervised learning (SSL) has emerged as a promising approach for remote sensing image classification due to its ability to leverage large amounts of unlabeled data. In contrast to traditional supervised learning, SSL aims to learn representations of data without the need for explicit labels. This is achieved by formulating auxiliary tasks that can be used to create pseudo-labels for the unlabeled data and learn pre-trained models. The pre-trained models can then be fine-tuned on downstream tasks such as remote sensing image scene classification. The paper analyzes the effectiveness of SSL pre-training using Million AID - a large unlabeled remote sensing dataset on various remote sensing image scene classification datasets as downstream tasks. More specifically, we evaluate the effectiveness of SSL pre-training using the iBOT framework coupled with Vision transformers (ViT) in contrast to supervised pre-training of ViT using the ImageNet dataset. The comprehensive experimental work across 14 datasets with diverse properties reveals that in-domain SSL leads to improved predictive performance of models compared to the supervised counterparts.
自监督学习(SSL)已成为遥感图像分类的一个有前途的方法,因为它可以利用大量的未标记数据。与传统监督学习不同,SSL旨在学习数据的表述,而不需要显式标签。这可以通过制定辅助任务来实现,这些任务可以用来为未标记数据创建伪标签,并学习训练模型。然后,训练模型可以在下游任务(如遥感图像场景分类)中优化。本文使用数百万AID - 一个大型的未标记遥感图像场景分类数据集作为下游任务,对SSL预训练的效果进行了分析。更具体地说,我们比较了使用ibot框架和视觉转换器(ViT)的SSL预训练与使用ImageNet数据集进行 supervised pre-training的ViT。全面的实验工作涉及14个具有不同属性的数据集,表明相对于监督版的 SSL,跨领域的SSL会导致模型的预测性能改善。
https://arxiv.org/abs/2307.01645
Varying conditions between the data seen at training and at application time remain a major challenge for machine learning. We study this problem in the context of Acoustic Scene Classification (ASC) with mismatching recording devices. Previous works successfully employed frequency-wise normalization of inputs and hidden layer activations in convolutional neural networks to reduce the recording device discrepancy. The main objective of this work was to adopt frequency-wise normalization for Audio Spectrogram Transformers (ASTs), which have recently become the dominant model architecture in ASC. To this end, we first investigate how recording device characteristics are encoded in the hidden layer activations of ASTs. We find that recording device information is initially encoded in the frequency dimension; however, after the first self-attention block, it is largely transformed into the token dimension. Based on this observation, we conjecture that suppressing recording device characteristics in the input spectrogram is the most effective. We propose a frequency-centering operation for spectrograms that improves the ASC performance on unseen recording devices on average by up to 18.2 percentage points.
在语音场景分类(ASC)与不匹配的录制设备的背景下,训练数据和应用数据之间的不同条件仍然是机器学习面临的主要挑战。我们研究这个问题,以探讨在卷积神经网络中输入和隐藏层激活函数的频域归一化如何减少录制设备差异。以前的工作成功地采用了在卷积神经网络中输入和隐藏层激活函数的频域归一化来减少录制设备差异。该工作的主要目标是采用频域归一化来采用音频Spectrogram Transformers(ASTs),最近已成为 ASC 中的主要模型架构。为此,我们首先研究 ASTs 隐藏层激活函数中录制设备特征的编码方式。我们发现,录制设备信息最初是在频域维度上编码的;然而,在第一个关注块之后,它大部分转换为 token 维度。基于这个观察,我们猜测在输入 Spectrogram 中抑制录制设备特征是最有效的。我们提出了一个 Spectrogram 的频域中心化操作,该操作平均可以提高未使用录制设备的 ASC 性能高达 18.2 个百分点。
https://arxiv.org/abs/2306.11764
Domain shift is considered a challenge in machine learning as it causes significant degradation of model performance. In the Acoustic Scene Classification task (ASC), domain shift is mainly caused by different recording devices. Several studies have already targeted domain generalization to improve the performance of ASC models on unseen domains, such as new devices. Recently, the Controllable Gate Adapter ConGater has been proposed in Natural Language Processing to address the biased training data problem. ConGater allows controlling the debiasing process at inference time. ConGater's main advantage is the continuous and selective debiasing of a trained model, during inference. In this work, we adapt ConGater to the audio spectrogram transformer for an acoustic scene classification task. We show that ConGater can be used to selectively adapt the learned representations to be invariant to device domain shifts such as recording devices. Our analysis shows that ConGater can progressively remove device information from the learned representations and improve the model generalization, especially under domain shift conditions (e.g. unseen devices). We show that information removal can be extended to both device and location domain. Finally, we demonstrate ConGater's ability to enhance specific device performance without further training.
域转换在机器学习中被认为是一个挑战,因为它会导致模型性能的重大下降。在声学场景分类任务(ASC)中,域转换主要由不同的录制设备引起。几项研究已经目标于域扩展,以改善 ASC 模型在未知的领域,如新设备上的表现。最近,控制门适配器 ConGater 在自然语言处理中被提出,以解决训练数据偏见问题。 ConGater 可以在推断时间控制去偏过程。 ConGater 的主要优势是在推断期间持续和选择性去偏训练模型。在我们的研究中,我们使用 ConGater 对声学场景分类任务中的音频声谱transformer进行适应。我们表明, ConGater 可以选择性地适应学到的表示,使其对设备域转换,如录制设备,保持不变。我们的分析表明, ConGater 可以逐步从学到的表示中删除设备信息,并改善模型泛化,特别是在域转换条件(例如未观察到的设备)下。我们表明,信息删除可以扩展到设备和位置域。最后,我们展示了 ConGater 增强特定设备性能的能力,无需进一步训练。
https://arxiv.org/abs/2306.08010
Scene analysis is essential for enabling autonomous systems, such as mobile robots, to operate in real-world environments. However, obtaining a comprehensive understanding of the scene requires solving multiple tasks, such as panoptic segmentation, instance orientation estimation, and scene classification. Solving these tasks given limited computing and battery capabilities on mobile platforms is challenging. To address this challenge, we introduce an efficient multi-task scene analysis approach, called EMSAFormer, that uses an RGB-D Transformer-based encoder to simultaneously perform the aforementioned tasks. Our approach builds upon the previously published EMSANet. However, we show that the dual CNN-based encoder of EMSANet can be replaced with a single Transformer-based encoder. To achieve this, we investigate how information from both RGB and depth data can be effectively incorporated in a single encoder. To accelerate inference on robotic hardware, we provide a custom NVIDIA TensorRT extension enabling highly optimization for our EMSAFormer approach. Through extensive experiments on the commonly used indoor datasets NYUv2, SUNRGB-D, and ScanNet, we show that our approach achieves state-of-the-art performance while still enabling inference with up to 39.1 FPS on an NVIDIA Jetson AGX Orin 32 GB.
场景分析对于使机器人等自主系统能够在现实世界环境中运行至关重要。然而,获得对场景的全面理解需要解决多个任务,例如豹纹分割、实例定向估计和场景分类。在移动设备上有限的计算和电池能力下解决这些问题是一项挑战。为了应对这一挑战,我们介绍了一种高效的多任务场景分析方法,称为EMSA former,它使用RGB-DTransformer编码器同时完成上述任务。我们的方法基于之前发布的EMSANet。然而,我们表明,EMSANet的 dual CNN-based编码器可以替换为单个Transformer-based编码器。为了实现这一点,我们研究如何将RGB和深度数据的信息有效地整合到一个编码器中。为了加速机器人硬件的推理,我们提供了一种自定义的NVIDIA TensorRT扩展,以便对我们的EMSAFormer方法进行高度优化。通过在常用的室内数据集NYUv2、SunRGB-D和扫描Net上进行广泛的实验,我们表明,我们的方法实现了最先进的性能,同时仍能够在NVIDIAJetsonAGXOrin 32GB上实现高达39.1FPS的推理。
https://arxiv.org/abs/2306.05242
Zero-shot classification of image scenes which can recognize the image scenes that are not seen in the training stage holds great promise of lowering the dependence on large numbers of labeled samples. To address the zero-shot image scene classification, the cross-modal feature alignment methods have been proposed in recent years. These methods mainly focus on matching the visual features of each image scene with their corresponding semantic descriptors in the latent space. Less attention has been paid to the contrastive relationships between different image scenes and different semantic descriptors. In light of the challenge of large intra-class difference and inter-class similarity among image scenes and the potential noisy samples, these methods are susceptible to the influence of the instances which are far from these of the same classes and close to these of other classes. In this work, we propose a multi-level cross-modal feature alignment method via contrastive learning for zero-shot classification of remote sensing image scenes. While promoting the single-instance level positive alignment between each image scene with their corresponding semantic descriptors, the proposed method takes the cross-instance contrastive relationships into consideration,and learns to keep the visual and semantic features of different classes in the latent space apart from each other. Extensive experiments have been done to evaluate the performance of the proposed method. The results show that our proposed method outperforms state of the art methods for zero-shot remote sensing image scene classification. All the code and data are available at github this https URL
零样本图像场景分类,能够识别在训练阶段未看到的图像场景,具有降低对大量标记样本依赖的巨大潜力。为了解决零样本图像场景分类问题,近年来提出了跨模态特征对齐方法。这些方法主要关注每个图像场景的视觉特征与潜在空间中相应的语义描述符的匹配。较少关注不同图像场景和不同语义描述符之间的对比关系。鉴于图像场景和样本之间的大型内部差异以及不同类别之间的相似性挑战,这些方法可能受到实例之间远距离和不同类别之间接近的影响。在本文中,我们提出了一种多层次的跨模态特征对齐方法,通过对比学习来进行零样本遥感图像场景分类。在促进每个图像场景与其对应语义描述符的单个实例正向对齐的同时,我们考虑了跨实例对比关系,并学习保持不同类别的潜在空间和视觉和语义特征是独立的。进行了广泛的实验来评估 proposed 方法的性能。结果表明,我们提出的方法在零样本遥感图像场景分类方面优于现有方法。所有代码和数据都在 GitHub 上可用,该 URL 为 https://github.com/。
https://arxiv.org/abs/2306.06066
Lifelong audio feature extraction involves learning new sound classes incrementally, which is essential for adapting to new data distributions over time. However, optimizing the model only on new data can lead to catastrophic forgetting of previously learned tasks, which undermines the model's ability to perform well over the long term. This paper introduces a new approach to continual audio representation learning called DeCoR. Unlike other methods that store previous data, features, or models, DeCoR indirectly distills knowledge from an earlier model to the latest by predicting quantization indices from a delayed codebook. We demonstrate that DeCoR improves acoustic scene classification accuracy and integrates well with continual self-supervised representation learning. Our approach introduces minimal storage and computation overhead, making it a lightweight and efficient solution for continual learning.
长期音频特征提取涉及逐步学习新的声音类别,这是适应新数据分布的关键。然而,仅优化模型对新数据的期望输出可能导致灾难性遗忘以前学习的任务,这会破坏模型的长期表现能力。本文介绍了一种名为DeCoR的连续音频表示学习新方法。与其他方法存储先前的数据、特征或模型不同,DeCoR通过从延迟代码book中预测量化索引来间接地将知识从早期模型中提取到最新。我们证明,DeCoR可以提高语音识别场景分类的准确性,并与持续的自监督表示学习很好地集成。我们的方法引入了最小存储和计算 overhead,使其成为持续学习中的轻量级高效解决方案。
https://arxiv.org/abs/2305.18441
In this technical report, a low-complexity deep learning system for acoustic scene classification (ASC) is presented. The proposed system comprises two main phases: (Phase I) Training a teacher network; and (Phase II) training a student network using distilled knowledge from the teacher. In the first phase, the teacher, which presents a large footprint model, is trained. After training the teacher, the embeddings, which are the feature map of the second last layer of the teacher, are extracted. In the second phase, the student network, which presents a low complexity model, is trained with the embeddings extracted from the teacher. Our experiments conducted on DCASE 2023 Task 1 Development dataset have fulfilled the requirement of low-complexity and achieved the best classification accuracy of 57.4%, improving DCASE baseline by 14.5%.
本技术报告介绍了一种用于声学场景分类的低成本深度学习系统( ASC)。该系统由两个主要阶段组成:(1)训练教师网络;(2)使用从教师中提取的知识训练学生网络。在第一个阶段中,教师模型被训练。训练完成后,嵌入物是从教师模型第二 last 层的特征映射提取的。在第二个阶段中,学生网络被训练使用从教师中提取的嵌入物。我们在DCASE 2023任务1开发数据集上进行了实验,满足了低成本要求,并实现了最好的分类精度57.4%,提高了DCASE基准线14.5%。
https://arxiv.org/abs/2305.09463
The ability to generalize to a wide range of recording devices is a crucial performance factor for audio classification models. The characteristics of different types of microphones introduce distributional shifts in the digitized audio signals due to their varying frequency responses. If this domain shift is not taken into account during training, the model's performance could degrade severely when it is applied to signals recorded by unseen devices. In particular, training a model on audio signals recorded with a small number of different microphones can make generalization to unseen devices difficult. To tackle this problem, we convolve audio signals in the training set with pre-recorded device impulse responses (DIRs) to artificially increase the diversity of recording devices. We systematically study the effect of DIR augmentation on the task of Acoustic Scene Classification using CNNs and Audio Spectrogram Transformers. The results show that DIR augmentation in isolation performs similarly to the state-of-the-art method Freq-MixStyle. However, we also show that DIR augmentation and Freq-MixStyle are complementary, achieving a new state-of-the-art performance on signals recorded by devices unseen during training.
将泛化到广泛的录音设备的能力视为音频分类模型的关键性能因素是很重要的。不同麦克风类型的特性会导致数字音频信号的分布 Shift。如果在训练期间不考虑这种域 shift,模型的性能可能会在应用到未看到的设备时急剧恶化。特别是,训练一个基于不同麦克风类型的音频信号模型可能会使泛化到未看到的设备变得困难。为了解决这个问题,我们使用预先录制的设备响应(DIRs)将训练集中的音频信号进行卷积,以人为增加录音设备的多样性。我们系统地研究 DIR 增强对声学场景分类任务的影响,使用卷积神经网络和音频光谱转换器。结果显示,单独的 DIR 增强功能与最先进的方法freq-mixStyle相当。然而,我们也表明,DIR 增强和 freq-mixStyle是互补的,能够在训练期间未看到的设备记录的信号中实现新的最先进的性能。
https://arxiv.org/abs/2305.07499
The remarkable achievements of ChatGPT and GPT-4 have sparked a wave of interest and research in the field of large language models for Artificial General Intelligence (AGI). These models provide us with intelligent solutions that are more similar to human thinking, enabling us to use general artificial intelligence to solve problems in various applications. However, in the field of remote sensing, the scientific literature on the implementation of AGI remains relatively scant. Existing AI-related research primarily focuses on visual understanding tasks while neglecting the semantic understanding of the objects and their relationships. This is where vision-language models excel, as they enable reasoning about images and their associated textual descriptions, allowing for a deeper understanding of the underlying semantics. Vision-language models can go beyond recognizing the objects in an image and can infer the relationships between them, as well as generate natural language descriptions of the image. This makes them better suited for tasks that require both visual and textual understanding, such as image captioning, text-based image retrieval, and visual question answering. This paper provides a comprehensive review of the research on vision-language models in remote sensing, summarizing the latest progress, highlighting the current challenges, and identifying potential research opportunities. Specifically, we review the application of vision-language models in several mainstream remote sensing tasks, including image captioning, text-based image generation, text-based image retrieval, visual question answering, scene classification, semantic segmentation, and object detection. For each task, we briefly describe the task background and review some representative works. Finally, we summarize the limitations of existing work and provide some possible directions for future development.
ChatGPT和GPT-4的惊人成就引发了对大型语言模型为人工智能通用性(AGI)的应用领域的浓厚兴趣和研究。这些模型提供了更接近人类思考的智能解决方案,使我们可以使用通用人工智能来解决各种应用中的各种问题。然而,在遥感领域,有关AGI实现的科学文献相对较少。现有的人工智能相关研究主要关注视觉理解任务,而忽视了对象及其关系的语义理解。这就是为什么视觉语言模型 excels的原因,因为它们能够进行关于图像及其相关文本描述的逻辑推理,从而使我们能够更深入地理解其底层语义。视觉语言模型可以超越对图像中的对象的认识,并推断它们之间的关系,还可以生成图像的自然语言描述。这使得它们更适合需要视觉和文本理解的任务,例如图像标题制作、文本式图像生成、图像检索和视觉问答。本文全面回顾了遥感领域中的视觉语言模型研究,总结了最新的进展,突出了当前挑战,并识别了潜在的研究机会。具体来说,我们 review 了多个主流遥感任务中的视觉语言模型应用,包括图像标题制作、文本式图像生成、文本式图像检索、视觉问答、场景分类、语义分割和物体检测。在每个任务中,我们简要描述了任务背景,并回顾了一些代表性的工作。最后,我们总结了现有工作的局限性,并提供了未来发展的一些可能方向。
https://arxiv.org/abs/2305.05726
Convolutional neural networks (CNNs) are commonplace in high-performing solutions to many real-world problems, such as audio classification. CNNs have many parameters and filters, with some having a larger impact on the performance than others. This means that networks may contain many unnecessary filters, increasing a CNN's computation and memory requirements while providing limited performance benefits. To make CNNs more efficient, we propose a pruning framework that eliminates filters with the highest "commonality". We measure this commonality using the graph-theoretic concept of "centrality". We hypothesise that a filter with a high centrality should be eliminated as it represents commonality and can be replaced by other filters without affecting the performance of a network much. An experimental evaluation of the proposed framework is performed on acoustic scene classification and audio tagging. On the DCASE 2021 Task 1A baseline network, our proposed method reduces computations per inference by 71\% with 50\% fewer parameters at less than a two percentage point drop in accuracy compared to the original network. For large-scale CNNs such as PANNs designed for audio tagging, our method reduces 24\% computations per inference with 41\% fewer parameters at a slight improvement in performance.
卷积神经网络(CNNs)是解决许多实际问题中高性能解决方案的常见问题,例如音频分类。CNNs具有许多参数和滤波器,有些对性能的影响比另一些更大。这意味着网络可能包含许多不必要的滤波器,增加CNN的计算和内存要求,但仅提供有限的性能好处。为了使CNNs更高效,我们提出了一种修剪框架,删除最高“共同性”的滤波器。我们使用图论概念“中心性”来测量共同性。我们假设一个高中心性的滤波应该被删除,因为它代表共同性,可以由其他滤波代替而不会对网络性能造成太大影响。对于设计用于音频分类的大尺度CNNs,例如PNns,我们的方法减少了每个推断中的计算量,减少了50%的参数,比原始网络的精度下降不到2%。对于设计用于音频分类的大规模CNNs,例如PNns,我们的方法减少了每个推断中的计算量,减少了41%的参数,并略微提高了性能。
https://arxiv.org/abs/2305.03391
Recent advances in using language models to obtain cross-modal audio-text representations have overcome the limitations of conventional training approaches that use predefined labels. This has allowed the community to make progress in tasks like zero-shot classification, which would otherwise not be possible. However, learning such representations requires a large amount of human-annotated audio-text pairs. In this paper, we study unsupervised approaches to improve the learning framework of such representations with unpaired text and audio. We explore domain-unspecific and domain-specific curation methods to create audio-text pairs that we use to further improve the model. We also show that when domain-specific curation is used in conjunction with a soft-labeled contrastive loss, we are able to obtain significant improvement in terms of zero-shot classification performance on downstream sound event classification or acoustic scene classification tasks.
使用语言模型获取跨modal audio-text表示的最新进展已经克服了使用预先定义标签的传统训练方法的限制。这使得社区能够在没有大量人类标注的音频文本对的情况下实现零样本分类等不可能的任务。然而,学习这种表示需要大量的人类标注音频文本对。在本文中,我们研究了无监督方法来改进这种没有配对文本和音频的学习框架。我们探索了领域无关和领域特定的编辑方法,以创建我们用于进一步改进模型的音频文本对。我们还表明,当领域特定的编辑与软标签的竞争损失一起使用时,我们能够在后续声音事件分类或声学场景分类任务中实现零样本分类性能的重大改进。
https://arxiv.org/abs/2305.01864