The excellent performance of recent self-supervised learning methods on various downstream tasks has attracted great attention from academia and industry. Some recent research efforts have been devoted to self-supervised music representation learning. Nevertheless, most of them learn to represent equally-sized music clips in the waveform or a spectrogram. Despite being effective in some tasks, learning music representations in such a manner largely neglect the inherent part-whole hierarchies of music. Due to the hierarchical nature of the auditory cortex [24], understanding the bottom-up structure of music, i.e., how different parts constitute the whole at different levels, is essential for music understanding and representation learning. This work pursues hierarchical music representation learning and introduces the Music-PAW framework, which enables feature interactions of cropped music clips with part-whole hierarchies. From a technical perspective, we propose a transformer-based part-whole interaction module to progressively reason the structural relationships between part-whole music clips at adjacent levels. Besides, to create a multi-hierarchy representation space, we devise a hierarchical contrastive learning objective to align part-whole music representations in adjacent hierarchies. The merits of audio representation learning from part-whole hierarchies have been validated on various downstream tasks, including music classification (single-label and multi-label), cover song identification and acoustic scene classification.
近年来自监督学习在各种下游任务上的优异表现引起了学术界和产业界的高度关注。一些最近的研究努力致力于自监督音乐表示学习。然而,大多数研究者学会了在波形或频谱图上表示等大小的音乐片段。尽管在某些任务上有效,但以这种方式学习音乐表示很大程度上忽略了音乐固有的部分-整体层次结构。由于听觉皮层的层次结构[24],理解音乐在各个层次上的整体结构,即不同部分如何构成整体,对于音乐理解和表示学习至关重要。本文追求层次化的音乐表示学习,并引入了Music-PAW框架,该框架使裁剪音乐片段与部分-整体层次结构进行特征交互。从技术角度来看,我们提出了一个基于Transformer的模块,用于逐步推理相邻层次中部分-整体音乐片段之间的结构关系。此外,为了创建多层表示空间,我们设计了一个分层对比学习目标,使相邻层次中的部分-整体音乐表示对齐。从下游任务的角度验证了从部分-整体层次结构中进行音频表示学习的优越性,包括音乐分类(单标签和多标签)、歌曲识别和声景分类。
https://arxiv.org/abs/2312.06197
Recent advancements in Large Vision-Language Models (VLMs) have shown great promise in natural image domains, allowing users to hold a dialogue about given visual content. However, such general-domain VLMs perform poorly for Remote Sensing (RS) scenarios, leading to inaccurate or fabricated information when presented with RS domain-specific queries. Such a behavior emerges due to the unique challenges introduced by RS imagery. For example, to handle high-resolution RS imagery with diverse scale changes across categories and many small objects, region-level reasoning is necessary alongside holistic scene interpretation. Furthermore, the lack of domain-specific multimodal instruction following data as well as strong backbone models for RS make it hard for the models to align their behavior with user queries. To address these limitations, we propose GeoChat - the first versatile remote sensing VLM that offers multitask conversational capabilities with high-resolution RS images. Specifically, GeoChat can not only answer image-level queries but also accepts region inputs to hold region-specific dialogue. Furthermore, it can visually ground objects in its responses by referring to their spatial coordinates. To address the lack of domain-specific datasets, we generate a novel RS multimodal instruction-following dataset by extending image-text pairs from existing diverse RS datasets. We establish a comprehensive benchmark for RS multitask conversations and compare with a number of baseline methods. GeoChat demonstrates robust zero-shot performance on various RS tasks, e.g., image and region captioning, visual question answering, scene classification, visually grounded conversations and referring detection. Our code is available at this https URL.
近年来,在自然图像领域,大型视觉语言模型(VLMs)的进步已经表现出很大的潜力,使用户可以就给定的视觉内容进行对话。然而,在遥感(RS)场景中,这种通用域的VLM表现不佳,当面对RS领域特定的查询时,呈现出的信息往往不准确或捏造。这种行为是由RS图像独特带来的挑战所引起的。例如,为了处理具有不同尺度变化跨类别的较高分辨率RS图像,区域级推理是必要的,同时进行整体场景解释。此外,缺乏RS领域的多模态指令跟随数据以及强大的骨干模型也使得模型难以将行为与用户查询对齐。为了克服这些限制,我们提出了GeoChat - 第一个具有高分辨率RS图像多任务会话功能的遥感VLM。具体来说,GeoChat不仅可以回答图像级别的问题,还可以接受区域输入来保持区域特定的对话。此外,它可以通过参考它们的空间坐标在响应中视觉 grounding 对象。为了克服缺乏RS领域特定数据集的问题,我们通过扩展现有多样RS数据集中的图像-文本对来生成一个新的RS多模态指令跟随数据。我们为RS多任务会话建立了全面的基准,并将其与多个基线方法进行比较。GeoChat在各种RS任务上都表现出出色的零散 shot性能,例如图像和区域捕捉、视觉问题回答、场景分类、视觉 grounded 对话和参考检测。我们的代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2311.15826
Lack of interpretability of deep convolutional neural networks (DCNN) is a well-known problem particularly in the medical domain as clinicians want trustworthy automated decisions. One way to improve trust is to demonstrate the localisation of feature representations with respect to expert labeled regions of interest. In this work, we investigate the localisation of features learned via two varied learning paradigms and demonstrate the superiority of one learning approach with respect to localisation. Our analysis on medical and natural datasets show that the traditional end-to-end (E2E) learning strategy has a limited ability to localise discriminative features across multiple network layers. We show that a layer-wise learning strategy, namely cascade learning (CL), results in more localised features. Considering localisation accuracy, we not only show that CL outperforms E2E but that it is a promising method of predicting regions. On the YOLO object detection framework, our best result shows that CL outperforms the E2E scheme by $2\%$ in mAP.
深度卷积神经网络(DCNN)的可解释性不足是一个已知的问题,尤其是在医学领域,因为临床医生希望得到可信赖的自动决策。提高信任的一种方法是证明特征表示与专家标注的感兴趣区域局部相关。在这项工作中,我们研究了通过两种不同的学习范式学习到的特征的局部化,并证明了在局部化方面,一种学习方法比另一种更优越。我们对医学和自然数据集的分析表明,传统的端到端(E2E)学习策略在多层网络中定位有区别的特征方面有限。我们证明了级联学习(CL)策略导致了更局部的特征。考虑到局部化精度,我们不仅证明了CL优于E2E,而且它是一种有前景的预测区域的方法。在YOLO目标检测框架上,我们的最佳结果表明,CL在mAP方面超越了E2E方案2%。
https://arxiv.org/abs/2311.12704
The foundation model has recently garnered significant attention due to its potential to revolutionize the field of visual representation learning in a self-supervised manner. While most foundation models are tailored to effectively process RGB images for various visual tasks, there is a noticeable gap in research focused on spectral data, which offers valuable information for scene understanding, especially in remote sensing (RS) applications. To fill this gap, we created for the first time a universal RS foundation model, named SpectralGPT, which is purpose-built to handle spectral RS images using a novel 3D generative pretrained transformer (GPT). Compared to existing foundation models, SpectralGPT 1) accommodates input images with varying sizes, resolutions, time series, and regions in a progressive training fashion, enabling full utilization of extensive RS big data; 2) leverages 3D token generation for spatial-spectral coupling; 3) captures spectrally sequential patterns via multi-target reconstruction; 4) trains on one million spectral RS images, yielding models with over 600 million parameters. Our evaluation highlights significant performance improvements with pretrained SpectralGPT models, signifying substantial potential in advancing spectral RS big data applications within the field of geoscience across four downstream tasks: single/multi-label scene classification, semantic segmentation, and change detection.
基础模型因其在自监督方式下可能彻底颠覆视觉表示学习领域的潜在影响而最近引起了广泛关注。虽然大多数基础模型都是为有效地处理各种视觉任务而设计的,但在关注光谱数据的研究方面存在明显的差距,这对场景理解,尤其是在遥感和(RS)应用中,具有重要的价值。为了填补这一空白,我们创建了第一个通用 RS 基础模型,名为 SpectralGPT,它专门使用一种新颖的 3D 生成预训练变换器(GPT)处理光谱 RS 图像。与现有基础模型相比,SpectralGPT 1) 按 progressive training 的方式适应不同大小、分辨率、时间序列和区域的输入图像,实现对 RS 大数据的充分利用;2) 利用 3D 词生成进行空间-光谱耦合;3) 通过多目标重构捕捉光谱序列模式;4) 在一百万个光谱 RS 图像上训练,产生了具有超过 600 百万参数的模型。我们的评估显示,预训练的 SpectralGPT 模型在性能上取得了显著的改进,这表明在地质科学领域中,通过推动 RS 大数据应用的发展,具有巨大的潜力。 尽管在某些方面,SpectralGPT 可能无法完全替代现有的基础模型,但它在尝试解决当前难以解决的问题方面确实展现出了巨大的潜力。
https://arxiv.org/abs/2311.07113
The recent development of deep learning methods applied to vision has enabled their increasing integration into real-world applications to perform complex Computer Vision (CV) tasks. However, image acquisition conditions have a major impact on the performance of high-level image processing. A possible solution to overcome these limitations is to artificially augment the training databases or to design deep learning models that are robust to signal distortions. We opt here for the first solution by enriching the database with complex and realistic distortions which were ignored until now in the existing databases. To this end, we built a new versatile database derived from the well-known MS-COCO database to which we applied local and global photo-realistic distortions. These new local distortions are generated by considering the scene context of the images that guarantees a high level of photo-realism. Distortions are generated by exploiting the depth information of the objects in the scene as well as their semantics. This guarantees a high level of photo-realism and allows to explore real scenarios ignored in conventional databases dedicated to various CV applications. Our versatile database offers an efficient solution to improve the robustness of various CV tasks such as Object Detection (OD), scene segmentation, and distortion-type classification methods. The image database, scene classification index, and distortion generation codes are publicly available \footnote{\url{this https URL}}
近年来,将深度学习方法应用于计算机视觉领域,使得它们越来越多地融入现实世界的应用中执行复杂的计算机视觉(CV)任务。然而,图像获取条件对高级图像处理任务的性能有很大的影响。克服这些限制的解决方案之一是人为增加训练数据集,或者设计具有对信号畸变鲁棒性的深度学习模型。我们在本文中选择第一个解决方案,通过向现有的数据库中添加复杂且真实的畸变,来丰富数据库。为了实现这一目标,我们基于著名的MS-COCO数据库构建了一个新的多用途数据库,并对其应用了局部和全局照片现实畸变。这些新的局部畸变是在考虑图片场景的上下文,从而保证高水平的照片现实畸变的基础上产生的。畸变通过利用场景中物体的深度信息和语义来生成。这保证了一个高水平的照片现实畸变,并使您能够探索在传统计算机视觉应用数据库中未被探索的现实场景。我们的多用途数据库为改善各种CV任务的稳健性提供了一种有效的解决方案,比如物体检测(OD)、场景分割和畸变类型分类方法。图像数据库、场景分类索引和畸变生成代码都是公开可用的 \footnote{\url{这个https:// URL}}
https://arxiv.org/abs/2311.06976
Current audio classification models have small class vocabularies relative to the large number of sound event classes of interest in the real world. Thus, they provide a limited view of the world that may miss important yet unexpected or unknown sound events. To address this issue, open-set audio classification techniques have been developed to detect sound events from unknown classes. Although these methods have been applied to a multi-class context in audio, such as sound scene classification, they have yet to be investigated for polyphonic audio in which sound events overlap, requiring the use of multi-label models. In this study, we establish the problem of multi-label open-set audio classification by creating a dataset with varying unknown class distributions and evaluating baseline approaches built upon existing techniques.
目前,音频分类模型的类别词汇表相对于感兴趣的现实世界中的大量声音事件类别的规模非常小。因此,它们只能提供对现实世界中少量声音事件的有限认识,可能错过重要但意外或未知的声音事件。为了解决这个问题,已经开发了开放标签音频分类技术,以检测未知类别的声音事件。尽管这些方法已经应用于音频中的多类场景,如音频场景分类,但尚未对多声道音频进行调查,需要使用多标签模型。在本研究中,我们通过创建具有不同未知类别分布的音频数据集,并评估基于现有技术的基线方法,建立了多标签开放设置音频分类的问题。
https://arxiv.org/abs/2310.13759
Deep Learning models like Convolutional Neural Networks (CNN) are powerful image classifiers, but what factors determine whether they attend to similar image areas as humans do? While previous studies have focused on technological factors, little is known about the role of factors that affect human attention. In the present study, we investigated how the tasks used to elicit human attention maps interact with image characteristics in modulating the similarity between humans and CNN. We varied the intentionality of human tasks, ranging from spontaneous gaze during categorization over intentional gaze-pointing up to manual area selection. Moreover, we varied the type of image to be categorized, using either singular, salient objects, indoor scenes consisting of object arrangements, or landscapes without distinct objects defining the category. The human attention maps generated in this way were compared to the CNN attention maps revealed by explainable artificial intelligence (Grad-CAM). The influence of human tasks strongly depended on image type: For objects, human manual selection produced maps that were most similar to CNN, while the specific eye movement task has little impact. For indoor scenes, spontaneous gaze produced the least similarity, while for landscapes, similarity was equally low across all human tasks. To better understand these results, we also compared the different human attention maps to each other. Our results highlight the importance of taking human factors into account when comparing the attention of humans and CNN.
像卷积神经网络(CNN)这样的深度学习模型是强大的图像分类器,但决定它们是否像人类一样关注相似的图像区域的关键因素是什么?虽然以前的研究主要关注技术因素,但关于影响人类注意力的因素仍然知之甚少。在当前的研究中,我们研究了人类任务如何影响激发人类注意的地图与图像特征之间的相似性。我们研究了人类有意识地完成任务的程度,从分类过程中的自发眼神到手动区域选择,甚至包括手动选择类别。此外,我们还研究了要分类的图像类型,使用单个突出物体、突出物体组成的室内场景或没有明确定义类别的风景。在这种情况下,我们生成了人类注意力图,并将其与通过可解释人工智能(Grad-CAM)揭示的CNN注意力图进行了比较。人类任务对图像类型的影响取决于图像类型:对于物体,人类有意识地选择产生与CNN最相似的地图,而特定眼神移动任务对相似性影响很小。对于室内场景,自发眼神产生最不相似的地图,而景色则在不同的人类任务上相似。为了更好地理解这些结果,我们还比较了不同的人类注意力图。我们的研究结果强调了在比较人类和CNN的注意问题时,考虑人类因素的重要性。
https://arxiv.org/abs/2307.13345
Most deep learning-based acoustic scene classification (ASC) approaches identify scenes based on acoustic features converted from audio clips containing mixed information entangled by polyphonic audio events (AEs). However, these approaches have difficulties in explaining what cues they use to identify scenes. This paper conducts the first study on disclosing the relationship between real-life acoustic scenes and semantic embeddings from the most relevant AEs. Specifically, we propose an event-relational graph representation learning (ERGL) framework for ASC to classify scenes, and simultaneously answer clearly and straightly which cues are used in classifying. In the event-relational graph, embeddings of each event are treated as nodes, while relationship cues derived from each pair of nodes are described by multi-dimensional edge features. Experiments on a real-life ASC dataset show that the proposed ERGL achieves competitive performance on ASC by learning embeddings of only a limited number of AEs. The results show the feasibility of recognizing diverse acoustic scenes based on the audio event-relational graph. Visualizations of graph representations learned by ERGL are available here (this https URL).
大多数基于深度学习的音频场景分类(ASC)方法通过将包含混叠信息的音乐片段(AEs)所激发的声场信息转换为声场特征来识别场景。然而,这些方法在解释他们用于识别场景的 cues 方面存在困难。本文进行了首次研究,揭示了真实音频场景与最相关的AEs所表示语义嵌入之间的关系。具体来说,我们提出了一种 ASC 分类场景的 Event-relational 图形表示学习(ERGL)框架,同时清晰地回答了在分类中使用的 cues。在事件关系图(event-relational graph)中,每个事件的嵌入被视为节点,而从每个节点 pairs 得到的关系 cues 用多维边特征来描述。在一个真实的 ASC 数据集上进行的实验表明,所提出的 ERGL 通过学习仅有限数量的AEs 的嵌入实现了 ASC 分类的性能。结果表明,基于音频事件关系图识别不同音频场景是可行的。ERGL 学习到的图形表示的可视化版本在这里(此 https URL)。
https://arxiv.org/abs/2310.03889
The correlation between the sharpness of loss minima and generalisation in the context of deep neural networks has been subject to discussion for a long time. Whilst mostly investigated in the context of selected benchmark data sets in the area of computer vision, we explore this aspect for the audio scene classification task of the DCASE2020 challenge data. Our analysis is based on twodimensional filter-normalised visualisations and a derived sharpness measure. Our exploratory analysis shows that sharper minima tend to show better generalisation than flat minima -even more so for out-of-domain data, recorded from previously unseen devices-, thus adding to the dispute about better generalisation capabilities of flat minima. We further find that, in particular, the choice of optimisers is a main driver of the sharpness of minima and we discuss resulting limitations with respect to comparability. Our code, trained model states and loss landscape visualisations are publicly available.
在深度学习网络的背景下,损失最小值的尖锐性和泛化之间的关系已经长期受到讨论。我们主要研究计算机视觉领域中选定的基准数据集,但我们也对DCASE2020挑战数据集的音频场景分类任务进行了探索。我们的分析基于二维滤波归一化可视化和一种衍生的尖锐性度量。我们的探索性分析表明,更尖锐的最小值通常表现出更好的泛化能力,特别是对于来自之前从未见过的设备的非监督数据,记录的最小值更加具有争议性,从而增加了平坦最小值更好的泛化能力争议。我们还发现,特别是优化器的选择是最小值尖锐性的主要驱动因素,我们讨论了可比性所带来的结果限制。我们的代码、训练模型的状态和损失曲面可视化都是公开可用的。
https://arxiv.org/abs/2309.16369
We present a locality-aware method for interpreting the latent space of wavelet-based Generative Adversarial Networks (GANs), that can well capture the large spatial and spectral variability that is characteristic to satellite imagery. By focusing on preserving locality, the proposed method is able to decompose the weight-space of pre-trained GANs and recover interpretable directions that correspond to high-level semantic concepts (such as urbanization, structure density, flora presence) - that can subsequently be used for guided synthesis of satellite imagery. In contrast to typically used approaches that focus on capturing the variability of the weight-space in a reduced dimensionality space (i.e., based on Principal Component Analysis, PCA), we show that preserving locality leads to vectors with different angles, that are more robust to artifacts and can better preserve class information. Via a set of quantitative and qualitative examples, we further show that the proposed approach can outperform both baseline geometric augmentations, as well as global, PCA-based approaches for data synthesis in the context of data augmentation for satellite scene classification.
我们提出了一种注重局部性的方法和解释小波基生成对抗网络(GANs)隐空间的方法,能够很好地捕捉到卫星图像特征中的大规模空间和光谱变异。通过强调保留局部性,该方法能够分解训练好的GANs权重空间,并恢复与高级别语义概念对应(如城市化、结构密度、植物存在)可解释的方向,这些方向随后可以用于卫星图像的引导合成。与通常使用的方法注重在减少维度的空间中捕捉权重空间的变异(即基于主成分分析,PCA)相比,我们表明保留局部性会导致不同角度的向量,这些向量更加鲁棒,可以更好地保留类别信息。通过一组定量和定性示例,我们还进一步表明,该方法在卫星场景分类数据增强上下文中的数据合成中可以比基线几何增强和基于PCA的全球方法表现更好。
https://arxiv.org/abs/2309.14883
Vision Transformer (ViT) models have recently emerged as powerful and versatile models for various visual tasks. Recently, a work called PMF has achieved promising results in few-shot image classification by utilizing pre-trained vision transformer models. However, PMF employs full fine-tuning for learning the downstream tasks, leading to significant overfitting and storage issues, especially in the remote sensing domain. In order to tackle these issues, we turn to the recently proposed parameter-efficient tuning methods, such as VPT, which updates only the newly added prompt parameters while keeping the pre-trained backbone frozen. Inspired by VPT, we propose the Meta Visual Prompt Tuning (MVP) method. Specifically, we integrate the VPT method into the meta-learning framework and tailor it to the remote sensing domain, resulting in an efficient framework for Few-Shot Remote Sensing Scene Classification (FS-RSSC). Furthermore, we introduce a novel data augmentation strategy based on patch embedding recombination to enhance the representation and diversity of scenes for classification purposes. Experiment results on the FS-RSSC benchmark demonstrate the superior performance of the proposed MVP over existing methods in various settings, such as various-way-various-shot, various-way-one-shot, and cross-domain adaptation.
视觉变换器(ViT)模型最近成为各种视觉任务的强大和多功能模型。最近,一个名为PMF的工作在少量图像分类方面取得了令人瞩目的成果,利用预先训练的视觉变换器模型。然而,PMF采用了全 fine-tuning 来学习后续任务,导致严重的过拟合和存储问题,特别是在遥感领域。为了解决这些问题,我们转向了最近提出的参数高效的调整方法,例如 VPT,它只更新新添加的 prompt parameters,而保持预先训练的核心框架冻结。受到 VPT 的启发,我们提出了 Meta Visual Prompt Tuning (MVP) 方法。具体来说,我们将 VPT 方法纳入了元学习框架,并针对遥感领域进行定制,从而生成高效的框架,用于少量遥感场景分类(FS-RSSC)。此外,我们引入了基于补丁嵌入重构的一种新的数据增强策略,以提高场景的表示和多样性,以分类目的为例进行展示。FS-RSSC 基准实验结果证明了所提出的 MVP 在多种设置下比现有方法表现更好,例如不同方式的各种次数、不同方式的一次访问和跨域适应。
https://arxiv.org/abs/2309.09276
Recent studies focus on developing efficient systems for acoustic scene classification (ASC) using convolutional neural networks (CNNs), which typically consist of consecutive kernels. This paper highlights the benefits of using separate kernels as a more powerful and efficient design approach in ASC tasks. Inspired by the time-frequency nature of audio signals, we propose TF-SepNet, a CNN architecture that separates the feature processing along the time and frequency dimensions. Features resulted from the separate paths are then merged by channels and directly forwarded to the classifier. Instead of the conventional two dimensional (2D) kernel, TF-SepNet incorporates one dimensional (1D) kernels to reduce the computational costs. Experiments have been conducted using the TAU Urban Acoustic Scene 2022 Mobile development dataset. The results show that TF-SepNet outperforms similar state-of-the-arts that use consecutive kernels. A further investigation reveals that the separate kernels lead to a larger effective receptive field (ERF), which enables TF-SepNet to capture more time-frequency features.
最近的研究表明,利用卷积神经网络(CNN)开发高效的声音场景分类系统(ASC)是一个有效的方法,该系统通常由连续的内核组成。本文强调了使用独立的内核作为 ASC 任务更加强大和高效的设计方法的优势。受音频信号的时间频率特性启发,我们提出了 TF-SepNet,一种 CNN 架构,可以在时间频率维度上分离特征处理。从独立的路径中产生的特征随后通过通道合并并直接forward到分类器。与传统二维内核不同,TF-SepNet 采用了一维内核,以降低计算成本。实验使用 TAU 城市声音场景 2022 移动开发数据集进行了开展。结果表明,TF-SepNet 比使用连续内核的类似技术水平表现更好。进一步研究表明,独立的内核导致更大的有效响应面(ERF),使 TF-SepNet 能够捕获更多的时间频率特征。
https://arxiv.org/abs/2309.08200
The increasing availability of multi-sensor data sparks interest in multimodal self-supervised learning. However, most existing approaches learn only common representations across modalities while ignoring intra-modal training and modality-unique representations. We propose Decoupling Common and Unique Representations (DeCUR), a simple yet effective method for multimodal self-supervised learning. By distinguishing inter- and intra-modal embeddings, DeCUR is trained to integrate complementary information across different modalities. We evaluate DeCUR in three common multimodal scenarios (radar-optical, RGB-elevation, and RGB-depth), and demonstrate its consistent benefits on scene classification and semantic segmentation downstream tasks. Notably, we get straightforward improvements by transferring our pretrained backbones to state-of-the-art supervised multimodal methods without any hyperparameter tuning. Furthermore, we conduct a comprehensive explainability analysis to shed light on the interpretation of common and unique features in our multimodal approach. Codes are available at \url{this https URL}.
多传感器数据的日益普及引起了多模态自监督学习的兴趣。然而,大多数现有方法只学习不同模态之间的共同表示,而忽视了内部模态培训和模态独特的表示。我们提出了Decur,它是一种简单但有效的多模态自监督学习方法。通过区分间模态嵌入和内模态嵌入,Decur被训练以整合不同模态的互补信息。我们评估了Decur在三种常见的多模态场景(雷达光学、RGB高度和RGB深度)上的表现,并证明了它在场景分类和语义分割后续任务中的持续好处。值得注意的是,我们无需超参数调优即可通过将我们的预训练基线迁移到最先进的多模态自监督方法上,而无需进行超参数调整。此外,我们进行了全面解释性分析,以阐明我们多模态方法中共同和独特特征的解释。代码可在\url{this https URL}上获取。
https://arxiv.org/abs/2309.05300
Deep learning models have a risk of utilizing spurious clues to make predictions, such as recognizing actions based on the background scene. This issue can severely degrade the open-set action recognition performance when the testing samples have different scene distributions from the training samples. To mitigate this problem, we propose a novel method, called Scene-debiasing Open-set Action Recognition (SOAR), which features an adversarial scene reconstruction module and an adaptive adversarial scene classification module. The former prevents the decoder from reconstructing the video background given video features, and thus helps reduce the background information in feature learning. The latter aims to confuse scene type classification given video features, with a specific emphasis on the action foreground, and helps to learn scene-invariant information. In addition, we design an experiment to quantify the scene bias. The results indicate that the current open-set action recognizers are biased toward the scene, and our proposed SOAR method better mitigates such bias. Furthermore, our extensive experiments demonstrate that our method outperforms state-of-the-art methods, and the ablation studies confirm the effectiveness of our proposed modules.
深度学习模型有利用伪线索进行预测的风险,例如基于背景场景识别动作。当测试样本与训练样本的场景分布不同时,这种问题会对开放集动作识别性能造成严重的影响。为了解决这个问题,我们提出了一种新方法,称为场景去偏差的开放集动作识别(SOAR),它包括一个对抗场景重建模块和一个自适应对抗场景分类模块。前者防止解码器根据视频特征重构视频背景,从而有助于减少特征学习中的背景信息。后者旨在根据视频特征对场景进行分类,特别注重动作的前端,并有助于学习场景不变的信息。此外,我们设计了一项实验来量化场景偏差。结果显示,当前开放集动作识别器存在偏向场景的倾向,而我们提出的SOAR方法更好地克服了这种偏差。此外,我们的广泛实验表明,我们的方法比当前的方法表现更好,而削除研究证实了我们提出的模块的 effectiveness。
https://arxiv.org/abs/2309.01265
We tackle the problem of class incremental learning (CIL) in the realm of landcover classification from optical remote sensing (RS) images in this paper. The paradigm of CIL has recently gained much prominence given the fact that data are generally obtained in a sequential manner for real-world phenomenon. However, CIL has not been extensively considered yet in the domain of RS irrespective of the fact that the satellites tend to discover new classes at different geographical locations temporally. With this motivation, we propose a novel CIL framework inspired by the recent success of replay-memory based approaches and tackling two of their shortcomings. In order to reduce the effect of catastrophic forgetting of the old classes when a new stream arrives, we learn a curriculum of the new classes based on their similarity with the old classes. This is found to limit the degree of forgetting substantially. Next while constructing the replay memory, instead of randomly selecting samples from the old streams, we propose a sample selection strategy which ensures the selection of highly confident samples so as to reduce the effects of noise. We observe a sharp improvement in the CIL performance with the proposed components. Experimental results on the benchmark NWPU-RESISC45, PatternNet, and EuroSAT datasets confirm that our method offers improved stability-plasticity trade-off than the literature.
本文从光学遥感(RS)图像领域探讨了 class 增量学习(CIL)的问题。由于在实际场景中,数据通常需要以顺序方式获取,因此 CIL 已经成为一种非常热门的范式。然而,尽管卫星通常会在不同的地理区域和时间发现新类,但 RS 领域目前尚未广泛地考虑 CIL,即使考虑到卫星发现新类时可能存在的时间差异。因此,我们提出了一种基于最近成功回放记忆方法的新 CIL 框架,并解决了其两个缺点。为了在一个新的流到达时减少旧类灾难性遗忘的影响,我们学习了新的类的课程大纲,该大纲基于旧类之间的相似性。我们发现这极大地限制了遗忘的程度。在构建回放记忆时,我们而不是随机选择旧类中的样本,我们提出了一种样本选择策略,以确保选择高度自信的样本,以减少噪声的影响。我们观察到,与所选组件一起,CIL 性能得到了显著的改善。在基准数据集NWPU-RESISC45、模式Net和欧洲卫星数据集上进行了实验,实验结果显示,我们的方法比文献提供更好的稳定性和灵活性权衡。
https://arxiv.org/abs/2309.01050
Deep neural networks (DNNs) have achieved tremendous success in many remote sensing (RS) applications. However, their vulnerability to the threat of adversarial perturbations should not be neglected. Unfortunately, current adversarial defense approaches in RS studies usually suffer from performance fluctuation and unnecessary re-training costs due to the need for prior knowledge of the adversarial perturbations among RS data. To circumvent these challenges, we propose a universal adversarial defense approach in RS imagery (UAD-RS) using pre-trained diffusion models to defend the common DNNs against multiple unknown adversarial attacks. Specifically, the generative diffusion models are first pre-trained on different RS datasets to learn generalized representations in various data domains. After that, a universal adversarial purification framework is developed using the forward and reverse process of the pre-trained diffusion models to purify the perturbations from adversarial samples. Furthermore, an adaptive noise level selection (ANLS) mechanism is built to capture the optimal noise level of the diffusion model that can achieve the best purification results closest to the clean samples according to their Frechet Inception Distance (FID) in deep feature space. As a result, only a single pre-trained diffusion model is needed for the universal purification of adversarial samples on each dataset, which significantly alleviates the re-training efforts for each attack setting and maintains high performance without the prior knowledge of adversarial perturbations. Experiments on four heterogeneous RS datasets regarding scene classification and semantic segmentation verify that UAD-RS outperforms state-of-the-art adversarial purification approaches with a universal defense against seven commonly existing adversarial perturbations.
深度学习(DNN)在许多遥感(RS)应用中取得了巨大的成功,但是其对dversarial perturbations的威胁不应该被忽视。不幸的是,在RS研究中当前的dversarial防御方法通常因为需要对RS数据中的dversarial perturbations进行前置知识的需求而表现出性能波动和不必要的重新训练成本。为了克服这些挑战,我们提出了在RS图像中使用预先训练扩散模型的通用dversarial防御方法(UAD-RS),以保护常见的DNN免受多种未知的dversarial攻击。具体来说,先对不同的RS数据集进行预先训练,以学习在各种数据域中的通用表示。然后,使用预先训练扩散模型的 forward 和 reverse 过程来净化dversarial样本。此外,建立了自适应噪声水平选择机制(ANLS),以捕捉扩散模型的最佳噪声水平,该机制能够以清洁样本的深度特征空间中的卷积感知距离(FID)的最佳净化结果为目标实现最好的净化效果。因此,只需要在每个数据集上使用一个预先训练扩散模型来进行通用的dversarial样本净化,这显著减轻每个攻击设置下的重新训练努力,并且在没有dversarial perturbations的前置知识的情况下维持高水平的表现。关于场景分类和语义分割的四种不同RS数据集的实验证实了UAD-RS相对于最先进的dversarial净化方法以及通过通用防御对抗七种常见的dversarial干扰的优势。
https://arxiv.org/abs/2307.16865
Visual Saliency refers to the innate human mechanism of focusing on and extracting important features from the observed environment. Recently, there has been a notable surge of interest in the field of automotive research regarding the estimation of visual saliency. While operating a vehicle, drivers naturally direct their attention towards specific objects, employing brain-driven saliency mechanisms that prioritize certain elements over others. In this investigation, we present an intelligent system that combines a drowsiness detection system for drivers with a scene comprehension pipeline based on saliency. To achieve this, we have implemented a specialized 3D deep network for semantic segmentation, which has been pretrained and tailored for processing the frames captured by an automotive-grade external camera. The proposed pipeline was hosted on an embedded platform utilizing the STA1295 core, featuring ARM A7 dual-cores, and embeds an hardware accelerator. Additionally, we employ an innovative biosensor embedded on the car steering wheel to monitor the driver drowsiness, gathering the PhotoPlethysmoGraphy (PPG) signal of the driver. A dedicated 1D temporal deep convolutional network has been devised to classify the collected PPG time-series, enabling us to assess the driver level of attentiveness. Ultimately, we compare the determined attention level of the driver with the corresponding saliency-based scene classification to evaluate the overall safety level. The efficacy of the proposed pipeline has been validated through extensive experimental results.
视觉吸引力是指人类天生机制,专注于并从观察环境中提取重要特征。最近,在汽车研究领域,对视觉吸引力的估计引起了显著的兴趣。在驾驶车辆时,司机自然地将注意力指向特定的物体,使用大脑驱动的视觉吸引力机制,将某些元素优先级更高。在这个研究中,我们提出了一个智能系统,它将Driver drowsiness detection系统与基于吸引力的场景理解管道相结合。为了实现这一目标,我们实现了一个专门用于语义分割的3D深度神经网络,该网络已经预训练并定制以适应处理汽车级别外部摄像头捕捉的帧。该提议管道托管在一个嵌入平台上,利用STA1295核心,具有ARM A7双核心,并嵌入了硬件加速器。此外,我们嵌入了在汽车方向盘上的创新智能传感器来监测司机的睡眠,收集司机的PhotoPlethysmoGraphy(PPG)信号。我们设计了一个专门的1D时间深度卷积神经网络,用于分类收集的PPG时间序列,以便评估司机的注意力水平。最终,我们比较了司机确定的注意水平与基于吸引力的场景分类对应的注意水平,以评估整体安全水平。该提议管道的效果经过广泛的实验结果验证了。
https://arxiv.org/abs/2308.03770
Scene recognition based on deep-learning has made significant progress, but there are still limitations in its performance due to challenges posed by inter-class similarities and intra-class dissimilarities. Furthermore, prior research has primarily focused on improving classification accuracy, yet it has given less attention to achieving interpretable, precise scene classification. Therefore, we are motivated to propose EnTri, an ensemble scene recognition framework that employs ensemble learning using a hierarchy of visual features. EnTri represents features at three distinct levels of detail: pixel-level, semantic segmentation-level, and object class and frequency level. By incorporating distinct feature encoding schemes of differing complexity and leveraging ensemble strategies, our approach aims to improve classification accuracy while enhancing transparency and interpretability via visual and textual explanations. To achieve interpretability, we devised an extension algorithm that generates both visual and textual explanations highlighting various properties of a given scene that contribute to the final prediction of its category. This includes information about objects, statistics, spatial layout, and textural details. Through experiments on benchmark scene classification datasets, EnTri has demonstrated superiority in terms of recognition accuracy, achieving competitive performance compared to state-of-the-art approaches, with an accuracy of 87.69%, 75.56%, and 99.17% on the MIT67, SUN397, and UIUC8 datasets, respectively.
基于深度学习的场景识别已经取得了显著进展,但由于跨类相似性和同类别不相似性的挑战,其性能仍然存在一定的限制。此外,先前的研究主要关注如何提高分类准确率,但较少关注如何实现可解释性、精确的场景分类。因此,我们动机地提出了EnTri,一个集成场景识别框架,利用视觉特征的层级结构采用集成学习。EnTri代表场景中的三组不同级别的特征:像素级、语义分割级和对象类别和频率级。通过采用不同复杂度的特征编码方案并利用集成策略,我们的目标是提高分类准确率,同时通过视觉和文本解释增强透明度和可解释性。为了实现可解释性,我们设计了一个扩展算法,生成视觉和文本解释,突出显示给定场景的各种属性,为它的类别最终预测做出贡献。这包括关于对象、统计、空间布局和纹理细节的信息。通过实验对比基准场景分类数据集,EnTri在识别准确率方面表现出了优越性,比最先进的方法实现了更好的性能,其中 MIT67、Sun397 和 UIUC8 数据集的准确率分别为 87.69%、75.56% 和 99.17%。
https://arxiv.org/abs/2307.12442
The field of Explainable Artificial Intelligence (XAI) aims to improve the interpretability of black-box machine learning models. Building a heatmap based on the importance value of input features is a popular method for explaining the underlying functions of such models in producing their predictions. Heatmaps are almost understandable to humans, yet they are not without flaws. Non-expert users, for example, may not fully understand the logic of heatmaps (the logic in which relevant pixels to the model's prediction are highlighted with different intensities or colors). Additionally, objects and regions of the input image that are relevant to the model prediction are frequently not entirely differentiated by heatmaps. In this paper, we propose a framework called TbExplain that employs XAI techniques and a pre-trained object detector to present text-based explanations of scene classification models. Moreover, TbExplain incorporates a novel method to correct predictions and textually explain them based on the statistics of objects in the input image when the initial prediction is unreliable. To assess the trustworthiness and validity of the text-based explanations, we conducted a qualitative experiment, and the findings indicated that these explanations are sufficiently reliable. Furthermore, our quantitative and qualitative experiments on TbExplain with scene classification datasets reveal an improvement in classification accuracy over ResNet variants.
可解释人工智能(XAI)领域的目标是改善黑盒机器学习模型的可解释性。基于输入特征重要性值构建热图是一种常见的方法,用于解释这些模型产生预测背后的基本函数。热图几乎可以向人类解释,但仍然有一些缺点。非专家用户可能无法完全理解热图的逻辑(热图的逻辑是在模型预测相关的像素以不同强度或颜色强调的逻辑)。此外,输入图像中与模型预测相关的物体和区域往往无法通过热图完全区分。在本文中,我们提出了一个框架称为TbExplain,采用XAI技术和预先训练的对象检测器,以呈现场景分类模型的文本解释。此外,TbExplain还包括一种新的方法来纠正预测,并基于输入图像中物体的统计信息文本解释它们,当最初的预测不可靠时。为了评估文本解释的可靠性和有效性,我们进行了一种定性实验,结果表明这些解释足够可靠。此外,我们对TbExplain与场景分类数据集的定量和定性实验表明, ResNet变体的分类精度有了提高。
https://arxiv.org/abs/2307.10003
Self-supervised learning (SSL) has emerged as a promising approach for remote sensing image classification due to its ability to leverage large amounts of unlabeled data. In contrast to traditional supervised learning, SSL aims to learn representations of data without the need for explicit labels. This is achieved by formulating auxiliary tasks that can be used to create pseudo-labels for the unlabeled data and learn pre-trained models. The pre-trained models can then be fine-tuned on downstream tasks such as remote sensing image scene classification. The paper analyzes the effectiveness of SSL pre-training using Million AID - a large unlabeled remote sensing dataset on various remote sensing image scene classification datasets as downstream tasks. More specifically, we evaluate the effectiveness of SSL pre-training using the iBOT framework coupled with Vision transformers (ViT) in contrast to supervised pre-training of ViT using the ImageNet dataset. The comprehensive experimental work across 14 datasets with diverse properties reveals that in-domain SSL leads to improved predictive performance of models compared to the supervised counterparts.
自监督学习(SSL)已成为遥感图像分类的一个有前途的方法,因为它可以利用大量的未标记数据。与传统监督学习不同,SSL旨在学习数据的表述,而不需要显式标签。这可以通过制定辅助任务来实现,这些任务可以用来为未标记数据创建伪标签,并学习训练模型。然后,训练模型可以在下游任务(如遥感图像场景分类)中优化。本文使用数百万AID - 一个大型的未标记遥感图像场景分类数据集作为下游任务,对SSL预训练的效果进行了分析。更具体地说,我们比较了使用ibot框架和视觉转换器(ViT)的SSL预训练与使用ImageNet数据集进行 supervised pre-training的ViT。全面的实验工作涉及14个具有不同属性的数据集,表明相对于监督版的 SSL,跨领域的SSL会导致模型的预测性能改善。
https://arxiv.org/abs/2307.01645