In this technical report, we describe the SNTL-NTU team's submission for Task 1 Data-Efficient Low-Complexity Acoustic Scene Classification of the detection and classification of acoustic scenes and events (DCASE) 2024 challenge. Three systems are introduced to tackle training splits of different sizes. For small training splits, we explored reducing the complexity of the provided baseline model by reducing the number of base channels. We introduce data augmentation in the form of mixup to increase the diversity of training samples. For the larger training splits, we use FocusNet to provide confusing class information to an ensemble of multiple Patchout faSt Spectrogram Transformer (PaSST) models and baseline models trained on the original sampling rate of 44.1 kHz. We use Knowledge Distillation to distill the ensemble model to the baseline student model. Training the systems on the TAU Urban Acoustic Scene 2022 Mobile development dataset yielded the highest average testing accuracy of (62.21, 59.82, 56.81, 53.03, 47.97)% on split (100, 50, 25, 10, 5)% respectively over the three systems.
在这份技术报告中,我们描述了SNTL-NTU团队为2024年语音识别与分类任务(DCASE)提交的任务1:数据高效的低复杂度音频场景分类。我们引入了三种系统来处理不同大小的训练集。对于小训练集,我们通过减少提供的基线模型的基通道复杂度来降低模型的复杂度。我们引入了数据增强的形式为mixup,以增加训练样本的多样性。对于较大的训练集,我们使用FocusNet来向由多个Patchout faST Spectrogram Transformer(PaSST)模型和基于原始采样率44.1 kHz的基准模型组成的集成模型提供混乱的分类信息。我们使用知识蒸馏将集成模型分解为基线学生模型。在TAU urban acoustic scene 2022移动开发数据集上训练系统,在划分(100, 50, 25, 10, 5)%的测试准确率上取得了最高平均值(62.21, 59.82, 56.81, 53.03, 47.97)。
https://arxiv.org/abs/2409.11964
This paper proposes a method for unsupervised whole-image clustering of a target dataset of remote sensing scenes with no labels. The method consists of three main steps: (1) finetuning a pretrained deep neural network (DINOv2) on a labelled source remote sensing imagery dataset and using it to extract a feature vector from each image in the target dataset, (2) reducing the dimension of these deep features via manifold projection into a low-dimensional Euclidean space, and (3) clustering the embedded features using a Bayesian nonparametric technique to infer the number and membership of clusters simultaneously. The method takes advantage of heterogeneous transfer learning to cluster unseen data with different feature and label distributions. We demonstrate the performance of this approach outperforming state-of-the-art zero-shot classification methods on several remote sensing scene classification datasets.
本文提出了一种无需标签的遥感场景数据集的无需标签聚类方法。该方法包括三个主要步骤:首先,在带有标签的遥感图像数据集上微调预训练的深度神经网络(DINOv2),然后利用它从目标数据集中的每个图像中提取特征向量,接着通过向量投影将这些深层特征的维度降低到低维的欧氏空间,最后使用贝叶斯非参数技术对嵌入的特征进行聚类,以同时推断集群的数量和成员资格。该方法利用异质迁移学习来聚类未见过的数据,具有不同的特征和标签分布。我们在多个遥感场景分类数据集上证明了这种方法超越了最先进的零 shots分类方法的性能。
https://arxiv.org/abs/2409.03938
Vision-Language Models for remote sensing have shown promising uses thanks to their extensive pretraining. However, their conventional usage in zero-shot scene classification methods still involves dividing large images into patches and making independent predictions, i.e., inductive inference, thereby limiting their effectiveness by ignoring valuable contextual information. Our approach tackles this issue by utilizing initial predictions based on text prompting and patch affinity relationships from the image encoder to enhance zero-shot capabilities through transductive inference, all without the need for supervision and at a minor computational cost. Experiments on 10 remote sensing datasets with state-of-the-art Vision-Language Models demonstrate significant accuracy improvements over inductive zero-shot classification. Our source code is publicly available on Github: this https URL
凭借其广泛的预训练,视觉语言模型在远程感测领域显示出有希望的应用。然而,其在零散景观分类方法中的常规用法还是 involve 将大图像分割成补丁并做出独立预测,即归纳推理,从而限制了它们的有效性,忽略了宝贵的上下文信息。我们的方法通过利用图像编码器基于文本提示的初始预测和补丁关联关系来增强零散景观能力,通过转换推理实现,而无需监督,且计算成本较低。用最先进的视觉语言模型在 10 个远程感测数据集上的实验表明,与归纳零散景观分类相比,其准确率明显提高。我们的源代码已公开发布在 Github上:https://github.com/。
https://arxiv.org/abs/2409.00698
Acoustic scene classification (ASC) predominantly relies on supervised approaches. However, acquiring labeled data for training ASC models is often costly and time-consuming. Recently, self-supervised learning (SSL) has emerged as a powerful method for extracting features from unlabeled audio data, benefiting many downstream audio tasks. This paper proposes a data-efficient and low-complexity ASC system by leveraging self-supervised audio representations extracted from general-purpose audio datasets. We introduce BEATs, an audio SSL pre-trained model, to extract the general representations from AudioSet. Through extensive experiments, it has been demonstrated that the self-supervised audio representations can help to achieve high ASC accuracy with limited labeled fine-tuning data. Furthermore, we find that ensembling the SSL models fine-tuned with different strategies contributes to a further performance improvement. To meet low-complexity requirements, we use knowledge distillation to transfer the self-supervised knowledge from large teacher models to an efficient student model. The experimental results suggest that the self-supervised teachers effectively improve the classification accuracy of the student model. Our best-performing system obtains an average accuracy of 56.7%.
声景分类(ASC)主要依赖监督方法。然而,获取带有标签的训练数据通常代价昂贵且耗时。最近,自监督学习(SSL)作为一种提取无标签音频数据特征的有前途的方法,在许多下游音频任务中表现出了强大的作用。本文提出了一种数据高效且低复杂度的ASC系统,通过利用从通用音频数据集中提取的自监督音频表示来构建。我们引入了BEATs,一个音频SSL预训练模型,从AudioSet中提取通用表示。通过广泛的实验,我们发现自监督音频表示可以帮助实现用有限标记微调数据达到高ASC准确率。此外,我们还发现,通过使用不同的微调策略对SSL模型进行细粒度微调,可以进一步提高性能。为了满足低复杂性要求,我们使用知识蒸馏将大型教师模型的自监督知识传递给高效的学生模型。实验结果表明,自监督教师能够有效提高学生模型的分类准确率。我们的最佳系统可以达到56.7%的准确率。
https://arxiv.org/abs/2408.14862
Neural network models for audio tasks, such as automatic speech recognition (ASR) and acoustic scene classification (ASC), are susceptible to noise contamination for real-life applications. To improve audio quality, an enhancement module, which can be developed independently, is explicitly used at the front-end of the target audio applications. In this paper, we present an end-to-end learning solution to jointly optimise the models for audio enhancement (AE) and the subsequent applications. To guide the optimisation of the AE module towards a target application, and especially to overcome difficult samples, we make use of the sample-wise performance measure as an indication of sample importance. In experiments, we consider four representative applications to evaluate our training paradigm, i.e., ASR, speech command recognition (SCR), speech emotion recognition (SER), and ASC. These applications are associated with speech and non-speech tasks concerning semantic and non-semantic features, transient and global information, and the experimental results indicate that our proposed approach can considerably boost the noise robustness of the models, especially at low signal-to-noise ratios (SNRs), for a wide range of computer audition tasks in everyday-life noisy environments.
神经网络模型在音频任务中,如语音识别(ASR)和声景分类(ASC),容易受到现实应用中的噪声污染。为了提高音频质量,在目标音频应用程序的前端明确使用增强模块。在本文中,我们提出了一个端到端学习解决方案,以同时优化音频增强(AE)模型和后续应用。为了引导AE模块向目标应用程序优化,特别是克服困难样本,我们利用样本的性能指标作为样本重要性的指示。在实验中,我们考虑了四个代表应用程序来评估我们的训练范式,即ASR、语音命令识别(SCR)、语音情感识别(SER)和ASC。这些应用程序与语义和非语义特征、暂态和全局信息以及日常环境中噪音干扰的语音和非声音任务有关。实验结果表明,与低信号噪声比(SNRs)相比,我们提出的方法可以在广泛的计算机听觉任务中显著提高模型的噪声鲁棒性。
https://arxiv.org/abs/2408.06264
Unsupervised domain adaptation techniques, extensively studied in hyperspectral image (HSI) classification, aim to use labeled source domain data and unlabeled target domain data to learn domain invariant features for cross-scene classification. Compared to natural images, numerous spectral bands of HSIs provide abundant semantic information, but they also increase the domain shift significantly. In most existing methods, both explicit alignment and implicit alignment simply align feature distribution, ignoring domain information in the spectrum. We noted that when the spectral channel between source and target domains is distinguished obviously, the transfer performance of these methods tends to deteriorate. Additionally, their performance fluctuates greatly owing to the varying domain shifts across various datasets. To address these problems, a novel shift-sensitive spatial-spectral disentangling learning (S4DL) approach is proposed. In S4DL, gradient-guided spatial-spectral decomposition is designed to separate domain-specific and domain-invariant representations by generating tailored masks under the guidance of the gradient from domain classification. A shift-sensitive adaptive monitor is defined to adjust the intensity of disentangling according to the magnitude of domain shift. Furthermore, a reversible neural network is constructed to retain domain information that lies in not only in semantic but also the shallow-level detailed information. Extensive experimental results on several cross-scene HSI datasets consistently verified that S4DL is better than the state-of-the-art UDA methods. Our source code will be available at this https URL.
无监督的领域适应技术,在超分辨率图像(HSI)分类中得到了广泛研究,旨在利用带有标签的源域数据和无标签的目标域数据来学习跨场景分类中的领域不变特征。与自然图像相比,HSIs的许多光谱带提供了丰富的语义信息,但它们也显著增加了域移。在大多数现有方法中,显式对齐和隐式对齐只是对特征分布进行对齐,而忽略了光谱中的域信息。我们注意到,当源域和目标域之间的光谱通道明显区别时,这些方法的有效性往往恶化。此外,由于各种数据集上的域移不同,它们的性能波动很大。为解决这些问题,我们提出了一个新的具有感知平移敏感性的空间-光谱解分离学习(S4DL)方法。在S4DL中,由梯度引导的局部-空间-光谱分解被设计为通过在域分类的指导下生成 tailored掩码来分离领域特定和领域无关表示。定义了一个平移敏感的自适应监控器,根据域移的规模调整解离的强度。此外,还构建了一个可逆的神经网络,以保留位于不仅语义而且浅层级详细信息中的域信息。在多个跨场景HSI数据集上的实验结果表明,S4DL比最先进的UDA方法效果更好。我们的源代码将在此处https URL上提供。
https://arxiv.org/abs/2408.15263
Artificial Intelligence (AI) technologies have profoundly transformed the field of remote sensing, revolutionizing data collection, processing, and analysis. Traditionally reliant on manual interpretation and task-specific models, remote sensing has been significantly enhanced by the advent of foundation models--large-scale, pre-trained AI models capable of performing a wide array of tasks with unprecedented accuracy and efficiency. This paper provides a comprehensive survey of foundation models in the remote sensing domain, covering models released between June 2021 and June 2024. We categorize these models based on their applications in computer vision and domain-specific tasks, offering insights into their architectures, pre-training datasets, and methodologies. Through detailed performance comparisons, we highlight emerging trends and the significant advancements achieved by these foundation models. Additionally, we discuss the technical challenges, practical implications, and future research directions, addressing the need for high-quality data, computational resources, and improved model generalization. Our research also finds that pre-training methods, particularly self-supervised learning techniques like contrastive learning and masked autoencoders, significantly enhance the performance and robustness of foundation models in remote sensing tasks such as scene classification, object detection, and other applications. This survey aims to serve as a resource for researchers and practitioners by providing a panorama of advances and promising pathways for continued development and application of foundation models in remote sensing.
人工智能(AI)技术在遥感领域取得了深远的影响,颠覆了数据收集、处理和分析。传统上依赖于手动解释和任务特定模型,遥感通过引入基础模型——大规模、预训练的AI模型,实现了前所未有的准确性和效率,显著增强了遥感能力。本文对2021年6月至2024年间发布的遥感领域基础模型进行全面调查,涵盖这些模型。我们根据其在计算机视觉和领域特定任务的应用将这些模型进行分类,并提供了对它们的架构、预训练数据和方法论的洞察。通过详细的性能比较,我们突出了新兴趋势以及这些基础模型所取得的显著进步。此外,我们讨论了技术挑战、实际影响和未来研究方向,强调了高质量数据、计算资源和优化模型泛化的重要性。我们的研究还发现,特别是像对比学习技术和遮罩自动编码器这样的自监督学习方法,显著增强了在遥感任务(如场景分类、目标检测等)中基础模型的性能和稳健性。本文旨在为研究人员和从业者提供遥感的进展和有前景的持续发展和应用基础模型的概述。
https://arxiv.org/abs/2408.03464
Scene understanding plays an important role in several high-level computer vision applications, such as autonomous vehicles, intelligent video surveillance, or robotics. However, too few solutions have been proposed for indoor/outdoor scene classification to ensure scene context adaptability for computer vision frameworks. We propose the first Lightweight Hybrid Graph Convolutional Neural Network (LH-GCNN)-CNN framework as an add-on to object detection models. The proposed approach uses the output of the CNN object detection model to predict the observed scene type by generating a coherent GCNN representing the semantic and geometric content of the observed scene. This new method, applied to natural scenes, achieves an efficiency of over 90\% for scene classification in a COCO-derived dataset containing a large number of different scenes, while requiring fewer parameters than traditional CNN methods. For the benefit of the scientific community, we will make the source code publicly available: this https URL.
https://arxiv.org/abs/2407.14658
Olfaction, often overlooked in cultural heritage studies, holds profound significance in shaping human experiences and identities. Examining historical depictions of olfactory scenes can offer valuable insights into the role of smells in history. We show that a transfer-learning approach using weakly labeled training data can remarkably improve the classification of fragrant spaces and, more generally, artistic scene depictions. We fine-tune Places365-pre-trained models by querying two cultural heritage data sources and using the search terms as supervision signal. The models are evaluated on two manually corrected test splits. This work lays a foundation for further exploration of fragrant spaces recognition and artistic scene classification. All images and labels are released as the ArtPlaces dataset at this https URL.
嗅觉,在文化遗产研究中常常被忽视,对塑造人类经验和身份具有深刻的意义。研究历史上对嗅觉场景的描绘可以揭示气味在历史中的作用。我们证明了使用弱标签标注的训练数据进行迁移学习的方法可以显著改善对香气的空间分类,更一般地,对艺术场景描绘的分类。我们通过查询两个文化遗产数据源,并使用查询词作为监督信号,微调Places365-预训练模型。模型在两个手动修正的测试划分上进行评估。这项工作为进一步探索香气的空间识别和艺术场景分类奠定了基础。所有图像和标签都在这个https://网址上免费发布。
https://arxiv.org/abs/2407.11701
In this paper, we propose a method for online domain-incremental learning of acoustic scene classification from a sequence of different locations. Simply training a deep learning model on a sequence of different locations leads to forgetting of previously learned knowledge. In this work, we only correct the statistics of the Batch Normalization layers of a model using a few samples to learn the acoustic scenes from a new location without any excessive training. Experiments are performed on acoustic scenes from 11 different locations, with an initial task containing acoustic scenes from 6 locations and the remaining 5 incremental tasks each representing the acoustic scenes from a different location. The proposed approach outperforms fine-tuning based methods and achieves an average accuracy of 48.8% after learning the last task in sequence without forgetting acoustic scenes from the previously learned locations.
在本文中,我们提出了一种在从不同位置的序列中进行在线领域增强学习以进行声景分类的方法。仅仅在不同的位置训练深度学习模型会导致之前学习的知识被遗忘。在这项工作中,我们只使用几个样本来纠正模型的批归一化层中的统计量,从而从新的位置学习声景,而没有任何过度的训练。实验在11个不同的声景上进行,其中初始任务包括6个地方的声景,其余的5个增量任务分别代表从不同位置的声景。与基于微调的方法相比,所提出的 approach 表现出更好的性能,并且如果没有忘记从之前学习的位置中学习到的 acoustic scenes,平均准确率可以达到 48.8%。
https://arxiv.org/abs/2406.13386
This work is an improved system that we submitted to task 1 of DCASE2023 challenge. We propose a method of low-complexity acoustic scene classification by a parallel attention-convolution network which consists of four modules, including pre-processing, fusion, global and local contextual information extraction. The proposed network is computationally efficient to capture global and local contextual information from each audio clip. In addition, we integrate other techniques into our method, such as knowledge distillation, data augmentation, and adaptive residual normalization. When evaluated on the official dataset of DCASE2023 challenge, our method obtains the highest accuracy of 56.10% with parameter number of 5.21 kilo and multiply-accumulate operations of 1.44 million. It exceeds the top two systems of DCASE2023 challenge in accuracy and complexity, and obtains state-of-the-art result. Code is at: this https URL.
此作品是我们为DCASE2023挑战任务1提出的改进系统。我们提出了一种通过并行注意力卷积网络(四个模块:预处理、融合、全局和局部上下文信息提取)进行低复杂度音频场景分类的方法。所提出的网络能够高效地捕捉每个音频片段的全球和局部上下文信息。此外,我们还将其他技术(如知识蒸馏、数据增强和自适应残差 normalization)集成到我们的方法中。在DCASE2023挑战的官方数据集上评估时,我们的方法获得了56.10%的准确率,参数数为5.21千克,乘法累积操作数为1.44百万。它超过了DCASE2023挑战的前两名系统,在准确性和复杂性方面均取得了最先进的结果。代码位于此链接:https://this URL。
https://arxiv.org/abs/2406.08119
We propose noise-robust voice conversion (VC) which takes into account the recording quality and environment of noisy source speech. Conventional denoising training improves the noise robustness of a VC model by learning noisy-to-clean VC process. However, the naturalness of the converted speech is limited when the noise of the source speech is unseen during the training. To this end, our proposed training conditions a VC model on two latent variables representing the recording quality and environment of the source speech. These latent variables are derived from deep neural networks pre-trained on recording quality assessment and acoustic scene classification and calculated in an utterance-wise or frame-wise manner. As a result, the trained VC model can explicitly learn information about speech degradation during the training. Objective and subjective evaluations show that our training improves the quality of the converted speech compared to the conventional training.
我们提出了一个噪声鲁棒的声音转换(VC)方法,考虑了噪声源语音的录音质量和环境。传统的去噪训练通过学习噪声到清理的VC过程来提高VC模型的噪声鲁棒性。然而,在训练过程中无法观察到源语音的噪声时,转换后的语音的自然性是有限的。为此,我们提出的训练将两个潜在变量(录音质量和环境)作为VC模型的训练条件。这些潜在变量来源于经过录音质量评估和语音场景分类预训练的深度神经网络,以进行句子级或帧级计算。因此,训练后的VC模型可以在训练过程中明确学习语音降解信息。客观和主观评估表明,与传统训练相比,我们的训练提高了转换后的语音的质量。
https://arxiv.org/abs/2406.07280
An increasing number of models have achieved great performance in remote sensing tasks with the recent development of Large Language Models (LLMs) and Visual Language Models (VLMs). However, these models are constrained to basic vision and language instruction-tuning tasks, facing challenges in complex remote sensing applications. Additionally, these models lack specialized expertise in professional domains. To address these limitations, we propose a LLM-driven remote sensing intelligent agent named RS-Agent. Firstly, RS-Agent is powered by a large language model (LLM) that acts as its "Central Controller," enabling it to understand and respond to various problems intelligently. Secondly, our RS-Agent integrates many high-performance remote sensing image processing tools, facilitating multi-tool and multi-turn conversations. Thirdly, our RS-Agent can answer professional questions by leveraging robust knowledge documents. We conducted experiments using several datasets, e.g., RSSDIVCS, RSVQA, and DOTAv1. The experimental results demonstrate that our RS-Agent delivers outstanding performance in many tasks, i.e., scene classification, visual question answering, and object counting tasks.
越来越多的模型在遥感任务中取得了巨大的性能,这是得益于大型语言模型(LLMs)和视觉语言模型(VLMs)最近的发展。然而,这些模型仅限于基本视觉和语言指令调整任务,在复杂遥感应用中面临挑战。此外,这些模型在专业领域缺乏专业知识。为了应对这些局限,我们提出了一个基于LLM的遥感智能代理RS-Agent。首先,RS-Agent由一个大型语言模型(LLM)驱动,充当其“中央控制器”,使其能够智能地理解和响应各种问题。其次,我们的RS-Agent集成了许多高性能的遥感图像处理工具,促进了多工具和多轮对话。第三,通过利用稳健的知识文档,我们的RS-Agent可以回答专业问题。我们使用几个数据集进行了实验,例如RSSDIVCS、RSVQA和DOTAv1。实验结果表明,我们的RS-Agent在许多任务中表现出卓越的性能,即场景分类、视觉问答和物体计数任务。
https://arxiv.org/abs/2406.07089
Since the launch of the Sentinel-2 (S2) satellites, many ML models have used the data for diverse applications. The scene classification layer (SCL) inside the S2 product provides rich information for training, such as filtering images with high cloud coverage. However, there is more potential in this. We propose a technique to assess the clean optical coverage of a region, expressed by a SITS and calculated with the S2-based SCL data. With a manual threshold and specific labels in the SCL, the proposed technique assigns a percentage of spatial and temporal coverage across the time series and a high/low assessment. By evaluating the AI4EO challenge for Enhanced Agriculture, we show that the assessment is correlated to the predictive results of ML models. The classification results in a region with low spatial and temporal coverage is worse than in a region with high coverage. Finally, we applied the technique across all continents of the global dataset LandCoverNet.
自Sentinel-2(S2)卫星发射以来,许多机器学习(ML)模型已利用其数据进行各种应用。S2产品内的场景分类层(SCL)为训练提供了丰富的信息,例如通过高云覆盖率过滤图像。然而,这个场景还有更多的潜力。我们提出了一种评估区域干净光学覆盖率的技术,该技术通过基于S2的SCL数据计算。通过手动阈值和SCL中的特定标签,所提出的技术分配了时间序列系列中的空间和时间覆盖率的百分比,并进行了高/低评估。通过评估AI4EO挑战,我们证明了评估与ML模型的预测结果相关。在空间和时间覆盖率较低的区域中,分类结果比在覆盖率较高的区域中更差。最后,我们应用该技术对全球数据集LandCoverNet的所有大陆进行评估。
https://arxiv.org/abs/2406.18584
Current datasets for vehicular applications are mostly collected in North America or Europe. Models trained or evaluated on these datasets might suffer from geographical bias when deployed in other regions. Specifically, for scene classification, a highway in a Latin American country differs drastically from an Autobahn, for example, both in design and maintenance levels. We propose VWise, a novel benchmark for road-type classification and scene classification tasks, in addition to tasks focused on external contexts related to vehicular applications in LatAm. We collected over 520 video clips covering diverse urban and rural environments across Latin American countries, annotated with six classes of road types. We also evaluated several state-of-the-art classification models in baseline experiments, obtaining over 84% accuracy. With this dataset, we aim to enhance research on vehicular tasks in Latin America.
目前,针对车辆应用的数据集大多数都在北美或欧洲收集。在这些数据集上训练或评估的模型在部署到其他地区时可能会受到地理偏见的困扰。具体来说,对于场景分类,拉美国家的高速公路与Autobahn等有很大的差异,例如在设计水平和维护程度方面。我们提出了VWise,一个新的道路类型分类和场景分类任务的基准,以及针对拉美地区与车辆应用相关的外部环境的任务。我们收集了拉美国家涵盖各种城市和农村环境的520多个视频片段,并为每个路况类型标注了六个类别。我们还通过基线实验评估了几个最先进的分类模型,获得了超过84%的准确率。有了这个数据集,我们的目标是加强关于拉美地区车辆任务的科学研究。
https://arxiv.org/abs/2406.03273
A continual learning (CL) model is desired for remote sensing image analysis because of varying camera parameters, spectral ranges, resolutions, etc. There exist some recent initiatives to develop CL techniques in this domain but they still depend on massive labelled samples which do not fully fit remote sensing applications because ground truths are often obtained via field-based surveys. This paper addresses this problem with a proposal of unsupervised flat-wide learning approach (UNISA) for unsupervised few-shot continual learning approaches of remote sensing image scene classifications which do not depend on any labelled samples for its model updates. UNISA is developed from the idea of prototype scattering and positive sampling for learning representations while the catastrophic forgetting problem is tackled with the flat-wide learning approach combined with a ball generator to address the data scarcity problem. Our numerical study with remote sensing image scene datasets and a hyperspectral dataset confirms the advantages of our solution. Source codes of UNISA are shared publicly in \url{this https URL} to allow convenient future studies and reproductions of our numerical results.
由于各种相机参数、光谱范围、分辨率等的变化,遥感图像分析需要一个持续学习(CL)模型。在这个领域,已经有一些最近的倡议来开发CL技术,但是它们仍然依赖于大规模标记样本,这些样本并不完全符合遥感应用,因为地面真值通常通过现场调查获得。本文针对这个问题提出了一个自适应平面宽学习方法(UNISA)用于无监督少样本持续学习方法,这些方法不依赖于任何标记样本进行模型更新。UNISA是从原型散射和正采样学习表示的思想发展而来,同时结合平面宽学习方法和一个球生成器来解决数据稀缺问题。用遥感图像场景数据和 hyperspectral 数据进行 our numerical study 证明,我们的解决方案具有优越性。UNISA 的源代码公开在 \url{this https URL},以方便未来的研究和复制我们的数值结果。
https://arxiv.org/abs/2406.18574
In this research, we propose the first approach for integrating the Kolmogorov-Arnold Network (KAN) with various pre-trained Convolutional Neural Network (CNN) models for remote sensing (RS) scene classification tasks using the EuroSAT dataset. Our novel methodology, named KCN, aims to replace traditional Multi-Layer Perceptrons (MLPs) with KAN to enhance classification performance. We employed multiple CNN-based models, including VGG16, MobileNetV2, EfficientNet, ConvNeXt, ResNet101, and Vision Transformer (ViT), and evaluated their performance when paired with KAN. Our experiments demonstrated that KAN achieved high accuracy with fewer training epochs and parameters. Specifically, ConvNeXt paired with KAN showed the best performance, achieving 94% accuracy in the first epoch, which increased to 96% and remained consistent across subsequent epochs. The results indicated that KAN and MLP both achieved similar accuracy, with KAN performing slightly better in later epochs. By utilizing the EuroSAT dataset, we provided a robust testbed to investigate whether KAN is suitable for remote sensing classification tasks. Given that KAN is a novel algorithm, there is substantial capacity for further development and optimization, suggesting that KCN offers a promising alternative for efficient image analysis in the RS field.
在这项研究中,我们提出了将Kolmogorov-Arnold网络(KAN)与各种预训练卷积神经网络(CNN)模型集成到远程 sensing(RS)场景分类任务中的第一个方法。我们的新方法KCN旨在用KAN取代传统的多层感知器(MLP),从而提高分类性能。我们使用了多种基于CNN的模型,包括VGG16、MobileNetV2、EfficientNet、ConvNeXt、ResNet101和Vision Transformer(ViT),并评估了它们与KAN搭配时的性能。我们的实验证明,KAN在较少的训练周期和参数的情况下取得了高准确率。具体来说,ConvNeXt与KAN的搭配表现最佳,在第一帧获得了94%的准确率,在后续帧中上升至96%,并保持一致。结果表明,KAN和MLP在准确性上相当接近,而KAN在后续帧中的表现略好于MLP。通过利用 EuroSAT 数据集,我们为研究KAN是否适合远程 sensing分类任务提供了一个稳健的实验平台。鉴于KAN是一种新颖的算法,具有很大的发展和优化潜力,因此KCN在RS领域为高效图像分析提供了有前途的替代方案。
https://arxiv.org/abs/2406.00600
The development of supervised deep learning-based methods for multi-label scene classification (MLC) is one of the prominent research directions in remote sensing (RS). Yet, collecting annotations for large RS image archives is time-consuming and costly. To address this issue, several data augmentation methods have been introduced in RS. Among others, the data augmentation technique CutMix, which combines parts of two existing training images to generate an augmented image, stands out as a particularly effective approach. However, the direct application of CutMix in RS MLC can lead to the erasure or addition of class labels (i.e., label noise) in the augmented (i.e., combined) training image. To address this problem, we introduce a label propagation (LP) strategy that allows the effective application of CutMix in the context of MLC problems in RS without being affected by label noise. To this end, our proposed LP strategy exploits pixel-level class positional information to update the multi-label of the augmented training image. We propose to access such class positional information from reference maps associated to each training image (e.g., thematic products) or from class explanation masks provided by an explanation method if no reference maps are available. Similarly to pairing two training images, our LP strategy carries out a pairing operation on the associated pixel-level class positional information to derive the updated multi-label for the augmented image. Experimental results show the effectiveness of our LP strategy in general and its robustness in the case of various simulated and real scenarios with noisy class positional information in particular.
基于监督深度学习的多标签场景分类(MLC)方法是遥感(RS)领域的一个突出研究方向。然而,收集大型RS图像档案的注释需要花费时间和金钱。为解决这个问题,RS中已经引入了一些数据增强方法。其中,数据增强技术CutMix,它将两个现有训练图像的部分组合生成增强图像,被视为特别有效的解决方案。然而,在RS MLC中直接应用CutMix可能会导致增强(即联合)训练图像中类标签的消失或添加(即标签噪音)。为解决这个问题,我们引入了一种标签传播(LP)策略,使得在RS MLC问题中有效应用CutMix,同时不受标签噪音的影响。为此,我们提出的LP策略利用像素级类位置信息来更新增强训练图像的多标签。我们计划从与每个训练图像相关的参考图(例如主题产品)或来自解释方法提供的类解释 mask 获取类位置信息。与成对两个训练图像类似,我们的LP策略在相关像素级类位置信息上执行配对操作,以计算增强图像的更新多标签。实验结果表明,我们的LP策略在一般情况下都具有有效性,尤其是在存在各种模拟和现实场景中 noisy class positional information的情况下。
https://arxiv.org/abs/2405.13451
This article describes the Data-Efficient Low-Complexity Acoustic Scene Classification Task in the DCASE 2024 Challenge and the corresponding baseline system. The task setup is a continuation of previous editions (2022 and 2023), which focused on recording device mismatches and low-complexity constraints. This year's edition introduces an additional real-world problem: participants must develop data-efficient systems for five scenarios, which progressively limit the available training data. The provided baseline system is based on an efficient, factorized CNN architecture constructed from inverted residual blocks and uses Freq-MixStyle to tackle the device mismatch problem. The baseline system's accuracy ranges from 42.40% on the smallest to 56.99% on the largest training set.
这篇文章描述了在DCASE 2024挑战中数据效率低复杂度音频场景分类任务及其相应的基线系统。任务设置是前几版的延续(2022和2023),重点关注记录设备不匹配和低复杂度约束。今年的版本引入了一个额外的真实世界问题:参与者必须为五个场景开发数据有效的系统,这些系统逐渐限制了可用的训练数据。提供的基线系统基于由反残差块构建的效率高,分解卷积架构,并使用Freq-MixStyle解决设备不匹配问题。基线系统的准确率在最小训练集上为42.40%,在最大训练集上为56.99%。
https://arxiv.org/abs/2405.10018
Portrait images typically consist of a salient person against diverse backgrounds. With the development of mobile devices and image processing techniques, users can conveniently capture portrait images anytime and anywhere. However, the quality of these portraits may suffer from the degradation caused by unfavorable environmental conditions, subpar photography techniques, and inferior capturing devices. In this paper, we introduce a dual-branch network for portrait image quality assessment (PIQA), which can effectively address how the salient person and the background of a portrait image influence its visual quality. Specifically, we utilize two backbone networks (\textit{i.e.,} Swin Transformer-B) to extract the quality-aware features from the entire portrait image and the facial image cropped from it. To enhance the quality-aware feature representation of the backbones, we pre-train them on the large-scale video quality assessment dataset LSVQ and the large-scale facial image quality assessment dataset GFIQA. Additionally, we leverage LIQE, an image scene classification and quality assessment model, to capture the quality-aware and scene-specific features as the auxiliary features. Finally, we concatenate these features and regress them into quality scores via a multi-perception layer (MLP). We employ the fidelity loss to train the model via a learning-to-rank manner to mitigate inconsistencies in quality scores in the portrait image quality assessment dataset PIQ. Experimental results demonstrate that the proposed model achieves superior performance in the PIQ dataset, validating its effectiveness. The code is available at \url{this https URL}.
肖像图像通常由一个突出的人物和多种不同的背景组成。随着移动设备的发展和图像处理技术的不断发展,用户可以随时随地方便地捕捉到肖像图像。然而,这些肖像可能会受到不良环境条件、拍摄技巧和低质量采集设备等因素引起的质量下降的影响。在本文中,我们提出了一个用于肖像图像质量评估(PIQA)的双分支网络,可以有效地解决突出的人物和肖像图像背景如何影响其视觉质量的问题。具体来说,我们使用两个骨干网络(即Swin Transformer-B)从整个肖像图像和从其中提取的面部图像中提取质量感知特征。为了提高骨干网络的质量感知特征表示,我们在LSVQ和GFIQA等大规模视频质量评估数据集上进行预训练。此外,我们还利用LIQE,一种图像场景分类和质量评估模型,作为辅助特征来捕捉质量感知和场景特定的特征。最后,我们通过多感知层(MLP)将这些特征进行特征串联并对其进行回归,并通过一个多感知层(MLP)将特征和质量评分回归到质量分数。我们使用可靠性损失来通过学习排序的方式来训练模型,以减轻肖像图像质量评估数据集中质量评分不一致的问题。实验结果表明,与原始数据集相比,所提出的模型在PIQA数据集上取得了卓越的性能,验证了其有效性。代码可在此处访问:\url{this <https:// this URL>.
https://arxiv.org/abs/2405.08555