Remote sensing scene classification (RSSC) is a critical task with diverse applications in land use and resource management. While unimodal image-based approaches show promise, they often struggle with limitations such as high intra-class variance and inter-class similarity. Incorporating textual information can enhance classification by providing additional context and semantic understanding, but manual text annotation is labor-intensive and costly. In this work, we propose a novel RSSC framework that integrates text descriptions generated by large vision-language models (VLMs) as an auxiliary modality without incurring expensive manual annotation costs. To fully leverage the latent complementarities between visual and textual data, we propose a dual cross-attention-based network to fuse these modalities into a unified representation. Extensive experiments with both quantitative and qualitative evaluation across five RSSC datasets demonstrate that our framework consistently outperforms baseline models. We also verify the effectiveness of VLM-generated text descriptions compared to human-annotated descriptions. Additionally, we design a zero-shot classification scenario to show that the learned multimodal representation can be effectively utilized for unseen class classification. This research opens new opportunities for leveraging textual information in RSSC tasks and provides a promising multimodal fusion structure, offering insights and inspiration for future studies. Code is available at: this https URL
https://arxiv.org/abs/2412.02531
Countries in South Asia experience many catastrophic flooding events regularly. Through image classification, it is possible to expedite search and rescue initiatives by classifying flood zones, including houses and humans. We create a new dataset collecting aerial imagery of flooding events across South Asian countries. For the classification, we propose a fine-tuned Compact Convolutional Transformer (CCT) based approach and some other cutting-edge transformer-based and Convolutional Neural Network-based architectures (CNN). We also implement the YOLOv8 object detection model and detect houses and humans within the imagery of our proposed dataset, and then compare the performance with our classification-based approach. Since the countries in South Asia have similar topography, housing structure, the color of flood water, and vegetation, this work can be more applicable to such a region as opposed to the rest of the world. The images are divided evenly into four classes: 'flood', 'flood with domicile', 'flood with humans', and 'no flood'. After experimenting with our proposed dataset on our fine-tuned CCT model, which has a comparatively lower number of weight parameters than many other transformer-based architectures designed for computer vision, it exhibits an accuracy and macro average precision of 98.62% and 98.50%. The other transformer-based architectures that we implement are the Vision Transformer (ViT), Swin Transformer, and External Attention Transformer (EANet), which give an accuracy of 88.66%, 84.74%, and 66.56% respectively. We also implement DCECNN (Deep Custom Ensembled Convolutional Neural Network), which is a custom ensemble model that we create by combining MobileNet, InceptionV3, and EfficientNetB0, and we obtain an accuracy of 98.78%. The architectures we implement are fine-tuned to achieve optimal performance on our dataset.
南亚国家经常遭受许多灾难性的洪水事件。通过图像分类,可以通过对包括房屋和人类在内的洪涝区域进行分类来加速搜索和救援工作。我们创建了一个新的数据集,收集了跨越南亚各国的洪水事件的航拍影像。对于分类任务,我们提出了一种基于精调的紧凑型卷积变换器(CCT)的方法以及一些其他先进的变换器基础架构和卷积神经网络基础架构(CNN)。我们还实现了YOLOv8目标检测模型,并在我们提议的数据集图像中检测房屋和人类,然后将其性能与我们的分类方法进行比较。由于南亚国家具有相似的地形、住房结构、洪水的颜色以及植被,这项工作更适合应用于该地区而非世界其他地方。这些图像被均匀地分为四类:“洪涝”、“有住宅的洪涝”、“有人类的洪涝”和“无洪涝”。在对我们的精调CCT模型进行实验后,鉴于它与许多用于计算机视觉的设计相比具有相对较少的权重参数数量,该模型展示了98.62%的准确率以及98.50%的宏平均精度。我们实现的其他变换器基础架构包括Vision Transformer (ViT)、Swin Transformer和External Attention Transformer (EANet),它们分别给出了88.66%、84.74%和66.56%的准确率。此外,我们实现了DCECNN(深度定制集成卷积神经网络),这是一个通过组合MobileNet、InceptionV3和EfficientNetB0构建的自定义集成模型,并获得了98.78%的准确率。我们实现的所有架构都经过微调以在我们的数据集上达到最佳性能。
https://arxiv.org/abs/2411.00169
XyloAudio is a line of ultra-low-power audio inference chips, designed for in- and near-microphone analysis of audio in real-time energy-constrained scenarios. Xylo is designed around a highly efficient integer-logic processor which simulates parameter- and activity-sparse spiking neural networks (SNNs) using a leaky integrate-and-fire (LIF) neuron model. Neurons on Xylo are quantised integer devices operating in synchronous digital CMOS, with neuron and synapse state quantised to 16 bit, and weight parameters quantised to 8 bit. Xylo is tailored for real-time streaming operation, as opposed to accelerated-time operation in the case of an inference accelerator. XyloAudio includes a low-power audio encoding interface for direct connection to a microphone, designed for sparse encoding of incident audio for further processing by the inference core. In this report we present the results of DCASE 2020 acoustic scene classification audio benchmark dataset deployed to XyloAudio 2. We describe the benchmark dataset; the audio preprocessing approach; and the network architecture and training approach. We present the performance of the trained model, and the results of power and latency measurements performed on the XyloAudio 2 development kit. This benchmark is conducted as part of the Neurobench project.
XyloAudio 是一款超低功耗音频推理芯片系列,专为实时能量受限场景下的麦克风内和近麦克风音频分析设计。Xylo围绕一个高效的整数逻辑处理器构建,该处理器使用泄漏积分放电(LIF)神经元模型模拟参数稀疏和活动稀疏的脉冲神经网络(SNN)。Xylo上的神经元是量化的整数设备,在同步数字CMOS中运行,其中神经元和突触状态量化为16位,权重参数量化为8位。Xylo专为实时流媒体操作设计,而不是像推理加速器那样进行加速时间操作。XyloAudio 包含一个低功耗音频编码接口,可以直接连接到麦克风,用于稀疏编码传入的音频以便进一步由推理核心处理。 在这份报告中,我们展示了将DCASE 2020声景分类音频基准数据集部署到XyloAudio 2上的结果。我们描述了基准数据集;音频预处理方法;以及网络架构和训练方法。我们还呈现了训练模型的性能,并展示了在XyloAudio 2开发套件上进行的功耗和延迟测量的结果。这项基准测试是作为Neurobench项目的一部分而进行的。
https://arxiv.org/abs/2410.23776
The goal of the acoustic scene classification (ASC) task is to classify recordings into one of the predefined acoustic scene classes. However, in real-world scenarios, ASC systems often encounter challenges such as recording device mismatch, low-complexity constraints, and the limited availability of labeled data. To alleviate these issues, in this paper, a data-efficient and low-complexity ASC system is built with a new model architecture and better training strategies. Specifically, we firstly design a new low-complexity architecture named Rep-Mobile by integrating multi-convolution branches which can be reparameterized at inference. Compared to other models, it achieves better performance and less computational complexity. Then we apply the knowledge distillation strategy and provide a comparison of the data efficiency of the teacher model with different architectures. Finally, we propose a progressive pruning strategy, which involves pruning the model multiple times in small amounts, resulting in better performance compared to a single step pruning. Experiments are conducted on the TAU dataset. With Rep-Mobile and these training strategies, our proposed ASC system achieves the state-of-the-art (SOTA) results so far, while also winning the first place with a significant advantage over others in the DCASE2024 Challenge.
语音场景分类(ASC)任务的目标是将录音归类到预定义的声学场景类别之一。然而,在实际应用中,ASC系统通常会面临一些挑战,例如录制设备不匹配、低复杂度约束以及标注数据有限等问题。为了缓解这些问题,本文构建了一个高效且低复杂度的ASC系统,采用了新的模型架构和更好的训练策略。具体来说,我们首先设计了一种新的低复杂度架构,命名为Rep-Mobile,通过在推理时可重新参数化的多卷积分支集成实现。与其它模型相比,它实现了更好的性能并降低了计算复杂度。接着,我们应用了知识蒸馏策略,并对比分析了不同架构下教师模型的数据效率。最后,我们提出了一种渐进式剪枝策略,即通过多次少量地剪枝模型来达到比单次大量剪枝更好的效果。实验在TAU数据集上进行。借助Rep-Mobile和这些训练策略,我们提出的ASC系统取得了目前最先进的(SOTA)结果,并且在DCASE2024挑战赛中以显著优势获得了第一名。
https://arxiv.org/abs/2410.20775
Large vision and language assistants have enabled new capabilities for interpreting natural images. These approaches have recently been adapted to earth observation data, but they are only able to handle single image inputs, limiting their use for many real-world tasks. In this work, we develop a new vision and language assistant called TEOChat that can engage in conversations about temporal sequences of earth observation data. To train TEOChat, we curate an instruction-following dataset composed of many single image and temporal tasks including building change and damage assessment, semantic change detection, and temporal scene classification. We show that TEOChat can perform a wide variety of spatial and temporal reasoning tasks, substantially outperforming previous vision and language assistants, and even achieving comparable or better performance than specialist models trained to perform these specific tasks. Furthermore, TEOChat achieves impressive zero-shot performance on a change detection and change question answering dataset, outperforms GPT-4o and Gemini 1.5 Pro on multiple temporal tasks, and exhibits stronger single image capabilities than a comparable single EO image instruction-following model. We publicly release our data, models, and code at this https URL .
https://arxiv.org/abs/2410.06234
Scene recognition, particularly for aerial and underwater images, often suffers from various types of degradation, such as blurring or overexposure. Previous works that focus on convolutional neural networks have been shown to be able to extract panoramic semantic features and perform well on scene recognition tasks. However, low-quality images still impede model performance due to the inappropriate use of high-level semantic features. To address these To address these challenges, we propose an adaptive selection mechanism to identify the most important and robust regions with high-level features. Thus, the model can perform learning via these regions to avoid interference. implement a learnable mask in the neural network, which can filter high-level features by assigning weights to different regions of the feature matrix. We also introduce a regularization term to further enhance the significance of key high-level feature regions. Different from previous methods, our learnable matrix pays extra attention to regions that are important to multiple categories but may cause misclassification and sets constraints to reduce the influence of such regions.This is a plug-and-play architecture that can be easily extended to other methods. Additionally, we construct an Underwater Geological Scene Classification dataset to assess the effectiveness of our model. Extensive experimental results demonstrate the superiority and robustness of our proposed method over state-of-the-art techniques on two datasets.
场景识别,尤其是对于航空和潜水图像,通常会受到各种类型的损伤,例如模糊或过度曝光。以前专注于卷积神经网络的工作已经被证明可以在场景识别任务中提取全景语义特征并表现出色。然而,低质量的图像仍然会阻碍模型的性能,因为高级语义特征的不当使用。为了应对这些挑战,我们提出了一个自适应选择机制来确定具有高层次特征的最重要的和稳健的区域。因此,模型可以通过这些区域进行学习,避免干扰。在神经网络中实现可学习掩码,通过为特征矩阵的不同区域分配权重来过滤高级特征。我们还引入了一个正则化项,进一步增强关键高级特征区域的的重要性。与以前的方法不同,我们的可学习矩阵特别关注对多个分类至关重要的区域,但可能会导致误分类,并为这些区域设置约束以减少其影响。这是一个可扩展的建筑,可以轻松地应用于其他方法。此外,我们还构建了一个水下地质场景分类数据集,以评估我们所提出模型的效果。大量的实验结果表明,与最先进的技巧相比,我们提出的方法在两个数据集上的优越性和稳健性。
https://arxiv.org/abs/2409.14741
In this technical report, we describe the SNTL-NTU team's submission for Task 1 Data-Efficient Low-Complexity Acoustic Scene Classification of the detection and classification of acoustic scenes and events (DCASE) 2024 challenge. Three systems are introduced to tackle training splits of different sizes. For small training splits, we explored reducing the complexity of the provided baseline model by reducing the number of base channels. We introduce data augmentation in the form of mixup to increase the diversity of training samples. For the larger training splits, we use FocusNet to provide confusing class information to an ensemble of multiple Patchout faSt Spectrogram Transformer (PaSST) models and baseline models trained on the original sampling rate of 44.1 kHz. We use Knowledge Distillation to distill the ensemble model to the baseline student model. Training the systems on the TAU Urban Acoustic Scene 2022 Mobile development dataset yielded the highest average testing accuracy of (62.21, 59.82, 56.81, 53.03, 47.97)% on split (100, 50, 25, 10, 5)% respectively over the three systems.
在这份技术报告中,我们描述了SNTL-NTU团队为2024年语音识别与分类任务(DCASE)提交的任务1:数据高效的低复杂度音频场景分类。我们引入了三种系统来处理不同大小的训练集。对于小训练集,我们通过减少提供的基线模型的基通道复杂度来降低模型的复杂度。我们引入了数据增强的形式为mixup,以增加训练样本的多样性。对于较大的训练集,我们使用FocusNet来向由多个Patchout faST Spectrogram Transformer(PaSST)模型和基于原始采样率44.1 kHz的基准模型组成的集成模型提供混乱的分类信息。我们使用知识蒸馏将集成模型分解为基线学生模型。在TAU urban acoustic scene 2022移动开发数据集上训练系统,在划分(100, 50, 25, 10, 5)%的测试准确率上取得了最高平均值(62.21, 59.82, 56.81, 53.03, 47.97)。
https://arxiv.org/abs/2409.11964
This paper proposes a method for unsupervised whole-image clustering of a target dataset of remote sensing scenes with no labels. The method consists of three main steps: (1) finetuning a pretrained deep neural network (DINOv2) on a labelled source remote sensing imagery dataset and using it to extract a feature vector from each image in the target dataset, (2) reducing the dimension of these deep features via manifold projection into a low-dimensional Euclidean space, and (3) clustering the embedded features using a Bayesian nonparametric technique to infer the number and membership of clusters simultaneously. The method takes advantage of heterogeneous transfer learning to cluster unseen data with different feature and label distributions. We demonstrate the performance of this approach outperforming state-of-the-art zero-shot classification methods on several remote sensing scene classification datasets.
本文提出了一种无需标签的遥感场景数据集的无需标签聚类方法。该方法包括三个主要步骤:首先,在带有标签的遥感图像数据集上微调预训练的深度神经网络(DINOv2),然后利用它从目标数据集中的每个图像中提取特征向量,接着通过向量投影将这些深层特征的维度降低到低维的欧氏空间,最后使用贝叶斯非参数技术对嵌入的特征进行聚类,以同时推断集群的数量和成员资格。该方法利用异质迁移学习来聚类未见过的数据,具有不同的特征和标签分布。我们在多个遥感场景分类数据集上证明了这种方法超越了最先进的零 shots分类方法的性能。
https://arxiv.org/abs/2409.03938
Vision-Language Models for remote sensing have shown promising uses thanks to their extensive pretraining. However, their conventional usage in zero-shot scene classification methods still involves dividing large images into patches and making independent predictions, i.e., inductive inference, thereby limiting their effectiveness by ignoring valuable contextual information. Our approach tackles this issue by utilizing initial predictions based on text prompting and patch affinity relationships from the image encoder to enhance zero-shot capabilities through transductive inference, all without the need for supervision and at a minor computational cost. Experiments on 10 remote sensing datasets with state-of-the-art Vision-Language Models demonstrate significant accuracy improvements over inductive zero-shot classification. Our source code is publicly available on Github: this https URL
凭借其广泛的预训练,视觉语言模型在远程感测领域显示出有希望的应用。然而,其在零散景观分类方法中的常规用法还是 involve 将大图像分割成补丁并做出独立预测,即归纳推理,从而限制了它们的有效性,忽略了宝贵的上下文信息。我们的方法通过利用图像编码器基于文本提示的初始预测和补丁关联关系来增强零散景观能力,通过转换推理实现,而无需监督,且计算成本较低。用最先进的视觉语言模型在 10 个远程感测数据集上的实验表明,与归纳零散景观分类相比,其准确率明显提高。我们的源代码已公开发布在 Github上:https://github.com/。
https://arxiv.org/abs/2409.00698
Acoustic scene classification (ASC) predominantly relies on supervised approaches. However, acquiring labeled data for training ASC models is often costly and time-consuming. Recently, self-supervised learning (SSL) has emerged as a powerful method for extracting features from unlabeled audio data, benefiting many downstream audio tasks. This paper proposes a data-efficient and low-complexity ASC system by leveraging self-supervised audio representations extracted from general-purpose audio datasets. We introduce BEATs, an audio SSL pre-trained model, to extract the general representations from AudioSet. Through extensive experiments, it has been demonstrated that the self-supervised audio representations can help to achieve high ASC accuracy with limited labeled fine-tuning data. Furthermore, we find that ensembling the SSL models fine-tuned with different strategies contributes to a further performance improvement. To meet low-complexity requirements, we use knowledge distillation to transfer the self-supervised knowledge from large teacher models to an efficient student model. The experimental results suggest that the self-supervised teachers effectively improve the classification accuracy of the student model. Our best-performing system obtains an average accuracy of 56.7%.
声景分类(ASC)主要依赖监督方法。然而,获取带有标签的训练数据通常代价昂贵且耗时。最近,自监督学习(SSL)作为一种提取无标签音频数据特征的有前途的方法,在许多下游音频任务中表现出了强大的作用。本文提出了一种数据高效且低复杂度的ASC系统,通过利用从通用音频数据集中提取的自监督音频表示来构建。我们引入了BEATs,一个音频SSL预训练模型,从AudioSet中提取通用表示。通过广泛的实验,我们发现自监督音频表示可以帮助实现用有限标记微调数据达到高ASC准确率。此外,我们还发现,通过使用不同的微调策略对SSL模型进行细粒度微调,可以进一步提高性能。为了满足低复杂性要求,我们使用知识蒸馏将大型教师模型的自监督知识传递给高效的学生模型。实验结果表明,自监督教师能够有效提高学生模型的分类准确率。我们的最佳系统可以达到56.7%的准确率。
https://arxiv.org/abs/2408.14862
Neural network models for audio tasks, such as automatic speech recognition (ASR) and acoustic scene classification (ASC), are susceptible to noise contamination for real-life applications. To improve audio quality, an enhancement module, which can be developed independently, is explicitly used at the front-end of the target audio applications. In this paper, we present an end-to-end learning solution to jointly optimise the models for audio enhancement (AE) and the subsequent applications. To guide the optimisation of the AE module towards a target application, and especially to overcome difficult samples, we make use of the sample-wise performance measure as an indication of sample importance. In experiments, we consider four representative applications to evaluate our training paradigm, i.e., ASR, speech command recognition (SCR), speech emotion recognition (SER), and ASC. These applications are associated with speech and non-speech tasks concerning semantic and non-semantic features, transient and global information, and the experimental results indicate that our proposed approach can considerably boost the noise robustness of the models, especially at low signal-to-noise ratios (SNRs), for a wide range of computer audition tasks in everyday-life noisy environments.
神经网络模型在音频任务中,如语音识别(ASR)和声景分类(ASC),容易受到现实应用中的噪声污染。为了提高音频质量,在目标音频应用程序的前端明确使用增强模块。在本文中,我们提出了一个端到端学习解决方案,以同时优化音频增强(AE)模型和后续应用。为了引导AE模块向目标应用程序优化,特别是克服困难样本,我们利用样本的性能指标作为样本重要性的指示。在实验中,我们考虑了四个代表应用程序来评估我们的训练范式,即ASR、语音命令识别(SCR)、语音情感识别(SER)和ASC。这些应用程序与语义和非语义特征、暂态和全局信息以及日常环境中噪音干扰的语音和非声音任务有关。实验结果表明,与低信号噪声比(SNRs)相比,我们提出的方法可以在广泛的计算机听觉任务中显著提高模型的噪声鲁棒性。
https://arxiv.org/abs/2408.06264
Unsupervised domain adaptation techniques, extensively studied in hyperspectral image (HSI) classification, aim to use labeled source domain data and unlabeled target domain data to learn domain invariant features for cross-scene classification. Compared to natural images, numerous spectral bands of HSIs provide abundant semantic information, but they also increase the domain shift significantly. In most existing methods, both explicit alignment and implicit alignment simply align feature distribution, ignoring domain information in the spectrum. We noted that when the spectral channel between source and target domains is distinguished obviously, the transfer performance of these methods tends to deteriorate. Additionally, their performance fluctuates greatly owing to the varying domain shifts across various datasets. To address these problems, a novel shift-sensitive spatial-spectral disentangling learning (S4DL) approach is proposed. In S4DL, gradient-guided spatial-spectral decomposition is designed to separate domain-specific and domain-invariant representations by generating tailored masks under the guidance of the gradient from domain classification. A shift-sensitive adaptive monitor is defined to adjust the intensity of disentangling according to the magnitude of domain shift. Furthermore, a reversible neural network is constructed to retain domain information that lies in not only in semantic but also the shallow-level detailed information. Extensive experimental results on several cross-scene HSI datasets consistently verified that S4DL is better than the state-of-the-art UDA methods. Our source code will be available at this https URL.
无监督的领域适应技术,在超分辨率图像(HSI)分类中得到了广泛研究,旨在利用带有标签的源域数据和无标签的目标域数据来学习跨场景分类中的领域不变特征。与自然图像相比,HSIs的许多光谱带提供了丰富的语义信息,但它们也显著增加了域移。在大多数现有方法中,显式对齐和隐式对齐只是对特征分布进行对齐,而忽略了光谱中的域信息。我们注意到,当源域和目标域之间的光谱通道明显区别时,这些方法的有效性往往恶化。此外,由于各种数据集上的域移不同,它们的性能波动很大。为解决这些问题,我们提出了一个新的具有感知平移敏感性的空间-光谱解分离学习(S4DL)方法。在S4DL中,由梯度引导的局部-空间-光谱分解被设计为通过在域分类的指导下生成 tailored掩码来分离领域特定和领域无关表示。定义了一个平移敏感的自适应监控器,根据域移的规模调整解离的强度。此外,还构建了一个可逆的神经网络,以保留位于不仅语义而且浅层级详细信息中的域信息。在多个跨场景HSI数据集上的实验结果表明,S4DL比最先进的UDA方法效果更好。我们的源代码将在此处https URL上提供。
https://arxiv.org/abs/2408.15263
Artificial Intelligence (AI) technologies have profoundly transformed the field of remote sensing, revolutionizing data collection, processing, and analysis. Traditionally reliant on manual interpretation and task-specific models, remote sensing has been significantly enhanced by the advent of foundation models--large-scale, pre-trained AI models capable of performing a wide array of tasks with unprecedented accuracy and efficiency. This paper provides a comprehensive survey of foundation models in the remote sensing domain, covering models released between June 2021 and June 2024. We categorize these models based on their applications in computer vision and domain-specific tasks, offering insights into their architectures, pre-training datasets, and methodologies. Through detailed performance comparisons, we highlight emerging trends and the significant advancements achieved by these foundation models. Additionally, we discuss the technical challenges, practical implications, and future research directions, addressing the need for high-quality data, computational resources, and improved model generalization. Our research also finds that pre-training methods, particularly self-supervised learning techniques like contrastive learning and masked autoencoders, significantly enhance the performance and robustness of foundation models in remote sensing tasks such as scene classification, object detection, and other applications. This survey aims to serve as a resource for researchers and practitioners by providing a panorama of advances and promising pathways for continued development and application of foundation models in remote sensing.
人工智能(AI)技术在遥感领域取得了深远的影响,颠覆了数据收集、处理和分析。传统上依赖于手动解释和任务特定模型,遥感通过引入基础模型——大规模、预训练的AI模型,实现了前所未有的准确性和效率,显著增强了遥感能力。本文对2021年6月至2024年间发布的遥感领域基础模型进行全面调查,涵盖这些模型。我们根据其在计算机视觉和领域特定任务的应用将这些模型进行分类,并提供了对它们的架构、预训练数据和方法论的洞察。通过详细的性能比较,我们突出了新兴趋势以及这些基础模型所取得的显著进步。此外,我们讨论了技术挑战、实际影响和未来研究方向,强调了高质量数据、计算资源和优化模型泛化的重要性。我们的研究还发现,特别是像对比学习技术和遮罩自动编码器这样的自监督学习方法,显著增强了在遥感任务(如场景分类、目标检测等)中基础模型的性能和稳健性。本文旨在为研究人员和从业者提供遥感的进展和有前景的持续发展和应用基础模型的概述。
https://arxiv.org/abs/2408.03464
Scene understanding plays an important role in several high-level computer vision applications, such as autonomous vehicles, intelligent video surveillance, or robotics. However, too few solutions have been proposed for indoor/outdoor scene classification to ensure scene context adaptability for computer vision frameworks. We propose the first Lightweight Hybrid Graph Convolutional Neural Network (LH-GCNN)-CNN framework as an add-on to object detection models. The proposed approach uses the output of the CNN object detection model to predict the observed scene type by generating a coherent GCNN representing the semantic and geometric content of the observed scene. This new method, applied to natural scenes, achieves an efficiency of over 90\% for scene classification in a COCO-derived dataset containing a large number of different scenes, while requiring fewer parameters than traditional CNN methods. For the benefit of the scientific community, we will make the source code publicly available: this https URL.
https://arxiv.org/abs/2407.14658
Olfaction, often overlooked in cultural heritage studies, holds profound significance in shaping human experiences and identities. Examining historical depictions of olfactory scenes can offer valuable insights into the role of smells in history. We show that a transfer-learning approach using weakly labeled training data can remarkably improve the classification of fragrant spaces and, more generally, artistic scene depictions. We fine-tune Places365-pre-trained models by querying two cultural heritage data sources and using the search terms as supervision signal. The models are evaluated on two manually corrected test splits. This work lays a foundation for further exploration of fragrant spaces recognition and artistic scene classification. All images and labels are released as the ArtPlaces dataset at this https URL.
嗅觉,在文化遗产研究中常常被忽视,对塑造人类经验和身份具有深刻的意义。研究历史上对嗅觉场景的描绘可以揭示气味在历史中的作用。我们证明了使用弱标签标注的训练数据进行迁移学习的方法可以显著改善对香气的空间分类,更一般地,对艺术场景描绘的分类。我们通过查询两个文化遗产数据源,并使用查询词作为监督信号,微调Places365-预训练模型。模型在两个手动修正的测试划分上进行评估。这项工作为进一步探索香气的空间识别和艺术场景分类奠定了基础。所有图像和标签都在这个https://网址上免费发布。
https://arxiv.org/abs/2407.11701
In this paper, we propose a method for online domain-incremental learning of acoustic scene classification from a sequence of different locations. Simply training a deep learning model on a sequence of different locations leads to forgetting of previously learned knowledge. In this work, we only correct the statistics of the Batch Normalization layers of a model using a few samples to learn the acoustic scenes from a new location without any excessive training. Experiments are performed on acoustic scenes from 11 different locations, with an initial task containing acoustic scenes from 6 locations and the remaining 5 incremental tasks each representing the acoustic scenes from a different location. The proposed approach outperforms fine-tuning based methods and achieves an average accuracy of 48.8% after learning the last task in sequence without forgetting acoustic scenes from the previously learned locations.
在本文中,我们提出了一种在从不同位置的序列中进行在线领域增强学习以进行声景分类的方法。仅仅在不同的位置训练深度学习模型会导致之前学习的知识被遗忘。在这项工作中,我们只使用几个样本来纠正模型的批归一化层中的统计量,从而从新的位置学习声景,而没有任何过度的训练。实验在11个不同的声景上进行,其中初始任务包括6个地方的声景,其余的5个增量任务分别代表从不同位置的声景。与基于微调的方法相比,所提出的 approach 表现出更好的性能,并且如果没有忘记从之前学习的位置中学习到的 acoustic scenes,平均准确率可以达到 48.8%。
https://arxiv.org/abs/2406.13386
This work is an improved system that we submitted to task 1 of DCASE2023 challenge. We propose a method of low-complexity acoustic scene classification by a parallel attention-convolution network which consists of four modules, including pre-processing, fusion, global and local contextual information extraction. The proposed network is computationally efficient to capture global and local contextual information from each audio clip. In addition, we integrate other techniques into our method, such as knowledge distillation, data augmentation, and adaptive residual normalization. When evaluated on the official dataset of DCASE2023 challenge, our method obtains the highest accuracy of 56.10% with parameter number of 5.21 kilo and multiply-accumulate operations of 1.44 million. It exceeds the top two systems of DCASE2023 challenge in accuracy and complexity, and obtains state-of-the-art result. Code is at: this https URL.
此作品是我们为DCASE2023挑战任务1提出的改进系统。我们提出了一种通过并行注意力卷积网络(四个模块:预处理、融合、全局和局部上下文信息提取)进行低复杂度音频场景分类的方法。所提出的网络能够高效地捕捉每个音频片段的全球和局部上下文信息。此外,我们还将其他技术(如知识蒸馏、数据增强和自适应残差 normalization)集成到我们的方法中。在DCASE2023挑战的官方数据集上评估时,我们的方法获得了56.10%的准确率,参数数为5.21千克,乘法累积操作数为1.44百万。它超过了DCASE2023挑战的前两名系统,在准确性和复杂性方面均取得了最先进的结果。代码位于此链接:https://this URL。
https://arxiv.org/abs/2406.08119
We propose noise-robust voice conversion (VC) which takes into account the recording quality and environment of noisy source speech. Conventional denoising training improves the noise robustness of a VC model by learning noisy-to-clean VC process. However, the naturalness of the converted speech is limited when the noise of the source speech is unseen during the training. To this end, our proposed training conditions a VC model on two latent variables representing the recording quality and environment of the source speech. These latent variables are derived from deep neural networks pre-trained on recording quality assessment and acoustic scene classification and calculated in an utterance-wise or frame-wise manner. As a result, the trained VC model can explicitly learn information about speech degradation during the training. Objective and subjective evaluations show that our training improves the quality of the converted speech compared to the conventional training.
我们提出了一个噪声鲁棒的声音转换(VC)方法,考虑了噪声源语音的录音质量和环境。传统的去噪训练通过学习噪声到清理的VC过程来提高VC模型的噪声鲁棒性。然而,在训练过程中无法观察到源语音的噪声时,转换后的语音的自然性是有限的。为此,我们提出的训练将两个潜在变量(录音质量和环境)作为VC模型的训练条件。这些潜在变量来源于经过录音质量评估和语音场景分类预训练的深度神经网络,以进行句子级或帧级计算。因此,训练后的VC模型可以在训练过程中明确学习语音降解信息。客观和主观评估表明,与传统训练相比,我们的训练提高了转换后的语音的质量。
https://arxiv.org/abs/2406.07280
An increasing number of models have achieved great performance in remote sensing tasks with the recent development of Large Language Models (LLMs) and Visual Language Models (VLMs). However, these models are constrained to basic vision and language instruction-tuning tasks, facing challenges in complex remote sensing applications. Additionally, these models lack specialized expertise in professional domains. To address these limitations, we propose a LLM-driven remote sensing intelligent agent named RS-Agent. Firstly, RS-Agent is powered by a large language model (LLM) that acts as its "Central Controller," enabling it to understand and respond to various problems intelligently. Secondly, our RS-Agent integrates many high-performance remote sensing image processing tools, facilitating multi-tool and multi-turn conversations. Thirdly, our RS-Agent can answer professional questions by leveraging robust knowledge documents. We conducted experiments using several datasets, e.g., RSSDIVCS, RSVQA, and DOTAv1. The experimental results demonstrate that our RS-Agent delivers outstanding performance in many tasks, i.e., scene classification, visual question answering, and object counting tasks.
越来越多的模型在遥感任务中取得了巨大的性能,这是得益于大型语言模型(LLMs)和视觉语言模型(VLMs)最近的发展。然而,这些模型仅限于基本视觉和语言指令调整任务,在复杂遥感应用中面临挑战。此外,这些模型在专业领域缺乏专业知识。为了应对这些局限,我们提出了一个基于LLM的遥感智能代理RS-Agent。首先,RS-Agent由一个大型语言模型(LLM)驱动,充当其“中央控制器”,使其能够智能地理解和响应各种问题。其次,我们的RS-Agent集成了许多高性能的遥感图像处理工具,促进了多工具和多轮对话。第三,通过利用稳健的知识文档,我们的RS-Agent可以回答专业问题。我们使用几个数据集进行了实验,例如RSSDIVCS、RSVQA和DOTAv1。实验结果表明,我们的RS-Agent在许多任务中表现出卓越的性能,即场景分类、视觉问答和物体计数任务。
https://arxiv.org/abs/2406.07089
Since the launch of the Sentinel-2 (S2) satellites, many ML models have used the data for diverse applications. The scene classification layer (SCL) inside the S2 product provides rich information for training, such as filtering images with high cloud coverage. However, there is more potential in this. We propose a technique to assess the clean optical coverage of a region, expressed by a SITS and calculated with the S2-based SCL data. With a manual threshold and specific labels in the SCL, the proposed technique assigns a percentage of spatial and temporal coverage across the time series and a high/low assessment. By evaluating the AI4EO challenge for Enhanced Agriculture, we show that the assessment is correlated to the predictive results of ML models. The classification results in a region with low spatial and temporal coverage is worse than in a region with high coverage. Finally, we applied the technique across all continents of the global dataset LandCoverNet.
自Sentinel-2(S2)卫星发射以来,许多机器学习(ML)模型已利用其数据进行各种应用。S2产品内的场景分类层(SCL)为训练提供了丰富的信息,例如通过高云覆盖率过滤图像。然而,这个场景还有更多的潜力。我们提出了一种评估区域干净光学覆盖率的技术,该技术通过基于S2的SCL数据计算。通过手动阈值和SCL中的特定标签,所提出的技术分配了时间序列系列中的空间和时间覆盖率的百分比,并进行了高/低评估。通过评估AI4EO挑战,我们证明了评估与ML模型的预测结果相关。在空间和时间覆盖率较低的区域中,分类结果比在覆盖率较高的区域中更差。最后,我们应用该技术对全球数据集LandCoverNet的所有大陆进行评估。
https://arxiv.org/abs/2406.18584