In this work, we propose a novel variational Bayesian adaptive learning approach for cross-domain knowledge transfer to address acoustic mismatches between training and testing conditions, such as recording devices and environmental noise. Different from the traditional Bayesian approaches that impose uncertainties on model parameters risking the curse of dimensionality due to the huge number of parameters, we focus on estimating a manageable number of latent variables in deep neural models. Knowledge learned from a source domain is thus encoded in prior distributions of deep latent variables and optimally combined, in a Bayesian sense, with a small set of adaptation data from a target domain to approximate the corresponding posterior distributions. Two different strategies are proposed and investigated to estimate the posterior distributions: Gaussian mean-field variational inference, and empirical Bayes. These strategies address the presence or absence of parallel data in the source and target domains. Furthermore, structural relationship modeling is investigated to enhance the approximation. We evaluated our proposed approaches on two acoustic adaptation tasks: 1) device adaptation for acoustic scene classification, and 2) noise adaptation for spoken command recognition. Experimental results show that the proposed variational Bayesian adaptive learning approach can obtain good improvements on target domain data, and consistently outperforms state-of-the-art knowledge transfer methods.
在这项工作中,我们提出了一种新颖的变分贝叶斯自适应学习方法,用于跨域知识迁移,以解决训练和测试条件之间声学不匹配的问题,如录音设备和环境噪声。与传统的贝叶斯方法不同,后者通过在模型参数上施加不确定性来应对维度灾难问题(由于大量的参数数量),我们专注于估计深层神经网络中有限数量的潜在变量。从源域学到的知识因此被编码为深度潜在变量的先验分布,并在贝叶斯意义上与目标领域的少量适应数据集相结合,以近似相应的后验分布。为了估算这些后验分布,提出了并研究了两种不同的策略:高斯均值场变分推理和经验贝叶斯方法。这两种策略分别处理源域和目标域中是否存在平行数据的问题。此外,还探讨了结构关系建模以增强近似效果。 我们在两项声学适应任务上评估了我们提出的方法:1)用于声景分类的设备自适应;2)用于语音命令识别的噪声自适应。实验结果表明,所提出的变分贝叶斯自适应学习方法在目标领域数据上可以获得显著改进,并且始终优于当前最先进的知识迁移方法。
https://arxiv.org/abs/2501.15496
Scene understanding in remote sensing often faces challenges in generating accurate representations for complex environments such as various land use areas or coastal regions, which may also include snow, clouds, or haze. To address this, we present a vision-language framework named Spectral LLaVA, which integrates multispectral data with vision-language alignment techniques to enhance scene representation and description. Using the BigEarthNet v2 dataset from Sentinel-2, we establish a baseline with RGB-based scene descriptions and further demonstrate substantial improvements through the incorporation of multispectral information. Our framework optimizes a lightweight linear projection layer for alignment while keeping the vision backbone of SpectralGPT frozen. Our experiments encompass scene classification using linear probing and language modeling for jointly performing scene classification and description generation. Our results highlight Spectral LLaVA's ability to produce detailed and accurate descriptions, particularly for scenarios where RGB data alone proves inadequate, while also enhancing classification performance by refining SpectralGPT features into semantically meaningful representations.
遥感中的场景理解常常面临挑战,尤其是在生成复杂环境(如各种土地利用区域或沿海地区)的准确表示时,这些环境中可能还包括雪、云或雾霾等干扰因素。为了解决这些问题,我们提出了一种视觉-语言框架,名为Spectral LLaVA,该框架将多光谱数据与视觉-语言对齐技术相结合,以增强场景的表现和描述能力。使用Sentinel-2卫星的BigEarthNet v2数据集,我们在基于RGB图像的场景描述基础上建立了基准,并通过引入多光谱信息进一步展示了显著改进。 我们的框架优化了一层轻量级线性投影层来实现视觉与语言的对齐,同时保持SpectralGPT的视觉骨干网络不变。实验涵盖了使用线性探测进行场景分类以及联合执行场景分类和描述生成的语言建模任务。我们的结果突显了Spectral LLaVA在产生详细且准确描述方面的能力,特别是在仅依靠RGB数据不足以充分描述的情况下表现尤为突出,并通过将SpectralGPT的特征精炼为语义上有意义的表现来提高分类性能。 简而言之,Spectral LLaVA框架不仅能够生成对复杂环境场景的细致和精确描述,还能够在多光谱图像处理中提升模型在场景理解任务中的效果。
https://arxiv.org/abs/2501.10144
Remote sensing (RS) visual tasks have gained significant academic and practical importance. However, they encounter numerous challenges that hinder effective feature extraction, including the detection and recognition of multiple objects exhibiting substantial variations in scale within a single image. While prior dual-branch or multi-branch architectural strategies have been effective in managing these object variances, they have concurrently resulted in considerable increases in computational demands and parameter counts. Consequently, these architectures are rendered less viable for deployment on resource-constrained devices. Contemporary lightweight backbone networks, designed primarily for natural images, frequently encounter difficulties in effectively extracting features from multi-scale objects, which compromises their efficacy in RS visual tasks. This article introduces LWGANet, a specialized lightweight backbone network tailored for RS visual tasks, incorporating a novel lightweight group attention (LWGA) module designed to address these specific challenges. LWGA module, tailored for RS imagery, adeptly harnesses redundant features to extract a wide range of spatial information, from local to global scales, without introducing additional complexity or computational overhead. This facilitates precise feature extraction across multiple scales within an efficient this http URL was rigorously evaluated across twelve datasets, which span four crucial RS visual tasks: scene classification, oriented object detection, semantic segmentation, and change detection. The results confirm LWGANet's widespread applicability and its ability to maintain an optimal balance between high performance and low complexity, achieving SOTA results across diverse datasets. LWGANet emerged as a novel solution for resource-limited scenarios requiring robust RS image processing capabilities.
遥感(RS)视觉任务在学术和实际应用中具有重要意义,但它们面临着许多挑战,这些挑战阻碍了有效的特征提取,包括在同一图像内检测并识别多个尺寸变化显著的对象。尽管先前的双分支或多分支架构策略在处理对象变异方面表现出了有效性,但同时也导致了计算需求和参数数量的巨大增加。因此,在资源受限设备上部署这些架构变得不那么可行。目前为自然图像设计的轻量级骨干网络通常难以有效地从多尺度物体中提取特征,从而影响其在遥感视觉任务中的效能。本文介绍了LWGANet,这是一种专门针对遥感视觉任务定制的轻量化骨干网络,并引入了新型的轻量级组注意力(LWGA)模块来解决这些特定挑战。LWGA模块专为遥感图像设计,能够有效地利用冗余特征提取从局部到全局尺度的广泛空间信息,同时不增加复杂性或计算开销。这使得在高效框架内进行多尺度精确特征提取成为可能。该研究在涵盖四个关键RS视觉任务(场景分类、定向物体检测、语义分割和变化检测)的十二个数据集上进行了严格的评估。结果证实了LWGANet广泛的适用性和其保持高性能与低复杂性之间最佳平衡的能力,在各种数据集中取得了最先进的成果。LWGANet成为了一个新的解决方案,适用于需要强大遥感图像处理能力但资源有限的情境。
https://arxiv.org/abs/2501.10040
The proliferation of Internet of Things (IoT) devices equipped with acoustic sensors necessitates robust acoustic scene classification (ASC) capabilities, even in noisy and data-limited environments. Traditional machine learning methods often struggle to generalize effectively under such conditions. To address this, we introduce Q-ASC, a novel Quantum-Inspired Acoustic Scene Classifier that leverages the power of quantum-inspired transformers. By integrating quantum concepts like superposition and entanglement, Q-ASC achieves superior feature learning and enhanced noise resilience compared to classical models. Furthermore, we introduce a Quantum Variational Autoencoder (QVAE) based data augmentation technique to mitigate the challenge of limited labeled data in IoT deployments. Extensive evaluations on the Tampere University of Technology (TUT) Acoustic Scenes 2016 benchmark dataset demonstrate that Q-ASC achieves remarkable accuracy between 68.3% and 88.5% under challenging conditions, outperforming state-of-the-art methods by over 5% in the best case. This research paves the way for deploying intelligent acoustic sensing in IoT networks, with potential applications in smart homes, industrial monitoring, and environmental surveillance, even in adverse acoustic environments.
物联网(IoT)设备中配备声学传感器的数量激增,要求在嘈杂且数据有限的环境中具备强大的声景分类(ASC)能力。传统机器学习方法在这种条件下往往难以有效泛化。为此,我们引入了Q-ASC,这是一种基于量子启发式变压器的新型量子启发式声景分类器,利用了诸如叠加和纠缠等量子概念的力量,从而在特征学习和抗噪性能方面超越经典模型。此外,为了缓解物联网部署中标签数据不足的问题,我们还提出了一种基于量子变分自动编码器(QVAE)的数据增强技术。 我们在坦佩雷理工大学(TUT)2016年声景基准数据集上进行了广泛的评估,结果表明,在挑战性的条件下,Q-ASC的准确性达到了惊人的68.3%到88.5%,在最佳情况下比现有方法高出超过5%。这项研究为在物联网网络中部署智能声学传感铺平了道路,并且潜在的应用包括智能家居、工业监控和环境监视,即使是在不良的声学环境中也是如此。
https://arxiv.org/abs/2501.09394
Satellite imagery is a cornerstone for numerous Remote Sensing (RS) applications; however, limited spatial resolution frequently hinders the precision of such systems, especially in multi-label scene classification tasks as it requires a higher level of detail and feature differentiation. In this study, we explore the efficacy of image Super-Resolution (SR) as a pre-processing step to enhance the quality of satellite images and thus improve downstream classification performance. We investigate four SR models - SRResNet, HAT, SeeSR, and RealESRGAN - and evaluate their impact on multi-label scene classification across various CNN architectures, including ResNet-50, ResNet-101, ResNet-152, and Inception-v4. Our results show that applying SR significantly improves downstream classification performance across various metrics, demonstrating its ability to preserve spatial details critical for multi-label tasks. Overall, this work offers valuable insights into the selection of SR techniques for multi-label prediction in remote sensing and presents an easy-to-integrate framework to improve existing RS systems.
卫星图像在众多遥感(RS)应用中扮演着基石角色;然而,有限的空间分辨率常常限制了这些系统的精度,特别是在多标签场景分类任务中,因为这类任务需要更高的细节层次和特征区分度。在这项研究中,我们探索了使用图像超分辨率(SR)作为预处理步骤来提升卫星图像质量,并从而提高下游分类性能的有效性。我们评估了四种SR模型——SRResNet、HAT、SeeSR 和 RealESRGAN,在包括 ResNet-50、ResNet-101、ResNet-152 以及 Inception-v4 等不同卷积神经网络(CNN)架构上的多标签场景分类任务中的影响。我们的研究结果表明,应用超分辨率技术显著提升了各种性能指标下的下游分类效果,展示了其在保持对多标签任务至关重要的空间细节方面的能力。总的来说,这项工作为选择适用于遥感中多标签预测的超分辨率技术提供了有价值的见解,并提供了一个易于集成的框架来提升现有RS系统。
https://arxiv.org/abs/2501.06720
Scene understanding is essential for enhancing driver safety, generating human-centric explanations for Automated Vehicle (AV) decisions, and leveraging Artificial Intelligence (AI) for retrospective driving video analysis. This study developed a dynamic scene retrieval system using Contrastive Language-Image Pretraining (CLIP) models, which can be optimized for real-time deployment on edge devices. The proposed system outperforms state-of-the-art in-context learning methods, including the zero-shot capabilities of GPT-4o, particularly in complex scenarios. By conducting frame-level analysis on the Honda Scenes Dataset, which contains a collection of about 80 hours of annotated driving videos capturing diverse real-world road and weather conditions, our study highlights the robustness of CLIP models in learning visual concepts from natural language supervision. Results also showed that fine-tuning the CLIP models, such as ViT-L/14 and ViT-B/32, significantly improved scene classification, achieving a top F1 score of 91.1%. These results demonstrate the ability of the system to deliver rapid and precise scene recognition, which can be used to meet the critical requirements of Advanced Driver Assistance Systems (ADAS). This study shows the potential of CLIP models to provide scalable and efficient frameworks for dynamic scene understanding and classification. Furthermore, this work lays the groundwork for advanced autonomous vehicle technologies by fostering a deeper understanding of driver behavior, road conditions, and safety-critical scenarios, marking a significant step toward smarter, safer, and more context-aware autonomous driving systems.
场景理解对于提升驾驶员安全、生成针对自动驾驶汽车(AV)决策的人类可解释性,以及利用人工智能进行驾驶视频回顾分析至关重要。本研究开发了一种基于对比语言-图像预训练(CLIP)模型的动态场景检索系统,该系统可以优化为在边缘设备上的实时部署。所提出的系统在复杂场景中比最先进的情境学习方法表现更优,包括GPT-4o的零样本能力。通过对本田场景数据集进行帧级分析——这是一个包含大约80小时注释驾驶视频的数据集,涵盖了各种真实世界的道路和天气条件——我们的研究强调了CLIP模型从自然语言监督中学习视觉概念的能力之强健性。 实验结果还表明,对ViT-L/14和ViT-B/32等CLIP模型进行微调能够显著提高场景分类的准确性,达到了最高的F1分数为91.1%。这些结果展示了该系统能够提供快速且精确的场景识别能力,可用于满足高级驾驶辅助系统(ADAS)的关键需求。 这项研究展示了CLIP模型在动态场景理解和分类中具有提供可扩展和高效框架的巨大潜力。此外,本工作通过促进对驾驶员行为、道路条件及安全关键情景的理解,为先进自动驾驶技术的发展奠定了基础,标志着向更智能、更安全、更具上下文感知的自主驾驶系统迈进的重要一步。
https://arxiv.org/abs/2501.05566
In remote sensing scene classification, leveraging the transfer methods with well-trained optical models is an efficient way to overcome label scarcity. However, cloud contamination leads to optical information loss and significant impacts on feature distribution, challenging the reliability and stability of transferred target models. Common solutions include cloud removal for optical data or directly using Synthetic aperture radar (SAR) data in the target domain. However, cloud removal requires substantial auxiliary data for support and pre-training, while directly using SAR disregards the unobstructed portions of optical data. This study presents a scene classification transfer method that synergistically combines multi-modality data, which aims to transfer the source domain model trained on cloudfree optical data to the target domain that includes both cloudy optical and SAR data at low cost. Specifically, the framework incorporates two parts: (1) the collaborative transfer strategy, based on knowledge distillation, enables the efficient prior knowledge transfer across heterogeneous data; (2) the information regulation mechanism (IRM) is proposed to address the modality imbalance issue during transfer. It employs auxiliary models to measure the contribution discrepancy of each modality, and automatically balances the information utilization of modalities during the target model learning process at the sample-level. The transfer experiments were conducted on simulated and real cloud datasets, demonstrating the superior performance of the proposed method compared to other solutions in cloud-covered scenarios. We also verified the importance and limitations of IRM, and further discussed and visualized the modality imbalance problem during the model transfer. Codes are available at this https URL
在遥感场景分类中,利用经过良好训练的光学模型进行迁移学习是一种克服标签稀缺的有效方法。然而,云污染会导致光学信息丢失,并对特征分布产生显著影响,从而挑战了转移目标模型的可靠性和稳定性。常见的解决方案包括去除光学数据中的云层或直接使用合成孔径雷达(SAR)数据作为目标领域。但是,云层移除需要大量的辅助数据进行支持和预训练,而直接使用SAR则忽略了未被遮挡部分的光学数据。 本研究提出了一种场景分类迁移方法,该方法协同结合了多模态数据,旨在以低成本将源域中在无云条件下训练好的光学模型迁移到目标领域,其中包含有云的光学数据和SAR数据。具体而言,框架包括两个部分:(1)基于知识蒸馏的合作转移策略,能够有效地跨异构数据进行先验知识迁移;(2)信息调节机制(IRM),用于解决在迁移过程中模态不平衡的问题。该机制采用辅助模型来测量每种模态的贡献差异,并在目标模型学习过程中的样本级别自动平衡各种模态的信息利用。 转移实验是在模拟和真实云数据集上进行的,结果表明,在被云覆盖的情况下,所提出的方法相较于其他解决方案具有优越的表现。我们还验证了IRM的重要性和局限性,并进一步讨论并可视化了在模型迁移过程中出现的模态不平衡问题。代码可在以下链接获取:[此处插入实际URL]
https://arxiv.org/abs/2501.04283
Remote sensing data is often distributed across multiple institutions, and due to privacy concerns and data-sharing restrictions, leveraging large-scale datasets in a centralized training framework is challenging. Federated learning offers a promising solution by enabling collaborative model training across distributed data sources without requiring data centralization. However, current Vision-Language Models (VLMs), which typically contain billions of parameters, pose significant communication challenges for traditional federated learning approaches based on model parameter updates, as they would incur substantial communication costs. In this paper, we propose FedRSCLIP, the first federated learning framework designed for remote sensing image classification based on a VLM, specifically CLIP. FedRSCLIP addresses the challenges of data heterogeneity and large-scale model transmission in federated environments by introducing Prompt Learning, which optimizes only a small set of tunable parameters. The framework introduces a dual-prompt mechanism, comprising Shared Prompts for global knowledge sharing and Private Prompts for client-specific adaptation. To maintain semantic coherence between shared and private prompts, we propose the Dual Prompt Alignment Constraint to balance global consistency and local adaptability across diverse client distributions. Additionally, to enhance cross-modal representation learning, we introduce the Cross-Modal Feature Alignment Constraint to align multimodal features between text and image prompts. To validate the effectiveness of our proposed model, we construct a Fed-RSIC dataset based on three existing remote sensing image classification datasets, specifically designed to simulate various federated learning configurations. Experimental results demonstrate the effectiveness and superiority of FedRSCLIP in remote sensing image classification.
遥感数据通常分布在多个机构之间,由于隐私顾虑和数据共享限制,在集中式训练框架中利用大规模的数据集具有挑战性。联邦学习通过允许多个分布式数据源协同模型训练而不需进行数据集中化提供了一个有前景的解决方案。然而,现有的视觉-语言模型(VLMs)通常包含数十亿参数,这对基于模型参数更新的传统联邦学习方法提出了显著的通信挑战,因为这将带来高昂的通信成本。在本文中,我们提出FedRSCLIP,这是首个为基于视觉-语言模型CLIP的遥感图像分类设计的联邦学习框架。FedRSCLIP通过引入提示学习解决了数据异构性和大规模模型传输在联邦环境中的挑战,仅优化一小部分可调参数。该框架引入了双提示机制,包括用于全局知识共享的共享提示和针对特定客户端适应的私有提示。为了保持共享与私人提示之间的语义一致性,我们提出了双提示对齐约束以在全球一致性和局部适应性之间平衡各种客户端分布情况下的差异。此外,为增强跨模态表示学习,我们引入了跨模态特征对齐约束,用于调整文本和图像提示之间的多模式特征。为了验证所提出模型的有效性,我们基于三个现有的遥感图像分类数据集构建了一个Fed-RSIC数据集,专门设计以模拟各种联邦学习配置。实验结果证明了FedRSCLIP在遥感图像分类中的有效性和优越性。
https://arxiv.org/abs/2501.02461
The domain gap between remote sensing imagery and natural images has recently received widespread attention and Vision-Language Models (VLMs) have demonstrated excellent generalization performance in remote sensing multimodal tasks. However, current research is still limited in exploring how remote sensing VLMs handle different types of visual inputs. To bridge this gap, we introduce \textbf{UniRS}, the first vision-language model \textbf{uni}fying multi-temporal \textbf{r}emote \textbf{s}ensing tasks across various types of visual input. UniRS supports single images, dual-time image pairs, and videos as input, enabling comprehensive remote sensing temporal analysis within a unified framework. We adopt a unified visual representation approach, enabling the model to accept various visual inputs. For dual-time image pair tasks, we customize a change extraction module to further enhance the extraction of spatiotemporal features. Additionally, we design a prompt augmentation mechanism tailored to the model's reasoning process, utilizing the prior knowledge of the general-purpose VLM to provide clues for UniRS. To promote multi-task knowledge sharing, the model is jointly fine-tuned on a mixed dataset. Experimental results show that UniRS achieves state-of-the-art performance across diverse tasks, including visual question answering, change captioning, and video scene classification, highlighting its versatility and effectiveness in unifying these multi-temporal remote sensing tasks. Our code and dataset will be released soon.
最近,遥感图像与自然图像之间的领域差距引起了广泛的关注,并且视觉-语言模型(VLMs)在多模态遥感任务中展示了出色的泛化性能。然而,目前的研究仍然局限于探索遥感 VLM 如何处理不同类型的数据输入。为了弥补这一不足,我们引入了 **UniRS**,这是第一个统一跨多种类型视觉输入的多时态遥感任务的视觉-语言模型。UniRS 支持单幅图像、双时间图像对和视频作为输入,在一个统一框架内实现了全面的遥感时态分析。我们采用了一种统一的视觉表示方法,使模型能够接受各种视觉输入。对于双时间图像对任务,我们定制了一个变化提取模块,以进一步增强空间-时间特征的抽取能力。此外,我们设计了一种针对模型推理过程优化的提示增强机制,利用通用 VLM 的先验知识为 UniRS 提供线索。为了促进多任务的知识共享,该模型在混合数据集上进行了联合微调。实验结果表明,UniRS 在视觉问答、变化描述和视频场景分类等多种任务中均达到了最先进的性能,突显了其在统一这些多时态遥感任务中的多功能性和有效性。我们的代码和数据集将很快发布。
https://arxiv.org/abs/2412.20742
Acoustic Scene Classification (ASC) identifies an environment based on an audio signal. This paper explores ASC in low-resource conditions and proposes a novel model, DS-FlexiNet, which combines depthwise separable convolutions from MobileNetV2 with ResNet-inspired residual connections for a balance of efficiency and accuracy. To address hardware limitations and device heterogeneity, DS-FlexiNet employs Quantization Aware Training (QAT) for model compression and data augmentation methods like Auto Device Impulse Response (ADIR) and Freq-MixStyle (FMS) to improve cross-device generalization. Knowledge Distillation (KD) from twelve teacher models further enhances performance on unseen devices. The architecture includes a custom Residual Normalization layer to handle domain differences across devices, and depthwise separable convolutions reduce computational overhead without sacrificing feature representation. Experimental results show that DS-FlexiNet excels in both adaptability and performance under resource-constrained conditions.
声学场景分类(ASC)是一种基于音频信号识别环境的技术。本文探讨了在低资源条件下进行声学场景分类,并提出了一种新型模型DS-FlexiNet,该模型结合了MobileNetV2中的深度可分离卷积和ResNet启发式的残差连接,在效率与准确性之间取得了平衡。为了应对硬件限制和设备异质性问题,DS-FlexiNet采用量化感知训练(QAT)进行模型压缩,并使用诸如自动设备脉冲响应(ADIR)和频谱混合样式增强(FMS)等数据增强方法来提高跨设备泛化能力。从十二个教师模型中通过知识蒸馏进一步提升了在未见过的设备上的性能表现。该架构还包括一个自定义的残差归一化层,用于处理不同设备之间的领域差异,并且深度可分离卷积减少了计算负担而不牺牲特征表示的能力。 实验结果表明,在资源受限条件下,DS-FlexiNet无论是在适应性还是性能方面均表现出色。
https://arxiv.org/abs/2412.20722
Uncertainty quantification (UQ) is essential for assessing the reliability of Earth observation (EO) products. However, the extensive use of machine learning models in EO introduces an additional layer of complexity, as those models themselves are inherently uncertain. While various UQ methods do exist for machine learning models, their performance on EO datasets remains largely unevaluated. A key challenge in the community is the absence of the ground truth for uncertainty, i.e. how certain the uncertainty estimates are, apart from the labels for the image/signal. This article fills this gap by introducing three benchmark datasets specifically designed for UQ in EO machine learning models. These datasets address three common problem types in EO: regression, image segmentation, and scene classification. They enable a transparent comparison of different UQ methods for EO machine learning models. We describe the creation and characteristics of each dataset, including data sources, preprocessing steps, and label generation, with a particular focus on calculating the reference uncertainty. We also showcase baseline performance of several machine learning models on each dataset, highlighting the utility of these benchmarks for model development and comparison. Overall, this article offers a valuable resource for researchers and practitioners working in artificial intelligence for EO, promoting a more accurate and reliable quality measure of the outputs of machine learning models. The dataset and code are accessible via this https URL.
不确定性量化(UQ)对于评估地球观测(EO)产品的可靠性至关重要。然而,在地球观测中广泛使用机器学习模型引入了额外的复杂性,因为这些模型本身具有内在的不确定性。虽然存在多种针对机器学习模型的UQ方法,但它们在EO数据集上的性能大多未经过充分评估。社区内的一大挑战是缺乏关于不确定性的地面真实值,即除了图像/信号标签之外,如何确定不确定性估计的确切程度。本文通过引入三个专门设计用于EO机器学习模型中UQ的基准数据集填补了这一空白。这些数据集针对地球观测中的三种常见问题类型:回归、图像分割和场景分类。它们使得不同UQ方法在EO机器学习模型上的透明比较成为可能。我们详细描述了每个数据集的创建过程及特征,包括数据来源、预处理步骤和标签生成,特别关注参考不确定性的计算。此外,我们也展示了多个机器学习模型在这三个数据集上的基线性能,强调了这些基准对于模型开发与对比的价值。总的来说,本文为从事地球观测领域人工智能研究的研究人员和实践者提供了一个宝贵的资源,推动了对机器学习模型输出质量更加准确可靠衡量的发展。该数据集和代码可以通过这个 https URL 访问。
https://arxiv.org/abs/2412.06451
Remote sensing scene classification (RSSC) is a critical task with diverse applications in land use and resource management. While unimodal image-based approaches show promise, they often struggle with limitations such as high intra-class variance and inter-class similarity. Incorporating textual information can enhance classification by providing additional context and semantic understanding, but manual text annotation is labor-intensive and costly. In this work, we propose a novel RSSC framework that integrates text descriptions generated by large vision-language models (VLMs) as an auxiliary modality without incurring expensive manual annotation costs. To fully leverage the latent complementarities between visual and textual data, we propose a dual cross-attention-based network to fuse these modalities into a unified representation. Extensive experiments with both quantitative and qualitative evaluation across five RSSC datasets demonstrate that our framework consistently outperforms baseline models. We also verify the effectiveness of VLM-generated text descriptions compared to human-annotated descriptions. Additionally, we design a zero-shot classification scenario to show that the learned multimodal representation can be effectively utilized for unseen class classification. This research opens new opportunities for leveraging textual information in RSSC tasks and provides a promising multimodal fusion structure, offering insights and inspiration for future studies. Code is available at: this https URL
https://arxiv.org/abs/2412.02531
Countries in South Asia experience many catastrophic flooding events regularly. Through image classification, it is possible to expedite search and rescue initiatives by classifying flood zones, including houses and humans. We create a new dataset collecting aerial imagery of flooding events across South Asian countries. For the classification, we propose a fine-tuned Compact Convolutional Transformer (CCT) based approach and some other cutting-edge transformer-based and Convolutional Neural Network-based architectures (CNN). We also implement the YOLOv8 object detection model and detect houses and humans within the imagery of our proposed dataset, and then compare the performance with our classification-based approach. Since the countries in South Asia have similar topography, housing structure, the color of flood water, and vegetation, this work can be more applicable to such a region as opposed to the rest of the world. The images are divided evenly into four classes: 'flood', 'flood with domicile', 'flood with humans', and 'no flood'. After experimenting with our proposed dataset on our fine-tuned CCT model, which has a comparatively lower number of weight parameters than many other transformer-based architectures designed for computer vision, it exhibits an accuracy and macro average precision of 98.62% and 98.50%. The other transformer-based architectures that we implement are the Vision Transformer (ViT), Swin Transformer, and External Attention Transformer (EANet), which give an accuracy of 88.66%, 84.74%, and 66.56% respectively. We also implement DCECNN (Deep Custom Ensembled Convolutional Neural Network), which is a custom ensemble model that we create by combining MobileNet, InceptionV3, and EfficientNetB0, and we obtain an accuracy of 98.78%. The architectures we implement are fine-tuned to achieve optimal performance on our dataset.
南亚国家经常遭受许多灾难性的洪水事件。通过图像分类,可以通过对包括房屋和人类在内的洪涝区域进行分类来加速搜索和救援工作。我们创建了一个新的数据集,收集了跨越南亚各国的洪水事件的航拍影像。对于分类任务,我们提出了一种基于精调的紧凑型卷积变换器(CCT)的方法以及一些其他先进的变换器基础架构和卷积神经网络基础架构(CNN)。我们还实现了YOLOv8目标检测模型,并在我们提议的数据集图像中检测房屋和人类,然后将其性能与我们的分类方法进行比较。由于南亚国家具有相似的地形、住房结构、洪水的颜色以及植被,这项工作更适合应用于该地区而非世界其他地方。这些图像被均匀地分为四类:“洪涝”、“有住宅的洪涝”、“有人类的洪涝”和“无洪涝”。在对我们的精调CCT模型进行实验后,鉴于它与许多用于计算机视觉的设计相比具有相对较少的权重参数数量,该模型展示了98.62%的准确率以及98.50%的宏平均精度。我们实现的其他变换器基础架构包括Vision Transformer (ViT)、Swin Transformer和External Attention Transformer (EANet),它们分别给出了88.66%、84.74%和66.56%的准确率。此外,我们实现了DCECNN(深度定制集成卷积神经网络),这是一个通过组合MobileNet、InceptionV3和EfficientNetB0构建的自定义集成模型,并获得了98.78%的准确率。我们实现的所有架构都经过微调以在我们的数据集上达到最佳性能。
https://arxiv.org/abs/2411.00169
XyloAudio is a line of ultra-low-power audio inference chips, designed for in- and near-microphone analysis of audio in real-time energy-constrained scenarios. Xylo is designed around a highly efficient integer-logic processor which simulates parameter- and activity-sparse spiking neural networks (SNNs) using a leaky integrate-and-fire (LIF) neuron model. Neurons on Xylo are quantised integer devices operating in synchronous digital CMOS, with neuron and synapse state quantised to 16 bit, and weight parameters quantised to 8 bit. Xylo is tailored for real-time streaming operation, as opposed to accelerated-time operation in the case of an inference accelerator. XyloAudio includes a low-power audio encoding interface for direct connection to a microphone, designed for sparse encoding of incident audio for further processing by the inference core. In this report we present the results of DCASE 2020 acoustic scene classification audio benchmark dataset deployed to XyloAudio 2. We describe the benchmark dataset; the audio preprocessing approach; and the network architecture and training approach. We present the performance of the trained model, and the results of power and latency measurements performed on the XyloAudio 2 development kit. This benchmark is conducted as part of the Neurobench project.
XyloAudio 是一款超低功耗音频推理芯片系列,专为实时能量受限场景下的麦克风内和近麦克风音频分析设计。Xylo围绕一个高效的整数逻辑处理器构建,该处理器使用泄漏积分放电(LIF)神经元模型模拟参数稀疏和活动稀疏的脉冲神经网络(SNN)。Xylo上的神经元是量化的整数设备,在同步数字CMOS中运行,其中神经元和突触状态量化为16位,权重参数量化为8位。Xylo专为实时流媒体操作设计,而不是像推理加速器那样进行加速时间操作。XyloAudio 包含一个低功耗音频编码接口,可以直接连接到麦克风,用于稀疏编码传入的音频以便进一步由推理核心处理。 在这份报告中,我们展示了将DCASE 2020声景分类音频基准数据集部署到XyloAudio 2上的结果。我们描述了基准数据集;音频预处理方法;以及网络架构和训练方法。我们还呈现了训练模型的性能,并展示了在XyloAudio 2开发套件上进行的功耗和延迟测量的结果。这项基准测试是作为Neurobench项目的一部分而进行的。
https://arxiv.org/abs/2410.23776
The goal of the acoustic scene classification (ASC) task is to classify recordings into one of the predefined acoustic scene classes. However, in real-world scenarios, ASC systems often encounter challenges such as recording device mismatch, low-complexity constraints, and the limited availability of labeled data. To alleviate these issues, in this paper, a data-efficient and low-complexity ASC system is built with a new model architecture and better training strategies. Specifically, we firstly design a new low-complexity architecture named Rep-Mobile by integrating multi-convolution branches which can be reparameterized at inference. Compared to other models, it achieves better performance and less computational complexity. Then we apply the knowledge distillation strategy and provide a comparison of the data efficiency of the teacher model with different architectures. Finally, we propose a progressive pruning strategy, which involves pruning the model multiple times in small amounts, resulting in better performance compared to a single step pruning. Experiments are conducted on the TAU dataset. With Rep-Mobile and these training strategies, our proposed ASC system achieves the state-of-the-art (SOTA) results so far, while also winning the first place with a significant advantage over others in the DCASE2024 Challenge.
语音场景分类(ASC)任务的目标是将录音归类到预定义的声学场景类别之一。然而,在实际应用中,ASC系统通常会面临一些挑战,例如录制设备不匹配、低复杂度约束以及标注数据有限等问题。为了缓解这些问题,本文构建了一个高效且低复杂度的ASC系统,采用了新的模型架构和更好的训练策略。具体来说,我们首先设计了一种新的低复杂度架构,命名为Rep-Mobile,通过在推理时可重新参数化的多卷积分支集成实现。与其它模型相比,它实现了更好的性能并降低了计算复杂度。接着,我们应用了知识蒸馏策略,并对比分析了不同架构下教师模型的数据效率。最后,我们提出了一种渐进式剪枝策略,即通过多次少量地剪枝模型来达到比单次大量剪枝更好的效果。实验在TAU数据集上进行。借助Rep-Mobile和这些训练策略,我们提出的ASC系统取得了目前最先进的(SOTA)结果,并且在DCASE2024挑战赛中以显著优势获得了第一名。
https://arxiv.org/abs/2410.20775
Large vision and language assistants have enabled new capabilities for interpreting natural images. These approaches have recently been adapted to earth observation data, but they are only able to handle single image inputs, limiting their use for many real-world tasks. In this work, we develop a new vision and language assistant called TEOChat that can engage in conversations about temporal sequences of earth observation data. To train TEOChat, we curate an instruction-following dataset composed of many single image and temporal tasks including building change and damage assessment, semantic change detection, and temporal scene classification. We show that TEOChat can perform a wide variety of spatial and temporal reasoning tasks, substantially outperforming previous vision and language assistants, and even achieving comparable or better performance than specialist models trained to perform these specific tasks. Furthermore, TEOChat achieves impressive zero-shot performance on a change detection and change question answering dataset, outperforms GPT-4o and Gemini 1.5 Pro on multiple temporal tasks, and exhibits stronger single image capabilities than a comparable single EO image instruction-following model. We publicly release our data, models, and code at this https URL .
https://arxiv.org/abs/2410.06234
Scene recognition, particularly for aerial and underwater images, often suffers from various types of degradation, such as blurring or overexposure. Previous works that focus on convolutional neural networks have been shown to be able to extract panoramic semantic features and perform well on scene recognition tasks. However, low-quality images still impede model performance due to the inappropriate use of high-level semantic features. To address these To address these challenges, we propose an adaptive selection mechanism to identify the most important and robust regions with high-level features. Thus, the model can perform learning via these regions to avoid interference. implement a learnable mask in the neural network, which can filter high-level features by assigning weights to different regions of the feature matrix. We also introduce a regularization term to further enhance the significance of key high-level feature regions. Different from previous methods, our learnable matrix pays extra attention to regions that are important to multiple categories but may cause misclassification and sets constraints to reduce the influence of such regions.This is a plug-and-play architecture that can be easily extended to other methods. Additionally, we construct an Underwater Geological Scene Classification dataset to assess the effectiveness of our model. Extensive experimental results demonstrate the superiority and robustness of our proposed method over state-of-the-art techniques on two datasets.
场景识别,尤其是对于航空和潜水图像,通常会受到各种类型的损伤,例如模糊或过度曝光。以前专注于卷积神经网络的工作已经被证明可以在场景识别任务中提取全景语义特征并表现出色。然而,低质量的图像仍然会阻碍模型的性能,因为高级语义特征的不当使用。为了应对这些挑战,我们提出了一个自适应选择机制来确定具有高层次特征的最重要的和稳健的区域。因此,模型可以通过这些区域进行学习,避免干扰。在神经网络中实现可学习掩码,通过为特征矩阵的不同区域分配权重来过滤高级特征。我们还引入了一个正则化项,进一步增强关键高级特征区域的的重要性。与以前的方法不同,我们的可学习矩阵特别关注对多个分类至关重要的区域,但可能会导致误分类,并为这些区域设置约束以减少其影响。这是一个可扩展的建筑,可以轻松地应用于其他方法。此外,我们还构建了一个水下地质场景分类数据集,以评估我们所提出模型的效果。大量的实验结果表明,与最先进的技巧相比,我们提出的方法在两个数据集上的优越性和稳健性。
https://arxiv.org/abs/2409.14741
In this technical report, we describe the SNTL-NTU team's submission for Task 1 Data-Efficient Low-Complexity Acoustic Scene Classification of the detection and classification of acoustic scenes and events (DCASE) 2024 challenge. Three systems are introduced to tackle training splits of different sizes. For small training splits, we explored reducing the complexity of the provided baseline model by reducing the number of base channels. We introduce data augmentation in the form of mixup to increase the diversity of training samples. For the larger training splits, we use FocusNet to provide confusing class information to an ensemble of multiple Patchout faSt Spectrogram Transformer (PaSST) models and baseline models trained on the original sampling rate of 44.1 kHz. We use Knowledge Distillation to distill the ensemble model to the baseline student model. Training the systems on the TAU Urban Acoustic Scene 2022 Mobile development dataset yielded the highest average testing accuracy of (62.21, 59.82, 56.81, 53.03, 47.97)% on split (100, 50, 25, 10, 5)% respectively over the three systems.
在这份技术报告中,我们描述了SNTL-NTU团队为2024年语音识别与分类任务(DCASE)提交的任务1:数据高效的低复杂度音频场景分类。我们引入了三种系统来处理不同大小的训练集。对于小训练集,我们通过减少提供的基线模型的基通道复杂度来降低模型的复杂度。我们引入了数据增强的形式为mixup,以增加训练样本的多样性。对于较大的训练集,我们使用FocusNet来向由多个Patchout faST Spectrogram Transformer(PaSST)模型和基于原始采样率44.1 kHz的基准模型组成的集成模型提供混乱的分类信息。我们使用知识蒸馏将集成模型分解为基线学生模型。在TAU urban acoustic scene 2022移动开发数据集上训练系统,在划分(100, 50, 25, 10, 5)%的测试准确率上取得了最高平均值(62.21, 59.82, 56.81, 53.03, 47.97)。
https://arxiv.org/abs/2409.11964
This paper proposes a method for unsupervised whole-image clustering of a target dataset of remote sensing scenes with no labels. The method consists of three main steps: (1) finetuning a pretrained deep neural network (DINOv2) on a labelled source remote sensing imagery dataset and using it to extract a feature vector from each image in the target dataset, (2) reducing the dimension of these deep features via manifold projection into a low-dimensional Euclidean space, and (3) clustering the embedded features using a Bayesian nonparametric technique to infer the number and membership of clusters simultaneously. The method takes advantage of heterogeneous transfer learning to cluster unseen data with different feature and label distributions. We demonstrate the performance of this approach outperforming state-of-the-art zero-shot classification methods on several remote sensing scene classification datasets.
本文提出了一种无需标签的遥感场景数据集的无需标签聚类方法。该方法包括三个主要步骤:首先,在带有标签的遥感图像数据集上微调预训练的深度神经网络(DINOv2),然后利用它从目标数据集中的每个图像中提取特征向量,接着通过向量投影将这些深层特征的维度降低到低维的欧氏空间,最后使用贝叶斯非参数技术对嵌入的特征进行聚类,以同时推断集群的数量和成员资格。该方法利用异质迁移学习来聚类未见过的数据,具有不同的特征和标签分布。我们在多个遥感场景分类数据集上证明了这种方法超越了最先进的零 shots分类方法的性能。
https://arxiv.org/abs/2409.03938
Vision-Language Models for remote sensing have shown promising uses thanks to their extensive pretraining. However, their conventional usage in zero-shot scene classification methods still involves dividing large images into patches and making independent predictions, i.e., inductive inference, thereby limiting their effectiveness by ignoring valuable contextual information. Our approach tackles this issue by utilizing initial predictions based on text prompting and patch affinity relationships from the image encoder to enhance zero-shot capabilities through transductive inference, all without the need for supervision and at a minor computational cost. Experiments on 10 remote sensing datasets with state-of-the-art Vision-Language Models demonstrate significant accuracy improvements over inductive zero-shot classification. Our source code is publicly available on Github: this https URL
凭借其广泛的预训练,视觉语言模型在远程感测领域显示出有希望的应用。然而,其在零散景观分类方法中的常规用法还是 involve 将大图像分割成补丁并做出独立预测,即归纳推理,从而限制了它们的有效性,忽略了宝贵的上下文信息。我们的方法通过利用图像编码器基于文本提示的初始预测和补丁关联关系来增强零散景观能力,通过转换推理实现,而无需监督,且计算成本较低。用最先进的视觉语言模型在 10 个远程感测数据集上的实验表明,与归纳零散景观分类相比,其准确率明显提高。我们的源代码已公开发布在 Github上:https://github.com/。
https://arxiv.org/abs/2409.00698