As AI workloads increase in scope, generalization capability becomes challenging for small task-specific models and their demand for large amounts of labeled training samples increases. On the contrary, Foundation Models (FMs) are trained with internet-scale unlabeled data via self-supervised learning and have been shown to adapt to various tasks with minimal fine-tuning. Although large FMs have demonstrated significant impact in natural language processing and computer vision, efforts toward FMs for geospatial applications have been restricted to smaller size models, as pretraining larger models requires very large computing resources equipped with state-of-the-art hardware accelerators. Current satellite constellations collect 100+TBs of data a day, resulting in images that are billions of pixels and multimodal in nature. Such geospatial data poses unique challenges opening up new opportunities to develop FMs. We investigate billion scale FMs and HPC training profiles for geospatial applications by pretraining on publicly available data. We studied from end-to-end the performance and impact in the solution by scaling the model size. Our larger 3B parameter size model achieves up to 30% improvement in top1 scene classification accuracy when comparing a 100M parameter model. Moreover, we detail performance experiments on the Frontier supercomputer, America's first exascale system, where we study different model and data parallel approaches using PyTorch's Fully Sharded Data Parallel library. Specifically, we study variants of the Vision Transformer architecture (ViT), conducting performance analysis for ViT models with size up to 15B parameters. By discussing throughput and performance bottlenecks under different parallelism configurations, we offer insights on how to leverage such leadership-class HPC resources when developing large models for geospatial imagery applications.
随着AI工作负载的增加,对于小任务特定模型的泛化能力变得具有挑战性,同时它们对大量标注训练样本的需求也在增加。相反,通过自监督学习使用互联网规模的无标注数据训练的基模型(FMs)已经证明了在各种任务上具有最小的微调适应性。尽管大型FM在自然语言处理和计算机视觉方面取得了显著影响,但针对地理应用的FM努力主要限制在小模型上,因为使用大模型进行预训练需要配备最先进的硬件加速器所需的大规模计算资源。 目前,卫星星座每天收集100+TB的数据,导致图像具有数十亿个像素和高多模态性质。这样的地理空间数据带来了独特的挑战,同时也为开发FM提供了新的机会。我们通过在公共数据上预训练来研究亿规模FM和HPC训练剖面,用于地理应用。我们研究了从端到端解决方案的性能和影响,通过扩展模型大小进行缩放。 我们的较大3B参数尺寸模型在比较100M参数模型时,实现多达30%的Top1场景分类精度提升。此外,我们在美国第一台每秒千万亿次浮点运算超级计算机(Frontier)上详细研究了使用PyTorch的全分片数据并行库进行不同模型和数据并行方法的研究。具体来说,我们研究了Vision Transformer架构(ViT)的变体,对大小达到15B参数的ViT模型进行了性能分析。通过讨论不同并行配置下的吞吐量和性能瓶颈,我们提供了关于如何利用这些领导级HPC资源在为地理影像应用开发大型模型时如何充分利用它们的见解。
https://arxiv.org/abs/2404.11706
Indoor scenes are usually characterized by scattered objects and their relationships, which turns the indoor scene classification task into a challenging computer vision task. Despite the significant performance boost in classification tasks achieved in recent years, provided by the use of deep-learning-based methods, limitations such as inter-category ambiguity and intra-category variation have been holding back their performance. To overcome such issues, gathering semantic information has been shown to be a promising source of information towards a more complete and discriminative feature representation of indoor scenes. Therefore, the work described in this paper uses both semantic information, obtained from object detection, and semantic segmentation techniques. While object detection techniques provide the 2D location of objects allowing to obtain spatial distributions between objects, semantic segmentation techniques provide pixel-level information that allows to obtain, at a pixel-level, a spatial distribution and shape-related features of the segmentation categories. Hence, a novel approach that uses a semantic segmentation mask to provide Hu-moments-based segmentation categories' shape characterization, designated by Segmentation-based Hu-Moments Features (SHMFs), is proposed. Moreover, a three-main-branch network, designated by GOS$^2$F$^2$App, that exploits deep-learning-based global features, object-based features, and semantic segmentation-based features is also proposed. GOS$^2$F$^2$App was evaluated in two indoor scene benchmark datasets: SUN RGB-D and NYU Depth V2, where, to the best of our knowledge, state-of-the-art results were achieved on both datasets, which present evidences of the effectiveness of the proposed approach.
室内场景通常具有零散的物体及其关系,这使得室内场景分类任务成为具有挑战性的计算机视觉任务。尽管近年来基于深度学习的方法在分类任务方面取得了显著的性能提升,但类内模糊性和类间变异性等限制仍然阻碍了其性能。为了克服这些问题,通过收集语义信息来获得更完整和具有区分性的室内场景特征,已经证明是一种有前景的方法。因此,本文的工作既利用了从物体检测中获得的语义信息,也利用了语义分割技术。虽然物体检测技术提供了物体在二维位置,以便获得物体之间的空间分布,语义分割技术提供了像素级的关于分割类别形状和关系的信息,因此,本文提出了一种利用语义分割掩码提供基于Hu-moments的分割类别形状描述的新方法,被称为基于分区的Hu-Moments特征(SHMFs)。此外,还提出了一个利用基于深度学习的全局特征、基于物体的特征和语义分割基于特征的三分支网络,该网络被称为GOS$^2$F$^2$App。GOS$^2$F$^2$App在两个室内场景基准数据集:SUN RGB-D和NYU Depth V2上进行了评估,据我们所知,在两个数据集上都实现了最先进的性能,这证明了所提出方法的有效性。
https://arxiv.org/abs/2404.07739
In the realm of geospatial analysis, the diversity of remote sensors, encompassing both optical and microwave technologies, offers a wealth of distinct observational capabilities. Recognizing this, we present msGFM, a multisensor geospatial foundation model that effectively unifies data from four key sensor modalities. This integration spans an expansive dataset of two million multisensor images. msGFM is uniquely adept at handling both paired and unpaired sensor data. For data originating from identical geolocations, our model employs an innovative cross-sensor pretraining approach in masked image modeling, enabling the synthesis of joint representations from diverse sensors. msGFM, incorporating four remote sensors, upholds strong performance, forming a comprehensive model adaptable to various sensor types. msGFM has demonstrated enhanced proficiency in a range of both single-sensor and multisensor downstream tasks. These include scene classification, segmentation, cloud removal, and pan-sharpening. A key discovery of our research is that representations derived from natural images are not always compatible with the distinct characteristics of geospatial remote sensors, underscoring the limitations of existing representations in this field. Our work can serve as a guide for developing multisensor geospatial pretraining models, paving the way for more advanced geospatial capabilities.
在地理分析领域,遥感的多样性,包括光学和微波技术,提供了丰富的独特观测能力。意识到这一点,我们提出了msGFM,一个多传感器地理基础模型,有效地将四个关键传感器模态的数据统一在一起。这个集成涵盖了200,000个多传感器图像的广泛数据集。msGFM特别擅长处理成对和无对传感器数据。对于来自相同地理位置的数据,我们的模型采用了一种创新性的跨传感器预训练方法,实现从不同传感器合成联合表示。msGFM,包括四个遥感器,具有强大的性能,形成了一个适用于各种传感器类型的综合模型。msGFM在各种单传感器和多传感器下游任务中表现出了卓越的性能。这些包括场景分类、分割、云删除和锐化。我们研究的关键发现是,自然图像生成的表示并不总是与地理遥感器的独特特征相兼容,突显了该领域现有表示的局限性。我们的工作可以为开发多传感器地理预训练模型提供指导,为更先进的空间技术铺平道路。
https://arxiv.org/abs/2404.01260
Remote sensing image classification forms the foundation of various understanding tasks, serving a crucial function in remote sensing image interpretation. The recent advancements of Convolutional Neural Networks (CNNs) and Transformers have markedly enhanced classification accuracy. Nonetheless, remote sensing scene classification remains a significant challenge, especially given the complexity and diversity of remote sensing scenarios and the variability of spatiotemporal resolutions. The capacity for whole-image understanding can provide more precise semantic cues for scene discrimination. In this paper, we introduce RSMamba, a novel architecture for remote sensing image classification. RSMamba is based on the State Space Model (SSM) and incorporates an efficient, hardware-aware design known as the Mamba. It integrates the advantages of both a global receptive field and linear modeling complexity. To overcome the limitation of the vanilla Mamba, which can only model causal sequences and is not adaptable to two-dimensional image data, we propose a dynamic multi-path activation mechanism to augment Mamba's capacity to model non-causal data. Notably, RSMamba maintains the inherent modeling mechanism of the vanilla Mamba, yet exhibits superior performance across multiple remote sensing image classification datasets. This indicates that RSMamba holds significant potential to function as the backbone of future visual foundation models. The code will be available at \url{this https URL}.
遥感图像分类是各种理解任务的基石,在遥感图像解释中具有关键作用。最近卷积神经网络(CNN)和Transformer的进步显著提高了分类准确性。然而,遥感场景分类仍然是一个重要的挑战,尤其是在遥感场景的复杂性和多样性以及时空分辨率的不确定性方面。全图理解能力可以提供场景区分的更精确的语义线索。在本文中,我们引入了RSMamba,一种新型的遥感图像分类架构。RSMamba基于状态空间模型(SSM),并采用了一种高效且硬件意识的设计,称为Mamba。它整合了全局感受野和线性建模复杂度的优势。为了克服普通Mamba的局限性(只能建模因果序列,不适用于二维图像数据),我们提出了动态多路径激活机制来增强Mamba的建模非因果数据的能力。值得注意的是,RSMamba保留了普通Mamba的固有建模机制,同时在多个遥感图像分类数据集上表现出卓越的性能。这表明,RSMamba具有成为未来视觉基础模型 backbone的重要潜力。代码将公开在 \url{这个链接} 上。
https://arxiv.org/abs/2403.19654
In the realm of Federated Learning (FL) applied to remote sensing image classification, this study introduces and assesses several innovative communication strategies. Our exploration includes feature-centric communication, pseudo-weight amalgamation, and a combined method utilizing both weights and features. Experiments conducted on two public scene classification datasets unveil the effectiveness of these strategies, showcasing accelerated convergence, heightened privacy, and reduced network information exchange. This research provides valuable insights into the implications of feature-centric communication in FL, offering potential applications tailored for remote sensing scenarios.
在应用于远程 sensing图像分类领域的联邦学习(FL)领域,本文介绍并评估了几个创新性的通信策略。我们的探索包括基于特征的通信、伪权重合并和结合使用权重和特征的方法。在两个公开场景分类数据集上进行的实验揭示了这些策略的有效性,展示了加速收敛、提高隐私和减少网络信息交互。这项研究为FL中基于特征的通信提供了宝贵的洞见,为远程 sensing场景提供了潜在的应用。
https://arxiv.org/abs/2403.13575
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks. Pretraining is an active research topic, encompassing supervised and self-supervised learning methods to initialize model weights effectively. However, transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks. In this study, we explore the Multi-Task Pretraining (MTP) paradigm for RS foundation models to address this issue. Using a shared encoder and task-specific decoder architecture, we conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection. MTP supports both convolutional neural networks and vision transformer foundation models with over 300 million parameters. The pretrained models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection. Extensive experiments across 14 datasets demonstrate the superiority of our models over existing ones of similar size and their competitive performance compared to larger state-of-the-art models, thus validating the effectiveness of MTP.
基于先验模型的 Remote Sensing (RS) 领域已经发生了变革,通过增强各种图像解释任务,使先验模型成为了 Remote Sensing 领域的一种重要工具。预训练是一个活跃的研究课题,涵盖了监督学习和自监督学习方法,以有效地初始化模型权重。然而,将预训练模型应用于下游任务可能会因为它们将预训练建模为图像分类或目标识别任务而遇到任务差异。在这项研究中,我们探讨了 Multi-Task Pretraining (MTP) 范式,以解决这一问题。我们使用共享编码器和支持特定任务解码器的设计,在 SAMRS 数据集上进行多任务监督预训练,包括语义分割、实例分割和旋转物体检测。MTP 支持超过 300 亿参数的卷积神经网络和视觉 Transformer 基础模型。预训练模型在各种 RS 下游任务上进行微调,例如场景分类、水平物体检测、语义分割和变化检测。在 14 个数据集上的大量实验证明,我们的模型在大小相似的情况下优于现有模型,并且与更大的先进模型具有竞争力的性能,从而验证了 MTP 的有效性。
https://arxiv.org/abs/2403.13430
Crime in the 21st century is split into a virtual and real world. However, the former has become a global menace to people's well-being and security in the latter. The challenges it presents must be faced with unified global cooperation, and we must rely more than ever on automated yet trustworthy tools to combat the ever-growing nature of online offenses. Over 10 million child sexual abuse reports are submitted to the US National Center for Missing & Exploited Children every year, and over 80% originated from online sources. Therefore, investigation centers and clearinghouses cannot manually process and correctly investigate all imagery. In light of that, reliable automated tools that can securely and efficiently deal with this data are paramount. In this sense, the scene recognition task looks for contextual cues in the environment, being able to group and classify child sexual abuse data without requiring to be trained on sensitive material. The scarcity and limitations of working with child sexual abuse images lead to self-supervised learning, a machine-learning methodology that leverages unlabeled data to produce powerful representations that can be more easily transferred to target tasks. This work shows that self-supervised deep learning models pre-trained on scene-centric data can reach 71.6% balanced accuracy on our indoor scene classification task and, on average, 2.2 percentage points better performance than a fully supervised version. We cooperate with Brazilian Federal Police experts to evaluate our indoor classification model on actual child abuse material. The results demonstrate a notable discrepancy between the features observed in widely used scene datasets and those depicted on sensitive materials.
21世纪犯罪分为虚拟和现实世界。然而,前者已成为对人们福祉和安全的全球威胁。它所呈现的挑战必须通过全球合作来应对,我们比以往任何时候都更依赖自动且可靠的工具来对抗网络犯罪的不断增长。每年向美国国家失踪和受控儿童中心提交超过1000万份儿童性虐待报告,其中超过80%来自在线来源。因此,调查中心和清除中心无法手动处理和正确调查所有图像。鉴于这一点,可靠的自动化工具在安全且高效地处理这种数据方面至关重要。在这种情况下,场景识别任务在环境中寻找上下文线索,能够无要求地组队和分类儿童性虐待数据。对处理儿童性虐待图像的稀缺性和限制,导致了一种自我监督学习,这是一种利用未标记数据产生强大表示的机器学习方法。这项工作表明,基于场景的深度学习模型经过预训练后可以达到在我们 indoor 场景分类任务上实现71.6%的平衡准确度,并且平均比完全监督版本的表现更好。我们与巴西联邦警察专家合作,对我们的 indoor 分类模型在实际儿童虐待材料上进行评估。结果显示,广泛使用的场景数据中观察到的特征与敏感材料中描绘的特征之间存在明显的差异。
https://arxiv.org/abs/2403.01183
Most state-of-the-art computer vision models heavily depend on data. However, many datasets exhibit extreme class imbalance which has been shown to negatively impact model performance. Among the training-time and data-generation solutions that have been explored, one subset that leverages existing data is importance sampling. A good deal of this work focuses primarily on the CIFAR-10 and CIFAR-100 datasets which fail to be representative of the scale, composition, and complexity of current state-of-the-art datasets. In this work, we explore and compare three techniques that derive from importance sampling: loss reweighting, undersampling, and oversampling. Specifically, we compare the effect of these techniques on the performance of two encoders on an impactful satellite imagery dataset, Planet's Amazon Rainforest dataset, in preparation for another work. Furthermore, we perform supplemental experimentation on a scene classification dataset, ADE20K, to test on a contrasting domain and clarify our results. Across both types of encoders, we find that up-weighting the loss for and undersampling has a negigible effect on the performance on underrepresented classes. Additionally, our results suggest oversampling generally improves performance for the same underrepresented classes. Interestingly, our findings also indicate that there may exist some redundancy in data in the Planet dataset. Our work aims to provide a foundation for further work on the Planet dataset and similar domain-specific datasets. We open-source our code at this https URL for future work on other satellite imagery datasets as well.
大多数最先进的计算机视觉模型非常依赖数据。然而,许多数据集表现出极端的类不平衡,已经被证明会严重影响模型的性能。在已经探索的训练时间和数据生成解决方案中,利用现有数据的一个子集是重要性抽样。这个工作的大部分主要集中在CIFAR-10和CIFAR-100数据集上,这些数据集无法代表当前最先进数据集的规模、组成和复杂性。在这篇工作中,我们探讨并比较了三个基于重要性抽样的技术:损失重新加权、欠采样和过采样。具体来说,我们比较了这些技术对两个具有影响力的卫星图像数据集(Planet的Amazon雨林数据集)的性能影响,为另一项工作做准备。此外,我们还对另一个场景分类数据集ADE20K进行了补充实验,以测试对比领域并阐明我们的结果。在两种类型的编码器中,我们发现,为正负样本加权损失和欠采样对表现不佳的类别的性能没有负面影响。此外,我们的结果还表明,过度抽样通常会提高同一类别的性能。有趣的是,我们的研究还表明,Planet数据集中的数据可能存在冗余。我们的工作旨在为研究Planet数据集以及类似领域数据提供基础。我们将代码开源在https:// this URL,供未来对其他卫星图像数据集进行更多研究。
https://arxiv.org/abs/2402.18742
Multi-modal sensor data fusion takes advantage of complementary or reinforcing information from each sensor and can boost overall performance in applications such as scene classification and target detection. This paper presents a new method for fusing multi-modal and multi-resolution remote sensor data without requiring pixel-level training labels, which can be difficult to obtain. Previously, we developed a Multiple Instance Multi-Resolution Fusion (MIMRF) framework that addresses label uncertainty for fusion, but it can be slow to train due to the large search space for the fuzzy measures used to integrate sensor data sources. We propose a new method based on binary fuzzy measures, which reduces the search space and significantly improves the efficiency of the MIMRF framework. We present experimental results on synthetic data and a real-world remote sensing detection task and show that the proposed MIMRF-BFM algorithm can effectively and efficiently perform multi-resolution fusion given remote sensing data with uncertainty.
多模态传感器数据的融合利用每个传感器互补或增强的信息,可以在诸如场景分类和目标检测等应用中提高整体性能。本文提出了一种不需要像素级训练标签的新方法来融合多模态和多分辨率远程传感器数据,这是难以获得的。之前,我们开发了一种Multiple Instance Multi-Resolution Fusion (MIMRF)框架,用于解决融合时的标签不确定性,但由于使用的模糊度量具有较大的搜索空间,训练可能变得缓慢。我们提出了一种基于二进制模糊度量的全新方法,这减少了搜索空间,显著提高了MIMRF框架的效率。我们在合成数据和现实世界的遥感检测任务上进行了实验,并证明了所提出的MIMRF-BFM算法可以有效且高效地对具有不确定性的遥感数据进行多分辨率融合。
https://arxiv.org/abs/2402.05045
Acoustic scene classification (ASC) is a crucial research problem in computational auditory scene analysis, and it aims to recognize the unique acoustic characteristics of an environment. One of the challenges of the ASC task is domain shift caused by a distribution gap between training and testing data. Since 2018, ASC challenges have focused on the generalization of ASC models across different recording devices. Although this task in recent years has achieved substantial progress in device generalization, the challenge of domain shift between different regions, involving characteristics such as time, space, culture, and language, remains insufficiently explored at present. In addition, considering the abundance of unlabeled acoustic scene data in the real world, it is important to study the possible ways to utilize these unlabelled data. Therefore, we introduce the task Semi-supervised Acoustic Scene Classification under Domain Shift in the ICME 2024 Grand Challenge. We encourage participants to innovate with semi-supervised learning techniques, aiming to develop more robust ASC models under domain shift.
声景分类(ASC)是计算听觉场景分析中的一个关键研究问题,旨在识别环境的独特声学特征。ASC任务的挑战之一是训练和测试数据之间分布差异导致的领域转移。自2018年以来,ASC挑战的重点在于在不同记录设备上推广ASC模型。尽管在近年来设备泛化方面取得了实质性进展,但不同区域之间领域的转移问题(包括时间、空间、文化和语言等特征)仍然没有被充分研究。此外,考虑到现实世界中大量未标记的声景数据,研究如何利用这些未标记数据的可能性非常重要。因此,我们在ICME 2024大挑战中引入了域移下 semi-supervised Acoustic Scene 分类任务。我们鼓励参与者使用半监督学习技术创新,以开发在领域移除下更健壮的ASC模型。
https://arxiv.org/abs/2402.02694
Deep neural networks have achieved promising progress in remote sensing (RS) image classification, for which the training process requires abundant samples for each class. However, it is time-consuming and unrealistic to annotate labels for each RS category, given the fact that the RS target database is increasing dynamically. Zero-shot learning (ZSL) allows for identifying novel classes that are not seen during training, which provides a promising solution for the aforementioned problem. However, previous ZSL models mainly depend on manually-labeled attributes or word embeddings extracted from language models to transfer knowledge from seen classes to novel classes. Besides, pioneer ZSL models use convolutional neural networks pre-trained on ImageNet, which focus on the main objects appearing in each image, neglecting the background context that also matters in RS scene classification. To address the above problems, we propose to collect visually detectable attributes automatically. We predict attributes for each class by depicting the semantic-visual similarity between attributes and images. In this way, the attribute annotation process is accomplished by machine instead of human as in other methods. Moreover, we propose a Deep Semantic-Visual Alignment (DSVA) that take advantage of the self-attention mechanism in the transformer to associate local image regions together, integrating the background context information for prediction. The DSVA model further utilizes the attribute attention maps to focus on the informative image regions that are essential for knowledge transfer in ZSL, and maps the visual images into attribute space to perform ZSL classification. With extensive experiments, we show that our model outperforms other state-of-the-art models by a large margin on a challenging large-scale RS scene classification benchmark.
深度神经网络在远距离感知(RS)图像分类方面取得了令人鼓舞的进展,对于需要每个类别丰富样本的训练过程来说,这需要耗费大量的时间和精力。然而,由于RS目标数据库正在不断增加,因此为每个RS类别注释标签是非常耗时的,而且不现实的想法。零样本学习(ZSL)允许在训练过程中识别出在训练过程中看不到的新类别的样本,为上述问题提供了一个有前景的解决方案。然而,之前的ZSL模型主要依赖于手动标注的属性或从语言模型中提取的词向量来转移知识从见过的类别到新类别的知识。此外,先驱的ZSL模型使用预训练于ImageNet的卷积神经网络,重点关注每个图像中出现的主要物体,忽视了背景上下文信息对于RS场景分类同样重要的事实。为了解决上述问题,我们提出了自动收集视觉可检测属性的方法。我们通过描绘属性与图像之间的语义-视觉相似性来预测每个类的属性。这样,属性标注过程由机器完成,而不是由人类完成,类似于其他方法。此外,我们提出了一个Deep Semantic-Visual Alignment(DSVA)模型,该模型利用Transformer中的自注意力机制将局部图像区域关联起来,并整合背景上下文信息进行预测。DSVA模型还利用属性注意力图来关注ZSL中对于知识传递至关重要且有用的图像区域,并将视觉图像映射到属性空间进行ZSL分类。通过大量实验,我们发现我们的模型在具有挑战性的大规模RS场景分类基准上显著优于其他最先进的模型。
https://arxiv.org/abs/2402.02094
Although neural models have achieved remarkable performance, they still encounter doubts due to the intransparency. To this end, model prediction explanation is attracting more and more attentions. However, current methods rarely incorporate external knowledge and still suffer from three limitations: (1) Neglecting concept completeness. Merely selecting concepts may not sufficient for prediction. (2) Lacking concept fusion. Failure to merge semantically-equivalent concepts. (3) Difficult in manipulating model behavior. Lack of verification for explanation on original model. To address these issues, we propose a novel knowledge-aware neuron interpretation framework to explain model predictions for image scene classification. Specifically, for concept completeness, we present core concepts of a scene based on knowledge graph, ConceptNet, to gauge the completeness of concepts. Our method, incorporating complete concepts, effectively provides better prediction explanations compared to baselines. Furthermore, for concept fusion, we introduce a knowledge graph-based method known as Concept Filtering, which produces over 23% point gain on neuron behaviors for neuron interpretation. At last, we propose Model Manipulation, which aims to study whether the core concepts based on ConceptNet could be employed to manipulate model behavior. The results show that core concepts can effectively improve the performance of original model by over 26%.
尽管神经网络模型已经取得了显著的性能,但它们仍然会因透明度而遇到怀疑。因此,模型预测解释吸引了越来越多的关注。然而,目前的做法很少 incorporating外部知识,仍然存在三个限制:(1)忽视概念的完整性。仅仅选择概念可能不足以进行预测。(2)缺乏概念融合。未能将语义上等价的观念进行合并。(3)难以操纵模型行为。在原始模型上进行解释缺乏验证。为解决这些问题,我们提出了一个新颖的知识感知神经元解释框架,用于解释图像场景分类模型的预测。具体来说,基于知识图谱和ConceptNet,我们提出了场景核心概念,以衡量概念的完整性。我们的方法,结合了完整概念,在基线模型上提供了更好的预测解释。此外,为了概念融合,我们引入了一种基于知识图谱的方法,称为概念过滤,该方法在神经元行为上产生了超过23%的点增益。最后,我们提出了模型操作,旨在研究是否基于ConceptNet的概念核心概念可以用于操纵模型行为。结果显示,通过结合概念,可以显著提高原始模型的性能,超过26%。
https://arxiv.org/abs/2401.15820
In this work, we aim to establish a Bayesian adaptive learning framework by focusing on estimating latent variables in deep neural network (DNN) models. Latent variables indeed encode both transferable distributional information and structural relationships. Thus the distributions of the source latent variables (prior) can be combined with the knowledge learned from the target data (likelihood) to yield the distributions of the target latent variables (posterior) with the goal of addressing acoustic mismatches between training and testing conditions. The prior knowledge transfer is accomplished through Variational Bayes (VB). In addition, we also investigate Maximum a Posteriori (MAP) based Bayesian adaptation. Experimental results on device adaptation in acoustic scene classification show that our proposed approaches can obtain good improvements on target devices, and consistently outperforms other cut-edging algorithms.
在这项工作中,我们旨在通过专注于在深度神经网络(DNN)模型中估计潜在变量来建立一个贝叶斯自适应学习框架。事实上,潜在变量确实编码了可转移的分布信息和结构关系。因此,源潜在变量的分布(先验)可以与目标数据(后验)知识相结合,以产生目标潜在变量的分布,旨在解决训练和测试条件之间的声学不匹配。先验知识传递是通过Variational Bayes(VB)实现的。此外,我们还研究了基于最大后验概率(MAP)的贝叶斯自适应。 在音频场景分类设备的实验结果表明,我们提出的方法在目标设备上可以获得很好的改进,并且 consistently优于其他削减算法。
https://arxiv.org/abs/2401.13766
Computer-based scene understanding has influenced fields ranging from urban planning to autonomous vehicle performance, yet little is known about how well these technologies work across social differences. We investigate the biases of deep convolutional neural networks (dCNNs) in scene classification, using nearly one million images from global and US sources, including user-submitted home photographs and Airbnb listings. We applied statistical models to quantify the impact of socioeconomic indicators such as family income, Human Development Index (HDI), and demographic factors from public data sources (CIA and US Census) on dCNN performance. Our analyses revealed significant socioeconomic bias, where pretrained dCNNs demonstrated lower classification accuracy, lower classification confidence, and a higher tendency to assign labels that could be offensive when applied to homes (e.g., "ruin", "slum"), especially in images from homes with lower socioeconomic status (SES). This trend is consistent across two datasets of international images and within the diverse economic and racial landscapes of the United States. This research contributes to understanding biases in computer vision, emphasizing the need for more inclusive and representative training datasets. By mitigating the bias in the computer vision pipelines, we can ensure fairer and more equitable outcomes for applied computer vision, including home valuation and smart home security systems. There is urgency in addressing these biases, which can significantly impact critical decisions in urban development and resource allocation. Our findings also motivate the development of AI systems that better understand and serve diverse communities, moving towards technology that equitably benefits all sectors of society.
基于计算机的场景理解已经影响了包括城市规划、自动驾驶汽车性能在内的各个领域,然而目前还很少有人知道这些技术在不同社会差异下的表现。我们研究了深度卷积神经网络(dCNNs)在场景分类中的偏见,使用了来自全球和美国的近100万张图像,包括用户提交的家居照片和Airbnb列表。我们应用统计模型来量化来自公共数据源(CIA和美国人口普查局)的社会经济发展指标(如家庭收入、人类发展指数,人口统计因素)对dCNN性能的影响。我们的分析揭示了显著的社会经济偏见,即预训练的dCNNs在分类准确性、分类信心和将标签分配给可能冒犯住宅的倾向方面都较低(例如,“脏乱”、“贫民窟”等),特别是在社会经济地位较低的住宅(SES)中。这一趋势在两个国际图像数据集和美国的多样经济和种族景观中都是一致的。这项研究为计算机视觉中的偏见提供了更深入的理解,强调了需要创建更包容和代表性训练数据集的必要性。通过减轻计算机视觉流程中的偏见,我们可以确保应用计算机视觉获得更公平和平等的结果,包括住宅估值和智能家居安全系统。解决这些偏见的问题具有紧迫性,这可能会对城市发展和资源分配产生重大影响。我们的研究还推动了开发更了解和服务于多样社区的AI系统,朝着实现让所有社会各阶层都受益的技术方向发展。
https://arxiv.org/abs/2401.13097
Image summary, an abridged version of the original visual content, can be used to represent the scene. Thus, tasks such as scene classification, identification, indexing, etc., can be performed efficiently using the unique summary. Saliency is the most commonly used technique for generating the relevant image summary. However, the definition of saliency is subjective in nature and depends upon the application. Existing saliency detection methods using RGB-D data mainly focus on color, texture, and depth features. Consequently, the generated summary contains either foreground objects or non-stationary objects. However, applications such as scene identification require stationary characteristics of the scene, unlike state-of-the-art methods. This paper proposes a novel volumetric saliency-guided framework for indoor scene classification. The results highlight the efficacy of the proposed method.
图像摘要,是对原始视觉内容的一个简要概述,可以用来表示场景。因此,场景分类、识别、索引等任务可以使用独特的摘要来高效执行。最常见的生成相关图像摘要的技术是显著性。然而,显著性的定义在本质上是有主观性的,并取决于应用场景。使用RGB-D数据现有的 saliency 检测方法主要关注颜色、纹理和深度特征。因此,生成的摘要包含前景物体或非稳定物体。然而,场景识别应用程序需要场景的静止特性,而与现有方法不同。本文提出了一种新颖的体积显著性引导的室内场景分类框架。结果突出了所提出方法的有效性。
https://arxiv.org/abs/2401.16227
Deep learning models are essential for scene classification, change detection, land cover segmentation, and other remote sensing image understanding tasks. Most backbones of existing remote sensing deep learning models are typically initialized by pre-trained weights obtained from ImageNet pre-training (IMP). However, domain gaps exist between remote sensing images and natural images (e.g., ImageNet), making deep learning models initialized by pre-trained weights of IMP perform poorly for remote sensing image understanding. Although some pre-training methods are studied in the remote sensing community, current remote sensing pre-training methods face the problem of vague generalization by only using remote sensing images. In this paper, we propose a novel remote sensing pre-training framework, Generic Knowledge Boosted Remote Sensing Pre-training (GeRSP), to learn robust representations from remote sensing and natural images for remote sensing understanding tasks. GeRSP contains two pre-training branches: (1) A self-supervised pre-training branch is adopted to learn domain-related representations from unlabeled remote sensing images. (2) A supervised pre-training branch is integrated into GeRSP for general knowledge learning from labeled natural images. Moreover, GeRSP combines two pre-training branches using a teacher-student architecture to simultaneously learn representations with general and special knowledge, which generates a powerful pre-trained model for deep learning model initialization. Finally, we evaluate GeRSP and other remote sensing pre-training methods on three downstream tasks, i.e., object detection, semantic segmentation, and scene classification. The extensive experimental results consistently demonstrate that GeRSP can effectively learn robust representations in a unified manner, improving the performance of remote sensing downstream tasks.
深度学习模型对于场景分类、变化检测、土地覆盖分割等遥感图像理解任务至关重要。现有的遥感深度学习模型的骨干网络通常通过从ImageNet预训练中获得的预训练权重初始化。然而,遥感图像与自然图像之间存在领域差异(例如,ImageNet),因此仅通过遥感图像预训练的权重初始化的深度学习模型在遥感图像理解任务上表现不佳。尽管在遥感领域有一些预训练方法的研究,但现有的遥感预训练方法仅通过遥感图像无法解决领域差异问题。在本文中,我们提出了一个新颖的遥感预训练框架,通用知识增强遥感预训练(GeRSP),以从遥感图像和自然图像中学习稳健的表示来进行遥感理解任务。GeRSP包含两个预训练分支:(1)采用自监督预训练分支从未标注的遥感图像中学习领域相关的表示。(2)将监督预训练分支集成到GeRSP中,从标注的自然图像中学习通用知识。此外,GeRSP使用师生架构将两个预训练分支同时学习具有通用和特殊知识的表示,从而生成一个强大的预训练模型,用于深度学习模型的初始化。最后,我们对GeRSP和其他遥感预训练方法在三个下游任务上进行了评估,即目标检测、语义分割和场景分类。大量实验结果一致证明,GeRSP可以在统一的方式下有效学习稳健的表示,从而提高遥感下游任务的性能。
https://arxiv.org/abs/2401.04614
Remote Sensing Scene Classification is a challenging and valuable research topic, in which Convolutional Neural Network (CNN) has played a crucial role. CNN can extract hierarchical convolutional features from remote sensing imagery, and Feature Fusion of different layers can enhance CNN's performance. Two successful Feature Fusion methods, Add and Concat, are employed in certain state-of-the-art CNN algorithms. In this paper, we propose a novel Feature Fusion algorithm, which unifies the aforementioned methods using the Kronecker Product (KPFF), and we discuss the Backpropagation procedure associated with this algorithm. To validate the efficacy of the proposed method, a series of experiments are designed and conducted. The results demonstrate its effectiveness of enhancing CNN's accuracy in Remote sensing scene classification.
远程 sensing场景分类是一个具有挑战性和实用价值的研究课题,其中卷积神经网络(CNN)发挥了关键作用。CNN可以从遥感图像中提取层次卷积特征,而不同层之间的特征融合可以提高CNN的性能。在某些最先进的CNN算法中,使用了两种成功的特征融合方法:Add和Concat。在本文中,我们提出了一种新颖的特征融合算法,该算法使用Kronecker产品(KPFF)将上述方法统一,并讨论了与该算法相关的反向传播过程。为了验证所提出方法的有效性,进行了一系列实验。结果表明,该方法可以显著提高CNN在远程感测场景分类中的准确率。
https://arxiv.org/abs/2402.00036
Remote sensing imagery, despite its broad applications in helping achieve Sustainable Development Goals and tackle climate change, has not yet benefited from the recent advancements of versatile, task-agnostic vision language models (VLMs). A key reason is that the large-scale, semantically diverse image-text dataset required for developing VLMs is still absent for remote sensing images. Unlike natural images, remote sensing images and their associated text descriptions cannot be efficiently collected from the public Internet at scale. In this work, we bridge this gap by using geo-coordinates to automatically connect open, unlabeled remote sensing images with rich semantics covered in OpenStreetMap, and thus construct SkyScript, a comprehensive vision-language dataset for remote sensing images, comprising 2.6 million image-text pairs covering 29K distinct semantic tags. With continual pre-training on this dataset, we obtain a VLM that surpasses baseline models with a 6.2% average accuracy gain in zero-shot scene classification across seven benchmark datasets. It also demonstrates the ability of zero-shot transfer for fine-grained object attribute classification and cross-modal retrieval. We hope this dataset can support the advancement of VLMs for various multi-modal tasks in remote sensing, such as open-vocabulary classification, retrieval, captioning, and text-to-image synthesis.
尽管遥感影像在帮助实现可持续发展目标应对气候变化方面具有广泛的应用,但尚未充分利用最近的多功能、任务无关的视觉语言模型(VLMs)的先进技术。一个关键的原因是,用于开发VLMs的大规模、语义多样图像-文本数据集仍然缺失。与自然图像不同,远程 sensing图像及其相关文本描述无法从公共互联网上以大规模方式收集。在这项工作中,我们通过使用地理坐标将开放、未标记的远程 sensing图像与OpenStreetMap上丰富的语义覆盖连接起来,从而构建了 SkyScript,一个遥感图像 comprehensive vision-language 数据集,包括29K个不同的语义标签的图像-文本对。通过持续在這個數據集上進行预训练,我们在零散景观分类基准數據集上实现了基線模型的6.2%平均準確率增長。它還展示了零散轉移進行精细语義屬性分類和跨模态检索的能力。我们希望這個數據集能夠支持各種多模态遥感任務的VLMs發展,例如开放式词汇分類、檢索、旁白和文本到圖像合成。
https://arxiv.org/abs/2312.12856
The scale and quality of point cloud datasets constrain the advancement of point cloud learning. Recently, with the development of multi-modal learning, the incorporation of domain-agnostic prior knowledge from other modalities, such as images and text, to assist in point cloud feature learning has been considered a promising avenue. Existing methods have demonstrated the effectiveness of multi-modal contrastive training and feature distillation on point clouds. However, challenges remain, including the requirement for paired triplet data, redundancy and ambiguity in supervised features, and the disruption of the original priors. In this paper, we propose a language-assisted approach to point cloud feature learning (LAST-PCL), enriching semantic concepts through LLMs-based text enrichment. We achieve de-redundancy and feature dimensionality reduction without compromising textual priors by statistical-based and training-free significant feature selection. Furthermore, we also delve into an in-depth analysis of the impact of text contrastive training on the point cloud. Extensive experiments validate that the proposed method learns semantically meaningful point cloud features and achieves state-of-the-art or comparable performance in 3D semantic segmentation, 3D object detection, and 3D scene classification tasks. The source code is available at this https URL.
点云数据集的规模和质量限制了点云学习的进步。最近,随着多模态学习的发展,将其他模态(如图像和文本)的领域无关先验知识引入到点云特征学习以辅助点云特征学习被认为是一个有前途的途径。现有的方法已经证明了多模态对比训练和特征蒸馏在点云中的有效性。然而,仍然存在一些挑战,包括需要成对的三元组数据、监督特征的冗余和模糊以及原始先验知识的破坏。在本文中,我们提出了一种语言辅助的点云特征学习方法(LAST-PCL),通过LLM-based文本丰富来丰富语义概念。我们通过基于统计的基于训练的方法显著特征选择实现了去冗余和特征维度减少,同时不牺牲文本先验知识。此外,我们还深入研究了文本对比训练对点云的影响。大量实验证实,所提出的方法可以学习到语义上有意义的点云特征,并在3D语义分割、3D目标检测和3D场景分类任务中实现与最先进水平相当或更好的性能。源代码可在此处下载:https://www.acm.org/dl/doi/10.1145/2848206.2848313
https://arxiv.org/abs/2312.11451
Maps are fundamental medium to visualize and represent the real word in a simple and 16 philosophical way. The emergence of the 3rd wave information has made a proportion of maps are available to be generated ubiquitously, which would significantly enrich the dimensions and perspectives to understand the characteristics of the real world. However, a majority of map dataset have never been discovered, acquired and effectively used, and the map data used in many applications might not be completely fitted for the authentic demands of these applications. This challenge is emerged due to the lack of numerous well-labelled benchmark datasets for implementing the deep learning approaches into identifying complicated map content. Thus, we develop a large-scale benchmark dataset that includes well-labelled dataset for map text annotation recognition, map scene classification, map super-resolution reconstruction, and map style transferring. Furthermore, these well-labelled datasets would facilitate the state-of-the-art machine intelligence technologies to conduct map feature detection, map pattern recognition and map content retrieval. We hope our efforts would be useful for AI-enhanced cartographical applications.
地图是视觉化并代表现实世界的基本媒介。第三波信息的涌现使得大量地图可以随时生成,这将极大地丰富我们理解现实世界特征的维度和角度。然而,大多数地图数据集从未被发现、获取和使用,许多应用程序使用的地图数据可能不完全符合这些应用程序的真实需求。这个挑战是因为缺乏大量为实施深度学习方法进行标注的 benchmark 数据集。因此,我们开发了一个大规模基准数据集,包括用于地图文本注释识别、地图场景分类、地图超分辨率重建和地图风格转移的 well-labelled 数据集。此外,这些 well-labelled 数据集将促进最先进的机器智能技术进行地图特征检测、地图模式识别和地图内容检索。我们希望我们的努力能够为 AI 增强的地理应用提供帮助。
https://arxiv.org/abs/2312.08600