Crime in the 21st century is split into a virtual and real world. However, the former has become a global menace to people's well-being and security in the latter. The challenges it presents must be faced with unified global cooperation, and we must rely more than ever on automated yet trustworthy tools to combat the ever-growing nature of online offenses. Over 10 million child sexual abuse reports are submitted to the US National Center for Missing & Exploited Children every year, and over 80% originated from online sources. Therefore, investigation centers and clearinghouses cannot manually process and correctly investigate all imagery. In light of that, reliable automated tools that can securely and efficiently deal with this data are paramount. In this sense, the scene recognition task looks for contextual cues in the environment, being able to group and classify child sexual abuse data without requiring to be trained on sensitive material. The scarcity and limitations of working with child sexual abuse images lead to self-supervised learning, a machine-learning methodology that leverages unlabeled data to produce powerful representations that can be more easily transferred to target tasks. This work shows that self-supervised deep learning models pre-trained on scene-centric data can reach 71.6% balanced accuracy on our indoor scene classification task and, on average, 2.2 percentage points better performance than a fully supervised version. We cooperate with Brazilian Federal Police experts to evaluate our indoor classification model on actual child abuse material. The results demonstrate a notable discrepancy between the features observed in widely used scene datasets and those depicted on sensitive materials.
21世纪犯罪分为虚拟和现实世界。然而,前者已成为对人们福祉和安全的全球威胁。它所呈现的挑战必须通过全球合作来应对,我们比以往任何时候都更依赖自动且可靠的工具来对抗网络犯罪的不断增长。每年向美国国家失踪和受控儿童中心提交超过1000万份儿童性虐待报告,其中超过80%来自在线来源。因此,调查中心和清除中心无法手动处理和正确调查所有图像。鉴于这一点,可靠的自动化工具在安全且高效地处理这种数据方面至关重要。在这种情况下,场景识别任务在环境中寻找上下文线索,能够无要求地组队和分类儿童性虐待数据。对处理儿童性虐待图像的稀缺性和限制,导致了一种自我监督学习,这是一种利用未标记数据产生强大表示的机器学习方法。这项工作表明,基于场景的深度学习模型经过预训练后可以达到在我们 indoor 场景分类任务上实现71.6%的平衡准确度,并且平均比完全监督版本的表现更好。我们与巴西联邦警察专家合作,对我们的 indoor 分类模型在实际儿童虐待材料上进行评估。结果显示,广泛使用的场景数据中观察到的特征与敏感材料中描绘的特征之间存在明显的差异。
https://arxiv.org/abs/2403.01183
Most state-of-the-art computer vision models heavily depend on data. However, many datasets exhibit extreme class imbalance which has been shown to negatively impact model performance. Among the training-time and data-generation solutions that have been explored, one subset that leverages existing data is importance sampling. A good deal of this work focuses primarily on the CIFAR-10 and CIFAR-100 datasets which fail to be representative of the scale, composition, and complexity of current state-of-the-art datasets. In this work, we explore and compare three techniques that derive from importance sampling: loss reweighting, undersampling, and oversampling. Specifically, we compare the effect of these techniques on the performance of two encoders on an impactful satellite imagery dataset, Planet's Amazon Rainforest dataset, in preparation for another work. Furthermore, we perform supplemental experimentation on a scene classification dataset, ADE20K, to test on a contrasting domain and clarify our results. Across both types of encoders, we find that up-weighting the loss for and undersampling has a negigible effect on the performance on underrepresented classes. Additionally, our results suggest oversampling generally improves performance for the same underrepresented classes. Interestingly, our findings also indicate that there may exist some redundancy in data in the Planet dataset. Our work aims to provide a foundation for further work on the Planet dataset and similar domain-specific datasets. We open-source our code at this https URL for future work on other satellite imagery datasets as well.
大多数最先进的计算机视觉模型非常依赖数据。然而,许多数据集表现出极端的类不平衡,已经被证明会严重影响模型的性能。在已经探索的训练时间和数据生成解决方案中,利用现有数据的一个子集是重要性抽样。这个工作的大部分主要集中在CIFAR-10和CIFAR-100数据集上,这些数据集无法代表当前最先进数据集的规模、组成和复杂性。在这篇工作中,我们探讨并比较了三个基于重要性抽样的技术:损失重新加权、欠采样和过采样。具体来说,我们比较了这些技术对两个具有影响力的卫星图像数据集(Planet的Amazon雨林数据集)的性能影响,为另一项工作做准备。此外,我们还对另一个场景分类数据集ADE20K进行了补充实验,以测试对比领域并阐明我们的结果。在两种类型的编码器中,我们发现,为正负样本加权损失和欠采样对表现不佳的类别的性能没有负面影响。此外,我们的结果还表明,过度抽样通常会提高同一类别的性能。有趣的是,我们的研究还表明,Planet数据集中的数据可能存在冗余。我们的工作旨在为研究Planet数据集以及类似领域数据提供基础。我们将代码开源在https:// this URL,供未来对其他卫星图像数据集进行更多研究。
https://arxiv.org/abs/2402.18742
Multi-modal sensor data fusion takes advantage of complementary or reinforcing information from each sensor and can boost overall performance in applications such as scene classification and target detection. This paper presents a new method for fusing multi-modal and multi-resolution remote sensor data without requiring pixel-level training labels, which can be difficult to obtain. Previously, we developed a Multiple Instance Multi-Resolution Fusion (MIMRF) framework that addresses label uncertainty for fusion, but it can be slow to train due to the large search space for the fuzzy measures used to integrate sensor data sources. We propose a new method based on binary fuzzy measures, which reduces the search space and significantly improves the efficiency of the MIMRF framework. We present experimental results on synthetic data and a real-world remote sensing detection task and show that the proposed MIMRF-BFM algorithm can effectively and efficiently perform multi-resolution fusion given remote sensing data with uncertainty.
多模态传感器数据的融合利用每个传感器互补或增强的信息,可以在诸如场景分类和目标检测等应用中提高整体性能。本文提出了一种不需要像素级训练标签的新方法来融合多模态和多分辨率远程传感器数据,这是难以获得的。之前,我们开发了一种Multiple Instance Multi-Resolution Fusion (MIMRF)框架,用于解决融合时的标签不确定性,但由于使用的模糊度量具有较大的搜索空间,训练可能变得缓慢。我们提出了一种基于二进制模糊度量的全新方法,这减少了搜索空间,显著提高了MIMRF框架的效率。我们在合成数据和现实世界的遥感检测任务上进行了实验,并证明了所提出的MIMRF-BFM算法可以有效且高效地对具有不确定性的遥感数据进行多分辨率融合。
https://arxiv.org/abs/2402.05045
Acoustic scene classification (ASC) is a crucial research problem in computational auditory scene analysis, and it aims to recognize the unique acoustic characteristics of an environment. One of the challenges of the ASC task is domain shift caused by a distribution gap between training and testing data. Since 2018, ASC challenges have focused on the generalization of ASC models across different recording devices. Although this task in recent years has achieved substantial progress in device generalization, the challenge of domain shift between different regions, involving characteristics such as time, space, culture, and language, remains insufficiently explored at present. In addition, considering the abundance of unlabeled acoustic scene data in the real world, it is important to study the possible ways to utilize these unlabelled data. Therefore, we introduce the task Semi-supervised Acoustic Scene Classification under Domain Shift in the ICME 2024 Grand Challenge. We encourage participants to innovate with semi-supervised learning techniques, aiming to develop more robust ASC models under domain shift.
声景分类(ASC)是计算听觉场景分析中的一个关键研究问题,旨在识别环境的独特声学特征。ASC任务的挑战之一是训练和测试数据之间分布差异导致的领域转移。自2018年以来,ASC挑战的重点在于在不同记录设备上推广ASC模型。尽管在近年来设备泛化方面取得了实质性进展,但不同区域之间领域的转移问题(包括时间、空间、文化和语言等特征)仍然没有被充分研究。此外,考虑到现实世界中大量未标记的声景数据,研究如何利用这些未标记数据的可能性非常重要。因此,我们在ICME 2024大挑战中引入了域移下 semi-supervised Acoustic Scene 分类任务。我们鼓励参与者使用半监督学习技术创新,以开发在领域移除下更健壮的ASC模型。
https://arxiv.org/abs/2402.02694
Deep neural networks have achieved promising progress in remote sensing (RS) image classification, for which the training process requires abundant samples for each class. However, it is time-consuming and unrealistic to annotate labels for each RS category, given the fact that the RS target database is increasing dynamically. Zero-shot learning (ZSL) allows for identifying novel classes that are not seen during training, which provides a promising solution for the aforementioned problem. However, previous ZSL models mainly depend on manually-labeled attributes or word embeddings extracted from language models to transfer knowledge from seen classes to novel classes. Besides, pioneer ZSL models use convolutional neural networks pre-trained on ImageNet, which focus on the main objects appearing in each image, neglecting the background context that also matters in RS scene classification. To address the above problems, we propose to collect visually detectable attributes automatically. We predict attributes for each class by depicting the semantic-visual similarity between attributes and images. In this way, the attribute annotation process is accomplished by machine instead of human as in other methods. Moreover, we propose a Deep Semantic-Visual Alignment (DSVA) that take advantage of the self-attention mechanism in the transformer to associate local image regions together, integrating the background context information for prediction. The DSVA model further utilizes the attribute attention maps to focus on the informative image regions that are essential for knowledge transfer in ZSL, and maps the visual images into attribute space to perform ZSL classification. With extensive experiments, we show that our model outperforms other state-of-the-art models by a large margin on a challenging large-scale RS scene classification benchmark.
深度神经网络在远距离感知(RS)图像分类方面取得了令人鼓舞的进展,对于需要每个类别丰富样本的训练过程来说,这需要耗费大量的时间和精力。然而,由于RS目标数据库正在不断增加,因此为每个RS类别注释标签是非常耗时的,而且不现实的想法。零样本学习(ZSL)允许在训练过程中识别出在训练过程中看不到的新类别的样本,为上述问题提供了一个有前景的解决方案。然而,之前的ZSL模型主要依赖于手动标注的属性或从语言模型中提取的词向量来转移知识从见过的类别到新类别的知识。此外,先驱的ZSL模型使用预训练于ImageNet的卷积神经网络,重点关注每个图像中出现的主要物体,忽视了背景上下文信息对于RS场景分类同样重要的事实。为了解决上述问题,我们提出了自动收集视觉可检测属性的方法。我们通过描绘属性与图像之间的语义-视觉相似性来预测每个类的属性。这样,属性标注过程由机器完成,而不是由人类完成,类似于其他方法。此外,我们提出了一个Deep Semantic-Visual Alignment(DSVA)模型,该模型利用Transformer中的自注意力机制将局部图像区域关联起来,并整合背景上下文信息进行预测。DSVA模型还利用属性注意力图来关注ZSL中对于知识传递至关重要且有用的图像区域,并将视觉图像映射到属性空间进行ZSL分类。通过大量实验,我们发现我们的模型在具有挑战性的大规模RS场景分类基准上显著优于其他最先进的模型。
https://arxiv.org/abs/2402.02094
Although neural models have achieved remarkable performance, they still encounter doubts due to the intransparency. To this end, model prediction explanation is attracting more and more attentions. However, current methods rarely incorporate external knowledge and still suffer from three limitations: (1) Neglecting concept completeness. Merely selecting concepts may not sufficient for prediction. (2) Lacking concept fusion. Failure to merge semantically-equivalent concepts. (3) Difficult in manipulating model behavior. Lack of verification for explanation on original model. To address these issues, we propose a novel knowledge-aware neuron interpretation framework to explain model predictions for image scene classification. Specifically, for concept completeness, we present core concepts of a scene based on knowledge graph, ConceptNet, to gauge the completeness of concepts. Our method, incorporating complete concepts, effectively provides better prediction explanations compared to baselines. Furthermore, for concept fusion, we introduce a knowledge graph-based method known as Concept Filtering, which produces over 23% point gain on neuron behaviors for neuron interpretation. At last, we propose Model Manipulation, which aims to study whether the core concepts based on ConceptNet could be employed to manipulate model behavior. The results show that core concepts can effectively improve the performance of original model by over 26%.
尽管神经网络模型已经取得了显著的性能,但它们仍然会因透明度而遇到怀疑。因此,模型预测解释吸引了越来越多的关注。然而,目前的做法很少 incorporating外部知识,仍然存在三个限制:(1)忽视概念的完整性。仅仅选择概念可能不足以进行预测。(2)缺乏概念融合。未能将语义上等价的观念进行合并。(3)难以操纵模型行为。在原始模型上进行解释缺乏验证。为解决这些问题,我们提出了一个新颖的知识感知神经元解释框架,用于解释图像场景分类模型的预测。具体来说,基于知识图谱和ConceptNet,我们提出了场景核心概念,以衡量概念的完整性。我们的方法,结合了完整概念,在基线模型上提供了更好的预测解释。此外,为了概念融合,我们引入了一种基于知识图谱的方法,称为概念过滤,该方法在神经元行为上产生了超过23%的点增益。最后,我们提出了模型操作,旨在研究是否基于ConceptNet的概念核心概念可以用于操纵模型行为。结果显示,通过结合概念,可以显著提高原始模型的性能,超过26%。
https://arxiv.org/abs/2401.15820
In this work, we aim to establish a Bayesian adaptive learning framework by focusing on estimating latent variables in deep neural network (DNN) models. Latent variables indeed encode both transferable distributional information and structural relationships. Thus the distributions of the source latent variables (prior) can be combined with the knowledge learned from the target data (likelihood) to yield the distributions of the target latent variables (posterior) with the goal of addressing acoustic mismatches between training and testing conditions. The prior knowledge transfer is accomplished through Variational Bayes (VB). In addition, we also investigate Maximum a Posteriori (MAP) based Bayesian adaptation. Experimental results on device adaptation in acoustic scene classification show that our proposed approaches can obtain good improvements on target devices, and consistently outperforms other cut-edging algorithms.
在这项工作中,我们旨在通过专注于在深度神经网络(DNN)模型中估计潜在变量来建立一个贝叶斯自适应学习框架。事实上,潜在变量确实编码了可转移的分布信息和结构关系。因此,源潜在变量的分布(先验)可以与目标数据(后验)知识相结合,以产生目标潜在变量的分布,旨在解决训练和测试条件之间的声学不匹配。先验知识传递是通过Variational Bayes(VB)实现的。此外,我们还研究了基于最大后验概率(MAP)的贝叶斯自适应。 在音频场景分类设备的实验结果表明,我们提出的方法在目标设备上可以获得很好的改进,并且 consistently优于其他削减算法。
https://arxiv.org/abs/2401.13766
Computer-based scene understanding has influenced fields ranging from urban planning to autonomous vehicle performance, yet little is known about how well these technologies work across social differences. We investigate the biases of deep convolutional neural networks (dCNNs) in scene classification, using nearly one million images from global and US sources, including user-submitted home photographs and Airbnb listings. We applied statistical models to quantify the impact of socioeconomic indicators such as family income, Human Development Index (HDI), and demographic factors from public data sources (CIA and US Census) on dCNN performance. Our analyses revealed significant socioeconomic bias, where pretrained dCNNs demonstrated lower classification accuracy, lower classification confidence, and a higher tendency to assign labels that could be offensive when applied to homes (e.g., "ruin", "slum"), especially in images from homes with lower socioeconomic status (SES). This trend is consistent across two datasets of international images and within the diverse economic and racial landscapes of the United States. This research contributes to understanding biases in computer vision, emphasizing the need for more inclusive and representative training datasets. By mitigating the bias in the computer vision pipelines, we can ensure fairer and more equitable outcomes for applied computer vision, including home valuation and smart home security systems. There is urgency in addressing these biases, which can significantly impact critical decisions in urban development and resource allocation. Our findings also motivate the development of AI systems that better understand and serve diverse communities, moving towards technology that equitably benefits all sectors of society.
基于计算机的场景理解已经影响了包括城市规划、自动驾驶汽车性能在内的各个领域,然而目前还很少有人知道这些技术在不同社会差异下的表现。我们研究了深度卷积神经网络(dCNNs)在场景分类中的偏见,使用了来自全球和美国的近100万张图像,包括用户提交的家居照片和Airbnb列表。我们应用统计模型来量化来自公共数据源(CIA和美国人口普查局)的社会经济发展指标(如家庭收入、人类发展指数,人口统计因素)对dCNN性能的影响。我们的分析揭示了显著的社会经济偏见,即预训练的dCNNs在分类准确性、分类信心和将标签分配给可能冒犯住宅的倾向方面都较低(例如,“脏乱”、“贫民窟”等),特别是在社会经济地位较低的住宅(SES)中。这一趋势在两个国际图像数据集和美国的多样经济和种族景观中都是一致的。这项研究为计算机视觉中的偏见提供了更深入的理解,强调了需要创建更包容和代表性训练数据集的必要性。通过减轻计算机视觉流程中的偏见,我们可以确保应用计算机视觉获得更公平和平等的结果,包括住宅估值和智能家居安全系统。解决这些偏见的问题具有紧迫性,这可能会对城市发展和资源分配产生重大影响。我们的研究还推动了开发更了解和服务于多样社区的AI系统,朝着实现让所有社会各阶层都受益的技术方向发展。
https://arxiv.org/abs/2401.13097
Image summary, an abridged version of the original visual content, can be used to represent the scene. Thus, tasks such as scene classification, identification, indexing, etc., can be performed efficiently using the unique summary. Saliency is the most commonly used technique for generating the relevant image summary. However, the definition of saliency is subjective in nature and depends upon the application. Existing saliency detection methods using RGB-D data mainly focus on color, texture, and depth features. Consequently, the generated summary contains either foreground objects or non-stationary objects. However, applications such as scene identification require stationary characteristics of the scene, unlike state-of-the-art methods. This paper proposes a novel volumetric saliency-guided framework for indoor scene classification. The results highlight the efficacy of the proposed method.
图像摘要,是对原始视觉内容的一个简要概述,可以用来表示场景。因此,场景分类、识别、索引等任务可以使用独特的摘要来高效执行。最常见的生成相关图像摘要的技术是显著性。然而,显著性的定义在本质上是有主观性的,并取决于应用场景。使用RGB-D数据现有的 saliency 检测方法主要关注颜色、纹理和深度特征。因此,生成的摘要包含前景物体或非稳定物体。然而,场景识别应用程序需要场景的静止特性,而与现有方法不同。本文提出了一种新颖的体积显著性引导的室内场景分类框架。结果突出了所提出方法的有效性。
https://arxiv.org/abs/2401.16227
Deep learning models are essential for scene classification, change detection, land cover segmentation, and other remote sensing image understanding tasks. Most backbones of existing remote sensing deep learning models are typically initialized by pre-trained weights obtained from ImageNet pre-training (IMP). However, domain gaps exist between remote sensing images and natural images (e.g., ImageNet), making deep learning models initialized by pre-trained weights of IMP perform poorly for remote sensing image understanding. Although some pre-training methods are studied in the remote sensing community, current remote sensing pre-training methods face the problem of vague generalization by only using remote sensing images. In this paper, we propose a novel remote sensing pre-training framework, Generic Knowledge Boosted Remote Sensing Pre-training (GeRSP), to learn robust representations from remote sensing and natural images for remote sensing understanding tasks. GeRSP contains two pre-training branches: (1) A self-supervised pre-training branch is adopted to learn domain-related representations from unlabeled remote sensing images. (2) A supervised pre-training branch is integrated into GeRSP for general knowledge learning from labeled natural images. Moreover, GeRSP combines two pre-training branches using a teacher-student architecture to simultaneously learn representations with general and special knowledge, which generates a powerful pre-trained model for deep learning model initialization. Finally, we evaluate GeRSP and other remote sensing pre-training methods on three downstream tasks, i.e., object detection, semantic segmentation, and scene classification. The extensive experimental results consistently demonstrate that GeRSP can effectively learn robust representations in a unified manner, improving the performance of remote sensing downstream tasks.
深度学习模型对于场景分类、变化检测、土地覆盖分割等遥感图像理解任务至关重要。现有的遥感深度学习模型的骨干网络通常通过从ImageNet预训练中获得的预训练权重初始化。然而,遥感图像与自然图像之间存在领域差异(例如,ImageNet),因此仅通过遥感图像预训练的权重初始化的深度学习模型在遥感图像理解任务上表现不佳。尽管在遥感领域有一些预训练方法的研究,但现有的遥感预训练方法仅通过遥感图像无法解决领域差异问题。在本文中,我们提出了一个新颖的遥感预训练框架,通用知识增强遥感预训练(GeRSP),以从遥感图像和自然图像中学习稳健的表示来进行遥感理解任务。GeRSP包含两个预训练分支:(1)采用自监督预训练分支从未标注的遥感图像中学习领域相关的表示。(2)将监督预训练分支集成到GeRSP中,从标注的自然图像中学习通用知识。此外,GeRSP使用师生架构将两个预训练分支同时学习具有通用和特殊知识的表示,从而生成一个强大的预训练模型,用于深度学习模型的初始化。最后,我们对GeRSP和其他遥感预训练方法在三个下游任务上进行了评估,即目标检测、语义分割和场景分类。大量实验结果一致证明,GeRSP可以在统一的方式下有效学习稳健的表示,从而提高遥感下游任务的性能。
https://arxiv.org/abs/2401.04614
Remote Sensing Scene Classification is a challenging and valuable research topic, in which Convolutional Neural Network (CNN) has played a crucial role. CNN can extract hierarchical convolutional features from remote sensing imagery, and Feature Fusion of different layers can enhance CNN's performance. Two successful Feature Fusion methods, Add and Concat, are employed in certain state-of-the-art CNN algorithms. In this paper, we propose a novel Feature Fusion algorithm, which unifies the aforementioned methods using the Kronecker Product (KPFF), and we discuss the Backpropagation procedure associated with this algorithm. To validate the efficacy of the proposed method, a series of experiments are designed and conducted. The results demonstrate its effectiveness of enhancing CNN's accuracy in Remote sensing scene classification.
远程 sensing场景分类是一个具有挑战性和实用价值的研究课题,其中卷积神经网络(CNN)发挥了关键作用。CNN可以从遥感图像中提取层次卷积特征,而不同层之间的特征融合可以提高CNN的性能。在某些最先进的CNN算法中,使用了两种成功的特征融合方法:Add和Concat。在本文中,我们提出了一种新颖的特征融合算法,该算法使用Kronecker产品(KPFF)将上述方法统一,并讨论了与该算法相关的反向传播过程。为了验证所提出方法的有效性,进行了一系列实验。结果表明,该方法可以显著提高CNN在远程感测场景分类中的准确率。
https://arxiv.org/abs/2402.00036
Remote sensing imagery, despite its broad applications in helping achieve Sustainable Development Goals and tackle climate change, has not yet benefited from the recent advancements of versatile, task-agnostic vision language models (VLMs). A key reason is that the large-scale, semantically diverse image-text dataset required for developing VLMs is still absent for remote sensing images. Unlike natural images, remote sensing images and their associated text descriptions cannot be efficiently collected from the public Internet at scale. In this work, we bridge this gap by using geo-coordinates to automatically connect open, unlabeled remote sensing images with rich semantics covered in OpenStreetMap, and thus construct SkyScript, a comprehensive vision-language dataset for remote sensing images, comprising 2.6 million image-text pairs covering 29K distinct semantic tags. With continual pre-training on this dataset, we obtain a VLM that surpasses baseline models with a 6.2% average accuracy gain in zero-shot scene classification across seven benchmark datasets. It also demonstrates the ability of zero-shot transfer for fine-grained object attribute classification and cross-modal retrieval. We hope this dataset can support the advancement of VLMs for various multi-modal tasks in remote sensing, such as open-vocabulary classification, retrieval, captioning, and text-to-image synthesis.
尽管遥感影像在帮助实现可持续发展目标应对气候变化方面具有广泛的应用,但尚未充分利用最近的多功能、任务无关的视觉语言模型(VLMs)的先进技术。一个关键的原因是,用于开发VLMs的大规模、语义多样图像-文本数据集仍然缺失。与自然图像不同,远程 sensing图像及其相关文本描述无法从公共互联网上以大规模方式收集。在这项工作中,我们通过使用地理坐标将开放、未标记的远程 sensing图像与OpenStreetMap上丰富的语义覆盖连接起来,从而构建了 SkyScript,一个遥感图像 comprehensive vision-language 数据集,包括29K个不同的语义标签的图像-文本对。通过持续在這個數據集上進行预训练,我们在零散景观分类基准數據集上实现了基線模型的6.2%平均準確率增長。它還展示了零散轉移進行精细语義屬性分類和跨模态检索的能力。我们希望這個數據集能夠支持各種多模态遥感任務的VLMs發展,例如开放式词汇分類、檢索、旁白和文本到圖像合成。
https://arxiv.org/abs/2312.12856
The scale and quality of point cloud datasets constrain the advancement of point cloud learning. Recently, with the development of multi-modal learning, the incorporation of domain-agnostic prior knowledge from other modalities, such as images and text, to assist in point cloud feature learning has been considered a promising avenue. Existing methods have demonstrated the effectiveness of multi-modal contrastive training and feature distillation on point clouds. However, challenges remain, including the requirement for paired triplet data, redundancy and ambiguity in supervised features, and the disruption of the original priors. In this paper, we propose a language-assisted approach to point cloud feature learning (LAST-PCL), enriching semantic concepts through LLMs-based text enrichment. We achieve de-redundancy and feature dimensionality reduction without compromising textual priors by statistical-based and training-free significant feature selection. Furthermore, we also delve into an in-depth analysis of the impact of text contrastive training on the point cloud. Extensive experiments validate that the proposed method learns semantically meaningful point cloud features and achieves state-of-the-art or comparable performance in 3D semantic segmentation, 3D object detection, and 3D scene classification tasks. The source code is available at this https URL.
点云数据集的规模和质量限制了点云学习的进步。最近,随着多模态学习的发展,将其他模态(如图像和文本)的领域无关先验知识引入到点云特征学习以辅助点云特征学习被认为是一个有前途的途径。现有的方法已经证明了多模态对比训练和特征蒸馏在点云中的有效性。然而,仍然存在一些挑战,包括需要成对的三元组数据、监督特征的冗余和模糊以及原始先验知识的破坏。在本文中,我们提出了一种语言辅助的点云特征学习方法(LAST-PCL),通过LLM-based文本丰富来丰富语义概念。我们通过基于统计的基于训练的方法显著特征选择实现了去冗余和特征维度减少,同时不牺牲文本先验知识。此外,我们还深入研究了文本对比训练对点云的影响。大量实验证实,所提出的方法可以学习到语义上有意义的点云特征,并在3D语义分割、3D目标检测和3D场景分类任务中实现与最先进水平相当或更好的性能。源代码可在此处下载:https://www.acm.org/dl/doi/10.1145/2848206.2848313
https://arxiv.org/abs/2312.11451
Maps are fundamental medium to visualize and represent the real word in a simple and 16 philosophical way. The emergence of the 3rd wave information has made a proportion of maps are available to be generated ubiquitously, which would significantly enrich the dimensions and perspectives to understand the characteristics of the real world. However, a majority of map dataset have never been discovered, acquired and effectively used, and the map data used in many applications might not be completely fitted for the authentic demands of these applications. This challenge is emerged due to the lack of numerous well-labelled benchmark datasets for implementing the deep learning approaches into identifying complicated map content. Thus, we develop a large-scale benchmark dataset that includes well-labelled dataset for map text annotation recognition, map scene classification, map super-resolution reconstruction, and map style transferring. Furthermore, these well-labelled datasets would facilitate the state-of-the-art machine intelligence technologies to conduct map feature detection, map pattern recognition and map content retrieval. We hope our efforts would be useful for AI-enhanced cartographical applications.
地图是视觉化并代表现实世界的基本媒介。第三波信息的涌现使得大量地图可以随时生成,这将极大地丰富我们理解现实世界特征的维度和角度。然而,大多数地图数据集从未被发现、获取和使用,许多应用程序使用的地图数据可能不完全符合这些应用程序的真实需求。这个挑战是因为缺乏大量为实施深度学习方法进行标注的 benchmark 数据集。因此,我们开发了一个大规模基准数据集,包括用于地图文本注释识别、地图场景分类、地图超分辨率重建和地图风格转移的 well-labelled 数据集。此外,这些 well-labelled 数据集将促进最先进的机器智能技术进行地图特征检测、地图模式识别和地图内容检索。我们希望我们的努力能够为 AI 增强的地理应用提供帮助。
https://arxiv.org/abs/2312.08600
The excellent performance of recent self-supervised learning methods on various downstream tasks has attracted great attention from academia and industry. Some recent research efforts have been devoted to self-supervised music representation learning. Nevertheless, most of them learn to represent equally-sized music clips in the waveform or a spectrogram. Despite being effective in some tasks, learning music representations in such a manner largely neglect the inherent part-whole hierarchies of music. Due to the hierarchical nature of the auditory cortex [24], understanding the bottom-up structure of music, i.e., how different parts constitute the whole at different levels, is essential for music understanding and representation learning. This work pursues hierarchical music representation learning and introduces the Music-PAW framework, which enables feature interactions of cropped music clips with part-whole hierarchies. From a technical perspective, we propose a transformer-based part-whole interaction module to progressively reason the structural relationships between part-whole music clips at adjacent levels. Besides, to create a multi-hierarchy representation space, we devise a hierarchical contrastive learning objective to align part-whole music representations in adjacent hierarchies. The merits of audio representation learning from part-whole hierarchies have been validated on various downstream tasks, including music classification (single-label and multi-label), cover song identification and acoustic scene classification.
近年来自监督学习在各种下游任务上的优异表现引起了学术界和产业界的高度关注。一些最近的研究努力致力于自监督音乐表示学习。然而,大多数研究者学会了在波形或频谱图上表示等大小的音乐片段。尽管在某些任务上有效,但以这种方式学习音乐表示很大程度上忽略了音乐固有的部分-整体层次结构。由于听觉皮层的层次结构[24],理解音乐在各个层次上的整体结构,即不同部分如何构成整体,对于音乐理解和表示学习至关重要。本文追求层次化的音乐表示学习,并引入了Music-PAW框架,该框架使裁剪音乐片段与部分-整体层次结构进行特征交互。从技术角度来看,我们提出了一个基于Transformer的模块,用于逐步推理相邻层次中部分-整体音乐片段之间的结构关系。此外,为了创建多层表示空间,我们设计了一个分层对比学习目标,使相邻层次中的部分-整体音乐表示对齐。从下游任务的角度验证了从部分-整体层次结构中进行音频表示学习的优越性,包括音乐分类(单标签和多标签)、歌曲识别和声景分类。
https://arxiv.org/abs/2312.06197
Recent advancements in Large Vision-Language Models (VLMs) have shown great promise in natural image domains, allowing users to hold a dialogue about given visual content. However, such general-domain VLMs perform poorly for Remote Sensing (RS) scenarios, leading to inaccurate or fabricated information when presented with RS domain-specific queries. Such a behavior emerges due to the unique challenges introduced by RS imagery. For example, to handle high-resolution RS imagery with diverse scale changes across categories and many small objects, region-level reasoning is necessary alongside holistic scene interpretation. Furthermore, the lack of domain-specific multimodal instruction following data as well as strong backbone models for RS make it hard for the models to align their behavior with user queries. To address these limitations, we propose GeoChat - the first versatile remote sensing VLM that offers multitask conversational capabilities with high-resolution RS images. Specifically, GeoChat can not only answer image-level queries but also accepts region inputs to hold region-specific dialogue. Furthermore, it can visually ground objects in its responses by referring to their spatial coordinates. To address the lack of domain-specific datasets, we generate a novel RS multimodal instruction-following dataset by extending image-text pairs from existing diverse RS datasets. We establish a comprehensive benchmark for RS multitask conversations and compare with a number of baseline methods. GeoChat demonstrates robust zero-shot performance on various RS tasks, e.g., image and region captioning, visual question answering, scene classification, visually grounded conversations and referring detection. Our code is available at this https URL.
近年来,在自然图像领域,大型视觉语言模型(VLMs)的进步已经表现出很大的潜力,使用户可以就给定的视觉内容进行对话。然而,在遥感(RS)场景中,这种通用域的VLM表现不佳,当面对RS领域特定的查询时,呈现出的信息往往不准确或捏造。这种行为是由RS图像独特带来的挑战所引起的。例如,为了处理具有不同尺度变化跨类别的较高分辨率RS图像,区域级推理是必要的,同时进行整体场景解释。此外,缺乏RS领域的多模态指令跟随数据以及强大的骨干模型也使得模型难以将行为与用户查询对齐。为了克服这些限制,我们提出了GeoChat - 第一个具有高分辨率RS图像多任务会话功能的遥感VLM。具体来说,GeoChat不仅可以回答图像级别的问题,还可以接受区域输入来保持区域特定的对话。此外,它可以通过参考它们的空间坐标在响应中视觉 grounding 对象。为了克服缺乏RS领域特定数据集的问题,我们通过扩展现有多样RS数据集中的图像-文本对来生成一个新的RS多模态指令跟随数据。我们为RS多任务会话建立了全面的基准,并将其与多个基线方法进行比较。GeoChat在各种RS任务上都表现出出色的零散 shot性能,例如图像和区域捕捉、视觉问题回答、场景分类、视觉 grounded 对话和参考检测。我们的代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2311.15826
Lack of interpretability of deep convolutional neural networks (DCNN) is a well-known problem particularly in the medical domain as clinicians want trustworthy automated decisions. One way to improve trust is to demonstrate the localisation of feature representations with respect to expert labeled regions of interest. In this work, we investigate the localisation of features learned via two varied learning paradigms and demonstrate the superiority of one learning approach with respect to localisation. Our analysis on medical and natural datasets show that the traditional end-to-end (E2E) learning strategy has a limited ability to localise discriminative features across multiple network layers. We show that a layer-wise learning strategy, namely cascade learning (CL), results in more localised features. Considering localisation accuracy, we not only show that CL outperforms E2E but that it is a promising method of predicting regions. On the YOLO object detection framework, our best result shows that CL outperforms the E2E scheme by $2\%$ in mAP.
深度卷积神经网络(DCNN)的可解释性不足是一个已知的问题,尤其是在医学领域,因为临床医生希望得到可信赖的自动决策。提高信任的一种方法是证明特征表示与专家标注的感兴趣区域局部相关。在这项工作中,我们研究了通过两种不同的学习范式学习到的特征的局部化,并证明了在局部化方面,一种学习方法比另一种更优越。我们对医学和自然数据集的分析表明,传统的端到端(E2E)学习策略在多层网络中定位有区别的特征方面有限。我们证明了级联学习(CL)策略导致了更局部的特征。考虑到局部化精度,我们不仅证明了CL优于E2E,而且它是一种有前景的预测区域的方法。在YOLO目标检测框架上,我们的最佳结果表明,CL在mAP方面超越了E2E方案2%。
https://arxiv.org/abs/2311.12704
The foundation model has recently garnered significant attention due to its potential to revolutionize the field of visual representation learning in a self-supervised manner. While most foundation models are tailored to effectively process RGB images for various visual tasks, there is a noticeable gap in research focused on spectral data, which offers valuable information for scene understanding, especially in remote sensing (RS) applications. To fill this gap, we created for the first time a universal RS foundation model, named SpectralGPT, which is purpose-built to handle spectral RS images using a novel 3D generative pretrained transformer (GPT). Compared to existing foundation models, SpectralGPT 1) accommodates input images with varying sizes, resolutions, time series, and regions in a progressive training fashion, enabling full utilization of extensive RS big data; 2) leverages 3D token generation for spatial-spectral coupling; 3) captures spectrally sequential patterns via multi-target reconstruction; 4) trains on one million spectral RS images, yielding models with over 600 million parameters. Our evaluation highlights significant performance improvements with pretrained SpectralGPT models, signifying substantial potential in advancing spectral RS big data applications within the field of geoscience across four downstream tasks: single/multi-label scene classification, semantic segmentation, and change detection.
基础模型因其在自监督方式下可能彻底颠覆视觉表示学习领域的潜在影响而最近引起了广泛关注。虽然大多数基础模型都是为有效地处理各种视觉任务而设计的,但在关注光谱数据的研究方面存在明显的差距,这对场景理解,尤其是在遥感和(RS)应用中,具有重要的价值。为了填补这一空白,我们创建了第一个通用 RS 基础模型,名为 SpectralGPT,它专门使用一种新颖的 3D 生成预训练变换器(GPT)处理光谱 RS 图像。与现有基础模型相比,SpectralGPT 1) 按 progressive training 的方式适应不同大小、分辨率、时间序列和区域的输入图像,实现对 RS 大数据的充分利用;2) 利用 3D 词生成进行空间-光谱耦合;3) 通过多目标重构捕捉光谱序列模式;4) 在一百万个光谱 RS 图像上训练,产生了具有超过 600 百万参数的模型。我们的评估显示,预训练的 SpectralGPT 模型在性能上取得了显著的改进,这表明在地质科学领域中,通过推动 RS 大数据应用的发展,具有巨大的潜力。 尽管在某些方面,SpectralGPT 可能无法完全替代现有的基础模型,但它在尝试解决当前难以解决的问题方面确实展现出了巨大的潜力。
https://arxiv.org/abs/2311.07113
The recent development of deep learning methods applied to vision has enabled their increasing integration into real-world applications to perform complex Computer Vision (CV) tasks. However, image acquisition conditions have a major impact on the performance of high-level image processing. A possible solution to overcome these limitations is to artificially augment the training databases or to design deep learning models that are robust to signal distortions. We opt here for the first solution by enriching the database with complex and realistic distortions which were ignored until now in the existing databases. To this end, we built a new versatile database derived from the well-known MS-COCO database to which we applied local and global photo-realistic distortions. These new local distortions are generated by considering the scene context of the images that guarantees a high level of photo-realism. Distortions are generated by exploiting the depth information of the objects in the scene as well as their semantics. This guarantees a high level of photo-realism and allows to explore real scenarios ignored in conventional databases dedicated to various CV applications. Our versatile database offers an efficient solution to improve the robustness of various CV tasks such as Object Detection (OD), scene segmentation, and distortion-type classification methods. The image database, scene classification index, and distortion generation codes are publicly available \footnote{\url{this https URL}}
近年来,将深度学习方法应用于计算机视觉领域,使得它们越来越多地融入现实世界的应用中执行复杂的计算机视觉(CV)任务。然而,图像获取条件对高级图像处理任务的性能有很大的影响。克服这些限制的解决方案之一是人为增加训练数据集,或者设计具有对信号畸变鲁棒性的深度学习模型。我们在本文中选择第一个解决方案,通过向现有的数据库中添加复杂且真实的畸变,来丰富数据库。为了实现这一目标,我们基于著名的MS-COCO数据库构建了一个新的多用途数据库,并对其应用了局部和全局照片现实畸变。这些新的局部畸变是在考虑图片场景的上下文,从而保证高水平的照片现实畸变的基础上产生的。畸变通过利用场景中物体的深度信息和语义来生成。这保证了一个高水平的照片现实畸变,并使您能够探索在传统计算机视觉应用数据库中未被探索的现实场景。我们的多用途数据库为改善各种CV任务的稳健性提供了一种有效的解决方案,比如物体检测(OD)、场景分割和畸变类型分类方法。图像数据库、场景分类索引和畸变生成代码都是公开可用的 \footnote{\url{这个https:// URL}}
https://arxiv.org/abs/2311.06976
Current audio classification models have small class vocabularies relative to the large number of sound event classes of interest in the real world. Thus, they provide a limited view of the world that may miss important yet unexpected or unknown sound events. To address this issue, open-set audio classification techniques have been developed to detect sound events from unknown classes. Although these methods have been applied to a multi-class context in audio, such as sound scene classification, they have yet to be investigated for polyphonic audio in which sound events overlap, requiring the use of multi-label models. In this study, we establish the problem of multi-label open-set audio classification by creating a dataset with varying unknown class distributions and evaluating baseline approaches built upon existing techniques.
目前,音频分类模型的类别词汇表相对于感兴趣的现实世界中的大量声音事件类别的规模非常小。因此,它们只能提供对现实世界中少量声音事件的有限认识,可能错过重要但意外或未知的声音事件。为了解决这个问题,已经开发了开放标签音频分类技术,以检测未知类别的声音事件。尽管这些方法已经应用于音频中的多类场景,如音频场景分类,但尚未对多声道音频进行调查,需要使用多标签模型。在本研究中,我们通过创建具有不同未知类别分布的音频数据集,并评估基于现有技术的基线方法,建立了多标签开放设置音频分类的问题。
https://arxiv.org/abs/2310.13759