Unsupervised semantic segmentation aims to automatically partition images into semantically meaningful regions by identifying global categories within an image corpus without any form of annotation. Building upon recent advances in self-supervised representation learning, we focus on how to leverage these large pre-trained models for the downstream task of unsupervised segmentation. We present PriMaPs - Principal Mask Proposals - decomposing images into semantically meaningful masks based on their feature representation. This allows us to realize unsupervised semantic segmentation by fitting class prototypes to PriMaPs with a stochastic expectation-maximization algorithm, PriMaPs-EM. Despite its conceptual simplicity, PriMaPs-EM leads to competitive results across various pre-trained backbone models, including DINO and DINOv2, and across datasets, such as Cityscapes, COCO-Stuff, and Potsdam-3. Importantly, PriMaPs-EM is able to boost results when applied orthogonally to current state-of-the-art unsupervised semantic segmentation pipelines.
无监督语义分割旨在通过在图像集合中识别全局类别,自动将图像划分为语义上有意义的区域。无监督语义分割是基于自监督表示学习最近取得的进展,我们关注如何利用这些大型的预训练模型来实现下游任务的未监督分割。我们提出了PrimeMaPs - 主要掩码建议,通过基于它们的特征表示分解图像为语义上有意义的掩码。这使我们能够通过随机期望-最大化算法将类原型拟合到PrimeMaPs-EM,实现无监督语义分割。尽管其概念上很简单,但PrimeMaPs-EM在各种预训练骨干模型(包括DINO和DINOv2)和各种数据集(如Cityscapes、COCO-Stuff和Potsdam-3)上都取得了竞争力的结果。重要的是,当应用与当前最先进的无监督语义分割管道成角度时,PrimeMaPs-EM能够提高结果。
https://arxiv.org/abs/2404.16818
Multi-scale learning is central to semantic segmentation. We visualize the effective receptive field (ERF) of canonical multi-scale representations and point out two risks in learning them: scale inadequacy and field inactivation. A novel multi-scale learner, varying window attention (VWA), is presented to address these issues. VWA leverages the local window attention (LWA) and disentangles LWA into the query window and context window, allowing the context's scale to vary for the query to learn representations at multiple scales. However, varying the context to large-scale windows (enlarging ratio R) can significantly increase the memory footprint and computation cost (R^2 times larger than LWA). We propose a simple but professional re-scaling strategy to zero the extra induced cost without compromising performance. Consequently, VWA uses the same cost as LWA to overcome the receptive limitation of the local window. Furthermore, depending on VWA and employing various MLPs, we introduce a multi-scale decoder (MSD), VWFormer, to improve multi-scale representations for semantic segmentation. VWFormer achieves efficiency competitive with the most compute-friendly MSDs, like FPN and MLP decoder, but performs much better than any MSDs. For instance, using nearly half of UPerNet's computation, VWFormer outperforms it by 1.0%-2.5% mIoU on ADE20K. With little extra overhead, ~10G FLOPs, Mask2Former armed with VWFormer improves by 1.0%-1.3%.
多尺度学习是语义分割的核心。我们通过可视化具有规范多尺度表示的有效感受野(ERF)来指出学习它们的两个风险:规模不足和场激活不足。我们提出了一种新的多尺度学习器(Multi-scale Learner, VWA)来解决这些问题。VWA利用局部窗口注意(LWA)并解耦LWA为查询窗口和上下文窗口,使得上下文的规模对查询可以在多个尺度上学习表示。然而,将上下文增加到大型窗口(扩大比率R)可以显著增加内存 footprint 和计算成本(R^2 比 LWA 更大)。我们提出了一种简单但专业重新缩放策略来抵消额外诱导成本,同时不牺牲性能。因此,VWA使用与LWA相同的成本来克服局部窗口的感知限制。此外,根据VWA和采用各种MLP,我们引入了多尺度解码器(MSD),VWFormer,以提高语义分割的多尺度表示。VWFormer在计算开销与最可计算的MSD(如FPN和MLP解码器)相当,但性能远优于任何MSD。例如,使用UPerNet近一半的计算开销,VWFormer在ADE20K上实现了与它1.0%-2.5%的mIoU的提高。在很大程度上不需要额外的开销,大约10G FLOPs,Mask2Former配备了VWFormer后,性能提高了1.0%-1.3%。
https://arxiv.org/abs/2404.16573
In this paper, we address the challenging source-free unsupervised domain adaptation (SFUDA) for pinhole-to-panoramic semantic segmentation, given only a pinhole image pre-trained model (i.e., source) and unlabeled panoramic images (i.e., target). Tackling this problem is non-trivial due to three critical challenges: 1) semantic mismatches from the distinct Field-of-View (FoV) between domains, 2) style discrepancies inherent in the UDA problem, and 3) inevitable distortion of the panoramic images. To tackle these problems, we propose 360SFUDA++ that effectively extracts knowledge from the source pinhole model with only unlabeled panoramic images and transfers the reliable knowledge to the target panoramic domain. Specifically, we first utilize Tangent Projection (TP) as it has less distortion and meanwhile slits the equirectangular projection (ERP) to patches with fixed FoV projection (FFP) to mimic the pinhole images. Both projections are shown effective in extracting knowledge from the source model. However, as the distinct projections make it less possible to directly transfer knowledge between domains, we then propose Reliable Panoramic Prototype Adaptation Module (RP2AM) to transfer knowledge at both prediction and prototype levels. RP$^2$AM selects the confident knowledge and integrates panoramic prototypes for reliable knowledge adaptation. Moreover, we introduce Cross-projection Dual Attention Module (CDAM), which better aligns the spatial and channel characteristics across projections at the feature level between domains. Both knowledge extraction and transfer processes are synchronously updated to reach the best performance. Extensive experiments on the synthetic and real-world benchmarks, including outdoor and indoor scenarios, demonstrate that our 360SFUDA++ achieves significantly better performance than prior SFUDA methods.
在本文中,我们解决了仅使用预训练的孔洞图(source)和未标注的全景图像(target)进行无监督域适应(SFUDA)的问题,以实现孔洞到全景语义分割。解决这一问题是不简单的,因为存在三个关键挑战:1)不同域之间语义不匹配,2)源域问题中的风格差异,3)全景图像中不可避免的扭曲。为了应对这些问题,我们提出了360SFUDA++,它有效地从仅有的未标注全景图像中提取知识,并将可靠的知识传递到目标全景域。具体来说,我们首先利用切线投影(TP)作为它具有较少的扭曲,同时将等角投影(ERP)切成固定 FoV 投影(FFP)的补丁,以模仿孔洞图像。两个投影在提取知识方面都有效。然而,由于不同的投影使得域之间知识传递变得困难,我们 then 引入了可靠的全景原型适应模块(RP2AM),在预测和原型级别上传递知识。RP2AM 选择自信的知识,并整合全景原型以实现可靠的知识适应。此外,我们还引入了跨投影双重注意模块(CDAM),它更好地对域之间的特征水平进行投影之间的空间和通道特征的同步调整。知识提取和传递过程都被同步更新,以达到最佳性能。在合成和真实世界基准上的广泛实验,包括户外和室内场景,证明了我们的360SFUDA++在性能上显著优于前面的SFUDA方法。
https://arxiv.org/abs/2404.16501
Despite the remarkable success of deep learning in medical imaging analysis, medical image segmentation remains challenging due to the scarcity of high-quality labeled images for supervision. Further, the significant domain gap between natural and medical images in general and ultrasound images in particular hinders fine-tuning models trained on natural images to the task at hand. In this work, we address the performance degradation of segmentation models in low-data regimes and propose a prompt-less segmentation method harnessing the ability of segmentation foundation models to segment abstract shapes. We do that via our novel prompt point generation algorithm which uses coarse semantic segmentation masks as input and a zero-shot prompt-able foundation model as an optimization target. We demonstrate our method on a segmentation findings task (pathologic anomalies) in ultrasound images. Our method's advantages are brought to light in varying degrees of low-data regime experiments on a small-scale musculoskeletal ultrasound images dataset, yielding a larger performance gain as the training set size decreases.
尽管在医学影像分析中深度学习的成功已经让人印象深刻,但由于高质量 labeled 图像的稀缺性,医学图像分割仍然具有挑战性。此外,自然图像和医学图像以及超声图像之间显著的领域差距会阻碍将基于自然图像训练的模型用于当前任务的微调。在这项工作中,我们解决了在低数据 regime 下分割模型的性能下降问题,并提出了一个无需提示的分割方法,利用分割基础模型的能力对抽象形状进行分割。我们通过使用粗粒度语义分割掩码作为输入和零散提示可优化目标来实现这一目标。我们在超声图像数据集上展示了我们的方法。在小的多关节超声图像数据集上进行低数据 regime 实验,各种低数据 regime 实验都表明,随着训练集大小的减小,性能提高。
https://arxiv.org/abs/2404.16325
Unsupervised Domain Adaptation (UDA) refers to the method that utilizes annotated source domain data and unlabeled target domain data to train a model capable of generalizing to the target domain data. Domain discrepancy leads to a significant decrease in the performance of general network models trained on the source domain data when applied to the target domain. We introduce a straightforward approach to mitigate the domain discrepancy, which necessitates no additional parameter calculations and seamlessly integrates with self-training-based UDA methods. Through the transfer of the target domain style to the source domain in the latent feature space, the model is trained to prioritize the target domain style during the decision-making process. We tackle the problem at both the image-level and shallow feature map level by transferring the style information from the target domain to the source domain data. As a result, we obtain a model that exhibits superior performance on the target domain. Our method yields remarkable enhancements in the state-of-the-art performance for synthetic-to-real UDA tasks. For example, our proposed method attains a noteworthy UDA performance of 76.93 mIoU on the GTA->Cityscapes dataset, representing a notable improvement of +1.03 percentage points over the previous state-of-the-art results.
无监督领域适应(UDA)是指利用已标注的源域数据和未标注的目标域数据来训练一个能够泛化到目标域数据的模型。领域差异导致在将基于源域数据的通用网络模型应用于目标域数据时,模型的性能显著下降。我们引入了一种直接的方法来减轻领域差异,这不需要额外的参数计算,并无缝地与基于自训练的UDA方法集成。通过将目标域的风格信息传递到源域的潜在特征空间中,模型在决策过程中优先考虑目标域的风格。我们通过从目标域数据中传递样式信息来解决该问题。结果,我们在目标域上获得了卓越的性能。我们的方法在合成-真实UDA任务上的最先进性能有了显著提高。例如,与之前的结果相比,我们提出的UDA性能达到了+1.03%的显著提高。
https://arxiv.org/abs/2404.16301
As one of the emerging challenges in Automated Machine Learning, the Hardware-aware Neural Architecture Search (HW-NAS) tasks can be treated as black-box multi-objective optimization problems (MOPs). An important application of HW-NAS is real-time semantic segmentation, which plays a pivotal role in autonomous driving scenarios. The HW-NAS for real-time semantic segmentation inherently needs to balance multiple optimization objectives, including model accuracy, inference speed, and hardware-specific considerations. Despite its importance, benchmarks have yet to be developed to frame such a challenging task as multi-objective optimization. To bridge the gap, we introduce a tailored streamline to transform the task of HW-NAS for real-time semantic segmentation into standard MOPs. Building upon the streamline, we present a benchmark test suite, CitySeg/MOP, comprising fifteen MOPs derived from the Cityscapes dataset. The CitySeg/MOP test suite is integrated into the EvoXBench platform to provide seamless interfaces with various programming languages (e.g., Python and MATLAB) for instant fitness evaluations. We comprehensively assessed the CitySeg/MOP test suite on various multi-objective evolutionary algorithms, showcasing its versatility and practicality. Source codes are available at this https URL.
作为自动机器学习领域新兴挑战之一,硬件感知的神经架构搜索(HW-NAS)任务可以被视为多目标优化问题(MOPs)。HW-NAS在实时语义分割(Real-time Semantic Segmentation,RSS)中的应用至关重要。为实现实时语义分割,HW-NAS在实时语义分割本身就需要在多个优化目标之间取得平衡,包括模型准确性、推理速度和硬件特定考虑。尽管HW-NAS在实时语义分割中具有重要性,但迄今为止还没有为这样的具有挑战性的任务开发基准。为了填补这一空白,我们引入了一个定制的流线来将HW-NAS在实时语义分割中的任务转化为标准的MOP。在此基础上,我们提出了一个基准测试套件,CitySeg/MOP,包含来自Cityscapes数据集的15个MOP。CitySeg/MOP测试套件已集成到EvoXBench平台中,为各种编程语言(例如Python和MATLAB)提供了一个无缝的界面来进行即时的健身评估。我们对CitySeg/MOP测试套件进行了全面评估,展示了其多才性和实用性。源代码可以从此链接获取:https://url.cn/xyz6uJ4
https://arxiv.org/abs/2404.16266
Patellofemoral joint (PFJ) issues affect one in four people, with 20% experiencing chronic knee pain despite treatment. Poor outcomes and pain after knee replacement surgery are often linked to patellar mal-tracking. Traditional imaging methods like CT and MRI face challenges, including cost and metal artefacts, and there's currently no ideal way to observe joint motion without issues such as soft tissue artefacts or radiation exposure. A new system to monitor joint motion could significantly improve understanding of PFJ dynamics, aiding in better patient care and outcomes. Combining 2D ultrasound with motion tracking for 3D reconstruction of the joint using semantic segmentation and position registration can be a solution. However, the need for expensive external infrastructure to estimate the trajectories of the scanner remains the main limitation to implementing 3D bone reconstruction from handheld ultrasound scanning clinically. We proposed the Visual-Inertial Odometry (VIO) and the deep learning-based inertial-only odometry methods as alternatives to motion capture for tracking a handheld ultrasound scanner. The 3D reconstruction generated by these methods has demonstrated potential for assessing the PFJ and for further measurements from free-hand ultrasound scans. The results show that the VIO method performs as well as the motion capture method, with average reconstruction errors of 1.25 mm and 1.21 mm, respectively. The VIO method is the first infrastructure-free method for 3D reconstruction of bone from wireless handheld ultrasound scanning with an accuracy comparable to methods that require external infrastructure.
翻译:Patellofemoral joint (PFJ) 问题影响四分之一的人,即使经过治疗,20%的人仍然会经历慢性膝盖疼痛。腿部置换手术后的不良结果和疼痛通常与膝关节不良运动有关。传统的影像技术如 CT 和 MRI 面临成本和金属伪影等挑战,目前没有理想的方法在没有软组织伪影或辐射暴露等问题的情况下观察关节运动。一种新系统监测关节运动可能显著改善对 PFJ 动态的理解,有助于提高患者护理和治疗效果。将 2D 超声与运动跟踪结合进行关节三维重建可以使用语义分割和位置配准,可能是解决方案。然而,需要昂贵的外部基础设施估计扫描器的轨迹仍然是实施临床超声三维骨重建的主要限制。我们提出了视觉惯性测量 (VIO) 和基于深度学习的惯性仅运动跟踪方法作为手持超声扫描器的运动捕捉替代方法。这些方法产生的 3D 重建已经证明了评估 PFJ 的潜力和从自由手超声扫描中进行进一步测量的可能性。结果表明,VIO 方法与运动捕捉方法的表现相同,平均重建误差分别为 1.25 mm 和 1.21 mm。VIO 方法是第一个无基础设施免费的 3D 骨重建方法,其准确性相当于需要外部基础设施的方法。
https://arxiv.org/abs/2404.15847
Unsupervised domain adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain. The most recent UDA methods always resort to adversarial training to yield state-of-the-art results and a dominant number of existing UDA methods employ convolutional neural networks (CNNs) as feature extractors to learn domain invariant features. Vision transformer (ViT) has attracted tremendous attention since its emergence and has been widely used in various computer vision tasks, such as image classification, object detection, and semantic segmentation, yet its potential in adversarial domain adaptation has never been investigated. In this paper, we fill this gap by employing the ViT as the feature extractor in adversarial domain adaptation. Moreover, we empirically demonstrate that ViT can be a plug-and-play component in adversarial domain adaptation, which means directly replacing the CNN-based feature extractor in existing UDA methods with the ViT-based feature extractor can easily obtain performance improvement. The code is available at this https URL.
无监督领域适应(UDA)的目的是将来自标注源域的知识转移到未标注的目标域。最最新的UDA方法总是依赖于对抗性训练以获得最先进的成果和主导数量现有的UDA方法采用卷积神经网络(CNN)作为特征提取器来学习域不变的特征。自Vision Transformer(ViT) emergence以来,已经引起了巨大的关注,并在各种计算机视觉任务中得到了广泛应用,然而其在对抗领域适应性的潜在能量从未被研究。在本文中,我们通过将ViT作为对抗领域适应的特征提取器来填补这一空白。此外,我们通过实验实证证明,ViT可以成为对抗领域适应的一个插件,这意味着在现有UDA方法中,将基于CNN的特征提取器直接替换为ViT的特征提取器可以轻松获得性能提升。代码可在此处下载:https://url.com/
https://arxiv.org/abs/2404.15817
We introduce a self-supervised pretraining method, called OcFeat, for camera-only Bird's-Eye-View (BEV) segmentation networks. With OccFeat, we pretrain a BEV network via occupancy prediction and feature distillation tasks. Occupancy prediction provides a 3D geometric understanding of the scene to the model. However, the geometry learned is class-agnostic. Hence, we add semantic information to the model in the 3D space through distillation from a self-supervised pretrained image foundation model. Models pretrained with our method exhibit improved BEV semantic segmentation performance, particularly in low-data scenarios. Moreover, empirical results affirm the efficacy of integrating feature distillation with 3D occupancy prediction in our pretraining approach.
我们提出了一种名为OcFeat的自监督预训练方法,用于相机仅鸟眼视(BEV)分割网络。通过OcFeat,我们通过占有预测和特征蒸馏任务预训练BEV网络。占有预测提供了场景的3D几何理解给模型。然而,学习到的几何是分类无关的。因此,我们在3D空间中通过自监督预训练图像基础模型进行语义信息添加。使用我们方法预训练的模型表现出 improved BEV语义分割性能,特别是在低数据场景中。此外,实验结果证实了在我们的预训练方法中整合特征蒸馏与3D占有预测的有效性。
https://arxiv.org/abs/2404.14027
Domain generalized semantic segmentation is an essential computer vision task, for which models only leverage source data to learn the capability of generalized semantic segmentation towards the unseen target domains. Previous works typically address this challenge by global style randomization or feature regularization. In this paper, we argue that given the observation that different local semantic regions perform different visual characteristics from the source domain to the target domain, methods focusing on global operations are hard to capture such regional discrepancies, thus failing to construct domain-invariant representations with the consistency from local to global level. Therefore, we propose the Semantic-Rearrangement-based Multi-Level Alignment (SRMA) to overcome this problem. SRMA first incorporates a Semantic Rearrangement Module (SRM), which conducts semantic region randomization to enhance the diversity of the source domain sufficiently. A Multi-Level Alignment module (MLA) is subsequently proposed with the help of such diversity to establish the global-regional-local consistent domain-invariant representations. By aligning features across randomized samples with domain-neutral knowledge at multiple levels, SRMA provides a more robust way to handle the source-target domain gap. Extensive experiments demonstrate the superiority of SRMA over the current state-of-the-art works on various benchmarks.
领域泛化语义分割是一个重要的计算机视觉任务,因为模型仅利用源数据来学习泛化语义分割在未见过的目标领域中的能力。之前的工作通常通过全局样式随机化或特征正则化来解决这个挑战。在本文中,我们认为,鉴于观察到不同局部语义区域在源领域到目标领域中的视觉特征有所不同,关注全局操作的方法很难捕捉这种区域差异,因此无法在局部到全局级别上保持一致性来构建域间可变的表示。因此,我们提出了基于语义重新排列的多层次对齐(SRMA)来解决这个问题。SRMA首先引入了一个语义重新排列模块(SRM),通过进行语义区域随机化来增强源领域的多样性。接下来,通过这种多样性引入了多层对齐模块(MLA),以建立全局-区域-局部一致的域间可变表示。通过在随机样本的特征与域中性知识之间进行对齐,SRMA提供了一种更健壮的方式来处理源领域和目标领域之间的差距。通过在各种基准上进行的广泛实验,SRMA优越地超过了目前最先进的工作。
https://arxiv.org/abs/2404.13701
Photovoltaic (PV) systems allow us to tap into all abundant solar energy, however they require regular maintenance for high efficiency and to prevent degradation. Traditional manual health check, using Electroluminescence (EL) imaging, is expensive and logistically challenging making automated defect detection essential. Current automation approaches require extensive manual expert labeling, which is time-consuming, expensive, and prone to errors. We propose PV-S3 (Photovoltaic-Semi Supervised Segmentation), a Semi-Supervised Learning approach for semantic segmentation of defects in EL images that reduces reliance on extensive labeling. PV-S3 is a Deep learning model trained using a few labeled images along with numerous unlabeled images. We introduce a novel Semi Cross-Entropy loss function to train PV-S3 which addresses the challenges specific to automated PV defect detection, such as diverse defect types and class imbalance. We evaluate PV-S3 on multiple datasets and demonstrate its effectiveness and adaptability. With merely 20% labeled samples, we achieve an absolute improvement of 9.7% in IoU, 29.9% in Precision, 12.75% in Recall, and 20.42% in F1-Score over prior state-of-the-art supervised method (which uses 100% labeled samples) on UCF-EL dataset (largest dataset available for semantic segmentation of EL images) showing improvement in performance while reducing the annotation costs by 80%.
光伏(PV)系统允许我们利用丰富的太阳能能量,然而它们需要定期维护以实现高效和防止降解。传统的手动健康检查使用发光二极管(EL)成像,代价昂贵且具有挑战性,因此自动缺陷检测变得至关重要。目前的自动化方法需要大量手动专家标注,这需要花费时间、金钱,并且容易出错。我们提出了PV-S3(光伏-半监督分割),一种用于EL图像中缺陷语义分割的半监督学习方法,减少了对于广泛标注的依赖。 PV-S3是一个通过几张带标签图像和大量未标记图像进行训练的深度学习模型。我们引入了一种新颖的半交叉熵损失函数来训练PV-S3,解决了自动PV缺陷检测中特定的挑战,例如多样缺陷类型和类别不平衡。我们在多个数据集上评估PV-S3,并证明了其有效性和可适应性。 只需20%的带标签样本,我们实现了IoU绝对值 improve 9.7%,Precision绝对值 improve 29.9%,Recall绝对值 improve 12.75%,F1-Score绝对值 improve 20.42%,在UCF-EL数据集(可用于EL图像 semantic分割的最大数据集)上的性能改善,同时将 annotations costs 降低80%。
https://arxiv.org/abs/2404.13693
Corrosion, a naturally occurring process leading to the deterioration of metallic materials, demands diligent detection for quality control and the preservation of metal-based objects, especially within industrial contexts. Traditional techniques for corrosion identification, including ultrasonic testing, radio-graphic testing, and magnetic flux leakage, necessitate the deployment of expensive and bulky equipment on-site for effective data acquisition. An unexplored alternative involves employing lightweight, conventional camera systems, and state-of-the-art computer vision methods for its identification. In this work, we propose a complete system for semi-automated corrosion identification and mapping in industrial environments. We leverage recent advances in LiDAR-based methods for localization and mapping, with vision-based semantic segmentation deep learning techniques, in order to build semantic-geometric maps of industrial environments. Unlike previous corrosion identification systems available in the literature, our designed multi-modal system is low-cost, portable, semi-autonomous and allows collecting large datasets by untrained personnel. A set of experiments in an indoor laboratory environment, demonstrate quantitatively the high accuracy of the employed LiDAR based 3D mapping and localization system, with less then $0.05m$ and 0.02m average absolute and relative pose errors. Also, our data-driven semantic segmentation model, achieves around 70\% precision when trained with our pixel-wise manually annotated dataset.
腐蚀是一种自然发生的导致金属材料退化的过程,在工业环境中对质量控制和金属物体保存而言,需要进行严谨的检测。传统的腐蚀识别技术,包括超声波测试、无线电graphic测试和磁通泄漏,需要在现场部署昂贵且笨重的设备以实现有效数据采集。一种未探索的替代方法是采用轻量级、传统的相机系统和最先进的计算机视觉技术来进行腐蚀识别和绘制。在这项工作中,我们提出了一个工业环境中的半自动腐蚀识别和绘图系统。我们利用了最近在基于激光雷达的定位和绘图方法以及基于视觉的语义分割深度学习技术,构建了工业环境的语义-几何地图。与文中的 previous corrosion identification systems 不同,我们所设计的 multimodal system 成本低、便携、半自动化,并允许非受过训练的人员收集大量数据。在室内实验室环境下进行的一系列实验,证明了所采用的基于激光雷达的3D 定位和定位系统的高准确性,平均绝对和相对姿态误差小于 $0.05m$ 和 $0.02m$。此外,我们的数据驱动语义分割模型,在用我们逐像素手动标注的数据集上进行训练时,实现了约 70\% 的精确度。
https://arxiv.org/abs/2404.13691
Despite the rapid evolution of semantic segmentation for land cover classification in high-resolution remote sensing imagery, integrating multiple data modalities such as Digital Surface Model (DSM), RGB, and Near-infrared (NIR) remains a challenge. Current methods often process only two types of data, missing out on the rich information that additional modalities can provide. Addressing this gap, we propose a novel \textbf{L}ightweight \textbf{M}ultimodal data \textbf{F}usion \textbf{Net}work (LMFNet) to accomplish the tasks of fusion and semantic segmentation of multimodal remote sensing images. LMFNet uniquely accommodates various data types simultaneously, including RGB, NirRG, and DSM, through a weight-sharing, multi-branch vision transformer that minimizes parameter count while ensuring robust feature extraction. Our proposed multimodal fusion module integrates a \textit{Multimodal Feature Fusion Reconstruction Layer} and \textit{Multimodal Feature Self-Attention Fusion Layer}, which can reconstruct and fuse multimodal features. Extensive testing on public datasets such as US3D, ISPRS Potsdam, and ISPRS Vaihingen demonstrates the effectiveness of LMFNet. Specifically, it achieves a mean Intersection over Union ($mIoU$) of 85.09\% on the US3D dataset, marking a significant improvement over existing methods. Compared to unimodal approaches, LMFNet shows a 10\% enhancement in $mIoU$ with only a 0.5M increase in parameter count. Furthermore, against bimodal methods, our approach with trilateral inputs enhances $mIoU$ by 0.46 percentage points.
尽管在高分辨率遥感图像中,语义分割的快速进化为土地覆盖分类带来了便利,但整合多种数据模态(如数字表面模型(DSM)、红光和近红外(NIR))仍然具有挑战性。目前的方法通常仅处理两种数据类型,而忽略了其他模态提供的丰富信息。为解决这一空白,我们提出了一个新颖的轻量级多模态数据融合网络(LMFNet)来执行多模态遥感图像的融合和语义分割任务。LMFNet独特地将各种数据类型同时集成在一个加权共享、多分支视觉 transformer中,在确保参数数量的同时最小化特征提取。我们提出的多模态融合模块包括一个多模态特征融合重构层和一个多模态特征自注意融合层,可以重构和融合多模态特征。对包括US3D、ISPRS Potsdam和ISPRS Vaihingen等公共数据集的广泛测试表明,LMFNet的有效性得到了充分验证。具体来说,在美国3D数据集上,LMFNet的平均交集 over 统一(mIoU)值为85.09\%,相较于现有方法有显著的改进。与单模态方法相比,LMFNet在参数数量仅增加0.5M的情况下,mIoU提高了10\%。此外,与二模态方法相比,我们的具有三角形输入的方法提高了mIoU的0.46个百分点。
https://arxiv.org/abs/2404.13659
The advancement of deep learning has driven notable progress in remote sensing semantic segmentation. Attention mechanisms, while enabling global modeling and utilizing contextual information, face challenges of high computational costs and require window-based operations that weaken capturing long-range dependencies, hindering their effectiveness for remote sensing image processing. In this letter, we propose AMMUNet, a UNet-based framework that employs multi-scale attention map merging, comprising two key innovations: the granular multi-head self-attention (GMSA) module and the attention map merging mechanism (AMMM). GMSA efficiently acquires global information while substantially mitigating computational costs in contrast to global multi-head self-attention mechanism. This is accomplished through the strategic utilization of dimension correspondence to align granularity and the reduction of relative position bias parameters, thereby optimizing computational efficiency. The proposed AMMM effectively combines multi-scale attention maps into a unified representation using a fixed mask template, enabling the modeling of global attention mechanism. Experimental evaluations highlight the superior performance of our approach, achieving remarkable mean intersection over union (mIoU) scores of 75.48\% on the challenging Vaihingen dataset and an exceptional 77.90\% on the Potsdam dataset, demonstrating the superiority of our method in precise remote sensing semantic segmentation. Codes are available at this https URL.
深度学习的进步在远程 sensing语义分割方面取得了显著的进展。尽管引入了注意机制,实现了全局建模并利用上下文信息,但高计算成本的挑战和基于窗口的操作削弱了远程感应在图像处理中的有效性。在本文中,我们提出了AMMUNet,一种基于UNet的框架,采用多尺度注意力图合并,包括两个关键创新:粒度多头自注意力(GMSA)模块和注意力图合并机制(AMMM)。GMSA有效地获取全球信息,而与全局多头自注意力机制相比,大大降低了计算成本。这是通过将维度对应性用于对齐粒度并减少相对位置偏差参数来实现的,从而优化了计算效率。与AMMM结合的AMMUNet能够将多尺度注意力图合并为统一的表示,实现对全局注意机制的建模。实验评估显示,我们的方法在具有挑战性的Vaihingen数据集和Potsdam数据集上取得了卓越的性能,平均交集 over 联合(mIoU)得分分别为75.48%和77.90%,证明了我们在远程感应在语义分割方面的优越性。代码可在此处访问:https://www.acm.org/dl/2022.pdf
https://arxiv.org/abs/2404.13408
Semantic segmentation plays a crucial role in enabling comprehensive scene understanding for robotic systems. However, generating annotations is challenging, requiring labels for every pixel in an image. In scenarios like autonomous driving, there's a need to progressively incorporate new classes as the operating environment of the deployed agent becomes more complex. For enhanced annotation efficiency, ideally, only pixels belonging to new classes would be annotated. This approach is known as Continual Semantic Segmentation (CSS). Besides the common problem of classical catastrophic forgetting in the continual learning setting, CSS suffers from the inherent ambiguity of the background, a phenomenon we refer to as the "background shift'', since pixels labeled as background could correspond to future classes (forward background shift) or previous classes (backward background shift). As a result, continual learning approaches tend to fail. This paper proposes a Backward Background Shift Detector (BACS) to detect previously observed classes based on their distance in the latent space from the foreground centroids of previous steps. Moreover, we propose a modified version of the cross-entropy loss function, incorporating the BACS detector to down-weight background pixels associated with formerly observed classes. To combat catastrophic forgetting, we employ masked feature distillation alongside dark experience replay. Additionally, our approach includes a transformer decoder capable of adjusting to new classes without necessitating an additional classification head. We validate BACS's superior performance over existing state-of-the-art methods on standard CSS benchmarks.
语义分割在使机器人系统全面理解场景方面扮演了关键角色。然而,生成注释是具有挑战性的,需要为图像的每个像素提供标签。在自动驾驶等场景中,需要随着部署代理的操作环境变得越来越复杂而逐步引入新的类别。为了提高注释效率,理想情况下,只注释属于新类别的像素。这种方法被称为连续语义分割(CSS)。 除了在连续学习设置中经典灾难性遗忘的问题之外,CSS还受到背景固有不确定性的困扰,这种现象我们称之为“背景漂移”,因为被标注为背景的像素可能对应未来的类别(前向背景漂移)或过去的类别(后向背景漂移)。因此,连续学习方法往往失败。本文提出了一种基于背景漂移检测器(BACS)的Backward Background Shift Detector(BACS)来检测以前观察到的类别,根据它们在先前步骤中前景集中点的距离。此外,我们还提出了一个修改过的交叉熵损失函数,该函数包含BACS检测器,用于减轻与以前观察到的类别相关的背景像素的权重。 为了应对灾难性遗忘,我们还采用遮罩特征蒸馏以及暗经验回放。此外,我们的方法包括一个能够适应新类别的Transformer解码器,而无需增加额外的分类头。我们在标准CSS基准测试上验证BACS的卓越性能。
https://arxiv.org/abs/2404.13148
Current point cloud semantic segmentation has achieved great advances when given sufficient labels. However, the dense annotation of LiDAR point clouds remains prohibitively expensive and time-consuming, unable to keep up with the continuously growing volume of data. In this paper, we propose annotating images with scattered points, followed by utilizing SAM (a Foundation model) to generate semantic segmentation labels for the images. Finally, by mapping the segmentation labels of the images to the LiDAR space using the intrinsic and extrinsic parameters of the camera and LiDAR, we obtain labels for point cloud semantic segmentation, and release Scatter-KITTI and Scatter-nuScenes, which are the first works to utilize image segmentation-based SAM for weakly supervised point cloud semantic segmentation. Furthermore, to mitigate the influence of erroneous pseudo labels obtained from sparse annotations on point cloud features, we propose a multi-modal weakly supervised network for LiDAR semantic segmentation, called MM-ScatterNet. This network combines features from both point cloud and image modalities, enhancing the representation learning of point clouds by introducing consistency constraints between multi-modal features and point cloud features. On the SemanticKITTI dataset, we achieve 66\% of fully supervised performance using only 0.02% of annotated data, and on the NuScenes dataset, we achieve 95% of fully supervised performance using only 0.1% labeled points.
当前的点云语义分割在给出充分标签时取得了很大的进展。然而,对激光雷达点云的密集标注仍然过于昂贵和耗时,无法跟上数据不断增长的数量。在本文中,我们提出使用散射点对图像进行标注,然后利用SAM(一个基础模型)对图像进行语义分割标签生成。最后,通过将图像的语义分割标签映射到激光雷达空间中的内、外参数,我们获得了点云语义分割标签,并释放了Scatter-KITTI和Scatter-nuScenes,这是第一个利用基于图像分割的SAM进行弱监督点云语义分割的工作。此外,为了减轻从稀疏标注中获得的错误伪标签对点云特征的影响,我们提出了一个多模态弱监督网络,称为MM-ScatterNet。该网络结合了点云和图像模态的特征,通过引入多模态特征与点云特征之间的一致性约束,增强了点云的表示学习。在SemanticKITTI数据集上,我们实现了66%的完全监督性能,只需要0.02%的注释数据,而在NuScenes数据集上,我们实现了95%的完全监督性能,只需要0.1%的标注点。
https://arxiv.org/abs/2404.12861
Deep learning is dramatically transforming the field of medical imaging and radiology, enabling the identification of pathologies in medical images, including computed tomography (CT) and X-ray scans. However, the performance of deep learning models, particularly in segmentation tasks, is often limited by the need for extensive annotated datasets. To address this challenge, the capabilities of weakly supervised semantic segmentation are explored through the lens of Explainable AI and the generation of counterfactual explanations. The scope of this research is development of a novel counterfactual inpainting approach (COIN) that flips the predicted classification label from abnormal to normal by using a generative model. For instance, if the classifier deems an input medical image X as abnormal, indicating the presence of a pathology, the generative model aims to inpaint the abnormal region, thus reversing the classifier's original prediction label. The approach enables us to produce precise segmentations for pathologies without depending on pre-existing segmentation masks. Crucially, image-level labels are utilized, which are substantially easier to acquire than creating detailed segmentation masks. The effectiveness of the method is demonstrated by segmenting synthetic targets and actual kidney tumors from CT images acquired from Tartu University Hospital in Estonia. The findings indicate that COIN greatly surpasses established attribution methods, such as RISE, ScoreCAM, and LayerCAM, as well as an alternative counterfactual explanation method introduced by Singla et al. This evidence suggests that COIN is a promising approach for semantic segmentation of tumors in CT images, and presents a step forward in making deep learning applications more accessible and effective in healthcare, where annotated data is scarce.
深度学习正在深刻地改变医学影像和放射学领域,以前所未有的方式识别医学图像中的疾病,包括计算机断层扫描(CT)和X光扫描。然而,深度学习模型的性能,尤其是在分割任务中,常常受到需要大量注释数据的需求的限制。为解决这个问题,通过 Explainable AI 和生成反事实解释来探索弱监督语义分割模型的能力。这项研究旨在开发一种新颖的逆向修复方法(COIN),通过使用生成模型在预测分类标签异常的情况下,将预测分类标签从异常转为正常。例如,如果分类器认为输入医学图像X异常,表示存在疾病,生成模型旨在修复异常区域,从而反转分类器的原始预测标签。该方法使我们能够在不依赖预先存在的分割掩码的情况下精确地分割出疾病。关键的是,图像级标签被利用,这比创建详细的分割掩码要容易得多。该方法的效果由从爱沙尼亚图尔图大学医院的CT图像中分割出合成目标和实际肾肿瘤来证明。研究结果表明,COIN远远超过了已有的归因方法,如RISE、ScoreCAM和LayerCAM,以及Singla等人提出的另一种反事实解释方法。这一证据表明,COIN是用于CT图像肿瘤语义分割的有前景的方法,并为在医疗保健中使深度学习应用更具有可行性和效果铺平了道路,而注释数据又稀缺。
https://arxiv.org/abs/2404.12832
Annotating lots of 3D medical images for training segmentation models is time-consuming. The goal of weakly supervised semantic segmentation is to train segmentation models without using any ground truth segmentation masks. Our work addresses the case where only image-level categorical labels, indicating the presence or absence of a particular region of interest (such as tumours or lesions), are available. Most existing methods rely on class activation mapping (CAM). We propose a novel approach, ToNNO, which is based on the Tomographic reconstruction of a Neural Network's Output. Our technique extracts stacks of slices with different angles from the input 3D volume, feeds these slices to a 2D encoder, and applies the inverse Radon transform in order to reconstruct a 3D heatmap of the encoder's predictions. This generic method allows to perform dense prediction tasks on 3D volumes using any 2D image encoder. We apply it to weakly supervised medical image segmentation by training the 2D encoder to output high values for slices containing the regions of interest. We test it on four large scale medical image datasets and outperform 2D CAM methods. We then extend ToNNO by combining tomographic reconstruction with CAM methods, proposing Averaged CAM and Tomographic CAM, which obtain even better results.
给大量的3D医疗图像进行注释是一项耗时的工作。弱监督语义分割的目标是训练无需使用任何真实分割掩膜的分割模型。我们的工作解决了一个只有图像级别分类标签(表示兴趣区域的存在或缺失,如肿瘤或病变)的情况。大多数现有方法依赖于类激活映射(CAM)。我们提出了一种新方法ToNNO,它是基于神经网络输出Tomographic重构的。我们的技术从输入3D体积中提取不同角度的切片,将这些切片输入2D编码器,并应用逆Radon变换来重构编码器的预测的3D热图。这种通用方法允许使用任何2D图像编码器对3D体积进行密集预测。我们将它应用于弱监督医疗图像分割,通过训练2D编码器为包含感兴趣区域的切片提供高值。我们在四个大型医疗图像数据集上测试它,并优于2D CAM方法。然后,我们将ToNNO扩展,通过结合断层扫描和CAM方法,提出平均CAM和断层扫描CAM,获得更好的结果。
https://arxiv.org/abs/2404.13103
Multi-task networks can potentially improve performance and computational efficiency compared to single-task networks, facilitating online deployment. However, current multi-task architectures in point cloud perception combine multiple task-specific point cloud representations, each requiring a separate feature encoder and making the network structures bulky and slow. We propose PAttFormer, an efficient multi-task architecture for joint semantic segmentation and object detection in point clouds that only relies on a point-based representation. The network builds on transformer-based feature encoders using neighborhood attention and grid-pooling and a query-based detection decoder using a novel 3D deformable-attention detection head design. Unlike other LiDAR-based multi-task architectures, our proposed PAttFormer does not require separate feature encoders for multiple task-specific point cloud representations, resulting in a network that is 3x smaller and 1.4x faster while achieving competitive performance on the nuScenes and KITTI benchmarks for autonomous driving perception. Our extensive evaluations show substantial gains from multi-task learning, improving LiDAR semantic segmentation by +1.7% in mIou and 3D object detection by +1.7% in mAP on the nuScenes benchmark compared to the single-task models.
多任务网络相对于单任务网络可能会提高性能和计算效率,从而实现在线部署。然而,当前的点云感知中的多任务架构结合了多个任务特定的点云表示,每个表示都需要单独的特征编码器,使得网络结构变得庞大且运行缓慢。我们提出了PAttFormer,一种仅依赖于点基表示的多任务架构,用于联合语义分割和目标检测。该网络基于基于Transformer的特征编码器使用邻域注意力和池化,以及一种新颖的3D可形变注意检测头设计,用于查询基检测器。与其它基于激光雷达的多任务架构不同,我们的PAttFormer不需要为多个任务特定的点云表示分别编写单独的特征编码器,导致网络规模减小了3倍,同时速度提高了1.4倍,在nuScenes和KITTI基准测试中实现了与单任务模型相当竞争力的性能。我们进行的全面评估显示,多任务学习带来了显著的提高,在nuScenes基准测试中,LIDAR语义分割提高了+1.7%,而在3D目标检测中,提高了+1.7%。
https://arxiv.org/abs/2404.12798
Land-cover mapping is one of the vital applications in Earth observation, aiming at classifying each pixel's land-cover type of remote-sensing images. As natural and human activities change the landscape, the land-cover map needs to be rapidly updated. However, discovering newly appeared land-cover types in existing classification systems is still a non-trivial task hindered by various scales of complex land objects and insufficient labeled data over a wide-span geographic area. In this paper, we propose a generalized few-shot segmentation-based framework, named SegLand, to update novel classes in high-resolution land-cover mapping. Specifically, the proposed framework is designed in three parts: (a) Data pre-processing: the base training set and the few-shot support sets of novel classes are analyzed and augmented; (b) Hybrid segmentation structure; Multiple base learners and a modified Projection onto Orthogonal Prototypes (POP) network are combined to enhance the base-class recognition and to dig novel classes from insufficient labels data; (c) Ultimate fusion: the semantic segmentation results of the base learners and POP network are reasonably fused. The proposed framework has won first place in the leaderboard of the OpenEarthMap Land Cover Mapping Few-Shot Challenge. Experiments demonstrate the superiority of the framework for automatically updating novel land-cover classes with limited labeled data.
土地覆盖图是地球观测中一个至关重要的应用,旨在对每个像素的遥感图像中的土地覆盖类型进行分类。由于自然和人类活动会改变地貌,土地覆盖图需要快速更新。然而,在现有分类系统中发现新出现的土地覆盖类型仍然是一个困难的任务,受到各种规模复杂陆地对象和缺乏在整个地理区域内足够的标记数据的影响。在本文中,我们提出了一个基于通用的少样本分割框架,名为SegLand,用于更新高分辨率土地覆盖图中的新类。具体来说,所提出的框架分为三个部分:(a)数据预处理:对基训练集和新类支持集的数据进行分析和增强;(b)混合分割结构;将多个基学习器和一种经过修改的投影 onto 异面原型(POP)网络相结合,以增强基类识别并从不足的标记数据中挖掘新类;(c)终极融合:将基学习器和POP网络的语义分割结果进行合理的融合。所提出的框架在OpenEarthMap Land Cover Mapping Few-Shot Challenge的排行榜上获得了第一。实验结果表明,该框架可以自动用有限标记数据更新新土地覆盖类的实例。
https://arxiv.org/abs/2404.12721