LiDAR is a crucial sensor in autonomous driving, commonly used alongside cameras. By exploiting this camera-LiDAR setup and recent advances in image representation learning, prior studies have shown the promising potential of image-to-LiDAR distillation. These prior arts focus on the designs of their own losses to effectively distill the pre-trained 2D image representations into a 3D model. However, the other parts of the designs have been surprisingly unexplored. We find that fundamental design elements, e.g., the LiDAR coordinate system, quantization according to the existing input interface, and data utilization, are more critical than developing loss functions, which have been overlooked in prior works. In this work, we show that simple fixes to these designs notably outperform existing methods by 16% in 3D semantic segmentation on the nuScenes dataset and 13% in 3D object detection on the KITTI dataset in downstream task performance. We focus on overlooked design choices along the spatial and temporal axes. Spatially, prior work has used cylindrical coordinate and voxel sizes without considering their side effects yielded with a commonly deployed sparse convolution layer input interface, leading to spatial quantization errors in 3D models. Temporally, existing work has avoided cumbersome data curation by discarding unsynced data, limiting the use to only the small portion of data that is temporally synced across sensors. We analyze these effects and propose simple solutions for each overlooked aspect.
在自动驾驶领域,LiDAR(光探测和测距)传感器与摄像头共同使用是至关重要的。通过利用这种相机-LiDAR配置及近期图像表示学习的进展,先前的研究展示了从二维图像表示向三维模型提炼信息的巨大潜力,即所谓的“图像到LiDAR的知识蒸馏”。这些早期研究主要集中在设计自己的损失函数以有效提炼预训练的2D图像表示方面。然而,其他设计方案却鲜被探索。 我们发现,一些基本的设计要素——比如LiDAR坐标系统、依据现有输入接口进行量化的方法以及数据利用方式——比开发损失函数更为关键,而这些却被之前的工作所忽视了。在这项工作中,我们展示了对这些设计的简单改进可以显著超越现有的方法,在nuScenes数据集上的3D语义分割任务中性能提高了16%,在KITTI数据集上的3D物体检测任务中性能提高了13%。 我们的工作主要集中在被忽略的空间和时间轴的设计选择上。在空间维度上,早期的工作采用了柱状坐标系和体素尺寸大小而没有考虑它们与常用稀疏卷积层输入接口相结合时的副作用,这导致了三维模型中的空间量化误差。在时间维度上,为了避开繁琐的数据整理工作,现有的方法丢弃了不同步的数据,这限制了只有很小部分同步数据能够被利用。 我们分析了这些影响并针对每个被忽视的部分提出了简单的解决方案。
https://arxiv.org/abs/2501.09485
Foundation models have revolutionized computer vision by achieving vastly superior performance across diverse tasks through large-scale pretraining on extensive datasets. However, their application in surgical computer vision has been limited. This study addresses this gap by introducing SurgeNetXL, a novel surgical foundation model that sets a new benchmark in surgical computer vision. Trained on the largest reported surgical dataset to date, comprising over 4.7 million video frames, SurgeNetXL achieves consistent top-tier performance across six datasets spanning four surgical procedures and three tasks, including semantic segmentation, phase recognition, and critical view of safety (CVS) classification. Compared with the best-performing surgical foundation models, SurgeNetXL shows mean improvements of 2.4, 9.0, and 12.6 percent for semantic segmentation, phase recognition, and CVS classification, respectively. Additionally, SurgeNetXL outperforms the best-performing ImageNet-based variants by 14.4, 4.0, and 1.6 percent in the respective tasks. In addition to advancing model performance, this study provides key insights into scaling pretraining datasets, extending training durations, and optimizing model architectures specifically for surgical computer vision. These findings pave the way for improved generalizability and robustness in data-scarce scenarios, offering a comprehensive framework for future research in this domain. All models and a subset of the SurgeNetXL dataset, including over 2 million video frames, are publicly available at: this https URL.
基础模型通过在大规模数据集上的预训练,已在计算机视觉领域实现了跨多种任务的卓越性能。然而,在手术计算机视觉领域的应用却相对有限。本研究旨在填补这一空白,引入了SurgeNetXL,这是一种新型的手术基础模型,并为手术计算机视觉设定了新的基准。该模型是在迄今为止报道的最大规模的手术数据集上训练出来的,包含超过470万帧视频图像。SurgeNetXL在涵盖四个手术程序和三个任务(语义分割、阶段识别以及关键安全视图(CVS)分类)的六个数据集中均表现出持续领先的成绩。 相较于目前表现最佳的手术基础模型,SurgeNetXL在语义分割、阶段识别及CVS分类上分别提高了2.4%,9.0%和12.6%。此外,在各自的任务中,与基于ImageNet的最佳变体相比,SurgeNetXL的表现也高出14.4%,4.0%以及1.6%。 除提升模型性能外,本研究还提供了有关如何扩大预训练数据集规模、延长训练时长及优化手术计算机视觉领域中的模型架构的关键见解。这些发现为在数据稀缺场景下提高通用性和鲁棒性铺平了道路,并为该领域的未来研究提供了一个全面的框架。 所有模型以及SurgeNetXL数据集中的一部分(包括超过200万帧视频图像)均可从以下网址公开获取:[此链接](https://thishttpsURL.com)。
https://arxiv.org/abs/2501.09436
Semantic segmentation is essential for comprehending images, but the process necessitates a substantial amount of detailed annotations at the pixel level. Acquiring such annotations can be costly in the real-world. Unsupervised domain adaptation (UDA) for semantic segmentation is a technique that uses virtual data with labels to train a model and adapts it to real data without labels. Some recent works use contrastive learning, which is a powerful method for self-supervised learning, to help with this technique. However, these works do not take into account the diversity of features within each class when using contrastive learning, which leads to errors in class prediction. We analyze the limitations of these works and propose a novel framework called Pseudo-label Guided Pixel Contrast (PGPC), which overcomes the disadvantages of previous methods. We also investigate how to use more information from target images without adding noise from pseudo-labels. We test our method on two standard UDA benchmarks and show that it outperforms existing methods. Specifically, we achieve relative improvements of 5.1% mIoU and 4.6% mIoU on the Grand Theft Auto V (GTA5) to Cityscapes and SYNTHIA to Cityscapes tasks based on DAFormer, respectively. Furthermore, our approach can enhance the performance of other UDA approaches without increasing model complexity. Code is available at this https URL
语义分割对于理解图像至关重要,但这一过程需要大量的像素级详细标注。在现实世界中获取这些标注可能会非常昂贵。无监督领域适应(UDA)是一种利用带有标签的虚拟数据来训练模型,并将其调整应用于没有标签的真实数据的技术。最近的一些研究工作使用对比学习这种强大的自监督学习方法来进行此技术,但它们未能考虑到每类内部特征的多样性,在使用对比学习时会导致类别预测错误。我们分析了这些工作的局限性,并提出了一种名为伪标签引导像素对比(PGPC)的新框架来克服先前方法的缺点。此外,我们还研究如何利用目标图像中的更多信息而不引入来自伪标签的噪声。我们在两个标准UDA基准测试上测试了我们的方法,并表明它优于现有方法。具体来说,在基于DAFormer的GTA5到Cityscapes和SYNTHIA到Cityscapes任务中,分别实现了5.1%mIoU和4.6%mIoU的相对改进。此外,我们的方法可以增强其他UDA方法的表现而不增加模型复杂度。代码可在以下链接获取:[此处插入URL]
https://arxiv.org/abs/2501.09040
Remote sensing imagery is dense with objects and contextual visual information. There is a recent trend to combine paired satellite images and text captions for pretraining performant encoders for downstream tasks. However, while contrastive image-text methods like CLIP enable vision-language alignment and zero-shot classification ability, vision-only downstream performance tends to degrade compared to image-only pretraining, such as MAE. In this paper, we propose FLAVARS, a pretraining method that combines the best of both contrastive learning and masked modeling, along with geospatial alignment via contrastive location encoding. We find that FLAVARS significantly outperforms a baseline of SkyCLIP for vision-only tasks such as KNN classification and semantic segmentation, +6\% mIOU on SpaceNet1, while retaining the ability to perform zero-shot classification, unlike MAE pretrained methods.
遥感图像包含大量的对象和上下文视觉信息。最近的趋势是结合配对的卫星图像和文本描述进行预训练,以提高编码器在下游任务中的性能。然而,虽然对比学习方法(如CLIP)能够实现视觉与语言的一致性以及零样本分类能力,但仅基于视觉的任务性能通常会低于只使用图像的预训练方法,例如MAE。 本文提出了FLAVARS,这是一种结合了对比学习和掩码建模优点,并通过对比位置编码进行地理空间对齐的预训练方法。我们发现,在诸如KNN分类和语义分割等仅基于视觉的任务上,FLAVARS显著优于SkyCLIP基线方法,在SpaceNet1数据集上的mIOU指标提高了6%。同时,与只使用MAE预训练的方法不同,FLAVARS保留了零样本分类的能力。
https://arxiv.org/abs/2501.08490
Semantic future prediction is important for autonomous systems navigating dynamic environments. This paper introduces FUTURIST, a method for multimodal future semantic prediction that uses a unified and efficient visual sequence transformer architecture. Our approach incorporates a multimodal masked visual modeling objective and a novel masking mechanism designed for multimodal training. This allows the model to effectively integrate visible information from various modalities, improving prediction accuracy. Additionally, we propose a VAE-free hierarchical tokenization process, which reduces computational complexity, streamlines the training pipeline, and enables end-to-end training with high-resolution, multimodal inputs. We validate FUTURIST on the Cityscapes dataset, demonstrating state-of-the-art performance in future semantic segmentation for both short- and mid-term forecasting. We provide the implementation code at this https URL .
语义未来预测对于在动态环境中导航的自主系统来说至关重要。本文介绍了FUTURIST方法,这是一种用于多模态未来语义预测的方法,它使用了一种统一且高效的视觉序列变换器架构。我们的方法结合了多模态掩码视觉建模目标和专为多模态训练设计的新颖掩码机制,这使得模型能够有效地整合来自各种模式的可见信息,从而提高预测准确性。此外,我们提出了一种无需VAE(变分自编码器)的层级标记化过程,该过程减少了计算复杂性、简化了培训管道,并支持使用高分辨率多模态输入进行端到端训练。 我们在Cityscapes数据集上验证了FUTURIST的有效性,展示了在短期和中期预测中的未来语义分割方面的最新技术水平。您可以在此 [URL] 获取我们的实现代码(请注意原文中提供的实际链接地址)。
https://arxiv.org/abs/2501.08303
While recent foundation models have enabled significant breakthroughs in monocular depth estimation, a clear path towards safe and reliable deployment in the real-world remains elusive. Metric depth estimation, which involves predicting absolute distances, poses particular challenges, as even the most advanced foundation models remain prone to critical errors. Since quantifying the uncertainty has emerged as a promising endeavor to address these limitations and enable trustworthy deployment, we fuse five different uncertainty quantification methods with the current state-of-the-art DepthAnythingV2 foundation model. To cover a wide range of metric depth domains, we evaluate their performance on four diverse datasets. Our findings identify fine-tuning with the Gaussian Negative Log-Likelihood Loss (GNLL) as a particularly promising approach, offering reliable uncertainty estimates while maintaining predictive performance and computational efficiency on par with the baseline, encompassing both training and inference time. By fusing uncertainty quantification and foundation models within the context of monocular depth estimation, this paper lays a critical foundation for future research aimed at improving not only model performance but also its explainability. Extending this critical synthesis of uncertainty quantification and foundation models into other crucial tasks, such as semantic segmentation and pose estimation, presents exciting opportunities for safer and more reliable machine vision systems.
尽管最近的基础模型在单目深度估计方面取得了显著突破,但要实现安全可靠的现实世界部署仍然面临挑战。涉及预测绝对距离的度量深度估计尤其具有挑战性,即使最先进的基础模型仍容易出现关键错误。鉴于量化不确定性已成为解决这些限制并实现可信部署的有希望的方法,我们融合了五种不同的不确定性量化方法与当前最先进的DepthAnythingV2基础模型。为了涵盖广泛的度量深度领域,我们在四个多样化的数据集上评估它们的表现。我们的研究发现,使用高斯负对数似然损失(GNLL)进行微调特别具有前景,这种方法在提供可靠的不确定性估计的同时,保持了预测性能和计算效率与基线相当,在训练和推理时间方面均包括。 通过将不确定性量化和基础模型融合到单目深度估计的背景下,本文为未来旨在不仅提高模型性能而且增强其可解释性的研究奠定了关键基础。将这种不确定性量化的关键综合应用扩展到其他重要任务(如语义分割和姿态估计),为更安全可靠的机器视觉系统带来了令人兴奋的机会。
https://arxiv.org/abs/2501.08188
Semantic segmentation of remote sensing images is essential for various applications, including vegetation monitoring, disaster management, and urban planning. Previous studies have demonstrated that the self-attention mechanism (SA) is an effective approach for designing segmentation networks that can capture long-range pixel dependencies. SA enables the network to model the global dependencies between the input features, resulting in improved segmentation outcomes. However, the high density of attentional feature maps used in this mechanism causes exponential increases in computational complexity. Additionally, it introduces redundant information that negatively impacts the feature representation. Inspired by traditional threshold segmentation algorithms, we propose a novel threshold attention mechanism (TAM). This mechanism significantly reduces computational effort while also better modeling the correlation between different regions of the feature map. Based on TAM, we present a threshold attention network (TANet) for semantic segmentation. TANet consists of an attentional feature enhancement module (AFEM) for global feature enhancement of shallow features and a threshold attention pyramid pooling module (TAPP) for acquiring feature information at different scales for deep features. We have conducted extensive experiments on the ISPRS Vaihingen and Potsdam datasets. The results demonstrate the validity and superiority of our proposed TANet compared to the most state-of-the-art models.
遥感图像的语义分割对于植被监测、灾害管理和城市规划等应用至关重要。先前的研究表明,自注意力机制(SA)是设计能够捕捉长距离像素依赖性的分割网络的有效方法。该机制使网络能够建模输入特征之间的全局依赖性,从而提高分割效果。然而,在这一机制中使用的密集注意图导致计算复杂度呈指数级增长,并且引入了冗余信息,这对特征表示产生了负面影响。 受传统阈值分割算法的启发,我们提出了一种新颖的阈值注意力机制(TAM)。这种机制显著降低了计算工作量,同时更好地建模了特征图不同区域之间的相关性。基于TAM,我们为语义分割提出了一个阈值注意网络(TANet)。TANet包含一个注意特征增强模块(AFEM),用于对浅层特征进行全局特征增强;以及一个阈值注意力金字塔池化模块(TAPP),用于在不同尺度上获取深层特征的信息。我们在ISPRS Vaihingen和Potsdam数据集上进行了广泛的实验,结果表明我们提出的TANet相比最先进模型具有有效性和优越性。
https://arxiv.org/abs/2501.07984
Knowledge distillation has been widely adopted in computer vision task processing, since it can effectively enhance the performance of lightweight student networks by leveraging the knowledge transferred from cumbersome teacher networks. Most existing knowledge distillation methods utilize Kullback-Leibler divergence to mimic the logit output probabilities between the teacher network and the student network. Nonetheless, these methods may neglect the negative parts of the teacher's ''dark knowledge'' because the divergence calculations may ignore the effect of the minute probabilities from the teacher's logit output. This deficiency may lead to suboptimal performance in logit mimicry during the distillation process and result in an imbalance of information acquired by the student network. In this paper, we investigate the impact of this imbalance and propose a novel method, named Balance Divergence Distillation. By introducing a compensatory operation using reverse Kullback-Leibler divergence, our method can improve the modeling of the extremely small values in the negative from the teacher and preserve the learning capacity for the positive. Furthermore, we test the impact of different temperature coefficients adjustments, which may conducted to further balance for knowledge transferring. We evaluate the proposed method on several computer vision tasks, including image classification and semantic segmentation. The evaluation results show that our method achieves an accuracy improvement of 1%~3% for lightweight students on both CIFAR-100 and ImageNet dataset, and a 4.55% improvement in mIoU for PSP-ResNet18 on the Cityscapes dataset. The experiments show that our method is a simple yet highly effective solution that can be smoothly applied to different knowledge distillation methods.
知识蒸馏在计算机视觉任务处理中已被广泛采用,因为它可以有效利用从复杂的教师网络转移的知识来增强轻量级学生网络的性能。现有的大多数知识蒸馏方法都使用Kullback-Leibler散度(KL散度)来模仿教师网络和学生网络之间的logit输出概率。然而,这些方法可能会忽视教师“暗知识”的负面部分,因为其计算可能忽略了从教师的logit输出中产生的微小概率的影响。这种不足可能导致蒸馏过程中logit模仿次优,并导致学生网络获取的信息不平衡。在本文中,我们探讨了这一信息不平衡的影响,并提出了一种名为平衡散度蒸馏的新方法。通过引入反向Kullback-Leibler散度的补偿操作,我们的方法可以改善对教师极小值(负部分)的建模并保持正部分的学习能力。此外,我们还测试了不同温度系数调整的影响,这可能进一步平衡知识转移过程。我们在多个计算机视觉任务上评估了所提出的方法,包括图像分类和语义分割。实验结果表明,在CIFAR-100和ImageNet数据集上的轻量级学生网络准确率提高了1%~3%,在Cityscapes数据集中PSP-ResNet18的mIoU(平均交并比)提升了4.55%。实验证明,我们的方法是一种简单且高度有效的方法,可以平滑地应用于不同的知识蒸馏方法中。
https://arxiv.org/abs/2501.07804
Semantic segmentation plays a crucial role in remote sensing applications, where the accurate extraction and representation of features are essential for high-quality results. Despite the widespread use of encoder-decoder architectures, existing methods often struggle with fully utilizing the high-dimensional features extracted by the encoder and efficiently recovering detailed information during decoding. To address these problems, we propose a novel semantic segmentation network, namely DeepKANSeg, including two key innovations based on the emerging Kolmogorov Arnold Network (KAN). Notably, the advantage of KAN lies in its ability to decompose high-dimensional complex functions into univariate transformations, enabling efficient and flexible representation of intricate relationships in data. First, we introduce a KAN-based deep feature refinement module, namely DeepKAN to effectively capture complex spatial and rich semantic relationships from high-dimensional features. Second, we replace the traditional multi-layer perceptron (MLP) layers in the global-local combined decoder with KAN-based linear layers, namely GLKAN. This module enhances the decoder's ability to capture fine-grained details during decoding. To evaluate the effectiveness of the proposed method, experiments are conducted on two well-known fine-resolution remote sensing benchmark datasets, namely ISPRS Vaihingen and ISPRS Potsdam. The results demonstrate that the KAN-enhanced segmentation model achieves superior performance in terms of accuracy compared to state-of-the-art methods. They highlight the potential of KANs as a powerful alternative to traditional architectures in semantic segmentation tasks. Moreover, the explicit univariate decomposition provides improved interpretability, which is particularly beneficial for applications requiring explainable learning in remote sensing.
语义分割在遥感应用中扮演着至关重要的角色,其中准确提取和表示特征对于高质量的结果至关重要。尽管编码器-解码器架构被广泛应用,现有的方法往往难以充分利用编码器提取的高维特征,并且无法有效地在解码过程中恢复详细信息。为了解决这些问题,我们提出了一种新的语义分割网络——DeepKANSeg,该网络基于新兴的柯尔莫哥洛夫—阿诺德网络(KAN)引入了两个关键创新点。值得注意的是,KAN的优势在于能够将高维复杂函数分解为一元变换,从而可以高效且灵活地表示数据中的错综复杂的关联关系。 首先,我们提出了一种基于KAN的深度特征精炼模块——DeepKAN,用于有效捕获来自高维特征的复杂空间和丰富的语义关系。其次,我们将传统解码器中全局-局部结合部分使用的多层感知机(MLP)层替换为基于KAN的线性层——GLKAN。这使得在解码过程中能够更好地捕捉细微细节。 为了评估所提出方法的有效性,在两个著名的高分辨率遥感基准数据集,即ISPRS Vaihingen和ISPRS Potsdam上进行了实验。结果显示,增强型KAN分割模型在准确性方面优于现有最先进的方法。这些结果强调了KAN作为语义分割任务中传统架构的有力替代方案的巨大潜力。此外,显式的单变量分解提供了改进的可解释性,这对于需要可解释学习的遥感应用特别有利。
https://arxiv.org/abs/2501.07390
3D semantic scene completion is critical for multiple downstream tasks in autonomous systems. It estimates missing geometric and semantic information in the acquired scene data. Due to the challenging real-world conditions, this task usually demands complex models that process multi-modal data to achieve acceptable performance. We propose a unique neural model, leveraging advances from the state space and diffusion generative modeling to achieve remarkable 3D semantic scene completion performance with monocular image input. Our technique processes the data in the conditioned latent space of a variational autoencoder where diffusion modeling is carried out with an innovative state space technique. A key component of our neural network is the proposed Skimba (Skip Mamba) denoiser, which is adept at efficiently processing long-sequence data. The Skimba diffusion model is integral to our 3D scene completion network, incorporating a triple Mamba structure, dimensional decomposition residuals and varying dilations along three directions. We also adopt a variant of this network for the subsequent semantic segmentation stage of our method. Extensive evaluation on the standard SemanticKITTI and SSCBench-KITTI360 datasets show that our approach not only outperforms other monocular techniques by a large margin, it also achieves competitive performance against stereo methods. The code is available at this https URL
三维语义场景补全对于自主系统中的多个下游任务至关重要,它估计并填补获取的场景数据中缺失的几何和语义信息。由于现实世界条件的挑战性,这项任务通常需要处理多模态数据的复杂模型以达到可接受的表现水平。我们提出了一种独特的神经网络模型,利用了状态空间和扩散生成建模的进步,通过单目图像输入实现了显著的三维语义场景补全性能。 我们的技术在变分自编码器(VAE)的条件潜在空间中处理数据,并在此空间内使用创新的状态空间技术进行扩散建模。我们神经网络的一个关键组成部分是提议的Skimba(跳过Mamba)去噪器,它擅长高效地处理长序列数据。Skimba扩散模型是我们三维场景补全网络的核心部分,集成了三重Mamba结构、维度分解残差以及沿三个方向的变化膨胀率。我们还为方法中的后续语义分割阶段采用了该网络的变体。 在标准的SemanticKITTI和SSCBench-KITTI360数据集上的广泛评估表明,我们的方法不仅以显著的优势优于其他单目技术,而且在与立体方法的竞争中也表现出具有竞争力的表现。代码可在以下链接获取:[提供具体的URL]
https://arxiv.org/abs/2501.07260
Scaling up the vocabulary of semantic segmentation models is extremely challenging because annotating large-scale mask labels is labour-intensive and time-consuming. Recently, language-guided segmentation models have been proposed to address this challenge. However, their performance drops significantly when applied to out-of-distribution categories. In this paper, we propose a new large vocabulary semantic segmentation framework, called LarvSeg. Different from previous works, LarvSeg leverages image classification data to scale the vocabulary of semantic segmentation models as large-vocabulary classification datasets usually contain balanced categories and are much easier to obtain. However, for classification tasks, the category is image-level, while for segmentation we need to predict the label at pixel level. To address this issue, we first propose a general baseline framework to incorporate image-level supervision into the training process of a pixel-level segmentation model, making the trained network perform semantic segmentation on newly introduced categories in the classification data. We then observe that a model trained on segmentation data can group pixel features of categories beyond the training vocabulary. Inspired by this finding, we design a category-wise attentive classifier to apply supervision to the precise regions of corresponding categories to improve the model performance. Extensive experiments demonstrate that LarvSeg significantly improves the large vocabulary semantic segmentation performance, especially in the categories without mask labels. For the first time, we provide a 21K-category semantic segmentation model with the help of ImageNet21K. The code is available at this https URL.
扩展语义分割模型的词汇量是非常具有挑战性的,因为标注大规模掩码标签既费力又耗时。最近,有人提出了语言引导的分割模型来应对这一挑战,然而这些模型在应用于未见过类别(out-of-distribution categories)时性能显著下降。本文中,我们提出了一种新的大词汇语义分割框架——LarvSeg。与之前的工作不同,LarvSeg利用图像分类数据来扩展语义分割模型的词汇量,因为大型词汇分类数据集通常包含平衡的类别并且更容易获取。然而,在分类任务中,类别是图像级别的;而在分割任务中,则需要在像素级别预测标签。为解决这一问题,我们首先提出了一种通用基线框架,将图像级监督纳入像素级分割模型的训练过程中,从而使训练后的网络能够对新引入的分类数据中的类别进行语义分割。然后我们发现,在分割数据上训练的模型可以聚类超出训练词汇表类别的像素特征。受到这一发现的启发,我们设计了一种基于类别的注意力分类器,以监督对应类别的精确区域,从而提高模型性能。 大量的实验表明,LarvSeg显著提升了大词汇语义分割的表现,尤其是在没有掩码标签的类别上。借助ImageNet21K数据集的支持,首次提供了一个拥有21,000个类别的语义分割模型。代码可以在以下链接获取:[https://this-url-is-here-as-an-example.com](http://this-url-is-here-as-an-example.com)
https://arxiv.org/abs/2501.06862
This paper addresses the domain adaptation challenge for semantic segmentation in medical imaging. Despite the impressive performance of recent foundational segmentation models like SAM on natural images, they struggle with medical domain images. Beyond this, recent approaches that perform end-to-end fine-tuning of models are simply not computationally tractable. To address this, we propose a novel SAM adapter approach that minimizes the number of trainable parameters while achieving comparable performances to full fine-tuning. The proposed SAM adapter is strategically placed in the mask decoder, offering excellent and broad generalization capabilities and improved segmentation across both fully supervised and test-time domain adaptation tasks. Extensive validation on four datasets showcases the adapter's efficacy, outperforming existing methods while training less than 1% of SAM's total parameters.
本文探讨了医学影像语义分割中的领域适应挑战。尽管最近的基础分割模型(如SAM)在自然图像上表现出了令人印象深刻的效果,但它们在处理医学领域的图像时遇到了困难。此外,近期采用端到端微调模型的方法在计算上难以实现。为解决这一问题,我们提出了一种新颖的SAM适配器方法,该方法在减少可训练参数数量的同时实现了与完整微调相当的性能。所提出的SAM适配器被战略性地放置于掩码解码器中,提供了卓越且广泛的应用能力,并改善了完全监督和测试时领域适应任务中的分割效果。通过对四个数据集进行广泛的验证,展示了该适配器的有效性,在训练少于SAM总参数1%的情况下超越了现有方法的性能。
https://arxiv.org/abs/2501.06836
This paper addresses the challenge of parking space detection in urban areas, focusing on the city of Granada. Utilizing aerial imagery, we develop and apply semantic segmentation techniques to accurately identify parked cars, moving cars and roads. A significant aspect of our research is the creation of a proprietary dataset specific to Granada, which is instrumental in training our neural network model. We employ Fully Convolutional Networks, Pyramid Networks and Dilated Convolutions, demonstrating their effectiveness in urban semantic segmentation. Our approach involves comparative analysis and optimization of various models, including Dynamic U-Net, PSPNet and DeepLabV3+, tailored for the segmentation of aerial images. The study includes a thorough experimentation phase, using datasets such as UDD5 and UAVid, alongside our custom Granada dataset. We evaluate our models using metrics like Foreground Accuracy, Dice Coefficient and Jaccard Index. Our results indicate that DeepLabV3+ offers the most promising performance. We conclude with future directions, emphasizing the need for a dedicated neural network for parked car detection and the potential for application in other urban environments. This work contributes to the fields of urban planning and traffic management, providing insights into efficient utilization of parking spaces through advanced image processing techniques.
本文探讨了在城市区域中检测停车位的挑战,重点关注格拉纳达市。我们利用航拍图像开发并应用语义分割技术,以准确识别停放车辆、行驶中的车辆和道路。研究的一个重要方面是创建一个专为格拉纳达设计的数据集,这对训练我们的神经网络模型至关重要。我们采用全卷积网络(Fully Convolutional Networks)、金字塔网络(Pyramid Networks)和扩张卷积(Dilated Convolutions),展示了它们在城市语义分割中的有效性。我们的方法包括多种模型的比较分析和优化,如动态U-Net、PSPNet 和 DeepLabV3+,这些模型专门用于航空图像的分割。研究还包括全面的实验阶段,使用诸如 UDD5 和 UAVid 等数据集以及我们专有的格拉纳达数据集。我们的评估指标包括前景准确率(Foreground Accuracy)、Dice 系数和 Jaccard 指数。结果表明 DeepLabV3+ 的表现最为出色。最后,我们将对未来方向进行总结,并强调需要为停放车辆检测开发专用的神经网络以及在其他城市环境中的应用潜力。这项工作对城市规划和交通管理领域做出了贡献,通过先进的图像处理技术提供了关于有效利用停车位的见解。
https://arxiv.org/abs/2501.06651
Semantic segmentation for autonomous driving is an even more challenging task when faced with adverse driving conditions. Standard models trained on data recorded under ideal conditions show a deteriorated performance in unfavorable weather or illumination conditions. Fine-tuning on the new task or condition would lead to overwriting the previously learned information resulting in catastrophic forgetting. Adapting to the new conditions through traditional domain adaption methods improves the performance on the target domain at the expense of the source domain. Addressing these issues, we propose an architecture-based domain-incremental learning approach called Progressive Semantic Segmentation (PSS). PSS is a task-agnostic, dynamically growing collection of domain-specific segmentation models. The task of inferring the domain and subsequently selecting the appropriate module for segmentation is carried out using a collection of convolutional autoencoders. We extensively evaluate our proposed approach using several datasets at varying levels of granularity in the categorization of adverse driving conditions. Furthermore, we demonstrate the generalization of the proposed approach to similar and unseen domains.
针对自动驾驶中的语义分割任务,在遇到不利的驾驶条件时变得更加具有挑战性。标准模型在理想条件下训练后,在恶劣天气或光照条件下性能会下降。对新任务或条件进行微调会导致覆盖之前学习的信息,从而引发灾难性遗忘。通过传统领域适应方法来适应新的条件虽然可以提高目标领域的表现,但也会牺牲源域的性能。 为解决这些问题,我们提出了一种基于架构的增量领域学习方法,称为渐进语义分割(Progressive Semantic Segmentation, PSS)。PSS是一个任务无关、动态增长的领域特定分割模型集合。通过一组卷积自编码器来完成推断领域并选择适当的模块进行分割的任务。 我们在多个数据集上对所提出的这种方法进行了广泛的评估,这些数据集中包含了不同程度的不利驾驶条件分类细节。此外,我们还展示了该方法在类似和未见过领域的泛化能力。
https://arxiv.org/abs/2501.05246
The pre-training and fine-tuning paradigm has revolutionized satellite remote sensing applications. However, this approach remains largely underexplored for airborne laser scanning (ALS), an important technology for applications such as forest management and urban planning. In this study, we address this gap by constructing a large-scale ALS point cloud dataset and evaluating its impact on downstream applications. Our dataset comprises ALS point clouds collected across the contiguous United States, provided by the United States Geological Survey's 3D Elevation Program. To ensure efficient data collection while capturing diverse land cover and terrain types, we introduce a geospatial sampling method that selects point cloud tiles based on land cover maps and digital elevation models. As a baseline self-supervised learning model, we adopt BEV-MAE, a state-of-the-art masked autoencoder for 3D outdoor point clouds, and pre-train it on the constructed dataset. The pre-trained models are subsequently fine-tuned for downstream tasks, including tree species classification, terrain scene recognition, and point cloud semantic segmentation. Our results show that the pre-trained models significantly outperform their scratch counterparts across all downstream tasks, demonstrating the transferability of the representations learned from the proposed dataset. Furthermore, we observe that scaling the dataset using our geospatial sampling method consistently enhances performance, whereas pre-training on datasets constructed with random sampling fails to achieve similar improvements. These findings highlight the utility of the constructed dataset and the effectiveness of our sampling strategy in the pre-training and fine-tuning paradigm. The source code and pre-trained models will be made publicly available at \url{this https URL}.
预训练和微调的范式已经彻底革新了卫星遥感应用。然而,这种方法在航空激光扫描(ALS)领域仍被很大程度上忽视,而ALS对于森林管理和城市规划等应用场景来说是一项关键技术。在这项研究中,我们通过构建大规模的ALS点云数据集并评估其对下游应用程序的影响来填补这一空白。我们的数据集包括了从美国地质调查局3D高程计划提供的覆盖整个美利坚合众国连续区域内的ALS点云数据。 为了在确保高效的数据采集的同时捕捉多样化的土地覆盖和地形类型,我们引入了一种基于土地利用图和数字高程模型的地理空间采样方法来选择点云地块。作为基线的自监督学习模型,我们采用最先进的用于3D室外点云的掩码自动编码器BEV-MAE,并在构建的数据集上对其进行预训练。随后,我们将这些预训练模型微调以应用于下游任务,包括树种分类、地形场景识别和点云语义分割。 我们的研究结果表明,在所有下游任务中,预训练模型的表现均显著优于从头开始训练的模型,这证明了我们所提出数据集中学习到表示形式的可迁移性。此外,我们观察到使用我们的地理空间采样方法扩大量化规模会一致地提升性能,而基于随机采样的数据集进行预训练则无法达到类似的改善效果。这些发现强调了构建的数据集和我们的采样策略在预训练与微调范式中的实用性。 源代码和预训练模型将在[此处](此URL)公开提供。
https://arxiv.org/abs/2501.05095
We present Seg-TTO, a novel framework for zero-shot, open-vocabulary semantic segmentation (OVSS), designed to excel in specialized domain tasks. While current open vocabulary approaches show impressive performance on standard segmentation benchmarks under zero-shot settings, they fall short of supervised counterparts on highly domain-specific datasets. We focus on segmentation-specific test-time optimization to address this gap. Segmentation requires an understanding of multiple concepts within a single image while retaining the locality and spatial structure of representations. We propose a novel self-supervised objective adhering to these requirements and use it to align the model parameters with input images at test time. In the textual modality, we learn multiple embeddings for each category to capture diverse concepts within an image, while in the visual modality, we calculate pixel-level losses followed by embedding aggregation operations specific to preserving spatial structure. Our resulting framework termed Seg-TTO is a plug-in-play module. We integrate Seg-TTO with three state-of-the-art OVSS approaches and evaluate across 22 challenging OVSS tasks covering a range of specialized domains. Our Seg-TTO demonstrates clear performance improvements across these establishing new state-of-the-art. Code: this https URL.
我们提出了一种名为Seg-TTO的创新框架,旨在解决零样本、开放词汇语义分割(OVSS)中的专门领域任务。尽管当前的开放式词汇方法在零样本设置下的标准分割基准测试中表现出色,但它们在高度特定领域的数据集上却不如监督模型。为了解决这一差距,我们专注于在测试时进行细分特性的优化。 分割任务需要理解单个图像内的多个概念,并同时保持表示中的局部性和空间结构。为此,我们提出了一种新颖的自监督目标,该目标符合这些要求,并利用它将模型参数与输入图像对齐以用于测试时间。在文本模态中,为每个类别学习多种嵌入来捕捉图像内多样化的概念;而在视觉模态中,则计算像素级别的损失并执行特定于保持空间结构的嵌入聚合操作。 我们的框架Seg-TTO被设计成可插拔模块,能够与现有的最先进的OVSS方法无缝集成。我们将Seg-TTO整合到三种最先进的OVSS方法中,并在涵盖各种专门领域内的22个具有挑战性的任务上进行了评估。结果显示,我们提出的Seg-TTO在性能上有明显改进,确立了新的最先进水平。 代码链接:[请在此处插入实际的URL链接]
https://arxiv.org/abs/2501.04696
With the rapid advancement of deep learning, computational pathology has made significant progress in cancer diagnosis and subtyping. Tissue segmentation is a core challenge, essential for prognosis and treatment decisions. Weakly supervised semantic segmentation (WSSS) reduces the annotation requirement by using image-level labels instead of pixel-level ones. However, Class Activation Map (CAM)-based methods still suffer from low spatial resolution and unclear boundaries. To address these issues, we propose a multi-level superpixel correction algorithm that refines CAM boundaries using superpixel clustering and floodfill. Experimental results show that our method achieves great performance on breast cancer segmentation dataset with mIoU of 71.08%, significantly improving tumor microenvironment boundary delineation.
随着深度学习的迅速发展,计算病理学在癌症诊断和分型方面取得了显著进展。组织分割是其中的核心挑战之一,对于预后和治疗决策至关重要。弱监督语义分割(WSSS)通过使用图像级别的标签而不是像素级别的标签来减少标注需求。然而,基于类激活图(CAM)的方法仍然存在空间分辨率低和边界不清晰的问题。为了解决这些问题,我们提出了一种多级超像素校正算法,该算法利用超像素聚类和洪水填充技术来优化CAM的边界。实验结果表明,我们的方法在乳腺癌分割数据集上表现出色,达到了71.08%的mIoU(平均交并比),显著提高了肿瘤微环境边界的界定精度。
https://arxiv.org/abs/2501.03891
This study explores the potential of graph neural networks (GNNs) to enhance semantic segmentation across diverse image modalities. We evaluate the effectiveness of a novel GNN-based U-Net architecture on three distinct datasets: PascalVOC, a standard benchmark for natural image segmentation, WoodScape, a challenging dataset of fisheye images commonly used in autonomous driving, introducing significant geometric distortions; and ISIC2016, a dataset of dermoscopic images for skin lesion segmentation. We compare our proposed UNet-GNN model against established convolutional neural networks (CNNs) based segmentation models, including U-Net and U-Net++, as well as the transformer-based SwinUNet. Unlike these methods, which primarily rely on local convolutional operations or global self-attention, GNNs explicitly model relationships between image regions by constructing and operating on a graph representation of the image features. This approach allows the model to capture long-range dependencies and complex spatial relationships, which we hypothesize will be particularly beneficial for handling geometric distortions present in fisheye imagery and capturing intricate boundaries in medical images. Our analysis demonstrates the versatility of GNNs in addressing diverse segmentation challenges and highlights their potential to improve segmentation accuracy in various applications, including autonomous driving and medical image analysis.
这项研究探讨了图神经网络(GNN)在不同图像模式下增强语义分割的潜力。我们评估了一种基于新型GNN的U-Net架构,在三个不同的数据集上的有效性:PascalVOC,这是一个用于自然图像分割的标准基准;WoodScape,一个包含鱼眼图像的具有挑战性的数据集,这些图像是自主驾驶中常用的,并引入了显著的几何失真;以及ISIC2016,这是一个用于皮肤病变分割的皮肤镜像数据集。我们将我们提出的UNet-GNN模型与包括U-Net和U-Net++在内的已建立卷积神经网络(CNN)分割模型进行了比较,还包括基于变压器的SwinUNet。不同于这些方法主要依赖于局部卷积操作或全局自注意力机制,GNN通过构建并操作图像特征图的图表示来明确建模图像区域之间的关系。这种方法使模型能够捕捉长距离依赖性和复杂的空间关系,我们认为这对于处理鱼眼图像中存在的几何失真以及捕获医学影像中的精细边界特别有益。我们的分析展示了GNN在应对各种分割挑战方面的多功能性,并突显了它们提高多种应用(包括自动驾驶和医学影像分析)中分割准确性的潜力。
https://arxiv.org/abs/2501.03765
Semantic segmentation of LiDAR points has significant value for autonomous driving and mobile robot systems. Most approaches explore spatio-temporal information of multi-scan to identify the semantic classes and motion states for each point. However, these methods often overlook the segmentation consistency in space and time, which may result in point clouds within the same object being predicted as different categories. To handle this issue, our core idea is to generate cluster labels across multiple frames that can reflect the complete spatial structure and temporal information of objects. These labels serve as explicit guidance for our dual-branch network, 4D-CS, which integrates point-based and cluster-based branches to enable more consistent segmentation. Specifically, in the point-based branch, we leverage historical knowledge to enrich the current feature through temporal fusion on multiple views. In the cluster-based branch, we propose a new strategy to produce cluster labels of foreground objects and apply them to gather point-wise information to derive cluster features. We then merge neighboring clusters across multiple scans to restore missing features due to occlusion. Finally, in the point-cluster fusion stage, we adaptively fuse the information from the two branches to optimize segmentation results. Extensive experiments confirm the effectiveness of the proposed method, and we achieve state-of-the-art results on the multi-scan semantic and moving object segmentation on SemanticKITTI and nuScenes datasets. The code will be available at this https URL.
激光雷达点的语义分割对自动驾驶和移动机器人系统具有重要价值。大多数方法通过探索多扫描中的时空信息来识别每个点的语义类别和运动状态,但这些方法往往忽视了空间和时间上的一致性分割问题,这可能导致属于同一对象的点云被预测为不同的类别。为了解决这一问题,我们的核心思想是生成跨越多个帧的对象完整空间结构和时态信息的集群标签。这些标签作为我们双分支网络4D-CS(Four-Dimensional Clustering Segmentation)的明确指导,该网络整合基于点和基于集群的分支来实现更一致的分割。 具体来说,在基于点的分支中,我们利用历史知识通过多视图上的时间融合丰富当前特征。在基于集群的分支中,我们提出了一种新的策略来生成前景对象的集群标签,并将它们应用于收集点级别的信息以推导出集群特征。然后,我们在多个扫描中合并相邻的集群以恢复由于遮挡丢失的特征。最后,在点-集群融合阶段,我们自适应地融合来自两个分支的信息以优化分割结果。 广泛的实验验证了所提出方法的有效性,并在SemanticKITTI和nuScenes数据集上的多扫描语义及移动物体分割任务中取得了最先进的成果。代码将在此网址提供:[插入链接]。
https://arxiv.org/abs/2501.02937
3D semantic segmentation is one of the most crucial tasks in driving perception. The ability of a learning-based model to accurately perceive dense 3D surroundings often ensures the safe operation of autonomous vehicles. However, existing LiDAR-based 3D semantic segmentation databases consist of sequentially acquired LiDAR scans that are long-tailed and lack training diversity. In this report, we introduce MixSeg3D, a sophisticated combination of the strong point cloud segmentation model with advanced 3D data mixing strategies. Specifically, our approach integrates the MinkUNet family with LaserMix and PolarMix, two scene-scale data augmentation methods that blend LiDAR point clouds along the ego-scene's inclination and azimuth directions. Through empirical experiments, we demonstrate the superiority of MixSeg3D over the baseline and prior arts. Our team achieved 2nd place in the 3D semantic segmentation track of the 2024 Waymo Open Dataset Challenge.
三维语义分割是自动驾驶感知中最关键的任务之一。基于学习的模型能够准确地感知密集的三维环境,这通常能确保无人驾驶车辆的安全运行。然而,现有的LiDAR(激光雷达)基础的3D语义分割数据库包含的是顺序采集的、长尾分布且缺乏训练多样性的LiDAR扫描数据。 在本报告中,我们介绍了MixSeg3D,这是一种高级组合方法,将强大的点云分割模型与先进的三维数据混合策略相结合。具体来说,我们的方法结合了MinkUNet家族和两种场景级别的数据增强方法——LaserMix及PolarMix,这两种方法可以在自车视角的倾斜角度和平面角方向上融合LiDAR点云。 通过实验验证,我们展示了MixSeg3D相较于基准模型和先前的方法具有明显的优势。我们的团队在2024年Waymo Open Dataset Challenge(三维语义分割赛道)中取得了第二名的成绩。
https://arxiv.org/abs/2501.05472