Semantic segmentation relies on many dense pixel-wise annotations to achieve the best performance, but owing to the difficulty of obtaining accurate annotations for real world data, practitioners train on large-scale synthetic datasets. Unpaired image translation is one method used to address the ensuing domain gap by generating more realistic training data in low-data regimes. Current methods for unpaired image translation train generative adversarial networks (GANs) to perform the translation and enforce pixel-level semantic matching through cycle consistency. These methods do not guarantee that the semantic matching holds, posing a problem for semantic segmentation where performance is sensitive to noisy pixel labels. We propose a novel image translation method, Domain Adversarial Kernel Prediction Network (DA-KPN), that guarantees semantic matching between the synthetic label and translation. DA-KPN estimates pixel-wise input transformation parameters of a lightweight and simple translation function. To ensure the pixel-wise transformation is realistic, DA-KPN uses multi-scale discriminators to distinguish between translated and target samples. We show DA-KPN outperforms previous GAN-based methods on syn2real benchmarks for semantic segmentation with limited access to real image labels and achieves comparable performance on face parsing.
语义分割依赖于大量的密集像素级标注以实现最佳性能,但由于获取现实世界数据准确标注的难度较大,实践中通常使用大规模合成数据集进行训练。未配对图像转换是一种方法,用于通过生成更逼真的训练数据来解决由此产生的领域差距问题,在低数据环境下尤其有效。目前的未配对图像转换方法通过循环一致性训练生成对抗网络(GAN),以执行转换并强制执行像素级别的语义匹配。然而,这些方法并不能保证这种语义匹配的有效性,这在性能对噪声像素标签敏感的语义分割任务中成为了一个问题。 我们提出了一种新的图像翻译方法——领域对抗核预测网络(DA-KPN),该方法能够确保合成标签与转换后的图像之间的语义匹配。DA-KPN估算出用于简单轻量级变换函数的逐像素输入变换参数,以实现这一目标。为了保证这种逐像素的变换是真实的,DA-KPN使用多尺度判别器来区分翻译后和目标样本。 我们展示了在仅有限访问真实图像标签的情况下,DA-KPN在syn2real基准测试中优于先前基于GAN的方法,并且在面部解析任务中也达到了相当的表现。
https://arxiv.org/abs/2507.08554
Low-level enhancement and high-level visual understanding in low-light vision have traditionally been treated separately. Low-light enhancement improves image quality for downstream tasks, but existing methods rely on physical or geometric priors, limiting generalization. Evaluation mainly focuses on visual quality rather than downstream performance. Low-light visual understanding, constrained by scarce labeled data, primarily uses task-specific domain adaptation, which lacks scalability. To address these challenges, we build a generalized bridge between low-light enhancement and low-light understanding, which we term Generalized Enhancement For Understanding (GEFU). This paradigm improves both generalization and scalability. To address the diverse causes of low-light degradation, we leverage pretrained generative diffusion models to optimize images, achieving zero-shot generalization performance. Building on this, we propose Semantically Consistent Unsupervised Fine-tuning (SCUF). Specifically, to overcome text prompt limitations, we introduce an illumination-aware image prompt to explicitly guide image generation and propose a cycle-attention adapter to maximize its semantic potential. To mitigate semantic degradation in unsupervised training, we propose caption and reflectance consistency to learn high-level semantics and image-level spatial semantics. Extensive experiments demonstrate that our proposed method outperforms current state-of-the-art methods in traditional image quality and GEFU tasks including classification, detection, and semantic segmentation.
在低光视觉中,底层增强和高层视觉理解传统上被分别对待。低光增强可以提升图像质量以支持下游任务的性能,但现有方法依赖于物理或几何先验知识,这限制了它们的泛化能力。评价主要集中在视觉质量而非下游任务的表现上。低光照下的视觉理解由于标注数据稀缺,通常采用特定任务领域的适应方法,这种方法缺乏可扩展性。 为解决这些挑战,我们建立了一个将低光增强和低光理解连接起来的一般桥梁,并将其命名为“用于理解的广义增强”(Generalized Enhancement For Understanding, GEFU)。这种范式能够同时提升泛化能力和可扩展性。为了应对各种低光照退化的成因,我们利用预训练的生成扩散模型来优化图像,实现零样本学习下的性能。 在此基础上,我们提出了语义一致性的无监督微调(Semantically Consistent Unsupervised Fine-tuning, SCUF)。具体来说,为克服文本提示的局限性,我们引入了光照感知型图像提示,以明确指导图像生成,并提出了一种循环注意力适配器来最大化其语义潜力。为了减轻无监督训练中的语义退化问题,我们提出了标题和反射一致性以学习高级语义及图片级空间语义。 广泛实验表明,所提出的这种方法在传统图像质量和包括分类、检测以及语义分割在内的GEFU任务上均超越了现有的最先进的方法。
https://arxiv.org/abs/2507.08380
Weakly-supervised semantic segmentation aims to assign category labels to each pixel using weak annotations, significantly reducing manual annotation costs. Although existing methods have achieved remarkable progress in well-lit scenarios, their performance significantly degrades in low-light environments due to two fundamental limitations: severe image quality degradation (e.g., low contrast, noise, and color distortion) and the inherent constraints of weak supervision. These factors collectively lead to unreliable class activation maps and semantically ambiguous pseudo-labels, ultimately compromising the model's ability to learn discriminative feature representations. To address these problems, we propose Diffusion-Guided Knowledge Distillation for Weakly-Supervised Low-light Semantic Segmentation (DGKD-WLSS), a novel framework that synergistically combines Diffusion-Guided Knowledge Distillation (DGKD) with Depth-Guided Feature Fusion (DGF2). DGKD aligns normal-light and low-light features via diffusion-based denoising and knowledge distillation, while DGF2 integrates depth maps as illumination-invariant geometric priors to enhance structural feature learning. Extensive experiments demonstrate the effectiveness of DGKD-WLSS, which achieves state-of-the-art performance in weakly supervised semantic segmentation tasks under low-light conditions. The source codes have been released at:this https URL.
弱监督语义分割的目标是使用弱标注为每个像素分配类别标签,从而显著减少人工标注成本。尽管现有方法在光照良好的场景中取得了显著进展,但在低光环境中由于两个基本限制(严重的图像质量下降,例如低对比度、噪声和颜色失真以及弱监督的内在约束)其性能明显下降。这些因素共同导致了不可靠的类别激活图和语义模糊的伪标签,最终影响模型学习区分特征表示的能力。 为了解决这些问题,我们提出了Diffusion-Guided Knowledge Distillation for Weakly-Supervised Low-light Semantic Segmentation(DGKD-WLSS),这是一种新颖的框架,它将基于扩散引导的知识蒸馏(DGKD)与深度引导特性融合(DGF2)协同结合。DGKD通过基于扩散的去噪和知识蒸馏来对齐正常光照和低光特征,而DGF2则利用深度图作为照明不变的几何先验来增强结构特性学习。广泛的实验表明了DGKD-WLSS的有效性,在弱监督下的低光语义分割任务中达到了最先进的性能。 源代码已在此URL公开发布:[this https URL]。
https://arxiv.org/abs/2507.07578
Weakly Supervised Semantic Segmentation (WSSS) is a challenging problem that has been extensively studied in recent years. Traditional approaches often rely on external modules like Class Activation Maps to highlight regions of interest and generate pseudo segmentation masks. In this work, we propose an end-to-end method that directly utilizes the attention maps learned by a Vision Transformer (ViT) for WSSS. We propose training a sparse ViT with multiple [CLS] tokens (one for each class), using a random masking strategy to promote [CLS] token - class assignment. At inference time, we aggregate the different self-attention maps of each [CLS] token corresponding to the predicted labels to generate pseudo segmentation masks. Our proposed approach enhances the interpretability of self-attention maps and ensures accurate class assignments. Extensive experiments on two standard benchmarks and three specialized datasets demonstrate that our method generates accurate pseudo-masks, outperforming related works. Those pseudo-masks can be used to train a segmentation model which achieves results comparable to fully-supervised models, significantly reducing the need for fine-grained labeled data.
弱监督语义分割(WSSS)是一个近年来备受关注的难题。传统方法通常依赖外部模块如类激活图来突出感兴趣区域并生成伪标签分割掩膜。本文提出了一种端到端的方法,该方法直接利用视觉变换器(Vision Transformer, ViT)学习到的关注图进行WSSS。 我们提出了一种使用多个[CLS]标记(每个类别一个)训练稀疏ViT的方法,并采用随机掩码策略来促进[CLS]标记与类别的匹配。在推理阶段,我们将每个预测标签对应的[CLS]标记的不同自我注意图聚合起来生成伪分割掩膜。所提方法增强了自注意图的可解释性并确保了准确的类别分配。 我们在两个标准基准和三个专业数据集上进行了广泛的实验,结果表明我们的方法能够生成精确的伪掩膜,并且优于相关工作。这些伪标签可以用来训练一个分割模型,其性能与全监督模型相当,从而显著减少了对细粒度标注数据的需求。
https://arxiv.org/abs/2507.06848
This paper proposes an adaptive margin contrastive learning method for 3D semantic segmentation on point clouds. Most existing methods use equally penalized objectives, which ignore the per-point ambiguities and less discriminated features stemming from transition regions. However, as highly ambiguous points may be indistinguishable even for humans, their manually annotated labels are less reliable, and hard constraints over these points would lead to sub-optimal models. To address this, we first design AMContrast3D, a method comprising contrastive learning into an ambiguity estimation framework, tailored to adaptive objectives for individual points based on ambiguity levels. As a result, our method promotes model training, which ensures the correctness of low-ambiguity points while allowing mistakes for high-ambiguity points. As ambiguities are formulated based on position discrepancies across labels, optimization during inference is constrained by the assumption that all unlabeled points are uniformly unambiguous, lacking ambiguity awareness. Inspired by the insight of joint training, we further propose AMContrast3D++ integrating with two branches trained in parallel, where a novel ambiguity prediction module concurrently learns point ambiguities from generated embeddings. To this end, we design a masked refinement mechanism that leverages predicted ambiguities to enable the ambiguous embeddings to be more reliable, thereby boosting segmentation performance and enhancing robustness. Experimental results on 3D indoor scene datasets, S3DIS and ScanNet, demonstrate the effectiveness of the proposed method. Code is available at this https URL.
这篇论文提出了一种针对点云的三维语义分割自适应边缘对比学习方法。大多数现有的方法使用等量惩罚的目标函数,这忽略了过渡区域产生的每个点的模糊性和较少区分性的特征。然而,由于高度模糊的点即使对人类来说也可能难以区分,其手动标注的标签不够可靠,严格的约束条件会导致次优模型。为了解决这一问题,我们首先设计了AMContrast3D方法,该方法将对比学习融入到模糊度估计框架中,并针对每个点的不同模糊级别制定了自适应的目标函数。因此,我们的方法促进了模型训练,在确保低模糊度点正确性的同时允许高模糊度点出现错误。由于模糊度是基于标签间位置差异来制定的,推断期间的优化受制于所有未标记点在模糊度上都是均匀不明确这一假设,缺乏对模糊度的认识。受到联合训练思想的启发,我们进一步提出了AMContrast3D++,通过两个平行训练的分支进行整合,并引入了一个新颖的模糊预测模块,同时从生成的嵌入中学习点的模糊性。为此,我们设计了一种掩码细化机制,利用预测出的模糊度使模棱两可的嵌入更加可靠,从而提高分割性能并增强鲁棒性。在3D室内场景数据集S3DIS和ScanNet上的实验结果证明了所提出方法的有效性。代码可以在提供的链接中获取。
https://arxiv.org/abs/2507.06592
Conservation and decision-making regarding forest resources necessitate regular forest inventory. Light detection and ranging (LiDAR) in laser scanning systems has gained significant attention over the past two decades as a remote and non-destructive solution to streamline the labor-intensive and time-consuming procedure of forest inventory. Advanced multispectral (MS) LiDAR systems simultaneously acquire three-dimensional (3D) spatial and spectral information across multiple wavelengths of the electromagnetic spectrum. Consequently, MS-LiDAR technology enables the estimation of both the biochemical and biophysical characteristics of forests. Forest component segmentation is crucial for forest inventory. The synergistic use of spatial and spectral laser information has proven to be beneficial for achieving precise forest semantic segmentation. Thus, this study aims to investigate the potential of MS-LiDAR data, captured by the HeliALS system, providing high-density multispectral point clouds to segment forests into six components: ground, low vegetation, trunks, branches, foliage, and woody debris. Three point-wise 3D deep learning models and one machine learning model, including kernel point convolution, superpoint transformer, point transformer V3, and random forest, are implemented. Our experiments confirm the superior accuracy of the KPConv model. Additionally, various geometric and spectral feature vector scenarios are examined. The highest accuracy is achieved by feeding all three wavelengths (1550 nm, 905 nm, and 532 nm) as the initial features into the deep learning model, resulting in improvements of 33.73% and 32.35% in mean intersection over union (mIoU) and in mean accuracy (mAcc), respectively. This study highlights the excellent potential of multispectral LiDAR for improving the accuracy in fully automated forest component segmentation.
森林资源的保护和决策需要定期进行森林调查。在过去二十年中,激光扫描系统中的光探测与测距(LiDAR)作为一种远程且非破坏性的解决方案受到了广泛关注,能够简化劳动密集型且耗时的森林调查过程。高级多光谱(MS)LiDAR系统能够在电磁波谱的不同波长上同时获取三维空间和光谱信息。因此,MS-LiDAR技术可以估算出森林的生化和生物物理特性。森林成分分割对于森林调查至关重要。结合使用空间和光谱激光信息已被证明有助于实现精确的森林语义分割。因此,本研究旨在探讨由HeliALS系统捕获的高密度多光谱点云数据在将森林分割成六个组成部分(地面、低植被、树干、树枝、树叶和木屑)方面的潜力。 我们实施了三种基于点的3D深度学习模型以及一种机器学习模型,包括核点卷积(Kernel Point Convolution, KPConv)、超级点变换器(Superpoint Transformer)、点变换器V3(Point Transformer V3),以及随机森林。我们的实验确认了KPConv模型具有优越的准确性。此外,我们还研究了几何和光谱特征向量的不同场景。将所有三种波长(1550 nm、905 nm 和 532 nm)作为初始特征输入深度学习模型时,实现了最高的准确率,在平均交并比(mIoU)和平均精度(mAcc)上分别提高了33.73% 和 32.35%。这项研究表明多光谱LiDAR具有改善全自动森林成分分割精确度的巨大潜力。
https://arxiv.org/abs/2507.08025
Unsupervised domain adaptation (UDA) involves learning class semantics from labeled data within a source domain that generalize to an unseen target domain. UDA methods are particularly impactful for semantic segmentation, where annotations are more difficult to collect than in image classification. Despite recent advances in large-scale vision-language representation learning, UDA methods for segmentation have not taken advantage of the domain-agnostic properties of text. To address this, we present a novel Covariance-based Pixel-Text loss, CoPT, that uses domain-agnostic text embeddings to learn domain-invariant features in an image segmentation encoder. The text embeddings are generated through our LLM Domain Template process, where an LLM is used to generate source and target domain descriptions that are fed to a frozen CLIP model and combined. In experiments on four benchmarks we show that a model trained using CoPT achieves the new state of the art performance on UDA for segmentation. The code can be found at this https URL.
无监督领域适应(UDA)涉及从源域中的标记数据中学习类别语义,以便推广到未见的目标域。UDA方法在语义分割任务中尤为关键,因为与图像分类相比,在语义分割中收集标注更为困难。尽管在大规模视觉-语言表示学习方面取得了进展,但用于分割的UDA方法尚未利用文本的领域无关特性。为此,我们提出了一种基于协方差的像素-文本损失函数(CoPT),该函数使用领域无关的文本嵌入来学习图像分割编码器中的领域不变特征。这些文本嵌入通过我们的LLM领域模板过程生成,其中大型语言模型(LLM)被用来生成源域和目标域描述,并将这些描述输入冻结状态下的CLIP模型中进行组合。在四个基准测试上的实验表明,使用CoPT训练的模型在UDA分割任务上达到了新的最先进性能水平。代码可以在提供的URL地址获取。
https://arxiv.org/abs/2507.07125
Collecting and annotating images for the purpose of training segmentation models is often cost prohibitive. In the domain of wildland fire science, this challenge is further compounded by the scarcity of reliable public datasets with labeled ground truth. This paper presents the Centralized Copy-Paste Data Augmentation (CCPDA) method, for the purpose of assisting with the training of deep-learning multiclass segmentation models, with special focus on improving segmentation outcomes for the fire-class. CCPDA has three main steps: (i) identify fire clusters in the source image, (ii) apply a centralization technique to focus on the core of the fire area, and (iii) paste the refined fire clusters onto a target image. This method increases dataset diversity while preserving the essential characteristics of the fire class. The effectiveness of this augmentation technique is demonstrated via numerical analysis and comparison against various other augmentation methods using a weighted sum-based multi-objective optimization approach. This approach helps elevate segmentation performance metrics specific to the fire class, which carries significantly more operational significance than other classes (fuel, ash, or background). Numerical performance assessment validates the efficacy of the presented CCPDA method in alleviating the difficulties associated with small, manually labeled training datasets. It also illustrates that CCPDA outperforms other augmentation strategies in the application scenario considered, particularly in improving fire-class segmentation performance.
收集和标注用于训练分割模型的图像通常成本高昂。在野外火灾科学领域,这一挑战进一步因可靠公开标记数据集的稀缺而加剧。本文介绍了一种集中式复制粘贴数据增强(CCPDA)方法,旨在协助多类深度学习分割模型的训练,并特别关注提高火区分割效果。CCPDA包含三个主要步骤:(i)在源图像中识别火灾集群;(ii)应用一种中心化技术以聚焦于火灾区域的核心部分;(iii)将细化后的火灾集群粘贴到目标图像上。此方法增加了数据集的多样性,同时保留了火区类别的基本特征。 该增强技术的有效性通过数值分析和多目标优化方法进行评估,并与各种其他增广策略进行了对比。这种基于加权总和的方法有助于提升针对火灾类别(相对于燃料、灰烬或背景等其他类别而言具有显著操作意义)的分割性能指标。数值性能评估验证了所提出的CCPDA方法在缓解小型手动标记训练数据集困难方面的有效性,同时也表明在考虑的应用场景中,CCPDA优于其他增广策略,特别是在提升火灾类别的分割性能方面表现突出。
https://arxiv.org/abs/2507.06321
Recent advancements in robotic grasping have led to its integration as a core module in many manipulation systems. For instance, language-driven semantic segmentation enables the grasping of any designated object or object part. However, existing methods often struggle to generate feasible grasp poses for small objects or delicate components, potentially causing the entire pipeline to fail. To address this issue, we propose a novel grasping method, FineGrasp, which introduces improvements in three key aspects. First, we introduce multiple network modifications to enhance the ability of to handle delicate regions. Second, we address the issue of label imbalance and propose a refined graspness label normalization strategy. Third, we introduce a new simulated grasp dataset and show that mixed sim-to-real training further improves grasp performance. Experimental results show significant improvements, especially in grasping small objects, and confirm the effectiveness of our system in semantic grasping.
近期在机器人抓取领域的进展已使其成为许多操作系统的核心模块。例如,语言驱动的语义分割使得可以抓取任何指定的对象或对象部分。然而,现有方法往往难以生成适合小物体或精细部件的可行抓取姿态,这可能导致整个管道失效。为了解决这个问题,我们提出了一种新的抓取方法 FineGrasp,在三个方面进行了改进。首先,我们引入了多个网络修改以增强处理精细区域的能力。其次,我们解决了标签不平衡的问题,并提出了一个细化的抓取性标签规范化策略。第三,我们引入了一个新的模拟抓取数据集,并表明混合仿真到现实训练进一步提高了抓取性能。实验结果显示出显著改善,特别是在抓取小物体方面,并证实了我们在语义抓取方面的系统的有效性。
https://arxiv.org/abs/2507.05978
The annotation bottleneck in semantic segmentation has driven significant interest in few-shot segmentation, which aims to develop segmentation models capable of generalizing rapidly to novel classes using minimal exemplars. Conventional training paradigms typically generate query prior maps by extracting masked-area features from support images, followed by making predictions guided by these prior maps. However, current approaches remain constrained by two critical limitations stemming from inter- and intra-image discrepancies, both of which significantly degrade segmentation performance: 1) The semantic gap between support and query images results in mismatched features and inaccurate prior maps; 2) Visually similar yet semantically distinct regions within support or query images lead to false negative or false positive predictions. We propose a novel FSS method called \textbf{I$^2$R}: 1) Using category-specific high level representations which aggregate global semantic cues from support and query images, enabling more precise inter-image region localization and address the first limitation. 2) Directional masking strategy that suppresses inconsistent support-query pixel pairs, which exhibit high feature similarity but conflicting mask, to mitigate the second issue. Experiments demonstrate that our method outperforms state-of-the-art approaches, achieving improvements of 1.9\% and 2.1\% in mIoU under the 1-shot setting on PASCAL-5$^i$ and COCO-20$^i$ benchmarks, respectively.
语义分割中的标注瓶颈已经激发了对少样本分割(Few-Shot Segmentation,FSS)的极大兴趣。少样本分割的目标是开发能够使用少量示例快速推广到新类别的分割模型。传统的训练范式通常通过从支持图像中提取掩码区域特征来生成查询先验图,然后根据这些先验图进行预测。然而,当前的方法仍然受到两个关键限制的影响:一是来自支持图像和查询图像之间的语义差距导致的不匹配特征及错误先验图;二是支持图像或查询图像内视觉相似但语义不同的区域会导致假阴性或假阳性预测。这些问题都严重降低了分割性能。 为了解决上述问题,我们提出了一种名为\textbf{I$^2$R}的新少样本分割方法: 1. 使用特定类别的高级表示形式来聚合来自支持图像和查询图像的全局语义线索,这有助于更精确地定位跨图像区域,并解决第一个限制。 2. 采用定向掩码策略抑制具有高特征相似性但标签冲突的支持-查询像素对,从而缓解第二个问题。 实验表明,我们的方法在PASCAL-5$^i$和COCO-20$^i$基准测试的1-shot设置下分别实现了mIoU指标上的改进:提高了1.9\% 和 2.1\%,超过了现有的最先进方法。
https://arxiv.org/abs/2507.05838
As critical transportation infrastructure, bridges face escalating challenges from aging and deterioration, while traditional manual inspection methods suffer from low efficiency. Although 3D point cloud technology provides a new data-driven paradigm, its application potential is often constrained by the incompleteness of real-world data, which results from missing labels and scanning occlusions. To overcome the bottleneck of insufficient generalization in existing synthetic data methods, this paper proposes a systematic framework for generating 3D bridge data. This framework can automatically generate complete point clouds featuring component-level instance annotations, high-fidelity color, and precise normal vectors. It can be further extended to simulate the creation of diverse and physically realistic incomplete point clouds, designed to support the training of segmentation and completion networks, respectively. Experiments demonstrate that a PointNet++ model trained with our synthetic data achieves a mean Intersection over Union (mIoU) of 84.2% in real-world bridge semantic segmentation. Concurrently, a fine-tuned KT-Net exhibits superior performance on the component completion task. This research offers an innovative methodology and a foundational dataset for the 3D visual analysis of bridge structures, holding significant implications for advancing the automated management and maintenance of infrastructure.
作为关键的交通运输基础设施,桥梁面临着因老化和恶化而导致的问题日益严峻的情况,而传统的手动检查方法效率低下。尽管三维点云技术提供了一种新的数据驱动范式,但其应用潜力往往受到现实世界中由于缺少标签和扫描遮挡所导致的数据不完整性的限制。为了解决现有合成数据方法普遍化不足的瓶颈问题,本文提出了一种用于生成3D桥梁数据的系统框架。该框架能够自动产生包含组件级实例注释、高保真色彩和精确法向量的完整点云数据。此外,它还可以扩展以模拟创建多样化且物理上真实的不完整的点云,旨在支持分割网络和补全网络的训练任务。 实验结果表明,在现实世界的桥梁语义分割中,使用我们合成数据训练的PointNet++模型实现了84.2%的平均交并比(mIoU)。同时,经过微调的KT-Net在组件补全任务上表现出色。这项研究提供了一种创新的方法和基础数据集用于三维视觉分析桥梁结构,并为基础设施自动化管理和维护的进步提供了重要启示。
https://arxiv.org/abs/2507.05814
LiDAR representation learning aims to extract rich structural and semantic information from large-scale, readily available datasets, reducing reliance on costly human annotations. However, existing LiDAR representation strategies often overlook the inherent spatiotemporal cues in LiDAR sequences, limiting their effectiveness. In this work, we propose LiMA, a novel long-term image-to-LiDAR Memory Aggregation framework that explicitly captures longer range temporal correlations to enhance LiDAR representation learning. LiMA comprises three key components: 1) a Cross-View Aggregation module that aligns and fuses overlapping regions across neighboring camera views, constructing a more unified and redundancy-free memory bank; 2) a Long-Term Feature Propagation mechanism that efficiently aligns and integrates multi-frame image features, reinforcing temporal coherence during LiDAR representation learning; and 3) a Cross-Sequence Memory Alignment strategy that enforces consistency across driving sequences, improving generalization to unseen environments. LiMA maintains high pretraining efficiency and incurs no additional computational overhead during downstream tasks. Extensive experiments on mainstream LiDAR-based perception benchmarks demonstrate that LiMA significantly improves both LiDAR semantic segmentation and 3D object detection. We hope this work inspires more effective pretraining paradigms for autonomous driving. The code has be made publicly accessible for future research.
LiDAR 表征学习的目标是从大规模且易获取的数据集中提取丰富的结构和语义信息,从而减少对昂贵的人工标注的依赖。然而,现有的 LiDAR 表征策略常常忽视了点云序列中固有的时空线索,限制了其有效性。为此,我们提出了 LiMA(LiDAR Memory Aggregation),这是一种新颖的长期图像到LiDAR记忆聚合框架,它明确地捕捉更长的时间相关性以增强LiDAR表征学习。 LiMA 包含三个关键组件: 1. **Cross-View 聚合模块**:该模块对相邻摄像机视图中重叠区域进行对齐和融合,构建了一个更加统一且无冗余的记忆库。 2. **Long-Term 特征传播机制**:此机制高效地对齐并整合多帧图像特征,在LiDAR表征学习过程中强化了时间一致性。 3. **跨序列记忆对齐策略**:该策略强制执行不同驾驶序列之间的连贯性,提高在未知环境中的泛化能力。 LiMA 在预训练阶段保持高效率,并且不会为下游任务带来额外的计算开销。主流 LiDAR 基准测试的大量实验表明,LiMA 显著提高了 LiDAR 语义分割和3D目标检测的表现。我们希望这项工作能够激励更多有效的自动驾驶预训练范式的发展。相关代码现已公开发布,供未来研究使用。
https://arxiv.org/abs/2507.05260
We present MOSU, a novel autonomous long-range navigation system that enhances global navigation for mobile robots through multimodal perception and on-road scene understanding. MOSU addresses the outdoor robot navigation challenge by integrating geometric, semantic, and contextual information to ensure comprehensive scene understanding. The system combines GPS and QGIS map-based routing for high-level global path planning and multi-modal trajectory generation for local navigation refinement. For trajectory generation, MOSU leverages multi-modalities: LiDAR-based geometric data for precise obstacle avoidance, image-based semantic segmentation for traversability assessment, and Vision-Language Models (VLMs) to capture social context and enable the robot to adhere to social norms in complex environments. This multi-modal integration improves scene understanding and enhances traversability, allowing the robot to adapt to diverse outdoor conditions. We evaluate our system in real-world on-road environments and benchmark it on the GND dataset, achieving a 10% improvement in traversability on navigable terrains while maintaining a comparable navigation distance to existing global navigation methods.
我们介绍了MOSU,这是一种新颖的自主远距离导航系统,通过多模态感知和路面场景理解来增强移动机器人的全球导航能力。MOSU 通过整合几何、语义和上下文信息解决室外机器人导航挑战,以确保全面的场景理解。该系统结合 GPS 和 QGIS 地图路由进行高级全局路径规划,并采用多模态轨迹生成进行局部导航优化。 在轨迹生成方面,MOSU 利用多种模式:基于 LiDAR 的几何数据用于精确避障、基于图像的语义分割用于可通行性评估以及视觉-语言模型 (VLM) 以捕捉社会背景并使机器人能够在复杂环境中遵守社会规范。这种多模态集成提高了场景理解能力,增强了通过性,并允许机器人适应各种户外条件。 我们在真实世界的路面环境和 GND 数据集上对我们的系统进行了评估,在可通行地形中实现了10%的通过性改进,同时保持了与现有全球导航方法相当的导航距离。
https://arxiv.org/abs/2507.04686
In recent years, cities have increasingly reduced speed limits from 50 km/h to 30 km/h to enhance road safety, reduce noise pollution, and promote sustainable modes of transportation. However, achieving compliance with these new limits remains a key challenge for urban planners. This study investigates drivers' compliance with the 30 km/h speed limit in Milan and examines how street characteristics influence driving behavior. Our findings suggest that the mere introduction of lower speed limits is not sufficient to reduce driving speeds effectively, highlighting the need to understand how street design can improve speed limit adherence. To comprehend this relationship, we apply computer vision-based semantic segmentation models to Google Street View images. A large-scale analysis reveals that narrower streets and densely built environments are associated with lower speeds, whereas roads with greater visibility and larger sky views encourage faster driving. To evaluate the influence of the local context on speeding behaviour, we apply the developed methodological framework to two additional cities: Amsterdam, which, similar to Milan, is a historic European city not originally developed for cars, and Dubai, which instead has developed in recent decades with a more car-centric design. The results of the analyses largely confirm the findings obtained in Milan, which demonstrates the broad applicability of the road design guidelines for driver speed compliance identified in this paper. Finally, we develop a machine learning model to predict driving speeds based on street characteristics. We showcase the model's predictive power by estimating the compliance with speed limits in Milan if the city were to adopt a 30 km/h speed limit city-wide. The tool provides actionable insights for urban planners, supporting the design of interventions to improve speed limit compliance.
近年来,城市逐渐将限速从50公里/小时降至30公里/小时,以增强道路安全、减少噪音污染,并促进可持续的交通方式。然而,如何确保驾驶员遵守这些新的限速仍然是城市规划者面临的一大挑战。本研究调查了米兰市司机对30公里/小时限速的遵守情况,并探讨街道特征是如何影响驾驶行为的。我们的研究表明,仅降低速度限制是不足以有效减少车速的,因此需要进一步了解街道设计如何改善驾驶员的速度限制遵守度。为了理解这种关系,我们应用基于计算机视觉的语义分割模型来分析Google Street View图片。大规模分析显示,狭窄的道路和密集的城市建筑与较低的行驶速度有关,而视野更开阔、天空面积更大的道路则会鼓励更快的驾驶行为。 为了评估当地环境对超速行为的影响,我们将开发的方法框架应用于另外两个城市:阿姆斯特丹(类似米兰,一个历史悠久且并非为汽车最初设计建造的欧洲城市)和迪拜(在最近几十年中以更加迎合汽车的设计发展起来的城市)。分析结果大体上确认了米兰市的研究发现,这表明本研究提出的道路设计方案指南具有广泛的适用性。 最后,我们开发了一个机器学习模型来根据街道特征预测驾驶速度。通过估算如果米兰全城实行30公里/小时的限速时司机遵守情况,我们展示了该模型的预测能力。这一工具为城市规划者提供了实用性的见解,支持他们设计干预措施以改善驾驶员对速度限制的遵守度。
https://arxiv.org/abs/2507.04434
Understanding surgical scenes can provide better healthcare quality for patients, especially with the vast amount of video data that is generated during MIS. Processing these videos generates valuable assets for training sophisticated models. In this paper, we introduce CLIP-RL, a novel contrastive language-image pre-training model tailored for semantic segmentation for surgical scenes. CLIP-RL presents a new segmentation approach which involves reinforcement learning and curriculum learning, enabling continuous refinement of the segmentation masks during the full training pipeline. Our model has shown robust performance in different optical settings, such as occlusions, texture variations, and dynamic lighting, presenting significant challenges. CLIP model serves as a powerful feature extractor, capturing rich semantic context that enhances the distinction between instruments and tissues. The RL module plays a pivotal role in dynamically refining predictions through iterative action-space adjustments. We evaluated CLIP-RL on the EndoVis 2018 and EndoVis 2017 datasets. CLIP-RL achieved a mean IoU of 81%, outperforming state-of-the-art models, and a mean IoU of 74.12% on EndoVis 2017. This superior performance was achieved due to the combination of contrastive learning with reinforcement learning and curriculum learning.
理解手术场景可以为患者提供更好的医疗服务,尤其是在微创外科(MIS)过程中产生的大量视频数据的情况下。处理这些视频能够生成训练复杂模型的宝贵资源。本文介绍了CLIP-RL,这是一种针对手术场景语义分割而设计的新颖对比语言图像预训练模型。CLIP-RL提出了一种新颖的分割方法,该方法结合了强化学习和课程学习,在整个训练流程中可以持续优化分割掩模。我们的模型在不同光学设置下表现出色,例如遮挡、纹理变化和动态照明等具有挑战性的情况。 CLIP 模型作为强大的特征提取器,能够捕捉丰富的语义上下文信息,从而增强器械与组织之间的区分度。RL模块通过迭代调整动作空间来动态优化预测结果,扮演着关键角色。我们对EndoVis 2018和EndoVis 2017数据集进行了CLIP-RL的评估。在这些数据集中,CLIP-RL达到了平均IoU(交并比)为81%,超过了现有最先进的模型;同时,在EndoVis 2017数据集上的平均IoU也达到74.12%。这种优越的表现归功于对比学习与强化学习及课程学习的结合使用。
https://arxiv.org/abs/2507.04317
Holistic surgical scene segmentation in robot-assisted surgery (RAS) enables surgical residents to identify various anatomical tissues, articulated tools, and critical structures, such as veins and vessels. Given the firm intraoperative time constraints, it is challenging for surgeons to provide detailed real-time explanations of the operative field for trainees. This challenge is compounded by the scarcity of expert surgeons relative to trainees, making the unambiguous delineation of go- and no-go zones inconvenient. Therefore, high-performance semantic segmentation models offer a solution by providing clear postoperative analyses of surgical procedures. However, recent advanced segmentation models rely on user-generated prompts, rendering them impractical for lengthy surgical videos that commonly exceed an hour. To address this challenge, we introduce Surg-SegFormer, a novel prompt-free model that outperforms current state-of-the-art techniques. Surg-SegFormer attained a mean Intersection over Union (mIoU) of 0.80 on the EndoVis2018 dataset and 0.54 on the EndoVis2017 dataset. By providing robust and automated surgical scene comprehension, this model significantly reduces the tutoring burden on expert surgeons, empowering residents to independently and effectively understand complex surgical environments.
在机器人辅助手术(RAS)中,全面的手术场景分割使得外科住院医师能够识别各种解剖组织、关节工具和关键结构(如静脉和血管)。鉴于严格的术中时间限制,外科医生很难为实习生提供详细的实时操作区域解释。这一挑战进一步因专家外科医生相对较少而加剧,这使明确划分可操作区与不可操作区变得困难。因此,高性能的语义分割模型通过提供手术过程的清晰术后分析提供了潜在解决方案。然而,最近先进的分割模型依赖于用户生成的提示,对于通常超过一小时的长时间手术视频来说不太实际。 为了应对这一挑战,我们引入了 Surg-SegFormer——一个无需提示的新模型,其性能超越了当前最先进的技术。Surg-SegFormer在 EndoVis2018 数据集上达到了 0.80 的平均交并比(mIoU),并在 EndoVis2017 数据集上达到了 0.54 的 mIoU。通过提供强大的、自动化的手术场景理解,该模型显著减轻了专家外科医生的教学负担,并使住院医师能够独立且有效地理解复杂的手术环境。
https://arxiv.org/abs/2507.04304
Ray tracing is a widely used deterministic method for radio propagation simulations, capable of producing physically accurate multipath components. The accuracy depends on the quality of the environment model and its electromagnetic properties. Recent advances in computer vision and machine learning have made it possible to reconstruct detailed environment models augmented with semantic segmentation labels. In this letter, we propose a differentiable ray tracing-based radio propagation simulator that operates directly on point clouds. We showcase the efficiency of our method by simulating multi-bounce propagation paths with up to five interactions with specular reflections and diffuse scattering in two indoor scenarios, each completing in less than 90 ms. Lastly, we demonstrate how the differentiability of electromagnetic computations can be combined with segmentation labels to learn the electromagnetic properties of the environment.
光线追踪是一种广泛用于无线电传播模拟的确定性方法,能够产生物理上准确的多路径组件。其准确性取决于环境模型的质量及其电磁属性。近年来,在计算机视觉和机器学习领域取得的进步使得重建带有语义分割标签的详细环境模型成为可能。在这封信中,我们提出了一种基于可微光线追踪的无线电传播模拟器,该模拟器直接在点云上运行。通过在两个室内场景中的多弹跳传播路径(最多五次镜面反射和漫散射交互)模拟,展示了我们方法的效率,每个场景的模拟时间均少于90毫秒。最后,我们演示了如何结合电磁计算的可微性和分割标签来学习环境的电磁属性。
https://arxiv.org/abs/2507.04021
Birds' Eye View (BEV) semantic segmentation is an indispensable perception task in end-to-end autonomous driving systems. Unsupervised and semi-supervised learning for BEV tasks, as pivotal for real-world applications, underperform due to the homogeneous distribution of the labeled data. In this work, we explore the potential of synthetic data from driving world models to enhance the diversity of labeled data for robustifying BEV segmentation. Yet, our preliminary findings reveal that generation noise in synthetic data compromises efficient BEV model learning. To fully harness the potential of synthetic data from world models, this paper proposes NRSeg, a noise-resilient learning framework for BEV semantic segmentation. Specifically, a Perspective-Geometry Consistency Metric (PGCM) is proposed to quantitatively evaluate the guidance capability of generated data for model learning. This metric originates from the alignment measure between the perspective road mask of generated data and the mask projected from the BEV labels. Moreover, a Bi-Distribution Parallel Prediction (BiDPP) is designed to enhance the inherent robustness of the model, where the learning process is constrained through parallel prediction of multinomial and Dirichlet distributions. The former efficiently predicts semantic probabilities, whereas the latter adopts evidential deep learning to realize uncertainty quantification. Furthermore, a Hierarchical Local Semantic Exclusion (HLSE) module is designed to address the non-mutual exclusivity inherent in BEV semantic segmentation tasks. Experimental results demonstrate that NRSeg achieves state-of-the-art performance, yielding the highest improvements in mIoU of 13.8% and 11.4% in unsupervised and semi-supervised BEV segmentation tasks, respectively. The source code will be made publicly available at this https URL.
鸟瞰视角(BEV)语义分割是端到端自动驾驶系统中不可或缺的感知任务。对于此类任务,无监督和半监督学习在实际应用中表现不佳,原因在于标注数据分布同质化。本文探讨了利用驾驶世界模型生成的数据来增强BEV语义分割中标注数据多样性的潜力。然而,初步研究发现,合成数据中的生成噪声会损害BEV模型的学习效率。为了充分挖掘来自世界模型的合成数据的潜力,本文提出了NRSeg——一种用于BEV语义分割的抗噪学习框架。 具体来说,提出了一种透视-几何一致性度量(PGCM),用以量化生成数据对模型学习引导能力的定量评估。该度量源自生成数据的视角道路掩码与从BEV标签投影出的掩码之间的对齐程度测量方法。此外,设计了一个双分布并行预测(BiDPP)模块来增强模型固有的鲁棒性,通过多类和狄利克雷分布的并行预测约束学习过程实现这一点。前者可以高效地预测语义概率,后者则采用证据深度学习实现不确定性量化。 另外,还设计了一个分层局部语义排除(HLSE)模块以解决BEV语义分割任务中固有的非互斥性问题。 实验结果显示,NRSeg在无监督和半监督的BEV分割任务中分别取得了13.8% 和 11.4% 的mIoU性能提升,并且达到了最先进的水平。源代码将公开发布在此 URL 上。
https://arxiv.org/abs/2507.04002
Effective Out-of-Distribution (OOD) detection is criti-cal for ensuring the reliability of semantic segmentation models, particularly in complex road environments where safety and accuracy are paramount. Despite recent advancements in large language models (LLMs), notably GPT-4, which significantly enhanced multimodal reasoning through Chain-of-Thought (CoT) prompting, the application of CoT-based visual reasoning for OOD semantic segmentation remains largely unexplored. In this paper, through extensive analyses of the road scene anomalies, we identify three challenging scenarios where current state-of-the-art OOD segmentation methods consistently struggle: (1) densely packed and overlapping objects, (2) distant scenes with small objects, and (3) large foreground-dominant objects. To address the presented challenges, we propose a novel CoT-based framework targeting OOD detection in road anomaly scenes. Our method leverages the extensive knowledge and reasoning capabilities of foundation models, such as GPT-4, to enhance OOD detection through improved image understanding and prompt-based reasoning aligned with observed problematic scene attributes. Extensive experiments show that our framework consistently outperforms state-of-the-art methods on both standard benchmarks and our newly defined challenging subset of the RoadAnomaly dataset, offering a robust and interpretable solution for OOD semantic segmentation in complex driving environments.
有效的出分布(OOD,Out-of-Distribution)检测对于确保语义分割模型的可靠性至关重要,尤其是在安全性和准确性至关重要的复杂道路环境中。尽管在大型语言模型(LLMs),特别是GPT-4方面取得了显著进展,这些进步通过链式思维(CoT,Chain-of-Thought)提示极大地增强了多模态推理能力,但基于CoT的视觉推理应用于OOD语义分割领域仍然鲜有探索。在这篇论文中,通过对道路场景异常进行广泛分析,我们识别出三种当前最先进的OOD分割方法普遍面临挑战的情况:(1) 密集且相互重叠的对象;(2) 距离较远、包含小对象的场景;以及 (3) 以大前景为主的物体。为了解决这些提出的难题,我们提出了一种新的基于CoT框架,专注于道路异常场景中的OOD检测。我们的方法利用如GPT-4这样的基础模型所具备的广泛知识和推理能力,通过改进图像理解和与观察到的问题场景属性相一致的提示式推理来增强OOD检测。广泛的实验表明,我们的框架在标准基准测试以及我们新定义的RoadAnomaly数据集中具有挑战性的子集上均优于当前最先进的方法,为复杂驾驶环境中的OOD语义分割提供了稳健且可解释的解决方案。
https://arxiv.org/abs/2507.03984
Event cameras have recently been introduced into image semantic segmentation, owing to their high temporal resolution and other advantageous properties. However, existing event-based semantic segmentation methods often fail to fully exploit the complementary information provided by frames and events, resulting in complex training strategies and increased computational costs. To address these challenges, we propose an efficient hybrid framework for image semantic segmentation, comprising a Spiking Neural Network branch for events and an Artificial Neural Network branch for frames. Specifically, we introduce three specialized modules to facilitate the interaction between these two branches: the Adaptive Temporal Weighting (ATW) Injector, the Event-Driven Sparse (EDS) Injector, and the Channel Selection Fusion (CSF) module. The ATW Injector dynamically integrates temporal features from event data into frame features, enhancing segmentation accuracy by leveraging critical dynamic temporal information. The EDS Injector effectively combines sparse event data with rich frame features, ensuring precise temporal and spatial information alignment. The CSF module selectively merges these features to optimize segmentation performance. Experimental results demonstrate that our framework not only achieves state-of-the-art accuracy across the DDD17-Seg, DSEC-Semantic, and M3ED-Semantic datasets but also significantly reduces energy consumption, achieving a 65\% reduction on the DSEC-Semantic dataset.
最近,事件相机被引入图像语义分割领域,得益于其高时间分辨率和其他优势特性。然而,现有的基于事件的语义分割方法往往未能充分利用帧和事件提供的互补信息,导致复杂的训练策略及计算成本增加。为应对这些挑战,我们提出了一种高效的混合框架用于图像语义分割,该框架包含一个处理事件数据的脉冲神经网络分支以及一个处理帧数据的人工神经网络分支。具体而言,我们引入了三个专门模块来促进这两个分支之间的交互:自适应时间加权(ATW)注入器、基于事件驱动稀疏(EDS)注入器和通道选择融合(CSF)模块。ATW注入器动态整合事件数据的时间特征到帧特征中,通过利用关键的动态时间信息提高了分割精度。EDS注入器有效地结合了稀疏事件数据与丰富的帧特征,确保精确的时间和空间信息对齐。CSF模块则选择性地合并这些特征以优化分割性能。实验结果表明,我们的框架不仅在DDD17-Seg、DSEC-Semantic 和 M3ED-Semantic 数据集上达到了最先进的准确性,还显著降低了能耗,在 DSEC-Semantic 数据集中减少了65%的能耗。
https://arxiv.org/abs/2507.03765