Interpreting the mineralogical aspects of rock thin sections is an important task for oil and gas reservoirs evaluation. However, human analysis tend to be subjective and laborious. Technologies like QEMSCAN(R) are designed to automate the mineralogical mapping process, but also suffer from limitations like high monetary costs and time-consuming analysis. This work proposes a Convolutional Neural Network model for automatic mineralogical segmentation of thin section images of carbonate rocks. The model is able to mimic the QEMSCAN mapping itself in a low-cost, generalized and efficient manner. For this, the U-Net semantic segmentation architecture is trained on plane and cross polarized thin section images using the corresponding QEMSCAN maps as target, which is an approach not widely explored. The model was instructed to differentiate occurrences of Calcite, Dolomite, Mg-Clay Minerals, Quartz, Pores and the remaining mineral phases as an unique class named "Others", while it was validated on rock facies both seen and unseen during training, in order to address its generalization capability. Since the images and maps are provided in different resolutions, image registration was applied to align then spatially. The study reveals that the quality of the segmentation is very much dependent on these resolution differences and on the variety of learnable rock textures. However, it shows promising results, especially with regard to the proper delineation of minerals boundaries on solid textures and precise estimation of the minerals distributions, describing a nearly linear relationship between expected and predicted distributions, with coefficient of determination (R^2) superior to 0.97 for seen facies and 0.88 for unseen.
岩石薄片的矿物学分析对于油气储层评价是一项重要任务。然而,人工分析往往主观且耗时。虽然诸如QEMSCAN(R)等技术旨在自动化矿物学测绘过程,但它们也存在高成本和耗时的问题。本研究提出了一种基于卷积神经网络(Convolutional Neural Network, CNN)的模型,用于自动分割碳酸盐岩薄片图像中的矿物区域,该方法以低成本、通用且高效的方式模拟了QEMSCAN制图流程。 为了实现这一目标,使用U-Net语义分割架构对平面偏光和交叉偏光下的岩石薄片图像进行训练,并将对应的QEMSCAN地图作为目标。这种基于QEMSCAN地图的训练方式在研究中不常被采用。模型被设定为区分方解石、白云石、镁质粘土矿物、石英、孔隙以及其他矿物相(标记为“Others”)。为了评估其泛化能力,该模型不仅对训练期间见过的岩相进行了验证,还测试了未见过的岩相。 由于图像和地图提供时分辨率不同,应用图像配准技术使它们在空间上对齐。研究表明,分割质量很大程度上取决于这些分辨率差异以及可学习岩石纹理的多样性。然而,研究结果显示出令人鼓舞的结果,特别是在固体纹理中矿物边界划定得当,并且能精确估计矿物分布。该模型预测与实际分布之间呈现出近似线性关系,对于见过和未见岩相分别获得了0.97以上的决定系数(R^2)值。 总的来说,这项工作通过利用深度学习技术提供了低成本、高效率的岩石薄片矿物学分析方法,展示了在油气储层评价中的潜在应用价值。
https://arxiv.org/abs/2505.17008
Open-Vocabulary Segmentation (OVS) has drawn increasing attention for its capacity to generalize segmentation beyond predefined categories. However, existing methods typically predict segmentation masks with simple forward inference, lacking explicit reasoning and interpretability. This makes it challenging for OVS model to distinguish similar categories in open-world settings due to the lack of contextual understanding and discriminative visual cues. To address this limitation, we propose a step-by-step visual reasoning framework for open-vocabulary segmentation, named OpenSeg-R. The proposed OpenSeg-R leverages Large Multimodal Models (LMMs) to perform hierarchical visual reasoning before segmentation. Specifically, we generate both generic and image-specific reasoning for each image, forming structured triplets that explain the visual reason for objects in a coarse-to-fine manner. Based on these reasoning steps, we can compose detailed description prompts, and feed them to the segmentor to produce more accurate segmentation masks. To the best of our knowledge, OpenSeg-R is the first framework to introduce explicit step-by-step visual reasoning into OVS. Experimental results demonstrate that OpenSeg-R significantly outperforms state-of-the-art methods on open-vocabulary semantic segmentation across five benchmark datasets. Moreover, it achieves consistent gains across all metrics on open-vocabulary panoptic segmentation. Qualitative results further highlight the effectiveness of our reasoning-guided framework in improving both segmentation precision and interpretability. Our code is publicly available at this https URL.
开放词汇分割(OVS)因其能够将分割任务推广到预定义类别之外的能力而越来越受到关注。然而,现有的方法通常通过简单的前向推理来预测分割掩码,缺乏明确的推理过程和可解释性。这使得OVS模型在开放式环境中难以区分相似的类别,因为缺少上下文理解和具有判别性的视觉线索。 为了解决这一局限性,我们提出了一种逐步视觉推理框架,用于开放词汇分割,并将其命名为OpenSeg-R。所提出的OpenSeg-R利用大型多模态模型(LMMs)来进行分层视觉推理,在进行分割之前完成该步骤。具体而言,针对每张图像生成通用和特定于图像的推理内容,形成结构化三元组,以从粗到细的方式解释对象的视觉原因。基于这些推理步骤,我们可以合成详细的描述提示,并将其输入到分割器中,从而产生更准确的分割掩码。 据我们所知,OpenSeg-R是第一个将明确的逐步视觉推理引入OVS框架的方法。实验结果表明,在五个基准数据集上的开放词汇语义分割任务上,OpenSeg-R显著优于当前最佳方法。此外,在开放词汇全景分割的所有指标中也取得了持续性的改进。定性结果显示了我们推理引导框架在提高分割精度和可解释性方面的有效性。 我们的代码可在以下链接获取:[提供URL的地方](请替换为实际的代码公开链接)。
https://arxiv.org/abs/2505.16974
Artificial Intelligence (AI) is accelerating the transformation of scientific research paradigms, not only enhancing research efficiency but also driving innovation. We introduce NovelSeek, a unified closed-loop multi-agent framework to conduct Autonomous Scientific Research (ASR) across various scientific research fields, enabling researchers to tackle complicated problems in these fields with unprecedented speed and precision. NovelSeek highlights three key advantages: 1) Scalability: NovelSeek has demonstrated its versatility across 12 scientific research tasks, capable of generating innovative ideas to enhance the performance of baseline code. 2) Interactivity: NovelSeek provides an interface for human expert feedback and multi-agent interaction in automated end-to-end processes, allowing for the seamless integration of domain expert knowledge. 3) Efficiency: NovelSeek has achieved promising performance gains in several scientific fields with significantly less time cost compared to human efforts. For instance, in reaction yield prediction, it increased from 27.6% to 35.4% in just 12 hours; in enhancer activity prediction, accuracy rose from 0.52 to 0.79 with only 4 hours of processing; and in 2D semantic segmentation, precision advanced from 78.8% to 81.0% in a mere 30 hours.
人工智能(AI)正在加速科研范式的转变,不仅提升了研究效率,还推动了创新。我们推出了NovelSeek,这是一个统一的闭环多智能体框架,用于在多个科学领域中进行自主科学研究(ASR),使研究人员能够以前所未有的速度和精度解决这些领域的复杂问题。NovelSeek突出三大优势: 1. **可扩展性**:NovelSeek已在12项科研任务中展示了其适应能力,能够在多种基线代码的性能提升方面生成创新想法。 2. **交互性**:NovelSeek提供了一个接口,支持人类专家反馈和多智能体互动,在自动化端到端过程中能够无缝集成领域专业知识。 3. **效率**:相比人工努力,NovelSeek在多个科学领域中实现了显著的时间成本节约,并取得了令人瞩目的性能提升。例如,在反应产率预测方面,其性能从27.6%提升至35.4%,仅耗时12小时;在增强子活性预测上,准确度从0.52升至0.79,仅需4小时的处理时间;而在二维语义分割领域,精度提升了近三个百分点,在短短30小时内由78.8%提高到81.0%。
https://arxiv.org/abs/2505.16938
Segment Anything Models (SAM) have achieved remarkable success in object segmentation tasks across diverse datasets. However, these models are predominantly trained on large-scale semantic segmentation datasets, which introduce a bias toward object shape rather than texture cues in the image. This limitation is critical in domains such as medical imaging, material classification, and remote sensing, where texture changes define object boundaries. In this study, we investigate SAM's bias toward semantics over textures and introduce a new texture-aware foundation model, TextureSAM, which performs superior segmentation in texture-dominant scenarios. To achieve this, we employ a novel fine-tuning approach that incorporates texture augmentation techniques, incrementally modifying training images to emphasize texture features. By leveraging a novel texture-alternation of the ADE20K dataset, we guide TextureSAM to prioritize texture-defined regions, thereby mitigating the inherent shape bias present in the original SAM model. Our extensive experiments demonstrate that TextureSAM significantly outperforms SAM-2 on both natural (+0.2 mIoU) and synthetic (+0.18 mIoU) texture-based segmentation datasets. The code and texture-augmented dataset will be publicly available.
段落翻译如下: 片段化模型(Segment Anything Models,SAM)在各种数据集上的对象分割任务中取得了显著的成功。然而,这些模型主要是在大规模语义分割数据集上进行训练的,这导致了对物体形状的偏见,而不是图像中的纹理线索。这种限制在医学影像、材料分类和遥感等域尤为关键,在这些领域中,纹理变化定义了对象边界。在这项研究中,我们探讨了SAM模型在语义方面的偏差,并引入了一个新的以纹理感知为基础的基础模型——TextureSAM,该模型在以纹理为主导的场景中表现出色。为实现这一目标,我们采用了一种新颖的微调方法,结合了纹理增强技术,逐步修改训练图像以强调纹理特征。通过利用ADE20K数据集的新颖纹理替代版本,我们指导TextureSAM优先考虑由纹理定义的区域,从而缓解原始SAM模型中存在的固有形状偏见。我们的大量实验表明,在基于自然纹理和合成纹理的分割数据集中,TextureSAM在mIoU指标上均显著优于SAM-2(自然场景+0.2 mIoU,合成场景+0.18 mIoU)。代码及纹理增强的数据集将公开提供。
https://arxiv.org/abs/2505.16540
Semantic segmentation models trained on synthetic data often perform poorly on real-world images due to domain gaps, particularly in adverse conditions where labeled data is scarce. Yet, recent foundation models enable to generate realistic images without any training. This paper proposes to leverage such diffusion models to improve the performance of vision models when learned on synthetic data. We introduce two novel techniques for semantically consistent style transfer using diffusion models: Class-wise Adaptive Instance Normalization and Cross-Attention (CACTI) and its extension with selective attention Filtering (CACTIF). CACTI applies statistical normalization selectively based on semantic classes, while CACTIF further filters cross-attention maps based on feature similarity, preventing artifacts in regions with weak cross-attention correspondences. Our methods transfer style characteristics while preserving semantic boundaries and structural coherence, unlike approaches that apply global transformations or generate content without constraints. Experiments using GTA5 as source and Cityscapes/ACDC as target domains show that our approach produces higher quality images with lower FID scores and better content preservation. Our work demonstrates that class-aware diffusion-based style transfer effectively bridges the synthetic-to-real domain gap even with minimal target domain data, advancing robust perception systems for challenging real-world applications. The source code is available at: this https URL.
基于合成数据训练的语义分割模型在真实世界图像上的表现通常较差,特别是在标签数据稀缺的恶劣条件下。然而,最近的基础模型能够在不进行训练的情况下生成逼真的图像。本文提出利用这些扩散模型(diffusion models)来改进仅通过合成数据学习的视觉模型的表现。 我们介绍了两种新的用于语义一致风格迁移的技术:基于类别的自适应实例归一化与交叉注意力(CACTI,Class-wise Adaptive Instance Normalization and Cross-Attention),以及具有选择性注意过滤功能的其扩展版本(CACTIF)。CACTI技术根据语义类别进行统计标准化处理,而CACTIF进一步根据特征相似度对跨注意力图进行过滤,从而避免在对应关系较弱区域出现伪影。我们的方法可以转移风格特性并保持语义边界和结构一致性,与应用全局变换或无约束内容生成的方法不同。 使用GTA5作为源域,Cityscapes/ACDC作为目标域的实验表明,我们提出的方法能够产生质量更高、FID得分更低且内容保存更好的图像。我们的研究证明了类别感知扩散基风格转换技术可以有效缩小合成数据与真实世界之间的差距,并在目标领域数据量极少的情况下推进鲁棒性感知系统的发展,以应对具有挑战性的现实应用。 源代码可在以下链接获取:[此链接](请将实际的URL地址插入此处)。
https://arxiv.org/abs/2505.16360
Large-scale pretrained vision backbones have transformed computer vision by providing powerful feature extractors that enable various downstream tasks, including training-free approaches like visual prompting for semantic segmentation. Despite their success in generic scenarios, these models often fall short when applied to specialized technical domains where the visual features differ significantly from their training distribution. To bridge this gap, we introduce VP Lab, a comprehensive iterative framework that enhances visual prompting for robust segmentation model development. At the core of VP Lab lies E-PEFT, a novel ensemble of parameter-efficient fine-tuning techniques specifically designed to adapt our visual prompting pipeline to specific domains in a manner that is both parameter- and data-efficient. Our approach not only surpasses the state-of-the-art in parameter-efficient fine-tuning for the Segment Anything Model (SAM), but also facilitates an interactive, near-real-time loop, allowing users to observe progressively improving results as they experiment within the framework. By integrating E-PEFT with visual prompting, we demonstrate a remarkable 50\% increase in semantic segmentation mIoU performance across various technical datasets using only 5 validated images, establishing a new paradigm for fast, efficient, and interactive model deployment in new, challenging domains. This work comes in the form of a demonstration.
大规模预训练视觉骨干网络通过提供强大的特征提取器,彻底改变了计算机视觉领域,并且支持各种下游任务,包括无需额外训练的视觉提示方法(如语义分割)。尽管这些模型在通用场景中表现优异,但在应用于专业技术领域时往往效果不佳,因为这些领域的视觉特征与它们的训练数据分布存在显著差异。为了解决这一问题,我们推出了VP Lab,这是一个全面的迭代框架,旨在增强视觉提示以促进鲁棒性更强的分割模型开发。 VP Lab的核心是E-PEFT(参数高效微调技术的新型集合),该技术专门设计用于将我们的视觉提示管道适应特定领域,并且在参数和数据效率方面都表现出色。我们的方法不仅超越了Segment Anything Model (SAM) 在参数高效微调方面的最新成果,而且还提供了一个交互式、近实时的循环,使用户能够观察实验过程中逐渐改善的结果。 通过结合E-PEFT与视觉提示技术,我们在仅使用5张验证图像的情况下,在各种技术数据集上实现了语义分割mIoU性能的显著提升(达50%),这标志着在新挑战性领域快速、高效和交互式部署模型的新范式的建立。这项工作将以演示的形式呈现。
https://arxiv.org/abs/2505.15592
3D semantic segmentation plays a pivotal role in autonomous driving and road infrastructure analysis, yet state-of-the-art 3D models are prone to severe domain shift when deployed across different datasets. We propose a novel multi-view projection framework that excels in both domain generalization (DG) and unsupervised domain adaptation (UDA). Our approach first aligns Lidar scans into coherent 3D scenes and renders them from multiple virtual camera poses to create a large-scale synthetic 2D dataset (PC2D). We then use it to train a 2D segmentation model in-domain. During inference, the model processes hundreds of views per scene; the resulting logits are back-projected to 3D with an occlusion-aware voting scheme to generate final point-wise labels. Our framework is modular and enables extensive exploration of key design parameters, such as view generation optimization (VGO), visualization modality optimization (MODO), and 2D model choice. We evaluate on the nuScenes and SemanticKITTI datasets under both the DG and UDA settings. We achieve state-of-the-art results in UDA and close to state-of-the-art in DG, with particularly large gains on large, static classes. Our code and dataset generation tools will be publicly available at this https URL
3D语义分割在自动驾驶和道路基础设施分析中扮演着至关重要的角色,然而最先进的3D模型在跨不同数据集部署时容易出现严重的领域偏移。我们提出了一种新颖的多视角投影框架,该框架在领域泛化(DG)和无监督领域适应(UDA)方面表现出色。我们的方法首先将激光雷达扫描对齐为连贯的3D场景,并从多个虚拟相机姿态渲染这些场景以创建大规模合成2D数据集(PC2D)。然后我们使用这个数据集来训练域内的2D分割模型。在推断阶段,模型处理每个场景中的数百个视图;所得的logits通过一个感知遮挡的投票方案反投影到3D空间中,生成最终的点级别标签。我们的框架是模块化的,并且能够广泛探索关键设计参数,例如视角生成优化(VGO)、可视化模式优化(MODO)以及2D模型选择。 我们在nuScenes和SemanticKITTI数据集上分别在领域泛化和无监督领域适应设置下进行了评估。我们取得了UDA方面的最新成果,在DG方面也接近于最佳表现,并且在大而静态类别的表现上有显著提升。我们的代码及数据集生成工具将在以下网址公开:[此链接](请将方括号中的文本替换为实际提供的URL)。
https://arxiv.org/abs/2505.15545
Semantic segmentation relying solely on RGB data often struggles in challenging conditions such as low illumination and obscured views, limiting its reliability in critical applications like autonomous driving. To address this, integrating additional thermal radiation data with RGB images demonstrates enhanced performance and robustness. However, how to effectively reconcile the modality discrepancies and fuse the RGB and thermal features remains a well-known challenge. In this work, we address this challenge from a novel spectral perspective. We observe that the multi-modal features can be categorized into two spectral components: low-frequency features that provide broad scene context, including color variations and smooth areas, and high-frequency features that capture modality-specific details such as edges and textures. Inspired by this, we propose the Spectral-aware Global Fusion Network (SGFNet) to effectively enhance and fuse the multi-modal features by explicitly modeling the interactions between the high-frequency, modality-specific features. Our experimental results demonstrate that SGFNet outperforms the state-of-the-art methods on the MFNet and PST900 datasets.
仅基于RGB数据的语义分割在低光照和视线受阻等挑战条件下往往表现不佳,这限制了其在自动驾驶等关键应用中的可靠性。为了解决这一问题,将热辐射数据与RGB图像相结合显示出了更好的性能和鲁棒性。然而,如何有效地调和模态差异并融合RGB和热特征仍然是一个众所周知的难题。在这项工作中,我们从一个新的光谱角度来解决这个问题。我们观察到多模态特征可以分为两种光谱成分:低频特征提供广泛的场景上下文,包括颜色变化和平滑区域;高频特征捕捉特定于模式的细节,如边缘和纹理。受到这一发现的启发,我们提出了基于谱感知全局融合网络(SGFNet),通过显式建模特定于模式的高频特征之间的相互作用来有效增强并融合多模态特征。我们的实验结果表明,在MFNet和PST900数据集上,SGFNet优于最先进的方法。 总结来说,这项工作提出了一种新的方法——Spectral-aware Global Fusion Network (SGFNet),通过利用低频和高频光谱成分来更好地结合RGB和热图像的特性,并在关键应用中实现了更可靠的语义分割性能。
https://arxiv.org/abs/2505.15491
Remote sensing images (RSIs) capture both natural and human-induced changes on the Earth's surface, serving as essential data for environmental monitoring, urban planning, and resource management. Semantic segmentation (SS) of RSIs enables the fine-grained interpretation of surface features, making it a critical task in remote sensing analysis. With the increasing diversity and volume of RSIs collected by sensors on various platforms, traditional processing methods struggle to maintain efficiency and accuracy. In response, deep learning (DL) has emerged as a transformative approach, enabling substantial advances in remote sensing image semantic segmentation (RSISS) by automating feature extraction and improving segmentation accuracy across diverse modalities. This paper revisits the evolution of DL-based RSISS by categorizing existing approaches into four stages: the early pixel-based methods, the prevailing patch-based and tile-based techniques, and the emerging image-based strategies enabled by foundation models. We analyze these developments from the perspective of feature extraction and learning strategies, revealing the field's progression from pixel-level to tile-level and from unimodal to multimodal segmentation. Furthermore, we conduct a comprehensive evaluation of nearly 40 advanced techniques on a unified dataset to quantitatively characterize their performance and applicability. This review offers a holistic view of DL-based SS for RS, highlighting key advancements, comparative insights, and open challenges to guide future research.
遥感图像(RSIs)捕捉地球表面的自然和人为变化,为环境监测、城市规划和资源管理提供了必不可少的数据。遥感图像的语义分割(SS)能够对地表特征进行精细解读,成为遥感分析中的关键任务之一。随着不同平台传感器收集到的遥感图像种类与数量不断增加,传统的处理方法难以维持效率和准确性。为此,深度学习(DL)作为一种变革性方法应运而生,通过自动化特征提取并提高跨多模态场景下的分割精度,在基于遥感图像的语义分割领域实现了重大进展。本文回顾了基于深度学习的RSISS的发展历程,并将现有的方法归类为四个阶段:早期像素级的方法、当前流行的块级和图块级技术以及由基础模型驱动的新出现的图像级策略。我们从特征提取和学习策略的角度分析了这些发展,揭示出该领域的进展是从像素级别到图块级别的演变过程,同时也涵盖了单模态向多模态分割的发展趋势。 此外,我们在一个统一的数据集上对近40种先进方法进行了全面评估,以量化描述它们的性能和适用性。这篇综述为基于深度学习的RS语义分割提供了一个全景视角,突出了关键进展、比较见解以及未来研究面临的开放挑战。
https://arxiv.org/abs/2505.15147
Autonomous robots must reason about the physical consequences of their actions to operate effectively in unstructured, real-world environments. We present Scan, Materialize, Simulate (SMS), a unified framework that combines 3D Gaussian Splatting for accurate scene reconstruction, visual foundation models for semantic segmentation, vision-language models for material property inference, and physics simulation for reliable prediction of action outcomes. By integrating these components, SMS enables generalizable physical reasoning and object-centric planning without the need to re-learn foundational physical dynamics. We empirically validate SMS in a billiards-inspired manipulation task and a challenging quadrotor landing scenario, demonstrating robust performance on both simulated domain transfer and real-world experiments. Our results highlight the potential of bridging differentiable rendering for scene reconstruction, foundation models for semantic understanding, and physics-based simulation to achieve physically grounded robot planning across diverse settings.
自主机器人必须能够推理其行动的物理后果,以便在无结构的真实世界环境中有效运行。我们提出了Scan、Materialize、Simulate (SMS),这是一个统一框架,结合了3D高斯点拟合技术用于准确场景重建、视觉基础模型进行语义分割、视觉-语言模型推断材料属性以及基于物理的仿真预测行动结果。通过整合这些组件,SMS能够实现泛化的物理推理和以物体为中心的规划,而无需重新学习基本的物理动力学。我们在一个灵感来自台球的操作任务和一个具有挑战性的四旋翼飞机着陆场景中实验验证了SMS的有效性,展示了在模拟领域转移和真实世界实验中的鲁棒性能。我们的研究结果强调了将可微分渲染用于场景重建、基础模型用于语义理解以及基于物理的仿真结合在一起以实现跨越不同设置的物理基础机器人规划的巨大潜力。
https://arxiv.org/abs/2505.14938
Recently proposed neural network architectures like PointNet [QSMG16] and PointNet++ [QYSG17] have made it possible to apply Deep Learning to 3D point sets. The feature representations of shapes learned by these two networks enabled training classifiers for Semantic Segmentation, and more recently for Instance Segmentation via the Similarity Group Proposal Network (SGPN) [WYHN17]. One area of improvement which has been highlighted by SGPN's authors, pertains to use of memory intensive similarity matrices which occupy memory quadratic in the number of points. In this report, we attempt to tackle this issue through use of two sampling based methods, which compute Instance Segmentation on a sub-sampled Point Set, and then extrapolate labels to the complete set using the nearest neigbhour approach. While both approaches perform equally well on large sub-samples, the random-based strategy gives the most improvements in terms of speed and memory usage.
最近提出的神经网络架构,如PointNet [QSMG16]和PointNet++ [QYSG17],使得深度学习可以应用于三维点集。这两种网络所学的形状特征表示使得训练用于语义分割的分类器成为可能,并且最近通过相似性分组提案网络(SGPN)[WYHN17]用于实例分割。SGPN的作者指出,其改进的一个领域是内存密集型的相似度矩阵使用问题,这些矩阵在点的数量上呈二次增长。在这份报告中,我们尝试通过采用两种基于采样的方法来解决这个问题:这两种方法首先在一个子采样点集上计算实例分割,然后利用最近邻算法将标签推断到整个数据集中。尽管在大型子样本上的表现相当,但随机策略在速度和内存使用方面提供了最大的改进。
https://arxiv.org/abs/2505.14583
This paper introduces ReservoirTTA, a novel plug-in framework designed for prolonged test-time adaptation (TTA) in scenarios where the test domain continuously shifts over time, including cases where domains recur or evolve gradually. At its core, ReservoirTTA maintains a reservoir of domain-specialized models -- an adaptive test-time model ensemble -- that both detects new domains via online clustering over style features of incoming samples and routes each sample to the appropriate specialized model, and thereby enables domain-specific adaptation. This multi-model strategy overcomes key limitations of single model adaptation, such as catastrophic forgetting, inter-domain interference, and error accumulation, ensuring robust and stable performance on sustained non-stationary test distributions. Our theoretical analysis reveals key components that bound parameter variance and prevent model collapse, while our plug-in TTA module mitigates catastrophic forgetting of previously encountered domains. Extensive experiments on the classification corruption benchmarks, including ImageNet-C and CIFAR-10/100-C, as well as the Cityscapes$\rightarrow$ACDC semantic segmentation task, covering recurring and continuously evolving domain shifts, demonstrate that ReservoirTTA significantly improves adaptation accuracy and maintains stable performance across prolonged, recurring shifts, outperforming state-of-the-art methods.
本文介绍了ReservoirTTA,这是一种新颖的插件框架,旨在为测试时间适应(TTA)设计用于处理随时间持续变化的测试域场景,包括反复出现或逐渐演化的领域。在核心部分,ReservoirTTA维护一个包含特定于领域的模型的水库——一种自适应测试时间模型集合体,该模型不仅可以通过对输入样本风格特征进行在线聚类来检测新领域,并且可以将每个样本路由到合适的专用模型上,从而实现特定领域的调整。这种多模型策略克服了单一模型调整的关键局限性,如灾难性遗忘、跨域干扰和错误累积,确保在持久的非稳态测试分布中保持稳健和稳定的表现。我们的理论分析揭示了限制参数方差并防止模型崩溃的关键组件,并且我们的插件TTA模块缓解了对之前遇到领域的灾难性遗忘问题。在图像分类腐蚀基准(包括ImageNet-C和CIFAR-10/100-C)以及Cityscapes到ACDC的语义分割任务上的广泛实验表明,ReservoirTTA显著提高了调整精度,并且能够跨长时间、反复出现的变化保持稳定性能,超越了现有最先进的方法。
https://arxiv.org/abs/2505.14511
Knowledge distillation (KD) is a valuable technique for compressing large deep learning models into smaller, edge-suitable networks. However, conventional KD frameworks rely on pre-trained high-capacity teacher networks, which introduce significant challenges such as increased memory/storage requirements, additional training costs, and ambiguity in selecting an appropriate teacher for a given student model. Although a teacher-free distillation (self-distillation) has emerged as a promising alternative, many existing approaches still rely on architectural modifications or complex training procedures, which limit their generality and efficiency. To address these limitations, we propose a novel framework based on teacher-free distillation that operates using a single student network without any auxiliary components, architectural modifications, or additional learnable parameters. Our approach is built on a simple yet highly effective augmentation, called intra-class patch swap augmentation. This augmentation simulates a teacher-student dynamic within a single model by generating pairs of intra-class samples with varying confidence levels, and then applying instance-to-instance distillation to align their predictive distributions. Our method is conceptually simple, model-agnostic, and easy to implement, requiring only a single augmentation function. Extensive experiments across image classification, semantic segmentation, and object detection show that our method consistently outperforms both existing self-distillation baselines and conventional teacher-based KD approaches. These results suggest that the success of self-distillation could hinge on the design of the augmentation itself. Our codes are available at this https URL.
知识蒸馏(KD)是一种将大型深度学习模型压缩成适合边缘设备的小型网络的宝贵技术。然而,传统的KD框架依赖于预训练的大容量教师网络,这带来了诸如内存/存储需求增加、额外培训成本以及为特定学生模型选择合适教师的不确定性等重大挑战。尽管无师蒸馏(自我蒸馏)已经作为一种有前景的替代方案出现,但许多现有方法仍然依赖于架构修改或复杂的训练流程,从而限制了其通用性和效率。为了克服这些局限性,我们提出了一种基于无教师蒸馏的新框架,该框架使用单一的学生网络且无需任何辅助组件、架构修改或额外可学习参数。 我们的方法建立在一种简单而有效的数据增强技术上,称为类内贴片交换(intra-class patch swap)增强。这种增强通过生成具有不同置信度级别的类内样本对,在单个模型中模拟了教师-学生动态,并应用实例到实例的蒸馏来校准它们的预测分布。 我们的方法概念简单、与模型无关且易于实现,只需要一个增强函数即可。在图像分类、语义分割和目标检测等广泛实验中的结果显示,我们的方法始终优于现有的自我蒸馏基准以及传统的基于教师的KD方法。 这些结果表明,自我蒸馏的成功可能取决于数据增强的设计本身。代码可在[此处](https://this_https_URL.com)获取(请将此URL替换为实际链接)。
https://arxiv.org/abs/2505.14124
Three-dimensional reconstruction of buildings, particularly at Level of Detail 1 (LOD1), plays a crucial role in various applications such as urban planning, urban environmental studies, and designing optimized transportation networks. This study focuses on assessing the potential of LiDAR data for accurate 3D building reconstruction at LOD1 and extracting morphological features from these models. Four deep semantic segmentation models, U-Net, Attention U-Net, U-Net3+, and DeepLabV3+, were used, applying transfer learning to extract building footprints from LiDAR data. The results showed that U-Net3+ and Attention U-Net outperformed the others, achieving IoU scores of 0.833 and 0.814, respectively. Various statistical measures, including maximum, range, mode, median, and the 90th percentile, were used to estimate building heights, resulting in the generation of 3D models at LOD1. As the main contribution of the research, the impact of segmentation accuracy on the quality of 3D building modeling and the accuracy of morphological features like building area and external wall surface area was investigated. The results showed that the accuracy of building identification (segmentation performance) significantly affects the 3D model quality and the estimation of morphological features, depending on the height calculation method. Overall, the UNet3+ method, utilizing the 90th percentile and median measures, leads to accurate height estimation of buildings and the extraction of morphological features.
建筑物的三维重建,尤其是细节层次1(LOD1)级别的重建,在城市规划、城市环境研究和设计优化交通网络等方面发挥着关键作用。本研究集中于评估利用LiDAR数据进行准确的LOD1级别3D建筑重建以及从这些模型中提取形态特征的潜力。使用了四种深度语义分割模型——U-Net、注意力U-Net、U-Net3+和DeepLabV3+,通过迁移学习从LiDAR数据中提取建筑物轮廓。结果显示,U-Net3+和注意力U-Net的表现优于其他模型,分别达到了0.833和0.814的交并比(IoU)分数。 研究使用了多种统计指标,包括最大值、范围、众数、中位数以及第90百分位数等来估算建筑物的高度,并生成了LOD1级别的3D模型。作为主要贡献,本研究探讨了分割精度对3D建筑建模质量及形态特征(如建筑面积和外墙表面积)准确性的影响。结果显示,建筑物的识别精度(即分割性能)显著影响3D模型的质量以及基于不同高度计算方法的形态特征估算的准确性。 总体而言,使用第90百分位数和中位数度量的方法,并结合U-Net3+技术,能够实现对建筑高度的准确估算及形态特征的有效提取。
https://arxiv.org/abs/2505.14747
We introduce Land-MoE, a novel approach for multispectral land cover classification (MLCC). Spectral shift, which emerges from disparities in sensors and geospatial conditions, poses a significant challenge in this domain. Existing methods predominantly rely on domain adaptation and generalization strategies, often utilizing small-scale models that exhibit limited performance. In contrast, Land-MoE addresses these issues by hierarchically inserting a Frequency-aware Mixture of Low-rank Token Experts, to fine-tune Vision Foundation Models (VFMs) in a parameter-efficient manner. Specifically, Land-MoE comprises two key modules: the mixture of low-rank token experts (MoLTE) and frequency-aware filters (FAF). MoLTE leverages rank-differentiated tokens to generate diverse feature adjustments for individual instances within multispectral images. By dynamically combining learnable low-rank token experts of varying ranks, it enhances the robustness against spectral shifts. Meanwhile, FAF conducts frequency-domain modulation on the refined features. This process enables the model to effectively capture frequency band information that is strongly correlated with semantic essence, while simultaneously suppressing frequency noise irrelevant to the task. Comprehensive experiments on MLCC tasks involving cross-sensor and cross-geospatial setups demonstrate that Land-MoE outperforms existing methods by a large margin. Additionally, the proposed approach has also achieved state-of-the-art performance in domain generalization semantic segmentation tasks of RGB remote sensing images.
我们介绍了一种名为Land-MoE的新方法,用于多光谱土地覆盖分类(MLCC)。由于传感器和地理空间条件的差异导致的光谱偏移,在该领域内构成了一个重大挑战。现有的方法主要依赖于域适应和泛化策略,并且常常使用性能有限的小规模模型。与此相反,Land-MoE通过分层插入频域感知的低秩标记专家混合(Frequency-aware Mixture of Low-rank Token Experts),以参数高效的方式微调视觉基础模型(VFMs),来解决这些问题。 具体而言,Land-MoE包括两个关键模块:低秩标记专家混合体(MoLTE)和频率感知滤波器(FAF)。MoLTE利用不同等级的令牌生成多光谱图像中单个实例的独特特征调整。通过动态组合不同级别可学习的低秩令牌专家,它可以增强对光谱偏移的鲁棒性。同时,FAF在细化后的特征上进行频域调制。这一过程使模型能够有效地捕捉到与语义本质强相关且任务无关的频率噪声被抑制的频率带信息。 涉及跨传感器和跨地理空间设置的MLCC任务的全面实验表明,Land-MoE相比现有方法有显著优势。此外,该提出的方案还在RGB遥感图像的域泛化语义分割任务中取得了最先进的性能。
https://arxiv.org/abs/2505.14088
Vision Mamba has recently emerged as a promising alternative to Transformer-based architectures, offering linear complexity in sequence length while maintaining strong modeling capacity. However, its adaptation to visual inputs is hindered by challenges in 2D-to-1D patch serialization and weak scalability across input resolutions. Existing serialization strategies such as raster scanning disrupt local spatial continuity and limit the model's ability to generalize across scales. In this paper, we propose FractalMamba++, a robust vision backbone that leverages fractal-based patch serialization via Hilbert curves to preserve spatial locality and enable seamless resolution adaptability. To address long-range dependency fading in high-resolution inputs, we further introduce a Cross-State Routing (CSR) mechanism that enhances global context propagation through selective state reuse. Additionally, we propose a Positional-Relation Capture (PRC) module to recover local adjacency disrupted by curve inflection points. Extensive experiments on image classification, semantic segmentation, object detection, and change detection demonstrate that FractalMamba++ consistently outperforms previous Mamba-based backbones, particularly under high-resolution settings.
近期,Vision Mamba架构作为一种有潜力的Transformer架构替代方案崭露头角,它在保持强大的建模能力的同时,还提供了与序列长度呈线性关系的复杂度。然而,由于将二维图像转换为一维补丁时面临挑战,如2D到1D的补丁序列化困难以及跨输入分辨率的可扩展性较弱等问题,Vision Mamba在视觉任务中的适应性受到限制。现有的序列化策略(例如扫描方式)会破坏局部空间连续性,并且限制了模型跨尺度泛化的性能。 本文中,我们提出了FractalMamba++这一具有较强鲁棒性的视觉骨干网络,该架构利用基于分形的补丁序列化方法通过希尔伯特曲线来保持空间局部性和实现无缝分辨率适应。为了解决高分辨率输入时长距离依赖性衰减的问题,我们还引入了一种跨状态路由(CSR)机制,通过对选择的状态进行重复使用,增强全局上下文传播的能力。此外,为了恢复分形曲线拐点处破坏的局部邻接关系,我们提出了一种位置相关捕获模块(PRC)。 在图像分类、语义分割、目标检测和变化检测等领域的广泛实验表明,在高分辨率设置下,FractalMamba++相比之前的基于Vision Mamba架构显著提高了性能。
https://arxiv.org/abs/2505.14062
Recent efforts have explored multimodal semantic segmentation using various backbone architectures. However, while most methods aim to improve accuracy, their computational efficiency remains underexplored. To address this, we propose EGFormer, an efficient multimodal semantic segmentation framework that flexibly integrates an arbitrary number of modalities while significantly reducing model parameters and inference time without sacrificing performance. Our framework introduces two novel modules. First, the Any-modal Scoring Module (ASM) assigns importance scores to each modality independently, enabling dynamic ranking based on their feature maps. Second, the Modal Dropping Module (MDM) filters out less informative modalities at each stage, selectively preserving and aggregating only the most valuable features. This design allows the model to leverage useful information from all available modalities while discarding redundancy, thus ensuring high segmentation quality. In addition to efficiency, we evaluate EGFormer on a synthetic-to-real transfer task to demonstrate its generalizability. Extensive experiments show that EGFormer achieves competitive performance with up to 88 percent reduction in parameters and 50 percent fewer GFLOPs. Under unsupervised domain adaptation settings, it further achieves state-of-the-art transfer performance compared to existing methods.
最近的研究探索了使用各种骨干架构进行多模态语义分割的方法。然而,尽管大多数方法旨在提高准确性,但它们的计算效率却鲜有研究。为解决这一问题,我们提出了一种高效的多模态语义分割框架——EGFormer,该框架能够灵活地融合任意数量的模态,并且在不牺牲性能的前提下大幅减少模型参数和推理时间。 我们的框架引入了两个新颖模块:第一是任意模态评分模块(Any-modal Scoring Module, ASM),它独立为每个模态分配重要性分数,使得可以根据特征图动态排序;第二是模态丢弃模块(Modal Dropping Module, MDM),它可以过滤掉每一阶段的不相关信息模态,并选择性地保留和聚合最有价值的特性。这种设计使模型能够利用所有可用模态中的有用信息并剔除冗余,从而确保高质量的分割结果。 除了效率之外,我们在合成到现实的任务转移上评估了EGFormer以展示其泛化能力。大量的实验表明,EGFormer在参数减少高达88%和浮点运算次数(GFLOPs)减少了50%的情况下仍能实现具有竞争力的表现。此外,在无监督领域适应的设置下,它进一步实现了与现有方法相比的最佳转移性能。
https://arxiv.org/abs/2505.14014
Supervised learning demands large amounts of precisely annotated data to achieve promising results. Such data curation is labor-intensive and imposes significant overhead regarding time and costs. Self-supervised learning (SSL) partially overcomes these limitations by exploiting vast amounts of unlabeled data and creating surrogate (pretext or proxy) tasks to learn useful representations without manual labeling. As a result, SSL has become a powerful machine learning (ML) paradigm for solving several practical downstream computer vision problems, such as classification, detection, and segmentation. Image segmentation is the cornerstone of many high-level visual perception applications, including medical imaging, intelligent transportation, agriculture, and surveillance. Although there is substantial research potential for developing advanced algorithms for SSL-based semantic segmentation, a comprehensive study of existing methodologies is essential to trace advances and guide emerging researchers. This survey thoroughly investigates over 150 recent image segmentation articles, particularly focusing on SSL. It provides a practical categorization of pretext tasks, downstream tasks, and commonly used benchmark datasets for image segmentation research. It concludes with key observations distilled from a large body of literature and offers future directions to make this research field more accessible and comprehensible for readers.
监督学习需要大量精确标注的数据才能取得良好效果。这种数据整理工作既耗时又昂贵,成本较高。自监督学习(SSL)通过利用大量的未标记数据和创建代理任务(如预设或替代任务),在不进行人工标记的情况下学习有用的表示方式,部分克服了这些限制。因此,SSL已成为解决许多实用的下游计算机视觉问题的强大机器学习范式,例如分类、检测和分割。图像分割是包括医学成像、智能交通系统、农业以及监控在内的众多高级视觉感知应用的基础。 尽管在基于自监督学习的语义分割算法开发方面有巨大的研究潜力,但全面研究现有方法至关重要,这有助于追踪进展并指导新兴研究人员。这项综述深入调查了超过150篇近期图像分割文章,特别关注自监督学习。它提供了一种实用的方法来分类预设任务、下游任务以及用于图像分割研究的常用基准数据集。最后,该综述从大量文献中提炼出关键观察结果,并为读者提供了未来的研究方向,使这一研究领域更加易于访问和理解。
https://arxiv.org/abs/2505.13584
Semantic segmentation stands as a pivotal research focus in computer vision. In the context of industrial image inspection, conventional semantic segmentation models fail to maintain the segmentation consistency of fixed components across varying contextual environments due to a lack of perception of object contours. Given the real-time constraints and limited computing capability of industrial image detection machines, it is also necessary to create efficient models to reduce computational complexity. In this work, a Shape-Aware Efficient Network (SPENet) is proposed, which focuses on the shapes of objects to achieve excellent segmentation consistency by separately supervising the extraction of boundary and body information from images. In SPENet, a novel method is introduced for describing fuzzy boundaries to better adapt to real-world scenarios named Variable Boundary Domain (VBD). Additionally, a new metric, Consistency Mean Square Error(CMSE), is proposed to measure segmentation consistency for fixed components. Our approach attains the best segmentation accuracy and competitive speed on our dataset, showcasing significant advantages in CMSE among numerous state-of-the-art real-time segmentation networks, achieving a reduction of over 50% compared to the previously top-performing models.
语义分割是计算机视觉领域的核心研究方向之一。在工业图像检测的背景下,传统的语义分割模型由于缺乏对物体轮廓的认知,在不同上下文环境中难以保持固定组件的一致性分割结果。鉴于工业图像检测设备实时性和计算能力有限的要求,创建高效的模型以降低计算复杂度也显得尤为重要。 本文提出了一种基于形状感知的高效网络(Shape-Aware Efficient Network, SPENet),该网络通过分别监督图像边界和主体信息的提取来实现优异的分割一致性,并专注于物体形状。在SPENet中引入了一种名为可变边界域(Variable Boundary Domain, VBD)的新方法,用于描述模糊边界以更好地适应现实世界场景。 此外,本文还提出了一项新的度量标准——一致性均方误差(Consistency Mean Square Error, CMSE),用以衡量固定组件的分割一致性。我们的方法在我们构建的数据集中取得了最佳的分割精度和竞争力的速度,在众多实时语义分割网络中展现了显著的优势,并且在CMSE指标上相比之前性能最优的模型减少了超过50%。
https://arxiv.org/abs/2505.14718
Multi-modal semantic segmentation (MMSS) faces significant challenges in real-world scenarios due to dynamic environments, sensor failures, and noise interference, creating a gap between theoretical models and practical performance. To address this, we propose a two-stage framework called RobustSeg, which enhances multi-modal robustness through two key components: the Hybrid Prototype Distillation Module (HPDM) and the Representation Regularization Module (RRM). In the first stage, RobustSeg pre-trains a multi-modal teacher model using complete modalities. In the second stage, a student model is trained with random modality dropout while learning from the teacher via HPDM and RRM. HPDM transforms features into compact prototypes, enabling cross-modal hybrid knowledge distillation and mitigating bias from missing modalities. RRM reduces representation discrepancies between the teacher and student by optimizing functional entropy through the log-Sobolev inequality. Extensive experiments on three public benchmarks demonstrate that RobustSeg outperforms previous state-of-the-art methods, achieving improvements of +2.76%, +4.56%, and +0.98%, respectively. Code is available at: this https URL.
多模态语义分割(MMSS)在实际应用场景中面临动态环境、传感器故障和噪声干扰等挑战,导致理论模型与实际性能之间存在差距。为解决这些问题,我们提出了一种两阶段框架——RobustSeg,通过两个关键组件:混合原型蒸馏模块(HPDM)和表示正则化模块(RRM),增强了多模态的鲁棒性。 在第一阶段,RobustSeg 使用完整的模态预训练一个多模态教师模型。第二阶段中,学生模型是在随机模式缺失的情况下进行训练,并通过 HPDM 和 RRM 从教师模型学习。HPDM 将特征转换为紧凑型原型,使跨模态混合知识蒸馏成为可能,并减少了由于缺少模态导致的偏差。RRM 则通过优化对数-索伯列夫不等式中的功能熵来减少学生和老师之间的表示差异。 在三个公开基准数据集上的广泛实验表明,RobustSeg 的表现优于之前的最先进方法,分别取得了+2.76%,+4.56% 和 +0.98% 的改进。代码可以在以下链接找到:[此 URL](https://this-url.com) (请将“this https URL”替换为您提供的实际URL)。
https://arxiv.org/abs/2505.12861