Large-scale models are typically adapted to meet the diverse requirements of model owners and users. However, maintaining multiple specialized versions of the model is inefficient. In response, we propose AIM, a novel model modulation paradigm that enables a single model to exhibit diverse behaviors to meet the specific end requirements. AIM enables two key modulation modes: utility and focus modulations. The former provides model owners with dynamic control over output quality to deliver varying utility levels, and the latter offers users precise control to shift model's focused input features. AIM introduces a logits redistribution strategy that operates in a training data-agnostic and retraining-free manner. We establish a formal foundation to ensure AIM's regulation capability, based on the statistical properties of logits ordering via joint probability distributions. Our evaluation confirms AIM's practicality and versatility for Al model modulation, with tasks spanning image classification, semantic segmentation and text generation, and prevalent architectures including ResNet, SegFormer and Llama.
https://arxiv.org/abs/2603.12755
Synthetic Aperture Radar (SAR) enables global, all-weather earth observation. However, owing to diverse imaging mechanisms, domain shifts across sensors and regions severely hinder its semantic generalization. To address this, we present CrossEarth-SAR, the first billion-scale SAR vision foundation model built upon a novel physics-guided sparse mixture-of-experts (MoE) architecture incorporating physical descriptors, explicitly designed for cross-domain semantic segmentation. To facilitate large-scale pre-training, we develop CrossEarth-SAR-200K, a weakly and fully supervised dataset that unifies public and private SAR imagery. We also introduce a benchmark suite comprising 22 sub-benchmarks across 8 distinct domain gaps, establishing the first unified standard for domain generalization semantic segmentation on SAR imagery. Extensive experiments demonstrate that CrossEarth-SAR achieves state-of-the-art results on 20 benchmarks, surpassing previous methods by over 10\% mIoU on some benchmarks under multi-gap transfer. All code, benchmark and datasets will be publicly available.
合成孔径雷达(SAR)技术能够实现全球范围内的全天候地球观测。然而,由于成像机制的多样性及传感器与区域间的领域变化,这严重阻碍了其语义泛化能力。为解决这一问题,我们提出了CrossEarth-SAR,这是首个基于新颖物理引导稀疏专家混合架构(MoE)并融合物理描述符的大规模SAR视觉基础模型,该模型特别针对跨域语义分割设计。为了支持大规模预训练,我们开发了CrossEarth-SAR-200K数据集,这是一个将公共和私人SAR图像统一起来的弱监督与全监督混合的数据集。 此外,我们还引入了一套基准测试工具,包含涵盖8种不同领域差距共计22个子基准的评估系统,建立了首个针对SAR图像跨域泛化语义分割的标准。广泛的实验表明,在多间隙迁移条件下,CrossEarth-SAR在20个基准上实现了最新的性能表现,并且在某些基准上的mIoU指标比之前的最佳方法高出超过10%。 所有代码、基准和数据集都将公开提供使用。
https://arxiv.org/abs/2603.12008
Self-supervised visual pre-training methods face an inherent tension: contrastive learning (CL) captures global semantics but loses fine-grained detail, while masked image modeling (MIM) preserves local textures but suffers from "attention drift" due to semantically-agnostic random masking. We propose C2FMAE, a coarse-to-fine masked autoencoder that resolves this tension by explicitly learning hierarchical visual representations across three data granularities: semantic masks (scene-level), instance masks (object-level), and RGB images (pixel-level). Two synergistic innovations enforce a strict top-down learning principle. First, a cascaded decoder sequentially reconstructs from scene semantics to object instances to pixel details, establishing explicit cross-granularity dependencies that parallel decoders cannot capture. Second, a progressive masking curriculum dynamically shifts the training focus from semantic-guided to instance-guided and finally to random masking, creating a structured learning path from global context to local features. To support this framework, we construct a large-scale multi-granular dataset with high-quality pseudo-labels for all 1.28M ImageNet-1K images. Extensive experiments show that C2FMAE achieves significant performance gains on image classification, object detection, and semantic segmentation, validating the effectiveness of our hierarchical design in learning more robust and generalizable representations.
自监督视觉预训练方法面临着固有的矛盾:对比学习(CL)能够捕捉全局语义信息,但会丢失细粒度的细节;而掩码图像建模(MIM)则能保留局部纹理特征,但由于采用无意义的随机掩码策略,会导致“注意力漂移”问题。为此,我们提出了一种名为C2FMAE(从粗到精的掩码自动编码器)的方法,通过显式学习三种数据粒度层次化的视觉表示来解决这一矛盾:语义掩码(场景级别)、实例掩码(对象级别)和RGB图像(像素级别)。为了严格贯彻自上而下的学习原则,C2FMAE引入了两个协同创新: 1. 级联解码器按顺序从场景语义重建至物体实例再到像素细节,建立了明确的跨粒度依赖关系,这是并行解码器所无法捕捉到的。 2. 动态掩码课程学习策略逐步将训练焦点由以语义指导为主转向以实例指导为主,最后转为随机掩码,从而构建了一个从全局上下文到局部特征的学习路径。 为了支持这一框架,我们还构建了大规模多粒度数据集,并为此数据集中所有128万张ImageNet-1K图像生成高质量的伪标签。广泛的实验结果表明,C2FMAE在图像分类、目标检测和语义分割任务上实现了显著性能提升,验证了我们的分层设计在学习更具鲁棒性和泛化能力表示的有效性。
https://arxiv.org/abs/2603.09955
Deep learning models benefit from increasing data diversity and volume, motivating synthetic data augmentation to improve existing datasets. However, existing evaluation metrics for synthetic data typically calculate latent feature similarity, which is difficult to interpret and does not always correlate with the contribution to downstream tasks. We propose a vision-language grounded framework for interpretable synthetic data augmentation and evaluation in remote sensing. Our approach combines generative models, semantic segmentation and image captioning with vision and language models. Based on this framework, we introduce ARAS400k: A large-scale Remote sensing dataset Augmented with Synthetic data for segmentation and captioning, containing 100k real images and 300k synthetic images, each paired with segmentation maps and descriptions. ARAS400k enables the automated evaluation of synthetic data by analyzing semantic composition, minimizing caption redundancy, and verifying cross-modal consistency between visual structures and language descriptions. Experimental results indicate that while models trained exclusively on synthetic data reach competitive performance levels, those trained with augmented data (a combination of real and synthetic images) consistently outperform real-data baselines. Consequently, this work establishes a scalable benchmark for remote sensing tasks, specifically in semantic segmentation and image captioning. The dataset is available at this http URL and the code base at this http URL.
深度学习模型从数据多样性和数量的增加中受益,这促使了合成数据增强以改善现有数据集。然而,现有的用于评估合成数据的方法通常计算潜在特征相似性,这种方法难以解释,并且并不总是与对下游任务的影响相关联。我们提出了一种基于视觉和语言的理解框架,旨在促进遥感领域的可解释的合成数据增强和评估。我们的方法结合了生成模型、语义分割以及图像描述(或称为图像配词)技术与视觉及语言模型。 基于这一框架,我们引入了ARAS400k:这是一个大型远程传感数据集,通过合成数据进行了扩充,用于语义分割和图像描述任务,该数据集中包含10万张真实图片和30万张合成图片,每一张图片都配有对应的语义分割图和描述文本。ARAS400k使得可以通过分析语义组成、减少描述的冗余性以及验证视觉结构与语言描述之间的跨模态一致性来进行自动化的合成数据评估。 实验结果显示,在仅使用合成数据训练模型可以达到竞争性的性能水平,但在结合真实图像和合成图像(即增强数据)进行训练时,这些模型始终优于基于真实数据的基线模型。因此,这项工作为遥感任务提供了可扩展的基准测试,特别是在语义分割和图像描述领域。 该数据集可在提供的链接获取,代码库也可在相应的网址上找到。
https://arxiv.org/abs/2603.09625
Autonomous space operations such as on-orbit servicing and active debris removal demand robust part-level semantic understanding and precise relative navigation of target spacecraft, yet collecting large-scale real data in orbit remains impractical due to cost and access constraints. Existing synthetic datasets, moreover, suffer from limited target diversity, single-modality sensing, and incomplete ground-truth annotations. We present \textbf{SpaceSense-Bench}, a large-scale multi-modal benchmark for spacecraft perception encompassing 136~satellite models with approximately 70~GB of data. Each frame provides time-synchronized 1024$\times$1024 RGB images, millimeter-precision depth maps, and 256-beam LiDAR point clouds, together with dense 7-class part-level semantic labels at both the pixel and point level as well as accurate 6-DoF pose ground truth. The dataset is generated through a high-fidelity space simulation built in Unreal Engine~5 and a fully automated pipeline covering data acquisition, multi-stage quality control, and conversion to mainstream formats. We benchmark five representative tasks (object detection, 2D semantic segmentation, RGB--LiDAR fusion-based 3D point cloud segmentation, monocular depth estimation, and orientation estimation) and identify two key findings: (i)~perceiving small-scale components (\emph{e.g.}, thrusters and omni-antennas) and generalizing to entirely unseen spacecraft in a zero-shot setting remain critical bottlenecks for current methods, and (ii)~scaling up the number of training satellites yields substantial performance gains on novel targets, underscoring the value of large-scale, diverse datasets for space perception research. The dataset, code, and toolkit are publicly available at this https URL.
自主空间操作,如轨道服务和主动碎片清除,需要强大的部件级语义理解和目标航天器的精准相对导航。然而,由于成本和访问限制,在轨收集大规模真实数据仍然不切实际。现有的合成数据集还存在目标多样性有限、单一模态传感以及地面实况标注不完整的问题。我们提出了**SpaceSense-Bench**,这是一个涵盖136颗卫星模型的大型多模态基准库,包含大约70GB的数据。每个帧提供时间同步的1024×1024 RGB图像、毫米精度深度图和256束LiDAR点云,并附有像素级和点级密集的7类部件级语义标签以及精确的6自由度姿态地面实况标注。 该数据集通过在Unreal Engine 5中构建的高度逼真的空间模拟生成,并采用完全自动化的管道,涵盖数据采集、多阶段质量控制及转换为主流格式。我们针对五个代表性任务(物体检测、2D语义分割、基于RGB-LiDAR融合的3D点云分割、单目深度估计和姿态估计)进行了基准测试,并识别出两个关键发现:(i) 当前方法在感知小规模组件(例如,推进器和全向天线)以及零样本设置下泛化到完全未知的航天器方面仍然面临重大瓶颈;(ii) 增加训练卫星的数量会显著提高对新目标的表现,这凸显了大规模、多样化数据集对于空间感知研究的价值。该数据集、代码及工具包可在[公开网址]获取。
https://arxiv.org/abs/2603.09320
RGB-Thermal (RGB-T) semantic segmentation is essential for robotic systems operating in low-light or dark environments. However, traditional approaches often overemphasize modality balance, resulting in limited robustness and severe performance degradation when sensor signals are partially missing. Recent advances such as cross-modal knowledge distillation and modality-adaptive fine-tuning attempt to enhance cross-modal interaction, but they typically decouple modality fusion and modality adaptation, requiring multi-stage training with frozen models or teacher-student frameworks. We present RTFDNet, a three-branch encoder-decoder that unifies fusion and decoupling for robust RGB-T segmentation. Synergistic Feature Fusion (SFF) performs channel-wise gated exchange and lightweight spatial attention to inject complementary cues. Cross-Modal Decouple Regularization (CMDR) isolates modality-specific components from the fused representation and supervises unimodal decoders via stop-gradient targets. Region Decouple Regularization (RDR) enforces class-selective prediction consistency in confident regions while blocking gradients to the fusion branch. This feedback loop strengthens unimodal paths without degrading the fused stream, enabling efficient standalone inference at test time. Extensive experiments demonstrate the effectiveness of RTFDNet, showing consistent performance across varying modality conditions. Our implementation will be released to facilitate further research. Our source code are publicly available at this https URL.
RGB-Thermal(RGB-T)语义分割对于在低光或黑暗环境中运行的机器人系统至关重要。然而,传统方法往往过度强调模态平衡,在传感器信号部分缺失的情况下会导致有限的鲁棒性和严重的性能下降。最近的研究进展如跨模式知识蒸馏和模态自适应微调旨在增强跨模态互动,但它们通常将模态融合与模态适应解耦,需要使用冻结模型或教师-学生框架进行多阶段训练。 我们提出了一种名为RTFDNet的三分支编码器-解码器结构,它可以统一地处理融合和解耦以实现鲁棒的RGB-T分割。协同特征融合(Synergistic Feature Fusion, SFF)通过通道级门控交换和轻量级空间注意来注入互补线索。跨模态解耦正则化(Cross-Modal Decouple Regularization, CMDR)从融合表示中隔离出特定于某一种模态的组件,并通过停止梯度目标监督单模态解码器。区域解耦正则化(Region Decouple Regularization, RDR)在有信心的区域内强制执行类选择性预测一致性,同时阻止对融合分支的梯度传播。这一反馈回路增强了单模态路径而不降低融合流的质量,在测试时实现了高效的独立推理。 广泛实验表明了RTFDNet的有效性,并显示了其在不同模态条件下的一致性能表现。我们将会公开发布该实现以促进进一步的研究,源代码可在此URL获取(原文中提供的具体链接需替换为实际可用的地址)。
https://arxiv.org/abs/2603.09149
Rotation equivariance constitutes one of the most general and crucial structural priors for visual data, yet it remains notably absent from current Mamba-based vision architectures. Despite the success of Mamba in natural language processing and its growing adoption in computer vision, existing visual Mamba models fail to account for rotational symmetry in their design. This omission renders them inherently sensitive to image rotations, thereby constraining their robustness and cross-task generalization. To address this limitation, we propose to incorporate rotation symmetry, a universal and fundamental geometric prior in images, into Mamba-based architectures. Specifically, we introduce EQ-VMamba, the first rotation equivariant visual Mamba architecture for vision tasks. The core components of EQ-VMamba include a carefully designed rotation equivariant cross-scan strategy and group Mamba blocks. Moreover, we provide a rigorous theoretical analysis of the intrinsic equivariance error, demonstrating that the proposed architecture enforces end-to-end rotation equivariance throughout the network. Extensive experiments across multiple benchmarks - including high-level image classification task, mid-level semantic segmentation task, and low-level image super-resolution task - demonstrate that EQ-VMamba achieves superior or competitive performance compared to non-equivariant baselines, while requiring approximately 50% fewer parameters. These results indicate that embedding rotation equivariance not only effectively bolsters the robustness of visual Mamba models against rotation transformations, but also enhances overall performance with significantly improved parameter efficiency. Code is available at this https URL.
旋转等变性构成了视觉数据中最通用和最关键的结构先验之一,然而它在当前基于Mamba的视觉架构中却显著缺席。尽管Mamba在自然语言处理领域取得了成功,并且逐渐被应用于计算机视觉,现有的视觉Mamba模型在其设计中未能考虑旋转对称性。这种遗漏使它们内在地容易受到图像旋转的影响,从而限制了其鲁棒性和跨任务泛化能力。 为了解决这一局限性,我们提议将旋转对称性——一种普遍而基本的几何先验,在基于Mamba的架构中融入其中。具体而言,我们提出了EQ-VMamba,这是首个针对视觉任务的旋转等变视觉Mamba架构。EQ-VMamba的核心组件包括精心设计的旋转等变交叉扫描策略和群组Mamba块。此外,我们还提供了关于内在等变误差的严格理论分析,表明所提出的架构在整个网络中执行端到端的旋转等变性。 在多个基准测试——包括高层次图像分类任务、中层次语义分割任务和低层次图像超分辨率任务上的广泛实验显示,EQ-VMamba相较于非等变基线模型,不仅实现了优越或具有竞争力的表现,还减少了大约50%的参数需求。这些结果表明,在视觉Mamba模型中嵌入旋转等变性不仅能有效增强其对旋转变换的鲁棒性,还能通过显著提高参数效率来提升整体性能。 代码可在此链接获取:[提供URL的地方](请确保替换为实际可用的有效链接)。
https://arxiv.org/abs/2603.09138
Background and objectives: Colorectal cancer histopathological grading depends on accurate segmentation of glandular structures. Current deep learning approaches rely on large scale pixel level annotations that are labor intensive and difficult to obtain in routine clinical practice. Weakly supervised semantic segmentation offers a promising alternative. However, class activation map based methods often produce incomplete pseudo masks that emphasize highly discriminative regions and fail to supervise unannotated glandular structures. We propose a weakly supervised teacher student framework that leverages sparse pathologist annotations and an Exponential Moving Average stabilized teacher network to generate refined pseudo masks. Methods: The framework integrates confidence based filtering, adaptive fusion of teacher predictions with limited ground truth, and curriculum guided refinement to progressively segment unannotated glandular regions. The method was evaluated on an institutional colorectal cancer cohort from The Ohio State University Wexner Medical Center consisting of 60 hematoxylin and eosin stained whole slide images and on public datasets including the Gland Segmentation dataset, TCGA COAD, TCGA READ, and SPIDER. Results: On the Gland Segmentation dataset the framework achieved a mean Intersection over Union of 80.10 and a mean Dice coefficient of 89.10. Cross cohort evaluation demonstrated robust generalization on TCGA COAD and TCGA READ without additional annotations, while reduced performance on SPIDER reflected domain shift. Conclusions: The proposed framework provides an annotation efficient and generalizable approach for gland segmentation in colorectal histopathology.
背景与目标:结直肠癌的组织病理学分级依赖于腺体结构的精确分割。当前深度学习方法依赖于大规模像素级别的注释,这些注释在常规临床实践中耗时且难以获取。弱监督语义分割提供了一种有前景的替代方案。然而,基于类激活图的方法往往会产生不完整的伪掩模,过分强调具有高度区分性的区域,并且无法对未标注的腺体结构进行监督。我们提出了一种利用稀疏病理学家注释和指数移动平均稳定的教师网络生成精炼伪掩模的弱监督师生框架。 方法:该框架结合了基于置信度过滤、教师预测与有限真实标签自适应融合以及课程引导细化,逐步分割未标注的腺体区域。研究在来自俄亥俄州立大学韦克斯纳医学中心的一个机构结直肠癌队列上进行了评估,包括60张苏木精和伊红染色的全切片图像,并且还在公开数据集(如Gland Segmentation 数据集、TCGA COAD、TCGA READ 和 SPIDER)上进行了测试。 结果:在Gland Segmentation 数据集中,该框架实现了平均交并比为80.10%和平均Dice系数为89.10%。跨队列评估显示,在没有额外注释的情况下对TCGA COAD 和 TCGA READ 的泛化能力稳健,而SPIDER上的性能下降反映了域偏移。 结论:所提出的框架提供了一种注解效率高且具有广泛适用性的方法,用于结直肠组织病理学中的腺体分割。
https://arxiv.org/abs/2603.08605
Camera-based 3D Semantic Scene Completion (SSC) is a critical task for autonomous driving and robotic scene understanding. It aims to infer a complete 3D volumetric representation of both semantics and geometry from a single image. Existing methods typically focus on end-to-end 2D-to-3D feature lifting and voxel completion. However, they often overlook the interference between high-confidence visible-region perception and low-confidence occluded-region reasoning caused by single-image input, which can lead to feature dilution and error propagation. To address these challenges, we introduce an offline Visible Region Label Extraction (VRLE) strategy that explicitly separates and extracts voxel-level supervision for visible regions from dense 3D ground truth. This strategy purifies the supervisory space for two complementary sub-tasks: visible-region perception and occluded-region reasoning. Building on this idea, we propose the Visible-Occluded Interactive Completion Network (VOIC), a novel dual-decoder framework that explicitly decouples SSC into visible-region semantic perception and occluded-region scene completion. VOIC first constructs a base 3D voxel representation by fusing image features with depth-derived occupancy. The visible decoder focuses on generating high-fidelity geometric and semantic priors, while the occlusion decoder leverages these priors together with cross-modal interaction to perform coherent global scene reasoning. Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that VOIC outperforms existing monocular SSC methods in both geometric completion and semantic segmentation accuracy, achieving state-of-the-art performance.
基于相机的三维语义场景完成(SSC)是自主驾驶和机器人场景理解中的关键任务,其目标是从单张图像中推断出完整的3D体积表示,涵盖几何和语义信息。现有的方法通常集中于端到端的2D至3D特征提升及体素补全。然而,这些方法往往忽视了由单一图像输入所导致的高置信度可视区域感知与低置信度被遮挡区域推理之间的干扰问题,这些问题可能导致特征稀释和误差传播。 为了解决这些挑战,我们提出了一种离线可见区域标签提取(VRLE)策略,该策略明确地将密集3D真实标签中的体素级监督分离并抽取给可视区域。这一策略通过精炼两种互补子任务的指导空间来提高准确性和鲁棒性:可视区域感知和遮挡区域推理。 在此基础上,我们提出了可见-遮挡交互完成网络(VOIC),这是一种新颖的双解码器框架,它明确地将SSC拆分为两个部分:可视区域语义感知与被遮挡区域场景补全。VOIC首先通过融合图像特征与深度衍生的占用信息来构建基础3D体素表示。可见区域解码器专注于生成高保真度的几何和语义先验,而遮挡区域解码器则利用这些先验及跨模态交互进行一致的整体场景推理。 在SemanticKITTI和SSCBench-KITTI360基准测试中的广泛实验表明,VOIC优于现有的单目SSC方法,在几何补全和语义分割精度方面均达到了最先进的性能。
https://arxiv.org/abs/2512.18954
Scene understanding plays a critical role in enabling intelligence and autonomy in robotic systems. Traditional approaches often face challenges, including occlusions, ambiguous boundaries, and the inability to adapt attention based on task-specific requirements and sample variations. To address these limitations, this paper presents an efficient RGB-D scene understanding model that performs a range of tasks, including semantic segmentation, instance segmentation, orientation estimation, panoptic segmentation, and scene classification. The proposed model incorporates an enhanced fusion encoder, which effectively leverages redundant information from both RGB and depth inputs. For semantic segmentation, we introduce normalized focus channel layers and a context feature interaction layer, designed to mitigate issues such as shallow feature misguidance and insufficient local-global feature representation. The instance segmentation task benefits from a non-bottleneck 1D structure, which achieves superior contour representation with fewer parameters. Additionally, we propose a multi-task adaptive loss function that dynamically adjusts the learning strategy for different tasks based on scene variations. Extensive experiments on the NYUv2, SUN RGB-D, and Cityscapes datasets demonstrate that our approach outperforms existing methods in both segmentation accuracy and processing speed.
场景理解在增强机器人系统的智能和自主性方面扮演着关键角色。传统方法经常面临遮挡、边界模糊以及无法根据任务特定需求和样本变化调整注意力等问题的挑战。为了克服这些限制,本文提出了一种高效的RGB-D场景理解模型,该模型能够执行一系列任务,包括语义分割、实例分割、姿态估计、全景分割和场景分类。所提出的模型包含了一个增强融合编码器,可以有效利用来自RGB和深度输入数据的冗余信息。 对于语义分割任务,我们引入了标准化关注通道层和上下文特征交互层,旨在解决浅层特征误导以及局部-全局特征表示不足等问题。在实例分割任务中,我们的模型采用了一种非瓶颈1D结构,在减少参数的同时实现了更优的轮廓表示。此外,我们还提出了一种多任务自适应损失函数,可以根据场景变化动态调整不同的学习策略。 通过在NYUv2、SUN RGB-D和Cityscapes数据集上的大量实验验证,我们的方法不仅在分割准确性上超越了现有技术,而且在处理速度方面也表现出色。
https://arxiv.org/abs/2603.07570
Pretraining and fine-tuning have emerged as a new paradigm in remote sensing image interpretation. Among them, Masked Autoencoder (MAE)-based pretraining stands out for its strong capability to learn general feature representations via reconstructing masked image regions. However, applying MAE to multispectral remote sensing images remains challenging due to complex backgrounds, indistinct targets, and the lack of semantic guidance during masking, which hinders the learning of underlying structures and meaningful spatial-spectral features. To address this, we propose a simple yet effective approach, Spectral Index-Guided MAE (SIGMAE), for multispectral image pretraining. The core idea is to incorporate domain-specific spectral indices as prior knowledge to guide dynamic token masking toward informative regions. SIGMAE introduces Semantic Saliency-Guided Dynamic Token Masking (SSDTM), a curriculum-style strategy that quantifies each patch's semantic richness and internal heterogeneity to adaptively select the most informative tokens during training. By prioritizing semantically salient regions and progressively increasing sample difficulty, SSDTM enhances spectrally rich and structurally aware representation learning, mitigates overfitting, and reduces redundant computation compared with random masking. Extensive experiments on five widely used datasets covering various downstream tasks, including scene classification, semantic segmentation, object extraction and change detection, demonstrate that SIGMAE outperforms other pretrained geospatial foundation models. Moreover, it exhibits strong spatial-spectral reconstruction capability, even with a 90% mask ratio, and improves complex target recognition under limited labeled data. The source codes and model weights will be released at this https URL.
预训练和微调已成为遥感图像解释的新范式。其中,基于遮罩自动编码器(MAE)的预训练因其通过重构被遮罩的图像区域来学习通用特征表示的能力而脱颖而出。然而,由于复杂的背景、不清晰的目标以及在遮罩过程中缺乏语义指导,将MAE应用于多光谱遥感图像仍然具有挑战性,这阻碍了底层结构和有意义的空间-光谱特征的学习。为了解决这个问题,我们提出了一种简单而有效的方法——基于光谱指数的MAE(SIGMAE),用于多光谱图像预训练。核心思想是将领域特定的光谱指数作为先验知识,以指导动态标记遮罩向信息丰富的区域进行导向。 SIGMAE引入了语义显著性引导的动态令牌遮罩(SSDTM)策略,这是一种类似课程的学习方式,它量化每个补丁的语义丰富性和内部异质性,并在训练过程中自适应地选择最富有信息性的令牌。通过优先考虑语义显著区域并逐步增加样本难度,SSDTM增强了光谱丰富的和结构感知表示学习,减轻了过拟合问题,并与随机遮罩相比减少了冗余计算。 在五个广泛使用的数据集上进行了大量实验,这些数据集涵盖了包括场景分类、语义分割、目标提取和变化检测在内的各种下游任务。结果表明,SIGMAE优于其他预训练的地理空间基础模型。此外,即使在90%的遮罩比例下,它也展示了强大的光谱-空间重建能力,并且在标签有限的情况下提高了复杂目标识别的能力。 源代码和模型权重将在以下链接发布:[提供URL]
https://arxiv.org/abs/2603.07463
We present Rewis3d, a framework that leverages recent advances in feed-forward 3D reconstruction to significantly improve weakly supervised semantic segmentation on 2D images. Obtaining dense, pixel-level annotations remains a costly bottleneck for training segmentation models. Alleviating this issue, sparse annotations offer an efficient weakly-supervised alternative. However, they still incur a performance gap. To address this, we introduce a novel approach that leverages 3D scene reconstruction as an auxiliary supervisory signal. Our key insight is that 3D geometric structure recovered from 2D videos provides strong cues that can propagate sparse annotations across entire scenes. Specifically, a dual student-teacher architecture enforces semantic consistency between 2D images and reconstructed 3D point clouds, using state-of-the-art feed-forward reconstruction to generate reliable geometric supervision. Extensive experiments demonstrate that Rewis3d achieves state-of-the-art performance in sparse supervision, outperforming existing approaches by 2-7% without requiring additional labels or inference overhead.
我们介绍了一个名为Rewis3d的框架,该框架利用近期在前馈式三维重建方面的进展,显著提升了二维图像上弱监督语义分割的效果。获得密集、像素级别的注释仍然是训练分割模型的一个昂贵瓶颈问题。为了解决这个问题,稀疏标注提供了一种有效的弱监督替代方案。然而,这种做法仍然会导致性能差距。为此,我们引入了一个新颖的方法,该方法利用三维场景重建作为辅助监督信号。 我们的关键见解是,从二维视频中恢复出的三维几何结构可以提供强大的线索,从而将稀疏注释传播到整个场景中。具体来说,一个双学生的教师架构强制执行二维图像和重构的三维点云之间的语义一致性,并使用最先进的前馈式重建来生成可靠的几何监督。 广泛的实验表明,Rewis3d在稀疏监督下达到了最先进的性能,相比现有方法提高了2-7%,并且不需要额外的标签或推理开销。
https://arxiv.org/abs/2603.06374
Current semantic segmentation approaches for point cloud scenes heavily rely on manual labeling, while research on unsupervised semantic segmentation methods specifically for raw point clouds is still in its early stages. Unsupervised point cloud learning poses significant challenges due to the absence of annotation information and the lack of pre-training. The development of effective strategies is crucial in this context. In this paper, we propose a novel prototype library-driven unsupervised point cloud semantic segmentation strategy that utilizes Structure Learning and Consistent Reasoning (P-SLCR). First, we propose a Consistent Structure Learning to establish structural feature learning between consistent points and the library of consistent prototypes by selecting high-quality features. Second, we propose a Semantic Relation Consistent Reasoning that constructs a prototype inter-relation matrix between consistent and ambiguous prototype libraries separately. This process ensures the preservation of semantic consistency by imposing constraints on consistent and ambiguous prototype libraries through the prototype inter-relation matrix. Finally, our method was extensively evaluated on the S3DIS, SemanticKITTI, and Scannet datasets, achieving the best performance compared to unsupervised methods. Specifically, the mIoU of 47.1% is achieved for Area-5 of the S3DIS dataset, surpassing the classical fully supervised method PointNet by 2.5%.
目前针对点云场景的语义分割方法严重依赖于人工标注,而专门用于原始点云数据的无监督语义分割方法的研究仍处于初级阶段。由于缺乏注释信息和预训练数据,无监督点云学习面临着重大挑战。在这种背景下,开发有效的策略至关重要。 本文提出了一种新颖的方法:原型库驱动的无监督点云语义分割策略(P-SLCR),该方法结合了结构学习和一致性推理。首先,我们提出了一个一致结构学习模块,通过选择高质量特征,在一致点与一致原型库之间建立结构性特征学习。其次,我们提出了一致性关系推理模块,它分别构建两个独立的原型互相关矩阵:一致原型库和模棱两可原型库之间的矩阵。这一过程确保了语义一致性,通过对一致和模棱两可原型库进行约束来维持原型间的相互关联矩阵。 最后,我们的方法在S3DIS、SemanticKITTI和ScanNet数据集上进行了广泛的评估,在无监督方法中取得了最佳性能。具体而言,在S3DIS数据集的Area-5部分实现了47.1%的mIoU指标,这一成绩甚至超过了经典的完全有监督方法PointNet 2.5个百分点。
https://arxiv.org/abs/2603.06321
As powerful generative models, text-to-image diffusion models have recently been explored for discriminative tasks. A line of research focuses on adapting a pre-trained diffusion model to semantic segmentation without any further training, leading to what training-free diffusion segmentors. These methods typically rely on cross-attention maps from the model's attention layers, which are assumed to capture semantic relationships between image pixels and text tokens. Ideally, such approaches should benefit from more powerful diffusion models, i.e., stronger generative capability should lead to better segmentation. However, we observe that existing methods often fail to scale accordingly. To understand this issue, we identify two underlying gaps: (i) cross-attention is computed across multiple heads and layers, but there exists a discrepancy between these individual attention maps and a unified global representation. (ii) Even when a global map is available, it does not directly translate to accurate semantic correlation for segmentation, due to score imbalances among different text tokens. To bridge these gaps, we propose two techniques: auto aggregation and per-pixel rescaling, which together enable training-free segmentation to better leverage generative capability. We evaluate our approach on standard semantic segmentation benchmarks and further integrate it into a generative technique, demonstrating both improved performance broad applicability. Codes are at this https URL.
作为强大的生成模型,文本到图像的扩散模型最近被探索用于判别任务。一项研究重点是将预训练的扩散模型适应于语义分割任务而不进行进一步的训练,从而产生无需训练的扩散分割器(training-free diffusion segmentors)。这些方法通常依赖于来自模型注意层的交叉注意力图,假设它们捕获了图像像素和文本标记之间的语义关系。理想情况下,这样的方法应该受益于更强大的扩散模型,即更强的生成能力应导致更好的分割效果。然而,我们观察到现有方法往往无法相应地扩展。为了理解这一问题,我们确定了两个潜在差距:(i) 交叉注意力在多个头和层之间进行计算,但这些单独的注意图与统一的整体表示之间存在差异。(ii) 即使可以获取整体映射,在分割过程中它也无法直接转化为准确的语义相关性,由于不同文本标记之间的评分不平衡。为了弥合这两个差距,我们提出了两种技术:自动聚合(auto aggregation)和逐像素重缩放(per-pixel rescaling),这些技术结合使用可以使无训练分割更好地利用生成能力。我们在标准语义分割基准上评估了我们的方法,并将其进一步集成到一种生成技术中,展示了改进的性能和广泛的应用性。代码可在以下链接找到:[请在此处插入实际URL]。
https://arxiv.org/abs/2603.06178
Semantic segmentation across visual modalities such as 3D point clouds and panoramic images remains a challenging task, primarily due to the scarcity of annotated data and the limited adaptability of fixed-label models. In this paper, we present JOPP-3D, an open-vocabulary semantic segmentation framework that jointly leverages panoramic and point cloud data to enable language-driven scene understanding. We convert RGB-D panoramic images into their corresponding tangential perspective images and 3D point clouds, then use these modalities to extract and align foundational vision-language features. This allows natural language querying to generate semantic masks on both input modalities. Experimental evaluation on the Stanford-2D-3D-s and ToF-360 datasets demonstrates the capability of JOPP-3D to produce coherent and semantically meaningful segmentations across panoramic and 3D domains. Our proposed method achieves a significant improvement compared to the SOTA in open and closed vocabulary 2D and 3D semantic segmentation.
在视觉模态(如三维点云和全景图像)上进行语义分割仍然是一个具有挑战性的任务,主要原因是标注数据的稀缺以及固定标签模型适应能力有限。本文介绍了 JOPP-3D,这是一种开放式词汇语义分割框架,通过共同利用全景图和点云数据来实现基于语言的场景理解。我们把 RGB-D 全景图像转换为其相应的切线视角图像和 3D 点云,然后使用这些模态提取并对齐基础视觉-语言特征。这使得自然语言查询能够在这两种输入模式上生成语义蒙版。在斯坦福2D-3D-s 和 ToF-360 数据集上的实验评估证明了 JOPP-3D 能够在全景和 3D 域中生成一致且具有语义意义的分割结果。我们提出的方法在开放性和封闭性词汇的 2D 和 3D 语义分割方面相比现有最佳方法取得了显著的进步。
https://arxiv.org/abs/2603.06168
Historical map collections are highly diverse in style, scale, and geographic focus, often consisting of many single-sheet documents. Yet most work in map recognition focuses on specialist models tailored to homogeneous map series. In contrast, this article aims to develop generalizable semantic segmentation models and ontology. First, we introduce Semap, a new open benchmark dataset comprising 1,439 manually annotated patches designed to reflect the variety of historical map documents. Second, we present a segmentation framework that combines procedural data synthesis with multiscale integration to improve robustness and transferability. This framework achieves state-of-the-art performance on both the HCMSSD and Semap datasets, showing that a diversity-driven approach to map recognition is not only viable but also beneficial. The results indicate that segmentation performance remains largely stable across map collections, scales, geographic regions, and publication contexts. By proposing benchmark datasets and methods for the generic segmentation of historical maps, this work opens the way to integrating the long tail of cartographic archives to historical geographic studies.
历史地图收藏在风格、比例尺和地理焦点方面具有高度多样性,通常由许多单张文档组成。然而,大多数地图识别工作集中在针对同质化地图系列的专用模型上。与此相反,本文旨在开发通用化的语义分割模型和本体论。首先,我们介绍了Semap,这是一个新的开放基准数据集,包含1,439个手动注释的补丁,旨在反映历史地图文档的多样性。其次,我们提出了一种结合程序化数据合成与多尺度集成的分割框架,以提高鲁棒性和迁移性。该框架在HCMSSD和Semap数据集上均取得了最先进的性能,表明多样性的驱动方法对于地图识别不仅可行而且有益。结果显示,在不同的地图集合、比例尺、地理区域及出版背景下,分割表现基本保持稳定。通过提出针对历史地图通用分割的基准数据集和方法,这项工作为将长尾制图档案整合到历史地理研究中开辟了道路。
https://arxiv.org/abs/2603.05037
Distribution shifts between training and testing data are a critical bottleneck limiting the practical utility of models, especially in real-world test-time scenarios. To adapt models when the source domain is unknown and the target domain is unlabeled, previous works constructed pseudo-source domains via data generation and translation, then aligned the target domain with them. However, significant discrepancies exist between the pseudo-source and the original source domain, leading to potential divergence when correcting the target directly. From this perspective, we propose a Stepwise Semantic Alignment (SSA) method, viewing the pseudo-source as a semantic bridge connecting the source and target, rather than a direct substitute for the source. Specifically, we leverage easily accessible universal semantics to rectify the semantic features of the pseudo-source, and then align the target domain using the corrected pseudo-source semantics. Additionally, we introduce a Hierarchical Feature Aggregation (HFA) module and a Confidence-Aware Complementary Learning (CACL) strategy to enhance the semantic quality of the SSA process in the absence of source and ground truth of target domains. We evaluated our approach on tasks like semantic segmentation and image classification, achieving a 5.2% performance boost on GTA2Cityscapes over the state-of-the-art.
训练数据和测试数据之间的分布变化是限制模型实际应用的关键瓶颈,尤其是在现实世界的测试场景中。为了在源域未知且目标域未标记的情况下适应模型,以前的研究通过生成和转换构建了伪源域,并将目标域与它们对齐。然而,伪源域与原始源域之间存在显著差异,这可能导致直接纠正目标时的潜在偏差。从这个角度来看,我们提出了逐步语义对齐(SSA)方法,将伪源视作连接源域和目标域的语义桥梁,而不是直接替代品。具体来说,我们利用易于获取的通用语义来修正伪源的语义特征,并使用修正后的伪源语义对齐目标域。此外,我们引入了一个层次特征聚合(HFA)模块和一种基于信心感知互补学习(CACL)策略,在没有来源数据和目标域真实标签的情况下提升SSA过程中的语义质量。我们在诸如语义分割和图像分类等任务上评估了我们的方法,在GTA2Cityscapes上的表现比现有最佳方法高出5.2%。
https://arxiv.org/abs/2603.03844
Accurate sea ice mapping is essential for safe maritime navigation in polar regions, where rapidly changing ice conditions require timely and reliable information. While Sentinel-1 Synthetic Aperture Radar (SAR) provides high-resolution, all-weather observations of sea ice, conventional ground-based processing is limited by downlink bandwidth, latency, and energy costs associated with transmitting large volumes of raw data. On-board processing, enabled by dedicated inference chips integrated directly within the satellite payload, offers a transformative alternative by generating actionable sea ice products in orbit. In this context, we present TinyIceNet, a compact semantic segmentation network co-designed for on-board Stage of Development (SOD) mapping from dual-polarized Sentinel-1 SAR imagery under strict hardware and power constraints. Trained on the AI4Arctic dataset, TinyIceNet combines SAR-aware architectural simplifications with low-precision quantization to balance accuracy and efficiency. The model is synthesized using High-Level Synthesis and deployed on a Xilinx Zynq UltraScale+ FPGA platform, demonstrating near-real-time inference with significantly reduced energy consumption. Experimental results show that TinyIceNet achieves 75.216% F1 score on SOD segmentation while reducing energy consumption by 2x compared to full-precision GPU baselines, underscoring the potential of chip-level hardware-algorithm co-design for future spaceborne and edge AI systems.
准确的海冰测绘对于极地地区的海上安全导航至关重要,尤其是在快速变化的冰况下,需要及时和可靠的信息。虽然Sentinel-1合成孔径雷达(SAR)提供了高分辨率、全天候的海洋冰层观测数据,但传统的地面处理受限于下行链路带宽、延迟以及传输大量原始数据的能量成本问题。而通过在卫星载荷中直接集成专用推理芯片实现的空间处理提供了一种变革性的替代方案,在轨道上生成可操作的海冰产品。 在此背景下,我们介绍了TinyIceNet,这是一种专门为双极化Sentinel-1 SAR图像设计的小型语义分割网络,用于在严格的硬件和功率约束条件下进行发展阶段(SOD)映射。通过使用AI4Arctic数据集训练,TinyIceNet结合了雷达感知的架构简化与低精度量化技术来平衡准确性和效率。该模型采用高层次综合方法合成,并部署于Xilinx Zynq UltraScale+ FPGA平台上,在显著降低能耗的同时展示了接近实时推理的能力。实验结果表明,TinyIceNet在SOD分割中实现了75.216%的F1得分,与全精度GPU基准相比将能源消耗降低了两倍,突显了芯片级硬件-算法协同设计对于未来空间和边缘AI系统潜力的重要性。 该研究强调了专用推理芯片和硬件优化模型对提高太空任务效率的关键作用,并为未来的极地观测提供了重要的技术基础。
https://arxiv.org/abs/2603.03075
Video Diffusion Transformers (DiTs) have been synthesizing high-quality video with high fidelity from given text descriptions involving motion. However, understanding how Video DiTs convert motion words into video remains insufficient. Furthermore, while prior studies on interpretable saliency maps primarily target objects, motion-related behavior in Video DiTs remains largely unexplored. In this paper, we investigate concrete motion features that specify when and which object moves for a given motion concept. First, to spatially localize, we introduce GramCol, which adaptively produces per-frame saliency maps for any text concept, including both motion and non-motion. Second, we propose a motion-feature selection algorithm to obtain an Interpretable Motion-Attentive Map (IMAP) that localizes motion spatially and temporally. Our method discovers concept saliency maps without the need for any gradient calculation or parameter update. Experimentally, our method shows outstanding localization capability on the motion localization task and zero-shot video semantic segmentation, providing interpretable and clearer saliency maps for both motion and non-motion concepts.
视频扩散变换器(DiTs)已经能够从给定的包含运动描述的文本中合成高质量且忠实度高的视频。然而,关于Video DiT如何将运动词汇转化为视频的理解还很不充分。此外,之前的研究主要集中在可解释性的注意力图谱上,这些研究大多针对的是物体本身,而与运动相关的特性在Video DiTs中的表现则很少被探讨。 本文旨在探究对于给定的运动概念而言,在特定时间和哪一对象进行移动的具体运动特征。首先,为了空间定位,我们引入了GramCol技术,它可以为包括运动和非运动在内的任何文本概念自适应地生成每帧注意力图谱。其次,我们提出了一种选择性运动特征算法,以获得可解释的运动注意图(IMAP),该方法可以对运动进行时空定位。 我们的方法能够在不需要计算梯度或参数更新的情况下发现概念性的注意力图谱。实验表明,我们的方法在运动定位任务和零样本视频语义分割上展示了卓越的空间定位能力,并提供了更清晰、更具可解释性的运动和非运动概念的注意力图谱。
https://arxiv.org/abs/2603.02919
Unifying visual representation learning and text-to-image (T2I) generation within a single model remains a central challenge in multimodal learning. We introduce DREAM, a unified framework that jointly optimizes discriminative and generative objectives, while learning strong visual representations. DREAM is built on two key techniques: During training, Masking Warmup, a progressive masking schedule, begins with minimal masking to establish the contrastive alignment necessary for representation learning, then gradually transitions to full masking for stable generative training. At inference, DREAM employs Semantically Aligned Decoding to align partially masked image candidates with the target text and select the best one for further decoding, improving text-image fidelity (+6.3%) without external rerankers. Trained solely on CC12M, DREAM achieves 72.7% ImageNet linear-probing accuracy (+1.1% over CLIP) and an FID of 4.25 (+6.2% over FLUID), with consistent gains in few-shot classification, semantic segmentation, and depth estimation. These results demonstrate that discriminative and generative objectives can be synergistic, allowing unified multimodal models that excel at both visual understanding and generation.
统一视觉表示学习和文本到图像(T2I)生成模型的单一框架仍然是多模态学习中的一个核心挑战。我们引入了DREAM,这是一个统一的框架,它同时优化判别性和生成性目标,并在此过程中学习强大的视觉表示。DREAM基于两种关键技术构建:在训练阶段,遮罩预热(Masking Warmup),一种渐进式的遮罩调度策略,从最小化遮罩开始建立对比对齐以支持表示学习,然后逐渐过渡到完全遮罩以实现稳定的生成性训练;在推理阶段,DREAM采用语义对齐解码技术来将部分遮罩的图像候选与目标文本进行对齐,并选择最佳选项继续解码,从而提高图文一致性(+6.3%),而无需外部重排序器。仅使用CC12M数据集进行训练,DREAM达到了72.7%的ImageNet线性探测精度(比CLIP高1.1%)和FID为4.25(比FLUID高6.2%),并且在少量样本分类、语义分割和深度估计方面表现出一致的优势。这些结果表明,判别性和生成性的目标可以协同工作,使得统一的多模态模型能够在视觉理解和生成两方面都表现优异。
https://arxiv.org/abs/2603.02667