Deep neural networks, especially transformer-based architectures, have achieved remarkable success in semantic segmentation for environmental perception. However, existing models process video frames independently, thus failing to leverage temporal consistency, which could significantly improve both accuracy and stability in dynamic scenes. In this work, we propose a Spatio-Temporal Attention (STA) mechanism that extends transformer attention blocks to incorporate multi-frame context, enabling robust temporal feature representations for video semantic segmentation. Our approach modifies standard self-attention to process spatio-temporal feature sequences while maintaining computational efficiency and requiring minimal changes to existing architectures. STA demonstrates broad applicability across diverse transformer architectures and remains effective across both lightweight and larger-scale models. A comprehensive evaluation on the Cityscapes and BDD100k datasets shows substantial improvements of 9.20 percentage points in temporal consistency metrics and up to 1.76 percentage points in mean intersection over union compared to single-frame baselines. These results demonstrate STA as an effective architectural enhancement for video-based semantic segmentation applications.
深度神经网络,特别是基于变压器的架构,在环境感知中的语义分割方面取得了显著的成功。然而,现有的模型处理视频帧时是独立进行的,从而未能利用时间一致性,而这在动态场景中可以大幅提高准确性和稳定性。在此项工作中,我们提出了一种空间-时间注意(STA)机制,它扩展了变压器注意力块,以纳入多帧上下文,从而使视频语义分割具备稳健的时间特征表示能力。我们的方法将标准的自注意力处理方式修改为能够同时处理时空特征序列,同时保持计算效率,并且只需要对现有架构做出最小改动。STA在各种Transformer架构中具有广泛的适用性,并且无论模型是轻量级还是大规模,在所有情况下均能有效运行。在Cityscapes和BDD100k数据集上的全面评估显示,与单帧基线相比,时间一致性指标提高了9.20个百分点,平均交并比(mean Intersection over Union)最高提升了1.76个百分点。这些结果表明STA作为一种架构增强手段,在基于视频的语义分割应用中具有有效性。
https://arxiv.org/abs/2602.10052
Multimodal Large Language Models (MLLMs) have recently achieved remarkable success in visual-language understanding, demonstrating superior high-level semantic alignment within their vision encoders. An important question thus arises: Can these encoders serve as versatile vision backbones, capable of reliably performing classic vision-centric tasks as well? To address the question, we make the following contributions: (i) we identify that the vision encoders within MLLMs exhibit deficiencies in their dense feature representations, as evidenced by their suboptimal performance on dense prediction tasks (e.g., semantic segmentation, depth estimation); (ii) we propose VersaViT, a well-rounded vision transformer that instantiates a novel multi-task framework for collaborative post-training. This framework facilitates the optimization of the vision backbone via lightweight task heads with multi-granularity supervision; (iii) extensive experiments across various downstream tasks demonstrate the effectiveness of our method, yielding a versatile vision backbone suited for both language-mediated reasoning and pixel-level understanding.
多模态大型语言模型(MLLM)近期在视觉-语言理解方面取得了显著成功,展示了其视觉编码器中出色的高级语义对齐能力。由此引发了一个重要的问题:这些编码器能否作为通用的视觉骨干网络,可靠地执行经典的以视觉为中心的任务?为了解决这个问题,我们做出了以下贡献: (i) 我们识别出多模态大型语言模型中的视觉编码器在密集特征表示方面存在不足,在密集预测任务(如语义分割、深度估计)中表现出性能不佳。 (ii) 我们提出了一种全面的视觉变换器——VersaViT,该变换器实例化了一个新颖的多任务框架,用于协作式后期训练。此框架通过带有多种粒度监督的轻量级任务头优化视觉骨干网络。 (iii) 在各种下游任务中的广泛实验表明了我们方法的有效性,生成了一种适用于语言介导推理和像素级理解的通用视觉骨干网络。 通过以上贡献,VersaViT不仅提升了MLLM在视觉密集预测方面的性能,还为开发能够同时支持高层次语义理解和精确像素级分析的多功能视觉模型铺平了道路。
https://arxiv.org/abs/2602.09934
Domain Generalized Video Semantic Segmentation (DGVSS) is trained on a single labeled driving domain and is directly deployed on unseen domains without target labels and test-time adaptation while maintaining temporally consistent predictions over video streams. In practice, both domain shift and temporal-sampling shift break correspondence-based propagation and fixed-stride temporal aggregation, causing severe frame-to-frame flicker even in label-stable regions. We propose Time2General, a DGVSS framework built on Stability Queries. Time2General introduces a Spatio-Temporal Memory Decoder that aggregates multi-frame context into a clip-level spatio-temporal memory and decodes temporally consistent per-frame masks without explicit correspondence propagation. To further suppress flicker and improve robustness to varying sampling rates, the Masked Temporal Consistency Loss is proposed to regularize temporal prediction discrepancies across different strides, and randomize training strides to expose the model to diverse temporal gaps. Extensive experiments on multiple driving benchmarks show that Time2General achieves a substantial improvement in cross-domain accuracy and temporal stability over prior DGSS and VSS baselines while running at up to 18 FPS. Code will be released after the review process.
领域泛化的视频语义分割(Domain Generalized Video Semantic Segmentation, DGVSS)是在单一标注的驾驶域上训练,并直接在无目标标签和测试时间适应的未知域上部署,同时保持视频流中一致的时间预测。实践中,无论是领域的转变还是采样率的变化都会打破基于对应传播和固定步幅的时间聚合,即使在标签稳定的区域也会导致严重的帧间闪烁问题。 我们提出了Time2General框架,它建立在稳定性查询的基础上,并针对DGVSS设计了一个时空记忆解码器(Spatio-Temporal Memory Decoder),该解码器可以将多帧上下文信息聚集为一段视频级别的时空内存,并无需显式的对应传播来解码出一致的时间预测的每帧掩模。为了进一步抑制闪烁并提高对不同采样率变化的鲁棒性,我们提出了遮罩时间一致性损失(Masked Temporal Consistency Loss),以规范不同时步中的时间预测差异,并随机化训练步骤以使模型接触各种各样的时间间隔。 在多个驾驶基准测试中进行的广泛实验表明,Time2General在跨域精度和时间稳定性上相对于之前的DGSS和VSS基线取得了显著改进,并且可以在高达18 FPS的速度下运行。代码将在审查过程后发布。
https://arxiv.org/abs/2602.09648
Synthetic data provide low-cost, accurately annotated samples for geometry-sensitive vision tasks, but appearance and imaging differences between synthetic and real domains cause severe domain shift and degrade downstream performance. Unpaired synthetic-to-real translation can reduce this gap without paired supervision, yet existing methods often face a trade-off between photorealism and structural stability: unconstrained generation may introduce deformation or spurious textures, while overly rigid constraints limit adaptation to real-domain statistics. We propose FD-DB, a frequency-decoupled dual-branch model that separates appearance transfer into low-frequency interpretable editing and high-frequency residual compensation. The interpretable branch predicts physically meaningful editing parameters (white balance, exposure, contrast, saturation, blur, and grain) to build a stable low-frequency appearance base with strong content preservation. The free branch complements fine details through residual generation, and a gated fusion mechanism combines the two branches under explicit frequency constraints to limit low-frequency drift. We further adopt a two-stage training schedule that first stabilizes the editing branch and then releases the residual branch to improve optimization stability. Experiments on the YCB-V dataset show that FD-DB improves real-domain appearance consistency and significantly boosts downstream semantic segmentation performance while preserving geometric and semantic structures.
合成数据为几何敏感的视觉任务提供了低成本且准确标注的数据样本,但合成与真实场景之间的外观和成像差异导致了严重的领域偏移,并降低了下游任务的表现。无配对的合成到真实的转换可以减少这种差距而无需配对监督,然而现有方法常常在照片逼真度和结构稳定性之间做出权衡:不受约束的生成可能会引入形变或虚假纹理,而过于严格的限制又会限制向真实领域统计信息的适应能力。 我们提出了一种名为FD-DB(频率解耦双分支模型)的方法。该模型将外观转换分解为低频可解释编辑和高频残差补偿两个部分。可解释分支预测物理上合理的编辑参数(白平衡、曝光度、对比度、饱和度、模糊度和颗粒感),从而构建一个稳定的低频外观基础,同时保持强烈的内容保真性。自由分支则通过残差生成来补充细节信息,并且采用门控融合机制在明确的频率约束下将两个分支结合起来以限制低频漂移。 我们进一步采用了两阶段训练计划,先稳定编辑分支,再释放残差分支,从而提升优化稳定性。在YCB-V数据集上的实验表明,FD-DB显著提高了真实领域的外观一致性,并且在保持几何和语义结构的同时,大幅提升了下游的语义分割性能。
https://arxiv.org/abs/2602.09476
Open-vocabulary semantic segmentation has emerged as a promising research direction in remote sensing, enabling the recognition of diverse land-cover types beyond pre-defined category sets. However, existing methods predominantly rely on the passive mapping of visual features and textual embeddings. This ``appearance-based" paradigm lacks geospatial contextual awareness, leading to severe semantic ambiguity and misclassification when encountering land-cover classes with similar spectral features but distinct semantic attributes. To address this, we propose a Geospatial Reasoning Chain-of-Thought (GR-CoT) framework designed to enhance the scene understanding capabilities of Multimodal Large Language Models (MLLMs), thereby guiding open-vocabulary segmentation models toward precise mapping. The framework comprises two collaborative components: an offline knowledge distillation stream and an online instance reasoning stream. The offline stream establishes fine-grained category interpretation standards to resolve semantic conflicts between similar land-cover types. During online inference, the framework executes a sequential reasoning process involving macro-scenario anchoring, visual feature decoupling, and knowledge-driven decision synthesis. This process generates an image-adaptive vocabulary that guides downstream models to achieve pixel-level alignment with correct geographical semantics. Extensive experiments on the LoveDA and GID5 benchmarks demonstrate the superiority of our approach.
开放词汇语义分割在遥感领域中作为一种有前景的研究方向,它能够识别超出预定义类别集的多样化的地表覆盖类型。然而,现有的方法主要依赖于视觉特征和文本嵌入的被动映射。“基于外观”的这一范式缺乏地理空间上下文感知能力,在遇到具有相似光谱特性但语义属性不同的地表覆盖类时会导致严重的语义模糊和误分类问题。 为了解决这些问题,我们提出了一种名为“地理空间推理链思维”(GR-CoT)的框架,旨在提升多模态大型语言模型(MLLMs)对场景的理解能力,并引导开放词汇分割模型走向精确映射。该框架包含两个协作组件:离线知识提炼流和在线实例推理流。 - 离线流建立了细粒度类别解释标准,以解决相似地表覆盖类型之间的语义冲突。 - 在线推断过程中,框架执行包括宏观场景锚定、视觉特征解耦以及基于知识的决策综合在内的序列化推理过程。这一过程生成了一个适应图像的词汇表,引导下游模型在像素级别与正确的地理语义实现对齐。 在LoveDA和GID5基准测试上的广泛实验验证了我们方法的优势。
https://arxiv.org/abs/2602.08206
Semantic segmentation in high-resolution agricultural imagery demands models that strike a careful balance between accuracy and computational efficiency to enable deployment in practical systems. In this work, we propose DAS-SK, a novel lightweight architecture that retrofits selective kernel convolution (SK-Conv) into the dual atrous separable convolution (DAS-Conv) module to strengthen multi-scale feature learning. The model further enhances the atrous spatial pyramid pooling (ASPP) module, enabling the capture of fine-grained local structures alongside global contextual information. Built upon a modified DeepLabV3 framework with two complementary backbones - MobileNetV3-Large and EfficientNet-B3, the DAS-SK model mitigates limitations associated with large dataset requirements, limited spectral generalization, and the high computational cost that typically restricts deployment on UAVs and other edge devices. Comprehensive experiments across three benchmarks: this http URL, VDD, and PhenoBench, demonstrate that DAS-SK consistently achieves state-of-the-art performance, while being more efficient than CNN-, transformer-, and hybrid-based competitors. Notably, DAS-SK requires up to 21x fewer parameters and 19x fewer GFLOPs than top-performing transformer models. These findings establish DAS-SK as a robust, efficient, and scalable solution for real-time agricultural robotics and high-resolution remote sensing, with strong potential for broader deployment in other vision domains.
在高分辨率农业图像的语义分割任务中,需要一种模型能够巧妙地平衡精度和计算效率,以便能够在实际系统中部署。在这项工作中,我们提出了一种新颖且轻量级的架构DAS-SK(Dual Atrous Separable Convolution with Selective Kernel)。该架构将选择性核卷积(SK-Conv)融入到双空洞可分离卷积模块(DAS-Conv)中,以增强多尺度特征学习能力。此外,模型还改进了空洞空间金字塔池化(ASPP)模块,使其能够捕获精细的局部结构和全局上下文信息。 该模型基于修改后的DeepLabV3框架,并采用了两个互补的骨干网络:MobileNetV3-Large和EfficientNet-B3。通过这种方式,DAS-SK模型克服了大数据集需求、有限光谱泛化能力和高计算成本这些限制因素,后者通常会阻碍在无人机(UAV)和其他边缘设备上的部署。 在三个基准测试集(这个链接已给出但未显示具体信息的URL、VDD和PhenoBench)上进行的全面实验表明,DAS-SK始终能够达到最先进的性能,并且比基于CNN、Transformer及混合架构的竞争模型更高效。尤其值得注意的是,与顶级表现的Transformer模型相比,DAS-SK所需参数少至21倍,计算量(GFLOPs)少至19倍。 这些研究结果表明,DAS-SK作为实时农业机器人和高分辨率遥感应用的一种强大、高效且可扩展的解决方案具有巨大潜力,并且在其他视觉领域中也具备广泛部署的可能性。
https://arxiv.org/abs/2602.08168
This work proposes MeCSAFNet, a multi-branch encoder-decoder architecture for land cover segmentation in multispectral imagery. The model separately processes visible and non-visible channels through dual ConvNeXt encoders, followed by individual decoders that reconstruct spatial information. A dedicated fusion decoder integrates intermediate features at multiple scales, combining fine spatial cues with high-level spectral representations. The feature fusion is further enhanced with CBAM attention, and the ASAU activation function contributes to stable and efficient optimization. The model is designed to process different spectral configurations, including a 4-channel (4c) input combining RGB and NIR bands, as well as a 6-channel (6c) input incorporating NDVI and NDWI indices. Experiments on the Five-Billion-Pixels (FBP) and Potsdam datasets demonstrate significant performance gains. On FBP, MeCSAFNet-base (6c) surpasses U-Net (4c) by +19.21%, U-Net (6c) by +14.72%, SegFormer (4c) by +19.62%, and SegFormer (6c) by +14.74% in mIoU. On Potsdam, MeCSAFNet-large (4c) improves over DeepLabV3+ (4c) by +6.48%, DeepLabV3+ (6c) by +5.85%, SegFormer (4c) by +9.11%, and SegFormer (6c) by +4.80% in mIoU. The model also achieves consistent gains over several recent state-of-the-art approaches. Moreover, compact variants of MeCSAFNet deliver notable performance with lower training time and reduced inference cost, supporting their deployment in resource-constrained environments.
这项工作提出了MeCSAFNet,这是一种多分支编码器-解码器架构,用于处理多光谱图像中的土地覆盖分类。该模型通过双ConvNeXt编码器分别处理可见和不可见通道,并随后通过各自的解码器重建空间信息。一个专门的融合解码器在多个尺度上整合中间特征,将精细的空间线索与高级别的光谱表示结合在一起。特征融合进一步通过CBAM注意力机制增强,而ASAU激活函数则有助于稳定且高效的优化过程。 该模型被设计成可以处理不同的光谱配置,包括由RGB和NIR波段组合而成的4通道(4c)输入,以及包含NDVI和NDWI指数的6通道(6c)输入。在五千亿像素(FBP)和波茨坦数据集上的实验显示了显著的性能提升。在FBP上,MeCSAFNet-base (6c)相较于U-Net (4c)提高了+19.21%,相较于U-Net (6c)提高了+14.72%,相较于SegFormer (4c)提升了+19.62%,相较于SegFormer (6c)增加了+14.74%的mIoU。在波茨坦数据集上,MeCSAFNet-large (4c)相比于DeepLabV3+ (4c)提高了+6.48%,相比于DeepLabV3+ (6c)提高了+5.85%,相较于SegFormer (4c)提升了+9.11%,相较于SegFormer (6c)增加了+4.80%的mIoU。该模型还在几种最近的状态-of-the-art方法中取得了持续性的提升。 此外,MeCSAFNet的小型变体在训练时间和推理成本较低的情况下实现了显著的性能表现,支持它们在资源受限环境中部署。
https://arxiv.org/abs/2602.10137
Semantic segmentation and lane detection are crucial tasks in autonomous driving systems. Conventional approaches predominantly rely on deep neural networks (DNNs), which incur high energy costs due to extensive analog-to-digital conversions and large-scale image computations required for low-latency, real-time responses. Diffractive optical neural networks (DONNs) have shown promising advantages over conventional DNNs on digital or optoelectronic computing platforms in energy efficiency. By performing all-optical image processing via light diffraction at the speed of light, DONNs save computation energy costs while reducing the overhead associated with analog-to-digital conversions by all-optical encoding and computing. In this work, we propose a novel all-optical computing framework for RGB image segmentation and lane detection in autonomous driving applications. Our experimental results demonstrate the effectiveness of the DONN system for image segmentation on the CityScapes dataset. Additionally, we conduct case studies on lane detection using a customized indoor track dataset and simulated driving scenarios in CARLA, where we further evaluate the model's generalizability under diverse environmental conditions.
语义分割和车道检测是自主驾驶系统中的关键任务。传统的处理方法主要依赖于深度神经网络(DNN),由于需要进行大量的模数转换以及大规模的图像计算以实现低延迟、实时响应,因此能耗较高。衍射光学神经网络(DONN)在能源效率方面相对于传统数字或光电计算平台上的DNN表现出显著的优势。通过利用光速下的光衍射来进行全光学图像处理,DONN可以节省计算能量成本,并通过全光学编码和计算减少与模数转换相关的开销。 在这项工作中,我们提出了一种新的全光学计算框架,用于自主驾驶应用中的RGB图像分割和车道检测。我们的实验结果展示了DONN系统在CityScapes数据集上进行图像分割的有效性。此外,我们还使用定制的室内赛道数据集和CARLA中模拟的驾驶场景进行了案例研究,进一步评估了模型在各种环境条件下的泛化能力。
https://arxiv.org/abs/2602.07717
Recent self-supervised Vision Transformers (ViTs), such as DINOv3, provide rich feature representations for dense vision tasks. This study investigates the intrinsic few-shot semantic segmentation (FSS) capabilities of frozen DINOv3 features through a training-free baseline, FSSDINO, utilizing class-specific prototypes and Gram-matrix refinement. Our results across binary, multi-class, and cross-domain (CDFSS) benchmarks demonstrate that this minimal approach, applied to the final backbone layer, is highly competitive with specialized methods involving complex decoders or test-time adaptation. Crucially, we conduct an Oracle-guided layer analysis, identifying a significant performance gap between the standard last-layer features and globally optimal intermediate representations. We reveal a "Safest vs. Optimal" dilemma: while the Oracle proves higher performance is attainable, matching the results of compute-intensive adaptation methods, current unsupervised and support-guided selection metrics consistently yield lower performance than the last-layer baseline. This characterizes a "Semantic Selection Gap" in Foundation Models, a disconnect where traditional heuristics fail to reliably identify high-fidelity features. Our work establishes the "Last-Layer" as a deceptively strong baseline and provides a rigorous diagnostic of the latent semantic potentials in this http URL code is publicly available at this https URL.
近期的自监督视觉变换器(ViT),如DINOv3,为密集视觉任务提供了丰富的特征表示。本研究通过一个无训练基线FSSDINO来探究冻结后的DINOv3特征在少量样本语义分割(FSS)方面的内在能力,该方法利用了特定类别的原型和Gram矩阵改进。 我们在二元、多类别以及跨域(CDFSS)基准测试上的结果表明,即使采用简单的最后一层特征处理方式,FSSDINO也能与复杂的解码器或测试时间适应等专门化方法匹敌。尤为关键的是,我们进行了Oracle引导的层次分析,发现标准的最后一层特征和全局最优中间表示之间存在显著性能差距。 我们揭示了一个"最安全 vs. 最优"的困境:尽管Oracle证明了更高的性能是可实现的,并且可以媲美计算密集型适应方法的结果,但目前无监督和支持指导的选择指标始终无法超越最后一层基准的表现。这标志着在基础模型中存在着一个“语义选择差距”,即传统启发式方法未能可靠地识别出高质量特征的现象。 我们的研究确立了"最后一层"作为强大的基线,并为探索该架构的潜在语义潜力提供了严格的诊断。相关代码已公开发布,可在提供的链接地址找到。
https://arxiv.org/abs/2602.07550
Robust semantic segmentation of road scenes under adverse illumination, lighting, and shadow conditions remain a core challenge for autonomous driving applications. RGB-Thermal fusion is a standard approach, yet existing methods apply static fusion strategies uniformly across all conditions, allowing modality-specific noise to propagate throughout the network. Hence, we propose CLARITY that dynamically adapts its fusion strategy to the detected scene condition. Guided by vision-language model (VLM) priors, the network learns to modulate each modality's contribution based on the illumination state while leveraging object embeddings for segmentation, rather than applying a fixed fusion policy. We further introduce two mechanisms, i.e., one which preserves valid dark-object semantics that prior noise-suppression methods incorrectly discard, and a hierarchical decoder that enforces structural consistency across scales to sharpen boundaries on thin objects. Experiments on the MFNet dataset demonstrate that CLARITY establishes a new state-of-the-art (SOTA), achieving 62.3% mIoU and 77.5% mAcc.
在恶劣光照、照明和阴影条件下对道路场景进行鲁棒的语义分割仍然是自主驾驶应用的核心挑战。RGB-热成像融合是标准方法,但现有方法采用静态融合策略,在所有情况下都统一应用,这使得特定模式下的噪声在整个网络中传播。因此,我们提出了CLARITY,该方法能够根据检测到的场景条件动态调整其融合策略。通过视觉语言模型(VLM)先验引导,网络学会基于光照状态调节每个模态的贡献,并利用对象嵌入进行分割,而不是采用固定的融合政策。 此外,我们还引入了两种机制:一种保留先前噪声抑制方法错误丢弃的有效暗物体语义;另一种是分层解码器,通过在不同尺度上强制结构一致性来锐化细小物体的边界。在MFNet数据集上的实验表明,CLARITY建立了新的最先进水平(SOTA),实现了62.3%的mIoU和77.5%的mAcc。
https://arxiv.org/abs/2602.07343
Fully unsupervised segmentation pipelines naively seek the most salient object, should this be present. As a result, most of the methods reported in the literature deliver non-deterministic partitions that are sensitive to initialization, seed order, and threshold heuristics. We propose PANC, a weakly supervised spectral segmentation framework that uses a minimal set of annotated visual tokens to produce stable, controllable, and reproducible object masks. From the TokenCut approach, we augment the token-token affinity graph with a handful of priors coupled to anchor nodes. By manipulating the graph topology, we bias the spectral eigenspace toward partitions that are consistent with the annotations. Our approach preserves the global grouping enforced by dense self-supervised visual features, trading annotated tokens for significant gains in reproducibility, user control, and segmentation quality. Using 5 to 30 annotations per dataset, our training-free method achieves state-of-the-art performance among weakly and unsupervised approaches on standard benchmarks (e.g., DUTS-TE, ECSSD, MS COCO). Contrarily, it excels in domains where dense labels are costly or intra-class differences are subtle. We report strong and reliable results on homogeneous, fine-grained, and texture-limited domains, achieving 96.8% (+14.43% over SotA), 78.0% (+0.2%), and 78.8% (+0.37%) average mean intersection-over-union (mIoU) on CrackForest (CFD), CUB-200-2011, and HAM10000 datasets, respectively. For multi-object benchmarks, the framework showcases explicit, user-controllable semantic segmentation.
完全无监督的分割流程通常会寻找最显著的对象,前提是该对象存在。因此,文献中报告的大多数方法提供了非确定性的分区结果,这些结果对初始化、种子顺序和阈值启发式非常敏感。我们提出了一种弱监督光谱分割框架PANC,它使用一组最小化的注释视觉标记来生成稳定、可控且可重复的对象掩码。从TokenCut方法出发,我们在令牌-令牌亲和图中添加了一些与锚节点耦合的先验知识。通过调整图的拓扑结构,我们偏向于让光谱特征空间倾向于那些与标注一致的分区结果。我们的方法保持了由密集自监督视觉特征强制执行的整体分组效果,并以注释标记为代价换取可重复性、用户控制和分割质量方面的显著提升。 在标准基准测试(如DUTS-TE、ECSSD、MS COCO)上,使用每个数据集5到30个标注的情况下,我们的无训练方法在弱监督与完全无监督的方法中达到了最佳性能。相反,在密集标签成本高或类内差异微妙的领域,该方法表现出色。我们在同质性、细粒度和纹理有限的领域报告了强大且可靠的结果,分别实现了CrackForest (CFD)上96.8%(比最先进方法提高14.43%),CUB-200-2011上78.0%(提高0.2%)以及HAM10000数据集上的78.8%(提高0.37%)的平均交并比(mIoU)。对于多对象基准测试,该框架展示了用户可控的语义分割。
https://arxiv.org/abs/2602.06912
Adapting pre-trained vision models using parameter-efficient fine-tuning (PEFT) remains challenging, as it aims to achieve performance comparable to full fine-tuning using a minimal number of trainable parameters. When applied to complex dense prediction tasks, existing methods exhibit limitations, including input-agnostic modeling and redundant cross-layer representations. To this end, we propose AdaRoute, a new adapter-style method featuring a simple mixture-of-experts (MoE) architecture. Specifically, we introduce shared expert centers, where each expert is a trainable parameter matrix. During a feedforward pass, each AdaRoute module in the network dynamically generates weight matrices tailored for the current module via a simple dynamic parameter routing mechanism, which selectively aggregates parameter matrices in the corresponding expert center. Dynamic weight matrices in AdaRoute modules facilitate low-rank adaptation in an input-dependent manner, thus generating more customized and powerful feature representations. Moreover, since AdaRoute modules across multiple network layers share the same expert center, they improve feature diversity by promoting implicit cross-layer feature interaction. Extensive experiments demonstrate the superiority of AdaRoute on diverse vision tasks, including semantic segmentation, object detection and instance segmentation, and panoptic segmentation. Code will be available at: this https URL.
使用参数高效微调(Parameter-Efficient Fine-Tuning,PEFT)来调整预训练视觉模型仍然具有挑战性,因为其目标是在使用最少可训练参数的情况下实现与全量微调相当的性能。当应用于复杂的密集预测任务时,现有的方法表现出一些局限性,包括对输入不敏感建模和跨层冗余表示的问题。为此,我们提出了一种新的适配器风格的方法——AdaRoute,它采用简单的混合专家(Mixture-of-Experts, MoE)架构。 具体来说,我们在 AdaRoute 中引入了共享的专家中心,每个专家是一个可训练参数矩阵。在网络前向传播过程中,每个 AdaRoute 模块通过一个简单的动态参数路由机制自适应地生成适合当前模块的权重矩阵,并从相应的专家中心选择性聚合参数矩阵。AdaRoute 模块中的动态权重矩阵以输入依赖的方式促进了低秩调整,从而产生了更加定制化且强大的特征表示。 此外,由于网络各层之间的 AdaRoute 模块共享相同的专家中心,它们通过促进隐式跨层特征交互提高了特征多样性。广泛的实验表明,AdaRoute 在包括语义分割、目标检测和实例分割以及全景分割在内的多种视觉任务上表现出优越性能。 代码将发布在以下链接:[提供给定的 URL]
https://arxiv.org/abs/2602.06862
Vision Foundation Models (VFMs) have achieved remarkable success when applied to various downstream 2D tasks. Despite their effectiveness, they often exhibit a critical lack of 3D awareness. To this end, we introduce Splat and Distill, a framework that instills robust 3D awareness into 2D VFMs by augmenting the teacher model with a fast, feed-forward 3D reconstruction pipeline. Given 2D features produced by a teacher model, our method first lifts these features into an explicit 3D Gaussian representation, in a feedforward manner. These 3D features are then ``splatted" onto novel viewpoints, producing a set of novel 2D feature maps used to supervise the student model, ``distilling" geometrically grounded knowledge. By replacing slow per-scene optimization of prior work with our feed-forward lifting approach, our framework avoids feature-averaging artifacts, creating a dynamic learning process where the teacher's consistency improves alongside that of the student. We conduct a comprehensive evaluation on a suite of downstream tasks, including monocular depth estimation, surface normal estimation, multi-view correspondence, and semantic segmentation. Our method significantly outperforms prior works, not only achieving substantial gains in 3D awareness but also enhancing the underlying semantic richness of 2D features. Project page is available at this https URL
视觉基础模型(VFMs)在各种下游二维任务中取得了显著的成功。尽管它们表现卓越,但这些模型通常缺乏对三维空间的理解。为此,我们提出了一种名为“Splat and Distill”的框架,该框架通过向教师模型添加快速的前馈式3D重建管道来增强2D VFMs的鲁棒性三维意识。给定由教师模型生成的二维特征,我们的方法首先以一种前馈的方式将这些特征提升到显式的3D高斯表示中。然后,将这些3D特征“投射”到新的视点上,产生一组用于监督学生模型的新2D特征图,从而“提炼”出几何基础的知识。通过用我们快速的前馈式提升方法替代先前工作的每个场景优化过程,我们的框架避免了特征平均化带来的伪影,并创建了一个动态学习过程,在这个过程中教师和学生的连贯性同步提高。 我们在一系列下游任务上进行了全面评估,包括单目深度估计、表面法线估计、多视图对应以及语义分割。与先前的工作相比,我们方法在三维感知方面取得了显著的性能提升,同时也增强了二维特征的基础语义丰富度。项目的网页可以在这个URL找到:[此链接应为原文中的具体项目页面链接,请根据实际情况插入正确的网址]
https://arxiv.org/abs/2602.06032
Distribution shift is a common challenge in medical images obtained from different clinical centers, significantly hindering the deployment of pre-trained semantic segmentation models in real-world applications across multiple domains. Continual Test-Time Adaptation(CTTA) has emerged as a promising approach to address cross-domain shifts during continually evolving target domains. Most existing CTTA methods rely on incrementally updating model parameters, which inevitably suffer from error accumulation and catastrophic forgetting, especially in long-term adaptation. Recent prompt-tuning-based works have shown potential to mitigate the two issues above by updating only visual prompts. While these approaches have demonstrated promising performance, several limitations remain:1)lacking multi-scale prompt diversity, 2)inadequate incorporation of instance-specific knowledge, and 3)risk of privacy leakage. To overcome these limitations, we propose Multi-scale Global-Instance Prompt Tuning(MGIPT), to enhance scale diversity of prompts and capture both global- and instance-level knowledge for robust CTTA. Specifically, MGIPT consists of an Adaptive-scale Instance Prompt(AIP) and a Multi-scale Global-level Prompt(MGP). AIP dynamically learns lightweight and instance-specific prompts to mitigate error accumulation with adaptive optimal-scale selection mechanism. MGP captures domain-level knowledge across different scales to ensure robust adaptation with anti-forgetting capabilities. These complementary components are combined through a weighted ensemble approach, enabling effective dual-level adaptation that integrates both global and local information. Extensive experiments on medical image segmentation benchmarks demonstrate that our MGIPT outperforms state-of-the-art methods, achieving robust adaptation across continually changing target domains.
医学图像中由于不同临床中心获取的数据分布差异(即数据偏移)是应用预训练的语义分割模型到实际跨域任务中的一个常见挑战。连续测试时自适应(CTTA)作为一种有前景的方法,旨在解决在持续变化的目标领域间迁移过程中遇到的跨域偏移问题。尽管现有的大部分CTTA方法通过逐步更新模型参数来应对这一挑战,但这种方法在长期自适应中不可避免地会累积误差,并且容易出现灾难性遗忘的问题,尤其是在长期内。最近基于提示微调的方法显示出减少上述两类问题(即错误累积和灾难性遗忘)的潜力,它们仅对视觉提示进行更新。虽然这些方法已经展现了令人鼓舞的表现,但仍存在一些局限:1)缺乏多尺度提示多样性;2)未能充分整合实例特定知识;3)隐私泄露风险。 为了解决这些问题,我们提出了多尺度全局-实例提示微调(MGIPT),以增强提示的规模多样性并捕捉全局和实例级别的知识,从而实现鲁棒性的CTTA。MGIPT由自适应尺度实例提示(AIP)和多尺度全局级提示(MGP)组成。AIP通过动态学习轻量且特定于实例的提示,并结合自适应最优尺度选择机制来减少错误累积的影响。而MGP则在不同尺度上捕获领域级别的知识,以确保具有防遗忘能力的鲁棒性自适应。 这两种互补组件通过加权集成方法相结合,从而能够有效地进行双级别调整,整合全局和局部信息。我们在医学图像分割基准数据集上的大量实验表明,我们的MGIPT优于现有最佳方法,在持续变化的目标领域中实现了稳健的适应性。
https://arxiv.org/abs/2602.05937
Open-vocabulary semantic segmentation (OVSS) extends traditional closed-set segmentation by enabling pixel-wise annotation for both seen and unseen categories using arbitrary textual descriptions. While existing methods leverage vision-language models (VLMs) like CLIP, their reliance on image-level pretraining often results in imprecise spatial alignment, leading to mismatched segmentations in ambiguous or cluttered scenes. However, most existing approaches lack strong object priors and region-level constraints, which can lead to object hallucination or missed detections, further degrading performance. To address these challenges, we propose LoGoSeg, an efficient single-stage framework that integrates three key innovations: (i) an object existence prior that dynamically weights relevant categories through global image-text similarity, effectively reducing hallucinations; (ii) a region-aware alignment module that establishes precise region-level visual-textual correspondences; and (iii) a dual-stream fusion mechanism that optimally combines local structural information with global semantic context. Unlike prior works, LoGoSeg eliminates the need for external mask proposals, additional backbones, or extra datasets, ensuring efficiency. Extensive experiments on six benchmarks (A-847, PC-459, A-150, PC-59, PAS-20, and PAS-20b) demonstrate its competitive performance and strong generalization in open-vocabulary settings.
开放词汇语义分割(OVSS)扩展了传统的封闭集分割,通过允许使用任意文本描述对已见和未见过的类别进行像素级标注。尽管现有方法利用了视觉-语言模型(VLMs),如CLIP,但这些方法依赖于图像级别的预训练往往会导致空间定位不准确,在模糊或杂乱的场景中产生不匹配的分割结果。然而,大多数现有的方法缺乏强大的对象先验和区域级别约束,这可能导致对象幻觉或漏检,进一步降低性能。为了解决这些问题,我们提出了LoGoSeg,这是一个高效的单阶段框架,整合了三个关键创新:(i) 一个基于全局图像-文本相似性的对象存在先验,通过动态加权相关类别有效减少幻觉;(ii) 一个区域感知对齐模块,建立精确的区域级视觉-文本对应关系;以及(iii) 一种双流融合机制,将局部结构信息与全局语义背景最优结合。不同于先前的工作,LoGoSeg消除了对外部掩码提案、额外骨干网络或附加数据集的需求,确保了效率。在六个基准测试(A-847, PC-459, A-150, PC-59, PAS-20和PAS-20b)上的广泛实验表明其在开放词汇设置中的竞争力和强大的泛化能力。
https://arxiv.org/abs/2602.05578
Balancing accuracy and latency on high-resolution images is a critical challenge for lightweight models, particularly for Transformer-based architectures that often suffer from excessive latency. To address this issue, we introduce \textbf{ReGLA}, a series of lightweight hybrid networks, which integrates efficient convolutions for local feature extraction with ReLU-based gated linear attention for global modeling. The design incorporates three key innovations: the Efficient Large Receptive Field (ELRF) module for enhancing convolutional efficiency while preserving a large receptive field; the ReLU Gated Modulated Attention (RGMA) module for maintaining linear complexity while enhancing local feature representation; and a multi-teacher distillation strategy to boost performance on downstream tasks. Extensive experiments validate the superiority of ReGLA; particularly the ReGLA-M achieves \textbf{80.85\%} Top-1 accuracy on ImageNet-1K at $224px$, with only \textbf{4.98 ms} latency at $512px$. Furthermore, ReGLA outperforms similarly scaled iFormer models in downstream tasks, achieving gains of \textbf{3.1\%} AP on COCO object detection and \textbf{3.6\%} mIoU on ADE20K semantic segmentation, establishing it as a state-of-the-art solution for high-resolution visual applications.
在高分辨率图像上平衡准确性和延迟对于轻量级模型来说是一个关键挑战,特别是基于Transformer的架构经常会出现过高的延迟问题。为了解决这个问题,我们引入了**ReGLA(ReLU-Gated Lightweight Architecture)**,一系列轻量级混合网络。这些网络结合了高效的卷积用于局部特征提取和基于ReLU的门控线性注意机制用于全局建模。 设计中的三个关键创新点包括: 1. **Efficient Large Receptive Field (ELRF) 模块**:通过保持大感受野的同时提升卷积效率。 2. **ReLU Gated Modulated Attention (RGMA) 模块**:在维持线性复杂度的同时增强局部特征表示能力。 3. **多教师蒸馏策略**:提高下游任务的性能。 广泛的实验验证了ReGLA的优势。特别地,ReGLA-M模型在ImageNet-1K数据集上实现了80.85%的Top-1准确率(输入尺寸为224px),并且仅需4.98毫秒延迟即可处理分辨率为512px的图像。此外,在下游任务中,与同样规模的iFormer模型相比,ReGLA在COCO物体检测上的AP指标提高了3.1%,ADE20K语义分割上的mIoU指标提升了3.6%。这些结果确立了ReGLA作为高分辨率视觉应用领域内的一种领先解决方案的地位。
https://arxiv.org/abs/2602.05262
Urban design profoundly impacts public spaces and community engagement. Traditional top-down methods often overlook public input, creating a gap in design aspirations and reality. Recent advancements in digital tools, like City Information Modelling and augmented reality, have enabled a more participatory process involving more stakeholders in urban design. Further, deep learning and latent diffusion models have lowered barriers for design generation, providing even more opportunities for participatory urban design. Combining state-of-the-art latent diffusion models with interactive semantic segmentation, we propose RECITYGEN, a novel tool that allows users to interactively create variational street view images of urban environments using text prompts. In a pilot project in Beijing, users employed RECITYGEN to suggest improvements for an ongoing Urban Regeneration project. Despite some limitations, RECITYGEN has shown significant potential in aligning with public preferences, indicating a shift towards more dynamic and inclusive urban planning methods. The source code for the project can be found at RECITYGEN GitHub.
城市设计对公共空间和社区参与有着深远的影响。传统的自上而下的方法往往忽视了公众的意见,导致设计理念与现实之间存在差距。近年来,数字工具(如城市信息建模和增强现实)的发展使更多的利益相关者能够参与到城市设计中来,从而促成了更加参与式的过程。此外,深度学习和潜在扩散模型降低了生成设计方案的门槛,为更为参与式的城市设计提供了更多机会。 我们提出了一种新型工具——RECityGen,结合了最新的潜在扩散模型与互动语义分割技术,使用户能够使用文本指令交互地创建城市的街道视图图像。在北京的一个试点项目中,参与者利用RECityGen提出了关于正在进行的城市再生项目的改进建议。尽管存在一些限制,但RECityGen已经展现出了在满足公众偏好方面的巨大潜力,这表明城市规划方法正朝着更加动态和包容的方向转变。 该项目的源代码可以在RECityGen的GitHub页面上找到。
https://arxiv.org/abs/2602.07057
Semantic segmentation of high-resolution remote-sensing imagery is critical for urban mapping and land-cover monitoring, yet training data typically exhibits severe long-tailed pixel imbalance. In the dataset LoveDA, this challenge is compounded by an explicit Urban/Rural split with distinct appearance and inconsistent class-frequency statistics across domains. We present a prompt-controlled diffusion augmentation framework that synthesizes paired label--image samples with explicit control of both domain and semantic composition. Stage~A uses a domain-aware, masked ratio-conditioned discrete diffusion model to generate layouts that satisfy user-specified class-ratio targets while respecting learned co-occurrence structure. Stage~B translates layouts into photorealistic, domain-consistent images using Stable Diffusion with ControlNet guidance. Mixing the resulting ratio and domain-controlled synthetic pairs with real data yields consistent improvements across multiple segmentation backbones, with gains concentrated on minority classes and improved Urban and Rural generalization, demonstrating controllable augmentation as a practical mechanism to mitigate long-tail bias in remote-sensing segmentation. Source codes, pretrained models, and synthetic datasets are available at \href{this https URL}{Github}
高分辨率遥感图像的语义分割对于城市制图和土地覆盖监测至关重要,然而训练数据通常表现出严重的长尾像素不平衡问题。在LoveDA数据集中,这一挑战进一步被明确的城市/农村划分所加剧,这些区域具有不同的外观特征以及跨领域的不一致类别频率统计。 我们提出了一种受提示控制的扩散增强框架,该框架能够生成带有明确领域和语义成分控制的配对标签-图像样本。第一阶段(Stage A)使用了具备领域感知能力的、条件于掩码比例的离散扩散模型来生成满足用户指定类别比例目标的布局,并同时保持学习到的共现结构。第二阶段(Stage B)则利用Stable Diffusion结合ControlNet指导,将这些布局转换为逼真的且符合相应领域的图像。 通过将这种比率和领域受控合成对与真实数据混合使用,我们观察到了多个分割骨干网络的一致性改进,并且这些改进主要集中在少数类上以及城市和农村区域的泛化能力得到提升。这表明可控增强作为一种实用机制能够有效缓解遥感分割中的长尾偏差问题。 源代码、预训练模型及合成数据集可在Github上的[此链接](https://github.com/yourusername/repo)获取。
https://arxiv.org/abs/2602.04749
Deep neural networks for visual perception are highly susceptible to domain shift, which poses a critical challenge for real-world deployment under conditions that differ from the training data. To address this domain generalization challenge, we propose a cross-modal framework under the learning using privileged information (LUPI) paradigm for training a robust, single-modality RGB model. We leverage event cameras as a source of privileged information, available only during training. The two modalities exhibit complementary characteristics: the RGB stream is semantically dense but domain-dependent, whereas the event stream is sparse yet more domain-invariant. Direct feature alignment between them is therefore suboptimal, as it forces the RGB encoder to mimic the sparse event representation, thereby losing semantic detail. To overcome this, we introduce Privileged Event-based Predictive Regularization (PEPR), which reframes LUPI as a predictive problem in a shared latent space. Instead of enforcing direct cross-modal alignment, we train the RGB encoder with PEPR to predict event-based latent features, distilling robustness without sacrificing semantic richness. The resulting standalone RGB model consistently improves robustness to day-to-night and other domain shifts, outperforming alignment-based baselines across object detection and semantic segmentation.
用于视觉感知的深度神经网络在面对数据领域变化时非常脆弱,这给其在与训练数据不同的实际环境中部署带来了重大挑战。为了解决这种领域的泛化难题,我们提出了一种基于学习使用特权信息(LUPI)范式的跨模态框架来训练一种鲁棒的单模RGB模型。我们利用事件相机作为仅限于训练阶段使用的特权信息源。这两种模式表现出互补的特点:RGB流具有丰富的语义但依赖领域环境,而事件流则更为稀疏却更少受领域变化的影响。因此,在两者之间直接进行特征对齐是次优的,因为这会迫使RGB编码器模仿稀疏的事件表示,从而导致丢失了语义细节。 为了解决这一问题,我们引入了一种基于特权事件预测正则化的PEPR方法(Privileged Event-based Predictive Regularization),该方法将LUPI重新构架为共享潜在空间中的一个预测问题。与其强制执行直接跨模态对齐,我们使用PEPR训练RGB编码器来预测基于事件的潜在特征,这可以在不牺牲语义丰富性的情况下提炼出鲁棒性。最终得到的独立RGB模型在面对白天到夜晚以及其他领域的变化时,其稳健性得到了持续提高,并且无论是在目标检测还是语义分割方面都超过了基于对齐的方法基线表现。
https://arxiv.org/abs/2602.04583
We present SeeingThroughClutter, a method for reconstructing structured 3D representations from single images by segmenting and modeling objects individually. Prior approaches rely on intermediate tasks such as semantic segmentation and depth estimation, which often underperform in complex scenes, particularly in the presence of occlusion and clutter. We address this by introducing an iterative object removal and reconstruction pipeline that decomposes complex scenes into a sequence of simpler subtasks. Using VLMs as orchestrators, foreground objects are removed one at a time via detection, segmentation, object removal, and 3D fitting. We show that removing objects allows for cleaner segmentations of subsequent objects, even in highly occluded scenes. Our method requires no task-specific training and benefits directly from ongoing advances in foundation models. We demonstrate stateof-the-art robustness on 3D-Front and ADE20K datasets. Project Page: this https URL
我们介绍了SeeingThroughClutter方法,这是一种从单张图像中通过单独分割和建模物体来重构结构化3D表示的方法。先前的方法依赖于诸如语义分割和深度估计等中间任务,在复杂场景(特别是在遮挡和杂乱情况下)中这些任务通常表现不佳。为了解决这个问题,我们引入了一种迭代的对象移除与重建流水线,该方法将复杂的场景分解成一系列更简单的子任务。通过使用视觉语言模型作为协调者,前景物体被逐个检测、分割、移除,并进行3D拟合。我们展示了这种方法在即使是非常遮挡的场景中也能为后续物体提供更加干净的分割效果。我们的方法无需特定于任务的训练,可直接从基础模型的进步中受益。我们在3D-Front和ADE20K数据集上展示了最先进的鲁棒性。 项目页面:[此处应填写具体的URL链接]
https://arxiv.org/abs/2602.04053