In recent years, learning-based underwater image enhancement (UIE) techniques have rapidly evolved. However, distribution shifts between high-quality enhanced outputs and natural images can hinder semantic cue extraction for downstream vision tasks, thereby limiting the adaptability of existing enhancement models. To address this challenge, this work proposes a new learning mechanism that leverages Vision-Language Models (VLMs) to empower UIE models with semantic-sensitive capabilities. To be concrete, our strategy first generates textual descriptions of key objects from a degraded image via VLMs. Subsequently, a text-image alignment model remaps these relevant descriptions back onto the image to produce a spatial semantic guidance map. This map then steers the UIE network through a dual-guidance mechanism, which combines cross-attention and an explicit alignment loss. This forces the network to focus its restorative power on semantic-sensitive regions during image reconstruction, rather than pursuing a globally uniform improvement, thereby ensuring the faithful restoration of key object features. Experiments confirm that when our strategy is applied to different UIE baselines, significantly boosts their performance on perceptual quality metrics as well as enhances their performance on detection and segmentation tasks, validating its effectiveness and adaptability.
https://arxiv.org/abs/2603.12773
Regular monitoring of glycemic status is essential for diabetes management, yet conventional blood-based testing can be burdensome for frequent assessment. The sclera contains superficial microvasculature that may exhibit diabetes related alterations and is readily visible on the ocular surface. We propose ScleraGluNet, a multiview deep-learning framework for three-class metabolic status classification (normal, controlled diabetes, and high-glucose diabetes) and continuous fasting plasma glucose (FPG) estimation from multidirectional scleral vessel images. The dataset comprised 445 participants (150/140/155) and 2,225 anterior-segment images acquired from five gaze directions per participant. After vascular enhancement, features were extracted using parallel convolutional branches, refined with Manta Ray Foraging Optimization (MRFO), and fused via transformer-based cross-view attention. Performance was evaluated using subject-wise five-fold cross-validation, with all images from each participant assigned to the same fold. ScleraGluNet achieved 93.8% overall accuracy, with one-vs-rest AUCs of 0.971,0.956, and 0.982 for normal, controlled diabetes, and high-glucose diabetes, respectively. For FPG estimation, the model achieved MAE = 6.42 mg/dL and RMSE = 7.91 mg/dL, with strong correlation to laboratory measurements (r = 0.983; R2 = 0.966). Bland Altman analysis showed a mean bias of +1.45 mg/dL with 95% limits of agreement from -8.33 to +11.23$ mg/dL. These results support multidirectional scleral vessel imaging with multiview learning as a promising noninvasive approach for glycemic assessment, warranting multicenter validation before clinical deployment.
https://arxiv.org/abs/2603.12715
Infrared-visible (IR-VIS) feature matching plays an essential role in cross-modality visual localization, navigation and perception. Along with the rapid development of deep learning techniques, a number of representative image matching methods have been proposed. However, crossmodal feature matching is still a challenging task due to the significant appearance difference. A significant gap for cross-modal feature matching research lies in the absence of standardized benchmarks and metrics for evaluations. In this paper, we introduce a comprehensive cross-modal feature matching benchmark, CM-Bench, which encompasses 30 feature matching algorithms across diverse cross-modal datasets. Specifically, state-of-the-art traditional and deep learning-based methods are first summarized and categorized into sparse, semidense, and dense methods. These methods are evaluated by different tasks including homography estimation, relative pose estimation, and feature-matching-based geo-localization. In addition, we introduce a classification-network-based adaptive preprocessing front-end that automatically selects suitable enhancement strategies before matching. We also present a novel infrared-satellite cross-modal dataset with manually annotated ground-truth correspondences for practical geo-localization evaluation. The dataset and resource will be available at: this https URL.
https://arxiv.org/abs/2603.12690
This paper focuses on the inconsistency in salient regions between RGB and thermal images. To address this issue, we propose the Region-guided Selective Optimization Network for RGB-T Salient Object Detection, which consists of the region guidance stage and saliency generation stage. In the region guidance stage, three parallel branches with same encoder-decoder structure equipped with the context interaction (CI) module and spatial-aware fusion (SF) module are designed to generate the guidance maps which are leveraged to calculate similarity scores. Then, in the saliency generation stage, the selective optimization (SO) module fuses RGB and thermal features based on the previously obtained similarity values to mitigate the impact of inconsistent distribution of salient targets between the two modalities. After that, to generate high-quality detection result, the dense detail enhancement (DDE) module which adopts the multiple dense connections and visual state space blocks is applied to low-level features for optimizing the detail information. In addition, the mutual interaction semantic (MIS) module is placed in the high-level features to dig the location cues by the mutual fusion strategy. We conduct extensive experiments on the RGB-T dataset, and the results demonstrate that the proposed RSONet achieves competitive performance against 27 state-of-the-art SOD methods.
https://arxiv.org/abs/2603.12685
Remote sensing images captured from aerial perspectives often exhibit significant scale variations and complex backgrounds, posing challenges for salient object detection (SOD). Existing methods typically extract multi-level features at a single scale using uniform attention mechanisms, leading to suboptimal representations and incomplete detection results. To address these issues, we propose a GeoGran-Aware Hierarchical Feature Fusion Network (G2HFNet) that fully exploits geometric and granular cues in optical remote sensing images. Specifically, G2HFNet adopts Swin Transformer as the backbone to extract multi-level features and integrates three key modules: the multi-scale detail enhancement (MDE) module to handle object scale variations and enrich fine details, the dual-branch geo-gran complementary (DGC) module to jointly capture fine-grained details and positional information in mid-level features, and the deep semantic perception (DSP) module to refine high-level positional cues via self-attention. Additionally, a local-global guidance fusion (LGF) module is introduced to replace traditional convolutions for effective multi-level feature integration. Extensive experiments demonstrate that G2HFNet achieves high-quality saliency maps and significantly improves detection performance in challenging remote sensing scenarios.
https://arxiv.org/abs/2603.12680
Scalable Embodied AI faces fundamental constraints due to prohibitive costs and safety risks of real-world interaction. While Embodied World Models (EWMs) offer promise through imagined rollouts, existing approaches suffer from geometric hallucinations and lack unified optimization frameworks for practical policy improvement. We introduce RoboStereo, a symmetric dual-tower 4D world model that employs bidirectional cross-modal enhancement to ensure spatiotemporal geometric consistency and alleviate physics hallucinations. Building upon this high-fidelity 4D simulator, we present the first unified framework for world-model-based policy optimization: (1) Test-Time Policy Augmentation (TTPA) for pre-execution verification, (2) Imitative-Evolutionary Policy Learning (IEPL) leveraging visual perceptual rewards to learn from expert demonstrations, and (3) Open-Exploration Policy Learning (OEPL) enabling autonomous skill discovery and self-correction. Comprehensive experiments demonstrate RoboStereo achieves state-of-the-art generation quality, with our unified framework delivering >97% average relative improvement on fine-grained manipulation tasks.
https://arxiv.org/abs/2603.12639
Cross-view geo-localization (CVGL) aims to accurately localize street-view images through retrieval of corresponding geo-tagged satellite images. While prior works have achieved nearly perfect performance on certain standard datasets, their robustness in real-world corrupted environments remains under-explored. This oversight causes severe performance degradation or failure when images are affected by corruption such as blur or weather, significantly limiting practical deployment. To address this critical gap, we introduce MRGeo, the first systematic method designed for robust CVGL under corruption. MRGeo employs a hierarchical defense strategy that enhances the intrinsic quality of features and then enforces a robust geometric prior. Its core is the Spatial-Channel Enhancement Block, which contains: (1) a Spatial Adaptive Representation Module that models global and local features in parallel and uses a dynamic gating mechanism to arbitrate their fusion based on feature reliability; and (2) a Channel Calibration Module that performs compensatory adjustments by modeling multi-granularity channel dependencies to counteract information loss. To prevent spatial misalignment under severe corruption, a Region-level Geometric Alignment Module imposes a geometric structure on the final descriptors, ensuring coarse-grained consistency. Comprehensive experiments on both robustness benchmark and standard datasets demonstrate that MRGeo not only achieves an average R@1 improvement of 2.92\% across three comprehensive robustness benchmarks (CVUSA-C-ALL, CVACT\_val-C-ALL, and CVACT\_test-C-ALL) but also establishes superior performance in cross-area evaluation, thereby demonstrating its robustness and generalization capability.
https://arxiv.org/abs/2603.12587
Salient object detection (SOD) in remote sensing images faces significant challenges due to large variations in object sizes, the computational cost of self-attention mechanisms, and the limitations of CNN-based extractors in capturing global context and long-range dependencies. Existing methods that rely on fixed convolution kernels often struggle to adapt to diverse object scales, leading to detail loss or irrelevant feature aggregation. To address these issues, this work aims to enhance robustness to scale variations and achieve precise object localization. We propose the Region Proportion-Aware Dynamic Adaptive Salient Object Detection Network (RDNet), which replaces the CNN backbone with the SwinTransformer for global context modeling and introduces three key modules: (1) the Dynamic Adaptive Detail-aware (DAD) module, which applies varied convolution kernels guided by object region proportions; (2) the Frequency-matching Context Enhancement (FCE) module, which enriches contextual information through wavelet interactions and attention; and (3) the Region Proportion-aware Localization (RPL) module, which employs cross-attention to highlight semantic details and integrates a Proportion Guidance (PG) block to assist the DAD module. By combining these modules, RDNet achieves robustness against scale variations and accurate localization, delivering superior detection performance compared with state-of-the-art methods.
在遥感图像中进行显著目标检测(SOD)面临诸多挑战,包括目标大小的大幅变化、自注意力机制计算成本高以及基于CNN提取器在捕获全局上下文和长距离依赖关系方面的局限性。现有的方法通常依靠固定的卷积核来适应各种目标尺度,但它们往往难以应对这种多样性,导致细节丢失或无关特征聚集的问题。为了解决这些问题,本研究旨在提高对尺度变化的鲁棒性和实现精确的目标定位。为此,我们提出了区域比例感知动态自适应显著物体检测网络(RDNet),该网络用SwinTransformer取代了CNN骨干网以进行全局上下文建模,并引入了三个关键模块:(1) 动态自适应细节感知(DAD)模块,它通过对象区域的比例来应用不同的卷积核;(2) 频率匹配上下文增强(FCE)模块,该模块通过小波交互和注意力机制丰富了上下文信息;以及 (3) 区域比例感知定位(RPL)模块,该模块采用交叉注意机制突出语义细节,并集成了一个比例引导(PG)块以辅助DAD模块。结合这些模块,RDNet在应对尺度变化方面表现出更高的鲁棒性并实现精确的定位,在显著物体检测性能上超越了现有的先进方法。
https://arxiv.org/abs/2603.12215
RGB-NIR image registration plays an important role in sensor-fusion, image enhancement and off-road autonomy. In this work, we evaluate both classical and Deep Learning (DL) based image registration techniques to access their suitability for off-road forestry applications. NeMAR, trained under 6 different configurations, demonstrates partial success however, its GAN loss instability suggests challenges in preserving geometric consistency. MURF, when tested on off-road forestry data shows promising large scale feature alignment during shared information extraction but struggles with fine details in dense vegetation. Even though this is just a preliminary evaluation, our study necessitates further refinements for robust, multi-scale registration for off-road forest applications.
RGB-NIR(红绿蓝-近红外)图像配准在传感器融合、图像增强和非公路自主导航中扮演着重要角色。在这项工作中,我们评估了传统的基于深度学习(DL)的图像配准技术,以考察它们是否适用于非公路林业应用。NeMAR 在六种不同配置下训练后显示出部分成功,然而其生成对抗网络(GAN)损失不稳定的问题表明在保持几何一致性方面存在挑战。当 MURF 被测试用于非公路林业数据时,在共享信息提取期间显示出了大规模特征对齐的前景,但在密集植被中的细节处理上遇到了困难。尽管这只是初步评估,我们的研究仍需要进一步改进以实现适用于非公路森林应用的强大、多尺度配准技术。
https://arxiv.org/abs/2603.11952
The integration of dynamic, sparse structures like Mixture-of-Experts (MoE) with parameter-efficient adapters (e.g., LoRA) is a powerful technique for enhancing Large Language Models (LLMs). However, this architectural enhancement comes at a steep cost: despite minimal increases in computational load, the inference latency often skyrockets, leading to decoding speeds slowing by over 2.5 times. Through a fine-grained performance analysis, we pinpoint the primary bottleneck not in the computation itself, but in the severe overhead from fragmented, sequential CUDA kernel launches required for conventional dynamic routing. To address this challenge, we introduce AdaFuse, a framework built on a tight co-design between the algorithm and the underlying hardware system to enable efficient dynamic adapter execution. Departing from conventional layer-wise or block-wise routing, AdaFuse employs a token-level pre-gating strategy, which makes a single, global routing decision for all adapter layers before a token is processed. This "decide-once, apply-everywhere" approach effectively staticizes the execution path for each token, creating an opportunity for holistic optimization. We capitalize on this by developing a custom CUDA kernel that performs a fused switching operation, merging the parameters of all selected LoRA adapters into the backbone model in a single, efficient pass. Experimental results on popular open-source LLMs show that AdaFuse achieves accuracy on par with state-of-the-art dynamic adapters while drastically cutting decoding latency by a factor of over 2.4x, thereby bridging the gap between model capability and inference efficiency.
将诸如专家混合(MoE)等动态稀疏结构与参数高效的适配器(如LoRA)相结合,是一种增强大型语言模型(LLMs)的强有力技术。然而,这种架构改进伴随着高昂的成本:尽管计算负载几乎没有增加,推理延迟却显著上升,导致解码速度减慢超过2.5倍。通过细致入微的性能分析,我们发现瓶颈主要不在计算本身,而在于传统动态路由所需的碎片化、顺序CUDA内核启动带来的严重开销。 为解决这一挑战,我们引入了AdaFuse框架,该框架基于算法与底层硬件系统的紧密协同设计,旨在实现高效的动态适配器执行。不同于传统的逐层或分块式路由,AdaFuse采用了一种以令牌级别预门控策略为基础的方法,在处理某个令牌之前做出所有适配器层的全局路由决定。“一次决定,处处应用”的方法有效地使每个令牌的执行路径静态化,从而为整体优化提供了机会。 我们利用这一点开发了一个定制的CUDA内核,该内核可以进行融合切换操作,在单次高效的传递中将所选LoRA适配器的所有参数合并到主干模型中。在流行开源LLMs上的实验结果显示,AdaFuse实现了与最先进的动态适配器相当的精度,同时大幅削减了解码延迟超过2.4倍,从而弥合了模型能力与推理效率之间的差距。
https://arxiv.org/abs/2603.11873
The differentiation between tumor recurrence and radiation-induced contrast enhancements in post-treatment glioblastoma patients remains a major clinical challenge. Existing approaches rely on clinically sparsely available diffusion MRI or do not consider radiation maps, which are gaining increasing interest in the tumor board for this differentiation. We introduce RICE-NET, a multimodal 3D deep learning model that integrates longitudinal MRI data with radiotherapy dose distributions for automated lesion classification using conventional T1-weighted MRI data. Using a cohort of 92 patients, the model achieved an F1 score of 0.92 on an independent test set. During extensive ablation experiments, we quantified the contribution of each timepoint and modality and showed that reliable classification largely depends on the radiation map. Occlusion-based interpretability analyses further confirmed the model's focus on clinically relevant regions. These findings highlight the potential of multimodal deep learning to enhance diagnostic accuracy and support clinical decision-making in neuro-oncology.
肿瘤复发和放射诱发的对比增强在胶质母细胞瘤治疗后的鉴别诊断仍然是一个重要的临床挑战。现有的方法主要依赖于临床较少提供的弥散MRI数据,或者不考虑正在肿瘤委员会中越来越受重视的放疗剂量分布图。我们引入了RICE-NET模型,这是一种多模态3D深度学习模型,它可以整合纵向磁共振成像(MRI)数据与放射治疗剂量分布图,使用传统的T1加权MRI数据进行自动病变分类。在92名患者的队列中,该模型在一个独立测试集上实现了0.92的F1分数。 通过广泛的消融实验,我们量化了每个时间点和模态对结果的影响,并证明可靠的分类很大程度上依赖于放疗剂量分布图。基于遮挡的方法解释性分析进一步证实了模型关注临床相关区域的焦点。这些发现突显了多模态深度学习在提高神经肿瘤学诊断准确性和支持临床决策方面所具备的巨大潜力。
https://arxiv.org/abs/2603.11827
Frontier Multimodal Large Language Models (MLLMs) exhibit remarkable capabilities in Visual-Language Comprehension (VLC) tasks. However, they are often deployed as zero-shot solution to new tasks in a black-box manner. Validating and understanding the behavior of these models become important for application to new task. We propose an Explicit Logic Channel, in parallel with the black-box model channel, to perform explicit logical reasoning for model validation, selection and enhancement. The frontier MLLM, encapsulating latent vision-language knowledge, can be considered as an Implicit Logic Channel. The proposed Explicit Logic Channel, mimicking human logical reasoning, incorporates a LLM, a VFM, and logical reasoning with probabilistic inference for factual, counterfactual, and relational reasoning over the explicit visual evidence. A Consistency Rate (CR) is proposed for cross-channel validation and model selection, even without ground-truth annotations. Additionally, cross-channel integration further improves performance in zero-shot tasks over MLLMs, grounded with explicit visual evidence to enhance trustworthiness. Comprehensive experiments conducted for two representative VLC tasks, i.e., MC-VQA and HC-REC, on three challenging benchmarks, with 11 recent open-source MLLMs from 4 frontier families. Our systematic evaluations demonstrate the effectiveness of proposed ELC and CR for model validation, selection and improvement on MLLMs with enhanced explainability and trustworthiness.
前沿多模态大型语言模型(MLLMs)在视觉-语言理解(VLC)任务中表现出卓越的能力。然而,它们通常被作为黑盒解决方案应用于新任务。为了确保这些模型的可靠性和可解释性,在应用到新的任务之前验证和理解其行为变得非常重要。我们提出了一种显式逻辑通道,与现有的黑箱模型并行工作,以进行明确的逻辑推理来验证、选择和改进模型。 前沿MLLM封装了潜在的视觉-语言知识,并可以被视为隐式逻辑通道。所提出的显式逻辑通道模仿人类逻辑推理过程,结合大型语言模型(LLM)、视觉特征模块(VFM)以及基于概率推断的事实性、反事实性和关系推理来处理明确的视觉证据。 为了跨通道验证和选择模型,我们提出了一个一致性率(CR),即使在没有真实标签的情况下也能使用。此外,跨通道集成进一步提高了MLLMs在零样本任务中的性能,通过结合显式的视觉证据增强了信任度。我们在三个具有挑战性的基准测试上进行了广泛的实验,针对两个代表性的VLC任务——MC-VQA和HC-REC,并使用了来自四个前沿家族的11个最近开源的MLLM模型。 我们的系统性评估证明了所提出的ELC(显式逻辑通道)和CR在验证、选择以及改进增强可解释性和可信度的MLLM模型方面的有效性。
https://arxiv.org/abs/2603.11689
Image aesthetic enhancement aims to perceive aesthetic deficiencies in images and perform corresponding editing operations, which is highly challenging and requires the model to possess creativity and aesthetic perception capabilities. Although recent advancements in image editing models have significantly enhanced their controllability and flexibility, they struggle with enhancing image aesthetic. The primary challenges are twofold: first, following editing instructions with aesthetic perception is difficult, and second, there is a scarcity of "perfectly-paired" images that have consistent content but distinct aesthetic qualities. In this paper, we propose Dual-supervised Image Aesthetic Enhancement (DIAE), a diffusion-based generative model with multimodal aesthetic perception. First, DIAE incorporates Multimodal Aesthetic Perception (MAP) to convert the ambiguous aesthetic instruction into explicit guidance by (i) employing detailed, standardized aesthetic instructions across multiple aesthetic attributes, and (ii) utilizing multimodal control signals derived from text-image pairs that maintain consistency within the same aesthetic attribute. Second, to mitigate the lack of "perfectly-paired" images, we collect "imperfectly-paired" dataset called IIAEData, consisting of images with varying aesthetic qualities while sharing identical semantics. To better leverage the weak matching characteristics of IIAEData during training, a dual-branch supervision framework is also introduced for weakly supervised image aesthetic enhancement. Experimental results demonstrate that DIAE outperforms the baselines and obtains superior image aesthetic scores and image content consistency scores.
图像美学增强的目标是识别图片中的审美缺陷,并执行相应的编辑操作,这是一项极具挑战性的任务,要求模型具备创造力和审美感知能力。尽管最近的图像编辑模型在可控性和灵活性方面取得了显著进展,但它们仍难以提升图像的美感。主要面临的两个挑战是:首先,根据具有审美意识的指令进行编辑很困难;其次,“完美配对”的图像数据(即内容一致而美学特征不同的图像)稀缺。为此,在本文中我们提出了基于扩散模型的多模态审美感知图像美学增强方法(Dual-supervised Image Aesthetic Enhancement, DIAE)。 DIAE 方法主要包括两个方面: 1. **Multi-modal Aesthetic Perception (MAP)**:这种方法将模糊的审美指令转化为具体的指导信号,通过使用详细的、标准化的跨多个审美属性的指令,并利用从文本-图像对中提取出的一致于同一美学属性中的多模态控制信号。 2. 为了缓解“完美配对”数据稀缺的问题,我们收集了一个名为IIAEData的数据集,该数据集中包含了具有不同美学品质但语义一致性的图片。同时,为更好地利用IIAEData的弱匹配特性进行训练,我们还引入了一种双分支监督框架来实现弱监督图像美学增强。 实验结果表明,DIAE 方法不仅超越了基准方法,还在图像美学评分和图像内容一致性评分上取得了更好的成绩。
https://arxiv.org/abs/2603.11556
3D Gaussian Splatting (3DGS) has emerged as a powerful representation for high-quality rendering across a wide range of this http URL, its high computational demands and large storage costs pose significant challenges for deployment on mobile devices. In this work, we propose a mobile-tailored real-time Gaussian Splatting method, dubbed Mobile-GS, enabling efficient inference of Gaussian Splatting on edge devices. Specifically, we first identify alpha blending as the primary computational bottleneck, since it relies on the time-consuming Gaussian depth sorting process. To solve this issue, we propose a depth-aware order-independent rendering scheme that eliminates the need for sorting, thereby substantially accelerating rendering. Although this order-independent rendering improves rendering speed, it may introduce transparency artifacts in regions with overlapping geometry due to the scarcity of rendering order. To address this problem, we propose a neural view-dependent enhancement strategy, enabling more accurate modeling of view-dependent effects conditioned on viewing direction, 3D Gaussian geometry, and appearance attributes. In this way, Mobile-GS can achieve both high-quality and real-time rendering. Furthermore, to facilitate deployment on memory-constrained mobile platforms, we also introduce first-order spherical harmonics distillation, a neural vector quantization technique, and a contribution-based pruning strategy to reduce the number of Gaussian primitives and compress the 3D Gaussian representation with the assistance of neural networks. Extensive experiments demonstrate that our proposed Mobile-GS achieves real-time rendering and compact model size while preserving high visual quality, making it well-suited for mobile applications.
3D高斯点阵(3DGS)已成为高质量渲染的强大表示形式,适用于广泛的领域。然而,其高度的计算需求和高昂的存储成本给在移动设备上的部署带来了重大挑战。为此,我们提出了一种针对移动端优化的实时高斯点阵方法——Mobile-GS,旨在使边缘设备能够高效地进行高斯点阵推理。 具体而言,首先我们识别出alpha混合作为主要的计算瓶颈,因为它依赖于耗时的高斯深度排序过程。为了解决这个问题,我们提出了一种基于深度感知且不依赖渲染顺序的渲染方案,消除了对排序的需求,从而大大加速了渲染过程。尽管这种无序独立渲染方法提高了渲染速度,但它可能会在重叠几何区域中引入透明度伪影,因为缺乏确定性的渲染顺序可能导致这些区域的透明效果不准确。 为了解决这一问题,我们提出了一种基于神经网络的方向依赖性增强策略,该策略能够更精确地建模视图依赖效应,并考虑观察方向、三维高斯几何和外观属性。通过这种方式,Mobile-GS能够在保持高质量的同时实现实时渲染。 此外,为了便于在内存受限的移动平台上的部署,我们还引入了一阶球谐函数蒸馏(一种神经向量量化技术)以及基于贡献度的剪枝策略来减少高斯原语的数量,并借助神经网络辅助压缩3D高斯表示。大量的实验表明,我们的Mobile-GS方法能够在保持高质量视觉效果的同时实现实时渲染和紧凑模型尺寸,使其非常适合移动应用。
https://arxiv.org/abs/2603.11531
In medical image segmentation tasks, the domain gap caused by the difference in data collection between training and testing data seriously hinders the deployment of pre-trained models in clinical practice. Continual Test-Time Adaptation (CTTA) aims to enable pre-trained models to adapt to continuously changing unlabeled domains, providing an effective approach to solving this problem. However, existing CTTA methods often rely on unreliable supervisory signals, igniting a self-reinforcing cycle of error accumulation that culminates in catastrophic performance degradation. To overcome these challenges, we propose a CTTA via Semantic-Prompt-Enhanced Graph Clustering (SPEGC) for medical image segmentation. First, we design a semantic prompt feature enhancement mechanism that utilizes decoupled commonality and heterogeneity prompt pools to inject global contextual information into local features, alleviating their susceptibility to noise interference under domain shift. Second, based on these enhanced features, we design a differentiable graph clustering solver. This solver reframes global edge sparsification as an optimal transport problem, allowing it to distill a raw similarity matrix into a refined and high-order structural representation in an end-to-end manner. Finally, this robust structural representation is used to guide model adaptation, ensuring predictions are consistent at a cluster-level and dynamically adjusting decision boundaries. Extensive experiments demonstrate that SPEGC outperforms other state-of-the-art CTTA methods on two medical image segmentation benchmarks. The source code is available at this https URL.
在医学图像分割任务中,由于训练数据和测试数据之间数据采集方式的不同造成的领域差距严重阻碍了预训练模型在临床实践中的部署。连续测试时间自适应(CTTA)旨在使预训练模型能够适应不断变化的未标记领域,从而提供解决这一问题的有效方法。然而,现有的CTTA方法通常依赖于不可靠的监督信号,引发了错误累积的自我强化循环,并最终导致性能严重下降。为克服这些挑战,我们提出了一种基于语义提示增强图聚类(SPEGC)的连续测试时间自适应方法用于医学图像分割。首先,我们设计了一个语义提示特征增强机制,利用解耦的通用性和异质性提示池将全局上下文信息注入局部特征中,在领域转换下减轻其对噪声干扰的敏感度。其次,基于这些增强后的特征,我们设计了一个可微图聚类求解器。该求解器重新定义了全局边缘稀疏化为最优传输问题,从而能够在端到端的方式中将原始相似性矩阵提炼成精炼且高阶结构表示形式。最后,这种稳健的结构表达用于指导模型适应,确保预测在聚类级别上的一致性和动态调整决策边界。广泛的实验表明,在两个医学图像分割基准测试上,SPEGC优于其他最先进的CTTA方法。源代码可在以下网址获取:[提供具体的URL地址]
https://arxiv.org/abs/2603.11492
Deep learning models can predict protein properties with unprecedented accuracy but rarely offer mechanistic insight or actionable guidance for engineering improved variants. When a model flags an antibody as unstable, the protein engineer is left without recourse: which mutations would rescue stability while preserving function? We introduce Manifold-Constrained Counterfactual Optimization for Proteins (MCCOP), a framework that computes minimal, biologically plausible sequence edits that flip a model's prediction to a desired target state. MCCOP operates in a continuous joint sequence-structure latent space and employs a pretrained diffusion model as a manifold prior, balancing three objectives: validity (achieving the target property), proximity (minimizing mutations), and plausibility (producing foldable proteins). We evaluate MCCOP on three protein engineering tasks - GFP fluorescence rescue, thermodynamic stability enhancement, and E3 ligase activity recovery - and show that it generates sparser, more plausible counterfactuals than both discrete and continuous baselines. The recovered mutations align with known biophysical mechanisms, including chromophore packing and hydrophobic core consolidation, establishing MCCOP as a tool for both model interpretation and hypothesis-driven protein design. Our code is publicly available at this http URL.
深度学习模型能够以前所未有的精度预测蛋白质特性,但很少提供机制性见解或用于改进变体的设计指导。当一个模型将某个抗体标记为不稳定时,蛋白质工程师往往无计可施:哪些突变更正稳定性同时保持功能?我们引入了Manifold-Constrained Counterfactual Optimization for Proteins (MCCOP)框架,该框架可以计算最小且生物上合理的序列编辑,以使模型的预测转变为所需的靶标状态。MCCOP在连续的联合序列-结构潜在空间中运行,并使用预训练的扩散模型作为流形先验,在三个目标之间进行权衡:有效性(达到目标属性)、接近性(减少突变)和合理性(产生可折叠的蛋白质)。我们在三种蛋白质工程任务上评估了MCCOP——GFP荧光恢复、热力学稳定性增强以及E3泛素连接酶活性恢复,并表明它比离散和连续基准线生成更稀疏且合理的反事实。回收到的突变与已知的生物物理机制(包括色素团包装和疏水核心整合)相符,这确立了MCCOP作为模型解释和基于假设驱动蛋白质设计工具的作用。 我们的代码可以在 [此链接](http://this.http.url) 公开获取。
https://arxiv.org/abs/2603.10811
When MLLMs fail at Science, Technology, Engineering, and Mathematics (STEM) visual reasoning, a fundamental question arises: is it due to perceptual deficiencies or reasoning limitations? Through systematic scaling analysis that independently scales perception and reasoning components, we uncover a critical insight: scaling perception consistently outperforms scaling reasoning. This reveals perception as the true lever limiting current STEM visual reasoning. Motivated by this insight, our work focuses on systematically enhancing the perception capabilities of MLLMs by establishing code as a powerful perceptual medium--executable code provides precise semantics that naturally align with the structured nature of STEM visuals. Specifically, we construct ICC-1M, a large-scale dataset comprising 1M Image-Caption-Code triplets that materializes this code-as-perception paradigm through two complementary approaches: (1) Code-Grounded Caption Generation treats executable code as ground truth for image captions, eliminating the hallucinations inherent in existing knowledge distillation methods; (2) STEM Image-to-Code Translation prompts models to generate reconstruction code, mitigating the ambiguity of natural language for perception enhancement. To validate this paradigm, we further introduce STEM2Code-Eval, a novel benchmark that directly evaluates visual perception in STEM domains. Unlike existing work relying on problem-solving accuracy as a proxy that only measures problem-relevant understanding, our benchmark requires comprehensive visual comprehension through executable code generation for image reconstruction, providing deterministic and verifiable assessment. Code is available at this https URL.
当多语言大规模语言模型(MLLMs)在科学、技术、工程和数学(STEM)领域的视觉推理方面表现不佳时,一个基本的问题就出现了:这是由于感知缺陷还是推理限制造成的?通过独立扩展感知与推理组件的系统化规模分析,我们发现了一个关键见解:提升感知性能始终优于增强推理能力。这一发现表明当前MLLMs在STEM视觉推理上的瓶颈在于感知力。 受到这一见解的启发,我们的研究致力于系统性地提高MLLMs的感知能力,将代码作为强大的感知媒介来实现这一点——可执行代码提供了精确的语义信息,并且与STEM视觉内容结构化的本质自然契合。具体来说,我们构建了ICC-1M,这是一个大规模的数据集,包含100万个图像、描述和代码三元组,通过两种互补的方法实现了将代码作为感知媒介的理念:(1)Code-Grounded Caption Generation将可执行代码视为图像描述的真相来源,从而消除了现有知识蒸馏方法中固有的幻觉;(2)STEM Image-to-Code Translation提示模型生成重建代码,以此来减轻自然语言在增强感知能力方面的模糊性。 为了验证这一理念的有效性,我们进一步引入了STEM2Code-Eval,这是一个全新的评估基准,直接衡量STEM领域中的视觉感知。与依赖于问题解决准确度作为仅测量问题相关理解的代理方法不同,我们的评估需要通过生成重建图像所需的可执行代码来全面理解视觉内容,从而提供确定性和可验证性的评价标准。 源代码可在以下网址获取:[此URL](请将这里的"this https URL"替换为实际链接)。
https://arxiv.org/abs/2603.10757
Generative models are widely employed to enhance the photorealism of synthetic data for training computer vision algorithms. However, they often introduce visual artifacts that degrade the accuracy of these algorithms and require high computational resources, limiting their applicability in real-time training or evaluation scenarios. In this paper, we propose Hybrid Patch Enhanced Realism Generative Adversarial Network (HyPER-GAN), a lightweight image-to-image translation method based on a U-Net-style generator designed for real-time inference. The model is trained using paired synthetic and photorealism-enhanced images, complemented by a hybrid training strategy that incorporates matched patches from real-world data to improve visual realism and semantic consistency. Experimental results demonstrate that HyPER-GAN outperforms state-of-the-art paired image-to-image translation methods in terms of inference latency, visual realism, and semantic robustness. Moreover, it is illustrated that the proposed hybrid training strategy indeed improves visual quality and semantic consistency compared to training the model solely with paired synthetic and photorealism-enhanced images. Code and pretrained models are publicly available for download at: this https URL
生成模型被广泛用于增强合成数据的逼真度,以训练计算机视觉算法。然而,这些模型常常引入影响算法准确性的视觉伪影,并且需要大量的计算资源,这限制了它们在实时训练或评估场景中的应用性。本文提出了Hybrid Patch Enhanced Realism Generative Adversarial Network(HyPER-GAN),这是一种基于U-Net风格生成器的轻量级图像到图像转换方法,专为实时光学推理设计。该模型使用合成图像与增强逼真度的照片图像配对进行训练,并通过混合训练策略进一步优化视觉真实感和语义一致性,这种策略结合了现实世界数据中的匹配补丁。 实验结果显示,HyPER-GAN在推理延迟、视觉真实性和语义鲁棒性方面优于最先进的配对图像到图像转换方法。此外,文中还展示了所提出的混合训练策略确实能提升视觉质量和语义一致性,与仅使用合成和增强逼真度的照片图像的配对进行模型训练相比效果更佳。 代码及预训练模型可在此网址公开下载:[请参考原文中的链接]
https://arxiv.org/abs/2603.10604
Gaussian splatting has emerged as a competitive explicit representation for image and video reconstruction. In this work, we present P-GSVC, the first layered progressive 2D Gaussian splatting framework that provides a unified solution for scalable Gaussian representation in both images and videos. P-GSVC organizes 2D Gaussian splats into a base layer and successive enhancement layers, enabling coarse-to-fine reconstructions. To effectively optimize this layered representation, we propose a joint training strategy that simultaneously updates Gaussians across layers, aligning their optimization trajectories to ensure inter-layer compatibility and a stable progressive reconstruction. P-GSVC supports scalability in terms of both quality and resolution. Our experiments show that the joint training strategy can gain up to 1.9 dB improvement in PSNR for video and 2.6 dB improvement in PSNR for image when compared to methods that perform sequential layer-wise training. Project page: this https URL
高斯点绘(Gaussian splatting)已作为图像和视频重建的一种竞争性显式表示方法出现。在此工作中,我们提出了P-GSVC,这是第一个分层渐进2D高斯点绘框架,为图像和视频的可扩展高斯表示提供统一解决方案。P-GSVC 将 2D 高斯点(splat)组织成基础层和随后的增强层,从而支持从粗到细的重建过程。为了有效优化这种分层表示,我们提出了一种联合训练策略,该策略同时更新各层中的高斯分布,并将它们的优化轨迹对齐以确保层间兼容性并保证渐进式重建的稳定性。P-GSVC 支持质量和分辨率两方面的可扩展性。 我们的实验表明,与逐层顺序训练的方法相比,联合训练策略可以使视频的PSNR提高最多1.9 dB,使图像的PSNR提高最多2.6 dB。项目页面:[此链接](https://this.url)(请将 "this https URL" 替换为实际的网址)。
https://arxiv.org/abs/2603.10551
The construction of high quality health indicators (HIs) is crucial for effective prognostics and health management. Although deep learning has significantly advanced HI modeling, existing approaches often struggle with distribution mismatches resulting from varying operating conditions. Although domain adaptation is typically employed to mitigate these shifts, two critical challenges remain: (1) the misalignment of degradation stages during random mini-batch sampling, resulting in misleading discrepancy losses, and (2) the structural limitations of small-kernel 1D-CNNs in capturing long-range temporal dependencies within complex vibration signals. To address these issues, we propose a domain-adaptive framework comprising degradation stage synchronized batch sampling (DSSBS) and the cross-domain aligned fusion large autoencoder (CAFLAE). DSSBS utilizes kernel change-point detection to segment degradation stages, ensuring that source and target mini-batches are synchronized by their failure phases during alignment. Complementing this, CAFLAE integrates large-kernel temporal feature extraction with cross-attention mechanisms to learn superior domain-invariant representations. The proposed framework was rigorously validated on a Korean defense system dataset and the XJTU-SY bearing dataset, achieving an average performance enhancement of 24.1% over state-of-the-art methods. These results demonstrate that DSSBS improves cross-domain alignment through stage-consistent sampling, whereas CAFLAE offers a high-performance backbone for long-term industrial condition monitoring.
高质量健康指标(HIs)的构建对于有效的故障预测和健康管理至关重要。尽管深度学习在HI建模方面取得了显著进展,但现有方法往往难以应对因不同操作条件导致的数据分布差异问题。虽然领域适应通常被用来缓解这种偏移,但仍存在两个关键挑战:(1) 随机小批量采样过程中退化阶段的不对齐,这会导致误导性的差异损失;(2) 小核1D-CNN在捕捉复杂振动信号中的长时间依赖关系方面的结构限制。为解决这些问题,我们提出了一种包含退化阶段同步批量采样(DSSBS)和跨域对齐融合大型自编码器(CAFLAE)的领域适应框架。 DSSBS利用核变化点检测来划分退化阶段,并确保在对准过程中源和目标小批次根据其故障阶段保持一致。作为补充,CAFLAE结合了大核时序特征提取与跨注意力机制,以学习更优的域不变表示。该提出的框架已在韩国国防系统数据集和XJTU-SY轴承数据集上进行了严格的验证,并在最先进的方法中实现了平均性能提升24.1%的成绩。这些结果表明,DSSBS通过阶段一致采样改善了跨领域对齐,而CAFLAE则为长期工业状态监测提供了高性能的骨干架构。
https://arxiv.org/abs/2603.10430