All-in-one image restoration aims to recover clear images from various degradation types and levels with a unified model. Nonetheless, the significant variations among degradation types present challenges for training a universal model, often resulting in task interference, where the gradient update directions of different tasks may diverge due to shared parameters. To address this issue, motivated by the routing strategy, we propose DFPIR, a novel all-in-one image restorer that introduces Degradation-aware Feature Perturbations(DFP) to adjust the feature space to align with the unified parameter space. In this paper, the feature perturbations primarily include channel-wise perturbations and attention-wise perturbations. Specifically, channel-wise perturbations are implemented by shuffling the channels in high-dimensional space guided by degradation types, while attention-wise perturbations are achieved through selective masking in the attention space. To achieve these goals, we propose a Degradation-Guided Perturbation Block (DGPB) to implement these two functions, positioned between the encoding and decoding stages of the encoder-decoder architecture. Extensive experimental results demonstrate that DFPIR achieves state-of-the-art performance on several all-in-one image restoration tasks including image denoising, image dehazing, image deraining, motion deblurring, and low-light image enhancement. Our codes are available at this https URL.
全图像复原旨在利用统一的模型从各种退化类型和程度中恢复清晰图像。然而,不同退化类型的显著差异为训练通用模型带来了挑战,通常会导致任务干扰问题——由于共享参数,不同任务的梯度更新方向可能会发生分歧。为了应对这一问题,受路由策略启发,我们提出了DFPIR(Degradation-aware Feature Perturbations for Image Restoration),这是一种新的全图像复原方法,通过引入退化感知特征扰动(DFP)来调整特征空间以适应统一参数空间。 在本文中,特征扰动主要包括通道级扰动和注意力级扰动。具体而言,通道级扰动是通过根据不同的退化类型引导高维空间中的通道洗牌实现的;而注意力级扰动则是通过对注意力空间进行选择性屏蔽来完成的。为了达成这些目标,我们设计了一种退化导向的扰动模块(DGPB),用于实现在编码器-解码器架构的编码和解码阶段之间的这两种功能。 广泛的实验结果表明,DFPIR在包括图像去噪、图像去雾、图像除雨、运动模糊恢复以及低光照图像增强在内的几个全图像复原任务中达到了最先进的性能。我们的代码可在提供的链接地址获取。
https://arxiv.org/abs/2505.12630
Image enhancement methods often prioritize pixel level information, overlooking the semantic features. We propose a novel, unsupervised, fuzzy-inspired image enhancement framework guided by NSGA-II algorithm that optimizes image brightness, contrast, and gamma parameters to achieve a balance between visual quality and semantic fidelity. Central to our proposed method is the use of a pre trained deep neural network as a feature extractor. To find the best enhancement settings, we use a GPU-accelerated NSGA-II algorithm that balances multiple objectives, namely, increasing image entropy, improving perceptual similarity, and maintaining appropriate brightness. We further improve the results by applying a local search phase to fine-tune the top candidates from the genetic algorithm. Our approach operates entirely without paired training data making it broadly applicable across domains with limited or noisy labels. Quantitatively, our model achieves excellent performance with average BRISQUE and NIQE scores of 19.82 and 3.652, respectively, in all unpaired datasets. Qualitatively, enhanced images by our model exhibit significantly improved visibility in shadowed regions, natural balance of contrast and also preserve the richer fine detail without introducing noticable artifacts. This work opens new directions for unsupervised image enhancement where semantic consistency is critical.
图像增强方法往往侧重于像素级别的信息,而忽视了语义特征。我们提出了一种新颖的、无监督的、受模糊理论启发的图像增强框架,该框架由NSGA-II算法引导,并优化图像亮度、对比度和伽马参数,以实现视觉质量和语义保真度之间的平衡。我们的方法核心在于使用一个预训练的深度神经网络作为特征提取器。为了找到最佳的增强设置,我们利用了GPU加速的NSGA-II算法,该算法在增加图像熵、提高感知相似性以及保持适当亮度等多重目标之间进行权衡。我们进一步通过应用局部搜索阶段来微调遗传算法产生的顶级候选者,从而改进结果。 我们的方法完全不需要配对训练数据,在标签有限或嘈杂的各种领域中具有广泛的应用潜力。从定量角度看,我们的模型在所有无配对的数据集中取得了优秀的性能,BRISQUE和NIQE的平均分数分别为19.82和3.652。从定性角度来看,经过我们模型增强后的图像在阴影区域的可见度得到了显著改善,对比度自然平衡,并且保留了更丰富的细节,同时未引入明显的伪影。 这项工作为无监督图像增强开辟了新的方向,在这种情况下,语义一致性至关重要。
https://arxiv.org/abs/2505.11246
This study introduces an enhanced approach to video super-resolution by extending ordinary Single-Image Super-Resolution (SISR) Super-Resolution Generative Adversarial Network (SRGAN) structure to handle spatio-temporal data. While SRGAN has proven effective for single-image enhancement, its design does not account for the temporal continuity required in video processing. To address this, a modified framework that incorporates 3D Non-Local Blocks is proposed, which is enabling the model to capture relationships across both spatial and temporal dimensions. An experimental training pipeline is developed, based on patch-wise learning and advanced data degradation techniques, to simulate real-world video conditions and learn from both local and global structures and details. This helps the model generalize better and maintain stability across varying video content while maintaining the general structure besides the pixel-wise correctness. Two model variants-one larger and one more lightweight-are presented to explore the trade-offs between performance and efficiency. The results demonstrate improved temporal coherence, sharper textures, and fewer visual artifacts compared to traditional single-image methods. This work contributes to the development of practical, learning-based solutions for video enhancement tasks, with potential applications in streaming, gaming, and digital restoration.
这项研究提出了一种改进的方法,通过将普通的单图像超分辨率(SISR)生成对抗网络(SRGAN)结构扩展到处理时空数据来增强视频超分辨率。虽然SRGAN在单一图像增强方面表现出色,但其设计并未考虑视频处理中所需的时序连续性。为解决这一问题,提出了一个修改后的框架,该框架引入了3D非局部块,使模型能够捕捉空间和时间维度上的关系。 为了模拟现实世界的视频条件并从局部和全局结构及细节中学习,开发了一种基于分片式学习和高级数据退化技术的实验性训练流程。这有助于模型更好地泛化,并在处理不同类型的视频内容时保持稳定性,同时维持整体结构以及像素级别的准确性。 本研究提出了两种模型变体——一种较大、另一种较轻量级的,以探索性能与效率之间的权衡。结果表明,在时序一致性、更清晰的纹理和较少视觉伪影方面,相比传统的单图像方法有显著改进。这项工作为视频增强任务开发实用的学习型解决方案做出了贡献,并可能在流媒体、游戏和数字修复等领域得到应用。
https://arxiv.org/abs/2505.10589
Low-light image enhancement (LLIE) is a fundamental task in computational photography, aiming to improve illumination, reduce noise, and enhance image quality. While recent advancements focus on designing increasingly complex neural network models, we observe a peculiar phenomenon: resetting certain parameters to random values unexpectedly improves enhancement performance for some images. Drawing inspiration from biological genes, we term this phenomenon the gene effect. The gene effect limits enhancement performance, as even random parameters can sometimes outperform learned ones, preventing models from fully utilizing their capacity. In this paper, we investigate the reason and propose a solution. Based on our observations, we attribute the gene effect to static parameters, analogous to how fixed genetic configurations become maladaptive when environments change. Inspired by biological evolution, where adaptation to new environments relies on gene mutation and recombination, we propose parameter dynamic evolution (PDE) to adapt to different images and mitigate the gene effect. PDE employs a parameter orthogonal generation technique and the corresponding generated parameters to simulate gene recombination and gene mutation, separately. Experiments validate the effectiveness of our techniques. The code will be released to the public.
低光照图像增强(LLIE)是计算摄影中的一个基本任务,旨在提高照明、减少噪声并提升图像质量。尽管最近的研究重点在于设计越来越复杂的神经网络模型,但我们观察到一种奇特的现象:将某些参数重置为随机值有时会意外地改善部分图像的增强效果。受到生物学基因概念的启发,我们将这一现象称为“基因效应”。基因效应限制了增强性能,因为即使随机参数也有可能优于学习得到的参数,阻碍模型充分发挥其潜力。 在本文中,我们探讨了产生这种现象的原因,并提出了解决方案。根据我们的观察,我们认为基因效应是由静态参数引起的,就像固定不变的遗传配置会在环境变化时变得适应性差一样。受生物进化启发,即适应新环境需要通过基因突变和重组来实现,我们提出了动态参数演化(PDE)方法以适应不同的图像并缓解基因效应。 PDE采用了一种参数正交生成技术以及对应生成的参数,分别模拟了基因重组与基因突变的过程。实验验证了我们的技术的有效性。代码将公开发布。
https://arxiv.org/abs/2505.09196
In low-light environments, the performance of computer vision algorithms often deteriorates significantly, adversely affecting key vision tasks such as segmentation, detection, and classification. With the rapid advancement of deep learning, its application to low-light image processing has attracted widespread attention and seen significant progress in recent years. However, there remains a lack of comprehensive surveys that systematically examine how recent deep-learning-based low-light image enhancement methods function and evaluate their effectiveness in enhancing downstream vison tasks. To address this gap, this review provides a detailed elaboration on how various recent approaches (from 2020) operate and their enhancement mechanisms, supplemented with clear illustrations. It also investigates the impact of different enhancement techniques on subsequent vision tasks, critically analyzing their strengths and limitations. Additionally, it proposes future research directions. This review serves as a useful reference for determining low-light image enhancement techniques and optimizing vision task performance in low-light conditions.
在低光环境下,计算机视觉算法的性能往往显著下降,严重影响了分割、检测和分类等关键任务的表现。随着深度学习的迅速发展,其在低光图像处理领域的应用吸引了广泛的关注,并在过去几年中取得了重大进展。然而,目前仍缺乏全面且系统地评估最近基于深度学习的低光图像增强方法的研究综述,这些方法的功能及其对下游视觉任务效果的影响尚未得到充分探讨。为填补这一空白,本文详细阐述了2020年以来各种最新方法的工作原理及它们的增强机制,并附有清晰的图示说明。此外,本文还研究了不同增强技术对后续视觉任务的影响,对其优缺点进行了批判性分析,并提出了未来的研究方向。这篇综述对于确定低光图像增强技术以及在低光条件下优化视觉任务性能具有重要的参考价值。
https://arxiv.org/abs/2505.05759
Alzheimer's Disease (AD) is a neurodegenerative disorder characterized by amyloid-beta plaques and tau neurofibrillary tangles, which serve as key histopathological features. The identification and segmentation of these lesions are crucial for understanding AD progression but remain challenging due to the lack of large-scale annotated datasets and the impact of staining variations on automated image analysis. Deep learning has emerged as a powerful tool for pathology image segmentation; however, model performance is significantly influenced by variations in staining characteristics, necessitating effective stain normalization and enhancement techniques. In this study, we address these challenges by introducing an open-source dataset (ADNP-15) of neuritic plaques (i.e., amyloid deposits combined with a crown of dystrophic tau-positive neurites) in human brain whole slide images. We establish a comprehensive benchmark by evaluating five widely adopted deep learning models across four stain normalization techniques, providing deeper insights into their influence on neuritic plaque segmentation. Additionally, we propose a novel image enhancement method that improves segmentation accuracy, particularly in complex tissue structures, by enhancing structural details and mitigating staining inconsistencies. Our experimental results demonstrate that this enhancement strategy significantly boosts model generalization and segmentation accuracy. All datasets and code are open-source, ensuring transparency and reproducibility while enabling further advancements in the field.
阿尔茨海默病(AD)是一种神经退行性疾病,其特征是β淀粉样蛋白斑块和tau神经原纤维缠结的存在。这些病变被认为是该疾病的标志性病理学特征,在理解AD进展中具有关键作用。然而,由于缺乏大规模标注数据集以及染色变化对自动化图像分析的影响,识别和分割这些病变仍然极具挑战性。 深度学习技术因其在病理性图像分割中的强大能力而崭露头角,但模型性能受到染色特性变异的显著影响,这需要有效的染色校准与增强技术。在这项研究中,我们通过引入一个开放源代码数据集(ADNP-15),解决了这些问题,该数据集包含了人类大脑全切片图像中的神经原纤维缠结斑块(即淀粉样蛋白沉积物伴有变性tau阳性的神经原结构冠)。我们在四种染色校准技术上评估了五种广泛采用的深度学习模型,建立了全面的基准测试,从而更深入地了解这些技术对神经原纤维缠结斑块分割的影响。此外,我们提出了一种新的图像增强方法,通过提升复杂组织结构中的细节和减少染色不一致性来提高分割准确性。 我们的实验结果表明,该增强策略显著提高了模型的泛化能力以及分割精度。所有数据集及代码均开放源码以确保透明度与可重复性,并促进相关领域进一步的发展。
https://arxiv.org/abs/2505.05041
This paper presents a novel Two-Stage Diffusion Model (TS-Diff) for enhancing extremely low-light RAW images. In the pre-training stage, TS-Diff synthesizes noisy images by constructing multiple virtual cameras based on a noise space. Camera Feature Integration (CFI) modules are then designed to enable the model to learn generalizable features across diverse virtual cameras. During the aligning stage, CFIs are averaged to create a target-specific CFI$^T$, which is fine-tuned using a small amount of real RAW data to adapt to the noise characteristics of specific cameras. A structural reparameterization technique further simplifies CFI$^T$ for efficient deployment. To address color shifts during the diffusion process, a color corrector is introduced to ensure color consistency by dynamically adjusting global color distributions. Additionally, a novel dataset, QID, is constructed, featuring quantifiable illumination levels and a wide dynamic range, providing a comprehensive benchmark for training and evaluation under extreme low-light conditions. Experimental results demonstrate that TS-Diff achieves state-of-the-art performance on multiple datasets, including QID, SID, and ELD, excelling in denoising, generalization, and color consistency across various cameras and illumination levels. These findings highlight the robustness and versatility of TS-Diff, making it a practical solution for low-light imaging applications. Source codes and models are available at this https URL
这篇论文提出了一种新颖的两阶段扩散模型(TS-Diff),用于增强极低光照条件下的RAW图像。在预训练阶段,TS-Diff通过构建基于噪声空间的多个虚拟相机来合成带噪图像,并设计了相机特征集成(CFI)模块以使模型能够学习跨不同虚拟相机的一般化特性。在对齐阶段,平均计算出针对特定目标的CFI$^T$,并通过少量的真实RAW数据进行微调,使其适应特定相机的噪声特点。此外,还引入了一种结构重构技术来简化CFI$^T$,以便于高效部署。为了应对扩散过程中出现的颜色偏移问题,论文中设计了一个颜色校正器,通过动态调整全局颜色分布来确保色彩的一致性。 为解决低光照条件下的训练和评估需求,构造了一个具有可量化照明级别及宽广动态范围的新数据集QID,提供了全面的基准测试环境。实验结果显示,在包括QID、SID和ELD在内的多个数据集中,TS-Diff在去噪能力、泛化能力和颜色一致性方面表现优异,横跨不同相机型号与光照条件均能发挥稳定性能。这些发现突显了TS-Diff模型的强大适应性和实用性,使其成为极低光照成像应用中的理想解决方案。 源代码和模型可在以下网址获得:[https URL](注意实际使用时需替换为正确的URL)。
https://arxiv.org/abs/2505.04281
Accurate identification of agricultural pests is essential for crop protection but remains challenging due to the large intra-class variance and fine-grained differences among pest species. While deep learning has advanced pest detection, most existing approaches rely solely on low-level visual features and lack effective multi-modal integration, leading to limited accuracy and poor interpretability. Moreover, the scarcity of high-quality multi-modal agricultural datasets further restricts progress in this field. To address these issues, we construct two novel multi-modal benchmarks-CTIP102 and STIP102-based on the widely-used IP102 dataset, and introduce a Multi-scale Cross-Modal Fusion Network (MSFNet-CPD) for robust pest detection. Our approach enhances visual quality via a super-resolution reconstruction module, and feeds both the original and reconstructed images into the network to improve clarity and detection performance. To better exploit semantic cues, we propose an Image-Text Fusion (ITF) module for joint modeling of visual and textual features, and an Image-Text Converter (ITC) that reconstructs fine-grained details across multiple scales to handle challenging backgrounds. Furthermore, we introduce an Arbitrary Combination Image Enhancement (ACIE) strategy to generate a more complex and diverse pest detection dataset, MTIP102, improving the model's generalization to real-world scenarios. Extensive experiments demonstrate that MSFNet-CPD consistently outperforms state-of-the-art methods on multiple pest detection benchmarks. All code and datasets will be made publicly available at: this https URL.
准确识别农业害虫对于作物保护至关重要,但因为同一大类内部的变异性和害虫物种之间的细微差别,这项任务仍然具有挑战性。虽然深度学习已经推进了害虫检测技术的进步,但是大多数现有的方法主要依赖于低层次的视觉特征,并且缺乏有效的跨模态融合机制,导致精度有限且解释性较差。此外,高质量多模态农业数据集的稀缺进一步限制了该领域的发展。 为了解决这些问题,我们基于广泛使用的IP102数据集构建了两个新颖的多模态基准——CTIP102和STIP102,并引入了一个名为Multi-scale Cross-Modal Fusion Network(MSFNet-CPD)的模型用于稳健害虫检测。我们的方法通过超分辨率重建模块提升了视觉质量,并将原始图像与重构后的图像同时输入网络,以提高清晰度及检测性能。为了更好地利用语义线索,我们提出了一个Image-Text Fusion (ITF) 模块,用于联合建模视觉和文本特征;并且引入了Image-Text Converter (ITC),它可以在多个尺度上重建细微细节,从而应对复杂的背景挑战。 此外,我们还提出了一种Arbitrary Combination Image Enhancement(ACIE)策略来生成一个更加复杂且多样化的害虫检测数据集MTIP102,这有助于模型更好地适应现实世界的情况。广泛的实验表明,MSFNet-CPD在多个害虫检测基准测试中持续超越了现有最先进的方法。 所有代码和数据集将公开提供:[this URL](https://example.com) (请用实际链接替换示例URL)
https://arxiv.org/abs/2505.02441
Developing effective approaches to generate enhanced results that align well with human visual preferences for high-quality well-lit images remains a challenge in low-light image enhancement (LLIE). In this paper, we propose a human-in-the-loop LLIE training framework that improves the visual quality of unsupervised LLIE model outputs through iterative training stages, named HiLLIE. At each stage, we introduce human guidance into the training process through efficient visual quality annotations of enhanced outputs. Subsequently, we employ a tailored image quality assessment (IQA) model to learn human visual preferences encoded in the acquired labels, which is then utilized to guide the training process of an enhancement model. With only a small amount of pairwise ranking annotations required at each stage, our approach continually improves the IQA model's capability to simulate human visual assessment of enhanced outputs, thus leading to visually appealing LLIE results. Extensive experiments demonstrate that our approach significantly improves unsupervised LLIE model performance in terms of both quantitative and qualitative performance. The code and collected ranking dataset will be available at this https URL.
开发有效的低光图像增强(LLIE)方法,以生成符合人类视觉偏好的高质量、光照良好的图片仍然是一个挑战。本文提出了一种人机交互式的低光图像增强训练框架HiLLIE,在这种框架下通过迭代训练阶段来改进无监督LLIE模型输出的视觉质量。在每个阶段中,我们通过高效的视觉质量注释将人的指导引入到训练过程中。随后,我们采用定制化的图像质量评估(IQA)模型学习从获得的标签中编码的人类视觉偏好,并将其用于引导增强模型的训练过程。仅需在每个阶段提供少量成对排名注释,我们的方法就能持续提升IQA模型模拟人类对增强输出视觉评估的能力,从而产生令人满意的LLIE结果。大量的实验表明,我们的方法显著提高了无监督LLIE模型在定量和定性性能方面的表现。代码和收集的排序数据集可在[此链接](https://this https URL)获取。
https://arxiv.org/abs/2505.02134
High-quality fundus images provide essential anatomical information for clinical screening and ophthalmic disease diagnosis. Yet, due to hardware limitations, operational variability, and patient compliance, fundus images often suffer from low resolution and signal-to-noise ratio. Recent years have witnessed promising progress in fundus image enhancement. However, existing works usually focus on restoring structural details or global characteristics of fundus images, lacking a unified image enhancement framework to recover comprehensive multi-scale information. Moreover, few methods pinpoint the target of image enhancement, e.g., lesions, which is crucial for medical image-based diagnosis. To address these challenges, we propose a multi-scale target-aware representation learning framework (MTRL-FIE) for efficient fundus image enhancement. Specifically, we propose a multi-scale feature encoder (MFE) that employs wavelet decomposition to embed both low-frequency structural information and high-frequency details. Next, we design a structure-preserving hierarchical decoder (SHD) to fuse multi-scale feature embeddings for real fundus image restoration. SHD integrates hierarchical fusion and group attention mechanisms to achieve adaptive feature fusion while retaining local structural smoothness. Meanwhile, a target-aware feature aggregation (TFA) module is used to enhance pathological regions and reduce artifacts. Experimental results on multiple fundus image datasets demonstrate the effectiveness and generalizability of MTRL-FIE for fundus image enhancement. Compared to state-of-the-art methods, MTRL-FIE achieves superior enhancement performance with a more lightweight architecture. Furthermore, our approach generalizes to other ophthalmic image processing tasks without supervised fine-tuning, highlighting its potential for clinical applications.
高质量的眼底图像为临床筛查和眼科疾病诊断提供了重要的解剖信息。然而,由于硬件限制、操作变异性以及患者的配合度问题,眼底图像常常会受到分辨率低和信噪比差的影响。近年来,在眼底图像增强方面取得了显著进展,但现有的研究工作通常专注于恢复眼底图像的结构细节或整体特征,缺乏一种统一的眼底图像增强框架来全面恢复多尺度信息。此外,很少有方法明确指出图像增强的目标(如病变),这对于基于医学图像的诊断至关重要。为了应对这些挑战,我们提出了一种多尺度目标感知表示学习框架(MTRL-FIE)用于高效眼底图像增强。 具体而言,我们提出了一个多尺度特征编码器(MFE),该编码器利用小波分解嵌入低频结构信息和高频细节。接下来,我们设计了一个保持结构的分层解码器(SHD),以融合多尺度特征嵌入实现真实眼底图像恢复。SHD结合了层次化融合与组注意力机制来实现自适应特征融合的同时保留局部结构平滑度。同时,一个目标感知特征聚合(TFA)模块被用来增强病理区域并减少伪影。 在多个眼底图像数据集上的实验结果表明,MTRL-FIE在眼底图像增强的有效性和泛化性方面均表现出色。与最先进的方法相比,MTRL-FIE实现了更优的增强性能,并且具有更为轻量级的架构。此外,我们的方法能够推广至其他眼科影像处理任务而不需监督微调,凸显了其应用于临床诊断中的潜力。
https://arxiv.org/abs/2505.01831
Underwater image enhancement (UIE) is a critical preprocessing step for marine vision applications, where wavelength-dependent attenuation causes severe content degradation and color distortion. While recent state space models like Mamba show potential for long-range dependency modeling, their unfolding operations and fixed scan paths on 1D sequences fail to adapt to local object semantics and global relation modeling, limiting their efficacy in complex underwater environments. To address this, we enhance conventional Mamba with the sorting-based scanning mechanism that dynamically reorders scanning sequences based on statistical distribution of spatial correlation of all pixels. In this way, it encourages the network to prioritize the most informative components--structural and semantic features. Upon building this mechanism, we devise a Visually Self-adaptive State Block (VSSB) that harmonizes dynamic sorting of Mamba with input-dependent dynamic convolution, enabling coherent integration of global context and local relational cues. This exquisite design helps eliminate global focus bias, especially for widely distributed contents, which greatly weakens the statistical frequency. For robust feature extraction and refinement, we design a cross-feature bridge (CFB) to adaptively fuse multi-scale representations. These efforts compose the novel relation-driven Mamba framework for effective UIE (RD-UIE). Extensive experiments on underwater enhancement benchmarks demonstrate RD-UIE outperforms the state-of-the-art approach WMamba in both quantitative metrics and visual fidelity, averagely achieving 0.55 dB performance gain on the three benchmarks. Our code is available at this https URL
水下图像增强(UIE)是海洋视觉应用中的一个关键预处理步骤,其中波长依赖的衰减会导致严重的内容退化和颜色失真。尽管最近的状态空间模型如Mamba在长期依赖性建模方面显示出潜力,但它们的操作展开过程以及固定的一维序列扫描路径无法适应局部对象语义及全局关系建模的需求,在复杂的水下环境中其有效性受到限制。为解决这一问题,我们通过基于排序的扫描机制对传统的Mamba进行了增强,该机制可根据所有像素的空间相关性统计分布动态重新排列扫描顺序。这样一来,它鼓励网络优先处理最具信息量的组成部分——结构和语义特征。在此基础上,我们设计了一种视觉自适应状态块(VSSB),将Mamba的动态排序与基于输入的动态卷积相结合,从而实现了全局上下文与局部关系线索的一致融合。这一精妙的设计有助于消除全局关注偏差,特别是对于广泛分布的内容而言,这大大削弱了统计频率。为了实现稳健的功能提取和细化,我们设计了一种跨特征桥(CFB),以自适应地融合多尺度表示。这些努力共同构成了用于有效UIE的新颖的关系驱动Mamba框架(RD-UIE)。在水下增强基准测试上的大量实验表明,RD-UIE在定量指标和视觉保真度方面均优于当前最先进的方法WMamba,在三个基准上平均实现了0.55 dB的性能提升。我们的代码可在提供的链接处获取。
https://arxiv.org/abs/2505.01224
Diabetic retinopathy is a severe eye condition caused by diabetes where the retinal blood vessels get damaged and can lead to vision loss and blindness if not treated. Early and accurate detection is key to intervention and stopping the disease progressing. For addressing this disease properly, this paper presents a comprehensive approach for automated diabetic retinopathy detection by proposing a new hybrid deep learning model called VR-FuseNet. Diabetic retinopathy is a major eye disease and leading cause of blindness especially among diabetic patients so accurate and efficient automated detection methods are required. To address the limitations of existing methods including dataset imbalance, diversity and generalization issues this paper presents a hybrid dataset created from five publicly available diabetic retinopathy datasets. Essential preprocessing techniques such as SMOTE for class balancing and CLAHE for image enhancement are applied systematically to the dataset to improve the robustness and generalizability of the dataset. The proposed VR-FuseNet model combines the strengths of two state-of-the-art convolutional neural networks, VGG19 which captures fine-grained spatial features and ResNet50V2 which is known for its deep hierarchical feature extraction. This fusion improves the diagnostic performance and achieves an accuracy of 91.824%. The model outperforms individual architectures on all performance metrics demonstrating the effectiveness of hybrid feature extraction in Diabetic Retinopathy classification tasks. To make the proposed model more clinically useful and interpretable this paper incorporates multiple XAI techniques. These techniques generate visual explanations that clearly indicate the retinal features affecting the model's prediction such as microaneurysms, hemorrhages and exudates so that clinicians can interpret and validate.
糖尿病视网膜病变是一种由糖尿病引起的严重眼部疾病,其中视网膜的血管受损,如果不进行治疗可能会导致视力丧失甚至失明。早期和准确地检测到这种疾病是干预并阻止其进展的关键。为了有效应对这一病症,本文提出了一种全新的混合深度学习模型VR-FuseNet,并提出了一个全面的自动化糖尿病视网膜病变检测方法。 糖尿病视网膜病变是一种主要的眼部疾病,也是导致糖尿病患者失明的主要原因,因此需要准确且高效的自动检测方法。为了解决现有方法中存在的数据集不平衡、多样性和泛化性问题,本文提出了一种由五个公开可用的数据集组成的混合数据集。通过系统地应用诸如SMOTE(用于类别平衡)和CLAHE(用于图像增强)等重要的预处理技术来改进数据集的鲁棒性和普适性。 提出的VR-FuseNet模型结合了两个最先进的卷积神经网络VGG19和ResNet50V2的优势,前者能够捕捉细微的空间特征,而后者则以其深度层次化的特征提取能力著称。这种融合不仅提高了诊断性能,还达到了91.824%的准确率,在所有性能指标上都优于单一架构模型,证明了混合特征提取在糖尿病视网膜病变分类任务中的有效性。 为了使所提出的模型更加临床实用和易于解释,本文结合了几种XAI(可解释的人工智能)技术。这些方法生成的视觉解释能够清晰地显示影响模型预测结果的眼底特征,如微动脉瘤、出血点和渗出物等,从而帮助临床医生进行理解和验证。
https://arxiv.org/abs/2504.21464
The advent of Deep Neural Networks (DNNs) has driven remarkable progress in low-light image enhancement (LLIE), with diverse architectures (e.g., CNNs and Transformers) and color spaces (e.g., sRGB, HSV, HVI) yielding impressive results. Recent efforts have sought to leverage the complementary strengths of these paradigms, offering promising solutions to enhance performance across varying degradation scenarios. However, existing fusion strategies are hindered by challenges such as parameter explosion, optimization instability, and feature misalignment, limiting further improvements. To overcome these issues, we introduce FusionNet, a novel multi-model linear fusion framework that operates in parallel to effectively capture global and local features across diverse color spaces. By incorporating a linear fusion strategy underpinned by Hilbert space theoretical guarantees, FusionNet mitigates network collapse and reduces excessive training costs. Our method achieved 1st place in the CVPR2025 NTIRE Low Light Enhancement Challenge. Extensive experiments conducted on synthetic and real-world benchmark datasets demonstrate that the proposed method significantly outperforms state-of-the-art methods in terms of both quantitative and qualitative results, delivering robust enhancement under diverse low-light conditions.
深度神经网络(DNN)的出现推动了低光图像增强(LLIE)领域的显著进步,各种架构(如卷积神经网络CNN和Transformer)以及不同的颜色空间(如sRGB、HSV、HVI)都产生了令人印象深刻的结果。近期的研究努力旨在利用这些范式的优势来提高在不同退化场景下的性能表现。然而,现有的融合策略面临着参数爆炸、优化不稳定性以及特征错位等问题的挑战,从而限制了进一步的进步。 为了解决这些问题,我们引入了一个新的多模型线性融合框架FusionNet,它能够在不同的颜色空间中并行操作以有效捕捉全局和局部特征。通过采用基于希尔伯特空间理论保证的线性融合策略,FusionNet减轻了网络崩溃,并减少了过高的训练成本。我们的方法在CVPR2025 NTIRE低光增强挑战赛中获得了第一名的成绩。 我们在合成数据集和真实世界基准测试数据集上进行了广泛的实验,结果表明所提出的方法在定量和定性评估方面均显著超越现有的最先进方法,在各种低光照条件下提供了稳健的图像增强效果。
https://arxiv.org/abs/2504.19295
Recently, learning-based Underwater Image Enhancement (UIE) methods have demonstrated promising performance. However, existing learning-based methods still face two challenges. 1) They rarely consider the inconsistent degradation levels in different spatial regions and spectral bands simultaneously. 2) They treat all regions equally, ignoring that the regions with high-frequency details are more difficult to reconstruct. To address these challenges, we propose a novel UIE method based on spatial-spectral dual-domain adaptive learning, termed SS-UIE. Specifically, we first introduce a spatial-wise Multi-scale Cycle Selective Scan (MCSS) module and a Spectral-Wise Self-Attention (SWSA) module, both with linear complexity, and combine them in parallel to form a basic Spatial-Spectral block (SS-block). Benefiting from the global receptive field of MCSS and SWSA, SS-block can effectively model the degradation levels of different spatial regions and spectral bands, thereby enabling degradation level-based dual-domain adaptive UIE. By stacking multiple SS-blocks, we build our SS-UIE network. Additionally, a Frequency-Wise Loss (FWL) is introduced to narrow the frequency-wise discrepancy and reinforce the model's attention on the regions with high-frequency details. Extensive experiments validate that the SS-UIE technique outperforms state-of-the-art UIE methods while requiring cheaper computational and memory costs.
最近,基于学习的水下图像增强(UIE)方法展示了很有前景的表现。然而,现有的基于学习的方法仍然面临两个挑战:1) 它们很少同时考虑不同空间区域和光谱带之间不一致的退化程度;2) 它们对待所有区域同等处理,忽视了高频细节丰富的区域更难重建这一事实。为了解决这些挑战,我们提出了一种新的基于空间-光谱双域自适应学习的UIE方法,称为SS-UIE。 具体来说,我们首先引入了一个具有线性复杂度的空间级多尺度循环选择扫描(MCSS)模块和一个同样具备线性复杂度的光谱级自我注意(SWSA)模块,并将它们并行组合以形成基本的空间-光谱块(SS-block)。得益于MCSS和SWSA提供的全局感受野,SS-block能够有效地建模不同空间区域和光谱带上的退化程度,从而实现基于退化水平的双域自适应UIE。通过堆叠多个SS-block,我们构建了我们的SS-UIE网络。 此外,还引入了一种频率级损失(FWL),旨在缩小频率级别的差异,并强化模型对高频细节区域的关注。大量的实验验证表明,SS-UIE技术在性能上超越现有的最先进的UIE方法,同时所需的计算和内存成本更低。
https://arxiv.org/abs/2504.19198
There has long been a belief that high-level semantics learning can benefit various downstream computer vision tasks. However, in the low-light image enhancement (LLIE) community, existing methods learn a brutal mapping between low-light and normal-light domains without considering the semantic information of different regions, especially in those extremely dark regions that suffer from severe information loss. To address this issue, we propose a new deep semantic prior-guided framework (DeepSPG) based on Retinex image decomposition for LLIE to explore informative semantic knowledge via a pre-trained semantic segmentation model and multimodal learning. Notably, we incorporate both image-level semantic prior and text-level semantic prior and thus formulate a multimodal learning framework with combinatorial deep semantic prior guidance for LLIE. Specifically, we incorporate semantic knowledge to guide the enhancement process via three designs: an image-level semantic prior guidance by leveraging hierarchical semantic features from a pre-trained semantic segmentation model; a text-level semantic prior guidance by integrating natural language semantic constraints via a pre-trained vision-language model; a multi-scale semantic-aware structure that facilitates effective semantic feature incorporation. Eventually, our proposed DeepSPG demonstrates superior performance compared to state-of-the-art methods across five benchmark datasets. The implementation details and code are publicly available at this https URL.
长期以来,人们认为高层次的语义学习能够对各种下游计算机视觉任务产生积极影响。然而,在低光图像增强(LLIE)领域,现有的方法往往通过粗暴地映射低光和正常光照域来工作,并未考虑不同区域的语义信息,尤其是在那些严重信息损失的极暗区域内。为了解决这个问题,我们提出了一种新的基于Retinex图像分解的深度语义先验引导框架(DeepSPG),利用预训练的语义分割模型和多模态学习探索有用的语义知识。值得注意的是,我们的方法同时整合了图像级别的语义先验和文本级别的语义先验,并由此构建了一个结合组合式深度语义先验指导的多模态学习框架来处理LLIE问题。 具体而言,我们通过三种设计将语义知识融入增强过程:一种是利用预训练的语义分割模型中的层次化语义特征提供的图像级别语义先验引导;另一种是由预训练的视觉-语言模型整合自然语言语义约束构成的文本级别语义先验引导;还有一种是多尺度感知结构,这有助于有效结合语义特征。最终,在五个基准数据集上与现有最优方法相比,我们提出的DeepSPG框架表现出了更优性能。 有关实现细节和代码,请访问:[此处提供链接](请将此占位符替换为实际的网址)。
https://arxiv.org/abs/2504.19127
Underwater observation systems typically integrate optical cameras and imaging sonar systems. When underwater visibility is insufficient, only sonar systems can provide stable data, which necessitates exploration of the underwater acoustic object tracking (UAOT) task. Previous studies have explored traditional methods and Siamese networks for UAOT. However, the absence of a unified evaluation benchmark has significantly constrained the value of these methods. To alleviate this limitation, we propose the first large-scale UAOT benchmark, SonarT165, comprising 165 square sequences, 165 fan sequences, and 205K high-quality annotations. Experimental results demonstrate that SonarT165 reveals limitations in current state-of-the-art SOT trackers. To address these limitations, we propose STFTrack, an efficient framework for acoustic object tracking. It includes two novel modules, a multi-view template fusion module (MTFM) and an optimal trajectory correction module (OTCM). The MTFM module integrates multi-view feature of both the original image and the binary image of the dynamic template, and introduces a cross-attention-like layer to fuse the spatio-temporal target representations. The OTCM module introduces the acoustic-response-equivalent pixel property and proposes normalized pixel brightness response scores, thereby suppressing suboptimal matches caused by inaccurate Kalman filter prediction boxes. To further improve the model feature, STFTrack introduces a acoustic image enhancement method and a Frequency Enhancement Module (FEM) into its tracking pipeline. Comprehensive experiments show the proposed STFTrack achieves state-of-the-art performance on the proposed benchmark. The code is available at this https URL.
水下观测系统通常集成了光学相机和成像声纳系统。当水下能见度不足时,只有声纳系统能够提供稳定的数据,这就需要研究水下声学目标跟踪(UAOT)任务。先前的研究探索了传统方法和孪生网络在UAOT中的应用。然而,缺乏统一的评估基准显著限制了这些方法的价值。为了缓解这一局限性,我们提出了首个大规模UAOT基准——SonarT165,其中包括165个正方形序列、165个扇形序列以及205K条高质量标注信息。实验结果表明,SonarT165揭示了当前最先进的单目标跟踪器在性能上的局限性。为解决这些局限性,我们提出了STFTrack,这是一种高效的声学目标跟踪框架。该框架包括两个创新模块:多视图模板融合模块(MTFM)和最优轨迹校正模块(OTCM)。MTFM模块整合了原图像与动态模板的二值图像中的多视角特征,并引入了一种类似于交叉注意力机制的层来融合时空目标表示。OTCM模块则提出了声响应等效像素属性,以及归一化的像素亮度响应评分,从而抑制由于卡尔曼滤波预测框不准确而导致的次优匹配。为了进一步提升模型特性,STFTrack还引入了声学图像增强方法和频率增强模块(FEM)到其跟踪流程中。全面实验显示,所提出的STFTrack在新基准测试中的表现达到了最先进的水平。代码可在此网址获取:[此URL链接]
https://arxiv.org/abs/2504.15609
Image enhancement finds wide-ranging applications in real-world scenarios due to complex environments and the inherent limitations of imaging devices. Recent diffusion-based methods yield promising outcomes but necessitate prolonged and computationally intensive iterative sampling. In response, we propose InstaRevive, a straightforward yet powerful image enhancement framework that employs score-based diffusion distillation to harness potent generative capability and minimize the sampling steps. To fully exploit the potential of the pre-trained diffusion model, we devise a practical and effective diffusion distillation pipeline using dynamic control to address inaccuracies in updating direction during score matching. Our control strategy enables a dynamic diffusing scope, facilitating precise learning of denoising trajectories within the diffusion model and ensuring accurate distribution matching gradients during training. Additionally, to enrich guidance for the generative power, we incorporate textual prompts via image captioning as auxiliary conditions, fostering further exploration of the diffusion model. Extensive experiments substantiate the efficacy of our framework across a diverse array of challenging tasks and datasets, unveiling the compelling efficacy and efficiency of InstaRevive in delivering high-quality and visually appealing results. Code is available at this https URL.
图像增强技术由于复杂环境和成像设备本身的局限性,在现实世界的应用场景中具有广泛的作用。最近基于扩散的方法取得了令人鼓舞的结果,但需要长时间且计算密集型的迭代采样过程。为此,我们提出了一种简单而强大的图像增强框架——InstaRevive,该框架采用分数(score)引导的扩散蒸馏技术,旨在利用强大的生成能力同时减少采样步骤。 为了充分利用预训练的扩散模型的潜力,我们设计了一个实用且有效的扩散蒸馏管道,使用动态控制来解决分数匹配过程中的更新方向不准确问题。我们的控制策略能够实现动态扩散范围,有助于精确学习扩散模型内的去噪轨迹,并确保在训练过程中分布匹配梯度的准确性。 此外,为了丰富生成能力的指导信息,我们将通过图像描述引入文本提示作为辅助条件,从而促进对扩散模型进一步的研究和探索。 广泛的实验验证了我们的框架在多种具有挑战性的任务和数据集中的有效性,揭示了InstaRevive在提供高质量且视觉效果吸引人的结果方面的强大效能与效率。代码可在[此链接](https://example.com)获取。
https://arxiv.org/abs/2504.15513
While the diffusion transformer (DiT) has become a focal point of interest in recent years, its application in low-light image enhancement remains a blank area for exploration. Current methods recover the details from low-light images while inevitably amplifying the noise in images, resulting in poor visual quality. In this paper, we firstly introduce DiT into the low-light enhancement task and design a novel Structure-guided Diffusion Transformer based Low-light image enhancement (SDTL) framework. We compress the feature through wavelet transform to improve the inference efficiency of the model and capture the multi-directional frequency band. Then we propose a Structure Enhancement Module (SEM) that uses structural prior to enhance the texture and leverages an adaptive fusion strategy to achieve more accurate enhancement effect. In Addition, we propose a Structure-guided Attention Block (SAB) to pay more attention to texture-riched tokens and avoid interference from noisy areas in noise prediction. Extensive qualitative and quantitative experiments demonstrate that our method achieves SOTA performance on several popular datasets, validating the effectiveness of SDTL in improving image quality and the potential of DiT in low-light enhancement tasks.
虽然扩散变压器(DiT)近年来成为研究热点,但在低光图像增强领域的应用仍然是一片待开发的领域。现有的方法在恢复低光照图像细节的同时,不可避免地放大了图像中的噪声,导致视觉质量较差。在这篇论文中,我们首次将DiT引入到低光照增强任务,并设计了一个新颖的基于结构引导的扩散变压器(SDTL)框架来进行低光图像增强。我们通过小波变换压缩特征以提高模型的推理效率并捕捉多方向频率带。然后,我们提出了一种结构增强模块(SEM),该模块利用结构先验来增强纹理,并采用自适应融合策略实现更准确的增强效果。此外,我们还提出了一个结构引导注意力块(SAB),用于更加关注富含纹理的标记,并在噪声预测时避免来自噪声区域的干扰。 大量的定性和定量实验表明,在几个流行的基准数据集上,我们的方法达到了最先进的性能,验证了SDTL框架在提升图像质量方面的有效性以及DiT在低光照增强任务中的潜力。
https://arxiv.org/abs/2504.15054
Current Low-light Image Enhancement (LLIE) techniques predominantly rely on either direct Low-Light (LL) to Normal-Light (NL) mappings or guidance from semantic features or illumination maps. Nonetheless, the intrinsic ill-posedness of LLIE and the difficulty in retrieving robust semantics from heavily corrupted images hinder their effectiveness in extremely low-light environments. To tackle this challenge, we present SG-LLIE, a new multi-scale CNN-Transformer hybrid framework guided by structure priors. Different from employing pre-trained models for the extraction of semantics or illumination maps, we choose to extract robust structure priors based on illumination-invariant edge detectors. Moreover, we develop a CNN-Transformer Hybrid Structure-Guided Feature Extractor (HSGFE) module at each scale with in the UNet encoder-decoder architecture. Besides the CNN blocks which excels in multi-scale feature extraction and fusion, we introduce a Structure-Guided Transformer Block (SGTB) in each HSGFE that incorporates structural priors to modulate the enhancement process. Extensive experiments show that our method achieves state-of-the-art performance on several LLIE benchmarks in both quantitative metrics and visual quality. Our solution ranks second in the NTIRE 2025 Low-Light Enhancement Challenge. Code is released at this https URL.
目前的低光图像增强(LLIE)技术主要依赖于直接从低光照(LL)到正常光照(NL)的映射,或者通过语义特征或照明图进行指导。然而,LLIE的基本问题和从中提取稳健语义的困难使得这些方法在极端低光环境中效果不佳。 为了解决这一挑战,我们提出了SG-LLIE,这是一种新的多尺度CNN-Transformer混合框架,由结构先验引导。与使用预训练模型来提取语义或光照图不同,我们选择基于不变于光照的边缘检测器来提取稳健的结构先验信息。此外,在U-Net编码器-解码器架构中的每一层级上,我们都开发了一种称为CNN-Transformer混合结构指导特征提取器(HSGFE)的新模块。 除了擅长多尺度特征提取和融合的CNN块之外,我们还在每个HSGFE中引入了结构导向变换器块(SGTB),它将结构先验信息融入到增强过程中以调节图像增强过程。广泛的实验表明,我们的方法在多个LLIE基准测试中,在定量指标和视觉质量方面均达到了最先进的性能。 我们在NTIRE 2025低光增强挑战赛中排名第二。代码可在此URL获取:[此处应为具体链接,请访问原始发布来源以查看准确的链接地址]。
https://arxiv.org/abs/2504.14075
Deep neural networks (DNNs) have recently become the leading method for low-light image enhancement (LLIE). However, despite significant progress, their outputs may still exhibit issues such as amplified noise, incorrect white balance, or unnatural enhancements when deployed in real world applications. A key challenge is the lack of diverse, large scale training data that captures the complexities of low-light conditions and imaging pipelines. In this paper, we propose a novel image signal processing (ISP) driven data synthesis pipeline that addresses these challenges by generating unlimited paired training data. Specifically, our pipeline begins with easily collected high-quality normal-light images, which are first unprocessed into the RAW format using a reverse ISP. We then synthesize low-light degradations directly in the RAW domain. The resulting data is subsequently processed through a series of ISP stages, including white balance adjustment, color space conversion, tone mapping, and gamma correction, with controlled variations introduced at each stage. This broadens the degradation space and enhances the diversity of the training data, enabling the generated data to capture a wide range of degradations and the complexities inherent in the ISP pipeline. To demonstrate the effectiveness of our synthetic pipeline, we conduct extensive experiments using a vanilla UNet model consisting solely of convolutional layers, group normalization, GeLU activation, and convolutional block attention modules (CBAMs). Extensive testing across multiple datasets reveals that the vanilla UNet model trained with our data synthesis pipeline delivers high fidelity, visually appealing enhancement results, surpassing state-of-the-art (SOTA) methods both quantitatively and qualitatively.
最近,深度神经网络(DNN)已成为低光图像增强(LLIE)的领先方法。然而,尽管取得了显著进展,它们在实际应用中的输出仍然可能出现诸如放大噪声、错误白平衡或不自然增强等问题。关键挑战之一是缺乏能够捕捉低光条件和成像流程复杂性的多样化大规模训练数据。 为此,本文提出了一种新颖的基于图像信号处理(ISP)的数据合成管道,通过生成无限制配对训练数据来解决这些问题。具体来说,我们的管道从易于收集的高质量正常光照图像开始,并使用反向ISP首先将其未加工为RAW格式。然后,在RAW域直接合成低光退化情况。随后,生成的数据会经过一系列ISP阶段处理,包括白平衡调整、颜色空间转换、色调映射和伽马校正等,同时在每个阶段引入受控变化。这拓宽了降级范围,并增强了训练数据的多样性,使得生成的数据能够捕捉到广泛的降级情况以及ISP流程中的固有复杂性。 为了证明我们合成管道的有效性,我们在一个简单的UNet模型上进行了大量实验,该模型仅由卷积层、组归一化、GeLU激活和卷积块注意模块(CBAMs)组成。在多个数据集上的广泛测试表明,使用我们的数据合成管道训练的简单UNet模型能够提供高保真度且视觉效果良好的增强结果,在量化和定性评估中均超越了最先进的方法(SOTA)。
https://arxiv.org/abs/2504.12204