Significant progress has been made in video restoration under rainy conditions over the past decade, largely propelled by advancements in deep learning. Nevertheless, existing methods that depend on paired data struggle to generalize effectively to real-world scenarios, primarily due to the disparity between synthetic and authentic rain effects. To address these limitations, we propose a dual-branch spatio-temporal state-space model to enhance rain streak removal in video sequences. Specifically, we design spatial and temporal state-space model layers to extract spatial features and incorporate temporal dependencies across frames, respectively. To improve multi-frame feature fusion, we derive a dynamic stacking filter, which adaptively approximates statistical filters for superior pixel-wise feature refinement. Moreover, we develop a median stacking loss to enable semi-supervised learning by generating pseudo-clean patches based on the sparsity prior of rain. To further explore the capacity of deraining models in supporting other vision-based tasks in rainy environments, we introduce a novel real-world benchmark focused on object detection and tracking in rainy conditions. Our method is extensively evaluated across multiple benchmarks containing numerous synthetic and real-world rainy videos, consistently demonstrating its superiority in quantitative metrics, visual quality, efficiency, and its utility for downstream tasks.
在过去十年中,视频在雨天下的修复技术取得了显著进展,这主要得益于深度学习的进步。然而,依赖配对数据的现有方法难以有效地泛化到真实世界场景中,主要原因在于合成与实际雨效果之间的差异。为了克服这些限制,我们提出了一种双分支时空状态空间模型,旨在提高视频序列中雨迹去除的效果。具体来说,我们设计了用于提取空间特征的空间状态空间模型层和利用帧间时间依赖性的时态状态空间模型层。 为了改进多帧特征融合,我们推导出一种动态堆叠滤波器,该滤波器能够自适应地逼近统计滤波器,并实现更优的逐像素特征细化。此外,我们开发了一种中值堆叠损失函数,利用雨稀疏先验生成伪干净补丁,以支持半监督学习。 为了进一步探索去雨模型在其他基于视觉的任务中的应用能力(特别是在雨天环境下),我们引入了一个新的真实世界基准测试平台,专注于雨天下的物体检测和跟踪任务。我们的方法经过了多个包含大量合成与实际雨视频的基准数据集的全面评估,并在定量指标、视觉质量、效率及下游任务实用性方面均表现出优越性。
https://arxiv.org/abs/2505.16811
Restoring nighttime images affected by multiple adverse weather conditions is a practical yet under-explored research problem, as multiple weather conditions often coexist in the real world alongside various lighting effects at night. This paper first explores the challenging multi-weather nighttime image restoration task, where various types of weather degradations are intertwined with flare effects. To support the research, we contribute the AllWeatherNight dataset, featuring large-scale high-quality nighttime images with diverse compositional degradations, synthesized using our introduced illumination-aware degradation generation. Moreover, we present ClearNight, a unified nighttime image restoration framework, which effectively removes complex degradations in one go. Specifically, ClearNight extracts Retinex-based dual priors and explicitly guides the network to focus on uneven illumination regions and intrinsic texture contents respectively, thereby enhancing restoration effectiveness in nighttime scenarios. In order to better represent the common and unique characters of multiple weather degradations, we introduce a weather-aware dynamic specific-commonality collaboration method, which identifies weather degradations and adaptively selects optimal candidate units associated with specific weather types. Our ClearNight achieves state-of-the-art performance on both synthetic and real-world images. Comprehensive ablation experiments validate the necessity of AllWeatherNight dataset as well as the effectiveness of ClearNight. Project page: this https URL
恢复受多种不良天气条件影响的夜间图像是一项实用但尚未充分探索的研究问题,因为在现实世界中,多种天气状况往往同时存在,并伴随着各种夜间的照明效果。本文首先探讨了一个具有挑战性的多天气夜间图像修复任务,在该任务中,不同类型天气退化的痕迹与光晕效应交织在一起。为支持这项研究,我们贡献了AllWeatherNight数据集,其中包含了大量高质量的夜间图像,这些图像包含多样化的组合退化,并使用我们的照明感知降级生成方法进行合成。此外,我们提出了ClearNight框架,这是一个统一的夜间图像修复框架,能够一次性有效去除复杂的退化。 具体而言,ClearNight提取了基于Retinex的双重先验知识,并明确引导网络分别关注不均匀光照区域和固有纹理内容,从而提升夜间场景下的修复效果。为了更好地表示多种天气降级中常见的特性和独特的特点,我们引入了一种气象感知动态特定-共同性协作方法,该方法能够识别天气退化并自适应选择与特定天气类型相关联的最佳候选单元。 我们的ClearNight在合成图像和真实世界图像上均达到了最先进的性能。综合消融实验验证了AllWeatherNight数据集的必要性和ClearNight的有效性。项目页面链接:[this https URL](请根据实际需求替换为正确的URL)。
https://arxiv.org/abs/2505.16479
Recent advancements in video restoration have focused on recovering high-quality video frames from low-quality inputs. Compared with static images, the performance of video restoration significantly depends on efficient exploitation of temporal correlations among successive video frames. The numerous techniques make use of temporal information via flow-based strategies or recurrent architectures. However, these methods often encounter difficulties in preserving temporal consistency as they utilize degraded input video frames. To resolve this issue, we propose a novel video restoration framework named Joint Flow and Feature Refinement using Attention (JFFRA). The proposed JFFRA is based on key philosophy of iteratively enhancing data through the synergistic collaboration of flow (alignment) and restoration. By leveraging previously enhanced features to refine flow and vice versa, JFFRA enables efficient feature enhancement using temporal information. This interplay between flow and restoration is executed at multiple scales, reducing the dependence on precise flow estimation. Moreover, we incorporate an occlusion-aware temporal loss function to enhance the network's capability in eliminating flickering artifacts. Comprehensive experiments validate the versatility of JFFRA across various restoration tasks such as denoising, deblurring, and super-resolution. Our method demonstrates a remarkable performance improvement of up to 1.62 dB compared to state-of-the-art approaches.
近期的视频修复研究集中在从低质量输入中恢复高质量视频帧上。与静态图像相比,视频修复的效果在很大程度上依赖于高效利用连续视频帧之间的时序关联性。许多技术通过基于流(flow-based)策略或递归架构来使用这种时间信息。然而,这些方法通常难以保持时间一致性,因为它们会利用受损的输入视频帧。为了解决这个问题,我们提出了一种名为“联合流和特征精细调整注意力框架”(Joint Flow and Feature Refinement using Attention, JFFRA)的新颖视频修复框架。 提出的JFFRA基于通过流(对齐)和恢复之间的协同合作迭代增强数据这一核心理念。通过利用之前改进的特性来细化流,反之亦然,JFFRA能够有效地使用时间信息进行特征增强。这种流与恢复之间的交互作用在多个尺度上执行,减少了对精确流估计的依赖性。此外,我们整合了一种具有遮挡感知的时间损失函数,以提高网络消除闪烁伪影的能力。 全面的实验验证了JFFRA在各种修复任务(如去噪、去模糊和超分辨率)上的多功能性和卓越性能。我们的方法相比最新的技术方案,在性能上提高了高达1.62分贝的改进。
https://arxiv.org/abs/2505.16434
This paper reports on the NTIRE 2025 challenge on Text to Image (T2I) generation model quality assessment, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2025. The aim of this challenge is to address the fine-grained quality assessment of text-to-image generation models. This challenge evaluates text-to-image models from two aspects: image-text alignment and image structural distortion detection, and is divided into the alignment track and the structural track. The alignment track uses the EvalMuse-40K, which contains around 40K AI-Generated Images (AIGIs) generated by 20 popular generative models. The alignment track has a total of 371 registered participants. A total of 1,883 submissions are received in the development phase, and 507 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. The structure track uses the EvalMuse-Structure, which contains 10,000 AI-Generated Images (AIGIs) with corresponding structural distortion mask. A total of 211 participants have registered in the structure track. A total of 1155 submissions are received in the development phase, and 487 submissions are received in the test phase. Finally, 8 participating teams submitted their models and fact sheets. Almost all methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on T2I model quality assessment.
这篇论文报道了将于2025年在CVPR(计算机视觉和模式识别会议)期间与新趋势图像修复和增强研讨会(NTIRE)联合举行的NTIRE 2025文本到图像(T2I)生成模型质量评估挑战赛的情况。该挑战的目的是解决从细粒度角度对文本到图像生成模型进行质量评估的问题。 本次挑战赛从两个方面来评价T2I模型:图像与文本的一致性以及图像结构失真检测,并分别设置了一致性赛道和结构化赛道。 在一致性赛道中,使用了包含大约40,000张由20种流行生成模型产生的AI生成图片(AIGIs)的EvalMuse-40K数据集。该赛道共有371支队伍注册参加,在开发阶段收到了共计1883份提交,在测试阶段收到了507份提交,最后有12支参赛队提交了他们的模型和报告。 在结构化赛道中,使用的是包含10,000张AIGIs及其对应的结构失真掩模的EvalMuse-Structure数据集。该赛道共有211支队伍注册参加,在开发阶段收到了共计1155份提交,在测试阶段收到了487份提交,最后有8支参赛队提交了他们的模型和报告。 几乎所有方法都比基线方法取得了更好的结果,并且两个赛道的获胜方法在T2I模型质量评估上的预测性能表现出色。
https://arxiv.org/abs/2505.16314
Diffusion models have demonstrated promising performance in real-world video super-resolution (VSR). However, the dozens of sampling steps they require, make inference extremely slow. Sampling acceleration techniques, particularly single-step, provide a potential solution. Nonetheless, achieving one step in VSR remains challenging, due to the high training overhead on video data and stringent fidelity demands. To tackle the above issues, we propose DOVE, an efficient one-step diffusion model for real-world VSR. DOVE is obtained by fine-tuning a pretrained video diffusion model (*i.e.*, CogVideoX). To effectively train DOVE, we introduce the latent-pixel training strategy. The strategy employs a two-stage scheme to gradually adapt the model to the video super-resolution task. Meanwhile, we design a video processing pipeline to construct a high-quality dataset tailored for VSR, termed HQ-VSR. Fine-tuning on this dataset further enhances the restoration capability of DOVE. Extensive experiments show that DOVE exhibits comparable or superior performance to multi-step diffusion-based VSR methods. It also offers outstanding inference efficiency, achieving up to a **28$\times$** speed-up over existing methods such as MGLD-VSR. Code is available at: this https URL.
扩散模型在现实世界的视频超分辨率(VSR)任务中展示了有前景的表现。然而,它们需要的几十个采样步骤导致推理过程非常慢。单步采样的加速技术提供了一种潜在的解决方案。不过,在VSR中实现一步采样仍然具有挑战性,这主要是由于对视频数据训练成本高昂以及严格的保真度需求。为了解决上述问题,我们提出了DOVE,这是一种高效的一步扩散模型,专门用于现实世界的VSR任务。通过微调预训练的视频扩散模型(即CogVideoX),我们可以得到DOVE。 为了有效地训练DOVE,我们引入了潜像素训练策略。该策略采用两阶段方案逐步使模型适应视频超分辨率任务的需求。同时,我们设计了一个视频处理流水线来构建一个专门用于VSR的高质量数据集,称为HQ-VSR。在这一数据集上进行微调进一步增强了DOVE的恢复能力。 广泛的实验表明,与多步扩散方法相比,DOVE表现出相当或更优的表现,并且提供了卓越的推理效率,在一些现有方法如MGLD-VSR上实现了高达28倍的速度提升。代码可在以下链接获取:[此URL](请将"[此URL]"替换为实际提供的链接)。
https://arxiv.org/abs/2505.16239
Ultra-high-definition (UHD) image restoration aims to specifically solve the problem of quality degradation in ultra-high-resolution images. Recent advancements in this field are predominantly driven by deep learning-based innovations, including enhancements in dataset construction, network architecture, sampling strategies, prior knowledge integration, and loss functions. In this paper, we systematically review recent progress in UHD image restoration, covering various aspects ranging from dataset construction to algorithm design. This serves as a valuable resource for understanding state-of-the-art developments in the field. We begin by summarizing degradation models for various image restoration subproblems, such as super-resolution, low-light enhancement, deblurring, dehazing, deraining, and desnowing, and emphasizing the unique challenges of their application to UHD image restoration. We then highlight existing UHD benchmark datasets and organize the literature according to degradation types and dataset construction methods. Following this, we showcase major milestones in deep learning-driven UHD image restoration, reviewing the progression of restoration tasks, technological developments, and evaluations of existing methods. We further propose a classification framework based on network architectures and sampling strategies, helping to clearly organize existing methods. Finally, we share insights into the current research landscape and propose directions for further advancements. A related repository is available at this https URL.
超高清(UHD)图像恢复旨在解决超高分辨率图像质量下降的问题。近年来,该领域的主要进展主要来自于基于深度学习的创新,包括数据集构建、网络架构、采样策略、先验知识整合和损失函数等方面的改进。本文系统地回顾了近期在UHD图像恢复领域的进步,涵盖了从数据集构建到算法设计等各个方面,为理解这一领域的前沿发展提供了宝贵的资源。 我们首先总结了几种不同图像恢复子问题的退化模型,例如超分辨率(Super-resolution)、低光增强(Low-light enhancement)、去模糊(Deblurring)、去雾(Dehazing)、除雨(Deraining)和除雪(Desnowing),并强调了它们在UHD图像恢复中应用的独特挑战。然后,我们展示了现有的UHD基准数据集,并根据退化类型和数据集构建方法对文献进行了分类整理。 接下来,我们将展示深度学习驱动的UHD图像恢复的主要里程碑,回顾恢复任务、技术发展以及现有方法的评估情况。此外,我们提出了一种基于网络架构和采样策略的分类框架,有助于清晰地组织现有的方法。最后,我们分享了当前研究领域的见解,并提出了进一步发展的方向。 有关本文的相关资源库可以在此网址访问:[相关链接](https://this-url.com/)(请将“this-url”替换为实际提供的URL)。
https://arxiv.org/abs/2505.16161
Transformer-based models have made remarkable progress in image restoration (IR) tasks. However, the quadratic complexity of self-attention in Transformer hinders its applicability to high-resolution images. Existing methods mitigate this issue with sparse or window-based attention, yet inherently limit global context modeling. Linear attention, a variant of softmax attention, demonstrates promise in global context modeling while maintaining linear complexity, offering a potential solution to the above challenge. Despite its efficiency benefits, vanilla linear attention suffers from a significant performance drop in IR, largely due to the low-rank nature of its attention map. To counter this, we propose Rank Enhanced Linear Attention (RELA), a simple yet effective method that enriches feature representations by integrating a lightweight depthwise convolution. Building upon RELA, we propose an efficient and effective image restoration Transformer, named LAformer. LAformer achieves effective global perception by integrating linear attention and channel attention, while also enhancing local fitting capabilities through a convolutional gated feed-forward network. Notably, LAformer eliminates hardware-inefficient operations such as softmax and window shifting, enabling efficient processing of high-resolution images. Extensive experiments across 7 IR tasks and 21 benchmarks demonstrate that LAformer outperforms SOTA methods and offers significant computational advantages.
基于Transformer的模型在图像恢复(IR)任务上取得了显著进展。然而,自注意力机制中的二次复杂性阻碍了其在高分辨率图像上的应用。现有方法通过稀疏或窗口式注意来缓解这一问题,但这些方法本质上限制了全局上下文建模的能力。线性注意力作为一种softmax注意力的变体,在保持线性复杂度的同时展示了在全球上下文建模方面的潜力,为上述挑战提供了一种潜在解决方案。尽管具有效率优势,但标准的线性注意力在图像恢复任务中表现大幅下降,主要原因是其注意图表现出低秩特性。为了应对这一问题,我们提出了等级增强线性注意力(RELA),这是一种简单而有效的方法,通过整合轻量级深度卷积来丰富特征表示。基于RELA,我们提出了一种高效的图像恢复Transformer模型,命名为LAformer。LAformer通过集成线性注意力和通道注意力实现有效的全局感知能力,并通过引入卷积门控前馈网络增强了局部拟合能力。值得一提的是,LAformer消除了softmax及窗口移动等硬件效率低下的操作,从而能够高效处理高分辨率图像。在7个IR任务上的21项基准测试中进行了广泛实验,结果表明LAformer优于现有最佳方法,并提供了显著的计算优势。
https://arxiv.org/abs/2505.16157
With the increasing size of Large Vision-Language Models (LVLMs), network pruning techniques aimed at compressing models for deployment in resource-constrained environments have garnered significant attention. However, we observe that pruning often leads to a degradation in safety performance. To address this issue, we present a novel and lightweight approach, termed Hierarchical Safety Realignment (HSR). HSR operates by first quantifying the contribution of each attention head to safety, identifying the most critical ones, and then selectively restoring neurons directly within these attention heads that play a pivotal role in maintaining safety. This process hierarchically realigns the safety of pruned LVLMs, progressing from the attention head level to the neuron level. We validate HSR across various models and pruning strategies, consistently achieving notable improvements in safety performance. To our knowledge, this is the first work explicitly focused on restoring safety in LVLMs post-pruning.
随着大型视觉-语言模型(LVLM)规模的不断扩大,网络剪枝技术在资源受限环境中部署压缩模型方面受到了广泛关注。然而,我们观察到剪枝往往会导致安全性能下降。为解决这一问题,我们提出了一种新颖且轻量级的方法,称为分层安全性对齐(HSR)。HSR的工作原理是首先量化每个注意力头对于安全性的贡献,识别出最关键的部分,然后在这些注意力头中恢复那些在保持安全性方面起关键作用的神经元。此过程按层次重新调整剪枝后LVLM的安全性,从注意力头级别到神经元级别依次进行。我们通过多种模型和剪枝策略验证了HSR的有效性,并始终取得了显著的安全性能改进。据我们所知,这是第一个专门针对剪枝后的大型视觉-语言模型恢复安全性的研究工作。
https://arxiv.org/abs/2505.16104
Reconstructing high-fidelity underwater scenes remains a challenging task due to light absorption, scattering, and limited visibility inherent in aquatic environments. This paper presents an enhanced Gaussian Splatting-based framework that improves both the visual quality and geometric accuracy of deep underwater rendering. We propose decoupled learning for RGB channels, guided by the physics of underwater attenuation, to enable more accurate colour restoration. To address sparse-view limitations and improve view consistency, we introduce a frame interpolation strategy with a novel adaptive weighting scheme. Additionally, we introduce a new loss function aimed at reducing noise while preserving edges, which is essential for deep-sea content. We also release a newly collected dataset, Submerged3D, captured specifically in deep-sea environments. Experimental results demonstrate that our framework consistently outperforms state-of-the-art methods with PSNR gains up to 1.90dB, delivering superior perceptual quality and robustness, and offering promising directions for marine robotics and underwater visual analytics.
重建高保真的水下场景仍然是一项具有挑战性的任务,这是因为光吸收、散射以及在水生环境中固有的低能见度所造成的。本文提出了一种基于增强型高斯点阵(Gaussian Splatting)的框架,该框架能够提升深海渲染的视觉质量和几何精度。我们提出了RGB通道解耦学习的方法,并通过模拟水下衰减的物理过程来实现更精确的颜色恢复。为了解决视点多变和视角一致性差的问题,我们引入了一种帧插值策略以及一种新颖的自适应加权方案。此外,我们还提出了一种新的损失函数,旨在减少噪声的同时保持边缘细节,这对于深海内容尤为重要。同时,我们发布了新收集的数据集Submerged3D,该数据集专为深海环境设计和采集。 实验结果表明,我们的框架在感知质量和鲁棒性方面显著优于现有最佳方法,PSNR(峰值信噪比)的增益最高可达1.90dB,并且为海洋机器人技术及水下视觉分析提供了有前景的研究方向。
https://arxiv.org/abs/2505.15737
We propose a speech enhancement system that combines speaker-agnostic speech restoration with voice conversion (VC) to obtain a studio-level quality speech signal. While voice conversion models are typically used to change speaker characteristics, they can also serve as a means of speech restoration when the target speaker is the same as the source speaker. However, since VC models are vulnerable to noisy conditions, we have included a generative speech restoration (GSR) model at the front end of our proposed system. The GSR model performs noise suppression and restores speech damage incurred during that process without knowledge about the target speaker. The VC stage then uses guidance from clean speaker embeddings to further restore the output speech. By employing this two-stage approach, we have achieved speech quality objective metric scores comparable to state-of-the-art (SOTA) methods across multiple datasets.
我们提出了一种结合无说话人特性的语音恢复与声音转换(VC)的语音增强系统,旨在获得接近录音棚级别的高质量语音信号。虽然声音转换模型通常用于改变说话人的特征,但当目标说话者和源说话者是同一人时,它们也可以作为语音恢复的一种手段。然而,由于VC模型在嘈杂环境下容易受到影响,我们在我们提出的系统前端加入了生成式语音恢复(GSR)模型。GSR模型执行噪音抑制,并且无需了解目标说话人的信息即可修复在此过程中造成的语音损伤。随后的声音转换阶段则利用干净的说话人嵌入提供的指导来进一步优化输出语音的质量。 通过采用这种两阶段的方法,我们在多个数据集上实现了与当前最先进(SOTA)方法相当的客观语音质量评分。
https://arxiv.org/abs/2505.15254
Recently, continuous representation methods emerge as novel paradigms that characterize the intrinsic structures of real-world data through function representations that map positional coordinates to their corresponding values in the continuous space. As compared with the traditional discrete framework, the continuous framework demonstrates inherent superiority for data representation and reconstruction (e.g., image restoration, novel view synthesis, and waveform inversion) by offering inherent advantages including resolution flexibility, cross-modal adaptability, inherent smoothness, and parameter efficiency. In this review, we systematically examine recent advancements in continuous representation frameworks, focusing on three aspects: (i) Continuous representation method designs such as basis function representation, statistical modeling, tensor function decomposition, and implicit neural representation; (ii) Theoretical foundations of continuous representations such as approximation error analysis, convergence property, and implicit regularization; (iii) Real-world applications of continuous representations derived from computer vision, graphics, bioinformatics, and remote sensing. Furthermore, we outline future directions and perspectives to inspire exploration and deepen insights to facilitate continuous representation methods, theories, and applications. All referenced works are summarized in our open-source repository: this https URL.
最近,连续表示方法作为一种新颖的范式出现,通过将位置坐标映射到连续空间中对应值的功能表示来表征现实世界数据的基本结构。与传统的离散框架相比,连续框架在数据表示和重建(例如图像恢复、新视角合成和波形逆问题)方面展示了固有的优势,这些优势包括分辨率灵活性、跨模态适应性、内在平滑性和参数效率等。 在这篇综述中,我们系统地审视了连续表示框架的最新进展,并重点关注三个方面:(i) 连续表示方法的设计,例如基函数表示、统计建模、张量函数分解和隐式神经网络表示;(ii) 连续表示的理论基础,包括近似误差分析、收敛性属性以及隐式正则化;(iii) 从计算机视觉、图形学、生物信息学和遥感领域衍生的实际应用。此外,我们还概述了未来的发展方向和视角,以激发探索并深化对连续表示方法、理论及其应用的理解。 所有参考作品都总结在我们的开源代码库中:[此处插入链接]。
https://arxiv.org/abs/2505.15222
We consider the problem of sampling distributions stemming from non-convex potentials with Unadjusted Langevin Algorithm (ULA). We prove the stability of the discrete-time ULA to drift approximations under the assumption that the potential is strongly convex at infinity. In many context, e.g. imaging inverse problems, potentials are non-convex and non-smooth. Proximal Stochastic Gradient Langevin Algorithm (PSGLA) is a popular algorithm to handle such potentials. It combines the forward-backward optimization algorithm with a ULA step. Our main stability result combined with properties of the Moreau envelope allows us to derive the first proof of convergence of the PSGLA for non-convex potentials. We empirically validate our methodology on synthetic data and in the context of imaging inverse problems. In particular, we observe that PSGLA exhibits faster convergence rates than Stochastic Gradient Langevin Algorithm for posterior sampling while preserving its restoration properties.
我们考虑了从非凸势能中采样分布的问题,使用未调整的朗之万算法(ULA)。我们在假设势能在无穷远处强凸的情况下证明了离散时间ULA对漂移近似的稳定性。在许多情况下,例如成像逆问题中,势能是非凸和不光滑的。Proximal Stochastic Gradient Langevin Algorithm (PSGLA) 是处理此类势能的一个流行算法。它结合了前向-后向优化算法与一个ULA步骤。我们的主要稳定性结果结合Moreau包络函数的性质,使我们能够为非凸势能下的PSGLA证明其收敛性。我们在合成数据和成像逆问题的背景下通过实验证明了我们的方法的有效性。特别是,我们观察到相对于Stochastic Gradient Langevin Algorithm (SGLA),PSGLA在后验采样中表现出更快的收敛速度,同时保持其恢复特性。 这段话翻译主要讲述了利用未调整的朗之万算法处理从非凸势能分布采样的问题,并引入了一个结合了前向-后向优化技术和随机梯度的改进算法(Proximal Stochastic Gradient Langevin Algorithm, PSGLA),来更好地适用于含有非凸和不光滑势能的情况。研究者还证明了在这些条件下PSGLA的有效性和收敛性,通过实验验证其优于传统的Stochastic Gradient Langevin Algorithm (SGLA) 算法,在后验采样方面具有更快的收敛速度且保持了良好的恢复性能。
https://arxiv.org/abs/2505.14177
In this paper, we propose an efficient visual transformer framework for ultra-high-definition (UHD) image dehazing that addresses the key challenges of slow training speed and high memory consumption for existing methods. Our approach introduces two key innovations: 1) an \textbf{a}daptive \textbf{n}ormalization mechanism inspired by the nGPT architecture that enables ultra-fast and stable training with a network with a restricted range of parameter expressions; and 2) we devise an atmospheric scattering-aware KV caching mechanism that dynamically optimizes feature preservation based on the physical haze formation model. The proposed architecture improves the training convergence speed by \textbf{5 $\times$} while reducing memory overhead, enabling real-time processing of 50 high-resolution images per second on an RTX4090 GPU. Experimental results show that our approach maintains state-of-the-art dehazing quality while significantly improving computational efficiency for 4K/8K image restoration tasks. Furthermore, we provide a new dehazing image interpretable method with the help of an integrated gradient attribution map. Our code can be found here: this https URL.
在这篇论文中,我们提出了一种高效的视觉变换框架,用于处理超高清(UHD)图像的去雾问题,该框架解决了现有方法中存在的训练速度慢和内存消耗高的关键挑战。我们的方法引入了两项关键技术创新: 1) 一种受nGPT架构启发的自适应归一化机制,使具有有限参数表达范围的网络能够实现极快且稳定的训练。 2) 我们设计了一种基于物理雾形成模型的大气散射感知KV缓存机制,该机制可以根据需要动态优化特征保留。 所提出的架构将训练收敛速度提高了**5倍**,同时减少了内存开销,在RTX4090 GPU上实现了每秒实时处理50张高分辨率图像的能力。实验结果表明,我们的方法在保持最先进的去雾质量的同时,显著提升了4K/8K图像恢复任务的计算效率。 此外,我们还提供了一种新的可解释去雾图像的方法,借助集成梯度归因图的帮助实现这一目标。我们的代码可以在这里找到:[此链接](this https URL)。
https://arxiv.org/abs/2505.14010
Ultrasound imaging is widely applied in clinical practice, yet ultrasound videos often suffer from low signal-to-noise ratios (SNR) and limited resolutions, posing challenges for diagnosis and analysis. Variations in equipment and acquisition settings can further exacerbate differences in data distribution and noise levels, reducing the generalizability of pre-trained models. This work presents a self-supervised ultrasound video super-resolution algorithm called Deep Ultrasound Prior (DUP). DUP employs a video-adaptive optimization process of a neural network that enhances the resolution of given ultrasound videos without requiring paired training data while simultaneously removing noise. Quantitative and visual evaluations demonstrate that DUP outperforms existing super-resolution algorithms, leading to substantial improvements for downstream applications.
超声成像在临床实践中广泛应用,然而,超声视频常常受到低信噪比(SNR)和分辨率有限的问题困扰,这给诊断和分析带来了挑战。设备及采集设置的变化进一步加剧了数据分布和噪声水平的差异,降低了预训练模型的泛化能力。本研究提出了一种名为深度超声先验(DUP)的自监督超声视频超分辨算法。DUP采用一种基于神经网络的视频自适应优化过程,可以在无需配对训练数据的情况下增强给定超声视频的分辨率并同时去除噪声。定量和视觉评估表明,DUP优于现有的超分辨算法,在下游应用中取得了显著改善。
https://arxiv.org/abs/2505.13915
Multi-task and multilingual approaches benefit large models, yet speech processing for low-resource languages remains underexplored due to data scarcity. To address this, we present Granary, a large-scale collection of speech datasets for recognition and translation across 25 European languages. This is the first open-source effort at this scale for both transcription and translation. We enhance data quality using a pseudo-labeling pipeline with segmentation, two-pass inference, hallucination filtering, and punctuation restoration. We further generate translation pairs from pseudo-labeled transcriptions using EuroLLM, followed by a data filtration pipeline. Designed for efficiency, our pipeline processes vast amount of data within hours. We assess models trained on processed data by comparing their performance on previously curated datasets for both high- and low-resource languages. Our findings show that these models achieve similar performance using approx. 50% less data. Dataset will be made available at this https URL
多任务和多语言方法对大型模型有益,但由于数据稀缺,低资源语言的语音处理仍然未被充分探索。为解决这一问题,我们推出了Granary——一个用于25种欧洲语言跨语言识别和翻译的大规模语音数据集合。这是首个开源的努力,旨在同时提高转录和翻译的数据质量。为了增强数据质量,我们使用了一个伪标签流水线,包括分段、双遍推理、幻觉过滤和标点符号恢复等步骤。 随后,我们利用EuroLLM从伪标记的转写中生成翻译对,并经过一个数据过滤管道进一步优化数据集。我们的流程旨在提高效率,在数小时内处理大量的数据。 为了评估在加工后的数据上训练出的模型的表现,我们将这些模型与之前整理好的、针对高资源和低资源语言的高质量数据集进行比较。我们发现,使用大约50%更少的数据,这些模型仍然可以达到相似的性能水平。 该数据集将在这个网址(请参阅原文中的链接)上开放获取。
https://arxiv.org/abs/2505.13404
One of the major challenges in the field of computer vision especially for detection, segmentation, recognition, monitoring, and automated solutions, is the quality of images. Image degradation, often caused by factors such as rain, fog, lighting, etc., has a negative impact on automated this http URL, several image restoration solutions exist, including restoration models for single degradation and restoration models for multiple degradations. However, these solutions are not suitable for real-time processing. In this study, the aim was to develop a real-time image restoration solution for video surveillance. To achieve this, using transfer learning with ResNet_50, we developed a model for automatically identifying the types of degradation present in an image to reference the necessary treatment(s) for image restoration. Our solution has the advantage of being flexible and scalable.
计算机视觉领域,特别是在检测、分割、识别、监控和自动化解决方案方面的一个主要挑战是图像质量。由于降雨、雾气、光照等因素导致的图像退化会对这些任务产生负面影响。目前存在几种图像恢复解决方案,包括针对单一退化和多重退化的恢复模型。然而,这些方案不适合实时处理。 本研究旨在开发一种适用于视频监控的实时图像恢复解决方案。为此,我们使用迁移学习技术结合ResNet_50架构,建立了一个自动识别图像中退化类型并选择相应恢复方法的模型。我们的解决方案具有灵活性和可扩展性等优点。
https://arxiv.org/abs/2505.13130
Polarization images provide rich physical information that is fundamentally absent from standard RGB images, benefiting a wide range of computer vision applications such as reflection separation and material classification. However, the acquisition of polarization images typically requires additional optical components, which increases both the cost and the complexity of the applications. To bridge this gap, we introduce a new task: RGB-to-polarization image estimation, which aims to infer polarization information directly from RGB images. In this work, we establish the first comprehensive benchmark for this task by leveraging existing polarization datasets and evaluating a diverse set of state-of-the-art deep learning models, including both restoration-oriented and generative architectures. Through extensive quantitative and qualitative analysis, our benchmark not only establishes the current performance ceiling of RGB-to-polarization estimation, but also systematically reveals the respective strengths and limitations of different model families -- such as direct reconstruction versus generative synthesis, and task-specific training versus large-scale pre-training. In addition, we provide some potential directions for future research on polarization estimation. This benchmark is intended to serve as a foundational resource to facilitate the design and evaluation of future methods for polarization estimation from standard RGB inputs.
偏振图像提供了标准RGB图像所缺乏的丰富物理信息,对反射分离和材料分类等众多计算机视觉应用具有重要意义。然而,获取偏振图像通常需要额外的光学组件,从而增加了成本和复杂性。为了弥合这一差距,我们引入了一个新的任务:从RGB图像估计偏振图象(RGB-to-polarization image estimation),目标是从RGB图像中直接推断出偏振信息。在这项工作中,我们通过利用现有的偏振数据集,并评估一系列最先进的深度学习模型——包括恢复型和生成型架构,建立了该领域的首个全面基准测试。通过广泛的数量化和定性分析,我们的基准不仅确立了从标准RGB输入估计偏振图像的当前性能上限,而且还系统地揭示了不同模型家族(如直接重建与生成合成、特定任务训练与大规模预训练)各自的优势和局限。 此外,我们还提供了一些关于未来偏振估计研究的方向。这个基准旨在作为设计和评估从标准RGB输入进行偏振估计的方法的基础资源。
https://arxiv.org/abs/2505.13050
There is a growing interest in the use of latent diffusion models (LDMs) for image restoration (IR) tasks due to their ability to model effectively the distribution of natural images. While significant progress has been made, there are still key challenges that need to be addressed. First, many approaches depend on a predefined degradation operator, making them ill-suited for complex or unknown degradations that deviate from standard analytical models. Second, many methods struggle to provide a stable guidance in the latent space and finally most methods convert latent representations back to the pixel domain for guidance at every sampling iteration, which significantly increases computational and memory overhead. To overcome these limitations, we introduce a wavelet-inspired invertible neural network (INN) that simulates degradations through a forward transform and reconstructs lost details via the inverse transform. We further integrate this design into a latent diffusion pipeline through two proposed approaches: LatentINDIGO-PixelINN, which operates in the pixel domain, and LatentINDIGO-LatentINN, which stays fully in the latent space to reduce complexity. Both approaches alternate between updating intermediate latent variables under the guidance of our INN and refining the INN forward model to handle unknown degradations. In addition, a regularization step preserves the proximity of latent variables to the natural image manifold. Experiments demonstrate that our algorithm achieves state-of-the-art performance on synthetic and real-world low-quality images, and can be readily adapted to arbitrary output sizes.
由于潜在扩散模型(LDMs)能够有效地建模自然图像的分布,因此人们越来越有兴趣将它们用于图像恢复任务。尽管已经取得了显著的进步,但仍有一些关键挑战需要解决。首先,许多方法依赖于预定义的退化算子,这使得它们不适合处理复杂或未知的退化情况,这些情况偏离了标准分析模型。其次,许多方法难以在潜在空间中提供稳定的指导,并且大多数方法每一轮采样迭代都将潜在表示转换回像素域以获取指导,这显著增加了计算和内存开销。 为了解决这些问题,我们引入了一种受小波启发的可逆神经网络(INN),该网络通过前向变换模拟退化过程并通过反向变换重建丢失的细节。我们将这种设计整合到潜在扩散管道中,提出了两种方法:LatentINDIGO-PixelINN在像素域内操作,而LatentINDIGO-LatentINN完全处于潜在空间以减少复杂性。这两种方法都交替地根据我们的INN指导更新中间潜在变量,并且细化INN前向模型以处理未知退化情况。此外,一个正则化步骤保持了潜在变量与自然图像流形的接近度。 实验表明,我们的算法在合成和现实世界的低质量图像上达到了最先进的性能,并可以轻松适应任意输出尺寸。
https://arxiv.org/abs/2505.12935
Image degradation synthesis is highly desirable in a wide variety of applications ranging from image restoration to simulating artistic effects. Existing models are designed to generate one specific or a narrow set of degradations, which often require user-provided degradation parameters. As a result, they lack the generalizability to synthesize degradations beyond their initial design or adapt to other applications. Here we propose the first universal degradation model that can synthesize a broad spectrum of complex and realistic degradations containing both homogeneous (global) and inhomogeneous (spatially varying) components. Our model automatically extracts and disentangles homogeneous and inhomogeneous degradation features, which are later used for degradation synthesis without user intervention. A disentangle-by-compression method is proposed to separate degradation information from images. Two novel modules for extracting and incorporating inhomogeneous degradations are created to model inhomogeneous components in complex degradations. We demonstrate the model's accuracy and adaptability in film-grain simulation and blind image restoration tasks. The demo video, code, and dataset of this project will be released upon publication at this http URL.
图像退化合成在从图像修复到模拟艺术效果的广泛应用场景中都非常理想。现有的模型设计用于生成特定或一组狭窄范围内的退化,这通常需要用户提供退化的参数。因此,它们缺乏泛化能力来合成超出初始设计限制或者适应其他应用领域的退化效果。 在这里,我们提出了首个通用退化模型,该模型可以合成交织了同质(全局)和异质(空间变化)成分的复杂且现实的退化现象。我们的模型能够自动提取并分离出同质和异质退化特征,并在无需用户干预的情况下用于退化的合成过程中。我们提出了一种通过压缩方法来区分图像中的退化信息的方法,以及两个创新模块,用于提取和整合复杂的异质退化成分。 我们在电影颗粒仿真和无参考图像修复任务中展示了模型的准确性和适应性。该项目的演示视频、代码及数据集将在发布时在此网址提供:[此URL](注意原文提到的是一个占位符链接,在实际引用时需要替换为有效的项目页面链接)。
https://arxiv.org/abs/2505.12860
All-in-one image restoration aims to recover clear images from various degradation types and levels with a unified model. Nonetheless, the significant variations among degradation types present challenges for training a universal model, often resulting in task interference, where the gradient update directions of different tasks may diverge due to shared parameters. To address this issue, motivated by the routing strategy, we propose DFPIR, a novel all-in-one image restorer that introduces Degradation-aware Feature Perturbations(DFP) to adjust the feature space to align with the unified parameter space. In this paper, the feature perturbations primarily include channel-wise perturbations and attention-wise perturbations. Specifically, channel-wise perturbations are implemented by shuffling the channels in high-dimensional space guided by degradation types, while attention-wise perturbations are achieved through selective masking in the attention space. To achieve these goals, we propose a Degradation-Guided Perturbation Block (DGPB) to implement these two functions, positioned between the encoding and decoding stages of the encoder-decoder architecture. Extensive experimental results demonstrate that DFPIR achieves state-of-the-art performance on several all-in-one image restoration tasks including image denoising, image dehazing, image deraining, motion deblurring, and low-light image enhancement. Our codes are available at this https URL.
全图像复原旨在利用统一的模型从各种退化类型和程度中恢复清晰图像。然而,不同退化类型的显著差异为训练通用模型带来了挑战,通常会导致任务干扰问题——由于共享参数,不同任务的梯度更新方向可能会发生分歧。为了应对这一问题,受路由策略启发,我们提出了DFPIR(Degradation-aware Feature Perturbations for Image Restoration),这是一种新的全图像复原方法,通过引入退化感知特征扰动(DFP)来调整特征空间以适应统一参数空间。 在本文中,特征扰动主要包括通道级扰动和注意力级扰动。具体而言,通道级扰动是通过根据不同的退化类型引导高维空间中的通道洗牌实现的;而注意力级扰动则是通过对注意力空间进行选择性屏蔽来完成的。为了达成这些目标,我们设计了一种退化导向的扰动模块(DGPB),用于实现在编码器-解码器架构的编码和解码阶段之间的这两种功能。 广泛的实验结果表明,DFPIR在包括图像去噪、图像去雾、图像除雨、运动模糊恢复以及低光照图像增强在内的几个全图像复原任务中达到了最先进的性能。我们的代码可在提供的链接地址获取。
https://arxiv.org/abs/2505.12630