Recent advancements in video restoration have focused on recovering high-quality video frames from low-quality inputs. Compared with static images, the performance of video restoration significantly depends on efficient exploitation of temporal correlations among successive video frames. The numerous techniques make use of temporal information via flow-based strategies or recurrent architectures. However, these methods often encounter difficulties in preserving temporal consistency as they utilize degraded input video frames. To resolve this issue, we propose a novel video restoration framework named Joint Flow and Feature Refinement using Attention (JFFRA). The proposed JFFRA is based on key philosophy of iteratively enhancing data through the synergistic collaboration of flow (alignment) and restoration. By leveraging previously enhanced features to refine flow and vice versa, JFFRA enables efficient feature enhancement using temporal information. This interplay between flow and restoration is executed at multiple scales, reducing the dependence on precise flow estimation. Moreover, we incorporate an occlusion-aware temporal loss function to enhance the network's capability in eliminating flickering artifacts. Comprehensive experiments validate the versatility of JFFRA across various restoration tasks such as denoising, deblurring, and super-resolution. Our method demonstrates a remarkable performance improvement of up to 1.62 dB compared to state-of-the-art approaches.
近期的视频修复研究集中在从低质量输入中恢复高质量视频帧上。与静态图像相比,视频修复的效果在很大程度上依赖于高效利用连续视频帧之间的时序关联性。许多技术通过基于流(flow-based)策略或递归架构来使用这种时间信息。然而,这些方法通常难以保持时间一致性,因为它们会利用受损的输入视频帧。为了解决这个问题,我们提出了一种名为“联合流和特征精细调整注意力框架”(Joint Flow and Feature Refinement using Attention, JFFRA)的新颖视频修复框架。 提出的JFFRA基于通过流(对齐)和恢复之间的协同合作迭代增强数据这一核心理念。通过利用之前改进的特性来细化流,反之亦然,JFFRA能够有效地使用时间信息进行特征增强。这种流与恢复之间的交互作用在多个尺度上执行,减少了对精确流估计的依赖性。此外,我们整合了一种具有遮挡感知的时间损失函数,以提高网络消除闪烁伪影的能力。 全面的实验验证了JFFRA在各种修复任务(如去噪、去模糊和超分辨率)上的多功能性和卓越性能。我们的方法相比最新的技术方案,在性能上提高了高达1.62分贝的改进。
https://arxiv.org/abs/2505.16434
As vision-based machine learning models are increasingly integrated into autonomous and cyber-physical systems, concerns about (physical) adversarial patch attacks are growing. While state-of-the-art defenses can achieve certified robustness with minimal impact on utility against highly-concentrated localized patch attacks, they fall short in two important areas: (i) State-of-the-art methods are vulnerable to low-noise distributed patches where perturbations are subtly dispersed to evade detection or masking, as shown recently by the DorPatch attack; (ii) Achieving high robustness with state-of-the-art methods is extremely time and resource-consuming, rendering them impractical for latency-sensitive applications in many cyber-physical systems. To address both robustness and latency issues, this paper proposes a new defense strategy for adversarial patch attacks called SuperPure. The key novelty is developing a pixel-wise masking scheme that is robust against both distributed and localized patches. The masking involves leveraging a GAN-based super-resolution scheme to gradually purify the image from adversarial patches. Our extensive evaluations using ImageNet and two standard classifiers, ResNet and EfficientNet, show that SuperPure advances the state-of-the-art in three major directions: (i) it improves the robustness against conventional localized patches by more than 20%, on average, while also improving top-1 clean accuracy by almost 10%; (ii) It achieves 58% robustness against distributed patch attacks (as opposed to 0% in state-of-the-art method, PatchCleanser); (iii) It decreases the defense end-to-end latency by over 98% compared to PatchCleanser. Our further analysis shows that SuperPure is robust against white-box attacks and different patch sizes. Our code is open-source.
随着基于视觉的机器学习模型越来越多地被集成到自主系统和网络物理系统中,人们对(物理)对抗性补丁攻击的关注也在增加。虽然最先进的防御方法可以在不影响效用的情况下实现对高度集中且局部化的补丁攻击的认证鲁棒性,但它们在两个重要方面仍存在不足:(i)最新的方法对于低噪声分布式的补丁(其中扰动被微妙地分散以逃避检测或屏蔽)是脆弱的,最近由DorPatch攻击所证实;(ii)使用最先进的方法实现高鲁棒性的过程极其耗费时间和资源,这使得它们在许多需要低延迟应用的网络物理系统中变得不切实际。 为了同时解决鲁棒性和延迟问题,本文提出了一种新的对抗性补丁攻击防御策略,称为SuperPure。其主要创新在于开发了一种像素级屏蔽方案,这种方案能够抵御分布式和局部化的补丁。该屏蔽方法利用基于GAN的超分辨率方案逐步清除图像中的对抗性补丁。 我们使用ImageNet数据集以及两个标准分类器(ResNet和EfficientNet)进行了广泛的评估,结果表明SuperPure在三个主要方向上推进了当前技术水平:(i) 它平均将对抗传统局部化补丁的鲁棒性提高了超过20%,同时几乎提升了10%的top-1清洁准确率;(ii) 它实现了58%对分布式补丁攻击的鲁棒性(而最先进的PatchCleanser方法仅为0%);(iii) 相较于PatchCleanser,它将防御端到端延迟减少了超过98%。 进一步分析表明,SuperPure对于白盒攻击和不同大小的补丁具有鲁棒性。我们的代码是开源的。
https://arxiv.org/abs/2505.16318
Diffusion models have demonstrated promising performance in real-world video super-resolution (VSR). However, the dozens of sampling steps they require, make inference extremely slow. Sampling acceleration techniques, particularly single-step, provide a potential solution. Nonetheless, achieving one step in VSR remains challenging, due to the high training overhead on video data and stringent fidelity demands. To tackle the above issues, we propose DOVE, an efficient one-step diffusion model for real-world VSR. DOVE is obtained by fine-tuning a pretrained video diffusion model (*i.e.*, CogVideoX). To effectively train DOVE, we introduce the latent-pixel training strategy. The strategy employs a two-stage scheme to gradually adapt the model to the video super-resolution task. Meanwhile, we design a video processing pipeline to construct a high-quality dataset tailored for VSR, termed HQ-VSR. Fine-tuning on this dataset further enhances the restoration capability of DOVE. Extensive experiments show that DOVE exhibits comparable or superior performance to multi-step diffusion-based VSR methods. It also offers outstanding inference efficiency, achieving up to a **28$\times$** speed-up over existing methods such as MGLD-VSR. Code is available at: this https URL.
扩散模型在现实世界的视频超分辨率(VSR)任务中展示了有前景的表现。然而,它们需要的几十个采样步骤导致推理过程非常慢。单步采样的加速技术提供了一种潜在的解决方案。不过,在VSR中实现一步采样仍然具有挑战性,这主要是由于对视频数据训练成本高昂以及严格的保真度需求。为了解决上述问题,我们提出了DOVE,这是一种高效的一步扩散模型,专门用于现实世界的VSR任务。通过微调预训练的视频扩散模型(即CogVideoX),我们可以得到DOVE。 为了有效地训练DOVE,我们引入了潜像素训练策略。该策略采用两阶段方案逐步使模型适应视频超分辨率任务的需求。同时,我们设计了一个视频处理流水线来构建一个专门用于VSR的高质量数据集,称为HQ-VSR。在这一数据集上进行微调进一步增强了DOVE的恢复能力。 广泛的实验表明,与多步扩散方法相比,DOVE表现出相当或更优的表现,并且提供了卓越的推理效率,在一些现有方法如MGLD-VSR上实现了高达28倍的速度提升。代码可在以下链接获取:[此URL](请将"[此URL]"替换为实际提供的链接)。
https://arxiv.org/abs/2505.16239
Ultra-high-definition (UHD) image restoration aims to specifically solve the problem of quality degradation in ultra-high-resolution images. Recent advancements in this field are predominantly driven by deep learning-based innovations, including enhancements in dataset construction, network architecture, sampling strategies, prior knowledge integration, and loss functions. In this paper, we systematically review recent progress in UHD image restoration, covering various aspects ranging from dataset construction to algorithm design. This serves as a valuable resource for understanding state-of-the-art developments in the field. We begin by summarizing degradation models for various image restoration subproblems, such as super-resolution, low-light enhancement, deblurring, dehazing, deraining, and desnowing, and emphasizing the unique challenges of their application to UHD image restoration. We then highlight existing UHD benchmark datasets and organize the literature according to degradation types and dataset construction methods. Following this, we showcase major milestones in deep learning-driven UHD image restoration, reviewing the progression of restoration tasks, technological developments, and evaluations of existing methods. We further propose a classification framework based on network architectures and sampling strategies, helping to clearly organize existing methods. Finally, we share insights into the current research landscape and propose directions for further advancements. A related repository is available at this https URL.
超高清(UHD)图像恢复旨在解决超高分辨率图像质量下降的问题。近年来,该领域的主要进展主要来自于基于深度学习的创新,包括数据集构建、网络架构、采样策略、先验知识整合和损失函数等方面的改进。本文系统地回顾了近期在UHD图像恢复领域的进步,涵盖了从数据集构建到算法设计等各个方面,为理解这一领域的前沿发展提供了宝贵的资源。 我们首先总结了几种不同图像恢复子问题的退化模型,例如超分辨率(Super-resolution)、低光增强(Low-light enhancement)、去模糊(Deblurring)、去雾(Dehazing)、除雨(Deraining)和除雪(Desnowing),并强调了它们在UHD图像恢复中应用的独特挑战。然后,我们展示了现有的UHD基准数据集,并根据退化类型和数据集构建方法对文献进行了分类整理。 接下来,我们将展示深度学习驱动的UHD图像恢复的主要里程碑,回顾恢复任务、技术发展以及现有方法的评估情况。此外,我们提出了一种基于网络架构和采样策略的分类框架,有助于清晰地组织现有的方法。最后,我们分享了当前研究领域的见解,并提出了进一步发展的方向。 有关本文的相关资源库可以在此网址访问:[相关链接](https://this-url.com/)(请将“this-url”替换为实际提供的URL)。
https://arxiv.org/abs/2505.16161
We consider the limits of super-resolution using imaging constraints. Due to various theoretical and practical limitations, reconstruction-based methods have been largely restricted to small increases in resolution. In addition, motion-blur is usually seen as a nuisance that impedes super-resolution. We show that by using high-precision motion information, sparse image priors, and convex optimization, it is possible to increase resolution by large factors. A key operation in super-resolution is deconvolution with a box. In general, convolution with a box is not invertible. However, we obtain perfect reconstructions of sparse signals using convex optimization. We also show that motion blur can be helpful for super-resolution. We demonstrate that using pseudo-random motion it is possible to reconstruct a high-resolution target using a single low-resolution image. We present numerical experiments with simulated data and results with real data captured by a camera mounted on a computer controlled stage.
我们探讨了在成像约束下超分辨率技术的极限。由于各种理论和实际限制,基于重建的方法主要局限于小幅提升分辨率。此外,运动模糊通常被视为阻碍超分辨率的一个问题。然而,通过利用高精度的运动信息、稀疏图像先验以及凸优化方法,我们可以实现大幅度提高分辨率的目标。 在超分辨率中,一个重要操作是使用矩形函数进行反卷积处理。一般来说,与矩形函数的卷积运算不可逆。但是,我们证明可以通过凸优化得到稀疏信号的完美重构。此外,我们还展示了运动模糊可以帮助实现超分辨率。通过利用伪随机运动,我们可以仅用一张低分辨率图像来重建出高分辨率的目标。 本文通过使用模拟数据和安装在计算机控制平台上的相机捕捉的真实数据进行了数值实验,并呈现了相应的结果。
https://arxiv.org/abs/2505.15961
With the widespread application of super-resolution (SR) in various fields, researchers have begun to investigate its security. Previous studies have demonstrated that SR models can also be subjected to backdoor attacks through data poisoning, affecting downstream tasks. A backdoor SR model generates an attacker-predefined target image when given a triggered image while producing a normal high-resolution (HR) output for clean images. However, prior backdoor attacks on SR models have primarily focused on the stealthiness of poisoned low-resolution (LR) images while ignoring the stealthiness of poisoned HR images, making it easy for users to detect anomalous data. To address this problem, we propose BadSR, which improves the stealthiness of poisoned HR images. The key idea of BadSR is to approximate the clean HR image and the pre-defined target image in the feature space while ensuring that modifications to the clean HR image remain within a constrained range. The poisoned HR images generated by BadSR can be integrated with existing triggers. To further improve the effectiveness of BadSR, we design an adversarially optimized trigger and a backdoor gradient-driven poisoned sample selection method based on a genetic algorithm. The experimental results show that BadSR achieves a high attack success rate in various models and data sets, significantly affecting downstream tasks.
随着超分辨率(SR)技术在各个领域的广泛应用,研究人员已经开始探索其安全性问题。先前的研究表明,通过数据投毒的方式可以使SR模型遭受后门攻击,从而影响下游任务。被植入后门的SR模型会在收到触发图像时生成一个预定义的目标图像,在处理干净的低分辨率(LR)输入时则会正常输出高分辨率(HR)图像。然而,以往针对SR模型的后门攻击主要关注于投毒的LR图像隐蔽性,而忽略了投毒的HR图像隐蔽性,这使得用户容易检测到异常数据。为解决这一问题,我们提出了BadSR方法,旨在提升投毒HR图像的隐蔽性。 BadSR的核心思想是在特征空间中同时逼近干净HR图像和预定义的目标图像,并确保对干净HR图像所做的修改保持在受限范围内。由BadSR生成的投毒HR图像可以与现有的触发器相结合使用。为了进一步提高BadSR的有效性,我们设计了一种对抗优化的触发机制以及一种基于遗传算法选择中毒样本的方法,该方法通过后门梯度驱动。 实验结果表明,在多种模型和数据集上,BadSR实现了较高的攻击成功率,并对下游任务产生了显著影响。
https://arxiv.org/abs/2505.15308
We propose an OCT super-resolution framework based on a plug-and-play diffusion model (PnP-DM) to reconstruct high-quality images from sparse measurements (OCT B-mode corneal images). Our method formulates reconstruction as an inverse problem, combining a diffusion prior with Markov chain Monte Carlo sampling for efficient posterior inference. We collect high-speed under-sampled B-mode corneal images and apply a deep learning-based up-sampling pipeline to build realistic training pairs. Evaluations on in vivo and ex vivo fish-eye corneal models show that PnP-DM outperforms conventional 2D-UNet baselines, producing sharper structures and better noise suppression. This approach advances high-fidelity OCT imaging in high-speed acquisition for clinical applications.
我们提出了一种基于即插即用扩散模型(PnP-DM)的光学相干断层扫描(OCT)超分辨率框架,用于从稀疏测量数据(如角膜B模式图像)中重建高质量图像。我们的方法将重建问题表述为一个逆向问题,并结合了扩散先验和马尔可夫链蒙特卡罗采样技术,以实现高效的后验推断。我们收集了高速欠采样的B模式角膜图像,并应用基于深度学习的上采样管道来构建真实的训练对。在体内和体外鱼眼角膜模型上的评估表明,PnP-DM优于传统的2D-U-net基线方法,能够产生更清晰的结构并更好地抑制噪声。这种方法促进了临床高速采集中的高保真度OCT成像技术的发展。
https://arxiv.org/abs/2505.14916
DeepSeek-R1 has demonstrated remarkable effectiveness in incentivizing reasoning and generalization capabilities of large language models (LLMs) through reinforcement learning. Nevertheless, the potential of reasoning-induced computational modeling has not been thoroughly explored in the context of image quality assessment (IQA), a task critically dependent on visual reasoning. In this paper, we introduce VisualQuality-R1, a reasoning-induced no-reference IQA (NR-IQA) model, and we train it with reinforcement learning to rank, a learning algorithm tailored to the intrinsically relative nature of visual quality. Specifically, for a pair of images, we employ group relative policy optimization to generate multiple quality scores for each image. These estimates are then used to compute comparative probabilities of one image having higher quality than the other under the Thurstone model. Rewards for each quality estimate are defined using continuous fidelity measures rather than discretized binary labels. Extensive experiments show that the proposed VisualQuality-R1 consistently outperforms discriminative deep learning-based NR-IQA models as well as a recent reasoning-induced quality regression method. Moreover, VisualQuality-R1 is capable of generating contextually rich, human-aligned quality descriptions, and supports multi-dataset training without requiring perceptual scale realignment. These features make VisualQuality-R1 especially well-suited for reliably measuring progress in a wide range of image processing tasks like super-resolution and image generation.
DeepSeek-R1 通过强化学习展示了在激励大型语言模型(LLM)的推理和泛化能力方面的显著效果。然而,在图像质量评估(IQA)这一高度依赖视觉推理的任务中,基于推理的计算建模潜力尚未得到充分探索。为此,本文介绍了一种名为VisualQuality-R1 的新方法——一种基于推理的无参考图像质量评估(NR-IQA)模型,并通过强化学习进行训练以相对排名为学习算法的核心特点来适应视觉质量固有的相对性质。 具体来说,在一对图像的情况下,我们采用群相对策略优化来生成每张图像的多个质量分数。然后利用Thurstone 模型计算一张图比另一张图更高质量的概率。每个质量估计的价值使用连续保真度量(而非离散化的二进制标签)定义。 广泛的实验表明,所提出的VisualQuality-R1 一致地超越了判别式深度学习基线NR-IQA模型以及最近的基于推理的质量回归方法。此外,VisualQuality-R1 能够生成丰富且符合人类感知的质量描述,并支持跨多数据集训练而不需重新调整感知尺度。这些特点使得VisualQuality-R1 特别适合于广泛图像处理任务(如超分辨率和图像生成)中可靠地衡量进展。
https://arxiv.org/abs/2505.14460
Intelligent game creation represents a transformative advancement in game development, utilizing generative artificial intelligence to dynamically generate and enhance game content. Despite notable progress in generative models, the comprehensive synthesis of high-quality game assets, including both images and videos, remains a challenging frontier. To create high-fidelity game content that simultaneously aligns with player preferences and significantly boosts designer efficiency, we present Hunyuan-Game, an innovative project designed to revolutionize intelligent game production. Hunyuan-Game encompasses two primary branches: image generation and video generation. The image generation component is built upon a vast dataset comprising billions of game images, leading to the development of a group of customized image generation models tailored for game scenarios: (1) General Text-to-Image Generation. (2) Game Visual Effects Generation, involving text-to-effect and reference image-based game visual effect generation. (3) Transparent Image Generation for characters, scenes, and game visual effects. (4) Game Character Generation based on sketches, black-and-white images, and white models. The video generation component is built upon a comprehensive dataset of millions of game and anime videos, leading to the development of five core algorithmic models, each targeting critical pain points in game development and having robust adaptation to diverse game video scenarios: (1) Image-to-Video Generation. (2) 360 A/T Pose Avatar Video Synthesis. (3) Dynamic Illustration Generation. (4) Generative Video Super-Resolution. (5) Interactive Game Video Generation. These image and video generation models not only exhibit high-level aesthetic expression but also deeply integrate domain-specific knowledge, establishing a systematic understanding of diverse game and anime art styles.
智能游戏创作代表了游戏开发领域的变革性进步,利用生成式人工智能技术来动态生成和增强游戏内容。尽管在生成模型方面取得了显著进展,但高质量的游戏资产(包括图像和视频)的全面合成仍然是一个具有挑战性的前沿领域。为了创建与玩家偏好高度契合、同时大幅提高设计师效率的高保真度游戏内容,我们推出了“混元-游戏”项目,旨在通过智能技术彻底革新游戏制作过程。“混元-游戏”包含两个主要分支:图像生成和视频生成。 **图像生成部分**基于庞大的数据集构建而成,该数据集中包含了数十亿张游戏图像。这一组成部分开发了一组专门针对游戏场景定制的图像生成模型: 1. **通用文本到图像生成(General Text-to-Image Generation)** - 将描述性语言转化为高保真的游戏场景或物品。 2. **游戏视觉效果生成(Game Visual Effects Generation)** - 涉及从文本描述生成视觉特效,以及基于参考图像的游戏视觉效果生成。 3. **透明图象生成(Transparent Image Generation)** - 针对游戏角色、场景和游戏视觉效果的透明部分进行高质量生成。 4. **根据草图、黑白图或模型生成游戏角色(Game Character Generation from Sketches, Black-and-White Images, or White Models)** - 根据简单的输入创建复杂且逼真的游戏角色。 **视频生成部分**则基于涵盖数百万游戏和动漫视频的全面数据集,开发了五个核心算法模型: 1. **图像到视频生成(Image-to-Video Generation)** - 将静态图像转换为连贯动态的游戏场景。 2. **360度A/T姿势虚拟角色视频合成(360 A/T Pose Avatar Video Synthesis)** - 创建具备复杂动作姿态的高质量游戏角色动画。 3. **动态插图生成(Dynamic Illustration Generation)** - 产生高度风格化的游戏或动漫短片,提供流畅、吸引人的视觉体验。 4. **生成式视频超分辨率(Generative Video Super-Resolution)** - 将低分辨率的游戏视频提升至高清标准,保持图像质量和细节清晰度。 5. **互动式游戏视频生成(Interactive Game Video Generation)** - 根据玩家的输入或行为动态调整和创建视频内容。 这些图像和视频生成模型不仅在美学表达方面表现出色,还深入整合了特定领域的知识,形成了对多样化的游戏及动漫艺术风格系统性的理解。
https://arxiv.org/abs/2505.14135
This paper introduces a comprehensive end-to-end pipeline for Optical Character Recognition (OCR) on Urdu newspapers. In our approach, we address the unique challenges of complex multi-column layouts, low-resolution archival scans, and diverse font styles. Our process decomposes the OCR task into four key modules: (1) article segmentation, (2) image super-resolution, (3) column segmentation, and (4) text recognition. For article segmentation, we fine-tune and evaluate YOLOv11x to identify and separate individual articles from cluttered layouts. Our model achieves a precision of 0.963 and mAP@50 of 0.975. For super-resolution, we fine-tune and benchmark the SwinIR model (reaching 32.71 dB PSNR) to enhance the quality of degraded newspaper scans. To do our column segmentation, we use YOLOv11x to separate columns in text to further enhance performance - this model reaches a precision of 0.970 and mAP@50 of 0.975. In the text recognition stage, we benchmark a range of LLMs from different families, including Gemini, GPT, Llama, and Claude. The lowest WER of 0.133 is achieved by Gemini-2.5-Pro.
本文介绍了一种针对乌尔都语报纸的光学字符识别(OCR)的端到端综合管道。我们的方法解决了复杂多栏布局、低分辨率档案扫描和多样化字体风格的独特挑战。该过程将OCR任务分解为四个关键模块:(1) 文章分割,(2) 图像超分辨率,(3) 栏分割,以及 (4) 文本识别。 对于文章分割,我们微调并评估了YOLOv11x模型,以从复杂布局中识别和分离出单独的文章。我们的模型达到了0.963的精确度(precision)和0.975的mAP@50指标。在超分辨率阶段,我们微调并基准测试了SwinIR模型,使其达到32.71 dB PSNR,以提高受损报纸扫描的质量。对于栏分割,我们使用YOLOv11x将文本中的各栏目分离出来,进一步提升性能——该模型达到了0.970的精确度和0.975的mAP@50指标。 在文本识别阶段,我们测试了一系列来自不同家族的大规模语言模型(LLM),包括Gemini、GPT、Llama 和 Claude。最低词错误率(WER)为 0.133 的是由 Gemini-2.5-Pro 模型实现的。
https://arxiv.org/abs/2505.13943
Ultrasound imaging is widely applied in clinical practice, yet ultrasound videos often suffer from low signal-to-noise ratios (SNR) and limited resolutions, posing challenges for diagnosis and analysis. Variations in equipment and acquisition settings can further exacerbate differences in data distribution and noise levels, reducing the generalizability of pre-trained models. This work presents a self-supervised ultrasound video super-resolution algorithm called Deep Ultrasound Prior (DUP). DUP employs a video-adaptive optimization process of a neural network that enhances the resolution of given ultrasound videos without requiring paired training data while simultaneously removing noise. Quantitative and visual evaluations demonstrate that DUP outperforms existing super-resolution algorithms, leading to substantial improvements for downstream applications.
超声成像在临床实践中广泛应用,然而,超声视频常常受到低信噪比(SNR)和分辨率有限的问题困扰,这给诊断和分析带来了挑战。设备及采集设置的变化进一步加剧了数据分布和噪声水平的差异,降低了预训练模型的泛化能力。本研究提出了一种名为深度超声先验(DUP)的自监督超声视频超分辨算法。DUP采用一种基于神经网络的视频自适应优化过程,可以在无需配对训练数据的情况下增强给定超声视频的分辨率并同时去除噪声。定量和视觉评估表明,DUP优于现有的超分辨算法,在下游应用中取得了显著改善。
https://arxiv.org/abs/2505.13915
Diffusion MRI (dMRI) is essential for studying brain microstructure, but high-resolution imaging remains challenging due to the inherent trade-offs between acquisition time and signal-to-noise ratio (SNR). Conventional methods often optimize only the diffusion-weighted images (DWIs) without considering their relationship with the non-diffusion-weighted (b=0) reference images. However, calculating diffusion metrics, such as the apparent diffusion coefficient (ADC) and diffusion tensor with its derived metrics like fractional anisotropy (FA) and mean diffusivity (MD), relies on the ratio between each DWI and the b=0 image, which is crucial for clinical observation and diagnostics. In this study, we demonstrate that solely enhancing DWIs using a conventional pixel-wise mean squared error (MSE) loss is insufficient, as the error in ratio between generated DWIs and b=0 diverges. We propose a novel ratio loss, defined as the MSE loss between the predicted and ground-truth log of DWI/b=0 ratios. Our results show that incorporating the ratio loss significantly improves the convergence of this ratio error, achieving lower ratio MSE and slightly enhancing the peak signal-to-noise ratio (PSNR) of generated DWIs. This leads to improved dMRI super-resolution and better preservation of b=0 ratio-based features for the derivation of diffusion metrics.
扩散加权磁共振成像(dMRI)对于研究脑部微观结构至关重要,但高分辨率成像仍然由于采集时间和信噪比(SNR)之间的固有折衷而具有挑战性。传统方法通常仅优化弥散加权图像(DWIs),而不考虑它们与非弥散加权(b=0)参考图像的关系。然而,计算弥散系数,如表观扩散系数(ADC)和弥散张量及其派生的度量,例如分数各向异性(FA)和平均弥散性(MD),依赖于每个DWI与其相应的b=0图像之间的比率,这对于临床观察和诊断至关重要。 在本研究中,我们展示了仅使用传统的像素级均方误差(MSE)损失来增强DWIs是不够的,因为生成的DWIs与b=0之间的比率误差会发散。我们提出了一种新的比率损失,定义为预测值与真实值之间log(DWI/b=0)比率的MSE损失。我们的结果显示,引入这种比率损失可以显著改善比率误差的收敛性,并且能够降低比率MSE同时略微提高生成DWIs的峰值信噪比(PSNR)。这导致了dMRI超分辨率的改进以及基于b=0比率特征更准确地推导弥散度量。
https://arxiv.org/abs/2505.12978
This work introduces CLIP-aware Domain-Adaptive Super-Resolution (CDASR), a novel framework that addresses the critical challenge of domain generalization in single image super-resolution. By leveraging the semantic capabilities of CLIP (Contrastive Language-Image Pre-training), CDASR achieves unprecedented performance across diverse domains and extreme scaling factors. The proposed method integrates CLIP-guided feature alignment mechanism with a meta-learning inspired few-shot adaptation strategy, enabling efficient knowledge transfer and rapid adaptation to target domains. A custom domain-adaptive module processes CLIP features alongside super-resolution features through a multi-stage transformation process, including CLIP feature processing, spatial feature generation, and feature fusion. This intricate process ensures effective incorporation of semantic information into the super-resolution pipeline. Additionally, CDASR employs a multi-component loss function that combines pixel-wise reconstruction, perceptual similarity, and semantic consistency. Extensive experiments on benchmark datasets demonstrate CDASR's superiority, particularly in challenging scenarios. On the Urban100 dataset at $\times$8 scaling, CDASR achieves a significant PSNR gain of 0.15dB over existing methods, with even larger improvements of up to 0.30dB observed at $\times$16 scaling.
这项工作引入了一种新型框架CLIP感知领域自适应超分辨率(CDASR),旨在解决单幅图像超分辨率中的领域泛化这一关键挑战。通过利用对比语言-图像预训练(CLIP)的语义能力,CDASR在不同领域和极端缩放因子下实现了前所未有的性能表现。 该方法结合了由CLIP引导的特征对齐机制以及一种受元学习启发的少样本自适应策略,从而实现高效的知识转移并快速适应目标领域。一个定制化的领域自适应模块处理CLIP特征与超分辨率特征,通过包括CLIP特征处理、空间特征生成和特征融合在内的多阶段转换过程进行操作,确保语义信息有效地融入到超分辨率管道中。 此外,CDASR采用了一种结合像素级重构、感知相似性和语义一致性这三部分的损失函数。在基准数据集上的广泛实验表明了CDASR在挑战性场景中的优越性能,特别是在Urban100数据集上,在8倍缩放时,CDASR比现有方法平均获得了0.15dB的峰值信噪比(PSNR)增益;而在16倍缩放时,改进更为显著,最高可达0.30dB。
https://arxiv.org/abs/2505.12391
Diffusion models have gained attention for their success in modeling complex distributions, achieving impressive perceptual quality in SR tasks. However, existing diffusion-based SR methods often suffer from high computational costs, requiring numerous iterative steps for training and inference. Existing acceleration techniques, such as distillation and solver optimization, are generally task-agnostic and do not fully leverage the specific characteristics of low-level tasks like super-resolution (SR). In this study, we analyze the frequency- and spatial-domain properties of diffusion-based SR methods, revealing key insights into the temporal and spatial dependencies of high-frequency signal recovery. Specifically, high-frequency details benefit from concentrated optimization during early and late diffusion iterations, while spatially textured regions demand adaptive denoising strategies. Building on these observations, we propose the Time-Spatial-aware Sampling strategy (TSS) for the acceleration of Diffusion SR without any extra training cost. TSS combines Time Dynamic Sampling (TDS), which allocates more iterations to refining textures, and Spatial Dynamic Sampling (SDS), which dynamically adjusts strategies based on image content. Extensive evaluations across multiple benchmarks demonstrate that TSS achieves state-of-the-art (SOTA) performance with significantly fewer iterations, improving MUSIQ scores by 0.2 - 3.0 and outperforming the current acceleration methods with only half the number of steps.
扩散模型因其在建模复杂分布方面的成功而备受关注,在超分辨率(SR)任务中达到了令人印象深刻的感知质量。然而,现有的基于扩散的SR方法往往遭受高昂的计算成本问题,需要大量的迭代步骤来进行训练和推理。现有的加速技术,如知识蒸馏和求解器优化,通常是与特定任务无关的,并且没有充分利用诸如超分辨率等低级任务的具体特性。在本研究中,我们分析了基于扩散的SR方法在频域和空域的性质,揭示了高频信号恢复过程中时间依赖性和空间依赖性的关键见解。具体而言,高频细节可以从早期和晚期扩散迭代中的集中优化中受益,而纹理丰富的区域则需要自适应去噪策略。基于这些观察结果,我们提出了时空感知采样(TSS)策略,以加速扩散SR而不产生额外的训练成本。TSS结合了时间动态采样(TDS),该方法分配更多的迭代次数来细化纹理,并且空间动态采样(SDS),根据图像内容进行动态调整策略。 在多个基准测试中的广泛评估表明,TSS实现了最先进的性能,同时显著减少了迭代次数,在MUSIQ评分上提高了0.2到3.0分,并仅需当前加速方法的一半步骤就超过了现有的加速方法。
https://arxiv.org/abs/2505.12048
Neural Operators have emerged as powerful tools for learning mappings between function spaces. Among them, the kernel integral operator has been widely validated on universally approximating various operators. Although recent advancements following this definition have developed effective modules to better approximate the kernel function defined on the original domain (with $d$ dimensions, $d=1, 2, 3...$), the unclarified evolving mechanism in the embedding spaces blocks our view to design neural operators that can fully capture the target system evolution. Drawing on recent breakthroughs in quantum simulation of partial differential equations (PDEs), we elucidate the linear evolution process in neural operators. Based on that, we redefine neural operators on a new $d+1$ dimensional domain. Within this framework, we implement our proposed Schrödingerised Kernel Neural Operator (SKNO) aligning better with the $d+1$ dimensional evolution. In experiments, our $d+1$ dimensional evolving linear block performs far better than others. Also, we test SKNO's SOTA performance on various benchmark tests and also the zero-shot super-resolution task. In addition, we analyse the impact of different lifting and recovering operators on the prediction within the redefined NO framework, reflecting the alignment between our model and the underlying $d+1$ dimensional evolution.
神经算子作为学习函数空间之间映射的强大工具已经出现。其中,核积分算子因其在广泛逼近各类算子方面得到验证而被广泛应用。尽管根据此定义的最近进展开发出有效的模块以更好地近似原始域(维度为 $d$ 的情况,$d=1, 2, 3...$)上的内核函数,但嵌入空间中未明确化的演化机制阻碍了我们设计能够完全捕捉目标系统演化的神经算子。借鉴最近在偏微分方程 (PDE) 量子模拟方面的突破性成果,我们阐明了神经算子中的线性演化过程,并基于此,在一个新的 $d+1$ 维域上重新定义了神经算子。在这个框架下,我们实现了所提出的薛定谔化核神经算子(SKNO),使其更好地适应 $d+1$ 维的演化进程。 在实验中,我们的 $d+1$ 维线性演化块表现优于其他方法,并且我们在各种基准测试和零样本超分辨率任务上验证了 SKNO 的最先进性能。此外,我们分析了不同提升与恢复算子对重新定义后的神经算子框架内预测的影响,反映了我们的模型与其底层 $d+1$ 维演化的契合度。
https://arxiv.org/abs/2505.11766
Recent advances in data-centric deep generative models have led to significant progress in solving inverse imaging problems. However, these models (e.g., diffusion models (DMs)) typically require large amounts of fully sampled (clean) training data, which is often impractical in medical and scientific settings such as dynamic imaging. On the other hand, training-data-free approaches like the Deep Image Prior (DIP) do not require clean ground-truth images but suffer from noise overfitting and can be computationally expensive as the network parameters need to be optimized for each measurement set independently. Moreover, DIP-based methods often overlook the potential of learning a prior using a small number of sub-sampled measurements (or degraded images) available during training. In this paper, we propose UGoDIT, an Unsupervised Group DIP via Transferable weights, designed for the low-data regime where only a very small number, M, of sub-sampled measurement vectors are available during training. Our method learns a set of transferable weights by optimizing a shared encoder and M disentangled decoders. At test time, we reconstruct the unseen degraded image using a DIP network, where part of the parameters are fixed to the learned weights, while the remaining are optimized to enforce measurement consistency. We evaluate UGoDIT on both medical (multi-coil MRI) and natural (super resolution and non-linear deblurring) image recovery tasks under various settings. Compared to recent standalone DIP methods, UGoDIT provides accelerated convergence and notable improvement in reconstruction quality. Furthermore, our method achieves performance competitive with SOTA DM-based and supervised approaches, despite not requiring large amounts of clean training data.
最近在数据为中心的深度生成模型方面的进展显著推动了逆向成像问题的解决。然而,这些模型(例如扩散模型(DMs))通常需要大量的完整采样(干净)训练数据,在动态医学和科学成像等场景中获取这种数据往往不切实际。另一方面,不需要清洁地面实况图像的无训练数据方法,如深度图像先验(DIP),容易过拟合噪声,并且由于每个测量集都需要独立优化网络参数而可能计算成本高昂。此外,基于DIP的方法经常忽视利用少量子采样测量值(或退化图像)来学习先验的潜力。 本文提出了一种名为UGoDIT的方法,这是一种通过可转移权重进行无监督组深度图像先验的方法,专为低数据环境设计,在这种环境中只有非常少的数量,M个子采样的测量向量在训练时可用。我们的方法通过优化一个共享编码器和M个解纠缠解码器来学习一组可转移的权重。在测试阶段,我们使用DIP网络重构未见过的退化图像,其中部分参数固定为学到的权重,而其余参数则进行优化以强制执行测量一致性。 我们在医学(多线圈MRI)和自然(超分辨率和非线性去模糊)成像恢复任务的不同设置下评估了UGoDIT。与最近独立的DIP方法相比,UGoDIT提供了加速收敛并在重建质量上有了显著改进。此外,在不需要大量清洁训练数据的情况下,我们的方法实现了与最先进的基于DM的方法及监督方法相当的表现。
https://arxiv.org/abs/2505.11720
In recent years, implicit neural representations(INRs) have gained popularity in the computer vision community. This is mainly due to the strong performance of INRs in many computer vision tasks. These networks can extract a continuous signal representation given a discrete signal representation. In previous studies, it has been repeatedly shown that INR performance has a strong correlation with the activation functions used in its multilayer perceptrons. Although numerous activation functions have been proposed that are competitive with one another, they share some common set of challenges such as spectral bias(Lack of sensitivity to high-frequency content in signals), limited robustness to signal noise and difficulties in simultaneous capturing both local and global features. and furthermore, the requirement for manual parameter tuning. To address these issues, we introduce a novel activation function, Band Shifted Raised Cosine Activated Implicit Neural Networks \textbf{(BandRC)} tailored to enhance signal representation capacity further. We also incorporate deep prior knowledge extracted from the signal to adjust the activation functions through a task-specific model. Through a mathematical analysis and a series of experiments which include image reconstruction (with a +8.93 dB PSNR improvement over the nearest counterpart), denoising (with a +0.46 dB increase in PSNR), super-resolution (with a +1.03 dB improvement over the nearest State-Of-The-Art (SOTA) method for 6X super-resolution), inpainting, and 3D shape reconstruction we demonstrate the dominance of BandRC over existing state of the art activation functions.
近年来,隐式神经表示(INRs)在计算机视觉社区中获得了很大的关注。这主要是由于它们在许多计算机视觉任务中的强大性能表现。这些网络能够从离散信号表示中提取出连续的信号表示。以往的研究反复表明,多层感知器中使用的激活函数对INR的性能有很强的相关性。尽管提出了众多相互竞争的激活函数,但它们都面临一些共同挑战:如谱偏差(信号高频内容缺乏敏感度)、对抗信号噪声的能力有限以及难以同时捕捉局部和全局特征等问题,并且还需要手动调整参数。 为了应对这些挑战,我们引入了一种新型激活函数——频带偏移提升余弦激活隐式神经网络(BandRC),旨在进一步增强信号表示能力。此外,我们还整合了从信号中提取的深度先验知识,通过特定任务模型来调整激活函数。通过对数学分析和一系列实验进行验证(包括图像重建、去噪、超分辨率处理以及3D形状重构等场景),结果显示BandRC在现有最先进的激活函数性能上占据主导地位:图像重建提高了8.93 dB PSNR;去噪提升了0.46 dB PSNR,超分辨率处理中对于6倍放大任务,相比最近的最先进方法(SOTA)改进了1.03 dB。
https://arxiv.org/abs/2505.11640
Hyperspectral image (HSI) representation is fundamentally challenged by pervasive non-uniformity, where spectral dependencies, spatial continuity, and feature efficiency exhibit complex and often conflicting behaviors. Most existing models rely on a unified processing paradigm that assumes homogeneity across dimensions, leading to suboptimal performance and biased representations. To address this, we propose FairHyp, a fairness-directed framework that explicitly disentangles and resolves the threefold non-uniformity through cooperative yet specialized modules. We introduce a Runge-Kutta-inspired spatial variability adapter to restore spatial coherence under resolution discrepancies, a multi-receptive field convolution module with sparse-aware refinement to enhance discriminative features while respecting inherent sparsity, and a spectral-context state space model that captures stable and long-range spectral dependencies via bidirectional Mamba scanning and statistical aggregation. Unlike one-size-fits-all solutions, FairHyp achieves dimension-specific adaptation while preserving global consistency and mutual reinforcement. This design is grounded in the view that non-uniformity arises from the intrinsic structure of HSI representations, rather than any particular task setting. To validate this, we apply FairHyp across four representative tasks including classification, denoising, super-resolution, and inpaintin, demonstrating its effectiveness in modeling a shared structural flaw. Extensive experiments show that FairHyp consistently outperforms state-of-the-art methods under varied imaging conditions. Our findings redefine fairness as a structural necessity in HSI modeling and offer a new paradigm for balancing adaptability, efficiency, and fidelity in high-dimensional vision tasks.
高光谱图像(HSI)的表示面临广泛的非均匀性挑战,其中光谱依赖性、空间连续性和特征效率表现出复杂且往往相互冲突的行为。大多数现有的模型依赖于统一处理范式,假设各维度之间的一致性,导致次优性能和偏向化的表示。为了解决这个问题,我们提出了FairHyp,这是一种以公平为导向的框架,通过协同但专门化的模块明确地分离并解决了三重非均匀性问题。 该框架引入了受Runge-Kutta启发的空间变异性适配器来在分辨率差异下恢复空间一致性;一个多感受野卷积模块,具有稀疏感知细化功能,可在尊重固有稀疏性的前提下增强判别特征;以及一个光谱-上下文状态空间模型,通过双向Mamba扫描和统计聚合捕捉稳定的长程光谱依赖性。 与一刀切的解决方案不同,FairHyp实现了维度特异性的适应,同时保持全局一致性和相互强化。该设计基于这样的观点:非均匀性来源于HSI表示的本质结构,而不仅仅是因为特定任务设置。为了验证这一点,我们在包括分类、去噪、超分辨率和修复在内的四个代表性任务上应用了FairHyp,并展示了其在建模共享结构性缺陷方面的有效性。 大量的实验表明,在各种成像条件下,FairHyp始终优于现有的最先进方法。我们的发现重新定义了公平性作为HSI建模中的结构必要条件,并为高维视觉任务中平衡适应性、效率和保真度提供了一种新的范式。
https://arxiv.org/abs/2505.11267
Single hyperspectral image super-resolution (SHSR) aims to restore high-resolution images from low-resolution hyperspectral images. Recently, the Visual Mamba model has achieved an impressive balance between performance and computational efficiency. However, due to its 1D scanning paradigm, the model may suffer from potential artifacts during image generation. To address this issue, we propose HSRMamba. While maintaining the computational efficiency of Visual Mamba, we introduce a strip-based scanning scheme to effectively reduce artifacts from global unidirectional scanning. Additionally, HSRMamba uses wavelet decomposition to alleviate modal conflicts between high-frequency spatial features and low-frequency spectral features, further improving super-resolution performance. Extensive experiments show that HSRMamba not only excels in reducing computational load and model size but also outperforms existing methods, achieving state-of-the-art results.
单幅高光谱图像超分辨率(SHSR)的目标是从低分辨率的高光谱图像中恢复出高质量、高分辨率的图像。最近,Visual Mamba 模型在性能与计算效率之间取得了令人印象深刻的平衡。然而,由于其采用的一维扫描模式,在生成图像时可能会产生潜在的伪影。为了解决这个问题,我们提出了HSRMamba模型。 在保持 Visual Mamba 的计算效率的同时,我们引入了一种基于条带(strip-based)的扫描方案,以有效减少全局单向扫描产生的伪影。此外,HSRMamba 还采用了小波分解技术来缓解高频空间特征和低频光谱特征之间的模态冲突,进一步提升了超分辨率性能。 大量的实验表明,HSRMamba 不仅在降低计算负载和模型大小方面表现出色,而且在性能上也优于现有的方法,达到了最先进的结果。
https://arxiv.org/abs/2505.11062
This work presents Prior Depth Anything, a framework that combines incomplete but precise metric information in depth measurement with relative but complete geometric structures in depth prediction, generating accurate, dense, and detailed metric depth maps for any scene. To this end, we design a coarse-to-fine pipeline to progressively integrate the two complementary depth sources. First, we introduce pixel-level metric alignment and distance-aware weighting to pre-fill diverse metric priors by explicitly using depth prediction. It effectively narrows the domain gap between prior patterns, enhancing generalization across varying scenarios. Second, we develop a conditioned monocular depth estimation (MDE) model to refine the inherent noise of depth priors. By conditioning on the normalized pre-filled prior and prediction, the model further implicitly merges the two complementary depth sources. Our model showcases impressive zero-shot generalization across depth completion, super-resolution, and inpainting over 7 real-world datasets, matching or even surpassing previous task-specific methods. More importantly, it performs well on challenging, unseen mixed priors and enables test-time improvements by switching prediction models, providing a flexible accuracy-efficiency trade-off while evolving with advancements in MDE models.
这项工作提出了Prior Depth Anything框架,该框架结合了深度测量中不完整但精确的度量信息与深度预测中相对但完整的几何结构,从而为任何场景生成准确、密集和详细的度量深度图。为此,我们设计了一个从粗到细的流水线,逐步整合这两种互补的深度来源。首先,我们引入像素级度量对齐和距离感知加权,通过明确使用深度预测来预先填充各种度量先验,有效地缩小了先前模式之间的领域差距,增强了在不同场景中的泛化能力。其次,我们开发了一个条件单目深度估计(MDE)模型,以精炼深度先验中存在的固有噪声。该模型通过对归一化的预填充分先验和预测进行调节,进一步隐式地融合这两种互补的深度来源。我们的模型在七个真实世界数据集上展示了跨深度完成、超分辨率和修复任务的强大零样本泛化能力,并且在这些特定任务的方法中表现出匹配甚至超越的结果。更重要的是,它在具有挑战性的未见混合先验下表现良好,并通过切换预测模型实现了测试时间改进,提供了一个灵活的准确性和效率之间的权衡,随着MDE模型的进步而不断进化。
https://arxiv.org/abs/2505.10565