With advances in artificial intelligence, image processing has gained significant interest. Image super-resolution is a vital technology closely related to real-world applications, as it enhances the quality of existing images. Since enhancing fine details is crucial for the super-resolution task, pixels that contribute to high-frequency information should be emphasized. This paper proposes two methods to enhance high-frequency details in super-resolution images: a Laplacian pyramid-based detail loss and a repeated upscaling and downscaling process. Total loss with our detail loss guides a model by separately generating and controlling super-resolution and detail images. This approach allows the model to focus more effectively on high-frequency components, resulting in improved super-resolution images. Additionally, repeated upscaling and downscaling amplify the effectiveness of the detail loss by extracting diverse information from multiple low-resolution features. We conduct two types of experiments. First, we design a CNN-based model incorporating our methods. This model achieves state-of-the-art results, surpassing all currently available CNN-based and even some attention-based models. Second, we apply our methods to existing attention-based models on a small scale. In all our experiments, attention-based models adding our detail loss show improvements compared to the originals. These results demonstrate our approaches effectively enhance super-resolution images across different model structures.
随着人工智能的发展,图像处理领域引起了越来越多的关注。超分辨率技术作为其中的重要分支,在提升现有图像质量方面发挥了关键作用。由于增强细节对于超分辨率任务至关重要,因此需要强调那些有助于提供高频信息的像素。本文提出两种方法来增强超分辨率图像中的高频细节:一种是基于拉普拉斯金字塔的细节损失(Laplacian pyramid-based detail loss),另一种则是重复放缩过程(repeated upscaling and downscaling process)。总损失函数结合了我们的细节损失,通过分别生成和控制超分辨率图像和细节图像来引导模型。这种方法使模型能够更有效地聚焦于高频成分,从而提高超分辨率图像的质量。此外,重复放大缩小过程则可以通过从多个低分辨率特征中提取多样信息来增强细节损失的效果。 我们进行了两类实验。首先,设计了一种基于卷积神经网络(CNN)的模型,并将其与我们的方法相结合。该模型取得了最先进的成果,在现有的所有 CNN 基础上甚至某些注意力机制基础的模型性能之上。其次,我们将这些方法小规模地应用到现有的注意力机制基础上的模型中。在我们所有的实验中,加入我们细节损失后的基于注意力机制的模型相比于原始版本均有所改进。这些结果表明我们的方法能够在不同的模型结构下有效提升超分辨率图像的质量。
https://arxiv.org/abs/2601.09410
Tracklet quality is often treated as an afterthought in most person re-identification (ReID) methods, with the majority of research presenting architectural modifications to foundational models. Such approaches neglect an important limitation, posing challenges when deploying ReID systems in real-world, difficult scenarios. In this paper, we introduce S3-CLIP, a video super-resolution-based CLIP-ReID framework developed for the VReID-XFD challenge at WACV 2026. The proposed method integrates recent advances in super-resolution networks with task-driven super-resolution pipelines, adapting them to the video-based person re-identification setting. To the best of our knowledge, this work represents the first systematic investigation of video super-resolution as a means of enhancing tracklet quality for person ReID, particularly under challenging cross-view conditions. Experimental results demonstrate performance competitive with the baseline, achieving 37.52% mAP in aerial-to-ground and 29.16% mAP in ground-to-aerial scenarios. In the ground-to-aerial setting, S3-CLIP achieves substantial gains in ranking accuracy, improving Rank-1, Rank-5, and Rank-10 performance by 11.24%, 13.48%, and 17.98%, respectively.
在大多数人物重新识别(ReID)方法中,轨迹质量通常被当作次要问题处理,而多数研究集中在对基础模型进行架构上的改进。这种做法忽视了一个重要的局限性,在实际部署ReID系统时会遇到挑战,尤其是在复杂且困难的场景下。在这篇论文中,我们介绍了S3-CLIP框架,这是一个基于视频超分辨率(Video Super-Resolution)和CLIP-ReID技术相结合的方法,并用于WACV 2026年举办的VReID-XFD竞赛。该方法将最近在超分辨率网络中的进展与任务驱动的超分辨率流程集成起来,适应于基于视频的人物重新识别场景。 据我们所知,这项工作是首次系统性地研究视频超分辨率技术如何提升轨迹质量以增强人物ReID性能的研究,特别是在跨视角(cross-view)条件下。实验结果表明,我们的方法在性能上与基准线相当,并且在从空中到地面和从地面到空中的场景中分别达到了37.52%的mAP和29.16%的mAP。尤其是在地面到空中设置的情况下,S3-CLIP框架显著提高了排名准确度,在Rank-1、Rank-5和Rank-10性能方面分别提升了11.24%,13.48%,以及17.98%。
https://arxiv.org/abs/2601.08807
Single Image Super-Resolution (SISR) is a fundamental computer vision task that aims to reconstruct a high-resolution (HR) image from a low-resolution (LR) input. Transformer-based methods have achieved remarkable performance by modeling long-range dependencies in degraded images. However, their feature-intensive attention computation incurs high computational cost. To improve efficiency, most existing approaches partition images into fixed groups and restrict attention within each group. Such group-wise attention overlooks the inherent asymmetry in token similarities, thereby failing to enable flexible and token-adaptive attention computation. To address this limitation, we propose the Individualized Exploratory Transformer (IET), which introduces a novel Individualized Exploratory Attention (IEA) mechanism that allows each token to adaptively select its own content-aware and independent attention candidates. This token-adaptive and asymmetric design enables more precise information aggregation while maintaining computational efficiency. Extensive experiments on standard SR benchmarks demonstrate that IET achieves state-of-the-art performance under comparable computational complexity.
单图像超分辨率(SISR)是计算机视觉中的一个基本任务,旨在从低分辨率(LR)输入中重建高分辨率(HR)图像。基于Transformer的方法通过建模退化图像中的长距离依赖性而实现了显著的性能提升。然而,它们密集的注意力计算导致了较高的计算成本。为了提高效率,现有的大多数方法将图像划分为固定的组,并限制每个组内的注意机制。这种分组式的注意力忽视了令牌相似性的内在不对称性,因而无法实现灵活且适应于各个令牌的注意力计算。 为了解决这一局限性,我们提出了个性化探索Transformer(IET),它引入了一种新的个性化探索注意力(IEA)机制,允许每个令牌根据自身情况自适应地选择其内容感知且独立的注意候选者。这种针对各令牌进行调整和不对称的设计,在保持计算效率的同时实现了更精确的信息聚合。 在标准超分辨率基准上的大量实验表明,在相似的计算复杂度下,IET达到了最先进的性能水平。
https://arxiv.org/abs/2601.08341
Understanding plant root systems is critical for advancing research in soil-plant interactions, nutrient uptake, and overall plant health. However, accurate imaging of roots in subterranean environments remains a persistent challenge due to adverse conditions such as occlusion, varying soil moisture, and inherently low contrast, which limit the effectiveness of conventional vision-based approaches. In this work, we propose a novel underground imaging system that captures multiple overlapping views of plant roots and integrates a deep learning-based Multi-Image Super Resolution (MISR) framework designed to enhance root visibility and detail. To train and evaluate our approach, we construct a synthetic dataset that simulates realistic underground imaging scenarios, incorporating key environmental factors that affect image quality. Our proposed MISR algorithm leverages spatial redundancy across views to reconstruct high-resolution images with improved structural fidelity and visual clarity. Quantitative evaluations show that our approach outperforms state-of-the-art super resolution baselines, achieving a 2.3 percent reduction in BRISQUE, indicating improved image quality with the same CLIP-IQA score, thereby enabling enhanced phenotypic analysis of root systems. This, in turn, facilitates accurate estimation of critical root traits, including root hair count and root hair density. The proposed framework presents a promising direction for robust automatic underground plant root imaging and trait quantification for agricultural and ecological research.
理解植物根系对于推进土壤-植物相互作用、养分吸收和整体植物健康的研究至关重要。然而,由于遮挡、土壤含水量变化以及低对比度等不利条件的影响,在地下环境中准确成像根系仍然是一项持续的挑战,这些因素限制了传统视觉方法的有效性。为此,我们提出了一种新颖的地下成像系统,该系统能够捕捉植物根系的多个重叠视图,并整合了一个基于深度学习的多图像超分辨率(MISR)框架,旨在提高根系的可见度和细节。为了训练并评估我们的方法,我们构建了一个合成数据集,模拟现实中的地下成像场景,并纳入影响图像质量的关键环境因素。 我们提出的MISR算法利用视图间的空间冗余来重建具有改进结构保真度和视觉清晰度的高分辨率图像。定量评价表明,与最先进的超分辨率基准相比,我们的方法在BRISQUE(一种用于测量图片感知质量问题的方法)指标上减少了2.3%,同时保持了相同的CLIP-IQA评分,从而提升了根系表型分析的质量。 这种方法使得关键根特性如根毛计数和密度的准确估计成为可能。所提出的框架为农业和生态研究中的鲁棒自动地下植物根系成像及性状量化提供了一个有前景的方向。
https://arxiv.org/abs/2601.05482
Infrared video has been of great interest in visual tasks under challenging environments, but often suffers from severe atmospheric turbulence and compression degradation. Existing video super-resolution (VSR) methods either neglect the inherent modality gap between infrared and visible images or fail to restore turbulence-induced distortions. Directly cascading turbulence mitigation (TM) algorithms with VSR methods leads to error propagation and accumulation due to the decoupled modeling of degradation between turbulence and resolution. We introduce HATIR, a Heat-Aware Diffusion for Turbulent InfraRed Video Super-Resolution, which injects heat-aware deformation priors into the diffusion sampling path to jointly model the inverse process of turbulent degradation and structural detail loss. Specifically, HATIR constructs a Phasor-Guided Flow Estimator, rooted in the physical principle that thermally active regions exhibit consistent phasor responses over time, enabling reliable turbulence-aware flow to guide the reverse diffusion process. To ensure the fidelity of structural recovery under nonuniform distortions, a Turbulence-Aware Decoder is proposed to selectively suppress unstable temporal cues and enhance edge-aware feature aggregation via turbulence gating and structure-aware attention. We built FLIR-IVSR, the first dataset for turbulent infrared VSR, comprising paired LR-HR sequences from a FLIR T1050sc camera (1024 X 768) spanning 640 diverse scenes with varying camera and object motion conditions. This encourages future research in infrared VSR. Project page: this https URL
红外视频在具有挑战性的环境下的视觉任务中引起了极大兴趣,但由于严重的大气湍流和压缩退化问题而受到限制。现有的超分辨率(VSR)方法要么忽略了红外图像与可见光图像之间的固有模式差异,要么无法恢复由湍流引起的失真。直接将湍流缓解(TM)算法与VSR方法串联会导致误差传播和累积,因为它们在建模湍流和分辨率退化之间没有耦合。 我们提出了HATIR(Heat-Aware Diffusion for Turbulent InfraRed Video Super-Resolution),一种热感知扩散技术,用于处理红外视频的湍流超分辨率问题。该方法将具有温度感知变形先验的热感知扩散注入到采样路径中,以联合建模湍流失真的逆过程和结构细节损失。 具体来说,HATIR构建了一个相量引导流动估计器(Phasor-Guided Flow Estimator),基于物理原理,即热活性区域在时间上表现出一致的相量响应。这使得能够可靠地感知湍流并指导反向扩散过程。为了确保在非均匀失真的情况下结构恢复的准确性,我们提出了一个具有湍流感知解码器(Turbulence-Aware Decoder),它可以选择性地抑制不稳定的时序线索,并通过湍流门控和结构感知注意机制增强边缘感知特征聚合。 为了支持未来的研究,我们建立了一个名为FLIR-IVSR的数据集,这是首个用于处理红外视频湍流超分辨率问题的专用数据集。该数据集包含从FLIR T1050sc相机(分辨率为1024 X 768)采集的一对低分辨率和高分辨率序列,包括跨越了640种不同场景的数据,涵盖了各种摄像机和物体运动条件。 这促进了未来在红外视频超分辨率领域的研究工作。项目页面:[请参阅原文链接]
https://arxiv.org/abs/2601.04682
Neural networks commonly employ the McCulloch-Pitts neuron model, which is a linear model followed by a point-wise non-linear activation. Various researchers have already advanced inherently non-linear neuron models, such as quadratic neurons, generalized operational neurons, generative neurons, and super neurons, which offer stronger non-linearity compared to point-wise activation functions. In this paper, we introduce a novel and better non-linear neuron model called Padé neurons (Paons), inspired by Padé approximants. Paons offer several advantages, such as diversity of non-linearity, since each Paon learns a different non-linear function of its inputs, and layer efficiency, since Paons provide stronger non-linearity in much fewer layers compared to piecewise linear approximation. Furthermore, Paons include all previously proposed neuron models as special cases, thus any neuron model in any network can be replaced by Paons. We note that there has been a proposal to employ the Padé approximation as a generalized point-wise activation function, which is fundamentally different from our model. To validate the efficacy of Paons, in our experiments, we replace classic neurons in some well-known neural image super-resolution, compression, and classification models based on the ResNet architecture with Paons. Our comprehensive experimental results and analyses demonstrate that neural models built by Paons provide better or equal performance than their classic counterparts with a smaller number of layers. The PyTorch implementation code for Paon is open-sourced at this https URL.
神经网络通常采用麦卡洛克-皮茨(McCulloch-Pitts)神经元模型,该模型由线性部分和逐点非线性激活函数组成。然而,一些研究人员已经提出了固有的非线性神经元模型,例如二次神经元、广义操作神经元、生成神经元以及超级神经元,这些模型提供了比逐点激活函数更强的非线性特性。在本文中,我们介绍了一种新型且更为优越的非线性神经元模型——帕德神经元(Paon),该模型灵感源自帕德逼近(Padé approximants)。Paon具有多种优势,例如其非线性的多样性:每个Paon都能学习输入的不同非线性函数;以及层效率:与分段线性近似相比,Paons能在更少的层数中提供更强的非线性特性。此外,Paon将所有先前提出的神经元模型作为特殊情况包含在内,因此任何网络中的任何神经元模型都可以被Paon替换。 值得注意的是,有人曾提出使用帕德逼近作为一种通用的逐点激活函数,这与我们的模型从根本上不同。为了验证Paons的有效性,在实验中,我们用Paon替换了基于ResNet架构的一些著名的神经图像超分辨率、压缩和分类模型中的经典神经元。通过全面的实验结果和分析表明,由Paon构建的神经网络在使用较少层数的情况下可以达到与传统模型相同或更好的性能。 帕德神经元(Paon)的PyTorch实现代码可在以下链接获取:[https URL]
https://arxiv.org/abs/2601.04005
Distributing government relief efforts after a flood is challenging. In India, the crops are widely affected by floods; therefore, making rapid and accurate crop damage assessment is crucial for effective post-disaster agricultural management. Traditional manual surveys are slow and biased, while current satellite-based methods face challenges like cloud cover and low spatial resolution. Therefore, to bridge this gap, this paper introduced FLNet, a novel deep learning based architecture that used super-resolution to enhance the 10 m spatial resolution of Sentinel-2 satellite images into 3 m resolution before classifying damage. We tested our model on the Bihar Flood Impacted Croplands Dataset (BFCD-22), and the results showed an improved critical "Full Damage" F1-score from 0.83 to 0.89, nearly matching the 0.89 score of commercial high-resolution imagery. This work presented a cost-effective and scalable solution, paving the way for a nationwide shift from manual to automated, high-fidelity damage assessment.
洪水后的政府救援工作分配是一个挑战。在印度,农作物受到洪灾的影响广泛;因此,进行快速且准确的作物损失评估对于有效的灾后农业管理至关重要。传统的手工调查速度慢且存在偏见,而当前基于卫星的方法则面临云层遮挡和空间分辨率低的问题。为此,本文引入了FLNet,这是一种新型深度学习架构,使用超分辨率技术将Sentinel-2卫星图像的空间分辨率从10米提升至3米,然后再进行损失分类。我们在比哈尔邦洪水影响农作物数据集(BFCD-22)上测试了我们的模型,并且结果显示关键的“完全损毁”F1分数从0.83提高到了0.89,几乎与商用高分辨率影像所获得的0.89得分持平。这项工作提供了一种成本效益高且可扩展的解决方案,为全国范围内从手工评估向自动化、高保真度损失评估转变铺平了道路。
https://arxiv.org/abs/2601.03884
Optics-guided thermal UAV image super-resolution has attracted significant research interest due to its potential in all-weather monitoring applications. However, existing methods typically compress optical features to match thermal feature dimensions for cross-modal alignment and fusion, which not only causes the loss of high-frequency information that is beneficial for thermal super-resolution, but also introduces physically inconsistent artifacts such as texture distortions and edge blurring by overlooking differences in the imaging physics between modalities. To address these challenges, we propose PCNet to achieve cross-resolution mutual enhancement between optical and thermal modalities, while physically constraining the optical guidance process via thermal conduction to enable robust thermal UAV image super-resolution. In particular, we design a Cross-Resolution Mutual Enhancement Module (CRME) to jointly optimize thermal image super-resolution and optical-to-thermal modality conversion, facilitating effective bidirectional feature interaction across resolutions while preserving high-frequency optical priors. Moreover, we propose a Physics-Driven Thermal Conduction Module (PDTM) that incorporates two-dimensional heat conduction into optical guidance, modeling spatially-varying heat conduction properties to prevent inconsistent artifacts. In addition, we introduce a temperature consistency loss that enforces regional distribution consistency and boundary gradient smoothness to ensure generated thermal images align with real-world thermal radiation principles. Extensive experiments on VGTSR2.0 and DroneVehicle datasets demonstrate that PCNet significantly outperforms state-of-the-art methods on both reconstruction quality and downstream tasks including semantic segmentation and object detection.
基于光学引导的热气球无人机图像超分辨率技术因其在全天候监测应用中的潜力而引起了广泛的研究兴趣。然而,现有的方法通常通过压缩光学特征以匹配热特征维度来进行跨模态对齐和融合,这不仅导致了有益于热超分辨率的高频信息丢失,而且还因忽视了不同成像物理之间的差异而导致诸如纹理扭曲和边缘模糊等物理不一致的现象。 为了解决这些挑战,我们提出了PCNet(Physics-Driven Cross-modal Network),它在光学与热模态之间实现了跨分辨率相互增强,并通过基于热传导的物理约束来强化光学引导过程,从而能够实现鲁棒性的热气球无人机图像超分辨率。具体来说,我们设计了跨分辨率互增强模块(CRME),以共同优化热图的超分辨率和从光学到热的模态转换,促进不同分辨率间的有效双向特征交互,并保持高频光学先验信息。 此外,我们还提出了物理驱动的热传导模块(PDTM),该模块将二维热传导融入光学引导过程中,通过建模空间变化的热传导特性来防止不一致的现象。同时,我们引入了温度一致性损失,确保生成的热图在区域分布和边界梯度平滑性方面与现实世界的热辐射原理保持一致。 我们在VGTSR2.0和DroneVehicle数据集上的广泛实验表明,PCNet在重建质量和包括语义分割和目标检测在内的下游任务上都显著优于现有最先进的方法。
https://arxiv.org/abs/2601.03526
Generative adversarial networks (GANs) and diffusion models have recently achieved state-of-the-art performance in audio super-resolution (ADSR), producing perceptually convincing wideband audio from narrowband inputs. However, existing evaluations primarily rely on signal-level or perceptual metrics, leaving open the question of how closely the distributions of synthetic super-resolved and real wideband audio match. Here we address this problem by analyzing the separability of real and super-resolved audio in various embedding spaces. We consider both middle-band ($4\to 16$~kHz) and full-band ($16\to 48$~kHz) upsampling tasks for speech and music, training linear classifiers to distinguish real from synthetic samples based on multiple types of audio embeddings. Comparisons with objective metrics and subjective listening tests reveal that embedding-based classifiers achieve near-perfect separation, even when the generated audio attains high perceptual quality and state-of-the-art metric scores. This behavior is consistent across datasets and models, including recent diffusion-based approaches, highlighting a persistent gap between perceptual quality and true distributional fidelity in ADSR models.
生成对抗网络(GANs)和扩散模型近期在音频超分辨率(ADSR)领域取得了最先进的性能,能够从窄带输入中生成感知上逼真的宽带音频。然而,现有的评估主要依赖于信号级或感知度量标准,这留下了合成增强的音频与真实宽带音频分布之间匹配程度的问题未解。在这里,我们通过分析不同嵌入空间中实际和超分辨率音频的可分离性来解决这个问题。我们考虑了从中间带(4到16 kHz)到全频带(16到48 kHz)的声音和音乐上采样任务,在多种类型的音频嵌入基础上训练线性分类器,以区分真实样本与合成样本。通过客观度量标准以及主观听觉测试的比较显示,基于嵌入的方法能够实现近乎完美的分离,即使生成的音频达到了很高的感知质量和最先进的指标评分也是如此。这种行为在不同的数据集和模型中(包括最近的扩散方法)是一致的,这突显了感知质量与真实分布保真度之间持久存在的差距存在于ADSR模型中。
https://arxiv.org/abs/2601.03443
This paper addresses low-light video super-resolution (LVSR), aiming to restore high-resolution videos from low-light, low-resolution (LR) inputs. Existing LVSR methods often struggle to recover fine details due to limited contrast and insufficient high-frequency information. To overcome these challenges, we present RetinexEVSR, the first event-driven LVSR framework that leverages high-contrast event signals and Retinex-inspired priors to enhance video quality under low-light scenarios. Unlike previous approaches that directly fuse degraded signals, RetinexEVSR introduces a novel bidirectional cross-modal fusion strategy to extract and integrate meaningful cues from noisy event data and degraded RGB frames. Specifically, an illumination-guided event enhancement module is designed to progressively refine event features using illumination maps derived from the Retinex model, thereby suppressing low-light artifacts while preserving high-contrast details. Furthermore, we propose an event-guided reflectance enhancement module that utilizes the enhanced event features to dynamically recover reflectance details via a multi-scale fusion mechanism. Experimental results show that our RetinexEVSR achieves state-of-the-art performance on three datasets. Notably, on the SDSD benchmark, our method can get up to 2.95 dB gain while reducing runtime by 65% compared to prior event-based methods. Code: this https URL.
本文探讨了低光视频超分辨率(LVSR)问题,目标是从低光照、低分辨率的输入中恢复高分辨率视频。现有的LVSR方法往往难以恢复精细细节,因为它们受到对比度有限和高频信息不足的影响。为了解决这些问题,我们提出了RetinexEVSR框架,这是首个基于事件驱动的LVSR框架,该框架利用了高对比度的事件信号和受Retinex启发的先验知识,在低光场景下提升视频质量。与以往直接融合受损信号的方法不同,RetinexEVSR引入了一种新颖的双向跨模态融合策略,用于从噪声事件数据和退化RGB帧中提取并整合有意义的信息。 具体而言,RetinexEVSR设计了一个光照引导的事件增强模块,该模块利用从Retinex模型中得出的光照图逐步精炼事件特征,从而在抑制低光伪影的同时保持高对比度细节。此外,我们还提出了一种基于事件反射率增强的模块,通过多尺度融合机制动态恢复反射率细节,利用了改进后的事件特性。 实验结果显示,在三个数据集上,我们的RetinexEVSR框架取得了最先进的性能表现。特别地,在SDSD基准测试中,与先前基于事件的方法相比,该方法在降低65%运行时间的同时,可以实现最高2.95dB的增益。 代码可在此处获得:[这个链接](https://this-url.com)(注:请将"this https URL"替换为实际的链接地址)。
https://arxiv.org/abs/2601.02206
Face super-resolution aims to recover high-quality facial images from severely degraded low-resolution inputs, but remains challenging due to the loss of fine structural details and identity-specific features. This work introduces SwinIFS, a landmark-guided super-resolution framework that integrates structural priors with hierarchical attention mechanisms to achieve identity-preserving reconstruction at both moderate and extreme upscaling factors. The method incorporates dense Gaussian heatmaps of key facial landmarks into the input representation, enabling the network to focus on semantically important facial regions from the earliest stages of processing. A compact Swin Transformer backbone is employed to capture long-range contextual information while preserving local geometry, allowing the model to restore subtle facial textures and maintain global structural consistency. Extensive experiments on the CelebA benchmark demonstrate that SwinIFS achieves superior perceptual quality, sharper reconstructions, and improved identity retention; it consistently produces more photorealistic results and exhibits strong performance even under 8x magnification, where most methods fail to recover meaningful structure. SwinIFS also provides an advantageous balance between reconstruction accuracy and computational efficiency, making it suitable for real-world applications in facial enhancement, surveillance, and digital restoration. Our code, model weights, and results are available at this https URL.
面部超分辨率技术旨在从严重降级的低分辨率输入中恢复高质量的人脸图像,但由于精细结构细节和特定身份特征的丢失,这一任务仍然具有挑战性。这项工作引入了SwinIFS,这是一种基于标志点引导的超级分辨率框架,它结合了结构先验知识与分层注意力机制,在适度和极端放大的情况下都能实现保真度重建。该方法将关键面部标志点的密集高斯热图融入输入表示中,使得网络能够从处理的早期阶段就开始关注语义重要的面部区域。采用了一个紧凑型Swin Transformer骨干网来捕捉长距离上下文信息同时保持局部几何形状,使模型能够恢复细微的人脸纹理并维持全局结构的一致性。 在CelebA基准测试上的广泛实验表明,SwinIFS实现了卓越的感知质量、更清晰的重建效果以及改进的身份保留能力。该方法持续生成更具真实感的结果,并且即使在8倍放大时也能展现出强大的性能,而大多数其他方法在这种极端情况下无法恢复有意义的结构。 此外,SwinIFS在重建精度和计算效率之间提供了一个有利的平衡点,使其适合用于面部美化、监控以及数字修复等现实世界的应用场景中。我们的代码、模型权重及结果可在以下链接获取:[https URL](请将[https URL]替换为实际提供的URL)。
https://arxiv.org/abs/2601.01406
Single Image Super-Resolution (SISR) aims to recover high-resolution images from low-resolution inputs. Unlike SISR, Reference-based Super-Resolution (RefSR) leverages an additional high-resolution reference image to facilitate the recovery of high-frequency textures. However, existing research mainly focuses on backdoor attacks targeting RefSR, while the vulnerability of the adversarial attacks targeting RefSR has not been fully explored. To fill this research gap, we propose RefSR-Adv, an adversarial attack that degrades SR outputs by perturbing only the reference image. By maximizing the difference between adversarial and clean outputs, RefSR-Adv induces significant performance degradation and generates severe artifacts across CNN, Transformer, and Mamba architectures on the CUFED5, WR-SR, and DRefSR datasets. Importantly, experiments confirm a positive correlation between the similarity of the low-resolution input and the reference image and attack effectiveness, revealing that the model's over-reliance on reference features is a key security flaw. This study reveals a security vulnerability in RefSR systems, aiming to urge researchers to pay attention to the robustness of RefSR.
单图像超分辨率(SISR)的目标是从低分辨率的输入中恢复出高分辨率的图像。不同于SISR,基于参考的超分辨率(RefSR)利用了一张额外的高分辨率参考图像来帮助恢复高频纹理信息。然而,现有的研究主要集中在针对RefSR的后门攻击上,而关于对抗性攻击对RefSR系统脆弱性的探索尚未充分进行。为了填补这一研究空白,我们提出了一种名为RefSR-Adv的对抗性攻击方法,该方法通过仅扰动参考图像来降低超分辨率输出的质量。通过最大化对抗样本和干净输入之间的差异,RefSR-Adv在CUFED5、WR-SR 和 DRefSR 数据集上针对卷积神经网络(CNN)、Transformer 和 Mamba 架构的模型导致了显著性能下降,并生成了严重的伪影。 尤为重要的是,实验结果证实了低分辨率输入与参考图像之间的相似度和攻击效果之间存在正相关关系。这一发现揭示出,过度依赖参考特征是模型的一个关键安全漏洞。本研究揭示了RefSR系统中的一种安全性脆弱性,并旨在提醒研究人员注意RefSR的鲁棒性问题。
https://arxiv.org/abs/2601.01202
With the advent of Generative AI, Single Image Super-Resolution (SISR) quality has seen substantial improvement, as the strong priors learned by Text-2-Image Diffusion (T2IDiff) Foundation Models (FM) can bridge the gap between High-Resolution (HR) and Low-Resolution (LR) images. However, flagship smartphone cameras have been slow to adopt generative models because strong generation can lead to undesirable hallucinations. For substantially degraded LR images, as seen in academia, strong generation is required and hallucinations are more tolerable because of the wide gap between LR and HR images. In contrast, in consumer photography, the LR image has substantially higher fidelity, requiring only minimal hallucination-free generation. We hypothesize that generation in SISR is controlled by the stringency and richness of the FM's conditioning feature. First, text features are high level features, which often cannot describe subtle textures in an image. Additionally, Smartphone LR images are at least $12MP$, whereas SISR networks built on T2IDiff FM are designed to perform inference on much smaller images ($<1MP$). As a result, SISR inference has to be performed on small patches, which often cannot be accurately described by text feature. To address these shortcomings, we introduce an SISR network built on a FM with lower-level feature conditioning, specifically DINOv2 features, which we call a Feature-to-Image Diffusion (F2IDiff) Foundation Model (FM). Lower level features provide stricter conditioning while being rich descriptors of even small patches.
随着生成式AI的出现,单幅图像超分辨率(SISR)的质量有了显著提升。这是因为文本到图像扩散模型(T2I-Diffusion 模型)学习到了强大的先验知识,可以缩小高分辨率(HR)和低分辨率(LR)图像之间的差距。然而,旗舰智能手机摄像头在采用生成式模型方面进展缓慢,因为强烈的生成能力可能会导致不希望出现的幻觉现象。对于学术界看到的高度退化的LR图像来说,强烈生成是必需的,并且由于LR与HR图像之间的巨大差距,这些幻觉是可以容忍的。相比之下,在消费级摄影中,低分辨率图像具有较高的保真度,仅需要微小而无幻觉干扰的生成。 我们假设在SISR中的生成受到基础模型(Foundation Model, FM)条件特征严格性和丰富性的控制。首先,文本特征是高层次的特征,往往无法描述图像中的细微纹理。此外,智能手机拍摄的LR图像至少为1200万像素,而基于T2I-Diffusion FM构建的SISR网络被设计用于较小尺寸的图像(<1百万像素)进行推理。因此,SISR推理必须在小块上进行,并且这些小块通常无法用文本特征准确描述。 为了弥补这一不足,我们引入了一个基于低级特征条件的基础模型的SISR网络,具体来说是DINOv2特征,我们称其为特征到图像扩散(F2I-Diffusion 模型)。低级特征提供了严格的条件约束,同时对小块提供丰富的描述。
https://arxiv.org/abs/2512.24473
Diffusion-based video super-resolution (VSR) methods achieve strong perceptual quality but remain impractical for latency-sensitive settings due to reliance on future frames and expensive multi-step denoising. We propose Stream-DiffVSR, a causally conditioned diffusion framework for efficient online VSR. Operating strictly on past frames, it combines a four-step distilled denoiser for fast inference, an Auto-regressive Temporal Guidance (ARTG) module that injects motion-aligned cues during latent denoising, and a lightweight temporal-aware decoder with a Temporal Processor Module (TPM) that enhances detail and temporal coherence. Stream-DiffVSR processes 720p frames in 0.328 seconds on an RTX4090 GPU and significantly outperforms prior diffusion-based methods. Compared with the online SOTA TMP, it boosts perceptual quality (LPIPS +0.095) while reducing latency by over 130x. Stream-DiffVSR achieves the lowest latency reported for diffusion-based VSR, reducing initial delay from over 4600 seconds to 0.328 seconds, thereby making it the first diffusion VSR method suitable for low-latency online deployment. Project page: this https URL
基于扩散的视频超分辨率(VSR)方法在感知质量方面表现出色,但由于依赖于未来帧和昂贵的多步骤去噪过程,在延迟敏感的应用场景中仍不切实际。为此,我们提出了一种名为Stream-DiffVSR的新框架,这是一种因果条件下的扩散模型架构,旨在实现高效的在线视频超分辨率处理。 Stream-DiffVSR仅基于过去的帧进行操作,并结合了一个四步精简的去噪器以加速推理过程;一个自动回归时间引导(ARTG)模块,在潜在空间的去噪过程中注入与运动对齐的提示信号;以及一种轻量级的时间感知解码器,该解码器包含一个时间处理器模块(TPM),用于增强细节和时间连贯性。 实验显示,Stream-DiffVSR在RTX4090 GPU上处理720p帧的速度为0.328秒,并且相对于先前的扩散方法表现出显著改进。与当前最优的在线解决方案TMP相比,它不仅提高了感知质量(LPIPS提升+0.095),而且延迟降低了130多倍。 通过将初始延迟从超过4600秒减少到仅0.328秒,Stream-DiffVSR达到了基于扩散方法进行视频超分辨率处理的最低延迟记录。这使它成为第一个适合低延迟在线部署的扩散VSR方法。 项目页面链接:[此处](此URL)(原文中提到的是“this https URL”,但具体链接未提供)。
https://arxiv.org/abs/2512.23709
Diffusion models have become a leading paradigm for image super-resolution (SR), but existing methods struggle to guarantee both the high-frequency perceptual quality and the low-frequency structural fidelity of generated images. Although inference-time scaling can theoretically improve this trade-off by allocating more computation, existing strategies remain suboptimal: reward-driven particle optimization often causes perceptual over-smoothing, while optimal-path search tends to lose structural consistency. To overcome these difficulties, we propose Iterative Diffusion Inference-Time Scaling with Adaptive Frequency Steering (IAFS), a training-free framework that jointly leverages iterative refinement and frequency-aware particle fusion. IAFS addresses the challenge of balancing perceptual quality and structural fidelity by progressively refining the generated image through iterative correction of structural deviations. Simultaneously, it ensures effective frequency fusion by adaptively integrating high-frequency perceptual cues with low-frequency structural information, allowing for a more accurate and balanced reconstruction across different image details. Extensive experiments across multiple diffusion-based SR models show that IAFS effectively resolves the perception-fidelity conflict, yielding consistently improved perceptual detail and structural accuracy, and outperforming existing inference-time scaling methods.
扩散模型已成为图像超分辨率(SR)的主导范式,但现有方法难以同时保证生成图像的高频感知质量和低频结构保真度。尽管推理时间缩放理论上可以通过分配更多计算来改善这种权衡,现有的策略仍然不够理想:基于奖励驱动的粒子优化往往导致感知过度平滑,而最优路径搜索则容易失去结构一致性。为克服这些困难,我们提出了迭代扩散推理时间缩放与自适应频率控制(IAFS),这是一种无需训练的框架,结合了迭代细化和频域感知粒子融合。 IAFS通过逐步修正生成图像中的结构性偏差来解决平衡感知质量和结构保真度的挑战。同时,它通过自适应地将高频感知线索与低频结构信息相结合,确保有效的频率融合,从而实现不同细节下的更准确且均衡的重建。 多项跨多个基于扩散模型的SR模型的实验表明,IAFS有效地解决了感知-保真度冲突问题,使得生成图像在感知细节和结构准确性方面均得到一致改善,并超越了现有的推理时间缩放方法。
https://arxiv.org/abs/2512.23532
Image-to-Image (I2I) translation involves converting an image from one domain to another. Deterministic I2I translation, such as in image super-resolution, extends this concept by guaranteeing that each input generates a consistent and predictable output, closely matching the ground truth (GT) with high fidelity. In this paper, we propose a denoising Brownian bridge model with dual approximators (Dual-approx Bridge), a novel generative model that exploits the Brownian bridge dynamics and two neural network-based approximators (one for forward and one for reverse process) to produce faithful output with negligible variance and high image quality in I2I translations. Our extensive experiments on benchmark datasets including image generation and super-resolution demonstrate the consistent and superior performance of Dual-approx Bridge in terms of image quality and faithfulness to GT when compared to both stochastic and deterministic baselines. Project page and code: this https URL
图像到图像(Image-to-Image, I2I)转换涉及将一幅图像从一个领域转换为另一个领域。确定性的I2I转换,如超分辨率图像处理,则进一步确保每个输入都会生成一致且可预测的输出,并与高保真的真实地面(Ground Truth, GT)紧密匹配。在本文中,我们提出了一种新的生成模型——去噪布朗桥模型结合双近似器(Dual-approx Bridge),该模型利用了布朗桥动力学和两个基于神经网络的近似器(一个用于前向过程,另一个用于逆向过程),以产生具有几乎无方差且高质量图像输出的I2I转换。我们通过基准数据集上的广泛实验,包括图像生成和超分辨率任务,展示了Dual-approx Bridge在图像质量和接近真实地面(GT)方面的持续卓越性能,与随机和确定性的基线方法相比。 项目页面和代码:[此链接](https://this-url.com/)
https://arxiv.org/abs/2512.23463
Diffusion-based remote sensing (RS) generative foundation models are cruial for downstream tasks. However, these models rely on large amounts of globally representative data, which often contain redundancy, noise, and class imbalance, reducing training efficiency and preventing convergence. Existing RS diffusion foundation models typically aggregate multiple classification datasets or apply simplistic deduplication, overlooking the distributional requirements of generation modeling and the heterogeneity of RS imagery. To address these limitations, we propose a training-free, two-stage data pruning approach that quickly select a high-quality subset under high pruning ratios, enabling a preliminary foundation model to converge rapidly and serve as a versatile backbone for generation, downstream fine-tuning, and other applications. Our method jointly considers local information content with global scene-level diversity and representativeness. First, an entropy-based criterion efficiently removes low-information samples. Next, leveraging RS scene classification datasets as reference benchmarks, we perform scene-aware clustering with stratified sampling to improve clustering effectiveness while reducing computational costs on large-scale unlabeled data. Finally, by balancing cluster-level uniformity and sample representativeness, the method enables fine-grained selection under high pruning ratios while preserving overall diversity and representativeness. Experiments show that, even after pruning 85\% of the training data, our method significantly improves convergence and generation quality. Furthermore, diffusion foundation models trained with our method consistently achieve state-of-the-art performance across downstream tasks, including super-resolution and semantic image synthesis. This data pruning paradigm offers practical guidance for developing RS generative foundation models.
基于扩散的遥感(RS)生成基础模型对于下游任务至关重要。然而,这些模型依赖于大量具有全球代表性的数据集,这些数据集中通常包含冗余、噪声和类别不平衡的问题,这会降低训练效率并妨碍收敛性。现有的RS扩散基础模型通常通过聚合多个分类数据集或应用简单的去重技术来处理这些问题,但它们往往忽视了生成建模的分布需求以及遥感图像的异质性。 为了克服这些限制,我们提出了一种无训练、两阶段的数据修剪方法,可以在高修剪比例下快速选择高质量子集,从而使初步基础模型能够迅速收敛,并作为多种应用(如生成、下游微调等)的强大骨干。我们的方法同时考虑了局部信息内容与全局场景级别的多样性和代表性。 首先,使用基于熵的准则高效去除低信息量样本。 其次,利用遥感场景分类数据集作为参考基准,我们执行面向场景的聚类,并采用分层抽样以提高大规模未标记数据上的聚类效果并减少计算成本。 最后,通过平衡集群级别的均匀性和样本代表性,在高修剪比例下实现精细选择的同时保持整体多样性和代表性。 实验表明,即使在剪枝85%训练数据后,我们的方法显著提高了收敛速度和生成质量。此外,使用我们方法训练的扩散基础模型在包括超分辨率和语义图像合成在内的下游任务中始终取得最先进的性能水平。这种数据修剪范式为开发RS生成基础模型提供了实用指导。
https://arxiv.org/abs/2512.23239
The highly nonlinear degradation process, complex physical interactions, and various sources of uncertainty render single-image Super-resolution (SR) a particularly challenging task. Existing interpretable SR approaches, whether based on prior learning or deep unfolding optimization frameworks, typically rely on black-box deep networks to model latent variables, which leaves the degradation process largely unknown and uncontrollable. Inspired by the Kolmogorov-Arnold theorem (KAT), we for the first time propose a novel interpretable operator, termed Kolmogorov-Arnold Neural Operator (KANO), with the application to image SR. KANO provides a transparent and structured representation of the latent degradation fitting process. Specifically, we employ an additive structure composed of a finite number of B-spline functions to approximate continuous spectral curves in a piecewise fashion. By learning and optimizing the shape parameters of these spline functions within defined intervals, our KANO accurately captures key spectral characteristics, such as local linear trends and the peak-valley structures at nonlinear inflection points, thereby endowing SR results with physical interpretability. Furthermore, through theoretical modeling and experimental evaluations across natural images, aerial photographs, and satellite remote sensing data, we systematically compare multilayer perceptrons (MLPs) and Kolmogorov-Arnold networks (KANs) in handling complex sequence fitting tasks. This comparative study elucidates the respective advantages and limitations of these models in characterizing intricate degradation mechanisms, offering valuable insights for the development of interpretable SR techniques.
高度非线性的退化过程、复杂的物理交互以及各种不确定性来源,使得单图像超分辨率(SR)成为一个特别具有挑战性的任务。现有的可解释的SR方法,无论是基于先验学习还是深度优化框架,通常依赖于黑盒式的深层网络来建模潜在变量,这使得退化过程难以理解和控制。 受Kolmogorov-Arnold定理(KAT)启发,我们首次提出了一种新的、用于图像超分辨率的可解释操作符,称为Kolmogorov-Arnold神经算子(KANO)。KANO提供了对潜在退化适应过程的一种透明且结构化的表示。具体而言,我们采用一个由有限数量的B样条函数组成的加性结构来分段逼近连续光谱曲线。通过学习和优化这些样条函数在定义区间内的形状参数,我们的KANO能够精确捕捉关键的光谱特征,如局部线性趋势以及非线性拐点处的峰值-谷值结构,从而使超分辨率结果具有物理可解释性。 此外,我们通过理论建模和跨自然图像、航拍照片及卫星遥感数据进行实验评估,在处理复杂序列拟合任务时系统地比较了多层感知器(MLP)与Kolmogorov-Arnold网络(KAN)。这项对比研究揭示了这些模型在表征复杂退化机制方面的各自优势和局限性,为开发可解释的SR技术提供了宝贵的见解。
https://arxiv.org/abs/2512.22822
Reinforcement Learning with Human Feedback (RLHF) has proven effective in image generation field guided by reward models to align human preferences. Motivated by this, adapting RLHF for Image Super-Resolution (ISR) tasks has shown promise in optimizing perceptual quality with Image Quality Assessment (IQA) model as reward models. However, the traditional IQA model usually output a single global score, which are exceptionally insensitive to local and fine-grained distortions. This insensitivity allows ISR models to produce perceptually undesirable artifacts that yield spurious high scores, misaligning optimization objectives with perceptual quality and results in reward hacking. To address this, we propose a Fine-grained Perceptual Reward Model (FinPercep-RM) based on an Encoder-Decoder architecture. While providing a global quality score, it also generates a Perceptual Degradation Map that spatially localizes and quantifies local defects. We specifically introduce the FGR-30k dataset to train this model, consisting of diverse and subtle distortions from real-world super-resolution models. Despite the success of the FinPercep-RM model, its complexity introduces significant challenges in generator policy learning, leading to training instability. To address this, we propose a Co-evolutionary Curriculum Learning (CCL) mechanism, where both the reward model and the ISR model undergo synchronized curricula. The reward model progressively increases in complexity, while the ISR model starts with a simpler global reward for rapid convergence, gradually transitioning to the more complex model outputs. This easy-to-hard strategy enables stable training while suppressing reward hacking. Experiments validates the effectiveness of our method across ISR models in both global quality and local realism on RLHF methods.
基于人类反馈的强化学习(RLHF)在通过奖励模型引导图像生成领域时,已被证明能够有效地对齐人类偏好。受此启发,在图像超分辨率(ISR)任务中采用RLHF显示出潜力,可以利用图像质量评估(IQA)模型作为奖励模型来优化感知质量。然而,传统的IQA模型通常输出单一的全局评分,对于局部和细微失真非常不敏感。这种不敏感性允许ISR模型产生一些在视觉上令人不满但能获得高分数的伪影,从而导致优化目标与感知质量脱节,并且会导致奖励欺骗(reward hacking)。 为了解决这个问题,我们提出了一种基于编码器-解码器架构的细粒度感知奖励模型(Fine-grained Perceptual Reward Model, FinPercep-RM)。除了提供一个全局评分之外,该模型还会生成一张感知退化图,用于空间定位并量化局部缺陷。为了训练这一模型,我们特别引入了FGR-30k数据集,其中包含来自实际超分辨率模型的各种细微失真。 尽管FinPercep-RM模型取得了一定的成功,但其复杂性为生成器策略学习带来了重大挑战,并导致了训练的不稳定性。为此,我们提出了一种协同进化课程学习(Co-evolutionary Curriculum Learning, CCL)机制,在这种机制中,奖励模型和ISR模型都经历了同步的课程。在这一过程中,奖励模型逐步增加其复杂性,而ISR模型则从简单的全局奖励开始快速收敛,然后逐渐过渡到更复杂的模型输出。这一由简至难的战略不仅能够使训练过程更加稳定,还能抑制奖励欺骗。 实验验证了我们方法的有效性,在各种ISR模型中,无论是全局质量还是局部真实性方面,RLHF方法都取得了显著的改进。
https://arxiv.org/abs/2512.22647
Background: High-resolution MRI is critical for diagnosis, but long acquisition times limit clinical use. Super-resolution (SR) can enhance resolution post-scan, yet existing deep learning methods face fidelity-efficiency trade-offs. Purpose: To develop a computationally efficient and accurate deep learning framework for MRI SR that preserves anatomical detail for clinical integration. Materials and Methods: We propose a novel SR framework combining multi-head selective state-space models (MHSSM) with a lightweight channel MLP. The model uses 2D patch extraction with hybrid scanning to capture long-range dependencies. Each MambaFormer block integrates MHSSM, depthwise convolutions, and gated channel mixing. Evaluation used 7T brain T1 MP2RAGE maps (n=142) and 1.5T prostate T2w MRI (n=334). Comparisons included Bicubic interpolation, GANs (CycleGAN, Pix2pix, SPSR), transformers (SwinIR), Mamba (MambaIR), and diffusion models (I2SB, Res-SRDiff). Results: Our model achieved superior performance with exceptional efficiency. For 7T brain data: SSIM=0.951+-0.021, PSNR=26.90+-1.41 dB, LPIPS=0.076+-0.022, GMSD=0.083+-0.017, significantly outperforming all baselines (p<0.001). For prostate data: SSIM=0.770+-0.049, PSNR=27.15+-2.19 dB, LPIPS=0.190+-0.095, GMSD=0.087+-0.013. The framework used only 0.9M parameters and 57 GFLOPs, reducing parameters by 99.8% and computation by 97.5% versus Res-SRDiff, while outperforming SwinIR and MambaIR in accuracy and efficiency. Conclusion: The proposed framework provides an efficient, accurate MRI SR solution, delivering enhanced anatomical detail across datasets. Its low computational demand and state-of-the-art performance show strong potential for clinical translation.
背景:高分辨率MRI在诊断中至关重要,但长时间的获取时间限制了其临床应用。超分辨率(SR)可以提高扫描后的图像质量,然而现有的深度学习方法面临着精度与效率之间的权衡。 目的:开发一种计算高效且准确的深度学习框架用于MRI超分辨率处理,以保持解剖细节并便于临床上的应用整合。 材料和方法:我们提出了一种结合多头选择性状态空间模型(MHSSM)与轻量级通道MLP(多层感知器)的新颖SR框架。该模型采用2D补丁提取与混合扫描来捕捉长距离依赖关系。每个MambaFormer块集成了MHSSM、深度卷积和门控通道混合。评估使用了7T脑部的T1 MP2RAGE映射(n=142)和1.5T前列腺T2w MRI图像(n=334)。对比方法包括双三次插值,生成对抗网络(GANs)如CycleGAN、Pix2pix、SPSR,变换器模型如SwinIR,Mamba模型(MambaIR),以及扩散模型如I2SB和Res-SRDiff。 结果:我们的模型在效率上表现出色且性能优越。对于7T脑部数据集:SSIM(结构相似性指数)为0.951±0.021、PSNR(峰值信噪比)为26.90±1.41 dB、LPIPS(感知线性判别编码器距离)为0.076±0.022、GMSD(梯度平均标准差)为0.083±0.017,显著优于所有基准方法(p<0.001)。对于前列腺数据集:SSIM为0.770±0.049、PSNR为27.15±2.19 dB、LPIPS为0.190±0.095、GMSD为0.087±0.013。我们的框架仅使用了0.9M参数和57 GFLOPs,相较于Res-SRDiff模型减少了99.8%的参数以及97.5%的计算量,并在准确性和效率上超越了SwinIR和MambaIR。 结论:所提出的框架提供了一种高效且精确的MRI超分辨率解决方案,在不同数据集中增强了解剖细节。其较低的计算需求及顶尖性能显示出了巨大的临床转化潜力。
https://arxiv.org/abs/2512.19676