In medical image segmentation, convolutional neural networks (CNNs) and transformers are dominant. For CNNs, given the local receptive fields of convolutional layers, long-range spatial correlations are captured through consecutive convolutions and pooling. However, as the computational cost and memory footprint can be prohibitively large, 3D models can only afford fewer layers than 2D models with reduced receptive fields and abstract levels. For transformers, although long-range correlations can be captured by multi-head attention, its quadratic complexity with respect to input size is computationally demanding. Therefore, either model may require input size reduction to allow more filters and layers for better segmentation. Nevertheless, given their discrete nature, models trained with patch-wise training or image downsampling may produce suboptimal results when applied on higher resolutions. To address this issue, here we propose the resolution-robust HNOSeg-XS architecture. We model image segmentation by learnable partial differential equations through the Fourier neural operator which has the zero-shot super-resolution property. By replacing the Fourier transform by the Hartley transform and reformulating the problem in the frequency domain, we created the HNOSeg-XS model, which is resolution robust, fast, memory efficient, and extremely parameter efficient. When tested on the BraTS'23, KiTS'23, and MVSeg'23 datasets with a Tesla V100 GPU, HNOSeg-XS showed its superior resolution robustness with fewer than 34.7k model parameters. It also achieved the overall best inference time (< 0.24 s) and memory efficiency (< 1.8 GiB) compared to the tested CNN and transformer models.
在医学图像分割领域,卷积神经网络(CNN)和变压器模型占主导地位。对于CNN而言,由于卷积层的局部感受野特性,可以通过连续的卷积和池化操作来捕捉长距离的空间相关性。然而,由于计算成本和内存占用可能过高,3D模型只能比2D模型拥有更少的层次和较小的感受野及抽象层级。而对于变压器来说,尽管多头注意力机制可以捕获长距离的相关性,但由于其输入大小二次复杂度的关系,在计算上是非常昂贵的。 因此,无论是CNN还是Transformer模型都可能需要减少输入尺寸来允许更多的滤波器和层次以实现更好的分割效果。然而,鉴于这些模型训练时采用的是分块式或图像下采样的方式,当应用于更高分辨率的数据集时可能会产生次优的结果。为了解决这个问题,我们在此提出了一个鲁棒于分辨率的HNOSeg-XS架构。 通过使用傅立叶神经操作器来建模图像分割,并利用其零样本超分辨率特性,我们能够用可学习的偏微分方程形式进行建模。通过将傅立叶变换替换为哈特利变换,并重新在频域中表述问题,我们创建了HNOSeg-XS模型,该模型具有鲁棒于不同分辨率的能力、计算速度快且内存占用低,同时参数效率极高。 当在BraTS'23、KiTS'23和MVSeg'23数据集上进行测试时(使用Tesla V100 GPU),HNOSeg-XS仅需不到34.7k个模型参数即展现了其优秀的分辨率鲁棒性。它还达到了比所有测试过的CNN和Transformer模型都要好的整体推理时间(< 0.24秒)和内存效率(< 1.8 GiB)。
https://arxiv.org/abs/2507.08205
We present 4KAgent, a unified agentic super-resolution generalist system designed to universally upscale any image to 4K resolution (and even higher, if applied iteratively). Our system can transform images from extremely low resolutions with severe degradations, for example, highly distorted inputs at 256x256, into crystal-clear, photorealistic 4K outputs. 4KAgent comprises three core components: (1) Profiling, a module that customizes the 4KAgent pipeline based on bespoke use cases; (2) A Perception Agent, which leverages vision-language models alongside image quality assessment experts to analyze the input image and make a tailored restoration plan; and (3) A Restoration Agent, which executes the plan, following a recursive execution-reflection paradigm, guided by a quality-driven mixture-of-expert policy to select the optimal output for each step. Additionally, 4KAgent embeds a specialized face restoration pipeline, significantly enhancing facial details in portrait and selfie photos. We rigorously evaluate our 4KAgent across 11 distinct task categories encompassing a total of 26 diverse benchmarks, setting new state-of-the-art on a broad spectrum of imaging domains. Our evaluations cover natural images, portrait photos, AI-generated content, satellite imagery, fluorescence microscopy, and medical imaging like fundoscopy, ultrasound, and X-ray, demonstrating superior performance in terms of both perceptual (e.g., NIQE, MUSIQ) and fidelity (e.g., PSNR) metrics. By establishing a novel agentic paradigm for low-level vision tasks, we aim to catalyze broader interest and innovation within vision-centric autonomous agents across diverse research communities. We will release all the code, models, and results at: this https URL.
我们介绍了4KAgent,这是一种统一的代理超分辨率通用系统,旨在将任何图像普遍放大到4K分辨率(甚至更高,如果迭代应用的话)。我们的系统可以将极低质量且严重退化的输入图像(例如256x256像素的高度失真图像)转化为清晰逼真的4K输出。4KAgent包括三个核心组件:(1) 配置模块,可以根据特定用例定制4KAgent管道;(2) 感知代理,它利用视觉-语言模型和图片质量评估专家来分析输入的图像并制定个性化的修复计划;以及 (3) 修复代理,其遵循递归执行与反思范式,并根据以质量为驱动的混合专家策略选择每个步骤的最佳输出。此外,4KAgent还嵌入了一个专门的脸部修复管道,在人像和自拍照片中显著增强了面部细节。 我们对4KAgent进行了严格的评估,涵盖11个不同的任务类别共计26种多样的基准测试,并在广泛的成像领域内树立了新的技术前沿。我们的评估包括自然图像、人像照片、人工智能生成的内容、卫星图像、荧光显微镜以及眼科检查、超声和X射线等医学成像,证明其在感知(例如NIQE、MUSIQ)和保真度(例如PSNR)指标上均表现出卓越性能。 通过建立一种新颖的代理范式来解决低层次视觉任务,我们旨在激发研究社区中以视觉为中心的自主代理更广泛的兴趣与创新。我们将在此 URL 上发布所有代码、模型及结果:this https URL.
https://arxiv.org/abs/2507.07105
Diffusion models show promise for image restoration, but existing methods often struggle with inconsistent fidelity and undesirable artifacts. To address this, we introduce Kernel Density Steering (KDS), a novel inference-time framework promoting robust, high-fidelity outputs through explicit local mode-seeking. KDS employs an $N$-particle ensemble of diffusion samples, computing patch-wise kernel density estimation gradients from their collective outputs. These gradients steer patches in each particle towards shared, higher-density regions identified within the ensemble. This collective local mode-seeking mechanism, acting as "collective wisdom", steers samples away from spurious modes prone to artifacts, arising from independent sampling or model imperfections, and towards more robust, high-fidelity structures. This allows us to obtain better quality samples at the expense of higher compute by simultaneously sampling multiple particles. As a plug-and-play framework, KDS requires no retraining or external verifiers, seamlessly integrating with various diffusion samplers. Extensive numerical validations demonstrate KDS substantially improves both quantitative and qualitative performance on challenging real-world super-resolution and image inpainting tasks.
扩散模型在图像恢复方面表现出潜力,但现有的方法往往难以保持一致的保真度,并且会产生不希望出现的伪影。为了解决这些问题,我们引入了 Kernel Density Steering (KDS),这是一种新颖的推理框架,通过显式的局部模式寻求机制来促进稳健、高保真的输出。 KDS 使用一个由 $N$ 个粒子组成的扩散样本集合,在这些样本的集体输出中计算补丁级别的核密度估计梯度。这些梯度将每个粒子中的补丁引导到集合内识别出的共享更高密度区域,从而形成了一种“群体智慧”的机制。这种集体局部模式寻求机制可以将样本从容易产生伪影的虚假模式(由于独立采样或模型不完善)中引导出来,并朝向更稳健、高保真的结构。 通过同时采样多个粒子,我们可以以更高的计算成本换取更好的质量样本。KDS 作为一种即插即用框架,无需重新训练或外部验证器即可与各种扩散采样器无缝集成。 大量的数值验证表明,KDS 在具有挑战性的现实世界超分辨率和图像修复任务中显著提高了定性和定量性能。
https://arxiv.org/abs/2507.05604
Earth System Models (ESM) are our main tool for projecting the impacts of climate change. However, running these models at sufficient resolution for local-scale risk-assessments is not computationally feasible. Deep learning-based super-resolution models offer a promising solution to downscale ESM outputs to higher resolutions by learning from data. Yet, due to regional variations in climatic processes, these models typically require retraining for each geographical area-demanding high-resolution observational data, which is unevenly available across the globe. This highlights the need to assess how well these models generalize across geographic regions. To address this, we introduce RainShift, a dataset and benchmark for evaluating downscaling under geographic distribution shifts. We evaluate state-of-the-art downscaling approaches including GANs and diffusion models in generalizing across data gaps between the Global North and Global South. Our findings reveal substantial performance drops in out-of-distribution regions, depending on model and geographic area. While expanding the training domain generally improves generalization, it is insufficient to overcome shifts between geographically distinct regions. We show that addressing these shifts through, for example, data alignment can improve spatial generalization. Our work advances the global applicability of downscaling methods and represents a step toward reducing inequities in access to high-resolution climate information.
地球系统模型(ESM)是我们预测气候变化影响的主要工具。然而,以足够高的分辨率运行这些模型来进行局部风险评估在计算上是不可行的。基于深度学习的超分辨率模型可以通过从数据中学习来提供一个有前景的解决方案,将ESM输出下采样到更高分辨率。但是,由于气候过程在不同地区存在差异,这些模型通常需要为每个地理区域重新训练,并且这要求高分辨率的观测数据,而这种数据在全球范围内分布不均。这就突显了评估这些模型在不同地理区域之间泛化能力的需求。 为了应对这一挑战,我们引入了RainShift,这是一个用于评估地理想象力变化下降尺度效果的数据集和基准测试工具。我们在RainShift上评估了最先进的降尺度方法,包括生成对抗网络(GANs)和扩散模型,在跨越北半球与南半球数据缺口的情况下泛化能力如何。我们的研究发现显示,在分布差异较大的区域中,不同模型的性能有显著下降。虽然扩大训练领域通常可以改善泛化能力,但这不足以克服地理位置不同的地区之间的变化。 我们还展示了通过例如数据对齐等方法来解决这些转变能够提高空间上的泛化能力。这项工作推进了降尺度方法在全球范围内的适用性,并标志着减少获得高分辨率气候信息不平等的一步。
https://arxiv.org/abs/2507.04930
Recent advancements in diffusion models have demonstrated remarkable success in various image generation tasks. Building upon these achievements, diffusion models have also been effectively adapted to image restoration tasks, e.g., super-resolution and deblurring, aiming to recover high-quality images from degraded inputs. Although existing zero-shot approaches enable pretrained diffusion models to perform restoration tasks without additional fine-tuning, these methods often suffer from prolonged iteration times in the denoising process. To address this limitation, we propose a Quick Bypass Mechanism (QBM), a strategy that significantly accelerates the denoising process by initializing from an intermediate approximation, effectively bypassing early denoising steps. Furthermore, recognizing that approximation may introduce inconsistencies, we introduce a Revised Reverse Process (RRP), which adjusts the weighting of random noise to enhance the stochasticity and mitigate potential disharmony. We validate proposed methods on ImageNet-1K and CelebA-HQ across multiple image restoration tasks, e.g., super-resolution, deblurring, and compressed sensing. Our experimental results show that the proposed methods can effectively accelerate existing methods while maintaining original performance.
最近在扩散模型方面的进展在各种图像生成任务中取得了显著的成功。基于这些成就,扩散模型也被有效地应用于图像恢复任务(如超分辨率和去模糊),旨在从退化输入中恢复高质量的图像。虽然现有的零样本方法使预训练的扩散模型能够在无需额外微调的情况下执行恢复任务,但这些方法通常在去噪过程中会经历较长的迭代时间。为了解决这一限制,我们提出了一种快速旁路机制(QBM),该策略通过从中间近似值初始化来显著加速去噪过程,从而跳过早期的去噪步骤。此外,考虑到近似可能会引入不一致性,我们还提出了修订的逆向处理(RRP)方法,调整随机噪声的权重以增强其随机性并减轻潜在的不和谐问题。 我们在ImageNet-1K和CelebA-HQ数据集上对多种图像恢复任务(如超分辨率、去模糊和压缩感知)进行了验证。实验结果表明,我们提出的方法能够有效加速现有方法,同时保持原有的性能水平。
https://arxiv.org/abs/2507.04207
Although the lightweight Vision Transformer has significantly advanced image super-resolution (SR), it faces the inherent challenge of a limited receptive field due to the window-based self-attention modeling. The quadratic computational complexity relative to window size restricts its ability to use a large window size for expanding the receptive field while maintaining low computational costs. To address this challenge, we propose PromptSR, a novel prompt-empowered lightweight image SR method. The core component is the proposed cascade prompting block (CPB), which enhances global information access and local refinement via three cascaded prompting layers: a global anchor prompting layer (GAPL) and two local prompting layers (LPLs). The GAPL leverages downscaled features as anchors to construct low-dimensional anchor prompts (APs) through cross-scale attention, significantly reducing computational costs. These APs, with enhanced global perception, are then used to provide global prompts, efficiently facilitating long-range token connections. The two LPLs subsequently combine category-based self-attention and window-based self-attention to refine the representation in a coarse-to-fine manner. They leverage attention maps from the GAPL as additional global prompts, enabling them to perceive features globally at different granularities for adaptive local refinement. In this way, the proposed CPB effectively combines global priors and local details, significantly enlarging the receptive field while maintaining the low computational costs of our PromptSR. The experimental results demonstrate the superiority of our method, which outperforms state-of-the-art lightweight SR methods in quantitative, qualitative, and complexity evaluations. Our code will be released at this https URL.
尽管轻量级的视觉变换器(Vision Transformer)在图像超分辨率(Super-Resolution,SR)领域取得了显著进展,但其固有的基于窗口的自注意力建模导致感受野有限的问题仍然存在。由于相对于窗口大小的二次计算复杂性限制了使用大窗口尺寸来扩大感受野并保持低计算成本的能力,为了解决这一挑战,我们提出了PromptSR——一种新的以提示赋能的轻量级图像超分辨率方法。 该方法的核心组件是所提出的级联提示块(Cascade Prompting Block, CPB),通过三个级联提示层:全局锚点提示层(Global Anchor Prompt Layer, GAPL)和两个局部提示层(Local Prompt Layers,LPLs)增强了全局信息访问和局部细化。GAPL利用下采样的特征作为锚点,并通过跨尺度注意力构造低维的锚点提示(Anchor Prompts, APs),显著降低了计算成本。这些具有增强全球感知能力的APs随后被用来提供全局提示,有效地促进了长距离标记连接。 接下来的两个局部提示层结合了基于类别的自注意机制和基于窗口的自注意机制,以逐步细化的方式进行表示优化。它们利用GAPL中的注意力图作为额外的全局提示,在不同的粒度级别上实现全球感知并适应性地进行局部细化。 通过这种方式,所提出的CPB有效地融合了全局先验知识与局部细节信息,极大地扩大了感受野,同时保持了我们的PromptSR方法的低计算成本。实验结果表明,该方法在定量、定性和复杂性评估中都优于现有的轻量级超分辨率方法。我们的代码将在以下网址发布:[此链接]。
https://arxiv.org/abs/2507.04118
Interactive time-varying volume visualization is challenging due to its complex spatiotemporal features and sheer size of the dataset. Recent works transform the original discrete time-varying volumetric data into continuous Implicit Neural Representations (INR) to address the issues of compression, rendering, and super-resolution in both spatial and temporal domains. However, training the INR takes a long time to converge, especially when handling large-scale time-varying volumetric datasets. In this work, we proposed F-Hash, a novel feature-based multi-resolution Tesseract encoding architecture to greatly enhance the convergence speed compared with existing input encoding methods for modeling time-varying volumetric data. The proposed design incorporates multi-level collision-free hash functions that map dynamic 4D multi-resolution embedding grids without bucket waste, achieving high encoding capacity with compact encoding parameters. Our encoding method is agnostic to time-varying feature detection methods, making it a unified encoding solution for feature tracking and evolution visualization. Experiments show the F-Hash achieves state-of-the-art convergence speed in training various time-varying volumetric datasets for diverse features. We also proposed an adaptive ray marching algorithm to optimize the sample streaming for faster rendering of the time-varying neural representation.
交互式时变体积可视化由于其复杂的时空特征和庞大的数据集而具有挑战性。近期的研究将原始的离散时变体数据转换为连续的隐式神经表示(INR),以解决压缩、渲染以及空间和时间域中的超分辨率问题。然而,训练 INR 需要很长时间才能收敛,尤其是在处理大规模时变体积数据集时更是如此。 在本研究中,我们提出了一种名为 F-Hash 的新颖特征基础多分辨率 Tesseract 编码架构,它与现有输入编码方法相比,在建模时变体数据方面极大地提高了收敛速度。该设计整合了多层次无碰撞哈希函数,能够映射动态的四维多分辨率嵌入网格,并且不会浪费存储桶,从而在保持紧凑编码参数的同时实现了高编码容量。我们的编码方法对时变特征检测方法不敏感,使其成为追踪和可视化特征演化的一体化编码解决方案。 实验表明,F-Hash 在训练各种具有不同特征的时间变化体积数据集时达到了最先进的收敛速度。我们还提出了一种自适应光线投射算法来优化采样流,以实现时间变化神经表示的快速渲染。
https://arxiv.org/abs/2507.03836
Real-world video super-resolution (VSR) presents significant challenges due to complex and unpredictable degradations. Although some recent methods utilize image diffusion models for VSR and have shown improved detail generation capabilities, they still struggle to produce temporally consistent frames. We attempt to use Stable Video Diffusion (SVD) combined with ControlNet to address this issue. However, due to the intrinsic image-animation characteristics of SVD, it is challenging to generate fine details using only low-quality videos. To tackle this problem, we propose DAM-VSR, an appearance and motion disentanglement framework for VSR. This framework disentangles VSR into appearance enhancement and motion control problems. Specifically, appearance enhancement is achieved through reference image super-resolution, while motion control is achieved through video ControlNet. This disentanglement fully leverages the generative prior of video diffusion models and the detail generation capabilities of image super-resolution models. Furthermore, equipped with the proposed motion-aligned bidirectional sampling strategy, DAM-VSR can conduct VSR on longer input videos. DAM-VSR achieves state-of-the-art performance on real-world data and AIGC data, demonstrating its powerful detail generation capabilities.
现实世界中的视频超分辨率(VSR)面临着复杂且难以预测的退化问题。虽然最近的一些方法利用图像扩散模型来改善细节生成能力,但它们仍然在产生时间一致性帧方面遇到挑战。我们尝试使用稳定视频扩散(SVD)结合ControlNet来解决这个问题。然而,由于SVD自身具有的图片动画特性,在仅基于低质量视频的情况下生成精细细节非常困难。为了解决这一问题,我们提出了一种名为DAM-VSR的框架,这是一种用于VSR的外观和运动解耦框架。该框架将VSR任务分解成外观增强与运动控制两个子任务。具体来说,通过参考图像超分辨率实现外观增强;而视频ControlNet则用来处理运动控制部分。这一解耦方法充分运用了视频扩散模型生成先验知识以及图像超分辨率模型的细节生成能力。 此外,DAM-VSR配备了所提出的运动对齐双向采样策略,能够在更长的输入视频上进行VSR操作。在真实世界数据和AIGC(AI Generated Content)数据集上的测试中,DAM-VSR实现了最先进的性能,并展示了其强大的细节生成能力。
https://arxiv.org/abs/2507.01012
Speech super-resolution (SSR) enhances low-resolution speech by increasing the sampling rate. While most SSR methods focus on magnitude reconstruction, recent research highlights the importance of phase reconstruction for improved perceptual quality. Therefore, we introduce CTFT-Net, a Complex Time-Frequency Transformation Network that reconstructs both magnitude and phase in complex domains for improved SSR tasks. It incorporates a complex global attention block to model inter-phoneme and inter-frequency dependencies and a complex conformer to capture long-range and local features, improving frequency reconstruction and noise robustness. CTFT-Net employs time-domain and multi-resolution frequency-domain loss functions for better generalization. Experiments show CTFT-Net outperforms state-of-the-art models (NU-Wave, WSRGlow, NVSR, AERO) on the VCTK dataset, particularly for extreme upsampling (2 kHz to 48 kHz), reconstructing high frequencies effectively without noisy artifacts.
语音超分辨率(SSR)技术通过提高采样率来增强低分辨率的语音信号。尽管大多数现有的SSR方法主要关注幅度重建,但最近的研究强调了相位重建对于提升感知质量的重要性。因此,我们引入了CTFT-Net,这是一种复数时频变换网络,它在复数域中同时重建幅度和相位,以改进SSR任务的性能。CTFT-Net包括一个复杂的全局注意力块,用于建模不同音素之间的相互依赖关系以及不同频率之间的关系,并包含一个复杂的形式器(conformer)来捕捉长程和局部特征,从而改善频率重建并增强抗噪能力。CTFT-Net还采用了时域和多分辨率频域损失函数以实现更好的泛化效果。 实验表明,在VCTK数据集上,CTFT-Net在极端的上采样情况下(例如从2 kHz到48 kHz)显著优于当前最先进的模型(如NU-Wave、WSRGlow、NVSR 和 AERO),尤其在有效重建高频信号而不引入噪声干扰方面表现出色。
https://arxiv.org/abs/2507.00229
High-resolution imaging is crucial for enhancing visual clarity and enabling precise computer-assisted guidance in minimally invasive surgery (MIS). Despite the increasing adoption of 4K endoscopic systems, there remains a significant gap in publicly available native 4K datasets tailored specifically for robotic-assisted MIS. We introduce SurgiSR4K, the first publicly accessible surgical imaging and video dataset captured at a native 4K resolution, representing realistic conditions of robotic-assisted procedures. SurgiSR4K comprises diverse visual scenarios including specular reflections, tool occlusions, bleeding, and soft tissue deformations, meticulously designed to reflect common challenges faced during laparoscopic and robotic surgeries. This dataset opens up possibilities for a broad range of computer vision tasks that might benefit from high resolution data, such as super resolution (SR), smoke removal, surgical instrument detection, 3D tissue reconstruction, monocular depth estimation, instance segmentation, novel view synthesis, and vision-language model (VLM) development. SurgiSR4K provides a robust foundation for advancing research in high-resolution surgical imaging and fosters the development of intelligent imaging technologies aimed at enhancing performance, safety, and usability in image-guided robotic surgeries.
高分辨率成像是提高视觉清晰度并为微创手术(MIS)提供精确的计算机辅助引导的关键。尽管4K内窥镜系统的采用日益增多,但仍缺乏针对机器人辅助MIS量身定制的公开可用的原生4K数据集。我们推出了SurgiSR4K,这是首个以原生4K分辨率捕捉的公开可访问外科影像和视频数据集,涵盖了机器人辅助程序中的真实条件。SurgiSR4K包含各种视觉场景,包括镜面反射、工具遮挡、出血以及软组织变形等情况,精心设计以反映腹腔镜及机器人手术中常见的挑战。 该数据集为一系列可能受益于高分辨率数据的计算机视觉任务打开了大门,如超分辨率(SR)、烟雾去除、外科器械检测、3D组织重建、单目深度估计、实例分割、新视角合成以及视觉语言模型(VLM)开发等。SurgiSR4K为推进高分辨率手术成像研究提供了坚实的基础,并促进了旨在提升图像引导机器人手术性能、安全性和易用性的智能影像技术的发展。
https://arxiv.org/abs/2507.00209
Super-resolution (SR) techniques can enhance the spatial resolution of remote sensing images by utilizing low-resolution (LR) images to reconstruct high-resolution (HR) images, enabling more efficient large-scale earth observation applications. While single-image super-resolution (SISR) methods have shown progress, reference-based super-resolution (RefSR) offers superior performance by incorporating historical HR images alongside current LR observations. However, existing RefSR methods struggle with real-world complexities, such as cross-sensor resolution gap and significant land cover changes, often leading to under-generation or over-reliance on reference image. To address these challenges, we propose CRefDiff, a novel controllable reference-based diffusion model for real-world remote sensing image SR. To address the under-generation problem, CRefDiff is built upon the pretrained Stable Diffusion model, leveraging its powerful generative prior to produce accurate structures and textures. To mitigate over-reliance on the reference, we introduce a dual-branch fusion mechanism that adaptively integrates both local and global information from the reference image. Moreover, this novel dual-branch design enables reference strength control during inference, enhancing interactivity and flexibility of the model. Finally, a strategy named Better Start is proposed to significantly reduce the number of denoising steps, thereby accelerating the inference process. To support further research, we introduce Real-RefRSSRD, a new real-world RefSR dataset for remote sensing images, consisting of HR NAIP and LR Sentinel-2 image pairs with diverse land cover changes and significant temporal gaps. Extensive experiments on Real-RefRSSRD show that CRefDiff achieves state-of-the-art performance across various metrics and improves downstream tasks such as scene classification and semantic segmentation.
超分辨率(SR)技术可以通过利用低分辨率(LR)图像来重建高分辨率(HR)图像,从而提高遥感图像的空间分辨率,并使大规模地球观测应用更加高效。虽然单幅图像超分辨率(SISR)方法已经取得了一些进展,但参考基于的超分辨率(RefSR)通过结合历史上的HR图像和当前的LR观察数据提供了更优越的表现。然而,现有的RefSR方法在处理现实世界的复杂性时面临挑战,例如跨传感器分辨率差异和显著的土地覆盖变化,这些问题往往导致生成不足或过度依赖参考图像。为了解决这些难题,我们提出了CRefDiff,这是一种新型可控参考扩散模型,专门用于遥感图像的真实世界超分辨率任务。 为了应对生成不足的问题,CRefDiff基于预训练的Stable Diffusion模型构建,利用其强大的生成先验来产生准确的结构和纹理信息。为了解决过度依赖参考图像的问题,我们引入了一种双分支融合机制,该机制能够自适应地整合参考图像中的局部和全局信息。此外,这种新颖的双分支设计允许在推理过程中控制参考强度,从而增强模型的交互性和灵活性。最后,提出了名为“Better Start”的策略来大幅减少去噪步骤的数量,从而加速推理过程。 为了支持进一步的研究,我们引入了Real-RefRSSRD,这是一个新的真实世界RefSR数据集,专门用于遥感图像,它包含HR NAIP和LR Sentinel-2图像对,这些图像对具有多样化的土地覆盖变化和显著的时间间隔。在Real-RefRSSRD上的广泛实验表明,CRefDiff在各种指标上达到了最先进的性能,并且提高了下游任务(如场景分类和语义分割)的性能。
https://arxiv.org/abs/2506.23801
Diffusion-based generative models have demonstrated exceptional promise in the video super-resolution (VSR) task, achieving a substantial advancement in detail generation relative to prior methods. However, these approaches face significant computational efficiency challenges. For instance, current techniques may require tens of minutes to super-resolve a mere 2-second, 1080p video. In this paper, we present TurboVSR, an ultra-efficient diffusion-based video super-resolution model. Our core design comprises three key aspects: (1) We employ an autoencoder with a high compression ratio of 32$\times$32$\times$8 to reduce the number of tokens. (2) Highly compressed latents pose substantial challenges for training. We introduce factorized conditioning to mitigate the learning complexity: we first learn to super-resolve the initial frame; subsequently, we condition the super-resolution of the remaining frames on the high-resolution initial frame and the low-resolution subsequent frames. (3) We convert the pre-trained diffusion model to a shortcut model to enable fewer sampling steps, further accelerating inference. As a result, TurboVSR performs on par with state-of-the-art VSR methods, while being 100+ times faster, taking only 7 seconds to process a 2-second long 1080p video. TurboVSR also supports image resolution by considering image as a one-frame video. Our efficient design makes SR beyond 1080p possible, results on 4K (3648$\times$2048) image SR show surprising fine details.
基于扩散的生成模型在视频超分辨率(VSR)任务中展示了巨大的潜力,相较于之前的方法,在细节生成方面取得了显著的进步。然而,这些方法面临着重大的计算效率挑战。例如,当前的技术可能需要数十分钟才能将一段仅2秒长、分辨率为1080p的视频提升至超分辨率。本文介绍了TurboVSR,这是一种基于扩散模型的超高效率视频超分辨率模型。 我们的核心设计包括三个方面: 1. 我们使用具有极高压缩比(32x32x8)的自编码器来减少标记数量。 2. 高度压缩的潜在变量给训练带来了巨大的挑战。我们引入了因子化条件以降低学习复杂性:首先,我们学会对初始帧进行超分辨率处理;随后,在对剩余帧进行超分辨率时,则基于高分辨率的初始帧和低分辨率的后续帧来调节该过程。 3. 我们将预训练的扩散模型转换为一个快捷模型,从而减少采样步骤的数量,进一步加快推理速度。 作为结果,TurboVSR在与最先进的VSR方法相当的情况下,运行速度快100多倍,处理一段2秒长、分辨率为1080p的视频仅需7秒钟。此外,TurboVSR还支持图像分辨率提升,通过将图像视为一帧视频来考虑。我们高效的设计使得超分辨率提升至超过1080p成为可能,在4K(3648x2048)图像超分辨测试中展现出了令人惊讶的细腻细节。
https://arxiv.org/abs/2506.23618
The acquisition of high-resolution satellite imagery is often constrained by the spatial and temporal limitations of satellite sensors, as well as the high costs associated with frequent observations. These challenges hinder applications such as environmental monitoring, disaster response, and agricultural management, which require fine-grained and high-resolution data. In this paper, we propose MWT-Diff, an innovative framework for satellite image super-resolution (SR) that combines latent diffusion models with wavelet transforms to address these challenges. At the core of the framework is a novel metadata-, wavelet-, and time-aware encoder (MWT-Encoder), which generates embeddings that capture metadata attributes, multi-scale frequency information, and temporal relationships. The embedded feature representations steer the hierarchical diffusion dynamics, through which the model progressively reconstructs high-resolution satellite imagery from low-resolution inputs. This process preserves critical spatial characteristics including textural patterns, boundary discontinuities, and high-frequency spectral components essential for detailed remote sensing analysis. The comparative analysis of MWT-Diff across multiple datasets demonstrated favorable performance compared to recent approaches, as measured by standard perceptual quality metrics including FID and LPIPS.
获取高分辨率的卫星影像常常受到卫星传感器的空间和时间限制以及频繁观测所产生的高昂成本的影响。这些挑战阻碍了环境监测、灾害应对和农业管理等应用,而这些应用都需要精细且高质量的数据支持。在本文中,我们提出了MWT-Diff框架,这是一种结合潜在扩散模型与小波变换的创新型卫星影像超分辨率(SR)方法,旨在解决上述问题。 该框架的核心是一个新颖的元数据感知、小波感知和时间感知编码器(MWT-Encoder),它可以生成捕捉到元数据属性、多尺度频率信息以及时间关系的嵌入特征。这些特征引导了分层扩散动力学过程,在这一过程中,模型逐步从低分辨率输入中重建出高分辨率卫星影像。此过程保持了包括纹理模式、边界不连续性和高频光谱成分在内的关键空间特性,这些都是进行详细遥感分析所必需的。 在多个数据集上对MWT-Diff与最近的方法进行了比较性分析,结果显示其性能表现优异,依据包括FID和LPIPS等标准感知质量度量方法进行评价。
https://arxiv.org/abs/2506.23566
Diffusion-model-based image super-resolution techniques often face a trade-off between realistic image generation and computational efficiency. This issue is exacerbated when inference times by decreasing sampling steps, resulting in less realistic and hazy images. To overcome this challenge, we introduce a novel diffusion model named PixelBoost that underscores the significance of embracing the stochastic nature of Brownian motion in advancing image super-resolution, resulting in a high degree of realism, particularly focusing on texture and edge definitions. By integrating controlled stochasticity into the training regimen, our proposed model avoids convergence to local optima, effectively capturing and reproducing the inherent uncertainty of image textures and patterns. Our proposed model demonstrates superior objective results in terms of learned perceptual image patch similarity (LPIPS), lightness order error (LOE), peak signal-to-noise ratio(PSNR), structural similarity index measure (SSIM), as well as visual quality. To determine the edge enhancement, we evaluated the gradient magnitude and pixel value, and our proposed model exhibited a better edge reconstruction capability. Additionally, our model demonstrates adaptive learning capabilities by effectively adjusting to Brownian noise patterns and introduces a sigmoidal noise sequencing method that simplifies training, resulting in faster inference speeds.
基于扩散模型的图像超分辨率技术通常面临着真实感图像生成与计算效率之间的权衡问题。这一问题在减少采样步骤以缩短推理时间时更为严重,导致生成的图片变得不那么逼真且模糊。为了克服这一挑战,我们提出了一种名为PixelBoost的新扩散模型,该模型强调了布朗运动随机性的重要性,通过这种方式提升了图像超分辨率的真实感,特别是在纹理和边缘定义方面表现突出。 通过在训练过程中引入受控的随机性,我们的模型避免了陷入局部最优解的问题,并且能够有效捕捉和再现图像纹理与模式中的内在不确定性。我们提出的模型在学习感知图像块相似度(LPIPS)、亮度顺序误差(LOE)、峰值信噪比(PSNR)、结构相似指数测度(SSIM)以及视觉质量等指标上表现出色。为了评估边缘增强能力,我们计算了梯度幅度和像素值,并发现我们的模型具有更好的边缘重建性能。 此外,我们的模型展示了自适应学习能力,能够有效地调整以应对布朗噪声模式的变化,并引入了一种用于简化训练的Sigmoid噪声序列方法,这使得推理速度更快。
https://arxiv.org/abs/2506.23254
The paper proposes a statistical learning approach to the problem of estimating missing pixels of images, crucial for image inpainting and super-resolution problems. One of the main novelties of the method is that it also provides uncertainty quantifications together with the estimated values. Our core assumption is that the underlying data-generating function comes from a Reproducing Kernel Hilbert Space (RKHS). A special emphasis is put on band-limited functions, central to signal processing, which form Paley-Wiener type RKHSs. The proposed method, which we call Simultaneously Guaranteed Kernel Interpolation (SGKI), is an extension and refinement of a recently developed kernel method. An advantage of SGKI is that it not only estimates the missing pixels, but also builds non-asymptotic confidence bands for the unobserved values, which are simultaneously guaranteed for all missing pixels. We also show how to compute these bands efficiently using Schur complements, we discuss a generalization to vector-valued functions, and we present a series of numerical experiments on various datasets containing synthetically generated and benchmark images, as well.
该论文提出了一种统计学习方法,用于解决图像中缺失像素的估计问题,这对于图像修复(inpainting)和超分辨率问题至关重要。这种方法的主要创新之一是它不仅提供了估计值,还给出了不确定性量化。我们假设底层的数据生成函数来自一个再生核希尔伯特空间(RKHS)。特别强调的是限带信号处理中的限带函数,它们构成了帕莱-维纳类型RKHS。 所提出的方法,即同时保证的核插值法(Simultaneously Guaranteed Kernel Interpolation, SGKI),是对最近开发出的一种核方法的扩展和改进。SGKI的一个优点是它不仅能估计缺失像素,还能为未观测到的所有像素构建非渐近置信区间,并且这些置信区间对所有缺失像素同时有效。我们还展示了如何使用舒尔补来高效地计算这些区间,并讨论了向向量值函数的一般化方法。此外,我们通过一系列在包含合成数据和基准图像的多种数据集上的数值实验来验证该方法的有效性。
https://arxiv.org/abs/2506.23221
Video super-resolution remains a major challenge in low-level vision tasks. To date, CNN- and Transformer-based methods have delivered impressive results. However, CNNs are limited by local receptive fields, while Transformers struggle with quadratic complexity, posing challenges for processing long sequences in VSR. Recently, Mamba has drawn attention for its long-sequence modeling, linear complexity, and large receptive fields. In this work, we propose VSRM, a novel \textbf{V}ideo \textbf{S}uper-\textbf{R}esolution framework that leverages the power of \textbf{M}amba. VSRM introduces Spatial-to-Temporal Mamba and Temporal-to-Spatial Mamba blocks to extract long-range spatio-temporal features and enhance receptive fields efficiently. To better align adjacent frames, we propose Deformable Cross-Mamba Alignment module. This module utilizes a deformable cross-mamba mechanism to make the compensation stage more dynamic and flexible, preventing feature distortions. Finally, we minimize the frequency domain gaps between reconstructed and ground-truth frames by proposing a simple yet effective Frequency Charbonnier-like loss that better preserves high-frequency content and enhances visual quality. Through extensive experiments, VSRM achieves state-of-the-art results on diverse benchmarks, establishing itself as a solid foundation for future research.
视频超分辨率仍然是低级视觉任务中的一个重要挑战。到目前为止,基于CNN和Transformer的方法已经取得了令人印象深刻的结果。然而,CNN受局部感受野的限制,而Transformer则由于其二次复杂性在处理VSR(Video Super-Resolution)中长时间序列时面临困难。最近,Mamba因其对长序列建模、线性复杂性和大感受野的关注而引起了人们的注意。在这项工作中,我们提出了VSRM,这是一种新的视频超分辨率框架,利用了Mamba的强大功能。VSRM引入了空间到时间的Mamba和时间到空间的Mamba模块来高效地提取长距离的空间-时间特征,并增强感知域。为了更好地对齐相邻帧,我们提出了一种可变形交叉Mamba对准模块(Deformable Cross-Mamba Alignment module)。该模块利用了一个可变形的交叉机制,使补偿阶段更具动态性和灵活性,防止了特征扭曲的发生。最后,通过提出一个简单而有效的频率Charbonnier损失函数来最小化重建帧和真实帧之间在频域上的差距,从而更好地保持高频内容并提升视觉质量。通过广泛的实验,VSRM在多种基准测试中实现了最先进的结果,为未来的研究奠定了坚实的基础。
https://arxiv.org/abs/2506.22762
Implicit degradation estimation-based blind super-resolution (IDE-BSR) hinges on extracting the implicit degradation representation (IDR) of the LR image and adapting it to LR image features to guide HR detail restoration. Although IDE-BSR has shown potential in dealing with noise interference and complex degradations, existing methods ignore the importance of IDR discriminability for BSR and instead over-complicate the adaptation process to improve effect, resulting in a significant increase in the model's parameters and computations. In this paper, we focus on the discriminability optimization of IDR and propose a new powerful and lightweight BSR model termed LightBSR. Specifically, we employ a knowledge distillation-based learning framework. We first introduce a well-designed degradation-prior-constrained contrastive learning technique during teacher stage to make the model more focused on distinguishing different degradation types. Then we utilize a feature alignment technique to transfer the degradation-related knowledge acquired by the teacher to the student for practical inferencing. Extensive experiments demonstrate the effectiveness of IDR discriminability-driven BSR model design. The proposed LightBSR can achieve outstanding performance with minimal complexity across a range of blind SR tasks. Our code is accessible at: this https URL.
基于隐式退化估计的无参考超分辨率(IDE-BSR)依赖于从低分辨率图像中提取隐式退化表示(IDR),并将其适应到低分辨率图像特征,以指导高分辨率细节恢复。尽管IDE-BSR在处理噪声干扰和复杂退化方面表现出潜力,但现有方法忽视了IDR区分能力对无参考超分辨率的重要性,并且通过过度简化适应过程来提高效果,导致模型参数和计算量显著增加。 在这篇论文中,我们专注于优化IDR的区分性,并提出了一种新的强大而轻量级的BSR(盲超分辨率)模型,称为LightBSR。具体来说,我们采用基于知识蒸馏的学习框架。首先,在教师阶段引入一种设计良好的退化先验约束对比学习技术,使模型更专注于区分不同的退化类型。然后利用特征对齐技术将教师获取的与退化相关的知识传递给学生进行实际推理。 广泛的实验表明了IDR区分性驱动的BSR模型设计的有效性。所提出的LightBSR可以在各种盲超分辨率任务中以极小的复杂度实现卓越性能。我们的代码可在以下网址获得:此 https URL。
https://arxiv.org/abs/2506.22710
Background: Conventional cardiovascular magnetic resonance (CMR) in paediatric and congenital heart disease uses 2D, breath-hold, balanced steady state free precession (bSSFP) cine imaging for assessment of function and cardiac-gated, respiratory-navigated, static 3D bSSFP whole-heart imaging for anatomical assessment. Our aim is to concatenate a stack 2D free-breathing real-time cines and use Deep Learning (DL) to create an isotropic a fully segmented 3D cine dataset from these images. Methods: Four DL models were trained on open-source data that performed: a) Interslice contrast correction; b) Interslice respiratory motion correction; c) Super-resolution (slice direction); and d) Segmentation of right and left atria and ventricles (RA, LA, RV, and LV), thoracic aorta (Ao) and pulmonary arteries (PA). In 10 patients undergoing routine cardiovascular examination, our method was validated on prospectively acquired sagittal stacks of real-time cine images. Quantitative metrics (ventricular volumes and vessel diameters) and image quality of the 3D cines were compared to conventional breath hold cine and whole heart imaging. Results: All real-time data were successfully transformed into 3D cines with a total post-processing time of <1 min in all cases. There were no significant biases in any LV or RV metrics with reasonable limits of agreement and correlation. There is also reasonable agreement for all vessel diameters, although there was a small but significant overestimation of RPA diameter. Conclusion: We have demonstrated the potential of creating a 3D-cine data from concatenated 2D real-time cine images using a series of DL models. Our method has short acquisition and reconstruction times with fully segmented data being available within 2 minutes. The good agreement with conventional imaging suggests that our method could help to significantly speed up CMR in clinical practice.
背景:传统的儿科和先天性心脏病心血管磁共振成像(CMR)使用二维、屏气、平衡稳态自由进动(bSSFP)动态序列来评估心脏功能,并采用与心电图同步的呼吸导航静态三维 bSSFP 全心成像来进行解剖学评估。我们的目标是将一系列无屏气实时动态图像堆叠起来,然后使用深度学习技术创建一个各向同性的完全分割后的三维动态数据集。 方法:我们在开源数据上训练了四个深度学习模型来执行以下任务: a) 图层间对比度校正; b) 图层间呼吸运动校正; c) 超分辨率(切片方向); d) 右心房、左心房、右心室和左心室 (RA, LA, RV 和 LV),胸主动脉 (Ao) 以及肺动脉 (PA) 的分割。 在进行常规心血管检查的10名患者中,我们使用前瞻性采集的冠状面实时动态图像堆叠进行了方法验证。我们将三维动态图像的质量指标(心室体积和血管直径)与传统的屏气动态序列和全心成像进行了比较。 结果:所有实时数据均成功转化为3D 动态影像,在所有情况下总后处理时间少于1分钟。LV 或 RV 指标中没有显著偏差,且有合理的可接受范围及相关性。对于所有血管直径也表现出合理的一致性,尽管右肺动脉 (RPA) 直径略有但统计上显著的高估。 结论:我们证明了使用一系列深度学习模型从堆叠的二维实时动态图像创建三维动态数据集的可能性。我们的方法具有短的采集和重建时间,并且在2分钟内即可提供完全分割的数据。与传统成像的良好一致性表明,本方法可以大幅加快临床实践中CMR的速度。
https://arxiv.org/abs/2506.22532
Image restoration is a key task in low-level computer vision that aims to reconstruct high-quality images from degraded inputs. The emergence of Vision Mamba, which draws inspiration from the advanced state space model Mamba, marks a significant advancement in this field. Vision Mamba demonstrates excellence in modeling long-range dependencies with linear complexity, a crucial advantage for image restoration tasks. Despite its strengths, Vision Mamba encounters challenges in low-level vision tasks, including computational complexity that scales with the number of scanning sequences and local pixel forgetting. To address these limitations, this study introduces Efficient All-Around Mamba (EAMamba), an enhanced framework that incorporates a Multi-Head Selective Scan Module (MHSSM) with an all-around scanning mechanism. MHSSM efficiently aggregates multiple scanning sequences, which avoids increases in computational complexity and parameter count. The all-around scanning strategy implements multiple patterns to capture holistic information and resolves the local pixel forgetting issue. Our experimental evaluations validate these innovations across several restoration tasks, including super resolution, denoising, deblurring, and dehazing. The results validate that EAMamba achieves a significant 31-89% reduction in FLOPs while maintaining favorable performance compared to existing low-level Vision Mamba methods.
图像恢复是低级计算机视觉中的关键任务,旨在从退化的输入中重建高质量的图像。Vision Mamba的出现借鉴了先进的状态空间模型Mamba的理念,在这一领域取得了显著进展。Vision Mamba在建模长距离依赖性方面表现出线性复杂度的优势,这是图像恢复任务的关键优势之一。尽管具有这些优点,Vision Mamba在低级视觉任务中也面临着挑战,包括计算复杂度随扫描序列数量增加而增加以及局部像素遗忘的问题。 为了解决这些问题,本研究提出了Efficient All-Around Mamba (EAMamba),这是一个增强的框架,集成了多头选择性扫描模块(MHSSM)和全方位扫描机制。MHSSM有效地聚合了多个扫描序列,从而避免了计算复杂度和参数数量的增加。全方位扫描策略采用多种模式来捕捉整体信息,并解决了局部像素遗忘的问题。 我们的实验评估验证了这些创新在包括超分辨率、去噪、去模糊和去雾在内的多项恢复任务中的效果。结果显示,EAMamba实现了31-89%的FLOPs(浮点运算次数)减少,同时与现有的低级Vision Mamba方法相比保持了良好的性能。
https://arxiv.org/abs/2506.22246
Super-resolution (SR) is an ill-posed inverse problem with many feasible solutions consistent with a given low-resolution image. On one hand, regressive SR models aim to balance fidelity and perceptual quality to yield a single solution, but this trade-off often introduces artifacts that create ambiguity in information-critical applications such as recognizing digits or letters. On the other hand, diffusion models generate a diverse set of SR images, but selecting the most trustworthy solution from this set remains a challenge. This paper introduces a robust, automated framework for identifying the most trustworthy SR sample from a diffusion-generated set by leveraging the semantic reasoning capabilities of vision-language models (VLMs). Specifically, VLMs such as BLIP-2, GPT-4o, and their variants are prompted with structured queries to assess semantic correctness, visual quality, and artifact presence. The top-ranked SR candidates are then ensembled to yield a single trustworthy output in a cost-effective manner. To rigorously assess the validity of VLM-selected samples, we propose a novel Trustworthiness Score (TWS) a hybrid metric that quantifies SR reliability based on three complementary components: semantic similarity via CLIP embeddings, structural integrity using SSIM on edge maps, and artifact sensitivity through multi-level wavelet decomposition. We empirically show that TWS correlates strongly with human preference in both ambiguous and natural images, and that VLM-guided selections consistently yield high TWS values. Compared to conventional metrics like PSNR, LPIPS, which fail to reflect information fidelity, our approach offers a principled, scalable, and generalizable solution for navigating the uncertainty of the diffusion SR space. By aligning outputs with human expectations and semantic correctness, this work sets a new benchmark for trustworthiness in generative SR.
超分辨率(SR)是一种病态的逆向问题,给定一幅低分辨率图像的情况下存在许多可行的解决方案。一方面,回归式SR模型旨在平衡保真度和感知质量以产生单一结果,但这种权衡往往会引入生成瑕疵,这在识别数字或字母等信息关键的应用中造成模棱两可的情况。另一方面,扩散模型能够生成一组多样化的超分辨率图像,但从这些选项中选择最可信的解决方案仍然具有挑战性。本文提出了一种利用视觉-语言模型(VLMs)语义推理能力来从扩散模型生成的一组样本中识别最可信SR样本的鲁棒且自动化的框架。具体而言,BLIP-2、GPT-4o及其变体等VLM被用于结构化查询以评估语义正确性、视觉质量和瑕疵存在情况。然后,通过在成本效益的方式集合排名靠前的SR候选者来生成一个可信的结果。 为了严格评估所选样本的有效性,我们提出了一个新的信赖度评分(TWS),这是一种混合指标,根据三个互补成分量化超分辨率可靠性:使用CLIP嵌入评估语义相似性、用SSIM(边框图上的结构相似性)衡量结构性完整性和通过多级小波分解检查瑕疵敏感性。实验表明,TWS在模棱两可和自然图像中都与人类偏好有强烈相关,并且VLM指导的选择能够持续获得高TWS值。相比于传统的评估标准如PSNR(峰值信噪比)、LPIPS等无法反映信息保真度的问题,我们的方法提供了一个基于原则的、可扩展且通用化的解决方案以应对扩散SR空间中的不确定性问题。通过将输出与人类预期和语义正确性对齐,本工作为生成式超分辨率技术的信任度设定了一项新的基准。
https://arxiv.org/abs/2506.20832