Single-image super-resolution (SISR) remains challenging due to the inherent difficulty of recovering fine-grained details and preserving perceptual quality from low-resolution inputs. Existing methods often rely on limited image priors, leading to suboptimal results. We propose a novel approach that leverages the rich contextual information available in multiple modalities -- including depth, segmentation, edges, and text prompts -- to learn a powerful generative prior for SISR within a diffusion model framework. We introduce a flexible network architecture that effectively fuses multimodal information, accommodating an arbitrary number of input modalities without requiring significant modifications to the diffusion process. Crucially, we mitigate hallucinations, often introduced by text prompts, by using spatial information from other modalities to guide regional text-based conditioning. Each modality's guidance strength can also be controlled independently, allowing steering outputs toward different directions, such as increasing bokeh through depth or adjusting object prominence via segmentation. Extensive experiments demonstrate that our model surpasses state-of-the-art generative SISR methods, achieving superior visual quality and fidelity. See project page at this https URL.
单图像超分辨率(SISR)由于从低分辨率输入中恢复精细细节并保持感知质量的固有难度,仍然是一项挑战。现有方法通常依赖于有限的图像先验知识,导致结果不尽如人意。我们提出了一种新颖的方法,该方法利用多模态信息中的丰富上下文信息——包括深度、分割、边缘和文本提示——在扩散模型框架内学习强大的生成性先验以解决SISR问题。我们引入了一个灵活的网络架构,能够有效地融合多种模态的信息,并且可以处理任意数量的输入模式而不需对扩散过程进行重大修改。关键的是,通过使用其他模态的空间信息来指导区域文本条件引导,我们可以减少由文本提示引入的幻觉现象。每种模态的引导强度也可以独立控制,从而将输出导向不同的方向,例如通过深度增加散景效果或通过分割调整物体的重要性。大量的实验表明,我们的模型超越了最先进的生成性SISR方法,在视觉质量和保真度方面表现出色。有关该项目页面,请参见此[链接](https://www.example.com)(请使用实际项目URL替换示例链接)。
https://arxiv.org/abs/2503.14503
The computer vision community has developed numerous techniques for digitally restoring true scene information from single-view degraded photographs, an important yet extremely ill-posed task. In this work, we tackle image restoration from a different perspective by jointly denoising multiple photographs of the same scene. Our core hypothesis is that degraded images capturing a shared scene contain complementary information that, when combined, better constrains the restoration problem. To this end, we implement a powerful multi-view diffusion model that jointly generates uncorrupted views by extracting rich information from multi-view relationships. Our experiments show that our multi-view approach outperforms existing single-view image and even video-based methods on image deblurring and super-resolution tasks. Critically, our model is trained to output 3D consistent images, making it a promising tool for applications requiring robust multi-view integration, such as 3D reconstruction or pose estimation.
计算机视觉领域已经开发了多种技术,用于从单视角的退化照片中数字化地恢复真实场景信息,这是一个重要但极其难以处理的任务。在这项工作中,我们通过同时去噪同一场景的多张照片,以不同的视角解决了图像修复的问题。我们的核心假设是,捕捉相同场景的退化图片包含互补的信息,这些信息结合后可以更好地约束图像复原问题。为此,我们实现了一个强大的多视图扩散模型,该模型能够从多视图关系中提取丰富的信息,并同时生成未受污染的视角。 实验表明,与现有的单视图和甚至基于视频的方法相比,我们的多视图方法在图像去模糊和超分辨率任务上表现出色。尤为重要的是,我们的模型经过训练可以输出三维一致的图像,使其成为需要稳健多视图集成的应用(如3D重建或姿态估计)的理想工具。
https://arxiv.org/abs/2503.14463
Real-world image super-resolution is a critical image processing task, where two key evaluation criteria are the fidelity to the original image and the visual realness of the generated results. Although existing methods based on diffusion models excel in visual realness by leveraging strong priors, they often struggle to achieve an effective balance between fidelity and realness. In our preliminary experiments, we observe that a linear combination of multiple models outperforms individual models, motivating us to harness the strengths of different models for a more effective trade-off. Based on this insight, we propose a distillation-based approach that leverages the geometric decomposition of both fidelity and realness, alongside the performance advantages of multiple teacher models, to strike a more balanced trade-off. Furthermore, we explore the controllability of this trade-off, enabling a flexible and adjustable super-resolution process, which we call CTSR (Controllable Trade-off Super-Resolution). Experiments conducted on several real-world image super-resolution benchmarks demonstrate that our method surpasses existing state-of-the-art approaches, achieving superior performance across both fidelity and realness metrics.
https://arxiv.org/abs/2503.14272
Video Super-Resolution (VSR) reconstructs high-resolution videos from low-resolution inputs to restore fine details and improve visual clarity. While deep learning-based VSR methods achieve impressive results, their centralized nature raises serious privacy concerns, particularly in applications with strict privacy requirements. Federated Learning (FL) offers an alternative approach, but existing FL methods struggle with low-level vision tasks, leading to suboptimal reconstructions. To address this, we propose FedVSR1, a novel, architecture-independent, and stateless FL framework for VSR. Our approach introduces a lightweight loss term that improves local optimization and guides global aggregation with minimal computational overhead. To the best of our knowledge, this is the first attempt at federated VSR. Extensive experiments show that FedVSR outperforms general FL methods by an average of 0.85 dB in PSNR, highlighting its effectiveness. The code is available at: this https URL
视频超分辨率(VSR)技术通过从低分辨率输入中重建高质量的高分辨率视频,以恢复精细细节并提升视觉清晰度。基于深度学习的VSR方法虽然取得了显著成果,但其集中式的特性引发了严重的隐私担忧,尤其是在需要严格保护用户隐私的应用场景下。联邦学习(FL)提供了一种替代方案,然而现有的FL技术在处理低级视觉任务时表现不佳,导致重建效果不尽如人意。 为了解决这个问题,我们提出FedVSR1——一种新颖的、架构无关且无状态的FL框架,专门用于视频超分辨率任务。我们的方法引入了一个轻量级损失项,该项可以优化本地训练并指导全局聚合,在不显著增加计算成本的前提下提升重建质量。据我们所知,这是首次尝试在联邦学习环境中应用VSR技术。 通过广泛的实验测试,FedVSR的表现明显优于通用的FL方法,PSNR平均提升了0.85dB,充分展示了其在提高视频超分辨率方面的有效性。相关代码可在以下链接获取:[this https URL]
https://arxiv.org/abs/2503.13745
In recent years, attention mechanisms have been exploited in single image super-resolution (SISR), achieving impressive reconstruction results. However, these advancements are still limited by the reliance on simple training strategies and network architectures designed for discrete up-sampling scales, which hinder the model's ability to effectively capture information across multiple scales. To address these limitations, we propose a novel framework, \textbf{C2D-ISR}, for optimizing attention-based image super-resolution models from both performance and complexity perspectives. Our approach is based on a two-stage training methodology and a hierarchical encoding mechanism. The new training methodology involves continuous-scale training for discrete scale models, enabling the learning of inter-scale correlations and multi-scale feature representation. In addition, we generalize the hierarchical encoding mechanism with existing attention-based network structures, which can achieve improved spatial feature fusion, cross-scale information aggregation, and more importantly, much faster inference. We have evaluated the C2D-ISR framework based on three efficient attention-based backbones, SwinIR-L, SRFormer-L and MambaIRv2-L, and demonstrated significant improvements over the other existing optimization framework, HiT, in terms of super-resolution performance (up to 0.2dB) and computational complexity reduction (up to 11%). The source code will be made publicly available at this http URL.
近年来,注意力机制在单幅图像超分辨率(SISR)领域得到了广泛应用,并取得了令人瞩目的重建效果。然而,这些进展仍然受到简单训练策略和为离散上采样比例设计的网络架构限制,这阻碍了模型有效捕捉多尺度信息的能力。为了克服这些局限性,我们提出了一种新的框架 \textbf{C2D-ISR},用于从性能和复杂度两个方面优化基于注意力机制的图像超分辨率模型。我们的方法建立在两阶段训练策略及分层编码机制之上。新提出的训练方法包括对离散尺度模型进行连续比例的训练,从而能够学习跨尺度的相关性和多尺度特征表示。此外,我们将现有的基于注意力网络结构的层次化编码机制进行了推广,实现了改进的空间特征融合、跨尺度信息聚合,并且更重要的是,大幅度提升了推理速度。我们在三种高效的基于注意力的骨干网络(SwinIR-L, SRFormer-L 和 MambaIRv2-L)上评估了 C2D-ISR 框架,并展示了相对于现有优化框架 HiT,在超分辨率性能方面(高达 0.2dB 的提升)和计算复杂度降低方面的显著改进(最高减少11%)。源代码将在此网址公开发布。
https://arxiv.org/abs/2503.13740
Diffusion models for super-resolution (SR) produce high-quality visual results but require expensive computational costs. Despite the development of several methods to accelerate diffusion-based SR models, some (e.g., SinSR) fail to produce realistic perceptual details, while others (e.g., OSEDiff) may hallucinate non-existent structures. To overcome these issues, we present RSD, a new distillation method for ResShift, one of the top diffusion-based SR models. Our method is based on training the student network to produce such images that a new fake ResShift model trained on them will coincide with the teacher model. RSD achieves single-step restoration and outperforms the teacher by a large margin. We show that our distillation method can surpass the other distillation-based method for ResShift - SinSR - making it on par with state-of-the-art diffusion-based SR distillation methods. Compared to SR methods based on pre-trained text-to-image models, RSD produces competitive perceptual quality, provides images with better alignment to degraded input images, and requires fewer parameters and GPU memory. We provide experimental results on various real-world and synthetic datasets, including RealSR, RealSet65, DRealSR, ImageNet, and DIV2K.
扩散模型在超分辨率(SR)任务中能够生成高质量的视觉结果,但需要昂贵的计算成本。尽管已经开发了多种加速基于扩散的SR模型的方法,但一些方法(例如SinSR)未能产生逼真的感知细节,而另一些方法(如OSEDiff)可能会产生不存在的结构。为了解决这些问题,我们提出了一种新的知识蒸馏方法RSD,用于ResShift——一种顶级的基于扩散的超分辨率模型之一。我们的方法是通过训练学生网络生成图像,使一个新的假ResShift模型在这些图像上进行训练时能够与教师模型相匹配来实现的。RSD实现了单步恢复,并且大幅超越了教师模型的表现。我们展示了我们的蒸馏方法可以超过针对ResShift的其他蒸馏方法——SinSR,使其与其他最先进的基于扩散的超分辨率蒸馏方法相当。相比基于预训练文本到图像模型的SR方法,RSD提供了具有竞争力的感知质量,提供的图像与退化输入图像有更好的对齐,并且需要更少的参数和GPU内存。 我们在各种真实世界和合成数据集上进行了实验测试,包括RealSR、RealSet65、DRealSR、ImageNet和DIV2K。
https://arxiv.org/abs/2503.13358
To this day, accurately simulating local-scale precipitation and reliably reproducing its distribution remains a challenging task. The limited horizontal resolution of Global Climate Models is among the primary factors undermining their skill in this context. The physical mechanisms driving the onset and development of precipitation, especially in extreme events, operate at spatio-temporal scales smaller than those numerically resolved, thus struggling to be captured accurately. In order to circumvent this limitation, several downscaling approaches have been developed over the last decades to address the discrepancy between the spatial resolution of models output and the resolution required by local-scale applications. In this paper, we introduce RainScaleGAN, a conditional deep convolutional Generative Adversarial Network (GAN) for precipitation downscaling. GANs have been effectively used in image super-resolution, an approach highly relevant for downscaling tasks. RainScaleGAN's capabilities are tested in a perfect-model setup, where the spatial resolution of a precipitation dataset is artificially degraded from 0.25$^{\circ}\times$0.25$^{\circ}$ to 2$^{\circ}\times$2$^\circ$, and RainScaleGAN is used to restore it. The developed model outperforms one of the leading precipitation downscaling method found in the literature. RainScaleGAN not only generates a synthetic dataset featuring plausible high-resolution spatial patterns and intensities, but also produces a precipitation distribution with statistics closely mirroring those of the ground-truth dataset. Given that RainScaleGAN's approach is agnostic with respect to the underlying physics, the method has the potential to be applied to other physical variables such as surface winds or temperature.
至今,精确模拟局部尺度降水并可靠地再现其分布仍是一项具有挑战性的任务。全球气候模型的有限水平分辨率是削弱其在这一领域表现的主要因素之一。驱动降水开始和发展(尤其是在极端事件中)的物理机制,在空间和时间上的运行规模都小于数值上可解析的范围,因此难以准确捕捉。为了规避这个限制,过去几十年里开发了几种降尺度方法来解决模型输出的空间分辨率与局部应用所需分辨率之间的差异问题。本文介绍了RainScaleGAN,这是一种用于降水降尺度处理的条件深度卷积生成对抗网络(GAN)。GAN在图像超分辨率方面已被有效使用,这种方法对降尺度任务特别相关。通过在一个完美的模型设置中测试RainScaleGAN的能力,在该设置下将一个0.25°×0.25°空间分辨率的降水数据集人为退化为2°×2°的空间分辨率,并利用RainScaleGAN将其恢复到原始分辨率。开发出的模型在文献中领先的降水降尺度方法之一的表现上更胜一筹。RainScaleGAN不仅生成了一个具有真实高分辨率空间模式和强度特征的人工数据集,而且还产生了一个与地面真值数据集统计特性紧密匹配的降水分布。由于RainScaleGAN的方法不依赖于特定的物理原理,这种方法有可能被应用于其他物理变量如地表风速或温度上。
https://arxiv.org/abs/2503.13316
This study proposes a lightweight method for building image super-resolution using a Dilated Contextual Feature Modulation Network (DCFMN). The process includes obtaining high-resolution images, down-sampling them to low-resolution, enhancing the low-resolution images, constructing and training a lightweight network model, and generating super-resolution outputs. To address challenges such as regular textures and long-range dependencies in building images, the DCFMN integrates an expansion separable modulation unit and a local feature enhancement module. The former employs multiple expansion convolutions equivalent to a large kernel to efficiently aggregate multi-scale features while leveraging a simple attention mechanism for adaptivity. The latter encodes local features, mixes channel information, and ensures no additional computational burden during inference through reparameterization. This approach effectively resolves the limitations of existing lightweight super-resolution networks in modeling long-range dependencies, achieving accurate and efficient global feature modeling without increasing computational costs, and significantly improving both reconstruction quality and lightweight efficiency for building image super-resolution models.
这项研究提出了一种使用膨胀上下文特征调制网络(DCFMN)构建图像超分辨率的轻量级方法。该过程包括获取高分辨率图像,将其下采样为低分辨率,增强这些低分辨率图像,构建和训练一个轻量级网络模型,并生成超分辨率输出。为了应对诸如建筑图像中的常规纹理和长距离依赖等挑战,DCFMN 集成了一种扩张可分离调制单元和局部特征增强模块。前者通过使用多个膨胀卷积相当于大内核来高效地聚合多尺度特征,同时利用简单的注意力机制以实现适应性;后者编码本地特征、混合通道信息,并通过重新参数化确保在推理期间没有额外的计算负担。 这种方法有效地解决了现有轻量级超分辨率网络建模长距离依赖性的局限性,在不增加计算成本的情况下实现了准确且高效的全局特征建模,显著提升了建筑图像超分辨率模型的重建质量和轻量化效率。
https://arxiv.org/abs/2503.13179
While recent advancing image super-resolution (SR) techniques are continually improving the perceptual quality of their outputs, they can usually fail in quantitative evaluations. This inconsistency leads to a growing distrust in existing image metrics for SR evaluations. Though image evaluation depends on both the metric and the reference ground truth (GT), researchers typically do not inspect the role of GTs, as they are generally accepted as `perfect' references. However, due to the data being collected in the early years and the ignorance of controlling other types of distortions, we point out that GTs in existing SR datasets can exhibit relatively poor quality, which leads to biased evaluations. Following this observation, in this paper, we are interested in the following questions: Are GT images in existing SR datasets 100\% trustworthy for model evaluations? How does GT quality affect this evaluation? And how to make fair evaluations if there exist imperfect GTs? To answer these questions, this paper presents two main contributions. First, by systematically analyzing seven state-of-the-art SR models across three real-world SR datasets, we show that SR performances can be consistently affected across models by low-quality GTs, and models can perform quite differently when GT quality is controlled. Second, we propose a novel perceptual quality metric, Relative Quality Index (RQI), that measures the relative quality discrepancy of image pairs, thus issuing the biased evaluations caused by unreliable GTs. Our proposed model achieves significantly better consistency with human opinions. We expect our work to provide insights for the SR community on how future datasets, models, and metrics should be developed.
尽管最近的图像超分辨率(SR)技术在提高输出感知质量方面取得了持续的进步,但它们通常会在定量评估中表现不佳。这种不一致性导致了对现有超分辨率评估方法的信任度下降。虽然图像评价依赖于度量标准和参考真实值(GT),研究者们一般不会检查GT的作用,因为后者通常被视为“完美的”参照物。然而,由于早期收集的数据以及忽视其他类型失真的控制,我们指出现有的SR数据集中的GT质量相对较低,这导致了偏颇的评估结果。 基于这一观察,在本文中,我们将探讨以下几个问题:现有SR数据集中作为模型评价参考的GT是否完全可信?GT的质量如何影响这种评价?如果存在不完美的GT,应该如何进行公正的评价?为了回答这些问题,本文提出了两个主要贡献。首先,通过系统地分析在三个真实世界SR数据集上表现出来的七个最先进的超分辨率模型,我们展示了低质量的GT能够一致地影响不同模型的表现,并且当控制GT的质量时,这些模型之间的表现差异显著增大。其次,我们提出了一种新的感知质量度量标准——相对质量指标(RQI),该指标可以测量图像对之间的相对质量差距,从而解决了由于不可靠GT引起的偏颇评价问题。我们的提出的度量标准在与人类意见的一致性上实现了显著的提升。 我们认为这项工作将为超分辨率社区提供关于未来数据集、模型和度量标准如何开发的重要见解。
https://arxiv.org/abs/2503.13074
Deep learning-based super-resolution (SR) methods often perform pixel-wise computations uniformly across entire images, even in homogeneous regions where high-resolution refinement is redundant. We propose the Quadtree Diffusion Model (QDM), a region-adaptive diffusion framework that leverages a quadtree structure to selectively enhance detail-rich regions while reducing computations in homogeneous areas. By guiding the diffusion with a quadtree derived from the low-quality input, QDM identifies key regions-represented by leaf nodes-where fine detail is essential and applies minimal refinement elsewhere. This mask-guided, two-stream architecture adaptively balances quality and efficiency, producing high-fidelity outputs with low computational redundancy. Experiments demonstrate QDM's effectiveness in high-resolution SR tasks across diverse image types, particularly in medical imaging (e.g., CT scans), where large homogeneous regions are prevalent. Furthermore, QDM outperforms or is comparable to state-of-the-art SR methods on standard benchmarks while significantly reducing computational costs, highlighting its efficiency and suitability for resource-limited environments. Our code is available at this https URL.
基于深度学习的超分辨率(SR)方法通常在整个图像上均匀地执行逐像素计算,即使在高分辨率细化冗余的同质区域也是如此。我们提出了四叉树扩散模型(QDM),这是一种区域自适应扩散框架,利用四叉树结构选择性地增强细节丰富的区域,同时减少在均质区域上的计算量。通过使用从低质量输入派生出来的四叉树来引导扩散过程,QDM能够识别出关键的由叶节点表示的需要精细细节的地方,并在其他地方进行最小程度的细化处理。这种以掩码为指导的双流架构自适应地平衡了质量和效率,在减少计算冗余的同时生成高质量的输出。实验表明,QDM在各种类型的高分辨率SR任务中(特别是在医学成像如CT扫描等存在大量均质区域的情况下)表现出了其有效性。此外,QDM在标准基准测试中的性能优于或与最先进的SR方法相当,并且显著减少了计算成本,突显了它的效率和适合资源受限环境的特性。我们的代码可在[此处](https://这个URL)获取。
https://arxiv.org/abs/2503.12015
Text-to-image synthesis has witnessed remarkable advancements in recent years. Many attempts have been made to adopt text-to-image models to support multiple tasks. However, existing approaches typically require resource-intensive re-training or additional parameters to accommodate for the new tasks, which makes the model inefficient for on-device deployment. We propose Multi-Task Upcycling (MTU), a simple yet effective recipe that extends the capabilities of a pre-trained text-to-image diffusion model to support a variety of image-to-image generation tasks. MTU replaces Feed-Forward Network (FFN) layers in the diffusion model with smaller FFNs, referred to as experts, and combines them with a dynamic routing mechanism. To the best of our knowledge, MTU is the first multi-task diffusion modeling approach that seamlessly blends multi-tasking with on-device compatibility, by mitigating the issue of parameter inflation. We show that the performance of MTU is on par with the single-task fine-tuned diffusion models across several tasks including image editing, super-resolution, and inpainting, while maintaining similar latency and computational load (GFLOPs) as the single-task fine-tuned models.
近年来,文本到图像的合成技术取得了显著的进步。许多尝试致力于将文本到图像模型应用于支持多种任务。然而,现有的方法通常需要耗费大量资源进行重新训练或增加额外参数以适应新任务,这使得模型在设备上的部署效率低下。我们提出了多任务再利用(MTU),这是一种简单而有效的方法,它扩展了预训练的文本到图像扩散模型的能力,使其能够支持各种图像到图像生成的任务。MTU通过用较小的前馈网络(FFN)层替换扩散模型中的FFN层,并结合动态路由机制来实现这一目标,这些小的FFN层被称为专家。据我们所知,MTU是第一个在多任务处理与设备兼容性之间无缝融合的方法,它减轻了参数膨胀的问题。我们展示了,在包括图像编辑、超分辨率和修复在内的多项任务上,MTU的性能与单一任务微调的扩散模型相当,并且保持了相似的延迟时间和计算量(GFLOPs)。
https://arxiv.org/abs/2503.11905
In clinical imaging, magnetic resonance (MR) image volumes are often acquired as stacks of 2D slices, permitting decreased scan times, improved signal-to-noise ratio, and image contrasts unique to 2D MR pulse sequences. While this is sufficient for clinical evaluation, automated algorithms designed for 3D analysis perform sub-optimally on 2D-acquired scans, especially those with thick slices and gaps between slices. Super-resolution (SR) methods aim to address this problem, but previous methods do not address all of the following: slice profile shape estimation, slice gap, domain shift, and non-integer / arbitrary upsampling factors. In this paper, we propose ECLARE (Efficient Cross-planar Learning for Anisotropic Resolution Enhancement), a self-SR method that addresses each of these factors. ECLARE estimates the slice profile from the 2D-acquired multi-slice MR volume, trains a network to learn the mapping from low-resolution to high-resolution in-plane patches from the same volume, and performs SR with anti-aliasing. We compared ECLARE to cubic B-spline interpolation, SMORE, and other contemporary SR methods. We used realistic and representative simulations so that quantitative performance against a ground truth could be computed, and ECLARE outperformed all other methods in both signal recovery and downstream tasks. On real data for which there is no ground truth, ECLARE demonstrated qualitative superiority over other methods as well. Importantly, as ECLARE does not use external training data it cannot suffer from domain shift between training and testing. Our code is open-source and available at this https URL.
在临床成像中,磁共振(MR)图像通常以二维切片堆叠的形式获得,这种方式可以减少扫描时间、提高信噪比,并产生独特的2D MR脉冲序列对比度。尽管这对临床评估足够了,但为3D分析设计的自动化算法在二维获取的扫描上表现不佳,尤其是在具有厚切片和层间间隔的情况下。超分辨率(SR)方法旨在解决这些问题,但之前的方法未能全面处理如下所有因素:切片轮廓形状估计、切片间隙、域偏移以及非整数/任意放大因子。在这篇论文中,我们提出了ECLARE(高效跨平面学习用于各向异性分辨率增强),这是一种自我超分辨率方法,解决了上述每一个问题。 ECLARE 从二维获取的多层MR体积中估算切片轮廓,并训练网络以学习同一卷内从低分辨率到高分辨率在平面上的贴图映射。此外,它还执行抗混叠超分辨率处理。我们把 ECLARE 与立方 B 样条插值、SMORE 和其他现代 SR 方法进行了比较。为了能够计算出定量性能(相对于真实情况),我们在具有代表性的仿真中使用了现实的数据,并且在信号恢复和后续任务方面,ECLARE 的表现优于所有其他方法。在没有真实数据的实际情况测试中,ECLARE 也展示了比其他方法更优越的质量。 重要的是,由于 ECLARE 不依赖于外部训练数据,因此不会因训练与测试之间的域偏移而受到影响。我们的代码是开源的,并可在 [此链接](https://this https URL) 获取。
https://arxiv.org/abs/2503.11787
Camera control has been actively studied in text or image conditioned video generation tasks. However, altering camera trajectories of a given video remains under-explored, despite its importance in the field of video creation. It is non-trivial due to the extra constraints of maintaining multiple-frame appearance and dynamic synchronization. To address this, we present ReCamMaster, a camera-controlled generative video re-rendering framework that reproduces the dynamic scene of an input video at novel camera trajectories. The core innovation lies in harnessing the generative capabilities of pre-trained text-to-video models through a simple yet powerful video conditioning mechanism -- its capability often overlooked in current research. To overcome the scarcity of qualified training data, we construct a comprehensive multi-camera synchronized video dataset using Unreal Engine 5, which is carefully curated to follow real-world filming characteristics, covering diverse scenes and camera movements. It helps the model generalize to in-the-wild videos. Lastly, we further improve the robustness to diverse inputs through a meticulously designed training strategy. Extensive experiments tell that our method substantially outperforms existing state-of-the-art approaches and strong baselines. Our method also finds promising applications in video stabilization, super-resolution, and outpainting. Project page: this https URL
相机控制在文本或图像条件下的视频生成任务中受到了积极的研究。然而,改变给定视频中的摄像机轨迹尽管在视频创作领域非常重要,但这一问题至今尚未得到充分探索,原因在于其需要额外考虑多帧外观的一致性和动态同步的复杂约束。为解决此难题,我们提出了一种名为ReCamMaster的新框架,这是一种基于预训练的文本到视频模型生成能力,并通过一种简单而强大的视频条件机制来重新渲染给定视频以实现新相机轨迹下的场景再现的技术。 该核心创新在于利用了现有的预训练文本到视频模型的能力,这种能力在当前研究中往往被忽视。为了克服高质量培训数据缺乏的问题,我们使用Unreal Engine 5构建了一个全面的多摄像机同步视频数据集,并精心设计使其遵循现实世界拍摄的特点,涵盖了各种场景和摄像机运动方式,从而有助于模型更好地泛化到真实世界的视频内容。 此外,通过细致的设计训练策略,进一步增强了对多种输入类型的鲁棒性。大量实验表明,我们的方法在现有最先进技术及强大基准测试中表现出显著的优越性。此外,本研究的方法还为视频稳定、超分辨率和扩展画幅等应用提供了有前景的应用途径。 项目页面:[此链接](https://this-url.com)(请将“this https URL”替换为您实际的网址)。
https://arxiv.org/abs/2503.11647
Full-reference image quality assessment (FR-IQA) generally assumes that reference images are of perfect quality. However, this assumption is flawed due to the sensor and optical limitations of modern imaging systems. Moreover, recent generative enhancement methods are capable of producing images of higher quality than their original. All of these challenge the effectiveness and applicability of current FR-IQA models. To relax the assumption of perfect reference image quality, we build a large-scale IQA database, namely DiffIQA, containing approximately 180,000 images generated by a diffusion-based image enhancer with adjustable hyper-parameters. Each image is annotated by human subjects as either worse, similar, or better quality compared to its reference. Building on this, we present a generalized FR-IQA model, namely Adaptive Fidelity-Naturalness Evaluator (A-FINE), to accurately assess and adaptively combine the fidelity and naturalness of a test image. A-FINE aligns well with standard FR-IQA when the reference image is much more natural than the test image. We demonstrate by extensive experiments that A-FINE surpasses standard FR-IQA models on well-established IQA datasets and our newly created DiffIQA. To further validate A-FINE, we additionally construct a super-resolution IQA benchmark (SRIQA-Bench), encompassing test images derived from ten state-of-the-art SR methods with reliable human quality annotations. Tests on SRIQA-Bench re-affirm the advantages of A-FINE. The code and dataset are available at this https URL.
全参考图像质量评估(FR-IQA)通常假设参照图像是完美的。然而,由于现代成像系统在传感器和光学方面的限制,这一假设是站不住脚的。此外,最近的生成增强方法能够产生比原始图像更高的质量图片。所有这些都挑战了当前FR-IQA模型的有效性和适用性。为了放松完美参考图像质量的假设,我们建立了一个大规模的IQA数据库,名为DiffIQA,其中包含大约180,000张由基于扩散的图像增强器生成的照片,该增强器具有可调参数。每一张照片都经过人类受试者标注,标记为比参照图像是更差、相似或更好的质量。 在此基础上,我们提出了一种通用的FR-IQA模型——自适应保真度-自然度评估器(A-FINE),用于准确地评估并自适应结合测试图像中的保真度和自然度。当参考图像明显比测试图像更为自然时,A-FINE与标准FR-IQA一致。 通过广泛的实验表明,在已确立的IQA数据集以及我们新创建的DiffIQA上,A-FINE超越了标准FR-IQA模型。为了进一步验证A-FINE的有效性,我们还构建了一个超分辨率图像质量评估基准(SRIQA-Bench),该基准包含由十种最先进的SR方法生成的测试图片,并附有人类标注的质量信息。 在SRIQA-Bench上的测试再次证实了A-FINE的优势。代码和数据集可在以下链接获取:[此处插入链接]。
https://arxiv.org/abs/2503.11221
By leveraging the generative priors from pre-trained text-to-image diffusion models, significant progress has been made in real-world image super-resolution (Real-ISR). However, these methods tend to generate inaccurate and unnatural reconstructions in complex and/or heavily degraded scenes, primarily due to their limited perception and understanding capability of the input low-quality image. To address these limitations, we propose, for the first time to our knowledge, to adapt the pre-trained autoregressive multimodal model such as Lumina-mGPT into a robust Real-ISR model, namely PURE, which Perceives and Understands the input low-quality image, then REstores its high-quality counterpart. Specifically, we implement instruction tuning on Lumina-mGPT to perceive the image degradation level and the relationships between previously generated image tokens and the next token, understand the image content by generating image semantic descriptions, and consequently restore the image by generating high-quality image tokens autoregressively with the collected information. In addition, we reveal that the image token entropy reflects the image structure and present a entropy-based Top-k sampling strategy to optimize the local structure of the image during inference. Experimental results demonstrate that PURE preserves image content while generating realistic details, especially in complex scenes with multiple objects, showcasing the potential of autoregressive multimodal generative models for robust Real-ISR. The model and code will be available at this https URL.
通过利用预训练的文本到图像扩散模型中的生成先验,已经在现实世界图像超分辨率(Real-ISR)方面取得了显著进展。然而,这些方法倾向于在复杂和/或严重退化的场景中产生不准确且不自然的重建结果,主要是因为它们对输入低质量图像的理解能力有限。为了解决这些问题,我们首次提出将预训练的自回归多模态模型(如Lumina-mGPT)适应为一个鲁棒的Real-ISR模型,即PURE,该模型能够“感知”并理解输入的低质量图像,然后通过生成高质量图像标记来“恢复”其对应的质量版本。 具体来说,我们在Lumina-mGPT上实现了指令调优,以感知图像退化水平以及先前生成的图像标记与下一个标记之间的关系,并通过生成图像语义描述来了解图像内容。接着,利用收集到的信息自回归地生成高质量的图像标记来进行图像恢复。 此外,我们发现图像令牌熵反映了图像结构,并提出了一种基于熵的Top-k采样策略,在推理过程中优化图像的局部结构。 实验结果表明,PURE在保留图像内容的同时还能生成逼真的细节,特别是在包含多个物体的复杂场景中尤为明显。这展示了自回归多模态生成模型在鲁棒性Real-ISR中的潜力。该模型和代码将可在以下网址获取:[此链接](请提供实际URL)。
https://arxiv.org/abs/2503.11073
This study aims to develop surrogate models for accelerating decision making processes associated with carbon capture and storage (CCS) technologies. Selection of sub-surface $CO_2$ storage sites often necessitates expensive and involved simulations of $CO_2$ flow fields. Here, we develop a Fourier Neural Operator (FNO) based model for real-time, high-resolution simulation of $CO_2$ plume migration. The model is trained on a comprehensive dataset generated from realistic subsurface parameters and offers $O(10^5)$ computational acceleration with minimal sacrifice in prediction accuracy. We also explore super-resolution experiments to improve the computational cost of training the FNO based models. Additionally, we present various strategies for improving the reliability of predictions from the model, which is crucial while assessing actual geological sites. This novel framework, based on NVIDIA's Modulus library, will allow rapid screening of sites for CCS. The discussed workflows and strategies can be applied to other energy solutions like geothermal reservoir modeling and hydrogen storage. Our work scales scientific machine learning models to realistic 3D systems that are more consistent with real-life subsurface aquifers/reservoirs, paving the way for next-generation digital twins for subsurface CCS applications.
这项研究旨在开发替代模型,以加速与碳捕获和储存(CCS)技术相关的决策过程。选择地下二氧化碳存储地点通常需要进行昂贵且复杂的二氧化碳流动场模拟。为此,我们基于傅立叶神经算子(FNO)开发了一个实时、高分辨率的二氧化碳羽流迁移仿真模型。该模型在由现实地质参数生成的数据集上进行了训练,并提供了高达$O(10^5)$倍的计算加速,同时几乎不影响预测精度。此外,我们还探讨了超分辨率实验以降低基于FNO模型的训练成本。并且提出了一系列策略来提高模型预测结果的可靠性,这对于评估实际地质地点至关重要。该框架基于NVIDIA的Modulus库构建,将能够快速筛选出适用于CCS的地点。所讨论的工作流程和策略也可以应用于其他能源解决方案,如地热储层建模和氢气存储。我们的工作将科学机器学习模型扩展到了与现实世界地下水层/储层更加一致的实际三维系统中,为下一代地下CCS应用的数字孪生铺平了道路。
https://arxiv.org/abs/2503.11031
Quality control (QC) has long been considered essential to guarantee the reliability of neuroimaging studies. It is particularly important for fetal brain MRI, where acquisitions and image processing techniques are less standardized than in adult imaging. In this work, we focus on automated quality control of super-resolution reconstruction (SRR) volumes of fetal brain MRI, an important processing step where multiple stacks of thick 2D slices are registered together and combined to build a single, isotropic and artifact-free T2 weighted volume. We propose FetMRQC$_{SR}$, a machine-learning method that extracts more than 100 image quality metrics to predict image quality scores using a random forest model. This approach is well suited to a problem that is high dimensional, with highly heterogeneous data and small datasets. We validate FetMRQC$_{SR}$ in an out-of-domain (OOD) setting and report high performance (ROC AUC = 0.89), even when faced with data from an unknown site or SRR method. We also investigate failure cases and show that they occur in $45\%$ of the images due to ambiguous configurations for which the rating from the expert is arguable. These results are encouraging and illustrate how a non deep learning-based method like FetMRQC$_{SR}$ is well suited to this multifaceted problem. Our tool, along with all the code used to generate, train and evaluate the model will be released upon acceptance of the paper.
长久以来,质量控制(QC)被认为是对神经影像研究可靠性保证的关键。特别是在胎儿大脑磁共振成像(MRI)中,由于采集和图像处理技术相比成人影像不够标准化,因此这一环节尤为重要。在这项工作中,我们专注于自动化的超分辨率重建(SRR)体积的胎儿大脑MRI的质量控制,这是一个重要的处理步骤,在这个步骤中,多个厚2D切片堆叠被注册并组合以构建单个等轴距且无伪影的T2加权体积。 我们提出了FetMRQC$_{SR}$,一种机器学习方法,该方法提取超过100种图像质量度量指标,并使用随机森林模型预测图像质量评分。这种方法非常适合高维度、数据高度异质性以及小规模数据集的问题。我们在域外(OOD)设置中验证了FetMRQC$_{SR}$的性能,并报告了其出色的性能(ROC AUC = 0.89),即使面对来自未知地点或SRR方法的数据时也是如此。 我们还调查了失败案例,发现这些情况在45%的图像中发生,原因是专家评分存在争议性配置。这些结果令人鼓舞,展示了像FetMRQC$_{SR}$这样的非深度学习基方法如何适用于这种多方面的问题。我们的工具以及用于生成、训练和评估模型的所有代码将在论文被接受后发布。
https://arxiv.org/abs/2503.10156
Lightweight image super-resolution (SR) aims to reconstruct high-resolution images from low-resolution images with limited computational costs. We find existing frequency-based SR methods cannot balance the reconstruction of overall structures and high-frequency parts. Meanwhile, these methods are inefficient for handling frequency features and unsuitable for lightweight SR. In this paper, we show introducing both wavelet and Fourier information allows our model to consider both high-frequency features and overall SR structure reconstruction while reducing costs. Specifically, we propose a dual-domain modulation network that utilize wavelet-domain modulation self-Transformer (WMT) plus Fourier supervision to modulate frequency features in addition to spatial domain modulation. Compared to existing frequency-based SR modules, our WMT is more suitable for frequency learning in lightweight SR. Experimental results show that our method achieves a comparable PSNR of SRFormer and MambaIR while with less than 50% and 60% of their FLOPs and achieving inference speeds 15.4x and 5.4x faster, respectively, demonstrating the effectiveness of our method on SR quality and lightweight. Codes will be released upon acceptance.
轻量级图像超分辨率(SR)的目标是从低分辨率图像中重建高分辨率图像,同时保持较低的计算成本。我们发现现有的基于频率的SR方法无法在整体结构和高频部分的重建之间取得平衡,并且这些方法在处理频率特征方面效率低下,不适合用于轻量级SR任务。在这篇论文中,我们展示了引入小波变换信息和傅里叶变换信息可以使我们的模型同时考虑高频特性及整体SR结构重建的同时降低成本。 具体而言,我们提出了一种双域调制网络(DDM),利用小波域自变压器(WMT)加上傅立叶监督来调节频率特征,在空间域上也进行相应的调整。与现有的基于频率的SR模块相比,我们的WMT更适用于轻量级SR中的频率学习。 实验结果表明,我们的方法在图像超分辨率质量方面达到了SRFormer和MambaIR相似的PSNR值,但计算成本(FLOPs)仅为它们的大约50%和60%,并且推理速度分别快15.4倍和5.4倍。这些结果显示了我们所提出的方法在提升SR质量和轻量级运行方面的有效性。论文接受后代码将公开发布。
https://arxiv.org/abs/2503.10047
Image super-resolution (SR) aims to recover low-resolution images to high-resolution images, where improving SR efficiency is a high-profile challenge. However, commonly used units in SR, like convolutions and window-based Transformers, have limited receptive fields, making it challenging to apply them to improve SR under extremely limited computational cost. To address this issue, inspired by modeling convolution theorem through token mix, we propose a Fourier token-based plugin called FourierSR to improve SR uniformly, which avoids the instability or inefficiency of existing token mix technologies when applied as plug-ins. Furthermore, compared to convolutions and windows-based Transformers, our FourierSR only utilizes Fourier transform and multiplication operations, greatly reducing complexity while having global receptive fields. Experimental results show that our FourierSR as a plug-and-play unit brings an average PSNR gain of 0.34dB for existing efficient SR methods on Manga109 test set at the scale of x4, while the average increase in the number of Params and FLOPs is only 0.6% and 1.5% of original sizes. We will release our codes upon acceptance.
图像超分辨率(SR)的目标是将低分辨率的图像恢复为高分辨率的图像,提高SR效率是一个重要的挑战。然而,常用的卷积和基于窗口的Transformer单元具有有限的感受野,使得在计算成本极其受限的情况下改进SR变得困难。为此,受通过令牌混合建模卷积定理的启发,我们提出了一种称为FourierSR的傅里叶令牌插件来均匀地提升SR性能,该方法避免了现有令牌混合技术作为插件应用时所面临的不稳定或低效问题。此外,与卷积和基于窗口的Transformer相比,我们的FourierSR仅利用傅里叶变换和乘法运算,大大降低了复杂性,并具有全局感受野。 实验结果表明,在Manga109测试集上,以x4尺度为例,作为即插即用单元的我们的FourierSR为现有的高效SR方法带来了平均PSNR(峰值信噪比)提高了0.34dB的效果,而参数量和FLOPs(浮点运算次数)仅分别增加了原始大小的0.6%和1.5%,这表明了该方法的有效性和效率。我们将在论文被接受后发布代码。
https://arxiv.org/abs/2503.10043
Deep learning has significantly advanced medical imaging analysis, yet variations in image resolution remain an overlooked challenge. Most methods address this by resampling images, leading to either information loss or computational inefficiencies. While solutions exist for specific tasks, no unified approach has been proposed. We introduce a resolution-invariant autoencoder that adapts spatial resizing at each layer in the network via a learned variable resizing process, replacing fixed spatial down/upsampling at the traditional factor of 2. This ensures a consistent latent space resolution, regardless of input or output resolution. Our model enables various downstream tasks to be performed on an image latent whilst maintaining performance across different resolutions, overcoming the shortfalls of traditional methods. We demonstrate its effectiveness in uncertainty-aware super-resolution, classification, and generative modelling tasks and show how our method outperforms conventional baselines with minimal performance loss across resolutions.
深度学习在医学图像分析方面取得了显著进展,然而图像分辨率的差异仍然是一个被忽视的挑战。大多数方法通过重采样图像来解决这一问题,这会导致信息损失或计算效率低下。尽管已经为特定任务提出了解决方案,但尚未提出一种统一的方法。我们引入了一种分辨率不变的自编码器,该模型通过学习到的可变缩放过程在每一层网络中适应空间调整,取代了传统以固定2倍因子进行的空间下采样和上采样操作。这确保了无论输入或输出分辨率如何,潜空间的分辨率保持一致。我们的模型能够在不同分辨率的情况下,在图像潜空间上执行各种下游任务,并且性能不受影响,从而克服了传统方法的不足之处。我们在不确定性感知的超分辨、分类和生成建模任务中展示了该模型的有效性,并表明我们的方法在跨分辨率方面优于传统的基准方法,同时保持了最小的性能损失。
https://arxiv.org/abs/2503.09828