Remote sensing image super-resolution (SR) is a crucial task to restore high-resolution (HR) images from low-resolution (LR) observations. Recently, the Denoising Diffusion Probabilistic Model (DDPM) has shown promising performance in image reconstructions by overcoming problems inherent in generative models, such as over-smoothing and mode collapse. However, the high-frequency details generated by DDPM often suffer from misalignment with HR images due to the model's tendency to overlook long-range semantic contexts. This is attributed to the widely used U-Net decoder in the conditional noise predictor, which tends to overemphasize local information, leading to the generation of noises with significant variances during the prediction process. To address these issues, an adaptive semantic-enhanced DDPM (ASDDPM) is proposed to enhance the detail-preserving capability of the DDPM by incorporating low-frequency semantic information provided by the Transformer. Specifically, a novel adaptive diffusion Transformer decoder (ADTD) is developed to bridge the semantic gap between the encoder and decoder through regulating the noise prediction with the global contextual relationships and long-range dependencies in the diffusion process. Additionally, a residual feature fusion strategy establishes information exchange between the two decoders at multiple levels. As a result, the predicted noise generated by our approach closely approximates that of the real noise distribution.Extensive experiments on two SR and two semantic segmentation datasets confirm the superior performance of the proposed ASDDPM in both SR and the subsequent downstream applications. The source code will be available at this https URL.
遥感图像超分辨率(SR)是将低分辨率(LR)观察结果恢复为高分辨率(HR)图像的关键任务。最近,由Denoising Diffusion Probabilistic Model(DDPM)产生的图像重构已经通过克服生成模型的固有问题的表现表明了具有前景。然而,DDPM产生的高频细节往往由于模型倾向于忽视长距离语义上下文而与HR图像错位。这归因于在条件噪声预测中广泛使用的U-Net解码器,它倾向于强调局部信息,导致预测过程中生成具有显著方差的大噪声。为了应对这些问题,我们提出了一个自适应语义增强的DDPM(ASDDPM),通过整合Transformer提供的低频语义信息来增强DDPM的细节保留能力。具体来说,我们开发了一种新的自适应扩散Transformer解码器(ADTD)来通过全局上下文关系和扩散过程的噪声预测来调节信息在编码器和解码器之间的交换。此外,残差特征融合策略建立了在多个级别上两个解码器之间的信息交流。通过这种方式,我们方法产生的预测噪声与真实噪声分布非常接近。在两个SR和两个语义分割数据集上的实验证实了所提出的ASDDPM在SR和后续应用中的优越性能。源代码将在此处链接。
https://arxiv.org/abs/2403.11078
Flow-based super-resolution (SR) models have demonstrated astonishing capabilities in generating high-quality images. However, these methods encounter several challenges during image generation, such as grid artifacts, exploding inverses, and suboptimal results due to a fixed sampling temperature. To overcome these issues, this work introduces a conditional learned prior to the inference phase of a flow-based SR model. This prior is a latent code predicted by our proposed latent module conditioned on the low-resolution image, which is then transformed by the flow model into an SR image. Our framework is designed to seamlessly integrate with any contemporary flow-based SR model without modifying its architecture or pre-trained weights. We evaluate the effectiveness of our proposed framework through extensive experiments and ablation analyses. The proposed framework successfully addresses all the inherent issues in flow-based SR models and enhances their performance in various SR scenarios. Our code is available at: this https URL
基于流的超分辨率(SR)模型已经在生成高质量图像方面表现出惊人的能力。然而,在图像生成过程中,这些方法遇到了多个挑战,例如网格伪影、爆炸逆和由于固定采样温度而导致的最优结果。为了克服这些问题,本文在推理阶段引入了一个有条件的先验。这个先验是我们基于低分辨率图像提出的latent模块,然后通过流模型转换为SR图像。我们的框架旨在与任何当代流式SR模型无缝集成,而不会修改其架构或预训练权重。通过广泛的实验和消融分析来评估我们提出的框架的有效性。我们成功解决了基于流SR模型的所有固有问题,并在各种SR场景中提高了其性能。我们的代码可在此处下载:https://this URL。
https://arxiv.org/abs/2403.10988
Scale arbitrary super-resolution based on implicit image function gains increasing popularity since it can better represent the visual world in a continuous manner. However, existing scale arbitrary works are trained and evaluated on simulated datasets, where low-resolution images are generated from their ground truths by the simplest bicubic downsampling. These models exhibit limited generalization to real-world scenarios due to the greater complexity of real-world degradations. To address this issue, we build a RealArbiSR dataset, a new real-world super-resolution benchmark with both integer and non-integer scaling factors for the training and evaluation of real-world scale arbitrary super-resolution. Moreover, we propose a Dual-level Deformable Implicit Representation (DDIR) to solve real-world scale arbitrary super-resolution. Specifically, we design the appearance embedding and deformation field to handle both image-level and pixel-level deformations caused by real-world degradations. The appearance embedding models the characteristics of low-resolution inputs to deal with photometric variations at different scales, and the pixel-based deformation field learns RGB differences which result from the deviations between the real-world and simulated degradations at arbitrary coordinates. Extensive experiments show our trained model achieves state-of-the-art performance on the RealArbiSR and RealSR benchmarks for real-world scale arbitrary super-resolution. Our dataset as well as source code will be publicly available.
基于隐式图像函数增益的任意超分辨率已经在视觉世界中以连续的方式更好地表示视觉世界而变得越来越受欢迎。然而,现有的基于模拟数据集训练和评估的超分辨率模型在真实世界场景中的泛化能力有限,因为真实世界退化更加复杂。为解决这个问题,我们构建了RealArbiSR数据集,一个新的真实世界超分辨率基准,具有整数和非整数缩放因子,用于真实世界规模任意超分辨率。此外,我们提出了一个双层可塑隐式表示(DDIR)来解决真实世界规模任意超分辨率。具体来说,我们设计了一个能够处理真实世界退化引起的图像级别和像素级别变形的外观嵌入和变形场。外观嵌入模型对低分辨率输入的特性进行了建模,以处理不同尺度下的光度变化;像素级别的变形场学习真实世界和模拟退化之间的任意坐标差值产生的RGB差异。大量实验证明,我们训练的模型在真实世界规模任意超分辨率基准上取得了最先进的性能。我们的数据集以及源代码将公开发布。
https://arxiv.org/abs/2403.10925
Deep features are a cornerstone of computer vision research, capturing image semantics and enabling the community to solve downstream tasks even in the zero- or few-shot regime. However, these features often lack the spatial resolution to directly perform dense prediction tasks like segmentation and depth prediction because models aggressively pool information over large areas. In this work, we introduce FeatUp, a task- and model-agnostic framework to restore lost spatial information in deep features. We introduce two variants of FeatUp: one that guides features with high-resolution signal in a single forward pass, and one that fits an implicit model to a single image to reconstruct features at any resolution. Both approaches use a multi-view consistency loss with deep analogies to NeRFs. Our features retain their original semantics and can be swapped into existing applications to yield resolution and performance gains even without re-training. We show that FeatUp significantly outperforms other feature upsampling and image super-resolution approaches in class activation map generation, transfer learning for segmentation and depth prediction, and end-to-end training for semantic segmentation.
深度特征是计算机视觉研究的核心,捕捉图像语义并使社区能够在零或少数样本的情况下解决下游任务。然而,这些特征通常缺乏进行如分割和深度预测等直接密集预测任务的空间分辨率,因为模型在大型区域上积极抽取信息。在这项工作中,我们引入了FeatUp,一个任务和模型无关的框架,用于在深度特征中恢复丢失的时空信息。我们引入了两种FeatUp变体:一种在单前向传递中指导具有高分辨率信号的特征,另一种将隐式模型适配于单个图像以重构任何分辨率下的特征。两种方法都使用深度模拟NeRFs的多视图一致性损失。我们的特征保留其原始语义,可以交换到现有的应用程序中,甚至在没有重新训练的情况下实现分辨率和性能的提升。我们证明了FeatUp在类激活图生成、用于分割和深度预测的迁移学习以及语义分割的端到端训练方面显著优于其他特征放大和图像超分辨率方法。
https://arxiv.org/abs/2403.10516
Generative Adversarial Networks (GANs) have shown great performance on super-resolution problems since they can generate more visually realistic images and video frames. However, these models often introduce side effects into the outputs, such as unexpected artifacts and noises. To reduce these artifacts and enhance the perceptual quality of the results, in this paper, we propose a general method that can be effectively used in most GAN-based super-resolution (SR) models by introducing essential spatial information into the training process. We extract spatial information from the input data and incorporate it into the training loss, making the corresponding loss a spatially adaptive (SA) one. After that, we utilize it to guide the training process. We will show that the proposed approach is independent of the methods used to extract the spatial information and independent of the SR tasks and models. This method consistently guides the training process towards generating visually pleasing SR images and video frames, substantially mitigating artifacts and noise, ultimately leading to enhanced perceptual quality.
生成对抗网络(GANs)在超分辨率问题上的表现非常出色,因为它们可以生成更具有视觉真实感的图像和视频帧。然而,这些模型通常会在输出中引入副作用,例如意外的伪影和噪声。为了减少这些伪影并提高结果的感知质量,本文提出了一种可以在大多数基于GAN的超级分辨率(SR)模型中有效使用的通用方法。我们从输入数据中提取空间信息,并将其纳入训练损失中,使相应的损失成为空间自适应(SA)损失。然后,我们利用它来指导训练过程。我们将展示,所提出的方法独立于用于提取空间信息的方法以及SR任务和模型,独立地引导训练过程。我们还将证明,这种方法能够显著减轻伪影和噪声,从而提高感知质量。
https://arxiv.org/abs/2403.10589
Solving image inverse problems (e.g., super-resolution and inpainting) requires generating a high fidelity image that matches the given input (the low-resolution image or the masked image). By using the input image as guidance, we can leverage a pretrained diffusion generative model to solve a wide range of image inverse tasks without task specific model fine-tuning. To precisely estimate the guidance score function of the input image, we propose Diffusion Policy Gradient (DPG), a tractable computation method by viewing the intermediate noisy images as policies and the target image as the states selected by the policy. Experiments show that our method is robust to both Gaussian and Poisson noise degradation on multiple linear and non-linear inverse tasks, resulting into a higher image restoration quality on FFHQ, ImageNet and LSUN datasets.
解决图像反问题(例如超分辨率 和修复)需要生成具有给定输入的高保真度的图像。通过使用输入图像作为指导,我们可以利用预训练的扩散生成模型来解决广泛的图像反任务,而无需对任务特定的模型进行微调。为了精确估计输入图像的指导得分函数,我们提出了扩散策略梯度(DPG),这是一种通过将中间嘈杂图像视为策略,将目标图像视为策略选择的状态的可行计算方法。实验表明,我们的方法对多线性和非线性反任务具有鲁棒性,在FFHQ、ImageNet和LSUN数据集上,图像修复质量更高。
https://arxiv.org/abs/2403.10585
Super-resolution (SR) and image generation are important tasks in computer vision and are widely adopted in real-world applications. Most existing methods, however, generate images only at fixed-scale magnification and suffer from over-smoothing and artifacts. Additionally, they do not offer enough diversity of output images nor image consistency at different scales. Most relevant work applied Implicit Neural Representation (INR) to the denoising diffusion model to obtain continuous-resolution yet diverse and high-quality SR results. Since this model operates in the image space, the larger the resolution of image is produced, the more memory and inference time is required, and it also does not maintain scale-specific consistency. We propose a novel pipeline that can super-resolve an input image or generate from a random noise a novel image at arbitrary scales. The method consists of a pretrained auto-encoder, a latent diffusion model, and an implicit neural decoder, and their learning strategies. The proposed method adopts diffusion processes in a latent space, thus efficient, yet aligned with output image space decoded by MLPs at arbitrary scales. More specifically, our arbitrary-scale decoder is designed by the symmetric decoder w/o up-scaling from the pretrained auto-encoder, and Local Implicit Image Function (LIIF) in series. The latent diffusion process is learnt by the denoising and the alignment losses jointly. Errors in output images are backpropagated via the fixed decoder, improving the quality of output images. In the extensive experiments using multiple public benchmarks on the two tasks i.e. image super-resolution and novel image generation at arbitrary scales, the proposed method outperforms relevant methods in metrics of image quality, diversity and scale consistency. It is significantly better than the relevant prior-art in the inference speed and memory usage.
超分辨率(SR)和图像生成是计算机视觉中的重要任务,并在实际应用中得到了广泛应用。然而,大多数现有方法在固定缩放级别生成图像,并遭受过拟合和伪影的困扰。此外,它们在不同尺度上的图像输出缺乏多样性,也没有保持尺度相关的 consistency。最有价值的工作是将隐式神经表示(INR)应用于去噪扩散模型,以获得连续分辨率的高质量SR结果。由于该模型在图像空间运行,生成的图像分辨率越大,需要更多的内存和推理时间,并且也不保持尺度相关的一致性。我们提出了一个新颖的管道,可以超分辨率输入图像或任意尺度生成新图像。该方法包括预训练的自编码器、一个潜在扩散模型和一个隐式神经解码器,以及它们的学习策略。所提出的方法在隐式空间中学习扩散过程,因此既高效又与MLP解码得到的输出图像空间相一致。具体来说,我们的任意尺度解码器是由预训练自编码器的对称解码器和一个局部隐式图像函数(LIIF)组成的。去噪和alignment损失是通过联合学习获得的。输出图像中的错误通过固定的解码器反向传播,从而提高输出图像的质量。在两个任务——图像超分辨率和任意尺度图像生成——的多项公共基准上进行广泛的实验,与相关方法相比,所提出的方法在图像质量、多样性和尺度一致性方面显著优越。它在推理速度和内存使用方面与相关先贤相比明显更好。
https://arxiv.org/abs/2403.10255
Diffusion models (DM) have achieved remarkable promise in image super-resolution (SR). However, most of them are tailored to solving non-blind inverse problems with fixed known degradation settings, limiting their adaptability to real-world applications that involve complex unknown degradations. In this work, we propose BlindDiff, a DM-based blind SR method to tackle the blind degradation settings in SISR. BlindDiff seamlessly integrates the MAP-based optimization into DMs, which constructs a joint distribution of the low-resolution (LR) observation, high-resolution (HR) data, and degradation kernels for the data and kernel priors, and solves the blind SR problem by unfolding MAP approach along with the reverse process. Unlike most DMs, BlindDiff firstly presents a modulated conditional transformer (MCFormer) that is pre-trained with noise and kernel constraints, further serving as a posterior sampler to provide both priors simultaneously. Then, we plug a simple yet effective kernel-aware gradient term between adjacent sampling iterations that guides the diffusion model to learn degradation consistency knowledge. This also enables to joint refine the degradation model as well as HR images by observing the previous denoised sample. With the MAP-based reverse diffusion process, we show that BlindDiff advocates alternate optimization for blur kernel estimation and HR image restoration in a mutual reinforcing manner. Experiments on both synthetic and real-world datasets show that BlindDiff achieves the state-of-the-art performance with significant model complexity reduction compared to recent DM-based methods. Code will be available at \url{this https URL}
扩散模型(DM)在图像超分辨率(SR)方面取得了显著的进展。然而,大多数DM都是为了解决具有固定已知退化设置的非盲反问题,从而限制了其在现实世界涉及复杂未知退化的应用中的适应性。在这项工作中,我们提出了BlindDiff,一种基于DM的盲SR方法,用于解决SISR中的盲退化设置。BlindDiff无缝地将基于MAP的优化集成到DM中,构建了数据和核先验的联合分布,并利用展开的MAP方法与反向过程一起解决盲SR问题。与大多数DM不同,BlindDiff首先引入了一个预训练带有噪声和核约束的调制条件Transformer(MCFormer),作为后验采样器,同时提供核先验。然后,我们在相邻采样迭代之间插入一个简单的但有效的核关注度梯度项,引导扩散模型学习退化一致性知识。这还允许通过观察前 denoised 样本来联合优化降解模型和HR图像。与基于MAP的反向扩散过程相结合,我们证明了BlindDiff在模糊核估计和HR图像恢复方面主张交替优化。在合成和现实世界数据集上的实验表明,与最近的DM基于方法相比,BlindDiff在模型复杂度降低的情况下实现了最先进的性能。代码将在\url{这个https URL}上可用。
https://arxiv.org/abs/2403.10211
With the development of neural radiance fields and generative models, numerous methods have been proposed for learning 3D human generation from 2D images. These methods allow control over the pose of the generated 3D human and enable rendering from different viewpoints. However, none of these methods explore semantic disentanglement in human image synthesis, i.e., they can not disentangle the generation of different semantic parts, such as the body, tops, and bottoms. Furthermore, existing methods are limited to synthesize images at $512^2$ resolution due to the high computational cost of neural radiance fields. To address these limitations, we introduce SemanticHuman-HD, the first method to achieve semantic disentangled human image synthesis. Notably, SemanticHuman-HD is also the first method to achieve 3D-aware image synthesis at $1024^2$ resolution, benefiting from our proposed 3D-aware super-resolution module. By leveraging the depth maps and semantic masks as guidance for the 3D-aware super-resolution, we significantly reduce the number of sampling points during volume rendering, thereby reducing the computational cost. Our comparative experiments demonstrate the superiority of our method. The effectiveness of each proposed component is also verified through ablation studies. Moreover, our method opens up exciting possibilities for various applications, including 3D garment generation, semantic-aware image synthesis, controllable image synthesis, and out-of-domain image synthesis.
随着神经元辐射场和生成模型的不断发展,已经提出了许多从2D图像中学习3D人类生成的方法。这些方法允许控制生成3D人类的角度,并能够从不同的角度进行渲染。然而,这些方法都没有探索人图像合成中的语义解离,即它们无法分离生成不同语义部分,如身体、顶部 和底部。此外,由于神经元辐射场的高计算成本,现有方法仅能在$512^2$的分辨率上合成图像。为了克服这些限制,我们引入了 SemanticHuman-HD,这是第一个实现语义解离的人图像合成方法。值得注意的是,SemanticHuman-HD 也是第一个在$1024^2$的分辨率上实现3D意识图像生成的方法,得益于我们提出的3D意识超分辨率模块。通过利用深度图和语义掩码作为3D意识超分辨率的有指导,我们在体积渲染过程中显著减少了抽样点数,从而降低了计算成本。我们的比较实验证实了我们的方法具有优越性。每个所提出的组件的有效性也通过消融实验得到了验证。此外,我们的方法为各种应用开辟了令人兴奋的领域,包括3D衣物的生成、语义意识图像生成、可控制图像生成和跨域图像生成。
https://arxiv.org/abs/2403.10166
In recent years, the fusion of high spatial resolution multispectral image (HR-MSI) and low spatial resolution hyperspectral image (LR-HSI) has been recognized as an effective method for HSI super-resolution (HSI-SR). However, both HSI and MSI may be acquired under extreme conditions such as night or poorly illuminating scenarios, which may cause different exposure levels, thereby seriously downgrading the yielded HSISR. In contrast to most existing methods based on respective low-light enhancements (LLIE) of MSI and HSI followed by their fusion, a deep Unfolding HSI Super-Resolution with Automatic Exposure Correction (UHSR-AEC) is proposed, that can effectively generate a high-quality fused HSI-SR (in texture and features) even under very imbalanced exposures, thanks to the correlation between LLIE and HSI-SR taken into account. Extensive experiments are provided to demonstrate the state-of-the-art overall performance of the proposed UHSR-AEC, including comparison with some benchmark peer methods.
近年来,将高空间分辨率多光谱图像(HR-MSI)与低空间分辨率超光谱图像(LR-HSI)的融合被认为是超分辨率(HSI-SR)的有效方法。然而,在极端情况下,如夜间或光线较弱的场景中,HSI和MSI可能被获取,这可能导致不同的曝光水平,从而严重降低HSISR的产率。与大多数基于各自的低光增强方法(LLIE)的MSI和HSI的融合相比,我们提出了一个深度解卷HSI超分辨率与自动曝光校正(UHSR-AEC)相结合的方法,可以在非常不平衡的曝光情况下生成高质量的融合HSI-SR(纹理和特征),因为LLIE和HSI-SR之间的相关性已经考虑到。我们提供了大量实验来证明所提出的UHSR-AEC的尖端整体性能,包括与一些基准同行的比较。
https://arxiv.org/abs/2403.09096
Recent developments in face restoration have achieved remarkable results in producing high-quality and lifelike outputs. The stunning results however often fail to be faithful with respect to the identity of the person as the models lack necessary context. In this paper, we explore the potential of personalized face restoration with diffusion models. In our approach a restoration model is personalized using a few images of the identity, leading to tailored restoration with respect to the identity while retaining fine-grained details. By using independent trainable blocks for personalization, the rich prior of a base restoration model can be exploited to its fullest. To avoid the model relying on parts of identity left in the conditioning low-quality images, a generative regularizer is employed. With a learnable parameter, the model learns to balance between the details generated based on the input image and the degree of personalization. Moreover, we improve the training pipeline of face restoration models to enable an alignment-free approach. We showcase the robust capabilities of our approach in several real-world scenarios with multiple identities, demonstrating our method's ability to generate fine-grained details with faithful restoration. In the user study we evaluate the perceptual quality and faithfulness of the genereated details, with our method being voted best 61% of the time compared to the second best with 25% of the votes.
近年来,面部修复技术的发展已经取得了令人瞩目的成果,生产出了高质量、逼真的输出。然而,这些惊人的结果往往无法忠实于受保护者的身份,因为模型缺乏必要的上下文。在本文中,我们探讨了使用扩散模型进行个性化面部修复的潜力。在我们的方法中,使用几张身份的照片对身份进行个性化,从而实现针对身份的定制修复,同时保留微妙的细节。通过使用可训练的块进行个性化,可以充分利用基础修复模型的丰富先验。为了避免模型在训练过程中依赖于 conditioning 低质量图像的部分身份,我们使用生成 regularizer。通过可学习参数,模型学会了平衡基于输入图像生成的细节和个性化程度。此外,我们还改进了面部修复模型的训练流程,实现了一种无需对 conditioning 低质量图像的部分身份进行调整的 approach。我们在多个真实场景中展示了我们方法的优势,这些场景中存在多个受保护者。我们证明了我们的方法能够通过精确的修复来生成微妙的细节。在用户研究中,我们对生成的细节的感知质量和可靠性进行了评估。我们的方法获得了 61% 的用户投票,而排名第二的方法只获得了 25% 的用户投票。
https://arxiv.org/abs/2403.08436
The prevalence of convolution neural networks (CNNs) and vision transformers (ViTs) has markedly revolutionized the area of single-image super-resolution (SISR). To further boost the SR performances, several techniques, such as residual learning and attention mechanism, are introduced, which can be largely attributed to a wider range of activated area, that is, the input pixels that strongly influence the SR results. However, the possibility of further improving SR performance through another versatile vision backbone remains an unresolved challenge. To address this issue, in this paper, we unleash the representation potential of the modern state space model, i.e., Vision Mamba (Vim), in the context of SISR. Specifically, we present three recipes for better utilization of Vim-based models: 1) Integration into a MetaFormer-style block; 2) Pre-training on a larger and broader dataset; 3) Employing complementary attention mechanism, upon which we introduce the MMA. The resulting network MMA is capable of finding the most relevant and representative input pixels to reconstruct the corresponding high-resolution images. Comprehensive experimental analysis reveals that MMA not only achieves competitive or even superior performance compared to state-of-the-art SISR methods but also maintains relatively low memory and computational overheads (e.g., +0.5 dB PSNR elevation on Manga109 dataset with 19.8 M parameters at the scale of 2). Furthermore, MMA proves its versatility in lightweight SR applications. Through this work, we aim to illuminate the potential applications of state space models in the broader realm of image processing rather than SISR, encouraging further exploration in this innovative direction.
卷积神经网络(CNNs)和视觉变换器(ViTs)在单图像超分辨率(SISR)领域的普及已经显著地推动了该领域的发展。为了进一步提高SR性能,我们引入了残差学习和注意机制等技术,这些技术很大程度上归功于更广泛的激活区域,即强烈影响SR结果的输入像素。然而,通过另一个通用的视觉骨干网络进一步改进SR性能仍然是一个未解决的挑战。为了应对这个问题,在本文中,我们在SISR的背景下释放现代状态空间模型的表示潜力,即ViM(Vim)。具体来说,我们提出了三个利用ViM模型的更有效地利用的建议:1)集成到元组级块中;2)在更大的更广泛的训练数据集上进行预训练;3)采用互补的注意力机制,在此基础上我们引入了MMA。得到的网络MMA能够找到最有相关和代表性的输入像素来重构相应的高分辨率图像。全面的实验分析揭示了MMA不仅比最先进的SISR方法具有竞争力的甚至更好的性能,而且具有相对较低的内存和计算开销(例如,在Manga109数据集上,以2的规模在19.8 M参数级别提高+0.5 dB PSNR)。此外,MMA还证明了其在轻量级SR应用中的 versatility。通过这项工作,我们旨在阐明状态空间模型在更广泛的图像处理领域中的潜在应用,而不是SISR,鼓励进一步探索这个创新方向。
https://arxiv.org/abs/2403.08330
Previous approaches for blind image super-resolution (SR) have relied on degradation estimation to restore high-resolution (HR) images from their low-resolution (LR) counterparts. However, accurate degradation estimation poses significant challenges. The SR model's incompatibility with degradation estimation methods, particularly the Correction Filter, may significantly impair performance as a result of correction errors. In this paper, we introduce a novel blind SR approach that focuses on Learning Correction Errors (LCE). Our method employs a lightweight Corrector to obtain a corrected low-resolution (CLR) image. Subsequently, within an SR network, we jointly optimize SR performance by utilizing both the original LR image and the frequency learning of the CLR image. Additionally, we propose a new Frequency-Self Attention block (FSAB) that enhances the global information utilization ability of Transformer. This block integrates both self-attention and frequency spatial attention mechanisms. Extensive ablation and comparison experiments conducted across various settings demonstrate the superiority of our method in terms of visual quality and accuracy. Our approach effectively addresses the challenges associated with degradation estimation and correction errors, paving the way for more accurate blind image SR.
先前的盲图像超分辨率(SR)方法依赖于降解估计来从低分辨率(LR)图像中恢复高分辨率(HR)图像。然而,准确的降解估计带来了显著的挑战。SR模型与降解估计方法的兼容性,特别是修复滤波器,可能导致修复误差。在本文中,我们引入了一种新的盲SR方法,专注于学习修复误差(LCE)。我们的方法采用轻量级的修复器来获得纠正后的低分辨率(CLR)图像。在SR网络中,我们通过同时利用原始LR图像和CLR图像的频率学习来共同优化SR性能。此外,我们提出了一种新的频率自注意块(FSAB),提高了Transformer的全局信息利用率能力。该 blocks 结合了自注意和频率空间关注机制。在各种设置的广泛消融和比较实验中进行的消融和比较实验证明了我们在视觉质量和准确性方面的优越性。我们的方法有效地解决了降解估计和修复误差带来的挑战,为更准确的盲图像SR铺平了道路。
https://arxiv.org/abs/2403.07390
While diffusion-based image restoration (IR) methods have achieved remarkable success, they are still limited by the low inference speed attributed to the necessity of executing hundreds or even thousands of sampling steps. Existing acceleration sampling techniques, though seeking to expedite the process, inevitably sacrifice performance to some extent, resulting in over-blurry restored outcomes. To address this issue, this study proposes a novel and efficient diffusion model for IR that significantly reduces the required number of diffusion steps. Our method avoids the need for post-acceleration during inference, thereby avoiding the associated performance deterioration. Specifically, our proposed method establishes a Markov chain that facilitates the transitions between the high-quality and low-quality images by shifting their residuals, substantially improving the transition efficiency. A carefully formulated noise schedule is devised to flexibly control the shifting speed and the noise strength during the diffusion process. Extensive experimental evaluations demonstrate that the proposed method achieves superior or comparable performance to current state-of-the-art methods on three classical IR tasks, namely image super-resolution, image inpainting, and blind face restoration, \textit{\textbf{even only with four sampling steps}}. Our code and model are publicly available at \url{this https URL}.
虽然基于扩散的图像恢复(IR)方法已经取得了显著的成功,但它们仍然受到执行数百甚至数千个采样步骤的低推理速度的限制。现有的加速采样技术,尽管试图加速过程,但不可避免地牺牲了性能,导致恢复结果过模糊。为了解决这个问题,本研究提出了一个新颖且高效的扩散模型用于IR,显著减少了所需的扩散步骤。我们的方法在推理过程中不需要后加速,从而避免了与性能下降相关的后加速。具体来说,我们提出了一种通过移动残差来促进高质量和低质量图像之间转换的马尔可夫链。它大大提高了转换效率。为了灵活控制扩散过程的平滑度和噪声强度,我们精心设计了一个噪声时间表。大量的实验评估结果表明,与当前最先进的方法相比,所提出的方法在三个经典的IR任务(即图像超分辨率、图像修复和盲人面修复)上实现了卓越或可比较的性能,即使在仅使用四个采样步骤的情况下也是如此。我们的代码和模型公开可用,网址为:https:// this URL。
https://arxiv.org/abs/2403.07319
Color information is the most commonly used prior knowledge for depth map super-resolution (DSR), which can provide high-frequency boundary guidance for detail restoration. However, its role and functionality in DSR have not been fully developed. In this paper, we rethink the utilization of color information and propose a hierarchical color guidance network to achieve DSR. On the one hand, the low-level detail embedding module is designed to supplement high-frequency color information of depth features in a residual mask manner at the low-level stages. On the other hand, the high-level abstract guidance module is proposed to maintain semantic consistency in the reconstruction process by using a semantic mask that encodes the global guidance information. The color information of these two dimensions plays a role in the front and back ends of the attention-based feature projection (AFP) module in a more comprehensive form. Simultaneously, the AFP module integrates the multi-scale content enhancement block and adaptive attention projection block to make full use of multi-scale information and adaptively project critical restoration information in an attention manner for DSR. Compared with the state-of-the-art methods on four benchmark datasets, our method achieves more competitive performance both qualitatively and quantitatively.
颜色信息是深度图超分辨率(DSR)中最常用的先验知识,可以为细节修复提供高频边界指导。然而,其在DSR中的作用和功能并未完全开发。在本文中,我们重新思考了颜色信息的利用,并提出了一种分层颜色指导网络来实现DSR。一方面,低级细节嵌入模块旨在通过在残差图的低级阶段补充深度特征的高频颜色信息。另一方面,高级抽象指导模块提出了一种使用语义掩码来保持重构过程的语义一致性的方法。这两个维度的颜色信息在自注意力机制(AFP)模块的前后端起到了作用。同时,AFP模块还整合了多尺度内容增强块和自适应注意力投影块,以充分利用多尺度信息和以自注意方式对关键修复信息进行修复。与四个基准数据集上的最先进方法相比,我们的方法在质量和数量上均具有更竞争力的性能。
https://arxiv.org/abs/2403.07290
Recently, the methods based on implicit neural representations have shown excellent capabilities for arbitrary-scale super-resolution (ASSR). Although these methods represent the features of an image by generating latent codes, these latent codes are difficult to adapt for different magnification factors of super-resolution, which seriously affects their performance. Addressing this, we design Multi-Scale Implicit Transformer (MSIT), consisting of an Multi-scale Neural Operator (MSNO) and Multi-Scale Self-Attention (MSSA). Among them, MSNO obtains multi-scale latent codes through feature enhancement, multi-scale characteristics extraction, and multi-scale characteristics merging. MSSA further enhances the multi-scale characteristics of latent codes, resulting in better performance. Furthermore, to improve the performance of network, we propose the Re-Interaction Module (RIM) combined with the cumulative training strategy to improve the diversity of learned information for the network. We have systematically introduced multi-scale characteristics for the first time in ASSR, extensive experiments are performed to validate the effectiveness of MSIT, and our method achieves state-of-the-art performance in arbitrary super-resolution tasks.
近年来,基于隐式神经表示的方法在任意规模超分辨率(ASSR)中表现出了卓越的性能。尽管这些方法通过生成隐含码来表示图像的特征,但这些隐含码对于不同放大的超分辨率因素来说很难进行适应,这严重地影响了其性能。为了解决这个问题,我们设计了一个多尺度隐式Transformer(MSIT),由多尺度神经操作(MSNO)和多尺度自注意(MSSA)组成。其中,MSNO通过特征增强、多尺度特征提取和多尺度特征合并来获得多尺度隐含码。MSSA进一步增强了多尺度特征,从而提高了性能。此外,为了提高网络的性能,我们采用累积训练策略与重新交互模块(RIM)相结合,以提高网络学习信息多样性。我们首次系统地引入了多尺度特征到ASSR中,并通过大量实验验证了MSIT的有效性。在任意超分辨率任务中,我们的方法实现了与最先进水平相当的表现。
https://arxiv.org/abs/2403.06536
Conditional diffusion models have gained recognition for their effectiveness in image restoration tasks, yet their iterative denoising process, starting from Gaussian noise, often leads to slow inference speeds. As a promising alternative, the Image-to-Image Schrödinger Bridge (I2SB) initializes the generative process from corrupted images and integrates training techniques from conditional diffusion models. In this study, we extended the I2SB method by introducing the Implicit Image-to-Image Schrodinger Bridge (I3SB), transitioning its generative process to a non-Markovian process by incorporating corrupted images in each generative step. This enhancement empowers I3SB to generate images with better texture restoration using a small number of generative steps. The proposed method was validated on CT super-resolution and denoising tasks and outperformed existing methods, including the conditional denoising diffusion probabilistic model (cDDPM) and I2SB, in both visual quality and quantitative metrics. These findings underscore the potential of I3SB in improving medical image restoration by providing fast and accurate generative modeling.
条件扩散模型因在图像修复任务中的有效性而获得了认可,然而其从高斯噪声开始的迭代去噪过程通常会导致较慢的推理速度。作为一种有前途的替代方法,图像到图像的Schrödinger桥(I2SB)从损坏的图像中初始化生成过程,并整合了条件扩散模型的训练技术。在本文中,我们通过引入I3SB方法,将生成过程从损坏的图像中进行扩展,并通过在每个生成步骤中包含损坏图像来将其转化为非马尔可夫过程。这一改进使得I3SB能够通过较少的生成步骤来生成具有更好纹理修复效果的图像。所提出的方法在CT超分辨率去噪任务中进行了验证,并在视觉质量和数量指标上优于现有的方法,包括条件去噪扩散概率模型(cDDPM)和I2SB。这些发现强调了我们所提出的I3SB通过提供快速和准确的生成建模来改善医学图像修复的潜力。
https://arxiv.org/abs/2403.06069
Diffusion models have recently gained traction as a powerful class of deep generative priors, excelling in a wide range of image restoration tasks due to their exceptional ability to model data distributions. To solve image restoration problems, many existing techniques achieve data consistency by incorporating additional likelihood gradient steps into the reverse sampling process of diffusion models. However, the additional gradient steps pose a challenge for real-world practical applications as they incur a large computational overhead, thereby increasing inference time. They also present additional difficulties when using accelerated diffusion model samplers, as the number of data consistency steps is limited by the number of reverse sampling steps. In this work, we propose a novel diffusion-based image restoration solver that addresses these issues by decoupling the reverse process from the data consistency steps. Our method involves alternating between a reconstruction phase to maintain data consistency and a refinement phase that enforces the prior via diffusion purification. Our approach demonstrates versatility, making it highly adaptable for efficient problem-solving in latent space. Additionally, it reduces the necessity for numerous sampling steps through the integration of consistency models. The efficacy of our approach is validated through comprehensive experiments across various image restoration tasks, including image denoising, deblurring, inpainting, and super-resolution.
扩散模型最近因能够在图像修复任务中建模数据分布而获得了广泛关注,成为了一种强大的深度生成priors。由于其出色的数据分布建模能力,扩散模型在广泛的图像修复任务中表现出色。为解决图像修复问题,许多现有技术通过将扩散模型的反向采样过程引入 additional likelihood gradient 步骤来实现数据一致性。然而,这些额外的梯度步骤对现实世界的实际应用造成了较大的计算开销,从而增加了推理时间。当使用加速扩散模型采样器时,它们还提出了其他困难,因为数据一致性步骤的数量限制了反向采样步骤的数量。在本文中,我们提出了一种新型的基于扩散的图像修复求解器,通过将反向过程与数据一致性步骤解耦,从而解决了这些问题。我们的方法包括重建阶段和优化阶段。在重建阶段,我们维持数据一致性,通过扩散净化来实施先验。在优化阶段,我们通过扩散除杂来强制实施先验。通过将一致性模型集成到我们的方法中,我们减少了需要进行的采样步骤数量。我们通过在各种图像修复任务上进行全面的实验来验证我们方法的效力,包括图像去噪、去模糊、修复和超分辨率。
https://arxiv.org/abs/2403.06054
Pre-trained diffusion models utilized for image generation encapsulate a substantial reservoir of a priori knowledge pertaining to intricate textures. Harnessing the potential of leveraging this a priori knowledge in the context of image super-resolution presents a compelling avenue. Nonetheless, prevailing diffusion-based methodologies presently overlook the constraints imposed by degradation information on the diffusion process. Furthermore, these methods fail to consider the spatial variability inherent in the estimated blur kernel, stemming from factors such as motion jitter and out-of-focus elements in open-environment scenarios. This oversight results in a notable deviation of the image super-resolution effect from fundamental realities. To address these concerns, we introduce a framework known as Adaptive Multi-modal Fusion of \textbf{S}patially Variant Kernel Refinement with Diffusion Model for Blind Image \textbf{S}uper-\textbf{R}esolution (SSR). Within the SSR framework, we propose a Spatially Variant Kernel Refinement (SVKR) module. SVKR estimates a Depth-Informed Kernel, which takes the depth information into account and is spatially variant. Additionally, SVKR enhance the accuracy of depth information acquired from LR images, allowing for mutual enhancement between the depth map and blur kernel estimates. Finally, we introduce the Adaptive Multi-Modal Fusion (AMF) module to align the information from three modalities: low-resolution images, depth maps, and blur kernels. This alignment can constrain the diffusion model to generate more authentic SR results. Quantitative and qualitative experiments affirm the superiority of our approach, while ablation experiments corroborate the effectiveness of the modules we have proposed.
前预训练扩散模型用于图像生成时,包含了与复杂纹理相关的先验知识的一个丰富资源。在将这个先验知识应用于图像超分辨率的过程中,会涌现出非常有吸引力的途径。然而,目前流行的扩散方法论忽略了降解信息对扩散过程所施加的约束。此外,这些方法也没有考虑到估计模糊卷积中存在的局域变异性,源于运动抖动和开环境场景中的失焦元素等因素。这一疏忽导致图像超分辨率的效果与基本现实存在显著偏差。为解决这些问题,我们引入了一个名为“自适应多模态融合扩散模型用于盲图像超分辨率”的框架。在SSR框架中,我们提出了一个空间可变卷积内核精炼(SVKR)模块。SVKR估计了一个深度相关的卷积核,考虑了深度信息,并且具有空间可变性。此外,SVKR通过从LR图像中获得的深度信息增强模糊卷积的准确性,使得深度图和模糊卷积估计之间可以相互增强。最后,我们引入了自适应多模态融合(AMF)模块,将来自低分辨率图像、深度地图和模糊卷积的信息对齐。这种对齐可以使扩散模型生成更真实SR结果。定量和定性实验证实了我们的方法的优势,而消融实验证实了我们提出的模块的有效性。
https://arxiv.org/abs/2403.05808
Recent advancements in text-to-image generative systems have been largely driven by diffusion models. However, single-stage text-to-image diffusion models still face challenges, in terms of computational efficiency and the refinement of image details. To tackle the issue, we propose CogView3, an innovative cascaded framework that enhances the performance of text-to-image diffusion. CogView3 is the first model implementing relay diffusion in the realm of text-to-image generation, executing the task by first creating low-resolution images and subsequently applying relay-based super-resolution. This methodology not only results in competitive text-to-image outputs but also greatly reduces both training and inference costs. Our experimental results demonstrate that CogView3 outperforms SDXL, the current state-of-the-art open-source text-to-image diffusion model, by 77.0\% in human evaluations, all while requiring only about 1/2 of the inference time. The distilled variant of CogView3 achieves comparable performance while only utilizing 1/10 of the inference time by SDXL.
近年来,在文本到图像生成系统领域的进步主要是由扩散模型推动的。然而,单阶段文本到图像扩散模型在计算效率和图像细节精炼方面仍然面临挑战。为解决这个问题,我们提出了CogView3,一种创新的级联框架,可以提高文本到图像扩散模型的性能。CogView3是第一个在文本到图像生成领域实现中继扩散的模型,通过先创建低分辨率图像,然后应用中继基于超分辨率来执行任务。这种方法不仅产生了竞争力的文本到图像输出,而且大大减少了训练和推理成本。我们的实验结果表明,CogView3在人类评估中超过了SDXL,而只用了SDXL的1/2的推理时间。CogView3的蒸馏变体在仅使用1/10的推理时间的同时,实现了与SDXL相当的表演。
https://arxiv.org/abs/2403.05121