We present VIDIM, a generative model for video interpolation, which creates short videos given a start and end frame. In order to achieve high fidelity and generate motions unseen in the input data, VIDIM uses cascaded diffusion models to first generate the target video at low resolution, and then generate the high-resolution video conditioned on the low-resolution generated video. We compare VIDIM to previous state-of-the-art methods on video interpolation, and demonstrate how such works fail in most settings where the underlying motion is complex, nonlinear, or ambiguous while VIDIM can easily handle such cases. We additionally demonstrate how classifier-free guidance on the start and end frame and conditioning the super-resolution model on the original high-resolution frames without additional parameters unlocks high-fidelity results. VIDIM is fast to sample from as it jointly denoises all the frames to be generated, requires less than a billion parameters per diffusion model to produce compelling results, and still enjoys scalability and improved quality at larger parameter counts.
我们提出了 VIDIM,一种用于视频插帧的生成模型,可以根据起始和结束帧创建短视频。为了实现高保真度并生成在输入数据中未见过的运动,VIDIM 使用级联扩散模型首先在低分辨率下生成目标视频,然后根据低分辨率生成的视频生成高分辨率视频。我们比较 VIDIM 与先前的视频插帧方法,并展示了在大多数设置中,这种方法在底层运动复杂、非线性和模糊的情况下都失败了。此外,我们还证明了在起始和结束帧的无分类指导以及将超分辨率模型对原始高分辨率帧进行条件处理可以实现高保真度的结果。VIDIM 具有从样本中抽样速度快、每个扩散模型需要的参数数量不到10亿个、具有可扩展性并在参数数量较大时仍然具有优良质量等优点。
https://arxiv.org/abs/2404.01203
We present DiSR-NeRF, a diffusion-guided framework for view-consistent super-resolution (SR) NeRF. Unlike prior works, we circumvent the requirement for high-resolution (HR) reference images by leveraging existing powerful 2D super-resolution models. Nonetheless, independent SR 2D images are often inconsistent across different views. We thus propose Iterative 3D Synchronization (I3DS) to mitigate the inconsistency problem via the inherent multi-view consistency property of NeRF. Specifically, our I3DS alternates between upscaling low-resolution (LR) rendered images with diffusion models, and updating the underlying 3D representation with standard NeRF training. We further introduce Renoised Score Distillation (RSD), a novel score-distillation objective for 2D image resolution. Our RSD combines features from ancestral sampling and Score Distillation Sampling (SDS) to generate sharp images that are also LR-consistent. Qualitative and quantitative results on both synthetic and real-world datasets demonstrate that our DiSR-NeRF can achieve better results on NeRF super-resolution compared with existing works. Code and video results available at the project website.
我们提出了DiSR-NeRF,一种扩散引导的视图一致超分辨率(SR) NeRF框架。与之前的工作不同,我们通过利用现有的强大2D超分辨率模型绕过了高分辨率(HR)参考图像的要求。然而,不同的视图中的独立SR 2D图像通常是不一致的。因此,我们提出了迭代3D同步(I3DS)来通过NeRF固有的多视图一致性特性来缓解不一致性问题。具体来说,我们的I3DS交替使用扩散模型上采样低分辨率(LR)渲染图像,并使用标准NeRF训练更新底层3D表示。我们还引入了去噪得分蒸馏(RSD)和新型的分数蒸馏采样(SDS)用于2D图像分辨率。我们的RSD结合了祖先生成度和分数蒸馏采样(SDS)的特征,生成具有清晰度的图像,并且也是LR一致的。在合成和真实世界数据集上的结果表明,与现有工作相比,我们的DiSR-NeRF在NeRF超分辨率方面可以实现更好的结果。代码和视频结果可以在项目网站上查看。
https://arxiv.org/abs/2404.00874
In recent years, Vision Transformer-based applications to low-level vision tasks have achieved widespread success. Unlike CNN-based models, Transformers are more adept at capturing long-range dependencies, enabling the reconstruction of images utilizing information from non-local areas. In the domain of super-resolution, Swin-transformer-based approaches have become mainstream due to their capacity to capture global spatial information and their shifting-window attention mechanism that facilitates the interchange of information between different windows. Many researchers have enhanced image quality and network efficiency by expanding the receptive field or designing complex networks, yielding commendable results. However, we observed that spatial information tends to diminish during the forward propagation process due to increased depth, leading to a loss of spatial information and, consequently, limiting the model's potential. To address this, we propose the Dense-residual-connected Transformer (DRCT), aimed at mitigating the loss of spatial information through dense-residual connections between layers, thereby unleashing the model's potential and enhancing performance. Experiment results indicate that our approach is not only straightforward but also achieves remarkable efficiency, surpassing state-of-the-art methods and performing commendably at NTIRE2024.
近年来,基于Vision Transformer(ViT)的应用在低级别视觉任务上取得了广泛的成功。与CNN基模型相比,Transformer更擅长捕捉长距离依赖关系,使得利用非局部信息重构图像成为可能。在超分辨率领域,基于Swin Transformer的方法成为了主流,因为它们能够捕捉全局空间信息并实现滑动窗口关注机制,从而促进不同窗口之间的信息交流。许多研究者通过扩展感受野或设计复杂的网络来提高图像质量和网络效率,取得了很好的效果。然而,我们观察到,在前向传播过程中,由于深度增加,空间信息往往会有所减弱,导致模型性能受限。为了应对这个问题,我们提出了Dense-residual-connected Transformer(DRCT)方法,旨在通过层间密集连接来减轻空间信息损失,从而释放模型的潜在能力并提高性能。实验结果表明,我们的方法不仅简单而且非常有效,超过了最先进的水平,在NTIRE2024上表现得相当出色。
https://arxiv.org/abs/2404.00722
Diffusion models, known for their powerful generative capabilities, play a crucial role in addressing real-world super-resolution challenges. However, these models often focus on improving local textures while neglecting the impacts of global degradation, which can significantly reduce semantic fidelity and lead to inaccurate reconstructions and suboptimal super-resolution performance. To address this issue, we introduce a novel two-stage, degradation-aware framework that enhances the diffusion model's ability to recognize content and degradation in low-resolution images. In the first stage, we employ unsupervised contrastive learning to obtain representations of image degradations. In the second stage, we integrate a degradation-aware module into a simplified ControlNet, enabling flexible adaptation to various degradations based on the learned representations. Furthermore, we decompose the degradation-aware features into global semantics and local details branches, which are then injected into the diffusion denoising module to modulate the target generation. Our method effectively recovers semantically precise and photorealistic details, particularly under significant degradation conditions, demonstrating state-of-the-art performance across various benchmarks. Codes will be released at this https URL.
扩散模型以其强大的生成能力而闻名,在解决现实世界的超分辨率挑战中发挥着关键作用。然而,这些模型通常只关注改善低分辨率图像的局部纹理,而忽视全局退化的影响,这可能导致语义保真度降低,从而导致不准确的重建和次优的超分辨率性能。为了解决这个问题,我们引入了一个新颖的两天平框架,该框架增强了扩散模型在低分辨率图像中识别内容和退化的能力。在第一阶段,我们采用无监督的对比学习来获得图像退化的表示。在第二阶段,我们将退化感知模块集成到简单的控制网络中,使得模型能够根据学习的表示对各种退化进行灵活的适应。此外,我们将退化感知的特征分解为全局语义和局部细节分支,然后注入到扩散去噪模块中,调节目标生成。我们的方法在显著的退化条件下有效地恢复了语义精确和逼真的细节,特别是在退化较大时,展示了在各种基准测试中的最先进性能。代码将在该https URL上发布。
https://arxiv.org/abs/2404.00661
Cross-spectral image guided denoising has shown its great potential in recovering clean images with rich details, such as using the near-infrared image to guide the denoising process of the visible one. To obtain such image pairs, a feasible and economical way is to employ a stereo system, which is widely used on mobile devices. Current works attempt to generate an aligned guidance image to handle the disparity between two images. However, due to occlusion, spectral differences and noise degradation, the aligned guidance image generally exists ghosting and artifacts, leading to an unsatisfactory denoised result. To address this issue, we propose a one-stage transformer-based architecture, named SGDFormer, for cross-spectral Stereo image Guided Denoising. The architecture integrates the correspondence modeling and feature fusion of stereo images into a unified network. Our transformer block contains a noise-robust cross-attention (NRCA) module and a spatially variant feature fusion (SVFF) module. The NRCA module captures the long-range correspondence of two images in a coarse-to-fine manner to alleviate the interference of noise. The SVFF module further enhances salient structures and suppresses harmful artifacts through dynamically selecting useful information. Thanks to the above design, our SGDFormer can restore artifact-free images with fine structures, and achieves state-of-the-art performance on various datasets. Additionally, our SGDFormer can be extended to handle other unaligned cross-model guided restoration tasks such as guided depth super-resolution.
跨谱图像指导去噪在恢复清晰图像和丰富细节方面显示出巨大的潜力,例如利用近红外图像指导可见图像的降噪过程。为了获得这样的图像对,采用一个经济且可行的方式是使用双目立体系统,这已经在移动设备上得到了广泛应用。当前的工作试图生成一个对齐的指导图像来处理两张图像之间的差异。然而,由于遮挡、光谱差异和噪声衰减,对齐的指导图像通常存在伪影和伪色,导致去噪效果不满意。为了解决这个问题,我们提出了一个基于Transformer的一阶架构,称为SGDFormer,用于跨谱立体图像指导去噪。该架构将立体图像的对应关系建模和特征融合统一到一个网络中。我们的Transformer模块包含一个噪声鲁棒跨注意力(NRCA)模块和一个空间可变特征融合(SVFF)模块。NRCA模块以粗到细的方式捕捉两张图像之间的长距离对应关系,减轻噪声干扰。SVFF模块通过动态选择有用的信息来增强显著结构并抑制有害伪影。得益于上述设计,我们的SGDFormer可以恢复无伪影的图像,具有丰富的细节,并在各种数据集上实现最先进的性能。此外,我们的SGDFormer还可以扩展到处理其他未对齐的跨模型指导修复任务,如指导深度超分辨率。
https://arxiv.org/abs/2404.00349
Recent advances in self-supervised learning, predominantly studied in high-level visual tasks, have been explored in low-level image processing. This paper introduces a novel self-supervised constraint for single image super-resolution, termed SSC-SR. SSC-SR uniquely addresses the divergence in image complexity by employing a dual asymmetric paradigm and a target model updated via exponential moving average to enhance stability. The proposed SSC-SR framework works as a plug-and-play paradigm and can be easily applied to existing SR models. Empirical evaluations reveal that our SSC-SR framework delivers substantial enhancements on a variety of benchmark datasets, achieving an average increase of 0.1 dB over EDSR and 0.06 dB over SwinIR. In addition, extensive ablation studies corroborate the effectiveness of each constituent in our SSC-SR framework. Codes are available at this https URL.
近年来在自监督学习方面的进步,主要研究在高级视觉任务上,已经在低级图像处理领域进行了探讨。本文介绍了一种名为SSC-SR的新单图像超分辨率自监督约束,通过采用双非对称范式和通过指数平滑平均更新目标模型来增强稳定性。所提出的SSC-SR框架是一个可插拔和可用的范式,可以轻松应用于现有的SR模型中。实证评估表明,我们的SSC-SR框架在各种基准数据集上取得了显著的增强,实现了EDSR和SwinIR的均值增加0.1 dB。此外,广泛的消融研究证实了SSC-SR框架中每个组成部分的有效性。代码可在此处下载:https://www.example.com/
https://arxiv.org/abs/2404.00260
Electrocardiogram (ECG) signals play a pivotal role in cardiovascular diagnostics, providing essential information on the electrical activity of the heart. However, the inherent noise and limited resolution in ECG recordings can hinder accurate interpretation and diagnosis. In this paper, we propose a novel model for ECG super resolution (SR) that uses a DNAE to enhance temporal and frequency information inside ECG signals. Our approach addresses the limitations of traditional ECG signal processing techniques. Our model takes in input 5-second length ECG windows sampled at 50 Hz (very low resolution) and it is able to reconstruct a denoised super-resolution signal with an x10 upsampling rate (sampled at 500 Hz). We trained the proposed DCAE-SR on public available myocardial infraction ECG signals. Our method demonstrates superior performance in reconstructing high-resolution ECG signals from very low-resolution signals with a sampling rate of 50 Hz. We compared our results with the current deep-learning literature approaches for ECG super-resolution and some non-deep learning reproducible methods that can perform both super-resolution and denoising. We obtained current state-of-the-art performances in super-resolution of very low resolution ECG signals frequently corrupted by ECG artifacts. We were able to obtain a signal-to-noise ratio of 12.20 dB (outperforms previous 4.68 dB), mean squared error of 0.0044 (outperforms previous 0.0154) and root mean squared error of 4.86% (outperforms previous 12.40%). In conclusion, our DCAE-SR model offers a robust (to artefact presence), versatile and explainable solution to enhance the quality of ECG signals. This advancement holds promise in advancing the field of cardiovascular diagnostics, paving the way for improved patient care and high-quality clinical decisions
电图(ECG)信号在心血管诊断中具有关键作用,提供了关于心脏电气活动的重要信息。然而,ECG记录的固有噪声和有限分辨率可能阻碍准确解释和诊断。在本文中,我们提出了一个新颖的ECG超分辨率(SR)模型,该模型使用DNAE增强ECG信号内的时域和频域信息。我们的方法解决了传统ECG信号处理技术的局限性。我们的模型接受5秒长度的输入ECG窗口,以50 Hz的采样率(非常低分辨率)采样,并能够以10x的插值率重构带噪超分辨率信号(以500 Hz的采样率采样)。我们将所提出的DCAE-SR模型应用于公开可用的的心脏病发作ECG信号。我们的方法在50 Hz采样率的低分辨率ECG信号中重构高分辨率ECG信号的性能优越。我们将我们的结果与当前的深度学习文献中的ECG超分辨率方法和一些非深度学习可重复的方法进行了比较。我们获得了经常被ECG伪迹干扰的低分辨率ECG信号的超分辨率性能。我们的方法在50 Hz采样率下的信号与噪声比为12.20 dB(超越了前4.68 dB),均方误差为0.0044(超越了前0.0154)和根方均方误差为4.86%(超越了前12.40%)。总之,我们的DCAE-SR模型提供了一个 robust(对伪迹存在),多才多艺且可解释的解决方案,以增强ECG信号的质量。这一进步有望在心血管诊断领域推动发展,为提高患者护理和做出高质量临床决策铺平道路。
https://arxiv.org/abs/2404.15307
While burst LR images are useful for improving the SR image quality compared with a single LR image, prior SR networks accepting the burst LR images are trained in a deterministic manner, which is known to produce a blurry SR image. In addition, it is difficult to perfectly align the burst LR images, making the SR image more blurry. Since such blurry images are perceptually degraded, we aim to reconstruct the sharp high-fidelity boundaries. Such high-fidelity images can be reconstructed by diffusion models. However, prior SR methods using the diffusion model are not properly optimized for the burst SR task. Specifically, the reverse process starting from a random sample is not optimized for image enhancement and restoration methods, including burst SR. In our proposed method, on the other hand, burst LR features are used to reconstruct the initial burst SR image that is fed into an intermediate step in the diffusion model. This reverse process from the intermediate step 1) skips diffusion steps for reconstructing the global structure of the image and 2) focuses on steps for refining detailed textures. Our experimental results demonstrate that our method can improve the scores of the perceptual quality metrics. Code: this https URL
虽然 burst LR 图像在改善与单个 LR 图像的 SR 图像质量方面是有用的,但接受 burst LR 图像的早期 SR 网络是在确定性方式下训练的,这已经被知道会生成模糊的 SR 图像。此外,很难完美对齐 burst LR 图像,使得 SR 图像变得更模糊。由于这些模糊的图像在感知上退化,我们试图通过扩散模型重构尖锐的高保真度边界。通过扩散模型可以重构高保真度图像。然而,早期 SR 方法使用扩散模型并未对 burst SR 任务进行优化。具体来说,从随机样本开始的反向过程没有优化图像增强和恢复方法,包括 burst SR。在我们的方法中,另一方面,使用 burst LR 特征重构输入到扩散模型中间步骤的初始 burst SR 图像。这种反向过程从中间步骤 1) 跳过扩散步骤以重构图像的整体结构,2) 专注于微纹理的优化步骤。我们的实验结果表明,我们的方法可以提高感知质量指标的得分。代码:https:// this URL
https://arxiv.org/abs/2403.19428
In recent years, there has been an explosion of 2D vision models for numerous tasks such as semantic segmentation, style transfer or scene editing, enabled by large-scale 2D image datasets. At the same time, there has been renewed interest in 3D scene representations such as neural radiance fields from multi-view images. However, the availability of 3D or multiview data is still substantially limited compared to 2D image datasets, making extending 2D vision models to 3D data highly desirable but also very challenging. Indeed, extending a single 2D vision operator like scene editing to 3D typically requires a highly creative method specialized to that task and often requires per-scene optimization. In this paper, we ask the question of whether any 2D vision model can be lifted to make 3D consistent predictions. We answer this question in the affirmative; our new Lift3D method trains to predict unseen views on feature spaces generated by a few visual models (i.e. DINO and CLIP), but then generalizes to novel vision operators and tasks, such as style transfer, super-resolution, open vocabulary segmentation and image colorization; for some of these tasks, there is no comparable previous 3D method. In many cases, we even outperform state-of-the-art methods specialized for the task in question. Moreover, Lift3D is a zero-shot method, in the sense that it requires no task-specific training, nor scene-specific optimization.
近年来,随着大型2D图像数据集的爆炸性增长,2D视觉模型在语义分割、风格迁移或场景编辑等任务上取得了显著突破。同时,对于多视角图像中的3D场景表示,如来自多视角图像的神经辐射场,3D场景表示的可用性仍然相对有限,这使得将2D视觉模型扩展到3D数据非常诱人,但同时也非常具有挑战性。事实上,将单个2D视觉操作扩展到3D通常需要高度创造性方法的专业领域,并且通常需要针对每个场景进行优化。在本文中,我们问是否可以提升任何2D视觉模型使其在3D中做出一致的预测。我们得出结论:是的,我们的新Lift3D方法训练预测由几个视觉模型(即DINO和CLIP)生成的特征空间中的未见过的视图,但 then 扩展到新颖的视觉操作和任务,如风格迁移、超分辨率、开词汇分割和图像色度;对于某些任务,没有 comparable之前的3D方法。在许多情况下,我们甚至超过了针对该任务的最佳3D方法。此外,Lift3D是一种零 shot方法,这意味着它不需要任务特定训练,也不需要场景特定优化。
https://arxiv.org/abs/2403.18922
In recent years, remarkable advancements have been achieved in the field of image generation, primarily driven by the escalating demand for high-quality outcomes across various image generation subtasks, such as inpainting, denoising, and super resolution. A major effort is devoted to exploring the application of super-resolution techniques to enhance the quality of low-resolution images. In this context, our method explores in depth the problem of ship image super resolution, which is crucial for coastal and port surveillance. We investigate the opportunity given by the growing interest in text-to-image diffusion models, taking advantage of the prior knowledge that such foundation models have already learned. In particular, we present a diffusion-model-based architecture that leverages text conditioning during training while being class-aware, to best preserve the crucial details of the ships during the generation of the super-resoluted image. Since the specificity of this task and the scarcity availability of off-the-shelf data, we also introduce a large labeled ship dataset scraped from online ship images, mostly from ShipSpotting\footnote{\url{this http URL}} website. Our method achieves more robust results than other deep learning models previously employed for super resolution, as proven by the multiple experiments performed. Moreover, we investigate how this model can benefit downstream tasks, such as classification and object detection, thus emphasizing practical implementation in a real-world scenario. Experimental results show flexibility, reliability, and impressive performance of the proposed framework over state-of-the-art methods for different tasks. The code is available at: this https URL .
近年来,在图像生成领域取得了显著的进步,主要受到高质量图像成果需求的不断增长推动,尤其是在修复、去噪和超分辨率等图像生成子任务方面。大量精力致力于探讨将超分辨率技术应用于增强低分辨率图像质量。在这种情况下,我们的方法深入研究了船舶图像超分辨率问题,这对沿海和港口监视至关重要。我们研究了对于文本到图像扩散模型的增长兴趣,利用其已经获得的知识。特别是,我们提出了一个基于扩散模型的架构,在训练过程中利用文本条件,以保留超分辨率图像中船舶的关键细节。由于这项任务的独特性和可用数据的稀缺性,我们还引入了一个从在线图像网站如ShipSpotting网站收集的大规模标注船舶数据集。我们的方法在超分辨率模型的应用中实现了比之前使用的更稳健的结果,这是通过多次实验证明的。此外,我们还研究了这种模型如何为下游任务(如分类和目标检测)带来好处,从而强调在现实场景中的实际实现。实验结果表明,该框架具有灵活性、可靠性和令人印象深刻的性能,超过目前最先进的方法。代码可在此处下载:https://this URL 。
https://arxiv.org/abs/2403.18370
Human activities accelerate consumption of fossil fuels and produce greenhouse gases, resulting in urgent issues today: global warming and the climate change. These indirectly cause severe natural disasters, plenty of lives suffering and huge losses of agricultural properties. To mitigate impacts on our lands, scientists are developing renewable, reusable, and clean energies and climatologists are trying to predict the extremes. Meanwhile, governments are publicizing resource-saving policies for a more eco-friendly society and arousing environment awareness. One of the most influencing factors is the precipitation, bringing condensed water vapor onto lands. Water resources are the most significant but basic needs in society, not only supporting our livings, but also economics. In Taiwan, although the average annual precipitation is up to 2,500 millimeter (mm), the water allocation for each person is lower than the global average due to drastically geographical elevation changes and uneven distribution through the year. Thus, it is crucial to track and predict the rainfall to make the most use of it and to prevent the floods. However, climate models have limited resolution and require intensive computational power for local-scale use. Therefore, we proposed a deep convolutional neural network with skip connections, attention blocks, and auxiliary data concatenation, in order to downscale the low-resolution precipitation data into high-resolution one. Eventually, we compare with other climate downscaling methods and show better performance in metrics of Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Pearson Correlation, structural similarity index (SSIM), and forecast indicators.
人类活动加速了化石燃料的消耗并产生了温室气体,导致了当前的紧迫问题:全球变暖和气候变化。这些活动还间接导致了严重的自然灾害、无数人的死亡和巨大的农业损失。为了减轻对我们土地的影响,科学家正在开发可再生能源、可重复利用和清洁能源,而气候学家则试图预测极端天气。与此同时,政府正在宣传资源节约型社会,唤起人们的环保意识。最重要的因素是降水,将凝结水汽带到陆地上。水资源是社会最基本的需求,不仅支持我们的生活,还有经济。在台湾,虽然平均年降水量达到2,500毫米(mm),但由于急剧的地理高度变化和全年不均衡分布,每个人的水资源分配低于全球平均水平。因此,跟踪和预测降水以充分利用是非常重要的,以防止洪水。然而,气候模型具有有限的分辨率,需要大量的计算资源来进行局部使用。因此,我们提出了一个深度卷积神经网络,包括跳连接、注意块和辅助数据串联,以将低分辨率降水数据缩放到高分辨率数据。最后,我们与其他气候缩放方法进行了比较,并在指标 Mean Absolute Error(MAE)、Root Mean Square Error(RMSE)、Pearson Correlation、结构相似性指数(SSIM)和预测指标上表现出更好的性能。
https://arxiv.org/abs/2403.17847
The data bottleneck has emerged as a fundamental challenge in learning based image restoration methods. Researchers have attempted to generate synthesized training data using paired or unpaired samples to address this challenge. This study proposes SeNM-VAE, a semi-supervised noise modeling method that leverages both paired and unpaired datasets to generate realistic degraded data. Our approach is based on modeling the conditional distribution of degraded and clean images with a specially designed graphical model. Under the variational inference framework, we develop an objective function for handling both paired and unpaired data. We employ our method to generate paired training samples for real-world image denoising and super-resolution tasks. Our approach excels in the quality of synthetic degraded images compared to other unpaired and paired noise modeling methods. Furthermore, our approach demonstrates remarkable performance in downstream image restoration tasks, even with limited paired data. With more paired data, our method achieves the best performance on the SIDD dataset.
数据瓶颈已成为基于图像修复方法的学习中的一个基本挑战。研究人员试图通过成对或非成对样本来生成合成训练数据来解决这个挑战。本研究提出了一种半监督噪声建模方法——SeNM-VAE,该方法利用成对和未成对数据集来生成真实 degradation数据。我们的方法基于使用专门设计的图形模型建模降解和清洁图像的条件分布。在变分推理框架下,我们开发了一个处理成对和未成对数据的共同目标函数。我们将该方法应用于真实世界图像去噪和超分辨率任务。与其它未成对和成对噪声建模方法相比,我们的方法在合成降解图像的质量方面具有卓越的表现。此外,即使只有很少的成对数据,我们的方法在下游图像修复任务中也表现出优异的性能。随着更多成对数据的增加,我们的方法在SIDD数据集上实现最佳性能。
https://arxiv.org/abs/2403.17502
Reference-based super-resolution (RefSR) has the potential to build bridges across spatial and temporal resolutions of remote sensing images. However, existing RefSR methods are limited by the faithfulness of content reconstruction and the effectiveness of texture transfer in large scaling factors. Conditional diffusion models have opened up new opportunities for generating realistic high-resolution images, but effectively utilizing reference images within these models remains an area for further exploration. Furthermore, content fidelity is difficult to guarantee in areas without relevant reference information. To solve these issues, we propose a change-aware diffusion model named Ref-Diff for RefSR, using the land cover change priors to guide the denoising process explicitly. Specifically, we inject the priors into the denoising model to improve the utilization of reference information in unchanged areas and regulate the reconstruction of semantically relevant content in changed areas. With this powerful guidance, we decouple the semantics-guided denoising and reference texture-guided denoising processes to improve the model performance. Extensive experiments demonstrate the superior effectiveness and robustness of the proposed method compared with state-of-the-art RefSR methods in both quantitative and qualitative evaluations. The code and data are available at this https URL.
基于参考的超分辨率(RefSR)方法具有在遥感图像的空间和时间分辨率之间建立桥梁的潜力。然而,现有的RefSR方法在内容重建的忠实度和在大规模因子中的纹理传输的有效性方面存在局限。条件扩散模型为生成高分辨率图像开辟了新的机会,但在这些模型中有效地利用参考图像仍然是一个需要进一步研究的问题。此外,在没有相关参考信息的地方,内容保真度很难保证。为解决这些问题,我们提出了一个名为Ref-Diff的基于变化的RefSR模型,使用土地覆盖变化 prior 指导去噪过程。具体来说,我们将 prior 注入到去噪模型中,以提高在未改变区域中参考信息的利用效率并调节在改变区域中语义相关内容的重建。借助这一强大的指导,我们解耦了语义引导去噪和参考纹理引导去噪过程,从而提高模型性能。大量实验证明,与最先进的RefSR方法相比,所提出的方法在数量和质量评估方面具有卓越的效力和鲁棒性。代码和数据可在此链接下载:https://www.xxxxxx.com/
https://arxiv.org/abs/2403.17460
In image Super-Resolution (SR), relying on large datasets for training is a double-edged sword. While offering rich training material, they also demand substantial computational and storage resources. In this work, we analyze dataset pruning as a solution to these challenges. We introduce a novel approach that reduces a dataset to a core-set of training samples, selected based on their loss values as determined by a simple pre-trained SR model. By focusing the training on just 50% of the original dataset, specifically on the samples characterized by the highest loss values, we achieve results comparable to or even surpassing those obtained from training on the entire dataset. Interestingly, our analysis reveals that the top 5% of samples with the highest loss values negatively affect the training process. Excluding these samples and adjusting the selection to favor easier samples further enhances training outcomes. Our work opens new perspectives to the untapped potential of dataset pruning in image SR. It suggests that careful selection of training data based on loss-value metrics can lead to better SR models, challenging the conventional wisdom that more data inevitably leads to better performance.
在图像超分辨率(SR)中,依靠大量数据进行训练是一个双刃剑。虽然它们提供了丰富的训练材料,但它们也需要大量的计算和存储资源。在这项工作中,我们分析了数据集修剪作为一个解决这些挑战的解决方案。我们引入了一种新颖的方法,将数据集修剪为一个基于其损失值确定的核心集。通过将训练重点放在原始数据集的50%上,特别关注损失值最高的样本,我们实现了与或甚至超过整个数据集训练获得的结果 comparable或 even surpassing those obtained from training on the entire dataset。有趣的是,我们的分析揭示了最高损失值的前5%的样本负
https://arxiv.org/abs/2403.17083
Diffusion models are just at a tipping point for image super-resolution task. Nevertheless, it is not trivial to capitalize on diffusion models for video super-resolution which necessitates not only the preservation of visual appearance from low-resolution to high-resolution videos, but also the temporal consistency across video frames. In this paper, we propose a novel approach, pursuing Spatial Adaptation and Temporal Coherence (SATeCo), for video super-resolution. SATeCo pivots on learning spatial-temporal guidance from low-resolution videos to calibrate both latent-space high-resolution video denoising and pixel-space video reconstruction. Technically, SATeCo freezes all the parameters of the pre-trained UNet and VAE, and only optimizes two deliberately-designed spatial feature adaptation (SFA) and temporal feature alignment (TFA) modules, in the decoder of UNet and VAE. SFA modulates frame features via adaptively estimating affine parameters for each pixel, guaranteeing pixel-wise guidance for high-resolution frame synthesis. TFA delves into feature interaction within a 3D local window (tubelet) through self-attention, and executes cross-attention between tubelet and its low-resolution counterpart to guide temporal feature alignment. Extensive experiments conducted on the REDS4 and Vid4 datasets demonstrate the effectiveness of our approach.
扩散模型仅在图像超分辨率任务中达到了临界点。然而,利用扩散模型进行视频超分辨率并不容易,这需要不仅保留低分辨率视频到高分辨率视频的视觉外观,而且还要保证视频帧之间的时间一致性。在本文中,我们提出了一个新的方法,即追求空间适应性和时间一致性(SATeCo),用于视频超分辨率。SATeCo的基础是学习低分辨率视频到高分辨率视频的空间-时间指导,以校准潜在空间高分辨率视频去噪和像素空间视频重建。从技术上讲,SATeCo冻结了预训练UNet和VAE的所有参数,仅在UNet和VAE的解码器中优化两个故意设计的空间特征适应(SFA)和时间特征对齐(TFA)模块。SFA通过根据每个像素的适应性估计平移参数来修改帧特征,从而保证高分辨率帧合成时的逐像素指导。TFA深入研究了3D局部窗口(试管)内的特征交互,通过自注意实现 tubelet 和其低分辨率对应物的跨注意,并执行时间特征对齐。在REDS4和Vid4数据集上进行的大量实验证明了我们方法的有效性。
https://arxiv.org/abs/2403.17000
The use of fluorescent molecules to create long sequences of low-density, diffraction-limited images enables highly-precise molecule localization. However, this methodology requires lengthy imaging times, which limits the ability to view dynamic interactions of live cells on short time scales. Many techniques have been developed to reduce the number of frames needed for localization, from classic iterative optimization to deep neural networks. Particularly, deep algorithm unrolling utilizes both the structure of iterative sparse recovery algorithms and the performance gains of supervised deep learning. However, the robustness of this approach is highly dependant on having sufficient training data. In this paper we introduce deep unrolled self-supervised learning, which alleviates the need for such data by training a sequence-specific, model-based autoencoder that learns only from given measurements. Our proposed method exceeds the performance of its supervised counterparts, thus allowing for robust, dynamic imaging well below the diffraction limit without any labeled training samples. Furthermore, the suggested model-based autoencoder scheme can be utilized to enhance generalization in any sparse recovery framework, without the need for external training data.
使用荧光分子创建长序列低密度、扩散限制的图像,使得对分子进行高精度定位。然而,这种方法需要较长的成像时间,这限制了在短时间尺度上观察活细胞动态相互作用的能力。为了减少对定位所需帧数的方法已经开发了许多,从经典的迭代优化到深度神经网络。特别,深度算法扩展利用了迭代稀疏恢复算法的结构和监督深度学习的性能提升。然而,这种方法的成功与否高度依赖于是否有足够的训练数据。在本文中,我们引入了深度自旋卷积学习,通过训练一个仅从给定测量中学习的序列特定的基于模型的自编码器来减轻这种依赖。与监督方法相比,我们所提出的方法超越了其性能,从而允许在衍射极限以下进行稳健、动态成像,而无需任何标记训练样本。此外,所提出的模型基于自编码器方案可以在任何稀疏恢复框架中增强泛化,而无需外部训练数据。
https://arxiv.org/abs/2403.16974
Artifact-free super-resolution (SR) aims to translate low-resolution images into their high-resolution counterparts with a strict integrity of the original content, eliminating any distortions or synthetic details. While traditional diffusion-based SR techniques have demonstrated remarkable abilities to enhance image detail, they are prone to artifact introduction during iterative procedures. Such artifacts, ranging from trivial noise to unauthentic textures, deviate from the true structure of the source image, thus challenging the integrity of the super-resolution process. In this work, we propose Self-Adaptive Reality-Guided Diffusion (SARGD), a training-free method that delves into the latent space to effectively identify and mitigate the propagation of artifacts. Our SARGD begins by using an artifact detector to identify implausible pixels, creating a binary mask that highlights artifacts. Following this, the Reality Guidance Refinement (RGR) process refines artifacts by integrating this mask with realistic latent representations, improving alignment with the original image. Nonetheless, initial realistic-latent representations from lower-quality images result in over-smoothing in the final output. To address this, we introduce a Self-Adaptive Guidance (SAG) mechanism. It dynamically computes a reality score, enhancing the sharpness of the realistic latent. These alternating mechanisms collectively achieve artifact-free super-resolution. Extensive experiments demonstrate the superiority of our method, delivering detailed artifact-free high-resolution images while reducing sampling steps by 2X. We release our code at this https URL.
零 artifact超分辨率(SR)旨在通过严格的原始内容完整性将低分辨率图像转换为高分辨率图像,消除任何扭曲或合成细节。尽管传统的扩散基 SR 技术在增强图像细节方面表现出惊人的能力,但在迭代过程中易引入伪影。这些伪影,从轻微噪声到不真实纹理,与源图像的真实结构不符,从而挑战了超分辨率过程的可靠性。在这项工作中,我们提出了自适应现实引导扩散(SARGD)方法,一种无需训练的方法,深入挖掘潜在空间以有效地识别和减轻伪影的传播。 SARGD 首先使用一个伪影检测器识别不合理的像素,创建一个二进制掩码突出显示伪影。接下来,现实引导精炼(RGR)过程通过将掩码与现实主义的潜在表示集成来优化伪影。然而,低质量图像的初始现实主义潜在表示在最终输出中导致过度平滑。为解决这个问题,我们引入了自适应引导(SAG)机制。它动态地计算现实分数,提高现实潜在的尖锐度。这些交替机制共同实现零 artifact 的超分辨率。 丰富的实验证明了我们方法的优势,在减少采样步骤的同时提供详细的无伪影高分辨率图像。您可以在以下链接处获取我们的代码:https://www.osac.tsinghua.edu.cn/group/Home/
https://arxiv.org/abs/2403.16643
Ultrasound imaging is crucial for evaluating organ morphology and function, yet depth adjustment can degrade image quality and field-of-view, presenting a depth-dependent dilemma. Traditional interpolation-based zoom-in techniques often sacrifice detail and introduce artifacts. Motivated by the potential of arbitrary-scale super-resolution to naturally address these inherent challenges, we present the Residual Dense Swin Transformer Network (RDSTN), designed to capture the non-local characteristics and long-range dependencies intrinsic to ultrasound images. It comprises a linear embedding module for feature enhancement, an encoder with shifted-window attention for modeling non-locality, and an MLP decoder for continuous detail reconstruction. This strategy streamlines balancing image quality and field-of-view, which offers superior textures over traditional methods. Experimentally, RDSTN outperforms existing approaches while requiring fewer parameters. In conclusion, RDSTN shows promising potential for ultrasound image enhancement by overcoming the limitations of conventional interpolation-based methods and achieving depth-independent imaging.
超声成像对评价器官形态和功能至关重要,但深度调整会降低图像质量和视野范围,呈现出深度相关的困境。传统的基于插值的方法通常会牺牲细节并引入伪影。鉴于任意尺度超分辨率的自然解决这些固有挑战的潜力,我们提出了残余密集辛普森变换网络(RDSTN),旨在捕捉超声图像的非局部特征和长距离依赖关系。它包括一个用于特征增强的线性嵌入模块、一个具有平移窗口注意力的编码器和一个用于连续细节重构的MLP解码器。这种策略在平衡图像质量和视野范围方面取得了优越的 texture,超过了传统方法。实验证明,RDSTN在性能上优于现有方法,同时需要的参数更少。总之,RDSTN通过克服传统插值方法的局限,为超声图像增强展示了有前景的可能性,实现了无深度依赖的图像。
https://arxiv.org/abs/2403.16384
Transformer-based models have revolutionized the field of image super-resolution (SR) by harnessing their inherent ability to capture complex contextual features. The overlapping rectangular shifted window technique used in transformer architecture nowadays is a common practice in super-resolution models to improve the quality and robustness of image upscaling. However, it suffers from distortion at the boundaries and has limited unique shifting modes. To overcome these weaknesses, we propose a non-overlapping triangular window technique that synchronously works with the rectangular one to mitigate boundary-level distortion and allows the model to access more unique sifting modes. In this paper, we propose a Composite Fusion Attention Transformer (CFAT) that incorporates triangular-rectangular window-based local attention with a channel-based global attention technique in image super-resolution. As a result, CFAT enables attention mechanisms to be activated on more image pixels and captures long-range, multi-scale features to improve SR performance. The extensive experimental results and ablation study demonstrate the effectiveness of CFAT in the SR domain. Our proposed model shows a significant 0.7 dB performance improvement over other state-of-the-art SR architectures.
基于Transformer的模型通过利用其固有的捕获复杂上下文特征的能力,对图像超分辨率(SR)领域进行了革命性的变革。目前在超分辨率模型中使用的重叠矩形平移窗口技术是一种常见的提高图像放大效果和鲁棒性的方法。然而,它会在边界处产生扭曲,并且具有有限的独特平移模式。为了克服这些缺陷,我们提出了一个非重叠三角窗技术,该技术与矩形窗同步工作以减轻边界级扭曲,并允许模型访问更多的独特平移模式。在本文中,我们提出了一个包含基于通道的全局注意力和三角-矩形窗口基于局部注意力的图像超分辨率复合融合注意力Transformer(CFAT)。结果表明,CFAT能够使注意机制激活更多的图像像素,并捕获长距离、多尺度特征,从而提高SR性能。 extensive实验结果和消融分析证明了CFAT在SR领域的效果。与最先进的SR架构相比,我们提出的模型在性能上提高了0.7 dB。
https://arxiv.org/abs/2403.16143
The one-shot talking-head generation learns to synthesize a talking-head video with one source portrait image under the driving of same or different identity video. Usually these methods require plane-based pixel transformations via Jacobin matrices or facial image warps for novel poses generation. The constraints of using a single image source and pixel displacements often compromise the clarity of the synthesized images. Some methods try to improve the quality of synthesized videos by introducing additional super-resolution modules, but this will undoubtedly increase computational consumption and destroy the original data distribution. In this work, we propose an adaptive high-quality talking-head video generation method, which synthesizes high-resolution video without additional pre-trained modules. Specifically, inspired by existing super-resolution methods, we down-sample the one-shot source image, and then adaptively reconstruct high-frequency details via an encoder-decoder module, resulting in enhanced video clarity. Our method consistently improves the quality of generated videos through a straightforward yet effective strategy, substantiated by quantitative and qualitative evaluations. The code and demo video are available on: \url{this https URL}.
一次性谈话头生成学会在相同或不同身份视频的驱动下,合成一张谈话头视频。通常,这些方法需要通过Jacobin矩阵或面部图像扭曲进行平面像素变换来生成新颖的姿态。使用单个图像源和像素平移限制往往会使合成图像的清晰度妥协。一些方法试图通过引入额外的超分辨率模块来提高合成视频的质量,但这无疑会增加计算消耗并破坏原始数据分布。 在本文中,我们提出了一种自适应高品质谈话头视频生成方法,在没有额外预训练模块的情况下合成高分辨率视频。具体来说,受到现有超分辨率方法的影响,我们对一次性源图像进行下采样,然后通过编码器-解码器模块适当地重构高频细节,从而提高视频的清晰度。我们方法通过一种简单而有效的方式,不断改进生成视频的质量,经定量与定性评估均得到了证实。代码和演示视频可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2403.15944