This paper tackles the intricate challenge of object removal to update the radiance field using the 3D Gaussian Splatting. The main challenges of this task lie in the preservation of geometric consistency and the maintenance of texture coherence in the presence of the substantial discrete nature of Gaussian primitives. We introduce a robust framework specifically designed to overcome these obstacles. The key insight of our approach is the enhancement of information exchange among visible and invisible areas, facilitating content restoration in terms of both geometry and texture. Our methodology begins with optimizing the positioning of Gaussian primitives to improve geometric consistency across both removed and visible areas, guided by an online registration process informed by monocular depth estimation. Following this, we employ a novel feature propagation mechanism to bolster texture coherence, leveraging a cross-attention design that bridges sampling Gaussians from both uncertain and certain areas. This innovative approach significantly refines the texture coherence within the final radiance field. Extensive experiments validate that our method not only elevates the quality of novel view synthesis for scenes undergoing object removal but also showcases notable efficiency gains in training and rendering speeds.
本文解决了使用3D高斯平铺更新辐射场的问题,这是通过保留几何一致性和在Gaussian原始数据中保持纹理一致性的复杂挑战。这项任务的主要挑战在于保留通过平铺Gaussian原始数据而获得的平滑性,同时保持纹理一致性,这在很大程度上是由Gaussian原始数据的显著离散性造成的。为了克服这些障碍,我们引入了一个专门设计的稳健框架。我们方法的关键洞察力是提高可见和不可见区域之间的信息交流,从而实现几何和纹理的恢复。我们的方法从通过单目深度估计的在线注册过程优化Gaussian原始数据的位置开始。接着,我们采用一种新颖的特征传播机制来增强纹理一致性,并利用一种跨注意设计,将来自不确定和确定区域的Gaussian采样进行连接。这种创新方法在最终辐射场中对纹理一致性进行了显著改进。大量实验证实,我们的方法不仅提高了进行物体删除的场景中新颖视觉合成质量,而且在训练和渲染速度方面显著展示了效率提升。
https://arxiv.org/abs/2404.13679
Multiple complex degradations are coupled in low-quality video faces in the real world. Therefore, blind video face restoration is a highly challenging ill-posed problem, requiring not only hallucinating high-fidelity details but also enhancing temporal coherence across diverse pose variations. Restoring each frame independently in a naive manner inevitably introduces temporal incoherence and artifacts from pose changes and keypoint localization errors. To address this, we propose the first blind video face restoration approach with a novel parsing-guided temporal-coherent transformer (PGTFormer) without pre-alignment. PGTFormer leverages semantic parsing guidance to select optimal face priors for generating temporally coherent artifact-free results. Specifically, we pre-train a temporal-spatial vector quantized auto-encoder on high-quality video face datasets to extract expressive context-rich priors. Then, the temporal parse-guided codebook predictor (TPCP) restores faces in different poses based on face parsing context cues without performing face pre-alignment. This strategy reduces artifacts and mitigates jitter caused by cumulative errors from face pre-alignment. Finally, the temporal fidelity regulator (TFR) enhances fidelity through temporal feature interaction and improves video temporal consistency. Extensive experiments on face videos show that our method outperforms previous face restoration baselines. The code will be released on \href{this https URL}{this https URL}.
在现实生活中,低质量的视频脸部存在多个复杂降解。因此,盲视频脸部修复是一项高度具有挑战性的 ill-posed 问题,不仅需要高保真度的图像,还需要增强跨不同姿态变化的时间一致性。在 naive 的独立修复每个帧的方式下,难免引入了姿态变化和关键点定位误差带来的时间不一致和伪影。为了解决这个问题,我们提出了第一个没有预对齐的盲视频脸部修复方法——具有新颖的分词引导的时间一致性变换器 (PGTFormer)。PGTFormer 利用语义分词指导来选择生成具有时间一致性伪影的最佳人脸 prior。具体来说,我们在高质量视频脸数据集上预训练一个时间-空间向量量化自编码器,以提取充满表达性上下文的 prior。然后,基于姿态解码指导的代码本预测器 (TPCP) 根据姿态解码上下文预测不同姿态的脸部。这种策略通过时间特征交互减少了伪影,并减轻了由于预对齐误差累积造成的抖动。最后,时间 fidelity 调节器 (TFR) 通过时间特征交互增加了 fidelity,并改善了视频的时间一致性。在面部视频的广泛实验中,我们的方法超越了以前的面部修复基线。代码将发布在 \href{this <https://this URL>}{this <https://this URL>}。
https://arxiv.org/abs/2404.13640
Tackling image degradation due to atmospheric turbulence, particularly in dynamic environment, remains a challenge for long-range imaging systems. Existing techniques have been primarily designed for static scenes or scenes with small motion. This paper presents the first segment-then-restore pipeline for restoring the videos of dynamic scenes in turbulent environment. We leverage mean optical flow with an unsupervised motion segmentation method to separate dynamic and static scene components prior to restoration. After camera shake compensation and segmentation, we introduce foreground/background enhancement leveraging the statistics of turbulence strength and a transformer model trained on a novel noise-based procedural turbulence generator for fast dataset augmentation. Benchmarked against existing restoration methods, our approach restores most of the geometric distortion and enhances sharpness for videos. We make our code, simulator, and data publicly available to advance the field of video restoration from turbulence: this http URL
克服由于大气扰动而导致的图像退化,特别是在动态环境中,仍然是一个长期成像系统的挑战。现有的技术主要针对静态场景或具有较小运动的场景。本文提出了第一个用于恢复动态场景视频的分割-然后-修复管道。我们利用无监督运动分割方法 mean optical flow 来分离动态和静态场景组件,在修复之前。经过相机振动补偿和分割之后,我们引入了前景/背景增强,利用湍流强度的统计信息和基于新噪声生成器的Transformer模型,进行快速数据增强。与现有的修复方法进行基准测试,我们的方法修复了大多数几何失真,并提高了视频的清晰度。我们将我们的代码、模拟器和数据公开发布,以推动视频修复领域的发展:这是 http://www.example.com
https://arxiv.org/abs/2404.13605
In real-world scenarios, due to a series of image degradations, obtaining high-quality, clear content photos is challenging. While significant progress has been made in synthesizing high-quality images, previous methods for image restoration and enhancement often overlooked the characteristics of different degradations. They applied the same structure to address various types of degradation, resulting in less-than-ideal restoration outcomes. Inspired by the notion that high/low frequency information is applicable to different degradations, we introduce HLNet, a Bracketing Image Restoration and Enhancement method based on high-low frequency decomposition. Specifically, we employ two modules for feature extraction: shared weight modules and non-shared weight modules. In the shared weight modules, we use SCConv to extract common features from different degradations. In the non-shared weight modules, we introduce the High-Low Frequency Decomposition Block (HLFDB), which employs different methods to handle high-low frequency information, enabling the model to address different degradations more effectively. Compared to other networks, our method takes into account the characteristics of different degradations, thus achieving higher-quality image restoration.
在现实场景中,由于一系列图像降噪,获得高质量、清晰的内容图片具有挑战性。尽管在生成高质量图像方面已经取得了显著的进展,但以前的照片修复和增强方法往往忽视了不同降噪类型的特征。它们应用相同的结构来解决各种类型的降噪,导致恢复结果不理想。受到高/低频信息适用于不同降噪类型的启发,我们引入了HLNet,一种基于高-低频分解的框图图像修复和增强方法。具体来说,我们使用两个模块进行特征提取:共享权重模块和非共享权重模块。在共享权重模块中,我们使用SCConv从不同降噪中提取共性特征。在非共享权重模块中,我们引入了高-低频分解块(HLFDB),它采用不同的方法处理高-低频信息,使模型能够更有效地处理不同降噪类型。与其它网络相比,我们的方法考虑了不同降噪类型的特征,从而实现了更好的图像修复质量。
https://arxiv.org/abs/2404.13537
The deep learning revolution has strongly impacted low-level image processing tasks such as style/domain transfer, enhancement/restoration, and visual quality assessments. Despite often being treated separately, the aforementioned tasks share a common theme of understanding, editing, or enhancing the appearance of input images without modifying the underlying content. We leverage this observation to develop a novel disentangled representation learning method that decomposes inputs into content and appearance features. The model is trained in a self-supervised manner and we use the learned features to develop a new quality prediction model named DisQUE. We demonstrate through extensive evaluations that DisQUE achieves state-of-the-art accuracy across quality prediction tasks and distortion types. Moreover, we demonstrate that the same features may also be used for image processing tasks such as HDR tone mapping, where the desired output characteristics may be tuned using example input-output pairs.
深度学习革命对诸如风格/领域转移、增强/修复和视觉质量评估等低级图像处理任务产生了强烈影响。尽管这些任务通常被单独处理,但前述任务都 share a common theme of understanding、editing或增强输入图像的视觉效果,而不会修改底层内容。我们利用这个观察结果开发了一种新颖的解耦表示学习方法,将输入分解为内容和外观特征。该模型以自监督的方式进行训练,并使用学习到的特征开发了一个名为DisQUE的新质量预测模型。我们在广泛的评估中证明了DisQUE在质量预测任务和失真类型上的最先进准确度。此外,我们还证明了相同特征还可以用于图像处理任务,如 HDR 色调映射,其中所需的输出特性可以通过使用示例输入-输出对进行调整。
https://arxiv.org/abs/2404.13484
Unsupervised anomaly detection using only normal samples is of great significance for quality inspection in industrial manufacturing. Although existing reconstruction-based methods have achieved promising results, they still face two problems: poor distinguishable information in image reconstruction and well abnormal regeneration caused by model over-generalization ability. To overcome the above issues, we convert the image reconstruction into a combination of parallel feature restorations and propose a multi-feature reconstruction network, MFRNet, using crossed-mask restoration in this paper. Specifically, a multi-scale feature aggregator is first developed to generate more discriminative hierarchical representations of the input images from a pre-trained model. Subsequently, a crossed-mask generator is adopted to randomly cover the extracted feature map, followed by a restoration network based on the transformer structure for high-quality repair of the missing regions. Finally, a hybrid loss is equipped to guide model training and anomaly estimation, which gives consideration to both the pixel and structural similarity. Extensive experiments show that our method is highly competitive with or significantly outperforms other state-of-the-arts on four public available datasets and one self-made dataset.
无监督异常检测仅使用正常样本在工业制造业的质量检测中具有重要意义。尽管现有的基于重构的方法已经取得了很好的效果,但它们仍然面临两个问题:图像重构中的信息模糊和由模型过拟合能力引起的异常再生。为解决这些问题,本文将图像重构转化为跨模态恢复的组合,并提出了一种多特征重构网络,MFRNet,使用双向掩码恢复。具体来说,首先开发了一个多尺度特征聚合器,从预训练模型中生成输入图像的更有区分性的层次表示。然后,采用双向掩码生成器随机覆盖提取的特征图,接着是基于Transformer结构的高质量修补缺失区域的修复网络。最后,配备了一种混合损失函数来指导模型训练和异常估计,考虑了像素结构和相似性。大量实验证明,我们的方法在四个公开可用数据集和自建数据集上与其他最先进的水平具有高度竞争性或显著优于其他方法。
https://arxiv.org/abs/2404.13273
The recent advancement of spatial transcriptomics (ST) allows to characterize spatial gene expression within tissue for discovery research. However, current ST platforms suffer from low resolution, hindering in-depth understanding of spatial gene expression. Super-resolution approaches promise to enhance ST maps by integrating histology images with gene expressions of profiled tissue spots. However, current super-resolution methods are limited by restoration uncertainty and mode collapse. Although diffusion models have shown promise in capturing complex interactions between multi-modal conditions, it remains a challenge to integrate histology images and gene expression for super-resolved ST maps. This paper proposes a cross-modal conditional diffusion model for super-resolving ST maps with the guidance of histology images. Specifically, we design a multi-modal disentangling network with cross-modal adaptive modulation to utilize complementary information from histology images and spatial gene expression. Moreover, we propose a dynamic cross-attention modelling strategy to extract hierarchical cell-to-tissue information from histology images. Lastly, we propose a co-expression-based gene-correlation graph network to model the co-expression relationship of multiple genes. Experiments show that our method outperforms other state-of-the-art methods in ST super-resolution on three public datasets.
近年来,空间转录组学(ST)的进展允许在组织中对空间基因表达进行研究,从而推动发现研究。然而,当前的ST平台存在分辨率低和信噪比低的问题,阻碍了深入了解空间基因表达。超分辨率方法通过将组织图像与特征组织的基因表达集成来增强ST映射。然而,现有的超分辨率方法受到修复不确定性和模态崩塌的限制。尽管扩散模型在捕捉多模态条件下的复杂相互作用方面显示出前景,但将组织图像和基因表达集成到超分辨率ST映射仍然具有挑战性。本文提出了一种跨模态条件扩散模型,在历史图像的指导下解决ST映射的超分辨率问题。具体来说,我们设计了一个多模态去中心化网络,通过跨模态自适应调制利用组织图像的互补信息。此外,我们还提出了一种动态跨注意建模策略,从组织图像中提取层次结构细胞到组织信息。最后,我们提出了一种基于共表达的基因相关图网络来建模多个基因之间的共表达关系。实验结果表明,我们的方法在三个公开数据集上的ST超分辨率方面超过了最先进的水平。
https://arxiv.org/abs/2404.12973
The Frozen Section (FS) technique is a rapid and efficient method, taking only 15-30 minutes to prepare slides for pathologists' evaluation during surgery, enabling immediate decisions on further surgical interventions. However, FS process often introduces artifacts and distortions like folds and ice-crystal effects. In contrast, these artifacts and distortions are absent in the higher-quality formalin-fixed paraffin-embedded (FFPE) slides, which require 2-3 days to prepare. While Generative Adversarial Network (GAN)-based methods have been used to translate FS to FFPE images (F2F), they may leave morphological inaccuracies with remaining FS artifacts or introduce new artifacts, reducing the quality of these translations for clinical assessments. In this study, we benchmark recent generative models, focusing on GANs and Latent Diffusion Models (LDMs), to overcome these limitations. We introduce a novel approach that combines LDMs with Histopathology Pre-Trained Embeddings to enhance restoration of FS images. Our framework leverages LDMs conditioned by both text and pre-trained embeddings to learn meaningful features of FS and FFPE histopathology images. Through diffusion and denoising techniques, our approach not only preserves essential diagnostic attributes like color staining and tissue morphology but also proposes an embedding translation mechanism to better predict the targeted FFPE representation of input FS images. As a result, this work achieves a significant improvement in classification performance, with the Area Under the Curve rising from 81.99% to 94.64%, accompanied by an advantageous CaseFD. This work establishes a new benchmark for FS to FFPE image translation quality, promising enhanced reliability and accuracy in histopathology FS image analysis. Our work is available at this https URL.
冻切片(FS)技术是一种快速而有效的解决方案,用于在手术过程中为病理学家评估切片,仅需15-30分钟,允许在短时间内做出进一步的手术干预决策。然而,FS过程通常引入收缩和错乱的伪像,如褶皱和冰晶效应。相比之下,高质量正式冻固定装片(FFPE)的这些伪像和错乱是缺失的,这些装片需要2-3天来准备。虽然基于生成对抗网络(GAN)的方法已经将FS翻译为FFPE图像(F2F),但它们可能留下形态不准确残余FS伪像或引入新的伪像,从而降低这些翻译的质量和临床评估的准确性。在这项研究中,我们对比了最近的几种生成模型,重点关注GAN和潜在扩散模型(LDM),以克服这些限制。我们引入了一种结合LDM和病理学预训练嵌入的新方法,以增强FS图像的修复。我们的框架利用由文本和预训练嵌入条件下的LDM学习有意义FS和FFPE病理学图像的特征。通过扩散和去噪技术,我们的方法不仅保留色彩染色和组织形态等基本诊断特征,还提出了一个嵌入转译机制,以更好地预测输入FS图像的目标FFPE表示。因此,这项工作在分类性能上取得了显著的改进,曲线下的面积从81.99%增加至94.64%,同时具有优势的CaseFD。这项工作为FS到FFPE图像翻译质量设立了新的基准,承诺在病理学FS图像分析中提高可靠性和准确性。我们的工作可以在以下链接处查看:
https://arxiv.org/abs/2404.12650
Recent advances in image deraining have focused on training powerful models on mixed multiple datasets comprising diverse rain types and backgrounds. However, this approach tends to overlook the inherent differences among rainy images, leading to suboptimal results. To overcome this limitation, we focus on addressing various rainy images by delving into meaningful representations that encapsulate both the rain and background components. Leveraging these representations as instructive guidance, we put forth a Context-based Instance-level Modulation (CoI-M) mechanism adept at efficiently modulating CNN- or Transformer-based models. Furthermore, we devise a rain-/detail-aware contrastive learning strategy to help extract joint rain-/detail-aware representations. By integrating CoI-M with the rain-/detail-aware Contrastive learning, we develop CoIC, an innovative and potent algorithm tailored for training models on mixed datasets. Moreover, CoIC offers insight into modeling relationships of datasets, quantitatively assessing the impact of rain and details on restoration, and unveiling distinct behaviors of models given diverse inputs. Extensive experiments validate the efficacy of CoIC in boosting the deraining ability of CNN and Transformer models. CoIC also enhances the deraining prowess remarkably when real-world dataset is included.
近年来,图像去雾技术的发展主要集中在在包含多种雨类型和背景的混合数据集上训练强大的模型。然而,这种方法往往忽视了雨图片之间的固有差异,导致性能较低。为了克服这一局限,我们专注于通过深入挖掘有意义的表现来解决各种雨图片,从而实现更好的结果。利用这些表现作为有指导性的提示,我们提出了一个基于上下文的实例级调制(CoI-M)机制,该机制能够有效地对基于CNN或Transformer的模型进行调制。此外,我们还设计了一个雨-/细节敏感的对比学习策略,以帮助提取联合雨-/细节感知的表示。通过将CoI-M与雨-/细节感知的对比学习相结合,我们开发了CoIC,一种专为在混合数据集上训练模型而设计的创新且强大的算法。此外,CoIC揭示了数据集之间的建模关系,定量评估了雨和细节对恢复的影响,并揭示了给定不同输入的模型具有显著的差异行为。大量的实验证实了CoIC在提高CNN和Transformer模型的去雾能力方面的有效性。当包含真实世界数据时,CoIC的去雾能力显著增强。
https://arxiv.org/abs/2404.12091
Reconstructing degraded images is a critical task in image processing. Although CNN and Transformer-based models are prevalent in this field, they exhibit inherent limitations, such as inadequate long-range dependency modeling and high computational costs. To overcome these issues, we introduce the Channel-Aware U-Shaped Mamba (CU-Mamba) model, which incorporates a dual State Space Model (SSM) framework into the U-Net architecture. CU-Mamba employs a Spatial SSM module for global context encoding and a Channel SSM component to preserve channel correlation features, both in linear computational complexity relative to the feature map size. Extensive experimental results validate CU-Mamba's superiority over existing state-of-the-art methods, underscoring the importance of integrating both spatial and channel contexts in image restoration.
重建失真图像是在图像处理领域一个关键的任务。尽管在图像处理领域中CNN和Transformer-based模型普遍存在,但它们存在固有局限性,例如不足以建模长距离依赖关系和计算成本高。为了克服这些限制,我们引入了通道感知U-shapedMamba(CU-Mamba)模型,该模型将双状态空间模型(SSM)框架融入到U-Net架构中。CU-Mamba采用空间SSM模块进行全局上下文编码,并使用通道SSM组件保留通道相关特征,在线性计算复杂度与特征图大小相对方面都具有优势。大量实验结果证实了CU-Mamba在现有最先进方法上的优越性,强调了在图像修复中整合空间和通道上下文的重要性。
https://arxiv.org/abs/2404.11778
Existing image restoration approaches typically employ extensive networks specifically trained for designated degradations. Despite being effective, such methods inevitably entail considerable storage costs and computational overheads due to the reliance on task-specific networks. In this work, we go beyond this well-established framework and exploit the inherent commonalities among image restoration tasks. The primary objective is to identify components that are shareable across restoration tasks and augment the shared components with modules specifically trained for individual tasks. Towards this goal, we propose AdaIR, a novel framework that enables low storage cost and efficient training without sacrificing performance. Specifically, a generic restoration network is first constructed through self-supervised pre-training using synthetic degradations. Subsequent to the pre-training phase, adapters are trained to adapt the pre-trained network to specific degradations. AdaIR requires solely the training of lightweight, task-specific modules, ensuring a more efficient storage and training regimen. We have conducted extensive experiments to validate the effectiveness of AdaIR and analyze the influence of the pre-training strategy on discovering shareable components. Extensive experimental results show that AdaIR achieves outstanding results on multi-task restoration while utilizing significantly fewer parameters (1.9 MB) and less training time (7 hours) for each restoration task. The source codes and trained models will be released.
现有的图像修复方法通常采用专门为指定贬值而训练的广泛网络。尽管这些方法有效,但由于依赖任务特定的网络,这种方法不可避免地导致相当大的存储成本和计算开销。在本文中,我们超越了这个经过充分验证的框架,并探讨了图像修复任务中固有的共同点。主要目标是确定可以在多个修复任务中共享的组件,并针对每个任务专门训练模块。为了实现这一目标,我们提出了AdaIR,一种新颖的框架,可以在不牺牲性能的情况下实现低存储成本和高效训练。具体来说,通过使用合成降噪进行自监督预训练,构建了一个通用的修复网络。在预训练阶段之后,我们训练适配器将预训练网络适应特定的降噪。AdaIR仅需要对轻量级、任务特定的模块进行训练,从而确保更高效的存储和训练计划。我们进行了广泛的实验来验证AdaIR的有效性并分析预训练策略对发现可共享组件的影响。大量的实验结果表明,AdaIR在多任务修复方面取得了出色的成绩,同时使用显著更少的参数(1.9 MB)和更短的学习时间(7小时)来完成每个修复任务。源代码和训练好的模型将发布。
https://arxiv.org/abs/2404.11475
Video Frame Interpolation (VFI) is a crucial technique in various applications such as slow-motion generation, frame rate conversion, video frame restoration etc. This paper introduces an efficient video frame interpolation framework that aims to strike a favorable balance between efficiency and quality. Our framework follows a general paradigm consisting of a flow estimator and a refinement module, while incorporating carefully designed components. First of all, we adopt depth-wise convolution with large kernels in the flow estimator that simultaneously reduces the parameters and enhances the receptive field for encoding rich context and handling complex motion. Secondly, diverging from a common design for the refinement module with a UNet-structure (encoder-decoder structure), which we find redundant, our decoder-only refinement module directly enhances the result from coarse to fine features, offering a more efficient process. In addition, to address the challenge of handling high-definition frames, we also introduce an innovative HD-aware augmentation strategy during training, leading to consistent enhancement on HD images. Extensive experiments are conducted on diverse datasets, Vimeo90K, UCF101, Xiph and SNU-FILM. The results demonstrate that our approach achieves state-of-the-art performance with clear improvement while requiring much less FLOPs and parameters, reaching to a better spot for balancing efficiency and quality.
视频帧插值(VFI)是各种应用(如慢动作生成、帧率转换、视频帧恢复等)中的关键技术。本文介绍了一种高效的视频帧插值框架,旨在在效率和质量之间取得良好的平衡。我们的框架包括一个流估计算法和一个优化模块,并精心设计了一些组件。首先,我们采用大尺寸的卷积来减少参数并增强编码丰富语境和处理复杂运动的能力。其次,从常见的优化模块设计(我们发现它是冗余的)中进行差异,我们的仅解码器优化模块直接增强从粗到细的特征,实现更高效的过程。此外,为了处理高清晰度帧,我们在训练过程中引入了一种创新的高清度增强策略,在HD图像上实现一致的增强。我们在多种数据集(Vimeo90K、UCF101、Xiph和SNU-FILM)上进行了广泛的实验。结果表明,我们的方法在具有显着提高的同时需要更少的FLOPs和参数,达到更好的平衡点,实现最高性能。
https://arxiv.org/abs/2404.11108
In this paper, we address the Bracket Image Restoration and Enhancement (BracketIRE) task using a novel framework, which requires restoring a high-quality high dynamic range (HDR) image from a sequence of noisy, blurred, and low dynamic range (LDR) multi-exposure RAW inputs. To overcome this challenge, we present the IREANet, which improves the multiple exposure alignment and aggregation with a Flow-guide Feature Alignment Module (FFAM) and an Enhanced Feature Aggregation Module (EFAM). Specifically, the proposed FFAM incorporates the inter-frame optical flow as guidance to facilitate the deformable alignment and spatial attention modules for better feature alignment. The EFAM further employs the proposed Enhanced Residual Block (ERB) as a foundational component, wherein a unidirectional recurrent network aggregates the aligned temporal features to better reconstruct the results. To improve model generalization and performance, we additionally employ the Bayer preserving augmentation (BayerAug) strategy to augment the multi-exposure RAW inputs. Our experimental evaluations demonstrate that the proposed IREANet shows state-of-the-art performance compared with previous methods.
在本文中,我们使用一种新框架来解决Bracket Image Restoration and Enhancement(BracketIRE)任务,该框架需要从噪声、模糊和低动态范围(LDR)的多曝光RAW输入序列中恢复高质量的高动态范围(HDR)图像。为了克服这一挑战,我们提出了IReadNet,它通过引入流量引导特征对齐模块(FFAM)和增强特征聚合模块(EFAM)来改善多曝光对齐和聚合。具体来说,所提出的FFAM利用跨帧光流作为指导,以促进可变形对齐和空间注意模块(更好的特征对齐),而EFAM则进一步采用提出的增强残差块(ERB)作为基本组件,其中单向递归网络聚集对齐的时空特征以更好地重构结果。为了提高模型的泛化能力和性能,我们还使用Bayer preserving augmentation(BayerAug)策略来增强多曝光RAW输入。我们的实验评估结果表明,与以前的方法相比,所提出的IReadNet显示出最先进的性能。
https://arxiv.org/abs/2404.10358
In reality, images often exhibit multiple degradations, such as rain and fog at night (triple degradations). However, in many cases, individuals may not want to remove all degradations, for instance, a blurry lens revealing a beautiful snowy landscape (double degradations). In such scenarios, people may only desire to deblur. These situations and requirements shed light on a new challenge in image restoration, where a model must perceive and remove specific degradation types specified by human commands in images with multiple degradations. We term this task Referring Flexible Image Restoration (RFIR). To address this, we first construct a large-scale synthetic dataset called RFIR, comprising 153,423 samples with the degraded image, text prompt for specific degradation removal and restored image. RFIR consists of five basic degradation types: blur, rain, haze, low light and snow while six main sub-categories are included for varying degrees of degradation removal. To tackle the challenge, we propose a novel transformer-based multi-task model named TransRFIR, which simultaneously perceives degradation types in the degraded image and removes specific degradation upon text prompt. TransRFIR is based on two devised attention modules, Multi-Head Agent Self-Attention (MHASA) and Multi-Head Agent Cross Attention (MHACA), where MHASA and MHACA introduce the agent token and reach the linear complexity, achieving lower computation cost than vanilla self-attention and cross-attention and obtaining competitive performances. Our TransRFIR achieves state-of-the-art performances compared with other counterparts and is proven as an effective architecture for image restoration. We release our project at this https URL.
在现实生活中,图像通常表现出多种降噪,例如夜间(三重降噪)。然而,在许多情况下,个人可能不想移除所有降噪,例如,一个模糊的镜头揭示了一个美丽的雪景(双重降噪)。在这些场景中,人们只想去模糊。这些情况和需求阐明了图像修复领域的一个新挑战,即模型必须通过图像中的特定降噪类型来感知并移除。我们称之为 Referring Flexible Image Restoration (RFIR) 任务。为解决这个挑战,我们首先构建了一个名为 RFIR 的大规模合成数据集,包括 153,423 个样本,其中有损坏图像、特定降噪的文本提示和修复图像。RFIR 包括五种基本降噪类型:模糊、雨、雾、低光和雪,同时包括六种主要降噪子类别,以不同程度地移除降噪。为了应对这个挑战,我们提出了一个基于 transformer 的多任务模型,名为 TransRFIR,它同时感知损坏图像中的降噪类型并在文本提示上移除特定降噪。TransRFIR 基于两个设计的注意力模块,Multi-Head Agent Self-Attention (MHASA) 和 Multi-Head Agent Cross Attention (MHACA)。MHASA 和 MHACA 引入了代理标记和到达线性复杂性,实现了低于普通自注意力和跨注意力的计算成本,并获得竞争力的性能。与其它类似实现相比,我们的 TransRFIR 取得了最先进的性能,并证明了在图像修复领域这是一个有效的架构。我们将该项目发布在 https:// this URL。
https://arxiv.org/abs/2404.10342
Omnidirectional images (ODIs) are commonly used in real-world visual tasks, and high-resolution ODIs help improve the performance of related visual tasks. Most existing super-resolution methods for ODIs use end-to-end learning strategies, resulting in inferior realness of generated images and a lack of effective out-of-domain generalization capabilities in training methods. Image generation methods represented by diffusion model provide strong priors for visual tasks and have been proven to be effectively applied to image restoration tasks. Leveraging the image priors of the Stable Diffusion (SD) model, we achieve omnidirectional image super-resolution with both fidelity and realness, dubbed as OmniSSR. Firstly, we transform the equirectangular projection (ERP) images into tangent projection (TP) images, whose distribution approximates the planar image domain. Then, we use SD to iteratively sample initial high-resolution results. At each denoising iteration, we further correct and update the initial results using the proposed Octadecaplex Tangent Information Interaction (OTII) and Gradient Decomposition (GD) technique to ensure better consistency. Finally, the TP images are transformed back to obtain the final high-resolution results. Our method is zero-shot, requiring no training or fine-tuning. Experiments of our method on two benchmark datasets demonstrate the effectiveness of our proposed method.
定向图像(ODIs)通常用于现实世界的视觉任务,而高分辨率ODIs有助于提高相关视觉任务的性能。大多数现有的超分辨率方法ODIs都使用端到端学习策略,导致生成的图像的现实性较差,训练方法中缺乏有效的跨域通用能力。代表扩散模型的图像生成方法具有很强的对视觉任务的优先级,已经被证明有效地应用于图像修复任务。通过利用Stable Diffusion(SD)模型的图像先验,我们实现了一种既有保真度又有真实感的 omnidirectional 图像超分辨率,被称为OmniSSR。首先,我们将等角投影(ERP)图像转换为切线投影(TP)图像,其分布近似于平面图像域。然后,我们使用SD逐迭代采样初始高分辨率结果。在每一次去噪迭代中,我们进一步使用所提出的八面体切线信息交互(OTII)和梯度分解(GD)技术纠正和更新初始结果,确保更好的一致性。最后,TP图像转换为获得最终高分辨率结果。我们的方法是零散的,不需要训练或微调。在两个基准数据集上的实验表明,我们提出的方法的有效性。
https://arxiv.org/abs/2404.10312
Image restoration, which aims to recover high-quality images from their corrupted counterparts, often faces the challenge of being an ill-posed problem that allows multiple solutions for a single input. However, most deep learning based works simply employ l1 loss to train their network in a deterministic way, resulting in over-smoothed predictions with inferior perceptual quality. In this work, we propose a novel method that shifts the focus from a deterministic pixel-by-pixel comparison to a statistical perspective, emphasizing the learning of distributions rather than individual pixel values. The core idea is to introduce spatial entropy into the loss function to measure the distribution difference between predictions and targets. To make this spatial entropy differentiable, we employ kernel density estimation (KDE) to approximate the probabilities for specific intensity values of each pixel with their neighbor areas. Specifically, we equip the entropy with diffusion models and aim for superior accuracy and enhanced perceptual quality over l1 based noise matching loss. In the experiments, we evaluate the proposed method for low light enhancement on two datasets and the NTIRE challenge 2024. All these results illustrate the effectiveness of our statistic-based entropy loss. Code is available at this https URL.
图像修复的目标是从损坏的图像中恢复高质量的图像,通常面临着一个具有单个输入多项式解的问题。然而,大多数基于深度学习的作品仅仅采用L1损失来以确定性的方式训练网络,导致预测过拟合,感知质量差。在本文中,我们提出了一种新方法,将重点从确定性的像素逐像素比较转变为统计视角,强调学习分布而不是单个像素值。核心思想是引入空间熵到损失函数中,以测量预测和目标之间的分布差异。为了使空间熵不同寻常,我们采用核密度估计(KDE)来近似每个像素具有与其邻居区域的具体强度值的概率。具体来说,我们将熵与扩散模型相结合,旨在实现与基于L1噪声匹配的损失相比的卓越准确性和感知质量的提高。在实验中,我们对所提出的方法在两个数据集上的低光增强进行了评估,以及NTIRE挑战2024。所有这些结果都说明了基于统计熵的熵损失的有效性。代码可在此处访问:https://www.xxx.com/
https://arxiv.org/abs/2404.09735
Though diffusion models have been successfully applied to various image restoration (IR) tasks, their performance is sensitive to the choice of training datasets. Typically, diffusion models trained in specific datasets fail to recover images that have out-of-distribution degradations. To address this problem, this work leverages a capable vision-language model and a synthetic degradation pipeline to learn image restoration in the wild (wild IR). More specifically, all low-quality images are simulated with a synthetic degradation pipeline that contains multiple common degradations such as blur, resize, noise, and JPEG compression. Then we introduce robust training for a degradation-aware CLIP model to extract enriched image content features to assist high-quality image restoration. Our base diffusion model is the image restoration SDE (IR-SDE). Built upon it, we further present a posterior sampling strategy for fast noise-free image generation. We evaluate our model on both synthetic and real-world degradation datasets. Moreover, experiments on the unified image restoration task illustrate that the proposed posterior sampling improves image generation quality for various degradations.
尽管扩散模型已成功应用于各种图像修复(IR)任务,但它们的性能对训练数据的选择非常敏感。通常,在特定数据集上训练的扩散模型无法恢复具有离散退化的图像。为解决这个问题,本文利用了一个强大的视觉语言模型和一个合成退化管道来在野外学习图像修复(野生IR)。具体来说,本文使用包含 blur、resize、noise 和 JPEG 压缩等多种常见退化的合成退化管道来模拟所有低质量图像。然后,我们引入了一种对退化的关注的 CLIP 模型,以提取高质量图像修复所需的丰富图像内容特征。我们的基本扩散模型是图像修复 SDE(IR-SDE)。基于它,我们进一步提出了一种用于快速无噪声图像生成的后验采样策略。我们对模型在合成和真实世界退化数据集上的性能进行了评估。此外,在统一图像修复任务上进行的实验表明,所提出的后验采样策略可以提高各种退化下的图像生成质量。
https://arxiv.org/abs/2404.09732
Human beings construct perception of space by integrating sparse observations into massively interconnected synapses and neurons, offering a superior parallelism and efficiency. Replicating this capability in AI finds wide applications in medical imaging, AR/VR, and embodied AI, where input data is often sparse and computing resources are limited. However, traditional signal reconstruction methods on digital computers face both software and hardware challenges. On the software front, difficulties arise from storage inefficiencies in conventional explicit signal representation. Hardware obstacles include the von Neumann bottleneck, which limits data transfer between the CPU and memory, and the limitations of CMOS circuits in supporting parallel processing. We propose a systematic approach with software-hardware co-optimizations for signal reconstruction from sparse inputs. Software-wise, we employ neural field to implicitly represent signals via neural networks, which is further compressed using low-rank decomposition and structured pruning. Hardware-wise, we design a resistive memory-based computing-in-memory (CIM) platform, featuring a Gaussian Encoder (GE) and an MLP Processing Engine (PE). The GE harnesses the intrinsic stochasticity of resistive memory for efficient input encoding, while the PE achieves precise weight mapping through a Hardware-Aware Quantization (HAQ) circuit. We demonstrate the system's efficacy on a 40nm 256Kb resistive memory-based in-memory computing macro, achieving huge energy efficiency and parallelism improvements without compromising reconstruction quality in tasks like 3D CT sparse reconstruction, novel view synthesis, and novel view synthesis for dynamic scenes. This work advances the AI-driven signal restoration technology and paves the way for future efficient and robust medical AI and 3D vision applications.
人类通过将稀疏观测整合到密集连接的神经元中,构建了我们对空间的感知,这使得人工智能在医学成像、增强现实(AR)和 embodied AI等领域具有卓越的并行度和效率。在AI中实现这种能力面临着软件和硬件方面的挑战。在软件方面,困难源于传统显式信号表示中存储效率低下。硬件方面,包括由冯·诺伊曼瓶颈限制了CPU和内存之间的数据传输,以及CMOS电路在支持并行处理方面的限制。我们提出了一个软件和硬件协同优化的信号重构系统,可以从稀疏输入中恢复信号。在软件方面,我们使用神经场通过神经网络隐含表示信号,并使用低秩分解和结构化剪裁进一步压缩。在硬件方面,我们设计了一个基于电阻性内存的计算在内存(CIM)平台,包括一个高斯编码器(GE)和一个多层感知器(MLP处理引擎(PE)。GE利用电阻性内存的固有随机性实现高效的输入编码,而PE通过硬件感知量化(HAQ)电路实现精确的权重映射。我们在基于40nm的256Kb电阻性内存的内存计算宏观上展示了系统的效果,实现了巨大的能效和并行度改进,而不会牺牲重构质量,例如3D CT稀疏重建、新颖视图合成和动态场景下的新颖视图合成。这项工作推动了AI驱动的信号修复技术的发展,为未来的高效和可靠的医疗AI和3D视觉应用铺平了道路。
https://arxiv.org/abs/2404.09613
In medical imaging, accurate image segmentation is crucial for quantifying diseases, assessing prognosis, and evaluating treatment outcomes. However, existing methods lack an in-depth integration of global and local features, failing to pay special attention to abnormal regions and boundary details in medical images. To this end, we present a novel deep learning-based approach, MIPC-Net, for precise boundary segmentation in medical images. Our approach, inspired by radiologists' working patterns, features two distinct modules: (i) \textbf{Mutual Inclusion of Position and Channel Attention (MIPC) module}: To enhance the precision of boundary segmentation in medical images, we introduce the MIPC module, which enhances the focus on channel information when extracting position features and vice versa; (ii) \textbf{GL-MIPC-Residue}: To improve the restoration of medical images, we propose the GL-MIPC-Residue, a global residual connection that enhances the integration of the encoder and decoder by filtering out invalid information and restoring the most effective information lost during the feature extraction process. We evaluate the performance of the proposed model using metrics such as Dice coefficient (DSC) and Hausdorff Distance (HD) on three publicly accessible datasets: Synapse, ISIC2018-Task, and Segpc. Our ablation study shows that each module contributes to improving the quality of segmentation results. Furthermore, with the assistance of both modules, our approach outperforms state-of-the-art methods across all metrics on the benchmark datasets, notably achieving a 2.23mm reduction in HD on the Synapse dataset, strongly evidencing our model's enhanced capability for precise image boundary segmentation. Codes will be available at this https URL.
在医学影像学中,准确的图像分割对于评估疾病、预测预后和评估治疗效果至关重要。然而,现有的方法缺乏对全局和局部特征的深入整合,未能特别关注医学图像中的异常区域和边界细节。为此,我们提出了一个基于深度学习的新的精确边界分割方法,称为MIPC-Net。 我们的方法受到放射科医生工作模式的启发,包括两个不同的模块:(i) mutual inclusion of position and channel attention (MIPC)模块:为了提高医学图像边界分割的精度,我们引入了MIPC模块,在提取位置特征时增强了通道信息的关注;反之亦然;(ii) GL-MIPC-Residue:为了提高医学图像的恢复效果,我们提出了GL-MIPC-Residue,一个全局残留连接,通过滤除无效信息并恢复在特征提取过程中丢失的有效信息来增强编码器和解码器的整合。 我们使用Synapse、ISIC2018-Task和Segpc三个公开可用的数据集对所提出的模型进行评估。我们的消融研究结果表明,每个模块都促进了分割结果的质量提升。此外,在基准数据集上,通过两个模块的辅助,我们的方法在所有指标上都超越了最先进的方法,特别是在Synapse数据集上实现了2.23mm的HD减少,充分证明了我们的模型在精确图像边界分割方面的增强能力。代码将在此链接处可用。
https://arxiv.org/abs/2404.08201
Imaging through scattering media is a fundamental and pervasive challenge in fields ranging from medical diagnostics to astronomy. A promising strategy to overcome this challenge is wavefront modulation, which induces measurement diversity during image acquisition. Despite its importance, designing optimal wavefront modulations to image through scattering remains under-explored. This paper introduces a novel learning-based framework to address the gap. Our approach jointly optimizes wavefront modulations and a computationally lightweight feedforward "proxy" reconstruction network. This network is trained to recover scenes obscured by scattering, using measurements that are modified by these modulations. The learned modulations produced by our framework generalize effectively to unseen scattering scenarios and exhibit remarkable versatility. During deployment, the learned modulations can be decoupled from the proxy network to augment other more computationally expensive restoration algorithms. Through extensive experiments, we demonstrate our approach significantly advances the state of the art in imaging through scattering media. Our project webpage is at this https URL.
通过散射媒体进行成像是一个基本且普遍的挑战,涉及医学诊断、天文学等各个领域。克服这一挑战的一种有前景的方法是波前调制,它在图像采集过程中诱导测量多样性。尽管波前调制非常重要,但为通过散射成像设计最优波前调制仍然是一个未被探索的问题。本文介绍了一种基于学习的新框架来解决这一空白。我们的方法共同优化波前调制和一种计算轻量级的反馈“代理”重构网络。该网络通过这些调制对测量进行训练,以恢复由这些调制导致的场景。我们框架产生的学习调制具有很好的泛化效果,能够有效地扩展到未见过的散射场景,并表现出出色的 versatility。在部署过程中,学习到的调制可以与代理网络分离,从而增强其他更计算密集的恢复算法。通过大量实验,我们证明,我们的方法在散射媒体成像方面显著提高了现有水平。我们的项目网页地址是https://www.example.com。
https://arxiv.org/abs/2404.07985