Visible watermarks pose significant challenges for image restoration techniques, especially when the target background is unknown. Toward this end, we present MorphoMod, a novel method for automated visible watermark removal that operates in a blind setting -- without requiring target images. Unlike existing methods, MorphoMod effectively removes opaque and transparent watermarks while preserving semantic content, making it well-suited for real-world applications. Evaluations on benchmark datasets, including the Colored Large-scale Watermark Dataset (CLWD), LOGO-series, and the newly introduced Alpha1 datasets, demonstrate that MorphoMod achieves up to a 50.8% improvement in watermark removal effectiveness compared to state-of-the-art methods. Ablation studies highlight the impact of prompts used for inpainting, pre-removal filling strategies, and inpainting model performance on watermark removal. Additionally, a case study on steganographic disorientation reveals broader applications for watermark removal in disrupting high-level hidden messages. MorphoMod offers a robust, adaptable solution for watermark removal and opens avenues for further advancements in image restoration and adversarial manipulation.
可见水印对图像恢复技术构成了重大挑战,尤其是在目标背景未知的情况下。为此,我们提出了MorphoMod,这是一种新颖的自动化去除可见水印的方法,在盲处理环境下工作——无需提供目标图片。与现有方法不同的是,MorphoMod能够有效移除不透明和半透明的水印,并且在保留语义内容的同时进行操作,使其非常适合现实世界的应用。 在包括Colored Large-scale Watermark Dataset (CLWD),LOGO系列以及新引入的Alpha1数据集在内的基准数据集上进行评估显示,MorphoMod相比最先进的方法,在水印去除效果方面提高了高达50.8%。消融研究表明了用于修复过程中的提示、预移除填充策略和修复模型性能对水印去除的影响。此外,一项关于隐写术定向的研究案例揭示了水印去除在干扰高级隐藏信息方面的更广泛应用。 MorphoMod为水印去除提供了一种稳健且适应性强的解决方案,并开启了图像恢复及对抗性操作领域进一步发展的新途径。
https://arxiv.org/abs/2502.02676
Diffusion priors have been used for blind face restoration (BFR) by fine-tuning diffusion models (DMs) on restoration datasets to recover low-quality images. However, the naive application of DMs presents several key limitations. (i) The diffusion prior has inferior semantic consistency (e.g., ID, structure and color.), increasing the difficulty of optimizing the BFR model; (ii) reliance on hundreds of denoising iterations, preventing the effective cooperation with perceptual losses, which is crucial for faithful restoration. Observing that the latent consistency model (LCM) learns consistency noise-to-data mappings on the ODE-trajectory and therefore shows more semantic consistency in the subject identity, structural information and color preservation, we propose InterLCM to leverage the LCM for its superior semantic consistency and efficiency to counter the above issues. Treating low-quality images as the intermediate state of LCM, InterLCM achieves a balance between fidelity and quality by starting from earlier LCM steps. LCM also allows the integration of perceptual loss during training, leading to improved restoration quality, particularly in real-world scenarios. To mitigate structural and semantic uncertainties, InterLCM incorporates a Visual Module to extract visual features and a Spatial Encoder to capture spatial details, enhancing the fidelity of restored images. Extensive experiments demonstrate that InterLCM outperforms existing approaches in both synthetic and real-world datasets while also achieving faster inference speed.
扩散先验已经被用来通过在图像恢复数据集上微调扩散模型(DMs)来进行盲人脸恢复(BFR),以恢复低质量的图像。然而,简单地应用DMs存在几个关键限制。(i)扩散先验在语义一致性方面较差(例如,在身份识别、结构和颜色等方面),增加了优化BFR模型的难度;(ii)依赖于数百次去噪迭代,这阻碍了与感知损失的有效合作,这对于忠实恢复至关重要。观察到潜在一致模型(LCM)在ODE轨迹上学习了一致性噪声到数据映射,并因此显示出更强的身份识别、结构信息和颜色保持的语义一致性,我们提出了InterLCM来利用LCM的卓越语义一致性和效率以解决上述问题。将低质量图像视为LCM中的中间状态,InterLCM通过从早期的LCM步骤开始,在保真度与质量之间达到了平衡。LCM还允许在训练期间集成感知损失,从而提高了恢复的质量,特别是在现实世界的场景中。为了缓解结构和语义不确定性,InterLCM整合了一个视觉模块来提取视觉特征以及一个空间编码器来捕捉空间细节,提升了恢复图像的保真度。广泛的实验表明,InterLCM在合成数据集和真实世界数据集中均优于现有方法,并且实现了更快的推理速度。
https://arxiv.org/abs/2502.02215
Generative diffusion models trained on large-scale datasets have achieved remarkable progress in image synthesis. In favor of their ability to supplement missing details and generate aesthetically pleasing contents, recent works have applied them to image deblurring tasks via training an adapter on blurry-sharp image pairs to provide structural conditions for restoration. However, acquiring substantial amounts of realistic paired data is challenging and costly in real-world scenarios. On the other hand, relying solely on synthetic data often results in overfitting, leading to unsatisfactory performance when confronted with unseen blur patterns. To tackle this issue, we propose BD-Diff, a generative-diffusion-based model designed to enhance deblurring performance on unknown domains by decoupling structural features and blur patterns through joint training on three specially designed tasks. We employ two Q-Formers as structural representations and blur patterns extractors separately. The features extracted by them will be used for the supervised deblurring task on synthetic data and the unsupervised blur-transfer task by leveraging unpaired blurred images from the target domain simultaneously. Furthermore, we introduce a reconstruction task to make the structural features and blur patterns complementary. This blur-decoupled learning process enhances the generalization capabilities of BD-Diff when encountering unknown domain blur patterns. Experiments on real-world datasets demonstrate that BD-Diff outperforms existing state-of-the-art methods in blur removal and structural preservation in various challenging scenarios. The codes will be released in this https URL
大规模数据集上训练的生成扩散模型在图像合成方面取得了显著进展。由于这些模型能够补充缺失细节并生成美观的内容,最近的工作将它们应用于通过在模糊-清晰图像对上训练适配器来提供恢复所需的结构条件的方式来解决去模糊任务。然而,在现实场景中获取大量真实的成对数据极具挑战性和成本高昂。另一方面,仅依赖合成数据往往会导致过拟合,在遇到未见过的模糊模式时表现不佳。 为了解决这个问题,我们提出了BD-Diff模型,这是一种基于生成扩散的方法,旨在通过在三个特别设计的任务上联合训练来解耦结构特征和模糊模式,从而提升未知领域去模糊任务的表现。我们使用了两个Q-Formers分别作为结构表示提取器和模糊模式提取器。它们提取的特征将用于合成数据上的监督去模糊任务以及利用目标域中的未配对模糊图像进行无监督模糊转移任务。此外,我们还引入了一个重建任务,以使结构特征和模糊模式相互补充。这种解耦模糊学习过程增强了BD-Diff在遇到未知领域模糊模式时的一般化能力。 实验表明,在真实世界的数据集上,BD-Diff在各种具有挑战性的场景中实现了比现有的最先进的方法更好的去模糊效果以及结构保持性能。代码将在[此链接](https://this-url.com)发布。
https://arxiv.org/abs/2502.01522
Human body restoration, as a specific application of image restoration, is widely applied in practice and plays a vital role across diverse fields. However, thorough research remains difficult, particularly due to the lack of benchmark datasets. In this study, we propose a high-quality dataset automated cropping and filtering (HQ-ACF) pipeline. This pipeline leverages existing object detection datasets and other unlabeled images to automatically crop and filter high-quality human images. Using this pipeline, we constructed a person-based restoration with sophisticated objects and natural activities (\emph{PERSONA}) dataset, which includes training, validation, and test sets. The dataset significantly surpasses other human-related datasets in both quality and content richness. Finally, we propose \emph{OSDHuman}, a novel one-step diffusion model for human body restoration. Specifically, we propose a high-fidelity image embedder (HFIE) as the prompt generator to better guide the model with low-quality human image information, effectively avoiding misleading prompts. Experimental results show that OSDHuman outperforms existing methods in both visual quality and quantitative metrics. The dataset and code will at this https URL.
人体恢复作为图像恢复的一个具体应用,在实践中得到了广泛应用,并在多个领域发挥着重要作用。然而,由于缺乏基准数据集,深入研究仍然面临挑战。为此,本研究提出了一种高质量数据集自动化裁剪和过滤(HQ-ACF)流水线。该流水线利用现有的物体检测数据集和其他未标记图像来自动裁剪和筛选高质量的人体图片。通过此流水线,我们构建了一个以人物为基础、包含复杂对象及自然活动的恢复数据集——PERSONA,其中包括训练集、验证集和测试集。与现有其他人体相关数据集相比,该数据集在质量和内容丰富度上都有显著提升。 最后,我们提出了一种新的单步扩散模型OSDHuman,用于人体恢复。具体来说,我们提出了一个高保真图像嵌入器(HFIE)作为提示生成器,以更好地指导模型利用低质量的人体图片信息,并有效避免误导性提示的产生。实验结果表明,在视觉质量和定量指标方面,OSDHuman优于现有方法。 数据集和代码可在 [此链接](https://this.url) 获取。
https://arxiv.org/abs/2502.01411
Learning diffusion bridge models is easy; making them fast and practical is an art. Diffusion bridge models (DBMs) are a promising extension of diffusion models for applications in image-to-image translation. However, like many modern diffusion and flow models, DBMs suffer from the problem of slow inference. To address it, we propose a novel distillation technique based on the inverse bridge matching formulation and derive the tractable objective to solve it in practice. Unlike previously developed DBM distillation techniques, the proposed method can distill both conditional and unconditional types of DBMs, distill models in a one-step generator, and use only the corrupted images for training. We evaluate our approach for both conditional and unconditional types of bridge matching on a wide set of setups, including super-resolution, JPEG restoration, sketch-to-image, and other tasks, and show that our distillation technique allows us to accelerate the inference of DBMs from 4x to 100x and even provide better generation quality than used teacher model depending on particular setup.
学习扩散桥模型(Diffusion Bridge Models,DBMs)很简单;但要使它们变得快速且实用则是一门艺术。扩散桥模型是扩散模型在图像到图像转换应用中的一个有前景的扩展。然而,与许多现代的扩散和流模型一样,DBMs面临着推理速度慢的问题。为了解决这个问题,我们提出了一种基于逆向桥匹配公式的新颖蒸馏技术,并推导出实用的目标函数来解决这一问题。 不同于之前开发的DBM蒸馏技术,我们的方法可以同时对条件性和非条件性的DBMs进行蒸馏,在一步生成器中训练模型,并仅使用损坏的图像进行训练。我们在一系列广泛的设置上评估了我们这种方法在条件性和非条件性桥匹配上的表现,包括超分辨率、JPEG恢复、草图到图像转换以及其他任务,结果显示我们的蒸馏技术可以将DBM的推理速度提高4倍至100倍不等,并且在某些情况下甚至能提供比原教师模型更好的生成质量。
https://arxiv.org/abs/2502.01362
We present a novel generative approach based on Denoising Diffusion Models (DDMs), which produces high-quality image samples along with their losslessly compressed bit-stream representations. This is obtained by replacing the standard Gaussian noise sampling in the reverse diffusion with a selection of noise samples from pre-defined codebooks of fixed iid Gaussian vectors. Surprisingly, we find that our method, termed Denoising Diffusion Codebook Model (DDCM), retains sample quality and diversity of standard DDMs, even for extremely small codebooks. We leverage DDCM and pick the noises from the codebooks that best match a given image, converting our generative model into a highly effective lossy image codec achieving state-of-the-art perceptual image compression results. More generally, by setting other noise selections rules, we extend our compression method to any conditional image generation task (e.g., image restoration), where the generated images are produced jointly with their condensed bit-stream representations. Our work is accompanied by a mathematical interpretation of the proposed compressed conditional generation schemes, establishing a connection with score-based approximations of posterior samplers for the tasks considered.
我们提出了一种基于去噪扩散模型(DDMs)的新型生成方法,该方法能够产生高质量的图像样本及其无损压缩比特流表示。这一成果是通过在反向扩散过程中用预定义代码本中的固定独立同分布高斯向量噪声样本集替换标准高斯噪声采样实现的。令人惊讶的是,我们发现我们的方法(称为去噪扩散代码模型DDCM)即使对于极小的代码库也能保持与标准DDMs相当的样本质量和多样性。通过利用DDCM并从代码本中选择最符合给定图像的噪声,我们可以将生成模型转化为一种高效的有损图像编解码器,从而达到最先进的感知图像压缩效果。更一般地,通过设置其他噪声选取规则,我们将我们的压缩方法扩展到任何条件下的图像生成任务(例如,图像恢复),在这种情况下,生成的图像与其紧凑的比特流表示同时产生。我们的工作还伴随着对所提出的压缩条件下生成方案的数学解释,建立了一种与所考虑任务后验抽样器评分近似之间的联系。
https://arxiv.org/abs/2502.01189
Forest is the most significant land-based carbon storage mechanism. The forest carbon sink can effectively decrease the atmospheric CO2 concentration and mitigate climate change. Remote sensing estimation not only ensures high accuracy of data, but also enables large-scale area observation. Optical images provide the possibility for long-term monitoring, which is a potential issue in the future carbon storage estimation research. We chose Huize County, Qujing City, Yunnan Province, China as the study area, took GF-1 WFV satellite image as the data, introduced the KD-VGG module to extract the initial features, and proposed the improved implicit diffusion model (IIDM). The results showed that: (1) The VGG-19 module after knowledge distillation can realize the initial feature extraction, reduce the inference time and improve the accuracy in the case of reducing the number of model parameters. (2) The Attention + MLP module was added for feature fusion to obtain the relationship between global and local features and realized the restoration of high-fidelity images in the continuous scale range. (3) The IIDM model proposed in this paper had the highest estimation accuracy, with RMSE of 28.68, which was 13.16 higher than that of the regression model, about 31.45%. In the estimation of carbon storage, the generative model can extract deeper features, and its performance was significantly better than other models. It demonstrated the feasibility of artificial intelligence-generated content (AIGC) in the field of quantitative remote sensing and provided valuable insights for the study of carbon neutralization effect. By combining the actual characteristics of the forest, the regional carbon storage estimation with a resolution of 16-meter was utilized to provide a significant theoretical basis for the formulation of forest carbon sink regulation.
森林是基于陆地的最重要的碳储存机制。通过森林碳汇可以有效降低大气中的二氧化碳浓度,从而缓解气候变化。遥感估算不仅保证了数据的高准确性,还能够实现大规模区域观测。光学图像提供了长期监测的可能性,这在未来碳存储估计研究中可能是一个潜在的问题。 我们选择了中国云南省曲靖市会泽县作为研究区域,采用GF-1 WFV卫星影像作为数据来源,并引入了KD-VGG模块以提取初始特征,同时提出了一种改进的隐式扩散模型(IIDM)。结果表明: (1) 经过知识蒸馏后的VGG-19模块可以在减少模型参数数量的同时实现初始特征的提取,从而降低推理时间并提高准确性。 (2) 添加注意力机制与多层感知器(MLP) 模块用于特征融合,以获取全局和局部特征之间的关系,并实现了在连续尺度范围内高保真图像的恢复。 (3) 本文提出的IIDM模型具有最高的估计精度,其均方根误差(RMSE)为28.68,比回归模型高出13.16(约31.45%)。在碳储存估计中,生成式模型能够提取更深层次的特征,其性能显著优于其他模型。这表明了人工智能生成内容(AIGC)在定量遥感领域中的可行性,并为碳中和效应的研究提供了有价值的观点。 结合森林的实际特性,该研究使用16米分辨率的区域碳储量估算方法,为制定森林碳汇调节策略提供了重要的理论依据。
https://arxiv.org/abs/2502.00783
We propose ``Shape from Semantics'', which is able to create 3D models whose geometry and appearance match given semantics when observed from different views. Traditional ``Shape from X'' tasks usually use visual input (e.g., RGB images or depth maps) to reconstruct geometry, imposing strict constraints that limit creative explorations. As applications, works like Shadow Art and Wire Art often struggle to grasp the embedded semantics of their design through direct observation and rely heavily on specific setups for proper display. To address these limitations, our framework uses semantics as input, greatly expanding the design space to create objects that integrate multiple semantic elements and are easily discernible by observers. Considering that this task requires a rich imagination, we adopt various generative models and structure-to-detail pipelines. Specifically, we adopt multi-semantics Score Distillation Sampling (SDS) to distill 3D geometry and appearance from 2D diffusion models, ensuring that the initial shape is consistent with the semantic input. We then use image restoration and video generation models to add more details as supervision. Finally, we introduce neural signed distance field (SDF) representation to achieve detailed shape reconstruction. Our framework generates meshes with complex details, well-structured geometry, coherent textures, and smooth transitions, resulting in visually appealing and eye-catching designs. Project page: this https URL
我们提出了一种名为“从语义生成形状”(Shape from Semantics)的方法,这种方法能够根据给定的语义信息创建出在不同视角下几何和外观都匹配的3D模型。传统的“从X生成形状”任务通常使用视觉输入(如RGB图像或深度图),来重构几何结构,这会施加严格的限制条件,从而阻碍创新性的探索。例如,在影子艺术和线造型等应用中,作品往往难以通过直接观察捕捉其设计中的嵌入语义,并且依赖于特定的展示设置才能正确呈现。 为了解决这些问题,我们的框架采用语义作为输入方式,极大地扩展了设计空间,使得能够创建融合多个语义元素并且易于被观察者理解的对象。鉴于这项任务需要丰富的想象力,我们采用了各种生成模型和结构到细节的流水线处理方法。具体来说,我们使用多语义评分蒸馏采样(SDS)从2D扩散模型中提炼出3D几何和外观信息,确保初始形状与输入语义的一致性;接着采用图像修复和视频生成模型来添加更多细节作为监督;最后,引入神经符号距离场(SDF)表示方法,以实现详细的形状重构。 我们的框架能够生成具有复杂细节、结构良好、纹理一致以及平滑过渡的网格,从而创造出视觉上令人愉悦且引人注目的设计。项目页面:[此链接](https://this-url.com/)
https://arxiv.org/abs/2502.00360
Colorization is a traditional computer vision task and it plays an important role in many time-consuming tasks, such as old film restoration. Existing methods suffer from unsaturated color and temporally inconsistency. In this paper, we propose a novel pipeline to overcome the challenges. We regard the colorization task as a generative task and introduce Stable Video Diffusion (SVD) as our base model. We design a palette-based color guider to assist the model in generating vivid and consistent colors. The color context introduced by the palette not only provides guidance for color generation, but also enhances the stability of the generated colors through a unified color context across multiple sequences. Experiments demonstrate that the proposed method can provide vivid and stable colors for videos, surpassing previous methods.
色彩化是传统的计算机视觉任务,在诸如老电影修复等耗时的任务中扮演着重要角色。现有的方法在处理饱和度不足和时间不一致性的问题上存在局限性。本文提出了一种新颖的流程来克服这些挑战。我们将色彩化任务视为生成式任务,并引入了稳定视频扩散(Stable Video Diffusion,简称SVD)作为我们的基础模型。我们设计了一个基于调色板的颜色引导器,以帮助模型生成生动且一致的颜色。通过统一多个序列中的颜色上下文,这种由调色板引入的颜色上下文不仅为颜色的生成提供了指导,还增强了生成颜色的稳定性。实验表明,所提出的方法可以为视频提供生动且稳定的色彩,在性能上超越了先前的方法。
https://arxiv.org/abs/2501.19331
Under Display Camera (UDC) is an advanced imaging system that places a digital camera lens underneath a display panel, effectively concealing the camera. However, the display panel significantly degrades captured images or videos, introducing low transmittance, blur, noise, and flare issues. Tackling such issues is challenging because of the complex degradation of UDCs, including diverse flare patterns. Despite extensive research on UDC images and their restoration models, studies on videos have yet to be significantly explored. While two UDC video datasets exist, they primarily focus on unrealistic or synthetic UDC degradation rather than real-world UDC degradation. In this paper, we propose a real-world UDC video dataset called UDC-VIT. Unlike existing datasets, only UDC-VIT exclusively includes human motions that target facial recognition. We propose a video-capturing system to simultaneously acquire non-degraded and UDC-degraded videos of the same scene. Then, we align a pair of captured videos frame by frame, using discrete Fourier transform (DFT). We compare UDC-VIT with six representative UDC still image datasets and two existing UDC video datasets. Using six deep-learning models, we compare UDC-VIT and an existing synthetic UDC video dataset. The results indicate the ineffectiveness of models trained on earlier synthetic UDC video datasets, as they do not reflect the actual characteristics of UDC-degraded videos. We also demonstrate the importance of effective UDC restoration by evaluating face recognition accuracy concerning PSNR, SSIM, and LPIPS scores. UDC-VIT enables further exploration in the UDC video restoration and offers better insights into the challenge. UDC-VIT is available at our project site.
隐藏在显示屏下的摄像头(UDC)是一种高级成像系统,它将数字相机镜头放置于显示面板下方,从而有效隐藏了摄像头。然而,这种设计显著降低了拍摄图像或视频的质量,引入了透光率低、模糊度高、噪声和眩光等问题。解决这些问题极具挑战性,因为UDC的退化情况复杂多变,包括各种不同的眩光模式。尽管针对UDC图像及其恢复模型的研究已经非常广泛,但对于视频的相关研究却尚未得到充分探索。目前虽然存在两个UDC视频数据集,但它们主要关注的是不现实或合成的UDC退化问题,而非真实世界中的UDC退化。 在本文中,我们提出了一个名为UDC-VIT的真实世界UDC视频数据集。与现有的数据集不同,只有UDC-VIT专门包含了针对面部识别的人类动作场景。为了构建这一数据集,我们设计了一套视频采集系统,用于同时获取同一场景下的未退化和UDC退化的视频片段。然后,使用离散傅里叶变换(DFT),我们将捕获的一对视频逐帧对齐。 与六个具有代表性的UDC静态图像数据集以及两个现有的UDC视频数据集进行对比后,我们利用六种深度学习模型比较了UDC-VIT和现有合成UDC视频数据集。研究结果表明,基于早期合成UDC视频数据集训练的模型在处理真实世界中的UDC退化问题上效果不佳,因为它们未能准确反映实际的特征。 此外,通过评估PSNR、SSIM和LPIPS评分下的面部识别精度,我们展示了有效恢复UDC的重要性。UDC-VIT为UDC视频修复的研究提供了进一步探索的可能性,并更好地揭示了这一挑战的真实情况。UDC-VIT在我们的项目网站上可以获取。
https://arxiv.org/abs/2501.18545
Under-Display Camera (UDC) houses a digital camera lens under a display panel. However, UDC introduces complex degradations such as noise, blur, decrease in transmittance, and flare. Despite the remarkable progress, previous research on UDC mainly focuses on eliminating diffraction in the spatial domain and rarely explores its potential in the frequency domain. It is essential to consider both the spatial and frequency domains effectively. For example, degradations, such as noise and blur, can be addressed by local information (e.g., CNN kernels in the spatial domain). At the same time, tackling flares may require leveraging global information (e.g., the frequency domain). In this paper, we revisit the UDC degradations in the Fourier space and figure out intrinsic frequency priors that imply the presence of the flares. Based on this observation, we propose a novel multi-level DNN architecture called SFIM. It efficiently restores UDC-distorted images by integrating local and global (the collective contribution of all points in the image) information. The architecture exploits CNNs to capture local information and FFT-based models to capture global information. SFIM comprises a spatial domain block (SDB), a Frequency Domain Block (FDB), and an Attention-based Multi-level Integration Block (AMIB). Specifically, SDB focuses more on detailed textures such as noise and blur, FDB emphasizes irregular texture loss in extensive areas such as flare, and AMIB enables effective cross-domain interaction. SFIM's superior performance over state-of-the-art approaches is demonstrated through rigorous quantitative and qualitative assessments across three UDC benchmarks.
隐藏在显示屏下的摄像头(UDC)将数字相机镜头置于显示面板之下。然而,UDC技术带来了诸如噪点、模糊、透光率下降以及眩光等复杂退化现象。尽管已有显著进展,以往关于UDC的研究主要集中在空间域内消除衍射效应上,并且很少探索其在频域内的潜力。实际上,为了有效应对这些问题,需要同时考虑空间域和频域中的信息。 例如,在解决像噪点和模糊这样的局部问题时,可以利用局部信息(如空间域中CNN的核函数)。与此同时,处理广义上的眩光现象可能需要借助全局信息(如频域中的信息)来实现。在本研究论文中,我们重新审视了UDC退化现象在傅里叶空间的表现,并发现了与眩光相关的内在频域先验知识。 基于这一观察,我们提出了一种新的多级深度神经网络架构——SFIM(Spatial-Frequency Integration Module)。该架构通过整合局部和全局信息有效地修复由UDC引起的图像失真。SFIM利用CNN捕捉局部细节,同时采用FFT模型来捕捉整体纹理变化。具体来说,SFIM包括空间域模块(SDB)、频域模块(FDB)以及基于注意力机制的多级融合模块(AMIB)。其中,SDB更专注于处理如噪点和模糊等细节问题;FDB则侧重于识别并修复大范围内的不规则纹理损失,比如眩光现象;而AMIB则通过跨领域交互提高了信息整合的有效性。 SFIM在三个UDC基准测试中的综合定量与定性评估中表现出了超越现有先进技术的卓越性能。
https://arxiv.org/abs/2501.18517
Mamba has demonstrated exceptional performance in visual tasks due to its powerful global modeling capabilities and linear computational complexity, offering considerable potential in hyperspectral image super-resolution (HSISR). However, in HSISR, Mamba faces challenges as transforming images into 1D sequences neglects the spatial-spectral structural relationships between locally adjacent pixels, and its performance is highly sensitive to input order, which affects the restoration of both spatial and spectral details. In this paper, we propose HSRMamba, a contextual spatial-spectral modeling state space model for HSISR, to address these issues both locally and globally. Specifically, a local spatial-spectral partitioning mechanism is designed to establish patch-wise causal relationships among adjacent pixels in 3D features, mitigating the local forgetting issue. Furthermore, a global spectral reordering strategy based on spectral similarity is employed to enhance the causal representation of similar pixels across both spatial and spectral dimensions. Finally, experimental results demonstrate our HSRMamba outperforms the state-of-the-art methods in quantitative quality and visual results. Code will be available soon.
曼巴(Mamba)在视觉任务中展示了卓越的性能,这得益于其强大的全局建模能力和线性计算复杂度。它在高光谱图像超分辨率(HSISR)方面具有巨大潜力。然而,在HSISR应用中,将图像转换为1D序列会导致忽略局部相邻像素之间的空间-光谱结构关系,并且曼巴的表现高度依赖于输入顺序,这影响了空间和光谱细节的恢复效果。 为此,我们提出了一种新的方法——HSRMamba(高光谱图像超分辨率中的上下文空间-光谱建模状态空间模型),用以解决这些问题。具体来说,该方法设计了一个局部空间-光谱分区机制,用于在3D特征中建立相邻像素之间的块级因果关系,从而减轻局部信息丢失的问题。此外,还引入了一种基于光谱相似性的全局光谱重排序策略,以增强跨空间和光谱维度的类似像素的因果表示。 最后,实验结果表明我们的HSRMamba方法在定量质量和视觉效果上都优于现有的先进技术。代码即将发布。
https://arxiv.org/abs/2501.18500
In recent years, Transformers-based models have made significant progress in the field of image restoration by leveraging their inherent ability to capture complex contextual features. Recently, Mamba models have made a splash in the field of computer vision due to their ability to handle long-range dependencies and their significant computational efficiency compared to Transformers. However, Mamba currently lags behind Transformers in contextual learning capabilities. To overcome the limitations of these two models, we propose a Mamba-Transformer hybrid image restoration model called MatIR. Specifically, MatIR cross-cycles the blocks of the Transformer layer and the Mamba layer to extract features, thereby taking full advantage of the advantages of the two architectures. In the Mamba module, we introduce the Image Inpainting State Space (IRSS) module, which traverses along four scan paths to achieve efficient processing of long sequence data. In the Transformer module, we combine triangular window-based local attention with channel-based global attention to effectively activate the attention mechanism over a wider range of image pixels. Extensive experimental results and ablation studies demonstrate the effectiveness of our approach.
近年来,基于Transformer的模型通过利用其捕捉复杂上下文特征的能力,在图像恢复领域取得了显著进展。最近,Mamba模型由于能够处理长距离依赖关系,并且与Transformer相比具有显著的计算效率,在计算机视觉领域引起了轰动。然而,Mamba在上下文学习能力上目前仍落后于Transformer。为了克服这两种模型的局限性,我们提出了一种名为MatIR的混合型图像恢复模型,该模型结合了Mamba和Transformer的特点。具体而言,MatIR通过交错使用Transformer层和Mamba层的块来提取特征,从而充分利用两种架构的优势。 在Mamba模块中,我们引入了一个称为“图像修复状态空间”(Image Inpainting State Space, IRSS)的模块,该模块沿着四条扫描路径进行遍历,以实现长序列数据的有效处理。而在Transformer模块中,我们将基于三角窗口的局部注意力与基于通道的全局注意力结合起来,从而更有效地在更大范围内的图像像素上激活注意力机制。 大量的实验结果和消融研究表明了我们方法的有效性。
https://arxiv.org/abs/2501.18401
Image restoration aims to recover details and enhance contrast in degraded images. With the growing demand for high-quality imaging (\textit{e.g.}, 4K and 8K), achieving a balance between restoration quality and computational efficiency has become increasingly critical. Existing methods, primarily based on CNNs, Transformers, or their hybrid approaches, apply uniform deep representation extraction across the image. However, these methods often struggle to effectively model long-range dependencies and largely overlook the spatial characteristics of image degradation (regions with richer textures tend to suffer more severe damage), making it hard to achieve the best trade-off between restoration quality and efficiency. To address these issues, we propose a novel texture-aware image restoration method, TAMambaIR, which simultaneously perceives image textures and achieves a trade-off between performance and efficiency. Specifically, we introduce a novel Texture-Aware State Space Model, which enhances texture awareness and improves efficiency by modulating the transition matrix of the state-space equation and focusing on regions with complex textures. Additionally, we design a {Multi-Directional Perception Block} to improve multi-directional receptive fields while maintaining low computational overhead. Extensive experiments on benchmarks for image super-resolution, deraining, and low-light image enhancement demonstrate that TAMambaIR achieves state-of-the-art performance with significantly improved efficiency, establishing it as a robust and efficient framework for image restoration.
图像恢复的目标是还原受损图像中的细节并增强对比度。随着对高质量成像(例如4K和8K)需求的增长,实现修复质量和计算效率之间的平衡变得越来越关键。现有方法主要基于卷积神经网络(CNNs)、变换器(Transformer),或两者的混合方法,在整个图像上应用统一的深度表示提取技术。然而,这些方法往往难以有效地建模长程依赖关系,并且严重忽视了图像降质的空间特性(纹理更丰富的区域通常遭受更严重的损坏),从而很难在修复质量和效率之间取得最佳平衡。 为了应对这些问题,我们提出了一种新型的基于纹理感知的图像恢复方法——TAMambaIR。该方法能够同时感知图像中的纹理特征并在性能与计算效率之间实现权衡。具体来说,我们引入了一个新颖的“基于纹理的状态空间模型”(Texture-Aware State Space Model),通过调节状态空间方程的转换矩阵来增强对纹理的认知,并专注于复杂纹理区域以提高效率。此外,我们设计了一种“多向感知模块”,旨在改善图像的多方向感受野,同时保持较低的计算开销。 在超分辨率、去雨和低光照图像增强等基准测试上的广泛实验表明,TAMambaIR能够实现最先进的性能,并且显著提高了效率,从而确立了其作为稳健而高效的图像恢复框架的地位。
https://arxiv.org/abs/2501.16583
Physical and optical factors interacting with sensor characteristics create complex image degradation patterns. Despite advances in deep learning-based super-resolution, existing methods overlook the causal nature of degradation by adopting simplistic black-box mappings. This paper formulates super-resolution using structural causal models to reason about image degradation processes. We establish a mathematical foundation that unifies principles from causal inference, deriving necessary conditions for identifying latent degradation mechanisms and corresponding propagation. We propose a novel counterfactual learning strategy that leverages semantic guidance to reason about hypothetical degradation scenarios, leading to theoretically-grounded representations that capture invariant features across different degradation conditions. The framework incorporates an adaptive intervention mechanism with provable bounds on treatment effects, allowing precise manipulation of degradation factors while maintaining semantic consistency. Through extensive empirical validation, we demonstrate that our approach achieves significant improvements over state-of-the-art methods, particularly in challenging scenarios with compound degradations. On standard benchmarks, our method consistently outperforms existing approaches by significant margins (0.86-1.21dB PSNR), while providing interpretable insights into the restoration process. The theoretical framework and empirical results demonstrate the fundamental importance of causal reasoning in understanding image restoration systems.
物理和光学因素与传感器特性相互作用,产生了复杂的图像退化模式。尽管基于深度学习的超分辨率技术取得了进展,但现有方法忽视了退化的因果性质,通过采用简单的黑盒映射来处理问题。本文使用结构因果模型对超分辨率进行了建模,以分析图像退化过程的原因。我们建立了一个将因果推理原则统一起来的数学基础,并推导出识别潜在退化机制及其传播所必需的条件。 我们提出了一种新颖的反事实学习策略,利用语义指导来推测假设中的退化场景,从而生成理论上合理的表示形式,这些表示形式能够捕捉不同退化条件下不变的特征。该框架集成了一个自适应干预机制,并提供了可证明影响效果的边界,在保持语义一致性的同时精确控制退化因素。 通过广泛的实证验证,我们展示了我们的方法在标准基准测试中显著优于现有技术(特别是在存在复合退化的挑战性场景下),并在峰值信噪比(PSNR)方面取得了0.86至1.21分贝的改进。此外,我们的方法还提供了解释图像恢复过程的见解。 理论框架和实证结果证明了因果推理在理解图像修复系统中的基础重要性。
https://arxiv.org/abs/2501.15852
This paper proposes the Degradation Classification Pre-Training (DCPT), which enables models to learn how to classify the degradation type of input images for universal image restoration pre-training. Unlike the existing self-supervised pre-training methods, DCPT utilizes the degradation type of the input image as an extremely weak supervision, which can be effortlessly obtained, even intrinsic in all image restoration datasets. DCPT comprises two primary stages. Initially, image features are extracted from the encoder. Subsequently, a lightweight decoder, such as ResNet18, is leveraged to classify the degradation type of the input image solely based on the features extracted in the first stage, without utilizing the input image. The encoder is pre-trained with a straightforward yet potent DCPT, which is used to address universal image restoration and achieve outstanding performance. Following DCPT, both convolutional neural networks (CNNs) and transformers demonstrate performance improvements, with gains of up to 2.55 dB in the 10D all-in-one restoration task and 6.53 dB in the mixed degradation scenarios. Moreover, previous self-supervised pretraining methods, such as masked image modeling, discard the decoder after pre-training, while our DCPT utilizes the pre-trained parameters more effectively. This superiority arises from the degradation classifier acquired during DCPT, which facilitates transfer learning between models of identical architecture trained on diverse degradation types. Source code and models are available at this https URL.
本文提出了降质分类预训练(DCPT),使模型能够学习如何根据输入图像的特征来识别其降质类型,用于通用图像恢复任务的预训练。与现有的自监督预训练方法不同,DCPT利用输入图像的降质类型作为极为微弱的监督信号,这种信号可以轻松获取,并且在所有图像恢复数据集中都是固有的。DCPT主要包含两个阶段:首先,从编码器中提取图像特征;其次,使用轻量级解码器(例如ResNet18)仅根据第一阶段提取的特征来分类输入图像的降质类型,而不依赖于原始图像本身。通过简单的但有效的DCPT,编码器被预训练以解决通用图像恢复问题并取得卓越性能。 在DCPT之后,无论是卷积神经网络(CNNs)还是变换模型都表现出性能提升,在10D全任务图像恢复中达到了高达2.55 dB的增益,在混合降质场景下则提升了6.53 dB。此外,之前的自监督预训练方法如掩码图像建模,在完成预训练后会丢弃解码器部分,而我们的DCPT更有效地利用了预训练参数。这一优势源自于在DCPT过程中获得的降质分类器,它促进了相同架构但在不同降质量化类型上进行训练的模型之间的迁移学习。 源代码和模型可在[此链接](https://this-url.com)获取(请根据实际情况替换URL)。
https://arxiv.org/abs/2501.15510
Blind face restoration (BFR) is a highly challenging problem due to the uncertainty of data degradation patterns. Current BFR methods have realized certain restored productions but with inherent neural degradations that limit real-world generalization in complicated scenarios. In this paper, we propose a plug-and-play framework InfoBFR to tackle neural degradations, e.g., prior bias, topological distortion, textural distortion, and artifact residues, which achieves high-generalization face restoration in diverse wild and heterogeneous scenes. Specifically, based on the results from pre-trained BFR models, InfoBFR considers information compression using manifold information bottleneck (MIB) and information compensation with efficient diffusion LoRA to conduct information optimization. InfoBFR effectively synthesizes high-fidelity faces without attribute and identity distortions. Comprehensive experimental results demonstrate the superiority of InfoBFR over state-of-the-art GAN-based and diffusion-based BFR methods, with around 70ms consumption, 16M trainable parameters, and nearly 85% BFR-boosting. It is promising that InfoBFR will be the first plug-and-play restorer universally employed by diverse BFR models to conquer neural degradations.
盲面修复(BFR)是一个非常具有挑战性的问题,因为数据退化模式存在不确定性。现有的BFR方法已经实现了一些恢复成果,但这些方法内在的神经退化问题限制了它们在复杂场景中的实际应用泛化能力。在这篇论文中,我们提出了一种即插即用框架InfoBFR,旨在解决诸如先验偏置、拓扑扭曲、纹理扭曲和残余瑕疵等神经退化问题,在各种多变且异质的场景中实现了高泛化的面部修复效果。 具体而言,基于预训练的BFR模型的结果,InfoBFR采用了流形信息瓶颈(MIB)来考虑信息压缩,并通过高效的扩散LoRA进行信息补偿,以实现信息优化。InfoBFR能够合成高质量、无属性和身份扭曲的面部图像。全面的实验结果表明,在大约70毫秒的时间内,使用16M可训练参数,InfoBFR相比最先进的基于GAN和扩散模型的BFR方法取得了显著优势,且在BFR增强方面达到了接近85%的效果提升。 前景看好,InfoBFR有望成为首个被各种BFR模型广泛采用的即插即用修复工具,以解决神经退化问题。
https://arxiv.org/abs/2501.15443
Knowledge Graph (KG)-augmented Large Language Models (LLMs) have recently propelled significant advances in complex reasoning tasks, thanks to their broad domain knowledge and contextual awareness. Unfortunately, current methods often assume KGs to be complete, which is impractical given the inherent limitations of KG construction and the potential loss of contextual cues when converting unstructured text into entity-relation triples. In response, this paper proposes the Triple Context Restoration and Query-driven Feedback (TCR-QF) framework, which reconstructs the textual context underlying each triple to mitigate information loss, while dynamically refining the KG structure by iteratively incorporating query-relevant missing knowledge. Experiments on five benchmark question-answering datasets substantiate the effectiveness of TCR-QF in KG and LLM integration, where itachieves a 29.1% improvement in Exact Match and a 15.5% improvement in F1 over its state-of-the-art GraphRAG competitors.
知识图谱(KG)增强的大规模语言模型(LLMs)近期在复杂推理任务中取得了显著进展,这得益于它们广泛的知识领域和上下文感知能力。然而,目前的方法常常假设 KG 完整无缺,这一假设在实际应用中是不现实的,因为 KG 的构建本身就存在固有限制,并且将非结构化文本转换为实体关系三元组时可能会丢失重要的上下文线索。为此,本文提出了一个名为 Triple Context Restoration and Query-driven Feedback(TCR-QF)的框架,该框架通过重构每个三元组背后的文本上下文来减少信息损失,并通过迭代地加入与查询相关的缺失知识来动态精炼 KG 结构。在五个基准问答数据集上的实验验证了 TCR-QF 在 KG 和 LLM 集成中的有效性,其中其精确匹配得分比最先进的 GraphRAG 竞争对手提高了 29.1%,F1 分数提高了 15.5%。
https://arxiv.org/abs/2501.15378
Plug-and-play approaches to solving inverse problems such as restoration and super-resolution have recently benefited from Diffusion-based generative priors for natural as well as medical images. However, solutions often use the standard albeit computationally intensive route of training and inferring with the whole image on the diffusion prior. While patch-based approaches to evaluating diffusion priors in plug-and-play methods have received some interest, they remain an open area of study. In this work, we explore the feasibility of the usage of patches for training and inference of a diffusion prior on MRI images. We explore the minor adaptation necessary for artifact avoidance, the performance and the efficiency of memory usage of patch-based methods as well as the adaptability of whole image training to patch-based evaluation - evaluating across multiple plug-and-play methods, tasks and datasets.
最近,基于扩散模型的生成先验在解决诸如恢复和超分辨率等逆问题时,在自然图像及医学图像领域为即插即用方法带来了显著的好处。然而,大多数解决方案仍然采用标准但计算成本较高的方式,即在整个图像上对扩散先验进行训练和推理。尽管针对即插即用方法中评估扩散先验的块(patch)基础方法已引起一些关注,但仍属于研究中的开放领域。 在本工作中,我们探索了使用MRI图像的小型补丁来训练和推断扩散先验的可能性。我们将探讨避免伪影所需的微小适应性调整、基于块的方法的性能以及内存使用的效率,并且还将评估整体图像训练向块基础评估方法转换的适应能力——涵盖多种即插即用方法、任务及数据集。 简而言之,这项研究旨在探究和评估在MRI图像处理中使用扩散模型时,如何通过局部补丁而不是整个图像进行训练和推理的有效性和效率。
https://arxiv.org/abs/2501.15309
We aim to redefine robust ego-motion estimation and photorealistic 3D reconstruction by addressing a critical limitation: the reliance on noise-free data in existing models. While such sanitized conditions simplify evaluation, they fail to capture the unpredictable, noisy complexities of real-world environments. Dynamic motion, sensor imperfections, and synchronization perturbations lead to sharp performance declines when these models are deployed in practice, revealing an urgent need for frameworks that embrace and excel under real-world noise. To bridge this gap, we tackle three core challenges: scalable data generation, comprehensive benchmarking, and model robustness enhancement. First, we introduce a scalable noisy data synthesis pipeline that generates diverse datasets simulating complex motion, sensor imperfections, and synchronization errors. Second, we leverage this pipeline to create Robust-Ego3D, a benchmark rigorously designed to expose noise-induced performance degradation, highlighting the limitations of current learning-based methods in ego-motion accuracy and 3D reconstruction quality. Third, we propose Correspondence-guided Gaussian Splatting (CorrGS), a novel test-time adaptation method that progressively refines an internal clean 3D representation by aligning noisy observations with rendered RGB-D frames from clean 3D map, enhancing geometric alignment and appearance restoration through visual correspondence. Extensive experiments on synthetic and real-world data demonstrate that CorrGS consistently outperforms prior state-of-the-art methods, particularly in scenarios involving rapid motion and dynamic illumination.
我们的目标是通过解决现有模型的一个关键限制——对无噪声数据的依赖,来重新定义稳健的自我运动估计和逼真的3D重建。虽然这种净化条件简化了评估过程,但它们未能捕捉到现实世界环境中不可预测且充满噪音的复杂性。动态移动、传感器缺陷以及同步误差在这些模型的实际部署中会导致性能急剧下降,这表明迫切需要能够在真实世界的噪声环境下表现优秀的框架。为了弥补这一差距,我们解决了三个核心挑战:可扩展的数据生成、全面基准测试和模型鲁棒性的提升。 首先,我们引入了一个可扩展的含噪数据合成流水线,能够生成一系列模拟复杂运动、传感器缺陷以及同步误差的多样化数据集。其次,我们利用这个管道创建了Robust-Ego3D,一个严格设计用于揭示噪声引起性能下降的基准测试,并突出现有基于学习的方法在自我运动精度和3D重建质量上的局限性。第三,我们提出了由对应关系引导的高斯点阵法(CorrGS),这是一种新颖的运行时适应方法,通过将含噪观测与从清洁3D地图中渲染出来的RGB-D帧进行对齐,逐步精炼一个内部干净的3D表示,以此来增强几何对齐和外观恢复。 在合成数据及真实世界数据上的广泛实验表明,CorrGS持续优于现有最先进方法,在涉及快速移动和动态照明的情况下尤为显著。
https://arxiv.org/abs/2501.14319