This paper reports on the NTIRE 2024 Quality Assessment of AI-Generated Content Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2024. This challenge is to address a major challenge in the field of image and video processing, namely, Image Quality Assessment (IQA) and Video Quality Assessment (VQA) for AI-Generated Content (AIGC). The challenge is divided into the image track and the video track. The image track uses the AIGIQA-20K, which contains 20,000 AI-Generated Images (AIGIs) generated by 15 popular generative models. The image track has a total of 318 registered participants. A total of 1,646 submissions are received in the development phase, and 221 submissions are received in the test phase. Finally, 16 participating teams submitted their models and fact sheets. The video track uses the T2VQA-DB, which contains 10,000 AI-Generated Videos (AIGVs) generated by 9 popular Text-to-Video (T2V) models. A total of 196 participants have registered in the video track. A total of 991 submissions are received in the development phase, and 185 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. Some methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on AIGC.
这篇论文报告了NTIRE 2024人工智能生成内容挑战赛,该挑战赛将与CVPR 2024中的图像修复和增强研讨会(NTIRE)同时举办。这项挑战的目标是解决图像和视频处理领域的一个重大挑战,即人工智能生成内容(AIGC)的图像质量和视频质量评估(VQA)。挑战分为图像赛道和视频赛道。图像赛道使用了AIGIQA-20K,它包含了由15个流行生成模型生成的20,000个AI生成图像(AIGIs)。图像赛道共有318名注册参与者。在开发阶段共收到1,646篇提交,测试阶段收到了221篇提交。最后,16支参赛队伍提交了他们的模型和报告。视频赛道使用了T2VQA-DB,它包含了由9个流行文本转视频(T2V)模型生成的10,000个AI生成视频(AIGVs)。共有196名参与者登记注册在视频赛道上。在开发阶段共收到991篇提交,测试阶段收到了185篇提交。最后,12支参赛队伍提交了他们的模型和报告。有些方法取得了比基线方法更好的效果,两赛道获胜的方法在AIGC上表现出卓越的预测性能。
https://arxiv.org/abs/2404.16687
State space models (SSMs) with selection mechanisms and hardware-aware architectures, namely Mamba, have recently demonstrated significant promise in long-sequence modeling. Since the self-attention mechanism in transformers has quadratic complexity with image size and increasing computational demands, the researchers are now exploring how to adapt Mamba for computer vision tasks. This paper is the first comprehensive survey aiming to provide an in-depth analysis of Mamba models in the field of computer vision. It begins by exploring the foundational concepts contributing to Mamba's success, including the state space model framework, selection mechanisms, and hardware-aware design. Next, we review these vision mamba models by categorizing them into foundational ones and enhancing them with techniques such as convolution, recurrence, and attention to improve their sophistication. We further delve into the widespread applications of Mamba in vision tasks, which include their use as a backbone in various levels of vision processing. This encompasses general visual tasks, Medical visual tasks (e.g., 2D / 3D segmentation, classification, and image registration, etc.), and Remote Sensing visual tasks. We specially introduce general visual tasks from two levels: High/Mid-level vision (e.g., Object detection, Segmentation, Video classification, etc.) and Low-level vision (e.g., Image super-resolution, Image restoration, Visual generation, etc.). We hope this endeavor will spark additional interest within the community to address current challenges and further apply Mamba models in computer vision.
带有选择机制和硬件感知架构的状态空间模型(SSMs),如Mamba,在长序列建模方面最近取得了显著的进展。由于Transformer中自注意力机制的复杂性随着图像尺寸的增加而增加,计算机视觉任务的计算需求也在增加,因此研究人员现在正在探索如何将Mamba适应计算机视觉任务。本文是旨在为计算机视觉领域提供对Mamba模型的深入分析的第一篇全面调查。文章首先探讨了导致Mamba成功的基本概念,包括状态空间模型框架、选择机制和硬件感知设计。接下来,我们通过分类这些视觉Mamba模型为基本模型并使用卷积、递归和注意等技术对其进行改进,来回顾这些模型。我们深入探讨了Mamba在计算机视觉任务中的广泛应用,包括在各种级别视觉处理中的作为骨干的应用。这包括一般视觉任务(如物体检测、分割、分类和图像配准等)、医学视觉任务(如2D/3D分割、分类和图像配准等)和遥感视觉任务。我们特别引入了两个层面的通用视觉任务:高/中级别视觉(如物体检测、分割、视频分类等)和低级别视觉(如图像超分辨率、图像修复、视觉生成等)。我们希望这个努力将在社区中激发更多的兴趣,以解决当前的挑战并进一步将Mamba模型应用于计算机视觉。
https://arxiv.org/abs/2404.15956
This paper addresses the challenges associated with hyperspectral image (HSI) reconstruction from miniaturized satellites, which often suffer from stripe effects and are computationally resource-limited. We propose a Real-Time Compressed Sensing (RTCS) network designed to be lightweight and require only relatively few training samples for efficient and robust HSI reconstruction in the presence of the stripe effect and under noisy transmission conditions. The RTCS network features a simplified architecture that reduces the required training samples and allows for easy implementation on integer-8-based encoders, facilitating rapid compressed sensing for stripe-like HSI, which exactly matches the moderate design of miniaturized satellites on push broom scanning mechanism. This contrasts optimization-based models that demand high-precision floating-point operations, making them difficult to deploy on edge devices. Our encoder employs an integer-8-compatible linear projection for stripe-like HSI data transmission, ensuring real-time compressed sensing. Furthermore, based on the novel two-streamed architecture, an efficient HSI restoration decoder is proposed for the receiver side, allowing for edge-device reconstruction without needing a sophisticated central server. This is particularly crucial as an increasing number of miniaturized satellites necessitates significant computing resources on the ground station. Extensive experiments validate the superior performance of our approach, offering new and vital capabilities for existing miniaturized satellite systems.
本文讨论了从微型卫星中恢复超光谱图像(HSI)面临的挑战,这些卫星通常受到条带效应的影响,并且计算资源有限。我们提出了一个轻量级的实时压缩感知(RTCS)网络,旨在实现高效且在条带效应和噪声传输条件下具有鲁棒性的HSI重构。RTCS网络具有简化架构,减少了所需的训练样本,并使条带型HSI的压缩感知变得容易,与迷你火箭扫描机制上的中等设计完全吻合。这 contrasts 基于优化的模型,这些模型需要高精度的浮点运算,使得它们难以在边缘设备上部署。我们的编码器采用了一个兼容整数8的线性投影来传输条带型HSI数据,实现实时压缩感知。此外,根据新颖的双流架构,在接收端提出了一种高效的HSI恢复解码器,允许在没有复杂中央服务器的情况下实现边缘设备重建。随着越来越多的微型卫星需要地面站的大量计算资源,这种方法的重要性也越来越突出。大量实验验证了我们的方法的优越性能,为现有的微型卫星系统提供了新的和至关重要的功能。
https://arxiv.org/abs/2404.15781
Hazy images degrade visual quality, and dehazing is a crucial prerequisite for subsequent processing tasks. Most current dehazing methods rely on neural networks and face challenges such as high computational parameter pressure and weak generalization capabilities. This paper introduces PriorNet--a novel, lightweight, and highly applicable dehazing network designed to significantly improve the clarity and visual quality of hazy images while avoiding excessive detail extraction issues. The core of PriorNet is the original Multi-Dimensional Interactive Attention (MIA) mechanism, which effectively captures a wide range of haze characteristics, substantially reducing the computational load and generalization difficulties associated with complex systems. By utilizing a uniform convolutional kernel size and incorporating skip connections, we have streamlined the feature extraction process. Simplifying the number of layers and architecture not only enhances dehazing efficiency but also facilitates easier deployment on edge devices. Extensive testing across multiple datasets has demonstrated PriorNet's exceptional performance in dehazing and clarity restoration, maintaining image detail and color fidelity in single-image dehazing tasks. Notably, with a model size of just 18Kb, PriorNet showcases superior dehazing generalization capabilities compared to other methods. Our research makes a significant contribution to advancing image dehazing technology, providing new perspectives and tools for the field and related domains, particularly emphasizing the importance of improving universality and deployability.
模糊图像降低视觉质量,而去雾是后续处理任务的關鍵先決條件。目前的大多数去雾方法依賴於神經網絡,並面臨高計算參數壓力和弱泛化能力的挑戰。本文介紹了 PriorNet--一個新輕量級且高度適用的去霧網絡,旨在顯著改善模糊圖像的清晰度和視覺質量,同時避免過度細節提取問題。PriorNet 的核心是原始的多維交互注意力(MIA)機制,有效捕捉廣泛的灰霾特性,大幅降低複雜系統的計算負荷和泛化困難。通過使用相同的卷積内核大小並包含跳躍連接,我們簡化了特徵提取過程。通過在多個數據集上的測試,PriorNet 在去霧和清晰度恢復方面的表現非常出色,保持單一圖像去霧任務中的圖像細節和色彩保真度。值得注意的是,PriorNet 的模型大小僅為 18Kb,其在去霧擴展能力上比其他方法优越。我們的研究在推動圖像去霧技術發展方面做出了重要的貢獻,為該領域和相關領域提供了新的視角和工具,尤其強調了提高普遍性和部署能力的必要性。
https://arxiv.org/abs/2404.15638
Convolutional Neural Network (CNN) and Transformer have attracted much attention recently for video post-processing (VPP). However, the interaction between CNN and Transformer in existing VPP methods is not fully explored, leading to inefficient communication between the local and global extracted features. In this paper, we explore the interaction between CNN and Transformer in the task of VPP, and propose a novel Spatial and Channel Hybrid-Attention Video Post-Processing Network (SC-HVPPNet), which can cooperatively exploit the image priors in both spatial and channel domains. Specifically, in the spatial domain, a novel spatial attention fusion module is designed, in which two attention weights are generated to fuse the local and global representations collaboratively. In the channel domain, a novel channel attention fusion module is developed, which can blend the deep representations at the channel dimension dynamically. Extensive experiments show that SC-HVPPNet notably boosts video restoration quality, with average bitrate savings of 5.29%, 12.42%, and 13.09% for Y, U, and V components in the VTM-11.0-NNVC RA configuration.
近年来,卷积神经网络(CNN)和Transformer在视频后处理(VPP)领域引起了广泛关注。然而,在现有VPP方法中,CNN和Transformer之间的相互作用并没有完全被探索,导致本地和全局提取特征的效率较低。在本文中,我们探讨了CNN和Transformer在VPP任务中的相互作用,并提出了一种新颖的空间和通道混合注意力视频后处理网络(SC-HVPPNet),可以协同利用空间和通道域的图像先验。具体来说,在空间域,我们设计了一个新颖的空间注意力融合模块,其中两个权重生成器合作生成局部和全局表示的注意力。在通道域,我们开发了一个新颖的通道注意力融合模块,可以动态融合维度为通道的深层表示。大量实验证明,SC-HVPPNet显著提高了视频修复质量,在VTM-11.0-NNVC RA配置中,Y、U和V分量的平均比特率为5.29%、12.42%和13.09%。
https://arxiv.org/abs/2404.14709
With the popularity of social media platforms such as Instagram and TikTok, and the widespread availability and convenience of retouching tools, an increasing number of individuals are utilizing these tools to beautify their facial photographs. This poses challenges for fields that place high demands on the authenticity of photographs, such as identity verification and social media. By altering facial images, users can easily create deceptive images, leading to the dissemination of false information. This may pose challenges to the reliability of identity verification systems and social media, and even lead to online fraud. To address this issue, some work has proposed makeup removal methods, but they still lack the ability to restore images involving geometric deformations caused by retouching. To tackle the problem of facial retouching restoration, we propose a framework, dubbed Face2Face, which consists of three components: a facial retouching detector, an image restoration model named FaceR, and a color correction module called Hierarchical Adaptive Instance Normalization (H-AdaIN). Firstly, the facial retouching detector predicts a retouching label containing three integers, indicating the retouching methods and their corresponding degrees. Then FaceR restores the retouched image based on the predicted retouching label. Finally, H-AdaIN is applied to address the issue of color shift arising from diffusion models. Extensive experiments demonstrate the effectiveness of our framework and each module.
随着社交媒体平台如Instagram和TikTok的流行,以及修图工具的广泛可用性和便利性,越来越多的人使用这些工具来美化他们的面部照片。这给对照片真实度有很高要求的领域(如身份验证和社交媒体)带来了挑战。通过改变面部图像,用户可以轻松创建误导性的图像,导致传播虚假信息。这可能对身份验证系统的可靠性和社交媒体造成挑战,甚至可能导致网络欺诈。为解决这个问题,一些工作提出了化妆去除方法,但他们仍然缺乏修复由修图引起的几何变形图像的能力。为解决面部修图恢复问题,我们提出了一个名为Face2Face的框架,它由三个组件组成:面部修图检测器、一个名为FaceR的图像修复模型和一个名为Hierarchical Adaptive Instance Normalization(H-AdaIN)的颜色校正模块。首先,面部修图检测器预测包含三个整数的修图标签,表示修图方法和它们的相应程度。然后,FaceR根据预测的修图标签恢复被修复的图像。最后,H-AdaIN应用于解决扩散模型引起的颜色偏移问题。大量实验证明了我们框架和每个模块的有效性。
https://arxiv.org/abs/2404.14177
In real-world scenarios, images captured often suffer from blurring, noise, and other forms of image degradation, and due to sensor limitations, people usually can only obtain low dynamic range images. To achieve high-quality images, researchers have attempted various image restoration and enhancement operations on photographs, including denoising, deblurring, and high dynamic range imaging. However, merely performing a single type of image enhancement still cannot yield satisfactory images. In this paper, to deal with the challenge above, we propose the Composite Refinement Network (CRNet) to address this issue using multiple exposure images. By fully integrating information-rich multiple exposure inputs, CRNet can perform unified image restoration and enhancement. To improve the quality of image details, CRNet explicitly separates and strengthens high and low-frequency information through pooling layers, using specially designed Multi-Branch Blocks for effective fusion of these frequencies. To increase the receptive field and fully integrate input features, CRNet employs the High-Frequency Enhancement Module, which includes large kernel convolutions and an inverted bottleneck ConvFFN. Our model secured third place in the first track of the Bracketing Image Restoration and Enhancement Challenge, surpassing previous SOTA models in both testing metrics and visual quality.
在现实场景中,捕获的图像经常受到模糊、噪声和其他图像退化形式的影响,由于传感器限制,人们通常只能获得低动态范围图像。为了获得高质量的图像,研究人员对照片进行了各种图像修复和增强操作,包括去噪、去模糊和高动态范围成像。然而,仅进行一种图像增强操作仍然无法产生令人满意的图像。在本文中,为了应对上述挑战,我们提出了复合优化网络(CRNet)来解决这个问题,利用多个曝光图像。通过完全整合信息丰富的多个曝光输入,CRNet可以执行统一图像修复和增强。为了提高图像细节质量,CRNet通过池化层明确区分和加强高和低频信息,使用专门设计的Multi-Branch Blocks对这两个频率进行有效的融合。为了增加接收范围并完全整合输入特征,CRNet采用High-Frequency Enhancement Module,包括大内核卷积和反向瓶颈ConvFFN。我们的模型在Bracketing Image Restoration and Enhancement Challenge的第一 track获得了第三名的成绩,在测试指标和视觉质量方面均超过了之前的最佳模型。
https://arxiv.org/abs/2404.14132
This paper tackles the intricate challenge of object removal to update the radiance field using the 3D Gaussian Splatting. The main challenges of this task lie in the preservation of geometric consistency and the maintenance of texture coherence in the presence of the substantial discrete nature of Gaussian primitives. We introduce a robust framework specifically designed to overcome these obstacles. The key insight of our approach is the enhancement of information exchange among visible and invisible areas, facilitating content restoration in terms of both geometry and texture. Our methodology begins with optimizing the positioning of Gaussian primitives to improve geometric consistency across both removed and visible areas, guided by an online registration process informed by monocular depth estimation. Following this, we employ a novel feature propagation mechanism to bolster texture coherence, leveraging a cross-attention design that bridges sampling Gaussians from both uncertain and certain areas. This innovative approach significantly refines the texture coherence within the final radiance field. Extensive experiments validate that our method not only elevates the quality of novel view synthesis for scenes undergoing object removal but also showcases notable efficiency gains in training and rendering speeds.
本文解决了使用3D高斯平铺更新辐射场的问题,这是通过保留几何一致性和在Gaussian原始数据中保持纹理一致性的复杂挑战。这项任务的主要挑战在于保留通过平铺Gaussian原始数据而获得的平滑性,同时保持纹理一致性,这在很大程度上是由Gaussian原始数据的显著离散性造成的。为了克服这些障碍,我们引入了一个专门设计的稳健框架。我们方法的关键洞察力是提高可见和不可见区域之间的信息交流,从而实现几何和纹理的恢复。我们的方法从通过单目深度估计的在线注册过程优化Gaussian原始数据的位置开始。接着,我们采用一种新颖的特征传播机制来增强纹理一致性,并利用一种跨注意设计,将来自不确定和确定区域的Gaussian采样进行连接。这种创新方法在最终辐射场中对纹理一致性进行了显著改进。大量实验证实,我们的方法不仅提高了进行物体删除的场景中新颖视觉合成质量,而且在训练和渲染速度方面显著展示了效率提升。
https://arxiv.org/abs/2404.13679
Multiple complex degradations are coupled in low-quality video faces in the real world. Therefore, blind video face restoration is a highly challenging ill-posed problem, requiring not only hallucinating high-fidelity details but also enhancing temporal coherence across diverse pose variations. Restoring each frame independently in a naive manner inevitably introduces temporal incoherence and artifacts from pose changes and keypoint localization errors. To address this, we propose the first blind video face restoration approach with a novel parsing-guided temporal-coherent transformer (PGTFormer) without pre-alignment. PGTFormer leverages semantic parsing guidance to select optimal face priors for generating temporally coherent artifact-free results. Specifically, we pre-train a temporal-spatial vector quantized auto-encoder on high-quality video face datasets to extract expressive context-rich priors. Then, the temporal parse-guided codebook predictor (TPCP) restores faces in different poses based on face parsing context cues without performing face pre-alignment. This strategy reduces artifacts and mitigates jitter caused by cumulative errors from face pre-alignment. Finally, the temporal fidelity regulator (TFR) enhances fidelity through temporal feature interaction and improves video temporal consistency. Extensive experiments on face videos show that our method outperforms previous face restoration baselines. The code will be released on \href{this https URL}{this https URL}.
在现实生活中,低质量的视频脸部存在多个复杂降解。因此,盲视频脸部修复是一项高度具有挑战性的 ill-posed 问题,不仅需要高保真度的图像,还需要增强跨不同姿态变化的时间一致性。在 naive 的独立修复每个帧的方式下,难免引入了姿态变化和关键点定位误差带来的时间不一致和伪影。为了解决这个问题,我们提出了第一个没有预对齐的盲视频脸部修复方法——具有新颖的分词引导的时间一致性变换器 (PGTFormer)。PGTFormer 利用语义分词指导来选择生成具有时间一致性伪影的最佳人脸 prior。具体来说,我们在高质量视频脸数据集上预训练一个时间-空间向量量化自编码器,以提取充满表达性上下文的 prior。然后,基于姿态解码指导的代码本预测器 (TPCP) 根据姿态解码上下文预测不同姿态的脸部。这种策略通过时间特征交互减少了伪影,并减轻了由于预对齐误差累积造成的抖动。最后,时间 fidelity 调节器 (TFR) 通过时间特征交互增加了 fidelity,并改善了视频的时间一致性。在面部视频的广泛实验中,我们的方法超越了以前的面部修复基线。代码将发布在 \href{this <https://this URL>}{this <https://this URL>}。
https://arxiv.org/abs/2404.13640
Tackling image degradation due to atmospheric turbulence, particularly in dynamic environment, remains a challenge for long-range imaging systems. Existing techniques have been primarily designed for static scenes or scenes with small motion. This paper presents the first segment-then-restore pipeline for restoring the videos of dynamic scenes in turbulent environment. We leverage mean optical flow with an unsupervised motion segmentation method to separate dynamic and static scene components prior to restoration. After camera shake compensation and segmentation, we introduce foreground/background enhancement leveraging the statistics of turbulence strength and a transformer model trained on a novel noise-based procedural turbulence generator for fast dataset augmentation. Benchmarked against existing restoration methods, our approach restores most of the geometric distortion and enhances sharpness for videos. We make our code, simulator, and data publicly available to advance the field of video restoration from turbulence: this http URL
克服由于大气扰动而导致的图像退化,特别是在动态环境中,仍然是一个长期成像系统的挑战。现有的技术主要针对静态场景或具有较小运动的场景。本文提出了第一个用于恢复动态场景视频的分割-然后-修复管道。我们利用无监督运动分割方法 mean optical flow 来分离动态和静态场景组件,在修复之前。经过相机振动补偿和分割之后,我们引入了前景/背景增强,利用湍流强度的统计信息和基于新噪声生成器的Transformer模型,进行快速数据增强。与现有的修复方法进行基准测试,我们的方法修复了大多数几何失真,并提高了视频的清晰度。我们将我们的代码、模拟器和数据公开发布,以推动视频修复领域的发展:这是 http://www.example.com
https://arxiv.org/abs/2404.13605
In real-world scenarios, due to a series of image degradations, obtaining high-quality, clear content photos is challenging. While significant progress has been made in synthesizing high-quality images, previous methods for image restoration and enhancement often overlooked the characteristics of different degradations. They applied the same structure to address various types of degradation, resulting in less-than-ideal restoration outcomes. Inspired by the notion that high/low frequency information is applicable to different degradations, we introduce HLNet, a Bracketing Image Restoration and Enhancement method based on high-low frequency decomposition. Specifically, we employ two modules for feature extraction: shared weight modules and non-shared weight modules. In the shared weight modules, we use SCConv to extract common features from different degradations. In the non-shared weight modules, we introduce the High-Low Frequency Decomposition Block (HLFDB), which employs different methods to handle high-low frequency information, enabling the model to address different degradations more effectively. Compared to other networks, our method takes into account the characteristics of different degradations, thus achieving higher-quality image restoration.
在现实场景中,由于一系列图像降噪,获得高质量、清晰的内容图片具有挑战性。尽管在生成高质量图像方面已经取得了显著的进展,但以前的照片修复和增强方法往往忽视了不同降噪类型的特征。它们应用相同的结构来解决各种类型的降噪,导致恢复结果不理想。受到高/低频信息适用于不同降噪类型的启发,我们引入了HLNet,一种基于高-低频分解的框图图像修复和增强方法。具体来说,我们使用两个模块进行特征提取:共享权重模块和非共享权重模块。在共享权重模块中,我们使用SCConv从不同降噪中提取共性特征。在非共享权重模块中,我们引入了高-低频分解块(HLFDB),它采用不同的方法处理高-低频信息,使模型能够更有效地处理不同降噪类型。与其它网络相比,我们的方法考虑了不同降噪类型的特征,从而实现了更好的图像修复质量。
https://arxiv.org/abs/2404.13537
The deep learning revolution has strongly impacted low-level image processing tasks such as style/domain transfer, enhancement/restoration, and visual quality assessments. Despite often being treated separately, the aforementioned tasks share a common theme of understanding, editing, or enhancing the appearance of input images without modifying the underlying content. We leverage this observation to develop a novel disentangled representation learning method that decomposes inputs into content and appearance features. The model is trained in a self-supervised manner and we use the learned features to develop a new quality prediction model named DisQUE. We demonstrate through extensive evaluations that DisQUE achieves state-of-the-art accuracy across quality prediction tasks and distortion types. Moreover, we demonstrate that the same features may also be used for image processing tasks such as HDR tone mapping, where the desired output characteristics may be tuned using example input-output pairs.
深度学习革命对诸如风格/领域转移、增强/修复和视觉质量评估等低级图像处理任务产生了强烈影响。尽管这些任务通常被单独处理,但前述任务都 share a common theme of understanding、editing或增强输入图像的视觉效果,而不会修改底层内容。我们利用这个观察结果开发了一种新颖的解耦表示学习方法,将输入分解为内容和外观特征。该模型以自监督的方式进行训练,并使用学习到的特征开发了一个名为DisQUE的新质量预测模型。我们在广泛的评估中证明了DisQUE在质量预测任务和失真类型上的最先进准确度。此外,我们还证明了相同特征还可以用于图像处理任务,如 HDR 色调映射,其中所需的输出特性可以通过使用示例输入-输出对进行调整。
https://arxiv.org/abs/2404.13484
Unsupervised anomaly detection using only normal samples is of great significance for quality inspection in industrial manufacturing. Although existing reconstruction-based methods have achieved promising results, they still face two problems: poor distinguishable information in image reconstruction and well abnormal regeneration caused by model over-generalization ability. To overcome the above issues, we convert the image reconstruction into a combination of parallel feature restorations and propose a multi-feature reconstruction network, MFRNet, using crossed-mask restoration in this paper. Specifically, a multi-scale feature aggregator is first developed to generate more discriminative hierarchical representations of the input images from a pre-trained model. Subsequently, a crossed-mask generator is adopted to randomly cover the extracted feature map, followed by a restoration network based on the transformer structure for high-quality repair of the missing regions. Finally, a hybrid loss is equipped to guide model training and anomaly estimation, which gives consideration to both the pixel and structural similarity. Extensive experiments show that our method is highly competitive with or significantly outperforms other state-of-the-arts on four public available datasets and one self-made dataset.
无监督异常检测仅使用正常样本在工业制造业的质量检测中具有重要意义。尽管现有的基于重构的方法已经取得了很好的效果,但它们仍然面临两个问题:图像重构中的信息模糊和由模型过拟合能力引起的异常再生。为解决这些问题,本文将图像重构转化为跨模态恢复的组合,并提出了一种多特征重构网络,MFRNet,使用双向掩码恢复。具体来说,首先开发了一个多尺度特征聚合器,从预训练模型中生成输入图像的更有区分性的层次表示。然后,采用双向掩码生成器随机覆盖提取的特征图,接着是基于Transformer结构的高质量修补缺失区域的修复网络。最后,配备了一种混合损失函数来指导模型训练和异常估计,考虑了像素结构和相似性。大量实验证明,我们的方法在四个公开可用数据集和自建数据集上与其他最先进的水平具有高度竞争性或显著优于其他方法。
https://arxiv.org/abs/2404.13273
The recent advancement of spatial transcriptomics (ST) allows to characterize spatial gene expression within tissue for discovery research. However, current ST platforms suffer from low resolution, hindering in-depth understanding of spatial gene expression. Super-resolution approaches promise to enhance ST maps by integrating histology images with gene expressions of profiled tissue spots. However, current super-resolution methods are limited by restoration uncertainty and mode collapse. Although diffusion models have shown promise in capturing complex interactions between multi-modal conditions, it remains a challenge to integrate histology images and gene expression for super-resolved ST maps. This paper proposes a cross-modal conditional diffusion model for super-resolving ST maps with the guidance of histology images. Specifically, we design a multi-modal disentangling network with cross-modal adaptive modulation to utilize complementary information from histology images and spatial gene expression. Moreover, we propose a dynamic cross-attention modelling strategy to extract hierarchical cell-to-tissue information from histology images. Lastly, we propose a co-expression-based gene-correlation graph network to model the co-expression relationship of multiple genes. Experiments show that our method outperforms other state-of-the-art methods in ST super-resolution on three public datasets.
近年来,空间转录组学(ST)的进展允许在组织中对空间基因表达进行研究,从而推动发现研究。然而,当前的ST平台存在分辨率低和信噪比低的问题,阻碍了深入了解空间基因表达。超分辨率方法通过将组织图像与特征组织的基因表达集成来增强ST映射。然而,现有的超分辨率方法受到修复不确定性和模态崩塌的限制。尽管扩散模型在捕捉多模态条件下的复杂相互作用方面显示出前景,但将组织图像和基因表达集成到超分辨率ST映射仍然具有挑战性。本文提出了一种跨模态条件扩散模型,在历史图像的指导下解决ST映射的超分辨率问题。具体来说,我们设计了一个多模态去中心化网络,通过跨模态自适应调制利用组织图像的互补信息。此外,我们还提出了一种动态跨注意建模策略,从组织图像中提取层次结构细胞到组织信息。最后,我们提出了一种基于共表达的基因相关图网络来建模多个基因之间的共表达关系。实验结果表明,我们的方法在三个公开数据集上的ST超分辨率方面超过了最先进的水平。
https://arxiv.org/abs/2404.12973
The Frozen Section (FS) technique is a rapid and efficient method, taking only 15-30 minutes to prepare slides for pathologists' evaluation during surgery, enabling immediate decisions on further surgical interventions. However, FS process often introduces artifacts and distortions like folds and ice-crystal effects. In contrast, these artifacts and distortions are absent in the higher-quality formalin-fixed paraffin-embedded (FFPE) slides, which require 2-3 days to prepare. While Generative Adversarial Network (GAN)-based methods have been used to translate FS to FFPE images (F2F), they may leave morphological inaccuracies with remaining FS artifacts or introduce new artifacts, reducing the quality of these translations for clinical assessments. In this study, we benchmark recent generative models, focusing on GANs and Latent Diffusion Models (LDMs), to overcome these limitations. We introduce a novel approach that combines LDMs with Histopathology Pre-Trained Embeddings to enhance restoration of FS images. Our framework leverages LDMs conditioned by both text and pre-trained embeddings to learn meaningful features of FS and FFPE histopathology images. Through diffusion and denoising techniques, our approach not only preserves essential diagnostic attributes like color staining and tissue morphology but also proposes an embedding translation mechanism to better predict the targeted FFPE representation of input FS images. As a result, this work achieves a significant improvement in classification performance, with the Area Under the Curve rising from 81.99% to 94.64%, accompanied by an advantageous CaseFD. This work establishes a new benchmark for FS to FFPE image translation quality, promising enhanced reliability and accuracy in histopathology FS image analysis. Our work is available at this https URL.
冻切片(FS)技术是一种快速而有效的解决方案,用于在手术过程中为病理学家评估切片,仅需15-30分钟,允许在短时间内做出进一步的手术干预决策。然而,FS过程通常引入收缩和错乱的伪像,如褶皱和冰晶效应。相比之下,高质量正式冻固定装片(FFPE)的这些伪像和错乱是缺失的,这些装片需要2-3天来准备。虽然基于生成对抗网络(GAN)的方法已经将FS翻译为FFPE图像(F2F),但它们可能留下形态不准确残余FS伪像或引入新的伪像,从而降低这些翻译的质量和临床评估的准确性。在这项研究中,我们对比了最近的几种生成模型,重点关注GAN和潜在扩散模型(LDM),以克服这些限制。我们引入了一种结合LDM和病理学预训练嵌入的新方法,以增强FS图像的修复。我们的框架利用由文本和预训练嵌入条件下的LDM学习有意义FS和FFPE病理学图像的特征。通过扩散和去噪技术,我们的方法不仅保留色彩染色和组织形态等基本诊断特征,还提出了一个嵌入转译机制,以更好地预测输入FS图像的目标FFPE表示。因此,这项工作在分类性能上取得了显著的改进,曲线下的面积从81.99%增加至94.64%,同时具有优势的CaseFD。这项工作为FS到FFPE图像翻译质量设立了新的基准,承诺在病理学FS图像分析中提高可靠性和准确性。我们的工作可以在以下链接处查看:
https://arxiv.org/abs/2404.12650
Recent advances in image deraining have focused on training powerful models on mixed multiple datasets comprising diverse rain types and backgrounds. However, this approach tends to overlook the inherent differences among rainy images, leading to suboptimal results. To overcome this limitation, we focus on addressing various rainy images by delving into meaningful representations that encapsulate both the rain and background components. Leveraging these representations as instructive guidance, we put forth a Context-based Instance-level Modulation (CoI-M) mechanism adept at efficiently modulating CNN- or Transformer-based models. Furthermore, we devise a rain-/detail-aware contrastive learning strategy to help extract joint rain-/detail-aware representations. By integrating CoI-M with the rain-/detail-aware Contrastive learning, we develop CoIC, an innovative and potent algorithm tailored for training models on mixed datasets. Moreover, CoIC offers insight into modeling relationships of datasets, quantitatively assessing the impact of rain and details on restoration, and unveiling distinct behaviors of models given diverse inputs. Extensive experiments validate the efficacy of CoIC in boosting the deraining ability of CNN and Transformer models. CoIC also enhances the deraining prowess remarkably when real-world dataset is included.
近年来,图像去雾技术的发展主要集中在在包含多种雨类型和背景的混合数据集上训练强大的模型。然而,这种方法往往忽视了雨图片之间的固有差异,导致性能较低。为了克服这一局限,我们专注于通过深入挖掘有意义的表现来解决各种雨图片,从而实现更好的结果。利用这些表现作为有指导性的提示,我们提出了一个基于上下文的实例级调制(CoI-M)机制,该机制能够有效地对基于CNN或Transformer的模型进行调制。此外,我们还设计了一个雨-/细节敏感的对比学习策略,以帮助提取联合雨-/细节感知的表示。通过将CoI-M与雨-/细节感知的对比学习相结合,我们开发了CoIC,一种专为在混合数据集上训练模型而设计的创新且强大的算法。此外,CoIC揭示了数据集之间的建模关系,定量评估了雨和细节对恢复的影响,并揭示了给定不同输入的模型具有显著的差异行为。大量的实验证实了CoIC在提高CNN和Transformer模型的去雾能力方面的有效性。当包含真实世界数据时,CoIC的去雾能力显著增强。
https://arxiv.org/abs/2404.12091
Reconstructing degraded images is a critical task in image processing. Although CNN and Transformer-based models are prevalent in this field, they exhibit inherent limitations, such as inadequate long-range dependency modeling and high computational costs. To overcome these issues, we introduce the Channel-Aware U-Shaped Mamba (CU-Mamba) model, which incorporates a dual State Space Model (SSM) framework into the U-Net architecture. CU-Mamba employs a Spatial SSM module for global context encoding and a Channel SSM component to preserve channel correlation features, both in linear computational complexity relative to the feature map size. Extensive experimental results validate CU-Mamba's superiority over existing state-of-the-art methods, underscoring the importance of integrating both spatial and channel contexts in image restoration.
重建失真图像是在图像处理领域一个关键的任务。尽管在图像处理领域中CNN和Transformer-based模型普遍存在,但它们存在固有局限性,例如不足以建模长距离依赖关系和计算成本高。为了克服这些限制,我们引入了通道感知U-shapedMamba(CU-Mamba)模型,该模型将双状态空间模型(SSM)框架融入到U-Net架构中。CU-Mamba采用空间SSM模块进行全局上下文编码,并使用通道SSM组件保留通道相关特征,在线性计算复杂度与特征图大小相对方面都具有优势。大量实验结果证实了CU-Mamba在现有最先进方法上的优越性,强调了在图像修复中整合空间和通道上下文的重要性。
https://arxiv.org/abs/2404.11778
Existing image restoration approaches typically employ extensive networks specifically trained for designated degradations. Despite being effective, such methods inevitably entail considerable storage costs and computational overheads due to the reliance on task-specific networks. In this work, we go beyond this well-established framework and exploit the inherent commonalities among image restoration tasks. The primary objective is to identify components that are shareable across restoration tasks and augment the shared components with modules specifically trained for individual tasks. Towards this goal, we propose AdaIR, a novel framework that enables low storage cost and efficient training without sacrificing performance. Specifically, a generic restoration network is first constructed through self-supervised pre-training using synthetic degradations. Subsequent to the pre-training phase, adapters are trained to adapt the pre-trained network to specific degradations. AdaIR requires solely the training of lightweight, task-specific modules, ensuring a more efficient storage and training regimen. We have conducted extensive experiments to validate the effectiveness of AdaIR and analyze the influence of the pre-training strategy on discovering shareable components. Extensive experimental results show that AdaIR achieves outstanding results on multi-task restoration while utilizing significantly fewer parameters (1.9 MB) and less training time (7 hours) for each restoration task. The source codes and trained models will be released.
现有的图像修复方法通常采用专门为指定贬值而训练的广泛网络。尽管这些方法有效,但由于依赖任务特定的网络,这种方法不可避免地导致相当大的存储成本和计算开销。在本文中,我们超越了这个经过充分验证的框架,并探讨了图像修复任务中固有的共同点。主要目标是确定可以在多个修复任务中共享的组件,并针对每个任务专门训练模块。为了实现这一目标,我们提出了AdaIR,一种新颖的框架,可以在不牺牲性能的情况下实现低存储成本和高效训练。具体来说,通过使用合成降噪进行自监督预训练,构建了一个通用的修复网络。在预训练阶段之后,我们训练适配器将预训练网络适应特定的降噪。AdaIR仅需要对轻量级、任务特定的模块进行训练,从而确保更高效的存储和训练计划。我们进行了广泛的实验来验证AdaIR的有效性并分析预训练策略对发现可共享组件的影响。大量的实验结果表明,AdaIR在多任务修复方面取得了出色的成绩,同时使用显著更少的参数(1.9 MB)和更短的学习时间(7小时)来完成每个修复任务。源代码和训练好的模型将发布。
https://arxiv.org/abs/2404.11475
Video Frame Interpolation (VFI) is a crucial technique in various applications such as slow-motion generation, frame rate conversion, video frame restoration etc. This paper introduces an efficient video frame interpolation framework that aims to strike a favorable balance between efficiency and quality. Our framework follows a general paradigm consisting of a flow estimator and a refinement module, while incorporating carefully designed components. First of all, we adopt depth-wise convolution with large kernels in the flow estimator that simultaneously reduces the parameters and enhances the receptive field for encoding rich context and handling complex motion. Secondly, diverging from a common design for the refinement module with a UNet-structure (encoder-decoder structure), which we find redundant, our decoder-only refinement module directly enhances the result from coarse to fine features, offering a more efficient process. In addition, to address the challenge of handling high-definition frames, we also introduce an innovative HD-aware augmentation strategy during training, leading to consistent enhancement on HD images. Extensive experiments are conducted on diverse datasets, Vimeo90K, UCF101, Xiph and SNU-FILM. The results demonstrate that our approach achieves state-of-the-art performance with clear improvement while requiring much less FLOPs and parameters, reaching to a better spot for balancing efficiency and quality.
视频帧插值(VFI)是各种应用(如慢动作生成、帧率转换、视频帧恢复等)中的关键技术。本文介绍了一种高效的视频帧插值框架,旨在在效率和质量之间取得良好的平衡。我们的框架包括一个流估计算法和一个优化模块,并精心设计了一些组件。首先,我们采用大尺寸的卷积来减少参数并增强编码丰富语境和处理复杂运动的能力。其次,从常见的优化模块设计(我们发现它是冗余的)中进行差异,我们的仅解码器优化模块直接增强从粗到细的特征,实现更高效的过程。此外,为了处理高清晰度帧,我们在训练过程中引入了一种创新的高清度增强策略,在HD图像上实现一致的增强。我们在多种数据集(Vimeo90K、UCF101、Xiph和SNU-FILM)上进行了广泛的实验。结果表明,我们的方法在具有显着提高的同时需要更少的FLOPs和参数,达到更好的平衡点,实现最高性能。
https://arxiv.org/abs/2404.11108
In this paper, we address the Bracket Image Restoration and Enhancement (BracketIRE) task using a novel framework, which requires restoring a high-quality high dynamic range (HDR) image from a sequence of noisy, blurred, and low dynamic range (LDR) multi-exposure RAW inputs. To overcome this challenge, we present the IREANet, which improves the multiple exposure alignment and aggregation with a Flow-guide Feature Alignment Module (FFAM) and an Enhanced Feature Aggregation Module (EFAM). Specifically, the proposed FFAM incorporates the inter-frame optical flow as guidance to facilitate the deformable alignment and spatial attention modules for better feature alignment. The EFAM further employs the proposed Enhanced Residual Block (ERB) as a foundational component, wherein a unidirectional recurrent network aggregates the aligned temporal features to better reconstruct the results. To improve model generalization and performance, we additionally employ the Bayer preserving augmentation (BayerAug) strategy to augment the multi-exposure RAW inputs. Our experimental evaluations demonstrate that the proposed IREANet shows state-of-the-art performance compared with previous methods.
在本文中,我们使用一种新框架来解决Bracket Image Restoration and Enhancement(BracketIRE)任务,该框架需要从噪声、模糊和低动态范围(LDR)的多曝光RAW输入序列中恢复高质量的高动态范围(HDR)图像。为了克服这一挑战,我们提出了IReadNet,它通过引入流量引导特征对齐模块(FFAM)和增强特征聚合模块(EFAM)来改善多曝光对齐和聚合。具体来说,所提出的FFAM利用跨帧光流作为指导,以促进可变形对齐和空间注意模块(更好的特征对齐),而EFAM则进一步采用提出的增强残差块(ERB)作为基本组件,其中单向递归网络聚集对齐的时空特征以更好地重构结果。为了提高模型的泛化能力和性能,我们还使用Bayer preserving augmentation(BayerAug)策略来增强多曝光RAW输入。我们的实验评估结果表明,与以前的方法相比,所提出的IReadNet显示出最先进的性能。
https://arxiv.org/abs/2404.10358