In the realm of research, the detection/recognition of text within images/videos captured by cameras constitutes a highly challenging problem for researchers. Despite certain advancements achieving high accuracy, current methods still require substantial improvements to be applicable in practical scenarios. Diverging from text detection in images/videos, this paper addresses the issue of text detection within license plates by amalgamating multiple frames of distinct perspectives. For each viewpoint, the proposed method extracts descriptive features characterizing the text components of the license plate, specifically corner points and area. Concretely, we present three viewpoints: view-1, view-2, and view-3, to identify the nearest neighboring components facilitating the restoration of text components from the same license plate line based on estimations of similarity levels and distance metrics. Subsequently, we employ the CnOCR method for text recognition within license plates. Experimental results on the self-collected dataset (PTITPlates), comprising pairs of images in various scenarios, and the publicly available Stanford Cars Dataset, demonstrate the superiority of the proposed method over existing approaches.
在研究领域,相机捕获的图像/视频中的文字检测/识别构成了一个高度挑战的问题,尽管某些方法已经实现了高精度,但当前的方法仍然需要在实际应用场景中进行大量改进。与图像/视频中的文字检测不同,本文通过整合多个不同视角的图像帧来解决 license plate 中的文字检测问题。对于每个视角,该方法提取了描述性特征,描述了 license plate 中文字组件的特征,特别是角落点和区域。具体来说,我们展示了三个视角:view-1、view-2、view-3,以确定最接近的相邻组件,通过相似度和距离度量估计来实现文字组件从同一 license plate 线条的恢复。随后,我们采用 CnOCR 方法在 license plate 中的文字识别。对自收集的数据集(PTITPlates)进行了实验结果,该数据集包括各种场景下的两个图像对,以及公开可用的 Stanford 汽车数据集,证明了该方法相对于现有方法的优越性。
https://arxiv.org/abs/2309.12972
In surveillance, accurately recognizing license plates is hindered by their often low quality and small dimensions, compromising recognition precision. Despite advancements in AI-based image super-resolution, methods like Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs) still fall short in enhancing license plate images. This study leverages the cutting-edge diffusion model, which has consistently outperformed other deep learning techniques in image restoration. By training this model using a curated dataset of Saudi license plates, both in low and high resolutions, we discovered the diffusion model's superior efficacy. The method achieves a 12.55\% and 37.32% improvement in Peak Signal-to-Noise Ratio (PSNR) over SwinIR and ESRGAN, respectively. Moreover, our method surpasses these techniques in terms of Structural Similarity Index (SSIM), registering a 4.89% and 17.66% improvement over SwinIR and ESRGAN, respectively. Furthermore, 92% of human evaluators preferred our images over those from other algorithms. In essence, this research presents a pioneering solution for license plate super-resolution, with tangible potential for surveillance systems.
在监控中,准确识别车牌常常由于它们的质量较低和尺寸较小而受到限制,从而影响识别精度。尽管基于人工智能的图像超分辨率技术取得了进展,但像卷积神经网络(CNNs)和生成对抗网络(GANs)等方法在增强车牌图像方面仍无法满足要求。本研究利用最先进的扩散模型,该模型在图像恢复方面一直比其他深度学习技术表现更好。通过使用沙特车牌的 curated 数据集,以低和高分辨率两种形式训练该模型,我们发现了扩散模型的优越性。方法在峰值信号-噪声比(PSNR)方面实现了12.55\%和37.32%的 improvement,分别比 SwinIR 和ESRGAN 提高了37.32%和12.55%。此外,我们的方法和这些技术在结构相似性指数(SSIM)方面超过了它们,分别提高了4.89%和17.66%。此外,92%的人类评估者认为我们的图像比来自其他算法的图像更喜欢。本研究提出了车牌超分辨率的开创性解决方案,对于监控系统具有实际潜力。
https://arxiv.org/abs/2309.12506
Segment Anything (SAM), an advanced universal image segmentation model trained on an expansive visual dataset, has set a new benchmark in image segmentation and computer vision. However, it faced challenges when it came to distinguishing between shadows and their backgrounds. To address this, we developed Deshadow-Anything, considering the generalization of large-scale datasets, and we performed Fine-tuning on large-scale datasets to achieve image shadow removal. The diffusion model can diffuse along the edges and textures of an image, helping to remove shadows while preserving the details of the image. Furthermore, we design Multi-Self-Attention Guidance (MSAG) and adaptive input perturbation (DDPM-AIP) to accelerate the iterative training speed of diffusion. Experiments on shadow removal tasks demonstrate that these methods can effectively improve image restoration performance.
Segment Anything (SAM) 是一种高级的、基于大量视觉数据集的图像分割模型,它在图像分割和计算机视觉领域树立了新的基准。然而,在区分阴影和背景时,它遇到了挑战。为了解决这一问题,我们开发了Deshadow- Anything,考虑了大规模数据集的泛化情况,并在大规模数据集上进行微调,以去除图像阴影。扩散模型可以在图像的边缘和纹理上扩散,帮助去除阴影,同时保留图像细节。此外,我们设计了多自注意力引导(MSAG)和自适应输入扰动(DDPM-AIP)来加速扩散迭代训练速度。在阴影去除任务的实验表明,这些方法可以有效地改善图像恢复性能。
https://arxiv.org/abs/2309.11715
Exploiting pre-trained diffusion models for restoration has recently become a favored alternative to the traditional task-specific training approach. Previous works have achieved noteworthy success by limiting the solution space using explicit degradation models. However, these methods often fall short when faced with complex degradations as they generally cannot be precisely modeled. In this paper, we propose PGDiff by introducing partial guidance, a fresh perspective that is more adaptable to real-world degradations compared to existing works. Rather than specifically defining the degradation process, our approach models the desired properties, such as image structure and color statistics of high-quality images, and applies this guidance during the reverse diffusion process. These properties are readily available and make no assumptions about the degradation process. When combined with a diffusion prior, this partial guidance can deliver appealing results across a range of restoration tasks. Additionally, PGDiff can be extended to handle composite tasks by consolidating multiple high-quality image properties, achieved by integrating the guidance from respective tasks. Experimental results demonstrate that our method not only outperforms existing diffusion-prior-based approaches but also competes favorably with task-specific models.
利用预先训练的扩散模型进行恢复已成为传统任务特定训练方法的 favor alternative。以前的工作通过使用显式退化模型限制了解决方案空间并取得了一些显著的成功。然而,当面对复杂的退化时这些方法往往表现不佳,因为它们通常无法精确建模。在本文中,我们提出了 PGDiff,引入部分指导,这是一种比现有工作更具适应性的新鲜视角。我们不仅仅详细定义了退化过程,而是建模了所需的属性,例如高质量图像的图像结构和色彩统计,并在逆扩散过程中应用这些指导。这些属性是现成的,并对退化过程没有假设。与扩散前一起使用,这种部分指导可以在不同的恢复任务中提供令人感兴趣的结果。此外,PGDiff 可以扩展以处理复合任务,通过整合来自各自任务的指导,实现。实验结果表明,我们的方法不仅超越了现有的扩散前基于方法,而且与任务特定模型竞争良好。
https://arxiv.org/abs/2309.10810
Image denoising is a fundamental and challenging task in the field of computer vision. Most supervised denoising methods learn to reconstruct clean images from noisy inputs, which have intrinsic spectral bias and tend to produce over-smoothed and blurry images. Recently, researchers have explored diffusion models to generate high-frequency details in image restoration tasks, but these models do not guarantee that the generated texture aligns with real images, leading to undesirable artifacts. To address the trade-off between visual appeal and fidelity of high-frequency details in denoising tasks, we propose a novel approach called the Reconstruct-and-Generate Diffusion Model (RnG). Our method leverages a reconstructive denoising network to recover the majority of the underlying clean signal, which serves as the initial estimation for subsequent steps to maintain fidelity. Additionally, it employs a diffusion algorithm to generate residual high-frequency details, thereby enhancing visual quality. We further introduce a two-stage training scheme to ensure effective collaboration between the reconstructive and generative modules of RnG. To reduce undesirable texture introduced by the diffusion model, we also propose an adaptive step controller that regulates the number of inverse steps applied by the diffusion model, allowing control over the level of high-frequency details added to each patch as well as saving the inference computational cost. Through our proposed RnG, we achieve a better balance between perception and distortion. We conducted extensive experiments on both synthetic and real denoising datasets, validating the superiority of the proposed approach.
图像去噪是计算机视觉领域中一个基本且具有挑战性的任务。大多数受监督去噪方法学习从噪声输入中恢复干净图像,这些输入具有固有的光谱偏差,往往产生过度平滑和模糊的图像。最近,研究人员已经探索了扩散模型,在图像恢复任务中生成高频细节,但这些模型不能保证生成的纹理与真实图像对齐,导致不良结果。为了解决去噪任务中视觉吸引力和忠实度之间的权衡,我们提出了一种新方法,称为“重建-生成扩散模型”(RnG)。我们利用重建去噪网络恢复大部分底层干净信号,作为后续步骤保持忠实度的初始估计。此外,它还使用扩散算法生成剩余的高频细节,从而提高视觉质量。我们还引入了一个两阶段的训练计划,以确保重建和生成模块之间的有效协作。为了减少扩散模型引入的不良纹理,我们还提出了一种自适应步控制器,规范扩散模型应用的反步骤数量,允许对每个区块添加的高频细节水平进行控制,并节省推理计算成本。通过我们的RnG方法,我们实现了感知和失真之间的更好的平衡。我们针对合成和真实去噪数据集进行了广泛的实验,验证了我们提出的方法的优越性。
https://arxiv.org/abs/2309.10714
Most existing sandstorm image enhancement methods are based on traditional theory and prior knowledge, which often restrict their applicability in real-world scenarios. In addition, these approaches often adopt a strategy of color correction followed by dust removal, which makes the algorithm structure too complex. To solve the issue, we introduce a novel image restoration model, named all-in-one sandstorm removal network (AOSR-Net). This model is developed based on a re-formulated sandstorm scattering model, which directly establishes the image mapping relationship by integrating intermediate parameters. Such integration scheme effectively addresses the problems of over-enhancement and weak generalization in the field of sand dust image enhancement. Experimental results on synthetic and real-world sandstorm images demonstrate the superiority of the proposed AOSR-Net over state-of-the-art (SOTA) algorithms.
大多数现有的沙尘暴增强方法都基于传统理论和先前知识,这通常限制它们在现实世界场景中的适用性。此外,这些方法通常采用颜色纠正后再进行沙尘去除的策略,这使得算法结构变得过于复杂。为了解决这一问题,我们提出了一种全新的图像恢复模型,名为“一体化沙尘暴去除网络”(AOSR-Net)。该模型基于重新构建的沙尘暴散射模型,通过整合中间参数直接建立了图像映射关系。这种整合方案有效地解决了沙尘暴图像增强领域中的过度增强和弱泛化问题。对合成和现实世界沙尘暴图像的实验结果证明了所提出的AOSR-Net相对于最新算法的优越性。
https://arxiv.org/abs/2309.08838
Remote sensing images are essential for many earth science applications, but their quality can be degraded due to limitations in sensor technology and complex imaging environments. To address this, various remote sensing image deblurring methods have been developed to restore sharp, high-quality images from degraded observational data. However, most traditional model-based deblurring methods usually require predefined hand-craft prior assumptions, which are difficult to handle in complex applications, and most deep learning-based deblurring methods are designed as a black box, lacking transparency and interpretability. In this work, we propose a novel blind deblurring learning framework based on alternating iterations of shrinkage thresholds, alternately updating blurring kernels and images, with the theoretical foundation of network design. Additionally, we propose a learnable blur kernel proximal mapping module to improve the blur kernel evaluation in the kernel domain. Then, we proposed a deep proximal mapping module in the image domain, which combines a generalized shrinkage threshold operator and a multi-scale prior feature extraction block. This module also introduces an attention mechanism to adaptively adjust the prior importance, thus avoiding the drawbacks of hand-crafted image prior terms. Thus, a novel multi-scale generalized shrinkage threshold network (MGSTNet) is designed to specifically focus on learning deep geometric prior features to enhance image restoration. Experiments demonstrate the superiority of our MGSTNet framework on remote sensing image datasets compared to existing deblurring methods.
遥感图像对于许多地球科学应用至关重要,但由于传感器技术和复杂成像环境的限制,其质量可能会退化。为了解决这一问题,各种遥感图像去模糊方法已经被开发出来,以从退化观测数据中恢复清晰、高质量的图像。然而,大多数传统的基于模型去模糊方法通常需要预先定义的手工craft性假设,在复杂的应用中难以处理,而大多数基于深度学习去模糊方法则被设计为黑盒,缺乏透明度和可解释性。在这项工作中,我们提出了一种基于交替迭代缩小阈值的新的盲去模糊学习框架,以基于网络设计的理论基础更新模糊内核和图像。此外,我们提出了一种可学习的模糊内核近邻映射模块,以提高模糊内核评估在内核域中的效果。然后,我们提出了一种在图像域中的深度近邻映射模块,它结合了Generalized缩小阈值操作和多尺度先前特征提取块。这个模块还引入了注意力机制,以自适应地调整先前重要性,从而避免手工craft性图像先前术语的缺陷。因此,我们提出了一种多尺度通用缩小阈值网络(MGSTNet),专门关注学习深度几何先前特征,以增强图像恢复。实验表明,我们的MGSTNet框架在遥感图像数据集上相对于现有去模糊方法具有优越性。
https://arxiv.org/abs/2309.07524
Talking Face Generation (TFG) aims to reconstruct facial movements to achieve high natural lip movements from audio and facial features that are under potential connections. Existing TFG methods have made significant advancements to produce natural and realistic images. However, most work rarely takes visual quality into consideration. It is challenging to ensure lip synchronization while avoiding visual quality degradation in cross-modal generation methods. To address this issue, we propose a universal High-Definition Teeth Restoration Network, dubbed HDTR-Net, for arbitrary TFG methods. HDTR-Net can enhance teeth regions at an extremely fast speed while maintaining synchronization, and temporal consistency. In particular, we propose a Fine-Grained Feature Fusion (FGFF) module to effectively capture fine texture feature information around teeth and surrounding regions, and use these features to fine-grain the feature map to enhance the clarity of teeth. Extensive experiments show that our method can be adapted to arbitrary TFG methods without suffering from lip synchronization and frame coherence. Another advantage of HDTR-Net is its real-time generation ability. Also under the condition of high-definition restoration of talking face video synthesis, its inference speed is $300\%$ faster than the current state-of-the-art face restoration based on super-resolution.
谈话脸生成(TFG)旨在重构面部运动,以实现高度自然的牙齿移动,从潜在连接中的音频和面部特征中获取。现有的TFG方法已经取得了重大进展,以产生自然和真实的图像。然而,大多数工作很少考虑视觉质量。确保 lips 同步并在跨模态生成方法中避免视觉质量下降是一项挑战。为了解决这个问题,我们提出了一种通用的高清晰度牙齿修复网络,称为HDTR-Net,为任意TFG方法。HDTR-Net可以在极快的速度下增强牙齿区域,同时保持同步和时间一致性。特别是,我们提议一个精细特征融合(FGFF)模块,有效地捕捉牙齿和周围区域中的精细纹理特征信息,并利用这些特征来精细地合成特征图,以提高牙齿的清晰度。广泛的实验表明,我们的方法可以适应任意TFG方法,而不会遭受 lips同步和帧一致性的问题。HDTR-Net的另一个优点是其实时生成能力。在高清晰度谈话脸视频合成中,其推理速度比基于超分辨率的最新面部恢复方法快300%。
https://arxiv.org/abs/2309.07495
Image reconstruction-based anomaly detection models are widely explored in industrial visual inspection. However, existing models usually suffer from the trade-off between normal reconstruction fidelity and abnormal reconstruction distinguishability, which damages the performance. In this paper, we find that the above trade-off can be better mitigated by leveraging the distinct frequency biases between normal and abnormal reconstruction errors. To this end, we propose Frequency-aware Image Restoration (FAIR), a novel self-supervised image restoration task that restores images from their high-frequency components. It enables precise reconstruction of normal patterns while mitigating unfavorable generalization to anomalies. Using only a simple vanilla UNet, FAIR achieves state-of-the-art performance with higher efficiency on various defect detection datasets. Code: this https URL.
图像重构为基础的异常检测模型在工业视觉检查中被广泛探索。然而,现有模型通常面临着正常重构精度与异常重构区分度之间的权衡,这会影响性能。在本文中,我们发现上述权衡可以通过利用正常和异常重构误差之间的不同频率偏差更好地减轻。为此,我们提出了频率 aware Image restoration (FAIR),这是一种新的自监督图像恢复任务,从高频率成分中恢复图像。它可以实现精确的正常模式重构,同时减轻对异常模式的不利泛化。仅使用简单的无基UNet,FAIR可以在各种缺陷检测数据集上实现更高效的性能。代码:这个https URL。
https://arxiv.org/abs/2309.07068
Video deblurring methods, aiming at recovering consecutive sharp frames from a given blurry video, usually assume that the input video suffers from consecutively blurry frames. However, in real-world blurry videos taken by modern imaging devices, sharp frames usually appear in the given video, thus making temporal long-term sharp features available for facilitating the restoration of a blurry frame. In this work, we propose a video deblurring method that leverages both neighboring frames and present sharp frames using hybrid Transformers for feature aggregation. Specifically, we first train a blur-aware detector to distinguish between sharp and blurry frames. Then, a window-based local Transformer is employed for exploiting features from neighboring frames, where cross attention is beneficial for aggregating features from neighboring frames without explicit spatial alignment. To aggregate long-term sharp features from detected sharp frames, we utilize a global Transformer with multi-scale matching capability. Moreover, our method can easily be extended to event-driven video deblurring by incorporating an event fusion module into the global Transformer. Extensive experiments on benchmark datasets demonstrate that our proposed method outperforms state-of-the-art video deblurring methods as well as event-driven video deblurring methods in terms of quantitative metrics and visual quality. The source code and trained models are available at this https URL.
视频去模糊方法旨在从给定模糊的视频中提取连续的锐利帧,通常假设输入视频经历了连续的模糊帧。然而,在现代成像设备拍摄的真实世界中,通常会出现连续的锐利帧,因此可以利用时间上的长期锐利特征,以方便恢复模糊的帧。在本研究中,我们提出了一种视频去模糊方法,利用相邻帧和利用混合Transformer进行特征聚合。具体来说,我们首先训练一个模糊感知探测器来区分锐利和模糊的帧。然后,我们使用基于窗口的本地Transformer利用相邻帧中提取的特征,其中交叉注意力对聚合相邻帧的特征无显式空间对齐有益处。为了从检测到的锐利帧中提取长期锐利特征,我们使用了具有多尺度匹配能力的全球Transformer。此外,我们的方法可以轻松地扩展到事件驱动的视频去模糊,通过将事件融合模块集成到全球Transformer中。在基准数据集上的广泛实验表明,我们提出的方法在定量指标和视觉质量方面比先进的视频去模糊方法和事件驱动视频去模糊方法表现出色。源代码和训练模型可在该httpsURL上获取。
https://arxiv.org/abs/2309.07054
Restoring degraded music signals is essential to enhance audio quality for downstream music manipulation. Recent diffusion-based music restoration methods have demonstrated impressive performance, and among them, diffusion posterior sampling (DPS) stands out given its intrinsic properties, making it versatile across various restoration tasks. In this paper, we identify that there are potential issues which will degrade current DPS-based methods' performance and introduce the way to mitigate the issues inspired by diverse diffusion guidance techniques including the RePaint (RP) strategy and the Pseudoinverse-Guided Diffusion Models ($\Pi$GDM). We demonstrate our methods for the vocal declipping and bandwidth extension tasks under various levels of distortion and cutoff frequency, respectively. In both tasks, our methods outperform the current DPS-based music restoration benchmarks. We refer to \url{this http URL} for examples of the restored audio samples.
恢复退化的音乐信号是提高后续音乐处理音频质量的关键。最近的扩散型音乐恢复方法已经表现出令人印象深刻的性能,其中扩散后采样(DPS)因其内在特性而特别突出,使其在各种恢复任务中具有广泛的适用性。在本文中,我们识别出可能存在影响当前DPS-based方法性能的潜在问题,并介绍了一种方法,以减轻受到多种扩散引导技术启发的问题,包括重画(RP)策略和伪逆引导扩散模型($\Pi$GDM)。我们分别展示了我们的方法在语音去噪和带宽扩展任务中在不同失真水平和截止频率下的表现。在 both 任务中,我们的方法都比当前DPS-based音乐恢复基准方法表现更好。我们列举了恢复音频样本的示例,这些样本参考了 \url{this http URL}。
https://arxiv.org/abs/2309.06934
Document shadow is a common issue that arise when capturing documents using mobile devices, which significantly impacts the readability. Current methods encounter various challenges including inaccurate detection of shadow masks and estimation of illumination. In this paper, we propose ShaDocFormer, a Transformer-based architecture that integrates traditional methodologies and deep learning techniques to tackle the problem of document shadow removal. The ShaDocFormer architecture comprises two components: the Shadow-attentive Threshold Detector (STD) and the Cascaded Fusion Refiner (CFR). The STD module employs a traditional thresholding technique and leverages the attention mechanism of the Transformer to gather global information, thereby enabling precise detection of shadow masks. The cascaded and aggregative structure of the CFR module facilitates a coarse-to-fine restoration process for the entire image. As a result, ShaDocFormer excels in accurately detecting and capturing variations in both shadow and illumination, thereby enabling effective removal of shadows. Extensive experiments demonstrate that ShaDocFormer outperforms current state-of-the-art methods in both qualitative and quantitative measurements.
文档阴影是在移动设备上捕获文档时常见的问题,这显著影响了可读性。目前的方法面临各种挑战,包括不准确的Shadow mask 检测和估计照明。在本文中,我们提出了ShadocFormer,它是一个基于Transformer的架构,将传统的方法和深度学习技术相结合,以解决文档阴影去除的问题。ShadocFormer架构由两个组件组成:STD模块和Cascaded Fusion Refiner(CFR)。STD模块使用传统的阈值技术,利用Transformer的注意力机制收集全局信息,从而能够精确检测Shadow mask。CFR模块的级联和聚合结构为整个图像的粗到细的恢复过程提供了便利。因此,Shadoc Former在准确检测和捕捉阴影和照明的变化方面表现出色,从而能够有效地去除阴影。广泛的实验表明,Shadoc Former在定性和定量测量方面优于当前的先进技术。
https://arxiv.org/abs/2309.06670
The Yongle Palace murals, as valuable cultural heritage, have suffered varying degrees of damage, making their restoration of significant importance. However, the giant size and unique data of Yongle Palace murals present challenges for existing deep-learning based restoration methods: 1) The distinctive style introduces domain bias in traditional transfer learning-based restoration methods, while the scarcity of mural data further limits the applicability of these methods. 2) Additionally, the giant size of these murals results in a wider range of defect types and sizes, necessitating models with greater adaptability. Consequently, there is a lack of focus on deep learning-based restoration methods for the unique giant murals of Yongle Palace. Here, a 3M-Hybrid model is proposed to address these challenges. Firstly, based on the characteristic that the mural data frequency is prominent in the distribution of low and high frequency features, high and low frequency features are separately abstracted for complementary learning. Furthermore, we integrate a pre-trained Vision Transformer model (VIT) into the CNN module, allowing us to leverage the benefits of a large model while mitigating domain bias. Secondly, we mitigate seam and structural distortion issues resulting from the restoration of large defects by employing a multi-scale and multi-perspective strategy, including data segmentation and fusion. Experimental results demonstrate the efficacy of our proposed model. In regular-sized mural restoration, it improves SSIM and PSNR by 14.61% and 4.73%, respectively, compared to the best model among four representative CNN models. Additionally, it achieves favorable results in the final restoration of giant murals.
Yongle Palace 壁画作为宝贵的文化遗产,已经遭受不同程度的破坏,因此它们的修复具有重要意义。然而,这些壁画的巨大规模和独特的数据提出了当前基于深度学习的修复方法的挑战: 1) 传统的迁移学习修复方法中,特征提取存在域偏见,而壁画数据稀缺进一步限制了这些方法的适用性。 2) 这些壁画的巨大尺寸导致各种缺陷类型和大小的范围更广,需要更多的适应力强模型。因此,对于 Yongle Palace 独特的巨大壁画,缺乏对深度学习修复方法的重点关注。在这里,我们提出了一种 3M-Hybrid 模型来解决这些问题。首先,基于特征,壁画数据的频率在低和高频率特征分布中尤为突出,因此将高低频特征分别抽象为互补学习。此外,我们将训练好的视觉转换器模型(VIT)集成到卷积神经网络模块中,以便利用大型模型的优势,同时减轻域偏见。其次,我们采用多尺度和多视角策略,包括数据分割和融合,以解决大型缺陷修复时产生的 seam 和结构扭曲问题。实验结果显示我们提出的模型的有效性。在常规尺寸的壁画修复中,它提高了 SSIM 和 PSNR 分别提高了 14.61 和 4.73%,而与四个代表性 CNN 模型中的最佳模型相比,它还实现了有利的结果,最后完成了巨大壁画的修复。
https://arxiv.org/abs/2309.06194
Contrastive learning has emerged as a prevailing paradigm for high-level vision tasks, which, by introducing properly negative samples, has also been exploited for low-level vision tasks to achieve a compact optimization space to account for their ill-posed nature. However, existing methods rely on manually predefined, task-oriented negatives, which often exhibit pronounced task-specific biases. In this paper, we propose a innovative approach for the adaptive generation of negative samples directly from the target model itself, called ``learning from history``. We introduce the Self-Prior guided Negative loss for image restoration (SPNIR) to enable this approach. Our approach is task-agnostic and generic, making it compatible with any existing image restoration method or task. We demonstrate the effectiveness of our approach by retraining existing models with SPNIR. The results show significant improvements in image restoration across various tasks and architectures. For example, models retrained with SPNIR outperform the original FFANet and DehazeFormer by 3.41 dB and 0.57 dB on the RESIDE indoor dataset for image dehazing. Similarly, they achieve notable improvements of 0.47 dB on SPA-Data over IDT for image deraining and 0.12 dB on Manga109 for a 4x scale super-resolution over lightweight SwinIR, respectively. Code and retrained models are available at this https URL.
对比学习已成为高水平视觉任务的流行范式,通过引入适当的负样本,它也已被用于较低水平的视觉任务,以获得紧凑的优化空间,以应对其不完备性。然而,现有方法依赖于手动预先定义的任务负样本,这些样本往往表现出显著的任务特定偏见。在本文中,我们提出一种创新的方法,用于自适应地从目标模型本身中提取负样本,称为“从历史中学习”。我们引入了自我前导 guided 负损失的图像恢复(SPNIR),以启用这种方法。我们的方法和任何现有图像恢复方法或任务都兼容,使其与任何现有图像恢复方法或任务兼容。我们通过重新训练现有的模型与 SPNIR 进行演示,结果表明,在各种任务和架构中,图像恢复显著提高。例如,模型重新训练与 SPNIR 在内部外部数据集上的图像去雾比原始 FFANet 和 DehazeFormer 多3.41 dB和0.57 dB。类似地,它们实现显著改善,SPA-Data 比 IDT 的图像去噪比多0.47 dB,以及Manga109 的4x尺度超分辨率比轻量级 SwinIR 的 lightweight SwinIR。代码和重新训练的模型可在 this https URL 上可用。
https://arxiv.org/abs/2309.06023
Depth estimation plays an important role in the robotic perception system. Self-supervised monocular paradigm has gained significant attention since it can free training from the reliance on depth annotations. Despite recent advancements, existing self-supervised methods still underutilize the available training data, limiting their generalization ability. In this paper, we take two data augmentation techniques, namely Resizing-Cropping and Splitting-Permuting, to fully exploit the potential of training datasets. Specifically, the original image and the generated two augmented images are fed into the training pipeline simultaneously and we leverage them to conduct self-distillation. Additionally, we introduce the detail-enhanced DepthNet with an extra full-scale branch in the encoder and a grid decoder to enhance the restoration of fine details in depth maps. Experimental results demonstrate our method can achieve state-of-the-art performance on the KITTI benchmark, with both raw ground truth and improved ground truth. Moreover, our models also show superior generalization performance when transferring to Make3D and NYUv2 datasets. Our codes are available at this https URL.
深度估计在机器人感知系统中发挥着重要作用。自监督单目范式已经引起了广泛关注,因为它可以摆脱对深度标注的依赖,使训练从依赖性中解放出来。尽管最近取得了进展,但现有的自监督方法仍然未充分利用可用的训练数据,限制了其泛化能力。在本文中,我们使用了两种数据增强技术,即缩放和分割,以充分利用训练数据的潜力。具体来说,原始图像和生成的两个增强图像同时进入训练管道,我们利用它们进行自编码。此外,我们引入了细节增强的深度神经网络,编码器中有额外的全尺寸分支,并使用grid解码增强深度地图的 Fine details 恢复。实验结果显示,我们的方法可以在KITTI基准测试中实现最先进的性能,同时提供原始 ground truth 和改进的 ground truth。此外,当我们将模型转移到Make3D 和 NYUv2 数据集时,它们表现出更好的泛化性能。我们的代码现在可以在这个 https URL 上可用。
https://arxiv.org/abs/2309.05254
Transformer-based methods have shown impressive performance in image restoration tasks, such as image super-resolution and denoising. However, we find that these networks can only utilize a limited spatial range of input information through attribution analysis. This implies that the potential of Transformer is still not fully exploited in existing networks. In order to activate more input pixels for better restoration, we propose a new Hybrid Attention Transformer (HAT). It combines both channel attention and window-based self-attention schemes, thus making use of their complementary advantages. Moreover, to better aggregate the cross-window information, we introduce an overlapping cross-attention module to enhance the interaction between neighboring window features. In the training stage, we additionally adopt a same-task pre-training strategy to further exploit the potential of the model for further improvement. Extensive experiments have demonstrated the effectiveness of the proposed modules. We further scale up the model to show that the performance of the SR task can be greatly improved. Besides, we extend HAT to more image restoration applications, including real-world image super-resolution, Gaussian image denoising and image compression artifacts reduction. Experiments on benchmark and real-world datasets demonstrate that our HAT achieves state-of-the-art performance both quantitatively and qualitatively. Codes and models are publicly available at this https URL.
Transformer-based methods在图像修复任务中表现出令人印象深刻的性能,例如图像超分辨率和去噪。然而,我们发现这些网络只能通过归因分析利用有限的输入信息空间。这暗示着现有的网络中,Transformer的潜力仍未完全利用。为了更好地恢复更多的输入像素,我们提出了一种新的混合注意力Transformer(HAT)方法。它结合了通道注意力和窗口基自注意力方案,从而利用了它们的互补优势。此外,为了更好地整合跨窗口信息,我们引入了一个重叠交叉注意力模块,以增强相邻窗口特征之间的相互作用。在训练阶段,我们还采用相同的任务预训练策略,进一步利用模型的潜力以进一步改进。广泛的实验已经证明了所提出的模块的有效性。我们进一步扩展了模型,以显示SR任务的性能可以 greatly 改善。此外,我们还将HAT扩展到更多的图像修复应用,包括现实世界的图像超分辨率、高斯图像去噪和图像压缩 artifacts减少。在基准数据和现实世界数据集上的实验表明,我们的HAT实现了先进的性能,既定量又定性。代码和模型在该https URL上publicly available。
https://arxiv.org/abs/2309.05239
Images or videos captured by the Under-Display Camera (UDC) suffer from severe degradation, such as saturation degeneration and color shift. While restoration for UDC has been a critical task, existing works of UDC restoration focus only on images. UDC video restoration (UDC-VR) has not been explored in the community. In this work, we first propose a GAN-based generation pipeline to simulate the realistic UDC degradation process. With the pipeline, we build the first large-scale UDC video restoration dataset called PexelsUDC, which includes two subsets named PexelsUDC-T and PexelsUDC-P corresponding to different displays for UDC. Using the proposed dataset, we conduct extensive benchmark studies on existing video restoration methods and observe their limitations on the UDC-VR task. To this end, we propose a novel transformer-based baseline method that adaptively enhances degraded videos. The key components of the method are a spatial branch with local-aware transformers, a temporal branch embedded temporal transformers, and a spatial-temporal fusion module. These components drive the model to fully exploit spatial and temporal information for UDC-VR. Extensive experiments show that our method achieves state-of-the-art performance on PexelsUDC. The benchmark and the baseline method are expected to promote the progress of UDC-VR in the community, which will be made public.
由显示摄像头(UDC)捕获的图像或视频遭受严重的退化,例如饱和度退化和颜色移位。尽管恢复UDC是一项关键任务,但现有的UDC恢复工作只关注图像。UDC视频恢复(UDC-VR)尚未在社区中探索。在本工作中,我们首先提出了基于生成对抗网络(GAN)的生成管道,以模拟真实的UDC退化过程。通过管道,我们建造了第一个大规模UDC视频恢复数据集,称为PexelsUDC,其中包括两个子集,分别对应UDC不同显示。使用提出的数据集,我们进行了广泛的基准研究,并对现有的视频恢复方法在UDC-VR任务中的限制进行了观察。为此,我们提出了一种新的Transformer基线方法,可以自适应地增强退化视频。方法的关键组件是一个具有本地感知Transformer的空间分支,一个嵌入时间Transformer的时间分支和一个空间-时间融合模块。这些组件驱动模型充分利用空间和时间信息,实现在PexelsUDC上的最先进的性能。广泛的实验表明,我们的方法和基准方法在PexelsUDC上实现了最先进的性能。基准和方法旨在促进UDC-VR社区中的进展,并将公开发布。
https://arxiv.org/abs/2309.04752
Image restoration aims to recover the high-quality images from their degraded observations. Since most existing methods have been dedicated into single degradation removal, they may not yield optimal results on other types of degradations, which do not satisfy the applications in real world scenarios. In this paper, we propose a novel data ingredient-oriented approach that leverages prompt-based learning to enable a single model to efficiently tackle multiple image degradation tasks. Specifically, we utilize a encoder to capture features and introduce prompts with degradation-specific information to guide the decoder in adaptively recovering images affected by various degradations. In order to model the local invariant properties and non-local information for high-quality image restoration, we combined CNNs operations and Transformers. Simultaneously, we made several key designs in the Transformer blocks (multi-head rearranged attention with prompts and simple-gate feed-forward network) to reduce computational requirements and selectively determines what information should be persevered to facilitate efficient recovery of potentially sharp images. Furthermore, we incorporate a feature fusion mechanism further explores the multi-scale information to improve the aggregated features. The resulting tightly interlinked hierarchy architecture, named as CAPTNet, despite being designed to handle different types of degradations, extensive experiments demonstrate that our method performs competitively to the task-specific algorithms.
图像恢复旨在从受损观察中恢复高质量的图像。由于大多数现有方法都专注于去除一种类型的退化,它们可能无法在另一种类型的退化上获得最佳结果,这不能满足现实世界应用的需求。在本文中,我们提出了一种新数据要素导向的方法,利用基于提示的学习来使单个模型有效地处理多个图像退化任务。具体来说,我们使用编码器捕捉特征并引入退化特定信息以指导解码器自适应地恢复受到各种退化影响的图像。为了模拟高质量的图像恢复Local Invariant Properties和非local信息,我们结合了卷积神经网络操作和Transformer。同时,在Transformer块中,我们做了几个关键设计(多Head重新安排注意力伴有简单门反馈网络)以降低计算要求并选择性确定应该坚持的信息以促进高效恢复可能锐利的图像。此外,我们引入了特征融合机制进一步探索多尺度信息以改善聚合特征。结果,我们设计的紧密连接级联结构名为CAPTNet,尽管它设计来处理不同类型的退化,但广泛的实验表明,我们的方法与任务特定的算法竞争良好。
https://arxiv.org/abs/2309.03063
Image deblurring is a critical task in the field of image restoration, aiming to eliminate blurring artifacts. However, the challenge of addressing non-uniform blurring leads to an ill-posed problem, which limits the generalization performance of existing deblurring models. To solve the problem, we propose a framework SAM-Deblur, integrating prior knowledge from the Segment Anything Model (SAM) into the deblurring task for the first time. In particular, SAM-Deblur is divided into three stages. First, We preprocess the blurred images, obtain image masks via SAM, and propose a mask dropout method for training to enhance model robustness. Then, to fully leverage the structural priors generated by SAM, we propose a Mask Average Pooling (MAP) unit specifically designed to average SAM-generated segmented areas, serving as a plug-and-play component which can be seamlessly integrated into existing deblurring networks. Finally, we feed the fused features generated by the MAP Unit into the deblurring model to obtain a sharp image. Experimental results on the RealBlurJ, ReloBlur, and REDS datasets reveal that incorporating our methods improves NAFNet's PSNR by 0.05, 0.96, and 7.03, respectively. Code will be available at \href{this https URL}{SAM-Deblur}.
图像去模糊是图像恢复领域的关键问题,旨在消除模糊效果。然而,解决非均匀模糊的挑战会导致一个无法求解的问题,这限制了现有去模糊模型的泛化性能。为了解决这个问题,我们提出了一个框架——Samm-Deblur,将Segment anything Model(Samm)的先验知识首次集成到去模糊任务中。特别是,Samm-Deblur被分为三个阶段。首先,我们预处理模糊图像,通过Samm获取图像掩码,并提出了掩码辍学方法以增强模型鲁棒性。然后,为了充分利用Samm生成的结构先验,我们提出了一个掩码平均Pooling(MAP)单元,专门设计用于平均Samm生成分割区域,作为可插拔组件,可以无缝集成到现有的去模糊网络中。最后,我们向MAP单元注入融合的特征,以获得锐利的图像。在RealBlurJ、 ReloBlur和REDS数据集上的实验结果表明,结合我们的方法和可以提高NAFNet的PSNR值分别为0.05、0.96和7.03。代码将在\href{this https URL}{SAM-Deblur}网站上提供。
https://arxiv.org/abs/2309.02270
Underwater image restoration has been a challenging problem for decades since the advent of underwater photography. Most solutions focus on shallow water scenarios, where the scene is uniformly illuminated by the sunlight. However, the vast majority of uncharted underwater terrain is located beyond 200 meters depth where natural light is scarce and artificial illumination is needed. In such cases, light sources co-moving with the camera, dynamically change the scene appearance, which make shallow water restoration methods inadequate. In particular for multi-light source systems (composed of dozens of LEDs nowadays), calibrating each light is time-consuming, error-prone and tedious, and we observe that only the integrated illumination within the viewing volume of the camera is critical, rather than the individual light sources. The key idea of this paper is therefore to exploit the appearance changes of objects or the seafloor, when traversing the viewing frustum of the camera. Through new constraints assuming Lambertian surfaces, corresponding image pixels constrain the light field in front of the camera, and for each voxel a signal factor and a backscatter value are stored in a volumetric grid that can be used for very efficient image restoration of camera-light platforms, which facilitates consistently texturing large 3D models and maps that would otherwise be dominated by lighting and medium artifacts. To validate the effectiveness of our approach, we conducted extensive experiments on simulated and real-world datasets. The results of these experiments demonstrate the robustness of our approach in restoring the true albedo of objects, while mitigating the influence of lighting and medium effects. Furthermore, we demonstrate our approach can be readily extended to other scenarios, including in-air imaging with artificial illumination or other similar cases.
水下图像恢复自水下摄影问世以来一直是一项挑战性的问题。大多数解决方案都关注浅水场景,即场景中的光线均匀照亮。然而,绝大多数未被发现的海洋地形位于200米以上的深度,缺乏自然光线,需要人工照明。在这种情况下,与相机运动相关的光源会动态改变场景的外观,使浅水方法无法应对。特别是多光源系统(如今由数十个LED组成),校准每个光源都十分耗时、容易出错且繁琐,我们发现只有相机视野内的集成照明是关键问题,而不是每个单独的光源。本文的关键思想是利用物体或海底表面在穿越相机视场时的外貌变化。通过假设 Lambertian 表面,对应像素限制了相机前方的光线场,每个 voxel 存储着信号因子和重投影值,可以用于高效相机-照明平台的图像恢复,这有助于连续地雕塑大型3D模型和地图,而否则这些模型和地图就会被照明和媒介效果所主宰。为了验证我们的方法的有效性,我们对模拟和现实世界数据集进行了广泛的实验。这些实验结果显示,我们的方法在恢复物体的真实反射率方面非常 robust,同时减轻照明和媒介效果的影响。此外,我们证明了我们的方法可以轻松地扩展到其他场景,包括在空气中使用人工照明或类似的情况。
https://arxiv.org/abs/2309.02217