Most video restoration networks are slow, have high computational load, and can't be used for real-time video enhancement. In this work, we design an efficient and fast framework to perform real-time video enhancement for practical use-cases like live video calls and video streams. Our proposed method, called Recurrent Bottleneck Mixer Network (ReBotNet), employs a dual-branch framework. The first branch learns spatio-temporal features by tokenizing the input frames along the spatial and temporal dimensions using a ConvNext-based encoder and processing these abstract tokens using a bottleneck mixer. To further improve temporal consistency, the second branch employs a mixer directly on tokens extracted from individual frames. A common decoder then merges the features form the two branches to predict the enhanced frame. In addition, we propose a recurrent training approach where the last frame's prediction is leveraged to efficiently enhance the current frame while improving temporal consistency. To evaluate our method, we curate two new datasets that emulate real-world video call and streaming scenarios, and show extensive results on multiple datasets where ReBotNet outperforms existing approaches with lower computations, reduced memory requirements, and faster inference time.
大多数视频恢复网络很慢,具有高计算负载,并且不能用于实时视频增强。在这项工作中,我们设计了一个高效且快速的框架,用于实时视频增强,例如实时视频通话和视频流。我们提出的方法被称为循环瓶颈混合器网络(ReBotNet),采用了双分支框架。第一个分支使用基于ConvNext的编码器将输入帧的空间和时间维度 tokenizing,并使用瓶颈混合器对这些抽象 tokens进行处理。为了进一步改善时间一致性,第二个分支直接使用从每个帧提取的 token 进行混合。一个通用的解码器然后将两个分支的特征合并,以预测增强帧。此外,我们提出了一种循环训练方法,其中利用最后帧的预测来高效地增强当前帧,同时改善时间一致性。为了评估我们的方法,我们编辑了两个模拟实际视频通话和流媒体场景的新数据集,并在多个数据集上展示了广泛的结果,其中 ReBotNet 在计算量更低、内存要求更低和Inference时间更快速的情况下表现更好。
https://arxiv.org/abs/2303.13504
Transformers-based methods have achieved significant performance in image deraining as they can model the non-local information which is vital for high-quality image reconstruction. In this paper, we find that most existing Transformers usually use all similarities of the tokens from the query-key pairs for the feature aggregation. However, if the tokens from the query are different from those of the key, the self-attention values estimated from these tokens also involve in feature aggregation, which accordingly interferes with the clear image restoration. To overcome this problem, we propose an effective DeRaining network, Sparse Transformer (DRSformer) that can adaptively keep the most useful self-attention values for feature aggregation so that the aggregated features better facilitate high-quality image reconstruction. Specifically, we develop a learnable top-k selection operator to adaptively retain the most crucial attention scores from the keys for each query for better feature aggregation. Simultaneously, as the naive feed-forward network in Transformers does not model the multi-scale information that is important for latent clear image restoration, we develop an effective mixed-scale feed-forward network to generate better features for image deraining. To learn an enriched set of hybrid features, which combines local context from CNN operators, we equip our model with mixture of experts feature compensator to present a cooperation refinement deraining scheme. Extensive experimental results on the commonly used benchmarks demonstrate that the proposed method achieves favorable performance against state-of-the-art approaches. The source code and trained models are available at this https URL.
Transformer-based methods 在图像去抑制方面取得了显著的性能,因为它们可以建模对于高质量图像重建至关重要的非局部信息。在本文中,我们发现,大多数现有Transformers通常使用从查询键对中的所有相似性 token 来进行特征聚合。然而,如果查询中的token 与键中的token不同,从这些token 估计的自注意力值也涉及在特征聚合中,这相应地会影响清晰图像恢复。为了克服这个问题,我们提出了一种有效的去抑制网络,Sparse Transformer (DRSformer),它可以自适应地保持最有用的自注意力值,以便特征聚合更好地促进高质量图像重建。具体来说,我们开发了一个可学习的选择 Operator,以自适应地保留每个查询中键的重要注意力 scores,以更好地特征聚合。同时,由于Transformers 中的天真前向网络不建模对于潜在清晰图像恢复的重要多尺度信息,我们开发了一种有效的混合尺度前向网络,以生成更好的特征去抑制特征。为了学习一个丰富混合特征集,它结合了CNN 操作 local 上下文,我们为模型配备了专家特征混合器,以呈现合作改进去抑制方案。在常用的基准测试上的广泛实验结果表明,提出的方法在与现有方法相比表现出有利的性能。源代码和训练模型可在 this https URL 中找到。
https://arxiv.org/abs/2303.11950
Soft robotics technology can aid in achieving United Nations Sustainable Development Goals (SDGs) and the Paris Climate Agreement through development of autonomous, environmentally responsible machines powered by renewable energy. By utilizing soft robotics, we can mitigate the detrimental effects of climate change on human society and the natural world through fostering adaptation, restoration, and remediation. Moreover, the implementation of soft robotics can lead to groundbreaking discoveries in material science, biology, control systems, energy efficiency, and sustainable manufacturing processes. However, to achieve these goals, we need further improvements in understanding biological principles at the basis of embodied and physical intelligence, environment-friendly materials, and energy-saving strategies to design and manufacture self-piloting and field-ready soft robots. This paper provides insights on how soft robotics can address the pressing issue of environmental sustainability. Sustainable manufacturing of soft robots at a large scale, exploring the potential of biodegradable and bioinspired materials, and integrating onboard renewable energy sources to promote autonomy and intelligence are some of the urgent challenges of this field that we discuss in this paper. Specifically, we will present field-ready soft robots that address targeted productive applications in urban farming, healthcare, land and ocean preservation, disaster remediation, and clean and affordable energy, thus supporting some of the SDGs. By embracing soft robotics as a solution, we can concretely support economic growth and sustainable industry, drive solutions for environment protection and clean energy, and improve overall health and well-being.
软机器人技术可以通过开发自主、环境负责任的机器,使用可再生能源驱动,来协助实现联合国可持续发展目标(SDGs)和巴黎气候协定。利用软机器人技术,我们可以通过促进适应、恢复和修复,减缓气候变化对人类和社会自然世界带来的有害影响。此外,实施软机器人技术还可以导致在材料科学、生物学、控制系统、能源效率和可持续制造进程中的开创性发现。但是,要实现这些目标,我们需要进一步改善理解生物体内性和身体智能、环境友好材料以及节省能源的战略,以设计和制造自主运行并field-ready的软机器人。本文提供了关于软机器人如何解决环境问题的重要见解。大规模生产软机器人的可持续性制造、探索可生物降解和生物启发材料的潜力,以及集成船上可再生能源,以促进自主和智能是该领域紧迫挑战之一,我们在此本文中讨论了这些问题。具体而言,我们将提供针对城市农业、医疗保健、土地和海洋保护、灾难恢复和清洁且价格合理的能源的目标生产性应用field-ready的软机器人,从而支持一些SDGs。通过拥抱软机器人作为解决方案,我们可以具体支持经济增长和可持续工业,推动环境保护和清洁能源的解决方案,并改善整体健康和福利。
https://arxiv.org/abs/2303.11931
Recent deep-learning-based video compression methods brought coding gains over conventional codecs such as AVC and HEVC. However, learning-based codecs generally require considerable computation time and model complexity. In this paper, we propose a new lightweight hybrid video codec consisting of a conventional video codec(HEVC / VVC), a lossless image codec, and our new restoration network. Precisely, our encoder consists of the conventional video encoder and a lossless image encoder, transmitting a lossy-compressed video bitstream along with a losslessly-compressed reference frame. The decoder is constructed with corresponding video/image decoders and a new restoration network, which enhances the compressed video in two-step processes. In the first step, a network trained with a large video dataset restores the details lost by the conventional encoder. Then, we further boost the video quality with the guidance of a reference image, which is a losslessly compressed video frame. The reference image provides video-specific information, which can be utilized to better restore the details of a compressed video. Experimental results show that the proposed method achieves comparable performance to top-tier methods, even when applied to HEVC. Nevertheless, our method has lower complexity, a faster run time, and can be easily integrated into existing conventional codecs.
最近基于深度学习的视频压缩方法相比传统的编码器(如AVC和HEVC)带来了编码增益。然而,基于学习的编码器通常需要相当大的计算时间和模型复杂性。在本文中,我们提出了一种新型的轻量级混合视频编码器,包括传统的视频编码(HEVC/VVC)、一种无损失图像编码器和我们的新的恢复网络。具体而言,我们的编码器包括传统的视频编码器和无损失图像编码器,发送了以无损失压缩的参考帧为标志的 lossy 压缩视频比特流。解码器由相应的视频/图像解码器和新的恢复网络组成,在两个步骤中增强压缩视频。在第一步中,一个由大量视频数据训练的网络恢复了传统的编码器失去的细节。然后,我们使用一个无损失压缩的参考帧的指导进一步改善视频质量。参考图像提供了视频特定的信息,可用于更好地恢复压缩视频的细节。实验结果表明,所提出的方法与顶级方法实现了相似的性能,即使在应用HEVC时也如此。然而,我们的方法具有较低的复杂度,更快的运行时间,并且可以轻松地与现有的传统编码器集成。
https://arxiv.org/abs/2303.11592
Inversion by Direct Iteration (InDI) is a new formulation for supervised image restoration that avoids the so-called ``regression to the mean'' effect and produces more realistic and detailed images than existing regression-based methods. It does this by gradually improving image quality in small steps, similar to generative denoising diffusion models. Image restoration is an ill-posed problem where multiple high-quality images are plausible reconstructions of a given low-quality input. Therefore, the outcome of a single step regression model is typically an aggregate of all possible explanations, therefore lacking details and realism. % The main advantage of InDI is that it does not try to predict the clean target image in a single step but instead gradually improves the image in small steps, resulting in better perceptual quality. While generative denoising diffusion models also work in small steps, our formulation is distinct in that it does not require knowledge of any analytic form of the degradation process. Instead, we directly learn an iterative restoration process from low-quality and high-quality paired examples. InDI can be applied to virtually any image degradation, given paired training data. In conditional denoising diffusion image restoration the denoising network generates the restored image by repeatedly denoising an initial image of pure noise, conditioned on the degraded input. Contrary to conditional denoising formulations, InDI directly proceeds by iteratively restoring the input low-quality image, producing high-quality results on a variety of image restoration tasks, including motion and out-of-focus deblurring, super-resolution, compression artifact removal, and denoising.
直接迭代反演(InDI)是一种用于监督图像恢复的新 formulation,旨在避免所谓的“回归均值”效应,并生成比现有基于回归的方法更为真实和详细的图像。它通过逐步改善图像质量小步骤来实现,类似于生成式去噪扩散模型。图像恢复是一个具有矛盾问题的示例,多个高质量的图像可能是给定低质量输入的的合理重构。因此,一个单一的回归模型的结果通常是所有可能解释的集合,因此缺乏细节和真实感。% InDI的主要优势是它不会试图在一步中预测清洁的目标图像,而是逐步改善图像,以获得更好的感知质量。虽然生成式去噪扩散模型也在小步骤中起作用,但我们的 formulation 有所不同,它不需要了解任何分析形式的退化过程知识。相反,我们直接从低质量和高质量配对示例中学习迭代恢复过程。InDI可以适用于几乎任何图像退化情况,只要配对训练数据。在条件去噪扩散图像恢复中,去噪网络通过反复去噪初始纯粹的噪声图像,根据退化输入生成恢复图像。与条件去噪 formulation 相反,InDI直接通过迭代恢复输入的低质量图像,在多种图像恢复任务中取得了高质量的结果,包括运动和焦距外去模糊、超分辨率、压缩 artifacts 去除和去噪。
https://arxiv.org/abs/2303.11435
In this paper, we propose to regularize ill-posed inverse problems using a deep hierarchical variational autoencoder (HVAE) as an image prior. The proposed method synthesizes the advantages of i) denoiser-based Plug \& Play approaches and ii) generative model based approaches to inverse problems. First, we exploit VAE properties to design an efficient algorithm that benefits from convergence guarantees of Plug-and-Play (PnP) methods. Second, our approach is not restricted to specialized datasets and the proposed PnP-HVAE model is able to solve image restoration problems on natural images of any size. Our experiments show that the proposed PnP-HVAE method is competitive with both SOTA denoiser-based PnP approaches, and other SOTA restoration methods based on generative models.
在本文中,我们提议通过使用 Deep hierarchical variational autoencoder (HVAE) 作为图像先验来 regularize 不完备反问题。该提议方法融合了基于去噪的插件和 Play (PnP)方法的优点,以及基于生成模型的反问题方法的优点。首先,我们利用 VAE 的特性设计了一个高效的算法,该算法从插件和 Play (PnP)方法的收敛保证中受益匪浅。其次,我们的 approach 不仅限于特定的数据集,而 proposed PnP-HVAE 模型能够在任何大小的自然图像上解决图像恢复问题。我们的实验结果表明,提出的 PnP-HVAE 方法与领先的基于插件和 Play (PnP)方法的 PnP 方法以及基于生成模型的其他领先的恢复方法竞争。
https://arxiv.org/abs/2303.11217
Compared to other severe weather image restoration tasks, single image desnowing is a more challenging task. This is mainly due to the diversity and irregularity of snow shape, which makes it extremely difficult to restore images in snowy scenes. Moreover, snow particles also have a veiling effect similar to haze or mist. Although current works can effectively remove snow particles with various shapes, they also bring distortion to the restored image. To address these issues, we propose a novel single image desnowing network called Star-Net. First, we design a Star type Skip Connection (SSC) to establish information channels for all different scale features, which can deal with the complex shape of snow particles.Second, we present a Multi-Stage Interactive Transformer (MIT) as the base module of Star-Net, which is designed to better understand snow particle shapes and to address image distortion by explicitly modeling a variety of important image recovery features. Finally, we propose a Degenerate Filter Module (DFM) to filter the snow particle and snow fog residual in the SSC on the spatial and channel domains. Extensive experiments show that our Star-Net achieves state-of-the-art snow removal performances on three standard snow removal datasets and retains the original sharpness of the images.
与其他严重的天气图像恢复任务相比,单图像除雪更具挑战性。这主要是因为雪的形状多样性和不规则性,这使得在雪景中恢复图像极其困难。此外,雪粒子也具有类似于雾或 mist的遮蔽效应。虽然当前的工作可以有效地去除各种形状的雪粒子,但它们也会对恢复图像产生扭曲。为了解决这些问题,我们提出了一种名为Star-Net的全新的单图像除雪网络。我们首先设计了一个Star型跳过连接(SSC)来建立所有不同尺度特征的信息通道,以处理雪粒子的复杂形状。其次,我们介绍了Multi-Stage Interactive Transformer(MIT)作为Star-Net的基础模块,该模块旨在更好地理解雪粒子的形状,并通过 explicitly modeling 多种重要的图像恢复特征来解决图像扭曲问题。最后,我们提出了一种退化滤波模块(DFM)来在空间和通道域中对SSC中的雪粒子和雪雾残留进行过滤。广泛的实验表明,我们的Star-Net在三个标准雪除数据集上实现了最先进的雪除性能,并保留了图像的原始清晰度。
https://arxiv.org/abs/2303.09988
Functional electrical stimulation (FES) has been increasingly integrated with other rehabilitation devices, including robots. FES cycling is one of the common FES applications in rehabilitation, which is performed by stimulating leg muscles in a certain pattern. The appropriate pattern varies across individuals and requires manual tuning which can be time-consuming and challenging for the individual user. Here, we present an AI-based method for finding the patterns, which requires no extra hardware or sensors. Our method has two phases, starting with finding model-based patterns using reinforcement learning and detailed musculoskeletal models. The models, built using open-source software, can be customised through our automated script and can be therefore used by non-technical individuals without extra cost. Next, our method fine-tunes the pattern using real cycling data. We test our both in simulation and experimentally on a stationary tricycle. In the simulation test, our method can robustly deliver model-based patterns for different cycling configurations. The experimental evaluation shows that our method can find a model-based pattern that induces higher cycling speed than an EMG-based pattern. By using just 100 seconds of cycling data, our method can deliver a fine-tuned pattern that gives better cycling performance. Beyond FES cycling, this work is a showcase, displaying the feasibility and potential of human-in-the-loop AI in real-world rehabilitation.
功能电刺激(FES)越来越与其他康复设备,包括机器人,集成在一起。 FES 循环是康复中常见的 FES 应用之一,通过刺激腿部肌肉的一种特定模式来实现。适合该模式的 Pattern 因人而异,并且需要手动调整,这对于个人用户来说可能会很耗时且具有挑战性。在这里,我们提出了一种基于 AI 的方法来找到 Pattern,这个方法不需要额外的硬件或传感器。我们的方法有两个阶段,第一阶段是通过 reinforcement learning 和详细的肌肉骨骼模型来找到模型模式。这些模型是通过开源软件构建的,可以通过我们的自动化脚本进行定制,因此可以由非技术人员使用而无需额外成本。接下来,我们的方法和真实的循环数据一起微调模式。我们在仿真和实验中测试了这些方法,在一个静止的三轮车上进行了测试。在仿真测试中,我们的方法和不同的循环配置都能 robustly 输出模型模式。实验评估表明,我们的方法可以找到一个模型模式,其促进的循环速度比肌电学模式更高。仅使用循环数据中的 100 秒,我们的方法就可以输出一个微调的模式,提供更好的循环性能。除了 FES 循环,这项工作是一个展示,展示了人类参与 Loop AI 在真实康复中的可行性和潜力。
https://arxiv.org/abs/2303.09986
Diffusion model (DM) has achieved SOTA performance by modeling the image synthesis process into a sequential application of a denoising network. However, different from image synthesis generating each pixel from scratch, most pixels of image restoration (IR) are given. Thus, for IR, traditional DMs running massive iterations on a large model to estimate whole images or feature maps is inefficient. To address this issue, we propose an efficient DM for IR (DiffIR), which consists of a compact IR prior extraction network (CPEN), dynamic IR transformer (DIRformer), and denoising network. Specifically, DiffIR has two training stages: pretraining and training DM. In pretraining, we input ground-truth images into CPEN$_{S1}$ to capture a compact IR prior representation (IPR) to guide DIRformer. In the second stage, we train the DM to directly estimate the same IRP as pretrained CPEN$_{S1}$ only using LQ images. We observe that since the IPR is only a compact vector, DiffIR can use fewer iterations than traditional DM to obtain accurate estimations and generate more stable and realistic results. Since the iterations are few, our DiffIR can adopt a joint optimization of CPEN$_{S2}$, DIRformer, and denoising network, which can further reduce the estimation error influence. We conduct extensive experiments on several IR tasks and achieve SOTA performance while consuming less computational costs.
扩散模型(DM)通过将图像合成过程模拟为一种顺序应用的去噪网络,实现了SOTA性能。然而,与图像合成从 scratch 开始生成每个像素不同,图像恢复(IR)的大部分像素是给定的。因此,对于IR,传统的DM在大型模型上进行大量迭代来估计整张图像或特征映射是不高效的。为了解决这一问题,我们提出了一种高效的IRDM(DiffIR),它由一个紧凑的IR前缀提取网络(CPEN)、动态IR转换器(DIRformer)和一个去噪网络组成。具体来说,DiffIR有两个训练阶段:预训练和训练DM。在预训练阶段,我们将 ground-truth 图像输入到CPEN_{S1} 中,以捕捉紧凑的IR前缀表示(IPR)来指导DIRformer。在第二个阶段,我们训练DM直接估计与预训练的CPEN_{S1} 相同的IPR,仅使用LQ图像。我们发现,由于IPR只是紧凑向量,DiffIR可以使用比传统DM更少的迭代来获得准确的估计和生成更稳定和真实的结果。由于迭代数量较少,我们的DiffIR可以采用CPEN_{S2}、DIRformer和去噪网络的联合优化,进一步减少估计误差的影响。我们对多个IR任务进行了广泛的实验,并在消耗较少计算成本的同时实现了SOTA性能。
https://arxiv.org/abs/2303.09472
Shadow removal in a single image has received increasing attention in recent years. However, removing shadows over dynamic scenes remains largely under-explored. In this paper, we propose the first data-driven video shadow removal model, termed PSTNet, by exploiting three essential characteristics of video shadows, i.e., physical property, spatio relation, and temporal coherence. Specifically, a dedicated physical branch was established to conduct local illumination estimation, which is more applicable for scenes with complex lighting and textures, and then enhance the physical features via a mask-guided attention strategy. Then, we develop a progressive aggregation module to enhance the spatio and temporal characteristics of features maps, and effectively integrate the three kinds of features. Furthermore, to tackle the lack of datasets of paired shadow videos, we synthesize a dataset (SVSRD-85) with aid of the popular game GTAV by controlling the switch of the shadow renderer. Experiments against 9 state-of-the-art models, including image shadow removers and image/video restoration methods, show that our method improves the best SOTA in terms of RMSE error for the shadow area by 14.7. In addition, we develop a lightweight model adaptation strategy to make our synthetic-driven model effective in real world scenes. The visual comparison on the public SBU-TimeLapse dataset verifies the generalization ability of our model in real scenes.
近几年,在图像中去除阴影的研究越来越受到关注。然而,在动态场景中去除阴影的研究仍然相对较少。在本文中,我们提出了第一种基于数据的图像处理模型,称为PSTNet,利用视频阴影的三个关键特性:物理属性、空间关系和时间一致性。具体来说,我们建立了一个专门的物理分支,进行局部照明估计,该方法适用于具有复杂照明和纹理的场景,然后通过 mask 引导的注意力策略增强物理特征。接着,我们开发了一种逐步聚合模块,以提高特征地图的空间和时间特性,并有效地融合三种特征。此外,为了解决配对阴影视频数据集缺乏的问题,我们使用流行的游戏GTAV的帮助,通过控制阴影渲染器切换,合成了一个数据集(SVSRD-85)。与9个最先进的模型进行了实验,包括图像去阴影器和图像/视频恢复方法,结果表明,我们的方法在阴影区域的RMSE误差方面提高了最好水平。此外,我们开发了一种轻量级模型适应策略,使我们合成驱动模型在现实世界场景中更有效。公开SBU-时间平滑数据集的视觉比较验证了我们模型在真实场景下的泛化能力。
https://arxiv.org/abs/2303.09370
Object detection and single image super-resolution are classic problems in computer vision (CV). The object detection task aims to recognize the objects in input images, while the image restoration task aims to reconstruct high quality images from given low quality images. In this paper, a two-stage framework for object detection and image restoration is proposed. The first stage uses YOLO series algorithms to complete the object detection and then performs image cropping. In the second stage, this work improves Swin Transformer and uses the new proposed algorithm to connect the Swin Transformer layer to design a new neural network architecture. We name the newly proposed network for image restoration SwinOIR. This work compares the model performance of different versions of YOLO detection algorithms on MS COCO dataset and Pascal VOC dataset, demonstrating the suitability of different YOLO network models for the first stage of the framework in different scenarios. For image super-resolution task, it compares the model performance of using different methods of connecting Swin Transformer layers and design different sizes of SwinOIR for use in different life scenarios. Our implementation code is released at this https URL.
物体检测和单张图像超分辨率是计算机视觉(CV)中的经典问题。物体检测任务旨在从输入图像中识别物体,而图像修复任务旨在从给定低质量图像中重构高质量的图像。在本文中,提出了一种物体检测和图像修复的二阶段框架。在第一阶段,使用YOLO系列算法完成物体检测,然后进行图像裁剪。在第二阶段,改进了 Swin Transformer,并使用新提出的算法连接 Swin Transformer 层来设计新的神经网络架构。我们提出了名为 SwinOIR 的图像修复网络。该工作对不同版本的YOLO检测算法在MS COCO数据和Pascal VOC数据集上的性能进行了比较,证明了不同YOLO网络模型在不同场景下的适用性。对于图像超分辨率任务,该工作比较了使用不同连接 Swin Transformer 层的方法以及设计不同大小的 SwinOIR 用于不同生命场景的性能。我们的实现代码在此httpsURL上发布。
https://arxiv.org/abs/2303.09190
Despite the remarkable achievement of recent underwater image restoration techniques, the lack of labeled data has become a major hurdle for further progress. In this work, we propose a mean-teacher based \textbf{Semi}-supervised \textbf{U}nderwater \textbf{I}mage \textbf{R}estoration (\textbf{Semi-UIR}) framework to incorporate the unlabeled data into network training. However, the naive mean-teacher method suffers from two main problems: (1) The consistency loss used in training might become ineffective when the teacher's prediction is wrong. (2) Using L1 distance may cause the network to overfit wrong labels, resulting in confirmation bias. To address the above problems, we first introduce a reliable bank to store the ``best-ever" outputs as pseudo ground truth. To assess the quality of outputs, we conduct an empirical analysis based on the monotonicity property to select the most trustworthy NR-IQA method. Besides, in view of the confirmation bias problem, we incorporate contrastive regularization to prevent the overfitting on wrong labels. Experimental results on both full-reference and non-reference underwater benchmarks demonstrate that our algorithm has obvious improvement over SOTA methods quantitatively and qualitatively. Code has been released at \href{this https URL}{this https URL}.
尽管近年来水下图像恢复技术取得了显著成就,但缺乏标记数据已成为进一步进展的重大障碍。在本研究中,我们提出了一种基于平均教师(以下简称“教师”)的半监督水下图像修复(以下简称“半-UIR”)框架,将未标记数据纳入网络训练。然而,天真的平均教师方法面临两个主要问题:(1)在训练时,使用一致性损失可能会变得无效,如果教师的预测错误。(2)使用L1距离可能会导致网络过度适应错误的标签,导致确认偏差。为了解决上述问题,我们首先引入了一个可靠的存储库,以存储“最佳”输出作为伪 ground truth。为了评估输出质量,我们基于单调性特性进行了一项经验分析,以选择最可靠的 NR-IQA 方法。此外,考虑到确认偏差问题,我们引入了对比Regularization,以避免过度适应错误的标签。在全面参考和非参考水下基准测试中,实验结果证明,我们的算法在质量和数量上明显优于当前的最佳方法。代码已发布在 href{this https URL}{this https URL} 上。
https://arxiv.org/abs/2303.09101
Adapting the Diffusion Probabilistic Model (DPM) for direct image super-resolution is wasteful, given that a simple Convolutional Neural Network (CNN) can recover the main low-frequency content. Therefore, we present ResDiff, a novel Diffusion Probabilistic Model based on Residual structure for Single Image Super-Resolution (SISR). ResDiff utilizes a combination of a CNN, which restores primary low-frequency components, and a DPM, which predicts the residual between the ground-truth image and the CNN-predicted image. In contrast to the common diffusion-based methods that directly use LR images to guide the noise towards HR space, ResDiff utilizes the CNN's initial prediction to direct the noise towards the residual space between HR space and CNN-predicted space, which not only accelerates the generation process but also acquires superior sample quality. Additionally, a frequency-domain-based loss function for CNN is introduced to facilitate its restoration, and a frequency-domain guided diffusion is designed for DPM on behalf of predicting high-frequency details. The extensive experiments on multiple benchmark datasets demonstrate that ResDiff outperforms previous diffusion-based methods in terms of shorter model convergence time, superior generation quality, and more diverse samples.
将扩散概率模型(DPM)用于直接图像超分辨率是一种浪费,因为一个简单的卷积神经网络(CNN)可以恢复主要低频率内容。因此,我们提出了ResDiff,一个基于残留结构的全新扩散概率模型,用于单图像超分辨率(SISR)。ResDiff利用CNN恢复主要低频率成分的组合,同时利用DPM预测基线图像和CNN预测图像之间的残留部分。与常见的扩散基于方法,直接使用LR图像引导噪声到HR空间不同,ResDiff利用CNN的初始预测将噪声引导到HR空间和CNN预测空间的残留部分,不仅加速了生成过程,也获得了更好的样本质量。此外,引入CNN的频域损失函数以方便恢复,并为DPM设计频域引导扩散,该方法的预测高频率细节。在多个基准数据集上的广泛实验表明,ResDiff在缩短模型收敛时间、更好的生成质量和更多的样本多样性方面优于以前的扩散基于方法。
https://arxiv.org/abs/2303.08714
Blind face restoration usually synthesizes degraded low-quality data with a pre-defined degradation model for training, while more complex cases could happen in the real world. This gap between the assumed and actual degradation hurts the restoration performance where artifacts are often observed in the output. However, it is expensive and infeasible to include every type of degradation to cover real-world cases in the training data. To tackle this robustness issue, we propose Diffusion-based Robust Degradation Remover (DR2) to first transform the degraded image to a coarse but degradation-invariant prediction, then employ an enhancement module to restore the coarse prediction to a high-quality image. By leveraging a well-performing denoising diffusion probabilistic model, our DR2 diffuses input images to a noisy status where various types of degradation give way to Gaussian noise, and then captures semantic information through iterative denoising steps. As a result, DR2 is robust against common degradation (e.g. blur, resize, noise and compression) and compatible with different designs of enhancement modules. Experiments in various settings show that our framework outperforms state-of-the-art methods on heavily degraded synthetic and real-world datasets.
Blind face restoration通常将质量下降的低质量数据与预先定义的退化模型一起合成用于训练,而在现实生活中可能会出现更复杂的情况。这种假设和实际退化之间的差距常常导致恢复性能的损失,在输出中常常观察到 artifacts。然而,将每种类型的退化都包括在训练数据中是非常昂贵和不可能的。为了解决这个问题的稳健性问题,我们提出了基于扩散的稳健退化去除器(DR2)。首先将退化图像转换为一个粗但退化不变的预测,然后使用增强模块将粗预测恢复为高质量的图像。通过利用表现良好的去噪扩散概率模型,我们的DR2将输入图像扩散到噪声状态,其中各种不同类型的退化让高斯噪声取代,然后通过迭代去噪步骤捕获语义信息。因此,DR2对常见的退化(例如模糊、缩放、噪声和压缩)具有鲁棒性,并与不同的增强模块设计兼容。在各种设置下的实验表明,我们的框架在严重退化的合成数据和实际数据集上优于最先进的方法。
https://arxiv.org/abs/2303.06885
This work presents an effective depth-consistency self-prompt Transformer for image dehazing. It is motivated by an observation that the estimated depths of an image with haze residuals and its clear counterpart vary. Enforcing the depth consistency of dehazed images with clear ones, therefore, is essential for dehazing. For this purpose, we develop a prompt based on the features of depth differences between the hazy input images and corresponding clear counterparts that can guide dehazing models for better restoration. Specifically, we first apply deep features extracted from the input images to the depth difference features for generating the prompt that contains the haze residual information in the input. Then we propose a prompt embedding module that is designed to perceive the haze residuals, by linearly adding the prompt to the deep features. Further, we develop an effective prompt attention module to pay more attention to haze residuals for better removal. By incorporating the prompt, prompt embedding, and prompt attention into an encoder-decoder network based on VQGAN, we can achieve better perception quality. As the depths of clear images are not available at inference, and the dehazed images with one-time feed-forward execution may still contain a portion of haze residuals, we propose a new continuous self-prompt inference that can iteratively correct the dehazing model towards better haze-free image generation. Extensive experiments show that our method performs favorably against the state-of-the-art approaches on both synthetic and real-world datasets in terms of perception metrics including NIQE, PI, and PIQE.
这项工作提出了一种有效的深度一致性自prompt Transformer的图像去雾方法。这受到一个观察的影响,即带有雾痕的图像和清晰的图像估计深度可能会不同。因此,为了确保去雾效果,必须确保去雾图像和清晰的图像之间的深度一致性。为此,我们开发了一种基于深度差异的特征的prompt,可以指导去雾模型更好地恢复。具体来说,我们首先从输入图像中提取深度特征,并将其与去雾对应的清晰图像的深度特征相匹配,以生成包含输入中的雾痕信息的prompt。然后我们提出了一个prompt嵌入模块,旨在感知雾痕,通过线性地将prompt添加到深度特征中。此外,我们开发了一种有效的prompt attention模块,以便更关注去雾残留雾痕的更好的移除。通过将prompt、prompt嵌入和prompt attention集成到一个基于VQGAN的编码-解码网络中,我们可以提高感知质量。由于清晰的图像深度无法推断,去雾图像的一次性逆过程可能仍然包含部分雾痕残留,因此我们提出了一种新的连续自prompt推断方法,可以迭代地纠正去雾模型,以生成更好的无雾图像。广泛的实验表明,我们的方法和 synthetic 和 real-world 数据集上的标准方法在感知 metrics包括NIQE、PI和PIQE方面表现良好。
https://arxiv.org/abs/2303.07033
In supervised image restoration tasks, one key issue is how to obtain the aligned high-quality (HQ) and low-quality (LQ) training image pairs. Unfortunately, such HQ-LQ training pairs are hard to capture in practice, and hard to synthesize due to the complex unknown degradation in the wild. While several sophisticated degradation models have been manually designed to synthesize LQ images from their HQ counterparts, the distribution gap between the synthesized and real-world LQ images remains large. We propose a new approach to synthesizing realistic image restoration training pairs using the emerging denoising diffusion probabilistic model (DDPM). First, we train a DDPM, which could convert a noisy input into the desired LQ image, with a large amount of collected LQ images, which define the target data distribution. Then, for a given HQ image, we synthesize an initial LQ image by using an off-the-shelf degradation model, and iteratively add proper Gaussian noises to it. Finally, we denoise the noisy LQ image using the pre-trained DDPM to obtain the final LQ image, which falls into the target distribution of real-world LQ images. Thanks to the strong capability of DDPM in distribution approximation, the synthesized HQ-LQ image pairs can be used to train robust models for real-world image restoration tasks, such as blind face image restoration and blind image super-resolution. Experiments demonstrated the superiority of our proposed approach to existing degradation models. Code and data will be released.
在监督的图像恢复任务中,一个关键的问题是如何获取 align 的高质量(HQ)和低质量(LQ)训练图像对。不幸的是,这种HQ-LQ训练对在实践中很难捕捉,并且由于野生情况下复杂的未知退化,很难合成。虽然有几个人手动设计了从HQ对应物合成LQ图像的HQ退化模型,但合成和现实世界的LQ图像之间的分布差距仍然很大。我们提出了一种新的方法来合成利用新兴去噪扩散概率模型(DDPM)生成的现实图像恢复训练对。首先,我们训练一个DDPM,它可以将杂音输入转换为所需的LQ图像,收集大量的LQ图像,定义了目标数据分布。然后,对于一个给定的HQ图像,我们使用预先训练的DDPM合成一个初始的LQ图像,并迭代地添加适当的高斯噪声。最后,我们使用训练好的DDPM去噪,得到最终的LQ图像,它符合现实世界的LQ图像的目标分布。由于DDPM在分布近似方面的强大的能力,合成的HQ-LQ图像对可以用于训练 robust 模型,例如盲人脸图像恢复和盲图像超分辨率。实验表明,我们提出的新方法优于现有的退化模型。代码和数据将发布。
https://arxiv.org/abs/2303.06994
The Light Field (LF) deblurring task is a challenging problem as the blur images are caused by different reasons like the camera shake and the object motion. The single image deblurring method is a possible way to solve this problem. However, since it deals with each view independently and cannot effectively utilize and maintain the LF structure, the restoration effect is usually not ideal. Besides, the LF blur is more complex because the degree is affected by the views and depth. Therefore, we carefully designed a novel LF deblurring network based on the LF blur characteristics. On one hand, since the blur degree varies a lot in different views, we design a novel view adaptive spatial convolution to deblur blurred LFs, which calculates the exclusive convolution kernel for each view. On the other hand, because the blur degree also varies with the depth of the object, a depth perception view attention is designed to deblur different depth areas by selectively integrating information from different views. Besides, we introduce an angular position embedding to maintain the LF structure better, which ensures the model correctly restores the view information. Quantitative and qualitative experimental results on synthetic and real images show that the deblurring effect of our method is better than other state-of-the-art methods.
光场(LF)去模糊任务是一个挑战性的问题,因为模糊图像可能是由于各种不同的原因,如相机震动和物体运动引起的。单张图像去模糊方法可能是解决这个问题的一种可能方法。然而,由于它独立处理每个视图,并且无法有效地利用和维护LF结构,恢复效果通常不太理想。此外,LF模糊的复杂度因为程度受视图和深度的影响。因此,我们 carefully designed a novel LF去模糊网络基于LF模糊特性。一方面,因为不同视图的模糊程度在很多方面都有很大的差异,我们设计了一种新型视角自适应空间卷积去模糊模糊的LF,计算每个视图的独家卷积核。另一方面,因为模糊程度也与物体深度有关,我们设计了一种深度感知视图注意力,通过选择性集成来自不同视图的信息,去模糊不同深度区域。此外,我们引入了角度位置嵌入来更好地维护LF结构,以确保模型正确恢复视图信息。在模拟和真实图像上的定量和定性实验结果表明,我们的方法去模糊效果比其他先进的方法更好。
https://arxiv.org/abs/2303.06860
In recent years, we have witnessed the great advancement of Deep neural networks (DNNs) in image restoration. However, a critical limitation is that they cannot generalize well to real-world degradations with different degrees or types. In this paper, we are the first to propose a novel training strategy for image restoration from the causality perspective, to improve the generalization ability of DNNs for unknown degradations. Our method, termed Distortion Invariant representation Learning (DIL), treats each distortion type and degree as one specific confounder, and learns the distortion-invariant representation by eliminating the harmful confounding effect of each degradation. We derive our DIL with the back-door criterion in causality by modeling the interventions of different distortions from the optimization perspective. Particularly, we introduce counterfactual distortion augmentation to simulate the virtual distortion types and degrees as the confounders. Then, we instantiate the intervention of each distortion with a virtual model updating based on corresponding distorted images, and eliminate them from the meta-learning perspective. Extensive experiments demonstrate the effectiveness of our DIL on the generalization capability for unseen distortion types and degrees. Our code will be available at this https URL.
近年来,我们见证了深度学习神经网络(DNN)在图像恢复方面的重大突破。然而,一个关键限制是,它们无法很好地适用于不同程度或类型的实际退化。在本文中,我们首先提出了从因果关系的角度提出的一种新的图像恢复训练策略,以提高未知退化的泛化能力。我们的方法被称为“失真不变表示学习(DIL)”,将每个失真类型和程度视为一个特定的干扰变量,并通过消除每个退化的有害混淆效应来学习失真不变的表示。我们从优化的角度来看建模不同失真的影响。特别地,我们引入了反事实失真增强来模拟虚拟失真类型和程度作为干扰变量。然后,我们基于相应的失真图像更新虚拟模型,并从元学习的角度来看消除每个失真的影响。广泛的实验表明,我们的DIL对于未观察到的失真类型和程度的泛化能力的有效性。我们的代码将在这个httpsURL上可用。
https://arxiv.org/abs/2303.06859
Normalizing flow models using invertible neural networks (INN) have been widely investigated for successful generative image super-resolution (SR) by learning the transformation between the normal distribution of latent variable $z$ and the conditional distribution of high-resolution (HR) images gave a low-resolution (LR) input. Recently, image rescaling models like IRN utilize the bidirectional nature of INN to push the performance limit of image upscaling by optimizing the downscaling and upscaling steps jointly. While the random sampling of latent variable $z$ is useful in generating diverse photo-realistic images, it is not desirable for image rescaling when accurate restoration of the HR image is more important. Hence, in places of random sampling of $z$, we propose auxiliary encoding modules to further push the limit of image rescaling performance. Two options to store the encoded latent variables in downscaled LR images, both readily supported in existing image file format, are proposed. One is saved as the alpha-channel, the other is saved as meta-data in the image header, and the corresponding modules are denoted as suffixes -A and -M respectively. Optimal network architectural changes are investigated for both options to demonstrate their effectiveness in raising the rescaling performance limit on different baseline models including IRN and DLV-IRN.
使用可逆神经网络(INN)规范化流模型,已经广泛研究了成功生成高分辨率图像(SR)的方法,通过学习 latent variable $z$ 的正常分布和高分辨率图像的条件分布之间的转换,优化缩小和放大步骤,以推动图像放大性能的极限。最近,像 IRN 的图像重缩模型利用了 INN 的双向性质,通过优化缩小和放大步骤,将图像放大性能的极限推向极致。虽然随机采样 latent variable $z$ 可以用于生成各种逼真的图像,但在准确恢复高分辨率图像时,图像重缩并不是最好的选择。因此,在随机采样 $z$ 的位置,我们提出了辅助编码模块,以进一步推动图像重缩性能的极限。提出了两种选项,一种是保存在缩小的 LR 图像中的编码 latent变量,另一种是在图像头文件中保存元数据,对应的模块分别为suffix -A 和 -M。两种选项都进行了网络架构优化,以证明它们在提高包括 IRN 和 DLV-IRN 等多种基准模型的图像重缩性能极限方面的有效性。
https://arxiv.org/abs/2303.06747
Diffusion models have recently received a surge of interest due to their impressive performance for image restoration, especially in terms of noise robustness. However, existing diffusion-based methods are trained on a large amount of training data and perform very well in-distribution, but can be quite susceptible to distribution shift. This is especially inappropriate for data-starved hyperspectral image (HSI) restoration. To tackle this problem, this work puts forth a self-supervised diffusion model for HSI restoration, namely Denoising Diffusion Spatio-Spectral Model (\texttt{DDS2M}), which works by inferring the parameters of the proposed Variational Spatio-Spectral Module (VS2M) during the reverse diffusion process, solely using the degraded HSI without any extra training data. In VS2M, a variational inference-based loss function is customized to enable the untrained spatial and spectral networks to learn the posterior distribution, which serves as the transitions of the sampling chain to help reverse the diffusion process. Benefiting from its self-supervised nature and the diffusion process, \texttt{DDS2M} enjoys stronger generalization ability to various HSIs compared to existing diffusion-based methods and superior robustness to noise compared to existing HSI restoration methods. Extensive experiments on HSI denoising, noisy HSI completion and super-resolution on a variety of HSIs demonstrate \texttt{DDS2M}'s superiority over the existing task-specific state-of-the-arts.
扩散模型最近吸引了大量关注,因为它们在图像恢复方面表现出令人印象深刻的性能,特别是在噪声鲁棒性方面。然而,现有的扩散方法需要大量的训练数据进行训练,并在分布范围内表现得很好,但很容易受到分布偏差的影响。这对于缺乏训练数据的高分辨率图像恢复(HSI)来说尤其不合适。为了解决这一问题,这项工作提出了一种自我监督的扩散模型,即Denoising DiffusionSpatio-Spectral Model(DDS2M),它通过在反向扩散过程中推断提议的变分超平面模块(VS2M)的参数,仅使用退化的高分辨率图像,而不需要额外的训练数据。在VS2M中,一种变分推断based的损失函数被定制,以便未训练的空间和谱网络学习后分布,作为采样链的过渡,以帮助反向扩散过程。得益于其自我监督性和扩散过程,DDS2M对各种HSI的泛化能力比现有的扩散方法更强,对噪声的鲁棒性也比现有的HSI恢复方法更好。在HSI去噪、噪声HSI完成和超分辨率多个HSI样本上开展的实验表明,DDS2M比现有的任务特定标准方法更具优越性。
https://arxiv.org/abs/2303.06682