How to generate the ground-truth (GT) image is a critical issue for training realistic image super-resolution (Real-ISR) models. Existing methods mostly take a set of high-resolution (HR) images as GTs and apply various degradations to simulate their low-resolution (LR) counterparts. Though great progress has been achieved, such an LR-HR pair generation scheme has several limitations. First, the perceptual quality of HR images may not be high enough, limiting the quality of Real-ISR outputs. Second, existing schemes do not consider much human perception in GT generation, and the trained models tend to produce over-smoothed results or unpleasant artifacts. With the above considerations, we propose a human guided GT generation scheme. We first elaborately train multiple image enhancement models to improve the perceptual quality of HR images, and enable one LR image having multiple HR counterparts. Human subjects are then involved to annotate the high quality regions among the enhanced HR images as GTs, and label the regions with unpleasant artifacts as negative samples. A human guided GT image dataset with both positive and negative samples is then constructed, and a loss function is proposed to train the Real-ISR models. Experiments show that the Real-ISR models trained on our dataset can produce perceptually more realistic results with less artifacts. Dataset and codes can be found at this https URL
生成真相图像(GT)是训练真实图像超分辨率(Real-ISR)模型的一个关键问题。现有的方法大多将高分辨率(HR)图像作为真相图像,并应用各种退化来模拟其低分辨率(LR)对应物。尽管取得了很大进展,但这种LR-HR对偶生成方法有几个限制。首先,HR图像的感知质量可能不够高,限制Real-ISR输出质量。其次,现有的生成方法并未充分考虑人类感知在真相图像生成中的作用,训练模型往往产生过度平滑的结果或不愉快的痕迹。基于以上考虑,我们提出了一种人类引导的真相图像生成方法。我们首先 elaborately 训练多个图像增强模型,以提高HR图像的感知质量,并使一个LR图像具有多个HR对应物。人类 subjects then 参与标注增强后的HR图像中的高质量区域,并将其作为真相图像,并将不愉快的痕迹区域作为负样本。一个包含正负样本的人类引导的真相图像数据集随后构建,并提出了损失函数来训练Real-ISR模型。实验表明,在我们数据集中训练的Real-ISR模型可以产生感知上更为真实,减少痕迹的结果。数据集和代码可在此https URL中找到。
https://arxiv.org/abs/2303.13069
While the performance of deep convolutional neural networks for image super-resolution (SR) has improved significantly, the rapid increase of memory and computation requirements hinders their deployment on resource-constrained devices. Quantized networks, especially binary neural networks (BNN) for SR have been proposed to significantly improve the model inference efficiency but suffer from large performance degradation. We observe the activation distribution of SR networks demonstrates very large pixel-to-pixel, channel-to-channel, and image-to-image variation, which is important for high performance SR but gets lost during binarization. To address the problem, we propose two effective methods, including the spatial re-scaling as well as channel-wise shifting and re-scaling, which augments binary convolutions by retaining more spatial and channel-wise information. Our proposed models, dubbed EBSR, demonstrate superior performance over prior art methods both quantitatively and qualitatively across different datasets and different model sizes. Specifically, for x4 SR on Set5 and Urban100, EBSRlight improves the PSNR by 0.31 dB and 0.28 dB compared to SRResNet-E2FIF, respectively, while EBSR outperforms EDSR-E2FIF by 0.29 dB and 0.32 dB PSNR, respectively.
虽然深度学习卷积神经网络的图像超分辨率(SR)性能已经显著改善,但内存和计算要求的快速增加阻碍了它们在资源受限的设备上的部署。提议了量化网络,特别是SR中的二进制神经网络(BNN),以提高模型推理效率,但出现了显著的性能下降。我们观察了SR网络的激活分布,显示它们具有巨大的像素到像素、通道到通道和图像到图像变异,这对于高性能SR非常重要,但在二进制分类中迷失了。为了解决这个问题,我们提出了两个有效方法,包括空间重排和通道重排,这增加了二进制卷积核的数量,并通过保留更多的空间和情感通道信息来增强它们。我们提出的模型被称为EBSR,在不同数据集和模型大小上表现出了比现有方法更高的性能和质量。具体来说,对于Set5和Urban100中的x4SR,EBSRlight比SRResNet-E2FIF提高了0.31 dB的PSNR,而EBSR比EDSR-E2FIF提高了0.29 dB的PSNR。
https://arxiv.org/abs/2303.12270
We present a novel approach to synthesise high-resolution isotropic 3D abdominal MR images, from anisotropic 3D images in an unpaired fashion. Using a modified CycleGAN architecture with a gradient mapping loss, we leverage disjoint patches from the high-resolution (in-plane) data of an anisotropic volume to enforce the network generator to increase the resolution of the low-resolution (through-plane) slices. This will enable accelerated whole-abdomen scanning with high-resolution isotropic images within short breath-hold times.
我们提出了一种新颖的合成高分辨率 isotropic 3D 腹部MRI图像的方法,该方法从非配对的anisotropic 3D图像中合成。使用一种修改的循环生成对抗网络(CycleGAN)架构,利用anisotropic volume的高分辨率(平面)数据中的分割块来强制网络生成器增加低分辨率(穿过平面)切片的分辨率。这将在简短的呼吸控制时间内加速整个腹部扫描,并使用高分辨率 isotropic 图像进行快速检查。
https://arxiv.org/abs/2303.11831
Lightweight neural networks for single-image super-resolution (SISR) tasks have made substantial breakthroughs in recent years. Compared to low-frequency information, high-frequency detail is much more difficult to reconstruct. Most SISR models allocate equal computational resources for low-frequency and high-frequency information, which leads to redundant processing of simple low-frequency information and inadequate recovery of more challenging high-frequency information. We propose a novel High-Frequency Focused Network (HFFN) through High-Frequency Focused Blocks (HFFBs) that selectively enhance high-frequency information while minimizing redundant feature computation of low-frequency information. The HFFB effectively allocates more computational resources to the more challenging reconstruction of high-frequency information. Moreover, we propose a Local Feature Fusion Block (LFFB) effectively fuses features from multiple HFFBs in a local region, utilizing complementary information across layers to enhance feature representativeness and reduce artifacts in reconstructed images. We assess the efficacy of our proposed HFFN on five benchmark datasets and show that it significantly enhances the super-resolution performance of the network. Our experimental results demonstrate state-of-the-art performance in reconstructing high-frequency information while using a low number of parameters.
近年来,针对单个图像超分辨率(SISR)任务轻量化的神经网络取得了重大进展。相比于低频信息,高频细节更难重构。大多数SISR模型将低频和高频信息的计算资源平等分配,这导致简单的低频信息进行了冗余处理,而更困难的高频信息未能得到充分的恢复。我们提出了一种新的高频重点网络(HFFN),通过高频重点块(HFFBs)选择性地增强高频信息,同时最小化低频信息冗余特征的计算。HFFB有效地将更多的计算资源分配给高频信息重构的任务。此外,我们提出了一个 local 特征融合块(LFFB),在局部区域有效地融合多个HFFBs 的局部特征,利用相邻层之间的互补信息增强特征代表性,并减少重构图像中的噪声。我们评估了我们所提出的HFFN在五个基准数据集上的有效性,并表明它显著提高了网络的超分辨率性能。我们的实验结果证明了在使用较少参数的情况下重构高频信息的最新技术水平。
https://arxiv.org/abs/2303.11701
In this paper, we propose a scribble-based video colorization network with temporal aggregation called SVCNet. It can colorize monochrome videos based on different user-given color scribbles. It addresses three common issues in the scribble-based video colorization area: colorization vividness, temporal consistency, and color bleeding. To improve the colorization quality and strengthen the temporal consistency, we adopt two sequential sub-networks in SVCNet for precise colorization and temporal smoothing, respectively. The first stage includes a pyramid feature encoder to incorporate color scribbles with a grayscale frame, and a semantic feature encoder to extract semantics. The second stage finetunes the output from the first stage by aggregating the information of neighboring colorized frames (as short-range connections) and the first colorized frame (as a long-range connection). To alleviate the color bleeding artifacts, we learn video colorization and segmentation simultaneously. Furthermore, we set the majority of operations on a fixed small image resolution and use a Super-resolution Module at the tail of SVCNet to recover original sizes. It allows the SVCNet to fit different image resolutions at the inference. Finally, we evaluate the proposed SVCNet on DAVIS and Videvo benchmarks. The experimental results demonstrate that SVCNet produces both higher-quality and more temporally consistent videos than other well-known video colorization approaches. The codes and models can be found at this https URL.
在本文中,我们提出了一种基于笔记的视频颜色化网络,称为SVCNet,该网络使用时间聚合来 colorize 基于不同用户给定的颜色笔记的黑白视频。该网络解决了笔记视频颜色化领域三个常见的问题:颜色饱和度、时间一致性和颜色泄漏。为了改善颜色质量和加强时间一致性,我们在SVCNet中采用了两个递归子网络,分别用于精确的颜色化和时间平滑。第一个阶段包括一个金字塔特征编码器,将彩色笔记与灰度帧集成,并提取语义特征。第二个阶段优化了第一个阶段的输出,通过聚合相邻颜色化的帧(作为短连接)和第一个颜色化的帧(作为长连接)的信息。为了减轻颜色泄漏 artifacts,我们同时学习视频颜色化和分割。此外,我们将大部分操作设置在固定的小型图像分辨率上,并使用SVCNet的尾巴上的超级分辨率模块恢复原始大小。它使SVCNet在推断时适应不同的图像分辨率。最后,我们评估了 proposed SVCNet on DAVIS和Videvo基准数据。实验结果显示,SVCNet比已知的视频颜色化方法生产更高质量和更时间一致性的视频。代码和模型可在上述httpsURL上找到。
https://arxiv.org/abs/2303.11591
Inversion by Direct Iteration (InDI) is a new formulation for supervised image restoration that avoids the so-called ``regression to the mean'' effect and produces more realistic and detailed images than existing regression-based methods. It does this by gradually improving image quality in small steps, similar to generative denoising diffusion models. Image restoration is an ill-posed problem where multiple high-quality images are plausible reconstructions of a given low-quality input. Therefore, the outcome of a single step regression model is typically an aggregate of all possible explanations, therefore lacking details and realism. % The main advantage of InDI is that it does not try to predict the clean target image in a single step but instead gradually improves the image in small steps, resulting in better perceptual quality. While generative denoising diffusion models also work in small steps, our formulation is distinct in that it does not require knowledge of any analytic form of the degradation process. Instead, we directly learn an iterative restoration process from low-quality and high-quality paired examples. InDI can be applied to virtually any image degradation, given paired training data. In conditional denoising diffusion image restoration the denoising network generates the restored image by repeatedly denoising an initial image of pure noise, conditioned on the degraded input. Contrary to conditional denoising formulations, InDI directly proceeds by iteratively restoring the input low-quality image, producing high-quality results on a variety of image restoration tasks, including motion and out-of-focus deblurring, super-resolution, compression artifact removal, and denoising.
直接迭代反演(InDI)是一种用于监督图像恢复的新 formulation,旨在避免所谓的“回归均值”效应,并生成比现有基于回归的方法更为真实和详细的图像。它通过逐步改善图像质量小步骤来实现,类似于生成式去噪扩散模型。图像恢复是一个具有矛盾问题的示例,多个高质量的图像可能是给定低质量输入的的合理重构。因此,一个单一的回归模型的结果通常是所有可能解释的集合,因此缺乏细节和真实感。% InDI的主要优势是它不会试图在一步中预测清洁的目标图像,而是逐步改善图像,以获得更好的感知质量。虽然生成式去噪扩散模型也在小步骤中起作用,但我们的 formulation 有所不同,它不需要了解任何分析形式的退化过程知识。相反,我们直接从低质量和高质量配对示例中学习迭代恢复过程。InDI可以适用于几乎任何图像退化情况,只要配对训练数据。在条件去噪扩散图像恢复中,去噪网络通过反复去噪初始纯粹的噪声图像,根据退化输入生成恢复图像。与条件去噪 formulation 相反,InDI直接通过迭代恢复输入的低质量图像,在多种图像恢复任务中取得了高质量的结果,包括运动和焦距外去模糊、超分辨率、压缩 artifacts 去除和去噪。
https://arxiv.org/abs/2303.11435
The spatial resolution of images of living samples obtained by fluorescence microscopes is physically limited due to the diffraction of visible light, which makes the study of entities of size less than the diffraction barrier (around 200 nm in the x-y plane) very challenging. To overcome this limitation, several deconvolution and super-resolution techniques have been proposed. Within the framework of inverse problems, modern approaches in fluorescence microscopy reconstruct a super-resolved image from a temporal stack of frames by carefully designing suitable hand-crafted sparsity-promoting regularisers. Numerically, such approaches are solved by proximal gradient-based iterative schemes. Aiming at obtaining a reconstruction more adapted to sample geometries (e.g. thin filaments), we adopt a plug-and-play denoising approach with convergence guarantees and replace the proximity operator associated with the explicit image regulariser with an image denoiser (i.e. a pre-trained network) which, upon appropriate training, mimics the action of an implicit prior. To account for the independence of the fluctuations between molecules, the model relies on second-order statistics. The denoiser is then trained on covariance images coming from data representing sequences of fluctuating fluorescent molecules with filament structure. The method is evaluated on both simulated and real fluorescence microscopy images, showing its ability to correctly reconstruct filament structures with high values of peak signal-to-noise ratio (PSNR).
荧光显微镜观察的生样本图像的空间分辨率由于可见光的衍射而物理上受到限制,这使得研究尺寸小于衍射屏障(在x-y平面上约为200纳米)的实体非常具有挑战性。为了克服这一限制,已经提出了几种差分和超分辨率技术。在逆问题的框架内,现代荧光显微镜的方法通过精心设计的人工稀疏增强 regulariser 从时间帧序列中重构出一个超分辨率图像。计算上,这些方法通过近邻梯度基迭代算法解决。旨在获得样品几何形状更适应的重构(例如细线状),我们采用了可插拔的去噪方法,并使用图像去噪器(即一个训练过的网络),将其与显式图像 Regulariser 替换为图像去噪器,以便在适当的训练后模拟出隐含先验的行为。为了考虑分子之间的独立性波动,模型依赖于二阶统计。去噪器随后从代表波动荧光分子序列的序列数据中训练出共形图像。方法在模拟和真实的荧光显微镜图像上进行评估,表明它能够以高峰值信号-噪声比(PSNR)的正确重构线状结构。
https://arxiv.org/abs/2303.11212
The channel attention mechanism is a useful technique widely employed in deep convolutional neural networks to boost the performance for image processing tasks, eg, image classification and image super-resolution. It is usually designed as a parameterized sub-network and embedded into the convolutional layers of the network to learn more powerful feature representations. However, current channel attention induces more parameters and therefore leads to higher computational costs. To deal with this issue, in this work, we propose a Parameter-Free Channel Attention (PFCA) module to boost the performance of popular image classification and image super-resolution networks, but completely sweep out the parameter growth of channel attention. Experiments on CIFAR-100, ImageNet, and DIV2K validate that our PFCA module improves the performance of ResNet on image classification and improves the performance of MSRResNet on image super-resolution tasks, respectively, while bringing little growth of parameters and FLOPs.
通道注意力机制是深度学习卷积神经网络中广泛使用的一种有用技术,用于提高图像处理任务的性能,例如图像分类和图像超分辨率。通常被设计为一个参数化的子网络,并嵌入到网络的卷积层中,以学习更强大的特征表示。然而,当前的通道注意力机制导致更多的参数,因此导致更高的计算成本。为了解决这一问题,在本文中,我们提出了一个参数免费的通道注意力(PFCA)模块,以Boost popular图像分类和图像超分辨率网络的性能,但完全消除通道注意力参数的增长。在CIFAR-100、ImageNet和DIV2K等实验中,证明了我们的PFCA模块分别提高了ResNet在图像分类任务中的表现,并提高了MSRResNet在图像超分辨率任务中的表现,而参数和FLOPs几乎没有增长。
https://arxiv.org/abs/2303.11055
Facial action unit detection has emerged as an important task within facial expression analysis, aimed at detecting specific pre-defined, objective facial expressions, such as lip tightening and cheek raising. This paper presents our submission to the Affective Behavior Analysis in-the-wild (ABAW) 2023 Competition for AU detection. We propose a multi-modal method for facial action unit detection with visual, acoustic, and lexical features extracted from the large pre-trained models. To provide high-quality details for visual feature extraction, we apply super-resolution and face alignment to the training data and show potential performance gain. Our approach achieves the F1 score of 52.3\% on the official validation set of the 5th ABAW Challenge.
面部表情单元检测在面部表情分析中成为一个重要任务,旨在检测特定的、预先定义的面部表达,如唇紧收和脸颊抬起。本文介绍了我们提交的2023年野生情感行为分析竞赛(ABAW) AU检测提交。我们提出了一种多模态方法,使用从大型预训练模型中获取的视觉、声学和词向量特征进行面部表情单元检测。为了提供高质量的视觉特征提取细节,我们应用超分辨率和面部对齐训练数据,并展示了潜在的性能提升。我们的方法在官方验证集上的F1得分为52.3%。
https://arxiv.org/abs/2303.10590
A computer vision system using low-resolution image sensors can provide intelligent services (e.g., activity recognition) but preserve unnecessary visual privacy information from the hardware level. However, preserving visual privacy and enabling accurate machine recognition have adversarial needs on image resolution. Modeling the trade-off of privacy preservation and machine recognition performance can guide future privacy-preserving computer vision systems using low-resolution image sensors. In this paper, using the at-home activity of daily livings (ADLs) as the scenario, we first obtained the most important visual privacy features through a user survey. Then we quantified and analyzed the effects of image resolution on human and machine recognition performance in activity recognition and privacy awareness tasks. We also investigated how modern image super-resolution techniques influence these effects. Based on the results, we proposed a method for modeling the trade-off of privacy preservation and activity recognition on low-resolution images.
使用低分辨率图像传感器的计算机视觉系统可以提供智能服务(例如,活动识别),但从硬件级别上保留不必要的视觉隐私信息。然而,保护视觉隐私并实现准确的机器识别具有对图像分辨率的dversarial需求。通过建模隐私保护和机器识别性能之间的权衡,可以指导未来使用低分辨率图像传感器的保持隐私的计算机视觉系统。在本文中,使用家庭日常活动(ADLs)作为场景,通过用户调查获取了最重要的视觉隐私特征。然后量化和分析了图像分辨率对活动识别和隐私意识任务中人类和机器识别性能的影响。我们还研究了现代图像超分辨率技术如何影响这些影响。基于结果,我们提出了一种方法,用于建模低分辨率图像中隐私保护和活动识别的权衡。
https://arxiv.org/abs/2303.10435
Super-resolution, which aims to reconstruct high-resolution images from low-resolution images, has drawn considerable attention and has been intensively studied in computer vision and remote sensing communities. The super-resolution technology is especially beneficial for Unmanned Aerial Vehicles (UAV), as the amount and resolution of images captured by UAV are highly limited by physical constraints such as flight altitude and load capacity. In the wake of the successful application of deep learning methods in the super-resolution task, in recent years, a series of super-resolution algorithms have been developed. In this paper, for the super-resolution of UAV images, a novel network based on the state-of-the-art Swin Transformer is proposed with better efficiency and competitive accuracy. Meanwhile, as one of the essential applications of the UAV is land cover and land use monitoring, simple image quality assessments such as the Peak-Signal-to-Noise Ratio (PSNR) and the Structural Similarity Index Measure (SSIM) are not enough to comprehensively measure the performance of an algorithm. Therefore, we further investigate the effectiveness of super-resolution methods using the accuracy of semantic segmentation. The code will be available at this https URL.
超分辨率(Super-resolution)旨在从低分辨率图像中提取高分辨率图像,引起了广泛关注并在计算机视觉和遥感社区中进行了深入研究。超分辨率技术特别有利于无人机(UAV),因为无人机捕获的图像数量和质量受到高度受限的物理限制,如飞行高度和负载能力。随着深度学习方法在超分辨率任务中成功应用,近年来开发了一系列的超分辨率算法。在本文中,针对无人机图像的超分辨率,我们提出了一种基于最新 Swin Transformer 的更高效且具有竞争力准确的新网络。同时,由于无人机一个重要的应用是土地覆盖和用地监测,简单的图像质量评估,如峰值信号到噪声比(PSNR)和结构相似性指数测量(SSIM)不足以全面衡量算法的性能。因此,我们进一步利用语义分割的准确性研究超分辨率方法的有效性。代码将在此 https URL 可用。
https://arxiv.org/abs/2303.10232
Gaze tracking is a valuable tool with a broad range of applications in various fields, including medicine, psychology, virtual reality, marketing, and safety. Therefore, it is essential to have gaze tracking software that is cost-efficient and high-performing. Accurately predicting gaze remains a difficult task, particularly in real-world situations where images are affected by motion blur, video compression, and noise. Super-resolution has been shown to improve image quality from a visual perspective. This work examines the usefulness of super-resolution for improving appearance-based gaze tracking. We show that not all SR models preserve the gaze direction. We propose a two-step framework based on SwinIR super-resolution model. The proposed method consistently outperforms the state-of-the-art, particularly in scenarios involving low-resolution or degraded images. Furthermore, we examine the use of super-resolution through the lens of self-supervised learning for gaze prediction. Self-supervised learning aims to learn from unlabelled data to reduce the amount of required labeled data for downstream tasks. We propose a novel architecture called SuperVision by fusing an SR backbone network to a ResNet18 (with some skip connections). The proposed SuperVision method uses 5x less labeled data and yet outperforms, by 15%, the state-of-the-art method of GazeTR which uses 100% of training data.
注视 tracking是一类重要的工具,在各种领域都有广泛的应用,包括医学、心理学、虚拟现实、营销和安全等。因此,拥有高效且性能优秀的注视 tracking软件是至关重要的。精确预测注视仍然是一项具有挑战性的任务,特别是在图像受到运动模糊、视频压缩和噪声影响的实际场景下。超分辨率已经被证明可以改善图像质量。本研究探讨了超分辨率如何用于改善外观基于注视 tracking。我们表明,不是所有的SR模型都保留注视方向。我们提出了基于 SwinIR 超分辨率模型的两步框架。该框架 consistently outperforms the state-of-the-art,特别是在涉及低分辨率或图像恶化的场景下。此外,我们考虑了通过自监督学习视角使用超分辨率进行注视预测的必要性。自监督学习旨在从未标记数据学习以减少后续任务所需的标记数据量。我们提出了一种名为SuperVision的新架构,通过将SR基线网络与ResNet18融合(并添加一些跳过连接)来实现。该SuperVision方法使用更少的标记数据,但仍比使用训练数据的100%的 gazeTR方法表现更好,提高了15%。
https://arxiv.org/abs/2303.10151
Existing real-world video super-resolution (VSR) methods focus on designing a general degradation pipeline for open-domain videos while ignoring data intrinsic characteristics which strongly limit their performance when applying to some specific domains (e.g. animation videos). In this paper, we thoroughly explore the characteristics of animation videos and leverage the rich priors in real-world animation data for a more practical animation VSR model. In particular, we propose a multi-scale Vector-Quantized Degradation model for animation video Super-Resolution (VQD-SR) to decompose the local details from global structures and transfer the degradation priors in real-world animation videos to a learned vector-quantized codebook for degradation modeling. A rich-content Real Animation Low-quality (RAL) video dataset is collected for extracting the priors. We further propose a data enhancement strategy for high-resolution (HR) training videos based on our observation that existing HR videos are mostly collected from the Web which contains conspicuous compression artifacts. The proposed strategy is valid to lift the upper bound of animation VSR performance, regardless of the specific VSR model. Experimental results demonstrate the superiority of the proposed VQD-SR over state-of-the-art methods, through extensive quantitative and qualitative evaluations of the latest animation video super-resolution benchmark.
现有的实际视频超分辨率(VSR)方法主要关注设计通用的退化管道,而忽略数据的内在特性,这些特性在应用于某些特定领域(例如动画视频)时,会强烈限制其表现。在本文中,我们彻底研究了动画视频的特点,并利用实际动画数据中的丰富先验向量,构建了一个更实用的动画 VSR 模型。特别是,我们提出了一种多尺度 Vector-Quantized 退化模型,用于动画视频超分辨率(VQD-SR),以分解 global 结构中的局部细节,并将实际动画视频的退化先验向量转移到一个学习 Vector-Quantized 编码书的退化建模编码表中。我们收集了丰富的高质量 Real Animation 低质量(RAL)视频数据集,以提取先验向量。我们还提出了一种高分辨率(HR)训练视频的数据增强策略,基于我们的观察,现有的 HR 视频大部分从 Web 收集,其中存在明显的压缩失真。我们提出的策略valid 地 lifting 了动画 VSR 性能的上限,无论具体的 VSR 模型是什么。实验结果显示,我们提出的 VQD-SR 方法相比现有方法具有优越性,通过对最新动画视频超分辨率基准的广泛定量和定性评估。
https://arxiv.org/abs/2303.09826
Previous works have shown that increasing the window size for Transformer-based image super-resolution models (e.g., SwinIR) can significantly improve the model performance but the computation overhead is also considerable. In this paper, we present SRFormer, a simple but novel method that can enjoy the benefit of large window self-attention but introduces even less computational burden. The core of our SRFormer is the permuted self-attention (PSA), which strikes an appropriate balance between the channel and spatial information for self-attention. Our PSA is simple and can be easily applied to existing super-resolution networks based on window self-attention. Without any bells and whistles, we show that our SRFormer achieves a 33.86dB PSNR score on the Urban100 dataset, which is 0.46dB higher than that of SwinIR but uses fewer parameters and computations. We hope our simple and effective approach can serve as a useful tool for future research in super-resolution model design.
先前的工作表明,增加基于Transformer的图像超分辨率模型窗口大小(例如SwinIR)可以显著改善模型性能,但计算负担也相应增加。在本文中,我们介绍了SR former,一种简单但新颖的方法,它可以享受大窗口自注意力的好处,但引入了更少的计算负担。我们的SR former的核心是变换自注意力(PSA),它在自注意力通道和空间信息之间实现了适当的平衡。我们的PSA非常简单,可以轻松应用于基于窗口自注意力的现有超分辨率网络。不需要任何花哨的功能,我们展示了在Urban100数据集上,我们的SRFormer获得了33.86dB的PSNR评分,比SwinIR高0.46dB,但参数和计算量更少。我们希望我们的简单而有效的方法可以成为超分辨率模型设计未来研究的有用工具。
https://arxiv.org/abs/2303.09735
The field of image super-resolution (SR) has witnessed extensive neural network designs from CNN to transformer architectures. However, prevailing SR models suffer from prohibitive memory footprint and intensive computations, which limits further deployment on computational-constrained platforms. In this work, we investigate the potential of network pruning for super-resolution to take advantage of off-the-shelf network designs and reduce the underlying computational overhead. Two main challenges remain in applying pruning methods for SR. First, the widely-used filter pruning technique reflects limited granularity and restricted adaptability to diverse network structures. Second, existing pruning methods generally operate upon a pre-trained network for the sparse structure determination, failing to get rid of dense model training in the traditional SR paradigm. To address these challenges, we adopt unstructured pruning with sparse models directly trained from scratch. Specifically, we propose a novel Iterative Soft Shrinkage-Percentage (ISS-P) method by optimizing the sparse structure of a randomly initialized network at each iteration and tweaking unimportant weights with a small amount proportional to the magnitude scale on-the-fly. We observe that the proposed ISS-P could dynamically learn sparse structures adapting to the optimization process and preserve the sparse model's trainability by yielding a more regularized gradient throughput. Experiments on benchmark datasets demonstrate the effectiveness of the proposed ISS-P compared with state-of-the-art methods over diverse network architectures.
图像超分辨率(SR)领域已经见证了从卷积神经网络架构到Transformer架构广泛的神经网络设计。然而,主流的SR模型受到内存占用过高和计算密集型的限制,这限制了在计算受限平台上进一步部署。在这项工作中,我们研究了网络修剪对超分辨率的潜在作用,利用现有网络设计并减少底层计算负担。我们提出了一种新的迭代软收缩百分比(ISS-P)方法,该方法在每个迭代中优化随机初始化网络的稀疏结构,并通过微小的量纲与实时大小比例无关的方式微调不重要的权重。我们观察到,提出的ISS-P可以动态地学习稀疏结构适应优化过程,并保留稀疏模型的训练能力,通过产生更 regularized 梯度吞吐量。在基准数据集上的实验表明,提出的ISS-P相对于各种网络架构的先进方法在各种不同的网络架构上的有效性。
https://arxiv.org/abs/2303.09650
We propose a novel multi-stage depth super-resolution network, which progressively reconstructs high-resolution depth maps from explicit and implicit high-frequency features. The former are extracted by an efficient transformer processing both local and global contexts, while the latter are obtained by projecting color images into the frequency domain. Both are combined together with depth features by means of a fusion strategy within a multi-stage and multi-scale framework. Experiments on the main benchmarks, such as NYUv2, Middlebury, DIML and RGBDD, show that our approach outperforms existing methods by a large margin (~20% on NYUv2 and DIML against the contemporary work DADA, with 16x upsampling), establishing a new state-of-the-art in the guided depth super-resolution task.
我们提出了一种新型的多级深度超分辨率网络,该网络逐步从显式和隐式高频特征中恢复高分辨率深度图。前者通过高效Transformer处理本地和全局上下文提取而来,后者则通过将颜色图像映射到频域获得。两者通过多级和多尺度框架中的融合策略进行结合。对主要基准点,如NYUv2、Middlebury、DIML和RGBDD,的实验结果表明,我们的方法比现有方法表现优异(NYUv2和DIML的比约为20%,与 contemporary工作DADA(16x扩采样)相比),在引导深度超分辨率任务中开创了新的前沿技术。
https://arxiv.org/abs/2303.09307
Object detection and single image super-resolution are classic problems in computer vision (CV). The object detection task aims to recognize the objects in input images, while the image restoration task aims to reconstruct high quality images from given low quality images. In this paper, a two-stage framework for object detection and image restoration is proposed. The first stage uses YOLO series algorithms to complete the object detection and then performs image cropping. In the second stage, this work improves Swin Transformer and uses the new proposed algorithm to connect the Swin Transformer layer to design a new neural network architecture. We name the newly proposed network for image restoration SwinOIR. This work compares the model performance of different versions of YOLO detection algorithms on MS COCO dataset and Pascal VOC dataset, demonstrating the suitability of different YOLO network models for the first stage of the framework in different scenarios. For image super-resolution task, it compares the model performance of using different methods of connecting Swin Transformer layers and design different sizes of SwinOIR for use in different life scenarios. Our implementation code is released at this https URL.
物体检测和单张图像超分辨率是计算机视觉(CV)中的经典问题。物体检测任务旨在从输入图像中识别物体,而图像修复任务旨在从给定低质量图像中重构高质量的图像。在本文中,提出了一种物体检测和图像修复的二阶段框架。在第一阶段,使用YOLO系列算法完成物体检测,然后进行图像裁剪。在第二阶段,改进了 Swin Transformer,并使用新提出的算法连接 Swin Transformer 层来设计新的神经网络架构。我们提出了名为 SwinOIR 的图像修复网络。该工作对不同版本的YOLO检测算法在MS COCO数据和Pascal VOC数据集上的性能进行了比较,证明了不同YOLO网络模型在不同场景下的适用性。对于图像超分辨率任务,该工作比较了使用不同连接 Swin Transformer 层的方法以及设计不同大小的 SwinOIR 用于不同生命场景的性能。我们的实现代码在此httpsURL上发布。
https://arxiv.org/abs/2303.09190
Generating images with both photorealism and multiview 3D consistency is crucial for 3D-aware GANs, yet existing methods struggle to achieve them simultaneously. Improving the photorealism via CNN-based 2D super-resolution can break the strict 3D consistency, while keeping the 3D consistency by learning high-resolution 3D representations for direct rendering often compromises image quality. In this paper, we propose a novel learning strategy, namely 3D-to-2D imitation, which enables a 3D-aware GAN to generate high-quality images while maintaining their strict 3D consistency, by letting the images synthesized by the generator's 3D rendering branch to mimic those generated by its 2D super-resolution branch. We also introduce 3D-aware convolutions into the generator for better 3D representation learning, which further improves the image generation quality. With the above strategies, our method reaches FID scores of 5.4 and 4.3 on FFHQ and AFHQ-v2 Cats, respectively, at 512x512 resolution, largely outperforming existing 3D-aware GANs using direct 3D rendering and coming very close to the previous state-of-the-art method that leverages 2D super-resolution. Project website: this https URL.
生成具有照片写实性和多角度3D一致性的图像对于3D意识GAN至关重要,但现有方法却难以同时实现。通过基于卷积神经网络的2D超分辨率可以破坏严格的3D一致性,而通过学习直接渲染的高质量3D表示常常会导致图像质量下降。在本文中,我们提出了一种新的学习方法,即3D到2D模仿,该方法可以让3D意识GAN生成高质量图像,同时保持严格的3D一致性,通过让生成器3D渲染分支生成的图像模仿其2D超分辨率分支生成的图像。我们还将3D意识Convolutions引入生成器以更好地学习3D表示,这进一步改善了图像生成质量。通过以上方法,我们的方法和FFHQ和AFHQ-v2猫在512x512分辨率下分别获得了5.4和4.3的FID得分, largely outperforms existing 3D意识GAN使用直接3D渲染并接近于之前利用2D超分辨率的优势的方法的最先进的方法。项目网站:这个https URL。
https://arxiv.org/abs/2303.09036
Adapting the Diffusion Probabilistic Model (DPM) for direct image super-resolution is wasteful, given that a simple Convolutional Neural Network (CNN) can recover the main low-frequency content. Therefore, we present ResDiff, a novel Diffusion Probabilistic Model based on Residual structure for Single Image Super-Resolution (SISR). ResDiff utilizes a combination of a CNN, which restores primary low-frequency components, and a DPM, which predicts the residual between the ground-truth image and the CNN-predicted image. In contrast to the common diffusion-based methods that directly use LR images to guide the noise towards HR space, ResDiff utilizes the CNN's initial prediction to direct the noise towards the residual space between HR space and CNN-predicted space, which not only accelerates the generation process but also acquires superior sample quality. Additionally, a frequency-domain-based loss function for CNN is introduced to facilitate its restoration, and a frequency-domain guided diffusion is designed for DPM on behalf of predicting high-frequency details. The extensive experiments on multiple benchmark datasets demonstrate that ResDiff outperforms previous diffusion-based methods in terms of shorter model convergence time, superior generation quality, and more diverse samples.
将扩散概率模型(DPM)用于直接图像超分辨率是一种浪费,因为一个简单的卷积神经网络(CNN)可以恢复主要低频率内容。因此,我们提出了ResDiff,一个基于残留结构的全新扩散概率模型,用于单图像超分辨率(SISR)。ResDiff利用CNN恢复主要低频率成分的组合,同时利用DPM预测基线图像和CNN预测图像之间的残留部分。与常见的扩散基于方法,直接使用LR图像引导噪声到HR空间不同,ResDiff利用CNN的初始预测将噪声引导到HR空间和CNN预测空间的残留部分,不仅加速了生成过程,也获得了更好的样本质量。此外,引入CNN的频域损失函数以方便恢复,并为DPM设计频域引导扩散,该方法的预测高频率细节。在多个基准数据集上的广泛实验表明,ResDiff在缩短模型收敛时间、更好的生成质量和更多的样本多样性方面优于以前的扩散基于方法。
https://arxiv.org/abs/2303.08714
Recent years have witnessed impressive progress in super-resolution (SR) processing. However, its real-time inference requirement sets a challenge not only for the model design but also for the on-chip implementation. In this paper, we implement a full-stack SR acceleration framework on embedded GPU devices. The special dictionary learning algorithm used in SR models was analyzed in detail and accelerated via a novel dictionary selective strategy. Besides, the hardware programming architecture together with the model structure is analyzed to guide the optimal design of computation kernels to minimize the inference latency under the resource constraints. With these novel techniques, the communication and computation bottlenecks in the deep dictionary learning-based SR models are tackled perfectly. The experiments on the edge embedded NVIDIA NX and 2080Ti show that our method outperforms the state-of-the-art NVIDIA TensorRT significantly, and can achieve real-time performance.
近年来,super-resolution(SR)处理取得了令人印象深刻的进展。然而,其实时推理要求不仅对模型设计提出了挑战,也对芯片实现提出了挑战。在本文中,我们在嵌入式GPU设备上实现了一个完整的SR加速框架。在SR模型中常用的特殊词典学习算法进行了详细的分析和加速,采用了一种独特的词典选择策略。此外,我们还对硬件编程架构和模型结构进行了分析,以指导计算内核的最佳设计,在资源限制下最小化推理延迟。通过这些新技术,深度词典学习SR模型中的通信和计算瓶颈得到了完美解决。在嵌入式NVIDIA NX和2080Ti的边缘实验中,我们的方法比最先进的NVIDIA TensorRT方法表现更好,可以实现实时性能。
https://arxiv.org/abs/2303.08999