Compressed sensing (CS) has emerged to overcome the inefficiency of Nyquist sampling. However, traditional optimization-based reconstruction is slow and can not yield an exact image in practice. Deep learning-based reconstruction has been a promising alternative to optimization-based reconstruction, outperforming it in accuracy and computation speed. Finding an efficient sampling method with deep learning-based reconstruction, especially for Fourier CS remains a challenge. Existing joint optimization of sampling-reconstruction works (H1) optimize the sampling mask but have low potential as it is not adaptive to each data point. Adaptive sampling (H2) has also disadvantages of difficult optimization and Pareto sub-optimality. Here, we propose a novel adaptive selection of sampling-reconstruction (H1.5) framework that selects the best sampling mask and reconstruction network for each input data. We provide theorems that our method has a higher potential than H1 and effectively solves the Pareto sub-optimality problem in sampling-reconstruction by using separate reconstruction networks for different sampling masks. To select the best sampling mask, we propose to quantify the high-frequency Bayesian uncertainty of the input, using a super-resolution space generation model. Our method outperforms joint optimization of sampling-reconstruction (H1) and adaptive sampling (H2) by achieving significant improvements on several Fourier CS problems.
压缩感知(CS)作为一种克服了奈奎斯特采样效率低的问题的解决方案,已经得到了广泛的应用。然而,传统的基于优化的重构方法速度较慢,在实践中无法获得精确的图像。基于深度学习的重构方法已经成为优化基于重构的一个有前景的替代方案,在准确性和计算速度方面优于它。在深度学习-基于重构的压缩感知中,尤其是对于傅里叶CS,找到一个高效的采样方法仍然具有挑战性。现有的压缩感知重构工作(H1)优化了采样掩码,但潜力较低,因为它不适应每个数据点。自适应采样(H2)也存在缺点,如难以优化和帕累托次优性。在这里,我们提出了一种新的自适应采样-重构(H1.5)框架,为每个输入数据选择最佳的采样掩码和重构网络。我们提供了使用单独重构网络为不同采样掩码优化采样-重构问题的理论证明。为了选择最佳的采样掩码,我们提出了使用超分辨率空间生成模型量化输入的高频贝叶斯不确定性的方法。我们的方法在解决几个傅里叶CS问题上优于联合优化采样-重构(H1)和自适应采样(H2)。
https://arxiv.org/abs/2409.11738
In recent years, Light Detection and Ranging (LiDAR) technology, a critical sensor in robotics and autonomous systems, has seen significant advancements. These improvements include enhanced resolution of point clouds and the capability to provide 360° low-resolution images. These images encode various data such as depth, reflectivity, and near-infrared light within the pixels. However, an excessive density of points and conventional point cloud sampling can be counterproductive, particularly in applications such as LiDAR odometry, where misleading points and degraded geometry information may induce drift errors. Currently, extensive research efforts are being directed towards leveraging LiDAR-generated images to improve situational awareness. This paper presents a comprehensive review of current deep learning (DL) techniques, including colorization and super-resolution, which are traditionally utilized in conventional computer vision tasks. These techniques are applied to LiDAR-generated images and are analyzed qualitatively. Based on this analysis, we have developed a novel approach that selectively integrates the most suited colorization and super-resolution methods with LiDAR imagery to sample reliable points from the LiDAR point cloud. This approach aims to not only improve the accuracy of point cloud registration but also avoid mismatching caused by lacking geometry information, thereby augmenting the utility and precision of LiDAR systems in practical applications. In our evaluation, the proposed approach demonstrates superior performance compared to our previous work, achieving lower translation and rotation errors with a reduced number of points.
近年来,激光探测和测距(LiDAR)技术在机器人技术和自动驾驶系统中扮演着关键传感器的角色,取得了显著的进步。这些进步包括点云的高分辨率以及提供360°低分辨率图像的能力。这些图像在像素中编码各种数据,如深度、反射率和近红外光。然而,过度密集的点和传统的点云采样可能会产生反效果,尤其是在诸如LiDAR导航这样的应用中,误导性的点和失真的几何信息可能会引起漂移误差。目前,大量的研究努力集中在利用LiDAR生成的图像提高情境意识。本文对当前的深度学习(DL)技术进行了全面的回顾,包括颜色化和超分辨率,这些技术在传统的计算机视觉任务中得到了传统的应用。这些技术应用于LiDAR生成的图像并进行了定性的分析。根据这个分析,我们开发了一种新方法,将最合适的颜色化和超分辨率方法与LiDAR图像集成,以从LiDAR点云中采样可靠的点。这种方法旨在不仅提高点云配准的准确性,还避免由于缺乏几何信息而产生的匹配误差,从而提高了LiDAR系统在实际应用中的效用和精度。在我们的评估中,与我们的以前工作相比,所提出的方法表现出卓越的性能,通过减少点数实现了较低的平移和旋转误差。
https://arxiv.org/abs/2409.11532
The present State-of-the-Art (SotA) Image Super-Resolution (ISR) methods employ Deep Learning (DL) techniques using a large amount of image data. The primary limitation to extending the existing SotA ISR works for real-world instances is their computational and time complexities. In this paper, contrary to the existing methods, we present a novel and computationally efficient ISR algorithm that is independent of the image dataset to learn the ISR task. The proposed algorithm reformulates the ISR task from generating the Super-Resolved (SR) images to computing the inverse of the kernels that span the degradation space. We introduce Deep Identity Learning, exploiting the identity relation between the degradation and inverse degradation models. The proposed approach neither relies on the ISR dataset nor on a single input low-resolution (LR) image (like the self-supervised method i.e. ZSSR) to model the ISR task. Hence we term our model as Null-Shot Super-Resolution Using Deep Identity Learning (NSSR-DIL). The proposed NSSR-DIL model requires fewer computational resources, at least by an order of 10, and demonstrates a competitive performance on benchmark ISR datasets. Another salient aspect of our proposition is that the NSSR-DIL framework detours retraining the model and remains the same for varying scale factors like X2, X3, and X4. This makes our highly efficient ISR model more suitable for real-world applications.
目前最先进的图像超分辨率(ISR)方法使用大量图像数据的大型数据集来应用深度学习(DL)技术。将现有的ISR工作扩展到现实世界的实例的主要限制是它们的计算和时间复杂性。在本文中,与现有方法相反,我们提出了一种新颖且计算效率高的ISR算法,该算法不依赖于图像数据集,可以学习ISR任务。所提出的算法将ISR任务从生成超分辨率(SR)图像重新定义为计算扩展降解空间中纹理的逆距离。我们引入了深度身份学习,利用降解和逆降解模型的身份关系。与自监督方法(如ZSSR)相比,所提出的方法既不依赖于ISR数据集,也不依赖于单个低分辨率(LR)图像。因此,我们将我们的模型称为使用深度身份学习进行零散ISR的模型(NSSR-DIL)。所提出的NSSR-DIL模型需要更少的计算资源,至少比10个数量级少,并在基准ISR数据集上表现出竞争力的性能。我们提案的另一个显著特点是,NSSR-DIL框架避开了对模型的重新训练,并且对不同的缩放因子X2、X3和X4保持不变。这使得我们的高效率ISR模型在现实世界中具有更强的适用性。
https://arxiv.org/abs/2409.12165
Implicit Neural Representation (INR), leveraging a neural network to transform coordinate input into corresponding attributes, has recently driven significant advances in several vision-related domains. However, the performance of INR is heavily influenced by the choice of the nonlinear activation function used in its multilayer perceptron (MLP) architecture. Multiple nonlinearities have been investigated; yet, current INRs face limitations in capturing high-frequency components, diverse signal types, and handling inverse problems. We have identified that these problems can be greatly alleviated by introducing a paradigm shift in INRs. We find that an architecture with learnable activations in initial layers can represent fine details in the underlying signals. Specifically, we propose SL$^{2}$A-INR, a hybrid network for INR with a single-layer learnable activation function, prompting the effectiveness of traditional ReLU-based MLPs. Our method performs superior across diverse tasks, including image representation, 3D shape reconstructions, inpainting, single image super-resolution, CT reconstruction, and novel view synthesis. Through comprehensive experiments, SL$^{2}$A-INR sets new benchmarks in accuracy, quality, and convergence rates for INR.
隐式神经表示(INR)利用神经网络将坐标输入转换为相应的属性,在最近几个与视觉相关的领域中,显著推动了进展。然而,INR的性能在很大程度上受到其在多层感知器(MLP)架构中使用的非线性激活函数的选择影响。我们研究了多个非线性特性;然而,目前的INR在面对高频成分、多样信号类型和处理反问题方面存在局限性。我们发现,通过引入INR范式的转变,这些问题可以大大减轻。我们发现,具有可学习激活函数的层初始架构可以表示底层信号的细小细节。具体来说,我们提出了SL$^{2}$A-INR,一种只使用单层可学习激活函数的INR,推动了传统基于ReLU的MLP的有效性。我们的方法在各种任务上表现出色,包括图像表示、3D形状重构、去噪、单图像超分辨率、CT重建和新的视图合成。通过全面的实验,SL$^{2}$A-INR为INR设立了新的准确度、质量和收敛率基准。
https://arxiv.org/abs/2409.10836
Kernel image regression methods have shown to provide excellent efficiency in many image processing task, such as image and light-field compression, Gaussian Splatting, denoising and super-resolution. The estimation of parameters for these methods frequently employ gradient descent iterative optimization, which poses significant computational burden for many applications. In this paper, we introduce a novel adaptive segmentation-based initialization method targeted for optimizing Steered-Mixture-of Experts (SMoE) gating networks and Radial-Basis-Function (RBF) networks with steering kernels. The novel initialization method allocates kernels into pre-calculated image segments. The optimal number of kernels, kernel positions, and steering parameters are derived per segment in an iterative optimization and kernel sparsification procedure. The kernel information from "local" segments is then transferred into a "global" initialization, ready for use in iterative optimization of SMoE, RBF, and related kernel image regression methods. Results show that drastic objective and subjective quality improvements are achievable compared to widely used regular grid initialization, "state-of-the-art" K-Means initialization and previously introduced segmentation-based initialization methods, while also drastically improving the sparsity of the regression models. For same quality, the novel initialization results in models with around 50% reduction of kernels. In addition, a significant reduction of convergence time is achieved, with overall run-time savings of up to 50%. The segmentation-based initialization strategy itself admits heavy parallel computation; in theory, it may be divided into as many tasks as there are segments in the images. By accessing only four parallel GPUs, run-time savings of already 50% for initialization are achievable.
内核图像回归方法在许多图像处理任务中表现出优异的效率,例如图像和光场压缩、高斯平铺、去噪和超分辨率。这些方法的参数估计通常采用梯度下降迭代优化,这对许多应用来说造成了很大的计算负担。在本文中,我们引入了一种新的自适应分割-基于初始化方法,针对优化导向Mixture-of-Expert(SMoE)卷积网络和径向基函数(RBF)网络与导向核。新的初始化方法将核分配到预计算的图像段中。通过迭代优化和核稀疏过程,每个段得到最优的核数量、核位置和导向参数。然后将“局部”段的核信息传递到“全局”初始化,准备用于迭代优化SMoE、RBF和相关内核图像回归方法。结果表明,与广泛使用的普通网格初始化、最先进的K-Means初始化和之前引入的分割-基于初始化方法相比,获得了显著的对象和主观质量提升,同时极大地改善了回归模型的稀疏性。对于相同质量,新的初始化方法实现了模型内核减少50%的情况。此外,还取得了显著的收敛时间减少,总运行时间节省了50%。分割-基于初始化策略本身具有沉重的并行计算;从理论上讲,它可以分解成与图像中的段数相同的所有任务。通过访问四个并行GPU,可以实现初始化时间的节省50%以上。
https://arxiv.org/abs/2409.10101
Recent advancements in single image super-resolution have been predominantly driven by token mixers and transformer architectures. WaveMixSR utilized the WaveMix architecture, employing a two-dimensional discrete wavelet transform for spatial token mixing, achieving superior performance in super-resolution tasks with remarkable resource efficiency. In this work, we present an enhanced version of the WaveMixSR architecture by (1) replacing the traditional transpose convolution layer with a pixel shuffle operation and (2) implementing a multistage design for higher resolution tasks ($4\times$). Our experiments demonstrate that our enhanced model -- WaveMixSR-V2 -- outperforms other architectures in multiple super-resolution tasks, achieving state-of-the-art for the BSD100 dataset, while also consuming fewer resources, exhibits higher parameter efficiency, lower latency and higher throughput. Our code is available at this https URL.
近年来,在单图像超分辨率方面,主要是由词混合器和Transformer架构推动的。WaveMixSR利用了WaveMix架构,采用二维离散小波变换进行空间词混合,在超分辨率任务中取得了显著的资源效率。在这项工作中,我们通过以下方式增强了WaveMixSR架构:(1)用像素调度操作替换了传统的转置卷积层;(2)为高分辨率任务($4\times$)实现多级设计。我们的实验结果表明,我们的增强模型——WaveMixSR-V2——在多个超分辨率任务中优于其他架构,在BSD100数据集上实现了最先进的性能,同时消耗的资源更少,展示了更高的参数效率、较低的延迟和更高的吞吐量。我们的代码可在此处访问:https://www.example.com/。
https://arxiv.org/abs/2409.10582
Magnetic Resonance Imaging (MRI) requires a trade-off between resolution, signal-to-noise ratio, and scan time, making high-resolution (HR) acquisition challenging. Therefore, super-resolution for MR image is a feasible solution. However, most existing methods face challenges in accurately learning a continuous volumetric representation from low-resolution image or require HR image for supervision. To solve these challenges, we propose a novel method for MR image super-resolution based on two-factor representation. Specifically, we factorize intensity signals into a linear combination of learnable basis and coefficient factors, enabling efficient continuous volumetric representation from low-resolution MR image. Besides, we introduce a coordinate-based encoding to capture structural relationships between sparse voxels, facilitating smooth completion in unobserved regions. Experiments on BraTS 2019 and MSSEG 2016 datasets demonstrate that our method achieves state-of-the-art performance, providing superior visual fidelity and robustness, particularly in large up-sampling scale MR image super-resolution.
磁共振成像(MRI)需要在分辨率、信号-噪声比和扫描时间之间做出权衡,使得高分辨率(HR)采集具有挑战性。因此,超级分辨率在MR图像中是一个可行的解决方案。然而,大多数现有方法在准确从低分辨率图像中学习连续的体积表示方面面临挑战,或者需要HR图像进行监督。为解决这些挑战,我们提出了一种基于两因素表示的全新的MR图像超分辨率方法。具体来说,我们将强度信号分解为可学习基和系数因子的线性组合,实现从低分辨率MR图像中进行高效的连续体积表示。此外,我们还引入了一种基于坐标的编码来捕捉稀疏体素之间的结构关系,促进在未观察到的区域中平滑完成。在BraTS 2019和MSSEG 2016数据集上的实验证明,我们的方法实现了最先进的性能,提供了卓越的视觉保真度和对称性,尤其是在大升级缩放比例的MR图像超分辨率中。
https://arxiv.org/abs/2409.09731
Speech Super-Resolution (SSR) is a task of enhancing low-resolution speech signals by restoring missing high-frequency components. Conventional approaches typically reconstruct log-mel features, followed by a vocoder that generates high-resolution speech in the waveform domain. However, as log-mel features lack phase information, this can result in performance degradation during the reconstruction phase. Motivated by recent advances with Selective State Spaces Models (SSMs), we propose a method, referred to as Wave-U-Mamba that directly performs SSR in time domain. In our comparative study, including models such as WSRGlow, NU-Wave 2, and AudioSR, Wave-U-Mamba demonstrates superior performance, achieving the lowest Log-Spectral Distance (LSD) across various low-resolution sampling rates, ranging from 8 kHz to 24 kHz. Additionally, subjective human evaluations, scored using Mean Opinion Score (MOS) reveal that our method produces SSR with natural and human-like quality. Furthermore, Wave-U-Mamba achieves these results while generating high-resolution speech over nine times faster than baseline models on a single A100 GPU, with parameter sizes less than 2% of those in the baseline models.
speech 超分辨率(SSR)是一个增强低分辨率语音信号的任务,通过恢复缺失的高频成分来增强。传统的解决方案通常重构对数 Mel 特征,然后通过语音合成器在频域中生成高分辨率语音。然而,由于对数 Mel 特征缺少相位信息,在重构阶段可能会导致性能下降。受到最近使用选择性状态空间模型(SSMs)的进展的启发,我们提出了一个称为 U-Mamba 的方法,在时域直接执行 SSR。在比较研究中,包括 WSRGlow、NU-Wave 2 和 AudioSR 模型,U-Mamba 展示了卓越的性能,在各种低分辨率采样率下实现了最低的对数光谱距离(LSD),从 8 kHz 到 24 kHz。此外,主观人类评价使用平均意见评分(MOS)得分,表明我们的方法具有自然和人类相似的 SSR 质量。此外,U-Mamba 可以在单个 A100 GPU 上实现这些结果,同时生成超过基线模型九倍速的高分辨率语音,具有的参数大小不到基线模型的2%。
https://arxiv.org/abs/2409.09337
The progress on Hyperspectral image (HSI) super-resolution (SR) is still lagging behind the research of RGB image SR. HSIs usually have a high number of spectral bands, so accurately modeling spectral band interaction for HSI SR is hard. Also, training data for HSI SR is hard to obtain so the dataset is usually rather small. In this work, we propose a new test-time training method to tackle this problem. Specifically, a novel self-training framework is developed, where more accurate pseudo-labels and more accurate LR-HR relationships are generated so that the model can be further trained with them to improve performance. In order to better support our test-time training method, we also propose a new network architecture to learn HSI SR without modeling spectral band interaction and propose a new data augmentation method Spectral Mixup to increase the diversity of the training data at test time. We also collect a new HSI dataset with a diverse set of images of interesting objects ranging from food to vegetation, to materials, and to general scenes. Extensive experiments on multiple datasets show that our method can improve the performance of pre-trained models significantly after test-time training and outperform competing methods significantly for HSI SR.
hyperspectral image (HSI) super-resolution (SR) 的进展仍然落后于 RGB 图像 SR 的研究。HSIs 通常具有大量的光谱带,因此准确建模光谱带相互作用对于 HSI SR 来说很难。此外,HSI SR 的训练数据很难获得,因此数据集通常较小。在这项工作中,我们提出了一个新的测试时间训练方法来解决这个问题。具体来说,我们开发了一个新的自训练框架,其中生成了更准确的伪标签和更准确的 LR-HR 关系,以便使用它们来进一步训练模型,提高性能。为了更好地支持我们的测试时间训练方法,我们还提出了一种新的网络架构,用于学习不建模光谱带相互作用的自监督 HSI SR,并提出了一个新的数据增强方法 Spectral Mixup,以增加测试时间训练数据的多样性。我们还收集了一系列有趣的对象(从食物到植被,材料到场景)的 HSI 数据集,以增加训练数据的多样性。在多个数据集上的广泛实验表明,与测试时间训练相比,我们的方法可以在显著提高预训练模型的性能方面显著提高,并且在 HSI SR 方面显著优于竞争方法。
https://arxiv.org/abs/2409.08667
Over the last decade, deep learning has shown great success at performing computer vision tasks, including classification, super-resolution, and style transfer. Now, we apply it to data compression to help build the next generation of multimedia codecs. This thesis provides three primary contributions to this new field of learned compression. First, we present an efficient low-complexity entropy model that dynamically adapts the encoding distribution to a specific input by compressing and transmitting the encoding distribution itself as side information. Secondly, we propose a novel lightweight low-complexity point cloud codec that is highly specialized for classification, attaining significant reductions in bitrate compared to non-specialized codecs. Lastly, we explore how motion within the input domain between consecutive video frames is manifested in the corresponding convolutionally-derived latent space.
在过去的十年里,深度学习已经在计算机视觉任务中取得了巨大的成功,包括分类、超分辨率和支持向量转移。现在,我们将它应用于数据压缩,以帮助构建下一代的多媒体编码标准。本论文为学习压缩领域提供了三个主要的贡献。首先,我们提出了一种高效且低复杂度的熵模型,通过压缩和传输编码分布本身作为侧信息,动态地适应特定输入。其次,我们提出了一种新颖的轻量级低复杂度点云编码器,专为分类而设计,比非专用编码器在带宽方面显著减少了比特率。最后,我们探讨了连续视频帧之间输入域内运动在相应卷积导出的潜在空间中的表现。
https://arxiv.org/abs/2409.08376
Super-resolution (SR) aims to enhance the quality of low-resolution images and has been widely applied in medical imaging. We found that the design principles of most existing methods are influenced by SR tasks based on real-world images and do not take into account the significance of the multi-level structure in pathological images, even if they can achieve respectable objective metric evaluations. In this work, we delve into two super-resolution working paradigms and propose a novel network called CWT-Net, which leverages cross-scale image wavelet transform and Transformer architecture. Our network consists of two branches: one dedicated to learning super-resolution and the other to high-frequency wavelet features. To generate high-resolution histopathology images, the Transformer module shares and fuses features from both branches at various stages. Notably, we have designed a specialized wavelet reconstruction module to effectively enhance the wavelet domain features and enable the network to operate in different modes, allowing for the introduction of additional relevant information from cross-scale images. Our experimental results demonstrate that our model significantly outperforms state-of-the-art methods in both performance and visualization evaluations and can substantially boost the accuracy of image diagnostic networks.
超分辨率(SR)旨在提高低分辨率图像的质量,并在医学影像领域得到了广泛应用。我们发现,大多数现有方法的设计原则是基于现实世界图像的SR任务影响的,并且没有考虑到病理图像多级结构的重要性,即使它们在客观指标评估方面也能够达到令人满意的结果。在这项工作中,我们深入研究了两种超分辨率工作范式,并提出了一个名为CWT-Net的新网络,该网络利用跨尺度图像小波变换和Transformer架构。我们的网络由两个分支组成:一个用于学习超分辨率,另一个用于高频率小波特征。为了生成高分辨率的组织学图像,Transformer模块在各个阶段共享和融合来自两个分支的特征。值得注意的是,我们设计了一个专门的小波重构模块,以有效地增强小波域特征,并使网络能够以不同的模式运行,允许从跨尺度图像中引入其他相关信息。我们的实验结果表明,我们的模型在性能和可视化评估方面显著优于最先进的方法,并且可以显著提高图像诊断网络的准确性。
https://arxiv.org/abs/2409.07092
The single image super-resolution(SISR) algorithms under deep learning currently have two main models, one based on convolutional neural networks and the other based on Transformer. The former uses the stacking of convolutional layers with different convolutional kernel sizes to design the model, which enables the model to better extract the local features of the image; the latter uses the self-attention mechanism to design the model, which allows the model to establish long-distance dependencies between image pixel points through the self-attention mechanism and then better extract the global features of the image. However, both of the above methods face their problems. Based on this, this paper proposes a new lightweight multi-scale feature fusion network model based on two-way complementary convolutional and Transformer, which integrates the respective features of Transformer and convolutional neural networks through a two-branch network architecture, to realize the mutual fusion of global and local information. Meanwhile, considering the partial loss of information caused by the low-pixel images trained by the deep neural network, this paper designs a modular connection method of multi-stage feature supplementation to fuse the feature maps extracted from the shallow stage of the model with those extracted from the deep stage of the model, to minimize the loss of the information in the feature images that is beneficial to the image restoration as much as possible, to facilitate the obtaining of a higher-quality restored image. The practical results finally show that the model proposed in this paper is optimal in image recovery performance when compared with other lightweight models with the same amount of parameters.
目前,基于深度学习的单图像超分辨率(SISR)算法主要有两种模型:一种基于卷积神经网络,另一种基于Transformer。前一种使用不同卷积核大小的卷积层堆叠来设计模型,使得模型能够更好地提取图像的局部特征;后一种使用自注意力机制来设计模型,使得模型能够通过自注意力机制建立图像像素点之间的长距离依赖关系,然后更好地提取图像的全局特征。然而,这两种方法都存在一些问题。基于此,本文提出了一种基于两路互补卷积和Transformer的新型轻量级多尺度特征融合网络模型,通过两分支网络架构将Transformer和卷积神经网络的各自特征进行融合,以实现全局和局部信息的相互融合。同时,考虑到由深度神经网络训练的低像素图像部分损失的信息,本文设计了一种多级特征补充连接方法,将模型浅层和深层提取的特征图进行融合,以尽可能减少对图像修复有益的信息的损失,从而获得更高质量的修复图像。实际结果表明,与具有相同参数量的其他轻量级模型相比,本文提出的模型在图像恢复性能上具有最优性。
https://arxiv.org/abs/2409.06590
Very low-resolution face recognition is challenging due to the serious loss of informative facial details in resolution degradation. In this paper, we propose a generative-discriminative representation distillation approach that combines generative representation with cross-resolution aligned knowledge distillation. This approach facilitates very low-resolution face recognition by jointly distilling generative and discriminative models via two distillation modules. Firstly, the generative representation distillation takes the encoder of a diffusion model pretrained for face super-resolution as the generative teacher to supervise the learning of the student backbone via feature regression, and then freezes the student backbone. After that, the discriminative representation distillation further considers a pretrained face recognizer as the discriminative teacher to supervise the learning of the student head via cross-resolution relational contrastive distillation. In this way, the general backbone representation can be transformed into discriminative head representation, leading to a robust and discriminative student model for very low-resolution face recognition. Our approach improves the recovery of the missing details in very low-resolution faces and achieves better knowledge transfer. Extensive experiments on face datasets demonstrate that our approach enhances the recognition accuracy of very low-resolution faces, showcasing its effectiveness and adaptability.
由于在分辨率降低过程中丢失了有用的面部细节,低分辨率面部识别具有挑战性。在本文中,我们提出了一种结合生成表示和跨分辨率平滑知识蒸馏的生成-判别表示蒸馏方法。通过两个蒸馏模块共同蒸馏生成和判别模型,这种方法有助于实现低分辨率面部识别。首先,生成表示蒸馏将预训练的扩散模型编码器作为生成教师,通过特征回归指导学生骨架的学习,然后将学生骨架冻结。接着,判别表示蒸馏将预训练的人脸识别器作为判别教师,通过跨分辨率关系对比蒸馏指导学生头的学习。这样,总体骨架表示可以转换为判别头部表示,从而实现对低分辨率面部识别的稳健和判别能力。在面部数据集上进行的大量实验证明,我们的方法提高了低分辨率面部识别的识别准确性,展示了其有效性和可适应性。
https://arxiv.org/abs/2409.06371
Due to their text-to-image synthesis feature, diffusion models have recently seen a rise in visual perception tasks, such as depth estimation. The lack of good-quality datasets makes the extraction of a fine-grain semantic context challenging for the diffusion models. The semantic context with fewer details further worsens the process of creating effective text embeddings that will be used as input for diffusion models. In this paper, we propose a novel EDADepth, an enhanced data augmentation method to estimate monocular depth without using additional training data. We use Swin2SR, a super-resolution model, to enhance the quality of input images. We employ the BEiT pre-trained semantic segmentation model for better extraction of text embeddings. We introduce BLIP-2 tokenizer to generate tokens from these text embeddings. The novelty of our approach is the introduction of Swin2SR, the BEiT model, and the BLIP-2 tokenizer in the diffusion-based pipeline for the monocular depth estimation. Our model achieves state-of-the-art results (SOTA) on the {\delta}3 metric on NYUv2 and KITTI datasets. It also achieves results comparable to those of the SOTA models in the RMSE and REL metrics. Finally, we also show improvements in the visualization of the estimated depth compared to the SOTA diffusion-based monocular depth estimation models. Code: this https URL.
由于它们文本到图像合成特性,扩散模型最近在诸如深度估计等视觉感知任务中看到了视觉感知任务的上升。缺乏高质量的数据集使得扩散模型提取细粒度语义上下文变得具有挑战性。语义上下文较少进一步加重了创建用于扩散模型的有效文本嵌入的过程。在本文中,我们提出了一个新颖的EDADepth,一种不需要额外训练数据来估计单目深度的新数据增强方法。我们使用Swin2SR,一种超分辨率模型,来增强输入图像的质量。我们使用BEiT预训练的语义分割模型来更好地提取文本嵌入。我们引入了BLIP-2词典,用于从这些文本嵌入中生成标记。我们方法的创新之处在于在扩散基础模型中引入了Swin2SR、BEiT模型和BLIP-2词典。我们的模型在NYUv2和KITTI数据集上的delta3指标达到最先进的水平(SOTA)。它还在RMSE和REL指标上实现了与最先进模型的结果相似的结果。最后,我们还展示了与最先进的扩散基于模型的深度估计模型的可视化改善。代码:https://this URL。
https://arxiv.org/abs/2409.06183
Score-based diffusion methods provide a powerful strategy to solve image restoration tasks by flexibly combining a pre-trained foundational prior model with a likelihood function specified during test time. Such methods are predominantly derived from two stochastic processes: reversing Ornstein-Uhlenbeck, which underpins the celebrated denoising diffusion probabilistic models (DDPM) and denoising diffusion implicit models (DDIM), and the Langevin diffusion process. The solutions delivered by DDPM and DDIM are often remarkably realistic, but they are not always consistent with measurements because of likelihood intractability issues and the associated required approximations. Alternatively, using a Langevin process circumvents the intractable likelihood issue, but usually leads to restoration results of inferior quality and longer computing times. This paper presents a novel and highly computationally efficient image restoration method that carefully embeds a foundational DDPM denoiser within an empirical Bayesian Langevin algorithm, which jointly calibrates key model hyper-parameters as it estimates the model's posterior mean. Extensive experimental results on three canonical tasks (image deblurring, super-resolution, and inpainting) demonstrate that the proposed approach improves on state-of-the-art strategies both in image estimation accuracy and computing time.
基于分数的扩散方法提供了一种通过灵活地将预训练的基本先验模型与在测试时间指定的 likelihood 函数相结合来解决图像修复任务的强大策略。这些方法主要来源于两个随机过程:反演Ornstein-Uhlenbeck过程,这是著名的去噪扩散概率模型(DDPM)和去噪扩散隐式模型的基础,以及Langevin扩散过程。DDPM和DDIM的解决方案通常非常逼真,但由于 likelihood 不可求性和相关所需逼近问题,它们并不总是符合测量结果。相反,使用Langevin过程绕过了不可求 likelihood 的问题,但通常会导致修复结果的质量较差和计算时间更长。本文介绍了一种新颖且计算效率极高的图像修复方法,该方法将基本DDPM去噪器嵌入到一个实证贝叶斯Langevin算法中,该算法在估计模型后验均值的同时共同标定关键模型超参数。对三个典型任务(图像去噪、超分辨率、修复)的实验结果表明,与最先进的策略相比,所提出的方法在图像估计精度和计算时间方面都取得了改善。
https://arxiv.org/abs/2409.04384
Single hyperspectral image super-resolution (single-HSI-SR) aims to improve the resolution of a single input low-resolution HSI. Due to the bottleneck of data scarcity, the development of single-HSI-SR lags far behind that of RGB natural images. In recent years, research on RGB SR has shown that models pre-trained on large-scale benchmark datasets can greatly improve performance on unseen data, which may stand as a remedy for HSI. But how can we transfer the pre-trained RGB model to HSI, to overcome the data-scarcity bottleneck? Because of the significant difference in the channels between the pre-trained RGB model and the HSI, the model cannot focus on the correlation along the spectral dimension, thus limiting its ability to utilize on HSI. Inspired by the HSI spatial-spectral decoupling, we propose a new framework that first fine-tunes the pre-trained model with the spatial components (known as eigenimages), and then infers on unseen HSI using an iterative spectral regularization (ISR) to maintain the spectral correlation. The advantages of our method lie in: 1) we effectively inject the spatial texture processing capabilities of the pre-trained RGB model into HSI while keeping spectral fidelity, 2) learning in the spectral-decorrelated domain can improve the generalizability to spectral-agnostic data, and 3) our inference in the eigenimage domain naturally exploits the spectral low-rank property of HSI, thereby reducing the complexity. This work bridges the gap between pre-trained RGB models and HSI via eigenimages, addressing the issue of limited HSI training data, hence the name EigenSR. Extensive experiments show that EigenSR outperforms the state-of-the-art (SOTA) methods in both spatial and spectral metrics. Our code will be released.
单超分辨率(single-HSI-SR)旨在提高单个低分辨率HSI的分辨率。由于数据稀缺的瓶颈,单-HSI-SR的开发滞后于基于大规模基准数据集的RGB自然图像。近年来,关于RGB SR的研究表明,在未见过的数据上预训练的大规模基准数据集的模型可以大大提高性能。这可能成为HSI的补救方法。但是,我们如何将预训练的RGB模型转移到HSI,以克服数据稀缺的瓶颈呢?因为预训练的RGB模型和HSI在通道方面存在显著差异,因此模型无法关注沿着谱域的关联,从而限制了其在HSI上的利用能力。受到HSI空间-谱域解耦的启发,我们提出了一个新的框架,首先使用空间组件(称为特征)微调预训练的模型,然后使用迭代谱正则化(ISR)在未见过的HSI上进行推断,以维持谱关联。我们方法的优势在于:1) 我们有效地将预训练RGB模型的空间纹理处理能力注入到HSI中,同时保持光谱保真度,2) 在谱相关领域学习可以提高对光谱无关数据的泛化能力,3) 在特征域的推断过程中自然利用HSI的低秩特性,从而降低复杂度。通过特征,我们连接了预训练的RGB模型和HSI,解决了HSI训练数据有限的问题,因此名为EigenSR。大量的实验证明,EigenSR在空间和光谱指标上优于最先进的(SOTA)方法。我们的代码将发布。
https://arxiv.org/abs/2409.04050
In recent years, facial recognition (FR) models have become the most widely used biometric tool, achieving impressive results on numerous datasets. However, inherent hardware challenges or shooting distances often result in low-resolution images, which significantly impact the performance of FR models. To address this issue, several solutions have been proposed, including super-resolution (SR) models that generate highly realistic faces. Despite these efforts, significant improvements in FR algorithms have not been achieved. We propose a novel SR model FTLGAN, which focuses on generating high-resolution images that preserve individual identities rather than merely improving image quality, thereby maximizing the performance of FR models. The results are compelling, demonstrating a mean value of d' 21% above the best current state-of-the-art models, specifically having a value of d' = 1.099 and AUC = 0.78 for 14x14 pixels, d' = 2.112 and AUC = 0.92 for 28x28 pixels, and d' = 3.049 and AUC = 0.98 for 56x56 pixels. The contributions of this study are significant in several key areas. Firstly, a notable improvement in facial recognition performance has been achieved in low-resolution images, specifically at resolutions of 14x14, 28x28, and 56x56 pixels. Secondly, the enhancements demonstrated by FTLGAN show a consistent response across all resolutions, delivering outstanding performance uniformly, unlike other comparative models. Thirdly, an innovative approach has been implemented using triplet loss logic, enabling the training of the super-resolution model solely with real images, contrasting with current models, and expanding potential real-world applications. Lastly, this study introduces a novel model that specifically addresses the challenge of improving classification performance in facial recognition systems by integrating facial recognition quality as a loss during model training.
近年来,面部识别(FR)模型已成为最广泛使用的生物识别工具之一,在多个数据集上取得了令人印象深刻的成果。然而,固有的硬件挑战或拍摄距离通常导致低分辨率图像,这严重影响了FR模型的性能。为解决这个问题,已经提出了几种解决方案,包括生成高度逼真的面部的大分辨率(SR)模型。尽管这些努力取得了很大的进展,但FR算法的性能并没有达到令人满意的程度。我们提出了一个新颖的SR模型FTLGAN,它专注于生成保留个体身份的高分辨率图像,而不是仅仅提高图像质量,从而最大程度地提高FR模型的性能。结果非常引人注目,表明平均值比最佳现有模型高21%,特别对于14x14、28x28和56x56像素的分辨率,分别具有d' = 1.099和AUC = 0.78、d' = 2.112和AUC = 0.92以及d' = 3.049和AUC = 0.98的值。本研究在几个关键领域取得了显著的贡献。首先,在低分辨率图像中,面部识别性能得到了显著改善,尤其是14x14、28x28和56x56像素的分辨率。其次,FTLGAN所展示的增强效果在所有分辨率上都是相同的,实现了在各个分辨率上具有出色的性能,与当前的比较模型不同,进一步拓展了潜在的现实应用。最后,本研究引入了一种新模型,专门针对通过在模型训练过程中将面部识别质量作为损失来提高面部识别系统分类性能的挑战。
https://arxiv.org/abs/2409.03530
Recent Vision Transformer (ViT)-based methods for Image Super-Resolution have demonstrated impressive performance. However, they suffer from significant complexity, resulting in high inference times and memory usage. Additionally, ViT models using Window Self-Attention (WSA) face challenges in processing regions outside their windows. To address these issues, we propose the Low-to-high Multi-Level Transformer (LMLT), which employs attention with varying feature sizes for each head. LMLT divides image features along the channel dimension, gradually reduces spatial size for lower heads, and applies self-attention to each head. This approach effectively captures both local and global information. By integrating the results from lower heads into higher heads, LMLT overcomes the window boundary issues in self-attention. Extensive experiments show that our model significantly reduces inference time and GPU memory usage while maintaining or even surpassing the performance of state-of-the-art ViT-based Image Super-Resolution methods. Our codes are availiable at this https URL.
近年来,基于Vision Transformer(ViT)的图像超分辨率方法已经取得了令人印象深刻的性能。然而,它们具有显著的复杂性,导致推理时间和高内存使用率。此外,使用Window Self-Attention(WSA)的ViT模型在处理其窗口外的区域时存在挑战。为了应对这些问题,我们提出了Low-to-high Multi-Level Transformer(LMLT),它采用具有不同特征大小的注意力和每个头自注意力。LMLT沿着通道维度分割图像特征,逐步减小低头的空间大小,并对每个头应用自注意力。这种方法有效地捕捉了局部和全局信息。通过将低层的输出集成到高层,LMLT克服了自注意力的窗口边界问题。大量实验证明,与基于ViT的图像超分辨率方法相比,我们的模型在降低推理时间和GPU内存使用率的同时,甚至超过了最先进的性能。您可以在此链接查看我们的代码:https://www.xxx
https://arxiv.org/abs/2409.03516
We present aTENNuate, a simple deep state-space autoencoder configured for efficient online raw speech enhancement in an end-to-end fashion. The network's performance is primarily evaluated on raw speech denoising, with additional assessments on tasks such as super-resolution and de-quantization. We benchmark aTENNuate on the VoiceBank + DEMAND and the Microsoft DNS1 synthetic test sets. The network outperforms previous real-time denoising models in terms of PESQ score, parameter count, MACs, and latency. Even as a raw waveform processing model, the model maintains high fidelity to the clean signal with minimal audible artifacts. In addition, the model remains performant even when the noisy input is compressed down to 4000Hz and 4 bits, suggesting general speech enhancement capabilities in low-resource environments.
我们提出了TENNuate,这是一个简单的高斯定理配置的端到端深度状态空间自编码器,用于在端到端方式下对在线原始语音增强。网络的表现主要评估在原始语音去噪上的性能,附加评估包括超分辨率和人量化。我们在VoiceBank + DEMAND和微软DNS1合成测试集上对aTENNuate进行基准测试。与之前的实时去噪模型相比,网络在PESQ分数、参数计数、MACs和延迟方面都表现出色。即使作为一个原始波形处理模型,该模型仍具有对干净信号的高保真度,且具有最小的可听 artifacts。此外,该模型在将嘈杂输入压缩至4000Hz和4比特时仍然保持高性能,表明在低资源环境中具有通用的语音增强能力。
https://arxiv.org/abs/2409.03377
Training Single-Image Super-Resolution (SISR) models using pixel-based regression losses can achieve high distortion metrics scores (e.g., PSNR and SSIM), but often results in blurry images due to insufficient recovery of high-frequency details. Conversely, using GAN or perceptual losses can produce sharp images with high perceptual metric scores (e.g., LPIPS), but may introduce artifacts and incorrect textures. Balancing these two types of losses can help achieve a trade-off between distortion and perception, but the challenge lies in tuning the loss function weights. To address this issue, we propose a novel method that incorporates Multi-Objective Optimization (MOO) into the training process of SISR models to balance perceptual quality and distortion. We conceptualize the relationship between loss weights and image quality assessment (IQA) metrics as black-box objective functions to be optimized within our Multi-Objective Bayesian Optimization Super-Resolution (MOBOSR) framework. This approach automates the hyperparameter tuning process, reduces overall computational cost, and enables the use of numerous loss functions simultaneously. Extensive experiments demonstrate that MOBOSR outperforms state-of-the-art methods in terms of both perceptual quality and distortion, significantly advancing the perception-distortion Pareto frontier. Our work points towards a new direction for future research on balancing perceptual quality and fidelity in nearly all image restoration tasks. The source code and pretrained models are available at: this https URL.
使用基于像素的回归损失来训练Single-Image Super-Resolution (SISR)模型可以实现高失真度指标得分(例如PSNR和SSIM),但通常会导致模糊的图像,因为高频率细节的恢复不足。相反,使用生成对抗网络(GAN)或感知损失可以产生具有高感知指标得分(例如LPIPS)的清晰图像,但可能会引入伪家和错误的纹理。平衡这两种类型的损失可以帮助实现失真和感知之间的权衡,但挑战在于调整损失函数权重。为解决这个问题,我们提出了一种将多目标优化(MOO)引入SISR模型训练过程的新方法,以平衡感知质量和失真。我们将损失权重和图像质量评估(IQA)指标之间的关系视为黑匣子目标函数,在多目标贝叶斯优化超分辨率(MOBOSR)框架内进行优化。这种方法自动化超参数调整过程,降低总计算成本,并能够同时使用许多损失函数。大量实验证明,MOBOSR在失真和感知方面都优于最先进的 methods,显著推动了感知失真前沿。我们的工作朝着未来几乎所有图像修复任务中平衡感知质量和保真的新方向迈出了重要的一步。源代码和预训练模型可在此处下载:https:// this URL。
https://arxiv.org/abs/2409.03179