In surveillance, accurately recognizing license plates is hindered by their often low quality and small dimensions, compromising recognition precision. Despite advancements in AI-based image super-resolution, methods like Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs) still fall short in enhancing license plate images. This study leverages the cutting-edge diffusion model, which has consistently outperformed other deep learning techniques in image restoration. By training this model using a curated dataset of Saudi license plates, both in low and high resolutions, we discovered the diffusion model's superior efficacy. The method achieves a 12.55\% and 37.32% improvement in Peak Signal-to-Noise Ratio (PSNR) over SwinIR and ESRGAN, respectively. Moreover, our method surpasses these techniques in terms of Structural Similarity Index (SSIM), registering a 4.89% and 17.66% improvement over SwinIR and ESRGAN, respectively. Furthermore, 92% of human evaluators preferred our images over those from other algorithms. In essence, this research presents a pioneering solution for license plate super-resolution, with tangible potential for surveillance systems.
在监控中,准确识别车牌常常由于它们的质量较低和尺寸较小而受到限制,从而影响识别精度。尽管基于人工智能的图像超分辨率技术取得了进展,但像卷积神经网络(CNNs)和生成对抗网络(GANs)等方法在增强车牌图像方面仍无法满足要求。本研究利用最先进的扩散模型,该模型在图像恢复方面一直比其他深度学习技术表现更好。通过使用沙特车牌的 curated 数据集,以低和高分辨率两种形式训练该模型,我们发现了扩散模型的优越性。方法在峰值信号-噪声比(PSNR)方面实现了12.55\%和37.32%的 improvement,分别比 SwinIR 和ESRGAN 提高了37.32%和12.55%。此外,我们的方法和这些技术在结构相似性指数(SSIM)方面超过了它们,分别提高了4.89%和17.66%。此外,92%的人类评估者认为我们的图像比来自其他算法的图像更喜欢。本研究提出了车牌超分辨率的开创性解决方案,对于监控系统具有实际潜力。
https://arxiv.org/abs/2309.12506
Machine learning and deep learning methods have been widely explored in understanding the chaotic behavior of the atmosphere and furthering weather forecasting. There has been increasing interest from technology companies, government institutions, and meteorological agencies in building digital twins of the Earth. Recent approaches using transformers, physics-informed machine learning, and graph neural networks have demonstrated state-of-the-art performance on relatively narrow spatiotemporal scales and specific tasks. With the recent success of generative artificial intelligence (AI) using pre-trained transformers for language modeling and vision with prompt engineering and fine-tuning, we are now moving towards generalizable AI. In particular, we are witnessing the rise of AI foundation models that can perform competitively on multiple domain-specific downstream tasks. Despite this progress, we are still in the nascent stages of a generalizable AI model for global Earth system models, regional climate models, and mesoscale weather models. Here, we review current state-of-the-art AI approaches, primarily from transformer and operator learning literature in the context of meteorology. We provide our perspective on criteria for success towards a family of foundation models for nowcasting and forecasting weather and climate predictions. We also discuss how such models can perform competitively on downstream tasks such as downscaling (super-resolution), identifying conditions conducive to the occurrence of wildfires, and predicting consequential meteorological phenomena across various spatiotemporal scales such as hurricanes and atmospheric rivers. In particular, we examine current AI methodologies and contend they have matured enough to design and implement a weather foundation model.
机器学习和深度学习方法已经被广泛探索,以理解大气层的混沌行为,并进一步进行天气预报。科技公司、政府机构和气象机构对建立地球数字孪生的兴趣日益增加。最近使用Transformer、基于物理学的机器学习和图神经网络的方法,在相对较窄的时间和空间尺度上以及特定的任务上表现出最先进的性能。随着使用预先训练Transformer进行语言建模和视觉预测的成功,我们现在正在向通用人工智能转型。特别是,我们见证了能够竞争地执行多个特定领域下游任务的AI基础模型的崛起。尽管取得了进展,但我们仍然处于全球地球系统模型、区域气候模型和微尺度天气模型的通用人工智能模型的早期阶段。在这里,我们综述了当前最先进的AI方法,主要从 meteorology的背景下的Transformer和操作学习文献中回顾。我们提供了成功实现一个基础模型家族的标准,以预测和天气预报的预测。我们还讨论了这种模型如何在下游任务如缩放(超分辨率)、确定有利于fire的爆发条件,以及在不同时间和空间尺度上预测飓风和大气河流等后果中竞争表现。特别是,我们审视了当前AI方法,并认为它们已经成熟到设计并实现一个天气预报基础模型的程度。
https://arxiv.org/abs/2309.10808
Three-dimensional electron microscopy (3DEM) is an essential technique to investigate volumetric tissue ultra-structure. Due to technical limitations and high imaging costs, samples are often imaged anisotropically, where resolution in the axial direction ($z$) is lower than in the lateral directions $(x,y)$. This anisotropy 3DEM can hamper subsequent analysis and visualization tasks. To overcome this limitation, we propose a novel deep-learning (DL)-based self-supervised super-resolution approach that computationally reconstructs isotropic 3DEM from the anisotropic acquisition. The proposed DL-based framework is built upon the U-shape architecture incorporating vision-transformer (ViT) blocks, enabling high-capability learning of local and global multi-scale image dependencies. To train the tailored network, we employ a self-supervised approach. Specifically, we generate pairs of anisotropic and isotropic training datasets from the given anisotropic 3DEM data. By feeding the given anisotropic 3DEM dataset in the trained network through our proposed framework, the isotropic 3DEM is obtained. Importantly, this isotropic reconstruction approach relies solely on the given anisotropic 3DEM dataset and does not require pairs of co-registered anisotropic and isotropic 3DEM training datasets. To evaluate the effectiveness of the proposed method, we conducted experiments using three 3DEM datasets acquired from brain. The experimental results demonstrated that our proposed framework could successfully reconstruct isotropic 3DEM from the anisotropic acquisition.
三维电子显微镜(3DEM)是一种研究体积组织微观结构的重要技术。由于技术限制和高昂的成像成本,样本经常成像Anisotropically,即axial方向( $z$ 轴)的分辨率低于横向方向( $x$ 和 $y$ 轴)。这种Anisotropic 3DEM可能会妨碍后续分析和可视化任务。为了克服这种限制,我们提出了一种 novel 深度学习(DL)-based 自监督超级分辨率方法,该方法通过计算从Anisotropic acquisition 中提取的Anisotropic 3DEM进行重构。我们提出的基于DL的框架基于U形架构,包括视觉转换块(ViT)单元,可实现高度能力的 local 和 global 多尺度图像依赖关系学习。为了训练定制网络,我们采用了自监督方法。具体来说,我们从给定的Anisotropic 3DEM数据中提取了一对Anisotropic 和 Isotropic的训练数据集。通过通过我们提出的框架将给定的Anisotropic 3DEM数据集输入训练网络,我们得到了Isotropic 3DEM。重要的是,这种Isotropic重构方法仅依赖于给定的Anisotropic 3DEM数据集,并不需要 co-registering 的Anisotropic 和 Isotropic 3DEM训练数据集对。为了评估新方法的效果,我们使用了从大脑获取的三个3DEM数据集。实验结果显示,我们提出的框架可以从Anisotropic acquisition 中提取的Anisotropic 3DEM进行成功重构。
https://arxiv.org/abs/2309.10646
Atmospheric flows are governed by a broad variety of spatio-temporal scales, thus making real-time numerical modeling of such turbulent flows in complex terrain at high resolution computationally intractable. In this study, we demonstrate a neural network approach motivated by Enhanced Super-Resolution Generative Adversarial Networks to upscale low-resolution wind fields to generate high-resolution wind fields in an actual wind farm in Bessaker, Norway. The neural network-based model is shown to successfully reconstruct fully resolved 3D velocity fields from a coarser scale while respecting the local terrain and that it easily outperforms trilinear interpolation. We also demonstrate that by using appropriate cost function based on domain knowledge, we can alleviate the use of adversarial training.
大气层的流动受到各种时间和空间尺度的影响,这使得在复杂地形上 high-resolve 的湍流流动的数值模拟在计算上非常难以实现。在本研究中,我们提出了一种基于增强的 Super-Resolution 生成对抗网络的神经网络方法,以升级低分辨率风场,在挪威Bs Boyer实际风力农场生成高分辨率风场。该基于神经网络的方法在尊重当地地形的情况下,成功从粗粒度尺度上重构了 fully resolved 3D 风速场,并且轻松超越了三角线性插值。我们还证明了通过基于领域知识的适当成本函数,可以减缓生成对抗网络训练的使用。
https://arxiv.org/abs/2309.10172
In this paper, we propose an Instant Photorealistic Style Transfer (IPST) approach, designed to achieve instant photorealistic style transfer on super-resolution inputs without the need for pre-training on pair-wise datasets or imposing extra constraints. Our method utilizes a lightweight StyleNet to enable style transfer from a style image to a content image while preserving non-color information. To further enhance the style transfer process, we introduce an instance-adaptive optimization to prioritize the photorealism of outputs and accelerate the convergence of the style network, leading to a rapid training completion within seconds. Moreover, IPST is well-suited for multi-frame style transfer tasks, as it retains temporal and multi-view consistency of the multi-frame inputs such as video and Neural Radiance Field (NeRF). Experimental results demonstrate that IPST requires less GPU memory usage, offers faster multi-frame transfer speed, and generates photorealistic outputs, making it a promising solution for various photorealistic transfer applications.
在本文中,我们提出了一种即时逼真风格转移(IPST)方法,旨在在超分辨率输入上实现即时逼真风格转移,而不需要对两个样本的数据集进行预先训练或强加额外的限制。我们的方法利用轻量级风格网络来实现风格从一个风格图像到内容图像的转移,同时保留非颜色信息。为了进一步增强风格转移过程,我们引入了实例自适应优化,以优先重视输出的逼真性和加速风格网络的收敛,从而使快速训练完成在几秒钟内完成。此外,IPST非常适合多帧风格转移任务,因为它保留了多帧输入如视频和神经网络辐射场(NeRF)的时序和多视角一致性。实验结果显示,IPST需要较少的GPU内存使用,提供更快的多帧传输速度,并生成逼真输出,使其成为各种逼真传输应用的一个有前途的解决方案。
https://arxiv.org/abs/2309.10011
Multi-contrast magnetic resonance imaging (MRI) reflects information about human tissue from different perspectives and has many clinical applications. By utilizing the complementary information among different modalities, multi-contrast super-resolution (SR) of MRI can achieve better results than single-image super-resolution. However, existing methods of multi-contrast MRI SR have the following shortcomings that may limit their performance: First, existing methods either simply concatenate the reference and degraded features or exploit global feature-matching between them, which are unsuitable for multi-contrast MRI SR. Second, although many recent methods employ transformers to capture long-range dependencies in the spatial dimension, they neglect that self-attention in the channel dimension is also important for low-level vision tasks. To address these shortcomings, we proposed a novel network architecture with compound-attention and neighbor matching (CANM-Net) for multi-contrast MRI SR: The compound self-attention mechanism effectively captures the dependencies in both spatial and channel dimension; the neighborhood-based feature-matching modules are exploited to match degraded features and adjacent reference features and then fuse them to obtain the high-quality images. We conduct experiments of SR tasks on the IXI, fastMRI, and real-world scanning datasets. The CANM-Net outperforms state-of-the-art approaches in both retrospective and prospective experiments. Moreover, the robustness study in our work shows that the CANM-Net still achieves good performance when the reference and degraded images are imperfectly registered, proving good potential in clinical applications.
多contrast magnetic resonance imaging (MRI)从不同角度反映了人类组织的信息,在许多临床应用中具有广泛的应用。通过利用不同模式之间的互补信息,MRI的多contrast super-resolution (SR)可以比单个图像的SR结果更好。然而,现有的多contrast MRISR方法有以下缺点可能会限制其性能:第一,现有的方法要么只是简单地将参考和退化的特征连接起来,要么就利用它们之间 global feature- matching,这些方法不适合用于多contrast MRISR。第二,尽管许多最近的方法使用Transformers捕捉空间维度中的远程依赖关系,但它们忽略了通道维度中的自我注意力对于低级别视觉任务也是很重要的。为了解决这些问题,我们提出了一种新的网络架构,具有复合注意力和邻居匹配(CANM-Net)的功能,用于多contrast MRISR:该复合自我注意力机制有效地捕捉了空间和通道维度中的依赖关系;基于邻居特征匹配模块利用匹配退化特征和相邻参考特征,然后将它们融合以获得高质量的图像。我们在IXI、FastMRI和现实世界扫描数据集上进行了SR任务的实验。CANM-Net在回顾性和前瞻性实验中都比当前最好的方法表现更好。此外,我们的 robust 研究结果表明,当参考和退化图像不完美匹配时,CANM-Net仍然能够取得良好的性能,表明它在临床应用领域有很好的潜力。
https://arxiv.org/abs/2307.02148
Current Scene text image super-resolution approaches primarily focus on extracting robust features, acquiring text information, and complex training strategies to generate super-resolution images. However, the upsampling module, which is crucial in the process of converting low-resolution images to high-resolution ones, has received little attention in existing works. To address this issue, we propose the Pixel Adapter Module (PAM) based on graph attention to address pixel distortion caused by upsampling. The PAM effectively captures local structural information by allowing each pixel to interact with its neighbors and update features. Unlike previous graph attention mechanisms, our approach achieves 2-3 orders of magnitude improvement in efficiency and memory utilization by eliminating the dependency on sparse adjacency matrices and introducing a sliding window approach for efficient parallel computation. Additionally, we introduce the MLP-based Sequential Residual Block (MSRB) for robust feature extraction from text images, and a Local Contour Awareness loss ($\mathcal{L}_{lca}$) to enhance the model's perception of details. Comprehensive experiments on TextZoom demonstrate that our proposed method generates high-quality super-resolution images, surpassing existing methods in recognition accuracy. For single-stage and multi-stage strategies, we achieved improvements of 0.7\% and 2.6\%, respectively, increasing the performance from 52.6\% and 53.7\% to 53.3\% and 56.3\%. The code is available at this https URL.
当前场景文本图像超分辨率方法主要关注提取稳健特征、获取文本信息以及复杂的训练策略以生成超分辨率图像。然而,在将低分辨率图像转换为高分辨率图像的过程中不可或缺的高斯采样模块却往往未被现有工作所重视。为了解决这一问题,我们提出了基于图注意力的像素适配模块(PAM)来解决高斯采样引起的像素扭曲问题。PAM有效地捕捉了局部结构信息,允许每个像素与其邻居交互并更新特征,与以前的图注意力机制不同,我们的方法实现了2-3个数量级的效率和内存利用率改进,通过消除稀疏邻接矩阵的依赖并引入高效的并行计算窗口方法。此外,我们还引入了基于多层感知器的序列残差块(MSRB)以从文本图像中提取稳健特征,并引入了Local Contour awareness loss( $\mathcal{L}_{lca}$)来增强模型对细节的感知。在文本Zoom实验中,我们实现了0.7%和2.6%的性能提升,将识别精度从52.6%和53.7%提高到了53.3%和56.3%。代码可在此处获取。
https://arxiv.org/abs/2309.08919
Accelerated by the increasing attention drawn by 5G, 6G, and Internet of Things applications, communication and sensing technologies have rapidly evolved from millimeter-wave (mmWave) to terahertz (THz) in recent years. Enabled by significant advancements in electromagnetic (EM) hardware, mmWave and THz frequency regimes spanning 30 GHz to 300 GHz and 300 GHz to 3000 GHz, respectively, can be employed for a host of applications. The main feature of THz systems is high-bandwidth transmission, enabling ultra-high-resolution imaging and high-throughput communications; however, challenges in both the hardware and algorithmic arenas remain for the ubiquitous adoption of THz technology. Spectra comprising mmWave and THz frequencies are well-suited for synthetic aperture radar (SAR) imaging at sub-millimeter resolutions for a wide spectrum of tasks like material characterization and nondestructive testing (NDT). This article provides a tutorial review of systems and algorithms for THz SAR in the near-field with an emphasis on emerging algorithms that combine signal processing and machine learning techniques. As part of this study, an overview of classical and data-driven THz SAR algorithms is provided, focusing on object detection for security applications and SAR image super-resolution. We also discuss relevant issues, challenges, and future research directions for emerging algorithms and THz SAR, including standardization of system and algorithm benchmarking, adoption of state-of-the-art deep learning techniques, signal processing-optimized machine learning, and hybrid data-driven signal processing algorithms...
近年来,受到5G、6G和物联网应用日益关注,通信和感知技术从毫米波(mmWave)快速进化到太赫兹(THz)。得益于电磁硬件的重大进展,毫米波和太赫兹频率范围分别涵盖30 GHz到300 GHz和300 GHz到3000 GHz,可用于众多应用。太赫兹系统的主要特点是高带宽传输,可实现高分辨率成像和高通量通信;然而,在硬件和算法领域仍面临普遍的太赫兹技术普及挑战。本文对太赫兹雷达近场系统和应用算法进行了 tutorial review,重点探讨了结合信号处理和机器学习技术的新兴算法。作为该研究的一部分,提供了经典和数据驱动的太赫兹雷达算法概述,重点讨论了安全应用中的物体检测和雷达图像超分辨率。此外,我们还探讨了新兴算法和太赫兹雷达相关的相关问题、挑战和未来研究方向,包括标准化系统和应用算法基准,采用先进的深度学习技术,优化信号处理的机器学习方法,以及混合数据驱动的信号处理算法。
https://arxiv.org/abs/2309.08844
Head-related transfer functions (HRTFs) are crucial for spatial soundfield reproduction in virtual reality applications. However, obtaining personalized, high-resolution HRTFs is a time-consuming and costly task. Recently, deep learning-based methods showed promise in interpolating high-resolution HRTFs from sparse measurements. Some of these methods treat HRTF interpolation as an image super-resolution task, which neglects spatial acoustic features. This paper proposes a spherical convolutional neural network method for HRTF interpolation. The proposed method realizes the convolution process by decomposing and reconstructing HRTF through the Spherical Harmonics (SHs). The SHs, an orthogonal function set defined on a sphere, allow the convolution layers to effectively capture the spatial features of HRTFs, which are sampled on a sphere. Simulation results demonstrate the effectiveness of the proposed method in achieving accurate interpolation from sparse measurements, outperforming the SH method and learning-based methods.
头相关转移函数(HRTFs)在虚拟现实应用中的空间声场复制至关重要。然而,获得个性化的高分辨率HRTF是一项耗时且昂贵的任务。最近,基于深度学习的方法在从稀疏测量中预测高分辨率HRTF方面表现出了前景。这些方法中的一些方法将HRTF插值视为图像超分辨率任务,而忽视了空间声学特征。本文提出了一种球面卷积神经网络方法来用于HRTF插值。该方法通过分解和重构HRTF通过球面谐波(SHs)实现卷积过程。SHs是一种在球面上定义的Orthogonal function set,使卷积层有效地捕捉HRTF在球面上的 spatial 特征。模拟结果证明了该方法从稀疏测量中实现准确插值的有效性,超越了SH方法和基于深度学习的方法。
https://arxiv.org/abs/2309.08290
Due to their highly structured characteristics, faces are easier to recover than natural scenes for blind image super-resolution. Therefore, we can extract the degradation representation of an image from the low-quality and recovered face pairs. Using the degradation representation, realistic low-quality images can then be synthesized to fine-tune the super-resolution model for the real-world low-quality image. However, such a procedure is time-consuming and laborious, and the gaps between recovered faces and the ground-truths further increase the optimization uncertainty. To facilitate efficient model adaptation towards image-specific degradations, we propose a method dubbed MetaF2N, which leverages the contained Faces to fine-tune model parameters for adapting to the whole Natural image in a Meta-learning framework. The degradation extraction and low-quality image synthesis steps are thus circumvented in our MetaF2N, and it requires only one fine-tuning step to get decent performance. Considering the gaps between the recovered faces and ground-truths, we further deploy a MaskNet for adaptively predicting loss weights at different positions to reduce the impact of low-confidence areas. To evaluate our proposed MetaF2N, we have collected a real-world low-quality dataset with one or multiple faces in each image, and our MetaF2N achieves superior performance on both synthetic and real-world datasets. Source code, pre-trained models, and collected datasets are available at this https URL.
由于它们具有高度结构化的特点,对于盲人图像超分辨率而言,人脸比自然场景更容易恢复。因此,我们可以从低质量和恢复面部的一对图像中提取图像的退化表示。使用退化表示,我们可以合成真实的低质量图像,以微调超分辨率模型以适应现实世界的低质量图像。然而,这种方法需要时间和努力,恢复面部和真实值之间的差距进一步增加了优化不确定性。为了促进高效的模型适应向特定退化方向的迁移,我们提出了一种方法称为MetaF2N,它利用包含面部的模型微调模型参数,以适应整个自然图像在Meta学习框架中。在MetaF2N中,退化提取和低质量图像合成步骤被绕过了,只需要一个微调步骤就能获得不错的性能。考虑到恢复面部和真实值之间的差距,我们进一步部署了MaskNet,以自适应地预测在不同位置的损失权重,以减少低信任区域的影响。为了评估我们提出的MetaF2N,我们收集了一个包含每个图像中的一个或多个面部的真实低质量数据集,我们的MetaF2N在合成和真实数据集上取得了更好的表现。源代码、预训练模型和收集的数据集可在以下httpsURL中获取。
https://arxiv.org/abs/2309.08113
Talking Face Generation (TFG) aims to reconstruct facial movements to achieve high natural lip movements from audio and facial features that are under potential connections. Existing TFG methods have made significant advancements to produce natural and realistic images. However, most work rarely takes visual quality into consideration. It is challenging to ensure lip synchronization while avoiding visual quality degradation in cross-modal generation methods. To address this issue, we propose a universal High-Definition Teeth Restoration Network, dubbed HDTR-Net, for arbitrary TFG methods. HDTR-Net can enhance teeth regions at an extremely fast speed while maintaining synchronization, and temporal consistency. In particular, we propose a Fine-Grained Feature Fusion (FGFF) module to effectively capture fine texture feature information around teeth and surrounding regions, and use these features to fine-grain the feature map to enhance the clarity of teeth. Extensive experiments show that our method can be adapted to arbitrary TFG methods without suffering from lip synchronization and frame coherence. Another advantage of HDTR-Net is its real-time generation ability. Also under the condition of high-definition restoration of talking face video synthesis, its inference speed is $300\%$ faster than the current state-of-the-art face restoration based on super-resolution.
谈话脸生成(TFG)旨在重构面部运动,以实现高度自然的牙齿移动,从潜在连接中的音频和面部特征中获取。现有的TFG方法已经取得了重大进展,以产生自然和真实的图像。然而,大多数工作很少考虑视觉质量。确保 lips 同步并在跨模态生成方法中避免视觉质量下降是一项挑战。为了解决这个问题,我们提出了一种通用的高清晰度牙齿修复网络,称为HDTR-Net,为任意TFG方法。HDTR-Net可以在极快的速度下增强牙齿区域,同时保持同步和时间一致性。特别是,我们提议一个精细特征融合(FGFF)模块,有效地捕捉牙齿和周围区域中的精细纹理特征信息,并利用这些特征来精细地合成特征图,以提高牙齿的清晰度。广泛的实验表明,我们的方法可以适应任意TFG方法,而不会遭受 lips同步和帧一致性的问题。HDTR-Net的另一个优点是其实时生成能力。在高清晰度谈话脸视频合成中,其推理速度比基于超分辨率的最新面部恢复方法快300%。
https://arxiv.org/abs/2309.07495
Audio super-resolution is a fundamental task that predicts high-frequency components for low-resolution audio, enhancing audio quality in digital applications. Previous methods have limitations such as the limited scope of audio types (e.g., music, speech) and specific bandwidth settings they can handle (e.g., 4kHz to 8kHz). In this paper, we introduce a diffusion-based generative model, AudioSR, that is capable of performing robust audio super-resolution on versatile audio types, including sound effects, music, and speech. Specifically, AudioSR can upsample any input audio signal within the bandwidth range of 2kHz to 16kHz to a high-resolution audio signal at 24kHz bandwidth with a sampling rate of 48kHz. Extensive objective evaluation on various audio super-resolution benchmarks demonstrates the strong result achieved by the proposed model. In addition, our subjective evaluation shows that AudioSR can acts as a plug-and-play module to enhance the generation quality of a wide range of audio generative models, including AudioLDM, Fastspeech2, and MusicGen. Our code and demo are available at this https URL.
音频超分辨率是一项基本的工作任务,旨在预测低分辨率音频中的高频成分,提高数字应用程序中的音频质量。以前的方法和方法都有限制,例如音频类型的有限范围(例如音乐、语音)以及他们能够处理的特定带宽设置(例如4kHz到8kHz)。在本文中,我们介绍了一种扩散基于生成模型AudioSR,它能够在包括音效、音乐和语音等多种音频类型中执行稳定的音频超分辨率。具体来说,AudioSR能够将任何输入音频信号在2kHz到16kHz的带宽范围内进行高分辨率重采样,以24kHz带宽并以48kHz采样率进行编码。对各种音频超分辨率基准的广泛客观评估表明,所提出的模型取得了显著的结果。此外,我们的主观评估表明,AudioSR可以作为一个插件和播放模块,提高包括AudioLDM、FastSpeech2和音乐Gen等多种音频生成模型的生成质量。我们的代码和演示可在本页的httpsURL上可用。
https://arxiv.org/abs/2309.07314
We aim to provide a general framework of for computational photography that recovers the real scene from imperfect images, via the Deep Nonparametric Convexified Filtering (DNCF). It is consists of a nonparametric deep network to resemble the physical equations behind the image formation, such as denoising, super-resolution, inpainting, and flash. DNCF has no parameterization dependent on training data, therefore has a strong generalization and robustness to adversarial image manipulation. During inference, we also encourage the network parameters to be nonnegative and create a bi-convex function on the input and parameters, and this adapts to second-order optimization algorithms with insufficient running time, having 10X acceleration over Deep Image Prior. With these tools, we empirically verify its capability to defend image classification deep networks against adversary attack algorithms in real-time.
我们希望提供一个计算摄影的通用框架,通过 Deep Nonparametric Convexified 滤波(DNCF)从不完美的图像中恢复真实场景。DNCF 由一个 nonparametric 深度学习网络组成,类似于图像形成背后的物理方程,例如去噪、高分辨率、填充和闪光。DNCF 没有依赖于训练数据的参数化,因此具有对dversarial 图像操纵的强大泛化和鲁棒性。在推理期间,我们鼓励网络参数非负,在输入和参数上创建一个双曲函数,这适应于运行时间不足的 second-order 优化算法,比 Deep Image Prior 具有 10X 加速。通过这些工具,我们经验证了它的能力,在实时保护图像分类深度学习网络免受dversarial 攻击算法的攻击方面抵御这种攻击。
https://arxiv.org/abs/2309.06724
Convolution is a fundamental operation in image processing and machine learning. Aimed primarily at maintaining image size, padding is a key ingredient of convolution, which, however, can introduce undesirable boundary effects. We present a non-padding-based method for size-keeping convolution based on the preservation of differential characteristics of kernels. The main idea is to make convolution over an incomplete sliding window "collapse" to a linear differential operator evaluated locally at its central pixel, which no longer requires information from the neighbouring missing pixels. While the underlying theory is rigorous, our final formula turns out to be simple: the convolution over an incomplete window is achieved by convolving its nearest complete window with a transformed kernel. This formula is computationally lightweight, involving neither interpolation or extrapolation nor restrictions on image and kernel sizes. Our method favours data with smooth boundaries, such as high-resolution images and fields from physics. Our experiments include: i) filtering analytical and non-analytical fields from computational physics and, ii) training convolutional neural networks (CNNs) for the tasks of image classification, semantic segmentation and super-resolution reconstruction. In all these experiments, our method has exhibited visible superiority over the compared ones.
卷积是图像处理和机器学习中的基本概念操作,其主要目标是保持图像的大小,而填充是卷积的关键成分,然而,它可能会引入不希望的边界效应。我们提出了一种基于保留内核差异特征的无填充卷积方法,以保持图像大小。其主要想法是,将不完整的滑动窗口的卷积“合并”到线性差异函数,该函数在它的中心像素处 locally 评估,不再需要相邻缺失像素的信息。尽管理论基础非常严谨,但最终公式却非常简单:不完整窗口的卷积是通过将变换后的内核与不完整窗口相乘实现的。这个公式计算量较轻,不涉及插值或延伸,也不限制图像和内核的大小。我们的方法倾向于平滑边界的数据,例如高分辨率图像和物理学领域的域。我们的实验包括:i)从计算物理学中过滤和分析性和非分析性的域,以及,ii)训练卷积神经网络(CNNs)用于图像分类、语义分割和超分辨率重建任务。在所有这些实验中,我们的方法都表现出明显优于与之相比的方法的优势。
https://arxiv.org/abs/2309.06370
Contrastive learning has emerged as a prevailing paradigm for high-level vision tasks, which, by introducing properly negative samples, has also been exploited for low-level vision tasks to achieve a compact optimization space to account for their ill-posed nature. However, existing methods rely on manually predefined, task-oriented negatives, which often exhibit pronounced task-specific biases. In this paper, we propose a innovative approach for the adaptive generation of negative samples directly from the target model itself, called ``learning from history``. We introduce the Self-Prior guided Negative loss for image restoration (SPNIR) to enable this approach. Our approach is task-agnostic and generic, making it compatible with any existing image restoration method or task. We demonstrate the effectiveness of our approach by retraining existing models with SPNIR. The results show significant improvements in image restoration across various tasks and architectures. For example, models retrained with SPNIR outperform the original FFANet and DehazeFormer by 3.41 dB and 0.57 dB on the RESIDE indoor dataset for image dehazing. Similarly, they achieve notable improvements of 0.47 dB on SPA-Data over IDT for image deraining and 0.12 dB on Manga109 for a 4x scale super-resolution over lightweight SwinIR, respectively. Code and retrained models are available at this https URL.
对比学习已成为高水平视觉任务的流行范式,通过引入适当的负样本,它也已被用于较低水平的视觉任务,以获得紧凑的优化空间,以应对其不完备性。然而,现有方法依赖于手动预先定义的任务负样本,这些样本往往表现出显著的任务特定偏见。在本文中,我们提出一种创新的方法,用于自适应地从目标模型本身中提取负样本,称为“从历史中学习”。我们引入了自我前导 guided 负损失的图像恢复(SPNIR),以启用这种方法。我们的方法和任何现有图像恢复方法或任务都兼容,使其与任何现有图像恢复方法或任务兼容。我们通过重新训练现有的模型与 SPNIR 进行演示,结果表明,在各种任务和架构中,图像恢复显著提高。例如,模型重新训练与 SPNIR 在内部外部数据集上的图像去雾比原始 FFANet 和 DehazeFormer 多3.41 dB和0.57 dB。类似地,它们实现显著改善,SPA-Data 比 IDT 的图像去噪比多0.47 dB,以及Manga109 的4x尺度超分辨率比轻量级 SwinIR 的 lightweight SwinIR。代码和重新训练的模型可在 this https URL 上可用。
https://arxiv.org/abs/2309.06023
Medical image segmentation is critical for diagnosing and treating spinal disorders. However, the presence of high noise, ambiguity, and uncertainty makes this task highly challenging. Factors such as unclear anatomical boundaries, inter-class similarities, and irrational annotations contribute to this challenge. Achieving both accurate and diverse segmentation templates is essential to support radiologists in clinical practice. In recent years, denoising diffusion probabilistic modeling (DDPM) has emerged as a prominent research topic in computer vision. It has demonstrated effectiveness in various vision tasks, including image deblurring, super-resolution, anomaly detection, and even semantic representation generation at the pixel level. Despite the robustness of existing diffusion models in visual generation tasks, they still struggle with discrete masks and their various effects. To address the need for accurate and diverse spine medical image segmentation templates, we propose an end-to-end framework called VerseDiff-UNet, which leverages the denoising diffusion probabilistic model (DDPM). Our approach integrates the diffusion model into a standard U-shaped architecture. At each step, we combine the noise-added image with the labeled mask to guide the diffusion direction accurately towards the target region. Furthermore, to capture specific anatomical a priori information in medical images, we incorporate a shape a priori module. This module efficiently extracts structural semantic information from the input spine images. We evaluate our method on a single dataset of spine images acquired through X-ray imaging. Our results demonstrate that VerseDiff-UNet significantly outperforms other state-of-the-art methods in terms of accuracy while preserving the natural features and variations of anatomy.
医学图像分割对于诊断和治疗脊柱疾病至关重要。然而,高噪声、歧义和不确定性使得这一任务极为困难。因素如不清楚解剖学边界、不同类别之间的相似性以及不合理的标注 contribute to this challenge。实现准确且多样化的分割模板是支持医学影像学实践的关键。近年来,去噪扩散概率模型(DDPM)已成为计算机视觉领域的研究热点。它在各种视觉任务中表现出有效性,包括图像去噪、超分辨率、异常检测以及在像素级别的语义表示生成。尽管现有的扩散模型在视觉生成任务中的稳健性很高,但它们仍然面临着离散 mask 及其各种效果的挑战。为了解决准确且多样化的脊柱医学图像分割模板的需求,我们提出了一个名为VerseDiff-UNet的端到端框架,该框架利用去噪扩散概率模型(DDPM)。我们的方法将扩散模型整合到标准的U形架构中。在每个步骤中,我们结合添加噪声的图像和标记的 mask 准确地指导扩散方向 towards 目标区域。此外,为了捕获医学图像中的特定解剖学先验信息,我们引入了形状先验模块。该模块高效从输入的脊柱图像中提取结构语义信息。我们使用射线成像获得的一组脊柱图像数据集评估了我们的方法。我们的结果表明,VerseDiff-UNet 在准确性方面显著优于其他最先进的方法,同时保留了解剖学的自然特征和变异。
https://arxiv.org/abs/2309.05929
Super-resolution tasks oriented to images captured in ultra-dark environments is a practical yet challenging problem that has received little attention. Due to uneven illumination and low signal-to-noise ratio in dark environments, a multitude of problems such as lack of detail and color distortion may be magnified in the super-resolution process compared to normal-lighting environments. Consequently, conventional low-light enhancement or super-resolution methods, whether applied individually or in a cascaded manner for such problem, often encounter limitations in recovering luminance, color fidelity, and intricate details. To conquer these issues, this paper proposes a specialized dual-modulated learning framework that, for the first time, attempts to deeply dissect the nature of the low-light super-resolution task. Leveraging natural image color characteristics, we introduce a self-regularized luminance constraint as a prior for addressing uneven lighting. Expanding on this, we develop Illuminance-Semantic Dual Modulation (ISDM) components to enhance feature-level preservation of illumination and color details. Besides, instead of deploying naive up-sampling strategies, we design the Resolution-Sensitive Merging Up-sampler (RSMU) module that brings together different sampling modalities as substrates, effectively mitigating the presence of artifacts and halos. Comprehensive experiments showcases the applicability and generalizability of our approach to diverse and challenging ultra-low-light conditions, outperforming state-of-the-art methods with a notable improvement (i.e., $\uparrow$5\% in PSNR, and $\uparrow$43\% in LPIPS). Especially noteworthy is the 19-fold increase in the RMSE score, underscoring our method's exceptional generalization across different darkness levels. The code will be available online upon publication of the paper.
面向极端黑暗环境图像的super-resolution任务是一个实际而具有挑战性的问题,很少受到关注。由于黑暗环境中照明不均匀和信号噪声比低,例如缺乏细节和色彩扭曲等问题可能会在super-resolution过程中被放大,与正常照明环境相比。因此,传统的低光增强或super-resolution方法,无论是单独应用还是以级联方式应用于此类问题,往往在恢复亮度、色彩逼真度和细节方面遇到限制。为了克服这些问题,本文提出了一种专门的双调制学习框架,首次尝试深入分析低光super- resolution任务的特性。利用自然图像色彩特征,我们引入自Regularized亮度约束作为解决照明不均匀的先验。在此基础上,我们开发Illumination-Semantic双调制(ISDM)组件,以增强特征级别 preservation of照明和色彩细节。此外,我们不再部署简单的向上采样策略,而是设计Resolution-sensitive Merging Up-sampler(RSMU)模块,将不同的采样模式作为底层基板,有效地减轻出现 artifacts和光环等问题。全面的实验结果表明,我们的方法适用于各种极端低光条件,表现出显著的改进(例如,PSNR提高5%,LPIPS提高43%)优于最先进的方法。特别是,RMSE得分增长了19倍,强调了我们方法在不同黑暗水平上的 exceptional generalization。代码将在文章发布后公开。
https://arxiv.org/abs/2309.05267
Transformer-based methods have shown impressive performance in image restoration tasks, such as image super-resolution and denoising. However, we find that these networks can only utilize a limited spatial range of input information through attribution analysis. This implies that the potential of Transformer is still not fully exploited in existing networks. In order to activate more input pixels for better restoration, we propose a new Hybrid Attention Transformer (HAT). It combines both channel attention and window-based self-attention schemes, thus making use of their complementary advantages. Moreover, to better aggregate the cross-window information, we introduce an overlapping cross-attention module to enhance the interaction between neighboring window features. In the training stage, we additionally adopt a same-task pre-training strategy to further exploit the potential of the model for further improvement. Extensive experiments have demonstrated the effectiveness of the proposed modules. We further scale up the model to show that the performance of the SR task can be greatly improved. Besides, we extend HAT to more image restoration applications, including real-world image super-resolution, Gaussian image denoising and image compression artifacts reduction. Experiments on benchmark and real-world datasets demonstrate that our HAT achieves state-of-the-art performance both quantitatively and qualitatively. Codes and models are publicly available at this https URL.
Transformer-based methods在图像修复任务中表现出令人印象深刻的性能,例如图像超分辨率和去噪。然而,我们发现这些网络只能通过归因分析利用有限的输入信息空间。这暗示着现有的网络中,Transformer的潜力仍未完全利用。为了更好地恢复更多的输入像素,我们提出了一种新的混合注意力Transformer(HAT)方法。它结合了通道注意力和窗口基自注意力方案,从而利用了它们的互补优势。此外,为了更好地整合跨窗口信息,我们引入了一个重叠交叉注意力模块,以增强相邻窗口特征之间的相互作用。在训练阶段,我们还采用相同的任务预训练策略,进一步利用模型的潜力以进一步改进。广泛的实验已经证明了所提出的模块的有效性。我们进一步扩展了模型,以显示SR任务的性能可以 greatly 改善。此外,我们还将HAT扩展到更多的图像修复应用,包括现实世界的图像超分辨率、高斯图像去噪和图像压缩 artifacts减少。在基准数据和现实世界数据集上的实验表明,我们的HAT实现了先进的性能,既定量又定性。代码和模型在该https URL上publicly available。
https://arxiv.org/abs/2309.05239
In many imaging applications where segmented features (e.g. blood vessels) are further used for other numerical simulations (e.g. finite element analysis), the obtained surfaces do not have fine resolutions suitable for the task. Increasing the resolution of such surfaces becomes crucial. This paper proposes a new variational model for solving this problem, based on an Euler-Elastica-based regulariser. Further, we propose and implement two numerical algorithms for solving the model, a projected gradient descent method and the alternating direction method of multipliers. Numerical experiments using real-life examples (including two from outputs of another variational model) have been illustrated for effectiveness. The advantages of the new model are shown through quantitative comparisons by the standard deviation of Gaussian curvatures and mean curvatures from the viewpoint of discrete geometry.
在许多成像应用中,将分割的特征(例如血管)用于其他数值模拟任务(例如有限元分析),所获得的表面并没有适合任务的高分辨率。增加这些表面的分辨率变得至关重要。本文提出了基于欧拉Elastica的 Regulariser 的新Variational 模型来解决这个问题。此外,我们提出了和实现了两个用于解决模型的 numerical 算法,一个 projected Gradient Descent 方法和 multiplier 的交替方向方法。使用实际示例(包括另一个Variational 模型的输出的两个例子)进行了有效性的实证。通过离散几何视角的Gaussian 曲率和均值曲率的标准差进行 quantitative 比较,展示了新模型的优势。
https://arxiv.org/abs/2309.05071
Despite substantial advances, single-image super-resolution (SISR) is always in a dilemma to reconstruct high-quality images with limited information from one input image, especially in realistic scenarios. In this paper, we establish a large-scale real-world burst super-resolution dataset, i.e., RealBSR, to explore the faithful reconstruction of image details from multiple frames. Furthermore, we introduce a Federated Burst Affinity network (FBAnet) to investigate non-trivial pixel-wise displacements among images under real-world image degradation. Specifically, rather than using pixel-wise alignment, our FBAnet employs a simple homography alignment from a structural geometry aspect and a Federated Affinity Fusion (FAF) strategy to aggregate the complementary information among frames. Those fused informative representations are fed to a Transformer-based module of burst representation decoding. Besides, we have conducted extensive experiments on two versions of our datasets, i.e., RealBSR-RAW and RealBSR-RGB. Experimental results demonstrate that our FBAnet outperforms existing state-of-the-art burst SR methods and also achieves visually-pleasant SR image predictions with model details. Our dataset, codes, and models are publicly available at this https URL.
尽管取得了重大进展,但单张图像超分辨率(SISR)仍然面临困难,只能从一张输入图像中恢复有限的高质量图像,特别是在实际场景中。在本文中,我们建立了一个大规模的实际瞬间超分辨率数据集,即 RealBSR,以探索从多个帧中准确恢复图像细节的方法。此外,我们引入了一个分布式瞬间吸引力网络(FBAnet),以研究在实际图像损失情况下,图像之间的微小像素位移。具体来说,我们不再使用像素对齐,而是使用从结构几何方面简单的同向映射对齐,并使用分布式瞬间吸引力融合(FAF)策略来合并帧之间的互补信息。这些融合的 informative representations 被输入到瞬间表示解码器中的Transformer-based模块。此外,我们对 Our dataset、代码和模型进行了广泛的实验,包括 RealBSR-RAW 和 RealBSR-RGB 两个版本。实验结果表明,我们的FBAnet 比现有的先进的瞬间超分辨率方法表现更好,还可以使用模型细节实现令人愉悦的图像超分辨率预测。我们的数据集、代码和模型在这个 https URL 上公开可用。
https://arxiv.org/abs/2309.04803