Transformer-based encoder-decoder models have achieved remarkable success in image-to-image transfer tasks, particularly in image restoration. However, their high computational complexity-manifested in elevated FLOPs and parameter counts-limits their application in real-world scenarios. Existing knowledge distillation methods in image restoration typically employ lightweight student models that directly mimic the intermediate features and reconstruction results of the teacher, overlooking the implicit attention relationships between them. To address this, we propose a Soft Knowledge Distillation (SKD) strategy that incorporates a Multi-dimensional Cross-net Attention (MCA) mechanism for compressing image restoration models. This mechanism facilitates interaction between the student and teacher across both channel and spatial dimensions, enabling the student to implicitly learn the attention matrices. Additionally, we employ a Gaussian kernel function to measure the distance between student and teacher features in kernel space, ensuring stable and efficient feature learning. To further enhance the quality of reconstructed images, we replace the commonly used L1 or KL divergence loss with a contrastive learning loss at the image level. Experiments on three tasks-image deraining, deblurring, and denoising-demonstrate that our SKD strategy significantly reduces computational complexity while maintaining strong image restoration capabilities.
基于Transformer的编码器-解码器模型在图像到图像转换任务中,尤其是图像恢复方面取得了显著的成功。然而,这些模型由于计算复杂度高(表现为更高的FLOPs和参数计数)而限制了其在现实场景中的应用。现有的知识蒸馏方法通常采用轻量级的学生模型直接模仿教师模型的中间特征和重建结果,忽略了两者之间的隐式注意关系。为了解决这一问题,我们提出了一种Soft Knowledge Distillation(SKD)策略,该策略结合了Multi-dimensional Cross-net Attention(MCA)机制以压缩图像恢复模型。这种机制促进了学生和教师在通道和空间维度上的交互,使学生能够隐式地学习注意力矩阵。 此外,我们使用高斯核函数来衡量学生和教师特征之间的距离,确保稳定且高效的特征学习。为了进一步提高重建图像的质量,我们将常用的L1或KL散度损失替换为基于图像级别的对比学习损失。在三项任务——去雨、去模糊和降噪的实验中,我们的SKD策略显著降低了计算复杂性,并保持了强大的图像恢复能力。
https://arxiv.org/abs/2501.09321
Model compression through knowledge distillation has seen extensive application in classification and segmentation tasks. However, its potential in image-to-image translation, particularly in image restoration, remains underexplored. To address this gap, we propose a Simultaneous Learning Knowledge Distillation (SLKD) framework tailored for model compression in image restoration tasks. SLKD employs a dual-teacher, single-student architecture with two distinct learning strategies: Degradation Removal Learning (DRL) and Image Reconstruction Learning (IRL), simultaneously. In DRL, the student encoder learns from Teacher A to focus on removing degradation factors, guided by a novel BRISQUE extractor. In IRL, the student decoder learns from Teacher B to reconstruct clean images, with the assistance of a proposed PIQE extractor. These strategies enable the student to learn from degraded and clean images simultaneously, ensuring high-quality compression of image restoration models. Experimental results across five datasets and three tasks demonstrate that SLKD achieves substantial reductions in FLOPs and parameters, exceeding 80\%, while maintaining strong image restoration performance.
模型通过知识蒸馏进行压缩在分类和分割任务中得到了广泛的应用,但在图像到图像的转换领域,尤其是在图像恢复方面,其潜力尚未被充分探索。为了填补这一空白,我们提出了一种专门用于图像恢复任务中的模型压缩的Simultaneous Learning Knowledge Distillation (SLKD)框架。SLKD采用了一个双教师单学生架构,并结合了两种不同的学习策略:退化移除学习(DRL)和图像重建学习(IRL),以同时进行。 在DRL中,学生编码器从教师A那里学习如何专注于去除退化因素,这受到新颖的BRISQUE提取器的指导。而在IRL中,学生解码器从教师B那里学习如何利用提出的PIQE提取器的帮助来重建干净图像。这两种策略使学生能够同时从降质和未降质的图像中学到知识,从而确保高质量地压缩图像恢复模型。 通过五个数据集和三个任务进行的实验结果表明,SLKD在保持强大的图像恢复性能的同时,可以实现超过80%的FLOPs(浮点运算次数)和参数减少。
https://arxiv.org/abs/2501.09268
Large language models (LLMs) have spurred development in multiple industries. However, the growing number of their parameters brings substantial storage and computing burdens, making it essential to explore model compression techniques for parameter reduction and easier deployment. We propose SWSC, an LLM compression method based on the concept of Shared Weight for Similar Channel. It uses the K-Means clustering algorithm to cluster model weights channel-by-channel, generating clusters with highly similar vectors within each. A representative vector from each cluster is selected to approximately replace all vectors in the cluster, significantly reducing the number of model weight parameters. However, approximate restoration will inevitably cause damage to the performance of the model. To tackle this issue, we perform singular value decomposition on the weight error values before and after compression and retain the larger singular values and their corresponding singular vectors to compensate for the accuracy. The experimental results show that our method can effectively ensure the performance of the compressed LLM even under low-precision conditions.
大型语言模型(LLM)在多个行业中的发展得到了推动。然而,随着参数数量的增加,存储和计算负担也随之增大,因此探索用于减少参数并简化部署的模型压缩技术变得至关重要。我们提出了SWSC方法,这是一种基于“相似通道共享权重”概念的LLM压缩方法。该方法采用K-Means聚类算法对每一层的权重进行逐通道聚类,并生成内部向量高度相似的簇。从每个簇中选择一个代表向量来近似替换所有该簇内的向量,这显著减少了模型权重参数的数量。然而,这种近似替代不可避免地会对模型性能造成损害。为了解决这一问题,在压缩前后对权重误差值进行奇异值分解,并保留较大的奇异值及其对应的奇异向量以补偿精度损失。实验结果表明,我们的方法能够在低精度条件下有效确保压缩后的LLM的性能。
https://arxiv.org/abs/2501.08631
Optical remote sensing images play a crucial role in the observation of the Earth's surface. However, obtaining complete optical remote sensing images is challenging due to cloud cover. Reconstructing cloud-free optical images has become a major task in recent years. This paper presents a two-flow Polarimetric Synthetic Aperture Radar (PolSAR)-Optical data fusion cloud removal algorithm (PODF-CR), which achieves the reconstruction of missing optical images. PODF-CR consists of an encoding module and a decoding module. The encoding module includes two parallel branches that extract PolSAR image features and optical image features. To address speckle noise in PolSAR images, we introduce dynamic filters in the PolSAR branch for image denoising. To better facilitate the fusion between multimodal optical images and PolSAR images, we propose fusion blocks based on cross-skip connections to enable interaction of multimodal data information. The obtained fusion features are refined through an attention mechanism to provide better conditions for the subsequent decoding of the fused images. In the decoding module, multi-scale convolution is introduced to obtain multi-scale information. Additionally, to better utilize comprehensive scattering information and polarization characteristics to assist in the restoration of optical images, we use a dataset for cloud restoration called OPT-BCFSAR-PFSAR, which includes backscatter coefficient feature images and polarization feature images obtained from PoLSAR data and optical images. Experimental results demonstrate that this method outperforms existing methods in both qualitative and quantitative evaluations.
光学遥感图像在地球表面观测中扮演着重要角色。然而,由于云层覆盖,获取完整的光学遥感图像是一个挑战。因此,重建无云的光学图像已成为近年来的一项重要任务。本文提出了一种两流极化合成孔径雷达(PolSAR)-光学数据融合去云算法(PODF-CR),该算法能够实现缺失光学图像的重建。PODF-CR由编码模块和解码模块组成。 **编码模块**包括两个并行分支,分别用于提取PolSAR图像特征和光学图像特征。为了处理PolSAR图像中的斑点噪声,我们在PolSAR支路中引入了动态滤波器以进行图像去噪。为了更好地促进多模态光学图像与PolSAR图像之间的融合,我们提出了一种基于跨跳连接的融合块,以便于多模态数据信息的交互作用。通过注意力机制对获得的融合特征进行细化处理,为后续的解码提供更好的条件。 **解码模块**引入了多尺度卷积以获取多尺度信息。此外,为了更好地利用综合散射信息和极化特性来辅助光学图像的恢复,我们使用了一种名为OPT-BCFSAR-PFSAR的数据集进行云层去除实验,该数据集中包含了从PoLSAR数据和光学图像中提取的后向散射系数特征图和极化特征图。 实验结果表明,在定性和定量评估方面,该方法优于现有的其他方法。
https://arxiv.org/abs/2501.07901
Speech super-resolution (SR), which generates a waveform at a higher sampling rate from its low-resolution version, is a long-standing critical task in speech restoration. Previous works have explored speech SR in different data spaces, but these methods either require additional compression networks or exhibit limited synthesis quality and inference speed. Motivated by recent advances in probabilistic generative models, we present Bridge-SR, a novel and efficient any-to-48kHz SR system in the speech waveform domain. Using tractable Schrödinger Bridge models, we leverage the observed low-resolution waveform as a prior, which is intrinsically informative for the high-resolution target. By optimizing a lightweight network to learn the score functions from the prior to the target, we achieve efficient waveform SR through a data-to-data generation process that fully exploits the instructive content contained in the low-resolution observation. Furthermore, we identify the importance of the noise schedule, data scaling, and auxiliary loss functions, which further improve the SR quality of bridge-based systems. The experiments conducted on the benchmark dataset VCTK demonstrate the efficiency of our system: (1) in terms of sample quality, Bridge-SR outperforms several strong baseline methods under different SR settings, using a lightweight network backbone (1.7M); (2) in terms of inference speed, our 4-step synthesis achieves better performance than the 8-step conditional diffusion counterpart (LSD: 0.911 vs 0.927). Demo at this https URL.
语音超分辨率(SR)是指从低采样率的语音波形生成高采样率的波形,这是语音恢复中的一个长期关键任务。以往的研究工作在不同的数据空间中探索了语音超分辨率技术,但这些方法要么需要额外的压缩网络,要么表现出合成质量有限且推理速度慢的问题。受概率生成模型近期进展的启发,我们提出了Bridge-SR,这是一种新颖高效的任何到48kHz SR系统,在语音波形域内操作。 通过使用可计算的薛定谔桥(Schrödinger Bridge)模型,我们可以将观察到的低分辨率波形作为先验知识利用。这种先验知识对于高分辨率的目标来说是内在的信息丰富的。通过优化一个轻量级网络来学习从先验到目标的评分函数,我们实现了通过数据到数据生成过程来进行高效的语音波形SR,这一过程中充分利用了低分辨率观察中包含的教学内容。 此外,我们还确定了噪声调度、数据缩放和辅助损失函数的重要性,这些因素进一步提高了基于桥模型系统的超分辨率质量。在VCTK基准测试数据集上进行的实验表明,我们的系统具有高效性:(1) 在样本质量方面,在不同的SR设置下使用轻量级网络骨干(1.7M),Bridge-SR超过了几个强大的基线方法;(2) 在推理速度方面,我们4步合成的表现优于8步条件扩散对应的方法(LSD: 0.911 vs 0.927)。 演示地址:请参见提供的链接。
https://arxiv.org/abs/2501.07897
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in chart understanding tasks. However, interpreting charts with textual descriptions often leads to information loss, as it fails to fully capture the dense information embedded in charts. In contrast, parsing charts into code provides lossless representations that can effectively contain all critical details. Although existing open-source MLLMs have achieved success in chart understanding tasks, they still face two major challenges when applied to chart-to-code tasks.: (1) Low executability and poor restoration of chart details in the generated code and (2) Lack of large-scale and diverse training data. To address these challenges, we propose \textbf{ChartCoder}, the first dedicated chart-to-code MLLM, which leverages Code LLMs as the language backbone to enhance the executability of the generated code. Furthermore, we introduce \textbf{Chart2Code-160k}, the first large-scale and diverse dataset for chart-to-code generation, and propose the \textbf{Snippet-of-Thought (SoT)} method, which transforms direct chart-to-code generation data into step-by-step generation. Experiments demonstrate that ChartCoder, with only 7B parameters, surpasses existing open-source MLLMs on chart-to-code benchmarks, achieving superior chart restoration and code excitability. Our code will be available at this https URL.
多模态大型语言模型(MLLMs)在图表理解任务中展现出了显著的能力。然而,通过文本描述来解读图表常常会导致信息丢失,因为这无法完全捕捉到嵌入在图表中的密集型信息。相比之下,将图表解析为代码提供了无损表示方式,能够有效地包含所有关键细节。尽管现有的开源MLLMs已在图表理解任务上取得了成功,但在应用于从图表生成代码的任务时仍面临两大挑战:(1)所生成的代码可执行性低且无法很好地恢复图表详细信息;(2)缺乏大规模和多样化的训练数据。 为了应对这些挑战,我们提出了一种专为将图表转换为代码设计的MLLM——**ChartCoder**,它利用代码LLM作为语言基础结构来增强生成代码的可执行性。此外,我们还介绍了**Chart2Code-160k**,这是首个大规模且多样化的从图表到代码生成的数据集,并提出了**Snippet-of-Thought (SoT)**方法,该方法将直接从图表生成代码的数据转换为逐步生成步骤。 实验结果表明,在仅具有70亿参数的情况下,我们的模型ChartCoder在从图表到代码的基准测试中超越了现有的开源MLLMs,实现了卓越的图表恢复和代码可执行性。我们的代码将在提供的链接地址上公开发布。
https://arxiv.org/abs/2501.06598
This paper introduces an all-neural text formatting (TF) model designed for commercial automatic speech recognition (ASR) systems, encompassing punctuation restoration (PR), truecasing, and inverse text normalization (ITN). Unlike traditional rule-based or hybrid approaches, this method leverages a two-stage neural architecture comprising a multi-objective token classifier and a sequence-to-sequence (seq2seq) model. This design minimizes computational costs and reduces hallucinations while ensuring flexibility and robustness across diverse linguistic entities and text domains. Developed as part of the Universal-2 ASR system, the proposed method demonstrates superior performance in TF accuracy, computational efficiency, and perceptual quality, as validated through comprehensive evaluations using both objective and subjective methods. This work underscores the importance of holistic TF models in enhancing ASR usability in practical settings.
本文介绍了一种专为商业自动语音识别(ASR)系统设计的全神经文本格式化(TF)模型,涵盖了标点恢复(PR)、真实用例处理和逆向文本规范化(ITN)。与传统的基于规则的方法或混合方法不同,该方法采用了一个两阶段的神经架构,包括一个多目标标记分类器和序列到序列(seq2seq)模型。这种设计在确保对各种语言实体和文本域具有灵活性和鲁棒性的同时,最大限度地减少了计算成本并降低了幻觉现象的发生率。作为Universal-2 ASR系统的一部分开发而成,所提出的方法通过使用客观和主观方法进行全面评估后,在TF准确度、计算效率和感知质量方面均表现出色。这项工作强调了在实际应用中提高ASR可用性的综合文本格式化模型的重要性。
https://arxiv.org/abs/2501.05948
Video restoration plays a pivotal role in revitalizing degraded video content by rectifying imperfections caused by various degradations introduced during capturing (sensor noise, motion blur, etc.), saving/sharing (compression, resizing, etc.) and editing. This paper introduces a novel algorithm designed for scenarios where noise is introduced during video capture, aiming to enhance the visual quality of videos by reducing unwanted noise artifacts. We propose the Latent space LSTM Video Denoiser (LLVD), an end-to-end blind denoising model. LLVD uniquely combines spatial and temporal feature extraction, employing Long Short Term Memory (LSTM) within the encoded feature domain. This integration of LSTM layers is crucial for maintaining continuity and minimizing flicker in the restored video. Moreover, processing frames in the encoded feature domain significantly reduces computations, resulting in a very lightweight architecture. LLVD's blind nature makes it versatile for real, in-the-wild denoising scenarios where prior information about noise characteristics is not available. Experiments reveal that LLVD demonstrates excellent performance for both synthetic and captured noise. Specifically, LLVD surpasses the current State-Of-The-Art (SOTA) in RAW denoising by 0.3dB, while also achieving a 59\% reduction in computational complexity.
视频修复在恢复降级的视频内容方面发挥着关键作用,通过纠正由拍摄(如传感器噪声、运动模糊等)、保存/分享(压缩、调整大小等)和编辑过程中引入的各种退化因素所引起的缺陷。本文介绍了一种旨在处理视频捕获过程中引入噪点场景的新算法,目的是通过减少不需要的噪点来提高视频的视觉质量。 我们提出了一种端到端盲去噪模型——潜空间LSTM视频去噪器(LLVD)。该模型独特地结合了空间和时间特征提取,并在编码特征域中采用了长短期记忆网络(LSTM),这对于保持恢复视频中的连续性和减少闪烁至关重要。此外,在编码特征域处理帧显著减少了计算量,从而形成了一个非常轻量级的架构。 由于其盲特性,LLVD非常适合实际应用中的去噪场景,即没有关于噪声特性的先验信息的情况下使用。实验结果显示,LLVD在合成和捕获噪音方面表现出色。具体来说,在RAW降噪方面,LLVD超越了当前的最佳性能(SOTA)0.3dB,并且计算复杂性减少了59%。
https://arxiv.org/abs/2501.05744
Advancements in imaging technology have enabled hardware to support 10 to 16 bits per channel, facilitating precise manipulation in applications like image editing and video processing. While deep neural networks promise to recover high bit-depth representations, existing methods often rely on scale-invariant image information, limiting performance in certain scenarios. In this paper, we introduce a novel approach that integrates a super-resolution architecture to extract detailed a priori information from images. By leveraging interpolated data generated during the super-resolution process, our method achieves pixel-level recovery of fine-grained color details. Additionally, we demonstrate that spatial features learned through the super-resolution process significantly contribute to the recovery of detailed color depth information. Experiments on benchmark datasets demonstrate that our approach outperforms state-of-the-art methods, highlighting the potential of super-resolution for high-fidelity color restoration.
图像技术的进步使硬件能够支持每通道10到16位的数据,从而在图片编辑和视频处理等应用中实现精准操作。虽然深度神经网络承诺恢复高比特深度的表示,但现有方法往往依赖于尺度不变的图像信息,在某些场景下会限制性能。本文介绍了一种新颖的方法,该方法将超分辨率架构与提取详细先验信息的图像相结合。通过利用在超分辨率过程中生成的插值数据,我们的方法实现了对细粒度颜色细节的像素级恢复。此外,我们还展示了通过超分辨率过程学习到的空间特征显著有助于精细色彩深度信息的恢复。基准数据集上的实验表明,我们的方法超越了现有的最佳方法,突显了超分辨率技术在高保真彩色修复中的潜力。
https://arxiv.org/abs/2501.05611
Blind face restoration is a highly ill-posed problem due to the lack of necessary context. Although existing methods produce high-quality outputs, they often fail to faithfully preserve the individual's identity. In this paper, we propose a personalized face restoration method, FaceMe, based on a diffusion model. Given a single or a few reference images, we use an identity encoder to extract identity-related features, which serve as prompts to guide the diffusion model in restoring high-quality and identity-consistent facial images. By simply combining identity-related features, we effectively minimize the impact of identity-irrelevant features during training and support any number of reference image inputs during inference. Additionally, thanks to the robustness of the identity encoder, synthesized images can be used as reference images during training, and identity changing during inference does not require fine-tuning the model. We also propose a pipeline for constructing a reference image training pool that simulates the poses and expressions that may appear in real-world scenarios. Experimental results demonstrate that our FaceMe can restore high-quality facial images while maintaining identity consistency, achieving excellent performance and robustness.
面部盲恢复是一个高度不适定的问题,由于缺乏必要的上下文信息。尽管现有方法能够生成高质量的输出结果,但它们往往难以忠实保留个体的身份特征。在本文中,我们提出了一种基于扩散模型的个性化面部恢复方法——FaceMe。给定一张或几张参考图像,我们使用身份编码器提取与身份相关的关键特征,并将其作为提示引导扩散模型以高保真度和一致性地重建面部图像。 通过简单组合身份相关的特性,我们在训练过程中有效减少了对无关特性的依赖,并在推断时支持任意数量的参考图输入。此外,由于身份编码器的鲁棒性,合成的图像可以用于训练阶段作为参考图,且推理过程中的身份变换无需调整模型。我们还提出了一种构建参考图像训练池的流程,以模拟现实场景中可能出现的姿态和表情。 实验结果表明,我们的FaceMe能够在保持身份一致性的同时恢复出高质量的面部图像,并在性能和鲁棒性方面表现出色。
https://arxiv.org/abs/2501.05177
In Costa Rica, an average of 5 tons of seashells are extracted from ecosystems annually. Confiscated seashells, cannot be returned to their ecosystems due to the lack of origin recognition. To address this issue, we developed a convolutional neural network (CNN) specifically for seashell identification. We built a dataset from scratch, consisting of approximately 19000 images from the Pacific and Caribbean coasts. Using this dataset, the model achieved a classification accuracy exceeding 85%. The model has been integrated into a user-friendly application, which has classified over 36,000 seashells to date, delivering real-time results within 3 seconds per image. To further enhance the system's accuracy, an anomaly detection mechanism was incorporated to filter out irrelevant or anomalous inputs, ensuring only valid seashell images are processed.
在哥斯达黎加,每年有大约5吨的贝壳从生态系统中被提取出来。由于无法识别其来源,没收的贝壳不能重新放回原来的生态系统中去。为解决这一问题,我们开发了一种专门用于贝壳识别的卷积神经网络(CNN)。我们从零开始构建了一个包含约19000张图片的数据集,这些图片涵盖了太平洋和加勒比海沿岸的物种。利用这个数据集训练模型后,在分类准确率上达到了超过85%的成绩。该模型已经被整合进一个用户友好的应用程序中,并且迄今为止已成功对超过36,000个贝壳进行了分类,每张图片处理时间不超过3秒。为了进一步提高系统的准确性,我们还加入了异常检测机制来过滤掉不相关或异常的输入信息,确保只处理有效的贝壳图像。
https://arxiv.org/abs/2501.04873
Recently, Transformer networks have demonstrated outstanding performance in the field of image restoration due to the global receptive field and adaptability to input. However, the quadratic computational complexity of Softmax-attention poses a significant limitation on its extensive application in image restoration tasks, particularly for high-resolution images. To tackle this challenge, we propose a novel variant of the Transformer. This variant leverages the Taylor expansion to approximate the Softmax-attention and utilizes the concept of norm-preserving mapping to approximate the remainder of the first-order Taylor expansion, resulting in a linear computational complexity. Moreover, we introduce a multi-branch architecture featuring multi-scale patch embedding into the proposed Transformer, which has four distinct advantages: 1) various sizes of the receptive field; 2) multi-level semantic information; 3) flexible shapes of the receptive field; 4) accelerated training and inference speed. Hence, the proposed model, named the second version of Taylor formula expansion-based Transformer (for short MB-TaylorFormer V2) has the capability to concurrently process coarse-to-fine features, capture long-distance pixel interactions with limited computational cost, and improve the approximation of the Taylor expansion remainder. Experimental results across diverse image restoration benchmarks demonstrate that MB-TaylorFormer V2 achieves state-of-the-art performance in multiple image restoration tasks, such as image dehazing, deraining, desnowing, motion deblurring, and denoising, with very little computational overhead. The source code is available at this https URL.
最近,Transformer网络由于其全局感受野和对输入的适应性,在图像恢复领域展示了出色的表现。然而,Softmax注意力机制的二次计算复杂度对其在图像恢复任务中的广泛应用构成了重大限制,尤其是在处理高分辨率图像时更为显著。为了解决这一挑战,我们提出了一种新的Transformer变体。这种变体利用泰勒展开来近似Softmax注意力,并采用保持范数映射的概念来近似一阶泰勒展开的余项,从而实现线性计算复杂度。此外,我们还引入了多分支架构和多尺度补丁嵌入到所提出的Transformer中,这一设计带来了四个显著优势:1)不同大小的感受野;2)多层次语义信息;3)灵活形状的感受野;4)加速训练和推理速度。 因此,我们的模型被命名为基于泰勒公式展开的第二代Transformer(简称MB-TaylorFormer V2),具备同时处理粗到细特征的能力,以有限的计算成本捕捉长距离像素交互,并改进泰勒展开余项的近似。在多个图像恢复基准测试中进行的实验表明,MB-TaylorFormer V2 在多种图像恢复任务上达到了最先进的性能,包括去雾、去雨、除雪、运动模糊消除和去噪等,同时计算开销非常小。该模型的源代码可在此网址获得。
https://arxiv.org/abs/2501.04486
We present numerical and analytical results on the formation and stability of a family of fixed points of deep neural networks (DNNs). Such fixed points appear in a class of DNNs when dimensions of input and output vectors are the same. We demonstrate examples of applications of such networks in supervised, semi-supervised and unsupervised learning such as encoding/decoding of images, restoration of damaged images among others. We present several numerical and analytical results. First, we show that for untrained DNN's with weights and biases initialized by normally distributed random variables the only one fixed point exists. This result holds for DNN with any depth (number of layers) $L$, any layer width $N$, and sigmoid-type activation functions. Second, it has been shown that for a DNN whose parameters (weights and biases) are initialized by ``light-tailed'' distribution of weights (e.g. normal distribution), after training the distribution of these parameters become ``heavy-tailed''. This motivates our study of DNNs with ``heavy-tailed'' initialization. For such DNNs we show numerically %existence and stability that training leads to emergence of $Q(N,L)$ fixed points, where $Q(N,L)$ is a positive integer which depends on the number of layers $L$ and layer width $N$. We further observe numerically that for fixed $N = N_0$ the function $Q(N_0, L)$ is non-monotone, that is it initially grows as $L$ increases and then decreases to 1. This non-monotone behavior of $Q(N_0, L)$ is also obtained by analytical derivation of equation for Empirical Spectral Distribution (ESD) of input-output Jacobian followed by numerical solution of this equation.
我们介绍了关于深度神经网络(DNN)的一组固定点的形成和稳定性的数值及分析结果。这些固定点出现在输入向量与输出向量维度相同的DNN类中。展示了这种类型的网络在监督学习、半监督学习以及无监督学习中的应用实例,例如图像编码/解码、受损图像恢复等。 我们呈现了几个数值和理论上的结果。首先,我们证明对于权重和偏置由正态分布随机变量初始化的未经训练DNN来说,仅存在一个固定点。这一结论适用于任何深度(层数)$L$、任何层宽$N$以及采用S型激活函数的DNN。 其次,研究表明:对于其参数(权重和偏置)以“轻尾”分布(如正态分布)初始化的DNN,在经过训练后这些参数的分布会变成“重尾”。这激发了我们对使用“重尾”初始化的DNN的研究。对于这样的DNN,我们通过数值方法证明:训练会导致$Q(N,L)$个固定点的出现,其中$Q(N,L)$是一个依赖于层数$L$和层宽$N$的正整数。 此外,我们还观察到当固定$N = N_0$时,函数$Q(N_0, L)$是非单调的:在$L$增加初期它增长随后减少至1。这种非单调行为也通过推导输入-输出雅可比矩阵的经验谱分布(ESD)方程,并进而求解该方程来获得。
https://arxiv.org/abs/2501.04182
This research focuses on developing a method for restoring the topology of digital images of paper documents captured by a camera, using algorithms for detection, segmentation, geometry restoration, and dewarping. Our methodology employs deep learning (DL) for document outline detection, followed by computer vision (CV) to create a topological 2D grid using cubic polynomial interpolation and correct nonlinear distortions by remapping the image. Using classical CV methods makes the document topology restoration process more efficient and faster, as it requires significantly fewer computational resources and memory. We developed a new pipeline for automatic document dewarping and reconstruction, along with a framework and annotated dataset to demonstrate its efficiency. Our experiments confirm the promise of our methodology and its superiority over existing benchmarks (including mobile apps and popular DL solutions, such as RectiNet, DocGeoNet, and DocTr++) both visually and in terms of document readability via Optical Character Recognition (OCR) and geometry restoration metrics. This paves the way for creating high-quality digital copies of paper documents and enhancing the efficiency of OCR systems. Project page: this https URL
这项研究致力于开发一种方法,用于恢复通过相机捕捉的纸质文档数字图像的拓扑结构。该方法利用检测、分割、几何恢复和去扭曲等算法实现这一目标。我们的方法采用深度学习(DL)技术来检测文件轮廓,并随后运用计算机视觉(CV)技术创建一个基于立方多项式插值的二维网格,以此纠正非线性失真并重新映射图像。通过使用传统的计算机视觉方法使文档拓扑结构恢复过程更加高效和快速,因为它需要显著较少的计算资源和内存。 我们开发了一套新的自动文件去扭曲与重建流水线,并构建了一个框架及注释数据集来展示其效率。实验结果证实了我们的方法在可视化效果以及通过光学字符识别(OCR)和几何恢复指标评估文档可读性方面均优于现有的基准测试(包括移动应用和流行的深度学习解决方案,如RectiNet、DocGeoNet和DocTr++)。这为创建纸质文档的高质量数字副本并提高OCR系统的效率铺平了道路。项目页面:[此URL](https://this.example.com/)
https://arxiv.org/abs/2501.03145
Diffusion models have demonstrated their utility as learned priors for solving various inverse problems. However, most existing approaches are limited to linear inverse problems. This paper exploits the efficient and unsupervised posterior sampling framework of Denoising Diffusion Restoration Models (DDRM) for the solution of nonlinear phase retrieval problem, which requires reconstructing an image from its noisy intensity-only measurements such as Fourier intensity. The approach combines the model-based alternating-projection methods with the DDRM to utilize pretrained unconditional diffusion priors for phase retrieval. The performance is demonstrated through both simulations and experimental data. Results demonstrate the potential of this approach for improving the alternating-projection methods as well as its limitations.
扩散模型已经展示了其作为解决各种逆向问题的先验知识的有效性。然而,现有的大多数方法仅限于线性逆向问题。本文利用了去噪扩散恢复模型(Denoising Diffusion Restoration Models, DDRM)高效的无监督后验采样框架来解决非线性的相位检索问题。该问题要求从仅有强度信息的噪声测量值中重建图像,例如傅里叶强度。研究方法结合了基于模型的交替投影法与DDRM,利用预先训练好的无条件扩散先验进行相位检索。通过模拟和实验数据展示了这种方法的效果。结果表明,此方法有潜力改进交替投影方法,并指出了其局限性。
https://arxiv.org/abs/2501.03030
Underwater imaging grapples with challenges from light-water interactions, leading to color distortions and reduced clarity. In response to these challenges, we propose a novel Color Balance Prior \textbf{Guided} \textbf{Hyb}rid \textbf{Sens}e \textbf{U}nderwater \textbf{I}mage \textbf{R}estoration framework (\textbf{GuidedHybSensUIR}). This framework operates on multiple scales, employing the proposed \textbf{Detail Restorer} module to restore low-level detailed features at finer scales and utilizing the proposed \textbf{Feature Contextualizer} module to capture long-range contextual relations of high-level general features at a broader scale. The hybridization of these different scales of sensing results effectively addresses color casts and restores blurry details. In order to effectively point out the evolutionary direction for the model, we propose a novel \textbf{Color Balance Prior} as a strong guide in the feature contextualization step and as a weak guide in the final decoding phase. We construct a comprehensive benchmark using paired training data from three real-world underwater datasets and evaluate on six test sets, including three paired and three unpaired, sourced from four real-world underwater datasets. Subsequently, we tested 14 traditional and retrained 23 deep learning existing underwater image restoration methods on this benchmark, obtaining metric results for each approach. This effort aims to furnish a valuable benchmarking dataset for standard basis for comparison. The extensive experiment results demonstrate that our method outperforms 37 other state-of-the-art methods overall on various benchmark datasets and metrics, despite not achieving the best results in certain individual cases. The code and dataset are available at \href{this https URL}{this https URL}.
水下成像面临来自光与水相互作用的挑战,导致颜色失真和清晰度下降。为应对这些挑战,我们提出了一种新的色彩平衡先验引导的混合感知水下图像恢复框架(GuidedHybSensUIR)。此框架在多个尺度上工作,利用提出的细节恢复模块在更精细的尺度上恢复低级详细特征,并通过引入提出的特征上下文化模块来捕捉高级一般特征的大规模长距离上下文关系。这种不同感知尺度结果的混合处理有效解决了色彩偏移并修复了模糊细节。 为了有效地指明模型的发展方向,我们提出了一种新的色彩平衡先验,在特征上下文化步骤中作为强有力的指导,并在最终解码阶段作为轻微的指导。使用来自三个真实水下数据集的配对训练数据构建了一个全面的基准测试,并在六个测试集中进行评估,包括三组配对和三组未配对的数据,这些数据来源于四个真实的水下数据集。 随后,在该基准上测试了14种传统方法并重新训练了23种现有深度学习水下图像恢复方法,为每种方法获取了度量结果。这一努力旨在提供一个有价值的标准比较基础的基准测试数据集。 广泛的实验结果显示,我们的方法在各种基准数据集和度量标准上总体优于其他37种最先进的方法,尽管某些个别情况下没有达到最佳效果。代码和数据集可在[此处](this https URL)获得。
https://arxiv.org/abs/2501.02701
In this paper, we propose the first diffusion-based all-in-one video restoration method that utilizes the power of a pre-trained Stable Diffusion and a fine-tuned ControlNet. Our method can restore various types of video degradation with a single unified model, overcoming the limitation of standard methods that require specific models for each restoration task. Our contributions include an efficient training strategy with Task Prompt Guidance (TPG) for diverse restoration tasks, an inference strategy that combines Denoising Diffusion Implicit Models~(DDIM) inversion with a novel Sliding Window Cross-Frame Attention (SW-CFA) mechanism for enhanced content preservation and temporal consistency, and a scalable pipeline that makes our method all-in-one to adapt to different video restoration tasks. Through extensive experiments on five video restoration tasks, we demonstrate the superiority of our method in generalization capability to real-world videos and temporal consistency preservation over existing state-of-the-art methods. Our method advances the video restoration task by providing a unified solution that enhances video quality across multiple applications.
在本文中,我们提出了首个基于扩散模型的全功能视频修复方法,该方法利用了预训练的Stable Diffusion和微调后的ControlNet的强大能力。我们的方法能够使用单一统一模型修复各种类型的视频降质问题,克服了传统方法需要为每项修复任务专门建立特定模型的局限性。 本研究的主要贡献包括:一种结合了任务提示引导(TPG)的高效训练策略,用于应对多样化的修复任务;一种推理策略,该策略将去噪扩散隐式模型(DDIM)反转与新颖的滑动窗口跨帧注意力(SW-CFA)机制相结合,以增强内容保留和时间一致性;以及一个可扩展的工作流程,使我们的方法能够适应不同的视频修复任务。 通过在五个视频修复任务上的广泛实验,我们展示了我们的方法在处理现实世界视频时具有更强的泛化能力和比现有最先进的技术更优的时间一致性保存能力。我们的方法通过提供一种统一解决方案来提升视频修复任务,在多个应用场景中改善了视频质量。
https://arxiv.org/abs/2501.02269
The Main Control Room of the Fermilab accelerator complex continuously gathers extensive time-series data from thousands of sensors monitoring the beam. However, unplanned events such as trips or voltage fluctuations often result in beam outages, causing operational downtime. This downtime not only consumes operator effort in diagnosing and addressing the issue but also leads to unnecessary energy consumption by idle machines awaiting beam restoration. The current threshold-based alarm system is reactive and faces challenges including frequent false alarms and inconsistent outage-cause labeling. To address these limitations, we propose an AI-enabled framework that leverages predictive analytics and automated labeling. Using data from $2,703$ Linac devices and $80$ operator-labeled outages, we evaluate state-of-the-art deep learning architectures, including recurrent, attention-based, and linear models, for beam outage prediction. Additionally, we assess a Random Forest-based labeling system for providing consistent, confidence-scored outage annotations. Our findings highlight the strengths and weaknesses of these architectures for beam outage prediction and identify critical gaps that must be addressed to fully harness AI for transitioning downtime handling from reactive to predictive, ultimately reducing downtime and improving decision-making in accelerator management.
费米实验室加速器综合体的主要控制室持续收集来自数千个监测束流的传感器的大量时间序列数据。然而,计划外事件(如跳闸或电压波动)常常导致束流中断,从而造成运行停机时间。这种停机不仅消耗了操作员在诊断和解决问题上的精力,还导致了由于等待束流恢复而闲置机器不必要的能源消耗。目前基于阈值的警报系统是被动的,并且面临着包括频繁误报和不一致的故障原因标签等问题。 为了克服这些限制,我们提出了一种采用预测分析和自动标注的人工智能框架。利用来自2,703个直线加速器设备和80次操作员标记中断的数据,我们评估了最先进的深度学习架构,包括递归、基于注意力和线性模型的束流中断预测性能。此外,我们还评估了一种基于随机森林的方法,用于提供一致且附有置信度评分的故障注释。 我们的研究结果突出了这些架构在束流中断预测中的优缺点,并确定了必须解决的关键差距,以充分发挥AI潜力,将停机处理从被动响应转变为预测性操作。这最终有助于减少停机时间并提高加速器管理中的决策效率。
https://arxiv.org/abs/2501.01509
Video restoration poses non-trivial challenges in maintaining fidelity while recovering temporally consistent details from unknown degradations in the wild. Despite recent advances in diffusion-based restoration, these methods often face limitations in generation capability and sampling efficiency. In this work, we present SeedVR, a diffusion transformer designed to handle real-world video restoration with arbitrary length and resolution. The core design of SeedVR lies in the shifted window attention that facilitates effective restoration on long video sequences. SeedVR further supports variable-sized windows near the boundary of both spatial and temporal dimensions, overcoming the resolution constraints of traditional window attention. Equipped with contemporary practices, including causal video autoencoder, mixed image and video training, and progressive training, SeedVR achieves highly-competitive performance on both synthetic and real-world benchmarks, as well as AI-generated videos. Extensive experiments demonstrate SeedVR's superiority over existing methods for generic video restoration.
视频修复在保持图像保真度的同时,面临着从未知退化中恢复出时间上一致的细节这一非平凡挑战。尽管基于扩散的方法最近取得了进展,但这些方法通常在生成能力和采样效率方面存在局限性。在这项工作中,我们提出了SeedVR,这是一种专为处理任意长度和分辨率的真实世界视频修复而设计的扩散变换器。 SeedVR的核心设计理念在于移位窗口注意力机制,这种机制能够有效地对长视频序列进行修复。此外,SeedVR支持在空间和时间维度边界附近的可变大小窗口,从而克服了传统窗口注意机制中的分辨率限制。借助当代实践方法,包括因果视频自编码器、混合图像和视频训练以及渐进式训练等手段,SeedVR在合成数据集、真实世界基准测试及人工智能生成的视频上均实现了极具竞争力的表现。 广泛的实验表明,在通用视频修复方面,SeedVR超越了现有的方法。
https://arxiv.org/abs/2501.01320
Face Restoration (FR) is a crucial area within image and video processing, focusing on reconstructing high-quality portraits from degraded inputs. Despite advancements in image FR, video FR remains relatively under-explored, primarily due to challenges related to temporal consistency, motion artifacts, and the limited availability of high-quality video data. Moreover, traditional face restoration typically prioritizes enhancing resolution and may not give as much consideration to related tasks such as facial colorization and inpainting. In this paper, we propose a novel approach for the Generalized Video Face Restoration (GVFR) task, which integrates video BFR, inpainting, and colorization tasks that we empirically show to benefit each other. We present a unified framework, termed as stable video face restoration (SVFR), which leverages the generative and motion priors of Stable Video Diffusion (SVD) and incorporates task-specific information through a unified face restoration framework. A learnable task embedding is introduced to enhance task identification. Meanwhile, a novel Unified Latent Regularization (ULR) is employed to encourage the shared feature representation learning among different subtasks. To further enhance the restoration quality and temporal stability, we introduce the facial prior learning and the self-referred refinement as auxiliary strategies used for both training and inference. The proposed framework effectively combines the complementary strengths of these tasks, enhancing temporal coherence and achieving superior restoration quality. This work advances the state-of-the-art in video FR and establishes a new paradigm for generalized video face restoration.
面部恢复(FR)是图像和视频处理中的一个关键领域,专注于从退化的输入中重建高质量的肖像。尽管在图像FR方面已经取得了进展,但视频FR仍相对较少被探索,主要是由于时间一致性、运动伪影以及高质量视频数据稀缺等方面的挑战。此外,传统的面部恢复通常侧重于提高分辨率,而对诸如面部着色和修补等相关的任务考虑不足。 在这篇论文中,我们提出了一种新的方法来解决广义视频面部恢复(GVFR)问题,这种方法集成了视频BFR、修补和着色任务,并且我们实证表明这些任务之间可以相互促进。我们提出了一个统一的框架,称为稳定视频面部恢复(SVFR),该框架利用了Stable Video Diffusion (SVD) 的生成性和运动先验,并通过一个统一的面部恢复框架整合特定于任务的信息。引入了一种可学习的任务嵌入来增强任务识别能力。同时,我们采用一种新颖的统一潜在正则化(ULR)方法,以鼓励不同子任务之间共享特征表示的学习。 为了进一步提高恢复质量和时间稳定性,我们提出了面部先验学习和自参照精炼作为训练和推理中的辅助策略。所提出的框架有效地结合了这些任务之间的互补优势,增强了时间连贯性,并实现了卓越的恢复质量。这项工作推动了视频FR领域的最先进水平,并为广义视频面部恢复确立了一个新的范式。
https://arxiv.org/abs/2501.01235