In this paper, we introduce Plug-and-Play (PnP) Flow Matching, an algorithm for solving imaging inverse problems. PnP methods leverage the strength of pre-trained denoisers, often deep neural networks, by integrating them in optimization schemes. While they achieve state-of-the-art performance on various inverse problems in imaging, PnP approaches face inherent limitations on more generative tasks like inpainting. On the other hand, generative models such as Flow Matching pushed the boundary in image sampling yet lack a clear method for efficient use in image restoration. We propose to combine the PnP framework with Flow Matching (FM) by defining a time-dependent denoiser using a pre-trained FM model. Our algorithm alternates between gradient descent steps on the data-fidelity term, reprojections onto the learned FM path, and denoising. Notably, our method is computationally efficient and memory-friendly, as it avoids backpropagation through ODEs and trace computations. We evaluate its performance on denoising, super-resolution, deblurring, and inpainting tasks, demonstrating superior results compared to existing PnP algorithms and Flow Matching based state-of-the-art methods.
在本文中,我们提出了Plug-and-Play(PnP)流匹配算法,用于解决图像反问题。PnP方法通过将预训练的去噪器集成到优化方案中,利用预训练去噪器的优势,通常使用深度神经网络。虽然它们在各种图像反问题中实现了最先进的性能,但PnP方法在更具有生成性的任务(如修复)上存在固有局限性。另一方面,像Flow Matching这样的生成模型在图像采样方面推动了边界,但是它们缺乏在图像修复中有效使用的方法。我们提出了一种将PnP框架与Flow Matching(FM)相结合的方法,通过使用预训练FM模型定义一个时间依赖的去噪器。我们的算法在数据可靠性梯度下降步骤、学习到的FM路径上的投影以及去噪三个步骤之间交替进行。值得注意的是,我们的方法在计算效率和内存友好性方面具有优势,因为它避免了通过ODE进行反向传播和迹计算。我们在去噪、超分辨率、去雾和修复任务上评估了其性能,证明了与现有PnP算法和基于Flow Matching的最佳方法相比具有卓越的结果。
https://arxiv.org/abs/2410.02423
Mamba, a special case of the State Space Model, is gaining popularity as an alternative to template-based deep learning approaches in medical image analysis. While transformers are powerful architectures, they have drawbacks, including quadratic computational complexity and an inability to address long-range dependencies efficiently. This limitation affects the analysis of large and complex datasets in medical imaging, where there are many spatial and temporal relationships. In contrast, Mamba offers benefits that make it well-suited for medical image analysis. It has linear time complexity, which is a significant improvement over transformers. Mamba processes longer sequences without attention mechanisms, enabling faster inference and requiring less memory. Mamba also demonstrates strong performance in merging multimodal data, improving diagnosis accuracy and patient outcomes. The organization of this paper allows readers to appreciate the capabilities of Mamba in medical imaging step by step. We begin by defining core concepts of SSMs and models, including S4, S5, and S6, followed by an exploration of Mamba architectures such as pure Mamba, U-Net variants, and hybrid models with convolutional neural networks, transformers, and Graph Neural Networks. We also cover Mamba optimizations, techniques and adaptations, scanning, datasets, applications, experimental results, and conclude with its challenges and future directions in medical imaging. This review aims to demonstrate the transformative potential of Mamba in overcoming existing barriers within medical imaging while paving the way for innovative advancements in the field. A comprehensive list of Mamba architectures applied in the medical field, reviewed in this work, is available at Github.
Mamba,一种 State Space Model 的特殊情况,正在成为医学图像分析中模板为基础的深度学习方法的替代品。尽管 Transformer 是一种强大的架构,但它们存在一些局限性,包括二次计算复杂性和无法有效地解决长距离依赖问题。这种局限性影响到医疗影像大数据的分析,其中存在许多空间和时间关系。相比之下,Mamba 提供了在医学图像分析中具有优势的益处。它具有线性时间复杂性,这是 Transformer 的重大改进。Mamba 在没有注意力机制的情况下处理较长的序列,实现更快的推理并需要更少的内存。Mamba 还展示了在合并多模态数据方面的强大性能,提高诊断准确性和患者 outcomes。本文的组织使读者能够逐步了解 Mamba 在医学影像分析中的能力。我们首先定义了 State Space Model 和模型的核心概念,包括 S4、S5 和 S6,接着探讨了 Mamba 的架构,如纯 Mamba、U-Net 变体和具有卷积神经网络、Transformer 和 Graph Neural Networks 的混合模型。我们还涵盖了 Mamba 的优化、技术和适应性,扫描、数据集、应用、实验结果,并最后结论与挑战及未来在医学影像领域的发展趋势。本综述旨在展示 Mamba 在克服现有医疗影像工作中的障碍的同时,为该领域推动创新进展奠定基础。本工作中回顾了在医学领域应用的 Mamba 架构的完整列表,可在 Github 上查看。
https://arxiv.org/abs/2410.02362
Posterior sampling in high-dimensional spaces using generative models holds significant promise for various applications, including but not limited to inverse problems and guided generation tasks. Despite many recent developments, generating diverse posterior samples remains a challenge, as existing methods require restarting the entire generative process for each new sample, making the procedure computationally expensive. In this work, we propose efficient posterior sampling by simulating Langevin dynamics in the noise space of a pre-trained generative model. By exploiting the mapping between the noise and data spaces which can be provided by distilled flows or consistency models, our method enables seamless exploration of the posterior without the need to re-run the full sampling chain, drastically reducing computational overhead. Theoretically, we prove a guarantee for the proposed noise-space Langevin dynamics to approximate the posterior, assuming that the generative model sufficiently approximates the prior distribution. Our framework is experimentally validated on image restoration tasks involving noisy linear and nonlinear forward operators applied to LSUN-Bedroom (256 x 256) and ImageNet (64 x 64) datasets. The results demonstrate that our approach generates high-fidelity samples with enhanced semantic diversity even under a limited number of function evaluations, offering superior efficiency and performance compared to existing diffusion-based posterior sampling techniques.
在高维空间中使用生成模型进行后验采样在各种应用中具有很大的潜力,包括反问题指导和生成任务。尽管许多最近的发展为生成多样后验样本提供了可能,但现有的方法在生成过程中需要重新启动整个生成过程,导致该过程在计算上变得昂贵。在这项工作中,我们提出了一种通过在预训练生成模型的噪声空间中模拟Langevin动力来实现高效后验采样的方法。通过利用通过去中心化流或一致性模型提供的噪声与数据空间之间的映射,我们的方法使得在不需要重新运行完整的采样链的情况下,轻松地探索后验,大大减少了计算开销。从理论上看,我们证明了所提出的噪声空间Langevin动力在生成足够逼近先验分布的生成模型上近似后验的保证。我们的框架在应用于LSUN-Bedroom(256 x 256)和ImageNet(64 x 64)数据集的图像修复任务中进行了实验验证。结果表明,即使在对数函数评估数量有限的情况下,我们的方法也能够生成具有丰富语义多样性的高保真度样本,与现有的扩散为基础的后验采样技术相比,具有卓越的效率和性能。
https://arxiv.org/abs/2410.02078
Photo-realistic image restoration algorithms are typically evaluated by distortion measures (e.g., PSNR, SSIM) and by perceptual quality measures (e.g., FID, NIQE), where the desire is to attain the lowest possible distortion without compromising on perceptual quality. To achieve this goal, current methods typically attempt to sample from the posterior distribution, or to optimize a weighted sum of a distortion loss (e.g., MSE) and a perceptual quality loss (e.g., GAN). Unlike previous works, this paper is concerned specifically with the optimal estimator that minimizes the MSE under a constraint of perfect perceptual index, namely where the distribution of the reconstructed images is equal to that of the ground-truth ones. A recent theoretical result shows that such an estimator can be constructed by optimally transporting the posterior mean prediction (MMSE estimate) to the distribution of the ground-truth images. Inspired by this result, we introduce Posterior-Mean Rectified Flow (PMRF), a simple yet highly effective algorithm that approximates this optimal estimator. In particular, PMRF first predicts the posterior mean, and then transports the result to a high-quality image using a rectified flow model that approximates the desired optimal transport map. We investigate the theoretical utility of PMRF and demonstrate that it consistently outperforms previous methods on a variety of image restoration tasks.
通常,照片现实感的图像修复算法通过畸变度量(如 PSNR 和 SSIM)以及感知质量度量(如 FID 和 NIQE)进行评估。其目标是在不牺牲感知质量的情况下实现最低的畸变。为了实现这一目标,现有方法通常尝试从后验分布中采样,或者通过优化一个加权畸变损失(如 MSE)和一个感知质量损失(如 GAN)来优化。与之前的工作不同,本文关注的是在满足完美感知指数的约束条件下,实现最低 MSE 的最优估计器,即重建图像的分布与真实图像的分布相等。 最近的一个理论结果表明,可以通过通过最优地传输后验均预测(MMSE估计)到真实图像分布来构建这样的最优估计器。受到这一结果的启发,我们引入了后验均值平滑流动(PMRF)算法,这是一种简单而高效的照片现实感图像修复算法。 特别是,PMRF 首先预测后验均值,然后使用平滑流动模型将结果传输到具有仿真实定理的图像中。我们研究了 PMRF 的理论应用价值,并证明了它在各种图像修复任务上 consistently优于之前的方法。
https://arxiv.org/abs/2410.00418
Image restoration and spectral reconstruction are longstanding computer vision tasks. Currently, CNN-transformer hybrid models provide state-of-the-art performance for these tasks. The key common ingredient in the architectural designs of these models is Channel-wise Self-Attention (CSA). We first show that CSA is an overall low-rank operation. Then, we propose an instance-Guided Low-rank Multi-Head selfattention (GLMHA) to replace the CSA for a considerable computational gain while closely retaining the original model performance. Unique to the proposed GLMHA is its ability to provide computational gain for both short and long input sequences. In particular, the gain is in terms of both Floating Point Operations (FLOPs) and parameter count reduction. This is in contrast to the existing popular computational complexity reduction techniques, e.g., Linformer, Performer, and Reformer, for whom FLOPs overpower the efficient design tricks for the shorter input sequences. Moreover, parameter reduction remains unaccounted for in the existing this http URL perform an extensive evaluation for the tasks of spectral reconstruction from RGB images, spectral reconstruction from snapshot compressive imaging, motion deblurring, and image deraining by enhancing the best-performing models with our GLMHA. Our results show up to a 7.7 Giga FLOPs reduction with 370K fewer parameters required to closely retain the original performance of the best-performing models that employ CSA.
图像修复和光谱重建是计算机视觉中的长期任务。目前,CNN-Transformer混合模型为这些任务提供了最先进的性能。这些模型的关键共同组件是通道级自注意力(CSA)。我们首先证明,CSA是一个总的低秩操作。然后,我们提出了一种实例引导的低秩多头自注意力(GLMHA)来代替CSA,从而实现显著的计算开销,同时保留原始模型的性能。与提出的GLMHA独特之处在于其能够为短和长输入序列提供计算开销。特别地,增益既涉及浮点操作(FLOPs),也涉及参数计数减少。这与现有流行的计算复杂度减少技术(例如LINformer、Performer和Reformer)不同,后者的FLOPs超过了对于较短输入序列的高效设计技巧。此外,现有的方法中尚未考虑参数减少。通过增强我们的GLMHA,我们对来自RGB图像的光谱重建、来自快照压缩成像的光谱重建、运动去雾和图像去雨等任务进行了广泛的评估。我们的结果表明,通过使用GLMHA,可以实现最高7.7 Giga FLOPs的减少,同时只需370K个参数,从而保留最佳模型的原始性能。
https://arxiv.org/abs/2410.00380
Diffusion models have become increasingly popular for generative modeling due to their ability to generate high-quality samples. This has unlocked exciting new possibilities for solving inverse problems, especially in image restoration and reconstruction, by treating diffusion models as unsupervised priors. This survey provides a comprehensive overview of methods that utilize pre-trained diffusion models to solve inverse problems without requiring further training. We introduce taxonomies to categorize these methods based on both the problems they address and the techniques they employ. We analyze the connections between different approaches, offering insights into their practical implementation and highlighting important considerations. We further discuss specific challenges and potential solutions associated with using latent diffusion models for inverse problems. This work aims to be a valuable resource for those interested in learning about the intersection of diffusion models and inverse problems.
扩散模型因为其能够生成高质量的样本而变得越来越受欢迎,这为解决反问题提供了令人兴奋的新方法。特别是在图像修复和重建方面,将扩散模型视为无监督 prior 处理, unlocked 了新的解决反问题的可能性。这项调查提供了利用预训练扩散模型解决反问题的全面概述,无需进一步训练。我们引入了分类来根据它们所解决的问题和采用的技术对这些方法进行分类。我们分析了不同方法之间的联系,提供了关于其实际实现的重要见解,并进一步讨论了与使用潜在扩散模型解决反问题相关的具体挑战和潜在解决方案。 这项工作旨在成为那些对扩散模型和反问题有兴趣的人的宝贵资源。
https://arxiv.org/abs/2410.00083
Existing unified methods typically treat multi-degradation image restoration as a multi-task learning problem. Despite performing effectively compared to single degradation restoration methods, they overlook the utilization of commonalities and specificities within multi-task restoration, thereby impeding the model's performance. Inspired by the success of deep generative models and fine-tuning techniques, we proposed a universal image restoration framework based on multiple low-rank adapters (LoRA) from multi-domain transfer learning. Our framework leverages the pre-trained generative model as the shared component for multi-degradation restoration and transfers it to specific degradation image restoration tasks using low-rank adaptation. Additionally, we introduce a LoRA composing strategy based on the degradation similarity, which adaptively combines trained LoRAs and enables our model to be applicable for mixed degradation restoration. Extensive experiments on multiple and mixed degradations demonstrate that the proposed universal image restoration method not only achieves higher fidelity and perceptual image quality but also has better generalization ability than other unified image restoration models. Our code is available at this https URL.
现有的统一方法通常将多降解图像修复视为多任务学习问题。尽管与单降解修复方法相比表现出色,但它们忽略了多任务修复中的共性和特定性,从而阻碍了模型的性能。受到深度生成模型和微调技术的成功启发,我们提出了一个基于多域迁移学习的多低秩适配器(LoRA)的全局图像修复框架。我们的框架利用预训练生成模型作为多降解修复的共同组件,并通过低秩适应将其转移到特定降解图像修复任务。此外,我们还引入了一种基于降解相似性的LoRA组合策略,将训练好的LoRAs动态地组合起来,使我们的模型能够应用于混合降解修复。在多个和混合降解实验中,我们证明了与其它统一图像修复模型相比,所提出的全局图像修复方法不仅实现了更高的保真度和感知图像质量,而且具有更好的泛化能力。我们的代码可在此处访问:https://url.cn/
https://arxiv.org/abs/2409.20197
In this work, we share the insights for achieving state-of-the-art quality in our text-to-image anime image generative model, called Illustrious. To achieve high resolution, dynamic color range images, and high restoration ability, we focus on three critical approaches for model improvement. First, we delve into the significance of the batch size and dropout control, which enables faster learning of controllable token based concept activations. Second, we increase the training resolution of images, affecting the accurate depiction of character anatomy in much higher resolution, extending its generation capability over 20MP with proper methods. Finally, we propose the refined multi-level captions, covering all tags and various natural language captions as a critical factor for model development. Through extensive analysis and experiments, Illustrious demonstrates state-of-the-art performance in terms of animation style, outperforming widely-used models in illustration domains, propelling easier customization and personalization with nature of open source. We plan to publicly release updated Illustrious model series sequentially as well as sustainable plans for improvements.
在这项工作中,我们分享了实现我们文本到图像动漫图像生成模型Illustrious达到最先进质量的见解。为了实现高分辨率、动态色彩范围图像和高修复能力,我们重点关注了三种模型改进方法。首先,我们深入探讨了批量大小的意义以及 dropout 控制的必要性,这使得我们能够更快地学习可控制词基于概念激活的控制。其次,我们提高了图像的训练分辨率,在更高的分辨率下更准确地描绘了角色解剖结构,通过适当的方法将其生成能力扩展到20MP。最后,我们提出了改进的多级字幕,将所有标签和各种自然语言字幕作为关键因素,这是模型开发中至关重要的一环。通过广泛的分析和实验,Illustrious 在动画风格方面表现出了最先进水平,在插图领域广泛使用的模型相媲美,并通过开源自然语言特点轻松实现定制和个性化。我们计划逐步公开发布更新的Illustrious 模型系列,以及可持续的改进计划。
https://arxiv.org/abs/2409.19946
To bridge the gap between artists and non-specialists, we present a unified framework, Neural-Polyptych, to facilitate the creation of expansive, high-resolution paintings by seamlessly incorporating interactive hand-drawn sketches with fragments from original paintings. We have designed a multi-scale GAN-based architecture to decompose the generation process into two parts, each responsible for identifying global and local features. To enhance the fidelity of semantic details generated from users' sketched outlines, we introduce a Correspondence Attention module utilizing our Reference Bank strategy. This ensures the creation of high-quality, intricately detailed elements within the artwork. The final result is achieved by carefully blending these local elements while preserving coherent global consistency. Consequently, this methodology enables the production of digital paintings at megapixel scale, accommodating diverse artistic expressions and enabling users to recreate content in a controlled manner. We validate our approach to diverse genres of both Eastern and Western paintings. Applications such as large painting extension, texture shuffling, genre switching, mural art restoration, and recomposition can be successfully based on our framework.
为解决艺术家和非专业人士之间的差距,我们提出了一个统一框架——Neural-Polyptych,以通过无缝集成交互式手绘草图片段来创作广阔、高分辨率画作。我们设计了一个多尺度GAN架构,将生成过程分解为两个部分,每个部分负责识别全局和局部特征。为了增强用户草图生成的语义细节的准确性,我们引入了我们的参考库策略的对应注意模块。这确保了作品中的高质量、精细细节。最终成果是通过小心翼翼地将这些局部元素 blend 在一起,同时保留全局一致性来实现的。因此,这种方法可以轻松地生产和解码大型数字绘画,适应各种艺术表达,并让用户以可控的方式创建内容。我们验证了我们的方法适用于各种东西方绘画风格。应用如大型绘画扩展、纹理混搭、风格转换、壁画艺术修复和重新创作等都可以基于我们的框架取得成功。
https://arxiv.org/abs/2409.19690
All-in-one image restoration aims to handle multiple degradation types using one model. This paper proposes a simple pipeline for all-in-one blind image restoration to Restore Anything with Masks (RAM). We focus on the image content by utilizing Mask Image Modeling to extract intrinsic image information rather than distinguishing degradation types like other methods. Our pipeline consists of two stages: masked image pre-training and fine-tuning with mask attribute conductance. We design a straightforward masking pre-training approach specifically tailored for all-in-one image restoration. This approach enhances networks to prioritize the extraction of image content priors from various degradations, resulting in a more balanced performance across different restoration tasks and achieving stronger overall results. To bridge the gap of input integrity while preserving learned image priors as much as possible, we selectively fine-tuned a small portion of the layers. Specifically, the importance of each layer is ranked by the proposed Mask Attribute Conductance (MAC), and the layers with higher contributions are selected for finetuning. Extensive experiments demonstrate that our method achieves state-of-the-art performance. Our code and model will be released at \href{this https URL}{this https URL}.
集成图像修复的通用的图像修复旨在使用一个模型处理多种退化类型。本文提出了一种简单的集成图像修复通用的全功能图像修复方法:使用遮罩图像建模来提取固有图像信息,而不是像其他方法那样区分退化类型。我们的管道包括两个阶段:遮罩图像预训练和带标签的微调。我们专门为全功能图像修复设计了一种直接针对所有集成图像修复的遮罩预训练方法。这种方法通过增强网络在各种退化中提取图像内容 prior,从而在不同的修复任务上实现更平衡的性能,并实现更强的整体结果。为了在保留学习到的图像先验的同时尽可能地修复输入完整性,我们选择性地微调了部分层。具体来说,我们通过提出的遮罩属性导数(MAC)对每个层的重要性进行排名,并对具有更高贡献的层进行微调。 extensive实验证明,我们的方法实现了最先进的性能。我们的代码和模型将公开发布在\href{this <https://this URL>}{this <https://this URL>}。
https://arxiv.org/abs/2409.19403
Images captured in challenging environments--such as nighttime, foggy, rainy weather, and underwater--often suffer from significant degradation, resulting in a substantial loss of visual quality. Effective restoration of these degraded images is critical for the subsequent vision tasks. While many existing approaches have successfully incorporated specific priors for individual tasks, these tailored solutions limit their applicability to other degradations. In this work, we propose a universal network architecture, dubbed "ReviveDiff", which can address a wide range of degradations and bring images back to life by enhancing and restoring their quality. Our approach is inspired by the observation that, unlike degradation caused by movement or electronic issues, quality degradation under adverse conditions primarily stems from natural media (such as fog, water, and low luminance), which generally preserves the original structures of objects. To restore the quality of such images, we leveraged the latest advancements in diffusion models and developed ReviveDiff to restore image quality from both macro and micro levels across some key factors determining image quality, such as sharpness, distortion, noise level, dynamic range, and color accuracy. We rigorously evaluated ReviveDiff on seven benchmark datasets covering five types of degrading conditions: Rainy, Underwater, Low-light, Smoke, and Nighttime Hazy. Our experimental results demonstrate that ReviveDiff outperforms the state-of-the-art methods both quantitatively and visually.
拍摄于具有挑战性的环境中的图像通常会遭受显著的失真,导致视觉质量的大幅降低。有效恢复这些失真的图像对于后续视觉任务至关重要。虽然许多现有方法已经成功地引入了特定任务的先验,但这些自适应解决方案将它们的适用性限制在其他的失真上。在这项工作中,我们提出了一个名为“ReviveDiff”的通用网络架构,它能够解决广泛的失真并通过增强和恢复图像的质量使图像焕发新生。我们的方法源于一个观察,即与运动或电子问题引起的失真不同,不良条件下导致的质量失真主要源于自然媒体(如雾、水、低亮度),这些媒体通常保留对象的原始结构。为了恢复这类图像的质量,我们利用扩散模型的最新进展,开发了ReviveDiff,从宏观和微观水平恢复图像质量,涵盖决定图像质量的一些关键因素,如清晰度、变形、噪声水平、动态范围和色彩准确性。我们对ReviveDiff在七种基准数据集上的实验结果进行了评估,涵盖了五种不同的失真类型:雨天、水下、低光、烟雾和夜间阴霾。我们的实验结果表明,ReviveDiff在量化和视觉上均优于最先进的 methods。
https://arxiv.org/abs/2409.18932
Multiple low-vision tasks such as denoising, deblurring and super-resolution depart from RGB images and further reduce the degradations, improving the quality. However, modeling the degradations in the sRGB domain is complicated because of the Image Signal Processor (ISP) transformations. Despite of this known issue, very few methods in the literature work directly with sensor RAW images. In this work we tackle image restoration directly in the RAW domain. We design a new realistic degradation pipeline for training deep blind RAW restoration models. Our pipeline considers realistic sensor noise, motion blur, camera shake, and other common degradations. The models trained with our pipeline and data from multiple sensors, can successfully reduce noise and blur, and recover details in RAW images captured from different cameras. To the best of our knowledge, this is the most exhaustive analysis on RAW image restoration. Code available at this https URL
多项低视力任务(例如去噪、去模糊和超分辨率)从RGB图像中分离出来,并进一步减少了降解,提高了质量。然而,在sRGB域中建模降解是一个复杂的问题,因为Image Signal Processor(ISP)变换。尽管如此,在文献中很少有直接处理传感器RAW图像的方法。在这项工作中,我们直接在RAW域处理图像修复。我们设计了一个新的真实感降解管道,用于训练深度盲RAW修复模型。我们的管道考虑了真实的传感器噪声、运动模糊、相机振动和其他常见降解。使用我们这个管道训练的模型和来自多个传感器的数据,可以成功减少噪声和模糊,并从不同相机捕捉到的RAW图像中恢复细节。据我们所知,这是关于RAW图像修复的最详尽分析。代码可在此处访问:https://www.xxxxxxx.com/
https://arxiv.org/abs/2409.18204
Diffusion-based image super-resolution (SR) models have attracted substantial interest due to their powerful image restoration capabilities. However, prevailing diffusion models often struggle to strike an optimal balance between efficiency and performance. Typically, they either neglect to exploit the potential of existing extensive pretrained models, limiting their generative capacity, or they necessitate a dozens of forward passes starting from random noises, compromising inference efficiency. In this paper, we present DoSSR, a Domain Shift diffusion-based SR model that capitalizes on the generative powers of pretrained diffusion models while significantly enhancing efficiency by initiating the diffusion process with low-resolution (LR) images. At the core of our approach is a domain shift equation that integrates seamlessly with existing diffusion models. This integration not only improves the use of diffusion prior but also boosts inference efficiency. Moreover, we advance our method by transitioning the discrete shift process to a continuous formulation, termed as DoS-SDEs. This advancement leads to the fast and customized solvers that further enhance sampling efficiency. Empirical results demonstrate that our proposed method achieves state-of-the-art performance on synthetic and real-world datasets, while notably requiring only 5 sampling steps. Compared to previous diffusion prior based methods, our approach achieves a remarkable speedup of 5-7 times, demonstrating its superior efficiency. Code: this https URL.
基于扩散的图像超分辨率(SR)模型因其强大的图像修复能力而吸引了大量关注。然而,当前的扩散模型往往难以在效率和性能之间取得最优平衡。通常,它们要么忽视利用现有 extensive 预训练模型,限制其生成能力,要么需要从随机噪声开始数十个前馈过程,从而降低推理效率。在本文中,我们提出了 DoSSR,一种基于扩散的 SR 模型,它利用预训练扩散模型的生成能力,通过低分辨率(LR)图像的扩散过程显著增强效率。 我们方法的核心是一个域移方程,与现有的扩散模型无缝集成。这种集成不仅改善了扩散先验的使用,还提高了推理效率。此外,我们通过将离散转移过程转化为连续公式,称为 DoS-SDEs,进一步提高了方法。这种进步导致快速且定制的求解器,进一步增强了抽样效率。 empirical 结果表明,与基于扩散的先验方法相比,我们所提出的方法在合成和现实世界数据集上实现了最先进的性能,而显著地只需要 5 步抽样。与之前基于扩散的先验方法相比,我们的方法取得了显着的提速,证明了其优越的效率。 代码:https:// this URL.
https://arxiv.org/abs/2409.17778
The increasing demand for augmented reality (AR) and virtual reality (VR) applications highlights the need for efficient depth information processing. Depth maps, essential for rendering realistic scenes and supporting advanced functionalities, are typically large and challenging to stream efficiently due to their size. This challenge introduces a focus on developing innovative depth upsampling techniques to reconstruct high-quality depth maps from compressed data. These techniques are crucial for overcoming the limitations posed by depth compression, which often degrades quality, loses scene details and introduces artifacts. By enhancing depth upsampling methods, this challenge aims to improve the efficiency and quality of depth map reconstruction. Our goal is to advance the state-of-the-art in depth processing technologies, thereby enhancing the overall user experience in AR and VR applications.
随着增强现实(AR)和虚拟现实(VR)应用需求的不断增加,对高效深度信息处理的需求也日益凸显。深度图(depth maps)是渲染真实场景和支持高级功能的关键,但由于其大小,通常很难以高效的方式流式传输。这个挑战的重点是开发创新深度上采样技术,从压缩数据中重构高质量深度图。这些技术对于克服深度压缩带来的限制至关重要,因为这种限制通常会导致质量下降、场景细节丢失和伪影等问题的出现。通过提高深度上采样方法,这个挑战的目标是提高深度图重建的效率和质量。我们的目标是推动深度处理技术的最新进展,从而提高AR和VR应用的整体用户体验。
https://arxiv.org/abs/2409.16277
This paper proposes a generative pretraining foundation model for high-quality speech restoration tasks. By directly operating on complex-valued short-time Fourier transform coefficients, our model does not rely on any vocoders for time-domain signal reconstruction. As a result, our model simplifies the synthesis process and removes the quality upper-bound introduced by any mel-spectrogram vocoder compared to prior work SpeechFlow. The proposed method is evaluated on multiple speech restoration tasks, including speech denoising, bandwidth extension, codec artifact removal, and target speaker extraction. In all scenarios, finetuning our pretrained model results in superior performance over strong baselines. Notably, in the target speaker extraction task, our model outperforms existing systems, including those leveraging SSL-pretrained encoders like WavLM. The code and the pretrained checkpoints are publicly available in the NVIDIA NeMo framework.
本文提出了一种用于高质量语音恢复任务的生成预训练基础模型。通过直接对复杂值短时傅里叶变换系数进行操作,我们的模型不依赖于任何语音合成器进行时域信号重建。因此,与之前的工作相比,我们的模型简化了合成过程并消除了任何 Mel- 频谱图语音合成器引入的质量上限。所提出的方法在多个语音修复任务上进行了评估,包括语音去噪、带宽扩展、降噪和目标说话人提取。在所有场景中,对预训练模型的微调都优于 strong baseline。值得注意的是,在目标说话人提取任务中,我们的模型超越了包括利用SSL预训练编码器(如WavLM)的现有系统。代码和预训练检查点都可以在NVIDIA NeMo框架中公开获得。
https://arxiv.org/abs/2409.16117
Recent advancements in adverse weather restoration have shown potential, yet the unpredictable and varied combinations of weather degradations in the real world pose significant challenges. Previous methods typically struggle with dynamically handling intricate degradation combinations and carrying on background reconstruction precisely, leading to performance and generalization limitations. Drawing inspiration from prompt learning and the "Teaching Tailored to Talent" concept, we introduce a novel pipeline, T3-DiffWeather. Specifically, we employ a prompt pool that allows the network to autonomously combine sub-prompts to construct weather-prompts, harnessing the necessary attributes to adaptively tackle unforeseen weather input. Moreover, from a scene modeling perspective, we incorporate general prompts constrained by Depth-Anything feature to provide the scene-specific condition for the diffusion process. Furthermore, by incorporating contrastive prompt loss, we ensures distinctive representations for both types of prompts by a mutual pushing strategy. Experimental results demonstrate that our method achieves state-of-the-art performance across various synthetic and real-world datasets, markedly outperforming existing diffusion techniques in terms of computational efficiency.
近年来,在恶劣天气修复方面的进步表明了其潜力,然而现实世界中复杂且不确定的天气衰减组合给带来了巨大的挑战。之前的方法通常很难动态处理复杂的衰减组合并精确地执行背景重建,导致性能和泛化能力的限制。从提示学习和“因材施教”的概念中获得了灵感,我们引入了一种新颖的管道,T3-DiffWeather。具体来说,我们使用一个提示池,让网络能够自主地将子提示组合成天气提示,从而利用所需的属性来适应性地处理未知的天气输入。此外,从场景建模的角度来看,我们将约束于 Depth-Anything 特征的通用提示引入场景中,为扩散过程提供场景特定条件。此外,通过引入对比性提示损失,我们通过相互推动策略确保了两种提示类型的独特表示。实验结果表明,我们的方法在各种合成和真实世界数据集上实现了最先进的性能,显著优于现有扩散技术在计算效率方面的表现。
https://arxiv.org/abs/2409.15739
This paper presents a versatile image-to-image visual assistant, PixWizard, designed for image generation, manipulation, and translation based on free-from language instructions. To this end, we tackle a variety of vision tasks into a unified image-text-to-image generation framework and curate an Omni Pixel-to-Pixel Instruction-Tuning Dataset. By constructing detailed instruction templates in natural language, we comprehensively include a large set of diverse vision tasks such as text-to-image generation, image restoration, image grounding, dense image prediction, image editing, controllable generation, inpainting/outpainting, and more. Furthermore, we adopt Diffusion Transformers (DiT) as our foundation model and extend its capabilities with a flexible any resolution mechanism, enabling the model to dynamically process images based on the aspect ratio of the input, closely aligning with human perceptual processes. The model also incorporates structure-aware and semantic-aware guidance to facilitate effective fusion of information from the input image. Our experiments demonstrate that PixWizard not only shows impressive generative and understanding abilities for images with diverse resolutions but also exhibits promising generalization capabilities with unseen tasks and human instructions. The code and related resources are available at this https URL
本文介绍了一款面向图像生成、编辑和翻译的图像到图像视觉助手PixWizard。该助手基于无需语言指令的自由图像指令进行图像处理和生成。为此,我们将各种视觉任务统一到一个图像-文本-图像生成框架中,并创建了一个 Omni Pixel-to-Pixel 指令微调数据集。通过在自然语言中构建详细指令模板,我们全面涵盖了包括文本到图像生成、图像修复、图像绑定、密集图像预测、图像编辑、可控制生成、修复/去修复、以及其他各种视觉任务在内的一大量 diverse 视觉任务。此外,我们还采用了扩散变换器(DiT)作为基础模型,并通过灵活的分辨率机制扩展其能力,使模型能够根据输入图像的透视比动态处理图像,与人类感知过程紧密对齐。模型还配备了结构感知和语义感知指导,以促进对输入图像中信息的有效融合。我们的实验结果表明,PixWizard 在各种分辨率的图像上表现出令人印象深刻的生成和理解能力,同时具有与未见任务和人类指令相媲美的泛化能力。代码和相关资源可在此链接中获取:https://url.cn/
https://arxiv.org/abs/2409.15278
Depth sensing is an important problem for 3D vision-based robotics. Yet, a real-world active stereo or ToF depth camera often produces noisy and incomplete depth which bottlenecks robot performances. In this work, we propose D3RoMa, a learning-based depth estimation framework on stereo image pairs that predicts clean and accurate depth in diverse indoor scenes, even in the most challenging scenarios with translucent or specular surfaces where classical depth sensing completely fails. Key to our method is that we unify depth estimation and restoration into an image-to-image translation problem by predicting the disparity map with a denoising diffusion probabilistic model. At inference time, we further incorporated a left-right consistency constraint as classifier guidance to the diffusion process. Our framework combines recently advanced learning-based approaches and geometric constraints from traditional stereo vision. For model training, we create a large scene-level synthetic dataset with diverse transparent and specular objects to compensate for existing tabletop datasets. The trained model can be directly applied to real-world in-the-wild scenes and achieve state-of-the-art performance in multiple public depth estimation benchmarks. Further experiments in real environments show that accurate depth prediction significantly improves robotic manipulation in various scenarios.
深度感知是基于3D视觉的机器人领域的一个重要问题。然而,现实世界中的主动立体或ToF深度相机通常会产生噪音和不完整的深度,从而限制了机器的性能。在这项工作中,我们提出了D3RoMa,一个基于学习的立体图像对深度估计框架,能够在多样室内场景中预测干净和准确的深度,即使在最具有挑战性的情况下,即透明或漫反射表面,经典深度感知也完全失败。我们方法的关键在于将深度估计和修复统一为图像到图像的迁移问题,通过预测噪声扩散概率模型中的差异图。在推理时间,我们进一步将左对齐约束作为分类器指导扩散过程。我们的框架结合了最近先进的基于学习的技术和传统立体视觉中的几何约束。为了模型训练,我们创建了一个大型的场景级合成数据集,包括各种透明和漫反射物体,以弥补现有的台式机数据集。训练好的模型可以直接应用于野外场景,在多个公开深度估计基准测试中实现最先进的表现。进一步的实验在真实环境中表明,准确的深度预测在各种场景中显著提高了机器人对各种物体的操作能力。
https://arxiv.org/abs/2409.14365
In this paper, we introduce a zero-shot Voice Transfer (VT) module that can be seamlessly integrated into a multi-lingual Text-to-speech (TTS) system to transfer an individual's voice across languages. Our proposed VT module comprises a speaker-encoder that processes reference speech, a bottleneck layer, and residual adapters, connected to preexisting TTS layers. We compare the performance of various configurations of these components and report Mean Opinion Score (MOS) and Speaker Similarity across languages. Using a single English reference speech per speaker, we achieve an average voice transfer similarity score of 73% across nine target languages. Vocal characteristics contribute significantly to the construction and perception of individual identity. The loss of one's voice, due to physical or neurological conditions, can lead to a profound sense of loss, impacting one's core identity. As a case study, we demonstrate that our approach can not only transfer typical speech but also restore the voices of individuals with dysarthria, even when only atypical speech samples are available - a valuable utility for those who have never had typical speech or banked their voice. Cross-lingual typical audio samples, plus videos demonstrating voice restoration for dysarthric speakers are available here (this http URL).
在本文中,我们提出了一种零击换声(VT)模块,可以无缝地集成到多语言文本转语音(TTS)系统中,将一个人的声音从一个语言转移到另一个语言。我们提出的VT模块包括一个声码器处理参考语音、一个瓶颈层和一个残余适应器,连接到现有的TTS层。我们比较了这些组件的各种配置的性能,并报告了跨语言的均方误差(MOS)和说话者相似度。使用每个说话者的单个英语参考语音,我们在九个目标语言上实现了平均语音转移相似度分数为73%。语音特征在构建和感知个体身份方面起着重要作用。由于身体或神经条件导致的声音丢失,可能导致深刻的失落感,影响个体的核心身份。以案例研究为例,我们证明了我们的方法不仅能够传输典型的声音,还可以恢复患有运动性失语症的个人声音,即使只有非典型语音样本可用 - 对于那些从未有典型声音或存储了他们的声音的人来说,这是一种有价值的应用。跨语言典型音频样本和演示恢复运动性失语症说话者声音的视频可以在这里(此链接)查看。
https://arxiv.org/abs/2409.13910
This article presents an experiment in fine-tuning a pretrained causal language model (Meta's Llama 3.1 8B Instruct) for aiding in three fundamental tasks of philological research: chronological and geographic attribution as well as text restoration in ancient Greek inscriptions and documentary papyri. Using a prompt-based instruct approach, the fine-tuned models surpass the state of the art in key metrics. For inscriptions, the models achieve a lower average character error rate (CER) of 22.5% (vs. 26.3%), while closely matching top-1 accuracy (60.9% vs. 61.8%) and top-20 accuracy (77.5% vs. 78.3%) for sequences up to 10 characters. They also provide a practical advantage by ignoring spaces during reconstruction, aligning better with the scriptio continua typically used in ancient written artifacts. In geographic attribution, the model outperforms previous benchmarks with a top-1 accuracy of 75.0% (vs. 70.8%) and a top-3 accuracy of 83.7% (vs. 82.1%). For dating, it achieves an average deviation of 26.2 years (vs. 29.3) and a median deviation of 1 year (vs. 3) from the actual date range. The models also set new baselines for documentary papyri, with a CER of 16.3%, a top-1 accuracy of 71.3%, and top-20 of 85.0% in text reconstruction; a top-1 accuracy of 66.4% and top-3 of 79.9% in geographic attribution; and, in chronological attribution, a deviation of 21.7 years from the actual termini post/ante quem, with a median deviation of 0 years.
这篇文章介绍了一个实验:对预训练的因果语言模型(Meta的Llama 3.1 8B指令)进行微调,以帮助促进三个基本的文献研究任务:chronological和geographic attribution,以及古希腊铭文和文书手稿的文字修复。通过一种基于提示的指令方法,微调后的模型在关键指标上超越了最先进的水平。对于铭文,模型实现了22.5%的平均字符错误率(CER)(相对于26.3%),同时对于Top 1准确度(60.9%相对于61.8%)和Top 20准确度(77.5%相对于78.3%)也具有优势。它们还通过在重建过程中忽略空格来提供实际优势,与古代书面文献中通常使用的脚本更接近。在地理归因方面,该模型在Top 1准确度为75.0%(相对于70.8%)和Top 3准确度为83.7%(相对于82.1%)的基准测试中超过了最先进水平。对于日期,它实现了26.2年的平均偏差(相对于29.3%)和1年的中位数偏差(相对于3),以及对于文书手稿的新的基线。这些模型还设定了新的文档手稿基线,具有16.3%的CER,文本重建的Top 1准确度为71.3%,Top 20准确度为85.0%;地理归因的Top 1准确度为66.4%,Top 3准确度为79.9%;在chronological attribution方面,偏差为21.7年,中位偏差为0年。
https://arxiv.org/abs/2409.13870