Diffusion-based extreme image compression methods have achieved impressive performance at extremely low bitrates. However, constrained by the iterative denoising process that starts from pure noise, these methods are limited in both fidelity and efficiency. To address these two issues, we present Relay Residual Diffusion Extreme Image Compression (RDEIC), which leverages compressed feature initialization and residual diffusion. Specifically, we first use the compressed latent features of the image with added noise, instead of pure noise, as the starting point to eliminate the unnecessary initial stages of the denoising process. Second, we design a novel relay residual diffusion that reconstructs the raw image by iteratively removing the added noise and the residual between the compressed and target latent features. Notably, our relay residual diffusion network seamlessly integrates pre-trained stable diffusion to leverage its robust generative capability for high-quality reconstruction. Third, we propose a fixed-step fine-tuning strategy to eliminate the discrepancy between the training and inference phases, further improving the reconstruction quality. Extensive experiments demonstrate that the proposed RDEIC achieves state-of-the-art visual quality and outperforms existing diffusion-based extreme image compression methods in both fidelity and efficiency. The source code will be provided in this https URL.
基于扩散的极端图像压缩方法在极低比特率下取得了令人印象深刻的性能。然而,由于从纯噪声开始的迭代去噪过程限制了它们的保真度和效率,这些方法在信噪比方面存在局限性。为了解决这两个问题,我们提出了Relay Residual Diffusion Extreme Image Compression(RDEIC),它利用了压缩特征初始化和残差扩散。具体来说,我们首先使用添加噪音的图像的压缩 latent 特征作为去噪过程的起点,而不是纯噪声。其次,我们设计了一个新的 relay 残差扩散,它通过迭代删除添加的噪音和压缩与目标 latent 特征之间的残差来重构原始图像。值得注意的是,我们的 relay 残差扩散网络无缝地整合了预训练的稳定扩散,以利用其对高质量重建的稳健生成能力。第三,我们提出了一个固定步数微调策略,以消除训练和推理阶段之间的差异,进一步提高了重建质量。大量实验证明,所提出的 RDEIC 达到了最先进的视觉质量,并且在信噪比方面优于现有的扩散型极端图像压缩方法。源代码将在此处的链接提供。
https://arxiv.org/abs/2410.02640
The total variation (TV) method is an image denoising technique that aims to reduce noise by minimizing the total variation of the image, which measures the variation in pixel intensities. The TV method has been widely applied in image processing and computer vision for its ability to preserve edges and enhance image quality. In this paper, we propose an improved TV model for image denoising and the associated numerical algorithm to carry out the procedure, which is particularly effective in removing several types of noises and their combinations. Our improved model admits a unique solution and the associated numerical algorithm guarantees the convergence. Numerical experiments are demonstrated to show improved effectiveness and denoising quality compared to other TV models. Such encouraging results further enhance the utility of the TV method in image processing.
总方差(TV)方法是一种图像去噪技术,旨在通过最小化图像的总方差来降低噪声,该方法衡量像素强度的变化。TV方法在图像处理和计算机视觉中得到了广泛应用,因为它能够保留边缘并提高图像质量。在本文中,我们提出了一个改进的TV模型用于图像去噪以及相关数值算法来执行该过程,特别有效地去除几种类型的噪声及其组合。我们的改进模型具有唯一的解,相应的数值算法保证收敛。数值实验结果证明了与其他TV模型相比,去噪效果和去噪质量的提高。这些鼓舞人心的结果进一步加强了TV方法在图像处理中的实用性。
https://arxiv.org/abs/2410.02587
Denoising is one of the fundamental steps of the processing pipeline that converts data captured by a camera sensor into a display-ready image or video. It is generally performed early in the pipeline, usually before demosaicking, although studies swapping their order or even conducting them jointly have been proposed. With the advent of deep learning, the quality of denoising algorithms has steadily increased. Even so, modern neural networks still have a hard time adapting to new noise levels and scenes, which is indispensable for real-world applications. With those in mind, we propose a self-similarity-based denoising scheme that weights both a pre- and a post-demosaicking denoiser for Bayer-patterned CFA video data. We show that a balance between the two leads to better image quality, and we empirically find that higher noise levels benefit from a higher influence pre-demosaicking. We also integrate temporal trajectory prefiltering steps before each denoiser, which further improve texture reconstruction. The proposed method only requires an estimation of the noise model at the sensor, accurately adapts to any noise level, and is competitive with the state of the art, making it suitable for real-world videography.
去噪是图像或视频处理流程中的一个基本步骤,将由相机传感器捕获的数据转换为显示级别的图像或视频。通常在处理流程的早期执行,通常在去噪之前,尽管已经提出了交换顺序或共同进行研究的方法。随着深度学习的出现,去噪算法的质量稳步提高。然而,现代神经网络仍然很难适应新的噪声水平和场景,这对于现实世界应用程序来说至关重要。为此,我们提出了一个基于自相似性的去噪方案,为Bayer-模式化的CFA视频数据中的预和后去噪器分配权重。我们证明了两种去噪器的平衡会带来更好的图像质量,并且通过实验我们发现,较高的噪声水平有助于提高预去噪的影响。我们还将在每个去噪器之前集成时间轨迹预滤波器,进一步改善纹理重建。所提出的方法只需要对传感器进行噪声模型的估计,准确适应任何噪声水平,与最先进的去噪方法相当,适合进行现实世界的视频拍摄。
https://arxiv.org/abs/2410.02572
Customized Image Generation, generating customized images with user-specified concepts, has raised significant attention due to its creativity and novelty. With impressive progress achieved in subject customization, some pioneer works further explored the customization of action and interaction beyond entity (i.e., human, animal, and object) appearance. However, these approaches only focus on basic actions and interactions between two entities, and their effects are limited by insufficient ''exactly same'' reference images. To extend customized image generation to more complex scenes for general real-world applications, we propose a new task: event-customized image generation. Given a single reference image, we define the ''event'' as all specific actions, poses, relations, or interactions between different entities in the scene. This task aims at accurately capturing the complex event and generating customized images with various target entities. To solve this task, we proposed a novel training-free event customization method: FreeEvent. Specifically, FreeEvent introduces two extra paths alongside the general diffusion denoising process: 1) Entity switching path: it applies cross-attention guidance and regulation for target entity generation. 2) Event transferring path: it injects the spatial feature and self-attention maps from the reference image to the target image for event generation. To further facilitate this new task, we collected two evaluation benchmarks: SWiG-Event and Real-Event. Extensive experiments and ablations have demonstrated the effectiveness of FreeEvent.
定制图像生成,使用用户指定概念生成定制图像,因其创造力和新颖性而引起了广泛关注。在主题自定义方面取得了令人印象深刻的进展后,一些先驱作品进一步探索了定制动作和交互的范围,超出了实体(即人类、动物和物体)外观。然而,这些方法仅关注基本动作和两个实体之间的交互,其效果受到不足的“完全相同”参考图像的限制。为了将定制图像生成扩展到更复杂的场景,我们提出了一个新的任务:事件定制图像生成。给定一个单一的参考图像,我们定义“事件”为场景中不同实体之间具体动作、姿势、关系或交互。这个任务旨在准确捕捉复杂的事件,并生成具有各种目标实体的定制图像。为了解决这个问题,我们提出了一个新颖的训练-free 事件定制方法:FreeEvent。具体来说,FreeEvent 在普通扩散去噪过程中引入了两条附加路径: 1. 实体切换路径:它应用跨注意引导和调节来生成目标实体的交叉注意力。 2. 事件传递路径:它从参考图像中注入空间特征和自注意图,用于事件生成。 为了进一步促进这项新任务,我们还收集了两个评估基准:SWiG-Event 和 Real-Event。大量的实验和分析证明了FreeEvent的有效性。
https://arxiv.org/abs/2410.02483
In this paper, we introduce Plug-and-Play (PnP) Flow Matching, an algorithm for solving imaging inverse problems. PnP methods leverage the strength of pre-trained denoisers, often deep neural networks, by integrating them in optimization schemes. While they achieve state-of-the-art performance on various inverse problems in imaging, PnP approaches face inherent limitations on more generative tasks like inpainting. On the other hand, generative models such as Flow Matching pushed the boundary in image sampling yet lack a clear method for efficient use in image restoration. We propose to combine the PnP framework with Flow Matching (FM) by defining a time-dependent denoiser using a pre-trained FM model. Our algorithm alternates between gradient descent steps on the data-fidelity term, reprojections onto the learned FM path, and denoising. Notably, our method is computationally efficient and memory-friendly, as it avoids backpropagation through ODEs and trace computations. We evaluate its performance on denoising, super-resolution, deblurring, and inpainting tasks, demonstrating superior results compared to existing PnP algorithms and Flow Matching based state-of-the-art methods.
在本文中,我们提出了Plug-and-Play(PnP)流匹配算法,用于解决图像反问题。PnP方法通过将预训练的去噪器集成到优化方案中,利用预训练去噪器的优势,通常使用深度神经网络。虽然它们在各种图像反问题中实现了最先进的性能,但PnP方法在更具有生成性的任务(如修复)上存在固有局限性。另一方面,像Flow Matching这样的生成模型在图像采样方面推动了边界,但是它们缺乏在图像修复中有效使用的方法。我们提出了一种将PnP框架与Flow Matching(FM)相结合的方法,通过使用预训练FM模型定义一个时间依赖的去噪器。我们的算法在数据可靠性梯度下降步骤、学习到的FM路径上的投影以及去噪三个步骤之间交替进行。值得注意的是,我们的方法在计算效率和内存友好性方面具有优势,因为它避免了通过ODE进行反向传播和迹计算。我们在去噪、超分辨率、去雾和修复任务上评估了其性能,证明了与现有PnP算法和基于Flow Matching的最佳方法相比具有卓越的结果。
https://arxiv.org/abs/2410.02423
Unrestricted adversarial attacks typically manipulate the semantic content of an image (e.g., color or texture) to create adversarial examples that are both effective and photorealistic. Recent works have utilized the diffusion inversion process to map images into a latent space, where high-level semantics are manipulated by introducing perturbations. However, they often results in substantial semantic distortions in the denoised output and suffers from low efficiency. In this study, we propose a novel framework called Semantic-Consistent Unrestricted Adversarial Attacks (SCA), which employs an inversion method to extract edit-friendly noise maps and utilizes Multimodal Large Language Model (MLLM) to provide semantic guidance throughout the process. Under the condition of rich semantic information provided by MLLM, we perform the DDPM denoising process of each step using a series of edit-friendly noise maps, and leverage DPM Solver++ to accelerate this process, enabling efficient sampling with semantic consistency. Compared to existing methods, our framework enables the efficient generation of adversarial examples that exhibit minimal discernible semantic changes. Consequently, we for the first time introduce Semantic-Consistent Adversarial Examples (SCAE). Extensive experiments and visualizations have demonstrated the high efficiency of SCA, particularly in being on average 12 times faster than the state-of-the-art attacks. Our code can be found at this https URL}{this https URL.
通常,无限制的对抗攻击会操纵图像的语义内容(例如,颜色或纹理)以创建既有效又逼真的对抗实例。最近的工作利用扩散反演过程将图像映射到拉取式空间,通过引入扰动来操作高阶语义信息。然而,它们通常会导致在去噪输出中产生严重的语义扭曲,且效率低下。在本文中,我们提出了一个名为 Semantic-Consistent Unrestricted Adversarial Attacks (SCA) 的新框架,它采用反演方法提取编辑友好的噪声映射,并利用 Multimodal Large Language Model (MLLM) 提供语义指导。在 MLLM 提供的丰富语义信息条件下,我们使用一系列编辑友好的噪声映射对每个步骤进行 DDPM 去噪处理,并利用 DPM Solver++ 加速此过程,实现基于语义一致性的高效采样。与现有方法相比,我们的框架能够高效地生成具有最小可察觉语义变化的对抗实例。因此,我们首次引入了 Semantic-Consistent Adversarial Examples (SCAE)。大量的实验和可视化结果证明了 SCA 的高效率,特别是平均速度比最先进的攻击快约 12 倍。我们的代码可在此链接中找到:https://this链接。
https://arxiv.org/abs/2410.02240
We introduce MDSGen, a novel framework for vision-guided open-domain sound generation optimized for model parameter size, memory consumption, and inference speed. This framework incorporates two key innovations: (1) a redundant video feature removal module that filters out unnecessary visual information, and (2) a temporal-aware masking strategy that leverages temporal context for enhanced audio generation accuracy. In contrast to existing resource-heavy Unet-based models, MDSGen employs denoising masked diffusion transformers, facilitating efficient generation without reliance on pre-trained diffusion models. Evaluated on the benchmark VGGSound dataset, our smallest model (5M parameters) achieves 97.9% alignment accuracy, using 172x fewer parameters, 371% less memory, and offering 36x faster inference than the current 860M-parameter state-of-the-art model (93.9% accuracy). The larger model (131M parameters) reaches nearly 99% accuracy while requiring 6.5x fewer parameters. These results highlight the scalability and effectiveness of our approach.
我们提出了MDSGen,一种针对视觉指导的开放域声音生成框架,优化了参数大小、内存消耗和推理速度。该框架采用了两个关键创新:(1)一个去冗余的视频特征去除模块,可以过滤出多余的视觉信息;(2)一个时间感知掩码策略,利用时间上下文来提高音频生成准确性。与现有的资源密集型Unet基模型相比,MDSGen采用去噪掩码扩散变换器,无需依赖预训练扩散模型,实现高效的生成。在基准VGGSound数据集上评估,我们的最小模型(5M参数)实现了97.9%的对齐准确性,使用了172x fewer parameters,371% less memory,并且比目前的860M-parameter最先进模型(93.9%准确度)快了36x的推理速度。较大模型(131M参数)达到几乎99%的准确度,同时减少了6.5x的参数。这些结果突出了我们方法的可扩展性和有效性。
https://arxiv.org/abs/2410.02130
We introduce FabricDiffusion, a method for transferring fabric textures from a single clothing image to 3D garments of arbitrary shapes. Existing approaches typically synthesize textures on the garment surface through 2D-to-3D texture mapping or depth-aware inpainting via generative models. Unfortunately, these methods often struggle to capture and preserve texture details, particularly due to challenging occlusions, distortions, or poses in the input image. Inspired by the observation that in the fashion industry, most garments are constructed by stitching sewing patterns with flat, repeatable textures, we cast the task of clothing texture transfer as extracting distortion-free, tileable texture materials that are subsequently mapped onto the UV space of the garment. Building upon this insight, we train a denoising diffusion model with a large-scale synthetic dataset to rectify distortions in the input texture image. This process yields a flat texture map that enables a tight coupling with existing Physically-Based Rendering (PBR) material generation pipelines, allowing for realistic relighting of the garment under various lighting conditions. We show that FabricDiffusion can transfer various features from a single clothing image including texture patterns, material properties, and detailed prints and logos. Extensive experiments demonstrate that our model significantly outperforms state-to-the-art methods on both synthetic data and real-world, in-the-wild clothing images while generalizing to unseen textures and garment shapes.
我们介绍了一种名为FabricDiffusion的方法,可以将单层服装图像中的织物纹理转移到任意形状的3D服装中。现有方法通常通过2D-到3D纹理映射或通过生成模型进行深度感知修复来在衣物表面合成纹理。然而,这些方法往往很难捕捉和保留纹理细节,特别是由于输入图像中复杂的遮挡、扭曲或姿态问题。受到时尚行业中大多数衣物通过缝合缝纫图案并使用平、可重复纹理的观点启发,我们将衣物纹理传输任务描述为提取无扭曲、可贴图的纹理材料,并将其映射到服装的UV空间中。在此基础上,我们使用一个大型的合成数据集训练去噪扩散模型来修复输入纹理图像中的扭曲。这一过程产生了一个平滑的纹理图,使得与现有的基于物理的渲染(PBR)材料生成流程紧密耦合,可以在各种光照条件下实现逼真的衣物重新照明。我们证明了FabricDiffusion可以从单个服装图像传输各种特征,包括纹理模式、材料属性和详细印刷和标志。大量的实验结果表明,我们的模型在合成数据和现实世界的野外服装图像上显著优于最先进的方法,同时能够泛化到未见过的纹理和服装形状。
https://arxiv.org/abs/2410.01801
Diffusion Transformers (DiTs) have gained prominence for outstanding scalability and extraordinary performance in generative tasks. However, their considerable inference costs impede practical deployment. The feature cache mechanism, which involves storing and retrieving redundant computations across timesteps, holds promise for reducing per-step inference time in diffusion models. Most existing caching methods for DiT are manually designed. Although the learning-based approach attempts to optimize strategies adaptively, it suffers from discrepancies between training and inference, which hampers both the performance and acceleration ratio. Upon detailed analysis, we pinpoint that these discrepancies primarily stem from two aspects: (1) Prior Timestep Disregard, where training ignores the effect of cache usage at earlier timesteps, and (2) Objective Mismatch, where the training target (align predicted noise in each timestep) deviates from the goal of inference (generate the high-quality image). To alleviate these discrepancies, we propose HarmoniCa, a novel method that Harmonizes training and inference with a novel learning-based Caching framework built upon Step-Wise Denoising Training (SDT) and Image Error Proxy-Guided Objective (IEPO). Compared to the traditional training paradigm, the newly proposed SDT maintains the continuity of the denoising process, enabling the model to leverage information from prior timesteps during training, similar to the way it operates during inference. Furthermore, we design IEPO, which integrates an efficient proxy mechanism to approximate the final image error caused by reusing the cached feature. Therefore, IEPO helps balance final image quality and cache utilization, resolving the issue of training that only considers the impact of cache usage on the predicted output at each timestep.
扩散变换器(DiTs)因在生成任务中的卓越可扩展性和非凡性能而备受关注。然而,它们显着的推理成本阻碍了实际部署。特征高速缓存机制涉及在时间步之间存储和检索冗余计算,对于扩散模型的每步推理时间有望减少。大多数现有的DiT缓存方法都是手动设计的。尽管基于学习的策略试图通过自适应优化策略来优化,但它们在训练和推理之间存在差异,这阻碍了性能和加速比。经过详细分析,我们发现这些差异主要源于两个方面:(1)早期时间步的缓存丢弃,训练忽略了在较早时间步的缓存使用效果;(2)训练目标与推理目标不匹配,其中训练目标(在每步生成高质量图像)偏离了推理目标(生成高质量图像)。为了减轻这些差异,我们提出了HarmoniCa,一种将训练和推理与基于逐层去噪训练(SDT)和图像误差代理引导目标(IEPO)相结合的新型基于学习的缓存方法。与传统训练范式相比,新提出的SDT保持了去噪过程的连续性,使模型在训练过程中能够利用先前的时刻的信息,类似于在推理过程中一样。此外,我们还设计IEPO,它通过高效的代理机制近似由重用缓存特征引起的最终图像误差。因此,IEPO有助于平衡最终图像质量和缓存利用率,解决训练仅考虑缓存使用对预测输出的影响的问题。
https://arxiv.org/abs/2410.01723
Classical generative diffusion models learn an isotropic Gaussian denoising process, treating all spatial regions uniformly, thus neglecting potentially valuable structural information in the data. Inspired by the long-established work on anisotropic diffusion in image processing, we present a novel edge-preserving diffusion model that is a generalization of denoising diffusion probablistic models (DDPM). In particular, we introduce an edge-aware noise scheduler that varies between edge-preserving and isotropic Gaussian noise. We show that our model's generative process converges faster to results that more closely match the target distribution. We demonstrate its capability to better learn the low-to-mid frequencies within the dataset, which plays a crucial role in representing shapes and structural information. Our edge-preserving diffusion process consistently outperforms state-of-the-art baselines in unconditional image generation. It is also more robust for generative tasks guided by a shape-based prior, such as stroke-to-image generation. We present qualitative and quantitative results showing consistent improvements (FID score) of up to 30% for both tasks.
经典生成扩散模型学习一个均匀的高斯去噪过程,对所有空间区域都是一致的,因此忽略了数据中可能具有价值的信息结构。受到图像处理中非均匀扩散工作的启发,我们提出了一个新颖的边缘保留扩散模型,它是去噪扩散概率模型的(DDPM)的扩展。特别地,我们引入了一个边注意噪声调度器,在边缘保留和均匀高斯噪声之间进行可变。我们证明了我们的模型的生成过程收敛得更快,更接近目标分布的结果。我们还证明了它在数据中更好地学习低至中频,这对于表示形状和结构信息至关重要。我们的边缘保留扩散过程在无条件图像生成领域始终优于最先进的基线。对于由形状先验指导的生成任务,如画笔到图像生成,它也更具鲁棒性。我们提供了定性和定量的结果,表明两种任务都有持续的30%以上的改进(FID得分)。
https://arxiv.org/abs/2410.01540
Audio-driven talking head generation is a pivotal area within film-making and Virtual Reality. Although existing methods have made significant strides following the end-to-end paradigm, they still encounter challenges in producing videos with high-frequency details due to their limited expressivity in this domain. This limitation has prompted us to explore an effective post-processing approach to synthesize photo-realistic talking head videos. Specifically, we employ a pretrained Wav2Lip model as our foundation model, leveraging its robust audio-lip alignment capabilities. Drawing on the theory of Lipschitz Continuity, we have theoretically established the noise robustness of Vector Quantised Auto Encoders (VQAEs). Our experiments further demonstrate that the high-frequency texture deficiency of the foundation model can be temporally consistently recovered by the Space-Optimised Vector Quantised Auto Encoder (SOVQAE) we introduced, thereby facilitating the creation of realistic talking head videos. We conduct experiments on both the conventional dataset and the High-Frequency TalKing head (HFTK) dataset that we curated. The results indicate that our method, LaDTalk, achieves new state-of-the-art video quality and out-of-domain lip synchronization performance.
音频驱动的说话头生成是电影制作和虚拟现实领域的一个关键领域。尽管在端到端范式下,已经取得了很多进展,但由于它们在这个领域的表现有限,它们仍然难以制作具有高频细节的视频。这个限制促使我们探索一种有效的后处理方法来合成照片现实感的说话头视频。具体来说,我们采用预训练的Wav2Lip模型作为基础模型,利用其稳健的音频 lip 对齐功能。借鉴Lipschitz 连续性理论,我们理论上建立了 Vector Quantised Auto Encoders (VQAEs) 的噪声鲁棒性。我们的实验进一步证明了我们引入的Space-Optimised Vector Quantised Auto Encoder (SOVQAE)可以恢复基础模型的高频纹理不足,从而促进创建真实说话头视频。我们在常规数据集和高频说话头数据集上进行实验。结果表明,我们的方法LaDTalk达到当前最先进的视频质量和跨域 lip 同步性能。
https://arxiv.org/abs/2410.00990
Understanding local risks from extreme rainfall, such as flooding, requires both long records (to sample rare events) and high-resolution products (to assess localized hazards). Unfortunately, there is a dearth of long-record and high-resolution products that can be used to understand local risk and precipitation science. In this paper, we present a novel generative diffusion model that downscales (super-resolves) globally available Climate Prediction Center (CPC) gauge-based precipitation products and ERA5 reanalysis data to generate kilometer-scale precipitation estimates. Downscaling gauge-based precipitation from 55 km to 1 km while recovering extreme rainfall signals poses significant challenges. To enforce our model (named WassDiff) to produce well-calibrated precipitation intensity values, we introduce a Wasserstein Distance Regularization (WDR) term for the score-matching training objective in the diffusion denoising process. We show that WDR greatly enhances the model's ability to capture extreme values compared to diffusion without WDR. Extensive evaluation shows that WassDiff has better reconstruction accuracy and bias scores than conventional score-based diffusion models. Case studies of extreme weather phenomena, like tropical storms and cold fronts, demonstrate WassDiff's ability to produce appropriate spatial patterns while capturing extremes. Such downscaling capability enables the generation of extensive km-scale precipitation datasets from existing historical global gauge records and current gauge measurements in areas without high-resolution radar.
理解极端降雨所带来的地方风险(如洪水)需要长时间的观测记录(以采样罕见事件)和高分辨率产品(以评估局部危险)。然而,目前可用的长时间的观测和高分辨率产品较少,用于了解地方风险和降水科学。在本文中,我们提出了一个新颖的生成扩散模型,该模型将全球可用的 Climate Prediction Center(CPC)气象产品和高分辨率ERA5再分析数据下渗(超分辨率)到千米级别的降水估计。将55千米到1千米的气温降雨从下渗到恢复极端降雨信号具有重大挑战。为了使我们的模型(名为WassDiff)产生准确的降水强度值,我们在扩散去噪过程中引入了Wasserstein距离 regularization(WDR)项。我们证明了WDR极大地增强了模型在扩散没有WDR时的捕捉极端值的能力。广泛的评估表明,WassDiff具有比传统评分基于扩散模型的更高的重建精度和偏差评分。极端天气现象(如台风和寒潮)案例研究证明了WassDiff在捕捉极端值的同时产生适当的空间模式的能力。这种下渗能力使得从现有历史全球气象观测记录和当前气象测量数据的地方生成广泛千米级别的降水数据集成为可能。
https://arxiv.org/abs/2410.00381
This paper analyzes the impact of causal manner in the text encoder of text-to-image (T2I) diffusion models, which can lead to information bias and loss. Previous works have focused on addressing the issues through the denoising process. However, there is no research discussing how text embedding contributes to T2I models, especially when generating more than one object. In this paper, we share a comprehensive analysis of text embedding: i) how text embedding contributes to the generated images and ii) why information gets lost and biases towards the first-mentioned object. Accordingly, we propose a simple but effective text embedding balance optimization method, which is training-free, with an improvement of 90.05% on information balance in stable diffusion. Furthermore, we propose a new automatic evaluation metric that quantifies information loss more accurately than existing methods, achieving 81% concordance with human assessments. This metric effectively measures the presence and accuracy of objects, addressing the limitations of current distribution scores like CLIP's text-image similarities.
本文分析了在文本到图像(T2I)扩散模型的文本编码器中,因果关系对信息偏差和损失的影响。之前的工作主要通过去噪过程来解决这些问题。然而,没有研究讨论了文本嵌入如何影响T2I模型,尤其是在生成多个对象时。在本文中,我们分享了关于文本嵌入的全面分析:i)文本嵌入如何影响生成的图像,ii)为什么信息会丢失并倾向于第一个提到的对象。因此,我们提出了一个简单而有效的文本嵌入平衡优化方法,该方法无需训练,可以在稳定扩散信息平衡方面提高90.05%。此外,我们还提出了一个新的自动评估指标,该指标比现有方法更准确地衡量信息损失,实现了与人类评估结果的81%一致性。这个指标有效地测量了对象的存在和准确性,解决了像CLIP的文本图像相似性这样的当前分布分数的局限性。
https://arxiv.org/abs/2410.00321
PET imaging is a powerful modality offering quantitative assessments of molecular and physiological processes. The necessity for PET denoising arises from the intrinsic high noise levels in PET imaging, which can significantly hinder the accurate interpretation and quantitative analysis of the scans. With advances in deep learning techniques, diffusion model-based PET denoising techniques have shown remarkable performance improvement. However, these models often face limitations when applied to volumetric data. Additionally, many existing diffusion models do not adequately consider the unique characteristics of PET imaging, such as its 3D volumetric nature, leading to the potential loss of anatomic consistency. Our Conditional Score-based Residual Diffusion (CSRD) model addresses these issues by incorporating a refined score function and 3D patch-wise training strategy, optimizing the model for efficient volumetric PET denoising. The CSRD model significantly lowers computational demands and expedites the denoising process. By effectively integrating volumetric data from PET and MRI scans, the CSRD model maintains spatial coherence and anatomical detail. Lastly, we demonstrate that the CSRD model achieves superior denoising performance in both qualitative and quantitative evaluations while maintaining image details and outperforms existing state-of-the-art methods.
磷光正电子断层成像(PET)是一种强大的成像技术,可用于定量评估分子和生理过程。PET去噪的需求源于PET成像固有的高噪声水平,这可能导致对扫描结果的准确解释和定量分析受到严重影响。随着深度学习技术的进步,基于扩散模型的PET去噪技术表现出显著的性能提升。然而,这些模型在应用于体积数据时通常存在局限性。此外,许多现有的扩散模型没有充分考虑PET成像的独特特点,如其3D体积性质,导致可能会丢失解剖一致性。我们的条件得分基于残差扩散(CSRD)模型通过引入细化的得分函数和3D局部训练策略来解决这些问题,优化了模型的体积PET去噪效率。CSRD模型显著降低了计算需求,加快了去噪过程。通过有效地将PET和MRI扫描的体积数据整合起来,CSRD模型保持了空间连通性并突出了解剖细节。最后,我们证明了CSRD模型在保持图像细节的同时,在定性和定量评估中优于现有最先进的方法。
https://arxiv.org/abs/2410.00184
We propose a novel framework COLLAGE for generating collaborative agent-object-agent interactions by leveraging large language models (LLMs) and hierarchical motion-specific vector-quantized variational autoencoders (VQ-VAEs). Our model addresses the lack of rich datasets in this domain by incorporating the knowledge and reasoning abilities of LLMs to guide a generative diffusion model. The hierarchical VQ-VAE architecture captures different motion-specific characteristics at multiple levels of abstraction, avoiding redundant concepts and enabling efficient multi-resolution representation. We introduce a diffusion model that operates in the latent space and incorporates LLM-generated motion planning cues to guide the denoising process, resulting in prompt-specific motion generation with greater control and diversity. Experimental results on the CORE-4D, and InterHuman datasets demonstrate the effectiveness of our approach in generating realistic and diverse collaborative human-object-human interactions, outperforming state-of-the-art methods. Our work opens up new possibilities for modeling complex interactions in various domains, such as robotics, graphics and computer vision.
我们提出了一个名为COLLAGE的新框架,通过利用大型语言模型(LLMs)和分层运动特定向量量化变分自编码器(VQ-VAEs)来生成协同机器人-对象-机器人交互。我们的模型通过结合LLMs的知识和推理能力来指导生成扩散模型。分层的VQ-VAE架构捕捉了多个抽象层次的不同的运动相关特征,避免了冗余概念,并实现了高效的多分辨率表示。我们引入了一个在潜在空间中操作的扩散模型,它包含了LLM生成的运动规划提示,指导去噪过程,从而实现了具有更强的控制和多样性的 prompt-specific 运动生成。在CORE-4D 和 InterHuman 数据集上的实验结果证明了我们在生成现实和多样的人-对象-人类交互方面的有效性,超过了最先进的方法。我们的工作为在各种领域建模复杂的交互提供了新的可能性,包括机器人学、图形学和计算机视觉。
https://arxiv.org/abs/2409.20502
Text-to-video diffusion models have made remarkable advancements. Driven by their ability to generate temporally coherent videos, research on zero-shot video editing using these fundamental models has expanded rapidly. To enhance editing quality, structural controls are frequently employed in video editing. Among these techniques, cross-attention mask control stands out for its effectiveness and efficiency. However, when cross-attention masks are naively applied to video editing, they can introduce artifacts such as blurring and flickering. Our experiments uncover a critical factor overlooked in previous video editing research: cross-attention masks are not consistently clear but vary with model structure and denoising timestep. To address this issue, we propose the metric Mask Matching Cost (MMC) that quantifies this variability and propose FreeMask, a method for selecting optimal masks tailored to specific video editing tasks. Using MMC-selected masks, we further improve the masked fusion mechanism within comprehensive attention features, e.g., temp, cross, and self-attention modules. Our approach can be seamlessly integrated into existing zero-shot video editing frameworks with better performance, requiring no control assistance or parameter fine-tuning but enabling adaptive decoupling of unedited semantic layouts with mask precision control. Extensive experiments demonstrate that FreeMask achieves superior semantic fidelity, temporal consistency, and editing quality compared to state-of-the-art methods.
文本转视频扩散模型取得了显著的进步。驱动于它们生成时间上连贯的视频的能力,使用这些基本模型进行零 shot 视频编辑的研究已经迅速扩展。为了提高编辑质量,经常在视频编辑中使用结构控制技术。在这些技术中,跨注意力和掩码控制因其有效性和效率而脱颖而出。然而,当跨注意力的掩码被粗心地应用于视频编辑时,它们可能会引入模糊和闪烁的伪影。我们进行的实验揭示了之前视频编辑研究中被忽略的关键因素:跨注意力的掩码不是始终清晰,而且随着模型结构的差异和去噪时间步长的变化而有所不同。为了解决这个问题,我们提出了 Mask Matching Cost (MMC) 指标,该指标衡量了这种变异性,并提出了 FreeMask,一种针对特定视频编辑任务选择最优掩码的方法。使用 MMC 选择的掩码,我们进一步改进了全面注意力的掩码融合机制,例如 temp、cross 和 self-attention 模块。我们的方法可以轻松地集成到现有的零 shot 视频编辑框架中,具有更好的性能,无需控制协助或参数微调,而能够通过掩码精度控制实现无编辑语义布局的适应性解耦。大量实验证明,FreeMask 在语义质量、时间一致性和编辑质量方面比最先进的方法具有优越性。
https://arxiv.org/abs/2409.20500
The application of deep learning in cancer research, particularly in early diagnosis, case understanding, and treatment strategy design, emphasizes the need for high-quality data. Generative AI, especially Generative Adversarial Networks (GANs), has emerged as a leading solution to challenges like class imbalance, robust learning, and model training, while addressing issues stemming from patient privacy and the scarcity of real data. Despite their promise, GANs face several challenges, both inherent and specific to histopathology data. Inherent issues include training imbalance, mode collapse, linear learning from insufficient discriminator feedback, and hard boundary convergence due to stringent feedback. Histopathology data presents a unique challenge with its complex representation, high spatial resolution, and multiscale features. To address these challenges, we propose a framework consisting of two components. First, we introduce a contrastive learning-based Multistage Progressive Finetuning Siamese Neural Network (MFT-SNN) for assessing the similarity between histopathology patches. Second, we implement a Reinforcement Learning-based External Optimizer (RL-EO) within the GAN training loop, serving as a reward signal generator. The modified discriminator loss function incorporates a weighted reward, guiding the GAN to maximize this reward while minimizing loss. This approach offers an external optimization guide to the discriminator, preventing generator overfitting and ensuring smooth convergence. Our proposed solution has been benchmarked against state-of-the-art (SOTA) GANs and a Denoising Diffusion Probabilistic model, outperforming previous SOTA across various metrics, including FID score, KID score, Perceptual Path Length, and downstream classification tasks.
在癌症研究中,特别是早期诊断、病灶理解和治疗策略设计,深度学习的应用突出了高质量数据的重要性。生成式AI,特别是生成对抗网络(GANs),已成为解决诸如分类不平衡、稳健学习和模型训练等问题的一种领先解决方案,同时解决了患者隐私问题和真实数据稀少等源于数据本身的问题。尽管GANs具有很大的潜力,但它们仍然面临着几个固有的挑战,一个是训练不平衡,另一个是来源于病理学数据本身的问题。固有问题包括训练不平衡、模式坍塌、从不足的判别器反馈线性学习和由于严格反馈而导致的边界收缩等。病理学数据的复杂表示、高空间分辨率和高多尺度特征是其独特的挑战。为了应对这些挑战,我们提出了一个由两个组件组成的框架。首先,我们引入了一种基于对比学习的多阶段渐进微调Siamese神经网络(MFT-SNN)来评估病理学补丁的相似性。其次,我们在GAN训练循环中实现了一个基于强化学习的外部优化器(RL-EO),作为奖励信号生成者。修改后的判别器损失函数包括加权奖励,引导GAN最大化这个奖励,同时最小化损失。这种方法为判别器提供了外部优化指导,防止了生成器过拟合,并确保平滑的收敛。我们的解决方案已经与最先进的(SOTA)GAN和去噪扩散概率模型进行了比较。在包括FID分数、KID分数、感知路径长度和下游分类任务等各种指标中,我们的解决方案均取得了优异的性能,超越了以前的SOTA。
https://arxiv.org/abs/2409.20340
Recent research on fine-tuning vision-language models has demonstrated impressive performance in various downstream tasks. However, the challenge of obtaining accurately labeled data in real-world applications poses a significant obstacle during the fine-tuning process. To address this challenge, this paper presents a Denoising Fine-Tuning framework, called DeFT, for adapting vision-language models. DeFT utilizes the robust alignment of textual and visual features pre-trained on millions of auxiliary image-text pairs to sieve out noisy labels. The proposed framework establishes a noisy label detector by learning positive and negative textual prompts for each class. The positive prompt seeks to reveal distinctive features of the class, while the negative prompt serves as a learnable threshold for separating clean and noisy samples. We employ parameter-efficient fine-tuning for the adaptation of a pre-trained visual encoder to promote its alignment with the learned textual prompts. As a general framework, DeFT can seamlessly fine-tune many pre-trained models to downstream tasks by utilizing carefully selected clean samples. Experimental results on seven synthetic and real-world noisy datasets validate the effectiveness of DeFT in both noisy label detection and image classification.
近年来关于迁移学习视觉-语言模型的研究表明,在各种下游任务上取得了令人印象深刻的性能。然而,在现实应用中获取准确标注数据的过程在微调过程中提出了一个显著的障碍。为了应对这一挑战,本文提出了一种名为DeFT的去噪微调框架,用于适应视觉-语言模型。DeFT利用了在数百万个辅助图像-文本对上预训练的文本和视觉特征的鲁棒对齐,以筛选出嘈杂的标签。所提出的框架通过学习每个类的积极和消极文本提示来建立一个嘈杂标签检测器。积极提示试图揭示类的特征,而消极提示作为可学习的阈值用于区分干净和嘈杂样本。我们使用参数高效的微调方法对预训练的视觉编码器进行微调,以促进其与学习到的文本提示的对齐。作为通用框架,DeFT可以通过选择 carefully selected干净样本,平滑地微调许多预训练模型,以应对下游任务。在七个合成和真实世界嘈杂数据集上的实验结果证实了DeFT在嘈杂标签检测和图像分类方面的有效性。
https://arxiv.org/abs/2409.19696
Adverse weather removal aims to restore clear vision under adverse weather conditions. Existing methods are mostly tailored for specific weather types and rely heavily on extensive labeled data. In dealing with these two limitations, this paper presents a pioneering semi-supervised all-in-one adverse weather removal framework built on the teacher-student network with a Denoising Diffusion Model (DDM) as the backbone, termed SemiDDM-Weather. As for the design of DDM backbone in our SemiDDM-Weather, we adopt the SOTA Wavelet Diffusion Model-Wavediff with customized inputs and loss functions, devoted to facilitating the learning of many-to-one mapping distributions for efficient all-in-one adverse weather removal with limited label data. To mitigate the risk of misleading model training due to potentially inaccurate pseudo-labels generated by the teacher network in semi-supervised learning, we introduce quality assessment and content consistency constraints to screen the "optimal" outputs from the teacher network as the pseudo-labels, thus more effectively guiding the student network training with unlabeled data. Experimental results show that on both synthetic and real-world datasets, our SemiDDM-Weather consistently delivers high visual quality and superior adverse weather removal, even when compared to fully supervised competitors. Our code and pre-trained model are available at this repository.
恶劣天气去除的目的是在恶劣天气条件下恢复清晰的视力。现有的方法主要是针对特定天气类型进行定制的,并且依赖于大量标记数据。在处理这两个限制时,本文提出了一种基于教师-学生网络的自监督全统一恶劣天气去除框架,称为半DDM-天气。关于我们的半DDM-天气的DDM主干的设置,我们采用了最先进的Wavelet Diffusion Model-Wavediff,并对其进行自定义输入和损失函数,致力于通过有限标记数据实现对许多对一映射分布的学习,从而提高有效的全统一恶劣天气去除。为了减轻教师网络在半监督学习过程中可能产生的不准确伪标签的风险,我们引入了质量评估和内容一致性约束来筛选出教师网络的“最优”输出作为伪标签,从而更有效地指导学生网络使用未标记数据进行训练。在合成和真实世界数据集上进行的实验结果表明,我们的半DDM-天气在 both synthetic 和 real-world datasets上都始终提供卓越的视觉质量和卓越的恶劣天气去除效果,即使与完全监督的竞争对手相比也是如此。我们的代码和预训练模型可以在这个仓库中找到。
https://arxiv.org/abs/2409.19679
Accurate dynamic modeling is critical for autonomous racing vehicles, especially during high-speed and agile maneuvers where precise motion prediction is essential for safety. Traditional parameter estimation methods face limitations such as reliance on initial guesses, labor-intensive fitting procedures, and complex testing setups. On the other hand, purely data-driven machine learning methods struggle to capture inherent physical constraints and typically require large datasets for optimal performance. To address these challenges, this paper introduces the Fine-Tuning Hybrid Dynamics (FTHD) method, which integrates supervised and unsupervised Physics-Informed Neural Networks (PINNs), combining physics-based modeling with data-driven techniques. FTHD fine-tunes a pre-trained Deep Dynamics Model (DDM) using a smaller training dataset, delivering superior performance compared to state-of-the-art methods such as the Deep Pacejka Model (DPM) and outperforming the original DDM. Furthermore, an Extended Kalman Filter (EKF) is embedded within FTHD (EKF-FTHD) to effectively manage noisy real-world data, ensuring accurate denoising while preserving the vehicle's essential physical characteristics. The proposed FTHD framework is validated through scaled simulations using the BayesRace Physics-based Simulator and full-scale real-world experiments from the Indy Autonomous Challenge. Results demonstrate that the hybrid approach significantly improves parameter estimation accuracy, even with reduced data, and outperforms existing models. EKF-FTHD enhances robustness by denoising real-world data while maintaining physical insights, representing a notable advancement in vehicle dynamics modeling for high-speed autonomous racing.
准确的动态建模对自动驾驶赛车车辆至关重要,尤其是在高速和敏捷操纵过程中,精确的运动预测对安全性至关重要。传统的参数估计方法面临诸如依赖初始猜测、繁琐的拟合过程和复杂的测试设置等限制。另一方面,纯数据驱动的机器学习方法很难捕捉固有物理约束,通常需要大量数据来实现最佳性能。为了应对这些挑战,本文引入了Fine-Tuning Hybrid Dynamics(FTHD)方法,该方法将监督和无监督的物理感知的神经网络(PINNs)集成起来,将物理建模与数据驱动技术相结合。FTHD通过使用较小的训练数据对预训练的Deep Dynamics模型(DDM)进行微调,在DPM等最先进的办法基础上实现了卓越的性能,并超过了原DDM。此外,FTHD中还嵌入了一个扩展卡尔曼滤波器(EKF),以有效地管理嘈杂的现实数据,确保在保留车辆固有物理特征的同时实现精确的滤波。所提出的FTHD框架通过缩放仿真使用BayesRace基于物理的模拟器和完整的现实世界实验进行了验证。结果表明,混合方法在减少数据的情况下显著提高了参数估计精度,甚至超越了现有模型。EKF-FTHD通过在现实世界中滤除噪声,同时保留物理洞察,在高速自动驾驶赛车车辆动力学建模方面取得了显著的进步。
https://arxiv.org/abs/2409.19647