Large-scale Text-to-Image (T2I) diffusion models demonstrate significant generation capabilities based on textual prompts. Based on the T2I diffusion models, text-guided image editing research aims to empower users to manipulate generated images by altering the text prompts. However, existing image editing techniques are prone to editing over unintentional regions that are beyond the intended target area, primarily due to inaccuracies in cross-attention maps. To address this problem, we propose Localization-aware Inversion (LocInv), which exploits segmentation maps or bounding boxes as extra localization priors to refine the cross-attention maps in the denoising phases of the diffusion process. Through the dynamic updating of tokens corresponding to noun words in the textual input, we are compelling the cross-attention maps to closely align with the correct noun and adjective words in the text prompt. Based on this technique, we achieve fine-grained image editing over particular objects while preventing undesired changes to other regions. Our method LocInv, based on the publicly available Stable Diffusion, is extensively evaluated on a subset of the COCO dataset, and consistently obtains superior results both quantitatively and qualitatively.The code will be released at this https URL
大规模文本到图像(T2I)扩散模型基于文本提示表现出显著的生成能力。基于T2I扩散模型,文本指导图像编辑研究旨在通过改变文本提示来用户操纵生成的图像。然而,现有的图像编辑技术容易导致编辑超过预期的目标区域,主要原因是跨注意图的准确性。为了解决这个问题,我们提出了局部化注意的逆置(LocInv)技术,该技术利用分词图或边界框作为额外的局部化先验来优化扩散过程的降噪阶段中的跨注意图。通过动态更新文本输入中相应的名词词位的 tokens,我们使得跨注意图与文本提示中的正确名词和形容词词首 closely 对齐。基于这种技术,我们在特定物体上实现精细图像编辑,同时防止对其他区域的不必要修改。我们的方法LocInv,基于公开可用的Stable Diffusion,在COCO数据集的子集上进行了广泛的评估,并且无论是在数量上还是在质量上,都取得了卓越的结果。代码将发布在https://这个 URL上。
https://arxiv.org/abs/2405.01496
Video-based remote photoplethysmography (rPPG) has emerged as a promising technology for non-contact vital sign monitoring, especially under controlled conditions. However, the accurate measurement of vital signs in real-world scenarios faces several challenges, including artifacts induced by videocodecs, low-light noise, degradation, low dynamic range, occlusions, and hardware and network constraints. In this article, we systematically investigate comprehensive investigate these issues, measuring their detrimental effects on the quality of rPPG measurements. Additionally, we propose practical strategies for mitigating these challenges to improve the dependability and resilience of video-based rPPG systems. We detail methods for effective biosignal recovery in the presence of network limitations and present denoising and inpainting techniques aimed at preserving video frame integrity. Through extensive evaluations and direct comparisons, we demonstrate the effectiveness of the approaches in enhancing rPPG measurements under challenging environments, contributing to the development of more reliable and effective remote vital sign monitoring technologies.
基于视频的远程脉搏测量(rPPG)作为一种非接触式生命体征监测的有前景的技术,尤其是在受控条件下,已经得到了广泛的应用。然而,在现实世界的场景中准确测量生命体征面临着几个挑战,包括由视频编码器产生的伪影、低光噪声、失真、低动态范围、遮挡和硬件及网络限制等。在本文中,我们系统地研究了这些问题,并测量了它们对rPPG测量质量的损害。此外,我们提出了应对这些挑战的实际策略,以提高基于视频的rPPG系统的可靠性和弹性。我们详细介绍了在网络限制下进行生物信号恢复的方法,并提出了旨在保持视频帧完整性的去噪和修复技术。通过广泛的评估和直接比较,我们证明了这些方法在具有挑战性的环境中增强rPPG测量效果,为开发更可靠和有效的远程生命体征监测技术做出了贡献。
https://arxiv.org/abs/2405.01230
Self-supervised learning for image denoising problems in the presence of denaturation for noisy data is a crucial approach in machine learning. However, theoretical understanding of the performance of the approach that uses denatured data is lacking. To provide better understanding of the approach, in this paper, we analyze a self-supervised denoising algorithm that uses denatured data in depth through theoretical analysis and numerical experiments. Through the theoretical analysis, we discuss that the algorithm finds desired solutions to the optimization problem with the population risk, while the guarantee for the empirical risk depends on the hardness of the denoising task in terms of denaturation levels. We also conduct several experiments to investigate the performance of an extended algorithm in practice. The results indicate that the algorithm training with denatured images works, and the empirical performance aligns with the theoretical results. These results suggest several insights for further improvement of self-supervised image denoising that uses denatured data in future directions.
自我监督学习在噪声数据中进行图像去噪问题是机器学习中的一个关键方法。然而,对于使用去噪数据的开源算法的性能理解还存在理论上的不足。为了更好地理解这种方法,本文分析了一种使用去噪数据的深度自我监督去噪算法,并通过理论分析和数值实验进行了分析。通过理论分析,我们讨论了该算法在优化问题中找到所需解的问题,而保证经验风险的保证取决于去噪任务的复杂程度。我们还进行了一些实验,研究了使用去噪图像进行训练的扩展算法的实际效果。结果表明,该算法基于去噪图像的训练是有效的,而经验性能与理论结果一致。这些结果为未来使用去噪数据进行图像去噪的自我监督学习提供了几个有价值的启示。
https://arxiv.org/abs/2405.01124
Simulating soil reflectance spectra is invaluable for soil-plant radiative modeling and training machine learning models, yet it is difficult as the intricate relationships between soil structure and its constituents. To address this, a fully data-driven soil optics generative model (SOGM) for simulation of soil reflectance spectra based on soil property inputs was developed. The model is trained on an extensive dataset comprising nearly 180,000 soil spectra-property pairs from 17 datasets. It generates soil reflectance spectra from text-based inputs describing soil properties and their values rather than only numerical values and labels in binary vector format. The generative model can simulate output spectra based on an incomplete set of input properties. SOGM is based on the denoising diffusion probabilistic model (DDPM). Two additional sub-models were also built to complement the SOGM: a spectral padding model that can fill in the gaps for spectra shorter than the full visible-near-infrared range (VIS-NIR; 400 to 2499 nm), and a wet soil spectra model that can estimate the effects of water content on soil reflectance spectra given the dry spectrum predicted by the SOGM. The SOGM was up-scaled by coupling with the Helios 3D plant modeling software, which allowed for generation of synthetic aerial images of simulated soil and plant scenes. It can also be easily integrated with soil-plant radiation model used for remote sensin research like PROSAIL. The testing results of the SOGM on new datasets that not included in model training proved that the model can generate reasonable soil reflectance spectra based on available property inputs. The presented models are openly accessible on: this https URL.
模拟土壤反射光谱对于土壤-植物辐射建模和训练机器学习模型非常有价值,然而,它具有挑战性,因为土壤结构和其组成之间的复杂关系。为了解决这个问题,基于土壤属性输入的完全数据驱动土壤光学生成模型(SOGM)用于模拟土壤反射光谱的开发。该模型在包括17个数据集的近180,000个土壤光谱-属性对的数据集上进行训练。它从基于文本的输入描述土壤属性和其值生成土壤反射光谱,而不是仅以二进制向量格式表示数值和标签。生成模型可以根据一组输入属性模拟输出光谱。SOGM基于去噪扩散概率模型(DDPM)。还开发了两个补充模型来补充SOGM:一个填充光谱长度的模型,可以填补光谱较短于完整可见-近红外范围(VIS-NIR;400至2499纳米)的缺口,和一个干土光谱模型,可以根据SOGM预测的干土光谱估计水分含量对土壤反射光谱的影响。通过与Helios 3D植物建模软件耦合,SOGM进行了放大,从而能够生成模拟土壤和植物场景的合成高空图像。它还可以轻松地与用于远地感研究的光谱遥感模型如PROSAIL集成。对SOGM在新数据集上的测试结果表明,基于现有属性输入,该模型可以生成合理的土壤反射光谱。所提出的模型在:https://这个URL公开可用。
https://arxiv.org/abs/2405.01060
We present EchoScene, an interactive and controllable generative model that generates 3D indoor scenes on scene graphs. EchoScene leverages a dual-branch diffusion model that dynamically adapts to scene graphs. Existing methods struggle to handle scene graphs due to varying numbers of nodes, multiple edge combinations, and manipulator-induced node-edge operations. EchoScene overcomes this by associating each node with a denoising process and enables collaborative information exchange, enhancing controllable and consistent generation aware of global constraints. This is achieved through an information echo scheme in both shape and layout branches. At every denoising step, all processes share their denoising data with an information exchange unit that combines these updates using graph convolution. The scheme ensures that the denoising processes are influenced by a holistic understanding of the scene graph, facilitating the generation of globally coherent scenes. The resulting scenes can be manipulated during inference by editing the input scene graph and sampling the noise in the diffusion model. Extensive experiments validate our approach, which maintains scene controllability and surpasses previous methods in generation fidelity. Moreover, the generated scenes are of high quality and thus directly compatible with off-the-shelf texture generation. Code and trained models are open-sourced.
我们提出了EchoScene,一种交互式且可控制的三维室内场景生成模型,它在场景图上生成3D室内场景。EchoScene利用了双分支扩散模型,该模型动态地适应场景图。由于节点数量、边组合和操纵器诱导的节点边缘操作等因素的不同,现有的方法在处理场景图时遇到困难。EchoScene通过将每个节点与去噪过程相关联,实现了协同信息交流,增强了可控制和一致性的生成,同时考虑了全局约束。这是通过形状和布局分支的信息回声方案实现的。在去噪步骤中,所有进程将去噪数据共享给一个信息交换单元,该单元使用图卷积对这些更新进行组合。该方案确保了去噪过程受到场景图的整体理解的影响,从而促进了全局一致场景的生成。生成的场景在推理过程中可以进行编辑,并从扩散模型的噪声中采样。大量实验验证了我们的方法,保持场景的可控性并超过了前方法在生成质量上的表现。此外,生成的场景具有高质量,因此与现成的纹理生成兼容。代码和训练的模型都是开源的。
https://arxiv.org/abs/2405.00915
To detect infected wounds in Diabetic Foot Ulcers (DFUs) from photographs, preventing severe complications and amputations. Methods: This paper proposes the Guided Conditional Diffusion Classifier (ConDiff), a novel deep-learning infection detection model that combines guided image synthesis with a denoising diffusion model and distance-based classification. The process involves (1) generating guided conditional synthetic images by injecting Gaussian noise to a guide image, followed by denoising the noise-perturbed image through a reverse diffusion process, conditioned on infection status and (2) classifying infections based on the minimum Euclidean distance between synthesized images and the original guide image in embedding space. Results: ConDiff demonstrated superior performance with an accuracy of 83% and an F1-score of 0.858, outperforming state-of-the-art models by at least 3%. The use of a triplet loss function reduces overfitting in the distance-based classifier. Conclusions: ConDiff not only enhances diagnostic accuracy for DFU infections but also pioneers the use of generative discriminative models for detailed medical image analysis, offering a promising approach for improving patient outcomes.
为了从照片中检测糖尿病足溃疡(DFUs)中的感染伤口,预防和减轻严重并发症和截肢,本文提出了一种名为引导条件扩散分类器(ConDiff)的新颖深度学习感染检测模型,它结合了指导图像生成和去噪扩散模型以及距离分类。该过程包括:(1)通过向指导图像注入高斯噪声来生成指导条件合成图像,然后通过反扩散过程对噪声污染的图像进行去噪,条件是感染状况;(2)根据合成图像与嵌入空间中原始指导图像之间的最小欧氏距离,对感染进行分类。结果:ConDiff在准确性和F1分数方面都表现出优异性能,至少超过了最先进的模型的3%。距离分类器的三元组损失函数有助于减少过拟合。结论:ConDiff不仅可以提高糖尿病足溃疡感染的诊断准确性,还开创了使用生成判别模型进行详细医学图像分析的先河,为提高患者治疗效果提供了有益的途径。
https://arxiv.org/abs/2405.00858
In this paper, we present a novel approach that combines deep metric learning and synthetic data generation using diffusion models for out-of-distribution (OOD) detection. One popular approach for OOD detection is outlier exposure, where models are trained using a mixture of in-distribution (ID) samples and ``seen" OOD samples. For the OOD samples, the model is trained to minimize the KL divergence between the output probability and the uniform distribution while correctly classifying the in-distribution (ID) data. In this paper, we propose a label-mixup approach to generate synthetic OOD data using Denoising Diffusion Probabilistic Models (DDPMs). Additionally, we explore recent advancements in metric learning to train our models. In the experiments, we found that metric learning-based loss functions perform better than the softmax. Furthermore, the baseline models (including softmax, and metric learning) show a significant improvement when trained with the generated OOD data. Our approach outperforms strong baselines in conventional OOD detection metrics.
在本文中,我们提出了一种结合深度度量学习和扩散模型合成数据的新方法,用于检测离散(OD)检测。在OD检测中,一种流行的方法是离群曝光,即使用混合分布在ID样本和“见过的”OD样本上训练模型。对于OD样本,模型通过最小化输出概率与均匀分布之间的KL散度来训练,同时正确分类ID数据。在本文中,我们提出了一种使用去噪扩散概率模型(DDPMs)生成合成OD数据的标签混合方法。此外,我们探讨了最近在度量学习方面的进展,以训练我们的模型。在实验中,我们发现基于度量学习的损失函数表现更好。此外,基线模型(包括软max和度量学习)在训练时使用生成的OD数据表现出显著的改进。我们的方法在传统OD检测指标上优于强基线。
https://arxiv.org/abs/2405.00631
Optimizing a text-to-image diffusion model with a given reward function is an important but underexplored research area. In this study, we propose Deep Reward Tuning (DRTune), an algorithm that directly supervises the final output image of a text-to-image diffusion model and back-propagates through the iterative sampling process to the input noise. We find that training earlier steps in the sampling process is crucial for low-level rewards, and deep supervision can be achieved efficiently and effectively by stopping the gradient of the denoising network input. DRTune is extensively evaluated on various reward models. It consistently outperforms other algorithms, particularly for low-level control signals, where all shallow supervision methods fail. Additionally, we fine-tune Stable Diffusion XL 1.0 (SDXL 1.0) model via DRTune to optimize Human Preference Score v2.1, resulting in the Favorable Diffusion XL 1.0 (FDXL 1.0) model. FDXL 1.0 significantly enhances image quality compared to SDXL 1.0 and reaches comparable quality compared with Midjourney v5.2.
优化一个文本到图像扩散模型的奖励函数是一个重要的但尚未深入研究的研究领域。在这项研究中,我们提出了Deep Reward Tuning(DRTune)算法,这是一种直接监督文本到图像扩散模型最终输出图像的算法,并通过迭代采样过程反向传播。我们发现,训练采样过程的早期步骤对低级奖励至关重要,而通过停止噪声网络输入的梯度可以实现深度监督,从而提高奖励的质量和效果。DRTune在各种奖励模型上进行了广泛评估。它 consistently优于其他算法,尤其是在低级控制信号上。此外,通过DRTune对Stable Diffusion XL 1.0(SDXL 1.0)模型进行微调,以优化Human Preference Score v2.1,最终得到Favorable Diffusion XL 1.0(FDXL 1.0)模型。FDXL 1.0在图像质量方面显著提高,与SDXL 1.0相比,图像质量有显著提高,与Midjourney v5.2相当。
https://arxiv.org/abs/2405.00760
Denoising diffusion models have recently gained prominence as powerful tools for a variety of image generation and manipulation tasks. Building on this, we propose a novel tool for real-time editing of images that provides users with fine-grained region-targeted supervision in addition to existing prompt-based controls. Our novel editing technique, termed Layered Diffusion Brushes, leverages prompt-guided and region-targeted alteration of intermediate denoising steps, enabling precise modifications while maintaining the integrity and context of the input image. We provide an editor based on Layered Diffusion Brushes modifications, which incorporates well-known image editing concepts such as layer masks, visibility toggles, and independent manipulation of layers; regardless of their order. Our system renders a single edit on a 512x512 image within 140 ms using a high-end consumer GPU, enabling real-time feedback and rapid exploration of candidate edits. We validated our method and editing system through a user study involving both natural images (using inversion) and generated images, showcasing its usability and effectiveness compared to existing techniques such as InstructPix2Pix and Stable Diffusion Inpainting for refining images. Our approach demonstrates efficacy across a range of tasks, including object attribute adjustments, error correction, and sequential prompt-based object placement and manipulation, demonstrating its versatility and potential for enhancing creative workflows.
去噪扩散模型最近因成为各种图像生成和编辑任务的强大工具而受到了广泛关注。在此基础上,我们提出了一种名为分层扩散刷的新工具,为用户提供了在现有提示基础上的细粒度区域目标指导,以及对输入图像的完整性和上下文的保留。我们提出的编辑技术被称为分层扩散刷,利用了提示引导的中间去噪步骤的区域的修改,可以在保持输入图像的完整性和上下文的同时进行精确修改。我们还基于分层扩散刷的编辑器,该编辑器包含了著名的图像编辑概念,如层遮罩、可见性开关和层级的独立操作;无论它们的顺序如何。我们的系统在高端消费级GPU上对512x512图像进行一次编辑,用时140毫秒,实现了实时的反馈和快速的选择编辑。我们对我们的方法进行了用户研究,包括自然图像(使用反向映射)和生成图像,证明了它的可用性和效果与现有技术(如InstructPix2Pix和稳定扩散修复)相比优越。我们的方法在各种任务中表现出有效性,包括对象属性的调整、错误纠正和基于序列提示的对象放置和编辑,证明了它的多才多艺和提高创意工作流程的潜力。
https://arxiv.org/abs/2405.00313
Recent advances in diffusion models can generate high-quality and stunning images from text. However, multi-turn image generation, which is of high demand in real-world scenarios, still faces challenges in maintaining semantic consistency between images and texts, as well as contextual consistency of the same subject across multiple interactive turns. To address this issue, we introduce TheaterGen, a training-free framework that integrates large language models (LLMs) and text-to-image (T2I) models to provide the capability of multi-turn image generation. Within this framework, LLMs, acting as a "Screenwriter", engage in multi-turn interaction, generating and managing a standardized prompt book that encompasses prompts and layout designs for each character in the target image. Based on these, Theatergen generate a list of character images and extract guidance information, akin to the "Rehearsal". Subsequently, through incorporating the prompt book and guidance information into the reverse denoising process of T2I diffusion models, Theatergen generate the final image, as conducting the "Final Performance". With the effective management of prompt books and character images, TheaterGen significantly improves semantic and contextual consistency in synthesized images. Furthermore, we introduce a dedicated benchmark, CMIGBench (Consistent Multi-turn Image Generation Benchmark) with 8000 multi-turn instructions. Different from previous multi-turn benchmarks, CMIGBench does not define characters in advance. Both the tasks of story generation and multi-turn editing are included on CMIGBench for comprehensive evaluation. Extensive experimental results show that TheaterGen outperforms state-of-the-art methods significantly. It raises the performance bar of the cutting-edge Mini DALLE 3 model by 21% in average character-character similarity and 19% in average text-image similarity.
近年来,扩散模型的进步可以从文本中生成高质量和令人惊叹的图像。然而,在现实场景中,多轮图像生成仍然面临着在图像和文本之间保持语义一致性以及跨多个交互轮次同一主题下保持上下文一致性的挑战。为解决这个问题,我们引入了TheaterGen,一个无需训练的框架,将大型语言模型(LLMs)和文本到图像(T2I)模型集成在一起,提供多轮图像生成的能力。在这个框架内,LLM充当一个“编剧”,参与多轮交互,生成和管理工作目标图像中每个角色的标准化提示和布局设计。基于这些,Theatergen生成角色图像和提取指导信息,类似于“排练”。接着,将提示书和指导信息融入T2I扩散模型的反向去噪过程,Theatergen生成最终图像,进行“最后表演”。通过有效管理提示书和角色图像,Theatergen在生成的图像中显著提高了语义和上下文一致性。此外,我们还引入了一个专门基准,CMIGBench(一致多轮图像生成基准),具有8000个多轮指令。与之前的多人多轮基准不同,CMIGBench没有预先定义角色。在CMIGBench中,故事生成和多轮编辑任务都进行全面的评估。广泛的实验结果表明,TheaterGen在性能上远胜于最先进的方法。它将先进的小型DALLE 3模型的性能提高了21%的单个角色角色相似度和19%的文本图像相似度。
https://arxiv.org/abs/2404.18919
U-Nets are among the most widely used architectures in computer vision, renowned for their exceptional performance in applications such as image segmentation, denoising, and diffusion modeling. However, a theoretical explanation of the U-Net architecture design has not yet been fully established. This paper introduces a novel interpretation of the U-Net architecture by studying certain generative hierarchical models, which are tree-structured graphical models extensively utilized in both language and image domains. With their encoder-decoder structure, long skip connections, and pooling and up-sampling layers, we demonstrate how U-Nets can naturally implement the belief propagation denoising algorithm in such generative hierarchical models, thereby efficiently approximating the denoising functions. This leads to an efficient sample complexity bound for learning the denoising function using U-Nets within these models. Additionally, we discuss the broader implications of these findings for diffusion models in generative hierarchical models. We also demonstrate that the conventional architecture of convolutional neural networks (ConvNets) is ideally suited for classification tasks within these models. This offers a unified view of the roles of ConvNets and U-Nets, highlighting the versatility of generative hierarchical models in modeling complex data distributions across language and image domains.
U-Net是一种在计算机视觉领域应用最广泛的架构之一,以其在图像分割、去噪和扩散建模等应用中的卓越性能而闻名。然而,U-Net架构的设计尚未得到完全的理论解释。本文通过研究某些被广泛用于语言和图像领域的生成层次模型,引入了一种新颖的解释U-Net架构的方法。这些生成层次模型具有编码器-解码器结构、长距离跳跃连接和池化与上采样层。我们证明了U-Nets可以自然地实现信念传播去噪算法在这些生成层次模型中,从而高效地逼近去噪函数。这使得使用U-Nets在这些模型中学习去噪函数的样本复杂ity bound变得尤为高效。此外,我们讨论了这些发现对这些扩散模型在生成层次模型中的影响。我们还证明了ConvNets这种传统的神经网络架构非常适合这些模型中的分类任务。这提供了一个统一的视角,强调了生成层次模型在语言和图像领域建模复杂数据分布的灵活性。
https://arxiv.org/abs/2404.18444
In clinical practice, tri-modal medical image fusion, compared to the existing dual-modal technique, can provide a more comprehensive view of the lesions, aiding physicians in evaluating the disease's shape, location, and biological activity. However, due to the limitations of imaging equipment and considerations for patient safety, the quality of medical images is usually limited, leading to sub-optimal fusion performance, and affecting the depth of image analysis by the physician. Thus, there is an urgent need for a technology that can both enhance image resolution and integrate multi-modal information. Although current image processing methods can effectively address image fusion and super-resolution individually, solving both problems synchronously remains extremely challenging. In this paper, we propose TFS-Diff, a simultaneously realize tri-modal medical image fusion and super-resolution model. Specially, TFS-Diff is based on the diffusion model generation of a random iterative denoising process. We also develop a simple objective function and the proposed fusion super-resolution loss, effectively evaluates the uncertainty in the fusion and ensures the stability of the optimization process. And the channel attention module is proposed to effectively integrate key information from different modalities for clinical diagnosis, avoiding information loss caused by multiple image processing. Extensive experiments on public Harvard datasets show that TFS-Diff significantly surpass the existing state-of-the-art methods in both quantitative and visual evaluations. The source code will be available at GitHub.
在临床实践中,与现有的双模技术相比,三模态医学图像融合可以提供更全面的病变观察,帮助医生评估疾病的形状、位置和生物活性。然而,由于成像设备的局限性和患者安全的考虑,医学图像的质量通常有限,导致融合性能不理想,影响医生对图像分析的深度。因此,迫切需要一种可以同时增强图像分辨率并整合多模态信息的解决方案。尽管当前的图像处理方法可以有效地解决图像融合和超分辨率问题,但同时解决这两个问题仍然具有挑战性。在本文中,我们提出了TFS-Diff,一种同时实现三模态医学图像融合和超分辨率模型的方法。特别地,TFS-Diff基于随机迭代去噪过程的扩散模型生成。我们还开发了一个简单的目标函数和提出的融合超分辨率损失,有效地评估了融合的不确定性,并确保了优化过程的稳定性。此外,通道关注模块被提出,可以有效地整合不同模态的关键信息,避免信息丢失。对公共Harvard数据集的广泛实验证明,TFS-Diff在数量和视觉评价方面显著超越了现有的最先进方法。源代码将 available at GitHub。
https://arxiv.org/abs/2404.17357
Spiking Neural Networks (SNNs) aim to bridge the gap between neuroscience and machine learning by emulating the structure of the human nervous system. However, like convolutional neural networks, SNNs are vulnerable to adversarial attacks. To tackle the challenge, we propose a biologically inspired methodology to enhance the robustness of SNNs, drawing insights from the visual masking effect and filtering theory. First, an end-to-end SNN-based image purification model is proposed to defend against adversarial attacks, including a noise extraction network and a non-blind denoising network. The former network extracts noise features from noisy images, while the latter component employs a residual U-Net structure to reconstruct high-quality noisy images and generate clean images. Simultaneously, a multi-level firing SNN based on Squeeze-and-Excitation Network is introduced to improve the robustness of the classifier. Crucially, the proposed image purification network serves as a pre-processing module, avoiding modifications to classifiers. Unlike adversarial training, our method is highly flexible and can be seamlessly integrated with other defense strategies. Experimental results on various datasets demonstrate that the proposed methodology outperforms state-of-the-art baselines in terms of defense effectiveness, training time, and resource consumption.
尖刺神经网络(SNNs)旨在通过模拟人脑神经系统的结构,将神经科学和机器学习之间的差距缩小。然而,与卷积神经网络一样,SNNs也容易受到对抗攻击。为了解决这个挑战,我们提出了一个基于生物启发式的方法来增强SNN的鲁棒性,并从视觉模糊效果和滤波理论中获得启示。我们首先提出了一个端到端的SNN图像净化模型来抵御对抗攻击,包括一个噪声提取网络和一个非盲滤波网络。前者网络从噪声图像中提取噪声特征,而后者组件采用残差U-Net结构来重建高质量噪声图像并生成清洁图像。同时,我们还引入了一个基于Squeeze-and-Excitation Network的多级放电SNN,以提高分类器的鲁棒性。至关重要的是,所提出的图像净化网络充当一个预处理模块,避免对分类器的修改。与对抗训练不同,我们的方法具有高度的灵活性,可以轻松地与其他防御策略集成。在各种数据集上的实验结果表明,与最先进的基准相比,所提出的方法在防御效果、训练时间和资源消耗方面都表现出色。
https://arxiv.org/abs/2404.17092
This paper aims to explore the evolution of image denoising in a pedagological way. We briefly review classical methods such as Fourier analysis and wavelet bases, highlighting the challenges they faced until the emergence of neural networks, notably the U-Net, in the 2010s. The remarkable performance of these networks has been demonstrated in studies such as Kadkhodaie et al. (2024). They exhibit adaptability to various image types, including those with fixed regularity, facial images, and bedroom scenes, achieving optimal results and biased towards geometry-adaptive harmonic basis. The introduction of score diffusion has played a crucial role in image generation. In this context, denoising becomes essential as it facilitates the estimation of probability density scores. We discuss the prerequisites for genuine learning of probability densities, offering insights that extend from mathematical research to the implications of universal structures.
本文旨在以教育性的方式探讨图像去噪的演变。我们简要回顾了经典方法,如傅里叶分析和小波基,并着重指出它们在21世纪之前所面临到的挑战,特别是U-Net。这些网络的非凡性能已在像Kadkhodaie等人(2024)这样的研究中得到证实。它们表现出对各种图像类型的适应性,包括具有固定规范的图像、面部图像和卧室场景,实现最佳结果并倾向于几何自适应小波基。引入分数扩散在图像生成中发挥了关键作用。在这种背景下,去噪变得至关重要,因为它有助于概率密度分数的估计。我们讨论了真正学习概率密度的先决条件,将数学研究的见解扩展到普遍结构的意义上。
https://arxiv.org/abs/2404.16617
Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (e.g., a woman's photo) and a text description (e.g., "a woman is drinking water."). Existing TI2V frameworks often require costly training on video-text datasets and specific model designs for text and image conditioning. In this paper, we propose TI2V-Zero, a zero-shot, tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image, enabling TI2V generation without any optimization, fine-tuning, or introducing external modules. Our approach leverages a pretrained T2V diffusion foundation model as the generative prior. To guide video generation with the additional image input, we propose a "repeat-and-slide" strategy that modulates the reverse denoising process, allowing the frozen diffusion model to synthesize a video frame-by-frame starting from the provided image. To ensure temporal continuity, we employ a DDPM inversion strategy to initialize Gaussian noise for each newly synthesized frame and a resampling technique to help preserve visual details. We conduct comprehensive experiments on both domain-specific and open-domain datasets, where TI2V-Zero consistently outperforms a recent open-domain TI2V model. Furthermore, we show that TI2V-Zero can seamlessly extend to other tasks such as video infilling and prediction when provided with more images. Its autoregressive design also supports long video generation.
文本有条件图像转视频生成(TI2V)旨在从给定的图像(例如,一张女人的照片)和文本描述(例如,“一个女人在喝水”)合成一个真实的视频。现有的TI2V框架通常需要在视频文本数据集上进行昂贵的训练,并针对文本和图像条件设计特定的模型。在本文中,我们提出TI2V-Zero,一种零散拍摄、无需优化、无需微调或引入外部模块的方法,它使预训练的文本到视频(T2V)扩散模型能够根据提供的图像进行条件生成,从而实现无需任何优化、微调或引入外部模块的TI2V生成。我们的方法利用预训练的T2V扩散基础模型作为生成先验。为了在使用附加图像进行视频生成时指导视频生成,我们提出了“重复并滑动”策略,它通过调节反滤波过程来控制预冻扩散模型,使其从提供的图像合成逐帧视频。为了确保时间连续性,我们采用DDPM反向策略对每个新合成帧进行初始化,并使用插值技术帮助保留视觉细节。我们对领域特定数据集和开放数据集进行了全面的实验,其中TI2V-Zero在领域特定模型中始终表现出优异的性能。此外,我们还证明了TI2V-Zero可以在提供更多图像时无缝扩展到其他任务,如视频填充和预测。其自回归设计还支持长视频生成。
https://arxiv.org/abs/2404.16306
Procedural noise is a fundamental component of computer graphics pipelines, offering a flexible way to generate textures that exhibit "natural" random variation. Many different types of noise exist, each produced by a separate algorithm. In this paper, we present a single generative model which can learn to generate multiple types of noise as well as blend between them. In addition, it is capable of producing spatially-varying noise blends despite not having access to such data for training. These features are enabled by training a denoising diffusion model using a novel combination of data augmentation and network conditioning techniques. Like procedural noise generators, the model's behavior is controllable via interpretable parameters and a source of randomness. We use our model to produce a variety of visually compelling noise textures. We also present an application of our model to improving inverse procedural material design; using our model in place of fixed-type noise nodes in a procedural material graph results in higher-fidelity material reconstructions without needing to know the type of noise in advance.
过程噪声是计算机图形流水线的一个基本组成部分,提供了一种生成具有“自然”随机变异的纹理的方式。有许多不同类型的噪声,每个都是由独立的算法生成的。在本文中,我们提出了一个单生成模型,可以学习生成多种类型的噪声以及将它们混合。此外,它还能够在没有访问到这种数据进行训练的情况下生成空间随机的噪声混合。这些特点是由使用新型的数据增强和网络调节技术训练去噪扩散模型而实现的。与过程噪声生成器类似,模型的行为可以通过可解释的参数和随机性的来源进行控制。我们使用我们的模型来生成各种视觉上令人印象深刻的噪声纹理。我们还介绍了一个使用我们模型的改进反向过程材料设计的应用;将我们的模型代替固定类型噪声节点在图状材料图中,可以实现更高保真的材料重构,而无需提前知道噪声的类型。
https://arxiv.org/abs/2404.16292
Diffusion models have demonstrated their capability to synthesize high-quality and diverse images from textual prompts. However, simultaneous control over both global contexts (e.g., object layouts and interactions) and local details (e.g., colors and emotions) still remains a significant challenge. The models often fail to understand complex descriptions involving multiple objects and reflect specified visual attributes to wrong targets or ignore them. This paper presents Global-Local Diffusion (\textit{GLoD}), a novel framework which allows simultaneous control over the global contexts and the local details in text-to-image generation without requiring training or fine-tuning. It assigns multiple global and local prompts to corresponding layers and composes their noises to guide a denoising process using pre-trained diffusion models. Our framework enables complex global-local compositions, conditioning objects in the global prompt with the local prompts while preserving other unspecified identities. Our quantitative and qualitative evaluations demonstrate that GLoD effectively generates complex images that adhere to both user-provided object interactions and object details.
扩散模型已经证明了它们从文本提示中合成高质量和多样图像的能力。然而,同时控制全局上下文(例如物体布局和交互)和局部细节(例如颜色和情感)仍然是一个重要的挑战。模型通常无法理解涉及多个物体的复杂描述,并将指定的视觉属性错误地应用于错误的目标或忽略它们。本文提出了一种名为全局-局部扩散(GLoD)的新框架,允许在文本到图像生成中同时控制全局上下文和局部细节,而无需进行训练或微调。它将多个全局和局部提示分配给相应的层,并将它们的噪声组合起来,使用预训练的扩散模型进行去噪处理。我们的框架能够实现复杂的全局-局部组合,通过局部提示保留全局提示,同时保留其他未指定身份的物体。我们的定量和定性评估显示,GLoD有效地生成了符合用户提供的物体交互和物体细节的复杂图像。
https://arxiv.org/abs/2404.15447
Transforming large pre-trained low-resolution diffusion models to cater to higher-resolution demands, i.e., diffusion extrapolation, significantly improves diffusion adaptability. We propose tuning-free CutDiffusion, aimed at simplifying and accelerating the diffusion extrapolation process, making it more affordable and improving performance. CutDiffusion abides by the existing patch-wise extrapolation but cuts a standard patch diffusion process into an initial phase focused on comprehensive structure denoising and a subsequent phase dedicated to specific detail refinement. Comprehensive experiments highlight the numerous almighty advantages of CutDiffusion: (1) simple method construction that enables a concise higher-resolution diffusion process without third-party engagement; (2) fast inference speed achieved through a single-step higher-resolution diffusion process, and fewer inference patches required; (3) cheap GPU cost resulting from patch-wise inference and fewer patches during the comprehensive structure denoising; (4) strong generation performance, stemming from the emphasis on specific detail refinement.
将大型预训练低分辨率扩散模型转换为满足更高分辨率需求,即扩散扩展,显著提高了扩散适应性。我们提出了一种无需调整的CutDiffusion,旨在简化并加速扩散扩展过程,使其更加经济且提高性能。CutDiffusion遵循现有的补丁扩展过程,但将标准补丁扩散过程切割成关注全面结构去噪和具体细节精炼的初始阶段,随后阶段为特定细节精炼。全面的实验突出了CutDiffusion的优势:(1)简单的方法构建使得简洁的高分辨率扩散过程成为可能,而无需第三方参与;(2)通过单步高分辨率扩散过程实现快速推理速度,且需要的推理补丁较少;(3)由于补丁推理和全面结构去噪阶段的便宜GPU成本,实现了较低的GPU成本;(4)强调具体细节精炼,从而实现强大的生成性能。
https://arxiv.org/abs/2404.15141
Existing solutions for 3D semantic occupancy prediction typically treat the task as a one-shot 3D voxel-wise segmentation perception problem. These discriminative methods focus on learning the mapping between the inputs and occupancy map in a single step, lacking the ability to gradually refine the occupancy map and the reasonable scene imaginative capacity to complete the local regions somewhere. In this paper, we introduce OccGen, a simple yet powerful generative perception model for the task of 3D semantic occupancy prediction. OccGen adopts a ''noise-to-occupancy'' generative paradigm, progressively inferring and refining the occupancy map by predicting and eliminating noise originating from a random Gaussian distribution. OccGen consists of two main components: a conditional encoder that is capable of processing multi-modal inputs, and a progressive refinement decoder that applies diffusion denoising using the multi-modal features as conditions. A key insight of this generative pipeline is that the diffusion denoising process is naturally able to model the coarse-to-fine refinement of the dense 3D occupancy map, therefore producing more detailed predictions. Extensive experiments on several occupancy benchmarks demonstrate the effectiveness of the proposed method compared to the state-of-the-art methods. For instance, OccGen relatively enhances the mIoU by 9.5%, 6.3%, and 13.3% on nuScenes-Occupancy dataset under the muli-modal, LiDAR-only, and camera-only settings, respectively. Moreover, as a generative perception model, OccGen exhibits desirable properties that discriminative models cannot achieve, such as providing uncertainty estimates alongside its multiple-step predictions.
目前针对3D语义占有预测的解决方案通常将其视为一次性的3D体素级分割感知问题。这些判别方法集中精力在单步内学习输入与占有图之间的映射,而缺乏逐步精炼占有图和完成局部区域的合理场景想象能力。在本文中,我们介绍了OccGen,一个简单而强大的3D语义占有预测生成感知模型。OccGen采用了一种“噪声到占有”的生成范式,通过预测和消除源于随机高斯分布的噪声来逐步精炼占有图。OccGen由两个主要组件组成:一个条件编码器,能够处理多模态输入,和一个 progressive refinement decoder,通过多模态特征应用扩散去噪。这一生成流程的关键见解是,扩散去噪过程自然能够建模出密集3D占有图的粗细精炼,从而产生更详细的预测。在多个占有基准测试中进行的广泛实验证明,与最先进的方法相比,所提出的方法具有有效性。例如,在nuScenes-Occupancy数据集上,OccGen分别相对增强了mIoU by 9.5%,6.3%和13.3%。此外,作为生成感知模型,OccGen表现出判别模型无法实现的一些有价值的特性,例如在多次预测中提供不确定性估计。
https://arxiv.org/abs/2404.15014
As one of the fundamental video tasks in computer vision, Open-Vocabulary Action Recognition (OVAR) recently gains increasing attention, with the development of vision-language pre-trainings. To enable generalization of arbitrary classes, existing methods treat class labels as text descriptions, then formulate OVAR as evaluating embedding similarity between visual samples and textual classes. However, one crucial issue is completely ignored: the class descriptions given by users may be noisy, e.g., misspellings and typos, limiting the real-world practicality of vanilla OVAR. To fill the research gap, this paper pioneers to evaluate existing methods by simulating multi-level noises of various types, and reveals their poor robustness. To tackle the noisy OVAR task, we further propose one novel DENOISER framework, covering two parts: generation and discrimination. Concretely, the generative part denoises noisy class-text names via one decoding process, i.e., propose text candidates, then utilize inter-modal and intra-modal information to vote for the best. At the discriminative part, we use vanilla OVAR models to assign visual samples to class-text names, thus obtaining more semantics. For optimization, we alternately iterate between generative and discriminative parts for progressive refinements. The denoised text classes help OVAR models classify visual samples more accurately; in return, classified visual samples help better denoising. On three datasets, we carry out extensive experiments to show our superior robustness, and thorough ablations to dissect the effectiveness of each component.
作为计算机视觉中的一个基本视频任务,Open-Vocabulary Action Recognition (OVAR) 最近受到了越来越多的关注,随着视觉语言预训练的发展。为了实现任意类别的泛化,现有方法将类标签视为文本描述,然后将 OVAR 表示为评估视觉样本与文本类之间的嵌入相似度。然而,一个关键问题被完全忽视了:用户提供的类描述可能会有噪音,例如拼写和错别字,这限制了原典 OVAR 的现实应用。为了填补研究空白,本文通过模拟各种类型的多级噪音来评估现有方法,并揭示了它们的脆弱性。为了应对噪音 OVAR 任务,我们进一步提出了一个新颖的 DENOISER 框架,包括生成和分类两个部分。具体来说,生成部分通过一个解码过程消除噪音类-文本名称,即提出文本候选者,然后利用跨模态和内模态信息进行投票,获得最佳。在分类部分,我们使用原典 OVAR 模型将视觉样本分配到类文本名称,从而获得更多的语义信息。为了优化,我们交替迭代生成和分类部分进行逐步改进。消除的文本类有助于 OVAR 模型更准确地分类视觉样本;与此同时,分类的视觉样本有助于更好地消除噪音。在三个数据集上,我们进行了广泛的实验,以展示我们卓越的鲁棒性,并对每个组件的深入剖析进行了充分研究。
https://arxiv.org/abs/2404.14890